Repurposing a Proxmox Node as an AI Workstation

The Problem

Our AI agent swarm needs a Conductor — an orchestrator that parses natural-language intents via LLM and routes tasks to specialized subagents. The Conductor model needs to be large enough for strong reasoning (35B parameters) but doesn't require GPU acceleration because it handles orchestration, not high-throughput inference.

Meanwhile, pve1 — a Minisforum MS-01 with an i9-13900H and 64GB DDR5 — was serving as a Proxmox node hosting only two lightweight VMs. The i9's 24 threads and 64GB of RAM were massively underutilized as a hypervisor.

The Decision

Migrate pve1's VMs to bighost (which had headroom after right-sizing BigBrain from 32GB to 16GB RAM), then repurpose the MS-01 as a bare-metal AI workstation. No hypervisor overhead, full access to all 24 threads and 64GB RAM for CPU inference.

VM Migration

Two VMs moved from pve1 to bighost via Proxmox's built-in migration:

  • Migrated with zero downtime using live migration
  • bighost had ~24GB RAM headroom after BigBrain's right-sizing
  • pve1 removed from Proxmox cluster cleanly

Bare-Metal Setup

Fresh Ubuntu 24.04 install on the MS-01, then a focused deployment:

1. Ollama with CPU Inference

Ollama installed natively. The MS-01 has an Intel Iris Xe iGPU, but testing confirmed it's unsuitable for inference — no matrix multiplication cores, no bf16 support. CPU inference on the i9 actually outperforms Vulkan GPU inference on this hardware.

2. Model Zoo

ModelPurposeSpeed
cyberpilot (qwen3.5:35b-a3b MoE)Conductor orchestrator8.2 t/s
qwen3-coder:30bCode generation13 t/s
deepseek-r1:32bDeep reasoning3 t/s
qwen3.5:27b (Q8_0)High-quality general1.7 t/s

The cyberpilot model — a 35B parameter sparse Mixture of Experts with only 3B active parameters per token — is the sweet spot. It reasons like a 35B model but runs at speeds comparable to a 3B model, making CPU inference practical for orchestration tasks.

3. Open WebUI + Aviation RAG

Open WebUI deployed as a chat interface with a custom Aviation RAG knowledge base: 35 FAA handbook chapters (Airplane Flying Handbook + Pilot's Handbook of Aeronautical Knowledge) converted from PDF to Markdown via Docling and indexed for retrieval-augmented generation.

4. OS Hardening & Monitoring

  • SSH key-only authentication
  • UFW firewall with deny-all default
  • Syslog forwarding to Splunk
  • SNMPv3 monitoring via LibreNMS

The Result

AspectBefore (pve1)After (MS-01)
RoleProxmox node (2 VMs)Conductor AI workstation
OSProxmox VEUbuntu 24.04 bare-metal
CPU utilization~5%~40-80% during inference
RAM utilization~12 GB~35-50 GB during inference
Primary workloadHypervisor overhead35B MoE orchestration model

Key Takeaways

  • MoE models change the economics of CPU inference. A 35B model that activates 3B parameters per token runs at practical speeds on consumer hardware.
  • Right-size before buying new hardware. We freed up a powerful workstation by consolidating VMs onto an existing host — no new purchases needed.
  • iGPUs are not inference accelerators. Intel Iris Xe lacks the matrix cores needed for LLM workloads. Don't waste time trying.
  • Bare metal beats virtualization for inference. No hypervisor overhead means every thread and every byte of RAM goes directly to the model.