The Problem
Our AI agent swarm needs a Conductor — an orchestrator that parses natural-language intents via LLM and routes tasks to specialized subagents. The Conductor model needs to be large enough for strong reasoning (35B parameters) but doesn't require GPU acceleration because it handles orchestration, not high-throughput inference.
Meanwhile, pve1 — a Minisforum MS-01 with an i9-13900H and 64GB DDR5 — was serving as a Proxmox node hosting only two lightweight VMs. The i9's 24 threads and 64GB of RAM were massively underutilized as a hypervisor.
The Decision
Migrate pve1's VMs to bighost (which had headroom after right-sizing BigBrain from 32GB to 16GB RAM), then repurpose the MS-01 as a bare-metal AI workstation. No hypervisor overhead, full access to all 24 threads and 64GB RAM for CPU inference.
VM Migration
Two VMs moved from pve1 to bighost via Proxmox's built-in migration:
- Migrated with zero downtime using live migration
- bighost had ~24GB RAM headroom after BigBrain's right-sizing
- pve1 removed from Proxmox cluster cleanly
Bare-Metal Setup
Fresh Ubuntu 24.04 install on the MS-01, then a focused deployment:
1. Ollama with CPU Inference
Ollama installed natively. The MS-01 has an Intel Iris Xe iGPU, but testing confirmed it's unsuitable for inference — no matrix multiplication cores, no bf16 support. CPU inference on the i9 actually outperforms Vulkan GPU inference on this hardware.
2. Model Zoo
| Model | Purpose | Speed |
|---|---|---|
| cyberpilot (qwen3.5:35b-a3b MoE) | Conductor orchestrator | 8.2 t/s |
| qwen3-coder:30b | Code generation | 13 t/s |
| deepseek-r1:32b | Deep reasoning | 3 t/s |
| qwen3.5:27b (Q8_0) | High-quality general | 1.7 t/s |
The cyberpilot model — a 35B parameter sparse Mixture of Experts with only 3B active parameters per token — is the sweet spot. It reasons like a 35B model but runs at speeds comparable to a 3B model, making CPU inference practical for orchestration tasks.
3. Open WebUI + Aviation RAG
Open WebUI deployed as a chat interface with a custom Aviation RAG knowledge base: 35 FAA handbook chapters (Airplane Flying Handbook + Pilot's Handbook of Aeronautical Knowledge) converted from PDF to Markdown via Docling and indexed for retrieval-augmented generation.
4. OS Hardening & Monitoring
- SSH key-only authentication
- UFW firewall with deny-all default
- Syslog forwarding to Splunk
- SNMPv3 monitoring via LibreNMS
The Result
| Aspect | Before (pve1) | After (MS-01) |
|---|---|---|
| Role | Proxmox node (2 VMs) | Conductor AI workstation |
| OS | Proxmox VE | Ubuntu 24.04 bare-metal |
| CPU utilization | ~5% | ~40-80% during inference |
| RAM utilization | ~12 GB | ~35-50 GB during inference |
| Primary workload | Hypervisor overhead | 35B MoE orchestration model |
Key Takeaways
- MoE models change the economics of CPU inference. A 35B model that activates 3B parameters per token runs at practical speeds on consumer hardware.
- Right-size before buying new hardware. We freed up a powerful workstation by consolidating VMs onto an existing host — no new purchases needed.
- iGPUs are not inference accelerators. Intel Iris Xe lacks the matrix cores needed for LLM workloads. Don't waste time trying.
- Bare metal beats virtualization for inference. No hypervisor overhead means every thread and every byte of RAM goes directly to the model.