Blog

Build logs and technical write-ups from the lab.

Squeezing Blood from a Stone: Testing TurboQuant KV Cache Compression on a 12GB GPU

Google's TurboQuant compresses KV cache to ~3 bits at inference time. We benchmarked it on a Qwen3 14B model with an RTX 4070 — 2GB VRAM savings, 25% speed penalty, and lessons learned from community forks.

RTX 5060 Blackwell vs RTX 4070: LLM Inference Benchmarks

The RTX 5060 was underperforming the 4070 despite being newer. Root cause: 8GB VRAM, not software. IQ4_XS quantization brought it to 67 t/s — within 8% of the 4070.

Replacing Graylog with Splunk Enterprise

Migrated from Graylog (Docker on cainfra01) to a dedicated Splunk Enterprise VM. Native install for real admin experience — service management, conf files, and proper resource isolation.

Repurposing a Proxmox Node as an AI Workstation

Freed pve1 (i9-13900H, 64GB) from hypervisor duty and deployed it as a bare-metal Conductor host running a 35B MoE model at 8.2 t/s on CPU.

Deploying Cisco Duo MFA Across the Lab

Replaced self-hosted PrivacyIDEA with Cisco Duo. Push-based MFA for Windows logon and RDP with RADIUS proxy for future VPN integration.

Launching the AI Agent Swarm

14-service multi-agent system deployed on cainfra02. Conductor, four subagents, three observers, and N8N gatekeeper — all communicating via RabbitMQ.