The Problem
FastForward — our new RTX 5060 (Blackwell, 8GB VRAM) inference host — was consistently underperforming BigBrain's RTX 4070 (Ada, 12GB VRAM) on identical models. The 5060 is a newer architecture with higher theoretical throughput, so something was wrong.
Initial benchmarks showed the 5060 running at roughly half the 4070's speed on the same Qwen 3 8B model. Task Manager revealed the root cause: the GPU was only handling 72% of inference, with 28% spilling to CPU. The model's KV cache was overflowing 8GB VRAM.
The Investigation
Baseline: RTX 4070 (12GB VRAM)
| Model | Quant | Tokens/sec | GPU Offload |
|---|---|---|---|
| qwen3:8b | Q4_K_M | 73.1 | 100% |
| qwen3:14b | Q4_K_M | 42.3 | 100% |
| qwen3:32b | Q4_K_M | 18.7 | ~60% |
RTX 5060 — Before Optimization
| Model | Quant | Tokens/sec | GPU Offload |
|---|---|---|---|
| qwen3:8b | Q4_K_M | 38.2 | ~72% |
The Q4_K_M quantization of the 8B model uses ~5.2GB for weights alone. With the KV cache, metadata, and CUDA overhead, total VRAM demand exceeded 8GB. The overflow spilled to system RAM, cutting throughput in half.
The Fix: IQ4_XS Quantization
IQ4_XS is a more aggressive quantization format that reduces model weights by roughly 15% compared to Q4_K_M while maintaining nearly identical perplexity. For the 8B model, this brought weights down to ~4.4GB — leaving enough VRAM headroom for the KV cache to stay fully GPU-resident.
RTX 5060 — After IQ4_XS
| Model | Quant | Tokens/sec | GPU Offload |
|---|---|---|---|
| qwen3:8b | IQ4_XS | 67.3 | 100% |
67.3 t/s — within 8% of the 4070's 73.1 t/s. The remaining gap is architectural: the 4070 has 50% more CUDA cores and higher memory bandwidth. But the 5060 is now pulling its weight as a viable subagent inference host.
Failed Experiment: SGLang + NVFP4
We also tested SGLang with NVIDIA's native FP4 quantization (NVFP4), hoping Blackwell's hardware FP4 support would unlock better performance. The NVFP4 model alone required 7.24GB VRAM — leaving almost nothing for the KV cache. It OOM'd immediately on any non-trivial prompt.
NVFP4 is designed for cards with 16GB+ VRAM. On 8GB, it's not viable.
Key Takeaways
- VRAM is king for local LLM inference. Architecture and clock speed are secondary to whether the model fits entirely in GPU memory.
- Quantization format matters on VRAM-constrained cards. IQ4_XS vs Q4_K_M was the difference between 38 t/s and 67 t/s — same model, same GPU.
- 8GB VRAM is the new minimum, not the sweet spot. It works for 8B models with careful quantization, but 12GB gives dramatically more flexibility.
- Check your GPU offload ratio before blaming software. If Task Manager shows CPU involvement during inference, your model is spilling.
Current Deployment
FastForward now runs as a dedicated subagent inference host in our AI swarm, serving IQ4_XS-quantized models at 67 t/s. BigBrain (RTX 4070) remains the primary agent model host at 73 t/s with room for larger models and longer contexts.