RTX 5060 Blackwell vs RTX 4070: LLM Inference Benchmarks

The Problem

FastForward — our new RTX 5060 (Blackwell, 8GB VRAM) inference host — was consistently underperforming BigBrain's RTX 4070 (Ada, 12GB VRAM) on identical models. The 5060 is a newer architecture with higher theoretical throughput, so something was wrong.

Initial benchmarks showed the 5060 running at roughly half the 4070's speed on the same Qwen 3 8B model. Task Manager revealed the root cause: the GPU was only handling 72% of inference, with 28% spilling to CPU. The model's KV cache was overflowing 8GB VRAM.

The Investigation

Baseline: RTX 4070 (12GB VRAM)

Model	Quant	Tokens/sec	GPU Offload
qwen3:8b	Q4_K_M	73.1	100%
qwen3:14b	Q4_K_M	42.3	100%
qwen3:32b	Q4_K_M	18.7	~60%

RTX 5060 — Before Optimization

Model	Quant	Tokens/sec	GPU Offload
qwen3:8b	Q4_K_M	38.2	~72%

The Q4_K_M quantization of the 8B model uses ~5.2GB for weights alone. With the KV cache, metadata, and CUDA overhead, total VRAM demand exceeded 8GB. The overflow spilled to system RAM, cutting throughput in half.

The Fix: IQ4_XS Quantization

IQ4_XS is a more aggressive quantization format that reduces model weights by roughly 15% compared to Q4_K_M while maintaining nearly identical perplexity. For the 8B model, this brought weights down to ~4.4GB — leaving enough VRAM headroom for the KV cache to stay fully GPU-resident.

RTX 5060 — After IQ4_XS

Model	Quant	Tokens/sec	GPU Offload
qwen3:8b	IQ4_XS	67.3	100%

67.3 t/s — within 8% of the 4070's 73.1 t/s. The remaining gap is architectural: the 4070 has 50% more CUDA cores and higher memory bandwidth. But the 5060 is now pulling its weight as a viable subagent inference host.

Failed Experiment: SGLang + NVFP4

We also tested SGLang with NVIDIA's native FP4 quantization (NVFP4), hoping Blackwell's hardware FP4 support would unlock better performance. The NVFP4 model alone required 7.24GB VRAM — leaving almost nothing for the KV cache. It OOM'd immediately on any non-trivial prompt.

NVFP4 is designed for cards with 16GB+ VRAM. On 8GB, it's not viable.

Key Takeaways

VRAM is king for local LLM inference. Architecture and clock speed are secondary to whether the model fits entirely in GPU memory.
Quantization format matters on VRAM-constrained cards. IQ4_XS vs Q4_K_M was the difference between 38 t/s and 67 t/s — same model, same GPU.
8GB VRAM is the new minimum, not the sweet spot. It works for 8B models with careful quantization, but 12GB gives dramatically more flexibility.
Check your GPU offload ratio before blaming software. If Task Manager shows CPU involvement during inference, your model is spilling.

Current Deployment

FastForward now runs as a dedicated subagent inference host in our AI swarm, serving IQ4_XS-quantized models at 67 t/s. BigBrain (RTX 4070) remains the primary agent model host at 73 t/s with room for larger models and longer contexts.