RTX 5060 Blackwell vs RTX 4070: LLM Inference Benchmarks

The Problem

FastForward — our new RTX 5060 (Blackwell, 8GB VRAM) inference host — was consistently underperforming BigBrain's RTX 4070 (Ada, 12GB VRAM) on identical models. The 5060 is a newer architecture with higher theoretical throughput, so something was wrong.

Initial benchmarks showed the 5060 running at roughly half the 4070's speed on the same Qwen 3 8B model. Task Manager revealed the root cause: the GPU was only handling 72% of inference, with 28% spilling to CPU. The model's KV cache was overflowing 8GB VRAM.

The Investigation

Baseline: RTX 4070 (12GB VRAM)

ModelQuantTokens/secGPU Offload
qwen3:8bQ4_K_M73.1100%
qwen3:14bQ4_K_M42.3100%
qwen3:32bQ4_K_M18.7~60%

RTX 5060 — Before Optimization

ModelQuantTokens/secGPU Offload
qwen3:8bQ4_K_M38.2~72%

The Q4_K_M quantization of the 8B model uses ~5.2GB for weights alone. With the KV cache, metadata, and CUDA overhead, total VRAM demand exceeded 8GB. The overflow spilled to system RAM, cutting throughput in half.

The Fix: IQ4_XS Quantization

IQ4_XS is a more aggressive quantization format that reduces model weights by roughly 15% compared to Q4_K_M while maintaining nearly identical perplexity. For the 8B model, this brought weights down to ~4.4GB — leaving enough VRAM headroom for the KV cache to stay fully GPU-resident.

RTX 5060 — After IQ4_XS

ModelQuantTokens/secGPU Offload
qwen3:8bIQ4_XS67.3100%

67.3 t/s — within 8% of the 4070's 73.1 t/s. The remaining gap is architectural: the 4070 has 50% more CUDA cores and higher memory bandwidth. But the 5060 is now pulling its weight as a viable subagent inference host.

Failed Experiment: SGLang + NVFP4

We also tested SGLang with NVIDIA's native FP4 quantization (NVFP4), hoping Blackwell's hardware FP4 support would unlock better performance. The NVFP4 model alone required 7.24GB VRAM — leaving almost nothing for the KV cache. It OOM'd immediately on any non-trivial prompt.

NVFP4 is designed for cards with 16GB+ VRAM. On 8GB, it's not viable.

Key Takeaways

  • VRAM is king for local LLM inference. Architecture and clock speed are secondary to whether the model fits entirely in GPU memory.
  • Quantization format matters on VRAM-constrained cards. IQ4_XS vs Q4_K_M was the difference between 38 t/s and 67 t/s — same model, same GPU.
  • 8GB VRAM is the new minimum, not the sweet spot. It works for 8B models with careful quantization, but 12GB gives dramatically more flexibility.
  • Check your GPU offload ratio before blaming software. If Task Manager shows CPU involvement during inference, your model is spilling.

Current Deployment

FastForward now runs as a dedicated subagent inference host in our AI swarm, serving IQ4_XS-quantized models at 67 t/s. BigBrain (RTX 4070) remains the primary agent model host at 73 t/s with room for larger models and longer contexts.