Aman's blog

Prefetching Weights in llama.cpp

This post about about a PR to llama.cpp which adds support to prefetch weights, overlapping compute for the current layer with the weight loading for the next layer. This is useful for memory-constrained environments where the model weights don't fit into VRAM.

This is the basic setup:

Timeline --------------------------------------->
[Compute Layer 1]       [Compute Layer 2] ....
   [Prefetch Weights-1] [Prefetch Weights-3]....

The implementation is changing ggml's backend scheduler to leverage the async copy engine for whichever backend supports it.

Performance Analysis

Dense Models

For dense models, the optimization is relatively straightforward. Let's call Cn the time taken for computing the nth layer on the GPU, and Tn+1 the time to transfer layer nth+1 weights from the CPU to the GPU.

If you choose to override weights on the CPU, then this overlaps Tn+1 with Cn, whereas currently everything happens sequentially. So if Cn>Tn+1 we can "hide" the transfer latency and have it behave like a GPU. Although when Cn>>Tn+1, Cn dominates so this is less useful. On the other hand, decreasing Tn+1 is only possible using newer hardware like PCIe Gen5 or using lower bpw.

There are two natural dimensions where we can increase Cn without increasing Tn+1, those are ubatch size and the kv-cache depth. Here is a graph using a relatively recent model, since this is a linear attention model, the compute doesn't go up as fast a quadratic attention model with increasing depth. We override ffn_(gate|up|down).* to the CPU, which are the bulk of the weights in each layer. We can see it benefits at all batch sizes, but the gap is lesser at higher batch sizes (since Cn starts to dominate)

llama-bench -m /opt/models/Qwen3.5-27B-Q4_K_M.gguf -fa 1 -p 2048 -ub 512,1024,2048 -d 0,10000,20000,30000,40000,50000 -n 0 -ot "ffn_(gate|up|down).*=CPU" -pw 0,1 --mmap 0

png

MoE models

For MoE models, the situation is different because of selective copying of experts (added in #15346). This is a massive improvement for smaller ubatch sizes, naturally prefetching cannot do this as it does not know which experts will be selected in the next layer. The situation is worse for larger MoE models with more experts and larger inner dims, so likely this will be slower for large MoE models unless they are deep (more layers) and not wide (larger expert dims).

The equation for prefetching to be beneficial becomes Cn+Tc>Tc+1, where Tc is the time to transfer the selected experts for the nth layer. The same stuff applies, if we scale Cn it becomes more beneficial, but we can also scale Tc by increasing the ubatch size, hence increasing expected number of used experts in a ubatch. In the graph below we see comparable performance at ubatch=512, but much better at 1024,2048. Since 50k context fully fits on this GPU, I also the added the theoretical maximum of fully offloading to the GPU

llama-bench -m /opt/models/gpt_oss-20b-mxfp4.gguf -fa 1 -p 2048 -ub 512,1024,2048 -d 0,10000,20000,30000,40000,50000 -n 0 -ncmoe 999 -pw 0,1 --mmap 0.

gptoss

Summary

Memory-hiding is a famous technique in performance optimization, this is just the same technique applied at a compute graph level. CUDA's event API and async copy engine makes this possible.