Prefetching Weights in llama.cpp

28 Apr, 2026

This post about about a PR to llama.cpp which adds support to prefetch weights, overlapping compute for the current layer with the weight loading for the next layer. This is useful for memory-constrained environments where the model weights don't fit into VRAM.

This is the basic setup:

Timeline --------------------------------------->
[Compute Layer 1]       [Compute Layer 2] ....
   [Prefetch Weights-1] [Prefetch Weights-3]....

The implementation is changing ggml's backend scheduler to leverage the async copy engine for whichever backend supports it.

Performance Analysis

Dense Models

For dense models, the optimization is relatively straightforward. Let's call $C_{n}$ the time taken for computing the $n^{t h}$ layer on the GPU, and $T_{n + 1}$ the time to transfer layer $n^{t h} + 1$ weights from the CPU to the GPU.

If you choose to override weights on the CPU, then this overlaps $T_{n + 1}$ with $C_{n}$ , whereas currently everything happens sequentially. So if $C_{n} > T_{n + 1}$ we can "hide" the transfer latency and have it behave like a GPU. Although when $C_{n} > > T_{n + 1}$ , $C_{n}$ dominates so this is less useful. On the other hand, decreasing $T_{n + 1}$ is only possible using newer hardware like PCIe Gen5 or using lower bpw.

There are two natural dimensions where we can increase $C_{n}$ without increasing $T_{n + 1}$ , those are ubatch size and the kv-cache depth. Here is a graph using a relatively recent model, since this is a linear attention model, the compute doesn't go up as fast a quadratic attention model with increasing depth. We override ffn_(gate|up|down).* to the CPU, which are the bulk of the weights in each layer. We can see it benefits at all batch sizes, but the gap is lesser at higher batch sizes (since $C_{n}$ starts to dominate)

llama-bench -m /opt/models/Qwen3.5-27B-Q4_K_M.gguf -fa 1 -p 2048 -ub 512,1024,2048 -d 0,10000,20000,30000,40000,50000 -n 0 -ot "ffn_(gate|up|down).*=CPU" -pw 0,1 --mmap 0

png

MoE models

For MoE models, the situation is different because of selective copying of experts (added in #15346). This is a massive improvement for smaller ubatch sizes, naturally prefetching cannot do this as it does not know which experts will be selected in the next layer. The situation is worse for larger MoE models with more experts and larger inner dims, so likely this will be slower for large MoE models unless they are deep (more layers) and not wide (larger expert dims).

The equation for prefetching to be beneficial becomes $C_{n} + T_{c}^{'} > T_{c + 1}$ , where $T_{c}^{'}$ is the time to transfer the selected experts for the $n^{t h}$ layer. The same stuff applies, if we scale $C_{n}$ it becomes more beneficial, but we can also scale $T_{c}^{'}$ by increasing the ubatch size, hence increasing expected number of used experts in a ubatch. In the graph below we see comparable performance at ubatch=512, but much better at 1024,2048. Since 50k context fully fits on this GPU, I also the added the theoretical maximum of fully offloading to the GPU

llama-bench -m /opt/models/gpt_oss-20b-mxfp4.gguf -fa 1 -p 2048 -ub 512,1024,2048 -d 0,10000,20000,30000,40000,50000 -n 0 -ncmoe 999 -pw 0,1 --mmap 0.

gptoss

Summary

Memory-hiding is a famous technique in performance optimization, this is just the same technique applied at a compute graph level. CUDA's event API and async copy engine makes this possible.