Aman's blog

Optimizing Token Generation in llama.cpp's CUDA Backend

Note: this is a copy of the discussion here

LLM inference is divided into 2 phases: prompt-processing (PP) and token-generation (TG). Prompt processing is when entire prompt is fed through the model, whereas token-generation is when the model starts outputting one token at a time. These are different workloads, for decoder-only auto-regressive LLMs PP is compute-bound and TG is memory-bound.

This post covers optimizations for TG using kernel fusion and concurrent streams in the CUDA backend for llama.cpp. Other backends also do fusion but differently, and the rest of this post assumes the CUDA backend. This post is lightly technical with links to PRs for people who want to go into further detail.

I ran tests on 4090 and 5090 with baseline as these optimizations disabled.

GGML_CUDA_GRAPH_OPT=1 ./llama-bench -m qwen3_30b_qk_m.gguf,gpt_oss_20b-mxfp4.gguf -fa 1 -p 0

model device size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium RTX 5090 17.28 GiB 30.53 B CUDA 99 1 tg128 352.06 ± 1.76
gpt-oss 20B MXFP4 MoE RTX 5090 11.27 GiB 20.91 B CUDA 99 1 tg128 419.14 ± 0.89
qwen3moe 30B.A3B Q4_K - Medium RTX 4090 17.28 GiB 30.53 B CUDA 99 1 tg128 271.04 ± 0.65
gpt-oss 20B MXFP4 MoE RTX 4090 11.27 GiB 20.91 B CUDA 99 1 tg128 271.99 ± 0.53

vs baselineGGML_CUDA_DISABLE_FUSION=1 GGML_CUDA_GRAPH_OPT=0 ./llama-bench -m qwen3_30b_qk_m.gguf,gpt_oss_20b-mxfp4.gguf -fa 1 -p 0

model device size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium RTX 5090 17.28 GiB 30.53 B CUDA 99 1 tg128 246.96 ± 0.31
gpt-oss 20B MXFP4 MoE RTX 5090 11.27 GiB 20.91 B CUDA 99 1 tg128 329.23 ± 0.16
qwen3moe 30B.A3B Q4_K - Medium RTX 4090 17.28 GiB 30.53 B CUDA 99 1 tg128 198.39 ± 0.05
gpt-oss 20B MXFP4 MoE RTX 4090 11.27 GiB 20.91 B CUDA 99 1 tg128 232.05 ± 0.02

According to my rough napkin math 1, the theoretical speed-of-light for gpt-oss-20b on a 5090 is ~520 t/s, and we are able to reach ~417 t/s (~80%), although speedups are more modest as context size grows.

Most of the optimizations are enabled by default in llama.cpp, concurrent streams is enabled via an environment flag GGML_CUDA_GRAPH_OPT=1 (note that this particular optimization only works for a single GPU at the moment, also note that this has nothing to with CUDA Graphs)

Kernel Fusion

Fusing kernels reduces memory traffic and kernel launch time. Since TG is memory-bound, kernel fusion helps there the most. From experience fusion is the most useful when we have high memory bandwidth, conversely the results are not as great when the bandwidth is low.

Note that some fusion will increase register pressure and generally leads to more complex kernels, so it's not always worth doing.

llama.cpp's backend is based on ggml, a low level tensor library. ggml offers a lot of flexibility to backends via its cgraph (compute graph) interface. After upfront tensor allocation and initial graph ordering, it passes control to individual backends to compute the graph. The graph itself is composed of ggml_tensors, which contain all the information required for fusion.

The backends are then responsible for fusion. The CUDA backend inspects order of operations and decides whether to fuse a bunch of individual kernels. This pattern can be described in pseudo-code:

for (int i = 0; i < graph->n_nodes; ++i) {
	if (can_fuse(graph, i, FUSION_PATTERN)) { 
		replace_nodes_with_kernel(graph, i, FUSION); 
		i += SKIP_NODES; // Skip the nodes we just absorbed 
	}
}

In ggml-cuda, the relevant portions start from here

Looking at the order of nodes, one can identify opportunities for fusion. Something which is commonly used in a lot of models is usually a good candidate. The following are the most effective fusions that the CUDA backend does at the moment:

GEMV fusion

Matrix-vector multiplication is used instead of matrix-matrix multiplication during TG as we operate on 1 token at a time. We identify operations where MUL_MAT follows an ADD or more interestingly, when there is a gated activation i.e. the operation o=σ(WgateX)WupX, so we can re-use the X activation and multiply by both matrices. More details here

TopK-MoE

This is the common scoring algorithm in MoE models to select expert weights per token. This usually involves a soft-max over the logits per expert, then a top-k selection for the experts for the sparse MoE matrix multiplication (called MUL_MAT_ID in ggml). More details here

Fused RMS Norm

This is a relatively simple one, usually RMS_NORM follows a MUL and optionally an ADD. We just do everything in the same kernel. More details here.

Others include fusing adds, fusing the softcap operation.

Concurrent CUDA Streams

Note: At the moment to run this you need to supply the env variable GGML_CUDA_GRAPH_OPT=1.

Since query (Q), key (K), and value(V) projections in the attention block are independent of each other prior to calculating the attention scores, we can run them independently using CUDA streams. This is better explained with a NSight diagram.

image

The ggml graph allocator optimizes for memory footprint by reusing 'expired' buffers immediately. In a multi-stream context, 'expired' is ambiguous, leading to race conditions where Stream B overwrites Stream A's input. This is currently "solved" via interleaving the graph nodes to extend their lifetime. There will be a better long-term fix in ggml to allow for such scenarios. More details here

Conclusion

Although none of these PRs increase the TG by more than 10%, taking them all altogether we get a nice speedup.

Acknowledgements

Johannes, Jeff, Georgi, Oliver, Sigbjørn

  1. speed-of-light (SoL) napkin math