Simple Loop for Auto-Anything using LLMs
Autoresearch is all the rage, it's like people like never heard of while loop before. Here it is, not even simplified:
let mut baseline = measure();
let mut experiment_log = log::new();
loop {
llm.make_changes();
let new_metric = measure();
if new_metric.is_better_than(baseline) {
baseline = new_metric;
}
experiment_log.record_experience();
}
You don't even need to code this up as you can specify all this in a prompt.
A few things to keep in mind:
Get out of the way: your job is to setup the experimental harness and walk away. So this usually means
--dangerously-skip-permissionsin Claude Code for me because I can't be bothered to setup nuanced permissions.measure()should be stable, it can't be very noisy. For example, while profiling you need to make sure there are no external factors which will change the measurement (e.g. thermal throttling), the LLM has no way to know this.measure()should be fast - the signal should be fast so that the agent can iterate faster.The experiment log is useful: in-case your context window fills up, it can reboot the experiments using the log.md.
Here was my prompt for a llama.cpp auto-tuning I tried. The goal was optimize token-generation performance for Qwen3.5 models.
We are going to do token-generation performance improvements in llama.cpp on the CUDA backend.
Here is the model we would like to optimize for: /opt/models/Qwen3.5-27B-Q4_K_M.gguf,
Here is the command that will print the performance:
- CUDA_VISIBLE_DEVICES=4 ./build/bin/llama-bench -m /opt/models/Qwen3.5-27B-Q4_K_M.gguf -n 128 -p 0 -fa 1
Use ncu to profile the kernel if you wish, keep a log.md of experiments you have tried and iterate on them. Outline ideas which have a measureable performance improvment (>1%)
Do this in a loop, do not ask for permissions regarding building the project, or reading a file, or running ncu or nsys.
Note: Outlandish ideas are okay! Hand-written PTX? Let's go! Weird kernel fusion? Also great. We're exploring the frontier here
Here is its final experimental log after churning through 45 mins. Although it didn't improve anything, I found out the we're spending 38% of the time in the CPU between tokens, that is quite a lot!
Claude's experiments log 🤖
# Token Generation Performance Experiments
## Setup
- GPU: NVIDIA GeForce RTX 5090, compute capability 12.0 (Blackwell), 32GB GDDR7
- Model: Qwen3.5-27B Q4_K_M (15.58 GiB, 26.90B params)
- Architecture: Hybrid attention/SSM (64 layers, full_attention_interval=4 → 16 attn + 48 SSM)
- hidden_dim=5120, FFN=17408, head_count=24, KV_heads=4, key/value_length=256
- SSM: gated_delta_net, inner_size=6144, state_size=128
- Command: `CUDA_VISIBLE_DEVICES=4 ./build/bin/llama-bench -m /opt/models/Qwen3.5-27B-Q4_K_M.gguf -n 128 -p 0 -fa 1`
- Build: c08d28d08 (8667), sm_120
## Baseline
- **67.04 ± 0.19 t/s** (tg128, fa=1, -r 10 for stability)
- Without CUDA graphs: 59.53 t/s (CUDA graphs provide 12.4% improvement)
- Without flash attention: 65.02 t/s (flash attn provides 3.0% improvement)
## Profile Analysis
### Kernel Time Breakdown (nsys, tg16 warmup tokens)
| Kernel | Time% | Instances | Avg(μs) | DRAM% |
|--------|-------|-----------|---------|-------|
| mul_mat_vec_q Q4_K fused | 40% | 1110 | 53/67 | 82.5% |
| mul_mat_vec_q Q5_K | 13% | 960 | 21 | 76% |
| mul_mat_vec_q Q6_K fused | 11% | 320 | 53 | 78% |
| mul_mat_vec_q Q4_K non-fused | 7% | 870 | 13 | 68% |
| mul_mat_vec_q Q6_K non-fused | 4% | 110 | 64 | 73% |
| rms_norm_f32(1024) | 3% | 1290 | 4.0 | - |
| quantize_q8_1 | 3% | 4330 | 1.1 | - |
| k_get_rows_float | 2% | 980 | 3.4 | - |
| concat_f32_dim0 | 2% | 480 | 6.9 | - |
| mul_mat_vec_q Q8_0 | 1% | 960 | 2.1 | 3.5% |
| flash_attn_ext_vec | 1% | 160 | 8.3 | - |
### Key ncu Metrics (Q4_K fused, 17408 blocks)
- DRAM throughput: **82.5% of peak** (near optimal for memory-bound kernel)
- L1 tex bytes: 624 MB (vs ~40 MB theoretical weight data)
- L2 bytes: 105 MB (2.6x theoretical, due to 144-byte Q4_K block vs 128-byte cache lines)
- Registers: 48/thread
- Occupancy: ~62.5% (10 blocks × 128 threads per SM)
- Block config: (32, 4, 1) = 128 threads, nwarps=4
### System-Level Analysis
- Graph compute: **1 call per token**, 3656 nodes per graph
- GPU kernel time per token: ~9.2ms
- Total time per token: ~14.9ms
- **GPU idle: ~38%** (CPU-side processing between tokens)
- CUDA API: 631 graph launches + 29,616 individual kernel launches (warmup only)
## Experiments
| # | Description | t/s | vs Base | Notes |
|---|------------|-----|---------|-------|
| 0 | Baseline (stock) | 67.04 | - | -r 10 average |
| 1 | nwarps=8 (Blackwell) | 65.81 | **-1.8%** | Shared memory reduction overhead > benefit |
| 2 | rows_per_block=2 | 66.68 | **-0.5%** | Register pressure from double accumulators |
| 3 | nwarps=2 | 66.89 | ~0% | Same memory-bound behavior |
| 4 | Forced small_k (rpb=nwarps=4) | 64.83 | **-3.3%** | Too much register pressure from 4 accumulators |
| 5 | PTX L2 prefetch in main loop | 66.66 | **-0.6%** | Hardware prefetcher already effective |
| 6 | Force MMQ path | 65.15 | **-2.8%** | MMVQ already optimal for batch=1 |
| 7 | Force cuBLAS | 64.91 | **-3.2%** | cuBLAS not competitive for quantized GEMV |
| 8 | No flash attention | 65.02 | **-3.0%** | FA is helping even for 16 attn layers |
| 9 | Disable CUDA graphs | 59.53 | **-11.2%** | Confirms graph benefit |
| 10 | q8_0 KV cache | 65.10 | **-2.9%** | KV cache tiny at seq_len=128, no benefit |
| 11 | q4_0 KV cache | 65.09 | **-2.9%** | Same as above |
## Analysis & Conclusions
### Why kernel optimizations don't help
The dominant Q4_K fused MMVQ kernel is already at **82.5% of peak DRAM bandwidth**. This leaves only ~18% headroom for kernel-level improvements. All attempted changes (warp count, block size, prefetching) either increased register pressure, added reduction overhead, or competed with the hardware's existing optimizations.
### Where the remaining performance is
1. **DRAM bandwidth gap (18%)**: Caused by Q4_K block size (144 bytes) not aligning with 128-byte cache lines, resulting in 2.6x L2 traffic overhead. Fixing this requires changing the quantization data layout.
2. **GPU idle time (38%)**: Caused by CPU-side processing between tokens (logit readback, sampling, graph preparation). This is an inference-engine level bottleneck, not a kernel bottleneck.
3. **Non-MMVQ kernels (20%)**: Small kernels (rms_norm, quantize_q8_1, concat, scale) that could be fused to reduce launch overhead and memory round-trips.
### Potential improvements requiring deeper changes
- **Kernel fusion**: Fuse rms_norm + quantize_q8_1 to eliminate memory round-trip (~1-2% potential)
- **q8_1 caching**: Cache quantized input across Q/K/V projections that share the same input (~1% potential)
- **Async token pipeline**: Overlap CPU token processing with GPU computation (~5-10% potential)
- **SoA data layout for Q4_K**: Separate scales/qs for cache-line alignment (~5% potential from reducing L2 waste)
- **NVFP4 format**: Native Blackwell FP4 support for ~2x bandwidth efficiency (major quality/format change)
- **Concurrent streams**: Overlap independent projections within attention layers