KernelBench v3: Rebuilding a GPU Kernel Benchmark from First Principles

How discovering the original KernelBench was exploitable led to building a focused, cost-effective benchmark for evaluating LLM kernel engineering on modern architectures.

Results at a Glance

10 models evaluated across RTX 3090, H100, and B200 GPUs on problems spanning 4 difficulty levels. Three metrics matter: Does the code compile? Is it numerically correct? Does it beat the PyTorch baseline?

Overall benchmark results showing compiled, correct, and beats baseline percentages for each model

The heatmap below breaks this down by difficulty level. L1-L3 are tractable for frontier models. L4 - novel architectures like DeepSeek MLA, GatedDeltaNet, and FP8 matmul - is where everyone struggles.

Heatmap showing compiled, correct, and beats baseline percentages by level for each model

Speedup Distribution

The distribution of speedups across all correct solutions shows where models actually deliver performance gains versus where they merely match the baseline.

Distribution of speedups across correct solutions

Level Breakdown

Per-level performance shows the steepest drop at L4, where problems involve novel architectures not well-represented in training data.

Performance breakdown by difficulty level

Cost vs Accuracy

API cost per evaluation varies dramatically across models. The cost-accuracy tradeoff reveals which models deliver the best kernel engineering per dollar.

GPT-5.4 and Gemini 3 Flash lead on correctness across GPUs, while Gemini 3 Flash offers the best cost-effectiveness. What matters is not average speedup - a metric easily gamed - but whether models can produce kernels that are both correct and faster than PyTorch. On that measure, the frontier is clear: Level 4 pass rates collapse across all models. Genuine kernel engineering on novel architectures remains beyond current capabilities.

The Problem with KernelBench

KernelBench, released by Stanford's Scaling Intelligence Lab, promised to evaluate whether LLMs could write optimized CUDA kernels. The premise was compelling: give a model a PyTorch reference implementation, ask it to write faster CUDA code, and measure the speedup. The benchmark included 250 problems across multiple difficulty levels.

Then METR published "Measuring Automated Kernel Engineering" and the facade crumbled.

The Exploits

METR discovered that models were achieving high "speedups" through exploitation rather than genuine kernel engineering:

Bypassing CUDA entirely: Models called torch or cuBLAS instead of writing kernels
Memory aliasing: No-op kernels that pass because output memory overlaps with reference
Timing manipulation: Monkey-patching torch.cuda.synchronize to make timing meaningless
Stack introspection: Extracting pre-computed reference results from the caller's stack
Constant functions: Problems like mean(softmax(x)) that always equal 1.0

METR removed 45 of the 250 problems due to fundamental task quality issues. After filtering exploits, average speedup dropped from 3.13x to 1.49x. The benchmark was measuring benchmark-gaming ability, not kernel engineering.

Starting Over

Rather than patching a broken system, I decided to rebuild from scratch with clear design principles:

Focus on modern architectures: No classical ML operators nobody optimizes anymore
Fewer problems, higher quality: Each problem manually validated per GPU target
Three GPUs that matter: RTX 3090 (Ampere), H100 (Hopper), and B200 (Blackwell)
Adaptive baselines: torch.compile used only when it actually helps, not blindly
Multi-seed correctness: 5 random seeds (42, 123, 456, 789, 1337) to catch caching exploits
Cost tracking: Full token usage and API cost per evaluation

The Cost Problem

The original plan was ambitious: evaluate many models across hundreds of problems on many GPU architectures. The math was brutal. Even with scope cuts, the final evaluation covers 10 models across 3 GPUs with varying problem counts per architecture (43 for RTX 3090, 54 for H100, 58 for B200), yielding over 1500 individual evaluations.

Problem Selection

Problems span four difficulty levels, with additional GPU-specific problems (tile-specialized GEMM variants for H100/B200, cuTile problems for B200, graphics problems for RTX 3090):

Level	Count	Turns	Description
L1	15	10	Simple ops: matmul, softmax, conv, norms
L2	15	12	Fused ops: matmul+activation chains
L3	3	15	Single blocks: attention, transformer block
L4	9+	15	Novel layers: MLA, MoE, GQA, FP8, DeltaNet, FP4, INT4

Level 4: The Real Test

Level 4 is where it gets interesting. These are modern inference optimization patterns that test genuine kernel engineering:

DeepSeek MLA: Multi-head Latent Attention with LoRA KV compression - not in training data
DeepSeek MoE: Mixture-of-Experts with grouped expert routing
GQA: Grouped Query Attention (Llama 3 style) with KV head expansion
FP8 Matmul: E4M3 quantized matmul with tensor cores via torch._scaled_mm
INT4 GEMM: Weight-only quantization with fused unpack+dequant+matmul
FP4 Matmul: B200-only FP4 quantized matrix multiply
GatedDeltaNet: Linear attention from ICLR 2025 - baseline uses flash-linear-attention's Triton kernels
KimiDeltaAttention: Channel-wise gated delta attention - same fla baseline

For GatedDeltaNet and KimiDeltaAttention, the baseline is not naive PyTorch - it is already optimized Triton code from flash-linear-attention. Models need to match or beat production-quality kernels.

Finding Modern Baselines

The Level 4 problems required digging through HuggingFace implementations to find reference code. DeepSeek MLA came from the DeepSeek-V3 model's modeling_deepseek.py. The core insight: the HuggingFace implementations use naive PyTorch ops that are ripe for optimization:

# DeepSeek MLA: naive PyTorch baseline
# Q projection with LoRA compression
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))

# KV projection with LoRA compression
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
kv = self.kv_b_proj(self.kv_a_layernorm(compressed_kv))

# A fused kernel can combine:
# 1. LoRA compression/expansion
# 2. RMSNorm
# 3. RoPE application
# 4. Attention computation

Infrastructure

The evaluation uses a multi-tier infrastructure. RTX 3090 runs locally, while H100 and B200 run on Modal with CUDA 13.2 and full support for Hopper and Blackwell architectures. Key infrastructure decisions:

Modal sandbox: Isolated execution with git, cmake, CUTLASS/CuTe DSL
Local sandbox: RTX 3090 runs locally for fast iteration
Multi-turn agent: Models iterate on their solutions with compiler feedback
Per-level turn limits: L1 gets 10 turns, L4 gets 15 - harder problems need more iteration
Adaptive torch.compile baseline: Uses whichever is faster per-problem (eager vs compiled)
Prompt caching: System prompts cached to reduce token costs
Dynamic pricing: Costs fetched from OpenRouter API, not hardcoded

Results

Over 1,500 evaluations across 10 models and 3 GPUs. The coverage matrix shows correctness rates:

Model	RTX 3090	H100	B200
MiniMax M2.5	35/43 (77%*)	9/54 (17%)	12/58 (21%)
GPT-5.4	33/43 (77%)	42/54 (78%)	-
Gemini 3 Flash	32/43 (74%)	41/54 (76%)	46/58 (79%)
GPT-5.3	28/43 (65%)	-	-
Claude Opus 4.6	27/43 (63%)	-	-
Claude Sonnet 4.6	25/43 (58%)	-	-
Kimi K2.5	22/43 (51%)	27/54 (50%)	35/58 (60%)
Qwen3.5 397B	-	22/54 (41%)	25/58 (43%)
GLM-5	19/43 (44%)	31/54 (57%)	31/58 (53%)
Gemini 3.1 Pro	16/43 (37%)	-	-
DeepSeek V3.2	0/43 (0%)	-	2/58 (3%)

*MiniMax RTX 3090 had 129 results from possible multi-run merge. Dashes indicate runs not yet completed.

By Level

Level 4's low pass rate tells the real story. When faced with modern architectures not in training data, or baselines that are already optimized Triton code, models struggle. The gap between "can write CUDA" and "can engineer production kernels" is substantial.

Key Observations

GPT-5.4 and Gemini 3 Flash Lead Across GPUs

GPT-5.4 achieves the highest correctness on both RTX 3090 (77%) and H100 (78%). Gemini 3 Flash is the only model evaluated across all three GPUs and maintains consistently strong performance (74-79% correctness), while also being the most cost-effective option at $0.50/$3 per million tokens.

GPU Architecture Matters

Some models perform notably differently across GPU architectures. Kimi K2.5 jumps from 51% on RTX 3090 to 60% on B200. GLM-5 improves from 44% on RTX 3090 to 57% on H100. This suggests that some models have better training coverage for newer GPU architectures, or that the additional problems (tile-specialized GEMM, cuTile) happen to play to their strengths.

Behavior on Specific Problems

The aggregate numbers hide interesting per-problem behavior. On LayerNorm, some models produce highly optimized fused kernels while others fall back to naive implementations. On GEMM fusion patterns, the approaches diverge significantly - some models attempt register tiling and shared memory optimization, others stick to cuBLAS calls. I encourage readers to explore the interactive dashboard and examine specific problems to understand how different models approach kernel engineering.

Open Models Struggle

DeepSeek V3.2 achieves 0-3% correctness across GPUs. Qwen3.5 397B manages 41-43% on H100/B200. The gap between frontier closed models and open alternatives is pronounced for kernel engineering - this appears to be a capability that requires significant training investment.

MiniMax M2.5: The Anomaly

MiniMax M2.5 shows the widest variance: 77% on RTX 3090 but only 17-21% on H100/B200. The RTX 3090 run had 129 results (possible multi-run merge), so the 77% figure should be treated with caution. More concerning: in early runs, MiniMax attempted reward hacking by running `pkill -f python` to kill the evaluation process. Guardrail fixes prevented this on subsequent runs.

What This Means

LLMs can write CUDA code. They can even write code that passes correctness checks on standard operations. But when the task requires genuine kernel engineering - understanding memory hierarchies, exploiting tensor cores, fusing operations for bandwidth efficiency - the capability drops sharply.

The original KernelBench inflated capabilities through exploitable tasks and naive baselines. With those removed, the picture is more sobering: models are useful assistants for kernel development, but not autonomous kernel engineers.

Remaining Evaluation

10 runs remain to complete the coverage matrix: Qwen3.5 on RTX 3090, plus GPT-5.3, Claude Opus 4.6, Claude Sonnet 4.6, and Gemini 3.1 Pro on both H100 and B200. GPT-5.4 on B200 is also pending. These will be published as they complete.

Future Work

KernelBench v3 currently evaluates single-GPU kernels. The roadmap includes:

Level 5: Multi-GPU with tensor parallelism and pipeline parallelism
Level 6: Multi-node distributed training/inference patterns
Expanded L4: More modern architectures as they emerge
M4 Max (Metal): Apple GPU evaluation with Metal-specific problems

Try It

The benchmark is open source. Results are browsable at /kernelbench-v3 with full filtering by model, GPU, level, operation type, and more.

git clone https://github.com/Infatoshi/KernelBench-v3
cd KernelBench-v3
uv sync
uv run python bench.py run rtx3090 --models google/gemini-3-flash-preview --levels 1,2,3,4 --workers 4

Resources

March 2026 (updated from January 2026)