Blog

Dendrite: 133x Faster Battery Simulation with Hand-Tuned CUDA

The battery simulation community is stuck on CPU. Naive GPU ports are actually slower. Hand-tuned CUDA kernels achieve 89% of RTX 3090 peak bandwidth.

MegaQwen: 3.8x Faster LLM Inference by Fusing an Entire Transformer Block

A custom CUDA megakernel for Qwen3-0.6B that fuses RMSNorm, QKV projection, RoPE, attention, and MLP into a single kernel launch - achieving 527 tok/s decode on RTX 3090.

Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks

Custom CUDA kernels that eliminate computational bottlenecks in spherical harmonics and tensor product operations - the core primitives of equivariant GNNs like MACE, NequIP, and Allegro.

KernelBench v3: Rebuilding a GPU Kernel Benchmark from First Principles

How discovering the original KernelBench was exploitable led to building a focused, cost-effective benchmark for evaluating LLM kernel engineering on modern architectures.

Running a 30B Parameter Model on a Single RTX 3090: REAP Compression Experiments

Compressing Qwen3-30B-A3B from 6,144 to 1,698 experts while retaining 91.5% HumanEval performance - fitting a frontier-class MoE model into 18GB of VRAM.

Grassmann Flows: An Independent Reproduction Study

Reproducing "Attention Is Not What You Need" (arXiv 2512.19428) reveals a 22.6% performance gap vs the claimed 10-15%. Includes custom CUDA kernels with 2x inference speedup.

The Space-Native Data Center: A First Principles Design for 2030

AI is consuming energy at a rate that Earth's grids can barely sustain. I spent several days modeling a 100 Megawatt Orbital Compute Cluster with Gemini to design a rig that lives in the vacuum.

MiniMax M2.1: The Open-Source Coding Powerhouse That Rivals Closed-Source Giants

A deep dive into MiniMax M2.1, the 230B parameter sparse MoE model that activates only 10B parameters per token while achieving SOTA performance at 10% of Claude Sonnet's cost.

GLM-4.7: Z.ai's Frontier Agentic Reasoning Model

A comprehensive technical analysis of GLM-4.7, the 358B parameter Mixture-of-Experts model pushing the boundaries of coding, reasoning, and agentic AI capabilities.

Should I Learn CUDA?

A down-to-earth answer considering my experience with CUDA, what it has and hasn't brought me success in, where the ecosystem is going, and how to play strategically around that.