Dendrite: 133x Faster Battery Simulation with Hand-Tuned CUDA
The battery simulation community is stuck on CPU. Naive GPU ports are actually slower. Hand-tuned CUDA kernels achieve 89% of RTX 3090 peak bandwidth.
The battery simulation community is stuck on CPU. Naive GPU ports are actually slower. Hand-tuned CUDA kernels achieve 89% of RTX 3090 peak bandwidth.
A custom CUDA megakernel for Qwen3-0.6B that fuses RMSNorm, QKV projection, RoPE, attention, and MLP into a single kernel launch - achieving 527 tok/s decode on RTX 3090.
Custom CUDA kernels that eliminate computational bottlenecks in spherical harmonics and tensor product operations - the core primitives of equivariant GNNs like MACE, NequIP, and Allegro.
How discovering the original KernelBench was exploitable led to building a focused, cost-effective benchmark for evaluating LLM kernel engineering on modern architectures.
Compressing Qwen3-30B-A3B from 6,144 to 1,698 experts while retaining 91.5% HumanEval performance - fitting a frontier-class MoE model into 18GB of VRAM.
Reproducing "Attention Is Not What You Need" (arXiv 2512.19428) reveals a 22.6% performance gap vs the claimed 10-15%. Includes custom CUDA kernels with 2x inference speedup.
AI is consuming energy at a rate that Earth's grids can barely sustain. I spent several days modeling a 100 Megawatt Orbital Compute Cluster with Gemini to design a rig that lives in the vacuum.
A deep dive into MiniMax M2.1, the 230B parameter sparse MoE model that activates only 10B parameters per token while achieving SOTA performance at 10% of Claude Sonnet's cost.
A comprehensive technical analysis of GLM-4.7, the 358B parameter Mixture-of-Experts model pushing the boundaries of coding, reasoning, and agentic AI capabilities.
A down-to-earth answer considering my experience with CUDA, what it has and hasn't brought me success in, where the ecosystem is going, and how to play strategically around that.