Blog

All posts

Filter

Tutorials·January 8, 2026·6 min read

CS336 Notes: Lecture 6 - Kernels and Triton

Writing efficient GPU kernels with Triton: profiling, benchmarking, kernel fusion, and when to hand-optimize versus using torch.compile.

Read

Tutorials·January 7, 2026·11 min read

GPU fundamentals for LLM training: memory hierarchy, arithmetic intensity, kernel optimization, FlashAttention, and bandwidth limits.

Read

Tutorials·January 6, 2026·12 min read

Mixture of Experts (MoE): adding capacity without proportional compute, routing, load balancing, and what makes MoE stable.

Read

Tutorials·January 5, 2026·8 min read

What modern LLMs converge on: pre-norm, RMSNorm, SwiGLU, RoPE, and stability tricks.

Read

Tutorials·January 4, 2026·10 min read

Resource accounting for LLM training: compute estimates, memory budgets, dtypes, tensors, and mixed precision.

Read