CS336 Notes: Lecture 5 - GPUs
GPU fundamentals for LLM training: memory hierarchy, arithmetic intensity, kernel optimization, FlashAttention, and bandwidth limits.
Read
Blog
Filter
GPU fundamentals for LLM training: memory hierarchy, arithmetic intensity, kernel optimization, FlashAttention, and bandwidth limits.
Mixture of Experts (MoE): adding capacity without proportional compute, routing, load balancing, and what makes MoE stable.
What modern LLMs converge on: pre-norm, RMSNorm, SwiGLU, RoPE, and stability tricks.
Resource accounting for LLM training: compute estimates, memory budgets, dtypes, tensors, and mixed precision.
My notes on Stanford CS336: Language Modeling From Scratch: how to build a large language model end to end, from data to deployment.