CS336 Notes: Lecture 7 - Parallelism 1
Distributed training fundamentals: data parallelism, ZeRO/FSDP for memory efficiency, tensor and pipeline parallelism, and how to combine strategies for frontier-scale models.
Read
Blog
Filter
Distributed training fundamentals: data parallelism, ZeRO/FSDP for memory efficiency, tensor and pipeline parallelism, and how to combine strategies for frontier-scale models.
Writing efficient GPU kernels with Triton: profiling, benchmarking, kernel fusion, and when to hand-optimize versus using torch.compile.
GPU fundamentals for LLM training: memory hierarchy, arithmetic intensity, kernel optimization, FlashAttention, and bandwidth limits.
Mixture of Experts (MoE): adding capacity without proportional compute, routing, load balancing, and what makes MoE stable.
What modern LLMs converge on: pre-norm, RMSNorm, SwiGLU, RoPE, and stability tricks.