CS336 Notes: Lecture 12 - Evaluation
LLM evaluation beyond accuracy: perplexity, knowledge benchmarks, instruction-following, agent tasks, safety, and why evaluation design shapes what models become.
Blog
Filter
LLM evaluation beyond accuracy: perplexity, knowledge benchmarks, instruction-following, agent tasks, safety, and why evaluation design shapes what models become.
Practical scaling: muP for hyperparameter transfer, WSD learning rate schedules, case studies from Cerebras-GPT, MiniCPM, and DeepSeek on compute-optimal training.
LLM inference optimization: understanding the prefill vs decode split, KV cache management, speculative decoding, and why inference is fundamentally memory-bound.
Understanding scaling laws: how loss depends on data, parameters, and compute, the Chinchilla tradeoff for compute-optimal training, and why power laws emerge in deep learning.
Hands-on distributed training: implementing collectives with PyTorch and NCCL, data/tensor/pipeline parallelism in practice, and understanding the compute-memory-communication tradeoff.