Tutorials·January 14, 2026·10 min readCS336 Notes: Lecture 12 - EvaluationLLM evaluation beyond accuracy: perplexity, knowledge benchmarks, instruction-following, agent tasks, safety, and why evaluation design shapes what models become.machine-learningevaluationstanford-cs336benchmarksRead