Tutorials·January 18, 2026·16 min readCS336 Notes: Lecture 16 - Alignment, RL 1Advanced RL for alignment: PPO implementation details, GRPO as a simpler alternative, overoptimization risks, and case studies from DeepSeek R1, Kimi K1.5, and Qwen 3.machine-learningalignmentstanford-cs336rlhfRead