CS336 Notes: Lecture 16 - Alignment, RL 1

RLHF works, but it is fragile. Push the reward model too hard and you overfit noisy human preferences. Models get more confident and less calibrated.

Key Takeaways

DPO is simpler than PPO and dominated open-source post-training for a while. Many DPO variants behave differently across settings.

Overoptimization is the core failure mode in RLHF. Pushing the reward model harder first helps, then hurts true human win rate.

PPO is a classic RL algorithm, but it is hard to run well. It needs a value model, advantage estimation, clipping, and lots of tuning.

GRPO keeps the PPO spirit but drops the value model and GAE. It is simpler and often more practical for language models.

The original GRPO objective has two math issues. Dividing by standard deviation and length-normalizing rewards can bias learning, especially toward very long outputs.

Modern reasoning systems (DeepSeek R1, Kimi K1.5, Qwen 3) follow a broad recipe: long chain-of-thought SFT, RL on verifiable rewards, then RLHF and distillation.

DeepSeek R1 suggests you can reach o1-style reasoning with simple outcome-based RL on verifiable tasks, without process reward models or tree search.

Kimi K1.5 shows that careful data curation, a different PPO-like objective, and explicit length control can reach similar performance while controlling cost.

Qwen 3 adds "thinking mode fusion" so one model can either think step by step or answer directly, trading accuracy for speed at inference.

Across these systems, data curation, reward design, and inference efficiency (especially chain-of-thought length) matter as much as the specific RL algorithm.

RLHF and DPO Recap

RLHF Setting

We collect pairwise preference data. For a prompt, humans choose the better of two model responses. We want a policy that maximizes an underlying reward consistent with these preferences.

DPO Idea

DPO optimizes an RL objective directly from preference data. The key trick is to treat the optimal policy nonparametrically. You rewrite reward as a log ratio between the target policy and a reference policy. You plug that reward into a Bradley-Terry preference model. Training becomes supervised: increase the likelihood of preferred responses over dispreferred ones.

DPO Update Intuition

DPO looks like "upweight good, downweight bad," but with structured weights.

The strength term keeps the policy from drifting too far from the reference. The update is larger when the implied reward disagrees more with preferences.

Most RL methods reduce to this same pattern. The hard part is deciding what "good" means and how hard to push it.

DPO Variants: SimPO and Length Normalization

SimPO

SimPO changes DPO by normalizing updates by response length and removing the reference policy term. Without the reference, you lose the clean derivation in terms of policy ratios. As a heuristic, it still fits the same "upweight good, downweight bad" picture.

Length-Normalized DPO

Another variant keeps the reference policy but divides the update by response length. SimPO and length-normalized DPO were both used heavily in Tulu 3 experiments.

Empirical Caution

AI2 experiments once found PPO better than DPO. Later work, including Tulu 3, showed that improving SFT can absorb most of what PPO or DPO would have added. In one setup, SFT alone matched RLHF, and only DPO with length normalization gave small gains.

The lesson: RL results depend on the base model, the data, and the environment. Do not generalize from one paper.

Overoptimization and Calibration in RLHF

Overoptimization

Imagine you increase how hard you optimize a proxy reward. True human win rate rises at first, then falls, even as the proxy reward keeps rising. That drop is overoptimization: you are fitting the reward model, not the real preference.

Why It Happens

Preference labels are noisy. Human preferences are complex and incomplete.

Evidence

Students ran RLHF using three reward sources: real human preferences, noisy AI feedback, and clean noiseless AI feedback. The overoptimization curve shows up for humans and noisy AI. It mostly disappears when rewards are clean and noiseless.

Expect the curve: more RL on a proxy does not reliably mean better real performance.

Calibration

Pretraining and SFT fit likelihoods and preserve a probabilistic view of the world. RLHF optimizes reward instead. Empirically, RLHF models often become overconfident, especially at temperature 1, and show worse probability calibration than SFT baselines.

This is not a bug relative to the RL objective. Calibration is not in the reward. Do not treat RLHF models as well-calibrated probability estimators.

From RLHF to RL with Verifiable Rewards

Motivation

Human approval is noisy, easy to game, expensive, and prone to overoptimization. Classic RL successes relied on rewards that were stable, cheap, and scalable.

New Strategy

Keep RL tooling, but move to tasks with verifiable rewards: math, coding, logic, puzzles, and problems with clear correct answers or strong automatic judges.

This shift underlies o1-style models and the case studies here: DeepSeek R1, Kimi K1.5, and Qwen 3.

PPO: From Policy Gradient to a Practical Algorithm

Policy Gradient

We want to maximize expected reward under the policy. The gradient increases the probability of high-reward samples and decreases low-reward ones.

Why Naive Policy Gradient is Hard

It is fully on-policy. Each update needs fresh rollouts and new reward computations. For language models, rollouts are expensive. We want to reuse rollouts for multiple optimization steps.

TRPO Idea

Use importance sampling from an older policy, correcting with likelihood ratios. Add a KL constraint so the new policy stays close to the old policy and steps do not blow up.

PPO Idea

PPO simplifies TRPO with clipping. It multiplies the likelihood ratio by an advantage term, but clips the ratio to a band like [1 - ε, 1 + ε]. Gains outside the band are cut off. This discourages large policy changes while staying easy to implement.

Value Function and GAE

PPO typically uses a value network to predict expected returns. It uses generalized advantage estimation to reduce variance, trading bias and variance via γ and λ.

Why PPO is Painful in Practice for RLHF

To make PPO work well for big LLMs, you usually need: a reward model, a value model roughly as large as the policy, GAE implementation and tuning, KL penalties (per-token or per-sequence), and many small tricks that affect stability.

The value model alone doubles memory and complicates training.

PPO Implementation for LLMs

Contextual Bandit View

For LLMs, RL is usually a contextual bandit. Input is a prompt. Action is a full generated sequence. Reward is one scalar at the end. There are no real state transitions.

Reward Shaping

Training often uses token-level rewards: the task reward at the end (correct or incorrect), plus per-token KL penalties to keep the policy near a reference.

The final scalar reward is usually broadcast backward for training.

Outer Loop

Collect rollouts, compute rewards and advantages, compute policy loss (with clipping) and value loss, backprop, step, and clip gradients.

Why This Digression Matters

PPO is conceptually clean. Its working versions are not. This motivates simpler algorithms like GRPO.

GRPO: A Simpler RL Algorithm for LLMs

Goal

Keep the PPO-style benefits but remove the hardest pieces: no value model and no GAE.

Core Idea

For each prompt, sample a group of G responses. For each response i:

Compute a scalar reward R_i.
Compute the group mean and standard deviation.
Define advantage A_i = (R_i - mean_group) / std_group.
Plug A_i into a PPO-like clipped objective, without a value network.

Interpretation

The group is "all responses to the same question." The group mean is a natural baseline that reflects difficulty. Subtracting it reduces variance because learning depends on relative success within the same prompt. The standard deviation normalizes scale across prompts.

Clipping and KL

GRPO can add clipping or KL-style terms to keep the policy close to a reference. In a fully online single-step setting, you can drop clipping and recover a simple policy gradient with a baseline.

Why GRPO Got Popular

It removes the big value model and the GAE machinery. It fits existing LLM training loops. It worked well for math RL (for example in DeepSeekMath).

Problems in Original GRPO and Dr-GRPO Fixes

Baseline Theory

You can subtract any baseline that does not depend on the action without biasing the policy gradient. This often reduces variance.

Issue 1: Dividing by Standard Deviation

Subtracting the group mean is a valid baseline. Dividing by std_group is not guaranteed valid under the baseline theorem. It changes the gradient, so it no longer matches the true policy gradient.

What Std Division Does

When std is small, advantages blow up. Std is small when rewards are nearly all the same, which happens on very easy problems (all correct) or very hard problems (all wrong). That biases training toward extremes, not the mid-difficulty region where the learning signal is often best.

Issue 2: Length Normalization in Reward

Original GRPO often divides total reward by output length. When the answer is wrong, and reward is near zero while KL penalties are negative, length normalization creates a bad incentive: the model can reduce per-token penalty by generating very long sequences while failing. When answers are correct, it pushes toward shorter outputs. This can produce long, pathological chains of thought when the model is stuck.

Dr-GRPO

A later paper reanalyzed GRPO and proposed Dr-GRPO: remove division by standard deviation, remove length normalization from reward.

It maintained or improved accuracy on tasks like GSM8K and stopped uncontrolled length growth. Output length stabilized at a reasonable plateau.

This suggests some very long chains of thought in GRPO runs may be objective artifacts, not a requirement for high performance.

DeepSeekMath and Early GRPO Results

Setup

DeepSeekMath used GRPO on math tasks with two reward types:

Outcome reward: 1 if final answer is correct, 0 otherwise.

Process reward: a process reward model scoring reasoning steps.

They compared: fine-tuning on correct outputs (RFT), online RFT (using new correct outputs from the current model), GRPO with outcome rewards, GRPO with process rewards.

Main Observations

GRPO with outcome rewards beat RFT baselines. GRPO with process rewards did even better in DeepSeekMath. These results made GRPO a strong default for math RL.

Later, in R1, they chose outcome-based rewards for the main reasoning model.

DeepSeek R1-Zero: Pure RL with Verifiable Rewards

Setting

R1-zero starts from DeepSeek V3 before instruction tuning and RLHF and applies RL with verifiable rewards, mostly on math tasks. It does not include chain-of-thought SFT before RL.

Rewards

Accuracy reward: binary reward for correct final answer.

Format reward: checks that reasoning is wrapped in tags like <think>...</think>.

The format reward keeps reasoning contained and machine-readable.

Behavior and Claims

As RL progresses, chains of thought get longer. The paper frames this as learning to think longer on harder problems and shows "aha"-style behaviors like backtracking.

Later Perspective

After Dr-GRPO, some of this may be driven by length-biased objectives, not new reasoning ability. Large base models can also produce "aha"-style text from SFT alone.

Still, R1-zero shows that outcome-based RL on verifiable math rewards can produce a strong reasoning model from a base model.

Full DeepSeek R1 Pipeline

Goal

R1 is a production reasoning model, not just a math specialist.

Pipeline

Base model (DeepSeek V3), then:

Long chain-of-thought SFT.
RL with verifiable rewards.
RLHF and general post-training.
Distillation into smaller models.

Long Chain-of-Thought SFT

They fine-tune on many long reasoning traces. The source is vague, likely distilled from strong models. The goal is to make the model comfortable producing long, structured reasoning and to keep RL outputs readable.

Even small amounts of chain-of-thought data can lift math benchmarks for base models like Qwen 2.5.

Reasoning RL

Similar to R1-zero: GRPO-type RL on verifiable tasks, with format rewards to enforce thinking tags. They also add a language consistency reward to prevent random language mixing in the chain of thought.

General Post-Training

After reasoning RL, the model is strong on math and logic but can be less friendly. They do SFT on mixed tasks (reasoning and nonreasoning) and then RLHF using GRPO-like RL on preference data, following the DeepSeek V3 pipeline.

Distillation

They use R1 to generate reasoning traces and answers, then fine-tune smaller models on those traces. Distilled models gain a lot on math benchmarks compared to same-size base models.

Negative Results: PRMs and Search

Process Reward Models

Scoring steps should give richer signal than only scoring the final answer. DeepSeek had earlier PRM success in DeepSeekMath. But in the full R1 pipeline, PRMs did not beat simpler outcome-based rewards when aiming for o1-level reasoning. Outcome-based GRPO was enough.

Search Methods

Many expected heavy search, like tree search over reasoning paths. R1 experiments with search did not show clear gains over straight RL with outcome rewards.

Takeaway

In the tested setups, simple RL on verifiable outcomes plus good data beat PRMs or search.

Kimi K1.5: A Parallel Recipe

Headline

Kimi K1.5 matches or beats o1 on many reasoning benchmarks and landed around the same time as R1.

Data Curation

They tag math questions by domain and balance for diversity. They drop multiple-choice and true/false items because they are easy to guess or game. They focus on questions with automatically checkable answers: short text, numbers, code outputs.

Difficulty Selection

They run the base SFT model without reasoning on each problem, sample multiple answers, measure pass rate, and keep problems that are "hard enough" (fail best-of-8, for example). This forces RL to focus where the model is weak.

SFT Step

They do chain-of-thought SFT first. Details on data and prompts are limited. The goal is a warm start that produces usable reasoning traces.

Kimi RL Objective and Length Control

Objective

They start from a standard RL goal: maximize expected reward under the new policy, with KL regularization to the base policy. They use a nonparametric trick like DPO to express implied rewards via policy ratios, then penalize squared differences between implied rewards and actual rewards.

The gradient resembles a policy gradient with a batch-mean baseline plus an explicit squared KL term.

Length Control

They explicitly control chain-of-thought length to reduce inference cost. In each batch, they compute the shortest and longest outputs, and each output's normalized position within that range.

If correct, the length reward pushes toward the shortest. If wrong, it pushes toward the middle, not extreme short or extreme long.

So correct solutions get compressed. Wrong ones do not get rewarded for being very long.

Curriculum

If you apply length pressure too early, the model can collapse into short, failing answers. They use a two-phase schedule:

First learn to solve.
Then turn on stronger length reward to compress reasoning while keeping accuracy.

Kimi Systems and Infrastructure

Why RL is System-Heavy

LLM RL needs rollout generation and training updates running together. Rollouts are slow. Weights must move from RL workers to inference workers. Generated trajectories must flow back. Long chains of thought create huge length variance and hurt batching.

Kimi's Solution Sketch

They use vLLM for inference with separate inference and RL nodes, passing weights or deltas between them. They mention hacks like starting vLLM with dummy weights and overwriting later, and periodically restarting vLLM to manage GPU memory.

Training Dynamics

Accuracy rises with iterations. With length control, average chain-of-thought length grows early, then plateaus instead of exploding.

Qwen 3: A Recent Reasoning Pipeline

Structure

Qwen 3 follows the same broad pipeline: base model, long chain-of-thought SFT, GRPO-style RL on verifiable rewards, thinking mode fusion, general RLHF, then distillation.

Data and Efficiency

They decontaminate against validation benchmarks and manually filter SFT data to remove low-quality or guessing-style chains of thought. Their reasoning RL uses about 3,995 examples, yet still gives strong gains.

With strong base models, RL on verifiable tasks can be data-efficient.

Thinking Mode Fusion and Inference Control

Problem

Always thinking at length is expensive. Users sometimes want fast answers.

Fusion Idea

Train one model to do both:

"think" mode: produce chain of thought then answer.

"no_think" mode: answer directly with no chain of thought.

They add special tokens to prompts and use SFT data showing both behaviors.

Early Stopping of Thinking

The model can learn to switch from thinking to answering if forced to stop. At inference you can cap thinking tokens. If the model hits the limit, you stop and force answer mode.

Scaling Behavior

Test-time scaling is smooth. More thinking tokens improves accuracy. Less thinking degrades performance gradually, not catastrophically. It gives a practical knob for cost versus quality.

Effects of Later RLHF

Qwen 3 reports metrics after each stage:

Reasoning RL boosts math and STEM. Thinking mode fusion keeps gains while adding control. General RLHF improves instruction following and general tasks.

There is a tension: general RLHF can slightly hurt thinking-mode math and STEM performance while helping nonthinking mode and general helpfulness. It highlights the tradeoff between a reasoning specialist and a friendly general assistant.

Closing

RLHF is powerful, but it is noisy and easy to overfit. Overoptimization and calibration loss are natural outcomes of optimizing human feedback.

PPO is a workhorse, but its practical cost and complexity are high for large LLMs.

GRPO and related variants simplify RL for LLMs by removing value models and GAE and using group baselines.

Original GRPO can push models toward pathological long chains of thought because of standard deviation scaling and length-normalized rewards. Dr-GRPO removes these quirks without losing accuracy.

DeepSeek R1, Kimi K1.5, and Qwen 3 converge on the same pattern: strong base model, long chain-of-thought SFT, outcome-based RL on verifiable tasks with KL regularization, careful data selection, then RLHF and distillation.

So far, process reward models and search have not proven necessary for strong reasoning at scale in these pipelines.

Controlling inference cost, especially chain-of-thought length, is now part of the core design, not an afterthought.