CS336 Notes: Lecture 15 - Alignment, SFT and RLHF

Pre-training gives a model broad capability. Post-training decides how that capability shows up in a product. The same base model can become helpful or harmful depending on what comes next.

Key Takeaways

Supervised fine-tuning (SFT) on instruction data works well, but "high quality" is subtle. Misused, it can teach hallucinations.

Safety tuning must refuse the right things without blocking harmless requests. A few hundred strong safety examples can shift behavior a lot.

"Instruction tuning" often starts during late pre-training (mid-training), which blurs the line between "base" and "chat" models.

RLHF changes the goal from matching the training distribution to choosing answers that score well with humans or a reward model.

Pairwise preference labels (A better than B) are cheaper than full answers but still hard to fact-check and full of bias (length, style, incentives).

AI feedback (LLMs judging LLM outputs) is now a major part of instruction tuning and RLHF.

PPO is the classic RLHF method but complex. DPO gets many of the same gains with a simpler, supervised-style objective.

Length and style bias are real. Humans and AI judges often favor longer, list-like answers even when they hallucinate more.

Data choice, mixture, and feedback design matter as much as the algorithm.

Post-training makes models useful, polite, and safe. It can also quietly introduce hallucinations, bias, and strange behavior.

From Pre-Training to Post-Training

Pre-Training

Train on massive corpora (web text, books, code). Learn language, reasoning patterns, coding, and many world facts. Does not automatically behave like a helpful assistant.

Post-Training

Shapes behavior on top of the pre-trained base. Goals: follow instructions, be helpful, be safe, and work in a product setting.

A practical mental model: pre-training packs capability into the weights. Post-training decides when and how to use it.

Supervised Fine-Tuning on Instructions

Basic Idea

Collect (instruction, good answer) pairs. Fine-tune the model to imitate those answers by gradient descent. This is supervised learning on expert demonstrations.

Even Alone, SFT Can Turn a Base Model into a Decent Instruction Follower

Start with a reasonable base model and public instruction data (OpenAssistant, OpenHermes). Train with solid hyperparameters. You get a chat-like model, though not state of the art.

Types of Instruction Data

1. Aggregated NLP Task Datasets (FLAN-style)

Convert many benchmarks into instruction format (QA, classification, summarization).

Strengths: big and easy to gather.

Weaknesses: often unnatural, short, and multiple-choice flavored. Not like real chat.

2. Human-Written Chat-Like Data (OpenAssistant-style)

People write real prompts and detailed answers.

Strengths: rich, diverse, and closer to real chat.

Weaknesses: expensive and slow. Long outputs are hard to annotate well.

3. AI-Generated Instruction Data (Alpaca-style)

Start with a seed set of human instructions. Use a strong model (InstructGPT, for example) to generate more instructions and answers.

Strengths: cheap, scalable, consistent long outputs.

Weaknesses: limited diversity, short prompts, and it can copy the teacher model's quirks and hallucinations.

Why "High Quality" Data is Hard

Length and Style Effects

Humans and AI judges often prefer long, detailed, list-based answers. Many benchmarks barely change with answer length, but chat-style evaluations are highly sensitive to style and verbosity.

Knowledge Versus Hallucination with Citations

Example: "Write an intro on monopsony" and include a specific citation.

Fine-tuning teaches two things at once: a fact claim ("this is a good reference") and a shape ("end complex answers with citation-like strings").

If the model does not actually know the citation, it may learn the shape and later invent references to match it. That teaches hallucination: copying the form of a good answer instead of staying true.

Key Insight

If post-training data is too polished or too detailed for what the model knows, it can push the model to fake competence.

Good data sometimes includes "I don't know" or abstains when knowledge is missing.

On-policy RL (training on the model's own outputs) helps focus updates where the model already has some footing.

Safety and Guardrails

Why Safety Tuning Exists

Models can be misused for scams, misinformation, or harm. Products need users and advertisers to trust them.

What Safety Data Looks Like

Instruction examples where the right response is refusal or safe redirection. Includes borderline contrasts:

"How to kill a Python process?" (safe technical meaning)
"How to kill a person?" (unsafe, should refuse)

The Core Trade-Off

Under-refusal: harmful answers get through.

Over-refusal: harmless questions get blocked because they look risky.

Good guidelines and examples must balance both.

Empirical Point

A few hundred well-designed safety examples, mixed into instruction tuning, can strongly shift safety behavior, especially with a strong base model.

Scaling Instruction Tuning with Mid-Training

The Old Picture

Pre-training is huge. Instruction tuning is a small final stage.

Modern Practice

Instruction-like data is mixed into late pre-training while the learning rate decays.

A common recipe:

Stage 1: classic pre-training on web, code, books.
Stage 2: mix in higher-quality instruction data (Wikipedia, QA, chats, code SFT, StackExchange) while continuing to train.
Optional: a small final SFT stage on the best instruction set.

Why This Helps

Uses instruction data at scale. Reduces catastrophic forgetting by keeping some pre-training data in the mix. Gets more mileage out of each instruction example.

Consequence

"Base model" is no longer a clean category. Many "base" models have already seen instruction-like data during mid-training.

Why RLHF

The Objective Changes

Generative Modeling (Pre-Training and SFT)

Assume there is a target distribution of good completions P*. Train by next-token prediction to match P*. Success means the model resembles the training data.

RLHF

No single P* is the goal. Define a reward R(x, y) for prompt x and answer y. Treat the model as a policy that should choose answers with high reward. Success means the model gets high scores from humans or a reward model.

Why Do It

a) Cost

SFT needs full, expert-level answers. RLHF can use cheaper judgments like "A is better than B."

b) Generator-Validator Gap

People may write worse answers than the ones they prefer when shown options. Humans and LLMs are often better critics than authors. RLHF uses this: models generate, evaluators judge.

How Pairwise Feedback is Collected

A Typical InstructGPT-Style Pipeline

Start with an SFT model.
For each prompt x, sample several candidate answers.
Show annotators pairs (A, B).
Ask which is better (sometimes with "tie" or "both bad").
Train a reward model from these preferences.
Use an RL algorithm to update the policy to score higher.

What Raters Are Asked to Do

Be helpful, truthful, and harmless. Answer the intended question. Avoid toxic content and hallucinations. Follow style guidance (polite, clear).

Real Constraint

Often about a minute per example. They must judge helpfulness, safety, and correctness fast. That is hard for math or fact-heavy answers.

Problems with Human Feedback

Fact-Checking Does Not Fit the Time Budget

To verify correctness, a rater would need to: understand the question, read both answers carefully, break them into claims, check those claims against reliable knowledge or sources.

Doing that in a minute is not realistic.

Style Can Beat Truth

Longer, confident, well-structured answers feel better. Long but wrong answers often beat short correct ones when facts are not checked.

Copying Another LLM

Some raters paste outputs into GPT-4 and copy what it says. That creates self-preference and feedback loops. Judge and model are no longer independent.

Worker Incentives and Ethics

Work is often outsourced, sometimes with low pay relative to cognitive and emotional load. Raters' culture and religion can shape what they call "good," which can embed bias.

Value Alignment Pressure

RLHF is a final, high-impact step. If most raters share a region or faith, outputs can drift toward those values.

AI Feedback and Length Bias

Why AI Feedback is Used

Strong LLMs can judge which answer is better. Experiments find their preferences can match humans about as well as one human matches another. It is faster and cheaper than human rating.

This enables large-scale AI-feedback datasets (UltraFeedback) and systems (Zephyr, Tulu) that use LLM judges.

Off-Policy Versus On-Policy Preferences

Off-policy: preferences over outputs from many models, not just the current one.

On-policy: preferences over the current model's outputs.

Pipelines often combine both.

Length Bias

Humans and LLM judges often prefer longer answers. RLHF can push models toward verbosity. Longer is not always more correct and can increase hallucination risk.

Takeaway

Training and evaluation should track and control length and style bias, not reward it blindly.

A Formal View of RLHF

Policy View

Model is a policy. Goal: maximize expected reward. Often add a KL penalty to keep the policy close to a reference (usually the SFT model) to prevent drift.

Reward Modeling from Preferences

True reward is not observed. We observe pairwise choices: for a prompt x, A is preferred over B.

Bradley-Terry setup: Each answer gets a score. Probability A is chosen rises with the score difference through a logistic function. Fit the reward model by maximum likelihood on the preference data.

Then update the policy using the learned reward.

PPO

Policy gradient intuition: increase probability of high-reward answers, decrease probability of low-reward ones.

Problems: high variance and instability. Reusing data can cause policy drift.

PPO stabilizes training by: using an advantage (reward minus baseline), constraining updates with probability ratios and clipping, adding KL penalties and other stabilizers.

Works, but is complex to implement and tune at LLM scale.

DPO

Motivation

PPO needs a reward model, on-policy sampling, importance weighting, clipping, and a full RL loop. DPO aims for many of the same benefits with a simpler objective.

Core Idea

Start from the RL objective: maximize reward minus KL to reference.
Use the closed-form relationship between optimal policies and rewards under KL regularization.
Tie reward to policy log-probabilities relative to reference.
Train directly on preference pairs (chosen, rejected): increase log probability of chosen, decrease log probability of rejected.
Result: supervised-style training on pairs, without an explicit reward model or PPO loop.
Works well in practice and is popular in open research.

Limits and Open Questions

Hallucinations

Instruction tuning and RLHF can reward confident detail. Without care, the model learns "always answer" instead of "say I don't know."

Knowledge Versus Behavior

Post-training mostly changes behavior, not knowledge. Mid-training may add some knowledge, but small SFT and RLHF cannot replace broad pre-training.

Value Alignment

Whose preferences count. Crowd-worker demographics, company policy, and AI-judge bias all shape the model's social behavior.

Measurement

Benchmarks like MMLU miss style and safety. Chat evaluations are sensitive to length and "vibes." Strong pipelines need multiple evaluation lenses, not one score.

Big Picture

Post-training can reshape behavior with relatively little data.

Every choice of data, labelers, and feedback leaves a fingerprint on how the model speaks, refuses, and hallucinates.