CS336 Notes: Lecture 3 - Architectures and Hyperparameters
Modern LLMs look alike because the designs that work keep getting copied. This lecture covers the choices that survived.
The test: if you see a new open model, can you predict its architecture? Probably yes. It's likely a LLaMA-style decoder-only transformer with pre-norm, RMSNorm, SwiGLU, RoPE, no biases, and familiar width-depth ratios.
Nine Claims About Modern Architectures
Small architectural choices matter. Moving LayerNorm to pre-norm, switching to RMSNorm, and dropping bias terms tend to improve stability and runtime.
Gated MLPs won. GLU variants (especially SwiGLU) reliably beat plain ReLU or GeLU at similar parameter counts. They're now the default.
Most models use serial blocks. Attention then MLP, each with its own pre-norm and residual add. Parallel blocks exist and can be efficient, but they're less common.
RoPE won for position encoding. Dense LMs mostly settled on rotary positional embeddings applied to queries and keys inside attention.
Hyperparameters cluster. MLP expansion ratios, head dimensions, depth vs width, and vocabulary size all follow simple rules of thumb across model families.
Regularization in pretraining is about optimization, not overfitting. Weight decay helps optimization, especially as the learning rate decays over long runs.
Stability tricks exist because large-scale training breaks without them. Gradient spikes and softmax blowups kill runs. Z-loss, extra norms, and QK normalization are the fixes.
Inference bottlenecks differ from training. With KV cache, attention becomes memory-bound. That's why MQA and GQA are attractive.
Long-context models use hybrid attention. Mostly local sliding-window attention with occasional global full-attention layers (often without RoPE) to mix information across the whole context.
Transformer Recap
The base model is a decoder-only transformer.
Token and position information enters a stack of transformer blocks. Each block has self-attention, an MLP, residual connections, and normalization. A final softmax produces next-token probabilities.
Modern variants change a few defaults: pre-norm instead of post-norm, RoPE instead of absolute or additive position embeddings, SwiGLU instead of ReLU in the MLP, and no bias terms in most linear layers.
The theme: many labs trained many LLMs. The designs that keep showing up are the ones that work.
LayerNorm, RMSNorm, and Bias Terms
Pre-norm vs Post-norm
The original transformer used post-norm: run a sub-block, add the residual, then normalize.
Modern models mostly use pre-norm: normalize before the sub-block, then add the sub-block output back to the residual stream.
Pre-norm tends to win because the residual path stays close to identity and gradients flow more cleanly through deep stacks. Post-norm can work, but it often needs more careful warmup and tuning to avoid loss spikes.
Double Norm
Some models add an extra normalization after the sub-block while keeping the residual stream itself un-normalized. This "double norm" pattern shows up as a stability tool at very large scale.
LayerNorm vs RMSNorm
LayerNorm normalizes by mean and variance and includes both a scale and a bias.
RMSNorm drops the mean subtraction and the bias term. It normalizes by root mean square and keeps only a learned scale.
RMSNorm is cheaper and tends to run faster without hurting quality, so it dominates in recent large LMs.
Dropping Bias Terms
Many modern LMs remove bias terms from most linear layers. The payoff is smaller parameter count, simpler kernels, and often better stability at scale. The expressive gain from biases rarely seems worth the cost.
Activations and Gated Linear Units
Plain Activations
Early transformers used ReLU. GPT-style models popularized GeLU.
GLUs
Modern MLPs often use a gated form:
- Compute a "main" projection and a "gate" projection.
- Apply a nonlinearity to the main path (GeLU or Swish).
- Multiply main and gate elementwise.
- Project back to the model dimension.
Common variants include GeGLU and SwiGLU (Swish-gated). The gate acts like a learned filter over hidden dimensions.
Why GLUs Won
Across many training runs, GLU variants reach lower loss and better downstream performance at similar parameter counts. That consistent edge is why they became the default.
Adjusting MLP Width for GLUs
Because GLUs add an extra projection, you usually shrink the hidden dimension to keep parameter counts comparable:
- Non-gated MLPs often use d_ff = 4 × d_model.
- GLU MLPs often use d_ff ≈ (8/3) × d_model (about 2.66×).
Serial vs Parallel Transformer Blocks
Serial Blocks
The standard block runs attention, then MLP, each with its own pre-norm and residual add. This is still the most common structure.
Parallel Blocks
Some architectures compute attention and MLP in parallel from the same input and add both results into the residual stream. This can simplify scheduling and improve kernel fusion, but it's less common in newer open models.
Position Embeddings and RoPE
Position methods varied: sinusoidal absolute, learned absolute, relative-bias methods, and ALiBi.
RoPE became the default for many dense LMs. Instead of adding a position vector at the bottom, RoPE rotates the query and key vectors inside attention. It pairs dimensions and rotates each pair by a position-dependent angle, with different rotation frequencies across dimensions.
Conceptually, it bakes relative position into the dot products themselves. This tends to work well and plays nicely with context extension methods.
Key Hyperparameters
MLP Expansion Ratio
A standard 2-layer MLP maps d_model → d_ff → d_model.
Common choices cluster around:
- d_ff = 4 × d_model for non-gated MLPs.
- d_ff ≈ (8/3) × d_model for GLU MLPs to keep parameters similar.
Head Dimension and Number of Heads
A typical constraint is d_model = n_heads × d_head. This keeps attention sizing predictable. Too many heads can make d_head small and potentially limit expressiveness, but many successful models sit in the same broad range.
Depth vs Width
Across model sizes, many families converge on a stable "shape": a fairly consistent ratio between hidden width and layer count. Pretraining loss tends to track total parameter count more than depth alone, though depth can help some downstream behavior at fixed compute.
Vocabulary Size
Older English-focused models often used 30k-50k tokens. Newer multilingual and production models often use 100k-250k to reduce sequence length across scripts and improve efficiency.
Regularization and Weight Decay
Classic overfitting intuition doesn't map cleanly to pretraining because data is huge and examples are often seen once. Dropout has largely faded during pretraining in many large runs.
Weight decay remains common, but the story is optimization: it interacts with learning rate schedules and can improve final training loss, especially late in training after the learning rate has decayed.
Training Stability Tricks
At scale, training failures often come from gradient spikes and numerical issues around softmax.
Two main softmax sites matter:
- The final vocabulary softmax.
- The attention softmax inside each block.
z-loss (Output Softmax)
Add a small penalty on (log Z)², where Z is the softmax normalizer. This discourages extreme logits and keeps the softmax in a safer numeric regime.
QK Norm (Attention Softmax)
Normalize queries and keys before the dot product. This controls attention logit scale and reduces overflow risk, making training more stable and sometimes allowing larger learning rates.
Other approaches include directly soft-capping attention logits, though results can be mixed. The broader pattern is clear: add normalization where scale can blow up.
Attention Variants, KV Cache, MQA, and GQA
Training is compute-heavy and parallel across sequence length. Inference is token-by-token, so attention becomes dominated by memory traffic.
KV cache stores past keys and values so you don't recompute them each step. That saves compute but makes inference more memory-bound as context grows.
MQA shares a single set of keys and values across all query heads, shrinking the KV cache dramatically.
GQA groups heads so several query heads share one K/V set. It's a middle ground that often preserves quality while still cutting memory.
Long-Context Attention Patterns
Full attention is quadratic in sequence length, so very long contexts need structure.
A common modern pattern is hybrid:
- Most layers use sliding-window (local) attention with RoPE.
- Occasional layers use global full attention, often without RoPE.
Local layers handle nearby structure efficiently. Global layers mix information across the whole sequence without paying the full quadratic cost in every layer. Dropping positions in those global layers can help with extrapolation.
What's Next
Modern LLM design is not a free-for-all. The field has converged on a small set of choices that train stably and run efficiently: pre-norm (often RMSNorm), few or no biases, gated MLPs like SwiGLU, RoPE inside attention, stable hyperparameter ratios, weight decay as an optimization aid, extra stability tricks around softmax, and inference-minded attention variants like MQA/GQA plus structured long-context attention.
Next lecture: Mixture of Experts. How to add capacity without proportional compute, and what makes routing work.
Keep reading
You might also like
CS336 Notes: Lecture 4 - Mixture of Experts
Mixture of Experts (MoE): adding capacity without proportional compute, routing, load balancing, and what makes MoE stable.