CS336 Notes: Lecture 1 - Overview and Tokenization
If you only call APIs, you lose touch with how models actually work. Building and fine-tuning keeps the mechanics real.
This lecture maps the modern LLM landscape, reframes the bitter lesson around efficiency, and introduces tokenization.
Five Claims Worth Testing
Large language models have deep roots. Most core ideas predate GPT-3. Frontier labs combined them with careful engineering and a willingness to spend.
Data choices set the ceiling. Raw web data is messy. Cleaning, filtering, deduplication, and transformation happen before training starts, not after.
Alignment turns a next-token predictor into something you can use. Supervised fine-tuning and feedback shape the base model into a controllable assistant.
Tokenization matters more than it looks. Byte Pair Encoding (BPE) is a simple learned tokenizer that trades sequence length for vocabulary size better than naive schemes.
The bitter lesson is often misread. It's not "only scale matters." It's "scale matters, and efficiency determines how much scale you can afford."
Frontier Models Are Industrial Projects
GPT-4 reportedly has about 1.8 trillion parameters and cost roughly $100 million to train. Training clusters may have around 200,000 H100 GPUs. Total investment in large models reaches hundreds of billions over a few years.
These models are also opaque. Major labs withhold training data, architecture details, and procedure specifics. They cite competition and safety. The result: frontier models are financially and informationally out of reach for most people.
This creates a gap in class. Students train small models. Small models don't always reflect large-model behavior.
Two Gaps That Matter at Scale
1. Attention vs MLP Compute
In small transformers, attention and MLP layers can have similar FLOPs. At around 175B parameters, MLP FLOPs dominate. If you optimize attention at small scale, you may miss the main cost at frontier scale.
2. Emergent Behavior
Some tasks stay flat as training FLOPs rise, until a threshold. Past that point, new capabilities appear quickly. In-context learning is one example. Small experiments can falsely suggest the approach doesn't work.
The Bitter Lesson, Corrected
Wrong version: only scale matters, algorithms don't.
Better version: scale matters, and efficiency matters. Algorithms that improve efficiency are how you get more from the same resources.
A simple frame:
accuracy = f(efficiency, resources)
You improve accuracy by using more resources with the same algorithms, or by improving efficiency to get more from the same resources.
Efficiency matters more when training costs are enormous. At tens or hundreds of millions of dollars, wasted compute is not an option. Frontier labs are probably more careful than typical academic training setups.
Evidence: ImageNet training between 2012 and 2019 saw about a 44× gain in algorithmic efficiency at the same accuracy. Without those advances, training would have cost about 44 times more.
History of Language Models
Key threads:
- Shannon used language models to estimate English entropy.
- Classical NLP used language models inside machine translation and speech recognition.
- By 2007, Google trained 5-gram models on about two trillion tokens. That's more than GPT-3's training data.
- N-gram models are shallow. They don't show the behaviors of modern neural language models.
Deep learning ingredients that came together in the 2010s:
- Neural language models (early 2000s)
- Sequence-to-sequence models
- Adam optimizer
- Attention, first in machine translation, then generalized
- Transformers (2017)
- Mixture of experts and large-scale parallelism, including early 100B-parameter experiments
Foundation models like ELMo, BERT, and T5 showed the power of one large base model fine-tuned for many tasks.
OpenAI pushed known ingredients to larger scale with strong engineering. Scaling laws became a central design principle. This drove GPT-2, GPT-3, and many follow-on efforts.
Closed vs Open
Three tiers:
- Closed models: API access only. No weights, no data.
- Open weight models: Weights released, but data and training details are often missing.
- Truly open source: Weights, data, and detailed reporting are released.
A crowded ecosystem exists: OpenAI, Anthropic, xAI, Google, Meta, DeepSeek, Alibaba, Tencent, AI2, and a strong open community of models and tools.
Course Structure
Five units, one theme: given hardware and data, how do you train the best model within a budget?
Unit 1: Basics
Build a full, simple pipeline: tokenizer, model, training loop.
Tokenizer: Converts strings to integer token sequences and back. BPE is the main tokenizer. Byte-only approaches exist but aren't dominant at frontier scale yet.
Model architecture: Core backbone is the transformer. Common modern variations include SwiGLU-style MLP nonlinearities, rotary positional embeddings, RMSNorm instead of LayerNorm, and attention variants like grouped query attention.
Training: Key choices include optimizer (often AdamW), learning rate schedule, batch size, regularization, and hyperparameter tuning. Details are not cosmetic. A well-tuned setup can beat a naive one by an order of magnitude.
Unit 2: Systems and Efficiency
Goal: maximize hardware performance with efficient kernels, multi-GPU training, and optimized inference.
GPU reality: GPUs have many small floating-point units, on-chip caches, and off-chip memory. Compute happens on chip. Data often lives off chip. Data movement is often the real bottleneck, not raw FLOPs. Moving data between GPUs is even more expensive.
Kernels: Kernels implement operations like matrix multiplication and attention. Good kernels reduce memory traffic through tiling and fusion, keeping data on chip longer. The course uses Triton to write custom kernels.
Multi-GPU training: Large models and large batches require many GPUs. NVLink and NVSwitch connect GPUs to each other and to CPUs. The job is to place parameters, activations, and gradients to minimize communication.
Inference: Two phases. Prefill processes the whole prompt at once (usually compute bound and efficient). Decode generates tokens one at a time (often memory bound and latency sensitive). Total serving cost across users can exceed training cost.
Unit 3: Scaling Laws
Goal: use small experiments to predict large-scale behavior and choose model/data sizes under a compute budget.
Core question: With fixed FLOPs, what model size and data size minimize loss?
Tradeoff: Larger models see fewer tokens for the same compute. Smaller models see more tokens. There's an optimal balance.
Rule of thumb: For N parameters, train on about 20×N tokens. Example: 1.4B parameters means about 28B tokens.
Limitation: Classical scaling laws ignore inference cost. They optimize for training compute, not serving constraints.
Unit 4: Data
Goal: see how data choices set the ceiling on capabilities.
Data effects are direct. More multilingual text yields multilingual ability. More code yields coding ability. The target use case should shape the mixture.
Web data is not "ready." Common Crawl contains HTML, boilerplate, spam, and non-text artifacts. Clean, meaningful text is a small fraction.
Processing is unavoidable: convert HTML, PDFs, or code repos into text, filter low-quality content, deduplicate to avoid overcounting repeated documents.
Evaluation guides data choices: perplexity for next-token prediction quality, benchmarks like MMLU for broad knowledge, instruction-following evals for response quality.
Unit 5: Alignment
Goal: turn a base next-token model into a helpful, controllable, safer assistant.
Supervised Fine Tuning (SFT): Collect examples of user prompts and desired assistant responses. Train with standard supervised learning. Most capabilities come from the base model. A small, high-quality SFT set can have large effects. Roughly 1,000 strong examples can produce good instruction following from a strong base.
Learning from Feedback: SFT alone may be insufficient, and large SFT datasets are expensive. Feedback can be cheaper. Types include preference data (compare two candidates, label which is better) and verifier data (use checkers for code or math).
Efficiency as the Unifying Theme
Everything maps back to efficiency, especially in a compute-constrained regime where data is abundant and compute is scarce.
- Data processing: filter aggressively so you don't waste compute on junk tokens.
- Tokenization: byte-only is elegant but inefficient today. BPE shortens sequences and cuts compute.
- Architecture: many design choices exist because they train and serve efficiently.
- Training: one epoch over a huge dataset is common. It can be better to see more unique data once than to repeat the same data.
- Scaling laws: reduce compute spent on badly scaled experiments.
- Alignment: good alignment can let smaller base models handle narrow tasks, cutting serving cost.
Shift Toward Data Constraints
Frontier labs may be moving toward a regime where compute grows faster than high-quality data. That changes best practices.
One pass over the full dataset may no longer be optimal if data is limited. Multiple passes or careful sampling may matter more. Architectures shaped by compute efficiency may change when data becomes the bottleneck.
The principle stays the same: think in terms of the limiting factor. When the constraint shifts, the best choices shift.
Tokenization
What a Tokenizer Does
A tokenizer maps raw strings to integer token sequences and back. It must be reversible for correct decoding. It shapes sequence length, vocabulary size, training cost, and how languages and scripts are represented.
What you see in GPT-style tokenizers:
- Tokens are often word pieces, not whole words.
- Spaces are often part of tokens. " hello" and "hello" are different.
- Numbers break into pieces that don't match human groupings.
- Compression ratio is bytes divided by tokens. For GPT-2 on sample text, it's about 1.6 bytes per token on average.
Naive Tokenization Schemes and Why They Fail
Character-based: Treat each Unicode character as a token. Simple and reversible, but huge vocabulary full of rare characters, mediocre compression, and some characters require multiple bytes.
Byte-based: Convert to UTF-8 bytes and treat each byte as a token. 256-token vocabulary, clean and uniform. But compression is 1 byte per token, sequences are long, attention cost grows quickly, and it becomes too slow and expensive for mainstream models.
Word-based: Split by regex into words and non-word spans, treat each as a token. Frequent words become single tokens and compression is strong. But vocabulary grows without bound, many words appear once, and new words become out-of-vocabulary, which complicates training and evaluation.
In practice, naive schemes are either too big, too slow, or brittle.
Byte Pair Encoding (BPE)
BPE started as a 1994 compression method. In tokenization, the idea is simple:
- Learn the tokenizer from data.
- Start from bytes, not words.
- Merge frequent adjacent sequences into new tokens.
Why it works: Frequent patterns compress into single tokens, shortening sequences. Rare patterns stay as compositions of smaller tokens. You avoid unknown tokens while still getting good compression.
Practical setup: Convert text to bytes. Optionally pre-tokenize with a word-like regex into segments. Run merges inside each segment. This reduces merges across whitespace and tends to produce better behavior for spaces and punctuation.
BPE Algorithm
Start with a byte sequence (0-255), an empty list of merges, and a vocabulary mapping token IDs to byte strings.
Repeat for a fixed number of merges:
- Count frequencies of all adjacent token pairs in the data.
- Find the most frequent pair.
- Create a new token ID for that pair.
- Record the merge rule: (left token, right token) → new token.
- Replace all occurrences of that pair with the new token.
Each merge adds one vocabulary entry and reduces sequence length where the pattern is common.
Encoding and Decoding
Encoding: convert string to bytes, then apply merge rules in learned order.
Decoding: map token IDs back to byte strings, concatenate bytes, then decode to Unicode.
Compression increases with more merges because more common sequences become single tokens.
Implementation Notes
A naive BPE implementation is easy to understand but slow. It can scan whole sequences repeatedly across many merges.
Real tokenizers use faster methods: only consider merges that are possible in the current sequence, use data structures that avoid full rescans.
Production tokenizers also need special tokens (start, end, padding), clear whitespace and punctuation handling, and thoughtful pre-tokenization (it strongly shapes how tokens behave).
What's Next
Tokenization is the bridge between text and model inputs. Naive tokenizers waste vocabulary, create long sequences, or break on new words. BPE compresses what's common and composes what's rare, balancing vocabulary size and sequence length.
Looking ahead, models that work directly on bytes may remove tokenization. For now, BPE-style tokenization is the practical standard.
Next lecture: resource accounting for training. How much compute, how much memory, and where the time actually goes.
Keep reading
You might also like
CS336 Notes: Course Overview
My notes on Stanford CS336: Language Modeling From Scratch: how to build a large language model end to end, from data to deployment.