Blog

Transformers

Filter

Tutorials·January 6, 2026·12 min read

CS336 Notes: Lecture 4 - Mixture of Experts

Mixture of Experts (MoE): adding capacity without proportional compute, routing, load balancing, and what makes MoE stable.

Read

Tutorials·January 5, 2026·8 min read

What modern LLMs converge on: pre-norm, RMSNorm, SwiGLU, RoPE, and stability tricks.

Read

Tutorials·January 3, 2026·11 min read

A tour of modern LLMs, the "bitter lesson" through the lens of efficiency, and why BPE tokenization matters.

Read

Tutorials·January 2, 2026·1 min read

My notes on Stanford CS336: Language Modeling From Scratch: how to build a large language model end to end, from data to deployment.

Read