Stanford CS336 - Language Modeling from Scratch
“Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.” (Adapted from the official course website)
This is condensed notes, and the collection of important points I encountered while working through this course on my own.
[In Progress]
Lecture 01: Overview and Tokenization
Why this course?
- Modern researchers are disconnected from the tech stack (prompting > fine-tuning > training).
- Abstractions in LLMs are leaky and not well understood.
- The course teaches understanding through building: tokenizers, models, data pipelines, and training.
Scaling Laws:
- Goal: Predict optimal hyperparameters for large-scale models by experimenting at small scale.
- Given a FLOPs budget ($C$ ), find optimal model size ($N$ ) and data size ($D$ ).
- Rule of thumb: $D^* = 20N^*$ , from Kaplan et al., 2020 and Hoffmann et al., 2022.
- This is known as Chinchilla Optimality — doesn’t consider inference cost though!
Historical Context:
- Pre-2010s: Shannon’s entropy, n-gram models.
- 2010s: Neural LMs (Bengio, Attention, Transformer).
- 2018+: Pretrained models like BERT, GPT-2, T5.
- Modern: Scaling laws, alignment, mixture of experts, compute efficiency.
What we can learn:
- Mechanics: How transformers, parallelism, and tokenization work.
- Mindset: Think in terms of resources and efficiency.
- Intuition: Data/modeling decisions (some do not transfer across scale).
Tokenization
Goal: Convert raw text (Unicode strings) to sequences of integers (tokens) and back.
Why care?
- Poor tokenization increases sequence length (hurts compute).
- Good tokenization yields high compression, efficient training.
Tokenization Methods
Character-based:
- Each Unicode character becomes a token (e.g.,
🌍
→ 127757). - Large vocab (~150k), poor compression.
- Each Unicode character becomes a token (e.g.,
Byte-based:
- Each byte (0–255) becomes a token via UTF-8.
- Fixed small vocab (256), but long sequences.
Word-based:
- Uses regex rules to segment strings into words.
- Vocabulary grows with corpus; OOV (out-of-vocab) issues.
Byte Pair Encoding (BPE):
- Merges frequent byte pairs iteratively.
- Trained on raw text; balances vocab size and sequence length.
- Used in GPT-2, GPT-3 tokenizers.
Observations
- GPT-2’s tokenizer merges frequently co-occurring byte pairs.
- Strings like
"hello world"
get segmented into different tokens depending on context. - Special tokens (e.g.,
<|endoftext|>
) must be preserved during encoding/decoding.
Key Takeaways
- BPE tokenization strikes a balance between efficiency and practicality.
- Tokenization deeply affects model size, training cost, and downstream performance.
- In this course, we’ll build a tokenizer from scratch and integrate it into our pipeline.
Additional Videos
Assignment 1 Solution
https://github.com/DhyeyMavani2003/stanford-cs336-assignment1-basics-solution/tree/main
Next up: PyTorch building blocks and efficient training routines.
Lecture 02: PyTorch and Resource Accounting
Coming soon…
Lecture 03: Architectures, hyperparameters
Coming soon…
Lecture 04: Mixture of experts
Coming soon…
Lecture 05: GPUs
Coming soon…
Lecture 06: Kernels, Triton
Coming soon…
Lecture 07: Parallelism
Coming soon…
Lecture 08: Parallelism
Coming soon…
Lecture 09: Scaling laws
Coming soon…
Lecture 10: Inference
Coming soon…
Lecture 11: Scaling laws
Coming soon…
Lecture 12: Evaluation
Coming soon…
Lecture 13: Data
Coming soon…
Lecture 14: Data
Coming soon…
Lecture 15: Alignment - SFT/RLHF
Coming soon…
Lecture 16: Alignment - RL
Coming soon…
Lecture 17: Alignment - RL
Coming soon…
Lecture 18: Guest Lecture by Junyang Lin
Coming soon…
Lecture 19: Guest Lecture by Mike Lewis
Coming soon…