Stanford CS336 - Language Modeling from Scratch
“Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.” (Adapted from the official course website)
This is condensed notes, and the collection of important points I encountered while working through this course on my own.
[In Progress]
Lecture 01: Overview and Tokenization
Why this course?
- Modern researchers are disconnected from the tech stack (prompting > fine-tuning > training).
- Abstractions in LLMs are leaky and not well understood.
- The course teaches understanding through building: tokenizers, models, data pipelines, and training.
Scaling Laws:
- Goal: Predict optimal hyperparameters for large-scale models by experimenting at small scale.
- Given a FLOPs budget ($C$ ), find optimal model size ($N$ ) and data size ($D$ ).
- Rule of thumb: $D^* = 20N^*$ , from Kaplan et al., 2020 and Hoffmann et al., 2022.
- This is known as Chinchilla Optimality — doesn’t consider inference cost though!
Historical Context:
- Pre-2010s: Shannon’s entropy, n-gram models.
- 2010s: Neural LMs (Bengio, Attention, Transformer).
- 2018+: Pretrained models like BERT, GPT-2, T5.
- Modern: Scaling laws, alignment, mixture of experts, compute efficiency.
What we can learn:
- Mechanics: How transformers, parallelism, and tokenization work.
- Mindset: Think in terms of resources and efficiency.
- Intuition: Data/modeling decisions (some do not transfer across scale).
Tokenization
Goal: Convert raw text (Unicode strings) to sequences of integers (tokens) and back.
Why care?
- Poor tokenization increases sequence length (hurts compute).
- Good tokenization yields high compression, efficient training.
Tokenization Methods
Character-based:
- Each Unicode character becomes a token (e.g.,
🌍
→ 127757). - Large vocab (~150k), poor compression.
- Each Unicode character becomes a token (e.g.,
Byte-based:
- Each byte (0–255) becomes a token via UTF-8.
- Fixed small vocab (256), but long sequences.
Word-based:
- Uses regex rules to segment strings into words.
- Vocabulary grows with corpus; OOV (out-of-vocab) issues.
Byte Pair Encoding (BPE):
- Merges frequent byte pairs iteratively.
- Trained on raw text; balances vocab size and sequence length.
- Used in GPT-2, GPT-3 tokenizers.
Observations
- GPT-2’s tokenizer merges frequently co-occurring byte pairs.
- Strings like
"hello world"
get segmented into different tokens depending on context. - Special tokens (e.g.,
<|endoftext|>
) must be preserved during encoding/decoding.
Key Takeaways
- BPE tokenization strikes a balance between efficiency and practicality.
- Tokenization deeply affects model size, training cost, and downstream performance.
- In this course, we’ll build a tokenizer from scratch and integrate it into our pipeline.
Next up: PyTorch building blocks and efficient training routines.