How LLMs Get Built
Inside the months-long pipeline that turns trillions of words of text into a deployable LLM: data, tokenizer, pre-training, alignment, and shipping.
Part 1 of the deep-dive series, picking up from The Map. The Map laid out all fifteen steps from raw text to streaming response. This post zooms into the first half: everything that has to happen before you type a single prompt.
The Map closed with “Now let’s go deeper.” This is what’s underneath Part I.
Six steps. Roughly six months. Thousands of GPUs and billions of dollars when you scale it to the frontier. But that’s the easy part to grasp. The harder thing is that what we casually call “training” is actually three distinct phases, and most explanations conflate them.
Pre-training is the expensive one - the part you’ve heard about. Post-training is what makes the model useful. Reasoning training is a third phase, newer, that some models add. Each one runs on a different signal, fixes a different problem, and produces a noticeably different model from the one that went in.
By the end of this post, the sequence should be in your head as a specific journey: corpus → tokenizer → pre-train → align → evaluate → deploy. Three distinct learning signals stacked along the way. Six structural steps with clean boundaries.
1. The corpus is not the internet
The phrase “trained on the internet” is shorthand. It’s also wrong.
The starting point IS web data - typically Common Crawl, an open repository of crawled HTML containing petabytes of raw pages. But raw web data is mostly garbage: spam, templates, duplicates, boilerplate navigation, machine-generated SEO content. The training corpus is what’s left after the curation pipeline finishes its work.
The pipeline runs in stages, each filtering out a specific failure mode:
- Language identification. Filter to target languages. A model trained on a multilingual corpus needs predictable language ratios. Random non-English in the middle of an English document confuses the signal and dilutes the gradient.
- Deduplication. Remove near-copies. This step is more important than it sounds. Duplicate data measurably degrades model quality. A model that sees the same paragraph 100 times learns to memorize, not generalize. Production pipelines use both exact-match dedup (hash collisions) and near-duplicate detection (MinHash, locality-sensitive hashing).
- Quality classifiers. Score each document against a “Wikipedia-like” reference. The classifier itself is a smaller model trained on labeled examples of “good” and “bad” text. Documents below a threshold get dropped.
- Content filters. Remove toxic content, personally identifiable information, and known malware or exploit text. Partly a legal requirement, partly a quality measure - toxic content reinforces toxic patterns the model would reproduce later.
After the filters, you balance domains. Too much code in the corpus and the model starts talking like a compiler. Too little and it can’t write a function. The ratio of web text to books to code to academic papers is an architecture decision in itself, made by humans, and it shows up in the model’s behavior at inference time.
The output of this whole pipeline is what actually trains the model: a cleaned, deduplicated, balanced corpus running into the trillions of tokens. LLaMA 3 trained on 15 trillion. GPT-4’s training set has never been disclosed but is estimated similar. Building this corpus takes months - large teams, large compute, large storage budgets - all before any model weights have been touched.
2. Tokenizer first, frozen forever
The corpus is text. The model thinks in integers. Something has to convert one to the other, and that thing has to exist before the model can learn anything.
The Map covered the basics. A tokenizer is trained on a representative sample of the corpus (not all of it - frequency statistics converge fast) using Byte Pair Encoding. The output is a merge table: an ordered list of roughly 100,000 rules describing how text decomposes into tokens. “the” is one token. “tokenization” is two: “token” + “ization.” “ChatGPT” is three.
What the Map didn’t dwell on is why this is its own step.
Tokenization happens before pre-training because the tokenizer determines what the model can perceive. Change the merge table after the model exists and every learned association breaks. The embedding layer was indexed by token IDs; now those IDs point to different strings. The whole model becomes incoherent. That’s why the merge table is frozen forever the moment pre-training starts. Anthropic shipping a new tokenizer with Claude 4.7 meant retraining the entire model from scratch.
One choice happens before the tokenizer can even be trained: vocabulary size. This is an architecture decision, not a linguistic one. LLaMA 1 chose 32,000 tokens. LLaMA 3 jumped to 128,000. BLOOM picked 250,880, a number divisible by 128 (GPU memory alignment) and by 4 (tensor parallelism). The vocabulary size sets a budget; BPE runs until the budget is filled.
Larger vocabulary means common words become single tokens, fewer tokens per text, cheaper inference. The cost is a bigger embedding table and more parameters. The industry trend has been to push vocabulary up - GPT-4o uses 200,000, Gemini uses 256,000 - betting that token efficiency matters more than parameter savings.
The merge table ships with every model file. Open the tokenizer.json of any modern open-weights LLM and you’re looking at the artifact built during this step. Same artifact, every prompt, model’s lifetime.
3. Pre-training: the next-token game
This is the expensive step. Months of wall-clock time. Thousands of GPUs running in parallel. Hundreds of millions of dollars at frontier scale. The thing you’re imagining when you imagine “training the model.”
The setup looks deceptively simple.
A token ID is just an integer. A neural network can’t multiply integers and get useful gradients, so the first thing the model needs is an embedding table: one row per vocabulary token, each row a vector of several thousand numbers. At initialization these vectors are pure random noise. The embedding table is what turns “integer 7085” into “a vector the model can do math on.”
The architecture that processes those vectors is a transformer: a stack of layers, typically 32 to 128 deep depending on model size. Each layer has two main components.
- Multi-head attention lets each token look at every previous token in the context and decide which ones matter. This is where the model learns relationships - that “bank” attends more to “river” in one sentence and more to “money” in another, depending on which neighbors are present.
- Feed-forward network processes each token’s vector independently through a non-linear transformation. This is where most of the model’s knowledge lives: facts, syntax patterns, code conventions, the connection between “Paris” and “France.” Attention routes information; the FFN stores it.
A token’s vector enters the bottom of the stack, passes through every layer, and exits as a prediction over the entire vocabulary - a probability distribution over which token comes next.
Then the training loop runs:
- Take a sequence of tokens from the corpus.
- For each position, predict the next token.
- Compare the prediction to the actual next token (which you know - it’s just the corpus shifted by one).
- Compute the loss: how wrong was the prediction.
- Backpropagate the loss to adjust every weight in every layer.
- Repeat. Billions of times.
The only training signal is “predict the next token.” That’s it. From this single signal, the model learns grammar, facts, reasoning, code, translation, dialogue, humor. Not because anyone designed those capabilities. Because they all reduce, eventually, to predicting what word comes next in some context.
Something striking happens to the embedding table during this process. The random vectors gradually drift. Tokens that behave similarly in text - “cat” and “kitten,” “big” and “large,” “Paris” and “London” - end up with similar vectors. The geometry of the embedding space comes to mirror the structure of language, even though no one designed it that way. Similar usage produced similar gradients, and similar gradients moved the vectors to similar locations.
There’s no secret algorithm. There’s a single objective (“predict the next token”) applied to a colossal amount of text, and the structure of language imprints itself on the weights through gradient descent.
What comes out the other end after months of training is called a base model or foundation model. It’s astonishingly capable: it can complete code, solve math, translate languages, write essays. It’s also nearly useless as a product. Ask a base model “what’s the capital of France?” and it might respond “What’s the capital of Germany? What’s the capital of Italy?” because it pattern-matched the format and decided the most likely next text is more questions of the same shape, not an answer.
That’s where post-training comes in.
4. Post-training: making it useful
Three distinct phases live inside what most people call “fine-tuning.” They run sequentially, each fixing what the previous one couldn’t.
Supervised Fine-Tuning (SFT) comes first. The model trains on tens of thousands of carefully written (instruction, response) pairs - examples of what helpful assistant behavior looks like. “Translate this sentence” paired with a clean translation. “Explain quicksort” paired with a clear explanation. “How do I reverse a string in Python?” paired with working code and a brief explanation.
SFT teaches the format of being helpful. The base model already knows things; SFT teaches it how to deliver them. After SFT, the model stops continuing prompts and starts answering them. This is the single biggest behavioral change in the whole pipeline.
But SFT alone is brittle. The model can be helpful in the format it was shown - and confidently wrong outside that distribution. You also can’t write enough examples to cover every kind of question. A second phase has to teach the model what better responses look like, beyond what was demonstrated.
Preference Optimization is that second phase. Humans (or AI systems) rank multiple model responses to the same prompt: “Response A is better than Response B.” The model learns to prefer the higher-ranked responses across thousands of these comparisons. Two main approaches:
- RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model to predict human preferences, then uses that reward model to score the LLM’s outputs and update its weights via reinforcement learning. This was OpenAI’s approach with InstructGPT and the original ChatGPT.
- DPO (Direct Preference Optimization) skips the reward model. It updates the LLM directly from preference pairs using a clever loss function. Simpler, faster, and increasingly the default in 2026.
Anthropic adds a third trick called Constitutional AI: instead of relying entirely on human feedback, the model critiques and revises its own responses according to a set of written principles (“be helpful, be honest, avoid harmful content”). The model generates a response, critiques it against the constitution, rewrites it, and trains on the rewrite. This scales better than human annotation. You can generate millions of self-critiques cheaply, where each human ranking is slow and expensive.
The newest phase, added to models like OpenAI’s o1 and DeepSeek-R1, is reasoning training: a separate reinforcement learning stage that rewards the model for producing explicit chains of reasoning before answering. The model learns to think before it speaks. This is structurally different from earlier post-training - the reward is for the reasoning trace, not just the final answer.
The deployable assistant you talk to is the result of all of these stacked. The base model from pre-training had general capabilities. SFT taught it the format. Preference optimization taught it which response is better. Reasoning training (if applied) taught it to deliberate. Each phase fixes a problem the previous phase couldn’t.
When people say “the model was trained,” they usually mean pre-training. That’s a quarter of the actual story.
5. Eval and deploy
Training produces a model. Two more steps stand between that model and you.
Evaluation is how the lab knows the model is shippable. It runs against fixed benchmarks designed to test specific capabilities:
- MMLU measures general knowledge across 57 academic and professional subjects.
- HumanEval measures code generation: given a function signature and docstring, write working code.
- GSM8K measures grade-school math reasoning.
- MT-Bench measures multi-turn conversational ability.
Benchmarks alone aren’t enough - they can be gamed, and they don’t capture qualitative differences. So labs run human evaluation: blind side-by-side comparisons where annotators see two responses without knowing which model produced each, and rank them. The new model has to win statistically, not just match.
Safety testing runs in parallel: red-teaming (people actively try to make the model misbehave), bias audits, and capability evaluations for dangerous knowledge. Any of these can block a release.
If the model passes, it goes to deployment preparation. The trained weights aren’t directly servable - they’re far too expensive to run at scale. Two main optimizations happen here.
Quantization reduces numerical precision. The trained weights might be 16-bit floating point (about 30 GB for a 15B parameter model). Quantization compresses them to 8-bit or even 4-bit integers, often with negligible quality loss. The same model now fits on a smaller GPU and runs faster.
Serving infrastructure wraps the quantized model in software optimized for high-throughput inference: vLLM, TensorRT-LLM, SGLang. These systems handle batching across many concurrent users, KV-cache management for long contexts, paged attention to avoid memory fragmentation. The math is the same; the engineering around it is what makes the API economic to operate.
The model that responds to your prompt is the deployed model: pre-trained, post-trained, reasoning-trained (maybe), evaluated, quantized, served. Each step ahead of you, frozen by the time you arrived.
What’s Next
Everything in this post happens before you type a prompt. Months of work, three distinct training phases, six structural steps - all completed and frozen by the time you open the API.
The next post in the series is the other half of the Map: what happens when you send a prompt. The inference pipeline. From the moment your text hits Anthropic’s servers to the moment a response streams back to your terminal. Different timescale (milliseconds, not months), different machinery, different bottlenecks.
If pre-training is where the capability lives, inference is where it gets summoned. Same model, very different code path.
References
- Touvron et al., “LLaMA: Open and Efficient Foundation Language Models” - Open model with documented training pipeline (2023)
- Touvron et al., “LLaMA 2” - Detailed RLHF disclosure (2023)
- Ouyang et al., “Training language models to follow instructions with human feedback” - InstructGPT and the original RLHF (2022)
- Rafailov et al., “Direct Preference Optimization” - DPO (2023)
- Bai et al., “Constitutional AI: Harmlessness from AI Feedback” - Anthropic’s alignment approach (2022)
- Vaswani et al., “Attention Is All You Need” - The transformer architecture (2017)
- Hoffmann et al., “Training Compute-Optimal Large Language Models” - Chinchilla scaling laws (2022)
- Dao et al., “FlashAttention” - Memory-efficient attention (2022)
- OpenAI, tiktoken source code - BPE tokenizer implementation
Co-written with AI. Credit the prose, blame the opinions.