Back to blog
Understanding LLMs · Part 1
· 18 min read

Understanding LLMs: The Map

From raw text to streaming response - every step of how large language models are built and how they process your prompts.

In the previous post, I explored how programming languages work - the compiler pipeline that turns source code into something a machine can execute. Seven stages, from lexer to code generation. That pipeline has been refined over decades and is well understood.

LLMs have a pipeline too. A different one. When I started pulling the thread - what happens between the moment you type a prompt and the moment text streams back? - I found something far more layered than I expected. Two distinct pipelines, fifteen steps, and a surprising amount of engineering that has nothing to do with neural networks.

This post is the map. Here’s the complete pipeline:

Two Pipelines, One Model

An LLM has two lives. First it’s built (training), then it’s used (inference). These happen at different times, in different places, using different algorithms. But they share the same model - the weights learned during training are the weights used during inference.

When you type a prompt into Claude Code, you’re seeing inference. But everything the model knows - every pattern, every capability, every failure mode - was determined during training, weeks or months earlier.

Let’s walk through both.


Part I: How an LLM Gets Built

1. Data Collection and Curation

Everything starts with text. Massive amounts of it.

The typical corpus begins with web crawls - Common Crawl alone contains petabytes of raw HTML. But raw web data is noisy. The curation pipeline is where the real work happens: language identification filters to target languages, deduplication removes near-copies (critical - duplicate data degrades model quality measurably), quality classifiers score documents against a “Wikipedia-like” standard, and content filters remove toxic material, PII, and malware.

Then comes domain balancing - the ratio of web text to books to code to academic papers matters enormously. Too much code and the model talks like a compiler. Too little and it can’t write a function.

The output is a cleaned, deduplicated, balanced corpus - typically trillions of tokens. This process takes months.

2. Tokenizer Training

Before the model can read a single word, text must become numbers. A tokenizer is trained on a representative sample of the corpus - not all of it, just enough for the frequency statistics to converge - using Byte Pair Encoding (BPE), a compression algorithm that discovers which character sequences appear frequently enough to deserve their own token.

One key decision happens before the tokenizer can be trained: vocabulary size. This is an architecture choice, not a linguistic one. LLaMA 1 chose 32K tokens. LLaMA 3 jumped to 128K. BLOOM chose 250,880 - a number divisible by 128 (GPU memory alignment) and by 4 (tensor parallelism). The vocabulary size is set, then BPE runs until it fills that many slots.

The result is a merge table - an ordered list of roughly 100,000 rules that define how text gets split. “the” becomes one token. “tokenization” becomes two (“token” + “ization”). This merge table is frozen and never changes again. Change it, and you retrain the entire model.

This isn’t theoretical. When Anthropic shipped Claude 4.7, they included a new tokenizer - a different merge table that produces smaller, more granular tokens. The same text now becomes 1.3-1.47x more tokens. That meant retraining the entire model from scratch - new merge table, new embedding layer, new weights. They accepted that cost because finer tokens gave the model more literal instruction following and fewer tool-call errors.

Our running example through the tokenizer:

"The bank by the river had no money"
→ [791, 7085, 553, 279, 15140, 1047, 912, 3300]

“bank” is now 7085. Just a number. The tokenizer has no idea it means two different things.

3. Pre-training

This is the main event - the most expensive step, often running for months on thousands of GPUs.

Two terms worth defining first. Tokens are the sub-word chunks a model thinks in - pieces like “cat”, ” sat”, ” believ”, ” able” - built from a sample of training text by repeatedly merging the most common adjacent pairs until you have a fixed set of around 50,000 of them. That set is the vocabulary. The corpus is the trillions of tokens of actual text the model reads during training. The vocabulary is the lens; the corpus is what you look at through it.

The training loop is deceptively simple in concept. Take a sequence of tokens. For each position, predict the next token. Compare the prediction to reality. Adjust the weights to make the prediction slightly better. Repeat billions of times.

But a token is just an integer - 7085 for “bank.” A neural network can’t do math on raw integers. So the model starts with an embedding table: one row per vocabulary token, each row a vector - a list of several thousand numbers. At initialization, these are random. No meaning, no intelligence - just noise. When token 7085 enters the model, the embedding step is just a table lookup: give me row 7085. That’s it.

These random vectors are what the model’s architecture receives. That architecture is a transformer - a stack of layers (typically 32-128, depending on model size), each containing two sub-components:

  • Multi-head attention: lets each token look at every previous token and decide which ones are relevant. This is where “bank” would learn to attend to “river.”
  • Feed-forward network: processes each token independently through a non-linear transformation. This is where most of the model’s knowledge gets stored.

Each token’s vector passes through all layers and exits as a prediction of what comes next. The difference between prediction and reality is the loss - and backpropagation adjusts every weight in every layer, including the embedding table itself, to reduce that loss. After billions of updates, the random numbers have been shaped into something meaningful. Tokens that behave similarly in text - “cat” and “kitten”, “big” and “large” - end up with similar vectors. Not because anyone designed it that way, but because similar vectors made the model better at predicting the next token.

The only training signal is next-token prediction. The model learns grammar, facts, reasoning, code, translation, and humor - all from predicting what word comes next.

4. Post-Training and Alignment

The pre-trained model is a powerful text completer, but not a useful assistant. It will happily continue any text you give it, including toxic content, hallucinations, and rambling. Post-training transforms it into something helpful.

This happens in stages:

Supervised Fine-Tuning (SFT): The model trains on thousands of carefully written (instruction, response) pairs - examples of ideal assistant behavior. This teaches it the format of being helpful.

Preference Optimization: Humans (or AI systems) rank multiple model responses to the same prompt. The model then learns to prefer the higher-ranked responses. Two approaches dominate:

  • RLHF (Reinforcement Learning from Human Feedback) - trains a separate reward model, then optimizes the LLM against it using reinforcement learning.
  • DPO (Direct Preference Optimization) - skips the reward model entirely and trains directly on preference pairs. Simpler, increasingly popular.

Constitutional AI (Anthropic’s approach): Instead of relying solely on human feedback, the model critiques and revises its own responses according to a set of principles. This scales better than human annotation.

Reasoning Training (newer): For models like OpenAI’s o1 or DeepSeek-R1, a separate RL phase trains the model to produce explicit chains of reasoning before answering. This is a distinct step beyond general alignment.

5. Evaluation and Testing

Before deployment, the model runs through automated benchmarks (MMLU for knowledge, HumanEval for code, GSM8K for math), human evaluation (blind side-by-side comparisons), and safety testing (red-teaming, bias audits, capability evaluations for dangerous knowledge).

6. Deployment Preparation

The trained model weights are optimized for serving: quantization reduces numerical precision (32-bit to 8-bit or even 4-bit) to cut memory and increase speed with minimal quality loss. Serving infrastructure (vLLM, TensorRT-LLM) is configured for efficient multi-user serving.


Part II: What Happens When You Send a Prompt

Here’s every step when you hit Enter in Claude Code. Most of this is invisible to you.

YOUR MACHINE (Claude Code)
│  You type: "The bank by the river had no money"
│  Claude Code sends raw text over HTTPS

└──→ ANTHROPIC'S SERVERS ───────────────────────────────────

7. API Gateway

Your request hits infrastructure first. Authentication, rate limiting, quota checks, request validation. If your API key is invalid or your rate limit is exceeded, you get rejected here - no GPU touched.

8. Prompt Assembly

The raw text you typed is just one piece. The server assembles the full prompt:

  • System prompt: safety instructions, behavioral guidelines, current date
  • Tool definitions: schemas for any tools the model can use (in Claude Code, this includes file reading, editing, bash execution, etc.)
  • Conversation history: all previous turns in the conversation
  • Your message: what you just typed
  • Special tokens: delimiters that tell the model where each role’s message begins and ends

A typical Claude Code prompt is thousands of tokens before your message even appears.

9. Tokenization

The assembled prompt is split into token IDs using the same merge table that was built during training. First, a regex pre-tokenizes the text into chunks (splitting contractions, separating numbers, isolating punctuation). Then BPE merge rules replay in priority order, turning each chunk into token IDs.

This runs on the CPU. It’s fast - millions of tokens per second.

10. Prefill (Processing the Prompt)

The token IDs enter the GPU. The entire prompt is processed in a single parallel forward pass - this is what GPUs excel at. Each token passes through:

  1. Embedding lookup: token ID 7085 becomes a vector of 4,096-12,288 dimensions
  2. Positional encoding (RoPE): rotation applied to encode where each token sits in the sequence
  3. Transformer layers (32-126 of them, each containing):
    • Multi-head attention: each token attends to all previous tokens
    • Feed-forward network: each token processed independently
    • Residual connections and normalization
  4. KV cache populated: key and value vectors for every token, at every layer, are stored in GPU memory for later reuse

This phase is compute-bound - the bottleneck is raw computation speed. A long prompt means a slow Time to First Token (TTFT).

11. Sampling the First Token

The final layer outputs a vector for the last token position. This vector is projected to vocabulary size (~100K dimensions) to produce logits - a raw score for every possible next token.

These logits are then shaped by:

  • Temperature: controls randomness (lower = more deterministic)
  • Top-k: keeps only the k most likely tokens
  • Top-p (nucleus sampling): keeps the smallest set of tokens whose cumulative probability exceeds p

One token is sampled from the resulting distribution.

12. Decode Loop (Token by Token)

Now the autoregressive loop begins. Each new token is generated one at a time:

  1. Feed the new token through all transformer layers
  2. But only compute Q, K, V for this one token (not the whole sequence)
  3. Attend over the cached K, V from all previous tokens (this is why the KV cache exists)
  4. Append this token’s K, V to the cache
  5. Compute logits, sample next token
  6. Convert token ID to text, stream to client
  7. Check stop conditions (EOS token, stop sequence, max tokens)
  8. Repeat

This phase is memory-bandwidth-bound - each token requires reading the entire model’s weights from GPU memory, but does very little computation per parameter. This is why generation speed (tokens per second) is roughly constant regardless of prompt length.

13. Tool Use (When Applicable)

When the model decides to call a tool - say, reading a file - the pipeline interrupts:

  1. Model emits a structured tool call and stops generating
  2. Response streams back to Claude Code with stop reason “tool_use”
  3. Claude Code executes the tool locally (reads the file, runs the command, etc.)
  4. Tool result is sent back as a new message
  5. The server assembles a new prompt (original + tool call + tool result)
  6. Prefill + decode starts again from step 10

Each tool call is a full round-trip. Prompt caching avoids recomputing the KV cache for the unchanged prefix.

14. Extended Thinking

When Claude uses extended thinking, it generates reasoning tokens before the visible response. This is the same autoregressive loop - there’s no separate “thinking module.” The model simply generates into a thinking region (with its own token budget), and those tokens become context that influences the final answer through attention.

15. Response Complete

The final token is generated. Output moderation runs a safety check on the response. Usage is counted (input tokens, output tokens, cached vs. uncached) for billing. The completion event streams back to your terminal.


The Running Example, End to End

From the moment you type “The bank by the river had no money” to the moment you see a response:

StepWhat HappensWhere
Prompt assemblyYour text joins system prompt, tools, historyServer CPU
Pre-tokenizeRegex splits into: [“The”, ” bank”, ” by”, ” the”, ” river”, ” had”, ” no”, ” money”]Server CPU
BPE encodeMerge rules produce: [791, 7085, 553, 279, 15140, 1047, 912, 3300]Server CPU
Embedding7085 becomes a vector: [0.23, -0.41, 0.87, …] (thousands of dimensions)GPU
Attention”bank” attends to “river” (high weight) and “money” (lower weight) - the model resolves that this is a riverbank, not a financial institutionGPU
GenerationModel predicts the most likely next token, one at a timeGPU
Decode + streamToken IDs convert back to text, stream to your terminalServer CPU

“bank” started as the string b-a-n-k. It became the integer 7085. That integer became a point in high-dimensional space. Attention shifted that point toward “riverbank” by connecting it to “river.” And the model generated its response understanding the joke.

Every step in this pipeline exists because the previous step wasn’t enough. Tokenization alone doesn’t capture meaning - you need embeddings. Embeddings alone don’t capture context - you need attention. Attention alone doesn’t generate text - you need the autoregressive loop.


What’s Next

This post is the map. The next posts are the territory.

Each step above has enough depth for its own deep-dive. The series will explore them one by one, using “The bank by the river had no money” as the running example throughout. Tokenization, embeddings, attention, generation, training, alignment - each gets its own post.

The tokenizer doesn’t understand language. The embedding layer doesn’t understand context. The attention mechanism doesn’t generate text. But stacked together, they produce something that feels like understanding.

That’s the complete picture. Now let’s go deeper.


References

Co-written with AI. Credit the prose, blame the opinions.