How Language Models Think

Chapter 1

The basic function abstraction

Talking to a language model is really just talking to a function:

function model(tokens: int[]): float[vocab_size]
    // returns P(next_token | tokens)

Token IDs are integers that represent chunks of text (not whole words, just pieces). You give it a set of token IDs as integers and it returns a probability distribution over every token in its vocabulary. The highest-probability token is its best guess at what comes next.

A vocabulary is typically 32k–128k tokens. "unhappiness" might be three tokens: un, happi, ness. The tokenizer handles the mapping from text to integers and back.

The model doesn't see text, just integers. It also returns floats, not text. All of the intelligence, all of the reasoning, is "just" a transformation from int[] to float[].

A language model is a next-token predictor. It takes a sequence of token IDs and returns a probability distribution over what comes next.

Chapter 2

The Loop

A single call to the model produces one token. To generate a full response, you run it in a loop:

tokens = tokenize("The cat sat on the")

while not done:
    probs  = model(tokens)         // forward pass — the expensive part
    next   = sample(probs)          // pick a token from the distribution
    tokens.append(next)              // grow the sequence by one

Every iteration, we feed the entire sequence back into the model — including the tokens it just generated. The model re-processes everything from scratch to produce the next token.

This is autoregressive generation. It's why models get slower as they write longer responses: each new token requires a forward pass over a longer input.

Or it would, without a trick called the KV cache. The expensive part of a forward pass is computing intermediate representations for each token at each layer. But for tokens we've already processed, those representations don't change — only the newest token is genuinely new. So we cache the work from prior tokens and only compute the new one's contribution. The first token is expensive. Every token after that reuses prior work.

Generation is a loop: forward pass, sample, append, repeat. The KV cache avoids recomputing prior tokens. This is why "context length" and "time to first token" are distinct costs.

Chapter 3

Attention: The Data Structure

Now let's open the box. What happens inside model(tokens)?

The core operation is attention, and it's best understood as a data structure, not a metaphor. At each layer, every token produces three vectors:

Query (Q) — "what am I looking for?"
Key (K) — "what do I contain?"
Value (V) — "what do I contribute if selected?"

The attention operation, for each token:

for token in sequence:
    scores = []
    for earlier_token in sequence[:current_pos]:
        scores.append(dot(token.Q, earlier_token.K))  // how relevant is this?

    weights = softmax(scores)                          // normalize to sum to 1
    output  = weighted_sum(weights, earlier_values)     // blend the values

That's it. Each token's query asks a question. It gets answered by dot-producting against every earlier token's key. The softmax turns raw scores into a probability-like weighting. Then the token collects a weighted blend of all earlier tokens' values.

The dot product is just a similarity score — if a query and key point in similar directions, that score is high. The token "bank" might have a query that scores highly against the key of "river" in one context and "account" in another. That's how context resolves ambiguity.

Notice: each token can only attend to tokens before it, not after. This is the causal mask — it's what makes the model autoregressive. Token 5 sees tokens 0–4 but not token 6.

Three arrays per token, per layer: Q, K, V. Dot the query against all prior keys, softmax, take a weighted sum of values. That's attention — and it's the only place in the network where information moves between token positions.

Now recall the KV cache from Chapter 2. The K and V vectors are what get cached. When generating token 100, tokens 0–99 already have their K and V computed and stored. We only need to compute Q, K, V for the new token, then dot its Q against the 100 cached keys. That's the optimization.

Chapter 4

The MLP: Per-Token Transform

After attention mixes information across positions, each token's vector gets transformed independently by a feedforward network — the MLP (multi-layer perceptron). Think of it as:

output = tokens.map(mlp) // no cross-talk between positions

The MLP applies the same function to every token, independently, in parallel. It doesn't know or care what the other tokens are — attention already handled that. The MLP's job is to transform each token's representation: refine it, add information from the weights, decide what features to amplify or suppress.

If attention is "gather relevant context from other tokens," the MLP is "now that I have context, think about it." Research suggests the MLP layers are where the model stores factual knowledge — the associative memory that maps "The capital of France is" to a representation pointing toward "Paris."

Attention moves information between positions. The MLP transforms each position independently. These are the only two operations, and they alternate.

Chapter 5

The Full Picture

A transformer is these two operations, repeated:

input embeddings

↓

mix_across_positions (attention)

↓

transform_each_position (MLP)

× N layers (30–100+)

↓

float[vocab_size] → probabilities

Each layer builds on the previous one. The first layer operates on raw token embeddings. The last layer produces the final representation that gets mapped to a probability distribution. In between, each layer has the opportunity to move information between positions (attention) and then process each position (MLP).

There's a rough tendency for earlier layers to handle more syntactic, local patterns and later layers to handle more semantic, abstract ones — but the reality is messier. Features are distributed throughout the network in ways we don't fully understand. Individual layers don't have clean "jobs." It's more like a gradient than a stack of labeled boxes.

Modern models have 30–100+ of these blocks, each with its own learned attention and MLP weights. GPT-4-class models have on the order of a trillion total parameters across all layers. The depth is what gives them capacity for complex reasoning — not because any single layer is complex, but because the composition of many simple operations can represent very complex functions.

A transformer is mix_across_positions then transform_each_position, repeated N times. Two simple operations, composed deeply.

Chapter 6

Watching It Generate

This brings it all together. Step through autoregressive generation and watch the KV cache grow. Each step shows the new token's query reaching back into cached keys from all prior tokens.

Interactive: Step-Through Generation

Click Next Token to advance one step. Watch the query reach back, attention scores form, and the KV cache grow.

Sequence

Prompt: "The cat sat"

Chapter 7

Why This Matters

You don't need to understand matrix multiplication to build good intuition for how these systems behave. But knowing the algorithm — even at pseudocode depth — changes how you reason about them.

The autoregressive loop explains latency. Each token requires a full forward pass. Generating 1000 tokens means 1000 sequential passes, regardless of parallelism. Long responses are expensive not because the model is "thinking harder" but because it's running the function more times.

The KV cache explains context limits. Every token's key and value vectors, at every layer, stay in memory for the entire generation. A 128k context window with 100 layers of 128-dimensional KV pairs is a lot of GPU memory. Context length isn't free — it's a cache size.

Attention explains context-sensitivity. The model isn't pattern-matching on isolated tokens. Each token dynamically looks back at every prior token to decide what's relevant. "Bank" resolves differently after "river" than after "account" because the attention scores are different.

The two-operation structure explains what models can and can't do. Attention routes information; the MLP transforms it. If a task requires combining information spread across distant positions, it needs enough layers with enough attention heads to route that information together. If a task requires recalling a fact, it needs MLP capacity. Different failure modes trace to different components.

The architecture is simpler than it looks: a next-token function, run in a loop, composed of alternating "mix" and "transform" operations, cached for efficiency. Everything else — the reasoning, the knowledge, the apparent intelligence — emerges from training billions of parameters within this structure.