WTF Is Happening Inside a Transformer
The linear algebra behind attention and MLPs, stripped to the essentials.
If you can internalize "compute delta, add it back," most of the transformer stops being magical.
1. The One Pattern That Repeats
Transformer diagrams look like a maze of boxes. But there is one pattern that repeats through the entire architecture:
Every layer computes an update vector $\Delta$ and adds it back to the input (residual connection). Repeat ~100 times. Done.
That is the entire structure of GPT, LLaMA, and every other decoder-only transformer. Each layer does two sub-steps, both wrapped in this same residual pattern:
- Attention: mix information across tokens. Rows of the matrix talk to each other. The output is a delta: $E' = E + \Delta_{\text{attn}}$.
- MLP: mix information within each token. Features inside a single row get remixed. Another delta: $E'' = E' + \Delta_{\text{mlp}}$.
Stack $L$ of these layers, slap a vocabulary projection on the final hidden state, and you get next-token prediction. That is the whole model.
2. Notation: Text as a Matrix
Take a sequence like "a cat sat ...". Tokenize it into $n$ tokens. Each token $i$ gets looked up in an embedding table to produce a vector:
$$e_i \in \mathbb{R}^d$$where $d$ is the model dimension (e.g., 4096 for LLaMA-7B). Stack all $n$ token vectors as rows of a matrix:
The whole sequence is one $n \times d$ matrix. Row $i$ is token $i$'s representation. Columns correspond to features in the embedding space. A transformer layer is a function that takes this matrix and returns a refined version of it: $E \mapsto E_{\text{new}}$.
3. The Problem: Per-Token Layers Can't Do Context
Suppose all we did was apply a per-token linear map:
$$e_i \mapsto e_i W$$Each row gets multiplied by the same matrix $W$ independently. Token "sat" cannot look at token "cat." It is isolated. This is fine for bag-of-words models, but terrible for language, where meaning depends on context ("bank" means different things in "river bank" vs. "bank account").
We need a mechanism where token $i$ can pull information from other tokens $j$. That mechanism is attention.
4. Attention: Context Mixing
What are Q, K, V?
For each token vector $e_i$, we create three derived vectors using learned linear projections (matrix multiplications with learned weights):
where $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d}$. The same three weight matrices are applied to every token. Each one extracts a different "view" of the same embedding:
- Query ($q_i \in \mathbb{R}^{d_k}$): "what am I looking for?" This token's search request.
- Key ($k_i \in \mathbb{R}^{d_k}$): "what do I contain?" The label this token advertises to others.
- Value ($v_i \in \mathbb{R}^{d}$): "what information should be added to whichever token attends to me?" The actual payload.
Notice the shapes. Q and K are projected down into a smaller $d_k$-dimensional space. This is the matching space: its only job is to produce dot-product scores that decide "who attends to whom." The value, by contrast, stays in the full $d$-dimensional embedding space, because its job is to be added back into the residual stream (which lives in $\mathbb{R}^d$).
Attention scores are dot products
How much should token $i$ attend to token $j$? Compute the dot product between token $i$'s query and token $j$'s key:
The dot product has a clean geometric interpretation:
- Large positive: $q_i$ and $k_j$ point in the same direction. "Very relevant."
- Near zero: orthogonal. "Irrelevant."
- Negative: opposing directions. "Actively not what I want."
Then normalize scores across all $j$ with softmax so they form a probability distribution (non-negative, sum to 1):
For causal (GPT-style) models, we mask out positions $j > i$ (set them to $-\infty$ before softmax) so each token can only look at the past and itself. This prevents the model from cheating by reading future tokens.
The core operation: weighted sum of values
With the attention weights in hand, the output for token $i$ is simply a weighted average of all the value vectors it can see:
Then the residual update:
$$e'_i = e_i + \Delta e_i$$That is the entire attention mechanism in one sentence: make a weighted mixture of other tokens' value vectors and add it to the current token. The original embedding is always preserved; attention only adds to it.
Walking through a concrete example
Consider three tokens: "a" ($e_1$), "cat" ($e_2$), "sat" ($e_3$), with causal masking.
Token 1 ("a"): can only see itself. Its attention weights are trivially $[\alpha_{11} = 1]$, so $\Delta e_1 = v_1$. It just copies its own value.
Token 2 ("cat"): can see tokens 1 and 2. First, compute two dot products: $\langle q_2, k_1 \rangle$ and $\langle q_2, k_2 \rangle$. Softmax these to get $[\alpha_{21}, \alpha_{22}]$ (two numbers that sum to 1). Then:
$$\Delta e_2 = \alpha_{21} \cdot v_1 + \alpha_{22} \cdot v_2$$If "cat" strongly attends to "a" (maybe it learned that articles modify nouns), $\alpha_{21}$ will be large, and $\Delta e_2$ will contain mostly $v_1$'s information.
Token 3 ("sat"): sees all three. Three dot products, three softmax weights, three value vectors mixed:
$$\Delta e_3 = \alpha_{31} \cdot v_1 + \alpha_{32} \cdot v_2 + \alpha_{33} \cdot v_3$$Each later token can attend to more context. This growing triangle of attention is why transformer language models get better at prediction as they see more tokens.
Matrix form (the whole trick in 3 lines)
Stack all queries, keys, and values into matrices:
Compute the $n \times n$ attention weight matrix:
Apply to values and add the residual:
Three lines. The $1/\sqrt{d_k}$ scaling prevents dot products from growing large as $d_k$ increases (which would push softmax into saturation with near-zero gradients).
Two facts worth internalizing
- Each $\Delta e_i$ lies in the span of $\{v_j\}$. Attention does not invent new directions; it recombines existing value vectors. The weights $\alpha_{ij}$ are just the coefficients of this linear combination.
- The mixing matrix $A$ depends on the input. A fixed matrix would apply the same transformation to any text. Attention computes $A$ from the data itself, which is what makes it context-sensitive.
5. MLP: Feature Computation
After attention, each token vector $e'_i$ now contains contextual information from other tokens. The MLP layer processes each token independently, so no cross-token mixing happens here.
Attention mixes across tokens. MLP mixes within a token.
The shape story: expand, activate, compress
The MLP has a characteristic "bottleneck in reverse" shape. It expands the dimension, applies a nonlinearity, then compresses back:
Where:
- $W_{\text{up}} \in \mathbb{R}^{d \times d_{ff}}$ expands the dimension. Typically $d_{ff} = 4d$ (e.g., 4096 to 16384).
- $\phi$ is a nonlinearity (GELU, SiLU, etc.) applied element-wise.
- $W_{\text{down}} \in \mathbb{R}^{d_{ff} \times d}$ compresses back to the original dimension.
With the residual connection:
$$e''_i = e'_i + \text{MLP}(e'_i)$$What the up-projection actually computes
When you multiply $e'_i \in \mathbb{R}^d$ by $W_{\text{up}} \in \mathbb{R}^{d \times d_{ff}}$, each element $j$ of the resulting hidden vector $h_i \in \mathbb{R}^{d_{ff}}$ is a dot product:
$$h_{i,j} = \langle w^{\text{up}}_j, \; e'_i \rangle$$where $w^{\text{up}}_j$ is the $j$-th column of $W_{\text{up}}$ (or equivalently, the $j$-th row of $W_{\text{up}}^\top$). Each of the $d_{ff}$ hidden units is checking: "how much does this token's embedding align with my learned direction?" The nonlinearity $\phi$ then selectively activates some of these matches and suppresses others.
The down-projection $W_{\text{down}}$ recombines the activated features back into a $d$-dimensional update vector $\Delta e'_i$.
If you remove the nonlinearity, the MLP collapses
Without $\phi$:
$$\text{MLP}(x) = x \, W_{\text{up}} W_{\text{down}}$$That is just $x \cdot M$ where $M = W_{\text{up}} W_{\text{down}}$ is a single matrix. Two linear layers without a nonlinearity between them collapse into one. The expansion to $d_{ff}$ dimensions buys you nothing. The nonlinearity (and often gating, as in SwiGLU) is what makes the MLP a flexible feature detector rather than a redundant linear transform.
6. One Full Transformer Layer
Putting both pieces together (ignoring LayerNorm):
That is one layer. Stack $L$ layers:
$$E^{(0)} \to E^{(1)} \to \cdots \to E^{(L)}$$Each layer refines the token representations by adding learned deltas. The residual connections mean information from early layers flows directly to later layers without being forced through every intermediate computation.
7. From Transformer Stack to Next-Token Prediction
After $L$ layers, we have the final representation $E^{(L)} \in \mathbb{R}^{n \times d}$. To predict the next token, take the last row (the representation of the most recent token) and project it into vocabulary space:
where $W_U \in \mathbb{R}^{d \times |\mathcal{V}|}$ is the unembedding matrix and $|\mathcal{V}|$ is the vocabulary size (e.g., 32000 for LLaMA). Apply softmax to get a probability distribution over the vocabulary: $P(\text{next token})$.
Geometric interpretation: each column of $W_U$ is a direction in $\mathbb{R}^d$ corresponding to one vocabulary token. The logit for token $w$ is the dot product $\langle e^{(L)}_n, w_U \rangle$: how aligned is the hidden state with that token's direction? The predicted next token is whichever vocabulary direction is most aligned with the final hidden state. The entire transformer stack's job is to produce a hidden state that points toward the correct next token.
8. Pseudocode
The complete forward pass for one layer, single-head, in readable Python-style pseudocode:
# E: (n, d) — the sequence matrix
# === Attention ===
Q = E @ W_Q # (n, d_k) — queries
K = E @ W_K # (n, d_k) — keys
V = E @ W_V # (n, d) — values (stays in embedding dim)
S = Q @ K.T # (n, n) — raw scores
S = S / sqrt(d_k) # scale to keep gradients stable
S = S + causal_mask # set j > i entries to -inf
A = softmax(S, dim=-1) # (n, n) — attention weights (rows sum to 1)
O = A @ V # (n, d) — weighted sum of values
E_prime = E + O # residual update
# === MLP ===
H = E_prime @ W_up # (n, d_ff) — expand
H = phi(H) # nonlinearity (GELU, SiLU, etc.)
Delta = H @ W_down # (n, d) — compress back
E_out = E_prime + Delta # residual update
That is one transformer layer. For the full model: apply this $L$ times, then project the last token's hidden state through $W_U$ to get vocabulary logits.
Summary
| Component | What it does | Operates on |
|---|---|---|
| Q, K, V projections | Create three views of each token | Per token (same $W$ for all) |
| $QK^\top$ | Compute pairwise relevance scores | All token pairs |
| softmax + mask | Normalize into a mixing distribution | Per query (each row) |
| $AV$ | Weighted sum of value payloads | Across tokens |
| Residual add | $E' = E + \Delta$ | Per token |
| MLP (up → $\phi$ → down) | Per-token feature detection | Per token independently |
| Residual add | $E'' = E' + \Delta'$ | Per token |
| $W_U$ + softmax | Project to vocabulary, predict next token | Last token only (at inference) |
Cite this post:
@article{sedhain2026transformer,
title = {WTF Is Happening Inside a Transformer (Linear Algebra Edition)},
author = {Sedhain, Suvash},
journal = {ssedhain.com},
year = {2026},
month = {Mar},
url = {https://mesuvash.github.io/blog/2026/transformer-linalg/}
}