The linear algebra behind attention and MLPs, stripped to the essentials.
If you can internalize "compute delta, add it back," most of the transformer stops being magical.

1. The One Pattern That Repeats

Transformer diagrams look like a maze of boxes. But there is one pattern that repeats through the entire architecture:

Every layer computes an update vector $\Delta$ and adds it back to the input (residual connection). Repeat ~100 times. Done.

That is the entire structure of GPT, LLaMA, and every other decoder-only transformer. Each layer does two sub-steps, both wrapped in this same residual pattern:

  1. Attention: mix information across tokens. Rows of the matrix talk to each other. The output is a delta: $E' = E + \Delta_{\text{attn}}$.
  2. MLP: mix information within each token. Features inside a single row get remixed. Another delta: $E'' = E' + \Delta_{\text{mlp}}$.

Stack $L$ of these layers, slap a vocabulary projection on the final hidden state, and you get next-token prediction. That is the whole model.

Simplifications in this post: We intentionally drop LayerNorm, dropout, multi-head splitting, and keep nonlinearity optional until we need it. These are important engineering details, but they obscure the core linear algebra story. We will note where they matter.

2. Notation: Text as a Matrix

Take a sequence like "a cat sat ...". Tokenize it into $n$ tokens. Each token $i$ gets looked up in an embedding table to produce a vector:

$$e_i \in \mathbb{R}^d$$

where $d$ is the model dimension (e.g., 4096 for LLaMA-7B). Stack all $n$ token vectors as rows of a matrix:

Embedding matrix:
$$E = \begin{bmatrix} e_1^\top \\ e_2^\top \\ \vdots \\ e_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d}$$

The whole sequence is one $n \times d$ matrix. Row $i$ is token $i$'s representation. Columns correspond to features in the embedding space. A transformer layer is a function that takes this matrix and returns a refined version of it: $E \mapsto E_{\text{new}}$.

Intuition: Think of $E$ as a spreadsheet. Each row is a word. Each column is some learned feature (you don't choose what the features mean; training does). The transformer edits this spreadsheet in place, refining the rows one layer at a time.

3. The Problem: Per-Token Layers Can't Do Context

Suppose all we did was apply a per-token linear map:

$$e_i \mapsto e_i W$$

Each row gets multiplied by the same matrix $W$ independently. Token "sat" cannot look at token "cat." It is isolated. This is fine for bag-of-words models, but terrible for language, where meaning depends on context ("bank" means different things in "river bank" vs. "bank account").

We need a mechanism where token $i$ can pull information from other tokens $j$. That mechanism is attention.

4. Attention: Context Mixing

What are Q, K, V?

For each token vector $e_i$, we create three derived vectors using learned linear projections (matrix multiplications with learned weights):

Query, Key, Value projections:
$$q_i = e_i W_Q, \quad k_i = e_i W_K, \quad v_i = e_i W_V$$

where $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d}$. The same three weight matrices are applied to every token. Each one extracts a different "view" of the same embedding:

Notice the shapes. Q and K are projected down into a smaller $d_k$-dimensional space. This is the matching space: its only job is to produce dot-product scores that decide "who attends to whom." The value, by contrast, stays in the full $d$-dimensional embedding space, because its job is to be added back into the residual stream (which lives in $\mathbb{R}^d$).

What really happens with $W_V$: In the actual implementation, V is first projected down to $d_k$ (just like Q and K), then the attention output is projected back up to $d$ via a separate output matrix $W_O$. Following 3Blue1Brown's excellent visual explanation, we collapse these two steps into a single $W_V \in \mathbb{R}^{d \times d}$ that maps directly from embedding space to embedding space. This is mathematically equivalent and makes the intuition cleaner: $W_V$ answers the question "if someone attends to me, what $d$-dimensional update should they receive?"
Intuition: Think of a library. The query is the question you walk in with. The key is the label on the spine of each book. The value is the content inside the book. You compare your question (query) against every label (key), and then read the content (value) of the books that match best.

Attention scores are dot products

How much should token $i$ attend to token $j$? Compute the dot product between token $i$'s query and token $j$'s key:

Attention score:
$$s_{ij} = \langle q_i, k_j \rangle$$

The dot product has a clean geometric interpretation:

Then normalize scores across all $j$ with softmax so they form a probability distribution (non-negative, sum to 1):

Attention weights:
$$\alpha_{ij} = \text{softmax}_j(s_{ij}) = \frac{\exp(s_{ij})}{\sum_{k} \exp(s_{ik})}$$

For causal (GPT-style) models, we mask out positions $j > i$ (set them to $-\infty$ before softmax) so each token can only look at the past and itself. This prevents the model from cheating by reading future tokens.

The core operation: weighted sum of values

With the attention weights in hand, the output for token $i$ is simply a weighted average of all the value vectors it can see:

Attention output (per token):
$$\Delta e_i = \sum_{j \le i} \alpha_{ij} \, v_j$$

Then the residual update:

$$e'_i = e_i + \Delta e_i$$

That is the entire attention mechanism in one sentence: make a weighted mixture of other tokens' value vectors and add it to the current token. The original embedding is always preserved; attention only adds to it.

Walking through a concrete example

Consider three tokens: "a" ($e_1$), "cat" ($e_2$), "sat" ($e_3$), with causal masking.

Token 1 ("a"): can only see itself. Its attention weights are trivially $[\alpha_{11} = 1]$, so $\Delta e_1 = v_1$. It just copies its own value.

Token 2 ("cat"): can see tokens 1 and 2. First, compute two dot products: $\langle q_2, k_1 \rangle$ and $\langle q_2, k_2 \rangle$. Softmax these to get $[\alpha_{21}, \alpha_{22}]$ (two numbers that sum to 1). Then:

$$\Delta e_2 = \alpha_{21} \cdot v_1 + \alpha_{22} \cdot v_2$$

If "cat" strongly attends to "a" (maybe it learned that articles modify nouns), $\alpha_{21}$ will be large, and $\Delta e_2$ will contain mostly $v_1$'s information.

Token 3 ("sat"): sees all three. Three dot products, three softmax weights, three value vectors mixed:

$$\Delta e_3 = \alpha_{31} \cdot v_1 + \alpha_{32} \cdot v_2 + \alpha_{33} \cdot v_3$$

Each later token can attend to more context. This growing triangle of attention is why transformer language models get better at prediction as they see more tokens.

Matrix form (the whole trick in 3 lines)

Stack all queries, keys, and values into matrices:

Matrix projections:
$$Q = EW_Q, \quad K = EW_K, \quad V = EW_V$$

Compute the $n \times n$ attention weight matrix:

Attention weights:
$$A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) \in \mathbb{R}^{n \times n}$$

Apply to values and add the residual:

Attention output + residual:
$$E' = E + AV$$

Three lines. The $1/\sqrt{d_k}$ scaling prevents dot products from growing large as $d_k$ increases (which would push softmax into saturation with near-zero gradients).

Intuition: $A$ is a data-dependent mixing matrix. Unlike a fixed weight matrix that does the same thing regardless of input, $A$ is computed fresh from the actual token content for every input sequence. It tells each token how much to "read from" every other token. Attention is essentially: "compute your own mixing matrix, then use it to blend the value payloads."

Two facts worth internalizing

  1. Each $\Delta e_i$ lies in the span of $\{v_j\}$. Attention does not invent new directions; it recombines existing value vectors. The weights $\alpha_{ij}$ are just the coefficients of this linear combination.
  2. The mixing matrix $A$ depends on the input. A fixed matrix would apply the same transformation to any text. Attention computes $A$ from the data itself, which is what makes it context-sensitive.
Note on multi-head attention: Real transformers split $Q$, $K$, $V$ into $h$ "heads" (e.g., $h = 32$), each with dimension $d_k = d/h$. Each head computes its own attention pattern independently, then the $h$ outputs are concatenated and projected back to dimension $d$. The math per head is identical to what we described above. Multi-head just runs multiple attention patterns in parallel so different heads can attend to different things (one head might track syntax, another might track coreference).

5. MLP: Feature Computation

After attention, each token vector $e'_i$ now contains contextual information from other tokens. The MLP layer processes each token independently, so no cross-token mixing happens here.

Attention mixes across tokens. MLP mixes within a token.

The shape story: expand, activate, compress

The MLP has a characteristic "bottleneck in reverse" shape. It expands the dimension, applies a nonlinearity, then compresses back:

MLP per token:
$$\text{MLP}(x) = \phi(x \, W_{\text{up}}) \; W_{\text{down}}$$

Where:

With the residual connection:

$$e''_i = e'_i + \text{MLP}(e'_i)$$

What the up-projection actually computes

When you multiply $e'_i \in \mathbb{R}^d$ by $W_{\text{up}} \in \mathbb{R}^{d \times d_{ff}}$, each element $j$ of the resulting hidden vector $h_i \in \mathbb{R}^{d_{ff}}$ is a dot product:

$$h_{i,j} = \langle w^{\text{up}}_j, \; e'_i \rangle$$

where $w^{\text{up}}_j$ is the $j$-th column of $W_{\text{up}}$ (or equivalently, the $j$-th row of $W_{\text{up}}^\top$). Each of the $d_{ff}$ hidden units is checking: "how much does this token's embedding align with my learned direction?" The nonlinearity $\phi$ then selectively activates some of these matches and suppresses others.

The down-projection $W_{\text{down}}$ recombines the activated features back into a $d$-dimensional update vector $\Delta e'_i$.

Intuition: The MLP has a menu of $d_{ff}$ possible features it can detect. The up-projection asks "is this feature present?" for each one. The nonlinearity decides "yes" or "no." The down-projection says "given which features fired, here is the update to add to the token." For a token like "bank" in "river bank," the attention step already mixed in context from "river." Now the MLP can fire features like "natural_landform" and suppress "financial_institution."

If you remove the nonlinearity, the MLP collapses

Without $\phi$:

$$\text{MLP}(x) = x \, W_{\text{up}} W_{\text{down}}$$

That is just $x \cdot M$ where $M = W_{\text{up}} W_{\text{down}}$ is a single matrix. Two linear layers without a nonlinearity between them collapse into one. The expansion to $d_{ff}$ dimensions buys you nothing. The nonlinearity (and often gating, as in SwiGLU) is what makes the MLP a flexible feature detector rather than a redundant linear transform.

6. One Full Transformer Layer

Putting both pieces together (ignoring LayerNorm):

Attention update: mix information across tokens. $$E' = E + \text{Attn}(E)$$ Each token reads from other tokens and adds the gathered information to itself.
MLP update: compute features within each token. $$E'' = E' + \text{MLP}(E')$$ Each token independently processes its (now context-enriched) representation.

That is one layer. Stack $L$ layers:

$$E^{(0)} \to E^{(1)} \to \cdots \to E^{(L)}$$

Each layer refines the token representations by adding learned deltas. The residual connections mean information from early layers flows directly to later layers without being forced through every intermediate computation.

Intuition: the two deltas carry different kinds of knowledge. $\text{Attn}(E)$ is contextual information: it is computed on the fly from the tokens seen so far, so it captures what is relevant in this particular sequence. $\text{MLP}(E')$ is world knowledge: it is produced by fixed, learned weight matrices, so it injects facts and patterns absorbed during training into the (now context-aware) representation $E'$. Attention figures out what to pay attention to; the MLP recalls what it knows about it.
Intuition: Think of the residual stream as a shared whiteboard. Each attention layer reads from the whiteboard, computes a suggestion based on context, and writes it back. Each MLP layer reads from the whiteboard, does some per-token thinking, and writes that back too. After $L$ rounds, the whiteboard holds a rich, refined representation of each token.

7. From Transformer Stack to Next-Token Prediction

After $L$ layers, we have the final representation $E^{(L)} \in \mathbb{R}^{n \times d}$. To predict the next token, take the last row (the representation of the most recent token) and project it into vocabulary space:

Logits:
$$\text{logits} = e^{(L)}_n \, W_U \in \mathbb{R}^{|\mathcal{V}|}$$

where $W_U \in \mathbb{R}^{d \times |\mathcal{V}|}$ is the unembedding matrix and $|\mathcal{V}|$ is the vocabulary size (e.g., 32000 for LLaMA). Apply softmax to get a probability distribution over the vocabulary: $P(\text{next token})$.

Geometric interpretation: each column of $W_U$ is a direction in $\mathbb{R}^d$ corresponding to one vocabulary token. The logit for token $w$ is the dot product $\langle e^{(L)}_n, w_U \rangle$: how aligned is the hidden state with that token's direction? The predicted next token is whichever vocabulary direction is most aligned with the final hidden state. The entire transformer stack's job is to produce a hidden state that points toward the correct next token.

8. Pseudocode

The complete forward pass for one layer, single-head, in readable Python-style pseudocode:

# E: (n, d) — the sequence matrix

# === Attention ===
Q = E @ W_Q          # (n, d_k) — queries
K = E @ W_K          # (n, d_k) — keys
V = E @ W_V          # (n, d)   — values (stays in embedding dim)

S = Q @ K.T          # (n, n)   — raw scores
S = S / sqrt(d_k)    # scale to keep gradients stable
S = S + causal_mask  # set j > i entries to -inf
A = softmax(S, dim=-1)  # (n, n) — attention weights (rows sum to 1)

O = A @ V            # (n, d)   — weighted sum of values
E_prime = E + O      # residual update

# === MLP ===
H = E_prime @ W_up   # (n, d_ff) — expand
H = phi(H)           # nonlinearity (GELU, SiLU, etc.)
Delta = H @ W_down   # (n, d)   — compress back

E_out = E_prime + Delta  # residual update

That is one transformer layer. For the full model: apply this $L$ times, then project the last token's hidden state through $W_U$ to get vocabulary logits.


Summary

ComponentWhat it doesOperates on
Q, K, V projections Create three views of each token Per token (same $W$ for all)
$QK^\top$ Compute pairwise relevance scores All token pairs
softmax + mask Normalize into a mixing distribution Per query (each row)
$AV$ Weighted sum of value payloads Across tokens
Residual add $E' = E + \Delta$ Per token
MLP (up → $\phi$ → down) Per-token feature detection Per token independently
Residual add $E'' = E' + \Delta'$ Per token
$W_U$ + softmax Project to vocabulary, predict next token Last token only (at inference)

Cite this post:

@article{sedhain2026transformer,
  title   = {WTF Is Happening Inside a Transformer (Linear Algebra Edition)},
  author  = {Sedhain, Suvash},
  journal = {ssedhain.com},
  year    = {2026},
  month   = {Mar},
  url     = {https://mesuvash.github.io/blog/2026/transformer-linalg/}
}