DeepSeek's Technical Playbook: From MLA to Conditional Memory
How DeepSeek systematically attacks efficiency at every layer of the stack: attention, routing, context, post-training, and now memory.
1. Why This Matters
DeepSeek has published a series of papers that, taken together, form a coherent efficiency playbook for large language models. Each paper targets a different bottleneck:
- KV cache memory dominates inference cost at long contexts. MLA compresses it by ~98%.
- Dense FFN layers waste compute on irrelevant experts. DeepSeekMoE activates only the experts each token needs.
- Quadratic attention makes long contexts prohibitively expensive. DeepSeek Sparse Attention (DSA) reduces it to near-linear.
- Post-training traditionally requires separate stages for reasoning, safety, and tool use. Their unified RL pipeline handles all three in one pass.
- Static knowledge retrieval forces transformers to waste depth simulating lookup. Engram adds a dedicated O(1) memory primitive.
The pattern is consistent: identify where compute or memory is wasted, then design a sparse or compressed alternative that preserves quality. The following diagram is a composed view showing how all of these innovations would fit inside a single transformer layer. Note that Engram is a separate 2026 paper, not part of the shipped DeepSeek-V3 architecture; the diagram shows the full playbook, not a single model checkpoint.
This post walks through each component in order: attention compression (MLA), sparse routing (MoE), sparse attention (DSA), post-training (GRPO), and conditional memory (Engram).
2. Multi-head Latent Attention (MLA)
Standard multi-head attention stores separate key and value vectors for every token at every layer. With $n_h$ attention heads (e.g., 128) each of dimension $d_h$ (e.g., 128), a 128K context window across 60 layers in BF16 means storing roughly 8-9 GB of KV cache per sequence (the exact number depends on the number of layers and precision; FP8 quantization halves it). MLA compresses this by projecting keys and values into a shared low-dimensional latent space.
How It Works
Instead of caching full $K$ and $V$ matrices, MLA caches a single compressed vector $c_t$ per token:
where $W^{DKV} \in \mathbb{R}^{d_c \times d}$ projects the hidden state $h_t$ (dimension $d$) down to a compressed representation $c_t^{KV}$ (dimension $d_c$, typically $d_c \ll n_h \cdot d_h$). At attention time, keys and values are reconstructed:
$W^{UK}$ and $W^{UV}$ are up-projection matrices that reconstruct the full key and value vectors from the compressed cache. The critical insight: you only cache $c_t^{KV}$, not the full keys and values.
Intuition: Think of MLA as JPEG compression for the KV cache. The full keys and values have redundancy across heads. MLA exploits this by storing a compressed version and reconstructing on the fly. With $d_c + d_R = 576$ cached per token vs. the original $2 \times n_h \times d_h = 32{,}768$ for full KV, you get a ~98% reduction in per-token cache size. (The DeepSeek-V2 paper reports 93.3% KV cache reduction relative to its predecessor DeepSeek 67B, which had a different head configuration.)
Decoupled RoPE
There is a subtlety. Rotary Position Embeddings (RoPE) are applied to keys to encode position information, but RoPE is position-dependent, so it cannot be absorbed into the low-rank compression. MLA handles this by decoupling the positional component:
The content key $k_t^C$ comes from the compressed cache. The positional key $k_t^R$ is a small separate vector (dimension $d_R$, typically 64) that carries the RoPE encoding. Only $k_t^R$ needs to be cached alongside $c_t^{KV}$, adding minimal overhead.
Comparison with Alternatives
| Method | KV Cache per Token | Quality vs. MHA | Mechanism |
|---|---|---|---|
| MHA (standard) | $2 n_h d_h$ | Baseline | Full cache per head |
| GQA | $2 n_g d_h$ | Slight degradation | Share KV across $n_g$ head groups |
| MQA | $2 d_h$ | Notable degradation | Single KV for all heads |
| MLA | $d_c + d_R$ | Matches or exceeds MHA | Low-rank compression + decoupled RoPE |
MLA achieves better compression than MQA while matching or exceeding the quality of full MHA. The key advantage over GQA/MQA: instead of reducing the number of heads, MLA reduces the dimensionality of the cached representation, preserving expressiveness.
Why not GQA? Grouped Query Attention (used by Llama 3, Gemma 2, and most open models) is simpler to implement and works well in practice. But GQA forces a hard trade-off: fewer KV groups means more compression but less per-head specialization. At DeepSeek's scale (128 heads, 128K context), even 8-group GQA still caches $2 \times 8 \times 128 = 2{,}048$ values per token, while MLA caches $d_c + d_R \approx 576$. MLA also preserves the full expressiveness of all 128 heads during attention computation (since keys and values are reconstructed per-head from the shared latent), whereas GQA forces heads within a group to share identical keys and values. The cost is implementation complexity: MLA requires the decoupled RoPE design and careful kernel optimization for the up-projections.
MLA handles the attention side of the efficiency equation. But each transformer layer also has a feed-forward network, and in a dense model, every parameter in that FFN fires for every token.
3. DeepSeekMoE
A standard transformer's feed-forward network (FFN) activates all parameters for every token. This is wasteful: a token about Python syntax does not need the same parameters as a token about organic chemistry. Mixture-of-Experts (MoE) replaces the dense FFN with many small expert networks and a router that selects which ones to activate per token.
Architecture
DeepSeekMoE uses two types of experts:
- Shared experts (1 in DeepSeek-V3): Always activated for every token. These handle common, cross-domain patterns.
- Routed experts (up to 256): A gating network selects the top-$k$ (typically $k = 8$) most relevant experts per token.
The FFN output for a token $x$ is:
where $N_s$ is the number of shared experts, $g_j$ are the gating weights from a softmax over router logits, and TopK selects the $k$ experts with the highest gate values. Each expert is a small FFN (typically SwiGLU) with its own weights.
Load Balancing Without Auxiliary Loss
A common problem with MoE: some experts get selected far more than others, leading to imbalanced computation. Most MoE systems (GShard, Switch Transformer) add an auxiliary balance loss to the training objective that penalizes uneven expert utilization. This works, but it creates a fundamental tension: the auxiliary loss pushes the router toward uniform distribution, while the language modeling loss pushes it toward selecting the best experts. The two gradients fight each other, and model quality suffers.
DeepSeek's solution is a loss-free load balancing strategy that separates the routing decision from the gradient signal entirely. Here is how it works:
The standard router computes a score for each expert $i$ given token $x$:
where $W_r^{(i)}$ is the router weight for expert $i$ and $h_x$ is the token's hidden state. Normally, TopK selection and gating weights both come from these scores. DeepSeek adds a bias term $b_i$ to each expert's score, but only for the TopK selection step:
The bias $b_i$ is not a learned parameter. It is adjusted by a simple heuristic after each training step:
- If expert $i$ is handling more tokens than average, decrease $b_i$ by a small step $\gamma$ (making it less likely to be selected).
- If expert $i$ is handling fewer tokens than average, increase $b_i$ by $\gamma$ (making it more likely to be selected).
The critical detail: $b_i$ shifts which experts get selected, but the actual gating weights $g_j$ (which determine how much each expert's output contributes) are computed from the original unbiased scores. This means the balance adjustment never introduces gradient noise into the model's learning signal.
Concrete example. Suppose you have 4 experts with router scores $s = [2.0,\; 1.5,\; 0.3,\; 0.1]$ and TopK = 2. Without any balancing, experts 1 and 2 are selected, and the gating weights are $g_1 = e^{2.0}/(e^{2.0} + e^{1.5}) \approx 0.62$ and $g_2 \approx 0.38$. Now suppose expert 2 is overloaded and acquires a bias of $b_2 = -1.5$, while underloaded expert 3 gets $b_3 = +1.0$. The biased selection scores become $[2.0,\; 0.0,\; 1.3,\; 0.1]$, so TopK now picks experts 1 and 3. But the gating weights are computed from the original scores of the selected pair: $g_1 = e^{2.0}/(e^{2.0} + e^{0.3}) \approx 0.85$ and $g_3 \approx 0.15$. Expert 3 got into the room because of the bias, but its low original score means it contributes only 15% of the output. The model still weights experts by true relevance.
Intuition: Think of it as two separate decisions. First: "which experts should this token visit?" (influenced by the bias). Second: "how much should I weight each expert's answer?" (unbiased, purely based on relevance). The bias is like a bouncer redirecting people to shorter queues at a club, but once you are inside, the DJ plays the same music regardless of how you got in. The model's loss function never sees the bias term, so its gradients are clean.
Scale
In DeepSeek-V3, this means 671B total parameters but only ~37B activated per token. You get the capacity of a 671B model at roughly the inference cost of a 37B dense model.
MoE makes each layer cheaper by activating fewer parameters. But even with sparse FFNs, attention still scales quadratically with sequence length, and at 128K tokens, it dominates the compute budget.
4. DeepSeek Sparse Attention (DSA)
Standard attention is $O(n^2)$ in sequence length $n$. At 128K tokens, the attention matrix has $\sim$16 billion entries per layer per head. DeepSeek Sparse Attention reduces this to near-linear by selecting only the most relevant tokens for each query to attend to.
The Core Idea
DSA splits the context into two categories for each query: a small set of globally relevant tokens (selected by a cheap scoring pass) and a local window of nearby tokens. Only these selected tokens participate in the full attention computation.
The process has two stages:
-
Lightning Indexer (cheap scoring): For each query position $t$, compute a relevance score for every past token. The scoring uses MLA's compressed latent representations $c_j^{KV}$ (which are already cached), so this step adds minimal overhead. Concretely, the score for token $j$ given query $q_t$ is a dot product in the compressed space:
Token Relevance Score $$\text{score}(q_t, j) = q_t^{\text{idx}} \cdot (c_j^{KV})^\top$$
where $q_t^{\text{idx}}$ is a lightweight query projection (a small linear layer applied to the query, separate from the main attention query). This is much cheaper than full attention because it operates in the compressed dimension $d_c$ (e.g., 512) rather than the full head dimension.
-
Selective Attention: Select the top-$B$ tokens by score (where $B$ is a fixed budget per query; the V3.2 paper uses $B = 2{,}048$ during training) and combine them with the local sliding window. Only this subset participates in the full attention computation:
Sparse Attention $$\text{Attn}(q_t) = \text{softmax}\!\left(\frac{q_t K_{\mathcal{S}_t}^\top}{\sqrt{d}}\right) V_{\mathcal{S}_t}$$
where $\mathcal{S}_t = \text{TopB}(\text{score}(q_t, \cdot)) \cup \text{Window}(t, w)$ is the union of the top-$B$ globally relevant tokens and the local window of size $w$.
Intuition: Instead of every token attending to every other token (quadratic), DSA first does a cheap scan to find which tokens matter, then computes full attention only on those. At 128K context with $B = 2{,}048$ selected tokens plus a local window, each query attends to a small fraction of the full context. The scoring pass itself is cheap because it operates in MLA's compressed space ($d_c = 512$) rather than the full attention dimension.
Why It Works with MLA
This is where MLA and DSA reinforce each other. MLA already compresses keys and values into latent vectors $c_t^{KV}$ for the purpose of reducing cache memory. DSA reuses these exact same compressed representations as input to the lightning indexer. The indexer does not need its own separate index or data structure; it simply dot-products the query against the cached latents. Without MLA's compression, the indexer would need to score against full-dimensional keys, making the "cheap scan" much less cheap.
Worked example: token selection at 128K context
Consider a model processing a 128K-token document. At position $t = 100{,}000$, the query asks about "the treaty signed in 1648". With dense attention, this query would compute 100,000 dot products at full dimension.
Step 1: Scoring. The lightning indexer computes $\text{score}(q_t, j)$ for all $j < t$ using the compressed latents. This is 100,000 dot products, but in dimension $d_c = 512$ instead of $d_h = 128 \times 128 = 16{,}384$. Tokens mentioning "treaty", "Westphalia", "1648", and "Peace" score highest.
Step 2: Selection. The top $B = 2{,}048$ tokens by score are selected globally. These might be scattered across the document: some from the section discussing the Treaty of Westphalia (positions ~12,000-12,500), some from a timeline section (positions ~45,000-45,200), and some from a cross-reference (position ~89,000). The local sliding window captures the most recent few thousand positions.
Step 3: Full attention. The query computes full multi-head attention over only the selected + window tokens, a small fraction of the 100K total context, while still attending to the most relevant passages anywhere in the document.
Hardware Considerations
The practical challenge with any sparse attention scheme is turning the token selection into efficient GPU kernels. Naive implementations with scattered memory accesses can negate the theoretical FLOP savings. The DeepSeek-V3.2 paper describes DSA as "hardware-aligned and natively trainable," designed to work with existing optimized attention kernels rather than requiring entirely custom implementations.
Key limitation: DSA requires the model to be trained with sparse attention from the start (or at least fine-tuned with it for an extended period). You cannot simply drop DSA into a pre-trained dense-attention model and expect it to work. The model needs to learn which tokens are worth selecting, and the lightning indexer's query projection must be trained jointly with the rest of the attention mechanism.
MLA, MoE, and DSA together address the core inference bottlenecks: memory, compute per layer, and quadratic attention. But a pre-trained model still needs alignment. The next section covers how DeepSeek handles post-training with a unified RL approach.
5. Scalable Reinforcement Learning
After pre-training, DeepSeek uses reinforcement learning to improve reasoning, tool use, and safety. The core algorithm is Group Relative Policy Optimization (GRPO), which avoids the need for a separate value model (unlike PPO).
GRPO
For a given prompt $q$, GRPO samples a group of $G$ responses $\{o_1, \ldots, o_G\}$ from the current policy. Each response is scored by a reward function $r(o_i)$. The advantage is computed relative to the group:
The policy gradient loss clips the ratio (like PPO) and includes a KL penalty against the reference policy $\pi_{\text{ref}}$:
where $\pi_\theta$ is the current policy being trained, $\pi_{\text{old}}$ is the policy that generated the samples (before the current update step), $\pi_\theta / \pi_{\text{old}}$ is the importance sampling ratio, $\epsilon$ is the clipping range (typically 0.2), and $\beta$ controls the KL penalty strength against the frozen reference policy $\pi_{\text{ref}}$.
Intuition: GRPO skips the value model entirely. Instead of estimating "how good is this state?" with a learned critic, it just compares each response to its siblings in the same group. If your response scored 0.8 and the group average was 0.5 with std 0.2, your advantage is $(0.8 - 0.5)/0.2 = 1.5$. Simple, stable, and cheap.
Stability Fixes
Scaling GRPO to large models and diverse reward signals revealed several instabilities. DeepSeek-V3.2 introduced four fixes:
| Problem | Fix | Mechanism |
|---|---|---|
| KL divergence biased upward | Unbiased KL estimator | Replace standard KL with $\frac{\pi_{\text{ref}}}{\pi_\theta} - \log\frac{\pi_{\text{ref}}}{\pi_\theta} - 1$, which has zero bias when $\pi_\theta = \pi_{\text{ref}}$ |
| Stale off-policy data | Off-policy sequence masking | For negative-advantage sequences, mask the entire sequence if the average log importance ratio $\frac{1}{T}\sum_t \log\frac{\pi_\theta}{\pi_{\text{old}}}$ exceeds a threshold $\delta$ |
| Expert routing drift during RL | Keep routing | Record the MoE expert routing decisions made during sampling and replay the same routing during the training forward pass, ensuring consistent expert assignments |
| Action-space mismatch between sampling and training | Keep sampling mask | Preserve the top-p/top-k token truncation masks from sampling and apply them during the training forward pass, so both policies operate over the same action subspace |
Unified Training Pipeline
Rather than running separate RL stages for reasoning, tool use, and safety (which risks catastrophic forgetting between stages), DeepSeek runs a three-phase unified pipeline:
-
Specialist Training: Train domain-specific expert policies using GRPO with verifiable rewards. Each specialist focuses on a single domain:
- Math specialist: trained on competition-level problems; reward = 1 if final answer matches ground truth, 0 otherwise.
- Code specialist: trained on programming challenges; reward = fraction of test cases passed (0.0 to 1.0).
- Agent specialist: trained on multi-step tool-use tasks; reward = binary task completion signal from a programmatic verifier.
- Distillation: Merge the specialist policies into a single generalist model. For each domain's prompt distribution, the generalist is trained to minimize KL divergence against the corresponding specialist's output distribution: $\mathcal{L}_{\text{distill}} = \sum_{d \in \text{domains}} \mathbb{E}_{x \sim \mathcal{D}_d}\left[D_{\text{KL}}(\pi_{\text{specialist}}^d(\cdot|x) \| \pi_{\text{generalist}}(\cdot|x))\right]$. This is cheaper than training from scratch because the specialists have already found good policies; distillation just needs to merge them. The result is a single model that is competent (but not yet optimal) across all domains.
- Unified RL: Run GRPO on the distilled model with a mixed-domain reward signal. Each training batch contains prompts from all domains simultaneously. The router selects the appropriate reward function based on the prompt's domain tag: correctness for math, test-case pass rate for code, task completion for agents, and a safety classifier score for all prompts. The mixed batches are critical: they prevent the model from overfitting to any single domain's reward signal. The KL penalty against the distilled checkpoint acts as an anchor, preventing the model from forgetting capabilities while improving on each domain.
Agentic Task Synthesis
The agent specialist in phase 1 needs training data, but real multi-step agentic tasks are scarce and expensive to annotate. DeepSeek solves this with a synthesis pipeline that generates over 85,000 tasks spanning web browsing, file operations, API calls, and multi-tool composition.
The synthesis process works in three steps: (1) an LLM generates a task description and the ground-truth execution plan, (2) a programmatic verifier is auto-generated that checks whether an agent's execution trace achieves the intended outcome (not just the final answer, but intermediate state transitions), and (3) the task is validated by running it against a baseline agent to ensure it is solvable but non-trivial. Tasks that are either trivially solved (>90% baseline pass rate) or unsolvable (<5% pass rate) are filtered out.
All five innovations so far target efficiency during pre-training or post-training. The final innovation, Engram, addresses a different question: can we give the transformer a fundamentally new primitive for tasks it currently handles inefficiently?
6. Engram: Conditional Memory
This is the most recent innovation, introduced in "Conditional Memory via Scalable Lookup" (Cheng et al., Jan 2026). The core insight: transformers waste depth on static pattern retrieval (named entities, idiomatic phrases, factual associations) that could be handled by direct lookup.
The Problem
Consider what happens when a model encounters "Alexander the". A standard transformer must propagate this through multiple attention and FFN layers to "compute" that the next likely token is "Great". But this is not reasoning; it is memorized factual recall. The model is using expensive conditional computation to simulate what should be a cheap table lookup.
Engram adds a dedicated conditional memory module that retrieves static embeddings via hash-based lookup in O(1) time, freeing the transformer's depth for genuine reasoning.
Architecture
Engram operates in two phases. Here is what happens concretely when the model processes the sequence "... the great ruler Alexander the ___":
Phase 1: Retrieval
For each token position $t$, Engram extracts the preceding N-gram contexts (bigrams and trigrams). A tokenizer compression function first maps tokens to canonical IDs via NFKC normalization and lowercasing, reducing the effective vocabulary by 23%.
Each N-gram is hashed using $K = 8$ independent hash heads per N-gram order, and the resulting indices look up rows from large embedding tables:
where $g_{t,n}$ is the N-gram context of order $n$ ending at position $t$ (e.g., $g_{t,2}$ is the preceding bigram), $\phi_{n,k}$ is a deterministic hash function for N-gram order $n$ and head $k$, $E_{n,k}$ is the embedding table, $N$ is the maximum N-gram order (typically 3, covering bigrams and trigrams), and $\|$ denotes concatenation. The hash uses lightweight multiplicative-XOR operations: each token ID in the N-gram is multiplied by a large prime, the results are XOR'd together, and the final value is taken modulo the table size $M$ (itself a prime). For example, for head $k$ with primes $p_k$: $\phi_{2,k}(a, b) = (a \cdot p_k \oplus b \cdot p_k') \bmod M$. Different heads use different primes, so the same N-gram maps to different rows in each head's table.
Worked example: retrieving the embedding for "Alexander the"
Step 1: Extract N-grams. The preceding tokens are compressed to canonical IDs (lowercased, normalized): "alexander" = ID 4821, "the" = ID 12. The bigram context is (4821, 12) and the trigram context includes the token before that, say (917, 4821, 12).
Step 2: Hash each N-gram, multiple times. Each N-gram is hashed with $K = 8$ independent hash functions. For the bigram (4821, 12), hash head 1 might compute $\phi_{2,1}(4821, 12) = 738{,}201$ and hash head 2 computes $\phi_{2,2}(4821, 12) = 1{,}402{,}557$, and so on for all 8 heads. The trigram is hashed the same way with its own 8 hash functions. That is 16 hash lookups total (8 for bigrams + 8 for trigrams).
Step 3: Retrieve embeddings. Each hash index pulls one row from its embedding table. If the total memory dimension is $d_{\text{mem}} = 1{,}280$ and each row has dimension $d_{\text{mem}} / (N \cdot K)$, say 80 dimensions, then the 16 retrieved vectors are concatenated into a single vector $e_t$ of dimension $16 \times 80 = 1{,}280$. This is the complete retrieved embedding for this position.
The whole process is deterministic: the same N-gram always retrieves the same embeddings. During training, backpropagation updates only the 16 retrieved rows (out of millions), making gradient updates sparse and efficient. The multiple hash heads serve a purpose similar to multiple attention heads: they give the model several independent "views" of each N-gram pattern, reducing the impact of hash collisions.
Phase 2: Fusion
The retrieved embedding $e_t$ is a context-independent prior: it only depends on the preceding N-gram, not on the broader sentence. The phrase "Alexander the" always retrieves the same embedding, whether it appears in "Alexander the Great conquered Persia" or "Alexander the plumber fixed my sink". The fusion phase makes this prior context-dependent.
The current hidden state $h_t$ (which has already been through attention and carries full-sentence context) acts as a query. The retrieved embedding is projected by learned matrices $W_k$ (key projection) and $W_v$ (value projection):
Here $\sigma$ is the sigmoid function. This is a scalar gate $\alpha_t \in [0, 1]$: it computes a normalized dot product between the hidden state and the projected memory key, then squashes it through the sigmoid. The gated output is $\tilde{v}_t = \alpha_t \cdot W_v e_t$, where $W_v$ projects the retrieved embedding into the value used for the residual stream.
Worked example: same N-gram, two different contexts
Consider "Alexander the" appearing in two sentences:
- "... the great ruler Alexander the ___": The hidden state $h_t$ encodes "great ruler" context from earlier attention. This context is highly aligned with the N-gram embedding for "Alexander the" (which has learned associations with "Great" during training). The dot product is large, so $\alpha_t \approx 0.9$. The memory contributes strongly, pushing the model toward predicting "Great".
- "... call Alexander the plumber and ask ___": The hidden state encodes "call ... plumber and ask" context. This context is poorly aligned with the "Alexander the" embedding's learned associations. The dot product is small, so $\alpha_t \approx 0.1$. The memory is mostly suppressed, and the transformer proceeds with its own reasoning about what you would ask a plumber.
The gated values then pass through a depthwise causal convolution (kernel size 4, dilation equal to the max N-gram order) with SiLU activation and a residual connection:
The convolution operates over adjacent token positions with a receptive field of 4 tokens. Each output position is a weighted mix of itself and its neighbors' gated values. This serves two purposes: reinforcing consistent signals and suppressing noise.
Why this matters. The gate at each position makes an independent decision based on its own N-gram and hidden state. But language has multi-token patterns where individual positions should cooperate. The convolution lets them.
Consider the sequence "the Milky Way galaxy". At position "Milky", the bigram ("the", "Milky") retrieves an embedding with a strong gate (say 0.85). At "Way", the bigram ("Milky", "Way") also has a strong gate (0.88). Without convolution, each position's memory contribution is independent. With convolution, the kernel sees two adjacent high-gate values and amplifies both, because a 1D convolution computes a weighted sum over the window: $\text{out}_t = w_1 \cdot \tilde{v}_{t-3} + w_2 \cdot \tilde{v}_{t-2} + w_3 \cdot \tilde{v}_{t-1} + w_4 \cdot \tilde{v}_t$. When multiple neighbors carry strong memory signals, they reinforce each other in the output.
Now consider a hash collision: the bigram ("set", "the") at some position accidentally maps to the same table row as a common idiom, producing a spuriously high gate of 0.7. But the positions before and after it have low gates (0.1, 0.05), because their N-grams did not match anything meaningful. The convolution window sees one high value surrounded by near-zero values, and the weighted sum dilutes the outlier. The spurious signal is dampened rather than passed through at full strength.
Intuition: The gate at each position asks: "does my context agree with this N-gram's memory?" The convolution then asks: "do my neighbors agree too?" If multiple adjacent positions all have high gates (a named entity, an idiom), the convolution reinforces the signal. If only one position lights up while its neighbors are dark (likely a hash collision or noise), the convolution smooths it out. It is a consensus mechanism across positions.
The U-Shaped Scaling Law
Given a fixed parameter budget, how should you split inactive parameters between MoE experts and Engram memory? DeepSeek defines an allocation ratio $\rho \in [0, 1]$ where $\rho$ is the fraction going to MoE:
Experiments at two compute scales ($2 \times 10^{20}$ and $6 \times 10^{20}$ FLOPs) reveal a U-shaped curve: validation loss is minimized at $\rho \approx 0.75\text{-}0.80$, meaning roughly 75-80% of sparse capacity should go to MoE and 20-25% to Engram.
| Allocation ($\rho$) | Val Loss ($C = 2 \times 10^{20}$) | Interpretation |
|---|---|---|
| 1.0 (pure MoE) | 1.7248 | Wastes depth simulating lookup |
| 0.75-0.80 (optimal) | 1.7109 | Best trade-off |
| 0.0 (pure Engram) | Higher | Lacks conditional computation |
Intuition: Go too far toward MoE ($\rho \to 1$) and the model wastes expert capacity memorizing static facts. Go too far toward Engram ($\rho \to 0$) and the model lacks the dynamic computation needed for reasoning. The sweet spot dedicates most sparse capacity to experts (for reasoning) with a meaningful slice for memory (for facts).
Engram-27B Results
The headline model, Engram-27B, has 26.7B total parameters with 3.8B active per token and 5.7B in Engram memory. Compared to a MoE-27B baseline (same total params, same FLOPs), it uses 55 routed experts instead of 72 (trading 17 experts for memory):
| Benchmark | MoE-27B | Engram-27B | Gain |
|---|---|---|---|
| MMLU | 57.4% | 60.4% | +3.0 |
| BBH | 50.9% | 55.9% | +5.0 |
| ARC-Challenge | 70.1% | 73.8% | +3.7 |
| HumanEval | 37.8% | 40.8% | +3.0 |
| MATH | 28.3% | 30.7% | +2.4 |
| GSM8K | 58.4% | 60.6% | +2.2 |
| RULER (long-ctx) | 84.2% | 97.0% | +12.8 |
The RULER result is remarkable: +12.8 points on multi-query needle-in-a-haystack. Engram's hash-based lookup gives the model near-perfect retrieval for factual patterns in long contexts.
A New Axis of Sparsity
The key conceptual contribution: language modeling has two qualitatively different sub-tasks, and they benefit from different types of sparsity:
| Sub-task | Sparsity Type | Mechanism | Example |
|---|---|---|---|
| Compositional reasoning | Conditional computation (MoE) | Route to relevant experts | Multi-step math derivation |
| Knowledge retrieval | Conditional memory (Engram) | Hash-based embedding lookup | "capital of France" => Paris |
Standard transformers force both tasks through the same computational pathway. Engram gives the model a native "memory fetch" primitive, analogous to how CPUs separate cache/memory access from ALU computation.
7. Engram in Practice
The architecture above describes what Engram does. This section covers how to deploy it: where to place it in the transformer stack, how to manage the large embedding tables, and the scaling properties.
System Design
Engram's embedding tables can grow to billions of parameters (the Engram-40B variant has 18.5B in memory alone). The system handles this through:
- Training: Tables are sharded across GPUs using all-to-all communication for active embedding retrieval.
- Inference: Hash indices are deterministic, enabling prefetching. A multi-level cache hierarchy exploits the Zipfian distribution of N-grams: frequent patterns in GPU HBM, rare patterns in host DRAM or NVMe.
- Overhead: Offloading a 100B-parameter table incurs only 2.8% throughput penalty (6,316 tok/s baseline vs 6,140 tok/s with offloading).
Layer Placement
Where you insert the Engram module within the transformer stack matters significantly. The paper sweeps single-module insertion across layers 1-12 of a 12-layer backbone and finds a clear pattern:
| Configuration | Val Loss | Notes |
|---|---|---|
| MoE baseline (no Engram) | 1.808 | All params in experts |
| Layer 1 (earliest) | 1.776 | Hidden state lacks context for gating |
| Layer 2 (best single) | 1.770 | One attention round provides enough context |
| Layer 6 (middle) | 1.778 | Good gating but backbone already spent depth |
| Layer 12 (latest) | 1.785 | Too late: backbone already reconstructed patterns |
| Layers 2 + 6 (split) | 1.768 | Best overall: early offload + mid-layer refinement |
The core trade-off: early insertion lets Engram offload static patterns before the backbone wastes depth reconstructing them, but the hidden state at layer 1 has not yet been through any attention and lacks the context needed for accurate gating. Layer 2 is the sweet spot: one round of attention provides a meaningfully contextualized hidden state while still being early enough to save depth.
Splitting the memory budget across two layers (2 and 6 in the ablation; 2 and 15 in the full Engram-27B model) outperforms any single-layer configuration. The early module handles high-confidence local patterns (named entities, common collocations). The later module handles patterns that require more accumulated context to gate correctly (e.g., domain-specific phrases where the gate needs to see several preceding tokens to decide relevance). This split also has a practical system benefit: two smaller tables are easier to distribute across the memory hierarchy than one large one.
Warning: Late-only insertion (layer 12) is worse than no Engram at all in some configurations. By the time the backbone reaches layer 12, it has already spent its depth budget reconstructing static patterns through computation. Adding a lookup primitive at this point provides little benefit because the work has already been done.
8. Practical Notes
Which innovations matter most for your use case?
| Bottleneck | Innovation | When to care |
|---|---|---|
| Inference memory | MLA | Serving long contexts (>8K tokens) at scale |
| Inference/training compute | MoE | Want large model capacity without proportional cost |
| Long-context quality | DSA | Processing 64K+ token inputs |
| Post-training efficiency | GRPO | RL alignment without the cost of training a value model |
| Factual accuracy | Engram | Knowledge-heavy tasks, long-context retrieval |
The efficiency compounds
These innovations are not independent. MLA's compressed representations enable DSA's lightweight token scoring. MoE's sparse routing pairs with Engram's sparse memory. GRPO's simplicity makes it feasible to run unified RL across many reward domains. Each piece makes the others more effective.
What Engram changes about scaling
Engram introduces a genuinely new scaling axis. Previous scaling laws focused on compute (FLOPs) and model parameters (active and total). Engram demonstrates that memory capacity (lookup table size) scales independently: adding memory from 5.7B to 18.5B parameters improves performance with zero additional compute per token. This follows a power-law relationship with no sign of saturation at current scales.
Warning: Engram's gains are largest on knowledge-intensive and long-context tasks (+5.0 on BBH, +12.8 on RULER). On tasks that are primarily reasoning-bound (like MATH or GSM8K), the gains are more modest (+2-3 points). If your workload is pure chain-of-thought reasoning, MoE and GRPO matter more than Engram.
Pitfalls
- MLA: The up-projection matrices $W^{UK}$ and $W^{UV}$ add latency to each attention computation. This is usually hidden by the memory savings, but on very short sequences where KV cache is not a bottleneck, MLA can be slightly slower than standard MHA.
- MoE: Expert parallelism requires all-to-all communication across devices. On poorly connected clusters (low bisection bandwidth), MoE can be slower than dense models of the same active size.
- GRPO: The group size $G$ matters. Too small ($G < 8$) and the advantage estimates are noisy. Too large ($G > 64$) and you waste compute generating low-information samples. DeepSeek uses $G = 16\text{-}64$ depending on the task.
- Engram: Layer placement is critical (see Layer Placement above). Late-only insertion can be worse than no Engram at all. Always place the first module at layer 2, with an optional second at a middle layer.
References
- DeepSeek-AI (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. Introduces fine-grained expert segmentation and shared expert isolation.
- DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. Introduces MLA and loss-free load balancing for MoE.
- DeepSeek-AI (2024). DeepSeek-V3 Technical Report. Scales MLA + MoE to 671B params, introduces the multi-token prediction training objective.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. GRPO for reasoning, distillation pipeline.
- DeepSeek-AI (2025). Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. DeepSeek Sparse Attention (DSA) with lightning indexer.
- DeepSeek-AI (2025). DeepSeek-V3.2 Technical Report. Unified RL pipeline, agentic task synthesis, GRPO stability fixes (keep routing, off-policy masking, unbiased KL).
- Cheng et al. (2026). Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. Engram architecture, U-shaped scaling law, layer placement analysis.
- Shazeer (2020). GLU Variants Improve Transformer. SwiGLU activation used in DeepSeek's expert FFNs.
- Zhang & Sennrich (2019). Root Mean Square Layer Normalization. RMSNorm used throughout DeepSeek and in Engram's gating mechanism.
- Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Introduces GRPO (Group Relative Policy Optimization).
Cite this post:
@article{sedhain2026deepseek,
title = {DeepSeek's Technical Playbook: From MLA to Conditional Memory},
author = {Sedhain, Suvash},
journal = {ssedhain.com},
year = {2026},
month = {Mar},
url = {https://mesuvash.github.io/blog/2026/deepseek-v3/}
}