DeepSeek's Technical Playbook: From MLA to Conditional Memory

How DeepSeek systematically attacks efficiency at every layer of the stack: attention, routing, context, post-training, and now memory.

1. Why This Matters

DeepSeek has published a series of papers that, taken together, form a coherent efficiency playbook for large language models. Each paper targets a different bottleneck:

The pattern is consistent: identify where compute or memory is wasted, then design a sparse or compressed alternative that preserves quality. The following diagram is a composed view showing how all of these innovations would fit inside a single transformer layer. Note that Engram is a separate 2026 paper, not part of the shipped DeepSeek-V3 architecture; the diagram shows the full playbook, not a single model checkpoint.

Composed View: DeepSeek Innovations in One Layer Input Token Embeddings Multi-head Latent Attention (MLA) compressed KV cache: ~98% smaller DSA variant sparse token selection Engram Memory Module hash-based O(1) lookup, context-gated DeepSeekMoE (Sparse FFN) 1 shared + top-8 of 256 routed experts Layer Output (+ residual) residual after pre-training Unified RL (GRPO) specialist training → distillation → unified stage reasoning + tool use + safety in one pass Attention Memory FFN/Experts Long-context variant

This post walks through each component in order: attention compression (MLA), sparse routing (MoE), sparse attention (DSA), post-training (GRPO), and conditional memory (Engram).

2. Multi-head Latent Attention (MLA)

Standard multi-head attention stores separate key and value vectors for every token at every layer. With $n_h$ attention heads (e.g., 128) each of dimension $d_h$ (e.g., 128), a 128K context window across 60 layers in BF16 means storing roughly 8-9 GB of KV cache per sequence (the exact number depends on the number of layers and precision; FP8 quantization halves it). MLA compresses this by projecting keys and values into a shared low-dimensional latent space.

How It Works

Instead of caching full $K$ and $V$ matrices, MLA caches a single compressed vector $c_t$ per token:

KV Compression $$c_t^{KV} = W^{DKV} h_t$$

where $W^{DKV} \in \mathbb{R}^{d_c \times d}$ projects the hidden state $h_t$ (dimension $d$) down to a compressed representation $c_t^{KV}$ (dimension $d_c$, typically $d_c \ll n_h \cdot d_h$). At attention time, keys and values are reconstructed:

KV Reconstruction $$k_t = W^{UK} c_t^{KV}, \quad v_t = W^{UV} c_t^{KV}$$

$W^{UK}$ and $W^{UV}$ are up-projection matrices that reconstruct the full key and value vectors from the compressed cache. The critical insight: you only cache $c_t^{KV}$, not the full keys and values.

Intuition: Think of MLA as JPEG compression for the KV cache. The full keys and values have redundancy across heads. MLA exploits this by storing a compressed version and reconstructing on the fly. With $d_c + d_R = 576$ cached per token vs. the original $2 \times n_h \times d_h = 32{,}768$ for full KV, you get a ~98% reduction in per-token cache size. (The DeepSeek-V2 paper reports 93.3% KV cache reduction relative to its predecessor DeepSeek 67B, which had a different head configuration.)

Decoupled RoPE

There is a subtlety. Rotary Position Embeddings (RoPE) are applied to keys to encode position information, but RoPE is position-dependent, so it cannot be absorbed into the low-rank compression. MLA handles this by decoupling the positional component:

Decoupled Position $$k_t = [k_t^C;\; k_t^R], \quad \text{where } k_t^R = \text{RoPE}(W^{KR} h_t)$$

The content key $k_t^C$ comes from the compressed cache. The positional key $k_t^R$ is a small separate vector (dimension $d_R$, typically 64) that carries the RoPE encoding. Only $k_t^R$ needs to be cached alongside $c_t^{KV}$, adding minimal overhead.

Comparison with Alternatives

MethodKV Cache per TokenQuality vs. MHAMechanism
MHA (standard)$2 n_h d_h$BaselineFull cache per head
GQA$2 n_g d_h$Slight degradationShare KV across $n_g$ head groups
MQA$2 d_h$Notable degradationSingle KV for all heads
MLA$d_c + d_R$Matches or exceeds MHALow-rank compression + decoupled RoPE

MLA achieves better compression than MQA while matching or exceeding the quality of full MHA. The key advantage over GQA/MQA: instead of reducing the number of heads, MLA reduces the dimensionality of the cached representation, preserving expressiveness.

Why not GQA? Grouped Query Attention (used by Llama 3, Gemma 2, and most open models) is simpler to implement and works well in practice. But GQA forces a hard trade-off: fewer KV groups means more compression but less per-head specialization. At DeepSeek's scale (128 heads, 128K context), even 8-group GQA still caches $2 \times 8 \times 128 = 2{,}048$ values per token, while MLA caches $d_c + d_R \approx 576$. MLA also preserves the full expressiveness of all 128 heads during attention computation (since keys and values are reconstructed per-head from the shared latent), whereas GQA forces heads within a group to share identical keys and values. The cost is implementation complexity: MLA requires the decoupled RoPE design and careful kernel optimization for the up-projections.

MLA handles the attention side of the efficiency equation. But each transformer layer also has a feed-forward network, and in a dense model, every parameter in that FFN fires for every token.

3. DeepSeekMoE

A standard transformer's feed-forward network (FFN) activates all parameters for every token. This is wasteful: a token about Python syntax does not need the same parameters as a token about organic chemistry. Mixture-of-Experts (MoE) replaces the dense FFN with many small expert networks and a router that selects which ones to activate per token.

Architecture

DeepSeekMoE uses two types of experts:

The FFN output for a token $x$ is:

MoE Output $$\text{FFN}(x) = \sum_{i=1}^{N_s} \text{Expert}_i^{(s)}(x) + \sum_{j \in \text{TopK}} g_j \cdot \text{Expert}_j^{(r)}(x)$$

where $N_s$ is the number of shared experts, $g_j$ are the gating weights from a softmax over router logits, and TopK selects the $k$ experts with the highest gate values. Each expert is a small FFN (typically SwiGLU) with its own weights.

Load Balancing Without Auxiliary Loss

A common problem with MoE: some experts get selected far more than others, leading to imbalanced computation. Most MoE systems (GShard, Switch Transformer) add an auxiliary balance loss to the training objective that penalizes uneven expert utilization. This works, but it creates a fundamental tension: the auxiliary loss pushes the router toward uniform distribution, while the language modeling loss pushes it toward selecting the best experts. The two gradients fight each other, and model quality suffers.

DeepSeek's solution is a loss-free load balancing strategy that separates the routing decision from the gradient signal entirely. Here is how it works:

The standard router computes a score for each expert $i$ given token $x$:

Router Score $$s_i = W_r^{(i)} \cdot h_x$$

where $W_r^{(i)}$ is the router weight for expert $i$ and $h_x$ is the token's hidden state. Normally, TopK selection and gating weights both come from these scores. DeepSeek adds a bias term $b_i$ to each expert's score, but only for the TopK selection step:

Biased Selection $$\text{TopK selection uses: } s_i + b_i \quad \text{(biased)}$$ $$\text{Gating weights use: } g_j = \frac{e^{s_j}}{\sum_{j' \in \text{TopK}} e^{s_{j'}}} \quad \text{(unbiased)}$$

The bias $b_i$ is not a learned parameter. It is adjusted by a simple heuristic after each training step:

The critical detail: $b_i$ shifts which experts get selected, but the actual gating weights $g_j$ (which determine how much each expert's output contributes) are computed from the original unbiased scores. This means the balance adjustment never introduces gradient noise into the model's learning signal.

Concrete example. Suppose you have 4 experts with router scores $s = [2.0,\; 1.5,\; 0.3,\; 0.1]$ and TopK = 2. Without any balancing, experts 1 and 2 are selected, and the gating weights are $g_1 = e^{2.0}/(e^{2.0} + e^{1.5}) \approx 0.62$ and $g_2 \approx 0.38$. Now suppose expert 2 is overloaded and acquires a bias of $b_2 = -1.5$, while underloaded expert 3 gets $b_3 = +1.0$. The biased selection scores become $[2.0,\; 0.0,\; 1.3,\; 0.1]$, so TopK now picks experts 1 and 3. But the gating weights are computed from the original scores of the selected pair: $g_1 = e^{2.0}/(e^{2.0} + e^{0.3}) \approx 0.85$ and $g_3 \approx 0.15$. Expert 3 got into the room because of the bias, but its low original score means it contributes only 15% of the output. The model still weights experts by true relevance.

Intuition: Think of it as two separate decisions. First: "which experts should this token visit?" (influenced by the bias). Second: "how much should I weight each expert's answer?" (unbiased, purely based on relevance). The bias is like a bouncer redirecting people to shorter queues at a club, but once you are inside, the DJ plays the same music regardless of how you got in. The model's loss function never sees the bias term, so its gradients are clean.

Scale

In DeepSeek-V3, this means 671B total parameters but only ~37B activated per token. You get the capacity of a 671B model at roughly the inference cost of a 37B dense model.

MoE makes each layer cheaper by activating fewer parameters. But even with sparse FFNs, attention still scales quadratically with sequence length, and at 128K tokens, it dominates the compute budget.

4. DeepSeek Sparse Attention (DSA)

Standard attention is $O(n^2)$ in sequence length $n$. At 128K tokens, the attention matrix has $\sim$16 billion entries per layer per head. DeepSeek Sparse Attention reduces this to near-linear by selecting only the most relevant tokens for each query to attend to.

The Core Idea

DSA splits the context into two categories for each query: a small set of globally relevant tokens (selected by a cheap scoring pass) and a local window of nearby tokens. Only these selected tokens participate in the full attention computation.

The process has two stages:

  1. Lightning Indexer (cheap scoring): For each query position $t$, compute a relevance score for every past token. The scoring uses MLA's compressed latent representations $c_j^{KV}$ (which are already cached), so this step adds minimal overhead. Concretely, the score for token $j$ given query $q_t$ is a dot product in the compressed space:
    Token Relevance Score $$\text{score}(q_t, j) = q_t^{\text{idx}} \cdot (c_j^{KV})^\top$$

    where $q_t^{\text{idx}}$ is a lightweight query projection (a small linear layer applied to the query, separate from the main attention query). This is much cheaper than full attention because it operates in the compressed dimension $d_c$ (e.g., 512) rather than the full head dimension.

  2. Selective Attention: Select the top-$B$ tokens by score (where $B$ is a fixed budget per query; the V3.2 paper uses $B = 2{,}048$ during training) and combine them with the local sliding window. Only this subset participates in the full attention computation:
    Sparse Attention $$\text{Attn}(q_t) = \text{softmax}\!\left(\frac{q_t K_{\mathcal{S}_t}^\top}{\sqrt{d}}\right) V_{\mathcal{S}_t}$$

    where $\mathcal{S}_t = \text{TopB}(\text{score}(q_t, \cdot)) \cup \text{Window}(t, w)$ is the union of the top-$B$ globally relevant tokens and the local window of size $w$.

Intuition: Instead of every token attending to every other token (quadratic), DSA first does a cheap scan to find which tokens matter, then computes full attention only on those. At 128K context with $B = 2{,}048$ selected tokens plus a local window, each query attends to a small fraction of the full context. The scoring pass itself is cheap because it operates in MLA's compressed space ($d_c = 512$) rather than the full attention dimension.

Why It Works with MLA

This is where MLA and DSA reinforce each other. MLA already compresses keys and values into latent vectors $c_t^{KV}$ for the purpose of reducing cache memory. DSA reuses these exact same compressed representations as input to the lightning indexer. The indexer does not need its own separate index or data structure; it simply dot-products the query against the cached latents. Without MLA's compression, the indexer would need to score against full-dimensional keys, making the "cheap scan" much less cheap.

Worked example: token selection at 128K context

Consider a model processing a 128K-token document. At position $t = 100{,}000$, the query asks about "the treaty signed in 1648". With dense attention, this query would compute 100,000 dot products at full dimension.

Step 1: Scoring. The lightning indexer computes $\text{score}(q_t, j)$ for all $j < t$ using the compressed latents. This is 100,000 dot products, but in dimension $d_c = 512$ instead of $d_h = 128 \times 128 = 16{,}384$. Tokens mentioning "treaty", "Westphalia", "1648", and "Peace" score highest.

Step 2: Selection. The top $B = 2{,}048$ tokens by score are selected globally. These might be scattered across the document: some from the section discussing the Treaty of Westphalia (positions ~12,000-12,500), some from a timeline section (positions ~45,000-45,200), and some from a cross-reference (position ~89,000). The local sliding window captures the most recent few thousand positions.

Step 3: Full attention. The query computes full multi-head attention over only the selected + window tokens, a small fraction of the 100K total context, while still attending to the most relevant passages anywhere in the document.

Hardware Considerations

The practical challenge with any sparse attention scheme is turning the token selection into efficient GPU kernels. Naive implementations with scattered memory accesses can negate the theoretical FLOP savings. The DeepSeek-V3.2 paper describes DSA as "hardware-aligned and natively trainable," designed to work with existing optimized attention kernels rather than requiring entirely custom implementations.

Key limitation: DSA requires the model to be trained with sparse attention from the start (or at least fine-tuned with it for an extended period). You cannot simply drop DSA into a pre-trained dense-attention model and expect it to work. The model needs to learn which tokens are worth selecting, and the lightning indexer's query projection must be trained jointly with the rest of the attention mechanism.

MLA, MoE, and DSA together address the core inference bottlenecks: memory, compute per layer, and quadratic attention. But a pre-trained model still needs alignment. The next section covers how DeepSeek handles post-training with a unified RL approach.

5. Scalable Reinforcement Learning

After pre-training, DeepSeek uses reinforcement learning to improve reasoning, tool use, and safety. The core algorithm is Group Relative Policy Optimization (GRPO), which avoids the need for a separate value model (unlike PPO).

GRPO

For a given prompt $q$, GRPO samples a group of $G$ responses $\{o_1, \ldots, o_G\}$ from the current policy. Each response is scored by a reward function $r(o_i)$. The advantage is computed relative to the group:

Group Advantage $$\hat{A}_i = \frac{r(o_i) - \text{mean}(\{r(o_j)\})}{\text{std}(\{r(o_j)\})}$$

The policy gradient loss clips the ratio (like PPO) and includes a KL penalty against the reference policy $\pi_{\text{ref}}$:

GRPO Objective $$\mathcal{L}_{\text{GRPO}} = -\mathbb{E}\!\left[\min\!\left(\frac{\pi_\theta}{\pi_{\text{old}}} \hat{A},\; \text{clip}\!\left(\frac{\pi_\theta}{\pi_{\text{old}}}, 1\!\pm\!\epsilon\right) \hat{A}\right) - \beta\, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$

where $\pi_\theta$ is the current policy being trained, $\pi_{\text{old}}$ is the policy that generated the samples (before the current update step), $\pi_\theta / \pi_{\text{old}}$ is the importance sampling ratio, $\epsilon$ is the clipping range (typically 0.2), and $\beta$ controls the KL penalty strength against the frozen reference policy $\pi_{\text{ref}}$.

Intuition: GRPO skips the value model entirely. Instead of estimating "how good is this state?" with a learned critic, it just compares each response to its siblings in the same group. If your response scored 0.8 and the group average was 0.5 with std 0.2, your advantage is $(0.8 - 0.5)/0.2 = 1.5$. Simple, stable, and cheap.

Stability Fixes

Scaling GRPO to large models and diverse reward signals revealed several instabilities. DeepSeek-V3.2 introduced four fixes:

ProblemFixMechanism
KL divergence biased upward Unbiased KL estimator Replace standard KL with $\frac{\pi_{\text{ref}}}{\pi_\theta} - \log\frac{\pi_{\text{ref}}}{\pi_\theta} - 1$, which has zero bias when $\pi_\theta = \pi_{\text{ref}}$
Stale off-policy data Off-policy sequence masking For negative-advantage sequences, mask the entire sequence if the average log importance ratio $\frac{1}{T}\sum_t \log\frac{\pi_\theta}{\pi_{\text{old}}}$ exceeds a threshold $\delta$
Expert routing drift during RL Keep routing Record the MoE expert routing decisions made during sampling and replay the same routing during the training forward pass, ensuring consistent expert assignments
Action-space mismatch between sampling and training Keep sampling mask Preserve the top-p/top-k token truncation masks from sampling and apply them during the training forward pass, so both policies operate over the same action subspace

Unified Training Pipeline

Rather than running separate RL stages for reasoning, tool use, and safety (which risks catastrophic forgetting between stages), DeepSeek runs a three-phase unified pipeline:

  1. Specialist Training: Train domain-specific expert policies using GRPO with verifiable rewards. Each specialist focuses on a single domain:
    • Math specialist: trained on competition-level problems; reward = 1 if final answer matches ground truth, 0 otherwise.
    • Code specialist: trained on programming challenges; reward = fraction of test cases passed (0.0 to 1.0).
    • Agent specialist: trained on multi-step tool-use tasks; reward = binary task completion signal from a programmatic verifier.
    Each specialist starts from the same SFT checkpoint and diverges during RL. This phase runs for thousands of GRPO steps per domain, producing 3-5 specialist checkpoints.
  2. Distillation: Merge the specialist policies into a single generalist model. For each domain's prompt distribution, the generalist is trained to minimize KL divergence against the corresponding specialist's output distribution: $\mathcal{L}_{\text{distill}} = \sum_{d \in \text{domains}} \mathbb{E}_{x \sim \mathcal{D}_d}\left[D_{\text{KL}}(\pi_{\text{specialist}}^d(\cdot|x) \| \pi_{\text{generalist}}(\cdot|x))\right]$. This is cheaper than training from scratch because the specialists have already found good policies; distillation just needs to merge them. The result is a single model that is competent (but not yet optimal) across all domains.
  3. Unified RL: Run GRPO on the distilled model with a mixed-domain reward signal. Each training batch contains prompts from all domains simultaneously. The router selects the appropriate reward function based on the prompt's domain tag: correctness for math, test-case pass rate for code, task completion for agents, and a safety classifier score for all prompts. The mixed batches are critical: they prevent the model from overfitting to any single domain's reward signal. The KL penalty against the distilled checkpoint acts as an anchor, preventing the model from forgetting capabilities while improving on each domain.

Agentic Task Synthesis

The agent specialist in phase 1 needs training data, but real multi-step agentic tasks are scarce and expensive to annotate. DeepSeek solves this with a synthesis pipeline that generates over 85,000 tasks spanning web browsing, file operations, API calls, and multi-tool composition.

The synthesis process works in three steps: (1) an LLM generates a task description and the ground-truth execution plan, (2) a programmatic verifier is auto-generated that checks whether an agent's execution trace achieves the intended outcome (not just the final answer, but intermediate state transitions), and (3) the task is validated by running it against a baseline agent to ensure it is solvable but non-trivial. Tasks that are either trivially solved (>90% baseline pass rate) or unsolvable (<5% pass rate) are filtered out.

All five innovations so far target efficiency during pre-training or post-training. The final innovation, Engram, addresses a different question: can we give the transformer a fundamentally new primitive for tasks it currently handles inefficiently?

6. Engram: Conditional Memory

This is the most recent innovation, introduced in "Conditional Memory via Scalable Lookup" (Cheng et al., Jan 2026). The core insight: transformers waste depth on static pattern retrieval (named entities, idiomatic phrases, factual associations) that could be handled by direct lookup.

The Problem

Consider what happens when a model encounters "Alexander the". A standard transformer must propagate this through multiple attention and FFN layers to "compute" that the next likely token is "Great". But this is not reasoning; it is memorized factual recall. The model is using expensive conditional computation to simulate what should be a cheap table lookup.

Engram adds a dedicated conditional memory module that retrieves static embeddings via hash-based lookup in O(1) time, freeing the transformer's depth for genuine reasoning.

Architecture

Engram operates in two phases. Here is what happens concretely when the model processes the sequence "... the great ruler Alexander the ___":

... ruler Alexander the ??? bigram: ("alexander", "the") Phase 1 Retrieval 8 Independent Hash Functions 8 indices Embedding Table 2.3 million rows, 80 dimensions each retrieved retrieved retrieved concatenate Retrieved Embedding 1,280 dims (8 heads x 2 n-grams x 80) Phase 2 Gating Hidden State from attention layers sigmoid( similarity ) "... great ruler Alexander the __" Context aligns with memory gate = 0.9 memory used strongly "... call Alexander the plumber __" Context disagrees with memory gate = 0.1 memory suppressed Same N-gram always retrieves the same embedding. The gate decides how much to use it. Gated output passes through a causal convolution, then adds to the residual stream.

Phase 1: Retrieval

For each token position $t$, Engram extracts the preceding N-gram contexts (bigrams and trigrams). A tokenizer compression function first maps tokens to canonical IDs via NFKC normalization and lowercasing, reducing the effective vocabulary by 23%.

Each N-gram is hashed using $K = 8$ independent hash heads per N-gram order, and the resulting indices look up rows from large embedding tables:

Hash Lookup $$e_{t,n,k} = E_{n,k}[\phi_{n,k}(g_{t,n})], \quad e_t = \|_{n=2}^{N} \|_{k=1}^{K} e_{t,n,k}$$

where $g_{t,n}$ is the N-gram context of order $n$ ending at position $t$ (e.g., $g_{t,2}$ is the preceding bigram), $\phi_{n,k}$ is a deterministic hash function for N-gram order $n$ and head $k$, $E_{n,k}$ is the embedding table, $N$ is the maximum N-gram order (typically 3, covering bigrams and trigrams), and $\|$ denotes concatenation. The hash uses lightweight multiplicative-XOR operations: each token ID in the N-gram is multiplied by a large prime, the results are XOR'd together, and the final value is taken modulo the table size $M$ (itself a prime). For example, for head $k$ with primes $p_k$: $\phi_{2,k}(a, b) = (a \cdot p_k \oplus b \cdot p_k') \bmod M$. Different heads use different primes, so the same N-gram maps to different rows in each head's table.

Worked example: retrieving the embedding for "Alexander the"

Step 1: Extract N-grams. The preceding tokens are compressed to canonical IDs (lowercased, normalized): "alexander" = ID 4821, "the" = ID 12. The bigram context is (4821, 12) and the trigram context includes the token before that, say (917, 4821, 12).

Step 2: Hash each N-gram, multiple times. Each N-gram is hashed with $K = 8$ independent hash functions. For the bigram (4821, 12), hash head 1 might compute $\phi_{2,1}(4821, 12) = 738{,}201$ and hash head 2 computes $\phi_{2,2}(4821, 12) = 1{,}402{,}557$, and so on for all 8 heads. The trigram is hashed the same way with its own 8 hash functions. That is 16 hash lookups total (8 for bigrams + 8 for trigrams).

Step 3: Retrieve embeddings. Each hash index pulls one row from its embedding table. If the total memory dimension is $d_{\text{mem}} = 1{,}280$ and each row has dimension $d_{\text{mem}} / (N \cdot K)$, say 80 dimensions, then the 16 retrieved vectors are concatenated into a single vector $e_t$ of dimension $16 \times 80 = 1{,}280$. This is the complete retrieved embedding for this position.

The whole process is deterministic: the same N-gram always retrieves the same embeddings. During training, backpropagation updates only the 16 retrieved rows (out of millions), making gradient updates sparse and efficient. The multiple hash heads serve a purpose similar to multiple attention heads: they give the model several independent "views" of each N-gram pattern, reducing the impact of hash collisions.

Phase 2: Fusion

The retrieved embedding $e_t$ is a context-independent prior: it only depends on the preceding N-gram, not on the broader sentence. The phrase "Alexander the" always retrieves the same embedding, whether it appears in "Alexander the Great conquered Persia" or "Alexander the plumber fixed my sink". The fusion phase makes this prior context-dependent.

The current hidden state $h_t$ (which has already been through attention and carries full-sentence context) acts as a query. The retrieved embedding is projected by learned matrices $W_k$ (key projection) and $W_v$ (value projection):

Gating $$\alpha_t = \sigma\!\left(\frac{\textbf{RMSNorm}(h_t)^\top \;\textbf{RMSNorm}(W_k e_t)}{\sqrt{d}}\right)$$
Hover for details: RMSNorm RMSNorm (Root Mean Square Normalization) Before vs After RMSNorm input 2.4 0.6 1.8 1.2 RMS = 1.70 divide each by 1.70 output 1.41 0.35 1.06 0.71 RMS = 1.00 RMS = sqrt( mean( x_i^2 ) ) = 1.70 output_i = x_i / 1.70 Normalizes by dividing by root-mean-square: $\text{RMSNorm}(x)_i = x_i / \sqrt{\frac{1}{d}\sum_j x_j^2}$. Simpler than LayerNorm (no mean subtraction). Makes the dot product measure direction only, not magnitude. Why normalize both? Why normalize both vectors? Dot product: direction AND magnitude without norm: big vector = big dot (misleading) with norm: unit vector = pure direction (measures alignment only) Both $h_t$ and $W_k e_t$ are normalized before the dot product. This makes the gate measure directional alignment only. A large hidden state cannot spuriously open the gate; only genuine semantic similarity does.

Here $\sigma$ is the sigmoid function. This is a scalar gate $\alpha_t \in [0, 1]$: it computes a normalized dot product between the hidden state and the projected memory key, then squashes it through the sigmoid. The gated output is $\tilde{v}_t = \alpha_t \cdot W_v e_t$, where $W_v$ projects the retrieved embedding into the value used for the residual stream.

Worked example: same N-gram, two different contexts

Consider "Alexander the" appearing in two sentences:

  • "... the great ruler Alexander the ___": The hidden state $h_t$ encodes "great ruler" context from earlier attention. This context is highly aligned with the N-gram embedding for "Alexander the" (which has learned associations with "Great" during training). The dot product is large, so $\alpha_t \approx 0.9$. The memory contributes strongly, pushing the model toward predicting "Great".
  • "... call Alexander the plumber and ask ___": The hidden state encodes "call ... plumber and ask" context. This context is poorly aligned with the "Alexander the" embedding's learned associations. The dot product is small, so $\alpha_t \approx 0.1$. The memory is mostly suppressed, and the transformer proceeds with its own reasoning about what you would ask a plumber.

The gated values then pass through a depthwise causal convolution (kernel size 4, dilation equal to the max N-gram order) with SiLU activation and a residual connection:

Fusion Output (hover terms for details)
$Y$ = SiLU SiLU Activation (Sigmoid Linear Unit) $\text{SiLU}(x) = x \cdot \sigma(x)$. Smooth, non-monotonic activation. Unlike ReLU, it allows small negative values (the dip near $x \approx -1$), which helps gradient flow. Dashed line shows ReLU for comparison. ( Conv1D Causal Depthwise Conv1D (kernel=4) Input: gated values at each position t-3 t-2 t-1 t kernel window (size 4) w1 w2 w3 w4 × × × sum output at t causal: no future positions Slides a window of 4 learned weights over adjacent positions. Causal: only looks at current + past positions (not future). Depthwise: each channel is convolved independently. This smooths gated values across neighbors. (RMSNorm($\tilde{V}$))) + $\tilde{V}$

The convolution operates over adjacent token positions with a receptive field of 4 tokens. Each output position is a weighted mix of itself and its neighbors' gated values. This serves two purposes: reinforcing consistent signals and suppressing noise.

Why this matters. The gate at each position makes an independent decision based on its own N-gram and hidden state. But language has multi-token patterns where individual positions should cooperate. The convolution lets them.

Consider the sequence "the Milky Way galaxy". At position "Milky", the bigram ("the", "Milky") retrieves an embedding with a strong gate (say 0.85). At "Way", the bigram ("Milky", "Way") also has a strong gate (0.88). Without convolution, each position's memory contribution is independent. With convolution, the kernel sees two adjacent high-gate values and amplifies both, because a 1D convolution computes a weighted sum over the window: $\text{out}_t = w_1 \cdot \tilde{v}_{t-3} + w_2 \cdot \tilde{v}_{t-2} + w_3 \cdot \tilde{v}_{t-1} + w_4 \cdot \tilde{v}_t$. When multiple neighbors carry strong memory signals, they reinforce each other in the output.

Now consider a hash collision: the bigram ("set", "the") at some position accidentally maps to the same table row as a common idiom, producing a spuriously high gate of 0.7. But the positions before and after it have low gates (0.1, 0.05), because their N-grams did not match anything meaningful. The convolution window sees one high value surrounded by near-zero values, and the weighted sum dilutes the outlier. The spurious signal is dampened rather than passed through at full strength.

Intuition: The gate at each position asks: "does my context agree with this N-gram's memory?" The convolution then asks: "do my neighbors agree too?" If multiple adjacent positions all have high gates (a named entity, an idiom), the convolution reinforces the signal. If only one position lights up while its neighbors are dark (likely a hash collision or noise), the convolution smooths it out. It is a consensus mechanism across positions.

The U-Shaped Scaling Law

Given a fixed parameter budget, how should you split inactive parameters between MoE experts and Engram memory? DeepSeek defines an allocation ratio $\rho \in [0, 1]$ where $\rho$ is the fraction going to MoE:

Allocation $$P_{\text{MoE}}^{\text{sparse}} = \rho \cdot P_{\text{sparse}}, \quad P_{\text{Engram}} = (1 - \rho) \cdot P_{\text{sparse}}$$

Experiments at two compute scales ($2 \times 10^{20}$ and $6 \times 10^{20}$ FLOPs) reveal a U-shaped curve: validation loss is minimized at $\rho \approx 0.75\text{-}0.80$, meaning roughly 75-80% of sparse capacity should go to MoE and 20-25% to Engram.

Allocation ($\rho$)Val Loss ($C = 2 \times 10^{20}$)Interpretation
1.0 (pure MoE)1.7248Wastes depth simulating lookup
0.75-0.80 (optimal)1.7109Best trade-off
0.0 (pure Engram)HigherLacks conditional computation

Intuition: Go too far toward MoE ($\rho \to 1$) and the model wastes expert capacity memorizing static facts. Go too far toward Engram ($\rho \to 0$) and the model lacks the dynamic computation needed for reasoning. The sweet spot dedicates most sparse capacity to experts (for reasoning) with a meaningful slice for memory (for facts).

Engram-27B Results

The headline model, Engram-27B, has 26.7B total parameters with 3.8B active per token and 5.7B in Engram memory. Compared to a MoE-27B baseline (same total params, same FLOPs), it uses 55 routed experts instead of 72 (trading 17 experts for memory):

BenchmarkMoE-27BEngram-27BGain
MMLU57.4%60.4%+3.0
BBH50.9%55.9%+5.0
ARC-Challenge70.1%73.8%+3.7
HumanEval37.8%40.8%+3.0
MATH28.3%30.7%+2.4
GSM8K58.4%60.6%+2.2
RULER (long-ctx)84.2%97.0%+12.8

The RULER result is remarkable: +12.8 points on multi-query needle-in-a-haystack. Engram's hash-based lookup gives the model near-perfect retrieval for factual patterns in long contexts.

A New Axis of Sparsity

The key conceptual contribution: language modeling has two qualitatively different sub-tasks, and they benefit from different types of sparsity:

Sub-taskSparsity TypeMechanismExample
Compositional reasoningConditional computation (MoE)Route to relevant expertsMulti-step math derivation
Knowledge retrievalConditional memory (Engram)Hash-based embedding lookup"capital of France" => Paris

Standard transformers force both tasks through the same computational pathway. Engram gives the model a native "memory fetch" primitive, analogous to how CPUs separate cache/memory access from ALU computation.

7. Engram in Practice

The architecture above describes what Engram does. This section covers how to deploy it: where to place it in the transformer stack, how to manage the large embedding tables, and the scaling properties.

System Design

Engram's embedding tables can grow to billions of parameters (the Engram-40B variant has 18.5B in memory alone). The system handles this through:

Layer Placement

Where you insert the Engram module within the transformer stack matters significantly. The paper sweeps single-module insertion across layers 1-12 of a 12-layer backbone and finds a clear pattern:

ConfigurationVal LossNotes
MoE baseline (no Engram)1.808All params in experts
Layer 1 (earliest)1.776Hidden state lacks context for gating
Layer 2 (best single)1.770One attention round provides enough context
Layer 6 (middle)1.778Good gating but backbone already spent depth
Layer 12 (latest)1.785Too late: backbone already reconstructed patterns
Layers 2 + 6 (split)1.768Best overall: early offload + mid-layer refinement

The core trade-off: early insertion lets Engram offload static patterns before the backbone wastes depth reconstructing them, but the hidden state at layer 1 has not yet been through any attention and lacks the context needed for accurate gating. Layer 2 is the sweet spot: one round of attention provides a meaningfully contextualized hidden state while still being early enough to save depth.

Splitting the memory budget across two layers (2 and 6 in the ablation; 2 and 15 in the full Engram-27B model) outperforms any single-layer configuration. The early module handles high-confidence local patterns (named entities, common collocations). The later module handles patterns that require more accumulated context to gate correctly (e.g., domain-specific phrases where the gate needs to see several preceding tokens to decide relevance). This split also has a practical system benefit: two smaller tables are easier to distribute across the memory hierarchy than one large one.

Warning: Late-only insertion (layer 12) is worse than no Engram at all in some configurations. By the time the backbone reaches layer 12, it has already spent its depth budget reconstructing static patterns through computation. Adding a lookup primitive at this point provides little benefit because the work has already been done.

8. Practical Notes

Which innovations matter most for your use case?

BottleneckInnovationWhen to care
Inference memoryMLAServing long contexts (>8K tokens) at scale
Inference/training computeMoEWant large model capacity without proportional cost
Long-context qualityDSAProcessing 64K+ token inputs
Post-training efficiencyGRPORL alignment without the cost of training a value model
Factual accuracyEngramKnowledge-heavy tasks, long-context retrieval

The efficiency compounds

These innovations are not independent. MLA's compressed representations enable DSA's lightweight token scoring. MoE's sparse routing pairs with Engram's sparse memory. GRPO's simplicity makes it feasible to run unified RL across many reward domains. Each piece makes the others more effective.

What Engram changes about scaling

Engram introduces a genuinely new scaling axis. Previous scaling laws focused on compute (FLOPs) and model parameters (active and total). Engram demonstrates that memory capacity (lookup table size) scales independently: adding memory from 5.7B to 18.5B parameters improves performance with zero additional compute per token. This follows a power-law relationship with no sign of saturation at current scales.

Warning: Engram's gains are largest on knowledge-intensive and long-context tasks (+5.0 on BBH, +12.8 on RULER). On tasks that are primarily reasoning-bound (like MATH or GSM8K), the gains are more modest (+2-3 points). If your workload is pure chain-of-thought reasoning, MoE and GRPO matter more than Engram.

Pitfalls

References

  1. DeepSeek-AI (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. Introduces fine-grained expert segmentation and shared expert isolation.
  2. DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. Introduces MLA and loss-free load balancing for MoE.
  3. DeepSeek-AI (2024). DeepSeek-V3 Technical Report. Scales MLA + MoE to 671B params, introduces the multi-token prediction training objective.
  4. DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. GRPO for reasoning, distillation pipeline.
  5. DeepSeek-AI (2025). Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. DeepSeek Sparse Attention (DSA) with lightning indexer.
  6. DeepSeek-AI (2025). DeepSeek-V3.2 Technical Report. Unified RL pipeline, agentic task synthesis, GRPO stability fixes (keep routing, off-policy masking, unbiased KL).
  7. Cheng et al. (2026). Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. Engram architecture, U-shaped scaling law, layer placement analysis.
  8. Shazeer (2020). GLU Variants Improve Transformer. SwiGLU activation used in DeepSeek's expert FFNs.
  9. Zhang & Sennrich (2019). Root Mean Square Layer Normalization. RMSNorm used throughout DeepSeek and in Engram's gating mechanism.
  10. Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Introduces GRPO (Group Relative Policy Optimization).

Cite this post:

@article{sedhain2026deepseek,
  title   = {DeepSeek's Technical Playbook: From MLA to Conditional Memory},
  author  = {Sedhain, Suvash},
  journal = {ssedhain.com},
  year    = {2026},
  month   = {Mar},
  url     = {https://mesuvash.github.io/blog/2026/deepseek-v3/}
}