Semantic IDs: Replacing Random Item IDs with Content-Derived Tokens
A detailed walkthrough of Singh et al., 2023: how YouTube compresses a 2048-dim video embedding into 8 integer tokens using RQ-VAE, then plugs those tokens into a production ranking model via SentencePiece, improving cold-start recommendations without replacing the ID embedding table entirely.
1. Why Random Item IDs Are a Problem
Almost every production recommender system keeps a giant embedding table indexed by item ID. The ID itself is a random string (a video_id, a product_id) with no semantic meaning. The model learns an embedding per item from interaction data, and that embedding stores everything the model knows about the item: its quality, its audience, its engagement patterns.
This works beautifully for head items with millions of interactions. It breaks for three situations:
- New items. A video uploaded an hour ago has no training signal. Its embedding is either freshly-initialized noise or a random-hashed collision with unrelated videos.
- Long-tail items. Most items in a power-law catalog have few interactions. Their embeddings never converge to anything meaningful.
- Evolving catalogs. YouTube adds hundreds of hours of video per minute. The ID-to-embedding mapping is always chasing a moving target.
The natural fix is to replace the ID with a content embedding: take the video, run it through a pre-trained encoder, use the resulting vector as the item representation. Content embeddings generalize by construction: two videos about cooking sushi land near each other in embedding space, regardless of whether the model has seen either before.
But when the authors at YouTube tried this directly, quality dropped. The ranking model had been relying on the ID embedding table for memorization, the per-item quality signal that a fixed content embedding cannot carry. Making up for it by making the MLP deeper helped, but at a cost: 1.5x to 2x more layers, which means 1.5x to 2x more serving compute on a system that runs at millions of QPS.
Semantic IDs are the answer. Instead of using a fixed content embedding as input, you compress the content embedding into a short sequence of discrete tokens, then learn an embedding table indexed by those tokens. Similar items share tokens, so they share embeddings (generalization). But each token still has its own learnable vector, so the model can still memorize what each semantic cluster tends to do (memorization). The embedding table is small because the token vocabulary is small.
2. Big Picture: Two Stages
The system has two decoupled stages. Stage 1 trains a quantizer that turns a content embedding into a token sequence. Stage 2 trains the ranking model with those token sequences as features. Stage 1 runs once and is frozen. Stage 2 runs continuously on fresh logged data.
Three things to notice:
- The video encoder is trained separately (using graph clustering techniques, not recommendation feedback) and frozen. The RQ-VAE is also frozen once trained. Only the embedding table indexed by Semantic ID tokens, and the rest of the ranking model, keep learning.
- The Semantic ID is a short sequence of integers: 8 tokens for YouTube, each from a vocabulary of 2048. That is tiny compared to a 2048-dim dense embedding, which matters when you need to store a user's entire watch history.
- The tokenizer in Stage 2 is the key production trick. A naive approach (one embedding per token, summed across positions) wastes the hierarchy. SentencePiece learns which token combinations are worth their own embedding.
3. Stage 1: RQ-VAE for Semantic IDs
Residual-Quantized Variational AutoEncoder (RQ-VAE) is a hierarchical vector quantizer. It takes a continuous vector, encodes it into a latent, then approximates that latent by a sum of codebook vectors, one per level. The indices of the chosen codebook vectors become the discrete tokens.
Architecture
The YouTube setup uses a compact encoder/decoder:
- Encoder $E$: 1-layer MLP that maps the content embedding $x \in \mathbb{R}^{2048}$ to a latent $z \in \mathbb{R}^{256}$.
- Residual quantizer: $L = 8$ levels, each with a codebook $\mathcal{C}_l = \{e^l_k\}_{k=1}^{K}$ of $K = 2048$ vectors in $\mathbb{R}^{256}$.
- Decoder $D$: 1-layer MLP that maps the quantized latent $\hat{z}$ back to $\hat{x} \in \mathbb{R}^{2048}$.
The whole RQ-VAE has roughly $2 \times 8 \times 2048 \times 256 \approx 8.4\text{M}$ codebook parameters, plus the tiny encoder/decoder. That is a rounding error compared to the ranking model it feeds.
The residual quantization procedure
Given the latent $z$, the quantizer walks down the levels, at each step picking the codebook vector closest to the current residual and subtracting it:
The key word is residual. Level 1 captures the coarsest structure of $z$. Level 2 captures what level 1 missed. Level 3 refines further. By level 8, the sum $\hat{z}$ is a close approximation of $z$, and the 8 indices together specify the video at whatever granularity matters.
The training loss
RQ-VAE is trained end-to-end with a sum of two losses. The reconstruction loss teaches the encoder and decoder to round-trip faithfully; the quantization loss teaches the codebook and the encoder to agree on where the latent lives.
where $\text{sg}[\cdot]$ is the stop-gradient operator (identity on the forward pass, zero on the backward pass) and $\beta = 0.25$. To understand why the loss is written this way, you have to think about who owns which parameters.
Animation: the two-term structure of the RQ-VAE loss. First, a single-term loss oscillates because both encoder and codebook chase each other at full speed. Then, split into two stop-gradient terms: the commitment term nudges the encoder gently (β = 0.25), and the codebook term pulls each code to the centroid of its assigned residuals (online k-means).
Why two terms instead of one?
The natural thing to write is a single term, $\|r_l - e^l_{c_l}\|^2$. That looks cleaner and says what we want: make the residual and the code agree. The problem is that this one term updates both $r_l$ (via the encoder) and $e^l_{c_l}$ (the codebook vector) by the same gradient magnitude. Both sides move toward each other at the same rate. That creates two failure modes:
- Codebook chases a moving target. The encoder is being trained jointly. Its outputs $r_l$ shift with every gradient step. If the codebook also tries to move at full speed toward $r_l$, it never settles, because $r_l$ keeps moving.
- Encoder loses its reconstruction signal. The encoder has another job: produce latents that the decoder can reconstruct from. That signal comes from $\mathcal{L}_{\text{recon}}$. If the quantization loss updates the encoder aggressively, it dominates reconstruction and the latents collapse to whatever is easiest to quantize rather than whatever is most informative.
The fix is to split the one term into two and control each side independently with stop-gradients.
Term 1: the commitment loss
The stop-gradient freezes $e^l_{c_l}$ during backprop. Only $r_l$ gets a gradient. The encoder is being told: "the codebook is fixed. Move your output toward the code it would have chosen." The scalar $\beta = 0.25$ says: do this gently, because the encoder also needs to listen to the reconstruction loss.
This is called a "commitment" loss because it forces the encoder to commit to values near one of the existing codes. Without it, the encoder would drift away from every code (since the codebook, not the encoder, is taking most of the quantization hit), and reconstruction would rely on a lucky alignment that never stabilizes.
Term 2: the codebook loss
Now the stop-gradient is on $r_l$. Only the codebook vector $e^l_{c_l}$ gets a gradient. The codebook is being told: "the encoder's output is fixed. Move the chosen code toward it." There is no $\beta$ in front, so the codebook moves at full speed (coefficient 1).
Intuitively, this is an online k-means update. Each codebook vector is the centroid of the residuals that get assigned to it, and this loss is the quadratic pull toward that centroid.
Why $\beta = 0.25$?
$\beta$ controls how loudly the quantization error yells at the encoder, relative to the reconstruction loss. Too small and the encoder ignores the codebook, producing residuals far from any code and making quantization lossy. Too large and quantization dominates reconstruction, producing latents that are easy to quantize but uninformative.
$\beta = 0.25$ is the value from the original VQ-VAE paper (van den Oord et al., 2017), and it has held up across essentially every follow-up (VQ-GAN, RQ-VAE, SoundStream). It is one of those hyperparameters that rarely needs retuning.
What about the reconstruction loss?
You might wonder how the encoder learns to produce good latents at all, not just latents close to codes. That signal comes from $\mathcal{L}_{\text{recon}} = \|x - \hat{x}\|^2$, which flows through the decoder, through the quantized latent $\hat{z}$, and back through the encoder via the straight-through estimator. The encoder thus has two jobs: (1) make $\hat{z}$ decode back to $x$ (reconstruction), and (2) make its output $r_l$ commit to the nearest code (commitment). The $\beta = 0.25$ weighting keeps job 2 from overwhelming job 1.
How this looks in code
The PyTorch snippet in the next section makes all of this concrete. Look at these three lines:
commitment_loss = commitment_loss + F.mse_loss(residual, e.detach())
codebook_loss = codebook_loss + F.mse_loss(e, residual.detach())
e_st = residual + (e - residual).detach()
The .detach() calls are exactly the stop-gradients from the math. The third line is the straight-through estimator: in the forward pass e_st == e, but gradients from the decoder flow back through residual, so the reconstruction loss reaches the encoder even though $\arg\min$ is not differentiable.
Training runs on a random sample of impressed YouTube videos for tens of millions of steps, until reconstruction loss stabilizes.
4. Handling Codebook Collapse
All vector quantization methods share one nasty failure mode: codebook collapse. A small fraction of codebook vectors become "winners" early in training, get selected by most inputs, and then accumulate all the gradient. The remaining vectors are never chosen, never updated, and effectively dead. A $K = 2048$ codebook can collapse to a few dozen active codes.
The fix, borrowed from SoundStream (Zeghidour et al., 2021): at every training step, detect codebook vectors that were not used by any input in the current batch, and reset each of them to a randomly-sampled content embedding from that same batch. This keeps the codebook alive without requiring any auxiliary loss or careful warm-up schedule.
The authors report that this single trick "significantly improved the codebook utilization." For a corpus of billions of videos, it is the difference between a working quantizer and a broken one.
5. RQ-VAE in PyTorch
The full picture fits in around 100 lines of PyTorch. Below is a minimal but complete implementation: encoder, decoder, residual quantizer with codebook reset, the loss function, and a training step. No tricks beyond what the paper describes.
The residual quantizer
Each level holds a codebook of $K$ vectors in $\mathbb{R}^{D'}$. Given the current residual, it finds the nearest code, emits its index, and returns the quantized vector along with the loss terms for that level. The straight-through estimator copies gradients from the quantized vector back to the residual, since $\arg\min$ is not differentiable.
import torch
import torch.nn as nn
import torch.nn.functional as F
class ResidualQuantizer(nn.Module):
def __init__(self, num_levels=8, codebook_size=2048, latent_dim=256, beta=0.25):
super().__init__()
self.num_levels = num_levels
self.codebook_size = codebook_size
self.beta = beta
# One codebook per level: (L, K, D')
self.codebooks = nn.Parameter(
torch.randn(num_levels, codebook_size, latent_dim) * 0.01
)
def forward(self, z):
# z: (B, D')
residual = z
quantized = torch.zeros_like(z)
codes = []
commitment_loss = 0.0
codebook_loss = 0.0
for l in range(self.num_levels):
cb = self.codebooks[l] # (K, D')
# Squared distance from residual to every code: (B, K)
dists = torch.cdist(residual, cb) ** 2
idx = dists.argmin(dim=-1) # (B,)
e = cb[idx] # (B, D')
# VQ losses at this level
commitment_loss = commitment_loss + F.mse_loss(residual, e.detach())
codebook_loss = codebook_loss + F.mse_loss(e, residual.detach())
# Straight-through: forward uses e, backward flows through residual
e_st = residual + (e - residual).detach()
quantized = quantized + e_st
residual = residual - e.detach()
codes.append(idx)
codes = torch.stack(codes, dim=1) # (B, L)
rq_loss = self.beta * commitment_loss + codebook_loss
return quantized, codes, rq_loss
@torch.no_grad()
def reset_dead_codes(self, z_batch):
"""Replace codes not used in this batch with random content embeddings."""
residual = z_batch
for l in range(self.num_levels):
cb = self.codebooks[l]
dists = torch.cdist(residual, cb) ** 2
idx = dists.argmin(dim=-1)
used = torch.zeros(self.codebook_size, dtype=torch.bool, device=z_batch.device)
used[idx] = True
dead = (~used).nonzero(as_tuple=True)[0]
if len(dead) > 0:
# Sample from the batch (with replacement if needed)
sample_idx = torch.randint(0, z_batch.size(0), (len(dead),), device=z_batch.device)
self.codebooks[l, dead] = z_batch[sample_idx]
residual = residual - cb[idx]
Two things to watch for. First, e_st = residual + (e - residual).detach() is the straight-through estimator. In the forward pass it equals e; in the backward pass its gradient is the identity with respect to residual, bypassing the non-differentiable $\arg\min$. Second, residual = residual - e.detach() uses the detached code so that gradients from deeper levels don't leak back through the codebook of the current level.
The full RQ-VAE
The encoder and decoder here are one-layer MLPs, matching the paper's "1-layer encoder decoder model with dimension 256." In practice you would swap in whatever depth you need.
class RQVAE(nn.Module):
def __init__(self, input_dim=2048, latent_dim=256, num_levels=8,
codebook_size=2048, beta=0.25):
super().__init__()
self.encoder = nn.Linear(input_dim, latent_dim)
self.decoder = nn.Linear(latent_dim, input_dim)
self.quantizer = ResidualQuantizer(
num_levels=num_levels,
codebook_size=codebook_size,
latent_dim=latent_dim,
beta=beta,
)
def forward(self, x):
z = self.encoder(x) # (B, D')
z_hat, codes, rq_loss = self.quantizer(z) # (B, D'), (B, L), scalar
x_hat = self.decoder(z_hat) # (B, D)
recon_loss = F.mse_loss(x_hat, x)
loss = recon_loss + rq_loss
return loss, codes, x_hat
@torch.no_grad()
def encode_to_sid(self, x):
"""Serving path: content embedding → Semantic ID."""
z = self.encoder(x)
_, codes, _ = self.quantizer(z)
return codes # (B, L) integers
The training loop
Standard optimizer, plus the codebook-reset call every step. The reset runs under no_grad and edits the parameters in place, so it does not interfere with autograd.
model = RQVAE(input_dim=2048, latent_dim=256, num_levels=8, codebook_size=2048)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for step, content_emb in enumerate(loader): # content_emb: (B, 2048)
loss, codes, x_hat = model(content_emb)
opt.zero_grad()
loss.backward()
opt.step()
# Codebook collapse fix: reset unused codes from current batch
with torch.no_grad():
z = model.encoder(content_emb)
model.quantizer.reset_dead_codes(z)
if step % 1000 == 0:
print(f"step {step} loss {loss.item():.4f}")
What each piece is doing
Once training converges, you freeze the model and call encode_to_sid at serving time to turn any new video's content embedding into its 8-integer Semantic ID. That sequence is what feeds into Stage 2.
6. Stage 2: Using Semantic IDs in the Ranking Model
The ranking model is a multitask neural network that predicts CTR, watch time, and other engagement targets. The baseline uses three categorical features per scoring request, all keyed by random video IDs:
- The video the user is currently watching.
- The candidate video being considered for recommendation.
- The user's watch history, represented as a sequence of video IDs.
For a corpus of O(100M) videos, the ID embedding table uses O(10M) buckets via the hashing trick. Many videos collide into the same bucket, but since IDs are random, the collisions are uniform noise that averages out.
The question: what replaces the video ID? You have an 8-token Semantic ID per video. How do you turn a sequence of tokens into a single dense vector that the rest of the ranking network can consume?
The tokenization problem
The obvious idea, summing a per-level embedding, throws away the hierarchy. If your tokens are $(c_1, \dots, c_8)$ and you look up $e_1[c_1] + e_2[c_2] + \dots + e_8[c_8]$ with a separate table per level, two videos that differ only at level 8 will have nearly identical representations, and two videos that differ at level 1 will also look nearly identical in embedding space (since 7 of 8 contributions are shared elsewhere). This is exactly the wrong behavior.
The fix is to embed combinations of consecutive tokens, not individual tokens. The token combinations are called subwords, by analogy with LLM tokenization.
7. N-gram vs. SentencePiece Tokenization
Two options for grouping tokens into subwords:
N-gram: fixed-size grouping
Split the 8-token Semantic ID into contiguous chunks of size $N$. Each unique chunk gets its own embedding.
- Unigram ($N=1$): 8 subwords per video, each from a vocabulary of $K = 2048$. Total embedding table: $8 \times 2048 = 16\text{K}$ rows.
- Bigram ($N=2$): 4 subwords per video, each from a vocabulary of $K^2 = 2048^2 \approx 4.2\text{M}$ per group. Total embedding table: $4 \times K^2 \approx 17\text{M}$ rows.
- Trigram ($N=3$): $K^3 \approx 8.6\text{B}$ rows per group. Infeasible.
The embedding table size grows as $K^N$. The authors cap $N$ at 2 for memory reasons. The final video representation is the sum of the $L/N$ chunk embeddings.
SPM: adaptive, variable-length subwords
SentencePiece (Kudo, 2018), the same tokenizer used in BERT and T5, learns a subword vocabulary of a fixed size $V$ from a corpus of token sequences. Popular token combinations (ones that occur frequently across items) get their own multi-token subword. Rare combinations fall back to shorter subwords or unigrams.
You train SPM once on a large sample of video Semantic IDs. Given a new Semantic ID $(c_1, \dots, c_8)$, SPM returns a segmentation, for example $[(c_1, c_2, c_3), (c_4, c_5), (c_6), (c_7, c_8)]$. Each segment maps to a row in a single shared embedding table of size $V$. The video representation is the sum of the segment embeddings.
Direct comparison
| Aspect | N-gram | SentencePiece |
|---|---|---|
| Subword length | Fixed ($N$) | Variable (1 to $L$) |
| Vocabulary | $K^N$ per group | Any fixed $V$ |
| Embedding tables | $L/N$ separate tables | One shared table |
| Lookups per video | $L/N$ (fixed) | Adaptive; fewer for popular combos |
| Adapts to data distribution | No | Yes |
SPM's adaptivity is not just elegant, it is efficient in production. Common videos get represented by one or two long subwords (fewer lookups); rare videos fall back to many short subwords. The average lookup count is comparable to N-gram, but the head of the distribution gets faster serving.
8. Experimental Results
Setup
Training data: sequentially-logged YouTube engagement data over $N$ days. Evaluation is on day $N+1$, which has not been seen during training. Two metrics:
- CTR AUC: overall click-through-rate AUC on day $N+1$. Measures generalization under distribution shift.
- CTR/1D AUC: same metric but sliced to items that were introduced on day $N+1$. Pure cold-start evaluation.
A 0.1% change in CTR AUC is significant in this production system. Two ranking-model variants are studied: one without user history (two item features), and one with user history (three item features).
Baselines and key findings
Four configurations were compared:
- Random Hashing: the original baseline. Video IDs hashed into 10M buckets.
- Dense Input: replace the ID embedding with the raw 2048-dim content embedding. Same model architecture.
- Dense Input + 1.5x / 2x layers: same as above but deeper MLP, to compensate for lost memorization capacity.
- Semantic ID (Unigram, Bigram, SPM): the proposed approach.
Three findings stand out.
Finding 1: Dense content embeddings alone hurt quality. Directly swapping video IDs for 2048-dim content embeddings causes CTR AUC to drop below the random-hashing baseline, on both overall and cold-start slices. The content embedding is a fixed input feature, so the model loses its per-item memorization capacity. Making the MLP 1.5x or 2x deeper partially recovers quality, but at a serving cost that is not acceptable in production.
Finding 2: Semantic IDs beat random hashing on cold-start, and match or beat it overall. When user history is included as a feature (the realistic setting), both Bigram-SID and SPM-SID outperform the random-hashing baseline on CTR AUC, and substantially outperform it on CTR/1D AUC. The Dense Input with 2x layers gets similar cold-start gains, but SIDs match those gains without the 2x compute.
Finding 3: SPM dominates N-gram for large embedding tables. At small table sizes (below roughly $8K$ or $4K^2$ rows), the fixed N-gram structure has a slight edge: it forces the tiny vocabulary to cover simple token patterns. At the larger table sizes that production ranking models actually use, SPM's adaptive vocabulary wins decisively, especially on cold-start CTR/1D AUC. Popular token combinations get their own embedding slot; rare ones fall back gracefully.
Stability of Semantic IDs over time
One concern with a frozen Stage 1 model: the video corpus drifts. Movies from 2026 may not look like movies from 2025 in the feature space. To test this, the authors trained two RQ-VAE models on data six months apart (RQ-VAEv0 and RQ-VAEv1) and evaluated the ranking model trained on the most recent engagement data using Semantic IDs from each.
Both versions give comparable ranking performance. The semantic token space is stable enough that you do not need to retrain the quantizer frequently. In practice, this means Stage 1 is a rare batch job, and Stage 2 is a continuous online job.
9. Hierarchy: What the Tokens Actually Capture
The authors include a simple but illuminating analysis. For each pair of videos that share a Semantic ID prefix of length $n$, compute their average pairwise cosine similarity in the original content embedding space. If the hierarchy is meaningful, longer shared prefixes should mean more similar videos.
| Shared prefix length | Avg. pairwise cosine similarity | Typical sub-trie size |
|---|---|---|
| 1 | 0.41 | 150,000-450,000 videos |
| 2 | 0.68 | 20-150 videos |
| 3 | 0.91 | 1-5 videos |
| 4 | 0.97 | 1 video |
The progression is clean. A length-1 prefix groups hundreds of thousands of videos at cosine similarity 0.41 (loose topical affinity). A length-4 prefix typically uniquely identifies a video at cosine similarity 0.97. Every additional token narrows the cluster by roughly one to two orders of magnitude.
The paper shows example sub-tries where the hierarchy aligns with human-understandable concepts: a sports cluster with sub-clusters for different sports; a food-vlogging cluster with sub-clusters for different cuisines. The model has learned a semantic trie over the catalog, for free, from a generic pretrained video encoder and a vector-quantization objective.
10. Practical Notes
Start with the right content encoder. The quality of the Semantic ID hierarchy is bounded by the quality of the content embedding going into RQ-VAE. YouTube uses a VideoBERT-based transformer trained on audio and visual features using graph clustering. If your content encoder doesn't capture the distinctions that matter for your task, no amount of quantization will fix it.
The RQ-VAE is small, but codebook collapse will ruin it. The codebook-reset trick (replacing unused codes with sampled content embeddings every step) is not optional. Skip it and you will end up with a codebook that collapses to a handful of clusters, making the whole Semantic ID pipeline degenerate to a poor hash.
SentencePiece, not N-gram, for production-scale embedding tables. N-gram is simpler and works for small-scale systems. Once your embedding budget is in the millions of rows, SPM's adaptive vocabulary wins on every axis: quality, memory efficiency, lookup count for head items. The infrastructure cost of adding SPM to your Stage-2 pipeline is small.
Don't replace random hashing, augment it, if you can afford both. The paper frames this as a replacement, and for YouTube it works. But if you have the capacity, the Semantic ID tokens and the random-hashed ID can be used as parallel features: the SID gives you generalization and cold-start, the ID table gives you extra memorization for head items beyond what the SPM subwords can hold.
$L = 8$ and $K = 2048$ are good starting points, not magic numbers. The right depth is "enough levels to uniquely identify most items at the deepest prefix length." For YouTube's billions of videos, $2048^4 \approx 1.7 \times 10^{13}$ at length 4, which already exceeds the catalog size. Length 8 gives headroom for continued growth. Smaller catalogs can use shorter Semantic IDs.
The ranking model still learns everything about the world from data. Semantic IDs do not inject any supervised knowledge about what is "good." They inject similarity structure, letting the ranking model share engagement signal across similar items. The quality signal itself still comes from logged clicks and watch time. If your engagement data is biased, Semantic IDs will propagate that bias across clusters faster, not eliminate it.