Positional Encodings for LLMs: From Sinusoidal to RoPE
How transformers understand token order, from the original sinusoidal scheme to Rotary Position Embeddings (RoPE).
1. The Problem: Transformers Have No Sense of Order
Consider the sentences "the cat sat on the mat" and "the mat sat on the cat." They contain the exact same tokens. A transformer's self-attention computes pairwise dot products between all token embeddings, and dot products are commutative. Without any positional information, both sentences produce identical attention patterns.
This is not a minor detail. Word order is meaning. "Dog bites man" and "man bites dog" are very different stories. If we want a transformer to understand language, we need to inject information about where each token sits in the sequence.
The standard solution: add a positional encoding vector to each token embedding before it enters the transformer. The encoding is a function of position only, and it injects enough structure that the model can recover token order from the modified embeddings.
2. Sinusoidal Positional Encoding
The original "Attention Is All You Need" paper (Vaswani et al., 2017) proposed a simple, elegant scheme: encode each position as a vector of sine and cosine values at different frequencies.
Given a token at position $\text{pos}$ in the sequence, the positional encoding $\mathbf{PE}$ has the same dimensionality $d$ as the token embedding. Each dimension $i$ of the encoding uses a sinusoid at a different frequency:
Where:
- $\text{pos}$ is the token's position in the sequence (0, 1, 2, ...)
- $i$ is the dimension index, ranging from $0$ to $d/2 - 1$
- $d$ is the embedding dimension (e.g., 512 or 768)
- $10000$ is a base frequency constant (chosen empirically)
The final input to the transformer is the element-wise sum of the token embedding and the positional encoding:
What the encoding looks like
Think of each dimension as a clock hand ticking at a different speed. Dimension 0 (low $i$) oscillates rapidly, changing its value at every position. Dimension $d-1$ (high $i$) oscillates extremely slowly, barely changing across the entire sequence.
For a model with $d = 512$:
- Dimension 0 ($i=0$): period = $2\pi \approx 6.3$ positions. One full cycle every ~6 tokens.
- Dimension 256 ($i=128$): period = $2\pi \cdot 10000^{128/256} = 2\pi \cdot 100 \approx 628$ positions.
- Dimension 510 ($i=255$): period = $2\pi \cdot 10000^{255/256} \approx 60{,}000$ positions.
This creates a spectrum of frequencies spanning several orders of magnitude. The low dimensions encode fine-grained, local position information ("am I at position 5 or 6?"). The high dimensions encode coarse, global position information ("am I near the beginning or the end?").
Interactive: Sinusoidal Encoding Heatmap
Each cell shows the PE value at a given position (x-axis) and dimension (y-axis). Drag the position slider to see the encoding vector for that position highlighted. Notice how low dimensions oscillate fast and high dimensions oscillate slowly.
3. Why Sinusoids Work
The choice of sine and cosine was not arbitrary. These functions have a critical property that makes them ideal for positional encoding: the encoding of any position can be expressed as a linear transformation of the encoding at any other position.
The relative position property
For any fixed offset $k$, there exists a rotation matrix $M_k$ such that:
Where $\omega_i = 1/10000^{2i/d}$ is the frequency for dimension pair $(2i, 2i+1)$.
This is the standard sine/cosine angle addition identity. The matrix $M_k$ depends only on the offset $k$, not on the absolute position. This means the model can learn to attend to "the token 3 positions ago" through a single linear operation, regardless of where in the sequence it currently is.
Other desirable properties
- Unique encodings: Every position gets a distinct vector (the combination of waves at different frequencies creates a unique fingerprint up to very long sequences).
- Bounded values: All values lie in $[-1, 1]$, so the positional signal does not dominate the token embedding.
- No learned parameters: The encoding is deterministic, meaning it adds zero trainable parameters and works out of the box.
- Smooth interpolation: Nearby positions have similar encodings (small rotation), so the model gets a natural notion of proximity.
4. Limitations of Additive Positional Encodings
The sinusoidal scheme works, but it has a fundamental design issue: positional information and semantic content are mixed together by addition before the transformer ever sees them.
There are three concrete problems:
1. Position information decays through layers. The positional signal is added once, at the input. As information flows through successive attention and feed-forward layers, the positional signal gets progressively diluted. By the upper layers, the model may struggle to determine relative positions precisely.
2. Fixed sequence length at training time. The encoding itself can generate vectors for any position, but the model only sees positions 0 through $L_{\text{train}}-1$ during training. At inference, if you feed position 5000 to a model trained with $L_{\text{train}} = 2048$, the model has never learned to interpret those encoding values. Generalization beyond the training length is unreliable.
3. Attention does not natively see relative position. When computing $\mathbf{q}_m^T \mathbf{k}_n$ (the attention score between positions $m$ and $n$), the dot product sees the sum of token and position information at both positions. The relative position $m - n$ is buried inside this computation, not made explicit. The model can recover it (thanks to the linear-transform property), but it must learn to do so.
These limitations motivated a new question: instead of adding position to the embedding, can we inject position directly into the attention computation, in a way that naturally encodes relative position?
5. Rotary Position Embeddings (RoPE)
RoPE (Su et al., 2021) solves the limitations above with a single elegant idea: encode position by rotating the query and key vectors in attention, so that their dot product naturally depends on relative position.
The core idea
Instead of adding a positional vector to the token embedding, RoPE applies a position-dependent rotation to each query and key vector, right before the attention dot product. The rotation angle is proportional to the position.
Start with 2D to build intuition. Suppose our query and key vectors are 2-dimensional. RoPE rotates the query at position $m$ by angle $m\theta$ and the key at position $n$ by angle $n\theta$:
Now compute the attention score:
Two properties of rotation matrices make this work: (1) the transpose is the inverse, so $R(\alpha)^T = R(-\alpha)$, and (2) successive rotations add their angles, so $R(-\alpha)\,R(\beta) = R(\beta - \alpha)$. Together: $R(m\theta)^T R(n\theta) = R(-m\theta)\,R(n\theta) = R\big((n-m)\theta\big)$. The absolute positions $m$ and $n$ disappear, and the dot product depends only on the relative position $n - m$.
Interactive: RoPE Rotation in 2D
Drag the position sliders to move the query (blue) and key (green) vectors. The dot product depends only on the relative distance (m - n), shown in amber. Try keeping the gap constant while moving both positions.
Scaling to d dimensions
Real query/key vectors have $d$ dimensions (e.g., 64 or 128 per attention head). RoPE pairs them up: $(d_0, d_1)$, $(d_2, d_3)$, ..., $(d_{d-2}, d_{d-1})$, giving $d/2$ pairs. Each pair gets rotated independently, using a different base frequency:
These are the same frequencies as the sinusoidal encoding. The rotation for dimension pair $(2i, 2i+1)$ at position $m$ is:
The full rotation is a block-diagonal matrix with $d/2$ independent $2 \times 2$ rotation blocks:
Efficient implementation
You never actually construct the rotation matrix. Each $2 \times 2$ rotation decomposes into element-wise multiplications:
In code, this is two element-wise multiplies and one addition per dimension pair. The $\cos(m\theta_i)$ and $\sin(m\theta_i)$ values can be precomputed for all positions and cached as a table. The per-token cost is negligible compared to the attention computation itself.
6. Why RoPE Works So Well
RoPE has become the default positional encoding for modern LLMs (LLaMA, Mistral, Qwen, Gemma, and many others). Here is why.
Relative position is built in
As shown above, the dot product $(\mathbf{q}_m^{\text{rot}})^T \mathbf{k}_n^{\text{rot}}$ depends only on $n - m$. The model does not need to learn to extract relative position from mixed signals. It gets relative position for free.
Position information is injected at every layer
RoPE applies the rotation in every attention layer, after the query and key projections. This means positional information is freshly injected at every layer, unlike additive encodings that only inject once at the input.
Semantic and positional information stay separate
The token embedding is never modified. Position is encoded purely through the rotation of queries and keys. The value vectors $\mathbf{v}$ are not rotated at all, meaning the content that gets aggregated by attention remains purely semantic.
Long-range decay
An important empirical property: the dot product between rotated queries and keys tends to decrease as $|m - n|$ increases. The high-frequency dimension pairs oscillate rapidly, and their contributions average toward zero for large relative distances. This gives RoPE a natural inductive bias toward local attention, without hard-coding a window.
Interactive: Long-Range Decay
This curve isolates position-only effect by setting $\mathbf{q} = \mathbf{k}$ (identical content). It computes $\hat{\mathbf{q}}^T R\big((n{-}m)\theta\big) \hat{\mathbf{q}}$: what happens to the dot product purely due to the rotation as distance grows? At distance 0 it is 1.0; it decays as tokens move apart. In a real model, content similarity also matters, so actual attention is this positional baseline modulated by how related the tokens are. Try the sliders: more dimensions makes the drop-off steeper (nearby tokens dominate more), while a higher base frequency stretches the curve to the right (the model distinguishes positions over a longer range).
7. Comparison: Sinusoidal vs Learned vs RoPE
| Property | Sinusoidal (additive) | Learned (additive) | RoPE (rotary) |
|---|---|---|---|
| How it works | Add fixed sin/cos vector to embedding | Add learned vector per position to embedding | Rotate query/key vectors before attention |
| Encodes relative position? | Implicitly (via linear transform property) | No (purely absolute) | Yes (built into dot product) |
| Extra parameters | 0 | $L \times d$ (one vector per position) | 0 |
| Position info in upper layers | Diluted (added once at input) | Diluted (added once at input) | Fresh (applied at every layer) |
| Extrapolation beyond $L_{\text{train}}$ | Poor | None (undefined) | Moderate (improved with RoPE scaling) |
| Mixes with content? | Yes (added to embedding) | Yes (added to embedding) | No (rotates Q/K only; V untouched) |
| Used by | Original Transformer, BERT | GPT-2, BERT (optional) | LLaMA, Mistral, Qwen, Gemma, etc. |
8. Practical Notes
Extending context length with RoPE scaling
RoPE's biggest practical advantage: you can extend the context length beyond what was used during training by modifying the base frequency. Two common approaches:
- NTK-aware scaling (Code Llama, many fine-tunes): increase the base from 10,000 to a larger value (e.g., 500,000 or 1,000,000). This stretches all the rotation frequencies, so positions that were at the edge of the training distribution now fall within it. Typically requires a small amount of fine-tuning on longer sequences.
- YaRN (Yet another RoPE extensioN): applies different scaling factors to different frequency bands. Low frequencies (already slow) get minimal scaling; high frequencies (which wrap around harmlessly) also get minimal scaling. The mid-range frequencies that would extrapolate worst get the most aggressive scaling.
Where RoPE is applied in the model
RoPE is applied after the linear projections that produce Q and K, but before the dot product. It is not applied to the value (V) vectors. In multi-head attention, the rotation is applied independently within each head, using the head dimension $d_h$ (not the full model dimension $d_{\text{model}}$).
Base frequency matters
The base frequency $b$ (default 10,000) controls the spectrum of rotation speeds. Larger base values make all rotations slower, which gives finer position resolution at the cost of making distant positions harder to distinguish. LLaMA 3 uses $b = 500{,}000$, which enables its 128K context window. The optimal base depends on the target context length.
RoPE and KV cache
During autoregressive generation, you cache the key and value vectors for past tokens (the KV cache). Since RoPE is applied to keys before caching, the cached key vectors already contain the correct positional rotation. You do not need to re-rotate them when computing attention for new tokens.
References
- Vaswani et al. (2017). Attention Is All You Need. The original transformer paper introducing sinusoidal positional encodings.
- Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. The RoPE paper.
- Press et al. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization. ALiBi, an alternative relative position approach.
- bloc97 (2023). NTK-Aware Scaled RoPE. The NTK-aware frequency scaling method.
- Peng et al. (2023). YaRN: Efficient Context Window Extension of Large Language Models. Frequency-band-aware RoPE scaling.
Cite this post:
@article{sedhain2026positional,
title = {Positional Encodings for LLMs: From Sinusoidal to RoPE},
author = {Sedhain, Suvash},
journal = {ssedhain.com},
year = {2026},
month = {Mar},
url = {https://mesuvash.github.io/blog/2026/positional-encodings/}
}