How transformers understand token order, from the original sinusoidal scheme to Rotary Position Embeddings (RoPE).

1. The Problem: Transformers Have No Sense of Order

Consider the sentences "the cat sat on the mat" and "the mat sat on the cat." They contain the exact same tokens. A transformer's self-attention computes pairwise dot products between all token embeddings, and dot products are commutative. Without any positional information, both sentences produce identical attention patterns.

This is not a minor detail. Word order is meaning. "Dog bites man" and "man bites dog" are very different stories. If we want a transformer to understand language, we need to inject information about where each token sits in the sequence.

Without positional encoding: both look identical to the transformer A: the cat sat on the mat same set of tokens B: the mat sat on the cat We need: a way to make position 2 "feel different" from position 6, so the model knows cat-then-mat ≠ mat-then-cat.

The standard solution: add a positional encoding vector to each token embedding before it enters the transformer. The encoding is a function of position only, and it injects enough structure that the model can recover token order from the modified embeddings.

2. Sinusoidal Positional Encoding

The original "Attention Is All You Need" paper (Vaswani et al., 2017) proposed a simple, elegant scheme: encode each position as a vector of sine and cosine values at different frequencies.

Given a token at position $\text{pos}$ in the sequence, the positional encoding $\mathbf{PE}$ has the same dimensionality $d$ as the token embedding. Each dimension $i$ of the encoding uses a sinusoid at a different frequency:

Sinusoidal positional encoding:
$$\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)$$ $$\text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)$$

Where:

The final input to the transformer is the element-wise sum of the token embedding and the positional encoding:

Input to transformer:
$$\mathbf{x}_{\text{pos}} = \mathbf{e}_{\text{token}} + \mathbf{PE}(\text{pos})$$
Token Embedding etoken Positional Encoding PE(pos) + xpos = position-aware embedding feeds into Transformer layers

What the encoding looks like

Think of each dimension as a clock hand ticking at a different speed. Dimension 0 (low $i$) oscillates rapidly, changing its value at every position. Dimension $d-1$ (high $i$) oscillates extremely slowly, barely changing across the entire sequence.

For a model with $d = 512$:

This creates a spectrum of frequencies spanning several orders of magnitude. The low dimensions encode fine-grained, local position information ("am I at position 5 or 6?"). The high dimensions encode coarse, global position information ("am I near the beginning or the end?").

Intuition: Sinusoidal positional encodings work like a binary counter, but with smooth waves instead of sharp bits. In a binary number, the least significant bit flips every step, the next bit flips every 2 steps, the next every 4, and so on. The sinusoidal encoding does the same thing with smooth sinusoids at geometrically spaced frequencies. Each position gets a unique "fingerprint" from the combination of all these waves, just like each integer has a unique binary representation.
Sinusoidal PE: each dimension is a wave at different frequency dim 0 dim 4 dim 8 position in sequence fast medium slow 0 10 20 30 40

Interactive: Sinusoidal Encoding Heatmap

Each cell shows the PE value at a given position (x-axis) and dimension (y-axis). Drag the position slider to see the encoding vector for that position highlighted. Notice how low dimensions oscillate fast and high dimensions oscillate slowly.

10
32

3. Why Sinusoids Work

The choice of sine and cosine was not arbitrary. These functions have a critical property that makes them ideal for positional encoding: the encoding of any position can be expressed as a linear transformation of the encoding at any other position.

The relative position property

For any fixed offset $k$, there exists a rotation matrix $M_k$ such that:

Relative position as linear transform:
$$\begin{bmatrix} \text{PE}(\text{pos}+k, 2i) \\ \text{PE}(\text{pos}+k, 2i+1) \end{bmatrix} = \begin{bmatrix} \cos(k\omega_i) & \sin(k\omega_i) \\ -\sin(k\omega_i) & \cos(k\omega_i) \end{bmatrix} \begin{bmatrix} \text{PE}(\text{pos}, 2i) \\ \text{PE}(\text{pos}, 2i+1) \end{bmatrix}$$

Where $\omega_i = 1/10000^{2i/d}$ is the frequency for dimension pair $(2i, 2i+1)$.

This is the standard sine/cosine angle addition identity. The matrix $M_k$ depends only on the offset $k$, not on the absolute position. This means the model can learn to attend to "the token 3 positions ago" through a single linear operation, regardless of where in the sequence it currently is.

Intuition: Think of each pair of dimensions as a 2D clock face. The positional encoding places a point on this clock. Moving forward by $k$ positions rotates that point by a fixed angle. The angle depends on $k$ and the frequency of that dimension pair, but not on the absolute position. So the "distance" between position 5 and 8 looks the same as between position 100 and 103: a rotation by 3 steps.

Other desirable properties

4. Limitations of Additive Positional Encodings

The sinusoidal scheme works, but it has a fundamental design issue: positional information and semantic content are mixed together by addition before the transformer ever sees them.

Key limitation: Once you add $\mathbf{PE}(\text{pos})$ to $\mathbf{e}_{\text{token}}$, the model must disentangle "what is this token?" from "where is this token?" using the same set of dimensions. In practice, the model learns to use some dimensions primarily for position and others for semantics, but this is an implicit, imperfect separation. It wastes model capacity.

There are three concrete problems:

1. Position information decays through layers. The positional signal is added once, at the input. As information flows through successive attention and feed-forward layers, the positional signal gets progressively diluted. By the upper layers, the model may struggle to determine relative positions precisely.

2. Fixed sequence length at training time. The encoding itself can generate vectors for any position, but the model only sees positions 0 through $L_{\text{train}}-1$ during training. At inference, if you feed position 5000 to a model trained with $L_{\text{train}} = 2048$, the model has never learned to interpret those encoding values. Generalization beyond the training length is unreliable.

3. Attention does not natively see relative position. When computing $\mathbf{q}_m^T \mathbf{k}_n$ (the attention score between positions $m$ and $n$), the dot product sees the sum of token and position information at both positions. The relative position $m - n$ is buried inside this computation, not made explicit. The model can recover it (thanks to the linear-transform property), but it must learn to do so.

These limitations motivated a new question: instead of adding position to the embedding, can we inject position directly into the attention computation, in a way that naturally encodes relative position?

5. Rotary Position Embeddings (RoPE)

RoPE (Su et al., 2021) solves the limitations above with a single elegant idea: encode position by rotating the query and key vectors in attention, so that their dot product naturally depends on relative position.

The core idea

Instead of adding a positional vector to the token embedding, RoPE applies a position-dependent rotation to each query and key vector, right before the attention dot product. The rotation angle is proportional to the position.

Start with 2D to build intuition. Suppose our query and key vectors are 2-dimensional. RoPE rotates the query at position $m$ by angle $m\theta$ and the key at position $n$ by angle $n\theta$:

RoPE in 2D:
$$\mathbf{q}_m^{\text{rot}} = R(m\theta)\,\mathbf{q}_m, \quad \mathbf{k}_n^{\text{rot}} = R(n\theta)\,\mathbf{k}_n$$ $$\text{where } R(\alpha) = \begin{bmatrix} \cos\alpha & -\sin\alpha \\ \sin\alpha & \cos\alpha \end{bmatrix}$$

Now compute the attention score:

Attention score with RoPE:
$$(\mathbf{q}_m^{\text{rot}})^T \mathbf{k}_n^{\text{rot}} = \mathbf{q}_m^T R(m\theta)^T R(n\theta) \, \mathbf{k}_n = \mathbf{q}_m^T R\big((n-m)\theta\big) \, \mathbf{k}_n$$

Two properties of rotation matrices make this work: (1) the transpose is the inverse, so $R(\alpha)^T = R(-\alpha)$, and (2) successive rotations add their angles, so $R(-\alpha)\,R(\beta) = R(\beta - \alpha)$. Together: $R(m\theta)^T R(n\theta) = R(-m\theta)\,R(n\theta) = R\big((n-m)\theta\big)$. The absolute positions $m$ and $n$ disappear, and the dot product depends only on the relative position $n - m$.

Intuition: Imagine you and a friend are both standing on a rotating merry-go-round, at different positions. You are at angle $m\theta$ from the starting point; your friend is at $n\theta$. From your perspective, your friend's position is $(n - m)\theta$ relative to you. The absolute positions wash out. RoPE uses exactly this trick: rotate each vector by its absolute position, and the dot product sees only the relative offset.

Interactive: RoPE Rotation in 2D

Drag the position sliders to move the query (blue) and key (green) vectors. The dot product depends only on the relative distance (m - n), shown in amber. Try keeping the gap constant while moving both positions.

3
7

RoPE: Rotate queries and keys, dot product sees relative position qm kn (n-m)θ 2D rotation view Full d-dimensional RoPE dims (0, 1): rotate by θ0 per pos dims (2, 3): rotate by θ1 per pos dims (4, 5): rotate by θ2 per pos ... dims (d-2, d-1): rotate by θd/2-1

Scaling to d dimensions

Real query/key vectors have $d$ dimensions (e.g., 64 or 128 per attention head). RoPE pairs them up: $(d_0, d_1)$, $(d_2, d_3)$, ..., $(d_{d-2}, d_{d-1})$, giving $d/2$ pairs. Each pair gets rotated independently, using a different base frequency:

RoPE frequencies:
$$\theta_i = \frac{1}{10000^{2i/d}}, \quad i = 0, 1, \ldots, d/2 - 1$$

These are the same frequencies as the sinusoidal encoding. The rotation for dimension pair $(2i, 2i+1)$ at position $m$ is:

Per-pair rotation:
$$\begin{bmatrix} q_{m,2i}^{\text{rot}} \\ q_{m,2i+1}^{\text{rot}} \end{bmatrix} = \begin{bmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{bmatrix} \begin{bmatrix} q_{m,2i} \\ q_{m,2i+1} \end{bmatrix}$$

The full rotation is a block-diagonal matrix with $d/2$ independent $2 \times 2$ rotation blocks:

Full RoPE rotation matrix:
$$R_m = \text{diag}\Big(R_{\theta_0}(m), \; R_{\theta_1}(m), \; \ldots, \; R_{\theta_{d/2-1}}(m)\Big)$$

Efficient implementation

You never actually construct the rotation matrix. Each $2 \times 2$ rotation decomposes into element-wise multiplications:

Element-wise RoPE (used in practice):
$$q_{m,2i}^{\text{rot}} = q_{m,2i} \cos(m\theta_i) - q_{m,2i+1} \sin(m\theta_i)$$ $$q_{m,2i+1}^{\text{rot}} = q_{m,2i} \sin(m\theta_i) + q_{m,2i+1} \cos(m\theta_i)$$

In code, this is two element-wise multiplies and one addition per dimension pair. The $\cos(m\theta_i)$ and $\sin(m\theta_i)$ values can be precomputed for all positions and cached as a table. The per-token cost is negligible compared to the attention computation itself.

6. Why RoPE Works So Well

RoPE has become the default positional encoding for modern LLMs (LLaMA, Mistral, Qwen, Gemma, and many others). Here is why.

Relative position is built in

As shown above, the dot product $(\mathbf{q}_m^{\text{rot}})^T \mathbf{k}_n^{\text{rot}}$ depends only on $n - m$. The model does not need to learn to extract relative position from mixed signals. It gets relative position for free.

Position information is injected at every layer

RoPE applies the rotation in every attention layer, after the query and key projections. This means positional information is freshly injected at every layer, unlike additive encodings that only inject once at the input.

Semantic and positional information stay separate

The token embedding is never modified. Position is encoded purely through the rotation of queries and keys. The value vectors $\mathbf{v}$ are not rotated at all, meaning the content that gets aggregated by attention remains purely semantic.

Long-range decay

An important empirical property: the dot product between rotated queries and keys tends to decrease as $|m - n|$ increases. The high-frequency dimension pairs oscillate rapidly, and their contributions average toward zero for large relative distances. This gives RoPE a natural inductive bias toward local attention, without hard-coding a window.

Intuition: At close range ($|m - n|$ small), all the rotation angles are small, so most dimension pairs contribute coherently to the dot product. At long range, the fast-rotating pairs spin through many full cycles, and their positive and negative contributions cancel out. Only the slow-rotating pairs (low frequency) still contribute. This is like how you can tell if a nearby sound is "ahead" or "behind" (phase coherent), but a very distant sound loses directional clarity.

Interactive: Long-Range Decay

This curve isolates position-only effect by setting $\mathbf{q} = \mathbf{k}$ (identical content). It computes $\hat{\mathbf{q}}^T R\big((n{-}m)\theta\big) \hat{\mathbf{q}}$: what happens to the dot product purely due to the rotation as distance grows? At distance 0 it is 1.0; it decays as tokens move apart. In a real model, content similarity also matters, so actual attention is this positional baseline modulated by how related the tokens are. Try the sliders: more dimensions makes the drop-off steeper (nearby tokens dominate more), while a higher base frequency stretches the curve to the right (the model distinguishes positions over a longer range).

32
10000

7. Comparison: Sinusoidal vs Learned vs RoPE

Property Sinusoidal (additive) Learned (additive) RoPE (rotary)
How it works Add fixed sin/cos vector to embedding Add learned vector per position to embedding Rotate query/key vectors before attention
Encodes relative position? Implicitly (via linear transform property) No (purely absolute) Yes (built into dot product)
Extra parameters 0 $L \times d$ (one vector per position) 0
Position info in upper layers Diluted (added once at input) Diluted (added once at input) Fresh (applied at every layer)
Extrapolation beyond $L_{\text{train}}$ Poor None (undefined) Moderate (improved with RoPE scaling)
Mixes with content? Yes (added to embedding) Yes (added to embedding) No (rotates Q/K only; V untouched)
Used by Original Transformer, BERT GPT-2, BERT (optional) LLaMA, Mistral, Qwen, Gemma, etc.

8. Practical Notes

Extending context length with RoPE scaling

RoPE's biggest practical advantage: you can extend the context length beyond what was used during training by modifying the base frequency. Two common approaches:

Warning: Simply feeding longer sequences into a RoPE model without any scaling does not work. The attention scores for token pairs beyond $L_{\text{train}}$ will see rotation angles the model has never encountered, producing garbage attention patterns. You must either scale the frequencies or fine-tune on longer sequences (or both).

Where RoPE is applied in the model

RoPE is applied after the linear projections that produce Q and K, but before the dot product. It is not applied to the value (V) vectors. In multi-head attention, the rotation is applied independently within each head, using the head dimension $d_h$ (not the full model dimension $d_{\text{model}}$).

Base frequency matters

The base frequency $b$ (default 10,000) controls the spectrum of rotation speeds. Larger base values make all rotations slower, which gives finer position resolution at the cost of making distant positions harder to distinguish. LLaMA 3 uses $b = 500{,}000$, which enables its 128K context window. The optimal base depends on the target context length.

RoPE and KV cache

During autoregressive generation, you cache the key and value vectors for past tokens (the KV cache). Since RoPE is applied to keys before caching, the cached key vectors already contain the correct positional rotation. You do not need to re-rotate them when computing attention for new tokens.


References

  1. Vaswani et al. (2017). Attention Is All You Need. The original transformer paper introducing sinusoidal positional encodings.
  2. Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. The RoPE paper.
  3. Press et al. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization. ALiBi, an alternative relative position approach.
  4. bloc97 (2023). NTK-Aware Scaled RoPE. The NTK-aware frequency scaling method.
  5. Peng et al. (2023). YaRN: Efficient Context Window Extension of Large Language Models. Frequency-band-aware RoPE scaling.

Cite this post:

@article{sedhain2026positional,
  title   = {Positional Encodings for LLMs: From Sinusoidal to RoPE},
  author  = {Sedhain, Suvash},
  journal = {ssedhain.com},
  year    = {2026},
  month   = {Mar},
  url     = {https://mesuvash.github.io/blog/2026/positional-encodings/}
}