An intuition-first guide to the RL concepts behind RLHF, PPO, and GRPO.
The background you need before diving into alignment algorithms.

Why this post exists

  first, a rant about RL/ML literature's readability problem

ML/RL literature has a readability problem. Papers and textbooks are dense with notation, and too often the math arrives before the intuition. If you've ever stared at a policy gradient derivation and thought "but why are we doing this?", you're not alone. The barrier is rarely the ideas themselves; it's how they're presented.

The underlying principles of RL, even the parts that power RLHF and GRPO, are surprisingly simple. At every stage, the core question is intuitive: "which tokens made this response good or bad, and how do we produce more of the good ones?" Everything else is machinery to answer that question efficiently.

There's an old idea, often attributed to Feynman: if you can't explain something simply, you don't understand it well enough. This post is my attempt to explain simply. Every equation earns its place only after the intuition is clear, and every concept is introduced exactly when it's needed to solve a concrete problem with the previous approach.

The teaching approach here is inspired by J. Clark Scott's "But How Do It Know? — The Basic Principles of Computers for Everyone", a book that builds an entire computer from NAND gates upward, introducing each piece only when the previous piece creates a need for it. That's the structure I'm aiming for: start with the simplest thing that could work (REINFORCE), hit a wall, and let the wall motivate the next concept.

1. RL for LLMs in One Picture

After pre-training and supervised fine-tuning (SFT), your LLM can generate fluent text. But "fluent" is not the same as "good." The model may be confidently wrong, unhelpfully verbose, or subtly toxic. RL lets you optimize for overall response quality rather than just per-token likelihood.

Here is the setup, reduced to its essentials:

Prompt LLM policy tok₁ tok₂ ... tok_T EOS R = 0.82 ? ? ? ? credit assignment: which tokens caused the reward? The whole game: Generate tokens → get a reward at the end → figure out which tokens were responsible → adjust the policy to produce better outputs.

The rest of this tutorial builds up the tools to solve this problem, one piece at a time. We'll start with the simplest approach (REINFORCE), see why it breaks, and then introduce each new concept exactly when it's needed to fix the previous one.

2. RL Vocabulary Mapped to Text Generation

RL has its own jargon, but every term maps cleanly onto text generation. This table is worth internalizing because the rest of the tutorial (and all RLHF literature) uses these terms interchangeably.

RL conceptIn text generationExample
State $s_t$ Prompt + tokens generated so far "Explain gravity" + "Gravity is"
Action $a_t$ Next token chosen "a" (the next token)
Policy $\pi_\theta$ The LLM's token distribution $P(\text{"a"}) = 0.3, P(\text{"the"}) = 0.2, \ldots$
Trajectory $\tau$ Complete prompt + response The full generated text
Reward $r$ Score for the response Reward model output: 0.82
Return $G$ Total reward (often = terminal reward) Same as reward when only scored at end
Discount $\gamma$ Weight on future reward Typically $1.0$ exactly (see note below)
Note on $\gamma$: In robotics and game-playing RL, the agent may act forever (infinite horizon), so $\gamma < 1$ is required to keep the sum of rewards finite. LLM RL is different: every episode terminates at the EOS token, so the return is a finite sum regardless of $\gamma$. This makes $\gamma = 1.0$ mathematically safe, and most implementations use it exactly. You'll sometimes see $\gamma = 1.0$ left implicit in RLHF papers for this reason.
Intuition: RL for LLMs is not "teaching the model to reason." It's shaping which outputs the model prefers under a reward signal. The model already has the capability from pre-training. RL steers it toward the outputs humans (or a reward model) rate highly.

3. Policy Gradients: The Naive Approach

With the vocabulary in place, let's tackle the core question: how do we update the LLM's weights to produce higher-reward responses?

What we want to do

Our objective is simple to state: maximize the expected reward over responses sampled from the policy.

Objective:
$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$

Where $\tau$ is a complete response (sequence of tokens) sampled from the LLM, and $R(\tau)$ is the reward model's score for that response. We want to find $\theta$ that makes $J(\theta)$ as large as possible, so we want $\nabla_\theta J(\theta)$ to can do gradient ascent.

Why we can't just differentiate through sampling

In supervised learning, the loss is a smooth function of the model's outputs, so backpropagation works directly. Here, the pipeline is:

$\theta$ → token probabilities → sample discrete tokens → response → $R$

The sampling step is the problem. "Pick the token with ID 3847" is a discrete, non-differentiable operation. The gradient of the reward with respect to $\theta$ doesn't flow back through it. We need a different route.

The log-derivative trick (REINFORCE)

The key insight (Williams, 1992): we don't need to differentiate through the sampling. We can rewrite the gradient of the expectation in a form that only requires differentiating the log-probabilities, which are smooth functions of $\theta$.

Start by expanding the expected reward:

Expected reward (expanded):
$$J(\theta) = \sum_{\tau} \pi_\theta(\tau) \, R(\tau)$$

This sums over all possible responses $\tau$, weighted by the probability the policy assigns to each one. (In practice, responses are sampled, not enumerated, but writing it as a sum makes the algebra clear.) Take the gradient:

Gradient of expected reward:
$$\nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta \pi_\theta(\tau) \, R(\tau)$$

The reward $R(\tau)$ doesn't depend on $\theta$ (it's just a number for a given response), so only $\pi_\theta(\tau)$ gets differentiated. Now the trick: multiply and divide by $\pi_\theta(\tau)$:

The log-derivative trick:
$$\nabla_\theta J(\theta) = \sum_{\tau} \pi_\theta(\tau) \, \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} \, R(\tau) = \sum_{\tau} \pi_\theta(\tau) \, \nabla_\theta \log \pi_\theta(\tau) \, R(\tau)$$

The identity $\frac{\nabla f}{f} = \nabla \log f$ is the entire trick. What we've done is convert $\sum_\tau \pi_\theta(\tau) \cdot [\ldots]$ back into an expectation under the policy:

Policy gradient (REINFORCE):
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\nabla_\theta \log \pi_\theta(\tau) \cdot R(\tau)\Big]$$

Since $\pi_\theta(\tau) = \prod_t \pi_\theta(a_t | s_t)$, we have $\log \pi_\theta(\tau) = \sum_t \log \pi_\theta(a_t | s_t)$. This gives the per-token form:

Per-token form:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\sum_{t} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R\Big]$$

Each variable:

Intuition: We've sidestepped the non-differentiable sampling entirely. Instead of differentiating through "which token was picked," we differentiate through "how likely was the token that was picked." The reward $R$ just acts as a scalar weight: high reward → push those token probabilities up. Low reward → push them down. No need to differentiate through discrete choices.

Connection to SFT

Compare the REINFORCE gradient to the SFT (supervised fine-tuning) gradient:

SFTREINFORCE
Gradient $\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t)$ $\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R$
Weight 1 (always push up) $R$ (push up if good, down if bad)
Tokens From a fixed dataset Sampled from the policy itself

SFT always increases the probability of the target tokens. REINFORCE does the same thing, but weighted by how good the result was. It's "SFT with a dial."

Estimating the expectation

In practice, we can't sum over all possible responses. We approximate the expectation by sampling:

Sample a batch of responses from the current policy.
Score each with the reward model → get $R$ for each.
Compute $\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R$ for each response, and average over the batch.
Update $\theta$ in the direction of this gradient (gradient ascent).

Why REINFORCE isn't enough

REINFORCE is mathematically correct. But in practice it's too noisy to use for LLMs. Two concrete problems:

Both problems point to the same need: instead of weighting every token by the raw reward $R$, we need a per-token signal that says "was this specific token better or worse than expected?" This is the advantage function.

4. The Advantage Function: Fixing REINFORCE

The advantage function replaces the raw reward $R$ in the policy gradient with a more informative, per-token signal.

The idea

Advantage:
$$A_t = Q(s_t, a_t) - V(s_t)$$

Where:

Intuition: Suppose at token position 50, responses that pass through this state typically end up with reward ~0.7. If a particular token choice leads to a final reward of 0.9, the advantage is +0.2, so reinforce it. If it leads to 0.5, the advantage is −0.2, so suppress it. The advantage converts "was this response good?" into "was this token good, relative to what we'd normally do here?"

The improved policy gradient

Replacing $R$ with $A_t$ in the REINFORCE formula:

Policy gradient with advantage:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t\Big]$$

Now each token gets its own training signal. Tokens with positive advantage (better than expected) get reinforced. Tokens with negative advantage (worse than expected) get suppressed. This solves both problems: the signal is centered (positive and negative), and it's per-token.

The catch

Computing $A_t = Q(s_t, a_t) - V(s_t)$ requires knowing $V(s_t)$, the expected reward from each partial response. But that's exactly the kind of thing we don't have. The reward model only scores complete responses.

We need a way to estimate $V(s_t)$ at every token position. This is the job of a value function, and the next two sections are about how to learn one.

5. Value Functions & The Bellman Equation

We just saw that the advantage function needs $V(s_t)$, the expected final reward given a partial response up to token $t$. This section covers what $V(s)$ is, and the key equation that makes it learnable.

What V(s) represents

$V(s)$ — state value: the expected total reward starting from state $s$ and following the current policy. In LLM terms: "given this prompt and the tokens generated so far, what reward does the final response typically get?"

prompt tok₁ tok₂ tok₃ ... tok_T V(s₁) 0.61 V(s₂) 0.68 V(s₃) 0.73 V(s_T) 0.81 R = 0.82 actual reward V(s) estimates converge toward the actual reward Why this matters: V(s) turns one end-of-response reward into a signal at every token position.

There's also $Q(s, a)$ — action value: the expected total reward if we choose token $a$ next, then follow the policy. The advantage is their difference: $A_t = Q(s_t, a_t) - V(s_t)$.

The Bellman equation: how to learn V(s)

We can't compute $V(s)$ by enumerating all possible continuations. The Bellman equation provides a recursive shortcut: express $V(s_t)$ in terms of $V(s_{t+1})$.

In plain English: the value of where you are = what you get now + the value of where you end up next (in expectation).

Bellman expectation equation:
$$V_\pi(s_t) = \mathbb{E}_{a_t \sim \pi,\; s_{t+1} \sim P}\big[r_t + \gamma \, V_\pi(s_{t+1})\big]$$

Each variable:

V(s_t) current state = r_t reward now + $\gamma$ V(s_{t+1}) discounted future In LLM RL (simplified): r_t = 0 for non-terminal tokens, γ = 1 → V(s_t) = E[V(s_{t+1})]

Why Bellman matters

The Bellman equation gives us a training objective for a value model. We train a neural network (the "critic") to predict $V(s_t)$ at every token position. The critic is trained to minimize the Bellman residual, the gap between its prediction $V(s_t)$ and the one-step target $r_t + \gamma V(s_{t+1})$, across sampled transitions.

On any single sample, $V(s_{t+1}) > V(s_t)$ is completely normal. It means that token had positive advantage. The critic aims for self-consistency in expectation, not on every individual token.

The per-sample mismatch $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is called the TD error. It's the building block for computing advantages, which brings us to how TD errors are used.

6. Monte Carlo vs Temporal Difference

We need to learn $V(s)$ and use it to compute advantages. There are two fundamental approaches to generating the training signal. Understanding the trade-off between them is essential because it shows up directly in GAE, the advantage estimator used by PPO.

Monte Carlo (MC): learn from complete outcomes

Wait until the response is fully generated and scored. Then use the actual return $G_t = \sum_{k=t}^{T} r_k$ as the target for $V(s_t)$ at every token position. Under terminal-only reward, $G_t = R$ for all $t$.

Temporal Difference (TD): learn from predictions

Don't wait for the end. After generating each token, update $V(s_t)$ using one step of reality plus the critic's prediction of what comes next.

TD error:
$$\delta_t = r_t + \gamma \, V(s_{t+1}) - V(s_t)$$

Each variable:

Intuition: Under terminal-only reward with $\gamma = 1$, the TD error simplifies to $\delta_t = V(s_{t+1}) - V(s_t)$ for non-terminal tokens. The critic is saying: "after seeing this token, I now think the response will be worth $V(s_{t+1})$ instead of $V(s_t)$. This token's contribution is the difference." If the critic's estimate goes up, the token was helpful. If it goes down, the token was harmful.
Practical nuance: Many RLHF implementations add a per-token KL penalty to the reference model. The per-token reward becomes:
Reward with KL penalty:
$$r_t = \begin{cases} -\beta \log \dfrac{\pi_\theta(a_t|s_t)}{\pi_{\text{ref}}(a_t|s_t)} & \text{for } t < T \\[6pt] R(\tau) - \beta \log \dfrac{\pi_\theta(a_T|s_T)}{\pi_{\text{ref}}(a_T|s_T)} & \text{for } t = T \end{cases}$$
This changes the picture meaningfully: the "sparse terminal reward" becomes a dense reward at every token. The TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ now includes a non-zero $r_t$ at every step, not just at the end. The intuition from above still holds (each TD error measures "how much did things change after this token?"), but the KL term adds a per-token cost for deviating from the reference policy. If you're cross-referencing with code (e.g., TRL's PPO implementation), this is why you'll see rewards arrays that are non-zero at every position, not just the last one.

The spectrum: from TD to MC

MC and TD are not binary choices. They sit on a spectrum. You can blend them by looking $n$ steps ahead before bootstrapping:

MethodUses $n$ real stepsThen bootstraps?BiasVariance
TD(0)1YesHighLow
$n$-step TD$n$YesMediumMedium
MCAll (to end)NoNoneHigh
TD($\lambda$)Weighted blend of allSmoothlyControllableControllable

TD($\lambda$) is an exponentially-weighted average of all $n$-step returns. The parameter $\lambda \in [0, 1]$ controls the blend: $\lambda = 0$ is pure TD(0), $\lambda = 1$ is pure MC. This matters because GAE, the advantage estimator PPO uses, is exactly TD($\lambda$) applied to advantages.

7. Actor-Critic: Putting It All Together

We now have all the pieces. This section shows how they fit together into the actor-critic architecture, the backbone of PPO.

GAE: computing advantages from TD errors

We need the advantage $A_t$ at every token position. GAE computes it as an exponentially-weighted sum of TD errors, the same TD errors from Section 6:

Generalized Advantage Estimation (GAE):
$$A^{\text{GAE}}_t = \sum_{k=0}^{T-t-1} (\gamma\lambda)^k \, \delta_{t+k}$$

Where $\delta_{t+k} = r_{t+k} + \gamma V(s_{t+k+1}) - V(s_{t+k})$ is the TD error at step $t+k$.

The two parameters control the bias-variance trade-off:

Why the critic is essential (and expensive)

The critic provides $V(s_t)$ at every token position, which is needed to compute TD errors and therefore GAE advantages. Without a critic, you'd fall back to REINFORCE with raw rewards which is too noisy for long text outputs.

The cost: you need the critic's value estimates alongside the policy, reference model, and reward model. Some implementations use a separate full-sized critic LLM; others share the policy's transformer trunk and add a value head (cheaper but couples the two). Either way, PPO requires juggling more model parameters than critic-free alternatives, typically 3-4 model-equivalents depending on the setup.

This is exactly the cost that GRPO eliminates. GRPO replaces the critic with group-based normalization: generate multiple responses per prompt, use the group mean as the baseline, and compute advantage as the z-score within the group. Same goal (per-token signal from a per-response reward), different mechanism.

8. From Here to PPO and GRPO

Every concept in this tutorial maps directly to a component in the alignment algorithms. Here is how they connect:

RL conceptRole in PPORole in GRPO
Value function $V(s)$ Critic model predicts $V(s_t)$ at every token Not used; replaced by group mean
Bellman equation Critic trained via Bellman consistency (MSE loss) Not used
TD error $\delta_t$ Building block of GAE advantages Not used
GAE / TD($\lambda$) Computes per-token advantages from TD errors Not used; response-level z-score instead
Advantage $A_t$ Per-token, from GAE Per-response z-score, applied to all tokens
Policy gradient Clipped surrogate objective Same clipped surrogate objective
Baseline / variance reduction Critic provides baseline $V(s_t)$ Group mean provides baseline
Intuition: PPO uses the full RL toolkit (critic, Bellman, TD, GAE) to get fine-grained per-token credit assignment. GRPO trades that precision for simplicity: instead of learning a value function, it uses the empirical statistics of a group of responses as the baseline. Both aim to solve the same problem (credit assignment from sparse rewards), just with different tools.

The shared mechanism between PPO and GRPO is the clipped policy update: regardless of how advantages are computed, both algorithms clip the probability ratio to prevent the policy from changing too drastically in a single step. Combined with a KL penalty to the reference model, this is what keeps RL training stable.

For the full algorithmic details (clipping mechanics, the KL penalty, reward model architecture, the PPO training loop, and GRPO's group normalization), see the PPO & GRPO deep dive.


References: Williams, "Simple Statistical Gradient-Following Algorithms" (1992) · Sutton & Barto, Reinforcement Learning: An Introduction (2018) · Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2016) · Schulman et al., "Proximal Policy Optimization Algorithms" (2017)