Reinforcement Learning for LLMs
An intuition-first guide to the RL concepts behind RLHF, PPO, and GRPO.
The background you need before diving into alignment algorithms.
Why this post exists
first, a rant about RL/ML literature's readability problem
ML/RL literature has a readability problem. Papers and textbooks are dense with notation, and too often the math arrives before the intuition. If you've ever stared at a policy gradient derivation and thought "but why are we doing this?", you're not alone. The barrier is rarely the ideas themselves; it's how they're presented.
The underlying principles of RL, even the parts that power RLHF and GRPO, are surprisingly simple. At every stage, the core question is intuitive: "which tokens made this response good or bad, and how do we produce more of the good ones?" Everything else is machinery to answer that question efficiently.
There's an old idea, often attributed to Feynman: if you can't explain something simply, you don't understand it well enough. This post is my attempt to explain simply. Every equation earns its place only after the intuition is clear, and every concept is introduced exactly when it's needed to solve a concrete problem with the previous approach.
The teaching approach here is inspired by J. Clark Scott's "But How Do It Know? — The Basic Principles of Computers for Everyone", a book that builds an entire computer from NAND gates upward, introducing each piece only when the previous piece creates a need for it. That's the structure I'm aiming for: start with the simplest thing that could work (REINFORCE), hit a wall, and let the wall motivate the next concept.
1. RL for LLMs in One Picture
After pre-training and supervised fine-tuning (SFT), your LLM can generate fluent text. But "fluent" is not the same as "good." The model may be confidently wrong, unhelpfully verbose, or subtly toxic. RL lets you optimize for overall response quality rather than just per-token likelihood.
Here is the setup, reduced to its essentials:
- The LLM is a policy that samples tokens one at a time to produce a response.
- A trajectory is: prompt → sequence of tokens → complete response.
- The reward is usually sparse and delayed: a single score at the very end.
- The central problem is credit assignment: which tokens in a 200-token response were responsible for the reward?
The rest of this tutorial builds up the tools to solve this problem, one piece at a time. We'll start with the simplest approach (REINFORCE), see why it breaks, and then introduce each new concept exactly when it's needed to fix the previous one.
2. RL Vocabulary Mapped to Text Generation
RL has its own jargon, but every term maps cleanly onto text generation. This table is worth internalizing because the rest of the tutorial (and all RLHF literature) uses these terms interchangeably.
| RL concept | In text generation | Example |
|---|---|---|
| State $s_t$ | Prompt + tokens generated so far | "Explain gravity" + "Gravity is" |
| Action $a_t$ | Next token chosen | "a" (the next token) |
| Policy $\pi_\theta$ | The LLM's token distribution | $P(\text{"a"}) = 0.3, P(\text{"the"}) = 0.2, \ldots$ |
| Trajectory $\tau$ | Complete prompt + response | The full generated text |
| Reward $r$ | Score for the response | Reward model output: 0.82 |
| Return $G$ | Total reward (often = terminal reward) | Same as reward when only scored at end |
| Discount $\gamma$ | Weight on future reward | Typically $1.0$ exactly (see note below) |
3. Policy Gradients: The Naive Approach
With the vocabulary in place, let's tackle the core question: how do we update the LLM's weights to produce higher-reward responses?
What we want to do
Our objective is simple to state: maximize the expected reward over responses sampled from the policy.
Where $\tau$ is a complete response (sequence of tokens) sampled from the LLM, and $R(\tau)$ is the reward model's score for that response. We want to find $\theta$ that makes $J(\theta)$ as large as possible, so we want $\nabla_\theta J(\theta)$ to can do gradient ascent.
Why we can't just differentiate through sampling
In supervised learning, the loss is a smooth function of the model's outputs, so backpropagation works directly. Here, the pipeline is:
$\theta$ → token probabilities → sample discrete tokens → response → $R$
The sampling step is the problem. "Pick the token with ID 3847" is a discrete, non-differentiable operation. The gradient of the reward with respect to $\theta$ doesn't flow back through it. We need a different route.
The log-derivative trick (REINFORCE)
The key insight (Williams, 1992): we don't need to differentiate through the sampling. We can rewrite the gradient of the expectation in a form that only requires differentiating the log-probabilities, which are smooth functions of $\theta$.
Start by expanding the expected reward:
This sums over all possible responses $\tau$, weighted by the probability the policy assigns to each one. (In practice, responses are sampled, not enumerated, but writing it as a sum makes the algebra clear.) Take the gradient:
The reward $R(\tau)$ doesn't depend on $\theta$ (it's just a number for a given response), so only $\pi_\theta(\tau)$ gets differentiated. Now the trick: multiply and divide by $\pi_\theta(\tau)$:
The identity $\frac{\nabla f}{f} = \nabla \log f$ is the entire trick. What we've done is convert $\sum_\tau \pi_\theta(\tau) \cdot [\ldots]$ back into an expectation under the policy:
Since $\pi_\theta(\tau) = \prod_t \pi_\theta(a_t | s_t)$, we have $\log \pi_\theta(\tau) = \sum_t \log \pi_\theta(a_t | s_t)$. This gives the per-token form:
Each variable:
- $\nabla_\theta$: gradient with respect to model weights.
- $\log \pi_\theta(a_t | s_t)$: log-probability of the token that was actually chosen at step $t$. This is differentiable with respect to $\theta$ (it's just the softmax output of the LLM).
- $R$: the reward for the complete response, used as a scalar weight. (In general RL, each token $t$ would use $G_t$, the return from step $t$ onward. In the terminal-reward LLM setting, $G_t = R$ for all $t$, so we simplify.)
Connection to SFT
Compare the REINFORCE gradient to the SFT (supervised fine-tuning) gradient:
| SFT | REINFORCE | |
|---|---|---|
| Gradient | $\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t)$ | $\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R$ |
| Weight | 1 (always push up) | $R$ (push up if good, down if bad) |
| Tokens | From a fixed dataset | Sampled from the policy itself |
SFT always increases the probability of the target tokens. REINFORCE does the same thing, but weighted by how good the result was. It's "SFT with a dial."
Estimating the expectation
In practice, we can't sum over all possible responses. We approximate the expectation by sampling:
Why REINFORCE isn't enough
REINFORCE is mathematically correct. But in practice it's too noisy to use for LLMs. Two concrete problems:
- All-positive rewards. If rewards range from 0.3 to 0.9, every token in every response gets reinforced, just by different amounts. We want to push up tokens that are better than expected and push down tokens that are worse than expected.
- Per-response, not per-token. A 200-token response gets one reward $R$. Every token in the response gets the same gradient weight. Were the early tokens good? The late ones? REINFORCE can't tell. It's the credit assignment problem from Section 1, completely unsolved.
Both problems point to the same need: instead of weighting every token by the raw reward $R$, we need a per-token signal that says "was this specific token better or worse than expected?" This is the advantage function.
4. The Advantage Function: Fixing REINFORCE
The advantage function replaces the raw reward $R$ in the policy gradient with a more informative, per-token signal.
The idea
Where:
- $Q(s_t, a_t)$: expected total reward if we take token $a_t$ here, then follow the policy. ("How good is this specific action in this state?")
- $V(s_t)$: expected total reward from state $s_t$ under the current policy, averaged over all possible next tokens. ("How good is this state on average?")
- $A_t$: the advantage — "was this token better or worse than what we'd typically produce here?"
The improved policy gradient
Replacing $R$ with $A_t$ in the REINFORCE formula:
Now each token gets its own training signal. Tokens with positive advantage (better than expected) get reinforced. Tokens with negative advantage (worse than expected) get suppressed. This solves both problems: the signal is centered (positive and negative), and it's per-token.
The catch
Computing $A_t = Q(s_t, a_t) - V(s_t)$ requires knowing $V(s_t)$, the expected reward from each partial response. But that's exactly the kind of thing we don't have. The reward model only scores complete responses.
We need a way to estimate $V(s_t)$ at every token position. This is the job of a value function, and the next two sections are about how to learn one.
5. Value Functions & The Bellman Equation
We just saw that the advantage function needs $V(s_t)$, the expected final reward given a partial response up to token $t$. This section covers what $V(s)$ is, and the key equation that makes it learnable.
What V(s) represents
$V(s)$ — state value: the expected total reward starting from state $s$ and following the current policy. In LLM terms: "given this prompt and the tokens generated so far, what reward does the final response typically get?"
There's also $Q(s, a)$ — action value: the expected total reward if we choose token $a$ next, then follow the policy. The advantage is their difference: $A_t = Q(s_t, a_t) - V(s_t)$.
The Bellman equation: how to learn V(s)
We can't compute $V(s)$ by enumerating all possible continuations. The Bellman equation provides a recursive shortcut: express $V(s_t)$ in terms of $V(s_{t+1})$.
In plain English: the value of where you are = what you get now + the value of where you end up next (in expectation).
Each variable:
- $V_\pi(s_t)$: value of being at state $s_t$ under policy $\pi$.
- $r_t$: immediate reward after taking action $a_t$.
- $\gamma$: discount factor (typically $1.0$ in LLM RL; see Section 2 note).
- $V_\pi(s_{t+1})$: value of the next state (after generating one more token).
- The expectation is over both the action (which token the policy picks) and the next state. In text generation, the transition $s_t, a_t \to s_{t+1}$ is deterministic (just append the token), so the expectation over $s_{t+1}$ is trivial. But the expectation over actions still matters.
Why Bellman matters
The Bellman equation gives us a training objective for a value model. We train a neural network (the "critic") to predict $V(s_t)$ at every token position. The critic is trained to minimize the Bellman residual, the gap between its prediction $V(s_t)$ and the one-step target $r_t + \gamma V(s_{t+1})$, across sampled transitions.
On any single sample, $V(s_{t+1}) > V(s_t)$ is completely normal. It means that token had positive advantage. The critic aims for self-consistency in expectation, not on every individual token.
The per-sample mismatch $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is called the TD error. It's the building block for computing advantages, which brings us to how TD errors are used.
6. Monte Carlo vs Temporal Difference
We need to learn $V(s)$ and use it to compute advantages. There are two fundamental approaches to generating the training signal. Understanding the trade-off between them is essential because it shows up directly in GAE, the advantage estimator used by PPO.
Monte Carlo (MC): learn from complete outcomes
Wait until the response is fully generated and scored. Then use the actual return $G_t = \sum_{k=t}^{T} r_k$ as the target for $V(s_t)$ at every token position. Under terminal-only reward, $G_t = R$ for all $t$.
- Unbiased: you're using real outcomes, not estimates.
- High variance: one response is one data point for every token. A lucky response inflates all value estimates; an unlucky one deflates them.
- Slow credit assignment: for a 200-token response, every token gets the same final reward as its target. Position 5 and position 195 get the same signal.
Temporal Difference (TD): learn from predictions
Don't wait for the end. After generating each token, update $V(s_t)$ using one step of reality plus the critic's prediction of what comes next.
Each variable:
- $\delta_t$: the "surprise," i.e. how much the critic's estimate changed after seeing this token.
- $r_t$: immediate reward (0 for non-terminal tokens in LLM RL).
- $V(s_{t+1})$: the critic's estimate of value after generating token $t$.
- $V(s_t)$: the critic's estimate of value before generating token $t$.
rewards arrays that are non-zero at every position, not just the last one.
- Lower variance: each update uses local information, not the entire trajectory.
- Biased: the estimate $V(s_{t+1})$ could be wrong. Garbage critic → garbage updates.
- Faster learning: you get a signal at every token, not just at the end.
The spectrum: from TD to MC
MC and TD are not binary choices. They sit on a spectrum. You can blend them by looking $n$ steps ahead before bootstrapping:
| Method | Uses $n$ real steps | Then bootstraps? | Bias | Variance |
|---|---|---|---|---|
| TD(0) | 1 | Yes | High | Low |
| $n$-step TD | $n$ | Yes | Medium | Medium |
| MC | All (to end) | No | None | High |
| TD($\lambda$) | Weighted blend of all | Smoothly | Controllable | Controllable |
TD($\lambda$) is an exponentially-weighted average of all $n$-step returns. The parameter $\lambda \in [0, 1]$ controls the blend: $\lambda = 0$ is pure TD(0), $\lambda = 1$ is pure MC. This matters because GAE, the advantage estimator PPO uses, is exactly TD($\lambda$) applied to advantages.
7. Actor-Critic: Putting It All Together
We now have all the pieces. This section shows how they fit together into the actor-critic architecture, the backbone of PPO.
- Actor = the policy (LLM). Generates tokens. Gets updated via the advantage-weighted policy gradient from Section 4.
- Critic = the value model. Predicts $V(s_t)$ at every token position. Trained via Bellman residual minimization from Section 5. Provides the baseline needed to compute advantages.
GAE: computing advantages from TD errors
We need the advantage $A_t$ at every token position. GAE computes it as an exponentially-weighted sum of TD errors, the same TD errors from Section 6:
Where $\delta_{t+k} = r_{t+k} + \gamma V(s_{t+k+1}) - V(s_{t+k})$ is the TD error at step $t+k$.
The two parameters control the bias-variance trade-off:
- $\lambda = 0$: advantage is just the TD error at step $t$. Low variance, high bias.
- $\lambda = 1$: advantage sums all future TD errors. No bias, high variance (equivalent to MC).
- $\lambda \approx 0.95$: the sweet spot used in practice. Mostly looks ahead, with some smoothing.
Why the critic is essential (and expensive)
The critic provides $V(s_t)$ at every token position, which is needed to compute TD errors and therefore GAE advantages. Without a critic, you'd fall back to REINFORCE with raw rewards which is too noisy for long text outputs.
The cost: you need the critic's value estimates alongside the policy, reference model, and reward model. Some implementations use a separate full-sized critic LLM; others share the policy's transformer trunk and add a value head (cheaper but couples the two). Either way, PPO requires juggling more model parameters than critic-free alternatives, typically 3-4 model-equivalents depending on the setup.
8. From Here to PPO and GRPO
Every concept in this tutorial maps directly to a component in the alignment algorithms. Here is how they connect:
| RL concept | Role in PPO | Role in GRPO |
|---|---|---|
| Value function $V(s)$ | Critic model predicts $V(s_t)$ at every token | Not used; replaced by group mean |
| Bellman equation | Critic trained via Bellman consistency (MSE loss) | Not used |
| TD error $\delta_t$ | Building block of GAE advantages | Not used |
| GAE / TD($\lambda$) | Computes per-token advantages from TD errors | Not used; response-level z-score instead |
| Advantage $A_t$ | Per-token, from GAE | Per-response z-score, applied to all tokens |
| Policy gradient | Clipped surrogate objective | Same clipped surrogate objective |
| Baseline / variance reduction | Critic provides baseline $V(s_t)$ | Group mean provides baseline |
The shared mechanism between PPO and GRPO is the clipped policy update: regardless of how advantages are computed, both algorithms clip the probability ratio to prevent the policy from changing too drastically in a single step. Combined with a KL penalty to the reference model, this is what keeps RL training stable.
For the full algorithmic details (clipping mechanics, the KL penalty, reward model architecture, the PPO training loop, and GRPO's group normalization), see the PPO & GRPO deep dive.
References: Williams, "Simple Statistical Gradient-Following Algorithms" (1992) · Sutton & Barto, Reinforcement Learning: An Introduction (2018) · Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2016) · Schulman et al., "Proximal Policy Optimization Algorithms" (2017)