Reinforcement Learning for LLMs

An intuition-first guide to the RL concepts behind RLHF, PPO, and GRPO.
The background you need before diving into alignment algorithms.

Why this post exists

first, a rant about RL/ML literature's readability problem

ML/RL literature has a readability problem. Papers and textbooks are dense with notation, and too often the math arrives before the intuition. If you've ever stared at a policy gradient derivation and thought "but why are we doing this?", you're not alone. The barrier is rarely the ideas themselves; it's how they're presented.

The underlying principles of RL, even the parts that power RLHF and GRPO, are surprisingly simple. At every stage, the core question is intuitive: "which tokens made this response good or bad, and how do we produce more of the good ones?" Everything else is machinery to answer that question efficiently.

There's an old idea, often attributed to Feynman: if you can't explain something simply, you don't understand it well enough. This post is my attempt to explain simply. Every equation earns its place only after the intuition is clear, and every concept is introduced exactly when it's needed to solve a concrete problem with the previous approach.

The teaching approach here is inspired by J. Clark Scott's "But How Do It Know? — The Basic Principles of Computers for Everyone", a book that builds an entire computer from NAND gates upward, introducing each piece only when the previous piece creates a need for it. That's the structure I'm aiming for: start with the simplest thing that could work (REINFORCE), hit a wall, and let the wall motivate the next concept.

1. RL for LLMs in One Picture

After pre-training and supervised fine-tuning (SFT), your LLM can generate fluent text. But "fluent" is not the same as "good." The model may be confidently wrong, unhelpfully verbose, or subtly toxic. RL lets you optimize for overall response quality rather than just per-token likelihood.

Here is the setup, reduced to its essentials:

The LLM is a policy that samples tokens one at a time to produce a response.
A trajectory is: prompt → sequence of tokens → complete response.
The reward is usually sparse and delayed: a single score at the very end.
The central problem is credit assignment: which tokens in a 200-token response were responsible for the reward?

What the reward model looks like

The reward $R$ in the diagram above comes from a reward model (RM), a neural network that scores complete responses. It's worth understanding what this model actually is, since the entire RL loop depends on it.

A reward model is typically a transformer initialized from the same SFT checkpoint as the policy. The only architectural change: remove the language model head (the final linear layer that projects to vocabulary size) and replace it with a scalar value head — a linear layer that maps the last hidden state to a single number.

Training data: pairs of responses to the same prompt, with a human label indicating which is better (chosen vs. rejected). The RM is trained with a Bradley-Terry pairwise ranking loss:

Reward model loss:

$$\mathcal{L}_{\text{RM}} = -\log \sigma\big(R(x, y_w) - R(x, y_l)\big)$$

Where $y_w$ is the preferred response, $y_l$ is the rejected response, $x$ is the prompt, and $\sigma$ is the sigmoid function. The loss pushes the RM to assign a higher score to the preferred response. Typical training sets: 50k–500k comparison pairs.

Intuition: The reward model is not predicting an absolute quality score. It's trained to rank responses: "is response A better than response B?" The absolute numbers (0.82, 0.35) are meaningful only relative to each other. This is why reward hacking is a real concern — the RM can be confidently wrong about out-of-distribution responses the policy learns to produce.

The rest of this tutorial builds up the tools to solve this problem, one piece at a time. We'll start with the simplest approach (REINFORCE), see why it breaks, and then introduce each new concept exactly when it's needed to fix the previous one.

2. RL Vocabulary Mapped to Text Generation

RL has its own jargon, but every term maps cleanly onto text generation. This table is worth internalizing because the rest of the tutorial (and all RLHF literature) uses these terms interchangeably.

RL concept	In text generation	Example
State $s_t$	Prompt + tokens generated so far	`"Explain gravity" + "Gravity is"`
Action $a_t$	Next token chosen	`"a"` (the next token)
Policy $\pi_\theta$	The LLM's token distribution	$P(\text{"a"}) = 0.3, P(\text{"the"}) = 0.2, \ldots$
Trajectory $\tau$	Complete prompt + response	The full generated text
Reward $r$	Score for the response	Reward model output: 0.82
Return $G$	Total reward (often = terminal reward)	Same as reward when only scored at end
Discount $\gamma$	Weight on future reward	Typically $1.0$ exactly (see note below)

Note on $\gamma$: In robotics and game-playing RL, the agent may act forever (infinite horizon), so $\gamma < 1$ is required to keep the sum of rewards finite. LLM RL is different: every episode terminates at the EOS token, so the return is a finite sum regardless of $\gamma$. This makes $\gamma = 1.0$ mathematically safe, and most implementations use it exactly. You'll sometimes see $\gamma = 1.0$ left implicit in RLHF papers for this reason.

Intuition: RL for LLMs is not "teaching the model to reason." It's shaping which outputs the model prefers under a reward signal. The model already has the capability from pre-training. RL steers it toward the outputs humans (or a reward model) rate highly.

3. Policy Gradients: The Naive Approach

With the vocabulary in place, let's tackle the core question: how do we update the LLM's weights to produce higher-reward responses?

What we want to do

Our objective is simple to state: maximize the expected reward over responses sampled from the policy.

Objective:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$

Where $\tau$ is a complete response (sequence of tokens) sampled from the LLM, and $R(\tau)$ is the reward model's score for that response. We want to find $\theta$ that makes $J(\theta)$ as large as possible, so we want $\nabla_\theta J(\theta)$ to can do gradient ascent.

Why we can't just differentiate through sampling

In supervised learning, the loss is a smooth function of the model's outputs, so backpropagation works directly. Here, the pipeline is:

$\theta$ → token probabilities → sample discrete tokens → response → $R$

The sampling step is the problem. "Pick the token with ID 3847" is a discrete, non-differentiable operation. The gradient of the reward with respect to $\theta$ doesn't flow back through it. We need a different route.

The log-derivative trick (REINFORCE)

The key insight (Williams, 1992): we don't need to differentiate through the sampling. We can rewrite the gradient of the expectation in a form that only requires differentiating the log-probabilities, which are smooth functions of $\theta$.

Start by expanding the expected reward:

Expected reward (expanded):

$$J(\theta) = \sum_{\tau} \pi_\theta(\tau) \, R(\tau)$$

This sums over all possible responses $\tau$, weighted by the probability the policy assigns to each one. (In practice, responses are sampled, not enumerated, but writing it as a sum makes the algebra clear.) Take the gradient:

Gradient of expected reward:

$$\nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta \pi_\theta(\tau) \, R(\tau)$$

The reward $R(\tau)$ doesn't depend on $\theta$ (it's just a number for a given response), so only $\pi_\theta(\tau)$ gets differentiated. Now the trick: multiply and divide by $\pi_\theta(\tau)$:

The log-derivative trick:

$$\nabla_\theta J(\theta) = \sum_{\tau} \pi_\theta(\tau) \, \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} \, R(\tau) = \sum_{\tau} \pi_\theta(\tau) \, \nabla_\theta \log \pi_\theta(\tau) \, R(\tau)$$

The identity $\frac{\nabla f}{f} = \nabla \log f$ is the entire trick. What we've done is convert $\sum_\tau \pi_\theta(\tau) \cdot [\ldots]$ back into an expectation under the policy:

Policy gradient (REINFORCE):

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\nabla_\theta \log \pi_\theta(\tau) \cdot R(\tau)\Big]$$

Since $\pi_\theta(\tau) = \prod_t \pi_\theta(a_t | s_t)$, we have $\log \pi_\theta(\tau) = \sum_t \log \pi_\theta(a_t | s_t)$. This gives the per-token form:

Per-token form:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\sum_{t} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R\Big]$$

Each variable:

$\nabla_\theta$: gradient with respect to model weights.
$\log \pi_\theta(a_t | s_t)$: log-probability of the token that was actually chosen at step $t$. This is differentiable with respect to $\theta$ (it's just the softmax output of the LLM).
$R$: the reward for the complete response, used as a scalar weight. (In general RL, each token $t$ would use $G_t$, the return from step $t$ onward. In the terminal-reward LLM setting, $G_t = R$ for all $t$, so we simplify.)

Intuition: We've sidestepped the non-differentiable sampling entirely. Instead of differentiating through "which token was picked," we differentiate through "how likely was the token that was picked." The reward $R$ just acts as a scalar weight: high reward → push those token probabilities up. Low reward → push them down. No need to differentiate through discrete choices.

Connection to SFT

Compare the REINFORCE gradient to the SFT (supervised fine-tuning) gradient:

	SFT	REINFORCE
Gradient	$\sum_t \nabla_\theta \log \pi_\theta(a_t \| s_t)$	$\sum_t \nabla_\theta \log \pi_\theta(a_t \| s_t) \cdot R$
Weight	1 (always push up)	$R$ (push up if good, down if bad)
Tokens	From a fixed dataset	Sampled from the policy itself

SFT always increases the probability of the target tokens. REINFORCE does the same thing, but weighted by how good the result was. It's "SFT with a dial."

Estimating the expectation

In practice, we can't sum over all possible responses. We approximate the expectation by sampling:

Sample a batch of responses from the current policy.

Score each with the reward model → get $R$ for each.

Compute $\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R$ for each response, and average over the batch.

Update $\theta$ in the direction of this gradient (gradient ascent).

Why REINFORCE isn't enough

REINFORCE is mathematically correct. But in practice it's too noisy to use for LLMs. Two concrete problems:

All-positive rewards. If rewards range from 0.3 to 0.9, every token in every response gets reinforced, just by different amounts. We want to push up tokens that are better than expected and push down tokens that are worse than expected.
Per-response, not per-token. A 200-token response gets one reward $R$. Every token in the response gets the same gradient weight. Were the early tokens good? The late ones? REINFORCE can't tell. It's the credit assignment problem from Section 1, completely unsolved.

Both problems point to the same need: instead of weighting every token by the raw reward $R$, we need a per-token signal that says "was this specific token better or worse than expected?" This is the advantage function.

4. The Advantage Function: Fixing REINFORCE

The advantage function replaces the raw reward $R$ in the policy gradient with a more informative, per-token signal.

The idea

Advantage:

$$A_t = Q(s_t, a_t) - V(s_t)$$

Where:

$Q(s_t, a_t)$: expected total reward if we take token $a_t$ here, then follow the policy. ("How good is this specific action in this state?")
$V(s_t)$: expected total reward from state $s_t$ under the current policy, averaged over all possible next tokens. ("How good is this state on average?")
$A_t$: the advantage — "was this token better or worse than what we'd typically produce here?"

Intuition: Suppose at token position 50, responses that pass through this state typically end up with reward ~0.7. If a particular token choice leads to a final reward of 0.9, the advantage is +0.2, so reinforce it. If it leads to 0.5, the advantage is −0.2, so suppress it. The advantage converts "was this response good?" into "was this token good, relative to what we'd normally do here?"

The improved policy gradient

Replacing $R$ with $A_t$ in the REINFORCE formula:

Policy gradient with advantage:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t\Big]$$

Now each token gets its own training signal. Tokens with positive advantage (better than expected) get reinforced. Tokens with negative advantage (worse than expected) get suppressed. This solves both problems: the signal is centered (positive and negative), and it's per-token.

The catch

Computing $A_t = Q(s_t, a_t) - V(s_t)$ requires knowing $V(s_t)$, the expected reward from each partial response. But that's exactly the kind of thing we don't have. The reward model only scores complete responses.

We need a way to estimate $V(s_t)$ at every token position. This is the job of a value function, and the next two sections are about how to learn one.

5. Value Functions & The Bellman Equation

We just saw that the advantage function needs $V(s_t)$, the expected final reward given a partial response up to token $t$. This section covers what $V(s)$ is, and the key equation that makes it learnable.

What V(s) represents

$V(s)$ — state value: the expected total reward starting from state $s$ and following the current policy. In LLM terms: "given this prompt and the tokens generated so far, what reward does the final response typically get?"

There's also $Q(s, a)$ — action value: the expected total reward if we choose token $a$ next, then follow the policy. The advantage is their difference: $A_t = Q(s_t, a_t) - V(s_t)$.

What the value model looks like

The value function $V(s_t)$ is computed by a critic model — another transformer. Unlike the reward model (which scores a complete response once), the critic must produce a value estimate at every token position.

Architecturally, the critic is similar to the reward model: a transformer with a scalar value head. The key difference is when and where it produces outputs.

Two common implementation strategies:

Separate critic: a full copy of the transformer with its own value head. More parameters (~2x), but the critic's learning doesn't interfere with the policy's representations.
Shared trunk + value head: the policy transformer has a value head bolted on alongside the LM head. Cheaper in memory, but the two objectives (generate good tokens vs. predict future reward) can compete for the same representations.

In either case, the critic is trained to minimize the squared error between its predictions $V_\phi(s_t)$ and the target values derived from the Bellman equation or Monte Carlo returns. The next subsection explains where those targets come from.

The Bellman equation: how to learn V(s)

We can't compute $V(s)$ by enumerating all possible continuations. The Bellman equation provides a recursive shortcut: express $V(s_t)$ in terms of $V(s_{t+1})$.

In plain English: the value of where you are = what you get now + the value of where you end up next (in expectation).

Bellman expectation equation:

$$V_\pi(s_t) = \mathbb{E}_{a_t \sim \pi,\; s_{t+1} \sim P}\big[r_t + \gamma \, V_\pi(s_{t+1})\big]$$

Each variable:

$V_\pi(s_t)$: value of being at state $s_t$ under policy $\pi$.
$r_t$: immediate reward after taking action $a_t$.
$\gamma$: discount factor (typically $1.0$ in LLM RL; see Section 2 note).
$V_\pi(s_{t+1})$: value of the next state (after generating one more token).
The expectation is over both the action (which token the policy picks) and the next state. In text generation, the transition $s_t, a_t \to s_{t+1}$ is deterministic (just append the token), so the expectation over $s_{t+1}$ is trivial. But the expectation over actions still matters.

Why Bellman matters

The Bellman equation gives us a training objective for a value model. We train a neural network (the "critic") to predict $V(s_t)$ at every token position. The critic is trained to minimize the Bellman residual, the gap between its prediction $V(s_t)$ and the one-step target $r_t + \gamma V(s_{t+1})$, across sampled transitions.

On any single sample, $V(s_{t+1}) > V(s_t)$ is completely normal. It means that token had positive advantage. The critic aims for self-consistency in expectation, not on every individual token.

The per-sample mismatch $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is called the TD error. It's the building block for computing advantages, which brings us to how TD errors are used.

6. Monte Carlo vs Temporal Difference

We need to learn $V(s)$ and use it to compute advantages. There are two fundamental approaches to generating the training signal. Understanding the trade-off between them is essential because it shows up directly in GAE, the advantage estimator used by PPO.

Monte Carlo (MC): learn from complete outcomes

Wait until the response is fully generated and scored. Then use the actual return $G_t = \sum_{k=t}^{T} r_k$ as the target for $V(s_t)$ at every token position. Under terminal-only reward, $G_t = R$ for all $t$.

Unbiased: you're using real outcomes, not estimates.
High variance: one response is one data point for every token. A lucky response inflates all value estimates; an unlucky one deflates them.
Slow credit assignment: for a 200-token response, every token gets the same final reward as its target. Position 5 and position 195 get the same signal.

Temporal Difference (TD): learn from predictions

Don't wait for the end. After generating each token, update $V(s_t)$ using one step of reality plus the critic's prediction of what comes next.

TD error:

$$\delta_t = r_t + \gamma \, V(s_{t+1}) - V(s_t)$$

Each variable:

$\delta_t$: the "surprise," i.e. how much the critic's estimate changed after seeing this token.
$r_t$: immediate reward (0 for non-terminal tokens in LLM RL).
$V(s_{t+1})$: the critic's estimate of value after generating token $t$.
$V(s_t)$: the critic's estimate of value before generating token $t$.

Intuition: Under terminal-only reward with $\gamma = 1$, the TD error simplifies to $\delta_t = V(s_{t+1}) - V(s_t)$ for non-terminal tokens. The critic is saying: "after seeing this token, I now think the response will be worth $V(s_{t+1})$ instead of $V(s_t)$. This token's contribution is the difference." If the critic's estimate goes up, the token was helpful. If it goes down, the token was harmful.

Practical nuance: Many RLHF implementations add a per-token KL penalty to the reference model. The per-token reward becomes:

Reward with KL penalty:

$$r_t = \begin{cases} -\beta \log \dfrac{\pi_\theta(a_t|s_t)}{\pi_{\text{ref}}(a_t|s_t)} & \text{for } t < T \\[6pt] R(\tau) - \beta \log \dfrac{\pi_\theta(a_T|s_T)}{\pi_{\text{ref}}(a_T|s_T)} & \text{for } t = T \end{cases}$$

This changes the picture meaningfully: the "sparse terminal reward" becomes a dense reward at every token. The TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ now includes a non-zero $r_t$ at every step, not just at the end. The intuition from above still holds (each TD error measures "how much did things change after this token?"), but the KL term adds a per-token cost for deviating from the reference policy. If you're cross-referencing with code (e.g., TRL's PPO implementation), this is why you'll see rewards arrays that are non-zero at every position, not just the last one.

Lower variance: each update uses local information, not the entire trajectory.
Biased: the estimate $V(s_{t+1})$ could be wrong. Garbage critic → garbage updates.
Faster learning: you get a signal at every token, not just at the end.

The spectrum: from TD to MC

MC and TD are not binary choices. They sit on a spectrum. You can blend them by looking $n$ steps ahead before bootstrapping:

Method	Uses $n$ real steps	Then bootstraps?	Bias	Variance
TD(0)	1	Yes	High	Low
$n$-step TD	$n$	Yes	Medium	Medium
MC	All (to end)	No	None	High
TD($\lambda$)	Weighted blend of all	Smoothly	Controllable	Controllable

TD($\lambda$) is an exponentially-weighted average of all $n$-step returns. The parameter $\lambda \in [0, 1]$ controls the blend: $\lambda = 0$ is pure TD(0), $\lambda = 1$ is pure MC. This matters because GAE, the advantage estimator PPO uses, is exactly TD($\lambda$) applied to advantages.

7. Actor-Critic: Putting It All Together

We now have all the pieces. This section shows how they fit together into the actor-critic architecture, the backbone of PPO.

Actor = the policy (LLM). Generates tokens. Gets updated via the advantage-weighted policy gradient from Section 4.
Critic = the value model. Predicts $V(s_t)$ at every token position. Trained via Bellman residual minimization from Section 5. Provides the baseline needed to compute advantages.

GAE: computing advantages from TD errors

We need the advantage $A_t$ at every token position. GAE computes it as an exponentially-weighted sum of TD errors, the same TD errors from Section 6:

Generalized Advantage Estimation (GAE):

$$A^{\text{GAE}}_t = \sum_{k=0}^{T-t-1} (\gamma\lambda)^k \, \delta_{t+k}$$

Where $\delta_{t+k} = r_{t+k} + \gamma V(s_{t+k+1}) - V(s_{t+k})$ is the TD error at step $t+k$.

The two parameters control the bias-variance trade-off:

$\lambda = 0$: advantage is just the TD error at step $t$. Low variance, high bias.
$\lambda = 1$: advantage sums all future TD errors. No bias, high variance (equivalent to MC).
$\lambda \approx 0.95$: the sweet spot used in practice. Mostly looks ahead, with some smoothing.

Why the critic is essential (and expensive)

The critic provides $V(s_t)$ at every token position, which is needed to compute TD errors and therefore GAE advantages. Without a critic, you'd fall back to REINFORCE with raw rewards which is too noisy for long text outputs.

The cost: you need the critic's value estimates alongside the policy, reference model, and reward model. Some implementations use a separate full-sized critic LLM; others share the policy's transformer trunk and add a value head (cheaper but couples the two). Either way, PPO requires juggling more model parameters than critic-free alternatives, typically 3-4 model-equivalents depending on the setup.

This is exactly the cost that GRPO eliminates. GRPO replaces the critic with group-based normalization: generate multiple responses per prompt, use the group mean as the baseline, and compute advantage as the z-score within the group. Same goal (per-token signal from a per-response reward), different mechanism.

8. From Here to PPO and GRPO

Every concept in this tutorial maps directly to a component in the alignment algorithms. Here is how they connect:

RL concept	Role in PPO	Role in GRPO
Value function $V(s)$	Critic model predicts $V(s_t)$ at every token	Not used; replaced by group mean
Bellman equation	Critic trained via Bellman consistency (MSE loss)	Not used
TD error $\delta_t$	Building block of GAE advantages	Not used
GAE / TD($\lambda$)	Computes per-token advantages from TD errors	Not used; response-level z-score instead
Advantage $A_t$	Per-token, from GAE	Per-response z-score, applied to all tokens
Policy gradient	Clipped surrogate objective	Same clipped surrogate objective
Baseline / variance reduction	Critic provides baseline $V(s_t)$	Group mean provides baseline

Intuition: PPO uses the full RL toolkit (critic, Bellman, TD, GAE) to get fine-grained per-token credit assignment. GRPO trades that precision for simplicity: instead of learning a value function, it uses the empirical statistics of a group of responses as the baseline. Both aim to solve the same problem (credit assignment from sparse rewards), just with different tools.

The shared mechanism between PPO and GRPO is the clipped policy update: regardless of how advantages are computed, both algorithms clip the probability ratio to prevent the policy from changing too drastically in a single step. Combined with a KL penalty to the reference model, this is what keeps RL training stable.

For the full algorithmic details (clipping mechanics, the KL penalty, reward model architecture, the PPO training loop, and GRPO's group normalization), see the PPO & GRPO deep dive.

References: Williams, "Simple Statistical Gradient-Following Algorithms" (1992) · Sutton & Barto, Reinforcement Learning: An Introduction (2018) · Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (2016) · Schulman et al., "Proximal Policy Optimization Algorithms" (2017)

Cite this post:

@article{sedhain2026rl,
  title   = {Reinforcement Learning for LLMs},
  author  = {Sedhain, Suvash},
  journal = {ssedhain.com},
  year    = {2026},
  month   = {Feb},
  url     = {https://mesuvash.github.io/blog/2026/rl_for_llm/}
}