PPO & GRPO for LLM Alignment
A first-principles guide for ML engineers with minimal RL background.
Inspired by Yuge Shi's guide.
1. Background: From Pre-training to SFT
Modern LLMs are built in stages. The first stage, pre-training, trains a transformer on a massive text corpus (web pages, books, code) using next-token prediction. The model learns to predict the most likely next word given everything before it. After pre-training, the model can complete text fluently, but it has no concept of "being helpful" or "following instructions." Ask it a question and it might continue with another question, or produce a Wikipedia-style paragraph that doesn't address what you asked.
Supervised fine-tuning (SFT) fixes this. You collect a dataset of (instruction, desired_response) pairs written by humans, then fine-tune the pre-trained model on these examples using the same next-token prediction loss. For example:
| Instruction | Desired response |
|---|---|
| Explain gradient descent in one paragraph. | Gradient descent is an optimization algorithm that iteratively adjusts parameters by moving in the direction of steepest decrease of the loss function... |
| Write a Python function to reverse a string. | def reverse(s): return s[::-1] |
| Is 127 a prime number? | Yes. 127 is not divisible by 2, 3, 5, 7, or 11, and $\sqrt{127} < 12$, so it is prime. |
After SFT, the model learns the format: given an instruction, produce a helpful response. SFT datasets typically contain 10k-100k examples. The resulting model is often called the SFT model or SFT checkpoint, and it serves as the starting point for the RL stage.
What SFT gets right
- The model follows instructions and produces structured responses.
- It learns the right tone and format from the demonstration data.
- It works well for tasks that are well-represented in the SFT dataset.
Where SFT falls short
SFT trains on a fixed set of "gold" responses. The model learns to imitate those specific outputs, but it never sees what a bad response looks like or learns to distinguish good from bad. This creates several problems:
- No negative signal. If the model generates a subtly wrong answer during inference, SFT gave it no mechanism to recognize or avoid that mistake. It only learned "produce text like these examples."
- Exposure bias. During SFT, the model always sees ground-truth tokens as context. During inference, it conditions on its own (potentially wrong) earlier tokens. Errors compound.
- Ceiling on quality. The model can't exceed the quality of the demonstration data. If annotators wrote decent but not optimal responses, that's the best the model can do.
This is where reinforcement learning enters the picture.
2. Why RL on Top of SFT?
RL addresses the gaps that SFT leaves. Instead of imitating fixed examples, the model generates its own responses and receives a reward signal indicating how good they were. Three properties make this effective:
- Ranking is easier than generating. Humans can say "response A is better than response B" much more easily than they can write the perfect response from scratch. RL lets us exploit this comparison signal.
- Whole-response optimization. SFT optimizes per-token likelihood. It doesn't directly optimize for "was the entire response good?", which is what we actually care about. RL optimizes a scalar reward over the full output.
- Exploration. The model tries different responses and learns from what works, rather than only imitating fixed demonstrations. It can discover strategies not present in the SFT data.
The approach: train a reward model to approximate human judgment, then use RL to steer the SFT model toward higher-reward responses.
3. The Big Picture: RLHF Pipeline
The complexity lives in Steps 2 and 3.
4. The Reward Model
What it is
A reward model takes a (prompt, response) pair and outputs a single scalar score: how good is this response? It is a learned proxy for human judgment.
Architecture
A language model with a scalar head replacing the vocabulary projection. Concretely:
- Start with a pre-trained LLM (often the same checkpoint you're about to fine-tune, or a similar-sized model).
- Remove the language modeling head (the vocabulary projection layer).
- Add a single linear layer that maps the final hidden state to a scalar:
hidden_dim -> 1. - The scalar is typically extracted from the last token position of the response (the EOS token), since it has attended to the entire sequence.
Training data
You need a dataset of comparisons. For each prompt, humans see two (or more) model responses and rank them. The dataset looks like:
Training objective: Bradley-Terry model
We want the reward model to assign a higher score to the chosen response than the rejected one. We use the Bradley-Terry preference model, which says:
Where $\sigma(x) = 1/(1 + e^{-x})$ is the sigmoid function. This is just a sigmoid applied to the score difference. If the chosen response scores much higher than the rejected one, the sigmoid is close to 1 (high probability, low loss). The training loss is the negative log-likelihood:
Training details
- The backbone LM weights are typically fine-tuned end-to-end (not frozen) for best results, but this is expensive. Some setups freeze the backbone and only train the head + last few layers.
- Training is standard supervised learning (no RL involved). Batch of pairs, compute loss, backprop.
- Typical dataset sizes: 50k-500k comparison pairs.
- Once trained, the reward model is frozen during RL. It's only used for inference.
5. Policy Gradients: The Core Idea
Before PPO, let's build the intuition for why it works.
The setup
Your LLM is a policy $\pi_\theta$. Given a prompt, it generates a response by sampling tokens one at a time. Each token is an "action." The full response is a sequence of actions. The reward model scores the complete response.
In RL notation: the state $s_t$ is the prompt plus all tokens generated so far (up to position $t$), and the action $a_t$ is the next token chosen. The policy $\pi_\theta(a_t | s_t)$ is just the LLM's next-token probability distribution.
Our goal: adjust $\theta$ (the LLM weights) so the model produces higher-reward responses.
The fundamental problem
We can't just do gradient descent on the reward directly, because the reward depends on discrete token choices (sampling is non-differentiable). We need a way to get gradients through the sampling process.
The REINFORCE trick
The key insight from the REINFORCE algorithm (Williams, 1992): we don't need to differentiate through the sampling. Instead, we can use the log-derivative trick:
Concretely:
- Sample a response from your current policy.
- Score it with the reward model to get $R$.
- If $R$ is high: increase the probability of those tokens (positive gradient).
- If $R$ is low: decrease the probability of those tokens (negative gradient).
The problem with vanilla REINFORCE
Using the raw reward $R$ for every token is noisy. Consider: if you generate a 200-token response and get a reward of 0.8, which tokens were responsible? The early tokens? The late ones? All of them equally? This is the credit assignment problem.
Also, rewards might all be positive (e.g., 0.3 to 0.9), meaning every token gets reinforced, just by different amounts. We want to push up tokens that are better than expected and push down tokens that are worse than expected. This is where the advantage function comes in.
6. PPO: Step by Step
The advantage function
Instead of weighting the gradient by raw reward $R$, we use the advantage:
Where:
- $Q^\pi(s_t, a_t)$: the expected total reward if we take action $a_t$ in state $s_t$ and then continue following the current policy $\pi$. ("How good is this specific action here, given how we currently behave?")
- $V^\pi(s_t)$: the expected total reward from state $s_t$ under the current policy. ("How good is this state on average?")
- $A_t$: the advantage. "Was this action better or worse than what we'd typically do here?"
Computing $Q$ and $V$ exactly is intractable. We need to estimate them. This is where the critic comes in. (We'll cover the critic architecture in the next section.)
Generalized Advantage Estimation (GAE)
In the LLM setting, the reward model only gives a score at the very end of the response. There are no intermediate rewards. So how do we estimate per-token advantages?
We train a critic (value function) $V_\psi(s_t)$ that predicts, from any partial response, the expected final reward. Then we compute advantages using GAE:
The RM reward is terminal: it scores the full response at the end. However, many RLHF implementations (including InstructGPT) add a per-token KL penalty as reward shaping: $r_t = -\beta \, \text{KL}_t$ at each non-terminal step, where $\text{KL}_t = \log \pi_\theta(a_t|s_t) - \log \pi_\text{ref}(a_t|s_t)$. The final token gets $r_T = R_\phi(p, r) - \beta \, \text{KL}_T$. In the simplest case without per-token KL shaping, $r_t = 0$ for non-terminal tokens and $\delta_t$ simplifies to:
$$\delta_t = \gamma \, V_\psi(s_{t+1}) - V_\psi(s_t)$$This is the critic saying: "after generating this token, I now think the response will be worth $V_\psi(s_{t+1})$ instead of $V_\psi(s_t)$, so this token's contribution is the difference."
$\gamma$ (discount factor, ~0.99) and $\lambda$ (GAE parameter, ~0.95) control how far ahead we look. This is an exponentially-weighted sum of TD errors, a smooth blend between "just look one step ahead" ($\lambda=0$) and "use the full remaining return" ($\lambda=1$).
The clipping mechanism
Now we have per-token advantages. The naive approach: use them directly in the policy gradient. If we update too aggressively, the policy can change drastically in one step and collapse (catastrophic forgetting, degenerate outputs, etc.).
PPO's solution: clip the policy update to prevent large changes.
First, define the probability ratio:
If $c_t = 1$, the new policy assigns the same probability as the old policy to this token. If $c_t = 1.5$, the new policy is 50% more likely to generate this token than before.
The $\text{clip}$ function clamps a value to a range:
The PPO clipped objective uses this to bound the probability ratio:
Where $\epsilon$ is typically 0.2. The clipped ratio $\text{clip}(c_t, 1-\epsilon, 1+\epsilon)$ forces $c_t$ to stay in $[1-\epsilon, 1+\epsilon]$. Here's what the $\min$ with this clipped version does:
- If $A_t > 0$ (good action): the objective wants to increase $c_t$ (make this token more likely). But the clip caps the benefit at $c_t = 1+\epsilon$. Beyond that, no extra gradient. The update is capped.
- If $A_t < 0$ (bad action): the objective wants to decrease $c_t$ (make this token less likely). But the clip caps the penalty at $c_t = 1-\epsilon$. You can't over-correct.
The full PPO objective for RLHF
Vanilla PPO is usually presented with clipping + value loss + entropy bonus. (The PPO paper also discusses a KL-penalty variant as an alternative to clipping, but it's a KL against the previous iterate, not a fixed reference.) In the RLHF setting for LLMs, we add a separate KL divergence term against the frozen SFT reference policy to prevent reward hacking. This "RLHF-flavored PPO" combines several terms:
Each term serves a purpose:
| Term | What it does | Why |
|---|---|---|
| $L^{\text{clip}}$ | Clipped policy gradient | Make the model produce better responses |
| $H(\theta)$ | Entropy bonus: $-\mathbb{E}[\log \pi_\theta(a_t|s_t)]$ | Prevent the model from becoming too deterministic too fast |
| $\text{KL}(\theta)$ | KL divergence from the original SFT model | Prevent reward hacking; keep outputs coherent |
| $L(\psi)$ | Critic/value function loss | Train the critic to better estimate future rewards |
PPO training loop : one iteration
7. The Critic Model
The critic (also called the value model) estimates per-token values during PPO training.
What problem it solves
The reward model gives one score for the entire response. But we need a training signal for each token. The critic bridges this gap: given a prompt and a partial response (up to token $t$), it predicts the expected final reward.
Architecture
Structurally identical to the reward model: another LLM with a scalar head.
Training
The critic is trained during the RL loop (not beforehand). Its loss is simple regression: predict the actual observed reward.
Where $R_\text{target}(s_t)$ is the actual return from position $t$ onward. In the simplest case (no intermediate rewards, discount $\gamma \approx 1$), this is just the reward model score for the complete response. So the critic at every token position is being trained to predict the final reward, given only a partial response.
Initialization
Common choices:
- From the reward model checkpoint. A reasonable starting point since the reward model already understands response quality. Just swap the head.
- From the SFT model. The model that generated the responses. It understands the distribution.
- From scratch (random head, pre-trained backbone). Works but converges slower.
Why this is expensive
During PPO training, you have four models in memory:
| Model | Role | Updated? |
|---|---|---|
| Policy (LLM) | Generates responses, gets updated | Yes |
| Reference policy | Frozen copy for KL penalty | No |
| Reward model | Scores complete responses | No |
| Critic | Estimates per-token values | Yes |
That's roughly 4x the memory of a single LLM. If your policy is 7B parameters, you need memory for ~28B parameters (plus optimizer states for the two that are being trained). This memory cost is the main practical obstacle to PPO.
8. GRPO: Dropping the Critic
Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-Math paper, asks: can we get per-token advantages without a critic?
The key insight
Instead of training a separate model to estimate "how good is this partial response," just generate multiple responses for the same prompt and compare them against each other.
How it works
No critic, no GAE, no temporal difference learning. The advantage for response $i$ is simply how much better or worse it scored compared to the group average.
What about per-token signal?
In GRPO, the same advantage $A_i$ is applied to every token in response $i$. This is cruder than PPO's per-token advantages from GAE. But it turns out to work well in practice, especially because:
- With a large enough group size, the normalization provides a strong signal.
- The clipping mechanism still prevents over-correction.
- For tasks with verifiable answers (math, code), the reward signal is clearer.
The GRPO objective
Notice: same clipped surrogate as PPO, same KL penalty, but no critic loss term and no entropy bonus (it's optional). The advantage $A_i$ is the z-score from above, applied uniformly to all tokens in response $i$.
Memory comparison
| PPO | GRPO | |
|---|---|---|
| Policy | Yes | Yes |
| Reference policy | Yes | Yes |
| Reward model | Yes | Yes |
| Critic | Yes | No |
| Approx. memory | ~4x policy size | ~3x policy size |
The trade-off: GRPO needs more compute per iteration (generating $G$ responses per prompt instead of 1), but less memory (no critic). In practice, the memory savings often matter more: it can be the difference between fitting on your cluster or not.
10. Practical Notes
What you need to train
Common pitfalls
- Reward hacking. The model finds degenerate responses that get high reward scores but are actually bad. The KL penalty is your main defense. Monitor response quality during training, not just reward scores.
- Critic lag. In PPO, if the critic is poorly trained, the advantages are garbage and training diverges. Warm-starting the critic from the reward model helps.
- KL coefficient tuning. Too low leads to reward hacking. Too high and the model barely moves from the SFT checkpoint. InstructGPT used $\beta = 0.02$ (order of $10^{-2}$). Values above 0.05 are less common in practice but implementation-dependent. Some setups use adaptive KL where the coefficient adjusts automatically to hit a target KL budget per update.
- Group size in GRPO. Too small and the advantage estimates are noisy. Too large and it gets expensive. 8-64 responses per prompt is typical.
Rule-based rewards vs learned reward models
If your task has verifiable outcomes (correct answer, code compiles, recommendation leads to click/engagement), consider skipping the learned reward model entirely and using rule-based rewards with GRPO. This eliminates an entire source of noise and reward hacking.
References: Schulman et al., "Proximal Policy Optimization Algorithms" (2017) · Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022) · Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning" (2024) · DeepSeek-AI, "DeepSeek-R1" (2025) · Yuge Shi, "A Vision Researcher's Guide to PPO & GRPO" (2025)