PPO & GRPO for LLM Alignment

A first-principles guide for ML engineers with minimal RL background.
Inspired by Yuge Shi's guide.

1. Background: From Pre-training to SFT

Modern LLMs are built in stages. The first stage, pre-training, trains a transformer on a massive text corpus (web pages, books, code) using next-token prediction. The model learns to predict the most likely next word given everything before it. After pre-training, the model can complete text fluently, but it has no concept of "being helpful" or "following instructions." Ask it a question and it might continue with another question, or produce a Wikipedia-style paragraph that doesn't address what you asked.

Supervised fine-tuning (SFT) fixes this. You collect a dataset of (instruction, desired_response) pairs written by humans, then fine-tune the pre-trained model on these examples using the same next-token prediction loss. For example:

Instruction	Desired response
Explain gradient descent in one paragraph.	Gradient descent is an optimization algorithm that iteratively adjusts parameters by moving in the direction of steepest decrease of the loss function...
Write a Python function to reverse a string.	`def reverse(s): return s[::-1]`
Is 127 a prime number?	Yes. 127 is not divisible by 2, 3, 5, 7, or 11, and $\sqrt{127} < 12$, so it is prime.

After SFT, the model learns the format: given an instruction, produce a helpful response. SFT datasets typically contain 10k-100k examples. The resulting model is often called the SFT model or SFT checkpoint, and it serves as the starting point for the RL stage.

What SFT gets right

The model follows instructions and produces structured responses.
It learns the right tone and format from the demonstration data.
It works well for tasks that are well-represented in the SFT dataset.

Where SFT falls short

SFT trains on a fixed set of "gold" responses. The model learns to imitate those specific outputs, but it never sees what a bad response looks like or learns to distinguish good from bad. This creates several problems:

No negative signal. If the model generates a subtly wrong answer during inference, SFT gave it no mechanism to recognize or avoid that mistake. It only learned "produce text like these examples."
Exposure bias. During SFT, the model always sees ground-truth tokens as context. During inference, it conditions on its own (potentially wrong) earlier tokens. Errors compound.
Ceiling on quality. The model can't exceed the quality of the demonstration data. If annotators wrote decent but not optimal responses, that's the best the model can do.

This is where reinforcement learning enters the picture.

2. Why RL on Top of SFT?

RL addresses the gaps that SFT leaves. Instead of imitating fixed examples, the model generates its own responses and receives a reward signal indicating how good they were. Three properties make this effective:

Ranking is easier than generating. Humans can say "response A is better than response B" much more easily than they can write the perfect response from scratch. RL lets us exploit this comparison signal.
Whole-response optimization. SFT optimizes per-token likelihood. It doesn't directly optimize for "was the entire response good?", which is what we actually care about. RL optimizes a scalar reward over the full output.
Exploration. The model tries different responses and learns from what works, rather than only imitating fixed demonstrations. It can discover strategies not present in the SFT data.

The approach: train a reward model to approximate human judgment, then use RL to steer the SFT model toward higher-reward responses.

3. The Big Picture: RLHF Pipeline

The complexity lives in Steps 2 and 3.

4. The Reward Model

What it is

A reward model takes a (prompt, response) pair and outputs a single scalar score: how good is this response? It is a learned proxy for human judgment.

Architecture

A language model with a scalar head replacing the vocabulary projection. Concretely:

Start with a pre-trained LLM (often the same checkpoint you're about to fine-tune, or a similar-sized model).
Remove the language modeling head (the vocabulary projection layer).
Add a single linear layer that maps the final hidden state to a scalar: hidden_dim -> 1.
The scalar is typically extracted from the last token position of the response (the EOS token), since it has attended to the entire sequence.

Training data

You need a dataset of comparisons. For each prompt, humans see two (or more) model responses and rank them. The dataset looks like:

Training objective: Bradley-Terry model

We want the reward model to assign a higher score to the chosen response than the rejected one. We use the Bradley-Terry preference model, which says:

Probability that response $i$ is preferred over response $j$:

$$P(r_i \succ r_j) = \frac{\exp(R_\phi(p, r_i))}{\exp(R_\phi(p, r_i)) + \exp(R_\phi(p, r_j))} = \sigma\big(R_\phi(p, r_i) - R_\phi(p, r_j)\big)$$

Where $\sigma(x) = 1/(1 + e^{-x})$ is the sigmoid function. This is just a sigmoid applied to the score difference. If the chosen response scores much higher than the rejected one, the sigmoid is close to 1 (high probability, low loss). The training loss is the negative log-likelihood:

Reward model loss:

$$\mathcal{L}(\phi) = -\log \sigma\big(R_\phi(p, r_\text{chosen}) - R_\phi(p, r_\text{rejected})\big)$$

Intuition: This is almost identical to binary cross-entropy. We're training a classifier that says "which response is better?" but parameterized through scalar scores. The key insight: we don't need absolute scores, just correct relative ordering. That's why the loss only depends on the score difference.

Training details

The backbone LM weights are typically fine-tuned end-to-end (not frozen) for best results, but this is expensive. Some setups freeze the backbone and only train the head + last few layers.
Training is standard supervised learning (no RL involved). Batch of pairs, compute loss, backprop.
Typical dataset sizes: 50k-500k comparison pairs.
Once trained, the reward model is frozen during RL. It's only used for inference.

Key limitation: The reward model is trained on complete responses: human comparisons are between full outputs, not partial prefixes. Architecturally it could produce a scalar for any prefix, but those scores are unreliable since the model never saw partial completions during training. So we treat the RM reward as terminal: one score at the end of the response. Getting good per-token credit assignment requires either a critic (PPO) or a relative baseline (GRPO).

5. Policy Gradients: The Core Idea

Before PPO, let's build the intuition for why it works.

The setup

Your LLM is a policy $\pi_\theta$. Given a prompt, it generates a response by sampling tokens one at a time. Each token is an "action." The full response is a sequence of actions. The reward model scores the complete response.

In RL notation: the state $s_t$ is the prompt plus all tokens generated so far (up to position $t$), and the action $a_t$ is the next token chosen. The policy $\pi_\theta(a_t | s_t)$ is just the LLM's next-token probability distribution.

Our goal: adjust $\theta$ (the LLM weights) so the model produces higher-reward responses.

The fundamental problem

We can't just do gradient descent on the reward directly, because the reward depends on discrete token choices (sampling is non-differentiable). We need a way to get gradients through the sampling process.

The REINFORCE trick

The key insight from the REINFORCE algorithm (Williams, 1992): we don't need to differentiate through the sampling. Instead, we can use the log-derivative trick:

Policy gradient:

$$\nabla_\theta \mathbb{E}[R] = \mathbb{E}\Big[\sum_{t} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R\Big]$$

Concretely:

Sample a response from your current policy.
Score it with the reward model to get $R$.
If $R$ is high: increase the probability of those tokens (positive gradient).
If $R$ is low: decrease the probability of those tokens (negative gradient).

Intuition: This is conceptually similar to SFT, but instead of always pushing up the probability of the target tokens, we weight the gradient by how good the reward was. Good responses get reinforced; bad ones get suppressed.

The problem with vanilla REINFORCE

Using the raw reward $R$ for every token is noisy. Consider: if you generate a 200-token response and get a reward of 0.8, which tokens were responsible? The early tokens? The late ones? All of them equally? This is the credit assignment problem.

Also, rewards might all be positive (e.g., 0.3 to 0.9), meaning every token gets reinforced, just by different amounts. We want to push up tokens that are better than expected and push down tokens that are worse than expected. This is where the advantage function comes in.

6. PPO: Step by Step

The advantage function

Instead of weighting the gradient by raw reward $R$, we use the advantage:

Advantage:

$$A^\pi_t = Q^\pi(s_t, a_t) - V^\pi(s_t)$$

Where:

$Q^\pi(s_t, a_t)$: the expected total reward if we take action $a_t$ in state $s_t$ and then continue following the current policy $\pi$. ("How good is this specific action here, given how we currently behave?")
$V^\pi(s_t)$: the expected total reward from state $s_t$ under the current policy. ("How good is this state on average?")
$A_t$: the advantage. "Was this action better or worse than what we'd typically do here?"

Intuition: Suppose at token position 50, the model usually produces responses worth ~0.7 reward. If a particular token choice leads to a reward of 0.9, the advantage is positive (+0.2), so reinforce it. If it leads to 0.5, the advantage is negative (-0.2), so suppress it. The advantage function provides a per-token training signal from a per-response reward.

Computing $Q$ and $V$ exactly is intractable. We need to estimate them. This is where the critic comes in. (We'll cover the critic architecture in the next section.)

Generalized Advantage Estimation (GAE)

In the LLM setting, the reward model only gives a score at the very end of the response. There are no intermediate rewards. So how do we estimate per-token advantages?

We train a critic (value function) $V_\psi(s_t)$ that predicts, from any partial response, the expected final reward. Then we compute advantages using GAE:

TD (temporal difference) error:

$$\delta_t = r_t + \gamma \, V_\psi(s_{t+1}) - V_\psi(s_t)$$

The RM reward is terminal: it scores the full response at the end. However, many RLHF implementations (including InstructGPT) add a per-token KL penalty as reward shaping: $r_t = -\beta \, \text{KL}_t$ at each non-terminal step, where $\text{KL}_t = \log \pi_\theta(a_t|s_t) - \log \pi_\text{ref}(a_t|s_t)$. The final token gets $r_T = R_\phi(p, r) - \beta \, \text{KL}_T$. In the simplest case without per-token KL shaping, $r_t = 0$ for non-terminal tokens and $\delta_t$ simplifies to:

$$\delta_t = \gamma \, V_\psi(s_{t+1}) - V_\psi(s_t)$$

This is the critic saying: "after generating this token, I now think the response will be worth $V_\psi(s_{t+1})$ instead of $V_\psi(s_t)$, so this token's contribution is the difference."

GAE advantage:

$$A^{\text{GAE}}_t = \sum_{k=0}^{T-t-1} (\gamma\lambda)^k \, \delta_{t+k}$$

$\gamma$ (discount factor, ~0.99) and $\lambda$ (GAE parameter, ~0.95) control how far ahead we look. This is an exponentially-weighted sum of TD errors, a smooth blend between "just use the current TD error" ($\lambda=0$, where only the $k=0$ term survives since $0^0 = 1$, giving $A_t = \delta_t$) and "use the full remaining return" ($\lambda=1$).

The clipping mechanism

Now we have per-token advantages. The naive approach: use them directly in the policy gradient. If we update too aggressively, the policy can change drastically in one step and collapse (catastrophic forgetting, degenerate outputs, etc.).

PPO's solution: clip the policy update to prevent large changes.

First, define the probability ratio:

Probability ratio:

$$c_t = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t | s_t)}$$

If $c_t = 1$, the new policy assigns the same probability as the old policy to this token. If $c_t = 1.5$, the new policy is 50% more likely to generate this token than before.

The $\text{clip}$ function clamps a value to a range:

Clip function:

$$\text{clip}(x, a, b) = \begin{cases} a & \text{if } x < a \\ x & \text{if } a \le x \le b \\ b & \text{if } x > b \end{cases}$$

The PPO clipped objective uses this to bound the probability ratio:

PPO clipped surrogate objective:

$$L^{\text{clip}}(\theta) = \mathbb{E}_t \Big[\min\big(c_t \cdot A_t, \;\; \text{clip}(c_t, \, 1-\epsilon, \, 1+\epsilon) \cdot A_t\big)\Big]$$

Where $\epsilon$ is typically 0.2. The clipped ratio $\text{clip}(c_t, 1-\epsilon, 1+\epsilon)$ forces $c_t$ to stay in $[1-\epsilon, 1+\epsilon]$. Here's what the $\min$ with this clipped version does:

If $A_t > 0$ (good action): the objective wants to increase $c_t$ (make this token more likely). But the clip caps the benefit at $c_t = 1+\epsilon$. Beyond that, no extra gradient. The update is capped.
If $A_t < 0$ (bad action): the objective wants to decrease $c_t$ (make this token less likely). But the clip caps the penalty at $c_t = 1-\epsilon$. You can't over-correct.

Intuition: PPO puts a leash on the policy. It says: "improve, but not too much in any single step." Each update makes a small, controlled step in the right direction.

The full PPO objective for RLHF

Vanilla PPO is usually presented with clipping + value loss + entropy bonus. (The PPO paper also discusses a KL-penalty variant as an alternative to clipping, but it's a KL against the previous iterate, not a fixed reference.) In the RLHF setting for LLMs, we add a separate KL divergence term against the frozen SFT reference policy to prevent reward hacking. This "RLHF-flavored PPO" combines several terms:

Full PPO objective:

$$\mathcal{L}_{\text{PPO}}(\theta, \psi) = \underbrace{L^{\text{clip}}(\theta)}_{\text{policy improvement}} + \underbrace{w_1 \, H(\theta)}_{\text{entropy bonus}} - \underbrace{w_2 \, \text{KL}(\theta)}_{\text{stay near original}} - \underbrace{w_3 \, L(\psi)}_{\text{critic loss}}$$

Each term serves a purpose:

Term	What it does	Why
$L^{\text{clip}}$	Clipped policy gradient	Make the model produce better responses
$H(\theta)$	Entropy bonus: $-\mathbb{E}[\log \pi_\theta(a_t\|s_t)]$	Prevent the model from becoming too deterministic too fast
$\text{KL}(\theta)$	KL divergence from the original SFT model	Prevent reward hacking; keep outputs coherent
$L(\psi)$	Critic/value function loss	Train the critic to better estimate future rewards

PPO training loop : one iteration

Sample. Given a batch of prompts, generate responses using the current policy $\pi_{\theta_\text{old}}$. Record the log-probability of each token under this policy.

Score. Pass each (prompt, response) to the frozen reward model. Get scalar rewards.

Estimate advantages. Run each (prompt, partial_response) through the critic to get $V_\psi(s_t)$ at every token position. Compute GAE advantages.

Update policy. For several mini-batch epochs, compute the clipped surrogate objective and update $\theta$. The "old" log-probs from Step 1 stay fixed; we only re-compute the "new" log-probs under the updating $\theta$.

Update critic. Using the same data, train the critic by regressing its predictions toward the actual observed returns (reward model scores).

7. The Critic Model

The critic (also called the value model) estimates per-token values during PPO training.

What problem it solves

The reward model gives one score for the entire response. But we need a training signal for each token. The critic bridges this gap: given a prompt and a partial response (up to token $t$), it predicts the expected final reward.

Architecture

Structurally identical to the reward model: another LLM with a scalar head.

Key difference from reward model: The reward model produces one score at the end (EOS position). The critic produces a value estimate at every token position. At position $t$, the critic sees the prompt + response tokens up to $t$ (thanks to causal masking) and predicts the final reward.

Training

The critic is trained during the RL loop (not beforehand). Its loss is simple regression: predict the actual observed reward.

Critic loss:

$$L(\psi) = \mathbb{E}_t \Big[\big(V_\psi(s_t) - R_\text{target}(s_t)\big)^2\Big]$$

Where $R_\text{target}(s_t)$ is the actual return from position $t$ onward. In the simplest case (no intermediate rewards, discount $\gamma \approx 1$), this is just the reward model score for the complete response. So the critic at every token position is being trained to predict the final reward, given only a partial response.

Initialization

Common choices:

From the reward model checkpoint. A reasonable starting point since the reward model already understands response quality. Just swap the head.
From the SFT model. The model that generated the responses. It understands the distribution.
From scratch (random head, pre-trained backbone). Works but converges slower.

Why this is expensive

During PPO training, you have four models in memory:

Model	Role	Updated?
Policy (LLM)	Generates responses, gets updated	Yes
Reference policy	Frozen copy for KL penalty	No
Reward model	Scores complete responses	No
Critic	Estimates per-token values	Yes

That's roughly 4x the memory of a single LLM. If your policy is 7B parameters, you need memory for ~28B parameters (plus optimizer states for the two that are being trained). This memory cost is the main practical obstacle to PPO.

8. GRPO: Dropping the Critic

Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-Math paper, asks: can we get per-token advantages without a critic?

The key insight

Instead of training a separate model to estimate "how good is this partial response," just generate multiple responses for the same prompt and compare them against each other.

How it works

Sample a group. For each prompt, generate $G$ responses (e.g., $G = 8$ or $16$) using the current policy. This gives you a group $\mathcal{G} = \{r_1, r_2, \ldots, r_G\}$.

Score all of them. Pass each response through the reward model to get scores $R_1, R_2, \ldots, R_G$.

Normalize within the group. Compute the advantage for each response as its z-score within the group:

GRPO advantage:

$$A_i = \frac{R_i - \text{mean}(\mathcal{G})}{\text{std}(\mathcal{G})}$$

No critic, no GAE, no temporal difference learning. The advantage for response $i$ is simply how much better or worse it scored compared to the group average.

Intuition: Think of it like grading on a curve. Instead of trying to estimate an absolute "expected value" at each token (which requires a critic), we generate a bunch of responses and reinforce the ones that scored above average while suppressing the ones below. The group itself provides the baseline that the critic was trying to learn.

What about per-token signal?

In GRPO, the same advantage $A_i$ is applied to every token in response $i$. This is cruder than PPO's per-token advantages from GAE. But it turns out to work well in practice, especially because:

With a large enough group size, the normalization provides a strong signal.
The clipping mechanism still prevents over-correction.
For tasks with verifiable answers (math, code), the reward signal is clearer.

The GRPO objective

GRPO objective:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_t \Big[\min\big(c_t \cdot A_i, \;\; \text{clip}(c_t, 1-\epsilon, 1+\epsilon) \cdot A_i\big)\Big] - w_1 \, \mathbb{D}_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$$

Notice: same clipped surrogate as PPO, same KL penalty, but no critic loss term and no entropy bonus (it's optional). The advantage $A_i$ is the z-score from above, applied uniformly to all tokens in response $i$.

Note on notation: The DeepSeek papers write the GRPO objective at the sequence level using $\pi_\theta(o_i | q)$ for the probability of generating full output $o_i$ given query $q$. In practice, implementations compute this via the sum of per-token log-probabilities: $\log \pi_\theta(o_i | q) = \sum_t \log \pi_\theta(a_t | s_t)$. The token-level formulation above is what you actually implement.

Memory comparison

	PPO	GRPO
Policy	Yes	Yes
Reference policy	Yes	Yes
Reward model	Yes	Yes
Critic	Yes	No
Approx. memory	~4x policy size	~3x policy size

The trade-off: GRPO needs more compute per iteration (generating $G$ responses per prompt instead of 1), but less memory (no critic). In practice, the memory savings often matter more: it can be the difference between fitting on your cluster or not.

DeepSeek's approach: DeepSeek-R1 used GRPO with rule-based rewards (no learned reward model for math tasks, just "is the answer correct?"). This simplifies the pipeline even further: no reward model training, no critic. Sample, check, normalize, update.

10. Practical Notes

What you need to train

SFT your base model. This is your starting policy $\pi_\text{ref}$. You'll also use a frozen copy as the reference for KL penalty.

Collect preference data. Use your SFT model to generate responses, have humans (or an AI judge) rank them.

Train a reward model on the preference data. (Skip this if you have rule-based rewards, e.g., code passes tests, math answer is correct.)

Run RL. PPO if you can afford the memory and want fine-grained credit assignment. GRPO if you want simplicity and lower memory.

Common pitfalls

Reward hacking. The model finds degenerate responses that get high reward scores but are actually bad. The KL penalty is your main defense. Monitor response quality during training, not just reward scores.
Critic lag. In PPO, if the critic is poorly trained, the advantages are garbage and training diverges. Warm-starting the critic from the reward model helps.
KL coefficient tuning. Too low leads to reward hacking. Too high and the model barely moves from the SFT checkpoint. InstructGPT used $\beta = 0.02$ (order of $10^{-2}$). Values above 0.05 are less common in practice but implementation-dependent. Some setups use adaptive KL where the coefficient adjusts automatically to hit a target KL budget per update.
Group size in GRPO. Too small and the advantage estimates are noisy. Too large and it gets expensive. 8-64 responses per prompt is typical.

Rule-based rewards vs learned reward models

If your task has verifiable outcomes (correct answer, code compiles, recommendation leads to click/engagement), consider skipping the learned reward model entirely and using rule-based rewards with GRPO. This eliminates an entire source of noise and reward hacking.

References: Schulman et al., "Proximal Policy Optimization Algorithms" (2017) · Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022) · Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning" (2024) · DeepSeek-AI, "DeepSeek-R1" (2025) · Yuge Shi, "A Vision Researcher's Guide to PPO & GRPO" (2025)

Cite this post:

@article{sedhain2026ppogrpo,
  title   = {PPO and GRPO for LLM Alignment},
  author  = {Sedhain, Suvash},
  journal = {ssedhain.com},
  year    = {2026},
  month   = {Feb},
  url     = {https://mesuvash.github.io/blog/2026/ppo-grpo/}
}