A first-principles guide for ML engineers with minimal RL background.
Inspired by Yuge Shi's guide.

1. Background: From Pre-training to SFT

Modern LLMs are built in stages. The first stage, pre-training, trains a transformer on a massive text corpus (web pages, books, code) using next-token prediction. The model learns to predict the most likely next word given everything before it. After pre-training, the model can complete text fluently, but it has no concept of "being helpful" or "following instructions." Ask it a question and it might continue with another question, or produce a Wikipedia-style paragraph that doesn't address what you asked.

Supervised fine-tuning (SFT) fixes this. You collect a dataset of (instruction, desired_response) pairs written by humans, then fine-tune the pre-trained model on these examples using the same next-token prediction loss. For example:

InstructionDesired response
Explain gradient descent in one paragraph. Gradient descent is an optimization algorithm that iteratively adjusts parameters by moving in the direction of steepest decrease of the loss function...
Write a Python function to reverse a string. def reverse(s): return s[::-1]
Is 127 a prime number? Yes. 127 is not divisible by 2, 3, 5, 7, or 11, and $\sqrt{127} < 12$, so it is prime.

After SFT, the model learns the format: given an instruction, produce a helpful response. SFT datasets typically contain 10k-100k examples. The resulting model is often called the SFT model or SFT checkpoint, and it serves as the starting point for the RL stage.

What SFT gets right

Where SFT falls short

SFT trains on a fixed set of "gold" responses. The model learns to imitate those specific outputs, but it never sees what a bad response looks like or learns to distinguish good from bad. This creates several problems:

This is where reinforcement learning enters the picture.

2. Why RL on Top of SFT?

RL addresses the gaps that SFT leaves. Instead of imitating fixed examples, the model generates its own responses and receives a reward signal indicating how good they were. Three properties make this effective:

The approach: train a reward model to approximate human judgment, then use RL to steer the SFT model toward higher-reward responses.

3. The Big Picture: RLHF Pipeline

Step 1 Collect preference data : humans rank response A vs B Step 2 Train a reward model : learns to score (prompt, response) pairs Step 3 RL fine-tune : generate, score, update policy Aligned LLM

The complexity lives in Steps 2 and 3.

4. The Reward Model

What it is

A reward model takes a (prompt, response) pair and outputs a single scalar score: how good is this response? It is a learned proxy for human judgment.

Architecture

A language model with a scalar head replacing the vocabulary projection. Concretely:

prompt tokens response tokens EOS Transformer pre-trained LLM backbone frozen or fine-tuned hidden state at EOS position only Linear(d -> 1) new layer R(prompt, response)

Training data

You need a dataset of comparisons. For each prompt, humans see two (or more) model responses and rank them. The dataset looks like:

( prompt , response_chosen , response_rejected ) ( prompt , response_chosen , response_rejected ) ...

Training objective: Bradley-Terry model

We want the reward model to assign a higher score to the chosen response than the rejected one. We use the Bradley-Terry preference model, which says:

Probability that response $i$ is preferred over response $j$:
$$P(r_i \succ r_j) = \frac{\exp(R_\phi(p, r_i))}{\exp(R_\phi(p, r_i)) + \exp(R_\phi(p, r_j))} = \sigma\big(R_\phi(p, r_i) - R_\phi(p, r_j)\big)$$

Where $\sigma(x) = 1/(1 + e^{-x})$ is the sigmoid function. This is just a sigmoid applied to the score difference. If the chosen response scores much higher than the rejected one, the sigmoid is close to 1 (high probability, low loss). The training loss is the negative log-likelihood:

Reward model loss:
$$\mathcal{L}(\phi) = -\log \sigma\big(R_\phi(p, r_\text{chosen}) - R_\phi(p, r_\text{rejected})\big)$$
Intuition: This is almost identical to binary cross-entropy. We're training a classifier that says "which response is better?" but parameterized through scalar scores. The key insight: we don't need absolute scores, just correct relative ordering. That's why the loss only depends on the score difference.

Training details

Key limitation: The reward model is trained on complete responses: human comparisons are between full outputs, not partial prefixes. Architecturally it could produce a scalar for any prefix, but those scores are unreliable since the model never saw partial completions during training. So we treat the RM reward as terminal: one score at the end of the response. Getting good per-token credit assignment requires either a critic (PPO) or a relative baseline (GRPO).

5. Policy Gradients: The Core Idea

Before PPO, let's build the intuition for why it works.

The setup

Your LLM is a policy $\pi_\theta$. Given a prompt, it generates a response by sampling tokens one at a time. Each token is an "action." The full response is a sequence of actions. The reward model scores the complete response.

In RL notation: the state $s_t$ is the prompt plus all tokens generated so far (up to position $t$), and the action $a_t$ is the next token chosen. The policy $\pi_\theta(a_t | s_t)$ is just the LLM's next-token probability distribution.

Our goal: adjust $\theta$ (the LLM weights) so the model produces higher-reward responses.

The fundamental problem

We can't just do gradient descent on the reward directly, because the reward depends on discrete token choices (sampling is non-differentiable). We need a way to get gradients through the sampling process.

The REINFORCE trick

The key insight from the REINFORCE algorithm (Williams, 1992): we don't need to differentiate through the sampling. Instead, we can use the log-derivative trick:

Policy gradient:
$$\nabla_\theta \mathbb{E}[R] = \mathbb{E}\Big[\sum_{t} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R\Big]$$

Concretely:

  1. Sample a response from your current policy.
  2. Score it with the reward model to get $R$.
  3. If $R$ is high: increase the probability of those tokens (positive gradient).
  4. If $R$ is low: decrease the probability of those tokens (negative gradient).
Intuition: This is conceptually similar to SFT, but instead of always pushing up the probability of the target tokens, we weight the gradient by how good the reward was. Good responses get reinforced; bad ones get suppressed.

The problem with vanilla REINFORCE

Using the raw reward $R$ for every token is noisy. Consider: if you generate a 200-token response and get a reward of 0.8, which tokens were responsible? The early tokens? The late ones? All of them equally? This is the credit assignment problem.

Also, rewards might all be positive (e.g., 0.3 to 0.9), meaning every token gets reinforced, just by different amounts. We want to push up tokens that are better than expected and push down tokens that are worse than expected. This is where the advantage function comes in.

6. PPO: Step by Step

The advantage function

Instead of weighting the gradient by raw reward $R$, we use the advantage:

Advantage:
$$A^\pi_t = Q^\pi(s_t, a_t) - V^\pi(s_t)$$

Where:

Intuition: Suppose at token position 50, the model usually produces responses worth ~0.7 reward. If a particular token choice leads to a reward of 0.9, the advantage is positive (+0.2), so reinforce it. If it leads to 0.5, the advantage is negative (-0.2), so suppress it. The advantage function provides a per-token training signal from a per-response reward.

Computing $Q$ and $V$ exactly is intractable. We need to estimate them. This is where the critic comes in. (We'll cover the critic architecture in the next section.)

Generalized Advantage Estimation (GAE)

In the LLM setting, the reward model only gives a score at the very end of the response. There are no intermediate rewards. So how do we estimate per-token advantages?

We train a critic (value function) $V_\psi(s_t)$ that predicts, from any partial response, the expected final reward. Then we compute advantages using GAE:

TD (temporal difference) error:
$$\delta_t = r_t + \gamma \, V_\psi(s_{t+1}) - V_\psi(s_t)$$

The RM reward is terminal: it scores the full response at the end. However, many RLHF implementations (including InstructGPT) add a per-token KL penalty as reward shaping: $r_t = -\beta \, \text{KL}_t$ at each non-terminal step, where $\text{KL}_t = \log \pi_\theta(a_t|s_t) - \log \pi_\text{ref}(a_t|s_t)$. The final token gets $r_T = R_\phi(p, r) - \beta \, \text{KL}_T$. In the simplest case without per-token KL shaping, $r_t = 0$ for non-terminal tokens and $\delta_t$ simplifies to:

$$\delta_t = \gamma \, V_\psi(s_{t+1}) - V_\psi(s_t)$$

This is the critic saying: "after generating this token, I now think the response will be worth $V_\psi(s_{t+1})$ instead of $V_\psi(s_t)$, so this token's contribution is the difference."

GAE advantage:
$$A^{\text{GAE}}_t = \sum_{k=0}^{T-t-1} (\gamma\lambda)^k \, \delta_{t+k}$$

$\gamma$ (discount factor, ~0.99) and $\lambda$ (GAE parameter, ~0.95) control how far ahead we look. This is an exponentially-weighted sum of TD errors, a smooth blend between "just look one step ahead" ($\lambda=0$) and "use the full remaining return" ($\lambda=1$).

The clipping mechanism

Now we have per-token advantages. The naive approach: use them directly in the policy gradient. If we update too aggressively, the policy can change drastically in one step and collapse (catastrophic forgetting, degenerate outputs, etc.).

PPO's solution: clip the policy update to prevent large changes.

First, define the probability ratio:

Probability ratio:
$$c_t = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t | s_t)}$$

If $c_t = 1$, the new policy assigns the same probability as the old policy to this token. If $c_t = 1.5$, the new policy is 50% more likely to generate this token than before.

The $\text{clip}$ function clamps a value to a range:

Clip function:
$$\text{clip}(x, a, b) = \begin{cases} a & \text{if } x < a \\ x & \text{if } a \le x \le b \\ b & \text{if } x > b \end{cases}$$

The PPO clipped objective uses this to bound the probability ratio:

PPO clipped surrogate objective:
$$L^{\text{clip}}(\theta) = \mathbb{E}_t \Big[\min\big(c_t \cdot A_t, \;\; \text{clip}(c_t, \, 1-\epsilon, \, 1+\epsilon) \cdot A_t\big)\Big]$$

Where $\epsilon$ is typically 0.2. The clipped ratio $\text{clip}(c_t, 1-\epsilon, 1+\epsilon)$ forces $c_t$ to stay in $[1-\epsilon, 1+\epsilon]$. Here's what the $\min$ with this clipped version does:

Intuition: PPO puts a leash on the policy. It says: "improve, but not too much in any single step." Each update makes a small, controlled step in the right direction.

The full PPO objective for RLHF

Vanilla PPO is usually presented with clipping + value loss + entropy bonus. (The PPO paper also discusses a KL-penalty variant as an alternative to clipping, but it's a KL against the previous iterate, not a fixed reference.) In the RLHF setting for LLMs, we add a separate KL divergence term against the frozen SFT reference policy to prevent reward hacking. This "RLHF-flavored PPO" combines several terms:

Full PPO objective:
$$\mathcal{L}_{\text{PPO}}(\theta, \psi) = \underbrace{L^{\text{clip}}(\theta)}_{\text{policy improvement}} + \underbrace{w_1 \, H(\theta)}_{\text{entropy bonus}} - \underbrace{w_2 \, \text{KL}(\theta)}_{\text{stay near original}} - \underbrace{w_3 \, L(\psi)}_{\text{critic loss}}$$

Each term serves a purpose:

TermWhat it doesWhy
$L^{\text{clip}}$ Clipped policy gradient Make the model produce better responses
$H(\theta)$ Entropy bonus: $-\mathbb{E}[\log \pi_\theta(a_t|s_t)]$ Prevent the model from becoming too deterministic too fast
$\text{KL}(\theta)$ KL divergence from the original SFT model Prevent reward hacking; keep outputs coherent
$L(\psi)$ Critic/value function loss Train the critic to better estimate future rewards

PPO training loop : one iteration

Sample. Given a batch of prompts, generate responses using the current policy $\pi_{\theta_\text{old}}$. Record the log-probability of each token under this policy.
Score. Pass each (prompt, response) to the frozen reward model. Get scalar rewards.
Estimate advantages. Run each (prompt, partial_response) through the critic to get $V_\psi(s_t)$ at every token position. Compute GAE advantages.
Update policy. For several mini-batch epochs, compute the clipped surrogate objective and update $\theta$. The "old" log-probs from Step 1 stay fixed; we only re-compute the "new" log-probs under the updating $\theta$.
Update critic. Using the same data, train the critic by regressing its predictions toward the actual observed returns (reward model scores).

7. The Critic Model

The critic (also called the value model) estimates per-token values during PPO training.

What problem it solves

The reward model gives one score for the entire response. But we need a training signal for each token. The critic bridges this gap: given a prompt and a partial response (up to token $t$), it predicts the expected final reward.

Architecture

Structurally identical to the reward model: another LLM with a scalar head.

prompt tokens tok 1 tok 2 ... tok t Transformer from SFT or reward model checkpoint hidden states at EVERY token position Linear(d -> 1) per position V(s1) V(s2) ... V(st)
Key difference from reward model: The reward model produces one score at the end (EOS position). The critic produces a value estimate at every token position. At position $t$, the critic sees the prompt + response tokens up to $t$ (thanks to causal masking) and predicts the final reward.

Training

The critic is trained during the RL loop (not beforehand). Its loss is simple regression: predict the actual observed reward.

Critic loss:
$$L(\psi) = \mathbb{E}_t \Big[\big(V_\psi(s_t) - R_\text{target}(s_t)\big)^2\Big]$$

Where $R_\text{target}(s_t)$ is the actual return from position $t$ onward. In the simplest case (no intermediate rewards, discount $\gamma \approx 1$), this is just the reward model score for the complete response. So the critic at every token position is being trained to predict the final reward, given only a partial response.

Initialization

Common choices:

Why this is expensive

During PPO training, you have four models in memory:

ModelRoleUpdated?
Policy (LLM)Generates responses, gets updatedYes
Reference policyFrozen copy for KL penaltyNo
Reward modelScores complete responsesNo
CriticEstimates per-token valuesYes

That's roughly 4x the memory of a single LLM. If your policy is 7B parameters, you need memory for ~28B parameters (plus optimizer states for the two that are being trained). This memory cost is the main practical obstacle to PPO.

8. GRPO: Dropping the Critic

Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-Math paper, asks: can we get per-token advantages without a critic?

The key insight

Instead of training a separate model to estimate "how good is this partial response," just generate multiple responses for the same prompt and compare them against each other.

How it works

Sample a group. For each prompt, generate $G$ responses (e.g., $G = 8$ or $16$) using the current policy. This gives you a group $\mathcal{G} = \{r_1, r_2, \ldots, r_G\}$.
Score all of them. Pass each response through the reward model to get scores $R_1, R_2, \ldots, R_G$.
Normalize within the group. Compute the advantage for each response as its z-score within the group:
GRPO advantage:
$$A_i = \frac{R_i - \text{mean}(\mathcal{G})}{\text{std}(\mathcal{G})}$$

No critic, no GAE, no temporal difference learning. The advantage for response $i$ is simply how much better or worse it scored compared to the group average.

Intuition: Think of it like grading on a curve. Instead of trying to estimate an absolute "expected value" at each token (which requires a critic), we generate a bunch of responses and reinforce the ones that scored above average while suppressing the ones below. The group itself provides the baseline that the critic was trying to learn.

What about per-token signal?

In GRPO, the same advantage $A_i$ is applied to every token in response $i$. This is cruder than PPO's per-token advantages from GAE. But it turns out to work well in practice, especially because:

The GRPO objective

GRPO objective:
$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_t \Big[\min\big(c_t \cdot A_i, \;\; \text{clip}(c_t, 1-\epsilon, 1+\epsilon) \cdot A_i\big)\Big] - w_1 \, \mathbb{D}_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$$

Notice: same clipped surrogate as PPO, same KL penalty, but no critic loss term and no entropy bonus (it's optional). The advantage $A_i$ is the z-score from above, applied uniformly to all tokens in response $i$.

Note on notation: The DeepSeek papers write the GRPO objective at the sequence level using $\pi_\theta(o_i | q)$ for the probability of generating full output $o_i$ given query $q$. In practice, implementations compute this via the sum of per-token log-probabilities: $\log \pi_\theta(o_i | q) = \sum_t \log \pi_\theta(a_t | s_t)$. The token-level formulation above is what you actually implement.

Memory comparison

PPOGRPO
PolicyYesYes
Reference policyYesYes
Reward modelYesYes
CriticYesNo
Approx. memory~4x policy size~3x policy size

The trade-off: GRPO needs more compute per iteration (generating $G$ responses per prompt instead of 1), but less memory (no critic). In practice, the memory savings often matter more: it can be the difference between fitting on your cluster or not.

DeepSeek's approach: DeepSeek-R1 used GRPO with rule-based rewards (no learned reward model for math tasks, just "is the answer correct?"). This simplifies the pipeline even further: no reward model training, no critic. Sample, check, normalize, update.

10. Practical Notes

What you need to train

SFT your base model. This is your starting policy $\pi_\text{ref}$. You'll also use a frozen copy as the reference for KL penalty.
Collect preference data. Use your SFT model to generate responses, have humans (or an AI judge) rank them.
Train a reward model on the preference data. (Skip this if you have rule-based rewards, e.g., code passes tests, math answer is correct.)
Run RL. PPO if you can afford the memory and want fine-grained credit assignment. GRPO if you want simplicity and lower memory.

Common pitfalls

Rule-based rewards vs learned reward models

If your task has verifiable outcomes (correct answer, code compiles, recommendation leads to click/engagement), consider skipping the learned reward model entirely and using rule-based rewards with GRPO. This eliminates an entire source of noise and reward hacking.


References: Schulman et al., "Proximal Policy Optimization Algorithms" (2017) · Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022) · Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning" (2024) · DeepSeek-AI, "DeepSeek-R1" (2025) · Yuge Shi, "A Vision Researcher's Guide to PPO & GRPO" (2025)