Reward Modeling and DPO: Learning What "Good" Means
How reward models turn human preferences into training signal, and how DPO skips the reward model entirely.
Companion to PPO & GRPO for LLM Alignment.
1. Why Reward Modeling?
The PPO & GRPO post covered how RL algorithms optimize a policy given a reward signal. But it glossed over a critical question: where does the reward come from?
For math or code, you can check correctness programmatically. But for most alignment goals (helpfulness, safety, tone, factuality in open-ended questions), there is no simple function that returns a score. The quality of "Explain quantum entanglement to a 10-year-old" has no ground truth.
The solution is to learn a reward function from human preferences. This is the reward model, and it is the foundation of the entire RLHF pipeline. If the reward model is wrong, the RL algorithm will optimize the wrong thing, often spectacularly.
- Humans can compare, even when they can't demonstrate. Asking an annotator to write the perfect response is hard. Asking "is A better than B?" is much easier and more reliable.
- Comparison data scales. Given $k$ responses per prompt, you get $\binom{k}{2}$ pairs. 4 responses yield 6 comparison pairs from one annotation session.
- The reward model amortizes human effort. Once trained, it can score millions of responses without further human input.
2. Preference Data: The Raw Material
Data format
Every preference dataset has the same structure: a prompt, a chosen (preferred) response, and a rejected (dispreferred) response. The chosen/rejected labels come from human annotators who read both responses and pick the better one.
How the data is collected
Dataset sizes
Production reward models are trained on 100k-500k comparison pairs. Research-scale experiments sometimes use 10k-50k. OpenAI's InstructGPT used roughly 50k comparisons. Anthropic's early RLHF work used around 170k. Llama 2 used over 1 million.
3. The Reward Model
Architecture
A reward model is a language model with a scalar output head instead of a vocabulary head. You take a pre-trained LLM, remove the final projection to vocabulary size, and replace it with a linear layer that maps the last hidden state to a single number.
Concretely: the input is the concatenation of the prompt $x$ and response $y$. The model processes the entire sequence through the transformer. The hidden state at the final token position (EOS) has attended to all tokens via causal attention, so it summarizes the entire (prompt, response) pair. The linear head maps this $d$-dimensional vector to a single scalar $R(x, y)$.
The Bradley-Terry model
We want the reward model to assign higher scores to preferred responses. The Bradley-Terry model formalizes this. Given a prompt $x$, a chosen response $y_w$, and a rejected response $y_l$, the probability that $y_w$ is preferred is:
Where $\sigma(z) = 1/(1 + e^{-z})$ is the sigmoid function and $R_\phi$ is the reward model parameterized by $\phi$. The key property: preference depends only on the score difference, not the absolute values. The training loss is the negative log-likelihood:
Where $\mathcal{D}$ is the dataset of preference triples. Each variable:
- $x$ is the prompt.
- $y_w$ is the chosen (winning) response.
- $y_l$ is the rejected (losing) response.
- $R_\phi(x, y)$ is the scalar score assigned by the reward model.
- $\sigma$ is the sigmoid function.
Why Bradley-Terry and not something simpler?
You might wonder: why not just train a regression model where chosen responses are labeled 1 and rejected are labeled 0? The Bradley-Terry framing is better for two reasons:
- Transitivity. Scalar scores give you a total ordering. If $R(A) > R(B) > R(C)$, the model automatically predicts $A \succ C$ even if it never saw that pair. A binary classifier over pairs does not guarantee consistency.
- Calibrated margin. The sigmoid on score differences means a pair that's "barely better" gets a small gradient, while a pair that's "clearly better but scored wrong" gets a large gradient. This is the right behavior for noisy human labels.
4. Reward Modeling in Practice
Model sizing
The reward model is typically the same size as (or slightly smaller than) the policy model. OpenAI's InstructGPT used a 6B reward model for a 175B policy. Llama 2 used a 70B reward model for a 70B policy. Anthropic has experimented with reward models up to 52B.
Smaller reward models are cheaper to run during RL (since every generated response needs scoring), but less accurate. The trade-off is simple: if your reward model is significantly less capable than your policy, the policy will quickly find outputs that fool the reward model.
Training recipe
Reward normalization
Raw reward model scores drift during training. A score of 2.3 means nothing in absolute terms. Common normalization strategies:
- Per-batch normalization: subtract the batch mean and divide by standard deviation. Keeps scores centered around 0.
- Clipping: clip rewards to $[-5, 5]$ or similar range to prevent outlier rewards from causing large policy updates.
- Length penalty: longer responses tend to get higher reward scores simply because they contain more information. Some teams normalize by response length or add an explicit length penalty.
5. Reward Model Failure Modes
Reward hacking
The most important failure mode. The policy finds outputs that score high according to the reward model but are clearly bad to a human. This happens because the reward model is an imperfect proxy, and RL is very good at exploiting imperfect proxies.
Common reward hacking patterns:
- Verbosity exploit: the policy learns that longer responses get higher scores, so it generates unnecessarily verbose outputs. Almost every RLHF system sees this.
- Sycophancy: the model learns to agree with the user's stated position regardless of accuracy, because annotators sometimes prefer agreeable responses.
- Formatting tricks: excessive use of bullet points, headers, bold text, or emoji that correlate with higher reward but add no substance.
- Hedging: "as an AI language model..." prefixes that reduce perceived risk of wrong answers, which annotators may rate as "safer."
The KL penalty
The standard defense against reward hacking. During RL, the actual reward used is:
Where $\pi_\text{ref}$ is the reference policy (typically the SFT model), $\pi_\theta$ is the current policy being trained, and $\beta$ controls the strength of the penalty. The KL divergence measures how far the current policy has moved from the reference. Higher $\beta$ keeps the policy closer to the SFT model (less reward hacking but less improvement). Lower $\beta$ allows more exploration (more improvement but more risk of exploitation).
Typical $\beta$ values: 0.01 to 0.2. This is one of the most sensitive hyperparameters in RLHF.
Other failure modes
- Position bias: annotators tend to prefer the first response shown to them. If your annotation UI always shows responses in the same order, this bias leaks into the reward model. Mitigation: randomize presentation order.
- Length bias: annotators tend to prefer longer responses. The reward model inherits this bias. Mitigation: normalize by length, or include length-matched pairs in training data.
- Annotator disagreement: on subjective tasks, annotators disagree 20-35% of the time. The reward model is trained on a majority vote or random annotator choice, smoothing over genuine diversity in preferences.
6. DPO: Skip the Reward Model
Direct Preference Optimization (Rafailov et al., 2023) is a fundamentally different approach. Instead of training a separate reward model and then doing RL, DPO directly optimizes the policy on preference data. No reward model. No RL loop. Just supervised learning on preference pairs.
The key insight
The DPO paper makes a mathematical observation. In the standard RLHF objective, the optimal policy $\pi^*$ under a reward function $R$ with KL constraint has a closed-form solution:
Where $Z(x)$ is a normalization constant (partition function) and $\beta$ is the KL penalty coefficient. This can be rearranged to express the reward in terms of the policy:
The reward is proportional to how much the optimal policy upweights a response relative to the reference policy. Now substitute this into the Bradley-Terry model. The $\log Z(x)$ terms cancel (since they appear in both the chosen and rejected reward), giving:
Each variable:
- $\pi_\theta$ is the policy being trained (initialized from $\pi_\text{ref}$).
- $\pi_\text{ref}$ is the reference policy (frozen SFT model).
- $y_w, y_l$ are the chosen and rejected responses.
- $\beta$ controls regularization strength (same role as in RLHF).
- $\log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}$ is the implicit reward: how much the policy has upweighted this response relative to the reference.
The DPO pipeline
What happens during training
For each preference pair $(x, y_w, y_l)$, DPO computes the log-probabilities of both responses under both the current policy $\pi_\theta$ and the frozen reference policy $\pi_\text{ref}$. This requires four forward passes per batch element (two responses, two models). In practice, the reference model log-probabilities are precomputed and cached, reducing to two forward passes.
The gradient pushes in two directions simultaneously:
- Increase the probability of the chosen response $y_w$ (relative to the reference).
- Decrease the probability of the rejected response $y_l$ (relative to the reference).
The gradient magnitude is weighted by how wrong the current model is. If the model already strongly prefers $y_w$, the sigmoid saturates and the gradient is small. If the model prefers $y_l$ (the wrong answer), the gradient is large. This is the same self-correcting property as in the reward model loss.
7. DPO Variants: IPO, KTO, and Others
DPO opened the door to a family of offline preference optimization methods, each addressing a specific weakness.
IPO: Identity Preference Optimization
IPO (Azar et al., 2023) argues that DPO overfits to preference pairs because the sigmoid loss saturates too quickly. Once the model gets a pair "right" (large margin between chosen and rejected), the gradient vanishes and the model stops improving on that example, even if the margin is based on memorization rather than genuine understanding.
IPO replaces the sigmoid loss with a squared loss on the margin:
Let $m = \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}$ be the implicit reward margin between chosen and rejected. The IPO loss is simply $(m - \frac{1}{2\beta})^2$. This is a parabola centered at $\frac{1}{2\beta}$: the loss is zero when the margin equals the target, and increases quadratically in both directions.
If $m < \frac{1}{2\beta}$ (margin too small, model does not prefer the chosen response enough), the gradient pushes the margin up. If $m > \frac{1}{2\beta}$ (margin too large), the gradient pushes it back down. Compare this to DPO's sigmoid loss $-\log \sigma(\beta \cdot m)$: as $m$ grows large, the sigmoid saturates near 1, the loss approaches 0, and the gradient vanishes. DPO has no penalty for an excessively large margin, so the model can keep inflating it by memorizing specific preference pairs. IPO's squared loss prevents this by treating "too confident" the same as "not confident enough."
KTO: Kahneman-Tversky Optimization
KTO (Ethayarajh et al., 2024) addresses a different problem: it does not require paired preferences at all. Instead of needing (chosen, rejected) pairs for the same prompt, KTO works with unpaired data where each response is independently labeled as "good" or "bad."
This is significant because paired preference data is expensive to collect. You need two responses to the same prompt, shown to the same annotator, compared side-by-side. KTO only needs thumbs-up / thumbs-down labels, which can come from user feedback logs, upvotes, or simple quality ratings.
The first term pushes up the implicit reward of good responses; the second pushes down the implicit reward of bad responses. The subtracted expectation term centers the rewards, similar to a baseline in REINFORCE.
Other variants
| Method | Key Idea | Data Requirement |
|---|---|---|
| DPO | Sigmoid loss on log-ratio margin | Paired preferences |
| IPO | Squared loss with target margin (anti-overfitting) | Paired preferences |
| KTO | Unpaired; works with thumbs-up/down labels | Unpaired good/bad labels |
| ORPO | Merges SFT and DPO into one loss (no ref model) | Paired preferences |
| SimPO | Length-normalized, no reference model needed | Paired preferences |
8. RLHF vs DPO: When to Use What
| Dimension | RLHF (PPO/GRPO) | DPO (and variants) |
|---|---|---|
| Training complexity | High. Requires reward model, value model (PPO), or group sampling (GRPO), plus the RL loop. | Low. Standard supervised training on preference pairs. One model, one loss. |
| Compute cost | 3-5x the cost of SFT. Needs generation + scoring + updates each iteration. | 1.5-2x the cost of SFT. Two forward passes per example (policy + reference). |
| Online exploration | Yes. Policy generates new responses and learns from them. Can correct novel errors. | No. Fixed dataset. Cannot adapt to distribution shift. |
| Reward hacking risk | High. The policy can exploit the reward model's blind spots over many RL steps. | Low. No explicit reward model to exploit. But can overfit to preference data. |
| Stability | Requires careful tuning (KL coefficient, learning rate, clipping). Can collapse. | Stable. Standard supervised training dynamics. |
| Ceiling | Higher. Online exploration can discover behaviors not in the training data. | Lower. Bounded by the quality and diversity of the preference dataset. |
| Iterative improvement | Natural. Collect new preferences on the improved model, retrain. | Requires regenerating preference data for each iteration. |
When DPO wins
- Small teams with limited compute and engineering bandwidth.
- Tasks where high-quality preference data already exists (or is easy to generate synthetically).
- Alignment tasks where the gap between SFT and the target behavior is small.
- Rapid iteration: you need a usable model today, not a perfectly optimized one in a month.
When RLHF wins
- Tasks with verifiable rewards (math, code, factual QA) where you can use rule-based scoring.
- Pushing past the ceiling of what preference data alone can teach.
- Large-scale production systems where the extra engineering cost is justified by quality gains.
- Iterative improvement loops where the model improves, generates new data, and improves again.
9. Practical Notes
Data quality matters more than quantity
A reward model trained on 50k high-quality preference pairs (clear differences between chosen and rejected, expert annotators, diverse prompts) will outperform one trained on 500k noisy pairs. Invest in annotation guidelines and quality control before scaling data collection.
The reference model matters
Both RLHF (via KL penalty) and DPO (via log-ratio) depend heavily on the reference policy. If the reference model is weak, the KL penalty in RLHF allows less room for improvement, and the log-ratios in DPO are less meaningful. Use the best SFT checkpoint you have as the reference.
Preference data goes stale
As the policy improves, the preference data (collected from an earlier, weaker model) becomes less informative. The distribution of responses the model now produces is different from what annotators compared. For RLHF, this matters less because the model generates fresh data. For DPO, this means you need to periodically regenerate preference data from the current policy ("online DPO" or "iterative DPO").
Process reward models vs outcome reward models
Standard reward models are outcome reward models (ORMs): they score the final response. Process reward models (PRMs) score each intermediate step in a chain-of-thought. PRMs are harder to train (you need step-level preference labels) but dramatically improve performance on reasoning tasks by providing denser feedback.
OpenAI's "Let's Verify Step by Step" paper showed that PRMs outperform ORMs on math, even with the same total annotation budget. The reason: an ORM can only say "this answer is wrong." A PRM can say "step 3 is where you went wrong," giving the policy much more useful gradient signal.
Synthetic preferences and RLAIF
Collecting human preferences is expensive ($0.50-$5 per comparison). An alternative: RLAIF (Reinforcement Learning from AI Feedback), where a strong LLM (GPT-4, Claude) judges which response is better. Anthropic's Constitutional AI and most open-source alignment efforts use some form of AI-generated preferences.
The trade-off: AI judges are cheaper and more consistent than humans, but they have systematic biases (preferring verbose, formal responses; struggling with factual verification; exhibiting position bias in multi-response comparisons). The best practice is to mix human and AI preferences.
References: Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022) · Rafailov et al., "Direct Preference Optimization" (DPO, 2023) · Azar et al., "A General Theoretical Paradigm to Understand Learning from Human Feedback" (IPO, 2023) · Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization" (2024) · Touvron et al., "Llama 2" (2023) · Lightman et al., "Let's Verify Step by Step" (2023) · Bai et al., "Constitutional AI" (2022)
Cite this post:
@article{sedhain2026rewardmodeling,
title = {Reward Modeling and DPO: Learning What "Good" Means},
author = {Sedhain, Suvash},
journal = {ssedhain.com},
year = {2026},
month = {Mar},
url = {https://mesuvash.github.io/blog/2026/reward-modeling/}
}