Reward Modeling and DPO: Learning What 'Good' Means

How reward models turn human preferences into training signal, and how DPO skips the reward model entirely.
Companion to PPO & GRPO for LLM Alignment.

1. Why Reward Modeling?

The PPO & GRPO post covered how RL algorithms optimize a policy given a reward signal. But it glossed over a critical question: where does the reward come from?

For math or code, you can check correctness programmatically. But for most alignment goals (helpfulness, safety, tone, factuality in open-ended questions), there is no simple function that returns a score. The quality of "Explain quantum entanglement to a 10-year-old" has no ground truth.

The solution is to learn a reward function from human preferences. This is the reward model, and it is the foundation of the entire RLHF pipeline. If the reward model is wrong, the RL algorithm will optimize the wrong thing, often spectacularly.

Humans can compare, even when they can't demonstrate. Asking an annotator to write the perfect response is hard. Asking "is A better than B?" is much easier and more reliable.
Comparison data scales. Given $k$ responses per prompt, you get $\binom{k}{2}$ pairs. 4 responses yield 6 comparison pairs from one annotation session.
The reward model amortizes human effort. Once trained, it can score millions of responses without further human input.

2. Preference Data: The Raw Material

Data format

Every preference dataset has the same structure: a prompt, a chosen (preferred) response, and a rejected (dispreferred) response. The chosen/rejected labels come from human annotators who read both responses and pick the better one.

How the data is collected

Sample responses from the policy. Given a set of prompts, generate 2-8 responses per prompt using the SFT model (or a mix of models). Sampling from the actual policy is important: the reward model needs to see the kinds of outputs it will score during RL, not just human-written text.

Human annotation. Annotators read a prompt and two responses, then select which response is better. Some pipelines collect rankings over $k > 2$ responses, then decompose into pairwise comparisons. Guidelines define what "better" means: more helpful, more accurate, less harmful, better formatted.

Quality control. Typical practices include inter-annotator agreement checks (disagreement rates of 20-35% are normal for subjective comparisons), majority voting across 3+ annotators per pair, and filtering out pairs where annotators unanimously disagreed.

Dataset sizes

Production reward models are trained on 100k-500k comparison pairs. Research-scale experiments sometimes use 10k-50k. OpenAI's InstructGPT used roughly 50k comparisons. Anthropic's early RLHF work used around 170k. Llama 2 used over 1 million.

Warning: Preference data has a built-in ceiling. If annotators can't reliably distinguish good from great responses on a particular task, neither can the reward model. This is most problematic for technical topics (coding, math, science) where annotator expertise may be limited. Many teams supplement with AI-generated preferences ("RLAIF") for these domains.

3. The Reward Model

Architecture

A reward model is a language model with a scalar output head instead of a vocabulary head. You take a pre-trained LLM, remove the final projection to vocabulary size, and replace it with a linear layer that maps the last hidden state to a single number.

Concretely: the input is the concatenation of the prompt $x$ and response $y$. The model processes the entire sequence through the transformer. The hidden state at the final token position (EOS) has attended to all tokens via causal attention, so it summarizes the entire (prompt, response) pair. The linear head maps this $d$-dimensional vector to a single scalar $R(x, y)$.

The Bradley-Terry model

We want the reward model to assign higher scores to preferred responses. The Bradley-Terry model formalizes this. Given a prompt $x$, a chosen response $y_w$, and a rejected response $y_l$, the probability that $y_w$ is preferred is:

Bradley-Terry preference probability:

$$P(y_w \succ y_l \mid x) = \sigma\big(R_\phi(x, y_w) - R_\phi(x, y_l)\big)$$

Where $\sigma(z) = 1/(1 + e^{-z})$ is the sigmoid function and $R_\phi$ is the reward model parameterized by $\phi$. The key property: preference depends only on the score difference, not the absolute values. The training loss is the negative log-likelihood:

Reward model training loss:

$$\mathcal{L}_\text{RM}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\big[\log \sigma\big(R_\phi(x, y_w) - R_\phi(x, y_l)\big)\big]$$

Where $\mathcal{D}$ is the dataset of preference triples. Each variable:

$x$ is the prompt.
$y_w$ is the chosen (winning) response.
$y_l$ is the rejected (losing) response.
$R_\phi(x, y)$ is the scalar score assigned by the reward model.
$\sigma$ is the sigmoid function.

Intuition: This is binary cross-entropy in disguise. You are training a classifier to answer "which response is better?" but instead of predicting a class directly, you parameterize the answer through scalar scores. If $R(x, y_w) \gg R(x, y_l)$, the sigmoid saturates near 1, loss is near 0. If the scores are close or reversed, loss is high. The model learns to push chosen scores up and rejected scores down, but only their relative ordering matters.

Why Bradley-Terry and not something simpler?

You might wonder: why not just train a regression model where chosen responses are labeled 1 and rejected are labeled 0? The Bradley-Terry framing is better for two reasons:

Transitivity. Scalar scores give you a total ordering. If $R(A) > R(B) > R(C)$, the model automatically predicts $A \succ C$ even if it never saw that pair. A binary classifier over pairs does not guarantee consistency.
Calibrated margin. The sigmoid on score differences means a pair that's "barely better" gets a small gradient, while a pair that's "clearly better but scored wrong" gets a large gradient. This is the right behavior for noisy human labels.

4. Reward Modeling in Practice

Model sizing

The reward model is typically the same size as (or slightly smaller than) the policy model. OpenAI's InstructGPT used a 6B reward model for a 175B policy. Llama 2 used a 70B reward model for a 70B policy. Anthropic has experimented with reward models up to 52B.

Smaller reward models are cheaper to run during RL (since every generated response needs scoring), but less accurate. The trade-off is simple: if your reward model is significantly less capable than your policy, the policy will quickly find outputs that fool the reward model.

Training recipe

Initialize from the SFT checkpoint (or the pre-trained base model). Using the SFT checkpoint is common because it already understands instruction-following format, which helps it evaluate response quality.

Replace the LM head. Remove the vocabulary projection layer (shape: $d \times V$) and add a randomly initialized linear layer (shape: $d \times 1$).

Fine-tune on preference pairs. Standard supervised training. Learning rate around $1 \times 10^{-5}$ to $5 \times 10^{-6}$. Train for 1-2 epochs (overfitting is a serious risk with preference data). Batch sizes of 64-128 pairs.

Evaluate. Hold out 10-20% of preference pairs. Report accuracy (does the model assign higher score to the chosen response?). Good reward models achieve 70-75% accuracy on held-out data. Human agreement on these datasets is typically 73-80%, so 75% is near-ceiling.

Reward normalization

Raw reward model scores drift during training. A score of 2.3 means nothing in absolute terms. Common normalization strategies:

Per-batch normalization: subtract the batch mean and divide by standard deviation. Keeps scores centered around 0.
Clipping: clip rewards to $[-5, 5]$ or similar range to prevent outlier rewards from causing large policy updates.
Length penalty: longer responses tend to get higher reward scores simply because they contain more information. Some teams normalize by response length or add an explicit length penalty.

5. Reward Model Failure Modes

Reward hacking

The most important failure mode. The policy finds outputs that score high according to the reward model but are clearly bad to a human. This happens because the reward model is an imperfect proxy, and RL is very good at exploiting imperfect proxies.

Common reward hacking patterns:

Verbosity exploit: the policy learns that longer responses get higher scores, so it generates unnecessarily verbose outputs. Almost every RLHF system sees this.
Sycophancy: the model learns to agree with the user's stated position regardless of accuracy, because annotators sometimes prefer agreeable responses.
Formatting tricks: excessive use of bullet points, headers, bold text, or emoji that correlate with higher reward but add no substance.
Hedging: "as an AI language model..." prefixes that reduce perceived risk of wrong answers, which annotators may rate as "safer."

The KL penalty

The standard defense against reward hacking. During RL, the actual reward used is:

KL-penalized reward:

$$R_\text{total}(x, y) = R_\phi(x, y) - \beta \cdot D_\text{KL}\big(\pi_\theta(\cdot|x) \,\|\, \pi_\text{ref}(\cdot|x)\big)$$

Where $\pi_\text{ref}$ is the reference policy (typically the SFT model), $\pi_\theta$ is the current policy being trained, and $\beta$ controls the strength of the penalty. The KL divergence measures how far the current policy has moved from the reference. Higher $\beta$ keeps the policy closer to the SFT model (less reward hacking but less improvement). Lower $\beta$ allows more exploration (more improvement but more risk of exploitation).

Typical $\beta$ values: 0.01 to 0.2. This is one of the most sensitive hyperparameters in RLHF.

Intuition: The KL penalty says: "improve the reward, but don't stray too far from the SFT model." The further you drift, the more likely you are in a region where the reward model's predictions are unreliable (out of distribution). The SFT model acts as a safety anchor.

Other failure modes

Position bias: annotators tend to prefer the first response shown to them. If your annotation UI always shows responses in the same order, this bias leaks into the reward model. Mitigation: randomize presentation order.
Length bias: annotators tend to prefer longer responses. The reward model inherits this bias. Mitigation: normalize by length, or include length-matched pairs in training data.
Annotator disagreement: on subjective tasks, annotators disagree 20-35% of the time. The reward model is trained on a majority vote or random annotator choice, smoothing over genuine diversity in preferences.

6. DPO: Skip the Reward Model

Direct Preference Optimization (Rafailov et al., 2023) is a fundamentally different approach. Instead of training a separate reward model and then doing RL, DPO directly optimizes the policy on preference data. No reward model. No RL loop. Just supervised learning on preference pairs.

The key insight

The DPO paper makes a mathematical observation. In the standard RLHF objective, the optimal policy $\pi^*$ under a reward function $R$ with KL constraint has a closed-form solution:

Optimal RLHF policy (closed-form):

$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_\text{ref}(y|x) \exp\!\Big(\frac{1}{\beta} R(x, y)\Big)$$

Where $Z(x)$ is a normalization constant (partition function) and $\beta$ is the KL penalty coefficient. This can be rearranged to express the reward in terms of the policy:

Reward as a function of the optimal policy:

$$R(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)$$

The reward is proportional to how much the optimal policy upweights a response relative to the reference policy. Now substitute this into the Bradley-Terry model. The $\log Z(x)$ terms cancel (since they appear in both the chosen and rejected reward), giving:

DPO loss:

$$\mathcal{L}_\text{DPO}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]$$

Each variable:

$\pi_\theta$ is the policy being trained (initialized from $\pi_\text{ref}$).
$\pi_\text{ref}$ is the reference policy (frozen SFT model).
$y_w, y_l$ are the chosen and rejected responses.
$\beta$ controls regularization strength (same role as in RLHF).
$\log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}$ is the implicit reward: how much the policy has upweighted this response relative to the reference.

Intuition: DPO says: the policy itself is the reward model. Instead of learning a separate scorer and then optimizing against it, just directly adjust the policy so that it assigns higher probability to chosen responses and lower probability to rejected ones, relative to where it started. The log-ratio $\log(\pi_\theta / \pi_\text{ref})$ acts as an implicit reward score, and the loss pushes this implicit score higher for chosen responses than rejected ones.

The DPO pipeline

What happens during training

For each preference pair $(x, y_w, y_l)$, DPO computes the log-probabilities of both responses under both the current policy $\pi_\theta$ and the frozen reference policy $\pi_\text{ref}$. This requires four forward passes per batch element (two responses, two models). In practice, the reference model log-probabilities are precomputed and cached, reducing to two forward passes.

The gradient pushes in two directions simultaneously:

Increase the probability of the chosen response $y_w$ (relative to the reference).
Decrease the probability of the rejected response $y_l$ (relative to the reference).

The gradient magnitude is weighted by how wrong the current model is. If the model already strongly prefers $y_w$, the sigmoid saturates and the gradient is small. If the model prefers $y_l$ (the wrong answer), the gradient is large. This is the same self-correcting property as in the reward model loss.

Key limitation: DPO is an offline algorithm. It trains on a fixed dataset of preference pairs, never generating new responses. This means it cannot explore: the policy only learns from responses that were already in the dataset. If the SFT model produces a new kind of error not represented in the training pairs, DPO cannot correct it. RLHF methods (PPO, GRPO) are online: the policy generates fresh responses each iteration and learns from its own mistakes.

7. DPO Variants: IPO, KTO, and Others

DPO opened the door to a family of offline preference optimization methods, each addressing a specific weakness.

IPO: Identity Preference Optimization

IPO (Azar et al., 2023) argues that DPO overfits to preference pairs because the sigmoid loss saturates too quickly. Once the model gets a pair "right" (large margin between chosen and rejected), the gradient vanishes and the model stops improving on that example, even if the margin is based on memorization rather than genuine understanding.

IPO replaces the sigmoid loss with a squared loss on the margin:

IPO loss:

$$\mathcal{L}_\text{IPO}(\theta) = \mathbb{E}\left[\left(\log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} - \frac{1}{2\beta}\right)^2\right]$$

Let $m = \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}$ be the implicit reward margin between chosen and rejected. The IPO loss is simply $(m - \frac{1}{2\beta})^2$. This is a parabola centered at $\frac{1}{2\beta}$: the loss is zero when the margin equals the target, and increases quadratically in both directions.

If $m < \frac{1}{2\beta}$ (margin too small, model does not prefer the chosen response enough), the gradient pushes the margin up. If $m > \frac{1}{2\beta}$ (margin too large), the gradient pushes it back down. Compare this to DPO's sigmoid loss $-\log \sigma(\beta \cdot m)$: as $m$ grows large, the sigmoid saturates near 1, the loss approaches 0, and the gradient vanishes. DPO has no penalty for an excessively large margin, so the model can keep inflating it by memorizing specific preference pairs. IPO's squared loss prevents this by treating "too confident" the same as "not confident enough."

KTO: Kahneman-Tversky Optimization

KTO (Ethayarajh et al., 2024) addresses a different problem: it does not require paired preferences at all. Instead of needing (chosen, rejected) pairs for the same prompt, KTO works with unpaired data where each response is independently labeled as "good" or "bad."

This is significant because paired preference data is expensive to collect. You need two responses to the same prompt, shown to the same annotator, compared side-by-side. KTO only needs thumbs-up / thumbs-down labels, which can come from user feedback logs, upvotes, or simple quality ratings.

KTO loss (simplified):

$$\mathcal{L}_\text{KTO}(\theta) = \mathbb{E}_{y_w}\left[\sigma\left(-r_\theta(x, y_w)\right)\right] + \mathbb{E}_{y_l}\left[\sigma\left(r_\theta(x, y_l)\right)\right]$$ $$\text{where } r_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} - \mathbb{E}_{y'}\left[\beta \log \frac{\pi_\theta(y'|x)}{\pi_\text{ref}(y'|x)}\right]$$

The first term pushes up the implicit reward of good responses; the second pushes down the implicit reward of bad responses. The subtracted expectation term centers the rewards, similar to a baseline in REINFORCE.

Other variants

Method	Key Idea	Data Requirement
DPO	Sigmoid loss on log-ratio margin	Paired preferences
IPO	Squared loss with target margin (anti-overfitting)	Paired preferences
KTO	Unpaired; works with thumbs-up/down labels	Unpaired good/bad labels
ORPO	Merges SFT and DPO into one loss (no ref model)	Paired preferences
SimPO	Length-normalized, no reference model needed	Paired preferences

Note: All of these methods share the same core idea: use preference data to directly adjust the policy's output probabilities, without an explicit reward model. They differ in the loss function shape, regularization strategy, and what kind of data they require.

8. RLHF vs DPO: When to Use What

Dimension	RLHF (PPO/GRPO)	DPO (and variants)
Training complexity	High. Requires reward model, value model (PPO), or group sampling (GRPO), plus the RL loop.	Low. Standard supervised training on preference pairs. One model, one loss.
Compute cost	3-5x the cost of SFT. Needs generation + scoring + updates each iteration.	1.5-2x the cost of SFT. Two forward passes per example (policy + reference).
Online exploration	Yes. Policy generates new responses and learns from them. Can correct novel errors.	No. Fixed dataset. Cannot adapt to distribution shift.
Reward hacking risk	High. The policy can exploit the reward model's blind spots over many RL steps.	Low. No explicit reward model to exploit. But can overfit to preference data.
Stability	Requires careful tuning (KL coefficient, learning rate, clipping). Can collapse.	Stable. Standard supervised training dynamics.
Ceiling	Higher. Online exploration can discover behaviors not in the training data.	Lower. Bounded by the quality and diversity of the preference dataset.
Iterative improvement	Natural. Collect new preferences on the improved model, retrain.	Requires regenerating preference data for each iteration.

Intuition: DPO is the "fast and stable" option. RLHF is the "higher ceiling but harder to get right" option. In practice, many teams start with DPO to get a quick baseline, then switch to RLHF (typically GRPO) for the final push. DeepSeek-R1, for instance, uses GRPO with rule-based rewards after an initial DPO/SFT stage.

When DPO wins

Small teams with limited compute and engineering bandwidth.
Tasks where high-quality preference data already exists (or is easy to generate synthetically).
Alignment tasks where the gap between SFT and the target behavior is small.
Rapid iteration: you need a usable model today, not a perfectly optimized one in a month.

When RLHF wins

Tasks with verifiable rewards (math, code, factual QA) where you can use rule-based scoring.
Pushing past the ceiling of what preference data alone can teach.
Large-scale production systems where the extra engineering cost is justified by quality gains.
Iterative improvement loops where the model improves, generates new data, and improves again.

9. Practical Notes

Data quality matters more than quantity

A reward model trained on 50k high-quality preference pairs (clear differences between chosen and rejected, expert annotators, diverse prompts) will outperform one trained on 500k noisy pairs. Invest in annotation guidelines and quality control before scaling data collection.

The reference model matters

Both RLHF (via KL penalty) and DPO (via log-ratio) depend heavily on the reference policy. If the reference model is weak, the KL penalty in RLHF allows less room for improvement, and the log-ratios in DPO are less meaningful. Use the best SFT checkpoint you have as the reference.

Preference data goes stale

As the policy improves, the preference data (collected from an earlier, weaker model) becomes less informative. The distribution of responses the model now produces is different from what annotators compared. For RLHF, this matters less because the model generates fresh data. For DPO, this means you need to periodically regenerate preference data from the current policy ("online DPO" or "iterative DPO").

Process reward models vs outcome reward models

Standard reward models are outcome reward models (ORMs): they score the final response. Process reward models (PRMs) score each intermediate step in a chain-of-thought. PRMs are harder to train (you need step-level preference labels) but dramatically improve performance on reasoning tasks by providing denser feedback.

OpenAI's "Let's Verify Step by Step" paper showed that PRMs outperform ORMs on math, even with the same total annotation budget. The reason: an ORM can only say "this answer is wrong." A PRM can say "step 3 is where you went wrong," giving the policy much more useful gradient signal.

Synthetic preferences and RLAIF

Collecting human preferences is expensive ($0.50-$5 per comparison). An alternative: RLAIF (Reinforcement Learning from AI Feedback), where a strong LLM (GPT-4, Claude) judges which response is better. Anthropic's Constitutional AI and most open-source alignment efforts use some form of AI-generated preferences.

The trade-off: AI judges are cheaper and more consistent than humans, but they have systematic biases (preferring verbose, formal responses; struggling with factual verification; exhibiting position bias in multi-response comparisons). The best practice is to mix human and AI preferences.

References: Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022) · Rafailov et al., "Direct Preference Optimization" (DPO, 2023) · Azar et al., "A General Theoretical Paradigm to Understand Learning from Human Feedback" (IPO, 2023) · Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization" (2024) · Touvron et al., "Llama 2" (2023) · Lightman et al., "Let's Verify Step by Step" (2023) · Bai et al., "Constitutional AI" (2022)

Cite this post:

@article{sedhain2026rewardmodeling,
  title   = {Reward Modeling and DPO: Learning What "Good" Means},
  author  = {Sedhain, Suvash},
  journal = {ssedhain.com},
  year    = {2026},
  month   = {Mar},
  url     = {https://mesuvash.github.io/blog/2026/reward-modeling/}
}