How reward models turn human preferences into training signal, and how DPO skips the reward model entirely.
Companion to PPO & GRPO for LLM Alignment.

1. Why Reward Modeling?

The PPO & GRPO post covered how RL algorithms optimize a policy given a reward signal. But it glossed over a critical question: where does the reward come from?

For math or code, you can check correctness programmatically. But for most alignment goals (helpfulness, safety, tone, factuality in open-ended questions), there is no simple function that returns a score. The quality of "Explain quantum entanglement to a 10-year-old" has no ground truth.

The solution is to learn a reward function from human preferences. This is the reward model, and it is the foundation of the entire RLHF pipeline. If the reward model is wrong, the RL algorithm will optimize the wrong thing, often spectacularly.

2. Preference Data: The Raw Material

Data format

Every preference dataset has the same structure: a prompt, a chosen (preferred) response, and a rejected (dispreferred) response. The chosen/rejected labels come from human annotators who read both responses and pick the better one.

Prompt: "Explain why the sky is blue in one sentence." Chosen: "Sunlight scatters off air molecules, and blue wavelengths scatter most, making the sky appear blue." Rejected: "The sky is blue because of the way light works with the atmosphere and stuff like that."

How the data is collected

Sample responses from the policy. Given a set of prompts, generate 2-8 responses per prompt using the SFT model (or a mix of models). Sampling from the actual policy is important: the reward model needs to see the kinds of outputs it will score during RL, not just human-written text.
Human annotation. Annotators read a prompt and two responses, then select which response is better. Some pipelines collect rankings over $k > 2$ responses, then decompose into pairwise comparisons. Guidelines define what "better" means: more helpful, more accurate, less harmful, better formatted.
Quality control. Typical practices include inter-annotator agreement checks (disagreement rates of 20-35% are normal for subjective comparisons), majority voting across 3+ annotators per pair, and filtering out pairs where annotators unanimously disagreed.

Dataset sizes

Production reward models are trained on 100k-500k comparison pairs. Research-scale experiments sometimes use 10k-50k. OpenAI's InstructGPT used roughly 50k comparisons. Anthropic's early RLHF work used around 170k. Llama 2 used over 1 million.

Warning: Preference data has a built-in ceiling. If annotators can't reliably distinguish good from great responses on a particular task, neither can the reward model. This is most problematic for technical topics (coding, math, science) where annotator expertise may be limited. Many teams supplement with AI-generated preferences ("RLAIF") for these domains.

3. The Reward Model

Architecture

A reward model is a language model with a scalar output head instead of a vocabulary head. You take a pre-trained LLM, remove the final projection to vocabulary size, and replace it with a linear layer that maps the last hidden state to a single number.

prompt tokens response tokens EOS Transformer Backbone pre-trained LLM (fine-tuned end-to-end) hidden state at EOS position shape: [d] Linear(d, 1) new head R(x, y) = scalar

Concretely: the input is the concatenation of the prompt $x$ and response $y$. The model processes the entire sequence through the transformer. The hidden state at the final token position (EOS) has attended to all tokens via causal attention, so it summarizes the entire (prompt, response) pair. The linear head maps this $d$-dimensional vector to a single scalar $R(x, y)$.

The Bradley-Terry model

We want the reward model to assign higher scores to preferred responses. The Bradley-Terry model formalizes this. Given a prompt $x$, a chosen response $y_w$, and a rejected response $y_l$, the probability that $y_w$ is preferred is:

Bradley-Terry preference probability:
$$P(y_w \succ y_l \mid x) = \sigma\big(R_\phi(x, y_w) - R_\phi(x, y_l)\big)$$

Where $\sigma(z) = 1/(1 + e^{-z})$ is the sigmoid function and $R_\phi$ is the reward model parameterized by $\phi$. The key property: preference depends only on the score difference, not the absolute values. The training loss is the negative log-likelihood:

Reward model training loss:
$$\mathcal{L}_\text{RM}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\big[\log \sigma\big(R_\phi(x, y_w) - R_\phi(x, y_l)\big)\big]$$

Where $\mathcal{D}$ is the dataset of preference triples. Each variable:

Intuition: This is binary cross-entropy in disguise. You are training a classifier to answer "which response is better?" but instead of predicting a class directly, you parameterize the answer through scalar scores. If $R(x, y_w) \gg R(x, y_l)$, the sigmoid saturates near 1, loss is near 0. If the scores are close or reversed, loss is high. The model learns to push chosen scores up and rejected scores down, but only their relative ordering matters.

Why Bradley-Terry and not something simpler?

You might wonder: why not just train a regression model where chosen responses are labeled 1 and rejected are labeled 0? The Bradley-Terry framing is better for two reasons:

4. Reward Modeling in Practice

Model sizing

The reward model is typically the same size as (or slightly smaller than) the policy model. OpenAI's InstructGPT used a 6B reward model for a 175B policy. Llama 2 used a 70B reward model for a 70B policy. Anthropic has experimented with reward models up to 52B.

Smaller reward models are cheaper to run during RL (since every generated response needs scoring), but less accurate. The trade-off is simple: if your reward model is significantly less capable than your policy, the policy will quickly find outputs that fool the reward model.

Training recipe

Initialize from the SFT checkpoint (or the pre-trained base model). Using the SFT checkpoint is common because it already understands instruction-following format, which helps it evaluate response quality.
Replace the LM head. Remove the vocabulary projection layer (shape: $d \times V$) and add a randomly initialized linear layer (shape: $d \times 1$).
Fine-tune on preference pairs. Standard supervised training. Learning rate around $1 \times 10^{-5}$ to $5 \times 10^{-6}$. Train for 1-2 epochs (overfitting is a serious risk with preference data). Batch sizes of 64-128 pairs.
Evaluate. Hold out 10-20% of preference pairs. Report accuracy (does the model assign higher score to the chosen response?). Good reward models achieve 70-75% accuracy on held-out data. Human agreement on these datasets is typically 73-80%, so 75% is near-ceiling.

Reward normalization

Raw reward model scores drift during training. A score of 2.3 means nothing in absolute terms. Common normalization strategies:

5. Reward Model Failure Modes

Reward hacking

The most important failure mode. The policy finds outputs that score high according to the reward model but are clearly bad to a human. This happens because the reward model is an imperfect proxy, and RL is very good at exploiting imperfect proxies.

Common reward hacking patterns:

The KL penalty

The standard defense against reward hacking. During RL, the actual reward used is:

KL-penalized reward:
$$R_\text{total}(x, y) = R_\phi(x, y) - \beta \cdot D_\text{KL}\big(\pi_\theta(\cdot|x) \,\|\, \pi_\text{ref}(\cdot|x)\big)$$

Where $\pi_\text{ref}$ is the reference policy (typically the SFT model), $\pi_\theta$ is the current policy being trained, and $\beta$ controls the strength of the penalty. The KL divergence measures how far the current policy has moved from the reference. Higher $\beta$ keeps the policy closer to the SFT model (less reward hacking but less improvement). Lower $\beta$ allows more exploration (more improvement but more risk of exploitation).

Typical $\beta$ values: 0.01 to 0.2. This is one of the most sensitive hyperparameters in RLHF.

Intuition: The KL penalty says: "improve the reward, but don't stray too far from the SFT model." The further you drift, the more likely you are in a region where the reward model's predictions are unreliable (out of distribution). The SFT model acts as a safety anchor.

Other failure modes

6. DPO: Skip the Reward Model

Direct Preference Optimization (Rafailov et al., 2023) is a fundamentally different approach. Instead of training a separate reward model and then doing RL, DPO directly optimizes the policy on preference data. No reward model. No RL loop. Just supervised learning on preference pairs.

The key insight

The DPO paper makes a mathematical observation. In the standard RLHF objective, the optimal policy $\pi^*$ under a reward function $R$ with KL constraint has a closed-form solution:

Optimal RLHF policy (closed-form):
$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_\text{ref}(y|x) \exp\!\Big(\frac{1}{\beta} R(x, y)\Big)$$

Where $Z(x)$ is a normalization constant (partition function) and $\beta$ is the KL penalty coefficient. This can be rearranged to express the reward in terms of the policy:

Reward as a function of the optimal policy:
$$R(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)$$

The reward is proportional to how much the optimal policy upweights a response relative to the reference policy. Now substitute this into the Bradley-Terry model. The $\log Z(x)$ terms cancel (since they appear in both the chosen and rejected reward), giving:

DPO loss:
$$\mathcal{L}_\text{DPO}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]$$

Each variable:

Intuition: DPO says: the policy itself is the reward model. Instead of learning a separate scorer and then optimizing against it, just directly adjust the policy so that it assigns higher probability to chosen responses and lower probability to rejected ones, relative to where it started. The log-ratio $\log(\pi_\theta / \pi_\text{ref})$ acts as an implicit reward score, and the loss pushes this implicit score higher for chosen responses than rejected ones.

The DPO pipeline

RLHF (3 steps) Preference data Train reward model RL training (PPO/GRPO) Aligned model DPO (1 step) Preference data Supervised training on pairs Aligned model No reward model, no RL loop

What happens during training

For each preference pair $(x, y_w, y_l)$, DPO computes the log-probabilities of both responses under both the current policy $\pi_\theta$ and the frozen reference policy $\pi_\text{ref}$. This requires four forward passes per batch element (two responses, two models). In practice, the reference model log-probabilities are precomputed and cached, reducing to two forward passes.

The gradient pushes in two directions simultaneously:

The gradient magnitude is weighted by how wrong the current model is. If the model already strongly prefers $y_w$, the sigmoid saturates and the gradient is small. If the model prefers $y_l$ (the wrong answer), the gradient is large. This is the same self-correcting property as in the reward model loss.

Key limitation: DPO is an offline algorithm. It trains on a fixed dataset of preference pairs, never generating new responses. This means it cannot explore: the policy only learns from responses that were already in the dataset. If the SFT model produces a new kind of error not represented in the training pairs, DPO cannot correct it. RLHF methods (PPO, GRPO) are online: the policy generates fresh responses each iteration and learns from its own mistakes.

7. DPO Variants: IPO, KTO, and Others

DPO opened the door to a family of offline preference optimization methods, each addressing a specific weakness.

IPO: Identity Preference Optimization

IPO (Azar et al., 2023) argues that DPO overfits to preference pairs because the sigmoid loss saturates too quickly. Once the model gets a pair "right" (large margin between chosen and rejected), the gradient vanishes and the model stops improving on that example, even if the margin is based on memorization rather than genuine understanding.

IPO replaces the sigmoid loss with a squared loss on the margin:

IPO loss:
$$\mathcal{L}_\text{IPO}(\theta) = \mathbb{E}\left[\left(\log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} - \frac{1}{2\beta}\right)^2\right]$$

Let $m = \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}$ be the implicit reward margin between chosen and rejected. The IPO loss is simply $(m - \frac{1}{2\beta})^2$. This is a parabola centered at $\frac{1}{2\beta}$: the loss is zero when the margin equals the target, and increases quadratically in both directions.

If $m < \frac{1}{2\beta}$ (margin too small, model does not prefer the chosen response enough), the gradient pushes the margin up. If $m > \frac{1}{2\beta}$ (margin too large), the gradient pushes it back down. Compare this to DPO's sigmoid loss $-\log \sigma(\beta \cdot m)$: as $m$ grows large, the sigmoid saturates near 1, the loss approaches 0, and the gradient vanishes. DPO has no penalty for an excessively large margin, so the model can keep inflating it by memorizing specific preference pairs. IPO's squared loss prevents this by treating "too confident" the same as "not confident enough."

KTO: Kahneman-Tversky Optimization

KTO (Ethayarajh et al., 2024) addresses a different problem: it does not require paired preferences at all. Instead of needing (chosen, rejected) pairs for the same prompt, KTO works with unpaired data where each response is independently labeled as "good" or "bad."

This is significant because paired preference data is expensive to collect. You need two responses to the same prompt, shown to the same annotator, compared side-by-side. KTO only needs thumbs-up / thumbs-down labels, which can come from user feedback logs, upvotes, or simple quality ratings.

KTO loss (simplified):
$$\mathcal{L}_\text{KTO}(\theta) = \mathbb{E}_{y_w}\left[\sigma\left(-r_\theta(x, y_w)\right)\right] + \mathbb{E}_{y_l}\left[\sigma\left(r_\theta(x, y_l)\right)\right]$$ $$\text{where } r_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} - \mathbb{E}_{y'}\left[\beta \log \frac{\pi_\theta(y'|x)}{\pi_\text{ref}(y'|x)}\right]$$

The first term pushes up the implicit reward of good responses; the second pushes down the implicit reward of bad responses. The subtracted expectation term centers the rewards, similar to a baseline in REINFORCE.

Other variants

Method Key Idea Data Requirement
DPO Sigmoid loss on log-ratio margin Paired preferences
IPO Squared loss with target margin (anti-overfitting) Paired preferences
KTO Unpaired; works with thumbs-up/down labels Unpaired good/bad labels
ORPO Merges SFT and DPO into one loss (no ref model) Paired preferences
SimPO Length-normalized, no reference model needed Paired preferences
Note: All of these methods share the same core idea: use preference data to directly adjust the policy's output probabilities, without an explicit reward model. They differ in the loss function shape, regularization strategy, and what kind of data they require.

8. RLHF vs DPO: When to Use What

Dimension RLHF (PPO/GRPO) DPO (and variants)
Training complexity High. Requires reward model, value model (PPO), or group sampling (GRPO), plus the RL loop. Low. Standard supervised training on preference pairs. One model, one loss.
Compute cost 3-5x the cost of SFT. Needs generation + scoring + updates each iteration. 1.5-2x the cost of SFT. Two forward passes per example (policy + reference).
Online exploration Yes. Policy generates new responses and learns from them. Can correct novel errors. No. Fixed dataset. Cannot adapt to distribution shift.
Reward hacking risk High. The policy can exploit the reward model's blind spots over many RL steps. Low. No explicit reward model to exploit. But can overfit to preference data.
Stability Requires careful tuning (KL coefficient, learning rate, clipping). Can collapse. Stable. Standard supervised training dynamics.
Ceiling Higher. Online exploration can discover behaviors not in the training data. Lower. Bounded by the quality and diversity of the preference dataset.
Iterative improvement Natural. Collect new preferences on the improved model, retrain. Requires regenerating preference data for each iteration.
Intuition: DPO is the "fast and stable" option. RLHF is the "higher ceiling but harder to get right" option. In practice, many teams start with DPO to get a quick baseline, then switch to RLHF (typically GRPO) for the final push. DeepSeek-R1, for instance, uses GRPO with rule-based rewards after an initial DPO/SFT stage.

When DPO wins

When RLHF wins

9. Practical Notes

Data quality matters more than quantity

A reward model trained on 50k high-quality preference pairs (clear differences between chosen and rejected, expert annotators, diverse prompts) will outperform one trained on 500k noisy pairs. Invest in annotation guidelines and quality control before scaling data collection.

The reference model matters

Both RLHF (via KL penalty) and DPO (via log-ratio) depend heavily on the reference policy. If the reference model is weak, the KL penalty in RLHF allows less room for improvement, and the log-ratios in DPO are less meaningful. Use the best SFT checkpoint you have as the reference.

Preference data goes stale

As the policy improves, the preference data (collected from an earlier, weaker model) becomes less informative. The distribution of responses the model now produces is different from what annotators compared. For RLHF, this matters less because the model generates fresh data. For DPO, this means you need to periodically regenerate preference data from the current policy ("online DPO" or "iterative DPO").

Process reward models vs outcome reward models

Standard reward models are outcome reward models (ORMs): they score the final response. Process reward models (PRMs) score each intermediate step in a chain-of-thought. PRMs are harder to train (you need step-level preference labels) but dramatically improve performance on reasoning tasks by providing denser feedback.

OpenAI's "Let's Verify Step by Step" paper showed that PRMs outperform ORMs on math, even with the same total annotation budget. The reason: an ORM can only say "this answer is wrong." A PRM can say "step 3 is where you went wrong," giving the policy much more useful gradient signal.

Synthetic preferences and RLAIF

Collecting human preferences is expensive ($0.50-$5 per comparison). An alternative: RLAIF (Reinforcement Learning from AI Feedback), where a strong LLM (GPT-4, Claude) judges which response is better. Anthropic's Constitutional AI and most open-source alignment efforts use some form of AI-generated preferences.

The trade-off: AI judges are cheaper and more consistent than humans, but they have systematic biases (preferring verbose, formal responses; struggling with factual verification; exhibiting position bias in multi-response comparisons). The best practice is to mix human and AI preferences.


References: Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022) · Rafailov et al., "Direct Preference Optimization" (DPO, 2023) · Azar et al., "A General Theoretical Paradigm to Understand Learning from Human Feedback" (IPO, 2023) · Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization" (2024) · Touvron et al., "Llama 2" (2023) · Lightman et al., "Let's Verify Step by Step" (2023) · Bai et al., "Constitutional AI" (2022)

Cite this post:

@article{sedhain2026rewardmodeling,
  title   = {Reward Modeling and DPO: Learning What "Good" Means},
  author  = {Sedhain, Suvash},
  journal = {ssedhain.com},
  year    = {2026},
  month   = {Mar},
  url     = {https://mesuvash.github.io/blog/2026/reward-modeling/}
}