Paper Review: RLHF — Training Language Models to Follow Instructions with Human Feedback (InstructGPT) Complete Analysis

Updated Feb 6, 2026

Introduction

Large language models (LLMs) trained on massive internet corpora are remarkably capable, but they often generate outputs that are untruthful, toxic, or simply unhelpful to the user. The fundamental problem is a misalignment between the language modeling objective (predicting the next token) and the actual goal we care about — following the user’s instructions helpfully and safely.

In March 2022, OpenAI published “Training language models to follow instructions with human feedback” (Ouyang et al., 2022), introducing InstructGPT. This paper formalized the Reinforcement Learning from Human Feedback (RLHF) pipeline that would later become the backbone of ChatGPT, Claude, and virtually every modern aligned LLM.

The core insight of InstructGPT is deceptively simple: a 1.3B parameter model fine-tuned with human feedback can be preferred over a 175B GPT-3 model by human evaluators, despite being over 100× smaller.

This review provides a thorough analysis of the InstructGPT paper — its three-stage training pipeline, reward modeling, PPO optimization, evaluation methodology, and broader implications for AI alignment.


Background and Motivation

The Alignment Problem in Language Models

Traditional LLMs are trained with a next-token prediction objective:

LLM=t=1TlogP(xtx<t;θ)\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(x_t \mid x_{<t}; \theta)

where xtx_t is the token at position tt, x<tx_{<t} are all preceding tokens, and θ\theta are model parameters.

This objective optimizes for statistical likelihood of internet text, not for:
Helpfulness — actually answering what the user asked
Truthfulness — providing factually accurate information
Harmlessness — avoiding toxic, biased, or dangerous content

This gap between training objective and desired behavior is the alignment problem.

Prior Work

The RLHF approach builds on several lines of research:

Prior Work Contribution Limitation
Christiano et al. (2017) RLHF for Atari games and simulated robotics Not applied to language
Stiennon et al. (2020) RLHF for text summarization Single narrow task
Ziegler et al. (2019) Fine-tuning LMs with human preferences Small scale, limited evaluation
GPT-3 (Brown et al., 2020) Few-shot learning via prompting No alignment training
FLAN / T0 (2021) Multi-task instruction tuning No human preference signal

InstructGPT’s contribution was to scale RLHF to general-purpose instruction following on a production-grade model, with rigorous human evaluation.


The InstructGPT Three-Stage Pipeline

The training pipeline consists of three distinct stages, each building on the previous one.

Stage 1: Supervised Fine-Tuning (SFT)

The first stage collects demonstration data — human-written ideal responses to prompts — and fine-tunes GPT-3 on this data using standard supervised learning.

Data Collection:
– OpenAI hired a team of 40 contractors (labelers) selected for sensitivity to preferences of different demographics
– Labelers wrote high-quality responses to prompts from the OpenAI API and a hand-crafted set
~13,000 training examples were collected

The SFT objective is the same language modeling loss, but conditioned on prompt-response pairs:

LSFT=t=1TlogP(ytx,y<t;θ)\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log P(y_t \mid x, y_{<t}; \theta)

where xx is the input prompt and yy is the target response.

SFT alone significantly improves output quality, but it is fundamentally limited by the diversity and scale of demonstration data. Collecting demonstrations is expensive and slow.

Stage 2: Reward Model (RM) Training

The key innovation is training a reward model that learns to predict which outputs humans prefer. This is far more scalable than collecting demonstrations because comparison judgments are easier and faster for humans than writing full responses.

Data Collection:
– For each prompt, the SFT model generates multiple outputs (typically K=4K = 4 to K=9K = 9)
– Labelers rank all KK outputs from best to worst
– From each ranking of KK outputs, (K2)\binom{K}{2} pairwise comparisons are extracted

The reward model rθ(x,y)r_\theta(x, y) takes a prompt xx and response yy and outputs a scalar reward. It is trained using the Bradley-Terry model of pairwise preferences:

LRM=1(K2)(i,j)pairslogσ(rθ(x,yw)rθ(x,yl))\mathcal{L}_{\text{RM}} = -\frac{1}{\binom{K}{2}} \sum_{(i,j) \in \text{pairs}} \log \sigma\big(r_\theta(x, y_w) – r_\theta(x, y_l)\big)

where:
ywy_w is the preferred (winning) response
yly_l is the less preferred (losing) response
σ\sigma is the sigmoid function
– The sum is over all (K2)\binom{K}{2} pairs from each ranking

Critical design choice: The authors found that training on all (K2)\binom{K}{2} pairs from each ranking within a single batch was crucial. Without this, the reward model overfits because the same prompt appears multiple times across mini-batches.

Architecture: The reward model is initialized from the 6B parameter SFT model with the final unembedding layer replaced by a linear projection to a scalar output. The authors chose 6B over 175B because the larger model was found to be unstable during RM training.

Stage 3: Reinforcement Learning via PPO

The final stage uses Proximal Policy Optimization (PPO) (Schulman et al., 2017) to optimize the language model policy against the learned reward model.

The optimization objective is:

objective(ϕ)=E(x,y)πϕ[rθ(x,y)βKL(πϕ(yx)πSFT(yx))]+γExDpretrain[logπϕ(x)]\text{objective}(\phi) = E_{(x, y) \sim \pi_\phi} \Big[ r_\theta(x, y) – \beta \cdot \text{KL}\big(\pi_\phi(y|x) \,\|\, \pi_{\text{SFT}}(y|x)\big) \Big] + \gamma \cdot E_{x \sim D_{\text{pretrain}}} \Big[ \log \pi_\phi(x) \Big]

Let’s break this down term by term:

Term Meaning
πϕ(yx)\pi_\phi(y \mid x) The language model policy (parameterized by ϕ\phi) generating response yy given prompt xx
rθ(x,y)r_\theta(x, y) The learned reward model score
βKL(πϕπSFT)\beta \cdot \text{KL}(\pi_\phi \| \pi_{\text{SFT}}) KL penalty preventing the policy from diverging too far from the SFT model
γlogπϕ(x)\gamma \cdot \log \pi_\phi(x) Pretraining mix — a language modeling term on the original pretraining distribution to prevent catastrophic forgetting

The KL divergence penalty is critical. Without it, the policy quickly learns to exploit the reward model by generating adversarial outputs that receive high reward but are actually low quality (reward hacking). The coefficient β\beta controls this trade-off.

The pretraining mix term (with coefficient γ\gamma) ensures the model retains its general language capabilities while being optimized for alignment. The authors call the variant with γ>0\gamma > 0 PPO-ptx.

import torch
import torch.nn.functional as F

def compute_rlhf_loss(
    policy_logprobs,      # Log probs under current policy
    sft_logprobs,         # Log probs under SFT reference model
    rewards,              # Reward model scores
    pretrain_logprobs,    # Log probs on pretraining data
    beta=0.02,            # KL penalty coefficient
    gamma=0.05            # Pretraining mix coefficient
):
    """
    Compute the InstructGPT RLHF objective.

    The objective maximizes reward while staying close to the SFT policy
    and maintaining pretraining language modeling performance.
    """
    # KL divergence between policy and SFT reference
    # Approximated as: E[log(pi_policy) - log(pi_sft)]
    kl_divergence = (policy_logprobs - sft_logprobs).mean()

    # Main RLHF objective: reward - KL penalty
    rlhf_objective = rewards.mean() - beta * kl_divergence

    # Pretraining mix to prevent catastrophic forgetting
    pretrain_loss = -pretrain_logprobs.mean()

    # Combined loss (negative because we maximize objective)
    total_loss = -rlhf_objective + gamma * pretrain_loss

    return total_loss, {
        "reward": rewards.mean().item(),
        "kl_divergence": kl_divergence.item(),
        "pretrain_loss": pretrain_loss.item()
    }

Training Details

Model Sizes

The authors trained InstructGPT at three scales:

Model Parameters SFT Epochs RM Architecture PPO Details
InstructGPT-1.3B 1.3B 16 6B 256 episodes/batch
InstructGPT-6B 6B 16 6B 256 episodes/batch
InstructGPT-175B 175B 16 6B 512 episodes/batch

Prompt Dataset

The prompts came from two sources:

  1. OpenAI API traffic — real user prompts submitted to the GPT-3 API (with user consent, deduplicated, PII filtered)
  2. Labeler-written prompts — designed to cover diverse categories

The prompt distribution covered:

Category Percentage
Generation (open-ended) 45.6%
Open QA 12.4%
Brainstorming 11.2%
Chat / Conversation 8.4%
Rewrite 6.6%
Summarization 4.2%
Classification 3.5%
Closed QA 2.6%
Extract 1.9%
Other 3.6%

Data Sizes

Dataset Size
SFT training data ~13,000 prompt-response pairs
RM comparison data ~33,000 prompts, ~300,000 pairwise comparisons
PPO training prompts ~31,000 unique prompts

Hyperparameters

Key PPO hyperparameters:
Learning rate: 9.65×1069.65 \times 10^{-6} (cosine schedule)
KL penalty β\beta: 0.02
Pretraining mix γ\gamma: 27.8 (for PPO-ptx, calibrated to match pretraining gradient scale)
Clip ratio ϵ\epsilon: 0.2 (standard PPO)
Batch size: 512 episodes for 175B, 256 for smaller
PPO epochs per batch: 4


Evaluation Methodology

Human Evaluation

The primary evaluation was human preference judgments on a held-out test set of prompts. Labelers compared outputs from:
– GPT-3 (few-shot prompted)
– GPT-3 (with carefully written prompts)
– SFT models
– InstructGPT (PPO and PPO-ptx)

For each comparison, labelers rated outputs on a 1-7 Likert scale and expressed pairwise preferences.

Automatic Metrics

The paper also evaluated on several NLP benchmarks to check for alignment tax (capability regression due to alignment training):

  • TruthfulQA — Measures tendency to produce false but plausible statements
  • RealToxicityPrompts — Measures toxic language generation
  • Winogender / BBQ — Measures bias
  • Standard NLP benchmarks — HellaSwag, WinoGrande, ARC, etc.

Experimental Results

Key Finding 1: InstructGPT Outputs Are Preferred Over GPT-3

Human labelers preferred InstructGPT-1.3B outputs over GPT-3-175B outputs 71% ± 3% of the time.

Model Win Rate vs. GPT-3 175B (prompted) Parameters
GPT-3 175B (few-shot) 28% 175B
SFT 1.3B 59% 1.3B
InstructGPT 1.3B (PPO-ptx) 71% 1.3B
SFT 6B 68% 6B
InstructGPT 6B (PPO-ptx) 73% 6B
SFT 175B 72% 175B
InstructGPT 175B (PPO-ptx) 85% 175B

A 1.3B model aligned with RLHF beats a 175B model without alignment — a 100× parameter advantage overcome by better training signal.

Key Finding 2: Improved Truthfulness

On TruthfulQA:
– GPT-3 175B: 21% truthful + informative
– InstructGPT 175B (PPO-ptx): 33% truthful + informative

This is a relative improvement of ~57%, though absolute performance remains modest — showing truthfulness is hard even with alignment.

Key Finding 3: Reduced Toxicity

On RealToxicityPrompts (measuring toxic continuations):
– GPT-3 175B: toxicity score ~0.44
– InstructGPT 175B (PPO): toxicity score ~0.28

A 36% reduction in toxic output generation. The model also showed smaller increases in toxicity when given toxic prompts (more robust to adversarial inputs).

Key Finding 4: Minimal Alignment Tax

On standard NLP benchmarks:

Benchmark GPT-3 175B InstructGPT 175B (PPO) InstructGPT 175B (PPO-ptx)
HellaSwag 78.9% 71.4% (↓) 79.1% (≈)
WinoGrande 70.2% 67.1% (↓) 70.5% (≈)
ARC (Easy) 68.8% 66.5% (↓) 69.2% (≈)
Lambada 76.2% 69.3% (↓) 75.8% (≈)

PPO alone causes noticeable regression on benchmarks, but PPO-ptx (with the pretraining mix) almost completely eliminates the alignment tax. This validates the importance of the γlogπϕ(x)\gamma \cdot \log \pi_\phi(x) term.

Key Finding 5: Labeler Agreement

The paper reports inter-annotator agreement of ~73%, meaning labelers agreed on the preferred output about 73% of the time. This establishes an approximate human ceiling for the task.

Interestingly, when tested with held-out labelers who did not participate in training data collection, InstructGPT was still preferred — suggesting the model learned generalizable preferences rather than overfitting to specific labeler quirks.


Ablation Studies

Effect of Model Size

Performance scaled with model size, but the gains from RLHF were relatively consistent across scales:

Base Model Size SFT Win Rate vs GPT-3 PPO-ptx Win Rate vs GPT-3 RLHF Gain
1.3B 59% 71% +12%
6B 68% 73% +5%
175B 72% 85% +13%

The 175B model showed the largest absolute gain from RLHF, suggesting that larger models benefit more from alignment training (they have more capability to be unlocked).

Effect of Pretraining Mix (PPO vs PPO-ptx)

The ablation clearly showed that the pretraining mix coefficient γ\gamma is essential:

  • PPO without pretraining mix: Higher reward on the RM, but significant regression on NLP benchmarks and occasional degenerate outputs
  • PPO-ptx: Slightly lower reward, but no benchmark regression and more robust outputs

Effect of KL Coefficient β\beta

The paper explored different values of β\beta:

β\beta Value Effect
Too low (<0.01< 0.01) Reward hacking — model exploits RM weaknesses
0.02 (optimal) Good balance of reward and output quality
Too high (>0.1> 0.1) Model barely deviates from SFT, limited improvement

SFT Data Scaling

Surprisingly, the authors found that SFT performance plateaus quickly — performance on held-out data peaks after just ~1 epoch of SFT training (despite running 16 epochs). This suggests:

  1. The model learns the instruction-following “format” quickly
  2. Further SFT training leads to overfitting on demonstration data
  3. The RM and PPO stages are where the real alignment learning happens

Comparison with Contemporary Methods

Method Human Feedback Training Approach Scalability Key Limitation
InstructGPT (RLHF) Rankings → Reward Model → PPO 3-stage pipeline Proven at 175B scale Complex pipeline, reward hacking risk
FLAN (2021) None (instruction templates) Multi-task SFT Highly scalable No preference learning
Constitutional AI (2022) AI-generated critiques + RLHF AI feedback + RL Reduces human annotation Still needs some human labels
DPO (2023) Pairwise preferences Direct optimization (no RM) Simpler pipeline Less flexible than RLHF
RLAIF (2023) AI-generated preferences Same as RLHF but AI labels Very scalable Depends on AI judge quality
PPO-Max / ReMax (2023) Rankings → RM → REINFORCE Simplified RL Easier to implement May underperform PPO

InstructGPT’s three-stage pipeline remains the gold standard against which all subsequent alignment methods are compared, even if newer methods simplify parts of it.


Strengths of the Paper

1. Rigorous Human Evaluation

Unlike many papers that rely solely on automatic metrics, InstructGPT invested heavily in human evaluation — the gold standard for measuring alignment. The use of both trained labelers and held-out evaluators strengthens the findings.

2. Practical Scalability Demonstration

The paper demonstrated RLHF at 175B parameter scale, proving it was viable for production systems. This was not merely academic — InstructGPT was deployed as a real product.

3. Alignment Tax Analysis

The careful measurement of performance regression on standard benchmarks, and the PPO-ptx solution to mitigate it, was a valuable practical contribution.

4. Transparent Limitations Discussion

The paper openly discusses failure modes:
– The model can still generate harmful outputs
– Performance depends heavily on labeler quality and demographics
– The model can be overly cautious (refusing harmless requests)
– RLHF amplifies existing biases in the labeler pool

5. Reproducible Pipeline

The three-stage pipeline (SFT → RM → PPO) became a widely reproduced recipe. Projects like Anthropic’s Claude, LLaMA-2-Chat, Tulu, and Zephyr all follow variants of this pipeline.


Limitations and Criticisms

1. Labeler Bias

The 40 labelers were predominantly English-speaking, US-based contractors. Their preferences inevitably encode cultural assumptions about what constitutes a “good” response. The paper acknowledges this but does not resolve it.

2. Reward Hacking

Despite the KL penalty, the model can still learn to produce outputs that score highly on the reward model without being genuinely better. Examples include:
– Verbose responses (longer is scored higher)
– Hedging language (“I’m not sure, but…” scored as more honest)
– Sycophantic agreement with the user’s stated beliefs

3. Training Instability

The PPO stage is notoriously unstable and sensitive to hyperparameters. The authors note that the 175B reward model was too unstable to train, forcing them to use a 6B RM. This mismatch between policy and reward model sizes is a known concern.

4. Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.”

The reward model is a proxy for human preferences. Optimizing against it too aggressively leads to reward over-optimization — outputs that exploit reward model weaknesses rather than improving actual quality.

The paper shows evidence of this:

True quality=f(RM score)peaks at moderate RM scores, then declines\text{True quality} = f(\text{RM score}) \quad \text{peaks at moderate RM scores, then declines}

This means there is an optimal level of RL optimization beyond which quality degrades.

5. Cost and Complexity

The three-stage pipeline requires:
– Specialized human annotation teams
– Training three separate models (SFT, RM, RL policy)
– Careful hyperparameter tuning at each stage
– Significant compute resources

This complexity motivated simpler alternatives like DPO (Direct Preference Optimization).


Impact and Legacy

InstructGPT’s impact on the field has been profound:

Direct Descendants

  1. ChatGPT (November 2022) — Built on the InstructGPT pipeline, applied to GPT-3.5, and became the fastest-growing consumer application in history
  2. GPT-4 (March 2023) — Continued using RLHF at larger scale
  3. Claude (Anthropic) — Extended RLHF with Constitutional AI (RLAIF)
  4. LLaMA-2-Chat (Meta, 2023) — Open-source reproduction of the InstructGPT pipeline

Methodological Influence

The paper established several now-standard practices:

  • Comparison data > demonstration data for preference learning
  • KL-constrained RL as the standard alignment optimization
  • Pretraining mix to prevent capability regression
  • Human evaluation as the primary alignment metric
  • Red teaming as part of safety evaluation

The RLHF Ecosystem

The paper spawned an entire ecosystem of tools and frameworks:

# Example: Modern RLHF pipeline using trl library
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

# Step 1: Load SFT model as starting point
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
tokenizer = AutoTokenizer.from_pretrained("sft-model")

# Step 2: Configure PPO with KL penalty
ppo_config = PPOConfig(
    model_name="instructgpt-style-rlhf",
    learning_rate=1e-5,
    init_kl_coef=0.02,        # Beta: KL penalty coefficient
    target_kl=6.0,            # Adaptive KL target
    batch_size=256,
    mini_batch_size=64,
    ppo_epochs=4,             # PPO epochs per batch
)

# Step 3: Initialize PPO trainer
ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    tokenizer=tokenizer,
)

# Step 4: Training loop
for batch in dataloader:
    query_tensors = batch["input_ids"]

    # Generate responses from the policy
    response_tensors = ppo_trainer.generate(
        query_tensors,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7
    )

    # Score responses with reward model
    rewards = reward_model(query_tensors, response_tensors)

    # PPO update step (includes KL penalty internally)
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

    print(f"Mean reward: {stats['ppo/mean_scores']:.3f}, "
          f"KL div: {stats['objective/kl']:.3f}")

Future Research Directions

Several open problems remain in the RLHF paradigm that InstructGPT established:

1. Scalable Oversight

As models become more capable, human evaluators may not be able to judge output quality reliably (especially for complex reasoning, code, or scientific content). Research directions include:
AI-assisted evaluation (constitutional AI, debate)
Process-based rewards (rewarding reasoning steps, not just final answers)
Recursive reward modeling (using aligned models to train better reward models)

2. Reducing Reward Hacking

  • Ensemble reward models to reduce exploitability
  • Conservative optimization methods that explicitly account for RM uncertainty
  • Iterated RLHF — alternating between collecting new human data and retraining

3. Multi-Objective Alignment

InstructGPT optimizes a single scalar reward, but alignment is inherently multi-dimensional (helpful, harmless, honest). Future work includes:
Multi-objective RL with Pareto-optimal solutions
Conditional alignment — different safety levels for different contexts
Constitutional approaches with explicit rule hierarchies

4. Eliminating the RL Stage

DPO (Rafailov et al., 2023) showed that the RM + PPO stages can be collapsed into a single supervised objective:

LDPO=logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))\mathcal{L}_{\text{DPO}} = -\log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} – \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right)

Whether DPO can fully replace PPO-based RLHF at scale remains an active research question, with evidence on both sides.

5. Democratizing Alignment

The InstructGPT pipeline requires significant resources. Making alignment accessible to the broader research community through:
Open-source reward models (e.g., OpenAssistant, UltraFeedback)
Synthetic preference data generation
Parameter-efficient RLHF (combining LoRA with PPO)


Conclusion

The InstructGPT paper is one of the most consequential ML papers of the 2020s. Its core contributions can be summarized as:

  1. A scalable three-stage pipeline (SFT → Reward Modeling → PPO) that transforms a raw language model into an instruction-following assistant
  2. Empirical proof that a small aligned model (1.3B) can be preferred by humans over a 100× larger unaligned model (175B)
  3. The PPO-ptx technique that mitigates alignment tax through pretraining data mixing
  4. Rigorous evaluation methodology combining human preference judgments with automatic safety and capability benchmarks
  5. An honest discussion of limitations including labeler bias, reward hacking, and the fundamental difficulty of specifying human values

The RLHF paradigm introduced by InstructGPT became the foundation upon which ChatGPT, Claude, Gemini, and virtually every modern conversational AI system was built. While newer methods like DPO offer simpler alternatives, the conceptual framework — learn a reward model from human preferences, then optimize against it with KL-constrained RL — remains the dominant paradigm in AI alignment.

For researchers and practitioners working on LLM alignment, this paper is essential reading. It demonstrates both the remarkable effectiveness of human feedback for steering AI behavior and the significant challenges that remain in making AI systems truly aligned with human values.


Reference: Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. arXiv:2203.02155

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 432 | TOTAL 2,655