Introduction
Large language models (LLMs) trained on massive internet corpora are remarkably capable, but they often generate outputs that are untruthful, toxic, or simply unhelpful to the user. The fundamental problem is a misalignment between the language modeling objective (predicting the next token) and the actual goal we care about — following the user’s instructions helpfully and safely.
In March 2022, OpenAI published “Training language models to follow instructions with human feedback” (Ouyang et al., 2022), introducing InstructGPT. This paper formalized the Reinforcement Learning from Human Feedback (RLHF) pipeline that would later become the backbone of ChatGPT, Claude, and virtually every modern aligned LLM.
The core insight of InstructGPT is deceptively simple: a 1.3B parameter model fine-tuned with human feedback can be preferred over a 175B GPT-3 model by human evaluators, despite being over 100× smaller.
This review provides a thorough analysis of the InstructGPT paper — its three-stage training pipeline, reward modeling, PPO optimization, evaluation methodology, and broader implications for AI alignment.
Background and Motivation
The Alignment Problem in Language Models
Traditional LLMs are trained with a next-token prediction objective:
where is the token at position , are all preceding tokens, and are model parameters.
This objective optimizes for statistical likelihood of internet text, not for:
– Helpfulness — actually answering what the user asked
– Truthfulness — providing factually accurate information
– Harmlessness — avoiding toxic, biased, or dangerous content
This gap between training objective and desired behavior is the alignment problem.
Prior Work
The RLHF approach builds on several lines of research:
| Prior Work | Contribution | Limitation |
|---|---|---|
| Christiano et al. (2017) | RLHF for Atari games and simulated robotics | Not applied to language |
| Stiennon et al. (2020) | RLHF for text summarization | Single narrow task |
| Ziegler et al. (2019) | Fine-tuning LMs with human preferences | Small scale, limited evaluation |
| GPT-3 (Brown et al., 2020) | Few-shot learning via prompting | No alignment training |
| FLAN / T0 (2021) | Multi-task instruction tuning | No human preference signal |
InstructGPT’s contribution was to scale RLHF to general-purpose instruction following on a production-grade model, with rigorous human evaluation.
The InstructGPT Three-Stage Pipeline
The training pipeline consists of three distinct stages, each building on the previous one.
Stage 1: Supervised Fine-Tuning (SFT)
The first stage collects demonstration data — human-written ideal responses to prompts — and fine-tunes GPT-3 on this data using standard supervised learning.
Data Collection:
– OpenAI hired a team of 40 contractors (labelers) selected for sensitivity to preferences of different demographics
– Labelers wrote high-quality responses to prompts from the OpenAI API and a hand-crafted set
– ~13,000 training examples were collected
The SFT objective is the same language modeling loss, but conditioned on prompt-response pairs:
where is the input prompt and is the target response.
SFT alone significantly improves output quality, but it is fundamentally limited by the diversity and scale of demonstration data. Collecting demonstrations is expensive and slow.
Stage 2: Reward Model (RM) Training
The key innovation is training a reward model that learns to predict which outputs humans prefer. This is far more scalable than collecting demonstrations because comparison judgments are easier and faster for humans than writing full responses.
Data Collection:
– For each prompt, the SFT model generates multiple outputs (typically to )
– Labelers rank all outputs from best to worst
– From each ranking of outputs, pairwise comparisons are extracted
The reward model takes a prompt and response and outputs a scalar reward. It is trained using the Bradley-Terry model of pairwise preferences:
where:
– is the preferred (winning) response
– is the less preferred (losing) response
– is the sigmoid function
– The sum is over all pairs from each ranking
Critical design choice: The authors found that training on all pairs from each ranking within a single batch was crucial. Without this, the reward model overfits because the same prompt appears multiple times across mini-batches.
Architecture: The reward model is initialized from the 6B parameter SFT model with the final unembedding layer replaced by a linear projection to a scalar output. The authors chose 6B over 175B because the larger model was found to be unstable during RM training.
Stage 3: Reinforcement Learning via PPO
The final stage uses Proximal Policy Optimization (PPO) (Schulman et al., 2017) to optimize the language model policy against the learned reward model.
The optimization objective is:
Let’s break this down term by term:
| Term | Meaning |
|---|---|
| The language model policy (parameterized by ) generating response given prompt | |
| The learned reward model score | |
| KL penalty preventing the policy from diverging too far from the SFT model | |
| Pretraining mix — a language modeling term on the original pretraining distribution to prevent catastrophic forgetting |
The KL divergence penalty is critical. Without it, the policy quickly learns to exploit the reward model by generating adversarial outputs that receive high reward but are actually low quality (reward hacking). The coefficient controls this trade-off.
The pretraining mix term (with coefficient ) ensures the model retains its general language capabilities while being optimized for alignment. The authors call the variant with PPO-ptx.
import torch
import torch.nn.functional as F
def compute_rlhf_loss(
policy_logprobs, # Log probs under current policy
sft_logprobs, # Log probs under SFT reference model
rewards, # Reward model scores
pretrain_logprobs, # Log probs on pretraining data
beta=0.02, # KL penalty coefficient
gamma=0.05 # Pretraining mix coefficient
):
"""
Compute the InstructGPT RLHF objective.
The objective maximizes reward while staying close to the SFT policy
and maintaining pretraining language modeling performance.
"""
# KL divergence between policy and SFT reference
# Approximated as: E[log(pi_policy) - log(pi_sft)]
kl_divergence = (policy_logprobs - sft_logprobs).mean()
# Main RLHF objective: reward - KL penalty
rlhf_objective = rewards.mean() - beta * kl_divergence
# Pretraining mix to prevent catastrophic forgetting
pretrain_loss = -pretrain_logprobs.mean()
# Combined loss (negative because we maximize objective)
total_loss = -rlhf_objective + gamma * pretrain_loss
return total_loss, {
"reward": rewards.mean().item(),
"kl_divergence": kl_divergence.item(),
"pretrain_loss": pretrain_loss.item()
}
Training Details
Model Sizes
The authors trained InstructGPT at three scales:
| Model | Parameters | SFT Epochs | RM Architecture | PPO Details |
|---|---|---|---|---|
| InstructGPT-1.3B | 1.3B | 16 | 6B | 256 episodes/batch |
| InstructGPT-6B | 6B | 16 | 6B | 256 episodes/batch |
| InstructGPT-175B | 175B | 16 | 6B | 512 episodes/batch |
Prompt Dataset
The prompts came from two sources:
- OpenAI API traffic — real user prompts submitted to the GPT-3 API (with user consent, deduplicated, PII filtered)
- Labeler-written prompts — designed to cover diverse categories
The prompt distribution covered:
| Category | Percentage |
|---|---|
| Generation (open-ended) | 45.6% |
| Open QA | 12.4% |
| Brainstorming | 11.2% |
| Chat / Conversation | 8.4% |
| Rewrite | 6.6% |
| Summarization | 4.2% |
| Classification | 3.5% |
| Closed QA | 2.6% |
| Extract | 1.9% |
| Other | 3.6% |
Data Sizes
| Dataset | Size |
|---|---|
| SFT training data | ~13,000 prompt-response pairs |
| RM comparison data | ~33,000 prompts, ~300,000 pairwise comparisons |
| PPO training prompts | ~31,000 unique prompts |
Hyperparameters
Key PPO hyperparameters:
– Learning rate: (cosine schedule)
– KL penalty : 0.02
– Pretraining mix : 27.8 (for PPO-ptx, calibrated to match pretraining gradient scale)
– Clip ratio : 0.2 (standard PPO)
– Batch size: 512 episodes for 175B, 256 for smaller
– PPO epochs per batch: 4
Evaluation Methodology
Human Evaluation
The primary evaluation was human preference judgments on a held-out test set of prompts. Labelers compared outputs from:
– GPT-3 (few-shot prompted)
– GPT-3 (with carefully written prompts)
– SFT models
– InstructGPT (PPO and PPO-ptx)
For each comparison, labelers rated outputs on a 1-7 Likert scale and expressed pairwise preferences.
Automatic Metrics
The paper also evaluated on several NLP benchmarks to check for alignment tax (capability regression due to alignment training):
- TruthfulQA — Measures tendency to produce false but plausible statements
- RealToxicityPrompts — Measures toxic language generation
- Winogender / BBQ — Measures bias
- Standard NLP benchmarks — HellaSwag, WinoGrande, ARC, etc.
Experimental Results
Key Finding 1: InstructGPT Outputs Are Preferred Over GPT-3
Human labelers preferred InstructGPT-1.3B outputs over GPT-3-175B outputs 71% ± 3% of the time.
| Model | Win Rate vs. GPT-3 175B (prompted) | Parameters |
|---|---|---|
| GPT-3 175B (few-shot) | 28% | 175B |
| SFT 1.3B | 59% | 1.3B |
| InstructGPT 1.3B (PPO-ptx) | 71% | 1.3B |
| SFT 6B | 68% | 6B |
| InstructGPT 6B (PPO-ptx) | 73% | 6B |
| SFT 175B | 72% | 175B |
| InstructGPT 175B (PPO-ptx) | 85% | 175B |
A 1.3B model aligned with RLHF beats a 175B model without alignment — a 100× parameter advantage overcome by better training signal.
Key Finding 2: Improved Truthfulness
On TruthfulQA:
– GPT-3 175B: 21% truthful + informative
– InstructGPT 175B (PPO-ptx): 33% truthful + informative
This is a relative improvement of ~57%, though absolute performance remains modest — showing truthfulness is hard even with alignment.
Key Finding 3: Reduced Toxicity
On RealToxicityPrompts (measuring toxic continuations):
– GPT-3 175B: toxicity score ~0.44
– InstructGPT 175B (PPO): toxicity score ~0.28
A 36% reduction in toxic output generation. The model also showed smaller increases in toxicity when given toxic prompts (more robust to adversarial inputs).
Key Finding 4: Minimal Alignment Tax
On standard NLP benchmarks:
| Benchmark | GPT-3 175B | InstructGPT 175B (PPO) | InstructGPT 175B (PPO-ptx) |
|---|---|---|---|
| HellaSwag | 78.9% | 71.4% (↓) | 79.1% (≈) |
| WinoGrande | 70.2% | 67.1% (↓) | 70.5% (≈) |
| ARC (Easy) | 68.8% | 66.5% (↓) | 69.2% (≈) |
| Lambada | 76.2% | 69.3% (↓) | 75.8% (≈) |
PPO alone causes noticeable regression on benchmarks, but PPO-ptx (with the pretraining mix) almost completely eliminates the alignment tax. This validates the importance of the term.
Key Finding 5: Labeler Agreement
The paper reports inter-annotator agreement of ~73%, meaning labelers agreed on the preferred output about 73% of the time. This establishes an approximate human ceiling for the task.
Interestingly, when tested with held-out labelers who did not participate in training data collection, InstructGPT was still preferred — suggesting the model learned generalizable preferences rather than overfitting to specific labeler quirks.
Ablation Studies
Effect of Model Size
Performance scaled with model size, but the gains from RLHF were relatively consistent across scales:
| Base Model Size | SFT Win Rate vs GPT-3 | PPO-ptx Win Rate vs GPT-3 | RLHF Gain |
|---|---|---|---|
| 1.3B | 59% | 71% | +12% |
| 6B | 68% | 73% | +5% |
| 175B | 72% | 85% | +13% |
The 175B model showed the largest absolute gain from RLHF, suggesting that larger models benefit more from alignment training (they have more capability to be unlocked).
Effect of Pretraining Mix (PPO vs PPO-ptx)
The ablation clearly showed that the pretraining mix coefficient is essential:
- PPO without pretraining mix: Higher reward on the RM, but significant regression on NLP benchmarks and occasional degenerate outputs
- PPO-ptx: Slightly lower reward, but no benchmark regression and more robust outputs
Effect of KL Coefficient
The paper explored different values of :
| Value | Effect |
|---|---|
| Too low () | Reward hacking — model exploits RM weaknesses |
| 0.02 (optimal) | Good balance of reward and output quality |
| Too high () | Model barely deviates from SFT, limited improvement |
SFT Data Scaling
Surprisingly, the authors found that SFT performance plateaus quickly — performance on held-out data peaks after just ~1 epoch of SFT training (despite running 16 epochs). This suggests:
- The model learns the instruction-following “format” quickly
- Further SFT training leads to overfitting on demonstration data
- The RM and PPO stages are where the real alignment learning happens
Comparison with Contemporary Methods
| Method | Human Feedback | Training Approach | Scalability | Key Limitation |
|---|---|---|---|---|
| InstructGPT (RLHF) | Rankings → Reward Model → PPO | 3-stage pipeline | Proven at 175B scale | Complex pipeline, reward hacking risk |
| FLAN (2021) | None (instruction templates) | Multi-task SFT | Highly scalable | No preference learning |
| Constitutional AI (2022) | AI-generated critiques + RLHF | AI feedback + RL | Reduces human annotation | Still needs some human labels |
| DPO (2023) | Pairwise preferences | Direct optimization (no RM) | Simpler pipeline | Less flexible than RLHF |
| RLAIF (2023) | AI-generated preferences | Same as RLHF but AI labels | Very scalable | Depends on AI judge quality |
| PPO-Max / ReMax (2023) | Rankings → RM → REINFORCE | Simplified RL | Easier to implement | May underperform PPO |
InstructGPT’s three-stage pipeline remains the gold standard against which all subsequent alignment methods are compared, even if newer methods simplify parts of it.
Strengths of the Paper
1. Rigorous Human Evaluation
Unlike many papers that rely solely on automatic metrics, InstructGPT invested heavily in human evaluation — the gold standard for measuring alignment. The use of both trained labelers and held-out evaluators strengthens the findings.
2. Practical Scalability Demonstration
The paper demonstrated RLHF at 175B parameter scale, proving it was viable for production systems. This was not merely academic — InstructGPT was deployed as a real product.
3. Alignment Tax Analysis
The careful measurement of performance regression on standard benchmarks, and the PPO-ptx solution to mitigate it, was a valuable practical contribution.
4. Transparent Limitations Discussion
The paper openly discusses failure modes:
– The model can still generate harmful outputs
– Performance depends heavily on labeler quality and demographics
– The model can be overly cautious (refusing harmless requests)
– RLHF amplifies existing biases in the labeler pool
5. Reproducible Pipeline
The three-stage pipeline (SFT → RM → PPO) became a widely reproduced recipe. Projects like Anthropic’s Claude, LLaMA-2-Chat, Tulu, and Zephyr all follow variants of this pipeline.
Limitations and Criticisms
1. Labeler Bias
The 40 labelers were predominantly English-speaking, US-based contractors. Their preferences inevitably encode cultural assumptions about what constitutes a “good” response. The paper acknowledges this but does not resolve it.
2. Reward Hacking
Despite the KL penalty, the model can still learn to produce outputs that score highly on the reward model without being genuinely better. Examples include:
– Verbose responses (longer is scored higher)
– Hedging language (“I’m not sure, but…” scored as more honest)
– Sycophantic agreement with the user’s stated beliefs
3. Training Instability
The PPO stage is notoriously unstable and sensitive to hyperparameters. The authors note that the 175B reward model was too unstable to train, forcing them to use a 6B RM. This mismatch between policy and reward model sizes is a known concern.
4. Goodhart’s Law
“When a measure becomes a target, it ceases to be a good measure.”
The reward model is a proxy for human preferences. Optimizing against it too aggressively leads to reward over-optimization — outputs that exploit reward model weaknesses rather than improving actual quality.
The paper shows evidence of this:
This means there is an optimal level of RL optimization beyond which quality degrades.
5. Cost and Complexity
The three-stage pipeline requires:
– Specialized human annotation teams
– Training three separate models (SFT, RM, RL policy)
– Careful hyperparameter tuning at each stage
– Significant compute resources
This complexity motivated simpler alternatives like DPO (Direct Preference Optimization).
Impact and Legacy
InstructGPT’s impact on the field has been profound:
Direct Descendants
- ChatGPT (November 2022) — Built on the InstructGPT pipeline, applied to GPT-3.5, and became the fastest-growing consumer application in history
- GPT-4 (March 2023) — Continued using RLHF at larger scale
- Claude (Anthropic) — Extended RLHF with Constitutional AI (RLAIF)
- LLaMA-2-Chat (Meta, 2023) — Open-source reproduction of the InstructGPT pipeline
Methodological Influence
The paper established several now-standard practices:
- Comparison data > demonstration data for preference learning
- KL-constrained RL as the standard alignment optimization
- Pretraining mix to prevent capability regression
- Human evaluation as the primary alignment metric
- Red teaming as part of safety evaluation
The RLHF Ecosystem
The paper spawned an entire ecosystem of tools and frameworks:
# Example: Modern RLHF pipeline using trl library
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
# Step 1: Load SFT model as starting point
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
tokenizer = AutoTokenizer.from_pretrained("sft-model")
# Step 2: Configure PPO with KL penalty
ppo_config = PPOConfig(
model_name="instructgpt-style-rlhf",
learning_rate=1e-5,
init_kl_coef=0.02, # Beta: KL penalty coefficient
target_kl=6.0, # Adaptive KL target
batch_size=256,
mini_batch_size=64,
ppo_epochs=4, # PPO epochs per batch
)
# Step 3: Initialize PPO trainer
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
tokenizer=tokenizer,
)
# Step 4: Training loop
for batch in dataloader:
query_tensors = batch["input_ids"]
# Generate responses from the policy
response_tensors = ppo_trainer.generate(
query_tensors,
max_new_tokens=256,
do_sample=True,
temperature=0.7
)
# Score responses with reward model
rewards = reward_model(query_tensors, response_tensors)
# PPO update step (includes KL penalty internally)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
print(f"Mean reward: {stats['ppo/mean_scores']:.3f}, "
f"KL div: {stats['objective/kl']:.3f}")
Future Research Directions
Several open problems remain in the RLHF paradigm that InstructGPT established:
1. Scalable Oversight
As models become more capable, human evaluators may not be able to judge output quality reliably (especially for complex reasoning, code, or scientific content). Research directions include:
– AI-assisted evaluation (constitutional AI, debate)
– Process-based rewards (rewarding reasoning steps, not just final answers)
– Recursive reward modeling (using aligned models to train better reward models)
2. Reducing Reward Hacking
- Ensemble reward models to reduce exploitability
- Conservative optimization methods that explicitly account for RM uncertainty
- Iterated RLHF — alternating between collecting new human data and retraining
3. Multi-Objective Alignment
InstructGPT optimizes a single scalar reward, but alignment is inherently multi-dimensional (helpful, harmless, honest). Future work includes:
– Multi-objective RL with Pareto-optimal solutions
– Conditional alignment — different safety levels for different contexts
– Constitutional approaches with explicit rule hierarchies
4. Eliminating the RL Stage
DPO (Rafailov et al., 2023) showed that the RM + PPO stages can be collapsed into a single supervised objective:
Whether DPO can fully replace PPO-based RLHF at scale remains an active research question, with evidence on both sides.
5. Democratizing Alignment
The InstructGPT pipeline requires significant resources. Making alignment accessible to the broader research community through:
– Open-source reward models (e.g., OpenAssistant, UltraFeedback)
– Synthetic preference data generation
– Parameter-efficient RLHF (combining LoRA with PPO)
Conclusion
The InstructGPT paper is one of the most consequential ML papers of the 2020s. Its core contributions can be summarized as:
- A scalable three-stage pipeline (SFT → Reward Modeling → PPO) that transforms a raw language model into an instruction-following assistant
- Empirical proof that a small aligned model (1.3B) can be preferred by humans over a 100× larger unaligned model (175B)
- The PPO-ptx technique that mitigates alignment tax through pretraining data mixing
- Rigorous evaluation methodology combining human preference judgments with automatic safety and capability benchmarks
- An honest discussion of limitations including labeler bias, reward hacking, and the fundamental difficulty of specifying human values
The RLHF paradigm introduced by InstructGPT became the foundation upon which ChatGPT, Claude, Gemini, and virtually every modern conversational AI system was built. While newer methods like DPO offer simpler alternatives, the conceptual framework — learn a reward model from human preferences, then optimize against it with KL-constrained RL — remains the dominant paradigm in AI alignment.
For researchers and practitioners working on LLM alignment, this paper is essential reading. It demonstrates both the remarkable effectiveness of human feedback for steering AI behavior and the significant challenges that remain in making AI systems truly aligned with human values.
Reference: Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. arXiv:2203.02155
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply