The GRPO Family

A Systematic Survey of Variants, Theory, and Applications

Sixiong Xie, Zhuofan Shi, Zhengyu Wang, Dongliang Huang, Hongxu An, ZiJie Guo, Shuyang Nie, Chunxiao Xin,
Yuntao Jiang, Yunshan Na, Yun Ma, Gang Huang, Xiang Jing*

Peking University  |  National Key Laboratory of Data Space Technology and System  |  Wenjing Future Lab
Equal contribution    * Corresponding author

Abstract

TL;DR: The first comprehensive survey of the GRPO variant ecosystem

Group Relative Policy Optimization (GRPO) removes the critic network from Proximal Policy Optimization by normalizing rewards across a group of sampled completions, roughly halving training memory at comparable performance. After DeepSeek-R1 showed that GRPO can elicit chain-of-thought reasoning from base models without supervised demonstrations, a rapidly growing family of variants has emerged to address its limitations. We catalogue these variants and organize them along two axes: a problem-oriented taxonomy with six dimensions (training stability, advantage estimation, length control, credit assignment, exploration, and domain adaptation) and a technique-oriented classification, linked by a method/dimension matrix. We cast the different advantage normalization strategies into a unified (b, s) framework, analyze their bias and variance properties, and connect GRPO's group normalization to self-normalized importance sampling. A horizontal comparison under overlapping base models and benchmarks positions GRPO against PPO, DPO, RLOO, and other alignment methods; we also document negative findings and open problems.

Key Contributions

Three main contributions of this survey

  Dual Taxonomy

Problem-oriented (6 dimensions) and technique-oriented classification of 20+ GRPO variants, linked by a method-dimension matrix revealing cross-cutting patterns across training stability, advantage estimation, length control, credit assignment, exploration, and domain adaptation.

  Unified Theory

A unified (b, s) framework connecting GRPO's group normalization to self-normalized importance sampling (SNIS), with rigorous bias-variance-compute trade-off analysis for Vanilla GRPO, REINFORCE++, Dr.GRPO, and RLOO under Gaussian reward assumptions.

  Critical Analysis

Horizontal comparisons under comparable conditions (same base model and benchmarks), explicit catalogue of negative results and failure modes, technique combination conflicts, and per-method reliability assessments.

Evolution of GRPO Variants

From DeepSeekMath (2024.02) to 20+ variants across six research directions

GRPO Evolution Tree
Figure 1. The evolutionary path of GRPO variants. Branches cover training efficiency (orange), reward/advantage estimation (blue), length control (green), credit assignment (purple), exploration (teal), and domain adaptation (red).

Background

RLHF pipeline and the GRPO algorithm

RLHF Pipeline
Figure 2. RLHF pipeline overview. GRPO eliminates the critic network (value model), reducing memory consumption while maintaining competitive performance.
GRPO Algorithm
Figure 3. Vanilla GRPO workflow: for each query, the policy samples G completions, computes rewards, normalizes advantages via group statistics, and updates with clipped policy gradients.
Architecture Comparison
Figure 4. Architecture comparison: PPO (4 models) → GRPO (3 models) → DAPO (2 models) → VAPO (4 models with enhanced critic). GRPO halves memory; DAPO further removes the reference model.

Taxonomy

Six-dimensional problem-oriented classification

Category Core Issue Representative Methods
Training Efficiency & StabilityHigh computational cost, entropy collapseCPPO, Flow-GRPO, DAPO
Reward & Advantage EstimationBiased advantage, sparse rewardsREINFORCE++, Dr.GRPO, Critique-GRPO, NTHR, SEED-GRPO
Length ControlOverthinking, verbose CoTS-GRPO, GRPO-λ, BRPO, GRPO-LEAD
Credit AssignmentCoarse-grained uniform assignmentGSPO, SPO, GTPO, GRPO-S
Exploration & DiversityLimited exploration, entropy collapseDAPO Clip-Higher, RiskPO, Training-Free GRPO, F-GRPO
Domain AdaptationTask-specific challengesDRPO, ARPO, ToolRL, vsGRPO, VAPO

Detailed Comparisons

Visual comparison of techniques across all six dimensions

Advantage Estimation
Figure 5. Advantage estimation strategies: Vanilla GRPO (local), REINFORCE++ (global), Dr.GRPO (no scaling), SEED-GRPO (uncertainty), Critique-GRPO (NL feedback).
DAPO Techniques
Figure 6. DAPO: (a) Clip-Higher for asymmetric clipping, (b) Dynamic Sampling, (c) Token-Level Loss, (d) Overlong Reward Shaping.
Length Control
Figure 7. Length control: Vanilla GRPO (no incentive), S-GRPO (decaying rewards), GRPO-λ (selective penalties), BRPO (budget constraints).
Credit Assignment
Figure 8. Credit assignment: sequence-level (GRPO/GSPO), segment-level (SPO), token-level (GTPO with entropy weighting).
Exploration
Figure 9. Exploration: DAPO Clip-Higher (asymmetric clipping), RiskPO (MVaR), Training-Free GRPO (context-space, 100x cost reduction).
Domain Adaptation
Figure 10. Domain adaptations: DRPO (medical), ARPO (GUI agents), ToolRL (tool use), vsGRPO (vision), VAPO (AIME 60.4 SOTA).

Unified Theoretical Framework

The (b, s) framework for advantage normalization

Ai(b, s) = (R(q, oi) − b) / s

Variant Baseline b Scale s Bias Variance
Vanilla GRPO Local mean Local std O(1/G) O(σq-2/G)
REINFORCE++ Global mean Global std O(1/N) O(σq-2/G)
Dr.GRPO Local mean 1 (none) 0 O(σq2/G)
RLOO Leave-one-out mean 1 (none) 0 O(σq2/G)

Comprehensive Comparison

Performance and reliability of GRPO variants

Method Base Model Critic Ref. Model AIME'24 MATH-500 Status Reliability
Vanilla GRPOQwen2.5-7BNoYesPub.★★★
CPPOQwen2.5-7BNoYesarXiv
DAPOQwen2.5-32BNoNo50.0Pub.★★★
REINFORCE++variousNoYesarXiv★★
Dr.GRPOQwen2.5-7BNoYes43.3Pub.★★
SEED-GRPOQwen2.5-7BNoYes56.783.4arXiv
S-GRPOQwen2.5-7BNoYes↑6.1%Pub.
GRPO-LEADQwen2.5-14BNoYesPub.★★
GSPOQwen2.5-32BNoYesarXiv★★★
RiskPOvariousNoYesPub.★★
VAPOQwen2.5-32BYesYes60.4arXiv★★

★★★ = Independently reproduced/adopted  |  ★★ = Multiple evaluations  |  ★ = Single-team self-report. Different methods use different training data and hyperparameters; results indicate general capability rather than strict rankings.

Method-Dimension Matrix

Cross-cutting contributions of each method

Method Efficiency Reward/Adv. Length Credit Exploration Domain
CPPO
Flow-GRPO
DAPO
REINFORCE++
Dr.GRPO
Critique-GRPO
NTHR
SEED-GRPO
S-GRPO
GRPO-λ
BRPO
GRPO-LEAD
GSPO
SPO
GTPO / GRPO-S
DAPO Clip-Higher
RiskPO
Training-Free GRPO
F-GRPO
DRPO
ARPO
ToolRL
vsGRPO
VAPO

Primary contribution    Secondary contribution

Citation

If you find this survey useful, please cite our paper

@article{xie2025grpofamily, title={The GRPO Family: A Systematic Survey of Variants, Theory, and Applications}, author={Xie, Sixiong and Shi, Zhuofan and Wang, Zhengyu and Huang, Dongliang and An, Hongxu and Guo, ZiJie and Nie, Shuyang and Xin, Chunxiao and Jiang, Yuntao and Na, Yunshan and Ma, Yun and Huang, Gang and Jing, Xiang}, doi={10.13140/RG.2.2.26262.92485}, year={2025} }