The GRPO Family: A Systematic Survey of Variants, Theory, and Applications

Abstract

TL;DR: The first comprehensive survey of the GRPO variant ecosystem

Group Relative Policy Optimization (GRPO) removes the critic network from Proximal Policy Optimization by normalizing rewards across a group of sampled completions, roughly halving training memory at comparable performance. After DeepSeek-R1 showed that GRPO can elicit chain-of-thought reasoning from base models without supervised demonstrations, a rapidly growing family of variants has emerged to address its limitations. We catalogue these variants and organize them along two axes: a problem-oriented taxonomy with six dimensions (training stability, advantage estimation, length control, credit assignment, exploration, and domain adaptation) and a technique-oriented classification, linked by a method/dimension matrix. We cast the different advantage normalization strategies into a unified (b, s) framework, analyze their bias and variance properties, and connect GRPO's group normalization to self-normalized importance sampling. A horizontal comparison under overlapping base models and benchmarks positions GRPO against PPO, DPO, RLOO, and other alignment methods; we also document negative findings and open problems.

Key Contributions

Three main contributions of this survey

Dual Taxonomy

Problem-oriented (6 dimensions) and technique-oriented classification of 20+ GRPO variants, linked by a method-dimension matrix revealing cross-cutting patterns across training stability, advantage estimation, length control, credit assignment, exploration, and domain adaptation.

Unified Theory

A unified (b, s) framework connecting GRPO's group normalization to self-normalized importance sampling (SNIS), with rigorous bias-variance-compute trade-off analysis for Vanilla GRPO, REINFORCE++, Dr.GRPO, and RLOO under Gaussian reward assumptions.

Critical Analysis

Horizontal comparisons under comparable conditions (same base model and benchmarks), explicit catalogue of negative results and failure modes, technique combination conflicts, and per-method reliability assessments.

Evolution of GRPO Variants

From DeepSeekMath (2024.02) to 20+ variants across six research directions

Figure 1. The evolutionary path of GRPO variants. Branches cover training efficiency (orange), reward/advantage estimation (blue), length control (green), credit assignment (purple), exploration (teal), and domain adaptation (red).

Background

RLHF pipeline and the GRPO algorithm

Figure 2. RLHF pipeline overview. GRPO eliminates the critic network (value model), reducing memory consumption while maintaining competitive performance.

Figure 3. Vanilla GRPO workflow: for each query, the policy samples G completions, computes rewards, normalizes advantages via group statistics, and updates with clipped policy gradients.

Figure 4. Architecture comparison: PPO (4 models) → GRPO (3 models) → DAPO (2 models) → VAPO (4 models with enhanced critic). GRPO halves memory; DAPO further removes the reference model.

Taxonomy

Six-dimensional problem-oriented classification

Category	Core Issue	Representative Methods
Training Efficiency & Stability	High computational cost, entropy collapse	CPPO, Flow-GRPO, DAPO
Reward & Advantage Estimation	Biased advantage, sparse rewards	REINFORCE++, Dr.GRPO, Critique-GRPO, NTHR, SEED-GRPO
Length Control	Overthinking, verbose CoT	S-GRPO, GRPO-λ, BRPO, GRPO-LEAD
Credit Assignment	Coarse-grained uniform assignment	GSPO, SPO, GTPO, GRPO-S
Exploration & Diversity	Limited exploration, entropy collapse	DAPO Clip-Higher, RiskPO, Training-Free GRPO, F-GRPO
Domain Adaptation	Task-specific challenges	DRPO, ARPO, ToolRL, vsGRPO, VAPO

Detailed Comparisons

Visual comparison of techniques across all six dimensions

Figure 5. Advantage estimation strategies: Vanilla GRPO (local), REINFORCE++ (global), Dr.GRPO (no scaling), SEED-GRPO (uncertainty), Critique-GRPO (NL feedback).

Figure 6. DAPO: (a) Clip-Higher for asymmetric clipping, (b) Dynamic Sampling, (c) Token-Level Loss, (d) Overlong Reward Shaping.

Figure 7. Length control: Vanilla GRPO (no incentive), S-GRPO (decaying rewards), GRPO-λ (selective penalties), BRPO (budget constraints).

Figure 8. Credit assignment: sequence-level (GRPO/GSPO), segment-level (SPO), token-level (GTPO with entropy weighting).

Figure 9. Exploration: DAPO Clip-Higher (asymmetric clipping), RiskPO (MVaR), Training-Free GRPO (context-space, 100x cost reduction).

Figure 10. Domain adaptations: DRPO (medical), ARPO (GUI agents), ToolRL (tool use), vsGRPO (vision), VAPO (AIME 60.4 SOTA).

Unified Theoretical Framework

The (b, s) framework for advantage normalization

A_i(b, s) = (R(q, o_i) − b) / s

Variant	Baseline b	Scale s	Bias	Variance
Vanilla GRPO	Local mean	Local std	O(1/G)	O(σ_q^-2/G)
REINFORCE++	Global mean	Global std	O(1/N)	O(σ_q^-2/G)
Dr.GRPO	Local mean	1 (none)	0	O(σ_q²/G)
RLOO	Leave-one-out mean	1 (none)	0	O(σ_q²/G)

Comprehensive Comparison

Performance and reliability of GRPO variants

Method	Base Model	Critic	Ref. Model	AIME'24	MATH-500	Status	Reliability
Vanilla GRPO	Qwen2.5-7B	No	Yes	—	—	Pub.	★★★
CPPO	Qwen2.5-7B	No	Yes	—	—	arXiv	★
DAPO	Qwen2.5-32B	No	No	50.0	—	Pub.	★★★
REINFORCE++	various	No	Yes	—	—	arXiv	★★
Dr.GRPO	Qwen2.5-7B	No	Yes	43.3	—	Pub.	★★
SEED-GRPO	Qwen2.5-7B	No	Yes	56.7	83.4	arXiv	★
S-GRPO	Qwen2.5-7B	No	Yes	—	↑6.1%	Pub.	★
GRPO-LEAD	Qwen2.5-14B	No	Yes	—	—	Pub.	★★
GSPO	Qwen2.5-32B	No	Yes	—	—	arXiv	★★★
RiskPO	various	No	Yes	—	—	Pub.	★★
VAPO	Qwen2.5-32B	Yes	Yes	60.4	—	arXiv	★★

★★★ = Independently reproduced/adopted | ★★ = Multiple evaluations | ★ = Single-team self-report. Different methods use different training data and hyperparameters; results indicate general capability rather than strict rankings.

Method-Dimension Matrix

Cross-cutting contributions of each method

Method	Efficiency	Reward/Adv.	Length	Credit	Exploration	Domain
CPPO	●
Flow-GRPO	●					○
DAPO	●	○	○	○	○
REINFORCE++		●
Dr.GRPO		●	○			○
Critique-GRPO		●			○
NTHR		●		○
SEED-GRPO		●
S-GRPO			●
GRPO-λ			●
BRPO			●
GRPO-LEAD		○	●			○
GSPO				●
SPO				●
GTPO / GRPO-S				●
DAPO Clip-Higher	○				●
RiskPO					●
Training-Free GRPO					●
F-GRPO		○			●
DRPO						●
ARPO					○	●
ToolRL		○				●
vsGRPO						●
VAPO		○	○	○	○	●

● Primary contribution ○ Secondary contribution

Citation

If you find this survey useful, please cite our paper

@article{xie2025grpofamily, title={The GRPO Family: A Systematic Survey of Variants, Theory, and Applications}, author={Xie, Sixiong and Shi, Zhuofan and Wang, Zhengyu and Huang, Dongliang and An, Hongxu and Guo, ZiJie and Nie, Shuyang and Xin, Chunxiao and Jiang, Yuntao and Na, Yunshan and Ma, Yun and Huang, Gang and Jing, Xiang}, doi={10.13140/RG.2.2.26262.92485}, year={2025} }