cs.LGcs.AI

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji

May 31, 2026

AI Summarygemma4:e4b

The paper introduces Group Prioritized Off-Policy Optimization (POPO), a novel framework that efficiently accelerates RL finetuning for LLM reasoning by leveraging effective off-policy training batches without requiring costly additional data rollouts.

Abstract

More Like This

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints. To address these limitations, we propose Group Prioritized Off-Policy Optimization (POPO), a simple yet effective framework that fully exploits effective training batches without additional rollout overhead. POPO comprises two key components: prioritized group replay and decoupled off-policy optimization. The former replaces ineffective on-policy groups with effective off-policy groups via a recency-based replay mechanism that jointly considers sample quality and the degree of off-policiness. To further mitigate the off-policy gap, POPO employs decoupled importance sampling to correct off-policy bias while maintaining stable policy updates under consistent trust-region constraints. Empirical evaluations across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that POPO substantially accelerates RL finetuning and achieves strong reasoning performance with significantly fewer rollouts.

02Low20%

EchoRL: Reinforcement Learning via Rollout Echoing

EchoRL proposes a lightweight module to exploit valuable learning signals from a…

03Low20%

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

The paper introduces REFT, a novel method that diversifies rollouts by sampling…

04Low18%

Reinforcement Learning with Robust Rubric Rewards

The paper introduces $ ext{RLR}^3$, a novel framework that extends verifiable re…

05Low18%

Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Fe…

The paper demonstrates that using Reinforcement Learning from Verifiable Rewards…

06Low18%

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

The paper proposes CAST, an answer-free self-distillation method that enhances G…

07Low17%

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-End…

The paper proposes EAPO, an entropy-driven adaptive weighting method that dynami…

08Low17%

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS introduces a novel data selection method that uses a verifier-coupled spars…