Built with and by Teycir Ben Soltane•

How to Use•FAQ•GitHub•arXiv.org•

Share:

Deep Research as Rubric for Reinforcement Learning | ArxivCSExplorer

ArXivCSExplorer

☆☆Bookmarks🏆RSS How to Use FAQ

Home/2606.01091

cs.CL

Deep Research as Rubric for Reinforcement Learning

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang

May 31, 2026

AI Summarygemma4:e4b

The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals for open-ended reasoning tasks.

Abstract

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

Related Papers

01Low35%
PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
PReMISE introduces a framework to audit and improve the quality of rubrics used…
02Low43%
Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge
The paper introduces a novel, training-free method to automatically generate fin…
03Low40%
QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards
QUBRIC introduces a co-design framework that simultaneously optimizes queries an…
04Low27%
Reinforcement Learning with Robust Rubric Rewards
The paper introduces $ ext{RLR}^3$, a novel framework that extends verifiable re…
05Low27%
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
This paper introduces CHERRL, a controllable hacking environment for rubric-base…
06Low26%
Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward
The paper proposes DecomposeR, a planner-centric framework that structures deep…
07Low22%
Preference-Aware Rubric Learning for Personalized Evaluation
The paper introduces PARL, a framework that learns personalized evaluation rubri…
08Low19%
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
The paper introduces extsc{Ptah}, a multi-agent harness designed to improve ver…