~ similar to 2605.27846· 20 results
Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu +3 more
CRITIC-R1 introduces a structured critic framework that treats RAG critique as an explicit error diagnosis problem using reinforcement learning, significantly improving answer quality over strong RAG…
The paper introduces a diagnostic benchmark for selective Question Answering over conflicting, multi-source personal memory, demonstrating that specialized fusion resolvers outperform general LLMs, es…
SCOPE introduces a data-free self-play framework that co-evolves a task-generating Challenger and a document-answering Solver, significantly improving open-ended performance on language models without…
Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang +7 more
QUBRIC introduces a co-design framework that simultaneously optimizes queries and rubrics, overcoming the bottleneck of vague rubrics derived from open-ended questions, leading to significant gains in…
The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…
Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more
The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…
The paper introduces Cross-Model Entropy (CME), a novel label-free reward signal that uses an independent verifier model to assess the quality of a generator's output, significantly improving LLM perf…
Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue +7 more
The paper addresses the 'detection-to-abstention gap' in reasoning models, where detecting insufficient information does not lead to abstention, by proposing a novel control framework that forces mode…
The paper proposes Shapley-based input uncertainty Quantification (ShaQ), a novel framework that uses Shapley values to precisely attribute input-induced uncertainty to specific spans of text, providi…
Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin +7 more
HMPO introduces a single-stage, cost-effective reinforcement learning framework that achieves significant token compression of Chain-of-Thought reasoning with minimal loss of accuracy, applicable acro…
Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more
ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…
The paper introduces a logic-driven framework using a neural certificate function to rigorously evaluate and benchmark the generalization capabilities of reinforcement learning algorithms on unseen ta…
Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng +1 more
SPADER is a novel reinforcement learning framework that addresses the challenges of Multi-Answer Question Answering by improving credit assignment and promoting diverse exploration during long-horizon…
Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou +8 more
EchoRL proposes a lightweight module to exploit valuable learning signals from advantage-degenerated rollouts in Reinforcement Learning with Verifiable Rewards (RLVR), significantly improving LLM post…
Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing +2 more
This paper investigates the non-monotonic role of sample difficulty in Reinforcement Learning with Verifiable Reward (RLVR), finding that medium-difficulty problems provide the most balanced and benef…
Critic-R introduces a novel framework that uses a critic model to provide natural language introspective feedback, significantly improving the performance of agentic search systems by optimizing retri…
Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more
The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…
The paper introduces Posterior Hybrid Bayesian Belief (PhyB), a novel framework that reformulates policy optimization in Bayesian Offline RL by approximating expectations as a convex combination over…
This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.
This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.