~ similar to 2605.29886· 20 results
Critic-R introduces a novel framework that uses a critic model to provide natural language introspective feedback, significantly improving the performance of agentic search systems by optimizing retri…
Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen +3 more
This paper proposes a post-training framework called Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT) to teach language models to reason by analogy.
Can Jin, Jiakang Li, Rui Wu, Eddy Zhang +1 more
The paper introduces Weak-Critic Strong Oversight, a method where a weak model guides a strong model's self-improvement by providing non-misleading revision directions, leading to scalable oversight.
Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang +7 more
QUBRIC introduces a co-design framework that simultaneously optimizes queries and rubrics, overcoming the bottleneck of vague rubrics derived from open-ended questions, leading to significant gains in…
The paper introduces OCC-RAG, a family of compact, task-specialized Small Language Models (SLMs) designed to achieve highly faithful, multi-hop question answering grounded strictly in provided context…
SCOPE introduces a data-free self-play framework that co-evolves a task-generating Challenger and a document-answering Solver, significantly improving open-ended performance on language models without…
The paper introduces a novel, training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-Judge, and further proposes an iterative fine-tuning strategy that significant…
Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more
The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…
Yunsheng Zeng, Gen Li, Yuwei Miao, Xiandong Li +7 more
The paper proposes EAPO, an entropy-driven adaptive weighting method that dynamically adjusts the influence of positive samples during policy optimization to improve both response diversity and stabil…
This paper introduces cost-aware Retrieval-Augmented Generation (RAG), demonstrating that fixed evidence selection is brittle and that adaptive, agentic controllers are necessary for effective knowled…
This paper introduces a framework to audit source-dependence in multi-source RAG systems, demonstrating that disagreement across institutional sources is a common and critical failure mode that curren…
The paper systematically compares multiple content representations for RAG pipelines and finds that answer retention—the ability of the representation to preserve the original answer-bearing content—i…
The paper introduces Contrastive Reflection (CORE), a novel non-parametric method that rapidly improves language model reasoning by distilling contrasts between successful and unsuccessful problem att…
Hao Chen, Xing Tang, Qirui Liu, Weijie Shi +5 more
The paper introduces the Data-centric Reasoning Compiler (DCRC), a novel data-driven framework that enhances financial QA systems by compiling user queries and retrieved documents into verifiable, exe…
The paper introduces HOPM, a hierarchical online prompt mutation framework that significantly improves the performance of language models in high-stakes evidence document generation by integrating dua…
Bin Chen, Xinye Liao, Yiming Liu, Xin Liao +1 more
The paper proposes Credit-Attenuated Privileged Feedback (CAPF), a training-time mechanism that uses verifier-side information to guide LLM search agents, significantly improving their performance on…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…
Chen He, Yuhao Wu, Lei Wang, Wenxuan Zhang +1 more
The paper identifies and demonstrates that post-conclusion continuation in answer-correct long-CoT traces is harmful during LLM fine-tuning, proposing a method to cut this continuation.
Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan +4 more
The paper investigates the limits of self-evolution in LLM reasoning under closed-loop settings, finding that while self-improvement is significant, it consistently falls short of perfect oracle super…