~ similar to 2605.27882· 20 results
HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang +4 more
The paper argues that current search agents often verify existing knowledge rather than genuinely searching, and introduces LiveBrowseComp, a new benchmark to measure true evidence-driven discovery.
LongTraceRL addresses long-context reasoning challenges by generating highly challenging training data and introducing a fine-grained rubric reward, significantly improving evidence-grounded reasoning…
Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim +11 more
The paper introduces K-BrowseComp, a new web-browsing agent benchmark of 400 problems grounded in Korean contexts, demonstrating that current frontier LLMs struggle significantly with complex, context…
The paper introduces GTA, a scalable framework for generating realistic, multi-hop web-agent tasks with dense, executable trajectories, addressing the current lack of process-level supervision in web…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen +4 more
The paper introduces Graph-Distance Contribution Reward (GDCR) and Step Advantage Policy Optimization (SAPO) to provide fine-grained, step-level credit assignment for agentic search by modeling world…
Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao +1 more
The paper proposes MERIT, a dual-level, multi-horizon memory retrieval framework that significantly improves the performance of interactive text-to-SQL agents by providing both global and local memory…
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…
Critic-R introduces a novel framework that uses a critic model to provide natural language introspective feedback, significantly improving the performance of agentic search systems by optimizing retri…
Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding +3 more
The paper introduces LongDS, a new benchmark for long-horizon, multi-turn data analysis, demonstrating that current AI agents struggle significantly with maintaining and updating complex analytical st…
Yuting Xu, Jiayi Tian, Jian Liang, Xin Xiong +3 more
The paper introduces VeriTrip, a new verifiable benchmark that evaluates travel planning agents' ability to perform evidence-grounded reasoning over complex, unstructured, and multimodal web data, rev…
Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu +4 more
The paper introduces Harness-1, a search agent that separates semantic decision-making from state management by using a stateful search harness, achieving state-of-the-art performance across diverse r…
Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung +3 more
GrepSeek introduces a novel direct corpus interaction (DCI) search agent that trains an LLM to find and compose evidence from large text corpora by issuing executable shell commands, achieving state-o…
Yongjie Wang, Xinyue Zhang, Kunhong Yao, Zhiwei Zeng +3 more
The paper introduces the concept of Search-Time Contamination (STC), demonstrating that deep research agents can leak information from public benchmarks via web search, leading to an overestimation of…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua +4 more
ExpWeaver introduces a novel framework for LLM agents to learn from past experiences using latent retrieval-augmented generation, achieving state-of-the-art performance while significantly improving t…
Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more
The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.
Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang +4 more
The paper proposes ProRL, an effective Reinforcement Learning framework that rectifies gradient estimation deficiencies to optimize proactive recommendation paths, significantly outperforming existing…
Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng +1 more
SPADER is a novel reinforcement learning framework that addresses the challenges of Multi-Answer Question Answering by improving credit assignment and promoting diverse exploration during long-horizon…