Papers similar to 2605.28721

~ similar to 2605.28721· 20 results

cs.CLcs.AIRecentMay 27, 2026

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

The paper introduces VibeSearchBench, a new benchmark designed to evaluate long-horizon, proactive search capabilities, demonstrating that current state-of-the-art LLM agents are still significantly i…

View →

cs.CLRecentJun 1, 2026

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim +11 more

The paper introduces K-BrowseComp, a new web-browsing agent benchmark of 400 problems grounded in Korean contexts, demonstrating that current frontier LLMs struggle significantly with complex, context…

View →

cs.AIRecentMay 30, 2026

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more

The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…

View →

cs.CRcs.AIRecentJun 3, 2026

Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation

Yongjie Wang, Xinyue Zhang, Kunhong Yao, Zhiwei Zeng +3 more

The paper introduces the concept of Search-Time Contamination (STC), demonstrating that deep research agents can leak information from public benchmarks via web search, leading to an overestimation of…

View →

cs.IRRecentJun 3, 2026

SearchLog: A Web Browser Extension for Capturing Search Logs in Laboratory Studies

Jiaman He, Riccardo Xia, Dana McKay, Damiano Spina +1 more

The paper presents SearchLog, a web browser extension for collecting natural search logs during lab-based studies.

View →

cs.AIRecentMay 28, 2026

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati +2 more

MEMENTO proposes a novel framework that treats the open web as a continuous learning signal, enabling agents to acquire task-specific expertise and reusable research strategies in low-data domains wit…

View →

cs.AIRecentMay 28, 2026

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

A. J. Lew, Y. Cao, M. J. Buehler

The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…

View →

cs.AIcs.IRRecentMay 28, 2026

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Gaurav Sahu, Laurent Charlin, Christopher Pal

The paper introduces a Deep Research pipeline that significantly improves literature search recall and demonstrates that human-curated citation lists are often unreliable and do not serve as a true gr…

View →

cs.CLRecentMay 30, 2026

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao +1 more

The paper proposes MERIT, a dual-level, multi-horizon memory retrieval framework that significantly improves the performance of interactive text-to-SQL agents by providing both global and local memory…

View →

cs.AIcs.CLRecentJun 1, 2026

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao +2 more

The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…

View →

cs.HCcs.AIRecentMay 27, 2026

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

Mert Yazan, Frederik Bungaran Ishak Situmeang, Suzan Verberne

Despite having access to web search, users' reliance on conversational AI for information remains high, driven primarily by pre-existing trust and influenced indirectly by the chatbot's conversational…

View →

cs.AIRecentMay 27, 2026

Plan Before Search: Search Agents Need Plan

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen +6 more

The paper introduces Plan, a structured agentic behavior that decomposes multi-hop questions into ordered sub-questions before retrieval, and proposes a self-bootstrapping paradigm to train it without…

View →

cs.CLRecentJun 1, 2026

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

Mingyan Wu, Han Yang, Omer Ben-Porat, Yftah Ziser

This paper introduces cost-aware Retrieval-Augmented Generation (RAG), demonstrating that fixed evidence selection is brittle and that adaptive, agentic controllers are necessary for effective knowled…

View →

cs.IRcs.AIRecentMay 30, 2026

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Md Zarif Ul Alam, Alireza Salemi, Hamed Zamani

Critic-R introduces a novel framework that uses a critic model to provide natural language introspective feedback, significantly improving the performance of agentic search systems by optimizing retri…

View →

cs.AIcs.LGRecentMay 29, 2026

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Yunpeng Zhou

This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…

View →

cs.IRcs.AIRecentMay 27, 2026

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

The study compares agentic data retrieval using unstructured web data versus structured, semantically-annotated datasets, concluding that semantic metadata remains essential for high-precision, reliab…

View →

cs.DLcs.AIcs.CLRecentMay 27, 2026

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang +1 more

The paper introduces CITETRACE, a large-scale dataset and evaluation framework that systematically measures structural citation failures in search-augmented LLMs, revealing a pattern called Verified M…

View →

cs.AIcs.CLcs.IRRecentJun 1, 2026

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu +4 more

The paper introduces Harness-1, a search agent that separates semantic decision-making from state management by using a stateful search harness, achieving state-of-the-art performance across diverse r…

View →

cs.CLcs.AIcs.IRRecentMay 28, 2026

GrepSeek: Training Search Agents for Direct Corpus Interaction

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung +3 more

GrepSeek introduces a novel direct corpus interaction (DCI) search agent that trains an LLM to find and compose evidence from large text corpora by issuing executable shell commands, achieving state-o…

View →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →