Papers similar to 2605.30459

~ similar to 2605.30459· 20 results

cs.AIRecentMay 28, 2026

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng +49 more

The paper introduces Mindgames, a comprehensive multi-game arena for evaluating LLM agents' sustained social and strategic reasoning, demonstrating that current evaluations are limited by structural s…

View →

cs.CLcs.AIRecentMay 31, 2026

CA-BED: Conversation-Aware Bayesian Experimental Design

Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal +4 more

CA-BED is a novel framework that improves LLM performance in interactive question-answering by integrating Bayesian Experimental Design to strategically select questions that maximize information gain…

View →

cs.AIcs.CLcs.HCRecentMay 27, 2026

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig +3 more

This study investigates human-AI collaboration in question answering, finding that while collaboration is beneficial, humans make suboptimal decisions by both under-relying on correct AI suggestions a…

View →

cs.CLRecentMay 29, 2026

What Am I Missing? Question-Answering as Hidden State Probing

Chu Fei Luo, Samuel Dahan, Xiaodan Zhu

The paper proposes using question-asking as an inference-time intervention to probe a language model's hidden state, finding that the self-diagnosis process provides a predictive signal for final corr…

View →

cs.CLcs.AIRecentJun 1, 2026

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

Jiaming Qu, Lucheng fu, Yibo Hu

The study finds that in multi-agent systems, peer agreement makes LLMs more susceptible to adopting misleading answers than to correcting genuinely wrong ones, suggesting a need for verification over…

View →

cs.AIcs.LGRecentMay 28, 2026

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more

Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.

View →

cs.AIRecentMay 27, 2026

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more

The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.

View →

cs.AIRecentMay 27, 2026

Plan Before Search: Search Agents Need Plan

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen +6 more

The paper introduces Plan, a structured agentic behavior that decomposes multi-hop questions into ordered sub-questions before retrieval, and proposes a self-bootstrapping paradigm to train it without…

View →

cs.AIRecentMay 27, 2026

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

TRACER introduces a novel turn-level reinforcement framework that enables cooperative multi-LLM reasoning by separating decision-making into a regret-matching controller and a generation-credit layer.

View →

cs.MAcs.AIcs.CLRecentMay 28, 2026

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Tom Pecher

This paper simulates the Argumentative Theory of Reasoning (ATR) using multi-agent debate among LLMs, demonstrating that collective adversarial discourse significantly enhances truth-seeking performan…

View →

cs.CLRecentMay 30, 2026

Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate

Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao +3 more

The paper decomposes LLM debate convergence into three mechanisms (instability, conformity, persuasion) and finds that much observed convergence is harmful social compliance rather than genuine reason…

View →

cs.CLcs.AIRecentMay 28, 2026

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Ruoxi Su, Yuhan Liu, Jingyu Hu

The paper introduces an adaptive interview framework to gather rich persona context, demonstrating that LLMs improve decision alignment in moral dilemmas only when they selectively ground their decisi…

View →

cs.CLRecentMay 28, 2026

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

Md Arid Hasan, Ruwad Naswan, Farhan Samir, Sharifa Sultana +1 more

The paper demonstrates that using English prompts causes large language models to prioritize globally dominant narratives over local cultural knowledge, even when local evidence is provided.

View →

cs.LGcs.AIRecentMay 28, 2026

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more

This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…

View →

cs.CLRecentJun 1, 2026

Not What, But How: A Communicative Audit of LLM Response Framing

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more

The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…

View →

cs.AIcs.CLcs.HCRecentMay 27, 2026

Mind Your Tone: Does Tone Alter LLM Performance?

Om Dobariya, Akhil Kumar

This study demonstrates that the tone of a prompt significantly affects the accuracy of various LLMs, requiring users to exercise caution regarding tone-robust reliability.

View →

cs.CLcs.AIcs.IRRecentMay 28, 2026

GrepSeek: Training Search Agents for Direct Corpus Interaction

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung +3 more

GrepSeek introduces a novel direct corpus interaction (DCI) search agent that trains an LLM to find and compose evidence from large text corpora by issuing executable shell commands, achieving state-o…

View →

cs.CLRecentJun 1, 2026

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

Danqing Wang, Akshay Sivaraman, Lei Li

The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…

View →

cs.AIRecentMay 30, 2026

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more

The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…

View →

cs.CLcs.IRRecentMay 29, 2026

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more

The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.

View →