~ similar to 2605.31354· 20 results
The paper demonstrates that extended pure neural reasoning fails on complex, deterministic state-tracking tasks beyond a certain 'Deterministic Horizon,' necessitating the integration of external tool…
The paper demonstrates that self-reflective agents can systematically confabulate incorrect memories, leading them to fail tasks even when the environment resets, and proposes a metric and mitigation…
The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…
Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang +3 more
This paper introduces a failure-aware observability framework to diagnose wasted computation in multi-agent LLM systems by mapping recurring failure modes to online trace signals.
The paper analyzes the failure modes of aggressive 2-bit quantization in large reasoning models, proposing lightweight controls like FP16 planning and loop rescue to restore accuracy and achieve pract…
Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more
The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…
The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma +7 more
The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demonstrating that requiring shared visual evidence among agents is crucial for achieving reliable and trustworthy consensus i…
The paper introduces a metric, the compositional residual eps*, to quantify how multi-component LLM agents violate basic probability axioms when combining local, coherent claims into a global predicti…
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more
The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…
Nizar Islah, Istabrak Abbes, Irina Rish, Sarath Chandar +1 more
This paper proposes a method to recover recoverability structure from failed traces of post-trained language models, enabling test-time routing and post-training analysis.
Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang +2 more
The paper introduces TRON, an online, rule-verifiable environment substrate that generates an unbounded stream of fresh, controllable visual reasoning training instances, significantly improving RL pe…
Kaixiang Zhao, Tianrun Yu, Shawn Huang, Porter Jenkins +2 more
TIGER is an inference-time framework that uses graph-based evidence routing to independently assess and repair unsupported facts (hallucinations) in multimodal generation.
Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng +49 more
The paper introduces Mindgames, a comprehensive multi-game arena for evaluating LLM agents' sustained social and strategic reasoning, demonstrating that current evaluations are limited by structural s…
This paper investigates how different types of compressed reasoning data (Explicit, Composed, Implicit CoT) affect LLM performance during post-training, finding that the choice of compression and subs…
Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang +1 more
FALAT is a diagnostic framework that treats failure attribution in complex LLM agent trajectories as a dependency-guided search problem, successfully identifying both the responsible agent and the dec…
The paper proposes DySCo, a dynamic trust-aware sparse consensus mechanism, to efficiently manage communication in multi-agent LLM systems by selectively connecting agents based on real-time value, th…