~ similar to 2605.30335· 20 results
The paper proposes using an LLM aggregator that analyzes complete reasoning traces, demonstrating that trace-level synthesis is superior to traditional consensus methods like majority voting for solvi…
The paper evaluates multi-agent LLM oracle systems for prediction market resolution, finding that independent aggregation with confidence-weighted voting significantly outperforms single-model baselin…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
The paper evaluates LLM reasoning on Boolean satisfiability (SAT) problems, concluding that conventional metrics are misleading and proposing a paired-formula protocol with Accurate Differentiation Ra…
The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…
Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen +1 more
This paper introduces MCTS-Guided Group Relative Policy Optimization (M-GRPO) to enhance LLM spatial reasoning by improving the decomposition of complex tasks into optimal sub-tasks.
Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang +3 more
This paper introduces a failure-aware observability framework to diagnose wasted computation in multi-agent LLM systems by mapping recurring failure modes to online trace signals.
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
The paper proposes Multi-Order Communication (MOC) to overcome the limitations of standard first-order message passing in LLM-based multi-agent systems, significantly improving performance by capturin…
Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang +1 more
FALAT is a diagnostic framework that treats failure attribution in complex LLM agent trajectories as a dependency-guided search problem, successfully identifying both the responsible agent and the dec…
Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho +1 more
The paper proposes Faithful Agentic XAI (FAX), a verification framework that explicitly checks LLM-generated explanations against model behavior, significantly improving explanation faithfulness on a…
Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin +1 more
The paper proposes a novel temporal and structural credit assignment framework to efficiently optimize multi-agent LLM systems by decomposing the error signal and using targeted, discrete gradient upd…
The paper introduces an Integrated, cross-Architecture Reasoning (IAR) framework to provide a unified and robust method for interpreting the opaque reasoning processes within Large Language Models.
Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more
The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang +5 more
The paper introduces Atomic Decomposition and Recombination (ADR), a novel framework that generates genuinely novel and challenging verifiable code tasks, significantly improving the scalability of Re…
The paper experimentally evaluates 12 multi-agent LLM collaboration topologies for software design, finding that structural adversarial prompting and cross-model review are the most effective approach…
The paper demonstrates that extended pure neural reasoning fails on complex, deterministic state-tracking tasks beyond a certain 'Deterministic Horizon,' necessitating the integration of external tool…
The paper proposes a local perturbation theory showing that cross-domain interference in multi-domain RL occurs via a low-dimensional shared conflict subspace, which can be selectively mitigated by sh…
The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…