~ similar to 2605.30087· 20 results
The paper proposes a deterministic, version-aware aggregation method that significantly outperforms existing LLM-based systems for resolving memory conflicts in fact consolidation tasks.
Eywa is a provenance-grounded memory architecture for AI agents that separates source evidence from derived beliefs, significantly improving memory reliability and diagnosability across multiple evalu…
Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more
The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…
Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu +6 more
MemGuard introduces a type-aware memory framework to prevent heterogeneous memory contamination in long-term memory-augmented LLMs, significantly improving memory reliability and efficiency.
Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more
The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.
Yongjie Wang, Xinyue Zhang, Kunhong Yao, Zhiwei Zeng +3 more
The paper introduces the concept of Search-Time Contamination (STC), demonstrating that deep research agents can leak information from public benchmarks via web search, leading to an overestimation of…
The paper proposes a novel, efficient method for checking the factuality of claims generated by LLMs by framing it as a true/false reading comprehension task and incorporating explicit test-taking str…
Yuxi Sun, Wenbo Shang, Wei Gao, Xin Huang +1 more
The paper introduces a diagnostic testbed, PAVE, to evaluate how LLMs arbitrate between their internal knowledge and retrieved evidence during fact-checking, revealing that this arbitration is unrelia…
The paper proposes SAGE, a novelty-aware gate that efficiently controls memory updates in agentic LLMs by classifying new facts as clearly novel, clearly redundant, or uncertain, thereby significantly…
The paper evaluates LLM reasoning on Boolean satisfiability (SAT) problems, concluding that conventional metrics are misleading and proposing a paired-formula protocol with Accurate Differentiation Ra…
Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more
The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.
Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more
The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…
The paper demonstrates that self-reflective agents can systematically confabulate incorrect memories, leading them to fail tasks even when the environment resets, and proposes a metric and mitigation…
Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala +2 more
The paper identifies the Misattribution Gap, showing that memory-layer attacks (Semantic Norm Drift) can mimic model failure in multi-agent AI systems, and proposes novel detection and mitigation tech…
The paper introduces MemCog, a Memory-as-Cognition system that integrates memory access directly into the reasoning process, significantly improving agent performance, especially in proactive memory r…
The paper introduces memorywire, a vendor-neutral JSON-Schema wire format and reference implementation designed to standardize and govern memory operations across disparate agent-memory frameworks.
HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang +4 more
The paper argues that current search agents often verify existing knowledge rather than genuinely searching, and introduces LiveBrowseComp, a new benchmark to measure true evidence-driven discovery.
The paper introduces OCC-RAG, a family of compact, task-specialized Small Language Models (SLMs) designed to achieve highly faithful, multi-hop question answering grounded strictly in provided context…
This paper investigates the production-evaluation gap in Large Reasoning Models (LRMs), finding that while LRMs excel at generating solutions, they struggle significantly to evaluate flawed reasoning,…
The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…