Papers similar to 2605.30981

~ similar to 2605.30981· 20 results

cs.AIcs.CLRecentMay 27, 2026

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez

The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu +14 more

The paper introduces MemTrace, a framework that treats LLM memory pipelines as traceable graphs to systematically diagnose and automatically correct memory-related errors, boosting performance by up t…

View →

cs.LGcs.AIcs.SERecentMay 30, 2026

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

Yongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo Ye

The paper demonstrates that standard LLM evaluation metrics overestimate performance because they fail to account for the stability of outcomes, showing a significant gap between reported pass rates a…

View →

cs.AIcs.LGRecentMay 29, 2026

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Yunpeng Zhou

This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…

View →

cs.CLcs.AIRecentMay 28, 2026

Unlocking the Working Memory of Large Language Models for Latent Reasoning

Lukas Aichberger, Sepp Hochreiter

The paper introduces Reasoning in Memory (RiM), a latent reasoning method that replaces autoregressive token generation with fixed memory blocks to enable compute-efficient internal working memory for…

View →

cs.CLcs.AIRecentMay 27, 2026

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more

This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…

View →

cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →

cs.AIcs.CLcs.CRRecentApr 27, 2026

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

Hikmat Karimov, Rahid Zahid Alekberli

The paper proposes a novel information-geometric framework to analyze LLM stability by integrating task utility, external entropy, and internal structural proxies, showing this composite score improve…

View →

cs.LGcs.CLRecentMay 30, 2026

Task Structure Reverses Layerwise State Encoding in Sequence Models

Yuhang Jiang

The paper demonstrates that the location and nature of state encoding in sequence models are not fixed architectural traits but are highly dependent on the specific task, showing that the encoding pro…

View →

cs.LGcs.AIRecentMay 28, 2026

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Prakhar Dixit, Sadia Kamal, Tim Oates

The paper demonstrates that self-reflective agents can systematically confabulate incorrect memories, leading them to fail tasks even when the environment resets, and proposes a metric and mitigation…

View →

cs.CRcs.AIRecentMay 11, 2026

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

Saba Pourhanifeh, AbdulAziz AbdulGhaffar, Ashraf Matrawy

The paper empirically evaluates domain-adapted and general-purpose LLMs for structured threat modelling (STRIDE on 5G security), finding that domain adaptation and model size do not guarantee reliable…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu +6 more

MemGuard introduces a type-aware memory framework to prevent heterogeneous memory contamination in long-term memory-augmented LLMs, significantly improving memory reliability and efficiency.

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

The paper introduces functional entropy, a code-specific uncertainty quantification method, which successfully predicts functional correctness in LLM-generated code by replacing natural language seman…

View →

cs.AIRecentMay 28, 2026

Temporal Stability and Few-Shot Prompting in Math Task Assessment

Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

This study investigated the stability and prompt-responsiveness of AI tools in classifying the cognitive demand of math tasks, finding that few-shot prompting was a more reliable performance booster t…

View →

cs.AIcs.CLRecentJun 1, 2026

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao +2 more

The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…

View →

cs.AIRecentMay 27, 2026

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

Chen Ying Claude, Zhihan Luo

The paper identifies five persistent, deep-seated behavioral patterns ('training strata') in LLMs, observed through long-term, intimate human-AI interaction, suggesting that training artifacts survive…

View →

cs.CLcs.IRRecentMay 29, 2026

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more

The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.

View →

cs.CLRecentMay 31, 2026

Before and After Temperature: A Distributional View of Creative LLM Generation

V. S. Raghu Parupudi, Harsha Ponnada, Aditi Kaushal, S. Shria Parupudi +2 more

The paper introduces a novel, per-token feature derived from how sampling temperature reshapes the token distribution, demonstrating it is a significantly stronger predictor of LLM creativity than sta…

View →

cs.AIRecentMay 31, 2026

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang +3 more

This paper introduces a failure-aware observability framework to diagnose wasted computation in multi-agent LLM systems by mapping recurring failure modes to online trace signals.

View →

cs.LGcs.AIEmpiricalComprehensiveRecentJun 4, 2026

Pretraining Recurrent Networks without Recurrence

Akarsh Kumar, Phillip Isola

This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.

View →