~ similar to 2605.31370· 20 results
MOOSE-Copilot is a novel web-based framework that unifies scientific hypothesis discovery by formalizing human-AI interaction, significantly improving performance over autonomous LLM baselines.
Sen Fang, Weiyuan Ding, Zhezhen Cao, Zhou Yang +1 more
AEGIS is a novel multi-agent framework that grounds vulnerability reasoning by reconstructing per-variable dependency chains over a Code Property Graph, achieving state-of-the-art performance on the P…
Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma +4 more
This paper introduces interpretability-guided, training-free interventions that systematically improve the accuracy and controllability of latent reasoning in LLMs by leveraging structural and causal…
The paper introduces Contrastive Reflection (CORE), a novel non-parametric method that rapidly improves language model reasoning by distilling contrasts between successful and unsuccessful problem att…
Minjing Shi, Junling Wang, Jingwei Ni, Sankalan Pal Chowdhury +1 more
The paper introduces LFTutor, an intelligent tutoring system leveraging LLMs and Socratic questioning to teach laypeople about logical fallacies, demonstrating its effectiveness in fostering critical…
Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue +7 more
The paper addresses the 'detection-to-abstention gap' in reasoning models, where detecting insufficient information does not lead to abstention, by proposing a novel control framework that forces mode…
The paper introduces Rationalize, a role-pair framework that facilitates shared semantic reasoning between humans and AI models to achieve deep alignment of intent and action.
The paper introduces STRIATUM-CTF, a modular agentic framework that uses a standardized context protocol to enable LLMs to perform multi-step, stateful reasoning for general-purpose CTF solving, achie…
The paper proposes using question-asking as an inference-time intervention to probe a language model's hidden state, finding that the self-diagnosis process provides a predictive signal for final corr…
BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.
The paper compares verbalized feature attributions and self-generated rationales for explaining model behavior, finding that the format and granularity of the explanation significantly affect its abil…
LongTraceRL addresses long-context reasoning challenges by generating highly challenging training data and introducing a fine-grained rubric reward, significantly improving evidence-grounded reasoning…
The paper identifies a failure mode called unfaithful capitulation (UC), where reasoning models maintain a correct internal thought process (chain-of-thought) but output an incorrect final answer when…
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more
The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…
Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao +1 more
COLLEAGUE.SKILL introduces an automated system that distills heterogeneous traces of human expertise and role-specific knowledge into portable, inspectable, and usable AI skill packages.
Zheng Yuan, Chuang Zhou, Linhao Luo, Siyu An +3 more
MoG proposes a novel Mixture of Experts framework for graph-based RAG, which uses hub graphs to guide the sparse activation of domain-specific expert graphs, significantly improving retrieval accuracy…
The paper proposes a novel, efficient method for checking the factuality of claims generated by LLMs by framing it as a true/false reading comprehension task and incorporating explicit test-taking str…
The paper introduces Oracle Poisoning, an attack that corrupts knowledge graphs used by AI agents, demonstrating that all tested models blindly trust poisoned data at high sophistication levels.
AutoVerifier is an LLM-based agentic framework that automates the end-to-end verification of complex technical claims, enabling non-experts to generate evidence-backed intelligence assessments.
Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal +4 more
CA-BED is a novel framework that improves LLM performance in interactive question-answering by integrating Bayesian Experimental Design to strategically select questions that maximize information gain…