Papers similar to 2605.29712

~ similar to 2605.29712· 19 results

cs.AIcs.LGRecentMay 31, 2026

Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation

Minjing Shi, Junling Wang, Jingwei Ni, Sankalan Pal Chowdhury +1 more

The paper introduces LFTutor, an intelligent tutoring system leveraging LLMs and Socratic questioning to teach laypeople about logical fallacies, demonstrating its effectiveness in fostering critical…

View →

cs.AIRecentMay 27, 2026

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant +2 more

The paper introduces Contrastive Reflection (CORE), a novel non-parametric method that rapidly improves language model reasoning by distilling contrasts between successful and unsuccessful problem att…

View →

cs.AIRecentMay 31, 2026

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

Yuxi Sun, Wenbo Shang, Wei Gao, Xin Huang +1 more

The paper introduces a diagnostic testbed, PAVE, to evaluate how LLMs arbitrate between their internal knowledge and retrieved evidence during fact-checking, revealing that this arbitration is unrelia…

View →

cs.AIRecentMay 27, 2026

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more

The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.

View →

cs.CLcs.AIcs.LGRecentJun 1, 2026

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Atoosa Chegini, Soheil Feizi

The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…

View →

cs.IRcs.CLRecentMay 29, 2026

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Michael R. DeMarco

The paper introduces Factual Density (FD*), a novel retrieval signal that measures the proportion of verified facts, demonstrating that optimizing RAG retrieval based on this density significantly imp…

View →

cs.AIRecentMay 28, 2026

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

Yundong Kim, Heyoung Yang

The paper introduces TRACE, a novel metric that evaluates the logical structure of LLM reasoning (CoT) by integrating Toulmin's argumentation theory, demonstrating that sound reasoning structure corre…

View →

cs.AIcs.CYcs.HCRecentMay 27, 2026

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…

View →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →

cs.CLcs.IRRecentJun 3, 2026

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

Zhenyu Yu, Shuigeng Zhou

This paper evaluates the causal reasoning abilities of large language models and finds that they rely heavily on lexical pattern matching rather than structural reasoning.

View →

cs.AIcs.CLcs.LORecentMay 27, 2026

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Leizhen Zhang, Shuhan Chen, Sheng Chen

The paper evaluates LLM reasoning on Boolean satisfiability (SAT) problems, concluding that conventional metrics are misleading and proposing a paired-formula protocol with Accurate Differentiation Ra…

View →

cs.AIRecentMay 28, 2026

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

The paper introduces a diagnostic benchmark for selective Question Answering over conflicting, multi-source personal memory, demonstrating that specialized fusion resolvers outperform general LLMs, es…

View →

cs.CLRecentMay 31, 2026

Learning from Saturated Data: Signals Beyond Correctness for LLM Training

Hanno Hiss, Jasper Dekoninck, Martin Vechev

The paper proposes using fine-grained quality signals, such as pairwise self-judgments and token-level entropy, instead of simple binary correctness to improve LLM performance on saturated datasets, s…

View →

cs.AIRecentMay 29, 2026

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

Liwei Kang, Yee Whye Teh, Wee Sun Lee

The paper introduces LinTree, a method that explicitly structures the search history of LLM reasoning traces using parent pointers, significantly improving task performance and search efficiency compa…

View →

cs.CLRecentMay 29, 2026

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

Gerrit Quaremba, Amy Rechkemmer, Elizabeth Black, Denny Vrandečić +1 more

The paper introduces a multilingual corpus and demonstrates that small, fine-tuned language models (SLMs) are highly effective for Citation Needed Detection (CND) in lower-resource languages, often ou…

View →

cs.AIcs.CLRecentMay 27, 2026

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez

The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…

View →

cs.AIcs.CRcs.IRRecentApr 3, 2026

AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

Yuntao Du, Minh Dinh, Kaiyuan Zhang, Ninghui Li

AutoVerifier is an LLM-based agentic framework that automates the end-to-end verification of complex technical claims, enabling non-experts to generate evidence-backed intelligence assessments.

View →

cs.CLRecentMay 31, 2026

Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales

Pingjun Hong, Benjamin Roth

The paper compares verbalized feature attributions and self-generated rationales for explaining model behavior, finding that the format and granularity of the explanation significantly affect its abil…

View →