Papers similar to 2605.28740

~ similar to 2605.28740· 20 results

cs.AIRecentMay 27, 2026

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

The paper proposes Shapley-based input uncertainty Quantification (ShaQ), a novel framework that uses Shapley values to precisely attribute input-induced uncertainty to specific spans of text, providi…

View →

cs.CLcs.AIcs.LGRecentJun 1, 2026

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

This paper proposes a method to improve error prediction for LLMs by explicitly disentangling input ambiguity from standard Uncertainty Quantification signals, showing that ambiguity information signi…

View →

cs.CLcs.AIRecentMay 28, 2026

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Mahdi Alkaeed, Adnan Qayyum, Nabeel Abo Kashreef, Muhammad Bilal +1 more

The paper evaluates the semantic stability of clinical LLMs to linguistic variations, finding that domain specialization does not guarantee consistent robustness improvements.

View →

cs.CLRecentJun 1, 2026

On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective

Yikai Guo, Bin Wang, Xilai Fan, Wenjun Ke +1 more

The paper proposes 'Uncertainty,' a multiscale uncertainty estimator that focuses on low-probability tokens to improve the detection of AI-generated text by addressing boilerplate dominance and score…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

The paper introduces functional entropy, a code-specific uncertainty quantification method, which successfully predicts functional correctness in LLM-generated code by replacing natural language seman…

View →

cs.CLRecentJun 1, 2026

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde +20 more

The paper introduces a supervised fine-tuning pipeline using large language models to accurately categorize sentence-level clinical provenance across multi-disciplinary hospital notes, demonstrating t…

View →

cs.CLRecentMay 31, 2026

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

Md Motaleb Hossen Manik, Ge Wang

HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…

View →

cs.CLcs.AIRecentJun 1, 2026

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

Christian Autenried, Cosimo Persia

This paper introduces KliniskVestBERT, a suite of BERT models specialized by pre-training on a large, diverse corpus of real-world Norwegian clinical texts, demonstrating superior performance for clin…

View →

cs.CLcs.AIRecentMay 29, 2026

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Kyle Moore, Jesse Roberts, Daryl Watson, William Ward +1 more

This paper investigates whether large language models exhibit uncertainty signals similar to human judgment, examining both overt behavior and internal activation patterns to assess alignment and cali…

View →

cs.IRcs.CLRecentMay 29, 2026

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Michael R. DeMarco

The paper introduces Factual Density (FD*), a novel retrieval signal that measures the proportion of verified facts, demonstrating that optimizing RAG retrieval based on this density significantly imp…

View →

cs.AIRecentMay 27, 2026

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more

The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…

View →

cs.HCcs.AIcs.CLRecentMay 28, 2026

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar +3 more

The paper introduces LLUMI, an open-source framework that improves LLM writing assistance for mental health support using community feedback, demonstrating comparable performance to proprietary models…

View →

cs.AIRecentJun 1, 2026

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang +2 more

This paper investigates whether model compression techniques (like quantization and pruning) preserve a Large Language Model's ability to quantify its own uncertainty, finding that accuracy-only evalu…

View →

cs.AIcs.CLcs.CYRecentMay 27, 2026

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more

The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…

View →

cs.CRRecentApr 21, 2026

Sensitivity Uncertainty Alignment in Large Language Models

Prakul Sunil Hiremath, Harshit R. Hiremath

The paper proposes Sensitivity-Uncertainty Alignment (SUA), a framework that measures the misalignment between a model's prediction instability and its stated uncertainty to improve model reliability.

View →

cs.CLcs.AIcs.IRRecentMay 27, 2026

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Yubo Li, Rema Padman, Ramayya Krishnan

This paper introduces a framework to audit source-dependence in multi-source RAG systems, demonstrating that disagreement across institutional sources is a common and critical failure mode that curren…

View →

cs.LGcs.CRRecentApr 29, 2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

Guillermo Iglesias, Gema Bello-Orgaz, María Navas-Loro, Cristian Ramirez-Atencia +2 more

This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…

View →

cs.AIRecentMay 27, 2026

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li +7 more

C-MIG is a novel retrieval-augmented generation framework that uses multi-view information gain to improve clinical diagnosis reasoning by providing richer, more nuanced reward signals than existing m…

View →

cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →

cs.CRcs.AIcs.CLRecentApr 23, 2026

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

Michele Miranda, Xinlan Yan, Nishant Mishra, Rachel Murphy +3 more

This paper conducts the first comparative study of Differential Privacy (DP), Named Entity Recognition (NER), and Large Language Models (LLMs) for de-identifying Dutch clinical notes, finding that com…

View →