Papers similar to 2605.09041v1

~ similar to 2605.09041v1· 20 results

cs.AIcs.CYcs.HCRecentMay 27, 2026

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…

View →

cs.CRcs.AIcs.LGRecentMar 23, 2026

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…

View →

cs.AIRecentMay 27, 2026

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more

The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…

View →

cs.CLcs.AIRecentMay 31, 2026

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

Ikhlasul Akmal Hanif, Muhammad Falensi Azmi, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat +1 more

The paper introduces IndoBias, a dual-track, culturally-grounded benchmark to evaluate biases in LLMs across Indonesian and three local languages, revealing significant differences in bias patterns ac…

View →

cs.CLRecentMay 28, 2026

Auditing LLM Benchmarks with Item Response Theory

Sander Land, Daniel M. Bikel

The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…

View →

cs.CLcs.AIRecentMay 29, 2026

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…

View →

cs.CLcs.CVRecentJun 1, 2026

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more

The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…

View →

cs.AIcs.CLRecentMay 27, 2026

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez

The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…

View →

cs.CLRecentMay 29, 2026

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

Jiwoo Choi, Seonwoo Ahn, Tongxin Zhang, Seohyon Jung

The paper audits six LLMs across four languages, finding that their gender stereotyping is significantly wider than human baselines and that cross-lingual translation fundamentally alters the nature o…

View →

cs.CLRecentJun 1, 2026

Not What, But How: A Communicative Audit of LLM Response Framing

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more

The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…

View →

cs.CLcs.AIRecentMay 27, 2026

DEPART: DEcomposing PARiTy across Multilingual LLMs

Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad +2 more

The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.

View →

cs.AIcs.LGRecentMay 28, 2026

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more

Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.

View →

cs.CLcs.AIRecentJun 2, 2026

Quantifying Faithful Confidence Expression in Large Reasoning Models

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

The paper introduces a novel framework to quantify faithful confidence expression (FC) in Large Reasoning Models (LRMs), finding that FC remains a significant and challenging reliability target for th…

View →

cs.CLcs.AIRecentMay 27, 2026

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more

This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…

View →

cs.LGcs.CLRecentMay 30, 2026

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

Hasan Amin, Kian Ahrabian, Ming Yin, Rajiv Khanna

The paper introduces Multi-Response Training (MRT) to combat the 'mode lottery' problem in language model fine-tuning, showing that retaining multiple valid responses significantly improves distributi…

View →

cs.CRcs.AIRecentMay 8, 2026

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim +1 more

The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.

View →

cs.AIRecentMay 29, 2026

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris +4 more

PReMISE introduces a framework to audit and improve the quality of rubrics used to guide LLM judges, demonstrating that it can significantly increase judge accuracy and reduce the exploitability of re…

View →

cs.AIcs.CLRecentMay 28, 2026

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński +1 more

The paper introduces PRAIB, a benchmark that demonstrates that LLM-generated peer reviews, while often verbose, systematically diverge from human norms by being less variable, positively biased, and f…

View →

cs.AIcs.CLcs.CRRecentApr 27, 2026

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

Hikmat Karimov, Rahid Zahid Alekberli

The paper proposes a novel information-geometric framework to analyze LLM stability by integrating task utility, external entropy, and internal structural proxies, showing this composite score improve…

View →

cs.AIRecentMay 28, 2026

Harnessing non-adversarial robustness in large language models

Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more

The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.

View →