~ similar to 2606.01736· 19 results
Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao +3 more
The paper decomposes LLM debate convergence into three mechanisms (instability, conformity, persuasion) and finds that much observed convergence is harmful social compliance rather than genuine reason…
The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…
The study found that constraining LLM access, rather than banning it, can preserve students' sense of authorship and encourage more strategic writing behaviors while still providing scaffolding benefi…
Minjing Shi, Junling Wang, Jingwei Ni, Sankalan Pal Chowdhury +1 more
The paper introduces LFTutor, an intelligent tutoring system leveraging LLMs and Socratic questioning to teach laypeople about logical fallacies, demonstrating its effectiveness in fostering critical…
This paper simulates the Argumentative Theory of Reasoning (ATR) using multi-agent debate among LLMs, demonstrating that collective adversarial discourse significantly enhances truth-seeking performan…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
The study finds that in multi-agent systems, peer agreement makes LLMs more susceptible to adopting misleading answers than to correcting genuinely wrong ones, suggesting a need for verification over…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
The paper introduces PRAIB, a benchmark that demonstrates that LLM-generated peer reviews, while often verbose, systematically diverge from human norms by being less variable, positively biased, and f…
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
Wanying Ren, Xin Song, Futing Wang, Guoxiu He +1 more
The paper theoretically analyzes the limitations of parameter-based knowledge editing and empirically demonstrates that these methods consistently damage core LLM capabilities compared to retrieval-ba…
The paper proposes a comprehensive benchmark to systematically audit how varying persona prompts and model choices affect the technical quality and social representativeness of scholar recommendations…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
The paper evaluates LLM-generated reactions to Spanish online news, finding that off-the-shelf models fail to accurately reproduce the measurable properties of real audience discourse, and even fine-t…
The paper introduces a novel, per-token feature derived from how sampling temperature reshapes the token distribution, demonstrating it is a significantly stronger predictor of LLM creativity than sta…
This study demonstrates that the tone of a prompt significantly affects the accuracy of various LLMs, requiring users to exercise caution regarding tone-robust reliability.
Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more
The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…