~ similar to 2606.01637· 20 results
Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao +3 more
The paper decomposes LLM debate convergence into three mechanisms (instability, conformity, persuasion) and finds that much observed convergence is harmful social compliance rather than genuine reason…
This paper investigates if team-based interaction improves LLM performance on complex reasoning tasks (ChGK), finding that structured team strategies significantly boost accuracy by acting as error-fi…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…
The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…
Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang +3 more
This paper introduces a failure-aware observability framework to diagnose wasted computation in multi-agent LLM systems by mapping recurring failure modes to online trace signals.
Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling +4 more
The paper proposes SafeReview, a co-evolutionary adversarial training framework that significantly improves the robustness of LLM-based peer review systems against sophisticated adversarial hidden pro…
The paper demonstrates that LLM performance in zero-shot annotation is significantly limited by the alignment between the model's internal understanding and the task definition, showing that prompt-ba…
The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…
The paper demonstrates that the order and content of external information (the 'feed') an LLM agent consumes before making a decision can significantly and causally steer its final choice, often overr…
The paper demonstrates that the sequence and composition of external information (the 'feed') an LLM agent consumes can significantly and causally steer its final decisions, often overriding its defau…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang +1 more
The paper introduces CITETRACE, a large-scale dataset and evaluation framework that systematically measures structural citation failures in search-augmented LLMs, revealing a pattern called Verified M…
Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more
The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…
The study found that human judgment of logical fallacies is significantly biased by source labels (e.g., human vs. AI), while LLM evaluations remained comparatively stable across these source conditio…
Chen He, Yuhao Wu, Lei Wang, Wenxuan Zhang +1 more
The paper identifies and demonstrates that post-conclusion continuation in answer-correct long-CoT traces is harmful during LLM fine-tuning, proposing a method to cut this continuation.
This study evaluates LLMs in conversational tutoring to identify high-confidence social biases, finding that state-of-the-art models are often overconfident in their incorrect assessments of stereotyp…
Jiatan Huang, Mingchen Li, Ziming Li, Sunjae Kwon +2 more
The paper proposes CAGE-CAL, a counterfactual graph calibration framework, to accurately assess the reliability and detect over-confidence in multi-agent LLM systems after agents communicate.