~ similar to 2605.29027· 17 results
The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…
The paper demonstrates that increasing the toxicity of prompts significantly degrades the factual reliability of LLMs, a degradation linked to the selective amplification of perturbation-sensitive nod…
Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more
The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
This paper investigates if team-based interaction improves LLM performance on complex reasoning tasks (ChGK), finding that structured team strategies significantly boost accuracy by acting as error-fi…
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
The paper demonstrates that LLM performance in zero-shot annotation is significantly limited by the alignment between the model's internal understanding and the task definition, showing that prompt-ba…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
The paper introduces Multi-Response Training (MRT) to combat the 'mode lottery' problem in language model fine-tuning, showing that retaining multiple valid responses significantly improves distributi…
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.
The paper proposes using fine-grained quality signals, such as pairwise self-judgments and token-level entropy, instead of simple binary correctness to improve LLM performance on saturated datasets, s…
This paper systematically evaluates LLMs' ability to infer pragmatic meaning from non-verbal responses, finding that their accuracy significantly drops compared to verbal inputs.
The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…
The paper proposes using question-asking as an inference-time intervention to probe a language model's hidden state, finding that the self-diagnosis process provides a predictive signal for final corr…
The paper introduces a novel, per-token feature derived from how sampling temperature reshapes the token distribution, demonstrating it is a significantly stronger predictor of LLM creativity than sta…