~ similar to 2606.01995· 19 results
Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao +4 more
The paper introduces CultureForest, a new benchmark for evaluating Cultural Norm Grounded Reasoning in LLMs, demonstrating that models struggle to apply their cultural knowledge effectively in realist…
Md Arid Hasan, Ruwad Naswan, Farhan Samir, Sharifa Sultana +1 more
The paper demonstrates that using English prompts causes large language models to prioritize globally dominant narratives over local cultural knowledge, even when local evidence is provided.
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more
The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…
The paper introduces MIDI, a novel multilingual dataset that embeds idioms in realistic sentence and conversational contexts across diverse resource levels, revealing that idiom comprehension is signi…
The paper introduces a new quantitative metric, Contextual Alternative Choice (CAC), to rigorously test language models' syntactic and functional understanding of determiners, showing that current mod…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more
The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…
The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.
The paper develops a theoretically grounded framework for evaluating multilingual LLMs in Social Sciences and Humanities, moving beyond traditional NLP benchmarks to assess interpretive validity and c…
Weiyi Chen, Shuaixiong Wang, Ziyun Gao, Kaichun Hu +4 more
The paper introduces TravelEval, a comprehensive, six-dimensional benchmarking framework that evaluates LLM-powered travel plans using realistic spatio-temporal simulation, revealing that current LLMs…
Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian +4 more
The paper proposes a dual-interventional framework to characterize how linguistic structures and contextual cues influence LLMs' spatial reasoning for navigation, finding that topological information…
The paper proposes an unsupervised Reinforcement Learning approach that enforces cross-lingual self-consistency to significantly enhance the multilingual reasoning capabilities of large language model…
The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…
The paper identifies specific attention heads in LLMs responsible for 'cultural binding'—associating cultural items with appropriate identities—and demonstrates that this capability is pre-trained and…
The paper audits six LLMs across four languages, finding that their gender stereotyping is significantly wider than human baselines and that cross-lingual translation fundamentally alters the nature o…