ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.01879· 19 results

cs.CLRecentMay 28, 2026

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

Md Arid Hasan, Ruwad Naswan, Farhan Samir, Sharifa Sultana +1 more

The paper demonstrates that using English prompts causes large language models to prioritize globally dominant narratives over local cultural knowledge, even when local evidence is provided.

View →
cs.CLRecentJun 1, 2026

CARTE: A Benchmark for Mapping Language Model Knowledge Across France

Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang +1 more

The paper introduces CARTE, a new benchmark designed to test how well large language models understand fine-grained, regionally differentiated knowledge across the 13 metropolitan regions of France, r…

View →
cs.AIcs.CLcs.LGRecentMay 27, 2026

Cultural Binding Heads in Language Models

Avrile Floro, Luca Benedetto

The paper identifies specific attention heads in LLMs responsible for 'cultural binding'—associating cultural items with appropriate identities—and demonstrates that this capability is pre-trained and…

View →
cs.CLRecentMay 30, 2026

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

Wajdi Zaghouani

The paper develops a theoretically grounded framework for evaluating multilingual LLMs in Social Sciences and Humanities, moving beyond traditional NLP benchmarks to assess interpretive validity and c…

View →
cs.CLcs.AIRecentMay 31, 2026

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

Ikhlasul Akmal Hanif, Muhammad Falensi Azmi, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat +1 more

The paper introduces IndoBias, a dual-track, culturally-grounded benchmark to evaluate biases in LLMs across Indonesian and three local languages, revealing significant differences in bias patterns ac…

View →
cs.AIRecentMay 27, 2026

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more

The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…

View →
cs.AIRecentMay 30, 2026

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more

The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…

View →
cs.AIcs.CLcs.LGRecentMay 31, 2026

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

This paper investigates the production-evaluation gap in Large Reasoning Models (LRMs), finding that while LRMs excel at generating solutions, they struggle significantly to evaluate flawed reasoning,…

View →
cs.CLRecentMay 31, 2026

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

Xiaoqi He, Kaixin Lan, Mu You, Tao Fang +2 more

The paper proposes MACAT, a Multi-Agent Culture-Aware Translation framework, to selectively translate culture-loaded words in ancient Chinese texts, achieving superior performance over existing method…

View →
cs.IRcs.AIcs.CYRecentMay 27, 2026

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

Annabella Sánchez-Guzmán, Lukas Eberhard, Denis Helic, Lisette Espín-Noboa

The paper proposes a comprehensive benchmark to systematically audit how varying persona prompts and model choices affect the technical quality and social representativeness of scholar recommendations…

View →
cs.CLcs.AIRecentMay 29, 2026

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…

View →
cs.AIRecentMay 31, 2026

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

Teddy Ferdinan, Bartłomiej Koptyra, Mikołaj Langner, Tomasz Adamczyk +41 more

This survey provides a comprehensive analysis of Reasoning Language Model (RLM) adoption across 28 scientific disciplines, revealing significant disparities in RLM maturity across different scientific…

View →
cs.CLcs.AIRecentMay 28, 2026

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Ruoxi Su, Yuhan Liu, Jingyu Hu

The paper introduces an adaptive interview framework to gather rich persona context, demonstrating that LLMs improve decision alignment in moral dilemmas only when they selectively ground their decisi…

View →
cs.CLRecentJun 1, 2026

Not What, But How: A Communicative Audit of LLM Response Framing

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more

The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…

View →
cs.AIRecentMay 28, 2026

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

Yundong Kim, Heyoung Yang

The paper introduces TRACE, a novel metric that evaluates the logical structure of LLM reasoning (CoT) by integrating Toulmin's argumentation theory, demonstrating that sound reasoning structure corre…

View →
cs.CLRecentMay 28, 2026

Can LLM Teams Play What? Where? When?

Anastasia Kotelnikova, Viktor Byzov, Maria Dolzhenkova, Evgeny Kotelnikov

This paper investigates if team-based interaction improves LLM performance on complex reasoning tasks (ChGK), finding that structured team strategies significantly boost accuracy by acting as error-fi…

View →
cs.AIRecentJun 1, 2026

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma +1 more

BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.

View →
cs.CVcs.AIRecentMay 29, 2026

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more

The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…

View →
cs.CLcs.AIRecentMay 27, 2026

DEPART: DEcomposing PARiTy across Multilingual LLMs

Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad +2 more

The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.

View →