Papers similar to 2606.12392

~ similar to 2606.12392· 19 results

cs.CLcs.AIRecentMay 27, 2026

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…

View →

cs.CLRecentMay 31, 2026

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

Xiaoqi He, Kaixin Lan, Mu You, Tao Fang +2 more

The paper proposes MACAT, a Multi-Agent Culture-Aware Translation framework, to selectively translate culture-loaded words in ancient Chinese texts, achieving superior performance over existing method…

View →

cs.CLRecentJun 1, 2026

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…

View →

cs.CLcs.AIcs.LGRecentMay 30, 2026

Short-form Text Rewriting with Phi Silica

Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou +1 more

This paper demonstrates that targeted adaptation of the small language model Phi Silica, using dataset curation and fine-tuning, significantly improves its performance in short-form text rewriting, na…

View →

cs.CVRecentJun 1, 2026

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin +5 more

The paper introduces InsightVQA, a large-scale benchmark dataset designed for hierarchical visual question answering that assesses complex emotion understanding and cognitive reasoning beyond simple e…

View →

cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →

cs.CLcs.AIRecentMay 27, 2026

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

Volodymyr Ovcharov

The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…

View →

cs.CLRecentMay 29, 2026

Unlocking Fine-Grained Translation Quality Estimation in LRMs through Synergistically Evolving Implicit and Explicit Reasoning

Renfei Dang, Xinye Wang, Zhejian Lai, Weilu Xu +4 more

The paper proposes RIEQE, a two-stage training framework that synergistically co-evolves implicit and explicit reasoning capabilities in Large Reasoning Models (LRMs) to significantly improve fine-gra…

View →

cs.CLRecentMay 29, 2026

Model-Based Quality Assessment for Massively Multilingual Parallel Data

Abdelaziz M. A. Ibrahim, Zihao Li, Jörg Tiedemann, Shaoxiong Ji

The paper proposes decomposing the assessment of massive multilingual parallel data into separate parallelism and quality estimation components, concluding that no single universal metric is reliable…

View →

cs.CLcs.HCRecentMay 29, 2026

Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows

Yuri Balashov, Rex VanHorn, Mingxi Xu, Austin Downes

The paper benchmarks local, offline LLMs for confidential translation workflows, demonstrating that while they are viable for privacy-sensitive use, they generally lag behind top commercial NMT system…

View →

cs.CLcs.CRRecentMay 9, 2026

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

Jialing Gan, Junhao Dong, Songze Li

The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…

View →

cs.AIcs.CLRecentMay 28, 2026

Demystifying Data Organization for Enhanced LLM Training

Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang +7 more

This paper proposes four guidelines and two novel data ordering methods (STR and SAW) to systematically optimize data organization, significantly enhancing the stability and performance of LLM trainin…

View →

cs.CLRecentMay 29, 2026

Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task

Bart Evelo, Meaghan Fowlie, Denis Paperno

The paper investigates compositional abilities in LLMs and humans using the Personal Relation Task, finding that LLMs excel at the structured (Intensional) task while humans are better at the real-wor…

View →

cs.CLRecentMay 28, 2026

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

Zijie Wang, Eduardo Blanco

The paper introduces a novel, training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-Judge, and further proposes an iterative fine-tuning strategy that significant…

View →

cs.CLRecentMay 31, 2026

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

Máté Metzger, Nadnapang Phophichit, Hansa Dhammahaso

The paper proposes an advanced auditing framework for classical-to-modern LLM translations, demonstrating that embedding drift signals potential error severity rather than error itself, and identifyin…

View →

cs.CLcs.AIRecentMay 30, 2026

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

Shefayat E Shams Adib, Ahmed Alfey Sani, Md Hasibur Rahman Alif, Ajwad Abrar

The paper introduces LinguIUTics, a system that significantly improves the classification of rare psychological defense mechanisms in conversational text by fine-tuning Qwen3-8B using specialized imba…

View →

cs.LGcs.CLRecentMay 28, 2026

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more

The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.

View →

cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Volodymyr Ovcharov

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →

cs.CLcs.AIRecentMay 27, 2026

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more

The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…

View →