~ similar to 2605.28710· 20 results
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
The paper proposes an aggressive, parameter-efficient method to prune non-essential experts from Mixture-of-Experts (MoE) LLMs, significantly compressing the model while maintaining high machine trans…
Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang +7 more
This paper proposes four guidelines and two novel data ordering methods (STR and SAW) to systematically optimize data organization, significantly enhancing the stability and performance of LLM trainin…
The paper benchmarks local, offline LLMs for confidential translation workflows, demonstrating that while they are viable for privacy-sensitive use, they generally lag behind top commercial NMT system…
The paper proposes decomposing the assessment of massive multilingual parallel data into separate parallelism and quality estimation components, concluding that no single universal metric is reliable…
This paper proposes a domain-specialized large language model, PoetryQwen, for precise translation and emotional understanding of classical poetry.
The paper introduces a novel, training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-Judge, and further proposes an iterative fine-tuning strategy that significant…
The paper introduces a multilingual corpus and demonstrates that small, fine-tuned language models (SLMs) are highly effective for Citation Needed Detection (CND) in lower-resource languages, often ou…
The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…
This paper analyzes the multilinguality of LLMs by examining their structural properties, finding that low-resource languages are structurally more distinct from English than high-resource languages,…
This paper introduces robustness indicators to systematically analyze how multilingual text embedding model rankings change based on dataset composition and aggregation methods, revealing that only a…
The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…
The paper evaluates the inconsistency of using LLMs as automated judges for multi-dimensional safety evaluations, finding that LLMs are unreliable for nuanced safety issues like financial advice but m…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu +8 more
MIRA proposes a novel source-aware filtering framework that discovers and anchors evaluation rubrics during data selection, significantly improving code-oriented mid-training data quality while reduci…
The paper proposes using fine-grained quality signals, such as pairwise self-judgments and token-level entropy, instead of simple binary correctness to improve LLM performance on saturated datasets, s…