~ similar to 2603.22214v1· 20 results
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
The paper evaluates the inconsistency of using LLMs as automated judges for multi-dimensional safety evaluations, finding that LLMs are unreliable for nuanced safety issues like financial advice but m…
The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…
Aiman Al Masoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu +5 more
This paper provides the first comprehensive Systematization of Knowledge (SoK) on the security aspects of LLM-as-a-Judge (LaaJ) systems, identifying key vulnerabilities and proposing a taxonomy for fu…
This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…
Junyu Lu, Qi Wei, Peishuo Zheng, Jie Zhang +5 more
The paper introduces Prosecution Decision Prediction (PDP), a new legal AI task that assesses prosecutorial review decisions, showing that current state-of-the-art LLMs perform significantly worse on…
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee +2 more
The paper addresses Perceptual Judgment Bias in multimodal LLM judges by introducing a new dataset and a unified training framework that forces models to prioritize visual evidence over plausible text…
The paper introduces a novel framework to quantify faithful confidence expression (FC) in Large Reasoning Models (LRMs), finding that FC remains a significant and challenging reliability target for th…
The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
The paper introduces a 'replication-first' paradigm for LLM behavioral benchmarking, demonstrating that this rigorous approach uncovers significant, non-obvious performance drops between successive mo…
The paper introduces PRAIB, a benchmark that demonstrates that LLM-generated peer reviews, while often verbose, systematically diverge from human norms by being less variable, positively biased, and f…
This paper shows that large language models can automate reproducibility assessments in the social and behavioral sciences.
Yanhui Sun, Wu Liu, Haifeng Ming, Xinru Wang +2 more
The paper introduces CyberJurors, a multi-agent framework and the VerdictBench benchmark to simulate and solve complex e-commerce dispute verdicts by modeling the reasoning and consensus process of cr…
The paper empirically evaluates domain-adapted and general-purpose LLMs for structured threat modelling (STRIDE on 5G security), finding that domain adaptation and model size do not guarantee reliable…