~ similar to 2605.30803· 20 results
The paper introduces a novel, training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-Judge, and further proposes an iterative fine-tuning strategy that significant…
Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more
The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…
The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…
The paper introduces a 'replication-first' paradigm for LLM behavioral benchmarking, demonstrating that this rigorous approach uncovers significant, non-obvious performance drops between successive mo…
Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang +7 more
QUBRIC introduces a co-design framework that simultaneously optimizes queries and rubrics, overcoming the bottleneck of vague rubrics derived from open-ended questions, leading to significant gains in…
The paper proposes a rigorous, fixed-budget, cluster-aware standard for LLM-as-a-judge evaluation of multi-hop RAG systems, demonstrating that current evaluation methods often overstate performance.
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…
Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more
The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu +14 more
The paper introduces $ ext{RLR}^3$, a novel framework that extends verifiable rewards in Reinforcement Learning to handle partially verifiable, multi-criteria vision-language tasks by integrating robu…
Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee +2 more
The paper addresses Perceptual Judgment Bias in multimodal LLM judges by introducing a new dataset and a unified training framework that forces models to prioritize visual evidence over plausible text…
The paper proposes a TEE-based architecture that enables external, auditable verification of AI-assisted grant evaluations without exposing the proprietary model, scoring logic, or intermediate reason…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu +3 more
CRITIC-R1 introduces a structured critic framework that treats RAG critique as an explicit error diagnosis problem using reinforcement learning, significantly improving answer quality over strong RAG…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…
The paper introduces TRACE, a novel metric that evaluates the logical structure of LLM reasoning (CoT) by integrating Toulmin's argumentation theory, demonstrating that sound reasoning structure corre…
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more
The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…