Papers similar to 2605.31351

~ similar to 2605.31351· 20 results

cs.CLRecentJun 1, 2026

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…

View →

cs.CVcs.AIRecentJun 1, 2026

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee +2 more

The paper addresses Perceptual Judgment Bias in multimodal LLM judges by introducing a new dataset and a unified training framework that forces models to prioritize visual evidence over plausible text…

View →

cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →

cs.CLcs.AIRecentMay 27, 2026

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak +5 more

The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…

View →

cs.CRcs.AIcs.LGRecentMar 23, 2026

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…

View →

cs.CRcs.AIRecentMar 31, 2026

Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Al Masoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu +5 more

This paper provides the first comprehensive Systematization of Knowledge (SoK) on the security aspects of LLM-as-a-Judge (LaaJ) systems, identifying key vulnerabilities and proposing a taxonomy for fu…

View →

cs.CVcs.AIRecentMay 28, 2026

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai +8 more

VISUALTHINK-VLA introduces a visual intermediate-reasoning framework that guides action prediction using compact visual evidence, achieving high accuracy and significantly low latency for real-time Vi…

View →

cs.CVcs.AIcs.GRRecentMay 31, 2026

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more

The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.

View →

cs.CVRecentJun 1, 2026

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more

The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…

View →

cs.CVcs.AIRecentMay 30, 2026

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

Rashid Mushkani

The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…

View →

cs.AIRecentJun 1, 2026

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang +2 more

The paper introduces TRON, an online, rule-verifiable environment substrate that generates an unbounded stream of fresh, controllable visual reasoning training instances, significantly improving RL pe…

View →

cs.AIRecentMay 29, 2026

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris +4 more

PReMISE introduces a framework to audit and improve the quality of rubrics used to guide LLM judges, demonstrating that it can significantly increase judge accuracy and reduce the exploitability of re…

View →

cs.CVRecentJun 1, 2026

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more

The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.

View →

cs.CVcs.AIRecentMay 29, 2026

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more

The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…

View →

cs.CVcs.AIcs.CLRecentMay 28, 2026

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton +2 more

This paper introduces a new evaluation framework, SpatialUncertain, demonstrating that current Vision-Language Models (VLMs) are prone to overconfident and incorrect answers to spatial questions when…

View →

cs.CRcs.AIRecentMar 17, 2026

Adversarial attacks against Modern Vision-Language Models

Alejandro Paredes La Torre

The paper evaluates the adversarial robustness of two open-source Vision-Language Models (LLaVA and Qwen2.5-VL) in a simulated e-commerce environment, finding that while LLaVA is vulnerable to gradien…

View →

cs.CVcs.AIcs.RORecentMay 28, 2026

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

Yuhang Jiang

The paper introduces PInVerify, an offline embodied benchmark for Active Instance Verification (AIV), a task requiring agents to actively select viewpoints to confirm if a candidate object matches a f…

View →

cs.AIRecentMay 27, 2026

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more

The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…

View →

cs.CLcs.AIRecentMay 27, 2026

The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

Junyu Lu, Qi Wei, Peishuo Zheng, Jie Zhang +5 more

The paper introduces Prosecution Decision Prediction (PDP), a new legal AI task that assesses prosecutorial review decisions, showing that current state-of-the-art LLMs perform significantly worse on…

View →

cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Volodymyr Ovcharov

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →