~ similar to 2605.30639· 20 results
Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi +6 more
The paper introduces OmniVerifier-M1, a multimodal meta-verifier that uses symbolic outputs and decoupled reinforcement learning to provide robust, fine-grained verification and error localization for…
Doguhuan Yeke, Yanming Zhou, Leo Y. Lin, Hongyu Cai +2 more
The paper introduces RoboJailBench, the first standardized evaluation framework for assessing adversarial jailbreak attacks and defenses in embodied AI systems like robots.
Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more
The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…
Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu +2 more
The paper introduces MOV-Bench, a challenging benchmark for multi-hop audio-visual reasoning, and proposes AOP-Agent, an agentic framework that significantly improves open-source Omni-LLMs' ability to…
Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou +4 more
The paper introduces ERGeoBench, a comprehensive diagnostic benchmark designed to evaluate the fine-grained capabilities of multimodal large language models (MLLMs) for embodied geo-localization acros…
Pramana introduces a standardized, protocol-level wire format for autonomous agent outputs, ensuring that every consequential claim is accompanied by a verifiable artifact that can be re-executed by a…
Yuting Xu, Jiayi Tian, Jian Liang, Xin Xiong +3 more
The paper introduces VeriTrip, a new verifiable benchmark that evaluates travel planning agents' ability to perform evidence-grounded reasoning over complex, unstructured, and multimodal web data, rev…
The paper introduces AgenticVBench, a comprehensive benchmark of 100 real-world video post-production tasks, and finds that even the best AI agents perform significantly worse than human experts on th…
Yuefeng Peng, Mingzhe Li, Kejing Xia, Renhao Zhang +1 more
This paper presents the first systematic study of membership inference attacks (MIAs) against Vision-Language-Action (VLA) models, demonstrating that these models are highly vulnerable to privacy brea…
The paper proposes evaluating certified training methods by comparing their Pareto fronts across the natural-certified accuracy trade-off, revealing superior performance and previously unappreciated c…
Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho +1 more
The paper proposes Faithful Agentic XAI (FAX), a verification framework that explicitly checks LLM-generated explanations against model behavior, significantly improving explanation faithfulness on a…
The paper proposes an agentic pipeline for spatial reasoning by introducing a dynamic cognitive map and Spatial Assertion Codes (SAC), achieving state-of-the-art performance on complex spatial tasks.
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su +7 more
This paper proposes SpatialClaw, a training-free framework for spatial reasoning that enables open-ended, complex 3D/4D spatial reasoning.
Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang +2 more
The paper introduces TRON, an online, rule-verifiable environment substrate that generates an unbounded stream of fresh, controllable visual reasoning training instances, significantly improving RL pe…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar +2 more
The paper introduces RoboTrustBench, a comprehensive benchmark that evaluates the trustworthiness of video world models for robotic manipulation across challenging scenarios, finding that current mode…
Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang +6 more
The paper introduces MiraBench, a new benchmark that evaluates the action-conditioned reliability of robotic world models, finding that visual fidelity is insufficient and that optimism bias is a perv…
Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more
The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.