~ similar to 2606.01600· 18 results
Doguhuan Yeke, Yanming Zhou, Leo Y. Lin, Hongyu Cai +2 more
The paper introduces RoboJailBench, the first standardized evaluation framework for assessing adversarial jailbreak attacks and defenses in embodied AI systems like robots.
Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang +6 more
The paper introduces MiraBench, a new benchmark that evaluates the action-conditioned reliability of robotic world models, finding that visual fidelity is insufficient and that optimism bias is a perv…
Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more
The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…
The paper introduces TouchSafeBench, a physics-grounded benchmark, to evaluate collision grounding—the ability to predict robot-human collisions—and finds that current Vision-Language Models (VLMs) ar…
Junjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li +5 more
RoboDream introduces an embodiment-centric world model that synthesizes photorealistic, physically feasible robot demonstrations by decoupling motion generation from environment synthesis, significant…
Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen +4 more
The paper introduces RoboWits, a new bi-manual robotic benchmark designed to test a robot's cognitive reasoning and adaptability to unexpected challenges, revealing that current Vision-Language-Action…
Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang +7 more
The paper introduces WorldCoder-Bench, a comprehensive benchmark and evaluation protocol for testing LLMs' ability to autonomously generate complex, physically grounded, and interactive 3D web worlds.
Beichen Shao, Mengying Xie, Heng Su, Wanyi Zhang +4 more
GSAM introduces a generalizable and safe robotic framework for articulated object manipulation, significantly improving success rates and reducing variability across diverse tasks by integrating commo…
Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen +11 more
The paper introduces CV-Arena, a large-scale open benchmark for instructional computer vision, demonstrating that professional-grade image editing requires advanced capabilities in physical reasoning…
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
The paper introduces AgenticVBench, a comprehensive benchmark of 100 real-world video post-production tasks, and finds that even the best AI agents perform significantly worse than human experts on th…
Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding +5 more
StressDream proposes a novel method to steer video world model imaginations toward high-impact, yet plausible outcomes, enabling robust policy evaluation and improvement by identifying undesirable fut…
The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…
This paper demonstrates that typographic attacks pose a significant, measurable, and physically consequential threat to household robot manipulation systems by causing the robot to grasp and transport…
The paper proposes an algorithmic method using conformal prediction to formally certify high-probability safety for Belief-Space Neural Safety Filters (BeliefSF), significantly improving safety guaran…
Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao +5 more
The paper introduces BilliardPhys-Bench, a new benchmark that demonstrates that current multimodal LLMs struggle with complex physical reasoning and predicting object dynamics in simulated environment…
Zhengxian Huang, Wenjun Zhu, Haoxuan Qiu, Xiaoyu Ji +1 more
This paper introduces TRAP, an adversarial attack that demonstrates how physical patches can hijack the Chain-of-Thought (CoT) reasoning process in Vision-Language-Action (VLA) models, forcing them to…
Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu +4 more
BORA is an offline-to-online RL framework that enhances dexterous VLA models for real-world robotics by using an action-conditioned critic and a lightweight residual adaptation mechanism to correct ex…