Rui Song
3 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper introduces AgenticVBench, a comprehensive benchmark of 100 real-world video post-production tasks, and finds that even the best AI agents perform significantly worse than human experts on these complex, multi-modal tasks.
The paper introduces HLL, a benchmark that tests if multimodal agents can successfully substitute for human verification (like CAPTCHA) in complex, real-world workflows, finding that current agents are still brittle and fail under realistic conditions.
The paper argues that current embodied planning benchmarks prioritize superficial language prediction over true physical reasoning, introducing new benchmarks and a large-scale dataset to demonstrate that physically grounded causal reasoning is necessary for reliable autonomous agents.
Papers
HLL: Can Agents Cross Humanity's Last Line of Verification?
Xinhao Song, Su Su, Sirui Song, Hongliang Wu +5 more
The paper introduces HLL, a benchmark that tests if multimodal agents can successfully substitute for human verification (like CAPTCHA) in complex, real-world workflows, finding that current agents ar…