Turning Video Models into Generalist Robot Policies

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

The paper introduces RoboTrustBench, a comprehensive benchmark that evaluates th…

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

The paper introduces PaSBench-Video, a comprehensive streaming video benchmark d…

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

SmartDirector is a novel framework that significantly improves cinematic video g…

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that rep…

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism introduces a novel heterogeneous Mixture-of-Experts framework that spec…

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

This paper presents GRAIL, a digital generation pipeline that synthesizes human-…

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

The paper introduces VIP-Net, a framework that leverages multi-modal spatio-temp…

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

The paper proposes VISTA, a multi-level event semantics mining framework, to acc…