Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Turning Video Models into Generalist Robot Policies

This paper explores decoupling video generative models from embodiment-specific…

Linear Scaling Video VLMs for Long Video Understanding

This paper introduces StateKV, an inference-time method that adapts pretrained v…

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Chal…

This paper addresses the challenge of understanding long-form egocentric videos…

Towards Consistent Video Geometry Estimation

ViGeo is a feed-forward foundation model designed to estimate spatially dense an…

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Archon is a unified multimodal model designed for holistic digital human generat…

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

This research addresses the memory and error accumulation issues in autoregressi…

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

minWM is a full-stack open-source framework designed to transform existing bidir…

EarlyTom: Early Token Compression Completes Fast Video Understanding

Video-LLMs are hindered by inefficiency in processing visual tokens. EarlyTom pr…