~ similar to 2604.15718v1· 9 results
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
The paper proposes a sequence-alignment framework using Soft Dynamic Time Warping to evaluate audio-driven talking-head generation, demonstrating that this approach provides more robust and fair compa…
Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang +3 more
The paper introduces VIP-Net, a framework that leverages multi-modal spatio-temporal cues and a new dataset (Temporal-VIP) to accurately identify the most influential people in videos, overcoming the…
DeepIPCv3 is a novel multi-modal framework that fuses LiDAR and DVS event streams using cross-modal attention to achieve state-of-the-art, highly reactive avoidance maneuvers for sudden pedestrian cro…
Echo is a joint-embedding predictive architecture that uses a single, pretrained ViT encoder to simultaneously perform speaker diarization, speech recognition, and dynamic source separation in a share…
V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.
CLANE presents an end-to-end continual action recognition system deployed on neuromorphic hardware (Intel Loihi 2) using event cameras, achieving high accuracy with massive reductions in energy and la…
Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He +2 more
The paper proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that effectively balances high-level semantic control and low-level identity fidelity for text-to-video generat…
This paper investigates if upper-face affective cues enhance audiovisual sentence recognition, especially when audio is degraded, finding that while mouth cues are crucial for robustness, upper-face c…