Papers similar to 2604.15718v1

~ similar to 2604.15718v1· 9 results

cs.CVcs.AIRecentJun 1, 2026

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more

The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…

View →

cs.GRcs.AIcs.CVRecentMay 31, 2026

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

The paper proposes a sequence-alignment framework using Soft Dynamic Time Warping to evaluate audio-driven talking-head generation, demonstrating that this approach provides more robust and fair compa…

View →

cs.CVcs.AIRecentMay 27, 2026

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang +3 more

The paper introduces VIP-Net, a framework that leverages multi-modal spatio-temporal cues and a new dataset (Temporal-VIP) to accurately identify the most influential people in videos, overcoming the…

View →

cs.ROcs.AIcs.CVRecentMay 31, 2026

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

Oskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky, Jazi Eko Istiyanto +1 more

DeepIPCv3 is a novel multi-modal framework that fuses LiDAR and DVS event streams using cross-modal attention to achieve state-of-the-art, highly reactive avoidance maneuvers for sudden pedestrian cro…

View →

cs.SDcs.AIeess.ASRecentJun 1, 2026

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Louis Mouchon

Echo is a joint-embedding predictive architecture that uses a single, pretrained ViT encoder to simultaneously perform speaker diarization, speech recognition, and dynamic source separation in a share…

View →

cs.CVcs.AIRecentMay 30, 2026

V-LynX: Token Interface Alignment for Video+X LLMs

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.

View →

cs.LGcs.AIcs.NERecentMay 27, 2026

CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

Elvin Hajizada, Michael Neumeier, Edward Paxon Frady, Yulia Sandamirskaya +3 more

CLANE presents an end-to-end continual action recognition system deployed on neuromorphic hardware (Intel Loihi 2) using event cameras, achieving high accuracy with massive reductions in energy and la…

View →

cs.CVRecentJun 1, 2026

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He +2 more

The paper proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that effectively balances high-level semantic control and low-level identity fidelity for text-to-video generat…

View →

cs.SDcs.AIRecentMay 30, 2026

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

Zhou Yang, Yueyi Yang

This paper investigates if upper-face affective cues enhance audiovisual sentence recognition, especially when audio is degraded, finding that while mouth cues are crucial for robustness, upper-face c…

View →