~ similar to 2606.00775· 17 results
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more
The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…
The paper proposes a zero-shot reason-then-retrieve pipeline using Qwen3.5-27B to solve the challenging task of composed video retrieval (CoVR-R), achieving high performance on both validation and bli…
The paper introduces ConTrans, a novel local-global multi-scale encoder that combines convolutional and transformer features to significantly improve zero-shot temporal action localization by capturin…
Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang +3 more
The paper introduces VIP-Net, a framework that leverages multi-modal spatio-temporal cues and a new dataset (Temporal-VIP) to accurately identify the most influential people in videos, overcoming the…
The paper introduces a novel two-stage framework to achieve robust, category-agnostic object localization in-context (ICL) by optimizing attention and minimizing localization error using reinforcement…
Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He +2 more
The paper proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that effectively balances high-level semantic control and low-level identity fidelity for text-to-video generat…
Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more
LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…
Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more
The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
CIPER proposes a unified transformer framework to simultaneously perform cross-view image retrieval and precise 3-DoF pose estimation, overcoming the limitations of cascaded, separate methods.
The paper introduces Q-ALIGN DT, a novel framework that improves conditioned sequence models by enforcing alignment between the input return-to-go (RTG) signal and the output policy's expected Q-value…
Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai +5 more
TunerDiT introduces a training-free progressive steering method to enhance multi-event video generation using Diffusion Transformers, achieving state-of-the-art performance by explicitly managing even…
The paper introduces the Terminal Representation (TR), a novel, lower-dimensional, and structurally distinct formulation for encoding reward-weighted trajectories in RL that bypasses the need for eige…
Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake +1 more
The paper introduces BabyCL, a continual multimodal learning framework that processes egocentric video data in a single chronological pass, demonstrating that meaningful word-referent mappings can be…
Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng +4 more
The paper proposes EKSFT, a selective fine-tuning method that masks high-entropy or high-KL divergence tokens during Supervised Fine-Tuning (SFT) to prevent distribution shift and improve subsequent R…
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
肖代替了视觉令牌的永久删除,通过可恢复的路由来改进视觉语言模型的性能