Papers similar to 2606.00775

~ similar to 2606.00775· 17 results

cs.CVRecentJun 1, 2026

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more

The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…

View →

cs.CVRecentJun 1, 2026

Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion

DongQing Liu, MengShi Qi, HongWei Ji

The paper proposes a zero-shot reason-then-retrieve pipeline using Qwen3.5-27B to solve the challenging task of composed video retrieval (CoVR-R), achieving high performance on both validation and bli…

View →

cs.CVcs.AIRecentMay 29, 2026

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

Kanchan Keisham, Thenukan Pathmanathan, Thangarajah Akilan

The paper introduces ConTrans, a novel local-global multi-scale encoder that combines convolutional and transformer features to significantly improve zero-shot temporal action localization by capturin…

View →

cs.CVcs.AIRecentMay 27, 2026

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang +3 more

The paper introduces VIP-Net, a framework that leverages multi-modal spatio-temporal cues and a new dataset (Temporal-VIP) to accurately identify the most influential people in videos, overcoming the…

View →

cs.CVcs.AIcs.LGRecentMay 29, 2026

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

Mohammed Asad Karim, Vinay Kumar Verma

The paper introduces a novel two-stage framework to achieve robust, category-agnostic object localization in-context (ICL) by optimizing attention and minimizing localization error using reinforcement…

View →

cs.CVRecentJun 1, 2026

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He +2 more

The paper proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that effectively balances high-level semantic control and low-level identity fidelity for text-to-video generat…

View →

cs.CVRecentJun 1, 2026

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more

LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…

View →

cs.CVcs.CLRecentMay 30, 2026

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more

The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…

View →

cs.CVcs.AIRecentJun 1, 2026

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more

The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…

View →

cs.CVcs.RORecentJun 3, 2026

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

Yurim Jeon, Dongseong Seo, Seung-Woo Seo

CIPER proposes a unified transformer framework to simultaneously perform cross-view image retrieval and precise 3-DoF pose estimation, overcoming the limitations of cascaded, separate methods.

View →

cs.LGcs.AIRecentMay 27, 2026

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Yuxiao Yang, Weitong Zhang

The paper introduces Q-ALIGN DT, a novel framework that improves conditioned sequence models by enforcing alignment between the input return-to-go (RTG) signal and the output policy's expected Q-value…

View →

cs.CVcs.AIRecentMay 29, 2026

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai +5 more

TunerDiT introduces a training-free progressive steering method to enhance multi-event video generation using Diffusion Transformers, achieving state-of-the-art performance by explicitly managing even…

View →

cs.LGcs.AIRecentMay 29, 2026

The Terminal Representation in Reinforcement Learning

Amir Esterhuysen, Anders Jonsson

The paper introduces the Terminal Representation (TR), a novel, lower-dimensional, and structurally distinct formulation for encoding reward-weighted trajectories in RL that bypasses the need for eige…

View →

cs.CVcs.AIcs.CLRecentJun 3, 2026

Continual Visual and Verbal Learning Through a Child's Egocentric Input

Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake +1 more

The paper introduces BabyCL, a continual multimodal learning framework that processes egocentric video data in a single chronological pass, demonstrating that meaningful word-referent mappings can be…

View →

cs.AIRecentMay 28, 2026

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng +4 more

The paper proposes EKSFT, a selective fine-tuning method that masks high-entropy or high-KL divergence tokens during Supervised Fine-Tuning (SFT) to prevent distribution shift and improve subsequent R…

View →

cs.CVcs.AIRecentMay 28, 2026

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan +3 more

VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…

View →

cs.CVcs.AIEmpiricalRecentJun 10, 2026

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Cheng-Yu Yang, Shao-Yuan Lo, Yu-Lun Liu

肖代替了视觉令牌的永久删除，通过可恢复的路由来改进视觉语言模型的性能

View →