~ similar to 2606.02441· 18 results
Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang +1 more
The paper proposes a lightweight post-processing framework that enhances identity continuity in thermal pedestrian tracking by leveraging scene-level spatial-temporal consistency, achieving improved t…
Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more
LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…
Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang +3 more
The paper introduces VIP-Net, a framework that leverages multi-modal spatio-temporal cues and a new dataset (Temporal-VIP) to accurately identify the most influential people in videos, overcoming the…
The paper proposes a decoupled two-stage training pipeline to effectively learn a shared representation for person re-identification by mitigating optimization conflicts between image-based and text-b…
This paper proposes a 3D CNN detector that leverages temporal artifacts to accurately identify high-quality deepfake videos, demonstrating robust detection even after social media re-encoding.
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin +2 more
InfoMerge is a novel, training-free method that significantly compresses visual tokens for Video-LLMs by estimating temporal redundancy and allocating tokens based on content richness, achieving high…
The paper proposes VRPO, a reinforcement learning-based optimization strategy that replaces static alignment losses in diffusion models, significantly improving both convergence and image fidelity.
The paper proposes a sequence-alignment framework using Soft Dynamic Time Warping to evaluate audio-driven talking-head generation, demonstrating that this approach provides more robust and fair compa…
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
VISReg introduces a novel regularization technique that combines variance control with a Sliced-Wasserstein-based sketching objective to stabilize self-supervised learning, achieving state-of-the-art…
V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.
Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang +12 more
VCap introduces a novel Witness-Adjudicator reward mechanism that provides highly precise, factually grounded feedback for visual captioning, enabling state-of-the-art performance in RL-trained multim…
VidPrism introduces a novel heterogeneous Mixture-of-Experts framework that specializes temporal processing by dividing labor among experts, achieving state-of-the-art performance in image-to-video tr…
Ziying Chen, Yang Cao, He Sun, Beining Yang +1 more
The paper proposes a novel geometric embedding hashing method to recover object correspondences (vector links) between two embedding clouds generated by different black-box encoders using only a small…
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more
The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…
The paper introduces SEED, a large-scale benchmark dataset for tracing sequential deepfake facial edits, and proposes FAITH, a frequency-aware Transformer model that effectively detects and orders the…
Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang +1 more
GIRL-DETR introduces Gradient-Isolated Reinforcement Learning to enhance temporal localization in lightweight Video Moment Retrieval models, achieving high accuracy by decoupling feature representatio…