~ similar to 2605.30968· 18 results
The paper proposes Dynamic Adapter Routing (DAR), a novel method that significantly improves continual multimodal retrieval by adaptively selecting and merging specialized adapters.
Ziying Chen, Yang Cao, He Sun, Beining Yang +1 more
The paper proposes a novel geometric embedding hashing method to recover object correspondences (vector links) between two embedding clouds generated by different black-box encoders using only a small…
The paper introduces COMET, a novel PLS-SVD framework, to analyze the audio-text modality gap in CLAP models, showing that shared concepts are captured by a small subset of axes, and proposes a spectr…
The paper demonstrates that clinical vision-language models (VLMs) pose a significant privacy risk by allowing de-identified images to be re-linked to original reports, and proposes a targeted differe…
The paper introduces a distributional framework using Wasserstein distance to unify the semantic comparison of sparse autoencoder features across different layers and to automatically compress large f…
The paper proposes FedSAP, a framework that stabilizes federated prototype learning by delaying global alignment and enforcing inter-class structure, significantly improving representation quality und…
The paper proposes a decoupled two-stage training pipeline to effectively learn a shared representation for person re-identification by mitigating optimization conflicts between image-based and text-b…
DiffCrossGait proposes a novel trajectory-level alignment method using latent diffusion to overcome domain discrepancies in 2D-3D gait recognition, achieving state-of-the-art performance.
MASER is a lightweight framework that dynamically routes a shared Vision-Language Model (VLM) to the most appropriate modality-specific adapter (e.g., point cloud, RGB) based on the input question, si…
The paper proposes xModel-KD, a cross-modal knowledge distillation framework, to improve 3D point cloud segmentation by effectively transferring rich appearance cues from 2D images to sparse 3D geomet…
The paper proposes VRPO, a reinforcement learning-based optimization strategy that replaces static alignment losses in diffusion models, significantly improving both convergence and image fidelity.
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more
The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…
Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin +2 more
InfoMerge is a novel, training-free method that significantly compresses visual tokens for Video-LLMs by estimating temporal redundancy and allocating tokens based on content richness, achieving high…
Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin +2 more
This paper systematically analyzes how different architectural components of Large Vision-Language Models (LVLMs) contribute to hallucination robustness, finding that joint enhancement of visual fidel…
Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more
STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…
V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.
David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin +2 more
The paper proposes a novel global multi-modal alignment framework to robustly learn video representations from noisy and complementary sensor data, significantly improving driver distraction detection…