Papers similar to 2605.30968

~ similar to 2605.30968· 18 results

cs.CVcs.AIRecentMay 29, 2026

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik +1 more

The paper proposes Dynamic Adapter Routing (DAR), a novel method that significantly improves continual multimodal retrieval by adaptively selecting and merging specialized adapters.

View →

cs.AIcs.DBcs.IRRecentMay 29, 2026

Vector Linking via Cross-Model Local Isometric Consistency

Ziying Chen, Yang Cao, He Sun, Beining Yang +1 more

The paper proposes a novel geometric embedding hashing method to recover object correspondences (vector links) between two embedding clouds generated by different black-box encoders using only a small…

View →

cs.SDcs.AIcs.CLRecentMay 28, 2026

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

The paper introduces COMET, a novel PLS-SVD framework, to analyze the audio-text modality gap in CLAP models, showing that shared concepts are captured by a small subset of axes, and proposes a spectr…

View →

cs.CVcs.AIcs.CLRecentJun 1, 2026

Cross-modal linkage risk in clinical vision-language models

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

The paper demonstrates that clinical vision-language models (VLMs) pose a significant privacy risk by allowing de-identified images to be re-linked to original reports, and proposes a targeted differe…

View →

cs.LGcs.AIRecentMay 27, 2026

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

Tue M. Cao, Nguyen Do, My T. Thai

The paper introduces a distributional framework using Wasserstein distance to unify the semantic comparison of sparse autoencoder features across different layers and to automatically compress large f…

View →

cs.LGcs.CVRecentJun 1, 2026

Closing the Alignment-Maturity Gap in Federated Prototype Learning

Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas

The paper proposes FedSAP, a framework that stabilizes federated prototype learning by delaying global alignment and enforcing inter-class structure, significantly improving representation quality und…

View →

cs.CVcs.AIcs.LGRecentJun 1, 2026

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

Karina Kvanchiani, Timur Mamedov

The paper proposes a decoupled two-stage training pipeline to effectively learn a shared representation for person re-identification by mitigating optimization conflicts between image-based and text-b…

View →

cs.CVcs.AIRecentMay 29, 2026

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion

Zhiyang Lu, Ming Cheng

DiffCrossGait proposes a novel trajectory-level alignment method using latent diffusion to overcome domain discrepancies in 2D-3D gait recognition, achieving state-of-the-art performance.

View →

cs.CVcs.AIRecentJun 1, 2026

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

Hilton Raj, Vishnuram AV

MASER is a lightweight framework that dynamically routes a shared Vision-Language Model (VLM) to the most appropriate modality-specific adapter (e.g., point cloud, RGB) based on the input question, si…

View →

cs.CVcs.AIRecentMay 28, 2026

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

The paper proposes xModel-KD, a cross-modal knowledge distillation framework, to improve 3D point cloud segmentation by effectively transferring rich appearance cues from 2D images to sparse 3D geomet…

View →

cs.CVcs.AIcs.LGRecentMay 30, 2026

Improving Visual Representation Alignment Generation with GRPO

Shentong Mo, Sukmin Yun

The paper proposes VRPO, a reinforcement learning-based optimization strategy that replaces static alignment losses in diffusion models, significantly improving both convergence and image fidelity.

View →

cs.CVcs.AIcs.LGRecentJun 1, 2026

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…

View →

cs.CVRecentJun 1, 2026

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more

The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…

View →

cs.CVcs.CLRecentJun 1, 2026

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin +2 more

InfoMerge is a novel, training-free method that significantly compresses visual tokens for Video-LLMs by estimating temporal redundancy and allocating tokens based on content richness, achieving high…

View →

cs.CVcs.AIRecentMay 29, 2026

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin +2 more

This paper systematically analyzes how different architectural components of Large Vision-Language Models (LVLMs) contribute to hallucination robustness, finding that joint enhancement of visual fidel…

View →

cs.CVcs.AIRecentJun 1, 2026

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more

STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…

View →

cs.CVcs.AIRecentMay 30, 2026

V-LynX: Token Interface Alignment for Video+X LLMs

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.

View →

cs.CVRecentJun 1, 2026

Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin +2 more

The paper proposes a novel global multi-modal alignment framework to robustly learn video representations from noisy and complementary sensor data, significantly improving driver distraction detection…

View →