~ similar to 2606.02161· 17 results
V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.
The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…
Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si +7 more
AdaCodec introduces a predictive visual coding scheme for video MLLMs, significantly improving efficiency and performance by transmitting only inter-frame changes and full reference frames when necess…
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
Geng Li, Guohao Chen, Ting Chen, Shilin Shan +5 more
OccamToken introduces a training-free, adaptive token pruning framework that replaces fixed token budgets with relative evidence testing against a register-based reference, significantly improving VLM…
Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele +2 more
PARCEL introduces a novel visual tokenization architecture that combines spatial pooling anchors with conditioned elastic queries, efficiently reducing the computational cost of large Vision-Language…
Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more
The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…
Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu +1 more
CIVIC is a path-consistent compact visual inference framework that achieves genuine hardware efficiency in Vision-Language Models by maintaining contiguous sequence representations across all inferenc…
肖代替了视觉令牌的永久删除,通过可恢复的路由来改进视觉语言模型的性能
This paper presents a black-box membership inference attack (MIA) against Video Large Language Models (VideoLLMs), demonstrating that they are vulnerable by analyzing generation behavior across varyin…
Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more
STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…
Junjie Peng, You Wu, Haoyi Wu, Jialong Han +3 more
GRKV introduces a training-free KV-cache merging method that uses global regression to distribute information from evicted tokens, solving the over-merging problem inherent in span-based retention.
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
Moment-KV introduces a novel momentum-based technique to compress the Key-Value (KV) cache during the decoding phase of LLM generation, significantly improving fidelity in long-generation tasks.
Xiangyi Chen, Zelun Wang, Xinyi Li, Yi-Ping Hsu +2 more
The paper proposes PrefixMem, a dedicated encoder for Semantic IDs (SIDs), demonstrating that structured, prefix-conditioned representations significantly improve the accuracy and recall of generative…
The paper proposes DLLM-VSR, a novel Diffusion Large Language Model framework for Visual Speech Recognition, achieving state-of-the-art performance by treating transcription as iterative masked denois…
Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more
LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…