Papers similar to 2605.30351

~ similar to 2605.30351· 19 results

cs.CVRecentJun 1, 2026

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more

LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…

View →

cs.ARcs.CLcs.LGRecentJun 1, 2026

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao +1 more

The paper proposes AsymCache, a computation-latency-aware KV cache management system that optimizes LLM inference by aligning cache eviction decisions with GPU attention kernel performance, significan…

View →

cs.CVcs.CLRecentJun 1, 2026

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin +2 more

InfoMerge is a novel, training-free method that significantly compresses visual tokens for Video-LLMs by estimating temporal redundancy and allocating tokens based on content richness, achieving high…

View →

cs.CVcs.AIRecentJun 1, 2026

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more

STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…

View →

cs.CVRecentJun 1, 2026

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo +8 more

The paper introduces X-Stream, a new benchmark for multi-stream video understanding, and finds that current state-of-the-art MLLMs perform poorly when required to process multiple concurrent video str…

View →

cs.CVcs.AIRecentMay 28, 2026

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Yuyang Zhao, Yicheng Pan, Qiyuan He, Jincheng Yu +5 more

SANA-Streaming introduces a novel, efficient framework that enables real-time, high-resolution streaming video-to-video editing by combining a hybrid diffusion transformer with specialized training an…

View →

cs.AIRecentMay 28, 2026

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

Moment-KV introduces a novel momentum-based technique to compress the Key-Value (KV) cache during the decoding phase of LLM generation, significantly improving fidelity in long-generation tasks.

View →

cs.CLcs.AIRecentMay 30, 2026

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu +4 more

WaveFilter is a novel, training-free framework that uses wavelet transforms to efficiently filter critical tokens in the KV cache, significantly improving the long-context performance of Diffusion LLM…

View →

cs.AIRecentMay 28, 2026

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

The paper introduces NaRA, a noise-aware LoRA technique that dynamically adapts fine-tuning parameters based on the noise level during diffusion, significantly improving the performance of Diffusion L…

View →

cs.CVcs.AIRecentMay 29, 2026

Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler +1 more

Real2SAM2Real introduces a framework that uses explicit 3D caches, derived from 3D lifting models, to provide robust geometric guidance to Video Diffusion Models, significantly improving spatiotempora…

View →

cs.LGcs.AIRecentMay 27, 2026

Locality-Aware Redundancy Pruning for LLM Depth Compression

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more

The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…

View →

cs.ARcs.PFRecentMay 30, 2026

Regular-Activation Concentration: Characterizing Column-Level Output Sparsity Across Diffusion Model Architectures

Dazhi Yang, Shafayat Mowla Anik, Byeong Kil Lee, Jeeho Ryoo

The paper systematically characterizes column-level activation sparsity across various diffusion model architectures, demonstrating that element-level sparsity metrics significantly overestimate the a…

View →

cs.DCcs.AIcs.NIRecentMay 31, 2026

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Bole Ma, Jan Eitzinger, Harald Köstler, Gerhard Wellein

The paper proposes moving the query instead of the KV-cache during cross-instance attention, demonstrating that this approach is significantly cheaper than moving the cache, especially on modern GPU f…

View →

cs.AIcs.CVeess.ASRecentMay 27, 2026

Diffusion Large Language Models for Visual Speech Recognition

Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro

The paper proposes DLLM-VSR, a novel Diffusion Large Language Model framework for Visual Speech Recognition, achieving state-of-the-art performance by treating transcription as iterative masked denois…

View →

cs.LGcs.AIRecentJun 1, 2026

FLARE: Diffusion for Hybrid Language Model

Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan +8 more

FLARE is a systematic conversion framework that enables a single checkpoint to support both autoregressive (AR) and diffusion-style parallel decoding for hybrid-attention large language models, achiev…

View →

cs.CLcs.AIRecentJun 1, 2026

SimSD: Simple Speculative Decoding in Diffusion Language Models

Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more

The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…

View →

cs.CLRecentMay 29, 2026

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Junjie Peng, You Wu, Haoyi Wu, Jialong Han +3 more

GRKV introduces a training-free KV-cache merging method that uses global regression to distribute information from evicted tokens, solving the over-merging problem inherent in span-based retention.

View →

cs.CVcs.AIRecentMay 30, 2026

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong +4 more

The paper proposes a compression pipeline combining few-step distillation and low-bit quantization to significantly reduce the deployment cost and parameter footprint of large dual-expert video diffus…

View →

cs.LGcs.AIRecentMay 29, 2026

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more

BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…

View →