~ similar to 2605.31500· 18 results
The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…
Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao +1 more
The paper proposes AsymCache, a computation-latency-aware KV cache management system that optimizes LLM inference by aligning cache eviction decisions with GPU attention kernel performance, significan…
Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng +10 more
The paper introduces PassNet, a large-scale ecosystem for generating compiler passes using LLMs, demonstrating that LLMs can significantly accelerate graph compilation for long-tail workloads, suggest…
HASTE introduces group-shared fixed fan-in sparsity for multi-label classification, achieving significant wall-clock speedups (up to 25x in backward pass) by enabling efficient GPU execution while mai…
The paper proposes moving the query instead of the KV-cache during cross-instance attention, demonstrating that this approach is significantly cheaper than moving the cache, especially on modern GPU f…
Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more
STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
The paper introduces Graph Cascades, a mesoscopic rewiring technique that enhances Graph Neural Networks by promoting node pairs with strong multi-hop connections to direct edges, improving performanc…
Hawkeye is a system that allows perfect, precision-preserving reproduction of GPU-level matrix multiplication operations on a CPU, enabling efficient and trustworthy third-party auditing of machine le…
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more
The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…
This paper demonstrates that Large Language Models (LLMs) can serve as accurate and selective surrogates for costly GPU kernel performance measurements, significantly expanding the search space for op…
Physical AI inference (batch-1 decode) is primarily memory-bandwidth-bound, but the observed latency gap between fast and slow GPUs is not solely due to memory bandwidth, as launch-side overheads beco…
The paper introduces Rotary GPU, an exploratory execution approach demonstrating that large Mixture-of-Experts models can be run locally on consumer GPUs with limited VRAM, achieving usable decode thr…
Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao +2 more
The paper proposes Task-Aware Coactivation Grouping (TACG) to significantly reduce communication costs in multi-task MoE inference by grouping experts based on task-specific co-activation patterns, ou…
Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more
The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…
Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf +2 more
The paper introduces Parallax, a scalable and numerically stable parameterized Local Linear Attention mechanism that significantly improves LLM performance and efficiency compared to existing methods…
Yuyang Zhao, Yicheng Pan, Qiyuan He, Jincheng Yu +5 more
SANA-Streaming introduces a novel, efficient framework that enables real-time, high-resolution streaming video-to-video editing by combining a hybrid diffusion transformer with specialized training an…