~ similar to 2606.01502· 20 results
Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao +1 more
The paper proposes AsymCache, a computation-latency-aware KV cache management system that optimizes LLM inference by aligning cache eviction decisions with GPU attention kernel performance, significan…
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more
The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan +2 more
Lodestar is a novel online learning-based request routing system that significantly improves LLM inference efficiency by dynamically assigning incoming requests to the optimal GPU instance to minimize…
The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…
肖代替了视觉令牌的永久删除,通过可恢复的路由来改进视觉语言模型的性能
The paper proposes GroundedCache, an evidence-validated cache router that significantly improves the safety of reusing cached semantic answers in RAG systems by requiring multiple gates to validate th…
The paper introduces Entity-Collision, a rigorous protocol that separates genuine retrieval lift from simple lexical overlap, demonstrating that embedder performance depends critically on the query ty…
Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more
STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…
The paper characterizes 'dead-entry' TLB misses in GPUs, which occur when recently evicted translations are immediately re-walked, and proposes DEPOT, a Bloom filter mechanism that significantly reduc…
The paper introduces Curvature-Conditioned Query (CCQ), a novel read-time contraction mechanism that improves linear attention's performance on long-context and retrieval tasks by incorporating the ge…
Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more
This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…
Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more
BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…
Guanlong Wu, Zhaohan li, Yao Zhang, Zheng Zhang +3 more
CachePrune introduces a privacy-aware, fine-grained KV cache sharing mechanism that allows LLM inference systems to safely reuse cache entries across users' requests, significantly improving efficienc…
Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele +2 more
PARCEL introduces a novel visual tokenization architecture that combines spatial pooling anchors with conditioned elastic queries, efficiently reducing the computational cost of large Vision-Language…
Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu +4 more
ElasticMem introduces a novel framework that treats memory as an elastic latent resource, allowing LLM agents to adaptively manage and inject variable-budget memories for improved performance in long-…
Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim +11 more
The paper introduces K-BrowseComp, a new web-browsing agent benchmark of 400 problems grounded in Korean contexts, demonstrating that current frontier LLMs struggle significantly with complex, context…
Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more
The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…
Moment-KV introduces a novel momentum-based technique to compress the Key-Value (KV) cache during the decoding phase of LLM generation, significantly improving fidelity in long-generation tasks.
The paper introduces Rotary GPU, an exploratory execution approach demonstrating that large Mixture-of-Experts models can be run locally on consumer GPUs with limited VRAM, achieving usable decode thr…