~ similar to 2606.00144· 20 results
TAPS introduces a target-aware prefix selection method that optimizes the trade-off between draft tree acceptance and verification cost, achieving significant speedups in speculative decoding.
The paper introduces CaDDTree, a cost-aware method that optimizes token throughput by jointly selecting the tree structure and node budget for speculative decoding, outperforming existing methods like…
Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu +8 more
DFlare introduces a lightweight layer-wise fusion mechanism to overcome the narrow conditioning bottleneck of existing block diffusion methods, enabling the scaling of draft models and achieving super…
Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah +4 more
The paper introduces Hybrid Verified Decoding, a method that predicts the acceptance length of a cache draft to intelligently select between cache verification and model-based drafting, achieving sign…
Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more
The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…
Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more
This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…
Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more
This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more
The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…
Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren +1 more
The paper proposes Speculative Pipeline Decoding (SPD), a novel framework that uses pipeline parallelism to accelerate LLM inference by processing multiple tokens in parallel, achieving higher speedup…
Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more
The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…
This paper investigates improving speculative decoding for multilingual LLM inference, finding that n-gram draft models offer consistent speed-ups across languages despite lower token acceptance rates…
PrunePath introduces a budget-adaptive structured sparsification framework that efficiently prunes Feed-forward networks in large language models, achieving hardware-friendly sparsity and measurable s…
Physical AI inference (batch-1 decode) is primarily memory-bandwidth-bound, but the observed latency gap between fast and slow GPUs is not solely due to memory bandwidth, as launch-side overheads beco…
Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more
This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…
The paper empirically audits the k-NAF budget accounting mechanism in Anchored Decoding, finding that observed high proxy spend ratios are likely artifacts rather than true budget exhaustion failures.
The paper empirically audits the k-NAF budget accounting mechanism in Anchored Decoding, finding that observed high proxy spend ratios are likely artifacts rather than genuine budget exhaustion failur…
Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more
This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…
Moment-KV introduces a novel momentum-based technique to compress the Key-Value (KV) cache during the decoding phase of LLM generation, significantly improving fidelity in long-generation tasks.
肖代替了视觉令牌的永久删除,通过可恢复的路由来改进视觉语言模型的性能
Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more
The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…