~ similar to 2606.00487· 18 results
The paper introduces CaDDTree, a cost-aware method that optimizes token throughput by jointly selecting the tree structure and node budget for speculative decoding, outperforming existing methods like…
Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more
BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…
Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu +8 more
DFlare introduces a lightweight layer-wise fusion mechanism to overcome the narrow conditioning bottleneck of existing block diffusion methods, enabling the scaling of draft models and achieving super…
Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more
The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…
Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren +1 more
The paper proposes Speculative Pipeline Decoding (SPD), a novel framework that uses pipeline parallelism to accelerate LLM inference by processing multiple tokens in parallel, achieving higher speedup…
Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah +4 more
The paper introduces Hybrid Verified Decoding, a method that predicts the acceptance length of a cache draft to intelligently select between cache verification and model-based drafting, achieving sign…
This paper investigates improving speculative decoding for multilingual LLM inference, finding that n-gram draft models offer consistent speed-ups across languages despite lower token acceptance rates…
The paper proposes EPIC, an efficient and parallel decoding framework that significantly speeds up the process of constraining diffusion language model outputs using Context-Free Grammars (CFG).
The paper introduces 'infilling extraction' to accurately model training data memorization in Diffusion Language Models (DLMs), finding that bidirectional masking significantly increases the extractab…
Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu +4 more
WaveFilter is a novel, training-free framework that uses wavelet transforms to efficiently filter critical tokens in the KV cache, significantly improving the long-context performance of Diffusion LLM…
PrunePath introduces a budget-adaptive structured sparsification framework that efficiently prunes Feed-forward networks in large language models, achieving hardware-friendly sparsity and measurable s…
Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu +1 more
BlockBatch introduces a novel framework that efficiently accelerates diffusion language model (dLLM) inference by simultaneously executing multiple block-size branches for a single request, achieving…
Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more
This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…
The paper proposes a unified framework for designing efficient and expressive token mixing layers by separating the direct and recurrent influences of inputs, allowing for a principled trade-off betwe…
Chatterbox-Flash introduces a prior-calibrated block diffusion model for zero-shot TTS that achieves high-fidelity, streaming synthesis with significantly lower computational overhead than existing me…
Zekai Li, Ji Liu, Yiqing Huang, Ziqiong Liu +2 more
The paper proposes a novel trace-aware decoding framework, combining Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE), to significantly accelerate the inference of diffusion…
The paper introduces NaRA, a noise-aware LoRA technique that dynamically adapts fine-tuning parameters based on the noise level during diffusion, significantly improving the performance of Diffusion L…
Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan +8 more
FLARE is a systematic conversion framework that enables a single checkpoint to support both autoregressive (AR) and diffusion-style parallel decoding for hybrid-attention large language models, achiev…