Papers similar to 2606.02091

~ similar to 2606.02091· 19 results

cs.CLRecentJun 1, 2026

Cost-Aware Diffusion Draft Trees for Speculative Decoding

Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai

The paper introduces CaDDTree, a cost-aware method that optimizes token throughput by jointly selecting the tree structure and node budget for speculative decoding, outperforming existing methods like…

View →

cs.AIRecentMay 30, 2026

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

Zhuoyu Wang, Junnan Huang, Xinyu Chen

TAPS introduces a target-aware prefix selection method that optimizes the trade-off between draft tree acceptance and verification cost, achieving significant speedups in speculative decoding.

View →

cs.CLRecentMay 29, 2026

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren +1 more

The paper proposes Speculative Pipeline Decoding (SPD), a novel framework that uses pipeline parallelism to accelerate LLM inference by processing multiple tokens in parallel, achieving higher speedup…

View →

cs.LGcs.AIRecentMay 29, 2026

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more

BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…

View →

cs.CLcs.AIRecentJun 1, 2026

SimSD: Simple Speculative Decoding in Diffusion Language Models

Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more

The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…

View →

cs.LGcs.AIRecentMay 28, 2026

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu +1 more

BlockBatch introduces a novel framework that efficiently accelerates diffusion language model (dLLM) inference by simultaneously executing multiple block-size branches for a single request, achieving…

View →

cs.CLcs.AIRecentMay 31, 2026

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah +4 more

The paper introduces Hybrid Verified Decoding, a method that predicts the acceptance length of a cache draft to intelligently select between cache verification and model-based drafting, achieving sign…

View →

cs.CLcs.LGRecentMay 28, 2026

Speculative Decoding Across Languages

Nirajan Paudel, Michael Ginn, Luc De Nardi, Alexis Palmer

This paper investigates improving speculative decoding for multilingual LLM inference, finding that n-gram draft models offer consistent speed-ups across languages despite lower token acceptance rates…

View →

cs.LGcs.AIRecentJun 1, 2026

FLARE: Diffusion for Hybrid Language Model

Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan +8 more

FLARE is a systematic conversion framework that enables a single checkpoint to support both autoregressive (AR) and diffusion-style parallel decoding for hybrid-attention large language models, achiev…

View →

cs.CLRecentMay 29, 2026

dMoE: dLLMs with Learnable Block Experts

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma +1 more

dMoE proposes a block-level Mixture-of-Experts (MoE) framework for Diffusion Large Language Models (dLLMs) that aggregates token-level expert distributions into a unified block-level distribution, sig…

View →

cs.CLcs.AIRecentJun 1, 2026

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…

View →

cs.AIRecentMay 28, 2026

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

The paper introduces NaRA, a noise-aware LoRA technique that dynamically adapts fine-tuning parameters based on the noise level during diffusion, significantly improving the performance of Diffusion L…

View →

cs.CLcs.AIRecentMay 30, 2026

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu +4 more

WaveFilter is a novel, training-free framework that uses wavelet transforms to efficiently filter critical tokens in the KV cache, significantly improving the long-context performance of Diffusion LLM…

View →

cs.DCcs.AIcs.LGRecentMay 31, 2026

Lodestar: An Online-Learning LLM Inference Router

Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan +2 more

Lodestar is a novel online learning-based request routing system that significantly improves LLM inference efficiency by dynamically assigning incoming requests to the optimal GPU instance to minimize…

View →

cs.AIRecentMay 30, 2026

Threshold-Based Exclusive Batching for LLM Inference

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more

This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…

View →

cs.ARRecentMay 29, 2026

SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference

Aradhana Mohan Parvathy, Soumendu Kumar Ghosh, Shamik Kundu, Arnab Raha +3 more

SPARQLe is a hardware-software co-design framework that exploits the inherent sub-precision sparsity of LLM activations to reduce memory traffic and enable efficient computation on lower-bit datapaths…

View →

cs.LGcs.AIRecentMay 27, 2026

Locality-Aware Redundancy Pruning for LLM Depth Compression

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more

The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…

View →

cs.LGcs.AIcs.DCRecentMay 27, 2026

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare +8 more

The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…

View →

cs.DCcs.AIRecentJun 1, 2026

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

Yafan Huang, Sheng Di, Guanpeng Li

This paper systematically studies how soft errors propagate during Large Language Model (LLM) inference using a novel fault-injection framework, providing critical insights and mitigation strategies f…

View →