~ similar to 2606.00722· 20 results
The paper introduces CFGzip, an offline token space compression technique that significantly reduces the computational overhead of constrained decoding, making complex grammar enforcement feasible at…
Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah +4 more
The paper introduces Hybrid Verified Decoding, a method that predicts the acceptance length of a cache draft to intelligently select between cache verification and model-based drafting, achieving sign…
Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more
The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…
Zekai Li, Ji Liu, Yiqing Huang, Ziqiong Liu +2 more
The paper proposes a novel trace-aware decoding framework, combining Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE), to significantly accelerate the inference of diffusion…
TAPS introduces a target-aware prefix selection method that optimizes the trade-off between draft tree acceptance and verification cost, achieving significant speedups in speculative decoding.
This paper systematically studies how soft errors propagate during Large Language Model (LLM) inference using a novel fault-injection framework, providing critical insights and mitigation strategies f…
Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu +1 more
BlockBatch introduces a novel framework that efficiently accelerates diffusion language model (dLLM) inference by simultaneously executing multiple block-size branches for a single request, achieving…
The paper introduces CaDDTree, a cost-aware method that optimizes token throughput by jointly selecting the tree structure and node budget for speculative decoding, outperforming existing methods like…
FPMoE introduces a sparse Mixture-of-Experts (MoE) architecture to improve functional code generation across multiple functional programming languages, achieving state-of-the-art performance with fewe…
Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan +8 more
FLARE is a systematic conversion framework that enables a single checkpoint to support both autoregressive (AR) and diffusion-style parallel decoding for hybrid-attention large language models, achiev…
The paper proposes projectional decoding, a novel framework that integrates a partial graph model alongside text generation to ensure the semantic validity of LLM-generated software artifacts.
The paper introduces 'infilling extraction' to accurately model training data memorization in Diffusion Language Models (DLMs), finding that bidirectional masking significantly increases the extractab…
Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi +5 more
UniScale proposes a unified framework that jointly optimizes model routing and test-time scaling to achieve a superior, fine-grained quality-cost trade-off for large language model inference.
The paper introduces an efficient, novel algorithm for incremental Byte Pair Encoding (BPE) tokenization that processes input text prefix by prefix, achieving significant speedups and enabling streami…
This paper investigates improving speculative decoding for multilingual LLM inference, finding that n-gram draft models offer consistent speed-ups across languages despite lower token acceptance rates…
The paper proposes a hybrid reasoning framework where Large Language Models (LLMs) generate code to encode complex optimization problems into a preference-based Maximum Satisfiability (MaxSAT) format,…
Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go +1 more
The paper introduces SARDI, a novel, training-free framework that uses low-confidence 'lookahead' tokens generated during the denoising process of discrete diffusion language models to dynamically gui…
The paper introduces prefix filters and an algorithm (Palla) to systematically learn and apply specific error patterns in Large Language Models, significantly improving constrained generation tasks li…
The paper introduces NaRA, a noise-aware LoRA technique that dynamically adapts fine-tuning parameters based on the noise level during diffusion, significantly improving the performance of Diffusion L…
Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang +5 more
The paper introduces DOMINO, a novel inductive framework that synthesizes domain-specific data for LLMs using only reference examples, significantly improving performance on challenging, implicitly de…