~ similar to 2605.30580· 20 results
Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah +4 more
The paper introduces Hybrid Verified Decoding, a method that predicts the acceptance length of a cache draft to intelligently select between cache verification and model-based drafting, achieving sign…
Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren +1 more
The paper proposes Speculative Pipeline Decoding (SPD), a novel framework that uses pipeline parallelism to accelerate LLM inference by processing multiple tokens in parallel, achieving higher speedup…
Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu +8 more
DFlare introduces a lightweight layer-wise fusion mechanism to overcome the narrow conditioning bottleneck of existing block diffusion methods, enabling the scaling of draft models and achieving super…
TAPS introduces a target-aware prefix selection method that optimizes the trade-off between draft tree acceptance and verification cost, achieving significant speedups in speculative decoding.
Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more
The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…
The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.
The paper introduces CaDDTree, a cost-aware method that optimizes token throughput by jointly selecting the tree structure and node budget for speculative decoding, outperforming existing methods like…
COFT is a training-free decoding method that significantly reduces societal biases in large language model chain-of-thought reasoning by applying token-level fairness control at decode time.
Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more
BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…
The paper investigates whether modestly sized open-source language models can grasp the semantics of rare Paired-Focus constructions, finding that understanding emerges later in training and correlate…
Yunhai Hu, Zining Liu, Xiangyang Yin, Tianhua Xia +4 more
DREAM-R is a novel framework that significantly enhances speculative reasoning in large multimodal models by optimizing draft generation alignment, introducing a robust verification mechanism, and ena…
The paper proposes an aggressive, parameter-efficient method to prune non-essential experts from Mixture-of-Experts (MoE) LLMs, significantly compressing the model while maintaining high machine trans…
The paper proposes EPIC, an efficient and parallel decoding framework that significantly speeds up the process of constraining diffusion language model outputs using Context-Free Grammars (CFG).
This paper investigates how different types of compressed reasoning data (Explicit, Composed, Implicit CoT) affect LLM performance during post-training, finding that the choice of compression and subs…
This paper analyzes the multilinguality of LLMs by examining their structural properties, finding that low-resource languages are structurally more distinct from English than high-resource languages,…
The paper benchmarks local, offline LLMs for confidential translation workflows, demonstrating that while they are viable for privacy-sensitive use, they generally lag behind top commercial NMT system…
Renfei Dang, Xinye Wang, Zhejian Lai, Weilu Xu +4 more
The paper proposes RIEQE, a two-stage training framework that synergistically co-evolves implicit and explicit reasoning capabilities in Large Reasoning Models (LRMs) to significantly improve fine-gra…
Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou +1 more
This paper demonstrates that targeted adaptation of the small language model Phi Silica, using dataset curation and fine-tuning, significantly improves its performance in short-form text rewriting, na…
Zhenlin Hu, Yan Wang, Zhen Bi, Zihao Xue +6 more
The paper introduces StreamSynth, a sequential setting for synthetic data generation, and proposes SynLearner, a framework that enables LLMs to improve synthesis performance by accumulating and transf…
This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…