~ similar to 2605.28001· 18 results
The paper empirically audits the k-NAF budget accounting mechanism in Anchored Decoding, finding that observed high proxy spend ratios are likely artifacts rather than true budget exhaustion failures.
Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more
BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…
This paper investigates the redundancy of the prompt KV cache during language model decoding, finding that the structure provided by chat templates is the primary source of redundancy, not the actual…
The paper introduces CaDDTree, a cost-aware method that optimizes token throughput by jointly selecting the tree structure and node budget for speculative decoding, outperforming existing methods like…
Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more
This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…
Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more
This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…
Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more
This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…
The paper introduces SB-ECC, a novel score-based decoder that models error correction as continuous-time denoising, achieving state-of-the-art performance across various code families and noise levels…
The paper introduces NumLeak, a framework demonstrating that top-tier LLMs often exhibit high fidelity recall of specific public numeric benchmarks (like financial factors) due to memorization, which…
The paper introduces NumLeak, a framework demonstrating that top-tier LLMs often exhibit high fidelity recall of specific public numeric benchmarks, suggesting that their apparent skill may be due to…
The paper establishes an information-theoretic upper bound on the combined functional capacity and perturbation retention of code LLMs, quantifying the security budget available for code generation.
Geng Li, Guohao Chen, Ting Chen, Shilin Shan +5 more
OccamToken introduces a training-free, adaptive token pruning framework that replaces fixed token budgets with relative evidence testing against a register-based reference, significantly improving VLM…
Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah +4 more
The paper introduces Hybrid Verified Decoding, a method that predicts the acceptance length of a cache draft to intelligently select between cache verification and model-based drafting, achieving sign…
Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more
The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…
TAPS introduces a target-aware prefix selection method that optimizes the trade-off between draft tree acceptance and verification cost, achieving significant speedups in speculative decoding.
Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen +5 more
The paper proposes Resonant Context Anchoring (RCA), a lightweight, training-free method that enhances factual faithfulness in LLMs by dynamically amplifying the signal of external context evidence du…
This paper introduces cost-aware Retrieval-Augmented Generation (RAG), demonstrating that fixed evidence selection is brittle and that adaptive, agentic controllers are necessary for effective knowled…
肖代替了视觉令牌的永久删除,通过可恢复的路由来改进视觉语言模型的性能