~ similar to 2605.29986· 20 results
The paper proposes EPIC, an efficient and parallel decoding framework that significantly speeds up the process of constraining diffusion language model outputs using Context-Free Grammars (CFG).
The paper introduces an efficient, novel algorithm for incremental Byte Pair Encoding (BPE) tokenization that processes input text prefix by prefix, achieving significant speedups and enabling streami…
Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah +4 more
The paper introduces Hybrid Verified Decoding, a method that predicts the acceptance length of a cache draft to intelligently select between cache verification and model-based drafting, achieving sign…
Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren +1 more
The paper proposes Speculative Pipeline Decoding (SPD), a novel framework that uses pipeline parallelism to accelerate LLM inference by processing multiple tokens in parallel, achieving higher speedup…
This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…
Meihua Dang, Linxin Song, Honghua Zhang, Jieyu Zhao +2 more
The paper proposes a novel probabilistic globally constrained decoding (P-GCD) method that efficiently constructs proposals for locally constrained decoding, significantly improving convergence speed…
The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…
Moment-KV introduces a novel momentum-based technique to compress the Key-Value (KV) cache during the decoding phase of LLM generation, significantly improving fidelity in long-generation tasks.
Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more
This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…
The paper proposes CYKNN, a novel recurrent neural network architecture that directly encodes the CYK parsing algorithm, demonstrating superior performance over large language models on syntactic pars…
Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo +8 more
The paper proposes SimSD, a plug-and-play speculative decoding algorithm that adapts diffusion language models (dLLMs) to achieve fast, token-level acceleration by restoring causal masking capabilitie…
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more
The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…
Junjie Peng, You Wu, Haoyi Wu, Jialong Han +3 more
GRKV introduces a training-free KV-cache merging method that uses global regression to distribute information from evicted tokens, solving the over-merging problem inherent in span-based retention.
PrunePath introduces a budget-adaptive structured sparsification framework that efficiently prunes Feed-forward networks in large language models, achieving hardware-friendly sparsity and measurable s…
This paper investigates the redundancy of the prompt KV cache during language model decoding, finding that the structure provided by chat templates is the primary source of redundancy, not the actual…
LongAttnComp introduces a novel, two-stage fine-tuning framework for context compression that significantly improves long-context reasoning performance, matching or exceeding full-context accuracy on…
The paper proposes a unified framework for designing efficient and expressive token mixing layers by separating the direct and recurrent influences of inputs, allowing for a principled trade-off betwe…
Kıvanç Kuzey Dikici, Serdar Kara, Semih Çağlar, Eray Tüzün +1 more
SERSEM introduces a selective entropy-weighted scoring framework to significantly improve Membership Inference Attacks (MIAs) against code LLMs by focusing on human-centric coding anomalies rather tha…
The paper proposes an aggressive, parameter-efficient method to prune non-essential experts from Mixture-of-Experts (MoE) LLMs, significantly compressing the model while maintaining high machine trans…
Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang +1 more
The paper introduces Token-Aware Gradient Optimization (TAGO), demonstrating that sparse optimization focusing only on high-gradient audio tokens is sufficient for effective jailbreaking of audio lang…