~ similar to 2605.31035· 17 results
Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more
This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…
Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu +8 more
DFlare introduces a lightweight layer-wise fusion mechanism to overcome the narrow conditioning bottleneck of existing block diffusion methods, enabling the scaling of draft models and achieving super…
Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu +1 more
BlockBatch introduces a novel framework that efficiently accelerates diffusion language model (dLLM) inference by simultaneously executing multiple block-size branches for a single request, achieving…
The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…
Xinjue Wang, Xiuheng Wang, Yejun Zhang, Sergiy A. Vorobyov +2 more
The paper investigates whether using fine-grained, tensorized adapters (CP components) instead of standard LoRA ranks improves the accuracy-budget trade-off in PEFT, finding that while they fill budge…
The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…
Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng +10 more
The paper introduces PassNet, a large-scale ecosystem for generating compiler passes using LLMs, demonstrating that LLMs can significantly accelerate graph compilation for long-tail workloads, suggest…
HARP introduces a novel, adaptive, learnable orthogonal processor that significantly improves the robustness and accuracy of extreme low-bit LLM quantization compared to fixed methods.
This paper presents a hardware-oriented description of GoldenFloat, a static-split floating-point family, and its concrete artefacts.
Clark Hash is a stateless, deterministic quantization method that significantly reduces the storage size of neural embeddings while maintaining high accuracy for cosine similarity search.
Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more
STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…
FPMoE introduces a sparse Mixture-of-Experts (MoE) architecture to improve functional code generation across multiple functional programming languages, achieving state-of-the-art performance with fewe…
SPARQLe is a hardware-software co-design framework that exploits the inherent sub-precision sparsity of LLM activations to reduce memory traffic and enable efficient computation on lower-bit datapaths…
The paper demonstrates that the location and nature of state encoding in sequence models are not fixed architectural traits but are highly dependent on the specific task, showing that the encoding pro…
Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong +4 more
The paper proposes a compression pipeline combining few-step distillation and low-bit quantization to significantly reduce the deployment cost and parameter footprint of large dual-expert video diffus…
This paper develops optimized algorithms and a pipeline architecture for high-throughput, memory-efficient batch processing of encrypted neural network inference, significantly improving performance o…
Hawkeye is a system that allows perfect, precision-preserving reproduction of GPU-level matrix multiplication operations on a CPU, enabling efficient and trustworthy third-party auditing of machine le…