~ similar to 2606.01117· 20 results
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more
The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…
PrunePath introduces a budget-adaptive structured sparsification framework that efficiently prunes Feed-forward networks in large language models, achieving hardware-friendly sparsity and measurable s…
Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more
This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…
The paper systematically characterizes column-level activation sparsity across various diffusion model architectures, demonstrating that element-level sparsity metrics significantly overestimate the a…
SPARQLe is a hardware-software co-design framework that exploits the inherent sub-precision sparsity of LLM activations to reduce memory traffic and enable efficient computation on lower-bit datapaths…
Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more
The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…
Clark Hash is a stateless, deterministic quantization method that significantly reduces the storage size of neural embeddings while maintaining high accuracy for cosine similarity search.
The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…
Lixuan Guo, Yifei Wang, Tiansheng Wen, Aosong Feng +2 more
The paper introduces Single-stage Sparse Retrieval (SSR), a method that replaces computationally expensive vector clustering with sparse autoencoding to achieve highly efficient multi-vector retrieval…
This paper develops specialized, I/O-aware GPU kernels for common GNN layer types, achieving significant speedups and memory reductions compared to existing frameworks.
Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more
This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…
Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more
BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…
TENNOR is a system that enables efficient and private training of wide neural networks in untrusted cloud environments by using doubly oblivious primitives and a novel memory-efficient hashing scheme.
Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more
The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…
The paper introduces TSVD, a novel framework that efficiently pre-trains LLMs by enforcing both low rank and strict weight orthonormality, achieving performance comparable to full-parameter models wit…
Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu +2 more
FedMPT introduces a novel federated learning framework for Multi-Label Recognition (MLR) using Vision-Language Models (VLMs) by leveraging generalizable conditions to mitigate label overfitting and im…
Junjie Peng, You Wu, Haoyi Wu, Jialong Han +3 more
GRKV introduces a training-free KV-cache merging method that uses global regression to distribute information from evicted tokens, solving the over-merging problem inherent in span-based retention.
Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu +1 more
BlockBatch introduces a novel framework that efficiently accelerates diffusion language model (dLLM) inference by simultaneously executing multiple block-size branches for a single request, achieving…