Papers similar to 2606.01117

~ similar to 2606.01117· 20 results

cs.CLcs.AIcs.LGRecentJun 4, 2026

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more

The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…

View →

cs.CLcs.AIRecentMay 27, 2026

PrunePath: Towards Highly Structured Sparse Language Models

Zhexuan Gu, Zixun Fu, Yancheng Yuan

PrunePath introduces a budget-adaptive structured sparsification framework that efficiently prunes Feed-forward networks in large language models, achieving hardware-friendly sparsity and measurable s…

View →

cs.LGcs.AIRecentMay 31, 2026

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more

This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…

View →

cs.ARcs.PFRecentMay 30, 2026

Regular-Activation Concentration: Characterizing Column-Level Output Sparsity Across Diffusion Model Architectures

Dazhi Yang, Shafayat Mowla Anik, Byeong Kil Lee, Jeeho Ryoo

The paper systematically characterizes column-level activation sparsity across various diffusion model architectures, demonstrating that element-level sparsity metrics significantly overestimate the a…

View →

cs.ARRecentMay 29, 2026

SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference

Aradhana Mohan Parvathy, Soumendu Kumar Ghosh, Shamik Kundu, Arnab Raha +3 more

SPARQLe is a hardware-software co-design framework that exploits the inherent sub-precision sparsity of LLM activations to reduce memory traffic and enable efficient computation on lower-bit datapaths…

View →

cs.LGcs.AIRecentMay 30, 2026

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more

The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…

View →

cs.AIRecentMay 27, 2026

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Stanislav Kirdey, Clark Labs Inc

Clark Hash is a stateless, deterministic quantization method that significantly reduces the storage size of neural embeddings while maintaining high accuracy for cosine similarity search.

View →

cs.CLcs.AIRecentJun 1, 2026

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…

View →

cs.IRcs.AIcs.LGRecentMay 28, 2026

No More K-means: Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval

Lixuan Guo, Yifei Wang, Tiansheng Wen, Aosong Feng +2 more

The paper introduces Single-stage Sparse Retrieval (SSR), a method that replaces computationally expensive vector clustering with sparse autoencoding to achieve highly efficient multi-vector retrieval…

View →

cs.LGcs.AIRecentMay 29, 2026

On Efficient Scaling of GNNs via IO-Aware Layers Implementations

Daria Fomina, Daniil Krasylnikov, Alexey Boykov, Andrey Dolgovyazov +2 more

This paper develops specialized, I/O-aware GPU kernels for common GNN layer types, achieving significant speedups and memory reductions compared to existing frameworks.

View →

cs.AIRecentMay 30, 2026

Threshold-Based Exclusive Batching for LLM Inference

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more

This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…

View →

cs.LGcs.AIRecentMay 29, 2026

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen +3 more

BudgetDraft introduces an acceptance-aware multi-view training method that trains a sparse-KV speculative decoder to maintain high acceptance rates across varying context lengths and sparsity levels,…

View →

cs.CVcs.AIRecentMay 28, 2026

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan +3 more

VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…

View →

cs.LGcs.AIcs.DCRecentMay 27, 2026

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare +8 more

The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…

View →

cs.CRRecentMay 8, 2026

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

Zifan Qu, Vasileios P. Kemerlis, Giuseppe Ateniese, Evgenios M. Kornaropoulos

TENNOR is a system that enables efficient and private training of wide neural networks in untrusted cloud environments by using doubly oblivious primitives and a novel memory-efficient hashing scheme.

View →

cs.LGcs.AIRecentMay 27, 2026

Locality-Aware Redundancy Pruning for LLM Depth Compression

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more

The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…

View →

cs.LGcs.AIRecentMay 27, 2026

Efficient Pre-Training of LLMs through Truncated SVD Layers

Kaivan Kamali, Kajetan Schweighofer, Hormoz Shahrzad, Olivier Francon +2 more

The paper introduces TSVD, a novel framework that efficiently pre-trains LLMs by enforcing both low rank and strict weight orthonormality, achieving performance comparable to full-parameter models wit…

View →

cs.AIRecentMay 27, 2026

FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu +2 more

FedMPT introduces a novel federated learning framework for Multi-Label Recognition (MLR) using Vision-Language Models (VLMs) by leveraging generalizable conditions to mitigate label overfitting and im…

View →

cs.CLRecentMay 29, 2026

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Junjie Peng, You Wu, Haoyi Wu, Jialong Han +3 more

GRKV introduces a training-free KV-cache merging method that uses global regression to distribute information from evicted tokens, solving the over-merging problem inherent in span-based retention.

View →

cs.LGcs.AIRecentMay 28, 2026

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu +1 more

BlockBatch introduces a novel framework that efficiently accelerates diffusion language model (dLLM) inference by simultaneously executing multiple block-size branches for a single request, achieving…

View →