Papers similar to 2606.01101

~ similar to 2606.01101· 19 results

cs.CLRecentMay 31, 2026

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu

LongAttnComp introduces a novel, two-stage fine-tuning framework for context compression that significantly improves long-context reasoning performance, matching or exceeding full-context accuracy on…

View →

cs.CLRecentMay 30, 2026

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

Mateusz Śmigielski, Michał Rajkowski, Mateusz Zbrocki, Michał Bernacki-Janson +4 more

This study systematically evaluates a wide range of chunking methods for Retrieval-Augmented Generation (RAG) to assess their effectiveness and highlight the overlooked challenges associated with chun…

View →

cs.CLRecentMay 29, 2026

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Junjie Peng, You Wu, Haoyi Wu, Jialong Han +3 more

GRKV introduces a training-free KV-cache merging method that uses global regression to distribute information from evicted tokens, solving the over-merging problem inherent in span-based retention.

View →

cs.AIcs.LGRecentMay 27, 2026

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa +1 more

This paper investigates how different types of compressed reasoning data (Explicit, Composed, Implicit CoT) affect LLM performance during post-training, finding that the choice of compression and subs…

View →

cs.LGcs.AIRecentMay 27, 2026

Locality-Aware Redundancy Pruning for LLM Depth Compression

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more

The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…

View →

cs.AIRecentMay 30, 2026

Threshold-Based Exclusive Batching for LLM Inference

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more

This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…

View →

cs.CLcs.AIcs.LGRecentJun 4, 2026

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more

The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…

View →

cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →

cs.LGcs.AIRecentMay 27, 2026

Context Distillation as Latent Memory Management

Ziyang Zheng, Zeju Li, Xiangyu Wen, Jianyuan Zhong +4 more

The paper reframes context distillation as a latent memory management problem, proposing a modular framework using LoRA adapters and a Self-Gating mechanism for efficient, selective memory retrieval a…

View →

cs.CLRecentMay 29, 2026

dMoE: dLLMs with Learnable Block Experts

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma +1 more

dMoE proposes a block-level Mixture-of-Experts (MoE) framework for Diffusion Large Language Models (dLLMs) that aggregates token-level expert distributions into a unified block-level distribution, sig…

View →

cs.CLRecentMay 29, 2026

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu +4 more

ElasticMem introduces a novel framework that treats memory as an elastic latent resource, allowing LLM agents to adaptively manage and inject variable-budget memories for improved performance in long-…

View →

cs.AIRecentMay 28, 2026

LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

Jung Hyun Lee, June Yong Yang, Jungwook Choi, Eunho Yang

The paper introduces Logit-aware Final-block Quantization (LFQ), an enhancement to block-wise quantization that quantizes the final Transformer block using a cross-entropy loss to significantly boost…

View →

cs.LGcs.AIeess.ASRecentMay 31, 2026

MURMUR: An Efficient Inference System for Long-Form ASR

Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci

Murmur is an efficient inference system for long-form ASR that resolves the accuracy-latency trade-off by optimizing both inter-chunk processing and intra-chunk attention mechanisms.

View →

cs.LGcs.AIcs.DCRecentMay 27, 2026

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare +8 more

The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…

View →

cs.LGcs.AIRecentMay 28, 2026

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu +1 more

BlockBatch introduces a novel framework that efficiently accelerates diffusion language model (dLLM) inference by simultaneously executing multiple block-size branches for a single request, achieving…

View →

cs.CLcs.IRRecentMay 29, 2026

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more

The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim +2 more

The paper introduces a systematic framework to convert large Mixture-of-Experts (MoE) models into memory-efficient, fully dense architectures, achieving superior performance compared to traditional pr…

View →

cs.CLcs.LGEmpiricalRecentJun 4, 2026

Latent Reasoning with Normalizing Flows

Guancheng Tu, Xiangjun Fu, Suhao Yu, Yao Tang +4 more

This paper proposes NF-CoT, a latent reasoning framework that preserves the advantages of chain-of-thought in large language models.

View →

cs.CLcs.LGEmpiricalRecentJun 4, 2026

Latent Reasoning with Normalizing Flows

Guancheng Tu, Xiangjun Fu, Suhao Yu, Yao Tang +4 more

This paper proposes NF-CoT, a latent reasoning framework that preserves the advantages of chain-of-thought in large language models.

View →