ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.30574· 20 results

cs.CVcs.AIRecentJun 1, 2026

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin +3 more

STaR-KV introduces a novel, training-free KV cache compression framework that adaptively re-weights token importance across spatial, temporal, and distributional axes, significantly reducing GPU memor…

View →
cs.CLRecentMay 29, 2026

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Junjie Peng, You Wu, Haoyi Wu, Jialong Han +3 more

GRKV introduces a training-free KV-cache merging method that uses global regression to distribute information from evicted tokens, solving the over-merging problem inherent in span-based retention.

View →
cs.CLRecentJun 1, 2026

Training Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning

Wenhang Shi, Yiren Chen, Shuqing Bian, Zhe Zhao +4 more

The paper introduces State-Adaptive Prompt Optimization (SAPO), a novel training strategy that treats prompts as dynamic variables to achieve robust fine-tuning, significantly mitigating catastrophic…

View →
cs.AIRecentMay 28, 2026

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

Moment-KV introduces a novel momentum-based technique to compress the Key-Value (KV) cache during the decoding phase of LLM generation, significantly improving fidelity in long-generation tasks.

View →
cs.AIcs.CRRecentMay 27, 2026

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

J. Vijayavallabh

The paper empirically audits the k-NAF budget accounting mechanism in Anchored Decoding, finding that observed high proxy spend ratios are likely artifacts rather than true budget exhaustion failures.

View →
cs.AIcs.CRRecentMay 27, 2026

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

J. Vijayavallabh

The paper empirically audits the k-NAF budget accounting mechanism in Anchored Decoding, finding that observed high proxy spend ratios are likely artifacts rather than genuine budget exhaustion failur…

View →
cs.CRRecentMay 28, 2026

Strengthening Polymorphic Prompt Assembling: Dynamic Separator Generation Against Emerging Prompt Injection Attacks

Nima Dorzhiev, Peng Liu

The paper introduces dynamic, per-request separator generation for Polymorphic Prompt Assembling (PPA), significantly reducing the blast-radius vulnerability to prompt injection attacks by ensuring un…

View →
cs.CLcs.AIRecentJun 1, 2026

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…

View →
cs.CRRecentJun 2, 2026

The Security Budget of Code LLMs: An Information-Theoretic Capacity-Security Bound

Jianwei Tai

The paper establishes an information-theoretic upper bound on the combined functional capacity and perturbation retention of code LLMs, quantifying the security budget available for code generation.

View →
cs.AIRecentMay 27, 2026

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang +6 more

The paper introduces Thinking as Compression (TaC), a novel paradigm showing that the inherent reasoning process of a large language model can naturally compress long context inputs, outperforming ded…

View →
cs.AIEmpiricalRecentJun 9, 2026

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more

This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…

View →
cs.AIEmpiricalRecentJun 9, 2026

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more

This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…

View →
cs.LGcs.AIRecentMay 27, 2026

Locality-Aware Redundancy Pruning for LLM Depth Compression

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more

The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…

View →
cs.ARcs.CLcs.LGRecentJun 1, 2026

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao +1 more

The paper proposes AsymCache, a computation-latency-aware KV cache management system that optimizes LLM inference by aligning cache eviction decisions with GPU attention kernel performance, significan…

View →
cs.CRRecentMay 22, 2026

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

Guanlong Wu, Zhaohan li, Yao Zhang, Zheng Zhang +3 more

CachePrune introduces a privacy-aware, fine-grained KV cache sharing mechanism that allows LLM inference systems to safely reuse cache entries across users' requests, significantly improving efficienc…

View →
cs.LGcs.CLcs.CRRecentMay 18, 2026

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

Gabriel Garcia

The paper demonstrates that structural protection mechanisms are the dominant factor in maintaining high performance for KV cache eviction policies, often surpassing the benefits of complex scoring al…

View →
cs.CRcs.CVRecentMay 15, 2026

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang +2 more

The paper introduces CrossMPI, a novel cross-modal prompt injection attack that uses image-only perturbations to steer the interpretation of both textual and visual inputs in Large Vision-Language Mod…

View →
cs.CRcs.CLcs.SERecentMay 28, 2026

Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

Minor, single-character perturbations to prompts can significantly degrade the security of code generated by LLMs, suggesting that prompt fragility is a major security concern beyond simple prompt inj…

View →
cs.CRcs.SERecentApr 30, 2026

How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more

The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…

View →
cs.AIRecentMay 28, 2026

Accelerating Constrained Decoding with Token Space Compression

Michael Sullivan, Alexander Koller

The paper introduces CFGzip, an offline token space compression technique that significantly reduces the computational overhead of constrained decoding, making complex grammar enforcement feasible at…

View →