"cache partitioning" | ArxivCSExplorer

20 results for “cache partitioning”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.AIEmpiricalRecentJun 9, 2026

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more

This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…

View →

cs.CRcs.AREmpiricalRecentJun 10, 2026

Partitioned Tags, Shared Data: Reconciling Strict Cache Isolation with Write-Shared Coherence

Kartik Ramkrishnan, Stephen McCamant, Antonia Zhai, Pen Chung Yew

This paper presents SCP, a cache partitioning design that combines strict eviction isolation with write-shared coherence to mitigate eviction-based cache side channels.

View →

cs.ARcs.CLcs.LGRecentJun 1, 2026

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao +1 more

The paper proposes AsymCache, a computation-latency-aware KV cache management system that optimizes LLM inference by aligning cache eviction decisions with GPU attention kernel performance, significan…

View →

cs.CRRecentMay 22, 2026

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

Guanlong Wu, Zhaohan li, Yao Zhang, Zheng Zhang +3 more

CachePrune introduces a privacy-aware, fine-grained KV cache sharing mechanism that allows LLM inference systems to safely reuse cache entries across users' requests, significantly improving efficienc…

View →

cs.CRcs.OSRecentMay 7, 2026

Pomegranate: A Lightweight Compartmentalization Architecture using Virtualization Extensions

Shriram Raja, Zhiyuan Ruan, Richard West

Pomegranate is a novel framework that uses hardware-assisted virtualization and Extended Page Tables to securely compartmentalize existing operating systems with minimal source code modification, enab…

View →

cs.CRRecentMay 26, 2026

Cloak: Heuristic ORAM Optimization Through Fixed Temporal Distribution

Onur Eren Arpaci, Florian Kerschbaum, Sujaya Maiyya

Cloak is an oblivious storage system that significantly improves the performance of ORAM by exploiting temporal locality, achieving low overheads while maintaining security.

View →

cs.ARcs.PFRecentMay 30, 2026

Regular-Dead on Arrival: Characterizing and Protecting Against Dead-Entry TLB Misses in GPU Microarchitectures

Shafayat Mowla Anik, Yongchan Jung, Jeeho Ryoo, Byeong Kil Lee

The paper characterizes 'dead-entry' TLB misses in GPUs, which occur when recently evicted translations are immediately re-walked, and proposes DEPOT, a Bloom filter mechanism that significantly reduc…

View →

cs.CRcs.LGRecentMay 28, 2026

CacheProbe: Auditing Prompt Cache Isolation in Gateway APIs

Ryan Fahey

This paper audits the prompt caching mechanisms of the OpenRouter API gateway to determine if its architecture compromises the prompt cache isolation guarantees provided by underlying LLM inference pr…

View →

cs.DCcs.AIcs.LGRecentMay 31, 2026

Leyline: KV Cache Directives for Agentic Inference

Bole Ma, Jan Eitzinger, Harald Koestler

Leyline introduces a novel serving-side primitive that allows agentic LLMs to perform targeted, efficient edits to the KV cache, avoiding costly full re-prefilling after content modification.

View →

cs.CRcs.AIcs.CLRecentMay 26, 2026

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Syed Huma Shah

The paper proposes GroundedCache, an evidence-validated cache router that significantly improves the safety of reusing cached semantic answers in RAG systems by requiring multiple gates to validate th…

View →

cs.CRcs.ARcs.LGRecentApr 19, 2026

Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems

Yuji Yamamoto, Satoshi Matsuura

The paper analyzes the bit-flip vulnerability of shared KV-cache blocks in LLM serving systems, demonstrating that these blocks are susceptible to silent, persistent, and selective data corruption.

View →

cs.PFcs.ARcs.DCRecentMay 27, 2026

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory

Myeong Jun Jo

The paper introduces Rotary GPU, an exploratory execution approach demonstrating that large Mixture-of-Experts models can be run locally on consumer GPUs with limited VRAM, achieving usable decode thr…

View →

cs.CRRecentMay 17, 2026

Loaded Dice: Solving the Non-Selection Problem for Scalable Probabilistic RowHammer Defense

Jeonghyun Woo, Junsu Kim, Aamer Jaleel, Prashant J. Nair

The paper proposes PrISM, an intersection-based probabilistic mitigation technique that significantly improves the scalability of RowHammer defense at low thresholds by correlating sampled row history…

View →

cs.SEcs.AIcs.LGRecentMay 28, 2026

AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve

Chaitanya Mamatha Ananda, Rajiv Gupta, Mircea Trofin, Aiden Grossman +3 more

AI-PROPELLER introduces a novel interprocedural code layout optimization system that uses an agentic evolutionary workflow to achieve significant, measurable performance gains in large-scale, real-wor…

View →

cs.SEcs.CRRecentMay 21, 2026

Automated Repair of TEE Partitioning Issues via DSL-Guided and LLM-Assisted Patching

Chengyan Ma, Jieke Shi, Ruidong Han, Ye Liu +3 more

The paper introduces TEERepair, a framework that automatically repairs severe security vulnerabilities caused by improper partitioning in Trusted Execution Environments (TEEs) by combining a domain-sp…

View →

cs.CRcs.ARcs.DCRecentMay 19, 2026

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

Nicola Barcarolo, Brahmaiah Gandham, Mohammad Sadrosadati, Roberto Passerone +2 more

This paper investigates the potential of real-world Processing-in-Memory (PIM) architectures, specifically using UPMEM, to accelerate cryptographic algorithms, demonstrating that distributing computat…

View →

cs.LGcs.CLRecentJun 1, 2026

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

Mind Lab, :, Song Cao, Vic Cao +51 more

The paper reframes Parameter-Efficient Fine-Tuning (PEFT) from a mere cost-saving alternative to a robust architecture for creating persistent, personalized models that layer specific behaviors onto l…

View →

cs.CRcs.ARRecentApr 6, 2026

GPIR: Enabling Practical Private Information Retrieval with GPUs

Hyesung Ji, Hyunah Yu, Jongmin Kim, Wonseok Choi +2 more

GPIR is a GPU-accelerated Private Information Retrieval (PIR) system that significantly boosts throughput by introducing a stage-aware hybrid execution model and optimizing data layouts for modern GPU…

View →

cs.DCcs.AIcs.NIRecentMay 31, 2026

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Bole Ma, Jan Eitzinger, Harald Köstler, Gerhard Wellein

The paper proposes moving the query instead of the KV-cache during cross-instance attention, demonstrating that this approach is significantly cheaper than moving the cache, especially on modern GPU f…

View →

cs.CRcs.AIRecentApr 22, 2026

Onyx: Cost-Efficient Disk-Oblivious ANN Search

Deevashwer Rathee, Jean-Luc Watson, Zirui Neil Zhao, G. Edward Suh +1 more

Onyx proposes a novel, cost-efficient disk-oblivious Approximate Nearest Neighbor (ANN) search system that significantly reduces both cost and latency compared to state-of-the-art methods.

View →