20 results for “cache partitioning”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei +6 more
This paper proposes a training-free framework called ReasonAlloc to mitigate inference bottlenecks in large language models by recasting decoding-time key-value compression as a hierarchical budget al…
This paper presents SCP, a cache partitioning design that combines strict eviction isolation with write-shared coherence to mitigate eviction-based cache side channels.
Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao +1 more
The paper proposes AsymCache, a computation-latency-aware KV cache management system that optimizes LLM inference by aligning cache eviction decisions with GPU attention kernel performance, significan…
Guanlong Wu, Zhaohan li, Yao Zhang, Zheng Zhang +3 more
CachePrune introduces a privacy-aware, fine-grained KV cache sharing mechanism that allows LLM inference systems to safely reuse cache entries across users' requests, significantly improving efficienc…
Pomegranate is a novel framework that uses hardware-assisted virtualization and Extended Page Tables to securely compartmentalize existing operating systems with minimal source code modification, enab…
Cloak is an oblivious storage system that significantly improves the performance of ORAM by exploiting temporal locality, achieving low overheads while maintaining security.
The paper characterizes 'dead-entry' TLB misses in GPUs, which occur when recently evicted translations are immediately re-walked, and proposes DEPOT, a Bloom filter mechanism that significantly reduc…
This paper audits the prompt caching mechanisms of the OpenRouter API gateway to determine if its architecture compromises the prompt cache isolation guarantees provided by underlying LLM inference pr…
Leyline introduces a novel serving-side primitive that allows agentic LLMs to perform targeted, efficient edits to the KV cache, avoiding costly full re-prefilling after content modification.
The paper proposes GroundedCache, an evidence-validated cache router that significantly improves the safety of reusing cached semantic answers in RAG systems by requiring multiple gates to validate th…
The paper analyzes the bit-flip vulnerability of shared KV-cache blocks in LLM serving systems, demonstrating that these blocks are susceptible to silent, persistent, and selective data corruption.
The paper introduces Rotary GPU, an exploratory execution approach demonstrating that large Mixture-of-Experts models can be run locally on consumer GPUs with limited VRAM, achieving usable decode thr…
The paper proposes PrISM, an intersection-based probabilistic mitigation technique that significantly improves the scalability of RowHammer defense at low thresholds by correlating sampled row history…
AI-PROPELLER introduces a novel interprocedural code layout optimization system that uses an agentic evolutionary workflow to achieve significant, measurable performance gains in large-scale, real-wor…
Chengyan Ma, Jieke Shi, Ruidong Han, Ye Liu +3 more
The paper introduces TEERepair, a framework that automatically repairs severe security vulnerabilities caused by improper partitioning in Trusted Execution Environments (TEEs) by combining a domain-sp…
This paper investigates the potential of real-world Processing-in-Memory (PIM) architectures, specifically using UPMEM, to accelerate cryptographic algorithms, demonstrating that distributing computat…
The paper reframes Parameter-Efficient Fine-Tuning (PEFT) from a mere cost-saving alternative to a robust architecture for creating persistent, personalized models that layer specific behaviors onto l…
Hyesung Ji, Hyunah Yu, Jongmin Kim, Wonseok Choi +2 more
GPIR is a GPU-accelerated Private Information Retrieval (PIR) system that significantly boosts throughput by introducing a stage-aware hybrid execution model and optimizing data layouts for modern GPU…
The paper proposes moving the query instead of the KV-cache during cross-instance attention, demonstrating that this approach is significantly cheaper than moving the cache, especially on modern GPU f…
Onyx proposes a novel, cost-efficient disk-oblivious Approximate Nearest Neighbor (ANN) search system that significantly reduces both cost and latency compared to state-of-the-art methods.