Papers similar to 2606.00486

~ similar to 2606.00486· 20 results

cs.CRRecentMay 17, 2026

Loaded Dice: Solving the Non-Selection Problem for Scalable Probabilistic RowHammer Defense

Jeonghyun Woo, Junsu Kim, Aamer Jaleel, Prashant J. Nair

The paper proposes PrISM, an intersection-based probabilistic mitigation technique that significantly improves the scalability of RowHammer defense at low thresholds by correlating sampled row history…

View →

cs.CRcs.ARcs.LGRecentApr 19, 2026

Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems

Yuji Yamamoto, Satoshi Matsuura

The paper analyzes the bit-flip vulnerability of shared KV-cache blocks in LLM serving systems, demonstrating that these blocks are susceptible to silent, persistent, and selective data corruption.

View →

cs.CRcs.ARRecentApr 22, 2026

PVAC: A RowHammer Mitigation Architecture Exploiting Per-victim-row Counting

Jumin Kim, Seungmin Baek, Hwayong Nam, Minbok Wi +2 more

The paper introduces PVAC, a novel victim-based row counting mechanism that accurately tracks RowHammer attacks by incrementing counters on the victim row, thereby improving hammering tolerance and pe…

View →

cs.AIRecentMay 30, 2026

Threshold-Based Exclusive Batching for LLM Inference

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma +1 more

This paper proposes a hybrid scheduler that dynamically switches between exclusive batching and mixed batching for LLM inference, achieving superior throughput, especially on bandwidth-constrained GPU…

View →

cs.CRcs.ARRecentApr 27, 2026

RowHammer Vulnerability Counter (RVC): Redefining RowHammer Detection with Victim-Centric Tracking

Lavi Jain, Venkata Kalyan Tavva

The paper proposes Rowhammer Vulnerability Counter (RVC), a novel framework that improves RowHammer mitigation by tracking a row's actual vulnerability to bit flips rather than relying on simple activ…

View →

cs.ARcs.CLcs.LGRecentJun 1, 2026

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao +1 more

The paper proposes AsymCache, a computation-latency-aware KV cache management system that optimizes LLM inference by aligning cache eviction decisions with GPU attention kernel performance, significan…

View →

cs.CRcs.AREmpiricalRecentJun 10, 2026

Partitioned Tags, Shared Data: Reconciling Strict Cache Isolation with Write-Shared Coherence

Kartik Ramkrishnan, Stephen McCamant, Antonia Zhai, Pen Chung Yew

This paper presents SCP, a cache partitioning design that combines strict eviction isolation with write-shared coherence to mitigate eviction-based cache side channels.

View →

cs.CRcs.DCRecentApr 17, 2026

PoSME: Proof of Sequential Memory Execution via Latency-Bound Pointer Chasing with Causal Hash Binding

David L. Condrey

The paper introduces PoSME, a cryptographic primitive that enforces strict sequential memory execution by chaining data-dependent writes, providing verifiable delay and authorship attestation.

View →

cs.LGcs.AIcs.DCRecentMay 27, 2026

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare +8 more

The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…

View →

cs.DCcs.AIcs.NIRecentMay 31, 2026

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Bole Ma, Jan Eitzinger, Harald Köstler, Gerhard Wellein

The paper proposes moving the query instead of the KV-cache during cross-instance attention, demonstrating that this approach is significantly cheaper than moving the cache, especially on modern GPU f…

View →

cs.CRcs.ARcs.LGRecentMar 20, 2026

Hawkeye: Reproducing GPU-Level Non-Determinism

Erez Badash, Dan Boneh, Ilan Komargodski, Megha Srivastava

Hawkeye is a system that allows perfect, precision-preserving reproduction of GPU-level matrix multiplication operations on a CPU, enabling efficient and trustworthy third-party auditing of machine le…

View →

cs.ARcs.AIcs.DCRecentMay 28, 2026

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Josef Chen

Physical AI inference (batch-1 decode) is primarily memory-bandwidth-bound, but the observed latency gap between fast and slow GPUs is not solely due to memory bandwidth, as launch-side overheads beco…

View →

cs.DCcs.AIcs.LGRecentMay 31, 2026

Lodestar: An Online-Learning LLM Inference Router

Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan +2 more

Lodestar is a novel online learning-based request routing system that significantly improves LLM inference efficiency by dynamically assigning incoming requests to the optimal GPU instance to minimize…

View →

cs.CRRecentApr 17, 2026

Low-Stack HAETAE for Memory-Constrained Microcontrollers

Gustavo Banegas, Kim Youngbeom, Seo Seog Chung, Vredendaal Christine Van

The paper presents a highly optimized, low-stack implementation of the HAETAE signature scheme, reducing peak stack usage significantly to enable its use on severely memory-constrained microcontroller…

View →

cs.PFcs.ARcs.DCRecentMay 27, 2026

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory

Myeong Jun Jo

The paper introduces Rotary GPU, an exploratory execution approach demonstrating that large Mixture-of-Experts models can be run locally on consumer GPUs with limited VRAM, achieving usable decode thr…

View →

cs.CRcs.ARcs.DCRecentMay 19, 2026

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

Nicola Barcarolo, Brahmaiah Gandham, Mohammad Sadrosadati, Roberto Passerone +2 more

This paper investigates the potential of real-world Processing-in-Memory (PIM) architectures, specifically using UPMEM, to accelerate cryptographic algorithms, demonstrating that distributing computat…

View →

cs.CRcs.ARcs.LGRecentApr 25, 2026

Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

Animan Naskar

Tessera introduces a novel hardware architecture that achieves secure, near-line-rate weight streaming for DNNs on UMA edge accelerators by performing cache-line granularity decryption during DRAM fet…

View →

cs.CRRecentMay 3, 2026

GPU Fingerprinting for Location Verification

Wayne Tee, Jonathan Happel

The paper proposes using hardware fingerprints instead of vulnerable cryptographic keys to enhance the security and robustness of GPU location verification for governing advanced AI development.

View →

cs.CRRecentApr 4, 2026

Partial Number Theoretic Transform Masking in Post-Quantum Cryptography (PQC) Hardware: A Security Margin Analysis

Ray Iskander, Khaled Kirah

The paper analyzes the security of a partially masked hardware accelerator for Number Theoretic Transform (NTT) in PQC, demonstrating that the claimed security margins are significantly overestimated…

View →

cs.DCcs.AIRecentJun 1, 2026

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

Yafan Huang, Sheng Di, Guanpeng Li

This paper systematically studies how soft errors propagate during Large Language Model (LLM) inference using a novel fault-injection framework, providing critical insights and mitigation strategies f…

View →