~ similar to 2606.02358· 19 results
Voktho Das, M Zafir Sadik Khan, Jafar Vafaei, Kimia Azar +1 more
The paper proposes a hybrid ASIC+eFPGA architecture to enhance the security and resilience of edge LLM inference accelerators against both runtime and supply-chain attacks.
O-POPE is a novel outer-product engine that accelerates floating-point GEMM by repurposing FPU pipeline registers as buffers, achieving high utilization and improved energy efficiency.
The paper introduces BLADEI, a hardware-accelerated framework that screens FPGA configuration bitstreams for anomalies in real-time, overcoming the latency bottleneck of traditional software-based det…
Tessera introduces a novel hardware architecture that achieves secure, near-line-rate weight streaming for DNNs on UMA edge accelerators by performing cache-line granularity decryption during DRAM fet…
The paper proposes n-VM, a novel Layer-1 architecture that unifies multiple heterogeneous virtual machines (VMs) onto a shared consensus and state layer, solving cross-chain fragmentation issues.
AEGIS is a novel system that significantly improves the scalability of running large, long-sequence Transformer models under Fully Homomorphic Encryption (FHE) on multi-GPU systems by optimizing data…
The paper proposes a Ferroelectric Charge-Domain Compute Cell (FCDC) using HZO memcapacitors to perform attention computation, achieving significant energy efficiency gains, especially for long-reside…
This paper provides the first systematic, isolated benchmarks of NIST-standardized post-quantum cryptography (ML-KEM and ML-DSA) on the highly constrained ARM Cortex-M0+ processor, showing performance…
The elasticAI.explorer is an extensible, unified Python framework that simplifies hardware-aware Neural Architecture Search (NAS) by decoupling search space definition from model implementation and de…
The paper characterizes 'dead-entry' TLB misses in GPUs, which occur when recently evicted translations are immediately re-walked, and proposes DEPOT, a Bloom filter mechanism that significantly reduc…
This paper investigates Confused Deputy Attacks (CDAs) on AI Accelerators (AIAs) and finds that CDA is feasible on most major vendor AIAs, impacting a vast number of devices.
Physical AI inference (batch-1 decode) is primarily memory-bandwidth-bound, but the observed latency gap between fast and slow GPUs is not solely due to memory bandwidth, as launch-side overheads beco…
This paper investigates the potential of real-world Processing-in-Memory (PIM) architectures, specifically using UPMEM, to accelerate cryptographic algorithms, demonstrating that distributing computat…
The paper systematically characterizes the fault response of the Intel NCS2 accelerator to electromagnetic fault injection, revealing a major degradation mode that is undetectable by standard inferenc…
Jianming Tong, Jingtian Dang, Simon Langowski, Tianhao Huang +5 more
The paper introduces MORPH, a framework that reformulates Zero-Knowledge Proof (ZKP) computations to efficiently utilize AI ASICs like TPUs, achieving up to 10x higher throughput on NTT.
Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan +2 more
Lodestar is a novel online learning-based request routing system that significantly improves LLM inference efficiency by dynamically assigning incoming requests to the optimal GPU instance to minimize…
The paper introduces memorywire, a vendor-neutral JSON-Schema wire format and reference implementation designed to standardize and govern memory operations across disparate agent-memory frameworks.
Junyi Yang, Shuai Dong, Zhengnan Fu, Hongyang Shang +1 more
The paper proposes a highly reconfigurable 256x128 in-memory computing array that significantly improves efficiency and performance for analog computing by introducing novel components for ADC, weight…
The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…