~ similar to 2606.02333· 19 results
This paper presents a hardware-oriented description of GoldenFloat, a static-split floating-point family, and its concrete artefacts.
Hawkeye is a system that allows perfect, precision-preserving reproduction of GPU-level matrix multiplication operations on a CPU, enabling efficient and trustworthy third-party auditing of machine le…
The paper introduces performance ruggedness analysis to quantify performance variance in GEMM workloads, proposing a two-stage software stack that significantly smooths the performance landscape and b…
The paper introduces Chimera, a highly efficient and scalable MCU designed for ultra-low-power edge AI inference, achieving 3.1 TOPS/W by integrating a dedicated transformer accelerator and a QoS-guar…
The paper proposes ZK-Flex, a flexible software-hardware co-designed framework that significantly accelerates Zero-Knowledge Proof (ZKP) generation by efficiently handling diverse polynomial and ellip…
The paper proposes ZK-Flex, a flexible software-hardware co-designed framework that significantly accelerates Zero-Knowledge Proof (ZKP) generation by efficiently handling diverse polynomial and ellip…
The paper proposes a Ferroelectric Charge-Domain Compute Cell (FCDC) using HZO memcapacitors to perform attention computation, achieving significant energy efficiency gains, especially for long-reside…
Jianming Tong, Jingtian Dang, Simon Langowski, Tianhao Huang +5 more
The paper introduces MORPH, a framework that reformulates Zero-Knowledge Proof (ZKP) computations to efficiently utilize AI ASICs like TPUs, achieving up to 10x higher throughput on NTT.
This paper provides the first systematic, isolated benchmarks of NIST-standardized post-quantum cryptography (ML-KEM and ML-DSA) on the highly constrained ARM Cortex-M0+ processor, showing performance…
Harshita Gupta, Mayank Kabra, Jaewoo Park, Priyam Mehta +8 more
The paper characterizes Homomorphic Encryption (HE) operations on a real-world Processing-In-Memory (PIM) system, demonstrating that while PIM is a viable alternative to CPUs/GPUs, performance is limi…
The paper characterizes 'dead-entry' TLB misses in GPUs, which occur when recently evicted translations are immediately re-walked, and proposes DEPOT, a Bloom filter mechanism that significantly reduc…
OpenEye is a scalable, sparsity-aware FPGA-based hardware accelerator designed to efficiently execute common deep neural network operations, demonstrating favorable performance-resource trade-offs acr…
This paper investigates the potential of real-world Processing-in-Memory (PIM) architectures, specifically using UPMEM, to accelerate cryptographic algorithms, demonstrating that distributing computat…
The paper introduces BOUNDARY FLOW, an LLVM-based framework that enhances kernel fuzzing and analysis by extracting per-task, state-aware data-flow information (arguments and return values) at functio…
The paper introduces a four-stage structural dependency analysis hierarchy that enables scalable, sound first-order masking verification for large, production-level post-quantum cryptographic accelera…
The paper introduces a novel hardware aging attack that exploits the commutative properties of addition to induce unbalanced stress on AI accelerator transistors, significantly degrading model accurac…
The paper analyzes the security of a partially masked hardware accelerator for Number Theoretic Transform (NTT) in PQC, demonstrating that the claimed security margins are significantly overestimated…
This empirical study of Pearl's cuPOW protocol demonstrates that the network's Proof-of-Useful-Work mechanism generates zero useful AI computation, instead causing economic harm and displacing legitim…
Tessera introduces a novel hardware architecture that achieves secure, near-line-rate weight streaming for DNNs on UMA edge accelerators by performing cache-line granularity decryption during DRAM fet…