~ similar to 2606.06470· 19 results
Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more
The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…
The paper introduces TSVD, a novel framework that efficiently pre-trains LLMs by enforcing both low rank and strict weight orthonormality, achieving performance comparable to full-parameter models wit…
HARP introduces a novel, adaptive, learnable orthogonal processor that significantly improves the robustness and accuracy of extreme low-bit LLM quantization compared to fixed methods.
The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…
NANOZK introduces a novel, highly efficient zero-knowledge proof system that allows users to cryptographically verify that the output of a large language model (LLM) was generated by a specific, claim…
The paper proposes FOAM, an adaptive damping method that stabilizes the Shampoo optimization algorithm by dynamically controlling damping and eigendecomposition frequency, thereby reducing staleness-i…
This paper demonstrates that Large Language Models (LLMs) can serve as accurate and selective surrogates for costly GPU kernel performance measurements, significantly expanding the search space for op…
Han Liu, Shanghao Shi, Yevgeniy Vorobeychik, Chongjie Zhang +1 more
This paper demonstrates that adversarial perturbations possess a low-rank structure, and proposes a two-step method to leverage this property to significantly improve the efficiency and effectiveness…
Yuanjian Xu, Jianing Hao, Wanbo Zhang, Zhong Li +1 more
The paper proposes DiReCT, a novel framework that treats data selection during LLM annealing as a constrained optimization problem based on the spectral geometry of the loss landscape, achieving state…
The scaling exponent in neural scaling laws is not fixed but systematically depends on the optimizer used, with preconditioned optimizers generally yielding steeper scaling.
The paper proposes a decision-aware quadratic replacement for the ReLU activation function, enabling low-degree, calibration-lossless polynomial approximations for neural network inference under Fully…
The paper identifies a universal, statistically predictable distribution (Mandelbrot) governing LLM outputs, enabling a highly efficient, model-agnostic scoring primitive for provenance and quality as…
Guoci Chen, Xiurui Pan, Qiao Li, Bo Mao +4 more
The paper introduces TIGER, a GPU-accelerated framework that significantly speeds up high-precision evaluation of nonlinear layers for encrypted LLM inference using TFHE.
CSULoRA is a post-hoc method that corrects trained LoRA adapters by estimating a safety-aligned subspace and solving a penalized minimum-change problem to attenuate unsafe update directions while pres…
This paper develops optimized algorithms and a pipeline architecture for high-throughput, memory-efficient batch processing of encrypted neural network inference, significantly improving performance o…
This paper provides the first non-vacuous generalization analysis for the Stochastic Variance Reduced Gradient (SVRG) method by establishing sharp, data-dependent algorithmic stability bounds, thereby…
Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf +2 more
The paper introduces Parallax, a scalable and numerically stable parameterized Local Linear Attention mechanism that significantly improves LLM performance and efficiency compared to existing methods…
Physical AI inference (batch-1 decode) is primarily memory-bandwidth-bound, but the observed latency gap between fast and slow GPUs is not solely due to memory bandwidth, as launch-side overheads beco…
The paper introduces Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) to achieve massive, structured compression of deep neural networks, demonstrating compression ratios up to 77,000x…