~ similar to 2606.02385· 18 results
The paper introduces a distributional framework using Wasserstein distance to unify the semantic comparison of sparse autoencoder features across different layers and to automatically compress large f…
The paper demonstrates that integrating Sparse Autoencoders (SAEs) into transformer residual streams significantly enhances the robustness of Large Language Models against various jailbreak attacks by…
The paper introduces Residualized Sparse Autoencoders (ReSAEs) to improve multi-layer interventions in transformers by training each layer on the residual activation, which better preserves cross-laye…
This paper demonstrates that Sparse Autoencoders (SAEs) can effectively steer Large Language Models (LLMs) on the AxBench benchmark, achieving performance comparable to LoRA baselines when combined wi…
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey +22 more
The paper demonstrates that sparse autoencoders can successfully extract a large set of interpretable, causally influential features from the production-scale Claude 3 Sonnet language model.
The study investigates the generalization of auto-generated natural-language labels for language model features, finding that while the underlying features show cross-lingual semantic consistency, the…
Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more
The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…
Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more
This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…
The paper formalizes the problem of representation identifiability in supervised learning, showing that a representation property is identifiable if and only if it is constant across all possible fact…
Calvin Yeung, Prathyush Poduval, Ali Zakeri, Zhuowen Zou +1 more
The paper introduces residualized temporal Sparse Autoencoders (SAEs) to analyze the full spatiotemporal structure of activations generated during the iterative denoising process of diffusion models,…
Lixuan Guo, Yifei Wang, Tiansheng Wen, Aosong Feng +2 more
The paper introduces Single-stage Sparse Retrieval (SSR), a method that replaces computationally expensive vector clustering with sparse autoencoding to achieve highly efficient multi-vector retrieval…
Hongfei Du, Jiacheng Shi, Sidi Lu, Gang Zhou +1 more
The paper uses sparse autoencoders to identify specific latent features within LLM-based TTS models, enabling interpretable and fine-grained control over emotional expression by intervening in small s…
The paper introduces Latent Terms, a method that shows dense retrieval models implicitly learn sparse, Zipfian vocabularies that can be used for classical BM25-style sparse scoring without requiring s…
The paper proposes a unified, constrained optimization framework using KL divergence and likelihood constraints to achieve effective and principled unlearning in diffusion models.
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more
The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…
The paper introduces the Vector Network (VN), a novel recurrent architecture that replaces fixed weight matrices with reusable weight atoms, enabling superior compositional generalization by making st…
The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…
IRDS introduces a novel data selection method that uses a verifier-coupled sparse autoencoder framework to efficiently select high-quality Reinforcement Learning with Verifiable Rewards (RLVR) trainin…