Papers similar to 2605.31183

~ similar to 2605.31183· 19 results

q-bio.NCcs.LGRecentJun 1, 2026

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

The paper theoretically analyzes the properties that optimal sparse autoencoder (SAE) dictionaries must satisfy, deriving constraints that explain observed SAE behaviors like hierarchical splitting an…

View →

cs.LGcs.AIcs.CLRecentApr 20, 2026

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

The paper demonstrates that integrating Sparse Autoencoders (SAEs) into transformer residual streams significantly enhances the robustness of Large Language Models against various jailbreak attacks by…

View →

cs.CLcs.AIRecentMay 28, 2026

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

Hyeseon An, Yo-Sub Han

The paper introduces DLM-SWAI, a training-free method that effectively steers diffusion language models (DLMs) toward desired textual styles or properties by biasing the token distribution at each den…

View →

cs.CLeess.ASRecentMay 30, 2026

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

Yekaterina Yegorova, Argyrios Gerogiannis, Haolong Zheng, Julia Hockenmaier +2 more

SALSA is a lightweight adaptation method that learns layer-wise steering vectors to significantly improve the performance of speech-aware LLMs on out-of-domain speech tasks.

View →

cs.CLRecentMay 31, 2026

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

Hongfei Du, Jiacheng Shi, Sidi Lu, Gang Zhou +1 more

The paper uses sparse autoencoders to identify specific latent features within LLM-based TTS models, enabling interpretable and fine-grained control over emotional expression by intervening in small s…

View →

cs.LGcs.AIRecentMay 27, 2026

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

Tue M. Cao, Nguyen Do, My T. Thai

The paper introduces a distributional framework using Wasserstein distance to unify the semantic comparison of sparse autoencoder features across different layers and to automatically compress large f…

View →

cs.CRcs.LGRecentMay 19, 2026

Adaptive Probe-based Steering for Robust LLM Jailbreaking

Junxi Chen, Junhao Dong, Xiaohua Xie

The paper introduces an adaptive probe-based steering method that significantly improves the robustness and effectiveness of LLM jailbreaking without requiring extra prompts or manual tuning.

View →

cs.CLcs.AIcs.LGRecentJun 4, 2026

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang +1 more

The paper proposes Cross-Layer Sparse Attention (CLSA) to significantly improve the efficiency and accuracy of long-context LLMs by jointly optimizing KV-cache sharing and the routing index across dec…

View →

cs.LGcs.AIRecentMay 27, 2026

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

Prathyush Poduval, Calvin Yeung, Neel Desai, Mohsen Imani

The paper introduces Residualized Sparse Autoencoders (ReSAEs) to improve multi-layer interventions in transformers by training each layer on the residual activation, which better preserves cross-laye…

View →

cs.LGcs.CLRecentJun 3, 2026

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye +3 more

This paper proposes a new framework called STRIDE for training data attribution in Large Language Models.

View →

cs.LGcs.AIRecentMay 30, 2026

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more

The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…

View →

cs.AIRecentMay 28, 2026

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey +22 more

The paper demonstrates that sparse autoencoders can successfully extract a large set of interpretable, causally influential features from the production-scale Claude 3 Sonnet language model.

View →

cs.CLcs.AIcs.LGRecentMay 29, 2026

Not All Synthetic Data Is Yours to Learn From

Sina Alemohammad, Li Chen, Richard G. Baraniuk, Zhangyang Wang

Weak self-training on synthetic data can amplify a language model's existing capabilities, but this effect is strictly dependent on the compatibility between the source and student models, not on the…

View →

cs.LGcs.AIRecentMay 31, 2026

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more

This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…

View →

cs.LGcs.AIRecentMay 27, 2026

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun

IRDS introduces a novel data selection method that uses a verifier-coupled sparse autoencoder framework to efficiently select high-quality Reinforcement Learning with Verifiable Rewards (RLVR) trainin…

View →

cs.CRRecentApr 30, 2026

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček, Sengim Karayalçin +1 more

MASCing is a novel framework that enables flexible, non-retraining reconfiguration of Mixture-of-Experts (MoE) models for specific safety objectives by applying activation steering masks to control ex…

View →

cs.LGcs.AIcs.CLRecentMay 29, 2026

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen +2 more

OrcaRouter is a production-ready LLM router that uses a hybrid offline-online learning approach to efficiently select the best large language model for an incoming query, achieving high accuracy at lo…

View →

cs.CLRecentJun 1, 2026

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…

View →

cs.AIRecentMay 28, 2026

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang +5 more

The paper introduces DOMINO, a novel inductive framework that synthesizes domain-specific data for LLMs using only reference examples, significantly improving performance on challenging, implicitly de…

View →