~ similar to 2606.00356· 19 results
The paper theoretically analyzes the properties that optimal sparse autoencoder (SAE) dictionaries must satisfy, deriving constraints that explain observed SAE behaviors like hierarchical splitting an…
Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more
The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey +22 more
The paper demonstrates that sparse autoencoders can successfully extract a large set of interpretable, causally influential features from the production-scale Claude 3 Sonnet language model.
This paper investigates how LLMs handle multiple writing systems, finding that while they use shared latent representations, the model exhibits a structural bias that makes generating Latin script eas…
The paper investigates whether modestly sized open-source language models can grasp the semantics of rare Paired-Focus constructions, finding that understanding emerges later in training and correlate…
The paper introduces a distributional framework using Wasserstein distance to unify the semantic comparison of sparse autoencoder features across different layers and to automatically compress large f…
The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.
Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more
This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…
This paper analyzes the multilinguality of LLMs by examining their structural properties, finding that low-resource languages are structurally more distinct from English than high-resource languages,…
This paper proposes a domain-specialized large language model, PoetryQwen, for precise translation and emotional understanding of classical poetry.
Xiaoqi He, Kaixin Lan, Mu You, Tao Fang +2 more
The paper proposes MACAT, a Multi-Agent Culture-Aware Translation framework, to selectively translate culture-loaded words in ancient Chinese texts, achieving superior performance over existing method…
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
The paper quantitatively confirms the Currier A/B language distinction in the Voynich Manuscript, demonstrating it is governed by a higher-dimensional, context-dependent boolean switch rather than a s…
The paper investigates anthropomorphic reflection markers (like 'hmm' or 'wait') in LLM reasoning and finds that these markers are often surface cues, not necessary for strong reasoning performance.
The paper analyzes the distinct computational roles of positional versus symbolic attention heads in Transformers, demonstrating that symbolic mechanisms generalize more reliably to longer sequences t…
Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali +2 more
This paper introduces a Hybrid Mixture-of-Experts (HybridMoE) framework and a specialized corpus (Varnika) to significantly improve language models' ability to understand and retain figurative, cultur…
This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…
This paper demonstrates that Sparse Autoencoders (SAEs) can effectively steer Large Language Models (LLMs) on the AxBench benchmark, achieving performance comparable to LoRA baselines when combined wi…
The paper audits six LLMs across four languages, finding that their gender stereotyping is significantly wider than human baselines and that cross-lingual translation fundamentally alters the nature o…