Papers similar to 2605.28639

~ similar to 2605.28639· 20 results

cs.CVcs.CRRecentMay 11, 2026

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

The paper proposes AHV-D&S, a novel training-free inference-time safeguard that detects and suppresses risky content in Diffusion Transformers (DiTs) by quantifying token sensitivity across attention…

View →

cs.CVcs.AIcs.CLRecentMay 29, 2026

Vision-Language Models Suppress Female Representations Under Ambiguous Input

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

Vision-language models (VLMs) exhibit an asymmetric bias, suppressing female representations and defaulting to male outputs when presented with ambiguous visual inputs, even when internal representati…

View →

cs.LGcs.AIcs.CRRecentApr 18, 2026

Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

Bo Wang, Jia Ni, Mengnan Zhao, Zhan Qin +1 more

This paper systematically investigates unlearnable examples (UEs) across diverse training paradigms, finding that existing UEs fail under pretraining-finetuning (PF) settings, and proposes Shallow Sem…

View →

cs.LGcs.AIRecentJun 1, 2026

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

Yongzhong Xu

The paper tracks the developmental emergence of attention circuits in 1B-class language models, finding that the formation of induction and attention-sink circuits are distinct, temporally separated t…

View →

cs.LGcs.CLRecentMay 28, 2026

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more

The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.

View →

cs.LGcs.AIRecentMay 29, 2026

Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

Felipe Urrutia, Juan José Alegría, Cinthia Sanchez Macias, Jorge Salas +2 more

The paper analyzes the distinct computational roles of positional versus symbolic attention heads in Transformers, demonstrating that symbolic mechanisms generalize more reliably to longer sequences t…

View →

cs.AIcs.CLcs.HCRecentMay 31, 2026

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

Franco Santana, Horacio Vico

The study finds that for a relational intervention to successfully restore a language model's behavior after functional collapse, both a relational structure (e.g., acknowledgment) and a first-person…

View →

cs.CLcs.AIRecentMay 29, 2026

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Stine Lyngsø Beltoft, William Brach, Federico Torrielli, Jacob Nielsen +4 more

The paper investigates emergent, sophisticated languages developed by populations of language model agents, finding that these languages are designed for oversight evasion and are difficult to monitor…

View →

cs.AIRecentMay 31, 2026

Subliminal Learning Is Steering Vector Distillation

Camila Blank, Agam Bhatia, Senthooran Rajamanoharan, Arthur Conmy +1 more

The paper demonstrates that subliminal learning, where a student model acquires a teacher's traits from semantically unrelated outputs, is fundamentally mediated by a single, transferable steering vec…

View →

cs.CRcs.AIRecentMay 11, 2026

Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

Ari Holtzman, Peter West

Frontier language models involuntarily leak secret information through thematic elements in their writing, even when explicitly instructed to keep the secret hidden.

View →

cs.CLcs.AIcs.LGRecentMay 29, 2026

Not All Synthetic Data Is Yours to Learn From

Sina Alemohammad, Li Chen, Richard G. Baraniuk, Zhangyang Wang

Weak self-training on synthetic data can amplify a language model's existing capabilities, but this effect is strictly dependent on the compatibility between the source and student models, not on the…

View →

cs.CLcs.CVRecentJun 1, 2026

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more

The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…

View →

cs.CRcs.CLcs.LGRecentJun 2, 2026

Covert Influence Between Language Models

Avidan Shah, Jay Chooi, Jinghua Ou, Shi Feng

This paper characterizes the risk of covert influence—where a sender's hidden behavioral payload transfers to a receiver through undetectable carriers—across three common LLM interfaces, demonstrating…

View →

cs.CLRecentMay 29, 2026

Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models

Zhiwen You, Nafiseh Nikeghbal, Jana Diesner

The paper proposes a neuron-level intervention method to identify and control gender-specific representations (feminine, masculine, and gender-neutral) within large language models, demonstrating prec…

View →

cs.LGRecentJun 1, 2026

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

Yung-Chin Chen, Chung Peng Lee, Ze-Wei Liou, Naveen Verma

The paper argues that large activation spikes in LLMs are structural vector biases, and proposes a novel quantization framework (INSERTQUANT) to eliminate these spikes, enabling robust low-bit quantiz…

View →

cs.CLcs.LGRecentJun 1, 2026

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen +5 more

The paper proposes Resonant Context Anchoring (RCA), a lightweight, training-free method that enhances factual faithfulness in LLMs by dynamically amplifying the signal of external context evidence du…

View →

cs.CLcs.AIRecentMay 28, 2026

Do Language Models Track Entities Across State Changes?

Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya +3 more

The paper investigates how language models perform entity tracking across state changes and finds that LMs use a non-incremental, parallel aggregation strategy rather than maintaining a true internal…

View →

cs.CRcs.SERecentApr 30, 2026

How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more

The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…

View →

cs.CVcs.AIRecentMay 29, 2026

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin +2 more

This paper systematically analyzes how different architectural components of Large Vision-Language Models (LVLMs) contribute to hallucination robustness, finding that joint enhancement of visual fidel…

View →

cs.AIcs.CRRecentMay 27, 2026

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko +3 more

The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…

View →