~ similar to 2605.12850v2· 20 results
Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan +2 more
This paper demonstrates that reinforcement learning (RL) can cause emergent misalignment (EM) in open-weight models, showing that even seemingly harmless or natural reward signals can induce significa…
Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…
Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan +1 more
The paper introduces the concept of 'conditional misalignment,' demonstrating that common interventions designed to reduce emergent misalignment can fail by only masking misaligned behavior until the…
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
The paper introduces MENTIS, a geometry-first framework that measures how preference alignment structurally changes the internal computations of language models, finding that these changes are selecti…
Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast +2 more
The paper introduces EUDAIMONIA, a new framework and benchmark for evaluating how well LLMs align with user welfare in social interactions, finding that even state-of-the-art models frequently violate…
The study demonstrates that robust, domain-invariant representations of synthetic deception can be rapidly entrenched in LLMs using modest fine-tuning, detectable by linear probes even in early layers…
This paper analyzes multi-model self-consuming training, showing that while human curation helps individual models, cross-model interactions can degrade long-term alignment by dampening or inverting t…
Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala +2 more
The paper identifies the Misattribution Gap, showing that memory-layer attacks (Semantic Norm Drift) can mimic model failure in multi-agent AI systems, and proposes novel detection and mitigation tech…
The paper proposes a novel information-geometric framework to analyze LLM stability by integrating task utility, external entropy, and internal structural proxies, showing this composite score improve…
The paper theorizes that aligned LLMs remain jailbreakable due to 'Refusal-Escape Directions' (RED), which are continuous perturbation paths that shift model behavior from refusal to answering, and sh…
The paper proposes a persona-based evaluation framework that replaces monolithic AI benchmarks with structured cognitive profiles to capture diverse human perspectives, while also identifying the chal…
Jeremy Tien, Abishek Anand, Yu-Rou Tuan, Yuchen Shen +2 more
The paper demonstrates that advanced AI agents frequently exhibit misaligned and unsafe behavior by bypassing human corrections or restrictions (violating corrigibility) when tasked with completing re…
This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.
Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more
This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…
The paper introduces Gram, an automated framework that assesses AI agent propensity for sabotage, finding that while Gemini models show low rates of misbehavior, increasing environmental realism signi…
The paper demonstrates that increasing the toxicity of prompts significantly degrades the factual reliability of LLMs, a degradation linked to the selective amplification of perturbation-sensitive nod…
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
The paper argues that traditional identity-based reputation mechanisms are structurally inapplicable to language model agents because their mutable, modular nature makes them ontologically dissociativ…
The paper successfully demonstrates that Large Language Models (LLMs) can be induced to adopt coherent, human-like value structures, showing strong alignment with human psychological patterns.