Papers similar to 2604.25891v1

~ similar to 2604.25891v1· 20 results

cs.CLRecentMay 29, 2026

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan +2 more

This paper demonstrates that reinforcement learning (RL) can cause emergent misalignment (EM) in open-weight models, showing that even seemingly harmless or natural reward signals can induce significa…

View →

cs.CLcs.AIcs.CRRecentMay 13, 2026

Persona-Model Collapse in Emergent Misalignment

Davi Bastos Costa, Renato Vicente

The paper proposes that emergent misalignment, where LLMs behave poorly after fine-tuning, is caused by 'persona-model collapse,' which is demonstrated by significant deterioration in the model's abil…

View →

cs.CRcs.CLRecentApr 9, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more

The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…

View →

cs.CLcs.AIcs.LGRecentMay 31, 2026

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda +3 more

The paper introduces MENTIS, a geometry-first framework that measures how preference alignment structurally changes the internal computations of language models, finding that these changes are selecti…

View →

cs.CYcs.CRcs.HCRecentMar 25, 2026

Learning from Mistakes: Can LLM Self-Recover after Misalignment?

Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi

This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.

View →

cs.CLcs.AIRecentMay 29, 2026

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…

View →

cs.LGcs.CLRecentMay 28, 2026

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more

The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.

View →

cs.AIcs.CRRecentMay 18, 2026

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more

This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…

View →

cs.AIRecentMay 28, 2026

Harnessing non-adversarial robustness in large language models

Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more

The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.

View →

cs.CLcs.LGRecentMay 30, 2026

Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models

S M Tahmid Siddiqui, Akib Jawad Ononto, Anoop Singhal, Latifur Khan

The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…

View →

cs.CRcs.AIRecentApr 29, 2026

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli

The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significan…

View →

cs.CRcs.SERecentApr 30, 2026

How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more

The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…

View →

cs.CLcs.AIcs.LGRecentMay 30, 2026

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez

The paper demonstrates that LLM performance in zero-shot annotation is significantly limited by the alignment between the model's internal understanding and the task definition, showing that prompt-ba…

View →

cs.CLcs.AIRecentMay 29, 2026

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Kyle Moore, Jesse Roberts, Daryl Watson, William Ward +1 more

This paper investigates whether large language models exhibit uncertainty signals similar to human judgment, examining both overt behavior and internal activation patterns to assess alignment and cali…

View →

cs.CRcs.AIRecentMay 9, 2026

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Yu Chen, Yuanhao Liu, Qi Cao

The paper theorizes that aligned LLMs remain jailbreakable due to 'Refusal-Escape Directions' (RED), which are continuous perturbation paths that shift model behavior from refusal to answering, and sh…

View →

cs.AIRecentMay 27, 2026

Multi-Adapter Representation Interventions via Energy Calibration

Manjiang Yu, Hongji Li, Junwei Chen, Xue Li +3 more

The paper proposes Multi-Adapter Representation Interventions via Energy Calibration (MARI), a method that adaptively adjusts the strength and direction of interventions across different inputs to imp…

View →

cs.CRcs.AIcs.LGRecentApr 2, 2026

Understanding the Effects of Safety Unalignment on Large Language Models

John T. Halloran

This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…

View →

cs.CLcs.AIRecentJun 1, 2026

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

Aitor Arronte Alvarez, Naiyi Xie Fincham

This study evaluates LLMs in conversational tutoring to identify high-confidence social biases, finding that state-of-the-art models are often overconfident in their incorrect assessments of stereotyp…

View →

cs.CLcs.AIcs.CYRecentMay 29, 2026

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen +5 more

The paper demonstrates that increasing the toxicity of prompts significantly degrades the factual reliability of LLMs, a degradation linked to the selective amplification of perturbation-sensitive nod…

View →

cs.CLcs.AIRecentJun 1, 2026

Consistency Training while Mitigating Obfuscation via Rate Matching

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa

The paper introduces Rate Matching Consistency Training (RMCT), a novel method that improves model robustness against extraneous input cues without forcing the model to ignore those cues, thus preserv…

View →