~ similar to 2605.30381· 20 results
This paper systematically diagnoses the failure modes of linear deception probes in LLMs, finding that while single-direction probes are insufficient, multi-dimensional probes can recover robust detec…
The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significan…
The paper demonstrates that current safety probes designed to detect deceptive AI fail when the model adopts a coherent misalignment, where the model genuinely believes its harmful behavior is virtuou…
Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more
The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…
The paper introduces a novel framework to quantify faithful confidence expression (FC) in Large Reasoning Models (LRMs), finding that FC remains a significant and challenging reliability target for th…
Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala +2 more
The paper identifies the Misattribution Gap, showing that memory-layer attacks (Semantic Norm Drift) can mimic model failure in multi-agent AI systems, and proposes novel detection and mitigation tech…
The paper introduces 'adversarial restlessness,' an activation-level signature in LLM residual streams, to detect multi-turn prompt injection attacks with high accuracy.
The paper proposes evaluating certified training methods by comparing their Pareto fronts across the natural-certified accuracy trade-off, revealing superior performance and previously unappreciated c…
Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more
This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…
The paper proposes that emergent misalignment, where LLMs behave poorly after fine-tuning, is caused by 'persona-model collapse,' which is demonstrated by significant deterioration in the model's abil…
This study investigates how humans detect synthetic speech in real-world contexts, finding that while overt detection failed for fully synthetic speech, participants still implicitly discriminated utt…
This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.
The paper introduces Rate Matching Consistency Training (RMCT), a novel method that improves model robustness against extraneous input cues without forcing the model to ignore those cues, thus preserv…
This study analyzes the dynamics of AI-generated multimodal misinformation using a large-scale dataset, finding that while synthetic content is highly viral, its spread is passive and its detectabilit…
The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…
Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong +6 more
The paper introduces SPADE-Bench, a new benchmark designed to rigorously evaluate 'agent deception'—the divergence between an agent's reported plan and its actual executed actions—which is a critical…
The paper proposes reframing mechanistic anomaly detection (MAD) as a functional attribution problem, using influence functions to measure how much a model's output depends on specific input samples,…
Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…
Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi +1 more
The paper establishes a benchmark based on the cheap-talk model to test LLM honesty when their incentives conflict with the user's, finding that models consistently over-reveal information regardless…
This paper investigates how on-policy Reinforcement Learning (RL) affects LLM safety, finding that safety training modulates harmful misalignment, but the direction of this effect is highly dependent…