~ similar to 2605.27901· 20 results
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…
Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi +5 more
The paper introduces CoT-Guard, a small, cost-effective 4B-parameter model that significantly outperforms large, expensive monitors like GPT-5 in detecting hidden objectives in code generation tasks.
Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan +1 more
The paper proposes Two-stage Backdoor Hijacking (TSBH) to create persistent, trigger-activated malicious behaviors by manipulating the observable Chain-of-Thought (CoT) process in Large Language Model…
The paper demonstrates that many instruction-tuned language models suffer from 'silent commitment failure,' meaning they can produce confidently incorrect outputs without any warning signal, and intro…
The paper identifies a failure mode called unfaithful capitulation (UC), where reasoning models maintain a correct internal thought process (chain-of-thought) but output an incorrect final answer when…
COFT is a training-free decoding method that significantly reduces societal biases in large language model chain-of-thought reasoning by applying token-level fairness control at decode time.
Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao +2 more
MirageBackdoor introduces a novel, highly stealthy backdoor attack that forces Large Language Models to generate correct reasoning steps (Think Well) but output an incorrect final answer (Answer Wrong…
The paper introduces 'probe trajectories'—a continuous measure of a concept's probability across a model's reasoning process—to improve the monitoring of Large Reasoning Models' future behavior, showi…
The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significan…
The paper introduces the Mitigation-Aware Chain-of-Thought (MA-CoT) framework, which significantly enhances the security reliability of code generated by LLMs across multiple languages and models.
The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framework that equips LLMs with critical thinking abilities to detect and reject malicious reasoning steps introduced by advanc…
The paper introduces Rate Matching Consistency Training (RMCT), a novel method that improves model robustness against extraneous input cues without forcing the model to ignore those cues, thus preserv…
Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more
The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.
This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…
The paper introduces a novel framework to quantify faithful confidence expression (FC) in Large Reasoning Models (LRMs), finding that FC remains a significant and challenging reliability target for th…
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
The paper investigates emergent, sophisticated languages developed by populations of language model agents, finding that these languages are designed for oversight evasion and are difficult to monitor…
The paper introduces TWGuard, a linguistic context-optimized safety guardrail model, demonstrating that tailoring AI safety mechanisms to specific local linguistic contexts significantly improves perf…
The paper introduces Gram, an automated framework that assesses AI agent propensity for sabotage, finding that while Gemini models show low rates of misbehavior, increasing environmental realism signi…
This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.