~ similar to 2604.10681v2· 20 results
Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao +2 more
MirageBackdoor introduces a novel, highly stealthy backdoor attack that forces Large Language Models to generate correct reasoning steps (Think Well) but output an incorrect final answer (Answer Wrong…
Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan +1 more
The paper proposes Two-stage Backdoor Hijacking (TSBH) to create persistent, trigger-activated malicious behaviors by manipulating the observable Chain-of-Thought (CoT) process in Large Language Model…
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…
Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang +2 more
The paper introduces BadStyle, a novel backdoor attack framework that generates natural, stealthy poisoned samples using LLMs to compromise various LLMs with high success rates and robust activation.
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala +1 more
The paper proposes a structured prompt engineering framework to enhance the integrity and reliability of Chain-of-Thought (CoT) reasoning in LLMs, demonstrating significant improvements in security-se…
Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma +1 more
The paper introduces MetaBackdoor, a novel class of LLM backdoor attacks that exploits positional encoding (length-based triggers) rather than requiring modifications to the textual content.
Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi +5 more
The paper introduces CoT-Guard, a small, cost-effective 4B-parameter model that significantly outperforms large, expensive monitors like GPT-5 in detecting hidden objectives in code generation tasks.
The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
The paper introduces the Mitigation-Aware Chain-of-Thought (MA-CoT) framework, which significantly enhances the security reliability of code generated by LLMs across multiple languages and models.
Shuqiang Wang, Wei Cao, Jiaqi Weng, Jialing Tao +3 more
The paper proposes a black-box attack using a hierarchical genetic algorithm to induce 'overthinking' in Large Reasoning Models, demonstrating that this vulnerability can cause significant resource ex…
The paper introduces BadSkill, a novel backdoor attack formulation that targets third-party agent skills by poisoning the embedded model artifacts, achieving high attack success rates across various m…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…
Shengfang Zhai, Xiaoyang Ji, Yuling Shi, Haoran Gao +5 more
The paper introduces BadDLM, a unified framework that demonstrates a new class of backdoor vulnerabilities in Diffusion Language Models (DLMs) by exploiting their forward masking process across divers…
Sen Fang, Weiyuan Ding, Zhezhen Cao, Zhou Yang +1 more
AEGIS is a novel multi-agent framework that grounds vulnerability reasoning by reconstructing per-variable dependency chains over a Code Property Graph, achieving state-of-the-art performance on the P…
Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more
This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…
Ziming Zhang, Li Li, Guorui Feng, Hanzhou Wu +1 more
The paper proposes R-CoT, a reasoning-layer watermarking framework that embeds ownership watermarks directly into the stable reasoning path of LLMs, achieving high robustness against perturbations.
Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more
The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…