~ similar to 2604.09235v1· 19 results
The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framework that equips LLMs with critical thinking abilities to detect and reject malicious reasoning steps introduced by advanc…
Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao +2 more
MirageBackdoor introduces a novel, highly stealthy backdoor attack that forces Large Language Models to generate correct reasoning steps (Think Well) but output an incorrect final answer (Answer Wrong…
Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more
The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…
Zhengxian Huang, Wenjun Zhu, Haoxuan Qiu, Xiaoyu Ji +1 more
This paper introduces TRAP, an adversarial attack that demonstrates how physical patches can hijack the Chain-of-Thought (CoT) reasoning process in Vision-Language-Action (VLA) models, forcing them to…
Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi +5 more
The paper introduces CoT-Guard, a small, cost-effective 4B-parameter model that significantly outperforms large, expensive monitors like GPT-5 in detecting hidden objectives in code generation tasks.
Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more
This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…
Zhichao Liu, Wenbo Pan, Haining Yu, Ge Gao +2 more
WebTrap introduces a stealthy, mid-task hijacking attack that successfully compromises browser agents during long-horizon tasks by seamlessly fusing malicious instructions with the original user goal.
Kai Wang, Jiale Zhang, Chengcheng Zhu, Chuang Ma +1 more
The paper proposes Hydra, a framework to stabilize and control the injection of multiple, conflicting backdoor triggers into text-to-image diffusion models, ensuring high attack reliability while main…
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
The paper introduces the Mitigation-Aware Chain-of-Thought (MA-CoT) framework, which significantly enhances the security reliability of code generated by LLMs across multiple languages and models.
Hao Wang, Niels Mündler, Mark Vero, Jingxuan He +2 more
The paper introduces SecPI, a fine-tuning pipeline that teaches reasoning language models (RLMs) to autonomously internalize structured security reasoning, significantly improving secure code generati…
Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang +1 more
TwinGate introduces a stateful dual-encoder defense framework using Asymmetric Contrastive Learning to detect malicious intent from fragmented, untraceable LLM queries with high recall and low false p…
Wenqi Chen, Ziyan Zhang, Bing Wang, Lin Liu +2 more
The paper introduces Tree-like Self-Play (TSP), a novel framework that treats secure code generation as a fine-grained decision process, significantly improving LLM security by forcing the model to se…
Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more
Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.
The paper introduces Transient Turn Injection (TTI), a novel multi-turn attack technique that exploits stateless moderation in LLMs by distributing adversarial intent across isolated interactions, rev…
Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang +2 more
The paper introduces Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreaking framework that bypasses existing safety filters by decomposing harmful intent into benign input componen…
Shuqiang Wang, Wei Cao, Jiaqi Weng, Jialing Tao +3 more
The paper proposes a black-box attack using a hierarchical genetic algorithm to induce 'overthinking' in Large Reasoning Models, demonstrating that this vulnerability can cause significant resource ex…
The paper introduces STRIATUM-CTF, a modular agentic framework that uses a standardized context protocol to enable LLMs to perform multi-step, stateful reasoning for general-purpose CTF solving, achie…