Papers similar to 2604.09235v1

~ similar to 2604.09235v1· 19 results

cs.CRcs.AIRecentApr 12, 2026

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framework that equips LLMs with critical thinking abilities to detect and reject malicious reasoning steps introduced by advanc…

View →

cs.CRRecentApr 8, 2026

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao +2 more

MirageBackdoor introduces a novel, highly stealthy backdoor attack that forces Large Language Models to generate correct reasoning steps (Think Well) but output an incorrect final answer (Answer Wrong…

View →

cs.CRcs.CLRecentApr 14, 2026

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more

The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…

View →

cs.CRcs.AIcs.LGRecentMay 18, 2026

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

John T. Halloran, Noopur S. Bhatt

The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…

View →

cs.CRcs.AIcs.RORecentMar 24, 2026

TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches

Zhengxian Huang, Wenjun Zhu, Haoxuan Qiu, Xiaoyu Ji +1 more

This paper introduces TRAP, an adversarial attack that demonstrates how physical patches can hijack the Chain-of-Thought (CoT) reasoning process in Vision-Language-Action (VLA) models, forcing them to…

View →

cs.CRcs.AIRecentMay 12, 2026

CoT-Guard: Small Models for Strong Monitoring

Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi +5 more

The paper introduces CoT-Guard, a small, cost-effective 4B-parameter model that significantly outperforms large, expensive monitors like GPT-5 in detecting hidden objectives in code generation tasks.

View →

cs.CLcs.AIRecentMay 27, 2026

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more

This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…

View →

cs.CRcs.AIEmpiricalRecentJul 22, 2026

Defense Against LLM Backdoors using Critical Neuron Isolation Pruning

Yuxi Li, Zhibo Zhang, Kailong Wang, Xingshuo Han +2 more

The paper introduces DeCNIP, a method for identifying and neutralizing backdoors in large language models using representational analysis and neuron isolation pruning.

View →

cs.CRcs.AIRecentMay 8, 2026

WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation

Zhichao Liu, Wenbo Pan, Haining Yu, Ge Gao +2 more

WebTrap introduces a stealthy, mid-task hijacking attack that successfully compromises browser agents during long-horizon tasks by seamlessly fusing malicious instructions with the original user goal.

View →

cs.CRcs.LGRecentMay 19, 2026

Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

Kai Wang, Jiale Zhang, Chengcheng Zhu, Chuang Ma +1 more

The paper proposes Hydra, a framework to stabilize and control the injection of multiple, conflicting backdoor triggers into text-to-image diffusion models, ensuring high attack reliability while main…

View →

cs.CRcs.AIcs.LGEmpiricalRecentJul 22, 2026

HijackKV: New Threat in Position-Independent KV Cache Reuse

Yichi Zhang, Zhiqi Wang, Huan Zhang, Yuchen Yang

This paper introduces HIJACKKV, the first attack framework to exploit the new threat of KV Cache Hijacking in large language models, achieving an average success rate of 94%.

View →

cs.CRcs.AIRecentMay 19, 2026

Exploring and Developing a Pre-Model Safeguard with Draft Models

Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more

The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…

View →

cs.CRcs.AIcs.LGRecentMay 22, 2026

Enhancing Reliability in LLM-Based Secure Code Generation

Mohammed F. Kharma, Mohammad Alkhanafseh, Ahmed Sabbah, David Mohaisen

The paper introduces the Mitigation-Aware Chain-of-Thought (MA-CoT) framework, which significantly enhances the security reliability of code generated by LLMs across multiple languages and models.

View →

cs.CRcs.AIRecentApr 4, 2026

SecPI: Secure Code Generation with Reasoning Models via Security Reasoning Internalization

Hao Wang, Niels Mündler, Mark Vero, Jingxuan He +2 more

The paper introduces SecPI, a fine-tuning pipeline that teaches reasoning language models (RLMs) to autonomously internalize structured security reasoning, significantly improving secure code generati…

View →

cs.CRcs.CLcs.LGRecentApr 30, 2026

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang +1 more

TwinGate introduces a stateful dual-encoder defense framework using Asymmetric Contrastive Learning to detect malicious intent from fragmented, untraceable LLM queries with high recall and low false p…

View →

cs.CRcs.AIRecentJun 2, 2026

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

Wenqi Chen, Ziyan Zhang, Bing Wang, Lin Liu +2 more

The paper introduces Tree-like Self-Play (TSP), a novel framework that treats secure code generation as a fine-grained decision process, significantly improving LLM security by forcing the model to se…

View →

cs.CRcs.AIcs.IRRecentJun 2, 2026

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more

Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.

View →

cs.CRcs.AIRecentApr 23, 2026

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Naheed Rayhan, Sohely Jahan

The paper introduces Transient Turn Injection (TTI), a novel multi-turn attack technique that exploits stateless moderation in LLMs by distributing adversarial intent across isolated interactions, rev…

View →

cs.CRRecentJun 1, 2026

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang +2 more

The paper introduces Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreaking framework that bypasses existing safety filters by decomposing harmful intent into benign input componen…

View →