Papers similar to 2605.13411v1

~ similar to 2605.13411v1· 20 results

cs.CRcs.CLRecentMay 29, 2026

EvoDefense: Co-Evolving Black-Box Defense with Large Language Models

Yu Li, Yuenan Hou, Yingmei Wei, Yanming Guo +1 more

EvoDefense introduces an experience-guided, co-evolving black-box defense mechanism that significantly improves the robustness of LLMs against unseen and diverse attacks without requiring model retrai…

View →

cs.CRcs.CLRecentMay 29, 2026

EvoDefense: Co-Evolving Black-Box Defense with Large Language Models

Yu Li, Yuenan Hou, Yingmei Wei, Yanming Guo +1 more

EvoDefense introduces an experience-guided, co-evolving black-box defense mechanism that significantly improves LLM robustness against unseen and diverse attacks without requiring model retraining.

View →

cs.CRcs.CLRecentApr 24, 2026

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien

The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…

View →

cs.CRcs.AIRecentApr 5, 2026

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Siyuan Li, Zehao Liu, Xi Lin, Qinghua Mao +5 more

CoopGuard is a novel stateful, multi-round defense framework using cooperative agents to significantly reduce the success rate of evolving adversarial attacks against Large Language Models.

View →

cs.CRcs.CLRecentJun 4, 2026

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim +4 more

Membrane introduces a self-evolving guardrail using Contrastive Safety Memory (CSM) that generalizes across topical jailbreak variants, achieving superior safety performance while minimizing benign re…

View →

cs.CRcs.AIRecentMay 7, 2026

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou +4 more

SafeHarbor is a novel, hierarchical memory-augmented framework that establishes context-aware decision boundaries for LLM agents, achieving state-of-the-art safety while minimizing over-refusal.

View →

cs.AIcs.CRcs.LGRecentApr 20, 2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more

ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…

View →

cs.CRcs.CLcs.ETRecentMay 30, 2026

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

Subhadip Mitra

The paper introduces a quality-diversity evolutionary framework that evolves interpretable attack strategies, successfully discovering distinct and systematic vulnerabilities in major LLMs like GPT-4o…

View →

cs.CRcs.CLcs.ETRecentMay 30, 2026

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

Subhadip Mitra

The paper introduces a quality-diversity evolutionary framework that discovers diverse, interpretable vulnerabilities in large language models by evolving attack strategies at the semantic level, reve…

View →

cs.CRcs.CVRecentMar 18, 2026

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

Saket Sanjeev Chaturvedi, Joshua Bergerson, Tanwi Mallick

This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.

View →

cs.CRcs.AIcs.MARecentMay 25, 2026

Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS

Bingyu Yan, Xiaoming Zhang, Jinyu Hou, Chaozhuo Li +3 more

Evo-Attacker introduces a memory-augmented reinforcement learning framework to perform generalized, long-horizon tool attacks on LLM-MAS, significantly outperforming existing methods.

View →

cs.CRcs.CLcs.ETRecentMay 30, 2026

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

Subhadip Mitra

The study demonstrates that LLM safety alignment is non-monotonic across model generations, showing that Gemma 3 exhibits unexpectedly high vulnerability to adversarial attacks compared to both its pr…

View →

cs.CRcs.CLcs.ETRecentMay 30, 2026

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

Subhadip Mitra

The study demonstrates that safety alignment in LLMs is non-monotonic across model generations, showing that Gemma 3 exhibits a significantly higher attack success rate than both its predecessor and s…

View →

cs.CRcs.AIcs.LGRecentMar 19, 2026

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao

Defense training for LLM agents, intended to improve safety, systematically degrades their core competence, leading to unreliability in multi-step tasks.

View →

cs.CRcs.AIcs.LGRecentMay 8, 2026

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Jun Wen Leong

The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…

View →

cs.CRRecentApr 1, 2026

When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Jiaqing Li, Zhibo Zhang, Shide Zhou, Yuxi Li +2 more

The paper introduces TrojanMerge, a framework demonstrating that model merging can be exploited to systematically compromise the safety alignment of multiple individually safe LLMs.

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding +12 more

BraveGuard is a self-evolving defense framework that significantly improves the safety monitoring of computer-use agents by generating guard model supervision from open-world threat discovery and real…

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen +12 more

BraveGuard is a self-evolving defense framework that improves the safety of computer-use agents by training guard models on open-world, multi-step threat trajectories rather than static benchmarks.

View →

cs.CRcs.AIcs.CLRecentMay 7, 2026

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more

The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…

View →

cs.AIcs.CRRecentMay 28, 2026

Provably Secure Agent Guardrail

Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang +1 more

The paper introduces an executable Proof-Constrained Action (ePCA) framework that secures AI agents by forcing them to formalize their intentions into first-order logical constraints, achieving provab…

View →