Papers similar to 2604.20704v1

~ similar to 2604.20704v1· 20 results

cs.LGcs.AIcs.CRRecentMar 25, 2026

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye +2 more

The paper demonstrates that using advanced AI agents in an autoresearch loop can discover novel and highly effective adversarial attack algorithms, significantly advancing the state-of-the-art for jai…

View →

cs.CRcs.AIRecentMar 17, 2026

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Taiwo Onitiju, Iman Vakilinia

The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…

View →

cs.CRcs.CLRecentApr 24, 2026

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien

The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…

View →

cs.LGcs.CRRecentMar 31, 2026

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Yunrui Yu, Xuxiang Feng, Pengda Qin, Pengyang Wang +4 more

The paper introduces Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that significantly reduces the reported robustness of Dummy Classes-based defenses by simultaneously targeting both t…

View →

cs.CRcs.AIRecentMay 2, 2026

VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

Pang Liu, Yingjie Lao

The paper introduces a dual-dimension evaluation for universal adversarial attacks on Vision-Language Models (VLMs), demonstrating that high reported attack success rates significantly overestimate th…

View →

cs.CRRecentApr 4, 2026

AttackEval: A Systematic Empirical Study of Prompt Injection Attack Effectiveness Against Large Language Models

Jackson Wang

AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…

View →

cs.AIcs.CRcs.IRRecentApr 3, 2026

AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

Yuntao Du, Minh Dinh, Kaiyuan Zhang, Ninghui Li

AutoVerifier is an LLM-based agentic framework that automates the end-to-end verification of complex technical claims, enabling non-experts to generate evidence-backed intelligence assessments.

View →

cs.CRcs.CLcs.LGRecentMay 7, 2026

Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning

Samuel Korn

The paper evaluates four RAG architectures under knowledge base poisoning, demonstrating that advanced architectures significantly improve robustness against adversarial contradictions, localizing the…

View →

cs.CRcs.AIcs.MARecentApr 23, 2026

AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

Tanmay Gautam, Alireza Bahramali, Sandeep Atluri

AutoRISE proposes optimizing the entire attack strategy—by searching over executable programs—rather than just optimizing prompts, achieving significant improvements in red-teaming large language mode…

View →

cs.CRcs.LGRecentMar 19, 2026

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Toan Tran, Olivera Kotevska, Li Xiong

The paper introduces AutoMIA, a novel framework that uses LLM agents to automate the discovery and implementation of Membership Inference Attacks (MIAs), achieving state-of-the-art performance by syst…

View →

cs.CYcs.CLcs.CRRecentApr 15, 2026

Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking

Alexander Nemecek, Osama Zafar, Yuqiao Xu, Wenbiao Li +1 more

The paper argues that current AI content watermarking benchmarks fail to test for bias across different languages, cultures, and demographics, proposing a new set of evaluation standards to ensure fai…

View →

cs.CRcs.LGRecentMay 5, 2026

Laundering AI Authority with Adversarial Examples

Jie Zhang, Pura Peetathawatchai, Florian Tramèr, Avital Shafran

The paper demonstrates that adversarial examples can be used to manipulate Vision-Language Models (VLMs) into confidently providing authoritative but incorrect information, a process termed 'AI author…

View →

cs.CRcs.AIcs.LGRecentMar 19, 2026

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao

Defense training for LLM agents, intended to improve safety, systematically degrades their core competence, leading to unreliability in multi-step tasks.

View →

cs.AIcs.CRRecentJun 4, 2026

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more

GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…

View →

cs.CRRecentMay 14, 2026

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more

This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…

View →

cs.LGcs.CLcs.CRRecentMay 30, 2026

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

Mohammed Sameer Syed, Rozhin Yasaei

The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's vulnerability to adversarial content changes based on whether the malicious input arrives via the user message, tool meta…

View →

cs.LGcs.CLcs.CRRecentMay 30, 2026

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

Mohammed Sameer Syed, Rozhin Yasaei

The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's susceptibility to adversarial attacks changes based on whether the malicious content arrives via the user message, tool m…

View →

cs.CRcs.AIcs.CLRecentMay 5, 2026

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.

View →

cs.CRcs.AIRecentJun 1, 2026

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Yingzi Ma, Zhengyue Zhao, Xiaogeng Liu, Minhui Xue +2 more

MaskForge is a novel, adaptive, black-box attack framework that significantly improves jailbreaking diffusion large language models (dLLMs) by treating red-teaming as an optimized search over reusable…

View →

cs.LGcs.CRRecentMay 26, 2026

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

Kevin Kuo, Chhavi Yadav, Virginia Smith

This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…

View →