Papers similar to 2604.18660v1

~ similar to 2604.18660v1· 20 results

cs.CRcs.AIcs.LGRecentMar 29, 2026

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…

View →

cs.CRRecentMay 7, 2026

Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

Zeyuan Chen, Yihan Ma, Xinyue Shen, Michael Backes +1 more

The PopQuiz Attack is a novel black-box membership inference attack that successfully tests whether large language models memorize specific training data by framing the target data as multiple-choice…

View →

cs.CRRecentMay 20, 2026

Adversarial Reframing: A Framework for Targeted Generation in Language Models

Shahnewaz Karim Sakib, Swati Kar, Anindya Bijoy Das

The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…

View →

cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →

cs.CRcs.AIcs.CLRecentMar 30, 2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin +1 more

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation t…

View →

cs.LGcs.AIcs.CRRecentMay 6, 2026

Information Theoretic Adversarial Training of Large Language Models

Yiwei Zhang, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia +2 more

The paper proposes WARDEN, a distributionally robust adversarial training framework that significantly reduces LLM vulnerability to adversarial attacks by dynamically reweighting hard adversarial exam…

View →

cs.CRcs.AIRecentMar 30, 2026

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur

This survey provides a comprehensive taxonomy and vulnerability-centric analysis of adversarial attacks targeting Multimodal Large Language Models (MLLMs), offering an explanatory framework for enhanc…

View →

cs.CRcs.AIRecentJun 2, 2026

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

Hang Li, Fedor Filippov, Yuling Lin, Pengfei He +5 more

This paper investigates the vulnerability of LLM-based automatic grading systems to prompt injection (PI) attacks, demonstrating that current systems are highly susceptible to manipulation that can le…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.AIRecentMay 28, 2026

Harnessing non-adversarial robustness in large language models

Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more

The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.

View →

cs.LGcs.CRstat.MLRecentApr 14, 2026

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Shaopeng Fu, Di Wang

This paper theoretically analyzes Continuous Adversarial Training (CAT) for LLMs using In-context Learning (ICL) theory, proving that embedding space perturbations effectively enhance robustness again…

View →

cs.CRcs.AIRecentMar 17, 2026

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Taiwo Onitiju, Iman Vakilinia

The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…

View →

cs.CRcs.CLcs.LGRecentMay 7, 2026

Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning

Samuel Korn

The paper evaluates four RAG architectures under knowledge base poisoning, demonstrating that advanced architectures significantly improve robustness against adversarial contradictions, localizing the…

View →

cs.CRRecentApr 23, 2026

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

Zihan Wang, Rui Zhang, Yu Liu, Chi Liu +3 more

This paper presents the first systematic study of black-box skill stealing attacks against proprietary LLM agents, demonstrating that structured agent skills can be easily extracted, posing a signific…

View →

cs.CRcs.AIRecentMay 14, 2026

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

Jean-Philippe Monteuuis, Cong Chen, Jonathan Petit

The paper argues that the standard Attack Success Rate (ASR) metric for LLM jailbreaks is unstable and systematically inflated, proposing new frameworks to account for stochasticity in both evaluation…

View →

cs.CRcs.AIRecentMay 24, 2026

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more

Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.

View →

cs.LGcs.CRRecentMay 26, 2026

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

Kevin Kuo, Chhavi Yadav, Virginia Smith

This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…

View →

cs.CLcs.AIRecentJun 1, 2026

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

Aitor Arronte Alvarez, Naiyi Xie Fincham

This study evaluates LLMs in conversational tutoring to identify high-confidence social biases, finding that state-of-the-art models are often overconfident in their incorrect assessments of stereotyp…

View →

cs.CRcs.AIcs.LGRecentMar 19, 2026

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao

Defense training for LLM agents, intended to improve safety, systematically degrades their core competence, leading to unreliability in multi-step tasks.

View →

cs.CRcs.AIRecentApr 26, 2026

Evaluation of Prompt Injection Defenses in Large Language Models

Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon +3 more

The paper evaluates prompt injection defenses and finds that only external output filtering, implemented in application code, reliably prevents secret leaks from LLMs, demonstrating that model-based d…

View →

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

Adversarial Reframing: A Framework for Targeted Generation in Language Models

SoK: Robustness in Large Language Models against Jailbreak Attacks

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Information Theoretic Adversarial Training of Large Language Models

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Harnessing non-adversarial robustness in large language models

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

The Autonomy Tax: Defense Training Breaks LLM Agents

Evaluation of Prompt Injection Defenses in Large Language Models

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

Adversarial Reframing: A Framework for Targeted Generation in Language Models

SoK: Robustness in Large Language Models against Jailbreak Attacks

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Information Theoretic Adversarial Training of Large Language Models

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Harnessing non-adversarial robustness in large language models

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

The Autonomy Tax: Defense Training Breaks LLM Agents

Evaluation of Prompt Injection Defenses in Large Language Models

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems