Papers similar to 2606.00392

~ similar to 2606.00392· 20 results

cs.CRcs.AIRecentMay 1, 2026

A Sentence Relation-Based Approach to Sanitizing Malicious Instructions

Soumil Datta, Melissa Umble, Daniel S. Brown, Guanhong Tao

The paper introduces SONAR, a prompt sanitization framework that uses natural language inference metrics to identify and remove malicious instructions injected into LLM prompts, achieving near-zero at…

View →

cs.AIRecentMay 31, 2026

Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts

Boxuan Wang, Zhuoyun Li, Xiaowei Huang, Yi Dong

The paper introduces an A*-inspired framework to generate highly effective and efficient adversarial prompts that cause LLMs to hallucinate commonsense errors while maintaining the original prompt's i…

View →

cs.CRcs.AIRecentApr 1, 2026

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…

View →

cs.AIRecentMay 27, 2026

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more

The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…

View →

cs.CRcs.AIRecentMay 17, 2026

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

Nanxi Li, Zhengyue Zhao, Chaowei Xiao

The paper introduces Latent Policy Guardrail (LPG), a novel framework that efficiently enforces dynamic safety policies for LLMs by compressing complex policy deliberation into a small set of latent t…

View →

cs.LGcs.AIRecentMay 28, 2026

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son +2 more

The paper proposes LaRA, a layer-wise representation analysis framework that detects data contamination in RL post-trained LLMs by analyzing geometric deviations across model layers.

View →

cs.CRRecentApr 4, 2026

AttackEval: A Systematic Empirical Study of Prompt Injection Attack Effectiveness Against Large Language Models

Jackson Wang

AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…

View →

cs.CRcs.AIcs.CVRecentMay 27, 2026

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Xiang Fang, Wanlong Fang

The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85%…

View →

cs.CRcs.AIcs.CVRecentMay 27, 2026

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Xiang Fang, Wanlong Fang

The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85% reduction…

View →

cs.CLcs.CRcs.LGRecentMar 29, 2026

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Duanyi Yao, Changyue Li, Zhicong Huang, Cheng Hong +1 more

The paper introduces Hidden Ads, a novel backdoor attack for Vision-Language Models (VLMs) that injects unauthorized advertisements by exploiting natural, recommendation-seeking user behaviors, mainta…

View →

cs.CRcs.AIRecentMar 25, 2026

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search

Yulin Shen, Xudong Pan, Geng Hong, Min Yang

The paper introduces Tree structured Injection for Payloads (TIP), a novel black-box attack framework that reliably generates stealthy injection payloads to seize control of LLM agents utilizing the M…

View →

cs.CRcs.AIcs.CLRecentMay 5, 2026

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.

View →

cs.LGcs.AIcs.CRRecentMay 11, 2026

Leveraging RAG for Training-Free Alignment of LLMs

John T. Halloran

The paper introduces RAG-Pref, a novel, training-free Retrieval Augmented Generation (RAG) method for preference alignment that significantly improves LLM refusal guardrails against agentic attacks wi…

View →

cs.CRRecentApr 21, 2026

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Alex Polyakov, Daniel Kuznetsov

The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…

View →

cs.CRcs.AIRecentApr 24, 2026

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu +1 more

RouteGuard is a novel detector that identifies skill poisoning in LLM agents by monitoring structured internal attention shifts, achieving high detection rates on critical skill-injection attacks.

View →

cs.AIcs.CRRecentMar 26, 2026

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more

This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…

View →

cs.CLcs.CRRecentMay 26, 2026

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

Akindoyin Akinrele, Shreyank N Gowda

The paper evaluates prompt injection detection in a deployment-aware, multi-regime framework, finding that detection performance is highly dependent on the operational setting and that no single detec…

View →

cs.LGcs.AIcs.CLRecentMay 22, 2026

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Nesreen K. Ahmed, Nima Nafisi

The paper introduces Agent-ToM, a Theory-of-Mind (ToM) based framework that learns to monitor autonomous LLM agents by explicitly reasoning about their hidden beliefs and intentions to detect covert m…

View →