The paper introduces REALISTA, a novel latent-space adversarial attack framework that generates semantically realistic and coherent prompts to effectively induce hallucinations in large language models (LLMs), outperforming existing methods.
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to systematically evaluate their reliability under realistic adversarial inputs. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing attack methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
This paper theoretically analyzes Continuous Adversarial Training (CAT) for LLMs…
When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion
The paper introduces TrojanMerge, a framework demonstrating that model merging c…
DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection
The paper introduces PromptFuzz-SC, a novel semantic-character dual-space mutati…
PIDP-Attack: Combining Prompt Injection with Database Poisoning Attacks on Retrieval-Augmented Gener…
The paper introduces PIDP-Attack, a novel compound adversarial attack that combi…
Adversarial attacks against Modern Vision-Language Models
The paper evaluates the adversarial robustness of two open-source Vision-Languag…
GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models
The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enh…
PIArena: A Platform for Prompt Injection Evaluation
The paper introduces PIArena, a unified and extensible platform designed to addr…
QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits
The paper proposes QShield, a hybrid quantum-classical neural network architectu…