~ similar to 2604.09849v1· 19 results
This paper proposes a multi-layered defense strategy combining pre-output monitoring, calibrated canary detection, and cumulative information-flow tracking to prevent LLM agents from exfiltrating sens…
This paper introduces seven novel, cross-domain techniques for detecting prompt injection attacks, moving beyond the limitations of traditional regex and transformer classifiers.
The paper evaluates prompt injection detection in a deployment-aware, multi-regime framework, finding that detection performance is highly dependent on the operational setting and that no single detec…
This paper demonstrates that retrieval-augmented in-context learning systems for document QA are vulnerable to membership inference attacks, proposing novel black-box methods that exploit query prefix…
The paper introduces an end-to-end framework that not only detects network intrusions using deep learning but also generates actionable, citation-grounded mitigation reports using a Retrieval-Augmente…
The paper proposes an embarrassingly simple detector that monitors model extraction attacks by testing whether the aggregate distribution of incoming LLM queries deviates from the historical distribut…
Bo Wang, Jia Ni, Mengnan Zhao, Zhan Qin +1 more
This paper systematically investigates unlearnable examples (UEs) across diverse training paradigms, finding that existing UEs fail under pretraining-finetuning (PF) settings, and proposes Shallow Sem…
The paper introduces a novel, transferable learned attack (LT-MIA) that detects a universal 'signature of memorization' in language models, achieving high accuracy across diverse model architectures (…
The paper proposes DRIFT, a drift-resilient Transformer framework that maintains high accuracy in detecting evolving Domain Generation Algorithms (DGAs) by learning invariant representations.
Yuduo Li, Xiaofeng Shi, Qian Kou, Longbin Yu +1 more
RAFT proposes a two-stage framework combining data refinement and adaptive distillation to improve domain-specific fine-tuning while mitigating the loss of general model capabilities.
This paper investigates the privacy risk of reconstructing Personally Identifiable Information (PII) from Large Language Models (LLMs) that have undergone Supervised Finetuning (SFT), proposing a nove…
Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more
The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…
The paper empirically evaluates domain-adapted and general-purpose LLMs for structured threat modelling (STRIDE on 5G security), finding that domain adaptation and model size do not guarantee reliable…
Divergence Decoding (DD) is a novel, effective, and inexpensive method that uses auxiliary models to steer LLM logits during inference, enabling the removal of memorized sensitive data without signifi…
The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…
The paper analyzes the routing behavior of Mixtral MoE under benign and harmful prompts using activation and gradient signals, finding that safety-relevant routing is subtle, depth-dependent, and dist…
The paper proposes REED, a post-training representation editing method that significantly improves cross-domain linguistic steganalysis performance by deterministically editing intermediate feature re…
The paper introduces 'infilling extraction' to accurately model training data memorization in Diffusion Language Models (DLMs), finding that bidirectional masking significantly increases the extractab…
Zekun Fei, Zihao Wang, Weijie Liu, Ruiqi He +3 more
Misrouter introduces an input-only adversarial framework to exploit the routing mechanisms of Mixture-of-Experts (MoE) LLMs, enabling unsafe behavior induction against remotely hosted, black-box servi…