~ similar to 2605.16462v1· 20 results
Max Hartman, Vidhata Jayaraman, Moulik Choraria, Yash Savani +1 more
The paper introduces TraceGuard, a detectability-aware antidistillation method that identifies and poisons 'thought anchors'—sparsely critical sentences—to degrade student model learning without makin…
Shuhao Zhang, Yuli Chen, Jiale Han, Bo Cheng +1 more
The paper proposes Adaptive Stealing (AS), a novel and more robust watermark stealing algorithm that dynamically selects optimal attack perspectives to significantly increase the efficiency of comprom…
Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen +1 more
The paper demonstrates that combining outputs from multiple large language models (LLMs) effectively cancels out statistical watermarks, revealing a fundamental vulnerability in current AI text detect…
Zhiyang Dai, Yansong Gao, Boyu Kuang, Haodong Li +4 more
This paper repurposes the statistical signals from data-poisoning backdoor attacks on contrastive learning (CL) models to create a multi-level, effective watermarking scheme for dataset intellectual p…
The paper proposes SteganoPrompt, an input-side watermark embedded in the assignment prompt that forces LLMs to generate a detectable signature in their output, thereby exposing verbatim copy-pasting.
Hang Li, Fedor Filippov, Yuling Lin, Pengfei He +5 more
This paper investigates the vulnerability of LLM-based automatic grading systems to prompt injection (PI) attacks, demonstrating that current systems are highly susceptible to manipulation that can le…
This paper addresses the vulnerability of existing LLM safety monitors to adaptive attackers and proposes activation watermarking, a technique that significantly improves detection robustness against…
Yan Liang, Ziyuan Yang, Mengyu Sun, Joey Tianyi Zhou +1 more
The paper proposes SubPopMark, a novel subpopulation-driven framework that injects harmless, verifiable markers into distilled datasets to prevent copyright infringement and data leakage.
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
The paper introduces ASPI, a benchmark showing that requiring LLM agents to seek clarification significantly amplifies their vulnerability to prompt injection attacks.
Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen +1 more
The paper proposes SAFESEAL, a novel key-conditioned watermarking framework that embeds robust, provider-specific watermarks into LLM outputs with minimal semantic distortion, effectively protecting i…
The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…
The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…
This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…
BackFlush introduces a novel, knowledge-free framework that detects and eliminates unknown backdoor attacks in LLMs while simultaneously preserving existing watermarks, achieving high detection rates…
The paper introduces WaveGuard, a frequency-aware, single-pass defense framework that safeguards text-to-image models by injecting structured, imperceptible perturbations into generated images, thereb…
The paper identifies a critical vulnerability, the Camouflage Detection Gap (CDG), where standard LLM injection detectors fail dramatically when malicious payloads mimic the target domain's language a…
The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85%…
The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85% reduction…
Yu-An Lu, Ci-Yang Tsai, Yu-Lin Tsai, Raluca Ada Popa +1 more
The paper introduces Reasoning Exposure Prompting (REP), a method that demonstrates that even when LLMs hide their internal reasoning steps from users, useful reasoning supervision can still be elicit…