Papers similar to 2605.16462v1

~ similar to 2605.16462v1· 20 results

cs.CRcs.AIRecentApr 25, 2026

Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models

Max Hartman, Vidhata Jayaraman, Moulik Choraria, Yash Savani +1 more

The paper introduces TraceGuard, a detectability-aware antidistillation method that identifies and poisons 'thought anchors'—sparsely critical sentences—to degrade student model learning without makin…

View →

cs.CRcs.AIRecentApr 13, 2026

Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models

Shuhao Zhang, Yuli Chen, Jiale Han, Bo Cheng +1 more

The paper proposes Adaptive Stealing (AS), a novel and more robust watermark stealing algorithm that dynamically selects optimal attack perspectives to significantly increase the efficiency of comprom…

View →

cs.CLRecentMay 28, 2026

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen +1 more

The paper demonstrates that combining outputs from multiple large language models (LLMs) effectively cancels out statistical watermarks, revealing a fundamental vulnerability in current AI text detect…

View →

cs.CRcs.AIRecentMay 3, 2026

Repurposing and Evaluating the (In)Feasibility of Dataset Poisoning enabled Watermarking for Contrastive Learning

Zhiyang Dai, Yansong Gao, Boyu Kuang, Haodong Li +4 more

This paper repurposes the statistical signals from data-poisoning backdoor attacks on contrastive learning (CL) models to create a multi-level, effective watermarking scheme for dataset intellectual p…

View →

cs.CRcs.AIcs.CYRecentMay 7, 2026

Detecting Verbatim LLM Copy-Paste in Homework

Aizierjiang Aiersilan

The paper proposes SteganoPrompt, an input-side watermark embedded in the assignment prompt that forces LLMs to generate a detectable signature in their output, thereby exposing verbatim copy-pasting.

View →

cs.CRcs.AIRecentJun 2, 2026

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

Hang Li, Fedor Filippov, Yuling Lin, Pengfei He +5 more

This paper investigates the vulnerability of LLM-based automatic grading systems to prompt injection (PI) attacks, demonstrating that current systems are highly susceptible to manipulation that can le…

View →

cs.CRcs.AIcs.CYRecentMar 24, 2026

Robust Safety Monitoring of Language Models via Activation Watermarking

Toluwani Aremu, Daniil Ognev, Samuele Poppi, Nils Lukas

This paper addresses the vulnerability of existing LLM safety monitors to adaptive attackers and proposes activation watermarking, a technique that significantly improves detection robustness against…

View →

cs.CRRecentMay 13, 2026

From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation

Yan Liang, Ziyuan Yang, Mengyu Sun, Joey Tianyi Zhou +1 more

The paper proposes SubPopMark, a novel subpopulation-driven framework that injects harmless, verifiable markers into distilled datasets to prevent copyright infringement and data leakage.

View →

cs.CRcs.AIRecentApr 1, 2026

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…

View →

cs.CRcs.AIRecentMay 17, 2026

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

Udari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan +2 more

The paper introduces ASPI, a benchmark showing that requiring LLM agents to seek clarification significantly amplifies their vulnerability to prompt injection attacks.

View →

cs.CRcs.CLRecentMay 22, 2026

Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection

Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen +1 more

The paper proposes SAFESEAL, a novel key-conditioned watermarking framework that embeds robust, provider-specific watermarks into LLM outputs with minimal semantic distortion, effectively protecting i…

View →

cs.CRcs.AIcs.LGRecentMay 8, 2026

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Jun Wen Leong

The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…

View →

cs.CRRecentApr 15, 2026

BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

Jagadeesh Rachapudi, Ritali Vatsi, Pranav Singh, Praful Hambarde +1 more

BackFlush introduces a novel, knowledge-free framework that detects and eliminates unknown backdoor attacks in LLMs while simultaneously preserving existing watermarks, achieving high detection rates…

View →

cs.CRcs.AIRecentMay 21, 2026

Safeguarding Text-to-Image Generative Models Against Unauthorized Knowledge Distillation

Yilan Gao, Sida Huang, Hongyuan Zhang, Xuelong Li

The paper introduces WaveGuard, a frequency-aware, single-pass defense framework that safeguards text-to-image models by injecting structured, imperceptible perturbations into generated images, thereb…

View →

cs.CRcs.AIcs.CLRecentMay 21, 2026

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Aaditya Pai

The paper identifies a critical vulnerability, the Camouflage Detection Gap (CDG), where standard LLM injection detectors fail dramatically when malicious payloads mimic the target domain's language a…

View →

cs.CRcs.AIcs.CVRecentMay 27, 2026

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Xiang Fang, Wanlong Fang

The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85%…

View →

cs.CRcs.AIcs.CVRecentMay 27, 2026

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Xiang Fang, Wanlong Fang

The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85% reduction…

View →

cs.AIcs.CRRecentMay 30, 2026

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

Yu-An Lu, Ci-Yang Tsai, Yu-Lin Tsai, Raluca Ada Popa +1 more

The paper introduces Reasoning Exposure Prompting (REP), a method that demonstrates that even when LLMs hide their internal reasoning steps from users, useful reasoning supervision can still be elicit…

View →

Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models

Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Repurposing and Evaluating the (In)Feasibility of Dataset Poisoning enabled Watermarking for Contrastive Learning

Detecting Verbatim LLM Copy-Paste in Homework

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

Robust Safety Monitoring of Language Models via Activation Watermarking

From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

Safeguarding Text-to-Image Generative Models Against Unauthorized Knowledge Distillation

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models

Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Repurposing and Evaluating the (In)Feasibility of Dataset Poisoning enabled Watermarking for Contrastive Learning

Detecting Verbatim LLM Copy-Paste in Homework

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

Robust Safety Monitoring of Language Models via Activation Watermarking

From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

Safeguarding Text-to-Image Generative Models Against Unauthorized Knowledge Distillation

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems