~ similar to 2606.00402· 20 results
Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen +1 more
The paper demonstrates that combining outputs from multiple large language models (LLMs) effectively cancels out statistical watermarks, revealing a fundamental vulnerability in current AI text detect…
The paper introduces OpAI-Bench, a novel benchmark designed to study how AI authorship signals evolve and accumulate during the progressive co-editing process between humans and AI.
The paper introduces TSM-Bench, a new benchmark that demonstrates existing LLM-generated text detectors fail to accurately identify task-specific machine-generated content found in real-world Wikipedi…
The paper introduces BREW, a novel framework that significantly improves the reliability of multi-bit text watermarking for LLMs by replacing flawed decoding-centric methods with a designated two-stag…
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…
The paper identifies a universal, statistically predictable distribution (Mandelbrot) governing LLM outputs, enabling a highly efficient, model-agnostic scoring primitive for provenance and quality as…
The paper introduces SONAR, a prompt sanitization framework that uses natural language inference metrics to identify and remove malicious instructions injected into LLM prompts, achieving near-zero at…
Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou +1 more
This paper demonstrates that targeted adaptation of the small language model Phi Silica, using dataset curation and fine-tuning, significantly improves its performance in short-form text rewriting, na…
PASA introduces a robust, semantic-level watermarking technique that embeds and detects watermarks in the latent embedding space, successfully resisting semantic-invariant attacks like paraphrasing.
Cong Kong, Xin Cheng, Zhaoxia Yin, Shuai Li +2 more
VertMark introduces a novel, unified, and training-free framework to embed robust watermarks into vertical domain pre-trained language models (VPLMs) for copyright protection across multiple specializ…
The paper introduces Synthesis Data Reversion (SDR), a method that infers the data laundering transformation used in LLM training and synthesizes queries to restore the detection signals lost when pro…
Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng +1 more
The paper introduces RLSpoofer, a lightweight, black-box reinforcement learning attack that demonstrates the fragile resilience of current LLM watermarking schemes by achieving a high spoofing success…
XMark introduces a novel multi-bit watermarking technique that reliably embeds binary messages into LLM-generated text while maintaining high text quality and robust performance even with limited toke…
The paper proposes REED, a post-training representation editing method that significantly improves cross-domain linguistic steganalysis performance by deterministically editing intermediate feature re…
The paper evaluates prompt injection detection in a deployment-aware, multi-regime framework, finding that detection performance is highly dependent on the operational setting and that no single detec…
Yuanfan Li, Qi Zhou, Chengzhengxu Li, Zhaohan Zhang +4 more
The paper introduces MGTEVAL, a comprehensive and extensible platform designed to systematically evaluate the performance, robustness, and efficiency of machine-generated text detectors.
This paper introduces seven novel, cross-domain techniques for detecting prompt injection attacks, moving beyond the limitations of traditional regex and transformer classifiers.
The paper analyzes the robustness of current LLM watermarking schemes against various text modifications, concluding that watermarks can be removed with reasonable effort.
Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more
The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.
The paper proposes SteganoPrompt, an input-side watermark embedded in the assignment prompt that forces LLMs to generate a detectable signature in their output, thereby exposing verbatim copy-pasting.