~ similar to 2603.23171v2· 20 results
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
Shuhao Zhang, Yuli Chen, Jiale Han, Bo Cheng +1 more
The paper proposes Adaptive Stealing (AS), a novel and more robust watermark stealing algorithm that dynamically selects optimal attack perspectives to significantly increase the efficiency of comprom…
Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen +1 more
The paper proposes SAFESEAL, a novel key-conditioned watermarking framework that embeds robust, provider-specific watermarks into LLM outputs with minimal semantic distortion, effectively protecting i…
Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen +1 more
The paper demonstrates that combining outputs from multiple large language models (LLMs) effectively cancels out statistical watermarks, revealing a fundamental vulnerability in current AI text detect…
The paper introduces GuardPhish, a large-scale dataset and evaluation framework, demonstrating that even high-performing open-source LLMs can generate actionable phishing content despite accurate inte…
The paper argues that watermarking must be viewed as a monitoring primitive, introducing an observer-based threat model that shows even zero-bit watermarking can enable entity-level attribution throug…
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon +3 more
The paper evaluates prompt injection defenses and finds that only external output filtering, implemented in application code, reliably prevents secret leaks from LLMs, demonstrating that model-based d…
Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more
Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.
This paper develops provably undetectable and robust watermarking schemes for LLM outputs even when the per-token entropy is only constant, removing previous dependencies on high entropy rates or larg…
This paper adapts LLM watermarking techniques, specifically the KGW watermark, to create detectable watermarks for AI game-playing strategies in perfect-information games, showing minimal impact on ga…
This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…
The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…
Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan +3 more
The EvoSafety framework enhances LLM safety by externalizing attack and defense mechanisms, enabling persistent, transferable, and model-agnostic robustness against adversarial prompts.
The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.
The paper analyzes LLM vulnerability detection using mechanistic interpretability, finding that models primarily rely on safety detectors rather than direct vulnerability signature recognition.
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…
Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng +1 more
The paper introduces RLSpoofer, a lightweight, black-box reinforcement learning attack that demonstrates the fragile resilience of current LLM watermarking schemes by achieving a high spoofing success…
The paper introduces LUNA, a linguistically adaptive watermarking technique that achieves high detection accuracy across diverse languages while maintaining minimal text distortion, outperforming exis…
The paper challenges the assumption that LLM safety is a binary threshold, proposing that safety failures occur in an 'instability region' and introducing Furina, a transferable attack that exploits t…