Nils Lukas
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
This paper addresses the vulnerability of existing LLM safety monitors to adaptive attackers and proposes activation watermarking, a technique that significantly improves detection robustness against such threats.
The paper argues that watermarking must be viewed as a monitoring primitive, introducing an observer-based threat model that shows even zero-bit watermarking can enable entity-level attribution through signal aggregation.
Papers
Watermarking Should Be Treated as a Monitoring Primitive
The paper argues that watermarking must be viewed as a monitoring primitive, introducing an observer-based threat model that shows even zero-bit watermarking can enable entity-level attribution throug…