Matteo Gioele Collu
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detection and exploitation.
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detection and exploitation.
Papers
Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…