John T. Halloran
3 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capabilities, making it a greater risk than Jailbreak-Tuning (JT).
The paper introduces RAG-Pref, a novel, training-free Retrieval Augmented Generation (RAG) method for preference alignment that significantly improves LLM refusal guardrails against agentic attacks with minimal computational overhead.
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving safety performance.
Papers
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…