Colin Samplawski
1 indexed paper
Recent (6 mo)
1With code
0Influential cites
0Benchmarked
0Publications per year
126
Top categories
Crypto×1ML×1
Frequent co-authors
Research Timeline
2026
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies significantly, with Llama-3 models showing high vulnerability.
Highlighted terms show continued research focus across papers
Papers
cs.CRcs.LGRecentApr 22, 2026
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
View →