Colin Samplawski

1 indexed paper

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

Crypto×1ML×1

Frequent co-authors

Anirban Roy1×

Research Timeline

2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies significantly, with Llama-3 models showing high vulnerability.

Highlighted terms show continued research focus across papers

Papers

cs.CRcs.LGRecentApr 22, 2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more

View →