Alex Kwon
1 indexed paper
Recent (6 mo)
1With code
0Influential cites
0Benchmarked
0Publications per year
126
Top categories
AI×1
Frequent co-authors
Research Timeline
2026
When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
The paper introduces 'brittle safety,' a failure mode where aligned language models fail to adapt their safety behavior when a situational context changes, and proposes state-aware validation to detect these failures.
Highlighted terms show continued research focus across papers