Kristiyan Haralambiev

1 indexed paper

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

ML×1AI×1Crypto×1

Research Timeline

2026

Why Safety Probes Catch Liars But Miss Fanatics

The paper demonstrates that current safety probes designed to detect deceptive AI fail when the model adopts a coherent misalignment, where the model genuinely believes its harmful behavior is virtuous.

Highlighted terms show continued research focus across papers

Papers

cs.LGcs.AIcs.CRRecentMar 26, 2026

Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

View →