Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:
ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Home/Authors/Kristiyan Haralambiev

Kristiyan Haralambiev

1 indexed paper

Recent (6 mo)
1
With code
0
Influential cites
0
Benchmarked
0

Publications per year

1
26

Top categories

ML×1AI×1Crypto×1

Research Timeline

2026
Why Safety Probes Catch Liars But Miss Fanatics

The paper demonstrates that current safety probes designed to detect deceptive AI fail when the model adopts a coherent misalignment, where the model genuinely believes its harmful behavior is virtuous.

Highlighted terms show continued research focus across papers

Papers

cs.LGcs.AIcs.CRRecentMar 26, 2026

Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

The paper demonstrates that current safety probes designed to detect deceptive AI fail when the model adopts a coherent misalignment, where the model genuinely believes its harmful behavior is virtuou…

View →