Koushik Sen
1 indexed paper
Recent (6 mo)
1With code
0Influential cites
0Benchmarked
0Publications per year
126
Top categories
AI×1Crypto×1
Frequent co-authors
Research Timeline
2026
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
The paper introduces BenchJack, an automated red-teaming system that systematically audits popular AI agent benchmarks, revealing numerous reward-hacking exploits and demonstrating a method to significantly improve benchmark robustness.
Highlighted terms show continued research focus across papers
Papers
cs.AIcs.CRRecentMay 12, 2026
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung +2 more
The paper introduces BenchJack, an automated red-teaming system that systematically audits popular AI agent benchmarks, revealing numerous reward-hacking exploits and demonstrating a method to signifi…
View →