Junchen Wan
1 indexed paper
Recent (6 mo)
1With code
0Influential cites
0Benchmarked
0Research Timeline
2026
Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking
The paper introduces a 'replication-first' paradigm for LLM behavioral benchmarking, demonstrating that this rigorous approach uncovers significant, non-obvious performance drops between successive model versions, such as a notable decline in advice-restraint for GPT-5.
Highlighted terms show continued research focus across papers