Tom Biskupski

1 indexed paper

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

Crypto×1AI×1ML×1

Frequent co-authors

Stephan Kleber1×

Research Timeline

2026

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts and powerful models are used.

Highlighted terms show continued research focus across papers

Papers

cs.CRcs.AIcs.LGRecentMar 23, 2026

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

View →