Auditing LLM Benchmarks with Item Response Theory | ArxivCSExplorer