Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation | ArxivCSExplorer