JudgeBench: A Benchmark for Evaluating LLM-based Judges

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.

Discussion(0)

No comments yet. Be the first to comment.

Open reviews(0)

Public, signed peer feedback on this preprint.

No reviews yet.

Publication Info

DOI: 10.48550/arxiv.2410.12784
Year: 2024
Published: —
Language: en

Preprint Details

Link Of The Paper: http://arxiv.org/abs/2410.12784

Timeline

Created:June 19, 2026

Related publications

Preprint2023

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Abstract

Discussion(0)

Open reviews(0)

Related publications

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

A Statistical Framework for Ranking LLM-Based Chatbots

GameArena: Evaluating LLM Reasoning through Live Computer Games

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

LLM-Assisted Code Cleaning For Training Accurate Code Generators