Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the more realistic distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.

Discussion(0)

No comments yet. Be the first to comment.

Open reviews(0)

Public, signed peer feedback on this preprint.

No reviews yet.

Publication Info

DOI: 10.48550/arxiv.2502.09328
Year: 2025
Published: —
Language: en

Preprint Details

Link Of The Paper: http://arxiv.org/abs/2502.09328

Timeline

Created:June 19, 2026

Related publications

Preprint2025

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi, Valerie Chen, Ryan Shar, Anuj Mittal, Jenny T. Liang, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue

Preprint2023

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Lin Zi, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Preprint2023

Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation

Muhammad Fawad Akbar Khan, M.J. Ramsdell, Erik Falor, Hamid Reza Karimi

arXiv (Cornell University)

Preprint2024

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Sijun Tan, Siyuan Zhuang, K. Leon Montgomery, Wan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, Ion Stoica

Preprint2024

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica