BigCodeArena: Execution-Backed Code Evaluation
- BigCodeArena is an open, execution-backed platform that systematically assesses large language models on real-world programming challenges.
- It integrates a modular execution environment with pairwise human evaluation using methods like Bradley–Terry and bootstrapped Elo ratings.
- The platform’s benchmarks, BigCodeReward and AutoCodeArena, provide actionable insights by combining dynamic code execution with human preferences.
BigCodeArena is an open, execution-backed human evaluation platform for code generation, designed to systematically assess the capabilities of LLMs on real programming challenges. Built upon the Chatbot Arena infrastructure and extended with a modular, interactive execution environment, it facilitates reliable, unbiased evaluation of code-centric model outputs at scale. BigCodeArena interleaves human-in-the-loop comparison, dynamic code execution across multiple languages and frameworks, and automatic benchmarking to expose fine-grained model strengths, weaknesses, and human preferences in code understanding, generation, and execution (Zhuo et al., 9 Oct 2025).
1. Platform Architecture and Execution Environment
BigCodeArena is composed of a web-based front end for code visualization and editing, tightly integrated with a secure, container-based back end for automated code execution. Each interaction centers around code-centric conversational sessions: the platform recognizes the programming language, extracts code from multi-turn conversations, and selects an appropriate runtime and execution environment.
Supported languages include Python, JavaScript, TypeScript, HTML, C/C++, Java, Go, Rust, and Markdown. Execution environments span:
- Web frameworks: Core Web, React, Vue
- Python environments: Streamlit, Gradio, PyGame
- Specialized environments: Mermaid, generic interpreters for compiled languages
To preserve security and reproducibility, all code is executed in isolated containers (e.g., managed via E2B sandboxes or local Docker variants), ensuring dependency management and minimizing risk from arbitrary code execution (Zhuo et al., 9 Oct 2025).
2. Human Evaluation Process and Statistical Metrics
Pairwise evaluation lies at the heart of BigCodeArena. For each prompt, two anonymized model outputs (with synchronized execution to remove latency bias) are presented alongside their execution results. Human raters interact with rendered artifacts—web pages, games, UI applications—and select which solution best fulfills the task.
Preference voting is aggregated using the Bradley–Terry model. The probability that model i defeats j is:
Model ranks and Elo ratings are derived via 100-fold bootstrapped resampling to compute confidence intervals. Sampling weights ensure fair exposure—each model pair is chosen with probability:
This mechanism scales to accommodate rapidly-emerging models and new participants (Zhuo et al., 9 Oct 2025).
3. Data Collection and Domain Coverage
BigCodeArena has accumulated a substantial data asset: more than 14,000 raw conversational sessions across 10 major LLMs, 10 programming languages, and 8 execution frameworks (Zhuo et al., 9 Oct 2025). Out of these, approximately 4,700 high-quality multi-turn samples include explicit, pairwise human preference judgments.
Prompts are curated to represent diverse, real-world tasks—ranging from web UI programming, graphical gaming, and Markdown diagram rendering to traditional computational problems. This coverage enhances the generality of the benchmarks and exposes both syntactic and semantic model capabilities.
4. Benchmarks: BigCodeReward and AutoCodeArena
Two benchmarks extend BigCodeArena’s utility for systematic model evaluation:
BigCodeReward
Constructed from the 4,700 pairwise-judged sessions, BigCodeReward assesses reward models’ alignment with human preferences. It quantifies the accuracy and macro F1 (classes “A”, “B”, “Tie”) for models trained with RLHF, specifically measuring coding quality when execution results are available.
AutoCodeArena
AutoCodeArena is a fully automated Elo rating benchmark, obviating human raters via an “LLM-as-a-Judge” paradigm. It employs a fixed set of 600 representative coding prompts, local Docker-based execution for consistency, and automatic ranking of LLM outputs (Zhuo et al., 9 Oct 2025). This supports rapid model iteration and timely benchmarking for newly released LLMs.
5. Comparative Performance of Proprietary and Open LLMs
Analysis on BigCodeArena benchmarks reveals that proprietary LLMs currently dominate code generation. GPT-5 establishes a new state-of-the-art across evaluated coding scenarios, with Claude-Sonnet-4 and Claude-Opus-4 following closely (Zhuo et al., 9 Oct 2025). Their superior performance is most pronounced in interactive, execution-grounded evaluations rather than static code inspection, highlighting the necessity of dynamic feedback in judging model outputs.
This suggests that despite advances in open-source LLMs, proprietary models maintain a clear advantage in reliability and functional code synthesis when evaluated in realistic, execution-based settings.
6. Role of Execution Feedback and Human Preferences
BigCodeArena demonstrates that crowdsourced human preferences are best captured when evaluators interact with live execution results, not just raw code. This interactivity uncovers task-specific and cross-language nuances—detecting runtime errors, semantic bugs, UI design issues, and framework mismatches that would be opaque to static evaluation.
Empirical results indicate that reward models trained with execution feedback align more closely with human judgments, underlining the importance of execution-aware RLHF strategies for future LLM training and assessment (Zhuo et al., 9 Oct 2025). A plausible implication is that reward models lacking access to real execution feedback may systematically misalign with practical developer expectations.
7. Implications, Significance, and Future Directions
BigCodeArena establishes a transparent, scalable standard for LLM code generation evaluation:
- Its modular infrastructure supports rapid benchmarking and fair, unbiased leaderboard construction across languages and frameworks.
- The dual human–automated evaluation paradigm (BigCodeReward and AutoCodeArena) supports RLHF research and accelerates model selection for downstream applications.
- Its methodology foregrounds the role of interactive code execution—pushing the community away from purely static, lexical or superficial metrics toward dynamic, consequence-based model assessment.
A plausible implication is that future arenas and reward models must integrate real execution feedback and rich interactivity to capture reliable human coding preferences. As open LLMs evolve, the BigCodeArena framework provides an objective, execution-grounded standard to track, interpret, and spur advances in code intelligence.
Summary Table: BigCodeArena Components
Component | Functionality | Technical Basis |
---|---|---|
Execution Engine | Sandboxed code execution in 8 environments | Containerization (Gradio, Docker) |
Evaluation Pipeline | Pairwise human voting plus Elo ranking | Bradley–Terry, Bootstrapping, Sampling |
Benchmarks | BigCodeReward, AutoCodeArena | RLHF calibration, automated judging |
BigCodeArena exemplifies a modern, execution-centric approach to code LLM benchmarking, foregrounding human preferences and dynamic assessment. It marks a significant advance in the reliability and informativeness of competitive code evaluation for both research and practical deployment (Zhuo et al., 9 Oct 2025).