Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution (2410.16256v1)

Published 21 Oct 2024 in cs.CL and cs.AI
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Abstract: Efficient and accurate evaluation is crucial for the continuous improvement of LLMs. Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbf{JudgerBench}, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.

Overview of CompassJudger-1: An Advanced Open-Source Judge Model

The evaluation of LLMs remains a significant challenge in the AI research community, particularly in effectively aligning model performance with human preferences. The paper outlines CompassJudger-1, an open-source LLM designed to address these evaluation challenges. This model serves as an all-encompassing judge capable of model scoring, comparative evaluation, critique generation, and diverse task execution. Furthermore, the paper introduces JudgerBench, a comprehensive benchmark for evaluating the effectiveness of different judge models in subjective scenarios.

Core Contributions

  1. All-in-One LLM Evaluation:
    • CompassJudger-1 exemplifies a versatile LLM with robust judging capabilities. It performs functions traditionally associated with reward models, while also handling complex critique tasks.
  2. Comprehensive Benchmarking:
    • JudgerBench offers a nuanced testing environment allowing for evaluating judge models across various dimensions, including alignment with human evaluations and critique proficiency.

Data Collection and Training

The paper underscores the importance of high-quality data for effective model training. Training data for CompassJudger-1 encompasses multiple sources:

  • Public Judge Data: Utilized datasets like PandaLM and AlpineFarm, re-evaluated with capable models such as Qwen-2.5-72B to ensure relevance.
  • Reward Data: Integrated in balanced proportions to bolster the model’s judgment capabilities while avoiding overfitting.
  • Self-Collect Data: Includes subjective evaluations from iterative model development stages, highlighting a pragmatic approach to data expansion.

Through extensive data filtering, categorization, and sampling strategies, the authors ensure a balanced dataset that enhances both the generalization and specificity of CompassJudger-1.

Training and Ablation Studies

The training framework adopted (Xtuner) and the strategic balance of critique, reward, and general SFT data are investigated to optimize the model's performance:

  • Optimal Data Ratios: Through ablation studies, the paper identifies optimal training data ratios (1:3:1 for critique:reward:sft), facilitating a judicious mix that augments both judging and generalization capacities.
  • Impact of G-SFT Data: Incorporating general SFT data reinforces the model's universality, demonstrating that small amounts aid in maintaining performance across varied tasks.

Evaluation on JudgerBench

The evaluation against JudgerBench, comprising both Arena and Benchmark components, substantiates CompassJudger-1's capabilities:

  • Alignment with Human Preferences: Tasks in JDB-A reflect high accuracy in mirroring human judgment.
  • Critique and Format Adherence: In JDB-B, the model's ability to provide detailed critiques and adhere to evaluation formats is significant.

Comparative Analysis

In comparative testing with models such as Qwen and GPT-4o, CompassJudger-1 demonstrates superior generalizability and robustness, achieving impressive scores against JudgerBench metrics and positioning itself as a substantial alternative to GPT-powered evaluations.

Implications and Future Prospects

CompassJudger-1’s development addresses pivotal gaps in existing judge models by providing a flexible, all-encompassing solution that enhances subjective evaluations. This open-source contribution, coupled with JudgerBench, offers researchers tools to advance LLM evaluation methodologies, ultimately fostering innovation in AI assessment protocols. Future exploration may focus on further enhancing integration capabilities and expanding training sets to include more diverse evaluation scenarios.

The introduction of CompassJudger-1 and JudgerBench illustrates a significant step forward in creating versatile, accessible tools for LLM evaluation, supporting ongoing advancements in AI technology and evaluation strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  3. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
  4. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023a.
  5. XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/xtuner, 2023b.
  6. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  7. Rlhf workflow: From reward modeling to online rlhf, 2024.
  8. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024a.
  9. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024b.
  10. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. arXiv preprint arXiv:2311.18702, 2023.
  11. The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models. arXiv preprint arXiv:2406.05761, 2024.
  12. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
  13. Criticbench: Evaluating large language models as critic. arXiv preprint arXiv:2402.13764, 2024.
  14. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023.
  15. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024.
  16. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. arXiv preprint arXiv:2406.04770, 2024.
  17. Alignbench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743, 2023.
  18. Offsetbias: Leveraging debiased data for tuning evaluators, 2024.
  19. Skywork critic model series. https://huggingface.co/Skywork, September 2024. URL https://huggingface.co/Skywork.
  20. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087, 2023.
  21. Fofo: A benchmark to evaluate llms’ format-following capability. arXiv preprint arXiv:2402.18667, 2024.
  22. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  23. LLMEval-2, July 2023. URL https://github.com/llmeval/llmeval-2.
  24. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Maosong Cao (9 papers)
  2. Alexander Lam (10 papers)
  3. Haodong Duan (55 papers)
  4. HongWei Liu (108 papers)
  5. Songyang Zhang (116 papers)
  6. Kai Chen (512 papers)