Atla Selene Mini: A General Purpose Evaluation Model (2501.17195v1)

Published 27 Jan 2025 in cs.CL and cs.AI

Abstract: We introduce Atla Selene Mini, a state-of-the-art small LLM-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios. Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace (https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B) and Ollama to encourage widespread community adoption.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces Atla Selene Mini, a cutting-edge SLMJ that sets new benchmarks by excelling on 11 out-of-distribution tasks.
It employs a hybrid training loss combining DPO and SFT with principled data curation and synthetic augmentation to boost evaluative accuracy.
The model demonstrates strong real-world applicability, achieving top rankings on RewardBench and Judge Arena while aligning with expert evaluations.

The paper "Atla Selene Mini: A General Purpose Evaluation Model" introduces Atla Selene Mini, a state-of-the-art Small LLM-as-a-Judge (SLMJ). Atla Selene Mini is specifically designed to evaluate other LLMs across a variety of tasks, outperforming existing SLMJs and even GPT-4o-mini on 11 out-of-distribution benchmarks. These benchmarks include absolute scoring, classification, and pairwise preference tasks. Notably, it is deemed the best 8B generative model on RewardBench, outperforming notable models such as GPT-4o.

Key Contributions and Methods:

Data Curation and Augmentation: The authors develop a principled data curation strategy that augments existing datasets with synthetically generated critiques. They ensure data quality through strict filtering and dataset ablations.
Hybrid Training Loss: To enhance its evaluative power, Selene Mini is trained using a combination of direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, making it highly robust across different scenarios. This approach focuses on maximizing the likelihood of correct responses and minimizing incorrect ones, further enhanced by including negative log-likelihood losses for chosen responses.
General-purpose Evaluator: The model is equipped to deliver both critiques and judgments effectively, adapting well to real-world conditions. It shows a significant improvement in zero-shot agreement with human evaluations, especially in niche domains such as financial and medical datasets.
Training Details: The model is fine-tuned on a blend of 16 public datasets, amounting to 577k data points, using the Llama 3.1 8B Instruct model. Synthetic augmentation involves generating critiques and judgments that facilitate the model's learning process.
Robustness and Promptability: Extensive testing indicates that Selene Mini is resilient to variation in prompt formats while maintaining its strong performance. Its adaptability in real-world tasks is evidenced by improved alignment with expert judgments and performance across various domain-specific datasets.

Results:

The model's efficacy is highlighted by its superior performance in a live, community-driven "Judge Arena," and detailed benchmark comparisons that demonstrate its overall capability against other current models. It achieves an average absolute scoring task performance of 0.648, cementing its rank as the top 8B generative model.

Performance Across Benchmarks: Selene Mini excels across various benchmarks such as EvalBiasBench, and RewardBench. It ensures reduced susceptibility to biases common in LLM evaluations, like length or positional bias.
Real-world Applicability: Its evaluative prowess is tested against real-world financial and medical datasets. The fine-tuned model displays an elevated concordance with expert labels when benchmarked using accuracy in domains like FinanceBench and CRAFT-MD.
Community-driven Evaluation: Selene Mini leads in the rankings within the Judge Arena, outstripping other formidable evaluators, further affirming its place as an effective general-purpose assessment tool.

Discussion:

The authors argue for the importance of high-quality data curation and multiple training objectives in crafting highly capable evaluators, emphasizing that the structured curation approach utilized here yields significant advantages even without using larger model architectures. The research charts potential implications for future AI evaluations, touching upon emerging complexities, such as agent-based systems and inference-time compute-driven evaluations, underlining the need for adaptable and sophisticated evaluative models like Atla Selene Mini.

In summary, Atla Selene Mini is a robust, accurate, and flexible evaluator poised for diverse applications, signifying meaningful advancements in automated evaluation frameworks for LLMs and broader AI-driven systems.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (12)

Tweets

https://twitter.com/_akhaliq/status/1884795489524003109

https://twitter.com/sashankpisupati/status/1884988010074173703

https://twitter.com/layerlens_ai/status/1885086151154806980

https://twitter.com/Atla_AI/status/1912149204761198725

Reddit

[R] [N] Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks (42 points, 5 comments)