Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models (2508.12461v1)

Published 17 Aug 2025 in cs.CL

Abstract: In August 2025, OpenAI released GPT-OSS models, its first open weight LLMs since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source LLMs ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (11)

Tweets

https://twitter.com/LangChainJP/status/1958816602654146953

https://twitter.com/xueguang_ma/status/1957638693084103054

https://twitter.com/Epsilon_Lee/status/1957975462044676278

https://twitter.com/arxivsanitybot/status/1957998163492553107

HackerNews

Is GPT-OSS Good? A Comprehensive Evaluation (1 point, 0 comments)

alphaXiv

Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models (38 likes, 0 questions)