A study compares GPT-4 and other large language models (LLMs) on complex reasoning tasks, including mathematics, science, symbolic reasoning, knowledge, and coding.
GPT-4 outperforms other models on GSM8K and MMLU tasks, while the 65B LLaMA model comes close to text/code-davinci-002 performance.