Explain Claude-3-Sonnet’s parity with Claude-3-Opus on long-form factuality

Determine the reasons why Claude-3-Sonnet achieves similar long-form factuality as Claude-3-Opus despite being a smaller model when evaluated on LongFact-Objects using the SAFE (Search-Augmented Factuality Evaluator) pipeline and aggregated with F1@K.

Background

The paper benchmarks thirteen LLMs across four families (Gemini, GPT, Claude, PaLM-2) on the LongFact-Objects prompts, evaluates responses using SAFE, and aggregates results via F1@K.

While larger models generally show better long-form factuality, the authors observe that Claude-3-Sonnet (a smaller model) performs similarly to Claude-3-Opus (a larger model) on this benchmark, but they cannot explain the cause without further details about the models.

References

Notably, we found that Claude-3-Sonnet achieves similar long-form factuality as Claude-3-Opus despite being a smaller model, but without access to further details about these models, it is unclear why this was the case.

— Long-form factuality in large language models (2403.18802 - Wei et al., 27 Mar 2024) in Section 6, Larger LLMs are more factual (sec:main-results)

Explain Claude-3-Sonnet’s parity with Claude-3-Opus on long-form factuality

Sponsor

Background

References

Related Problems