Effect of substituting SAFE’s evaluator language model

Determine how substituting the evaluator language model within SAFE (for example, replacing GPT-3.5-Turbo with GPT-4) affects SAFE’s annotation performance and the resulting model rankings on LongFact.

Background

SAFE’s steps (fact extraction, self-containment, relevance, search-based verification) are driven by an evaluator LLM. The current implementation uses GPT-3.5-Turbo to balance performance and cost.

The authors note that changing the evaluator model might materially alter SAFE’s performance and downstream rankings, but this has not been tested due to cost and latency constraints.

References

Nonetheless, an open question is whether substituting the LLM used in SAFE significantly affects performance.

— Long-form factuality in large language models (2403.18802 - Wei et al., 27 Mar 2024) in Appendix, SAFE details → Future investigation possibilities → Using other language models (sec:safe-future-investigation-possibilities)

Effect of substituting SAFE’s evaluator language model

Sponsor

Background

References

Related Problems