Effect of substituting SAFE’s evaluator language model
Determine how substituting the evaluator language model within SAFE (for example, replacing GPT-3.5-Turbo with GPT-4) affects SAFE’s annotation performance and the resulting model rankings on LongFact.
References
Nonetheless, an open question is whether substituting the LLM used in SAFE significantly affects performance.
— Long-form factuality in large language models
(2403.18802 - Wei et al., 27 Mar 2024) in Appendix, SAFE details → Future investigation possibilities → Using other language models (sec:safe-future-investigation-possibilities)