Generalization of SAFE–human comparison beyond biographies
Establish whether SAFE’s observed advantage over crowdsourced human annotators on the FActScore biography setting generalizes to additional topics covered by the LongFact benchmark.
References
The extent to which our findings in \cref{sec:safe-outperforms-human-annotators} generalize to additional topics is thus still unclear.
— Long-form factuality in large language models
(2403.18802 - Wei et al., 27 Mar 2024) in Appendix, SAFE details → Future investigation possibilities → Ability to generalize to other topics (sec:safe-future-investigation-possibilities)