Generalization of SAFE–human comparison beyond biographies

Establish whether SAFE’s observed advantage over crowdsourced human annotators on the FActScore biography setting generalizes to additional topics covered by the LongFact benchmark.

Background

SAFE was compared to crowdsourced human annotations using the dataset from FActScore, which focuses on biographical content verified against Wikipedia.

Although the paper applies SAFE to diverse topics in LongFact, the direct SAFE-versus-human comparison was limited to the biography domain, leaving uncertainty about broader generalization.

References

The extent to which our findings in \cref{sec:safe-outperforms-human-annotators} generalize to additional topics is thus still unclear.

— Long-form factuality in large language models (2403.18802 - Wei et al., 27 Mar 2024) in Appendix, SAFE details → Future investigation possibilities → Ability to generalize to other topics (sec:safe-future-investigation-possibilities)

Generalization of SAFE–human comparison beyond biographies

Sponsor

Background

References

Related Problems