Robust verifiers for domains without automatic verification
Develop robust verification algorithms that can reliably identify correct solutions among large collections of samples generated by large language models for tasks lacking automatic verifiers, such as math word problem datasets GSM8K and MATH, overcoming the limitations of methods like majority voting and reward-model scoring that do not scale with large sample budgets.
References
These results highlight that building robust verifiers remains an open problem.
— Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
(2407.21787 - Brown et al., 31 Jul 2024) in Introduction (Section 1)