Dice Question Streamline Icon: https://streamlinehq.com

Robust verifiers for domains without automatic verification

Develop robust verification algorithms that can reliably identify correct solutions among large collections of samples generated by large language models for tasks lacking automatic verifiers, such as math word problem datasets GSM8K and MATH, overcoming the limitations of methods like majority voting and reward-model scoring that do not scale with large sample budgets.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper shows that repeated sampling substantially increases coverage across tasks, including math word problems where automatic verifiers are unavailable. On GSM8K and MATH, coverage with Llama-3 models grew to over 95% with 10,000 samples, but common methods to select a correct answer from many samples—majority voting and reward models—plateaued beyond roughly 100 samples and failed to scale with larger budgets.

This disconnect between rising coverage and stagnant success when using mainstream selection methods indicates a critical need for better verification techniques that can identify rare correct samples amidst many incorrect ones in unstructured domains, motivating the open problem of building robust verifiers.

References

These results highlight that building robust verifiers remains an open problem.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2407.21787 - Brown et al., 31 Jul 2024) in Introduction (Section 1)