Show Your Work: Improved Reporting of Experimental Results (1909.03004v1)

Published 6 Sep 2019 in cs.LG, cs.CL, stat.ME, and stat.ML

Abstract: Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.

Citations (242)

View on Semantic Scholar

Summary

The paper introduces expected validation performance as a function of computational budget, enabling robust and equitable comparisons of NLP models.
It demonstrates that relying solely on test-set performance can mask true model improvements and obscure the impact of hyperparameter tuning.
The approach enhances transparency by detailing computational resources, infrastructure, and performance variability to foster reproducibility in experiments.

Improved Reporting of Experimental Results in NLP

The paper "Show Your Work: Improved Reporting of Experimental Results" addresses significant challenges in the reporting standards of experimental results in NLP research. It calls for a shift from relying solely on test-set performance as the primary metric for model evaluation and superiority claims, underscoring the need for more comprehensive reporting practices. A crucial aspect discussed is the inclusion of expected validation performance relative to computational budget, which offers a more nuanced view of how models perform under varying computational resources.

Key Arguments and Methodology

The authors argue that test-set performance, traditionally the sole metric for model superiority, is an insufficient basis for drawing robust conclusions. They identify the discrepancy between the computational budgets allocated for different experiments as a critical factor that can obscure real performance improvements attributable to architectural or methodological advancements. The paper introduces a novel reporting technique—expected validation performance (EVP) as a function of computational budget, capturing the expected performance of the best model configuration found at varying computational investments.

This technique enables a more meaningful comparison of NLP models, where budget refers to the number of hyperparameter search trials or overall training time. The authors present scenarios in which the conclusions about model performance could change based on the computational budget. For instance, they note that increasing the hyperparameter tuning budget does not universally guarantee better performance unless accompanied by reporting that reflects the variability and cost of searching.

Empirical Evidence and Findings

Among the empirical analyses, the paper highlights cases where recent model comparisons would yield different conclusions with varying computational budgets. It estimates the amount of computational effort necessary to achieve certain accuracies across multiple papers, revealing a massive variation from hours to weeks. These results often indicate that conclusions about the superiority of models may depend significantly on the computation invested in hyperparameter tuning.

The researchers emphasize using cumulative validation performance plots over a range of computational budgets as part of standard reporting. By doing so, subsequent efforts can more accurately assess the reproducibility and fair comparison between models. Furthermore, they suggest improvements such as detailing computing infrastructure, hyperparameter bounds, and expected performance variability to enhance transparency in experimental NLP research.

Implications and Future Directions

The call for improved reporting practices bears both theoretical and practical implications. Theoretically, it encourages a cultural shift toward rigorous scientific methodologies in machine learning, promoting reproducibility and transparency. Practically, it should guide practitioners toward more informed model selection, potentially optimizing resources by identifying where increased computational expenditure yields diminishing returns.

Future developments in AI, especially those concerning the training and evaluation of models, should consider these guidelines to avoid skewed assessments of model performance. Reporting standards such as those suggested could become integral to evaluating AI methods, facilitating the design of more robust and reliable systems. By institutionalizing these practices, the NLP community moves toward ensuring that reported advancements are attributable to genuine innovation rather than variances in computational allocations.

In conclusion, the paper advocates for a paradigm shift in experimental result reporting in NLP to reflect the intricate interplay between model performance and computational budgets. By addressing this gap, it sets a foundational precedent for the methodological rigor necessary to advance reproducible and equitable AI research.

PDF Markdown

Related Papers

GitHub

GitHub - allenai/show-your-work: Relevant code for the "Show Your Work" paper, EMNLP 2019. (18 stars)

Tweets

YouTube

Show All Videos