Papers
Topics
Authors
Recent
Search
2000 character limit reached

Validity in Music Information Research Experiments

Published 4 Jan 2023 in cs.SD and eess.AS | (2301.01578v1)

Abstract: Validity is the truth of an inference made from evidence, such as data collected in an experiment, and is central to working scientifically. Given the maturity of the domain of music information research (MIR), validity in our opinion should be discussed and considered much more than it has been so far. Considering validity in one's work can improve its scientific and engineering value. Puzzling MIR phenomena like adversarial attacks and performance glass ceilings become less mysterious through the lens of validity. In this article, we review the subject of validity in general, considering the four major types of validity from a key reference: Shadish et al. 2002. We ground our discussion of these types with a prototypical MIR experiment: music classification using machine learning. Through this MIR experimentalists can be guided to make valid inferences from data collected from their experiments.

Citations (2)

Summary

  • The paper argues that explicitly considering experimental validity, including statistical conclusion, internal, construct, and external types, is essential for improving the scientific and engineering rigor of Music Information Research (MIR).
  • It critiques the prevalent Cranfield paradigm, which relies on test collections as user proxies, highlighting limitations and the need for rigorous experimental design to test specific hypotheses and operationalize experiment components effectively.
  • Engaging with validity provides a framework for understanding phenomena like adversarial attacks and performance plateaus in MIR, ultimately enhancing the reproducibility, generalizability, and insights derived from research.

This paper, "Validity in Music Information Research Experiments" (2301.01578), addresses the concept of validity within the MIR field, arguing for its increased consideration to enhance scientific and engineering rigor. The authors posit that phenomena like adversarial attacks and performance glass ceilings in MIR become more understandable when examined through the lens of validity.

Core Arguments on Validity

The paper defines validity as the truth of an inference derived from experimental evidence. It critiques the prevalent Cranfield paradigm in MIR, where computer-based experiments using "test collections" serve as proxies for human users. The trade-off between cost/replicability and relevance/reliability is discussed, highlighting the limitations of relying solely on reproducing ground truth in datasets. The authors emphasize the necessity of rigorous experimental design to test well-defined hypotheses, operationalizing the components of an experiment (units, treatments, design, observations, and settings) to optimize quality and minimize cost. Despite repeated calls for methodological improvements within the community, a lack of systematic engagement with validity is noted.

Four Types of Validity

The paper structures its analysis around the four principal types of validity, drawing from Shadish et al. (2002).

Statistical Conclusion Validity

This concerns the validity of inferences regarding the covariation between variables, specifically treatment and outcome. Threats to statistical conclusion validity include violations of statistical test assumptions, small sample sizes leading to insufficient statistical power, the practice of "p-hacking," and heterogeneity within the experimental units.

In the context of a prototypical MIR experiment—music classification using ML on the BALLROOM dataset—the paper explores whether the results demonstrate statistical significance (e.g., whether ML models outperform random chance). It underscores the importance of employing appropriate statistical tests and evaluating whether observed statistical significance translates into practical relevance for actual users.

Internal Validity

Internal validity focuses on establishing whether the observed covariation between variables reflects a causal relationship. A primary threat to internal validity is confounding, where the treatment effect is conflated with other factors due to inadequate operationalization.

When applied to music classification, the paper considers the identification of factors within trained ML models that lead to responses inconsistent with random selection. Potential confounds, such as tempo and instrumentation, are explored. An intervention involving time dilation of test recordings is introduced, revealing the models' reliance on tempo. Despite this, the paper concludes that the models have learned aspects of the dataset beyond mere rhythm.

Construct Validity

Construct validity addresses the validity of inferences about the higher-order constructs that represent sampling particulars—essentially, the relationship between the intended measurement and the actual measurement. The main threat is a tenuous link between what is measured and the intended target of measurement.

In the music classification example, the paper questions whether classification accuracy on a labeled dataset truly reflects the intended construct (e.g., rhythm recognition). The construction of the BALLROOM dataset, which is based on different ballroom dance rhythms, is examined for its impact on the validity of construct inferences. The paper suggests modifying the experiment itself to ensure that the measured ability genuinely reflects rhythm recognition capabilities.

External Validity

External validity concerns the extent to which a causal relationship holds across variations in experimental units, settings, treatment variables, and measurement variables. It pertains to the generalizability of a causal inference.

The paper questions whether models can generalize rhythm recognition beyond the BALLROOM dataset, particularly since prior analysis indicated that models were not truly recognizing rhythm within that dataset. The extended BALLROOM dataset (X-BALLROOM) is introduced to assess the generalizability of model performance on recordings from the same source but spanning a decade later.

Summary

In summary, the paper advocates for a more thorough engagement with the different facets of validity to bolster the rigor, reproducibility, and generalizability of MIR research, thereby enriching the insights derived about music and information.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.