Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Learning Testing: Survey, Landscapes and Horizons (1906.10742v2)

Published 19 Jun 2019 in cs.LG, cs.AI, cs.SE, and stat.ML

Abstract: This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jie M. Zhang (39 papers)
  2. Mark Harman (31 papers)
  3. Lei Ma (197 papers)
  4. Yang Liu (2256 papers)
Citations (684)

Summary

  • The paper comprehensively surveys the ML testing landscape by analyzing 144 studies to uncover testing properties and research gaps.
  • The paper evaluates diverse methodologies such as domain-specific synthesis, fuzzing, and metamorphic testing to generate effective ML test cases.
  • The paper highlights key challenges like the test oracle problem and high generation costs, suggesting paths for automation and enhanced reliability.

Overview of "Machine Learning Testing: Survey, Landscapes and Horizons"

The paper "Machine Learning Testing: Survey, Landscapes and Horizons" by Jie M. Zhang et al. presents a comprehensive overview of the field of Machine Learning Testing (ML testing). This survey covers 144 papers, examining various facets of ML testing, including testing properties, components, workflows, and practical applications. It also discusses research trends, dataset usage, and challenges, culminating in a robust analysis that highlights current research gaps and potential future directions.

Testing Properties and Components

The paper categorizes ML testing into functional and non-functional properties. Functional aspects like correctness and model relevance are fundamental to ML systems, while non-functional properties such as robustness, fairness, and interpretability are critical for trustworthiness in real-world applications. The discussion on testing components addresses the data, the learning program, and the framework, which are integral to ML system construction and performance. The inherent intricacies and evolving behaviors of machine learning models make testing both challenging and crucial.

Methodological Approaches

The authors discuss various techniques for ML test input generation, including domain-specific synthesis, fuzz and search-based methods, and symbolic execution. These methods aim to produce both adversarial and natural inputs to evaluate ML models effectively. Metamorphic testing is highlighted as a key approach for addressing the test oracle problem, complemented by cross-referencing techniques and model evaluation metrics.

Evaluation Metrics and Criteria

The survey explores test adequacy evaluation, exploring coverage criteria such as neuron coverage and mutation testing. These metrics, while inspired by traditional software testing, need adaptation to account for ML systems' unique characteristics. The authors emphasize the importance of understanding how these criteria correlate with fault-revealing capabilities.

Practical Implications and Challenges

Testing in real-world applications like autonomous driving and machine translation is discussed, showing the importance of robustness and correctness. The paper also outlines challenges such as test generation cost and the oracle problem, suggesting that future work should focus on automation and improving test reliability.

Future Directions

The authors identify several research opportunities, including the need to explore testing techniques for unsupervised and reinforcement learning, as well as advancing benchmarks specifically for ML testing. They also point out the necessity of tool support and systematic assessments of test adequacy to enhance the development of reliable ML systems.

Conclusion

Zhang et al.'s survey provides a well-structured synthesis of ML testing literature, aiming to align the efforts of software engineering and machine learning researchers towards a more robust and trustworthy paradigm. Their work underscores the challenges and untapped potential within the field of ML testing, offering a pivotal reference for future research endeavors.