Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Leakage in Machine Learning Pipelines (2311.04179v2)

Published 7 Nov 2023 in cs.LG and cs.AI

Abstract: Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Leonard Sasse (3 papers)
  2. Eliana Nicolaisen-Sobesky (1 paper)
  3. Juergen Dukart (2 papers)
  4. Simon B. Eickhoff (11 papers)
  5. Michael Götz (11 papers)
  6. Sami Hamdan (4 papers)
  7. Vera Komeyer (2 papers)
  8. Abhijit Kulkarni (5 papers)
  9. Juha Lahnakoski (2 papers)
  10. Bradley C. Love (19 papers)
  11. Federico Raimondo (5 papers)
  12. Kaustubh R. Patil (10 papers)
Citations (3)