Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Complex Models with Multi-Task Weak Supervision (1810.02840v2)

Published 5 Oct 2018 in stat.ML and cs.LG

Abstract: As machine learning models continue to increase in complexity, collecting large hand-labeled training sets has become one of the biggest roadblocks in practice. Instead, weaker forms of supervision that provide noisier but cheaper labels are often used. However, these weak supervision sources have diverse and unknown accuracies, may output correlated labels, and may label different tasks or apply at different levels of granularity. We propose a framework for integrating and modeling such weak supervision sources by viewing them as labeling different related sub-tasks of a problem, which we refer to as the multi-task weak supervision setting. We show that by solving a matrix completion-style problem, we can recover the accuracies of these multi-task sources given their dependency structure, but without any labeled data, leading to higher-quality supervision for training an end model. Theoretically, we show that the generalization error of models trained with this approach improves with the number of unlabeled data points, and characterize the scaling with respect to the task and dependency structures. On three fine-grained classification problems, we show that our approach leads to average gains of 20.2 points in accuracy over a traditional supervised approach, 6.8 points over a majority vote baseline, and 4.1 points over a previously proposed weak supervision method that models tasks separately.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Alexander Ratner (24 papers)
  2. Braden Hancock (12 papers)
  3. Jared Dunnmon (14 papers)
  4. Frederic Sala (55 papers)
  5. Shreyash Pandey (4 papers)
  6. Christopher RĂ© (194 papers)
Citations (204)

Summary

Summary of "Training Complex Models with Multi-Task Weak Supervision"

The proliferation of complex machine learning models has highlighted a significant challenge: the necessity of large-scale, hand-labeled datasets. Hand-labeling is not only labor-intensive but also costly, and often infeasible when domain expertise is required. In response, many have turned to weak supervision, which leverages noisier, less costly labels from diverse sources such as knowledge bases, heuristic patterns, and crowdsourced annotations. However, these sources vary in accuracy, may be correlated, and label at different granularities, creating complexity in their integration.

This paper introduces a framework called \systemx, which addresses these complexities by modeling weak supervision sources as contributors to multiple related sub-tasks, forming what is termed a "multi-task weak supervision setting". The approach involves solving a matrix completion problem to recover the accuracies of these sources without labeled data. This is achieved by leveraging the dependency structure of the sources and incorporating cross-task signals, significantly improving the quality of the supervision that trains the final model.

The theoretical contributions of this work elucidate how the generalization error decreases with more unlabeled data, and they describe the scaling relative to task and dependency structures. The empirical results are compelling; across three fine-grained classification tasks, the framework achieved an average increase of 20.2 percentage points in accuracy over traditional supervised learning, 6.8 points over a majority voting baseline, and 4.1 points over separate task treatment in weak supervision methodologies. Specifically, the framework yields average accuracy improvements of 2.8 points by effectively handling unipolar sources, those that label only a single class.

From a practical standpoint, \systemx offers a scalable and straightforward implementation that can be run using standard stochastic gradient descent libraries like PyTorch, and it significantly reduces runtime compared to previous methods. This is especially critical as it enables practitioners to integrate and benefit from a wide array of weak supervision sources without requiring extensive computational resources.

The broader implications of this research are twofold. Theoretically, it provides a structured approach to modeling multi-task weak supervision that is both analytically tractable and practicable. Practically, it offers a robust method for enhancing training datasets with weak supervision, making the training of complex models more accessible and less reliant on exhaustive labeled data. As machine learning models continue to burgeon in complexity and application, frameworks like \systemx paves the way for enhanced efficiency and efficacy in model training. Future work might explore the automated learning of dependency structures or adeptness to additional weak supervision scenarios. As the machine learning landscape evolves, so too must the methodologies that support its growth, and this framework represents a step towards more adaptable and scalable model training processes.