ULF: Unsupervised Labeling Function Correction using Cross-Validation for Weak Supervision (2204.06863v4)
Abstract: A cost-effective alternative to manual data labeling is weak supervision (WS), where data samples are automatically annotated using a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the associated classes. In this work, we investigate noise reduction techniques for WS based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for Unsupervised Labeling Function correction, which denoises WS data by leveraging models trained on all but some LFs to identify and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples. Evaluation on multiple datasets confirms ULF's effectiveness in enhancing WS learning without the need for manual labeling.
- Tubespam: Comment spam filtering on youtube. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).
- Contributions to the study of sms spam filtering: New collection and results. In Proceedings of the 11th ACM Symposium on Document Engineering, DocEng ’11.
- Learning from rules generalizing labeled exemplars. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020.
- Interactive weak supervision: Learning useful heuristics for data labeling. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021.
- End-to-end weak supervision. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021.
- Robust data programming with precision-guided labeling functions. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020.
- Video scene graph generation from single-frame weak supervision. In The Eleventh International Conference on Learning Representations, ICLR 2023.
- What do a million news articles look like? In Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval ECIR 2016.
- Surabhi Datta and Kirk Roberts. 2023. Weakly supervised spatial relation extraction from radiology reports. JAMIA Open, 6(2). Ooad027.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature Communications, 12(1).
- Fast and three-rious: Speeding up weak supervision with triplet methods. In Proceedings of the 37th International Conference on Machine Learning, volume 119.
- Transfer learning and distant supervision for multilingual transformer models: A study on african languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020.
- ANEA: distant supervision for low-resource named entity recognition. CoRR, abs/2102.13129.
- Self-training with weak supervision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021.
- Semantic class learning from the web with hyponym pattern linkage graphs. In Proceedings of ACL-08: HLT. Association for Computational Linguistics.
- Dbpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6.
- Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002.
- Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Semi-supervised aggregation of dependent weak supervision sources with performance guarantees. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR.
- Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research (JAIR), 70:1373–1411.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32. Curran Associates, Inc.
- Snorkel: rapid training data creation with weak supervision. VLDB J., 29(2-3):709–730.
- Training complex models with multi-task weak supervision. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019.
- Knodle: Modular weakly supervised learning with PyTorch. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021).
- Learning with noisy labels by adaptive gradient-based outlier removal. In Machine Learning and Knowledge Discovery in Databases: Research Track, pages 237–253, Cham. Springer Nature Switzerland.
- Snorkel. Detecting spouse mentions in sentences. https://www.snorkel.org/use-cases/spouse-demo. Accessed: 14 February 2022.
- Ang Sun and Ralph Grishman. 2012. Active learning for relation type extension with local and global data views. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management.
- A semi-automatic approach for labeling large amounts of automated and non-automated social media user accounts. In Proceedings of the 2015 Second European Network Intelligence Conference.
- Paroma Varma and Christopher Ré. 2018. Snuba: Automating weak supervision to label training data. Proc. VLDB Endow., 12(3).
- Leveraging weak supervision to perform named entity recognition in electronic health records progress notes to identify the ophthalmology exam. International Journal of Medical Informatics, 167:104864.
- Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV.
- Crossweigh: Training named entity tagger from imperfect annotations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019.
- Tzu-Tsung Wong and Po-Yang Yeh. 2019. Reliable accuracy estimates from k -fold cross validation. IEEE Transactions on Knowledge and Data Engineering, PP:1–1.
- Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021.
- Optical remote sensing image understanding with weak supervision: Concepts, methods, and perspectives. IEEE Geoscience and Remote Sensing Magazine, 10(2).
- Weakly supervised text classification using supervision signals from a language model. In Findings of the Association for Computational Linguistics: NAACL 2022.
- Leveraging instance features for label aggregation in programmatic weak supervision. In In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023.
- WRENCH: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.