Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FAIR: Filtering of Automatically Induced Rules (2402.15472v2)

Published 23 Feb 2024 in cs.LG

Abstract: The availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domain-specific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlabeled data. Automatic Rule Induction (ARI) approaches circumvent this problem by automatically creating rules from features on a small labeled set and filtering a final set of rules from them. In the ARI approach, the crucial step is to filter out a set of a high-quality useful subset of rules from the large set of automatically created rules. In this paper, we propose an algorithm (Filtering of Automatically Induced Rules) to filter rules from a large number of automatically induced rules using submodular objective functions that account for the collective precision, coverage, and conflicts of the rule set. We experiment with three ARI approaches and five text classification datasets to validate the superior performance of our algorithm with respect to several semi-supervised label aggregation approaches. Further, we show that achieves statistically significant results in comparison to existing rule-filtering approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Spear: Semi-supervised data programming in python. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 121–127.
  2. Tubespam: Comment spam filtering on youtube. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pages 138–143. IEEE.
  3. Contributions to the study of sms spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering, pages 259–262.
  4. A critical analysis of self-supervision, or what we can learn from a single image. In International Conference on Learning Representations.
  5. Learning from rules generalizing labeled exemplars. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  6. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In Proceedings of the 2019 International Conference on Management of Data, pages 362–375.
  7. Interactive weak supervision: Learning useful heuristics for data labeling. arXiv preprint arXiv:2012.06046.
  8. Daren C Brabham. 2013. Crowdsourcing. Mit Press.
  9. Robust data programming with precision-guided labeling functions. In AAAI.
  10. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3090–3098.
  11. Self-training with weak supervision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 845–863.
  12. Demystifying multi-faceted video summarization: tradeoff between diversity, representation, coverage and importance. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 452–461. IEEE.
  13. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299. IEEE.
  14. Prism: A rich class of parameterized submodular information measures for guided subset selection. arXiv preprint arXiv:2103.00128.
  15. Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
  16. Data programming using semi-supervision and subset selection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.
  17. Learning to robustly aggregate labeling functions for semi-supervised data programming. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1188–1202.
  18. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011.
  19. Automatic rule induction for efficient semi-supervised learning. arXiv preprint arXiv:2205.09067.
  20. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4334–4343.
  21. Rule augmented unsupervised constituency parsing. In ACL/IJCNLP (Findings).
  22. Learning explainable linguistic expressions with neural inductive logic programming for sentence classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4211–4221.
  23. Burr Settles. 2009. Active learning literature survey.
  24. Grasp: Rich patterns for argumentation mining. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1345–1350.
  25. Eigen: Expert-informed joint learning aggregation for high-fidelity information extraction from document images. In Machine Learning for Health (ML4H), pages 559–573. PMLR.
  26. Adaptive mixing of auxiliary losses in supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9855–9863.
  27. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  28. Jesper E Van Engelen and Holger H Hoos. 2020. A survey on semi-supervised learning. Machine Learning, 109(2):373–440.
  29. Paroma Varma and Christopher Ré. 2018. Snuba: Automating weak supervision to label training data. Proc. VLDB Endow., 12(3):223–236.
  30. Paroma Varma and Christopher Ré. 2018. Snuba: automating weak supervision to label training data. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 12, page 223. NIH Public Access.
  31. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954–1963.
  32. Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics, pages 196–202. Springer.
  33. Recent advances in document summarization. Knowledge and Information Systems, 53(2):297–336.

Summary

We haven't generated a summary for this paper yet.