Document Set Expansion with Positive-Unlabeled Learning: A Density Estimation-based Approach (2401.11145v1)
Abstract: Document set expansion aims to identify relevant documents from a large collection based on a small set of documents that are on a fine-grained topic. Previous work shows that PU learning is a promising method for this task. However, some serious issues remain unresolved, i.e. typical challenges that PU methods suffer such as unknown class prior and imbalanced data, and the need for transductive experimental settings. In this paper, we propose a novel PU learning framework based on density estimation, called puDE, that can handle the above issues. The advantage of puDE is that it neither constrained to the SCAR assumption and nor require any class prior knowledge. We demonstrate the effectiveness of the proposed method using a series of real-world datasets and conclude that our method is a better alternative for the DSE task.
- Jessa Bekker and Jesse Davis. 2020. Learning from positive and unlabeled data: A survey. Machine Learning, 109:719–760.
- Kernel density estimation based factored relevance model for multi-contextual point-of-interest recommendation. Information Retrieval Journal, 25(1):44–90.
- A variational approach for learning from positive and unlabeled data. Advances in Neural Information Processing Systems, 33:14844–14854.
- Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, page 213–220, New York, NY, USA. Association for Computing Machinery.
- Dynamical variational autoencoders: A comprehensive review. arXiv preprint arXiv:2008.12595.
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr.
- Scalable evaluation and improvement of document set expansion via neural positive-unlabeled learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 581–592, Online. Association for Computational Linguistics.
- Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Positive-unlabeled learning with non-negative risk estimator. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 1674–1684, Red Hook, NY, USA. Curran Associates Inc.
- A tutorial on energy-based learning. Predicting structured data, 1(0).
- Grace E. Lee and Aixin Sun. 2018. Seed-driven document ranking for systematic reviews in evidence-based medicine. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 455–464, New York, NY, USA. Association for Computing Machinery.
- Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Atlanta, Georgia, USA.
- Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 1207–1215, Red Hook, NY, USA. Curran Associates Inc.
- Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30.
- Convex formulation for learning from positive and unlabeled data. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1386–1394, Lille, France. PMLR.
- Machine learning reduced workload for the cochrane covid-19 study register: development and evaluation of the cochrane covid-19 study classifier. Systematic Reviews, 11(1):1–8.
- B.W. Silverman. 2018. Density Estimation for Statistics and Data Analysis. Routledge.
- From little things big things grow: A collection with seed studies for medical systematic review literature search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 3176–3186, New York, NY, USA. Association for Computing Machinery.
- Neural rankers for effective screening prioritisation in medical systematic review literature search. In Proceedings of the 26th Australasian Document Computing Symposium, ADCS ’22, New York, NY, USA. Association for Computing Machinery.
- Seed-driven document ranking for systematic reviews: A reproducibility study. In Advances in Information Retrieval, pages 686–700, Cham. Springer International Publishing.
- Haiyang Zhang (56 papers)
- Qiuyi Chen (5 papers)
- Yuanjie Zou (2 papers)
- Yushan Pan (11 papers)
- Jia Wang (163 papers)
- Mark Stevenson (30 papers)