Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interactive Distillation of Large Single-Topic Corpora of Scientific Papers (2309.10772v1)

Published 19 Sep 2023 in cs.IR, cs.CL, cs.DL, and cs.LG

Abstract: Highly specific datasets of scientific literature are important for both research and education. However, it is difficult to build such datasets at scale. A common approach is to build these datasets reductively by applying topic modeling on an established corpus and selecting specific topics. A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert (SME) handpicks documents. This method does not scale and is prone to error as the dataset grows. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature. Given a small initial "core" corpus of papers, we build a citation network of documents. At each step of the citation network, we generate text embeddings and visualize the embeddings through dimensionality reduction. Papers are kept in the dataset if they are "similar" to the core or are otherwise pruned through human-in-the-loop selection. Additional insight into the papers is gained through sub-topic modeling using SeNMFk. We demonstrate our new tool for literature review by applying it to two different fields in machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. B. S. Alexandrov, V. Vesselinov, and K. Ø.. Rasmussen, “SmartTensors unsupervised AI platform for big-data analytics,” Los Alamos National Lab.(LANL), Los Alamos, NM (United States), Tech. Rep., 2021, lA-UR-21-25064. [Online]. Available: {https://www.lanl.gov/collaboration/smart-tensors/}
  2. P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, pp. 111–126, 1994.
  3. W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” 07 2003, pp. 267–273.
  4. R. Vangara, E. Skau, G. Chennupati, H. Djidjev, T. Tierney, J. P. Smith, M. Bhattarai, V. G. Stanev, and B. S. Alexandrov, “Semantic nonnegative matrix factorization with automatic model determination for topic modeling,” in 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA).   IEEE, 2020, pp. 328–335.
  5. R. Vangara, M. Bhattarai, E. Skau, G. Chennupati, H. Djidjev, T. Tierney, J. P. Smith, V. G. Stanev, and B. S. Alexandrov, “Finding the number of latent topics with semantic non-negative matrix factorization,” IEEE Access, vol. 9, pp. 117 217–117 231, 2021.
  6. M. E. Eren, N. Solovyev, M. Bhattarai, K. Ø. Rasmussen, C. Nicholas, and B. S. Alexandrov, “Senmfk-split: Large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection,” in Proceedings of the 22nd ACM Symposium on Document Engineering, ser. DocEng ’22.   New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3558100.3563844
  7. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37.   Lille, France: PMLR, 07–09 Jul 2015, pp. 957–966. [Online]. Available: https://proceedings.mlr.press/v37/kusnerb15.html
  8. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  9. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013.
  10. K. Li, Z. Liu, T. He, H. Huang, F. Peng, D. Povey, and S. Khudanpur, “An empirical study of transformer-based neural language model adaptation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7934–7938.
  11. K. Lo, Y. Jin, W. Tan, M. Liu, L. Du, and W. Buntine, “Transformer over pre-trained transformer for neural text segmentation with enhanced topic coherence,” 2021.
  12. L. Zhu, G. Pergola, L. Gui, D. Zhou, and Y. He, “Topic-driven and knowledge-aware transformer for dialogue emotion detection,” 2021.
  13. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
  14. S. V. Praveen and V. Vajrobol, “Understanding the perceptions of healthcare researchers regarding chatgpt: A study based on bidirectional encoder representation from transformers (bert) sentiment analysis and topic modeling.” Annals of biomedical engineering, 2023.
  15. OpenAI, “GPT-3.5-based ChatGPT,” 2021. [Online]. Available: https://openai.com
  16. R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, and N. Dehak, “Hierarchical transformers for long document classification,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 838–844.
  17. M. Ostendorff, N. Rethmeier, I. Augenstein, B. Gipp, and G. Rehm, “Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings,” in The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022).   Abu Dhabi: Association for Computational Linguistics, December 2022, 7-11 December 2022. Accepted for publication.
  18. J. S. Enderle, “Topic modeling tool,” https://senderle.github.io/topic-modeling-tool/documentation/2017/01/06/quickstart.html, 2023, accessed: June 1, 2023.
  19. E. R. Daniel Ramage, “Stanford topic modeling toolbox,” https://downloads.cs.stanford.edu/nlp/software/tmt/tmt-0.4/, 2009, accessed: June 1, 2023.
  20. E. Alex, E. Smolyansky, I. Harpaz, and P. Sahar, “Connected papers,” https://www.connectedpapers.com, 2023, accessed: June 1, 2023.
  21. M. E. Eren, N. Solovyev, E. Raff, C. Nicholas, and B. Johnson, “Covid-19 kaggle literature organization,” in Proceedings of the ACM Symposium on Document Engineering 2020, ser. DocEng ’20.   New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3395027.3419591
  22. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  23. L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.
  24. Google, “Google Bard,” 2023. [Online]. Available: https://bard.google.com/
  25. Z. Fu, Y. Xian, Y. Zhu, S. Xu, Z. Li, G. de Melo, and Y. Zhang, “Hoops: Human-in-the-loop graph reasoning for conversational recommendation,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 2415\quotesinglbaseÄì2421. [Online]. Available: https://doi.org/10.1145/3404835.3463247
  26. F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” 2016.
  27. X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, and L. He, “A survey of human-in-the-loop for machine learning,” Future Generation Computer Systems, vol. 135, pp. 364–381, oct 2022. [Online]. Available: https://doi.org/10.1016%2Fj.future.2022.05.014
  28. R. M. Kinney, C. Anastasiades, R. Authur, I. Beltagy, J. Bragg, A. Buraczynski, I. Cachola, S. Candra, Y. Chandrasekhar, A. Cohan, M. Crawford, D. Downey, J. Dunkelberger, O. Etzioni, R. Evans, S. Feldman, J. Gorney, D. W. Graham, F. Hu, R. Huff, D. King, S. Kohlmeier, B. Kuehl, M. Langan, D. Lin, H. Liu, K. Lo, J. Lochner, K. MacMillan, T. Murray, C. Newell, S. Rao, S. Rohatgi, P. L. Sayre, Z. Shen, A. Singh, L. Soldaini, S. Subramanian, A. Tanaka, A. D. Wade, L. M. Wagner, L. L. Wang, C. Wilhelm, C. Wu, J. Yang, A. Zamarron, M. van Zuylen, and D. S. Weld, “The semantic scholar open data platform,” ArXiv, vol. abs/2301.10140, 2023.
  29. M. E. Eren, N. Solovyev, M. Bhattarai, K. Ø. Rasmussen, C. Nicholas, and B. S. Alexandrov, “Senmfk-split: large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection,” in Proceedings of the 22nd ACM Symposium on Document Engineering, 2022, pp. 1–4.
  30. Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” 01 2019, pp. 2978–2988.
Citations (1)

Summary

We haven't generated a summary for this paper yet.