Logic Mill -- A Knowledge Navigation System (2301.00200v2)
Abstract: Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced NLP techniques to generate numerical representations of documents. Currently it leverages a large pre-trained LLM to generate these document representations. The system focuses on scientific publications and patent documents and contains more than 200 million documents. It is easily accessible via a simple Application Programming Interface (API) or via a web interface. Moreover, it is continuously being updated and can be extended to text corpora from other domains. We see this system as a general-purpose tool for future research applications in the social sciences and other domains.
- M. Marx and A. Fuegi, “Reliance on science: Worldwide front‐page patent citations to scientific articles,” Strategic Management Journal, vol. 41, no. 9, pp. 1572–1594, Sep. 2020. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1002/smj.3145
- F. Poege, D. Harhoff, F. Gaessler, and S. Baruffaldi, “Science quality and the value of inventions,” Science Advances, vol. 5, no. 12, p. eaay7323, Dec. 2019. [Online]. Available: https://www.science.org/doi/10.1126/sciadv.aay7323
- A. B. Jaffe and G. de Rassenfosse, “Patent citation data in social science research: Overview and best practices,” Journal of the Association for Information Science and Technology, vol. 68, no. 6, pp. 1360–1374, Jun. 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1002/asi.23731
- A. Rubin and E. Rubin, “Systematic Bias in the Progress of Research,” Journal of Political Economy, vol. 129, no. 9, pp. 2666 – 2719, 2021. [Online]. Available: https://doi.org/10.1086/715021
- S. L. Woltmann and L. Alkærsig, “Tracing university–industry knowledge transfer through a text mining approach,” Scientometrics, vol. 117, no. 1, pp. 449–472, Oct. 2018. [Online]. Available: http://link.springer.com/10.1007/s11192-018-2849-9
- B. Kelly, D. Papanikolaou, A. Seru, and M. Taddy, “Measuring Technological Innovation over the Long Run,” American Economic Review: Insights, vol. 3, no. 3, pp. 303–320, Sep. 2021. [Online]. Available: https://pubs.aeaweb.org/doi/10.1257/aeri.20190499
- J. Beel, B. Gipp, S. Langer, and C. Breitinger, “Research-paper recommender systems: a literature survey,” International Journal on Digital Libraries, vol. 17, no. 4, pp. 305–338, Nov. 2016. [Online]. Available: https://doi.org/10.1007/s00799-015-0156-0
- K. Sparck Jones, “A Statistical Interpretation of Term Specificity and its Application in Retrieval,” Journal of Documentation, vol. 28, no. 1, pp. 11–21, Jan. 1972, publisher: MCB UP Ltd. [Online]. Available: https://doi.org/10.1108/eb026526
- A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. S. Weld, “SPECTER: Document-level Representation Learning using Citation-informed Transformers,” May 2020, number: arXiv:2004.07180 arXiv:2004.07180 [cs]. [Online]. Available: http://arxiv.org/abs/2004.07180
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Dec. 2017. [Online]. Available: https://dl.acm.org/doi/10.5555/3295222.3295349
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Sep. 2013, arXiv:1301.3781 [cs]. [Online]. Available: http://arxiv.org/abs/1301.3781
- J. Pennington, R. Socher, and C. Manning, “GloVe: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. [Online]. Available: https://aclanthology.org/D14-1162
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Jun. 2017, arXiv:1607.04606 [cs]. [Online]. Available: http://arxiv.org/abs/1607.04606
- S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: https://direct.mit.edu/neco/article/9/8/1735-1780/6109
- O. Melamud, J. Goldberger, and I. Dagan, “context2vec: Learning Generic Context Embedding with Bidirectional LSTM,” in Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 51–61. [Online]. Available: https://aclanthology.org/K16-1006
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North. Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: http://aclweb.org/anthology/N19-1423
- I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3613–3618. [Online]. Available: https://www.aclweb.org/anthology/D19-1371
- G. Aslanyan and I. Wetherbee, “Patents Phrase to Phrase Semantic Matching Dataset,” Aug. 2022, number: arXiv:2208.01171 arXiv:2208.01171 [cs]. [Online]. Available: http://arxiv.org/abs/2208.01171
- S. Althammer, M. Buckley, S. Hofstätter, and A. Hanbury, “Linguistically Informed Masking for Representation Learning in the Patent Domain,” Jun. 2021, number: arXiv:2106.05768 arXiv:2106.05768 [cs]. [Online]. Available: http://arxiv.org/abs/2106.05768
- J.-S. Lee and J. Hsiang, “Patent classification by fine-tuning BERT language model,” World Patent Information, vol. 61, p. 101965, Jun. 2020. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0172219019300742
- W. Ammar, D. Groeneveld, C. Bhagavatula, I. Beltagy, M. Crawford, D. Downey, J. Dunkelberger, A. Elgohary, S. Feldman, V. Ha, R. Kinney, S. Kohlmeier, K. Lo, T. Murray, H.-H. Ooi, M. Peters, J. Power, S. Skjonsberg, L. L. Wang, C. Wilhelm, Z. Yuan, M. van Zuylen, and O. Etzioni, “Construction of the Literature Graph in Semantic Scholar,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). New Orleans - Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 84–91. [Online]. Available: https://aclanthology.org/N18-3011
- Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs,” Aug. 2018, arXiv:1603.09320 [cs]. [Online]. Available: http://arxiv.org/abs/1603.09320
- M. Aumüller, E. Bernhardsson, and A. Faithfull, “ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, p. 101374, Jan. 2020. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0306437918303685