Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Logic Mill -- A Knowledge Navigation System (2301.00200v2)

Published 31 Dec 2022 in cs.CL

Abstract: Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced NLP techniques to generate numerical representations of documents. Currently it leverages a large pre-trained LLM to generate these document representations. The system focuses on scientific publications and patent documents and contains more than 200 million documents. It is easily accessible via a simple Application Programming Interface (API) or via a web interface. Moreover, it is continuously being updated and can be extended to text corpora from other domains. We see this system as a general-purpose tool for future research applications in the social sciences and other domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. M. Marx and A. Fuegi, “Reliance on science: Worldwide front‐page patent citations to scientific articles,” Strategic Management Journal, vol. 41, no. 9, pp. 1572–1594, Sep. 2020. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1002/smj.3145
  2. F. Poege, D. Harhoff, F. Gaessler, and S. Baruffaldi, “Science quality and the value of inventions,” Science Advances, vol. 5, no. 12, p. eaay7323, Dec. 2019. [Online]. Available: https://www.science.org/doi/10.1126/sciadv.aay7323
  3. A. B. Jaffe and G. de Rassenfosse, “Patent citation data in social science research: Overview and best practices,” Journal of the Association for Information Science and Technology, vol. 68, no. 6, pp. 1360–1374, Jun. 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1002/asi.23731
  4. A. Rubin and E. Rubin, “Systematic Bias in the Progress of Research,” Journal of Political Economy, vol. 129, no. 9, pp. 2666 – 2719, 2021. [Online]. Available: https://doi.org/10.1086/715021
  5. S. L. Woltmann and L. Alkærsig, “Tracing university–industry knowledge transfer through a text mining approach,” Scientometrics, vol. 117, no. 1, pp. 449–472, Oct. 2018. [Online]. Available: http://link.springer.com/10.1007/s11192-018-2849-9
  6. B. Kelly, D. Papanikolaou, A. Seru, and M. Taddy, “Measuring Technological Innovation over the Long Run,” American Economic Review: Insights, vol. 3, no. 3, pp. 303–320, Sep. 2021. [Online]. Available: https://pubs.aeaweb.org/doi/10.1257/aeri.20190499
  7. J. Beel, B. Gipp, S. Langer, and C. Breitinger, “Research-paper recommender systems: a literature survey,” International Journal on Digital Libraries, vol. 17, no. 4, pp. 305–338, Nov. 2016. [Online]. Available: https://doi.org/10.1007/s00799-015-0156-0
  8. K. Sparck Jones, “A Statistical Interpretation of Term Specificity and its Application in Retrieval,” Journal of Documentation, vol. 28, no. 1, pp. 11–21, Jan. 1972, publisher: MCB UP Ltd. [Online]. Available: https://doi.org/10.1108/eb026526
  9. A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. S. Weld, “SPECTER: Document-level Representation Learning using Citation-informed Transformers,” May 2020, number: arXiv:2004.07180 arXiv:2004.07180 [cs]. [Online]. Available: http://arxiv.org/abs/2004.07180
  10. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Dec. 2017. [Online]. Available: https://dl.acm.org/doi/10.5555/3295222.3295349
  11. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Sep. 2013, arXiv:1301.3781 [cs]. [Online]. Available: http://arxiv.org/abs/1301.3781
  12. J. Pennington, R. Socher, and C. Manning, “GloVe: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. [Online]. Available: https://aclanthology.org/D14-1162
  13. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Jun. 2017, arXiv:1607.04606 [cs]. [Online]. Available: http://arxiv.org/abs/1607.04606
  14. S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: https://direct.mit.edu/neco/article/9/8/1735-1780/6109
  15. O. Melamud, J. Goldberger, and I. Dagan, “context2vec: Learning Generic Context Embedding with Bidirectional LSTM,” in Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning.   Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 51–61. [Online]. Available: https://aclanthology.org/K16-1006
  16. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North.   Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: http://aclweb.org/anthology/N19-1423
  17. I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3613–3618. [Online]. Available: https://www.aclweb.org/anthology/D19-1371
  18. G. Aslanyan and I. Wetherbee, “Patents Phrase to Phrase Semantic Matching Dataset,” Aug. 2022, number: arXiv:2208.01171 arXiv:2208.01171 [cs]. [Online]. Available: http://arxiv.org/abs/2208.01171
  19. S. Althammer, M. Buckley, S. Hofstätter, and A. Hanbury, “Linguistically Informed Masking for Representation Learning in the Patent Domain,” Jun. 2021, number: arXiv:2106.05768 arXiv:2106.05768 [cs]. [Online]. Available: http://arxiv.org/abs/2106.05768
  20. J.-S. Lee and J. Hsiang, “Patent classification by fine-tuning BERT language model,” World Patent Information, vol. 61, p. 101965, Jun. 2020. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0172219019300742
  21. W. Ammar, D. Groeneveld, C. Bhagavatula, I. Beltagy, M. Crawford, D. Downey, J. Dunkelberger, A. Elgohary, S. Feldman, V. Ha, R. Kinney, S. Kohlmeier, K. Lo, T. Murray, H.-H. Ooi, M. Peters, J. Power, S. Skjonsberg, L. L. Wang, C. Wilhelm, Z. Yuan, M. van Zuylen, and O. Etzioni, “Construction of the Literature Graph in Semantic Scholar,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers).   New Orleans - Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 84–91. [Online]. Available: https://aclanthology.org/N18-3011
  22. Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs,” Aug. 2018, arXiv:1603.09320 [cs]. [Online]. Available: http://arxiv.org/abs/1603.09320
  23. M. Aumüller, E. Bernhardsson, and A. Faithfull, “ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, p. 101374, Jan. 2020. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0306437918303685
Citations (2)

Summary

We haven't generated a summary for this paper yet.