Harvesting Textual and Structured Data from the HAL Publication Repository (2407.20595v2)
Abstract: HAL (\textit{Hyper Articles en Ligne}) is the French national publication repository, used by most higher education and research organizations for their open science policy. Although it is a rich repository of academic documents, its potential for advanced research has not been fully explored. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text of HAL-submitted articles to help with authorship attribution and verification. This first iteration consists of approximately 700,000 documents, spanning 56 languages across 13 identified domains. We transform articles' metadata into a citation network, producing a heterogeneous graph. This graph includes uniquely identified authors on HAL, as well as all open-access documents and their references. Finally, we mine 14.5 million high-quality sequence pairs from HALvest for contrastive learning purposes. By providing different views of HAL, suited for modern machine learning, we aim to assist practitioners in better analyzing and interpreting research dynamics.
- 6. 2024. stopwords-json. Original-date: 2014-02-01T08:08:26Z.
- Construction of the Literature Graph in Semantic Scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 84–91, New Orleans - Louisiana. Association for Computational Linguistics.
- arXiv.org submitters. 2024. arxiv dataset.
- Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1):014008.
- The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA).
- Bib2Auth: Deep Learning Approach for Author Disambiguation using Bibliographic Data. arXiv preprint. ArXiv:2107.04382 [cs].
- Xavier Bresson and Thomas Laurent. 2018. Residual Gated Graph ConvNets. arXiv preprint. ArXiv:1711.07553 [cs, stat].
- Structural Scaffolds for Citation Intent Classification in Scientific Publications. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3586–3596, Minneapolis, Minnesota. Association for Computational Linguistics.
- SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
- NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of Biomedical Informatics, 47:1–10.
- Matthias Fey and Jan E. Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
- GROBID Repository. 2008–2024. Grobid. https://github.com/kermitt2/grobid. Preprint, swh:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c.
- Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Semantic fingerprints-based author name disambiguation in Chinese documents. Scientometrics, 111(3):1879–1896.
- Hybrid Deep Pairwise Classification for Author Name Disambiguation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, pages 2369–2372, New York, NY, USA. Association for Computing Machinery.
- DisamBERT: Author name disambiguation with BERT.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database, 2016:baw068.
- A system for massively parallel hyperparameter tuning. In Proceedings of Machine Learning and Systems, volume 2, pages 230–246.
- S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
- A Graph-Based Author Name Disambiguation Method and Analysis via Information Theory. Entropy, 22(4):416.
- Mark-Christoph Müller. 2017. Semantic Author Name Disambiguation with Word Embeddings. In Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, pages 300–311, Cham. Springer International Publishing.
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only. Advances in Neural Information Processing Systems, 36:79155–79172.
- Online author name disambiguation in evolving digital library. Neurocomputing, 493:1–14.
- Exploiting similarities across multiple dimensions for author name disambiguation. Scientometrics, 126(9):7525–7560.
- OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint. ArXiv:2205.01833 [cs].
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv preprint. ArXiv:2112.11446 [cs].
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):140:5485–140:5551.
- RedPajama Repository. 2023. RedPajama: an Open Dataset for Training Large Language Models.
- How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online. Association for Computational Linguistics.
- A knowledge graph embeddings based approach for author name disambiguation using literals. Scientometrics, 127(8):4887–4912.
- The Graph Neural Network Model. IEEE Transactions on Neural Networks, 20(1):61–80. Conference Name: IEEE Transactions on Neural Networks.
- Collective Classification in Network Data. AI Magazine, 29(3):93–93. Number: 3.
- Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1):15–50.
- A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6):975–987. Conference Name: IEEE Transactions on Knowledge and Data Engineering.
- ArnetMiner: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 990–998, New York, NY, USA. Association for Computing Machinery.
- Galactica: A Large Language Model for Science. arXiv preprint. ArXiv:2211.09085 [cs, stat].
- Alexander Tekles and Lutz Bornmann. 2019. Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. arXiv preprint. ArXiv:1904.12746 [cs].
- Author Name Disambiguation by Using Deep Neural Network. In Intelligent Information and Database Systems, Lecture Notes in Computer Science, pages 123–132, Cham. Springer International Publishing.
- Graph attention networks. In International Conference on Learning Representations.
- A new approach and gold standard toward author disambiguation in MEDLINE. Journal of the American Medical Informatics Association : JAMIA, 26(10):1037–1045.
- D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2642–2651, Marseille, France. European Language Resources Association.
- Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413.
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
- Author Name Disambiguation via Heterogeneous Network Embedding from Structural and Semantic Perspectives. In 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), pages 245–250. ArXiv:2212.12715 [cs].
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Deep Bidirectional Language-Knowledge Graph Pretraining. Advances in Neural Information Processing Systems, 35:37309–37323.
- Research on Author Name Disambiguation Based on Fusion Features and Semantic Fingerprints. Journal of Physics: Conference Series, 1302(2):022013. Publisher: IOP Publishing.
- GreaseLM: Graph REASoning Enhanced Language Models for Question Answering. arXiv preprint. ArXiv:2201.08860 [cs].
- ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.