Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Use of ArXiv as a Dataset (1905.00075v1)

Published 30 Apr 2019 in cs.IR, cs.LG, cs.SI, and physics.soc-ph

Abstract: The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.

Citations (112)

Summary

  • The paper develops an open-source pipeline that standardizes extraction and processing of arXiv’s rich multimodal data into detailed co-citation networks.
  • It demonstrates that combining text, figures, and citation data can achieve up to 78.4% top-1 and 94.5% top-5 classification accuracy.
  • The comprehensive dataset empowers future research in relational and multimodal machine learning across diverse scientific fields.

On the Use of ArXiv as a Dataset

The paper "On the Use of ArXiv as a Dataset" explores the potential of using the arXiv database as a benchmark for developing and evaluating next-generation models in the context of multimodal, relational data. The arXiv repository, with its extensive collection of 1.5 million pre-print articles across various scientific disciplines, provides an ample dataset with rich multimodal features, including text, figures, authorship, and metadata. These data points are naturally embedded within a graph framework, which offers unique possibilities for analysis and model development.

Contribution and Methodology

The primary contribution of this paper is the creation of an open-source pipeline that standardizes the extraction and processing of the arXiv's publicly available data. This pipeline facilitates access to metadata and full-text documents, enabling the construction of a detailed co-citation network with 1.35 million nodes and 6.72 million edges, alongside an 11 billion word corpus. The implementation covers downloading PDFs, converting them to plaintext, building co-citation networks, and normalizing author data.

Empirical Results

The authors provide baseline classification results that utilize various features from the dataset, such as titles, abstracts, full texts, and co-citations to demonstrate the potential of this dataset for classification tasks. The classification task achieved up to 78.4% accuracy for the "Top 1" category prediction, which increases to 94.5% for the "Top 5," when combining all feature types. These results highlight the strength of leveraging the full range of multimodal data present in the arXiv.

Implications

This standardized and comprehensive dataset has numerous implications for the development of models that handle complex relational and multimodal data. By enabling easy access to a vast and structured dataset, researchers can pursue advancements in various machine learning techniques such as relational modelling, LLMing, citation prediction, and topic modeling. The extensive scope and rich metadata of the arXiv allow for unique explorations in fields such as text segmentation and mathematical formula recognition.

Future Work

The arXiv dataset opens avenues for future research into more sophisticated models that can incorporate relational inductive biases—enhancing tasks like link prediction and automatic summarization. Future work may also explore enhancing the dataset with additional cleaning procedures and extending relation modeling beyond first-order connections.

Conclusion

The paper provides a critical toolset for researchers interested in advancing relational and multimodal model development. By offering a standardized and scalable way to access and utilize arXiv data, this research enables broader exploration and benchmarking in these emerging fields. As the dataset and tools continue to evolve, they will likely become integral to advancing AI research across scientific domains.