- The paper develops an open-source pipeline that standardizes extraction and processing of arXiv’s rich multimodal data into detailed co-citation networks.
- It demonstrates that combining text, figures, and citation data can achieve up to 78.4% top-1 and 94.5% top-5 classification accuracy.
- The comprehensive dataset empowers future research in relational and multimodal machine learning across diverse scientific fields.
On the Use of ArXiv as a Dataset
The paper "On the Use of ArXiv as a Dataset" explores the potential of using the arXiv database as a benchmark for developing and evaluating next-generation models in the context of multimodal, relational data. The arXiv repository, with its extensive collection of 1.5 million pre-print articles across various scientific disciplines, provides an ample dataset with rich multimodal features, including text, figures, authorship, and metadata. These data points are naturally embedded within a graph framework, which offers unique possibilities for analysis and model development.
Contribution and Methodology
The primary contribution of this paper is the creation of an open-source pipeline that standardizes the extraction and processing of the arXiv's publicly available data. This pipeline facilitates access to metadata and full-text documents, enabling the construction of a detailed co-citation network with 1.35 million nodes and 6.72 million edges, alongside an 11 billion word corpus. The implementation covers downloading PDFs, converting them to plaintext, building co-citation networks, and normalizing author data.
Empirical Results
The authors provide baseline classification results that utilize various features from the dataset, such as titles, abstracts, full texts, and co-citations to demonstrate the potential of this dataset for classification tasks. The classification task achieved up to 78.4% accuracy for the "Top 1" category prediction, which increases to 94.5% for the "Top 5," when combining all feature types. These results highlight the strength of leveraging the full range of multimodal data present in the arXiv.
Implications
This standardized and comprehensive dataset has numerous implications for the development of models that handle complex relational and multimodal data. By enabling easy access to a vast and structured dataset, researchers can pursue advancements in various machine learning techniques such as relational modelling, LLMing, citation prediction, and topic modeling. The extensive scope and rich metadata of the arXiv allow for unique explorations in fields such as text segmentation and mathematical formula recognition.
Future Work
The arXiv dataset opens avenues for future research into more sophisticated models that can incorporate relational inductive biases—enhancing tasks like link prediction and automatic summarization. Future work may also explore enhancing the dataset with additional cleaning procedures and extending relation modeling beyond first-order connections.
Conclusion
The paper provides a critical toolset for researchers interested in advancing relational and multimodal model development. By offering a standardized and scalable way to access and utilize arXiv data, this research enables broader exploration and benchmarking in these emerging fields. As the dataset and tools continue to evolve, they will likely become integral to advancing AI research across scientific domains.