- The paper presents a 242-billion token dataset from Harvard Library's historic texts, enabling enhanced LLM training.
- It details comprehensive data retrieval, OCR post-processing, and deduplication methodologies across 983,004 volumes.
- It emphasizes rights determination via HathiTrust API, ensuring public domain compliance for responsible AI research.
Institutional Books 1.0: A 242 Billion Token Dataset from Harvard Library's Collections
The technical report, titled "Institutional Books 1.0," introduces a substantial dataset derived from Harvard Library's collections, intended to augment LLM training efforts. This dataset comprises approximately 242 billion tokens across 983,004 volumes, representing a pivotal resource for advancing the scope and quality of LLMs. It is noteworthy for its historic texts, which span over 250 languages, with multi-centennial temporal coverage primarily concentrated between the years 1820 and 1920.
Dataset Acquisition and Processing
This dataset is rooted in Harvard Library's participation in the Google Books digitization project, a collaboration that began in 2006. During the compilation, one key methodological aspect involved the retrieval of digitized copies of Harvard's collection via Google's Return Interface. A notable achievement is the successful retrieval of 1,004,977 out of 1,075,899 listed volumes, showcasing a comprehensive assembly process despite certain scanning discrepancies—an issue to be further investigated.
Post-retrieval, extensive analysis and meticulous post-processing were conducted. These included OCR artifact analysis and deduplication processes that yielded valid text from OCR-extracted sources. Particularly compelling is the temporal and language coverage analysis indicating the dataset's potential utility for engaged textual comprehension spanning diverse linguistic contexts and historical eras.
Rights Determination and Public Domain Access
A critical facet of the dataset's release was the legal determination of rights using HathiTrust's API, enabling the identification of public domain volumes. This was essential to ensure compliance with intellectual property rights, culminating in a dataset release encompassing contents confirmed as public domain in the United States through the HathiTrust rights status.
Analytical Courses and Scholarly Opportunities
The dataset also entailed an exploratory analysis involving topic classification to categorically differentiate the volumes according to the Library of Congress Classification Outline, despite an initially limited existing topic and subject metadata. These insights afford practitioners the opportunity to engage topical datasets selectively for model training and evaluation in distinct domains.
Implications and Future Prospects
Overall, this preprint presents a well-documented and scalable resource for training LLMs, highlighting the potential enhanced performance due to diverse and high-quality data sources. The public availability of such historical texts opens new avenues for research and collaboration across both library and AI spheres, nurturing a community-led process for dataset refinement and expansion. Speculatively, the release of raw scan images accompanying the structured text dataset may further support multimodal model training.
The healthcare of historical and scholarly repository management unveiled in this project sets the stage for advancing responsible AI practices. It impels collaboration amongst knowledge institutions and AI communities to navigate ongoing technical challenges related to data stewardship deeply grounded in documentation, provenance, and ethical considerations. Anticipated future work involves refining text post-processing techniques, expanding dataset volumes accessible through collaborations, and probing finer topic classification applications—a prospective alignment towards establishing an institutional data commons cultivated through collective scholarly stewardship.
This report and dataset provide a significant contribution towards establishing a more efficient training data foundation pivotal for innovations in the development of LLMs, invigorating scholarly discourse grounded in reliable and diverse linguistic resources.