Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability (2506.08300v1)

Published 10 Jun 2025 in cs.CL and cs.DL

Abstract: LLMs use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

Summary

  • The paper presents a 242-billion token dataset from Harvard Library's historic texts, enabling enhanced LLM training.
  • It details comprehensive data retrieval, OCR post-processing, and deduplication methodologies across 983,004 volumes.
  • It emphasizes rights determination via HathiTrust API, ensuring public domain compliance for responsible AI research.

Institutional Books 1.0: A 242 Billion Token Dataset from Harvard Library's Collections

The technical report, titled "Institutional Books 1.0," introduces a substantial dataset derived from Harvard Library's collections, intended to augment LLM training efforts. This dataset comprises approximately 242 billion tokens across 983,004 volumes, representing a pivotal resource for advancing the scope and quality of LLMs. It is noteworthy for its historic texts, which span over 250 languages, with multi-centennial temporal coverage primarily concentrated between the years 1820 and 1920.

Dataset Acquisition and Processing

This dataset is rooted in Harvard Library's participation in the Google Books digitization project, a collaboration that began in 2006. During the compilation, one key methodological aspect involved the retrieval of digitized copies of Harvard's collection via Google's Return Interface. A notable achievement is the successful retrieval of 1,004,977 out of 1,075,899 listed volumes, showcasing a comprehensive assembly process despite certain scanning discrepancies—an issue to be further investigated.

Post-retrieval, extensive analysis and meticulous post-processing were conducted. These included OCR artifact analysis and deduplication processes that yielded valid text from OCR-extracted sources. Particularly compelling is the temporal and language coverage analysis indicating the dataset's potential utility for engaged textual comprehension spanning diverse linguistic contexts and historical eras.

Rights Determination and Public Domain Access

A critical facet of the dataset's release was the legal determination of rights using HathiTrust's API, enabling the identification of public domain volumes. This was essential to ensure compliance with intellectual property rights, culminating in a dataset release encompassing contents confirmed as public domain in the United States through the HathiTrust rights status.

Analytical Courses and Scholarly Opportunities

The dataset also entailed an exploratory analysis involving topic classification to categorically differentiate the volumes according to the Library of Congress Classification Outline, despite an initially limited existing topic and subject metadata. These insights afford practitioners the opportunity to engage topical datasets selectively for model training and evaluation in distinct domains.

Implications and Future Prospects

Overall, this preprint presents a well-documented and scalable resource for training LLMs, highlighting the potential enhanced performance due to diverse and high-quality data sources. The public availability of such historical texts opens new avenues for research and collaboration across both library and AI spheres, nurturing a community-led process for dataset refinement and expansion. Speculatively, the release of raw scan images accompanying the structured text dataset may further support multimodal model training.

The healthcare of historical and scholarly repository management unveiled in this project sets the stage for advancing responsible AI practices. It impels collaboration amongst knowledge institutions and AI communities to navigate ongoing technical challenges related to data stewardship deeply grounded in documentation, provenance, and ethical considerations. Anticipated future work involves refining text post-processing techniques, expanding dataset volumes accessible through collaborations, and probing finer topic classification applications—a prospective alignment towards establishing an institutional data commons cultivated through collective scholarly stewardship.

This report and dataset provide a significant contribution towards establishing a more efficient training data foundation pivotal for innovations in the development of LLMs, invigorating scholarly discourse grounded in reliable and diverse linguistic resources.

Youtube Logo Streamline Icon: https://streamlinehq.com