Institutional Books 1.0 Dataset

Updated 30 June 2025

Institutional Books 1.0 is a massive textual dataset featuring 242B tokens from nearly 1M curated public domain volumes across diverse languages and historical eras.
It employs rigorous OCR, dual language detection, and deduplication techniques to ensure high data quality, transparency, and reproducibility.
The dataset supports large language model training and digital scholarship with verifiable provenance and sustainable curation.

Institutional Books 1.0 signifies a landmark in the landscape of large-scale textual datasets, representing a 242 billion-token subset of Harvard Library’s digitized collections, meticulously processed for quality, diversity, and provenance. Originating from public domain volumes scanned under the Google Books project since 2006 and refined by the Institutional Data Initiative and the Library Innovation Lab, it introduces a sustainable, transparent, and rigorously documented resource for LLM training, digital scholarship, and data stewardship (Cargnelutti et al., 10 Jun 2025).

1. Dataset Scope, Structure, and Coverage

Institutional Books 1.0 encompasses 983,004 public domain volumes from an initial corpus of 1,075,899 scanned works, with a cumulative token count ranging from 248B to 311B, depending on the tokenizer; 242B tokens are computed with OpenAI’s o200k_base scheme. The dataset spans nearly 400 million pages, averaging 367 pages per volume. Language coverage is exceptional, comprising at least 254 unique main languages at the volume level and up to 379 distinct languages in total, reflecting extensive multilingual content and intra-volume code-switching. Temporally, the collection is richest between 1820 and 1920 (60.55% of dated volumes), with significant representation from earlier periods (Cargnelutti et al., 10 Jun 2025).

<table> <thead> <tr><th>Tokenizer</th><th>Total Tokens</th><th>Avg. per Volume</th></tr> </thead> <tbody> <tr><td>OpenAI GPT-4o</td><td\>248,299,000,580</td><td\>230,783</td></tr> <tr><td>Mistral Mixtral-8x22B</td><td\>311,589,475,275</td><td\>289,608</td></tr> </tbody> </table>

The predominant languages by token count are English (43.8%), German (17.3%), and French (14%), with notable volumes in Latin, Italian, Spanish, Russian, Greek, Dutch, and Hebrew.

2. Data Acquisition, Cleaning, and Annotation Methodologies

Volumes were extracted via the Google Return Interface (GRIN) with strict filtering against the HathiTrust Rights Database, including only volumes identified as public domain, public domain in the US, or Creative Commons Zero. Each volume underwent comprehensive OCR analysis, leveraging both vendor-supplied and project-specific metrics, typically achieving OCR quality scores above 88. Language attribution employed a dual approach of ISO 639-3 bibliographic codes and trigram-based machine detection, increasing reliability for multilingual works (Cargnelutti et al., 10 Jun 2025).

Post-processing involved deduplication with Simhash locality-sensitive hashing, detailed per-volume and per-page token statistics for multiple tokenizer standards, and production of both raw and cleaned OCR text. Clean text variants were produced through a machine-assisted pipeline using sentence-transformers to classify line types and remove structural noise (e.g., page numbers, running heads), enhancing downstream usability for NLP tasks. The dataset further integrates fine-grained MARC metadata, computed genre/topic assignments (using a fine-tuned BERT classifier for Library of Congress top-level classes with 97.8% accuracy), OCR metrics, and provenance tags on every data element (Cargnelutti et al., 10 Jun 2025).

<table> <thead> <tr><th>Metadata Field</th><th>Description</th></tr> </thead> <tbody> <tr><td>Barcode</td><td>Primary key for volume</td></tr> <tr><td>OCR Text</td><td>Both original and cleaned, per page</td></tr> <tr><td>Languages</td><td>Bibliographic and detected, token distribution</td></tr> <tr><td>Topic Class</td><td>LCC-based, BERT-classified</td></tr> <tr><td>Rights/Provenance</td><td>Matched to HathiTrust, fully tracked</td></tr> </tbody> </table>

3. Provenance, Stewardship, and Sustainable Data Principles

The assembly and release of Institutional Books 1.0 are grounded in principles of sustainable stewardship, comprehensive provenance, and reproducibility. Every datum is tagged by source (e.g., original, generated, external), ensuring auditability and traceability. All processing pipelines, classifier models, and extraction scripts are open-source and documented. Frugal computing strategies were prioritized to minimize computational and environmental footprint. The data release adheres to strict rights determination, ensuring compliance with intellectual property norms and ethical standards (Cargnelutti et al., 10 Jun 2025).

4. Curation and the Harvard Library Context

The reliance on Harvard Library’s collections introduces a level of bibliographic authority, curation history, and diversity rarely found in large internet-scraped datasets. This curation includes a breadth of academic, legal, literary, scientific, and historic materials, significantly enhancing representation for low-resource languages and underrepresented subject domains. The stability, updatability, and documentary completeness afforded by the Harvard context serve both the immediate needs of LLMing and long-term digital scholarship (Cargnelutti et al., 10 Jun 2025).

5. Significance for LLM Development and Research Applications

Institutional Books 1.0 presents several distinctive utilities for LLMs and AI research:

Provides a foundational corpus for training or evaluating LLMs where provenance, rights status, and bibliographic context are critical.
Enables creation of language, topic, and genre-filtered training and evaluation subsets, supporting both domain adaptation (e.g., legal, scientific, historical) and low-resource language advances.
Facilitates robust evaluation on tasks involving long context, historical language, and multi-language processing.
Offers a transparent, auditable alternative to opaque web-scraped corpora, supporting regulatory scrutiny and responsible AI development.
Supports digital humanities, NLP/translation, and library science initiatives, transforming historic print collections into analyzable, reusable data (Cargnelutti et al., 10 Jun 2025).

6. Accessibility, Licensing, and Community Engagement

The initial release is under a non-commercial license (due to uncertainties around subsequent, non-Harvard digitizations of public domain materials), with explicit encouragement for research and community stewardship contributions. All documentation, code, and metadata are public, and the dataset is accessible via platforms such as HuggingFace and GitHub, accompanied by detailed appendices and expanded schema definitions (Cargnelutti et al., 10 Jun 2025).

7. Broader Context and Impact

Institutional Books 1.0 addresses deficiencies in the openness, traceability, and diversity of data underlying modern AI systems. As a product of professional library stewardship, it sets a precedent for future large-scale dataset creation, emphasizing verifiable provenance, inclusivity of languages and genres, and sustainable curation. Its release marks a transition from proprietary and ad hoc practices toward transparent, collaborative, and equitable data infrastructures suitable for a range of human and machine users.

In summary, Institutional Books 1.0 is distinguished by its scholarly scale, methodological transparency, rights clarity, and substantive diversity—offering a gold-standard resource for LLMing, digital scholarship, and the responsible evolution of AI (Cargnelutti et al., 10 Jun 2025).

PDF Markdown Chat (Upgrade)

References (1)

1.

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability (2025)