Tahoe 100M Dataset (ZhihuRec)
The Tahoe 100M Dataset, formally referred to as ZhihuRec (or Zhihu100M), is a large-scale, rich-context data collection from a major online knowledge-sharing platform. It is designed to facilitate research in recommendation systems, user modeling, information retrieval, and auxiliary machine learning tasks. The dataset is characterized by its unprecedented scale, comprehensive user and item metadata, and the inclusion of unique explicit search query information, distinguishing it from previous public benchmarks.
1. Origin, Scale, and Scope
Tahoe 100M is compiled from Zhihu, a leading Chinese knowledge-sharing platform, encapsulating real-world interaction data over a 10-day period. It encompasses approximately 100 million user-answer interactions and incorporates:
- 798,086 distinct users
- 165,310 questions
- 554,976 answers
- 240,395 authors
- 70,308 topics
- 501,918 user query keywords
In addition to the main corpus, two smaller subsets—Zhihu20M (~20M interactions) and Zhihu1M (~1M interactions)—are provided to support scalability tests and method benchmarking.
2. Structural Composition and Data Modalities
Tahoe 100M is organized into interconnected tables reflecting the multi-faceted nature of user engagement:
A. User-Answer Interaction Logs
- Each interaction records anonymized user and answer identifiers, show and read timestamps.
- Clicked answers indicate positive feedback, while non-click impressions are treated as negative feedback, supporting algorithms requiring both observed and unobserved preference data.
B. User Search Query Logs
- Each user is associated with up to their 20 most recent search queries, represented as keywords and timestamps.
- This element provides direct insight into users' explicit information needs, a unique feature among public recommender datasets.
C. Rich Side Information
Entity | Attributes Sampled |
---|---|
Users | Gender (anonymized), location, registration details, engagement stats |
Answers | Parent question, author ID, creation and recommendation status, metadata |
Questions | Associated topics, title, creation time, engagement stats |
Authors | Anonymized profile, engagement and recognition indicators |
Topics | Identifiers, textual description |
All potentially sensitive content (e.g., direct identifiers, textual fields) is anonymized, hashed, or encoded per established privacy norms.
3. Distinctive Features and Impact
Several elements set Tahoe 100M apart from prior open datasets:
- Inclusion of User Search Queries: By cataloging each user’s latest explicit search terms, the dataset permits joint modeling of information seeking and content engagement, enabling hybrid search–recommendation architectures.
- Both Positive and Negative Feedback: Availability of both clicked (positive) and non-clicked (negative) impressions supports the development and evaluation of recommendation algorithms that can explicitly model user disinterest and address the prevalent one-class problem in earlier datasets.
- Comprehensive Metadata: Extensive feature coverage—spanning behavioral, contextual, and network dimensions—for users, content, and interactions allows development of context-, content-, or knowledge-aware models.
4. Applications and Experimental Benchmarks
A. Recommendation System Research
- General Top-N Recommendation: The dataset has been used to assess collaborative filtering and embedding-based models, including Pop, ItemKNN, BPR, LightGCN, and ENMF. Standard evaluation metrics include hit rate at K (HR@K) and NDCG@K:
- Sequential Recommendation: With temporal logs on all interactions, methods such as FPMC, GRU4Rec, NARM, and SASRec are evaluated for predicting the next answer in a user’s behavior sequence.
- Context-Aware Recommendation: By incorporating side features from all entities, methods such as Wide{data}Deep, Neural Factorization Machine (NFM), ACCM, and CC-CC have been evaluated, using AUC (Area Under ROC Curve) as a principal metric:
- Integration of Search and Recommendation: The co-availability of search queries and recommendation logs allows for session-based or knowledge-enhanced recommender formulations, further connecting information retrieval and recommender system paradigms.
- Modeling Negative Feedback: Loss functions such as Pairwise Ranking Loss, which differentiate between positive, negative, and unobserved samples, can be robustly tested.
B. Applications Beyond Recommendation
- User Gender Prediction: Profiles, behavioral signals, and network features are used to model anonymized user gender labels using standard classifiers such as LinearSVC, Decision Trees, Naive Bayes, KNN, Random Forest, and MLP.
- High-Quality Answer Recognition: Detection of “editorially recommended” or otherwise high-quality answers, leveraging answer meta-features and author engagement metrics.
- Most Valuable Answerer Identification: Identification and prediction of prolific or influential contributors based on their engagement profiles and interaction patterns.
- Additional Tasks: User interest prediction, demographic inference, session analysis, and content quality assessment are also supported by the dataset’s breadth.
5. Data Loading and Scalability Considerations
Large-scale datasets such as Tahoe 100M present acute challenges in terms of random, memory-efficient data loading for deep learning. Established loading solutions for the AnnData format (frequently used for single-cell omics extensions of Tahoe 100M) often require in-memory operations or format conversion, both impractical at this scale.
scDataset (D'Ascenzo et al., 2 Jun 2025 ) introduces efficient block sampling and batched fetching:
- Block Sampling: Randomly shuffled, contiguous blocks ( samples per block) are read, reducing random disk seeks from per minibatch (size ) to .
- Batched Fetching: Larger in-memory buffers (size , where is the fetch factor) are filled, reshuffled, and split into minibatches. This restores diversity—quantified via entropy analysis—and amortizes I/O costs.
Performance benchmarks on Tahoe 100M indicate up to speed-up over AnnLoader, over HuggingFace Datasets, and over BioNeMo (single-core), with scaling to $2593$ samples/sec and speed-up using multiprocessing.
6. Significance for Research and Model Development
Tahoe 100M is the largest known real-world, public recommender system dataset featuring search query logs, impressions, and rich side information. It directly addresses critical needs in the academic and professional research communities by enabling:
- Development and evaluation of deep user models that learn from both what users search and what they engage with or skip.
- Experimentation with context-, content-, and graph-based recommendation strategies due to its multi-faceted features.
- Study of fairness, explainability, and temporal dynamics at scale, with auxiliary support for demographic inference, interest modeling, and content quality analysis.
- Benchmarking of next-generation data loaders, such as scDataset, for high-throughput and randomness-balanced minibatch generation on massive sparse datasets.
7. Dataset Features Overview
Feature | Included? | Notes |
---|---|---|
Textual Content (encoded) | ✓ | Questions, answers, topics |
User Profile | ✓ | Anonymized, diverse features |
Item Attributes | ✓ | Answers, questions, authors, topics info |
Timestamped Interactions | ✓ | All interactions and queries |
Impression (Negative) Data | ✓ | Not only clicks, but shown and skipped entries |
Search Queries (Explicit) | ✓ (unique) | Up to 20 latest user search keywords |
Favorites/Engagement Stats | ✓ (aggregate) | Likes, thanks, collected in summary stats |
Negative Feedback | ✓ | Comprehensive negative feedback logged |
8. Accessibility and Citation
The Tahoe 100M dataset (ZhihuRec) is available at: https://github.com/THUIR/ZhihuRec-Dataset
9. Conclusion
Tahoe 100M establishes a new standard for scale and contextual richness in public datasets for recommendation and related data-driven research. Its integration of explicit search queries, comprehensive user and content metadata, and multi-type feedback fosters deep investigations into hybrid information retrieval–recommendation modeling, context-aware algorithms, and large-scale user behavior analysis. The dataset also serves as a proving ground for advances in efficient, scalable machine learning infrastructure, as evidenced by the performance of scDataset on its records, and is poised to support the next generation of methodological and applied research in this domain.