Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Tahoe 100M Dataset (ZhihuRec)

Updated 22 June 2025

The Tahoe 100M Dataset, formally referred to as ZhihuRec (or Zhihu100M), is a large-scale, rich-context data collection from a major online knowledge-sharing platform. It is designed to facilitate research in recommendation systems, user modeling, information retrieval, and auxiliary machine learning tasks. The dataset is characterized by its unprecedented scale, comprehensive user and item metadata, and the inclusion of unique explicit search query information, distinguishing it from previous public benchmarks.

1. Origin, Scale, and Scope

Tahoe 100M is compiled from Zhihu, a leading Chinese knowledge-sharing platform, encapsulating real-world interaction data over a 10-day period. It encompasses approximately 100 million user-answer interactions and incorporates:

798,086 distinct users
165,310 questions
554,976 answers
240,395 authors
70,308 topics
501,918 user query keywords

In addition to the main corpus, two smaller subsets—Zhihu20M (~20M interactions) and Zhihu1M (~1M interactions)—are provided to support scalability tests and method benchmarking.

2. Structural Composition and Data Modalities

Tahoe 100M is organized into interconnected tables reflecting the multi-faceted nature of user engagement:

A. User-Answer Interaction Logs

Each interaction records anonymized user and answer identifiers, show and read timestamps.
Clicked answers indicate positive feedback, while non-click impressions are treated as negative feedback, supporting algorithms requiring both observed and unobserved preference data.

B. User Search Query Logs

Each user is associated with up to their 20 most recent search queries, represented as keywords and timestamps.
This element provides direct insight into users' explicit information needs, a unique feature among public recommender datasets.

C. Rich Side Information

Entity	Attributes Sampled
Users	Gender (anonymized), location, registration details, engagement stats
Answers	Parent question, author ID, creation and recommendation status, metadata
Questions	Associated topics, title, creation time, engagement stats
Authors	Anonymized profile, engagement and recognition indicators
Topics	Identifiers, textual description

All potentially sensitive content (e.g., direct identifiers, textual fields) is anonymized, hashed, or encoded per established privacy norms.

3. Distinctive Features and Impact

Several elements set Tahoe 100M apart from prior open datasets:

Inclusion of User Search Queries: By cataloging each user’s latest explicit search terms, the dataset permits joint modeling of information seeking and content engagement, enabling hybrid search–recommendation architectures.
Both Positive and Negative Feedback: Availability of both clicked (positive) and non-clicked (negative) impressions supports the development and evaluation of recommendation algorithms that can explicitly model user disinterest and address the prevalent one-class problem in earlier datasets.
Comprehensive Metadata: Extensive feature coverage—spanning behavioral, contextual, and network dimensions—for users, content, and interactions allows development of context-, content-, or knowledge-aware models.

4. Applications and Experimental Benchmarks

A. Recommendation System Research

General Top-N Recommendation: The dataset has been used to assess collaborative filtering and embedding-based models, including Pop, ItemKNN, BPR, LightGCN, and ENMF. Standard evaluation metrics include hit rate at K (HR@K) and NDCG@K:

$HR@K = \frac{\text{{number of hits among top K}}}{\text{{total predictions}}}$

$NDCG@K = \frac{1}{|\text{users}|}\sum_u \frac{1}{\log_2(\text{rank of correct answer}+1)}$

Sequential Recommendation: With temporal logs on all interactions, methods such as FPMC, GRU4Rec, NARM, and SASRec are evaluated for predicting the next answer in a user’s behavior sequence.
Context-Aware Recommendation: By incorporating side features from all entities, methods such as Wide{data}Deep, Neural Factorization Machine (NFM), ACCM, and CC-CC have been evaluated, using AUC (Area Under ROC Curve) as a principal metric:

$AUC = P(\hat{y}_{u,i} > \hat{y}_{u,j}~|~ y_{u,i}=1, y_{u,j}=0)$

Integration of Search and Recommendation: The co-availability of search queries and recommendation logs allows for session-based or knowledge-enhanced recommender formulations, further connecting information retrieval and recommender system paradigms.
Modeling Negative Feedback: Loss functions such as Pairwise Ranking Loss, which differentiate between positive, negative, and unobserved samples, can be robustly tested.

B. Applications Beyond Recommendation

User Gender Prediction: Profiles, behavioral signals, and network features are used to model anonymized user gender labels using standard classifiers such as LinearSVC, Decision Trees, Naive Bayes, KNN, Random Forest, and MLP.
High-Quality Answer Recognition: Detection of “editorially recommended” or otherwise high-quality answers, leveraging answer meta-features and author engagement metrics.
Most Valuable Answerer Identification: Identification and prediction of prolific or influential contributors based on their engagement profiles and interaction patterns.
Additional Tasks: User interest prediction, demographic inference, session analysis, and content quality assessment are also supported by the dataset’s breadth.

5. Data Loading and Scalability Considerations

Large-scale datasets such as Tahoe 100M present acute challenges in terms of random, memory-efficient data loading for deep learning. Established loading solutions for the AnnData format (frequently used for single-cell omics extensions of Tahoe 100M) often require in-memory operations or format conversion, both impractical at this scale.

scDataset (D'Ascenzo et al., 2 Jun 2025 ) introduces efficient block sampling and batched fetching:

Block Sampling: Randomly shuffled, contiguous blocks ( $b$ samples per block) are read, reducing random disk seeks from $m$ per minibatch (size $m$ ) to $m/b$ .
Batched Fetching: Larger in-memory buffers (size $m \times f$ , where $f$ is the fetch factor) are filled, reshuffled, and split into minibatches. This restores diversity—quantified via entropy analysis—and amortizes I/O costs.

Performance benchmarks on Tahoe 100M indicate up to $48\times$ speed-up over AnnLoader, $27\times$ over HuggingFace Datasets, and $18\times$ over BioNeMo (single-core), with scaling to $2593$ samples/sec and $129\times$ speed-up using multiprocessing.

6. Significance for Research and Model Development

Tahoe 100M is the largest known real-world, public recommender system dataset featuring search query logs, impressions, and rich side information. It directly addresses critical needs in the academic and professional research communities by enabling:

Development and evaluation of deep user models that learn from both what users search and what they engage with or skip.
Experimentation with context-, content-, and graph-based recommendation strategies due to its multi-faceted features.
Study of fairness, explainability, and temporal dynamics at scale, with auxiliary support for demographic inference, interest modeling, and content quality analysis.
Benchmarking of next-generation data loaders, such as scDataset, for high-throughput and randomness-balanced minibatch generation on massive sparse datasets.

7. Dataset Features Overview

Feature	Included?	Notes
Textual Content (encoded)	✓	Questions, answers, topics
User Profile	✓	Anonymized, diverse features
Item Attributes	✓	Answers, questions, authors, topics info
Timestamped Interactions	✓	All interactions and queries
Impression (Negative) Data	✓	Not only clicks, but shown and skipped entries
Search Queries (Explicit)	✓ (unique)	Up to 20 latest user search keywords
Favorites/Engagement Stats	✓ (aggregate)	Likes, thanks, collected in summary stats
Negative Feedback	✓	Comprehensive negative feedback logged

8. Accessibility and Citation

The Tahoe 100M dataset (ZhihuRec) is available at: https://github.com/THUIR/ZhihuRec-Dataset

9. Conclusion

Tahoe 100M establishes a new standard for scale and contextual richness in public datasets for recommendation and related data-driven research. Its integration of explicit search queries, comprehensive user and content metadata, and multi-type feedback fosters deep investigations into hybrid information retrieval–recommendation modeling, context-aware algorithms, and large-scale user behavior analysis. The dataset also serves as a proving ground for advances in efficient, scalable machine learning infrastructure, as evidenced by the performance of scDataset on its records, and is poised to support the next generation of methodological and applied research in this domain.

PDF Markdown Bookmark Chat (Pro)