Guru Datasets
- "Guru Dataset" refers to distinct resources used in research across AI, online labor markets, and education, designed for rigorous empirical evaluation and benchmarking.
- Specific "Guru Datasets" support research in online labor market team formation and AI model training for complex, multi-domain reasoning tasks.
- Another "Guru Dataset" tracks teacher professional development, while all datasets are characterized by rigorous cleaning and formal evaluation methods.
The Guru dataset refers to several distinct resources in recent academic literature, each designed for research applications in domains such as online labor market optimization, LLM reasoning, and teacher professional development. Each "Guru dataset" instance is shaped by its domain, but all are empirically grounded and crafted for systematic, large-scale evaluation.
1. Online Labor Market: Guru Team Formation Dataset
The Guru dataset originating from the online labor marketplace guru.com is a large-scale, real-world dataset used to paper team formation and task assignment with heterogeneous skills (Nikolakaki et al., 2020).
Dataset Specifications:
- Experts: 6,120, each with an average of 13.07 skills (post-filtered for relevance to job requirements).
- Tasks: 3,195, with an average of 5.24 required skills per task.
- Data Scope: Skills are drawn from anonymized expert profiles and real project postings, reflecting redundant, overlapping, and sometimes informal requirements. All skills listed by experts but never required by any task are filtered out, ensuring the saliency of the final dataset.
Comparative Table of Labor Datasets:
Characteristic | Freelancer | Guru | Upwork |
---|---|---|---|
# Experts | 1,212 | 6,120 | 1,500 |
# Tasks | 993 | 3,195 | 3,000 |
Avg. skills/expert | 1.46 | 13.07 | 6.2 |
Avg. skills/task | 2.86 | 5.24 | 39.9 |
Team Formation Model:
- Skills Universe:
- Expert Pool: with each
- Tasks: ,
A team () is assigned to each task, aiming to cover as many task-required skills as possible. The key metric is coverage, not completion; partial task coverage is permitted, with solution quality proportional to the fraction of requirements met.
Core Formulas:
- Task Coverage:
- Skill Coverage Deficit:
- Maximum Expert Load: , with
- Total Assignment Cost:
Three scalable heuristic algorithms—EXGreedy, TeamEXGreedy, and LP-based methods—are developed to optimize load and coverage. The problem is NP-hard, and the dataset’s expert surplus and high skill density support robust benchmarking of such algorithms. The practical implication is effective, load-balanced team formation in real labor marketplaces, with explicit trade-off between expert overload and skill coverage.
2. Reasoning Dataset for RL-Learned LLMs
The Guru dataset described in recent LLM research is a curated corpus for reinforcement learning (RL) in general reasoning, consisting of 92,000 verifiable examples across six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular (Cheng et al., 17 Jun 2025).
Composition:
- Domains: Math (competition/academic), Code (LeetCode, LiveCodeBench), Science (WebInstruct-Verified), Logic (ARC-AGI, Zebra Puzzle), Simulation (Code I/O), Tabular (HiTab, MultiHierTT).
- Data cleaning: Aggressive deduplication (e.g., 27% duplicates removed in Math).
- Reward Design: Automated, domain-specific and low-noise:
- Math, Logic, Simulation, Tabular: Rule-based, format-sensitive extraction and numerical equivalence checks (e.g., Sympy for math).
- Code: Execution-based, with output scripts tested against hidden test cases, time and memory constraints.
- Science: Model-based verification, using a 1.5B LLM for entailment.
- Difficulty filtering: Weak and strong LLMs are used to discard overly easy, impossible, or label-error-prone examples:
- Public release: Full dataset, RL-trained Guru-7B and Guru-32B models, training/evaluation code, and benchmark suite at https://github.com/LLM360/Reasoning360.
Key Findings:
- RL from cross-domain data is effective in Math, Code, and Science (common in pretraining); Logic, Simulation, and Tabular require in-domain RL for meaningful skill acquisition.
- Guru models achieve state-of-the-art results among open RL-trained reasoning models: Guru-7B improves over prior baselines by 9.0%, Guru-32B by 6.7%; both outperform across all six domains.
- Pass@k performance reveals that RL not only improves base accuracy but can expand the "reasoning boundary," especially for complex or underrepresented domains.
This Guru dataset establishes a foundation for RL research in general-purpose LLM reasoning, with robust multi-domain benchmarking and domain-adapted reward structures.
3. Application in Multi-Dataset QA and Transfer Learning
In the context of multi-dataset question answering (QA), the Guru dataset is relevant as a candidate domain for integration within adapter-based architectures (Friedman et al., 2021).
Methodology:
- MADE (Multi-Adapter Dataset Experts): Single-dataset expert adapters coupled to a shared Transformer backbone.
- Training: Joint optimization (all model parameters) on all datasets, followed by adapter-specific fine-tuning.
- Zero-shot/few-shot transfer: Parameter-averaged adapters enable competitive results on unseen datasets; rapid specialization is achieved with few labeled examples.
Speculative Application:
If the Guru dataset (as a QA resource) were integrated, a new adapter-classifier set could be initialized by averaging existing adapters. In cases of domain structural overlap, zero-shot performance would be competitive; few-shot fine-tuning would facilitate rapid adaptation. Parameter efficiency and modular growth are highlighted as advantages, particularly for new or under-resourced tasks.
MADE consistently outperforms unified and single-dataset baselines in both in-distribution (F1 scores) and transfer settings.
4. Teacher Professional Development: Guru Dataset in Educational Research
In educational research, the "Guru" dataset refers to a cohort of 60 PAUD (early childhood education) teachers engaged in tiered training in Indonesia (Miftakhi et al., 2023).
Dataset Structure:
- Size and Groups: 60 teachers, equally distributed among basic, intermediate, and expert training cohorts.
- Content: Participation status, completion and assessment of independent assignments (lesson planning, execution, reflection), mentor assessment, and certification outcome.
- Outcomes: Statistically, mean assignment scores were 85.85 (Basic), 81.00 (Intermediate), and 81.10 (Expert), with all participants meeting performance standards.
Implications:
Dataset design for professional development should include stage, assessment output, certification status, and qualitative mentoring notes. Variables enable tracking of progress, effectiveness, and policy compliance, facilitating targeted interventions and evaluation of training gaps.
5. Comparative Perspective and Practical Significance
Across all instantiations, Guru datasets are distinguished by:
- High real-world complexity or domain-specificity (e.g., redundant or extensive skill listings in team formation, multi-domain challenges in RL and QA).
- Comprehensive data cleaning and validation pipelines, including deduplication, difficulty filtering, or mentor-evaluated performance.
- Formal or heuristic reward and evaluation schemes suited to empirical benchmarking.
- Open-source or systematic public release for transparency and reproducibility.
Significance:
The availability and rigorous construction of these datasets underpin empirical advances in RL for reasoning, scalable QA transfer methodologies, and labor market optimization. As foundational benchmarks or case studies, Guru datasets inform algorithm development, evaluation methodology, and applied decision-making in both AI systems and human resource management.