Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Guru Datasets

Updated 1 July 2025

"Guru Dataset" refers to distinct resources used in research across AI, online labor markets, and education, designed for rigorous empirical evaluation and benchmarking.
Specific "Guru Datasets" support research in online labor market team formation and AI model training for complex, multi-domain reasoning tasks.
Another "Guru Dataset" tracks teacher professional development, while all datasets are characterized by rigorous cleaning and formal evaluation methods.

The Guru dataset refers to several distinct resources in recent academic literature, each designed for research applications in domains such as online labor market optimization, LLM reasoning, and teacher professional development. Each "Guru dataset" instance is shaped by its domain, but all are empirically grounded and crafted for systematic, large-scale evaluation.

1. Online Labor Market: Guru Team Formation Dataset

The Guru dataset originating from the online labor marketplace guru.com is a large-scale, real-world dataset used to paper team formation and task assignment with heterogeneous skills (Nikolakaki et al., 2020).

Dataset Specifications:

Experts: 6,120, each with an average of 13.07 skills (post-filtered for relevance to job requirements).
Tasks: 3,195, with an average of 5.24 required skills per task.
Data Scope: Skills are drawn from anonymized expert profiles and real project postings, reflecting redundant, overlapping, and sometimes informal requirements. All skills listed by experts but never required by any task are filtered out, ensuring the saliency of the final dataset.

Comparative Table of Labor Datasets:

Characteristic	Freelancer	Guru	Upwork
# Experts	1,212	6,120	1,500
# Tasks	993	3,195	3,000
Avg. skills/expert	1.46	13.07	6.2
Avg. skills/task	2.86	5.24	39.9

Team Formation Model:

Skills Universe: $\mathcal{S}$
Expert Pool: $\mathcal{P} = \{P_1, ..., P_n\}$ with each $P_i \subseteq \mathcal{S}$
Tasks: $\mathcal{J} = \{J_1, ..., J_k\}$ , $J_j \subseteq \mathcal{S}$

A team ( $Q_j \subseteq \mathcal{P}$ ) is assigned to each task, aiming to cover as many task-required skills as possible. The key metric is coverage, not completion; partial task coverage is permitted, with solution quality proportional to the fraction of requirements met.

Core Formulas:

Task Coverage: $\mathrm{Cov}(Q_j) = \bigcup_{i \in Q_j} P_i$
Skill Coverage Deficit: $F(Q_j, J_j) = \frac{|J_j \setminus \mathrm{Cov}(Q_j)|}{|J_j|}$
Maximum Expert Load: $L(\mathcal{Q}) = \max_{P \in \mathcal{P}} L(P, \mathcal{Q})$ , with $L(P, \mathcal{Q}) = |\{j: P \in Q_j\}|$
Total Assignment Cost: $B(\mathcal{Q}, \mathcal{J}, \lambda) = \lambda L(\mathcal{Q}) + \sum_{j \in \mathcal{J}} F(Q_j, J_j)$

Three scalable heuristic algorithms—EXGreedy, TeamEXGreedy, and LP-based methods—are developed to optimize load and coverage. The problem is NP-hard, and the dataset’s expert surplus and high skill density support robust benchmarking of such algorithms. The practical implication is effective, load-balanced team formation in real labor marketplaces, with explicit trade-off between expert overload and skill coverage.

2. Reasoning Dataset for RL-Learned LLMs

The Guru dataset described in recent LLM research is a curated corpus for reinforcement learning (RL) in general reasoning, consisting of 92,000 verifiable examples across six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular (Cheng et al., 17 Jun 2025).

Composition:

Domains: Math (competition/academic), Code (LeetCode, LiveCodeBench), Science (WebInstruct-Verified), Logic (ARC-AGI, Zebra Puzzle), Simulation (Code I/O), Tabular (HiTab, MultiHierTT).
Data cleaning: Aggressive deduplication (e.g., 27% duplicates removed in Math).
Reward Design: Automated, domain-specific and low-noise:
- Math, Logic, Simulation, Tabular: Rule-based, format-sensitive extraction and numerical equivalence checks (e.g., Sympy for math).
- Code: Execution-based, with output scripts tested against hidden test cases, time and memory constraints.
- Science: Model-based verification, using a 1.5B LLM for entailment.
Difficulty filtering: Weak and strong LLMs are used to discard overly easy, impossible, or label-error-prone examples:

$\begin{aligned} &\text{Discard if: } P_\text{weak} \geq \frac{15}{16} \ &\text{Discard if: } P_\text{strong} = 0 \ &\text{Discard if: } P_\text{weak} > P_\text{strong} \end{aligned}$

Public release: Full dataset, RL-trained Guru-7B and Guru-32B models, training/evaluation code, and benchmark suite at https://github.com/LLM360/Reasoning360.

Key Findings:

RL from cross-domain data is effective in Math, Code, and Science (common in pretraining); Logic, Simulation, and Tabular require in-domain RL for meaningful skill acquisition.
Guru models achieve state-of-the-art results among open RL-trained reasoning models: Guru-7B improves over prior baselines by 9.0%, Guru-32B by 6.7%; both outperform across all six domains.
Pass@k performance reveals that RL not only improves base accuracy but can expand the "reasoning boundary," especially for complex or underrepresented domains.

This Guru dataset establishes a foundation for RL research in general-purpose LLM reasoning, with robust multi-domain benchmarking and domain-adapted reward structures.

3. Application in Multi-Dataset QA and Transfer Learning

In the context of multi-dataset question answering (QA), the Guru dataset is relevant as a candidate domain for integration within adapter-based architectures (Friedman et al., 2021).

Methodology:

MADE (Multi-Adapter Dataset Experts): Single-dataset expert adapters coupled to a shared Transformer backbone.
Training: Joint optimization (all model parameters) on all datasets, followed by adapter-specific fine-tuning.
Zero-shot/few-shot transfer: Parameter-averaged adapters enable competitive results on unseen datasets; rapid specialization is achieved with few labeled examples.

Speculative Application:

If the Guru dataset (as a QA resource) were integrated, a new adapter-classifier set could be initialized by averaging existing adapters. In cases of domain structural overlap, zero-shot performance would be competitive; few-shot fine-tuning would facilitate rapid adaptation. Parameter efficiency and modular growth are highlighted as advantages, particularly for new or under-resourced tasks.

MADE consistently outperforms unified and single-dataset baselines in both in-distribution (F1 scores) and transfer settings.

4. Teacher Professional Development: Guru Dataset in Educational Research

In educational research, the "Guru" dataset refers to a cohort of 60 PAUD (early childhood education) teachers engaged in tiered training in Indonesia (Miftakhi et al., 2023).

Dataset Structure:

Size and Groups: 60 teachers, equally distributed among basic, intermediate, and expert training cohorts.
Content: Participation status, completion and assessment of independent assignments (lesson planning, execution, reflection), mentor assessment, and certification outcome.
Outcomes: Statistically, mean assignment scores were 85.85 (Basic), 81.00 (Intermediate), and 81.10 (Expert), with all participants meeting performance standards.

Implications:

Dataset design for professional development should include stage, assessment output, certification status, and qualitative mentoring notes. Variables enable tracking of progress, effectiveness, and policy compliance, facilitating targeted interventions and evaluation of training gaps.

5. Comparative Perspective and Practical Significance

Across all instantiations, Guru datasets are distinguished by:

High real-world complexity or domain-specificity (e.g., redundant or extensive skill listings in team formation, multi-domain challenges in RL and QA).
Comprehensive data cleaning and validation pipelines, including deduplication, difficulty filtering, or mentor-evaluated performance.
Formal or heuristic reward and evaluation schemes suited to empirical benchmarking.
Open-source or systematic public release for transparency and reproducibility.

Significance:

The availability and rigorous construction of these datasets underpin empirical advances in RL for reasoning, scalable QA transfer methodologies, and labor market optimization. As foundational benchmarks or case studies, Guru datasets inform algorithm development, evaluation methodology, and applied decision-making in both AI systems and human resource management.

PDF Markdown Chat (Pro)

References (4)

Finding teams that balance expert load and task coverage (2020)

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective (2025)

Single-dataset Experts for Multi-dataset Question Answering (2021)

Implementasi Peningkatan Profesionalisme Guru PAUD melalui Diklat Berjenjang (2023)