Autodata: An agentic data scientist to create high quality synthetic data

Published 24 Jun 2026 in cs.AI, cs.CL, and cs.LG | (2606.25996v1)

Abstract: We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper presents Autodata, an agentic framework that iteratively generates high-quality synthetic data via AI-driven feedback loops.
It significantly outperforms traditional methods in computer science, legal, and mathematical reasoning tasks with enhanced weak–strong discrimination and training metrics.
Meta-optimization refines the agent’s strategy, boosting validation success rates and improving token efficiency for robust RLHF and benchmark performance.

Autodata: An Agentic Data Scientist for High-Quality Synthetic Data

Framework and Methodological Innovations

The paper presents Autodata, a general framework that leverages AI agents as data scientists for constructing and curating high-quality synthetic training and evaluation data. Unlike classical prompt-based or filtering-centric synthetic data generation, Autodata introduces a cyclical data-science process: the agent iteratively generates data, inspects and quantitatively evaluates its utility, synthesizes learnings, and revises its data-generation strategy. Crucially, the agent itself is subject to meta-optimization, improving its effectiveness in the role using outer-loop evaluation rooted in the same criteria applied to its outputs.

Figure 1: The Autodata pipeline’s cyclical agentic workflow, spanning iterative data generation, feedback, refinement, and meta-optimization.

Autodata unifies—and generalizes—prior synthetic data creation mechanisms. The central methodological contribution is the explicit agentic loop: an orchestrator agent, supported by subagents (challenger, weak/strong solvers, verifier/judge), generates candidate examples, tests discrimination between weak and strong solvers, analyzes failures, and adapts its strategy to enhance utility for subsequent rounds.

Agentic Self-Instruct: Instantiation and Mechanisms

A canonical implementation, Agentic Self-Instruct (ASI), operationalizes the framework. The main LLM orchestrator coordinates four subagents: the challenger generates tasks; weak and strong solvers attempt them; and a judge evaluates all outputs. The pipeline aims to produce training examples where the strong solver reliably succeeds and the weak solver fails, guided by rubric-based evaluation and explicit gap criteria. This discriminative approach ensures data is neither too easy (ineffective for skill acquisition) nor too hard (impractical for learning).

Figure 2: The Agentic Self-Instruct weak-vs-strong setup, with subagents generating, solving, and evaluating candidate examples; the loop maximizes discriminative learning signal.

Empirical Results Across Diverse Domains

Computer Science Research Tasks

The approach was validated on CS research question generation, extracting data from 10k S2ORC papers. Agentic Self-Instruct required multiple iterative rounds (mean 6.59 per accepted item) and demonstrably outperformed standard CoT Self-Instruct across every key metric. Rubric-based rewards for strong solvers improved from 0.696 (CoT) to 0.772 (Agentic), while the weak solver’s score dropped precipitously (0.677 → 0.458). The strong–weak gap expanded from 0.019 to 0.314 under Agentic Self-Instruct. RL-trained Qwen3.5-4B models using Agentic data generalized better and exhibited superior performance on both in-distribution and harder out-of-distribution test sets.

Figure 3: RL training curves for Qwen3.5-4B on Agentic vs CoT Self-Instruct data, showing persistent superiority of Agentic created data across rubrics and test sets.

Agentic Self-Instruct systematically evolved questions to increase difficulty, utilizing judge feedback to push towards deeper reasoning and technical specificity. The iterative agentic loop operationalizes data discovery beyond initial prompts, generating examples that probe technical mechanisms and nuanced reasoning, thereby maximizing learning signal.

Figure 4: Agentic Self-Instruct’s multi-stage question and rubric generation for CS research tasks, tracking evolution in response to solver and judge feedback.

Legal Reasoning Tasks

In legal reasoning over public court opinions, a contrasting failure mode was observed: standard CoT Self-Instruct produced tasks that were too hard, yielding sparse reward signals. The agentic loop, guided by flexible loop-judge decisions and targeted feedback, adjusted question formats, increased weak solver variance (from 7.93 to 12.63), and shifted distributions to maximize RL reward suitability. RL on Agentic Self-Instruct data (2.8k items) produced higher scores on PRBench-Legal (0.441 GPT-5, 0.393 Kimi) than both CoT-trained and even much larger strong baseline models.

Figure 5: RL training dynamics on legal reasoning, highlighting Agentic Self-Instruct’s consistent performance lead over CoT Self-Instruct in PRBench evaluation and validation.

The critical insight is not to optimize for harder tasks, but to tune data “just right” for maximal gradient signal. The Agentic Self-Instruct loop adapts data characteristics towards utility, avoiding degenerate reward signals.

Scientific Reasoning

On mathematical reasoning benchmarks (Principia collection), Agentic Self-Instruct produced the most robust improvement (+3.20% avg@8, +1.04% on Principia OOD benchmark) compared to CoT Self-Instruct or even combined datasets of double size. Training on harder, more discriminative agentic data transferred to easier distributions, substantiating the hypothesis that inference-time compute invested in data quality is more effective than simply increasing dataset size. Moreover, training improved token efficiency dramatically, reducing reasoning truncation rates from 23.75% to 4.09% in the agentic configuration.

Meta-Optimization of the Agentic Data Scientist

Meta-optimization was applied to the data scientist’s prompt and scaffolding using an evolution-inspired framework. This outer loop evaluated candidate prompt modifications for data scientist agents (using trajectory analysis, root-cause diagnosis of solver failures, and automated code-editing), accepting only those yielding improved validation performance. Over 233 iterations, validation pass rates increased from 62.1% (baseline) to 79.6%. Automatically learned modifications included enforcement of paper-specific insights, prevention of context leakage, rubric format restructuring, and weight capping, all contributing to more rigorous discrimination between weak and strong solvers.

Figure 6: Meta-optimization process for the agentic data scientist, achieving substantial gains in weak–strong discriminator success rate.

Implications and Prospective Directions

Autodata reframes synthetic data creation as an agentic, data-science-driven process, capable of both producing more challenging benchmarks and optimizing datasets for discriminative RLHF and preference learning. By explicitly targeting weak–strong separation, agentic data creation operationalizes curriculum and task tuning, mitigating the diminishing returns from classical methods as LLM capabilities expand.

Practical implications include more efficient pipeline for post-training RL, robust cross-task generalization, reduced dependence on human labor for data curation, and improved token efficiency. Theoretically, the framework elevates data-centric optimization to equal footing with architecture or recipe search in scaling AI capabilities.

Future directions include extending agentic data generation to more diverse domains and modalities, co-research systems where human and agentic loops interact, robust safeguards against agent gaming or degenerate task generation, dataset-level diversity analysis, and integration of meta-optimization with ongoing solver training for autonomous, self-improving loops.

Conclusion

Autodata establishes a rigorous, agentic framework for synthetic data creation, leveraging iterative analysis, solver differentiation, judge feedback, and meta-optimization for superior training and evaluation data. Empirical evidence demonstrates consistent gains over classical approaches across CS, legal, and mathematical reasoning domains, with scalable implications for RLHF, robust benchmarking, and autonomous data science. The agentic paradigm—centered on discriminative, feedback-driven synthetic data—constitutes a fundamental shift in data-driven AI pipeline optimization (2606.25996).

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Autodata: An agentic data scientist to create high quality synthetic data

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Simple Explanation of “Autodata: An agentic data scientist to create high quality synthetic data”

1) What is this paper about?

This paper introduces Autodata, a way to let AI act like a data scientist. Instead of people writing all the practice problems and test questions that train AI, Autodata uses AI agents to design, check, and improve those questions on their own. The goal is to make better “synthetic data” (AI-made training examples) that helps models learn faster and perform better, especially on tough reasoning tasks.

2) What questions are the researchers trying to answer?

They ask:

Can an AI “data scientist” design training and test questions that are the right level of difficulty for improving another AI model?
Does iteratively checking and improving those questions lead to better results than older one-shot methods (like simple prompting)?
Can we also train the data-creating agent itself to get even better over time?

3) How does it work? (Methods explained simply)

Think of training an AI like training a sports team:

The team needs good practice drills (training data) that are neither too easy nor too hard.
A coach watches how players perform, then adjusts drills to focus on what needs improvement.

Autodata does this with AI:

An AI agent creates practice questions, looks at how models do on them, learns from mistakes, and updates the question-making strategy. This loop repeats to steadily improve data quality.

A specific version they test is called Agentic Self-Instruct. It uses four AI “subagents” with different roles:

The system first creates a candidate question (like a problem from a math paper, a legal case, or a computer science paper).
It then tests two solvers:
- A “weak solver” (a smaller or restricted model),
- A “strong solver” (a bigger or more capable model).
A “judge” evaluates their answers using a rubric (a checklist with scoring rules), and the main agent decides whether the question is accepted or needs improvement.

Here is the lineup used in the loop:

Challenger: proposes the question, the context, and a scoring rubric.
Weak solver: tries to answer; ideally struggles somewhat.
Strong solver: tries to answer; ideally does well.
Judge: scores the answers, checks question quality, and gives feedback.

If the question is too easy (both solvers do well) or too hard (both fail), the agent revises the question to hit the “just right” zone where the strong solver succeeds but the weak solver struggles—this creates useful training data to help the weak solver improve.

They also “ground” questions on real documents (like actual legal cases or research papers). Grounding means the AI uses real sources, which reduces made-up facts and keeps questions specific and diverse.

Finally, they explore meta-optimization: training the data-creating agent itself. You can picture this as “teaching the coach to be a better coach.” The system analyzes where its data creation goes wrong (like giving away answers in the context or writing confusing rubrics), then updates the agent’s prompts and rules to fix those issues.

Technical terms in everyday language:

Synthetic data: practice questions and examples made by AI rather than by humans.
LLM: a powerful text-based AI that reads and writes.
Chain-of-Thought (CoT): the model writes its step-by-step reasoning, like showing work in math.
Reinforcement Learning (RL): a way to train models using “rewards” based on how well they perform; here, the judge scores are the rewards.
GRPO: a specific RL training setup where the model generates multiple answers per question and learns from judged scores.

4) What did they find, and why does it matter?

They test Autodata on three areas—computer science research questions, legal reasoning, and scientific/mathematical problems—and consistently see better results than older synthetic data methods.

Key takeaways:

Computer Science tasks: The agentic loop shifts questions from generic summaries to detailed, paper-specific reasoning (like algorithm steps or ablation details). Training on these improved questions makes the weak solver significantly better, and it beats the baseline training method throughout training.
Legal reasoning: The usual CoT approach made questions too hard (weak solver often scored near zero, which is bad for learning). The agentic loop adjusted difficulty to “just right”—it increased the weak solver’s score variability and made questions more learnable. After training on agentic data, a small model (4B) even outperformed a much larger model (397B) on PRBench Legal benchmarks.
Scientific/mathematical reasoning: Training on agentic-generated, harder problems improved performance both in-distribution (on similar data) and out-of-distribution (on independent benchmarks like Principia). It showed better generalization, not just memorization. Training also reduced cases where the model’s reasoning got cut off by token limits (it became more “token-efficient” in its thinking).

Meta-optimization results:

When they “trained the coach,” the agent’s prompt and rules improved. For example, they enforced paper-specific questions, banned context leaks (not revealing the answer in the provided text), and fixed rubric format issues. This raised the agent’s success rate in producing useful questions (from roughly 62% to nearly 80% on validation), without manual tuning.

Why this matters:

Stronger training data leads to stronger models.
The agent can adapt question difficulty to the model’s current skill level—like a good coach.
It shows that spending compute on smarter data creation can be more impactful than just collecting more data or only tuning model architecture.

5) What’s the big picture?

Autodata points to a future where AI systems help build the very training data that makes other AI systems better. Instead of relying heavily on human-written examples (which can be slow and expensive) or simple one-shot AI prompts (which may be too easy or too hard), agentic data creation:

Iteratively improves question quality,
Targets the “just right” difficulty for learning,
Converts extra compute at inference time into better training signals,
Scales to different domains (CS, law, math) and evaluation styles (rubrics, verifiable answers).

In short, Autodata shows a promising way to keep AI progress going: teach the models using smarter, tailored practice that they themselves help design—then teach the teacher to improve that process, too.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide future research.

Judge dependence and circularity: The pipeline relies heavily on LLM-based judges (primarily Kimi-K2.6) for acceptance decisions, RL rewards, and evaluation. Quantify cross-judge robustness (e.g., Kimi vs multiple independent graders, including humans), assess susceptibility to reward hacking, and measure agreement rates and sensitivity to judge identity.
Generalization across domains: Experiments cover CS research QA, legal reasoning, and math/science problems only. Test Autodata on additional domains (code with execution, multimodal tasks, safety- and policy-sensitive tasks, multilingual settings) to assess transferability.
Verifiability gaps: For non-verifiable tasks, correctness rests on LLM-generated rubrics. Establish independent verifiability (e.g., tool-based checks, rule-executable rubrics, or human audits) and quantify rubric accuracy, coverage, and consistency across graders.
Compute cost and scalability: The agentic loop requires multiple subagents, rollouts, and several rounds per accepted item (e.g., ~6.6 rounds in CS, ~5 in Legal). Provide a rigorous cost-benefit analysis (tokens per accepted item, energy/latency) and study scaling laws and optimal budget allocation across components (rounds, rollouts, judge invocations).
Sensitivity to solver choices: Results use a specific weak/strong solver pair (Qwen3.5-4B vs Qwen3.5-397B-A17B). Systematically vary teacher models, modes (scaffolding, self-consistency), and sizes to evaluate stability, data quality, and transfer when teacher-student capabilities are closer or the same model is used in different modes.
Acceptance criteria design: The CS setting uses hard thresholds on weak/strong scores and gap, while Legal uses a flexible judge. Study how different acceptance rules affect data usefulness (ablation of thresholds, dynamic vs fixed criteria) and identify principled difficulty-calibration methods that generalize across tasks.
Robustness to grading noise: Solvers run with temperature and stochastic rollouts; acceptance decisions may be noisy. Quantify how rollout count, temperature, and majority-vote schemes affect acceptance reliability and downstream RL, and develop uncertainty-aware acceptance (e.g., confidence intervals, repeated grading).
Guardrails and anti-gaming: The paper mentions guardrails but does not specify them. Design and evaluate anti-exploitation measures (e.g., adversarial audits, independence between judge and generator, leakage detection, perturbation tests) to prevent agents from optimizing toward judge-specific artifacts.
Grounding and leakage measurement: Although context leakage checks are used, there is no systematic estimate of leakage/hallucination rates across datasets. Build automated leakage detectors, conduct human spot-checks, and report leakage prevalence and its impact on downstream performance.
Data diversity and mode collapse: The iterative loop may bias toward question types that best separate weak/strong models, potentially narrowing topical diversity. Measure and enforce diversity (topic/skill coverage, novelty vs redundancy) and study its relation to generalization.
Downstream evaluation breadth: CS experiments rely on in-house rubric-graded tests derived from the same process; there is no external, human-authored CS benchmark used. Add independent, human-curated evaluations to validate real-world utility.
Human evaluation: Aside from PRBench’s judge-based scoring, no human rating of question quality, rubric adequacy, or solver responses is reported. Include human studies to validate correctness, difficulty, clarity, and fairness.
Training objective comparisons: Only GRPO is used. Compare RL approaches (PPO, RLAIF/RLHF, DPO, KTO) and SFT (with or without preference/RM augmentation) on the same agentic data to understand which objectives exploit autodata most effectively.
Tool-augmented verification: Math/science tasks could leverage symbolic solvers; code tasks could use execution; legal tasks could use retrieval/citation validators. Evaluate whether tool-based verifiers improve correctness and reduce judge bias compared to pure LLM judging.
Difficulty calibration theory: The paper shows “too easy” vs “too hard” failure modes and heuristic fixes. Develop formal, task-agnostic metrics and controllers for “just-right” difficulty (e.g., target success bands, adaptive bandits) and test their stability across domains.
Meta-optimization external validity: Meta-optimization of the agent’s prompts is only demonstrated on a small CS set (50 train/25 val papers) and evaluated on pass rate, not downstream RL. Validate across larger, heterogeneous tasks, measure downstream RL gains, and test for overfitting to validation tasks or judges.
Outer-loop safety and stability: The code-editing agent can introduce brittle or unintended prompt changes. Study convergence, reproducibility, rollback strategies, and safety constraints (e.g., schema enforcement, diff linting, regression tests) to ensure stable evolution.
Teacher bias imprinting: Data is optimized to separate a particular strong solver from a weak solver, risking teacher-specific biases. Evaluate cross-teacher generalization (train on data created with Teacher A, test with Teacher B) and develop teacher-agnostic acceptance criteria.
Extractor quality in Legal: The legal pipeline depends on a document extractor agent; its accuracy and failure modes are not quantified. Evaluate extractor precision/recall and its impact on downstream data quality and RL outcomes.
Capacity limits and model scaling: Scientific reasoning gains on OOD Principia are modest, with hints of 4B model capacity limits. Test with larger student models and report how student capacity interacts with autodata difficulty and dataset size.
Alternative baselines under matched budgets: The paper compares against CoT Self-Instruct but not against other strong synthetic data strategies (e.g., instruction evolution, refinement, self-challenging) under the same compute budget. Provide head-to-head comparisons with compute accounting.
Contamination checks: Ensure no overlap between autodata and evaluation sets (e.g., PRBench), and assess whether grounding sources leak into test distributions. Report contamination auditing procedures and results.
Multi-turn and tool-use tasks: The current focus is single-turn QA/rubric grading. Extend to interactive tasks (planning, tool-use, debugging, multi-step research) and study how the agentic loop adapts acceptance and verification in these settings.
Rollout and sampling policies: Only a few rollout counts (e.g., 3–5) are used; their impact on acceptance and RL is not ablated. Explore adaptive rollout budgeting and self-consistency strategies to trade off compute for reliability.
Data and scaffold release: The paper does not state whether datasets, prompts, or agent scaffolds will be released. Public release is needed for reproducibility and third-party auditing of judge-dependence and leakage.
Fairness and bias: No analysis of demographic, topical, or jurisdictional biases in generated data (especially in legal) is provided. Conduct bias audits and develop constraints/regularizers to mitigate biased or harmful generations.
Formal quality metrics: “Data quality” is proxied by downstream scores and weak–strong gaps. Define and validate more general, task-agnostic quality metrics (e.g., correctness, novelty, verifiability, diversity) and analyze their predictive power for downstream gains.
Stopping criteria and diminishing returns: Average rounds per accepted sample are reported, but not the marginal utility of additional rounds. Develop stopping rules that maximize performance-per-token and study round-level return curves.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are specific, deployable use cases that draw directly from the paper’s Autodata framework, Agentic Self-Instruct method, and meta-optimization results. Each item notes target sector(s), a concrete tool/workflow that could be stood up now, and key dependencies/assumptions that affect feasibility.

Software/AI and MLOps

Synthetic curriculum builder for small/medium models
- What: Use a strong solver + judge to generate difficulty-calibrated training data (with rubrics) that systematically separates a “weak” target model from a stronger reference (weak-vs-strong loop).
- Sectors: Software/AI, ML platform providers
- Tools/workflows: “Autodata Trainer” service (Agentic Self-Instruct loop + GRPO or PPO reward), “Gap Controller” (acceptance criteria), “Weak-rollout variance monitor”
- Dependencies: Access to a strong LLM (teacher/judge), inference budget for multi-round loops, domain-grounded source documents, RL pipeline (e.g., GRPO), safeguards against reward hacking
Continuous benchmark and regression-suite generation (“Autobench”)
- What: Automatically produce new, document-grounded evaluation items that remain challenging as models improve, with rubric-based, reproducible grading
- Sectors: Software/AI, Model evaluation vendors, Enterprise AI QA
- Tools/workflows: “Autobench Generator” (judge-calibrated difficulty), rubric standardization and JSON schema validator (to avoid parsing errors found in meta-optimization)
- Dependencies: Reliable judge model, curated grounding corpora, stable storage of evaluation provenance and rubrics, guardrails for context leakage
Data quality copilot for RLHF/RLAIF pipelines
- What: Insert the agentic data scientist loop as a pre-RL data curator, improving reward signal quality (e.g., raising weak-rollout variance when too low; narrowing/ widening gaps as needed)
- Sectors: AI labs, LLM ops teams
- Tools/workflows: “Loop Judge” for GRPO-suitability tagging (high/medium/low), targeted challenger feedback templates, acceptance dashboards
- Dependencies: Reward-model availability (or rubric judging), telemetry to monitor weak/strong gaps and rollout variance, budget for multiple loop rounds
Prompt-harness meta-optimizer for autonomous data generation
- What: Apply the paper’s outer-loop optimization to evolve data-scientist prompts and guardrails, raising validated pass-rates without manual prompt engineering
- Sectors: MLOps, AI tooling vendors
- Tools/workflows: “Harness Evolver” (code-diff prompt evolution, validation-gated acceptance), failure-mode analyzers (e.g., context-leak detector), population-based search
- Dependencies: Validation sets and pass/fail criteria, reproducibility tracking, concurrency-safe prompt versioning

Legal Tech and Compliance

Document-grounded legal QA generation with rubrics (training + eval)
- What: Produce realistic, rubric-graded legal questions from public opinions/regulations that are “learnable” for target models (reshape weak-rollout distributions as shown in PRBench experiments)
- Sectors: Legal tech, In-house counsel tools, Compliance
- Tools/workflows: Extractor agents (facts/holdings synthesis), “Loop Judge” tuned for GRPO-suitability, PRBench-compatible graders
- Dependencies: Licensed/approved corpora (jurisdictional coverage), stringent context-leak checks, expert review for safety/policy alignment, record-keeping for auditability
Continuous regulatory-change test suites
- What: Generate evolving test sets from updated statutes/guidance to validate enterprise legal assistants’ adherence to latest rules
- Sectors: Finance, Healthcare, Insurance, RegTech
- Tools/workflows: Nightly retrieval + agentic generation, risk-weighted rubrics, drift monitors
- Dependencies: Access to up-to-date legal sources, human legal review for high-stakes use, risk triage workflows

Education and EdTech

Adaptive practice generator with rubric-based grading
- What: Personalized exercises calibrated to a student’s current ability (weak solver) with a strong solver guaranteeing correctness and a judge grading via rubrics
- Sectors: Education, Corporate L&D, Test prep
- Tools/workflows: “Difficulty Calibrator” (gap and variance targets), per-learner capability profiling, rubric-aligned feedback
- Dependencies: Pedagogical review of rubrics, safeguards against hallucinated facts, content licensing when grounded on textbooks/papers
Automated exam/test-bank authoring from source texts
- What: Generate balanced, diverse question sets with verified rubrics from course materials/papers; enforce paper-specific insight tests and context leak prevention
- Sectors: Higher ed, MOOCs, Certification providers
- Tools/workflows: “Context Leak Guard,” structured JSON rubric schema, per-topic coverage analyzers
- Dependencies: Instructor approval workflows, alignment with syllabus/standards, bias/content quality checks

Scientific R&D and Knowledge Work

Paper-comprehension QA and review aids
- What: Create paper-specific Q&A with rubrics to train/evaluate assistants that summarize, critique, or cross-check claims (shown effective for CS research tasks)
- Sectors: Academia, Industrial R&D, Publishing
- Tools/workflows: Agentic Self-Instruct over literature repositories, citation-verification checks, numeric/ablation detail targeting
- Dependencies: Access to full texts/metadata, IP/licensing, judge reliability for open-ended rubrics
Advanced math/science problem generation for model training
- What: Generate harder, verifiable problems to improve reasoning beyond CoT-generated datasets (as with Principia-style tasks)
- Sectors: Scientific computing, EdTech (STEM), AI research
- Tools/workflows: Verifier modules (symbolic/math tools), pass@k analytics, truncation-rate monitors
- Dependencies: Strong solver competence in target domains, verifiability or robust rubric grading, enough inference budget for multi-round exploration

Enterprise Knowledge Management

Internal, document-grounded QA dataset creation
- What: Build private, rubric-scored training/eval sets from enterprise docs to improve internal assistants’ accuracy and provenance
- Sectors: Any enterprise with large document stores
- Tools/workflows: Retrieval-grounded challenger prompts, redaction/PII filters, acceptance criteria tuned for business KPIs
- Dependencies: Data governance approvals, privacy/security constraints, storage of provenance and rubrics for audit

Safety, Red Teaming, and Policy

Self-challenging adversarial evaluation suites
- What: Use the challenger + strong solver + judge to surface failure cases and calibrate their difficulty so they’re neither trivial nor impossible
- Sectors: Safety evaluation, Security, Policy compliance
- Tools/workflows: “Self-Challenge” harness, tool-augmented strong solvers, per-category coverage dashboards
- Dependencies: Governance for risky content, guardrails preventing prompt injection and reward hacking, periodic human review

Finance and Healthcare (low-risk scopes first)

Policy/Guideline-grounded QA training and checks
- What: Generate rubrics and questions from financial regulations or clinical guidelines to train assistants on compliant summaries and procedures (non-diagnostic for healthcare)
- Sectors: Finance Ops, Healthcare Ops, Quality Assurance
- Tools/workflows: Domain-specific extractors, rubric standards aligned to SOPs, “accept/improve” loop judge for GRPO-suitability
- Dependencies: Regulatory and privacy constraints (HIPAA/GDPR), human-in-the-loop for high-stakes outputs, provenance tracking of sources

Daily Life

Personalized study, interview prep, and skill practice
- What: Difficulty-calibrated practice questions with rubric-based feedback (coding katas, bar exam prep, case interviews)
- Sectors: Consumer apps, Career services
- Tools/workflows: “Practice Loop” app (weak/strong/ judge roles abstracted), session-level difficulty targets
- Dependencies: Access to a capable base model (cloud or local), content licensing for grounded materials, accuracy safeguards

Long-Term Applications

These opportunities extend the paper’s approach to new modalities, larger scales, regulated contexts, or self-improving ecosystems; they will likely need further research, scaling, or standardization.

Agentic synthetic data for robotics and multimodal models
- What: Generate curricula and evaluation tasks for perception/manipulation with verifiable success metrics (sim-to-real)
- Sectors: Robotics, Autonomous systems
- Tools/workflows: Simulator-integrated judge/verifier, tool-augmented strong solvers, sensor-grounded rubrics
- Dependencies: High-fidelity simulators, reliable verifiers beyond text-only LLMs, safety and real-world validation

Continuous, Self-Evolving Benchmarks and AutodataOps

Organization-wide “AutodataOps” platforms
- What: Always-on pipelines that auto-generate, evaluate, and version high-quality training/eval data; meta-optimized data-scientist agents maintain targets as models evolve
- Sectors: AI-first enterprises, Model providers
- Tools/workflows: Population-based harness evolution, lineage/provenance registries, drift detection and automatic difficulty rebalancing
- Dependencies: Significant inference budgets, governance for synthetic data provenance, integration with CI/CD for ML

Regulatory-Grade Evaluation and Audit

Standardized, regulator-accepted synthetic evaluation suites
- What: Benchmarks whose rubrics and provenance meet audit standards for compliance certification
- Sectors: Finance, Healthcare, Government
- Tools/workflows: Audit trails, data cards and rubrics with versioning, cross-grader validation (multiple judges)
- Dependencies: Policy buy-in, third-party certification processes, handling of bias/fairness/equity metrics

Scalable Safety and Oversight

Dynamic safety curricula and scalable oversight
- What: Agentic generation of nuanced safety and misuse cases, with meta-optimized guardrails and reward models
- Sectors: AI safety, Trust & Safety
- Tools/workflows: Safety-specific loop judges, red-team marketplaces, continuous “oversight-as-data” services
- Dependencies: Stronger judges than today’s LLMs for safety domains, human governance, robust tool-use and sandboxing

Low-Resource and Multilingual Expansion

High-quality synthetic corpora for underserved languages/domains
- What: Grounded, agentic data generation with bilingual judges/solvers to bootstrap alignment and reasoning
- Sectors: Global education, Public-interest AI
- Tools/workflows: Cross-lingual evaluators, cultural-context rubrics, domain-specific grounding
- Dependencies: High-quality bilingual source corpora, multilingual strong solvers, community review

Scientific Discovery Assistants

From paper QA to hypothesis generation and experiment planning
- What: Extend the loop to propose, evaluate, and refine testable hypotheses and experimental protocols
- Sectors: Pharma, Materials, Basic science
- Tools/workflows: Integration with lab simulators/ELNs, verifiers tied to known constraints, uncertainty-aware judges
- Dependencies: Reliable domain tools beyond LLMs, safety/ethics oversight, reproducibility standards

Personalization at the Edge

On-device tutors and copilots with continual, difficulty-calibrated data generation
- What: Local “weak” models that are continuously improved via small, on-device agentic loops
- Sectors: Consumer devices, Education
- Tools/workflows: Distilled strong solvers as teacher snapshots, efficient judges, compute-aware loop policies
- Dependencies: On-device acceleration, privacy-preserving evaluation, energy constraints

Market and Ecosystem Evolution

Data-as-Compute services and judge/solver marketplaces
- What: Providers selling difficulty-calibrated data, judge APIs, and meta-optimized harnesses as commodities
- Sectors: Cloud AI, Data vendors
- Tools/workflows: SLAs around grading reliability, standardized rubric schemas, cost/performance dashboards
- Dependencies: Interoperability standards, transparent metrics, competition regulation

Notes on Common Assumptions/Dependencies Across Applications

Access to a strong solver (teacher) and a reliable judge is pivotal; results depend on their capability and calibration.
Inference budgets must cover multi-round agentic loops; cost control and variance reduction techniques are important.
Grounding sources must be licensed and appropriate; privacy/PII and context-leak prevention are essential in regulated domains.
Acceptance criteria (gap/variance thresholds, GRPO-suitability) should be tuned per domain; poor criteria can produce unhelpful data.
Guardrails against reward hacking, prompt injection, and rubric mis-specification are necessary; structured rubric formats help.
Human-in-the-loop review is advised in high-stakes domains (legal, medical, safety) until judges are demonstrably robust.

View Paper Prompt View All Prompts

Glossary

Acceptance criterion: The explicit rule used to decide whether a generated example is accepted by the agentic loop. Example: "We therefore define the acceptance criterion of the agentic loop directly in terms of this gap:"
Agentic data creation: Using autonomous AI agents to iteratively generate, evaluate, and refine datasets. Example: "Autodata, via agentic data creation, provides a way to convert increased inference compute into higher quality model training"
Agentic Self-Instruct: A specific Autodata implementation that orchestrates challenger, solver, and judge agents to synthesize targeted training data. Example: "which we call Agentic Self-Instruct, depicted in \autoref{fig:fig2}"
Aggregation: Combining multiple model outputs (e.g., via voting or ensembling) to improve reliability. Example: "scaffolding or aggregation \citep{zhao2025majority}"
Autodata: The proposed framework where an agent acts as a data scientist to create and improve training/evaluation data. Example: "We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data."
Autoresearch: A methodology for automatically exploring and optimizing research pipelines (here, for agent optimization). Example: "using autoresearch \citep{karpathy_autoresearch_2026}"
Boltzmann sampling: A probabilistic selection method where choices are sampled with probability proportional to exponentiated scores. Example: "Select a parent from the population via Boltzmann sampling, where candidate $c$ is chosen with probability proportional to $\exp(score_c / T)$ "
Chain-of-Thought reasoning: Structured, step-by-step reasoning traces used during generation or solving. Example: "use Chain-of-Thought reasoning during the generation process"
Challenger: A subagent that generates candidate tasks or examples given instructions from the main agent. Example: "Challenger, which creates training examples given a detailed prompt from the main agent,"
Context leakage: When provided context inadvertently reveals answers or evaluation cues. Example: "A quality verifier then checks for context leakage, rubric coverage, and question quality"
CoT Self-Instruct: A Self-Instruct variant that leverages Chain-of-Thought to synthesize more complex data. Example: "CoT Self-Instruct \citep{yu2025cot} extended that to use Chain-of-Thought reasoning"
Gap (strong − weak): The performance difference between strong and weak solvers on the same tasks. Example: "Gap (strong $-$ weak)"
GRPO: Group Relative Policy Optimization, a reinforcement learning algorithm used for fine-tuning LLMs. Example: "We train Qwen3.5-4B with GRPO"
Grounded Self-Instruct: A method that grounds generation on external sources (e.g., documents) to reduce hallucinations. Example: "Grounded Self-Instruct \citep{lupidi2024source2synth,yuan2025naturalreasoning} extended that to ground on documents"
Grounding: Anchoring generation to specific external data sources (papers, legal documents, etc.). Example: "The autodata agent grounds on some provided data"
In-context learning: Teaching a model via examples and instructions included directly in the prompt. Example: "the ability to use in-context learning and instruction following"
Inner loop: The iterative data creation and analysis cycle whose outcomes define the quality criteria. Example: "using the same inner loop criteria (creating better data) to guide the optimization of the outer loop"
Loop judge: A decision-making component that accepts or rejects a round’s generation based on multi-factor analysis (e.g., variance, gap, quality). Example: "we instead adopt a more flexible {\em loop judge} to decide if a round's generation is accepted."
Majority vote: An aggregation scheme where the final answer is determined by the most common output among multiple runs/agents. Example: "require that majority vote over the strong solver is correct, while majority vote over the weak solver is wrong."
Meta-harness: A framework for meta-level optimization of agent scaffolds or prompts. Example: "meta-harness \citep{lee2026meta} style optimization"
Meta-optimization: Optimizing the agent itself (its prompts/strategy) using metrics from the inner data-generation loop. Example: "we thus apply meta-optimization to the data scientist agent itself,"
Outer loop: The meta-level process that tunes the agent’s strategy based on inner-loop performance. Example: "to guide the optimization of the outer loop (the agent optimization itself)."
Out-of-distribution: Evaluation on data that differ from the training distribution to measure generalization. Example: "as well as the out-of-distribution Principia benchmark."
Pass@8: The probability of obtaining at least one correct solution in 8 attempts (rollouts). Example: "pass@8"
Pile of Law: A large corpus of public legal texts used as grounding data. Example: "Pile of Law~\citep{henderson2022pile}"
PRBench-Legal: A benchmark split for evaluating legal reasoning performance. Example: "evaluate on PRBench-Legal and the PRBench-Legal-Hard subset"
Reward model: A grader model that assigns rewards to outputs for reinforcement learning. Example: "using Kimi-K2.6 as the reward model to score responses against the generated rubrics."
Rollout: Multiple sampled attempts by a model on the same prompt to assess performance/variance. Example: "Each candidate is rolled out by the weak solver 5 times and the strong solver 3 times."
Rubric: Structured, often weighted, evaluation criteria used by a judge to score responses. Example: "the judge scores their answers against the rubric on a per-criterion basis."
S2ORC: The Semantic Scholar Open Research Corpus, a large dataset of academic papers. Example: "We process over 10k CS papers from the S2ORC corpus (2022+)"
Scaffolding: Procedures that structure or extend inference (e.g., tools, multi-step plans) to improve model performance. Example: "scaffolding or aggregation \citep{zhao2025majority}"
Self-challenging: A paradigm where an agent proposes increasingly difficult tasks, often with tool use, to stress-test models. Example: "so called ``self-challenging'' methods \citep{zhou2025self}"
Self-Instruct: A method where models bootstrap synthetic instruction data from seed prompts. Example: "Self-Instruct \citep{wang2023self} emerged as a method to create synthetic data"
Strong solver: A higher-capability model (or mode) used to validate correctness and set difficulty. Example: "``Strong'' solver that is expected to generally succeed at the created training data,"
Token budget: The maximum token limit available for reasoning/generation within a run. Example: "with a 65,536 token budget"
Truncation rates: The proportion of outputs cut off due to exceeding the token limit. Example: "reduces reasoning truncation rates (from 23.75\% to 4.09\%)"
Verifiable tasks: Tasks whose answers can be automatically checked for correctness. Example: "For verifiable tasks (using an LLM-based verifier)"
Verifier: An agent that checks correctness/quality and provides feedback to guide data generation. Example: "Verifier/judge that given the example and a model solution, checks its quality,"
Weak rollout standard deviation: The variability (standard deviation) of scores across multiple weak-solver attempts on a prompt. Example: "Weak rollout std"
Weak solver: The target model to improve; tasks are tailored so it initially struggles. Example: "``Weak'' solver that is expected to generally struggle to solve the created training data;"
Weighted grading rubric: A rubric whose criteria have weights to reflect their relative importance in scoring. Example: "a weighted grading rubric"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

Tweets

HackerNews

Autodata: An agentic data scientist to create high quality synthetic data (3 points, 1 comment)

Autodata: An agentic data scientist to create high quality synthetic data

Summary

Autodata: An Agentic Data Scientist for High-Quality Synthetic Data

Framework and Methodological Innovations

Agentic Self-Instruct: Instantiation and Mechanisms

Empirical Results Across Diverse Domains

Computer Science Research Tasks

Legal Reasoning Tasks

Scientific Reasoning

Meta-Optimization of the Agentic Data Scientist

Implications and Prospective Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Simple Explanation of “Autodata: An agentic data scientist to create high quality synthetic data”

1) What is this paper about?

2) What questions are the researchers trying to answer?

3) How does it work? (Methods explained simply)

4) What did they find, and why does it matter?

5) What’s the big picture?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Software/AI and MLOps

Legal Tech and Compliance

Education and EdTech

Scientific R&D and Knowledge Work

Enterprise Knowledge Management

Safety, Red Teaming, and Policy

Finance and Healthcare (low-risk scopes first)

Daily Life

Long-Term Applications

Cross-Modal and Embodied AI

Continuous, Self-Evolving Benchmarks and AutodataOps

Regulatory-Grade Evaluation and Audit

Scalable Safety and Oversight

Low-Resource and Multilingual Expansion

Scientific Discovery Assistants

Personalization at the Edge

Market and Ecosystem Evolution

Notes on Common Assumptions/Dependencies Across Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

HackerNews