LoongBench: Scalable Synthetic Data for RL
- LoongBench is a modular benchmark comprising 8,729 human-vetted QA–code triples across 12 diverse domains for reinforcement learning.
- It employs Few-Shot, Self-Instruct, and Evol-Instruct strategies to generate synthetic questions with measurable executability and semantic consistency.
- Its RLVR integration rewards verified long-chain-of-thought reasoning, driving scalable alignment of large language models.
LoongEnv is a modular, domain-agnostic environment for scalable synthetic data generation and automatic verification across diverse, reasoning-intensive domains. It forms a core component of the Loong Project’s open-source framework for reinforcement learning (RL) with verifiable reward, aimed at advancing long-chain-of-thought (CoT) capabilities in LLMs (Huang et al., 3 Sep 2025). LoongEnv supports the autonomous creation of question–answer–code triples, enabling extensive RL alignment for mathematical, scientific, and logical reasoning benchmarks.
1. System Architecture and Data Flow
LoongEnv is architected as a multi-agent, modular pipeline, structurally decoupled from the downstream LLM agent and designed to plug into RL loops as an “environment.” The overall agent–environment cycle is structured as follows:
- Seed Corpus: A curated dataset (LoongBench; 8,729 human-vetted QA–code triples spanning 12 domains) is ingested as the initial substrate.
- Synthetic Generation: Based on seed samples, LoongEnv generates new questions using various prompting strategies (Few-Shot, Self-Instruct, Evol-Instruct).
- Coder Agent: Constructs executable code to answer the new question; code is run to yield a synthetic answer.
- Verifiers: Judge module ensures code executability and semantic equivalence between the code-produced and LLM-generated answers.
- Reward Signal: RL loop rewards the LLM agent for generating a valid CoT answer matching the verifiable answer from the code execution.
The pipeline supports dynamic scaling, pluggable question synthesis techniques, domain-specific verification, and interaction with the RL agent for feedback-based sample selection.
2. Prompting and Generation Strategies
LoongEnv implements three primary automated data-generation paradigms:
- Few-Shot Prompting: Human seed examples are used as an in-context demonstration for synthesizing further problem instances with similar structure. This approach yields high correctness and executability (e.g., Physics: 93.9% pass rate, Logic: 92.6% pass rate).
- Self-Instruct: Recursively prompts instruction-tuned models to enhance novelty and diversity beyond surface-level copying. While this increases the semantic range (e.g., lower average embedding similarity to seeds), it sometimes decreases executability (higher judge rejection rates).
- Evol-Instruct: Iteratively mutates existing questions via targeted generalization, specification, or complexity scaling. This approach is optimized for generating high-difficulty, structurally complex problems, commonly resulting in richer reasoning chains but increased execution failure rates (e.g., Logic: 55% not executable, Physics: 14% not executable).
This collective strategy set facilitates both wide semantic coverage and control over difficulty and forms the basis for large-scale synthetic reasoning data creation.
3. Automatic Verification and Quality Control
LoongEnv rigorously filters synthetic outputs through multi-stage, automated judging:
- Executability: All generated code is run in a sandbox; only samples that execute without error are retained.
- Semantic Verification: A judge agent (possibly domain-specialized) assesses if the code and generated answer semantically solve the original question; malformed or unverifiable problems are rejected.
- Empirical Outcomes: Few-shot samples exhibit the highest execution and verification rates; self-instruct and evol-instruct methods yield a tradeoff, increasing diversity and difficulty at the expense of reliability.
This process results in a dataset enriched in high-quality, verifiable question–answer–code triples that are suitable for RL.
4. Agent–Environment Loop for Reinforcement Learning
LoongEnv is designed as an RL-ready environment, supporting alignment and skill acquisition for reasoning LLMs via the RLVR (RL with verifiable reward) paradigm:
- Agent Action: The LLM agent receives the synthetic question and produces a chain-of-thought and final answer.
- Environment Response: The generated answer is semantically compared to the code-executed answer from LoongEnv.
- Reward Assignment: Only answers verified as correct (by the judge) receive reward signal. Formally,
where denotes the judgment of equivalence.
This architecture allows scalable RL alignment in previously under-resourced or high-supervision-requirement domains.
5. Domain Coverage, Diversity, and Difficulty
LoongEnv leverages the LoongBench seed set and supports extension across 12 diverse domains: advanced mathematics, logic, chemistry, programming, physics, finance, medicine, security, board games, and others.
Empirical Findings
- Correctness: Few-shot generation has the highest pass rate; evol-instruct leads to greater failure but enhances coverage of “hard” and edge-case reasoning.
- Diversity: Measured by pairwise embedding similarity and cluster analysis (e.g., t-SNE projection), Self-Instruct and Evol-Instruct variants yield greater semantic drift from seeds; Evol-Instruct maintains high structural similarity but increases solution complexity.
- Difficulty: Expectation accuracy of SOTA models (e.g., GPT-4.1-mini, DeepSeek-r1) drops substantially on Evol-Instruct data, quantifying its added challenge (e.g., GPT4.1-mini: 62.0% vs. 92.0% accuracy on few-shot logic).
| Model | Few-shot | Self-Instruct | Evol-Instruct | Seed Dataset |
|---|---|---|---|---|
| GPT4.1-mini | 92.0% | 83.0% | 62.0% | 71.8% |
| DeepSeek-r1 | 93.2% | 87.4% | 70.3% | 77.4% |
This gradient in difficulty and diversity is systematically controlled through LoongEnv’s design, supporting targeted curriculum construction for RL.
6. Innovations and Distinguishing Characteristics
Key innovations of LoongEnv, relative to prior synthetic data generators (especially those limited to mathematics or programming), include:
- Executable Reasoning: Always produces executable code along with text answers, enabling rigorous, automated verification.
- Multi-domain and Multi-agent Support: Expands scalable synthetic data creation beyond traditional domains, with plug-in generation and verification modules.
- Evolutionary and Recursive Prompting: Introduces structures for controlled, progressive escalation in question difficulty and novelty.
- RLVR Integration: Native support for reinforcement learning via verifiable rewards, reducing annotation bottleneck and improving reasoning alignment in complex problem spaces.
7. Implications and Future Directions
LoongEnv establishes a systematic, automated workflow for high-fidelity reasoning data generation across scientific and logical domains. Its design permits:
- Automated scaling of RL alignment for LLMs in fields where verifiable data are scarce and manual annotation costs are prohibitive.
- Fine-grained curricular control over training samples, including reliability–difficulty tradeoffs.
- Expansion to novel, under-explored domains via new seed corpora and verifier plugins.
A plausible implication is that LoongEnv workflows, when embedded in RL pipelines, may shift the development paradigm for LLM reasoning—moving from reliance on narrow human-labeled datasets towards scalable, domain-extensible, and verification-driven alignment approaches. However, challenges persist regarding the reliability of advanced synthesis strategies (notably Evol-Instruct) and the need for further improvement in judge and code execution robustness.
LoongEnv’s implementation, seed corpora, benchmarking scripts, and documentation are provided at https://github.com/camel-ai/loong (Huang et al., 3 Sep 2025), enabling empirical validation and extension by the research community.