UltraLogic Data Framework
- UltraLogic Data Framework is a scalable system for generating high-quality, difficulty-calibrated reasoning tasks by decoupling logical cores from natural language.
- It employs a code-based approach with a graded bipolar float reward mechanism to overcome sparse rewards in reinforcement learning and encourage perfect logical chains.
- The framework integrates diverse seed tasks, an automated data synthesis pipeline, and precise difficulty calibration to systematically improve LLM training outcomes.
UltraLogic is a data framework for large-scale, high-quality, and difficulty-calibrated reasoning task generation, designed to systematically enhance LLM reasoning. It achieves this by decoupling the logical core of problems from their natural language representations using a code-based solving methodology, and by introducing a graded bipolar float reward (BFR) mechanism that reshapes the RL reward landscape to more effectively guide multi-step logical reasoning. UltraLogic comprises hundreds of unique seed tasks across diverse reasoning domains, an automated data synthesis pipeline, a ten-level difficulty calibration mechanism, and rigorous reward shaping for reinforcement learning with verifiable rewards (RLVR) (Liu et al., 6 Jan 2026).
1. Architectural Principles and Goals
UltraLogic directly addresses the lack of complex, difficulty-calibrated data for general-purpose LLM reasoning and the inherent limitations of sparse, binary reward signals in RLVR. The primary architectural objectives are:
- Provision of large-scale, high-quality data across a broad range of reasoning tasks, each carefully calibrated for difficulty.
- Structural decoupling of logical task cores from linguistic instantiations to ensure reproducibility, verifiability, and prevent rote pattern learning.
- Introduction of a bipolar reward signal to sharply penalize sub-perfect reasoning chains, mitigating gradient sparsity and sub-optimal convergence.
The top-level framework is organized as follows:
| Component | Description | Purpose |
|---|---|---|
| Original Task Repository | Hundreds of unique seed tasks, classified by Task Domain, Reasoning Ability, and Difficulty Source | Diversity and orthogonality |
| Task Template Repository | Human-verified natural language templates in multiple languages (e.g., English, Chinese) | Prevent pattern overfitting |
| Data Synthesis Pipeline | Composed of Input Code, Solution Code, and Difficulty Control Module | Automated, scalable task instance generation |
| Bipolar Float Reward Layer | Reward reshaping for RLVR on top of synthesized tasks | Effective RL signal for logic optimization |
2. Code-Based Solving and Data Synthesis
Central to UltraLogic's methodology is the abstraction of logical complexity from natural language, formalized via code-based specifications for both input and solution generation. Each seed task is defined by two deterministic Python functions:
input(difficulty, language) → core parameters(e.g., numerical variables, graph structures)solution(params, language) → ground-truth answer
The natural language component serves as a wrapper only, with templates filled programmatically. This orthogonality ensures that linguistic surface forms do not confound the logical core, allowing rigorous control of logical complexity and reproducibility. The workflow for data instance generation is as follows:
1 2 3 4 5 6 7 8 9 10 11 |
for each seed_task in OriginalTaskRepo: for difficulty in 1..10: for lang in {en, zh}: repeat until N_samples generated: params, nl_slots = input(difficulty, lang) answer = solution(params, lang) select a random template t from TemplateRepo[seed_task, lang] instance.text = fill_slots(t, nl_slots) instance.answer = answer instance.difficulty = difficulty save(instance) |
3. Difficulty Calibration and Control
UltraLogic implements an automated, ten-level difficulty calibration pipeline. Each level is associated with an explicit target success probability (), e.g.:
Calibration proceeds by generating validation instances and evaluating model accuracy . If , task parameters are adjusted (e.g., increasing search space or constraints) and the loop repeats. The main algorithmic steps are:
- Initialize
- Loop:
- Generate samples at current
- Evaluate model; compute
- If : adjust , repeat; else, calibration complete
This ensures a stable, reproducible mapping of difficulty levels to empirical model performance, critical for curriculum strategies and controlled evaluation (Liu et al., 6 Jan 2026).
4. Bipolar Float Reward (BFR) for RLVR
UltraLogic introduces the Bipolar Float Reward (BFR) to address the gradient sparsity of binary (0/1) rewards and the “non-negative reward trap” associated with unipolar floats. For a model response with correctness score (e.g., accuracy, F1):
Thus, fully correct responses yield a reward of ; imperfect answers in yield a strictly negative reward . This structure generates a “reward cliff” at and ensures only logically perfect solutions receive positive reinforcement, while all suboptimal chains are penalized regardless of proximity. In reinforcement learning (learning rate , rollout size 16, max response length 32,768 tokens, temperature , Top-), this mechanism eliminates reward plateaus and guides optimization toward global logical optima rather than “good-enough” local minima. A format bonus (+0.1) is optionally added to isolate logic evaluation from minor formatting errors (Liu et al., 6 Jan 2026).
5. Seed Task Suite and Diversity
UltraLogic's task suite covers hundreds of seed tasks cross-classified along three axes:
- Task Domain: Symbolic Manipulation, Spatial Pathfinding, Classic Games, etc.
- Core Reasoning Ability: Constraint Satisfaction, Algorithmic Thinking, Information Extraction
- Difficulty Source: Large Search Space, Complex Rules, Tedious Steps
Example categories include:
- Truth-and-Lie Game (constraint satisfaction over logical statements)
- Maze Pathfinding (spatial reasoning, instruction following)
- Symbolic Seal Decoding (targeting intrinsic LLM weaknesses)
- Rectangle-painting on a grid (spatial geometry, combinatorial search)
- Causal Chain Extraction (multi-step planning, complex dependencies)
Programmatic expansion yields millions of diverse instances. Experimental ablations demonstrate that diversity—coverage of numerous orthogonal task types—contributes more to reasoning improvements than simple data scale (Liu et al., 6 Jan 2026).
| Dimension | Example Category | Characteristic |
|---|---|---|
| Domain | Maze Pathfinding | Spatial/Instructional |
| Core Ability | Symbolic Manipulation | Algorithmic Weaknesses |
| Difficulty Source | Rectangle-painting | Large Combinatorial Space |
6. Training Strategies and Experimental Evaluation
UltraLogic has been used to train dense LLMs such as Qwen3-8B and Qwen3-14B. Key strategies and findings include:
- Training uses GRPO (Generalized Reinforcement Policy Optimization) with BFR or baseline rewards.
- Difficulty-matched curricula: three datasets (Easy: levels 1–4, Medium: 4–7, Hard: 7–10), each with 50 tasks × 10k samples.
- Zone of Proximal Development: Qwen3-8B achieves optimal performance on Easy set; Qwen3-14B best on Medium.
Reward ablation on Easy subset (Qwen3-8B):
Ablation studies confirm that increasing seed task diversity yields greater reasoning gains than increasing sample size for any given task. BFR also accelerates convergence (steeper learning curves) and eliminates sub-optimal plateauing in validation accuracy. These findings establish UltraLogic's utility for curriculum learning and reward shaping in LLMs (Liu et al., 6 Jan 2026).
7. Implementation Protocols
Several guidelines underpin practical UltraLogic deployments:
- Seed-task development is performed by domain experts, specifying
inputandsolutioncode for each logical challenge. - Template generation employs prompt-driven LLMs with human review for linguistic variation and semantic preservation.
- Automated difficulty calibration is maintained with periodic re-evaluation against the best open-source models, preserving the alignment between and .
- Training pipelines preprocess (prompt, completion, difficulty, lang) tuples, implement the BFR function for reward computation, and use adaptive RL schedulers that introduce harder instances when model success rates reach 40–60%.
- Quality assurance: Before broad release, sample-based checkpoints at levels 1–3 validate the match between templates, generated code, and labeled solutions.
This design enables rigorous, iterative expansion of reasoning benchmarks with consistent logical standards and effective RL training signals (Liu et al., 6 Jan 2026).