UltraLogic Data Framework

Updated 9 January 2026

UltraLogic Data Framework is a scalable system for generating high-quality, difficulty-calibrated reasoning tasks by decoupling logical cores from natural language.
It employs a code-based approach with a graded bipolar float reward mechanism to overcome sparse rewards in reinforcement learning and encourage perfect logical chains.
The framework integrates diverse seed tasks, an automated data synthesis pipeline, and precise difficulty calibration to systematically improve LLM training outcomes.

UltraLogic is a data framework for large-scale, high-quality, and difficulty-calibrated reasoning task generation, designed to systematically enhance LLM reasoning. It achieves this by decoupling the logical core of problems from their natural language representations using a code-based solving methodology, and by introducing a graded bipolar float reward (BFR) mechanism that reshapes the RL reward landscape to more effectively guide multi-step logical reasoning. UltraLogic comprises hundreds of unique seed tasks across diverse reasoning domains, an automated data synthesis pipeline, a ten-level difficulty calibration mechanism, and rigorous reward shaping for reinforcement learning with verifiable rewards (RLVR) (Liu et al., 6 Jan 2026).

1. Architectural Principles and Goals

UltraLogic directly addresses the lack of complex, difficulty-calibrated data for general-purpose LLM reasoning and the inherent limitations of sparse, binary reward signals in RLVR. The primary architectural objectives are:

Provision of large-scale, high-quality data across a broad range of reasoning tasks, each carefully calibrated for difficulty.
Structural decoupling of logical task cores from linguistic instantiations to ensure reproducibility, verifiability, and prevent rote pattern learning.
Introduction of a bipolar reward signal to sharply penalize sub-perfect reasoning chains, mitigating gradient sparsity and sub-optimal convergence.

The top-level framework is organized as follows:

Component	Description	Purpose
Original Task Repository	Hundreds of unique seed tasks, classified by Task Domain, Reasoning Ability, and Difficulty Source	Diversity and orthogonality
Task Template Repository	Human-verified natural language templates in multiple languages (e.g., English, Chinese)	Prevent pattern overfitting
Data Synthesis Pipeline	Composed of Input Code, Solution Code, and Difficulty Control Module	Automated, scalable task instance generation
Bipolar Float Reward Layer	Reward reshaping for RLVR on top of synthesized tasks	Effective RL signal for logic optimization

2. Code-Based Solving and Data Synthesis

Central to UltraLogic's methodology is the abstraction of logical complexity from natural language, formalized via code-based specifications for both input and solution generation. Each seed task is defined by two deterministic Python functions:

input(difficulty, language) → core parameters (e.g., numerical variables, graph structures)
solution(params, language) → ground-truth answer

The natural language component serves as a wrapper only, with templates filled programmatically. This orthogonality ensures that linguistic surface forms do not confound the logical core, allowing rigorous control of logical complexity and reproducibility. The workflow for data instance generation is as follows:

for each seed_task in OriginalTaskRepo:
    for difficulty in 1..10:
        for lang in {en, zh}:
            repeat until N_samples generated:
                params, nl_slots = input(difficulty, lang)
                answer = solution(params, lang)
                select a random template t from TemplateRepo[seed_task, lang]
                instance.text = fill_slots(t, nl_slots)
                instance.answer = answer
                instance.difficulty = difficulty
                save(instance)

This process yields millions of unique, verifiable task instances, with rigorous quality assurance implemented by restricting initial sample releases—requiring ≥ 90% success rate on flagship models—before full-scale synthesis (Liu et al., 6 Jan 2026).

3. Difficulty Calibration and Control

UltraLogic implements an automated, ten-level difficulty calibration pipeline. Each level is associated with an explicit target success probability ( $P_{\mathrm{target}}(d)$ ), e.g.:

$P_{\mathrm{target}}(1) \approx 100\%$
$P_{\mathrm{target}}(3) \approx 70\%$
$P_{\mathrm{target}}(5) \approx 50\%$
$P_{\mathrm{target}}(7) \approx 30\%$
$P_{\mathrm{target}}(10) \approx 0\%$

Calibration proceeds by generating validation instances and evaluating model accuracy $P_{\mathrm{actual}}(d)$ . If $|P_{\mathrm{actual}}(d) - P_{\mathrm{target}}(d)| > \epsilon$ , task parameters are adjusted (e.g., increasing search space or constraints) and the loop repeats. The main algorithmic steps are:

Initialize $difficulty\_params[d]$
Loop:
- Generate $M$ samples at current $difficulty\_params[d]$
- Evaluate model; compute $P_{\mathrm{actual}}(d) = \frac{1}{M} \sum_{i=1}^M \mathbf{1}\{\text{model}_i \text{ correct}\}$
- If $|P_{\mathrm{actual}}(d) - P_{\mathrm{target}}(d)| > \delta$ : adjust $difficulty\_params[d]$ , repeat; else, calibration complete

This ensures a stable, reproducible mapping of difficulty levels to empirical model performance, critical for curriculum strategies and controlled evaluation (Liu et al., 6 Jan 2026).

4. Bipolar Float Reward (BFR) for RLVR

UltraLogic introduces the Bipolar Float Reward (BFR) to address the gradient sparsity of binary (0/1) rewards and the “non-negative reward trap” associated with unipolar floats. For a model response with correctness score $S \in [0,1]$ (e.g., accuracy, F1):

$R(\text{response}) = \begin{cases} 1, & S = 1 \ S - 1, & 0 \leq S < 1 \ \end{cases}$

Thus, fully correct responses yield a reward of $+1$ ; imperfect answers in $[0,1)$ yield a strictly negative reward $(−1,0)$ . This structure generates a “reward cliff” at $S=1$ and ensures only logically perfect solutions receive positive reinforcement, while all suboptimal chains are penalized regardless of proximity. In reinforcement learning (learning rate $1\times 10^{-6}$ , rollout size 16, max response length 32,768 tokens, temperature $=1.0$ , Top- $p=1.0$ ), this mechanism eliminates reward plateaus and guides optimization toward global logical optima rather than “good-enough” local minima. A format bonus (+0.1) is optionally added to isolate logic evaluation from minor formatting errors (Liu et al., 6 Jan 2026).

5. Seed Task Suite and Diversity

UltraLogic's task suite covers hundreds of seed tasks cross-classified along three axes:

Task Domain: Symbolic Manipulation, Spatial Pathfinding, Classic Games, etc.
Core Reasoning Ability: Constraint Satisfaction, Algorithmic Thinking, Information Extraction
Difficulty Source: Large Search Space, Complex Rules, Tedious Steps

Example categories include:

Truth-and-Lie Game (constraint satisfaction over logical statements)
Maze Pathfinding (spatial reasoning, instruction following)
Symbolic Seal Decoding (targeting intrinsic LLM weaknesses)
Rectangle-painting on a grid (spatial geometry, combinatorial search)
Causal Chain Extraction (multi-step planning, complex dependencies)

Programmatic expansion yields millions of diverse instances. Experimental ablations demonstrate that diversity—coverage of numerous orthogonal task types—contributes more to reasoning improvements than simple data scale (Liu et al., 6 Jan 2026).

Dimension	Example Category	Characteristic
Domain	Maze Pathfinding	Spatial/Instructional
Core Ability	Symbolic Manipulation	Algorithmic Weaknesses
Difficulty Source	Rectangle-painting	Large Combinatorial Space

6. Training Strategies and Experimental Evaluation

UltraLogic has been used to train dense LLMs such as Qwen3-8B and Qwen3-14B. Key strategies and findings include:

Training uses GRPO (Generalized Reinforcement Policy Optimization) with BFR or baseline rewards.
Difficulty-matched curricula: three datasets (Easy: levels 1–4, Medium: 4–7, Hard: 7–10), each with 50 tasks × 10k samples.
Zone of Proximal Development: Qwen3-8B achieves optimal performance on Easy set; Qwen3-14B best on Medium.

Reward ablation on Easy subset (Qwen3-8B):

Reward Type	AIME24	HMMT25	BBH
Binary	81.7%	52.3%
Graded Float		53.0%	90.4%
BFR	82.6%	56.6%	91.1%

Ablation studies confirm that increasing seed task diversity yields greater reasoning gains than increasing sample size for any given task. BFR also accelerates convergence (steeper learning curves) and eliminates sub-optimal plateauing in validation accuracy. These findings establish UltraLogic's utility for curriculum learning and reward shaping in LLMs (Liu et al., 6 Jan 2026).

7. Implementation Protocols

Several guidelines underpin practical UltraLogic deployments:

Seed-task development is performed by domain experts, specifying input and solution code for each logical challenge.
Template generation employs prompt-driven LLMs with human review for linguistic variation and semantic preservation.
Automated difficulty calibration is maintained with periodic re-evaluation against the best open-source models, preserving the alignment between $P_{\mathrm{actual}}(d)$ and $P_{\mathrm{target}}(d)$ .
Training pipelines preprocess (prompt, completion, difficulty, lang) tuples, implement the BFR function for reward computation, and use adaptive RL schedulers that introduce harder instances when model success rates reach 40–60%.
Quality assurance: Before broad release, sample-based checkpoints at levels 1–3 validate the match between templates, generated code, and labeled solutions.

This design enables rigorous, iterative expansion of reasoning benchmarks with consistent logical standards and effective RL training signals (Liu et al., 6 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UltraLogic Data Framework.