Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

UltraIF: Scalable LLM Instruction Framework

Updated 13 November 2025
  • UltraIF is a scalable framework for training large language models to follow complex, real-world instructions using open-source data.
  • It decomposes prompts into atomic queries, explicit constraints, and evaluative questions to generate high-quality, constraint-aware datasets.
  • Through two-stage decomposition and composer-driven synthesis, UltraIF achieves competitive performance with proprietary models while ensuring robust self-alignment.

UltraIF is a scalable framework for training LLMs to follow complex, real-world instructions using only open-source data. It overcomes the quality gap between open-source and proprietary instruction-following models by systematically decomposing user prompts into atomic queries, explicit constraints, and associated evaluative questions. Through a two-stage process—decomposition and composer-driven synthesis—UltraIF produces high-quality, constraint-aware datasets used to fine-tune base LLMs and enable closed-loop self-alignment, yielding instruction-following performance competitive with leading proprietary models. The methodology is distinguished by the UltraComposer module which automates constraint injection and evaluation, drastically improving synthesis efficiency and alignment quality.

1. High-Level UltraIF Process

UltraIF operates through two tightly integrated stages: decomposition and generate–then–evaluate synthesis.

  • Decomposition Stage: Real-world instructions are collected from sources such as ShareGPT, OpenHermes, and No Robots. Each instruction XX is decomposed by a supervisor LLM into a set of triplets (xi,ci,qi)(x_i, c_i, q_i), where xix_i is a basic query, cic_i is an atomic constraint, and qiq_i is a corresponding yes/no evaluation question.
  • UltraComposer Training: An 8B-parameter transformer (“UltraComposer”) is fine-tuned to map each xix_i to the serialized pair [Xqi][X\,||\,q_i], enabling automated prompt composition with embedded constraints and evaluation protocols.
  • Generate–then–Evaluate Synthesis:
    • New instructions xx are iteratively augmented by UltraComposer to produce xˉ\bar{x} with up to kk constraints and cumulative evaluation questions qˉ\bar{q}.
    • For each augmented instruction xˉ\bar{x}, KK response candidates y1,,yKy_1,\ldots,y_K are generated by the model.
    • All responses are filtered using qˉ\bar{q}; only those passing every evaluation are accepted.
    • Preference tuples (xˉ,ychosen,yrejected)(\bar{x}, y_{\text{chosen}}, y_{\text{rejected}}) are formed for downstream supervised fine-tuning (SFT) and optional preference learning (DPO/NCA).

This modular pipeline produces large, diverse, and quality-controlled instruction–response datasets with minimal human oversight, forming the foundation for training robust instruction-following LLMs.

2. Decomposition of User Prompts

Central to UltraIF is the decomposition of wild user instructions XX into a collection of triplets (xi,ci,qi)(x_i, c_i, q_i):

X{(x1,c1,q1), , (xn,cn,qn)}X \longrightarrow \left\{(x_1, c_1, q_1),\ \ldots,\ (x_n, c_n, q_n)\right\}

  • xix_i : “basic” query, with constraint cic_i removed for atomic granularity.
  • cic_i : explicit, atomic requirement (style, count, format, content).
  • qiq_i : evaluative yes/no question verifying cic_i for any candidate response.

Example: For the prompt “In Shakespeare’s tone, recommend me ten Chinese books.”

  • (x1=(x_1 =“Recommend me ten Chinese books.”,c1=, c_1 =“In Shakespeare's tone.”,q1=, q_1 =“Is the response written in Shakespeare’s tone?”))
  • (x2=(x_2 =“Recommend me ten Chinese books.”,c2=, c_2 =“ten”,q2=, q_2 =“Are exactly ten books recommended?”))

This formalized decomposition allows constraint injection and systematic quality verification, forming the substrate for robust synthesis.

3. UltraComposer: Model, Objective, and Algorithms

Model Architecture:

  • UltraComposer adopts a standard transformer decoder (initialized from LLaMA-3.1-8B-Instruct) to ingest xix_i with a decomposition prefix and output the serialized [X<SEP>qi][X || \text{<SEP>} q_i] sequence.

Training Objective:

  • The prompt-composition loss is the token-level cross-entropy:

Lcomp(θ)=(xi,X,qi)logPθ([Xqi]xi)\mathcal{L}_{\text{comp}}(\theta) = - \sum_{(x_i,\,X,\,q_i)} \log P_\theta\big([X\,||\,q_i]\,|\,x_i\big)

Generation and Filtering Procedures:

1
2
3
4
5
6
7
8
9
def ComposeConstraints(x, num_iters):
    constraints = set(); questions = set(); bar_x = x
    for i in range(num_iters):
        X_i, q_i = UltraComposer.generate(bar_x)
        c_i = extract_constraint(X_i, bar_x)
        constraints.add(c_i)
        questions.add(q_i)
        bar_x = X_i
    return bar_x, questions

1
2
3
4
5
6
7
8
9
def FilterResponses(bar_x, questions, K):
    candidates = sample_responses(bar_x, K)
    for y in candidates:
        verdicts = [Judge(y, q) for q in questions]
        if all(verdict == "YES" for verdict in verdicts):
            accept(y) # y_chosen
        else:
            reject(y)
    return y_chosen, list_of_rejected

Through iterative composition, UltraComposer can layer constraints to synthesize complex instructions, while per-constraint evaluation questions yield a filter with strong empirical discriminative power.

4. Data Synthesis and Quality Assurance

Synthesis operates at scale via iterative batch augmentation:

  1. For each seed xx, use ComposeConstraints(x, t) to obtain xˉ\bar{x} with tt constraints and evaluation set qˉ\bar{q}.
  2. Sample KK responses per xˉ\bar{x}.
  3. Assess each yjy_j against all qqˉq \in \bar{q}; keep only those satisfying q,Judge(yj,q)=YES\forall q,\,\text{Judge}(y_j, q) = \text{YES}.
  4. Select one passing ycy_c (“positive”), one rejected yry_r (“negative”) for construction of preference tuples.

Empirical results: UltraIF achieves an 85% pass-rate in SFT-data synthesis versus 20% for AutoIF, demonstrating a substantial efficiency gain in generating high-quality, constraint-compliant training data.

5. Experimental Protocols and Performance

No Benchmark Leakage:

  • Decomposition and prompt generation are performed only with supervisor LLMs (e.g., LLaMA-3.1-70B-Instruct), which do not fine-tune the 8B base model on any held-out benchmark examples.

Self-Alignment:

  • The 8B-Instruct model can serve as its own supervisor: UltraIF generates data from the 8B-Instruct model and re-aligns it further via closed-loop self-supervision.

Benchmark Results (8B Base, Table 1):

Benchmark Score/Metric UltraIF (8B Base)
IFEval Pr(S) Accuracy 58.22
IFEval Pr(L) Accuracy 65.25
Ins(S) Accuracy 68.11
Ins(L) Accuracy 74.22
Multi-IF Turn 1 Success Rate 58.14%
Multi-IF Turn 2 Success Rate 35.65%
Multi-IF Turn 3 Success Rate 26.55%
InfoBench DRFR Score 83.56
LiveBench Score Score 49.50
FollowBench SSR Score 59.99

Scaling to 175K SFT and 20K DPO data enables UltraIF Base to match or slightly surpass LLaMA-3.1-8B-Instruct on IFEval Pr(S) (71.35 vs. 69.13) and remain competitive across all five benchmarks. This suggests that the UltraIF approach closes the gap with instruct-tuned proprietary models using open data and relatively small model footprints.

6. Advantages, Constraints, and Future Work

Advantages:

  • Scalability: Constraint composition enables generation of millions of diverse, high-quality instructions with minimal handcrafted rules.
  • Quality Control: Per-constraint evaluation provides an efficient, lightweight filtering mechanism to ensure data fidelity.
  • Self-Alignment: Even a strong instruct-tuned model can enhance itself autonomously in a closed feedback loop.

Limitations:

  • The decomposition quality is dependent on the supervisor LLM's inherent capabilities.
  • Evaluation questions are binary, unable to capture fine-grained quality attributes (e.g., nuance, creativity).
  • Domain-shift outside the training data sources may reduce effectiveness for specialized application areas.

Potential Extensions:

  • Multi-label or scalar evaluation questions for richer assessment (e.g., rating numerical adherence or stylistic compliance on a scale).
  • Joint multitask learning for decomposition and composition stages.
  • Incorporation of human-in-the-loop feedback for scarce or highly specialized constraints.
  • Extension to multimodal instruction settings, including image–text pairs.

A plausible implication is that UltraIF's methodology—especially constraint injection and automated evaluation—could generalize to other instruction-following domains requiring precise multi-faceted response validation.


UltraIF exemplifies a modular, scalable architecture for instruction-following model alignment using open-source data. Its decomposition-driven synthesis and high-yield, constraint-filtered training regime demonstrate that open models can approach proprietary instruction-following standards with careful pipelining, evaluation, and self-alignment mechanisms.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UltraIF Framework.