Flow-Factory: Unified RL & Flow Sampling
- Flow-Factory is a unified framework that decouples and modularizes reinforcement learning for flow-matching and diffusion models through a registry-driven plug-and-play design.
- It minimizes engineering complexity and supports advanced features like memory optimization, multi-reward training, and distributed computing for rapid prototyping.
- The framework also defines a Bernoulli factory for sampling integral flows in flow-based polytopes, bridging concepts in combinatorial probability and network theory.
Flow-Factory is a unified framework designed to decouple, modularize, and streamline reinforcement learning (RL) for flow-matching and diffusion models. It achieves this through a registry-driven architecture that enables plug-and-play integration of RL algorithms, generative models, and reward functions, with built-in support for memory optimization, multi-reward training, and distributed compute. In parallel, the term Flow-Factory also refers, in combinatorial probability, to an explicit Bernoulli factory construction for sampling integral flows in flow-based polytopes subject to network constraints. Both concepts share a unifying perspective: compositional, exact, and extensible frameworks targeting complex stochastic systems (Ping et al., 13 Feb 2026, Niazadeh et al., 2022).
1. Motivation and Design Principles
The primary motivation for Flow-Factory is to address severe fragmentation and algorithm-model coupling found in RL for flow-matching models: each new RL algorithm often resides in an independent codebase with bespoke stochastic differential equation (SDE) or ordinary differential equation (ODE) logic and model-specific optimizer code, yielding engineering effort to benchmark models against algorithmic variants. This impedes reproducible research and rapid prototyping.
Key design goals include:
- Minimizing engineering complexity by reducing integration cost to via decoupling,
- Enabling mix-and-match experimentation through configuration,
- Supporting future extensibility with minimal boilerplate for new architectures and RL objectives.
These are operationalized through a registry-based plug-and-play system, preprocessing-based memory optimizations (precomputing embeddings to offload non-essential components from GPU), and a unified multi-reward engine capable of weighted- and GDPO-style (Generalized Deterministic Policy Optimization) aggregation (Ping et al., 13 Feb 2026).
2. Registry-Based Modular Architecture
All core abstractions—adapters, trainers, reward models, and SDE/ODE schedulers—are registered via decorators that populate global registries (Python dicts). At runtime, users supply a single YAML configuration:
1 2 3 4 5 6 7 8 9 10 11 |
model: type: "flux" pretrained_ckpt: "/.../flux.ckpt" trainer: type: "FlowGRPO" sde_type: "Flow-SDE" rewards: - name: "PickScore" weight: 0.6 - name: "TextRender" weight: 0.4 |
The pipeline instantiates adapters, trainers, and reward objects dynamically:
1 2 3 4 5 6 |
adapter_cls = AdapterRegistry[ cfg.model.type ] trainer_cls = TrainerRegistry[ cfg.trainer.type ] reward_objs = [ RewardRegistry[r.name](**r.args) for r in cfg.rewards ] adapter = adapter_cls(**cfg.model) trainer = trainer_cls(adapter, reward_objs, **cfg.trainer.args) trainer.train() |
UML-style dataflow in BaseTrainer.train():
1 2 3 4 5 6 7 8 9 10 11 12 13 |
+----------------------------------------------------+ | BaseTrainer.train() | +----------------------------------------------------+ | 1. adapter.prepare_dataloader() | | 2. for batch in dataloader: | | traj = adapter.sample_trajectory(trainer)| | logp, states = self.compute_log_probs(traj) | | rewards = MultiRewardLoader(traj, rewards) | | advs = self.compute_advantages(rewards) | | loss = self._compute_loss(states, advs) | | optimizer.zero_grad(); loss.backward(); | | optimizer.step() | +----------------------------------------------------+ |
This allows for modular extension and reproducible experiments via configuration (Ping et al., 13 Feb 2026).
3. Implemented Algorithms and Model Integrations
The framework encapsulates several critical RL methodologies for flow-matching models:
- Flow-GRPO (and Variants): Implements SDE updates of the form
with pluggable SDE types (“Flow-SDE”, “Dance-SDE”, “CPS”) and schedule control via SDESchedulerMixin.
- DiffusionNFT: Supports generalized pairwise objectives,
agnostic to ODE samplers (uniform, logit-normal, high-order) supplied through the adapter.
- AWM (Advantage Weighted Matching): Incorporates RL advantage into the velocity loss,
Adapters include FluxAdapter, QwenImageAdapter, and WANVideoAdapter. Algorithms are instantiated and composed in code or YAML without model-specific changes (Ping et al., 13 Feb 2026).
4. Mathematical Framework
4.1. Flow-Matching Pretraining
The flow-matching pretraining loss:
4.2. RL Policy-Gradient Formulation
Policy gradient with the flow-matching model as the policy:
4.3. Multi-Reward Aggregation
For rewards , each yielding an advantage :
- Weighted sum:
- GDPO: Normalize each to zero mean/unit variance, then sum or clip.
These objectives generalize prior work in aligning diffusion/flow models with complex user preferences (Ping et al., 13 Feb 2026).
5. System Features and Extension
5.1. Memory Optimization
Embedding, text encoder, and VAE latents are precomputed and cached, reducing peak GPU memory by (from $61.08$ GB to $53.14$ GB) and step time by (from $144.02$ s to $82.68$ s) on Flux.1-dev (8×H200, 140 GB) (Ping et al., 13 Feb 2026).
5.2. Flexible Multi-Reward Training
Reward functions, aggregation type, and scheduling are controlled via YAML. Deduplication and scheduling (via callbacks) allow for curriculum and multi-granularity reward mixing.
5.3. Distributed Training
Built atop PyTorch Lightning and torch.distributed, Flow-Factory enables DataParallel/DDP for batch splitting, all-reduce for gradients and groupwise rewards, and ProcessGroup barriers for synchronized multi-reward scoring.
5.4. Extension Guidelines
To add algorithms:
- Create a new file subclassing
BaseTrainer(optionally withSDESchedulerMixin). - Register the algorithm with
@register_trainer. - Create a YAML config; training command is uniform.
To add models:
- Implement a new adapter.
- Register via
@register_adapter. - Reference by name in configuration.
Directory structure reflects this modular approach, with separate adapters, trainers, rewards, schedulers, and a central registry (Ping et al., 13 Feb 2026).
6. Empirical Results and Benchmarking
Flow-Factory reproduces published reward curves for Flow-GRPO, DiffusionNFT, and AWM on the Flux.1-dev + PickScore benchmark within ±2% of the original studies. Qualitative improvements across prompts indicate that base Flux generations are less detailed, Flow-GRPO improves texture, DiffusionNFT enhances color fidelity, and AWM yields optimal style-content trade-off.
Training Efficiency (Flux.1-dev / PickScore / 100 steps):
| Metric | Without Preproc | With Preproc |
|---|---|---|
| Peak GPU Memory/device | 61.08 GB | 53.14 GB (–13.0%) |
| Time per Step | 144.02 s | 82.68 s (×1.74) |
Scalability benchmarks on Qwen-Image and WAN video demonstrate >90% code reuse. Changing reward types or RL objectives typically requires only a YAML edit—no code changes—highlighting both extensibility and rapid prototyping capability (Ping et al., 13 Feb 2026).
7. Flow-Factory in Combinatorial Probability
Independently, "Flow-Factory" denotes an explicit combinatorial Bernoulli factory for sampling vertices of flow-based polytopes, as developed by Anari et al. (Niazadeh et al., 2022). Given with in the interior of a flow-polytope , the factory adaptively flips -coins to produce a binary flow such that for all .
The construction uses:
- Bernstein polynomials indexed by integral flows,
- The Matrix-Tree Theorem for arborescence enumeration and Laplacian minor invariance,
- An explicit rejection-sampling scheme correcting product-measure weights by summing over arborescences.
Algorithm Steps:
- Flip each -coin to form provisional .
- Reject if .
- Pick a directed tree such that flipped edges and reversals yield an arborescence.
- Flip coins on tree edges. Reject if mismatch with .
- Accept .
Correctness and expectation bounds are established by algebraic properties of Bernstein coefficients and Laplacian minors.
Concrete cases include s–t path sampling in DAGs, circulations in cycles, and -flows in general networks, generalizing prior matchings-based factories and revealing deep links to algebraic combinatorics (Niazadeh et al., 2022).
References: (Ping et al., 13 Feb 2026) "Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models" (Niazadeh et al., 2022) "Bernoulli Factories for Flow-Based Polytopes"