Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions (2511.09465v1)

Published 12 Nov 2025 in stat.ML and cs.LG

Abstract: Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion LLMs. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a LLM, or the number of amino acids in a protein chain is not known a priori. Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal' product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.

Summary

The paper introduces Branching Flows, a generative framework that leverages stochastic splits and deletions to dynamically model variable-length sequences.
It employs a binary tree structure with Poisson counting flows and deletion flows to efficiently sample and track branching events across discrete, continuous, and multimodal domains.
Empirical results demonstrate the model's effectiveness in generating diverse molecules, antibody sequences, and protein structures with enhanced fidelity and structural integrity.

Branching Flows: A Unified Framework for Variable-Length Flow Matching via Stochastic Splits and Deletions

Introduction: Problem Context and Motivation

The manuscript introduces Branching Flows, a generative modeling framework that generalizes diffusion and flow matching approaches to domains with variable-length sequences, incorporating stochastic splitting and deletion of elements. Traditional diffusion and flow matching models have been limited to fixed-cardinality state spaces, constraining their applicability for tasks where the number of elements must be dynamically determined (e.g., generation of natural language, proteins, or molecules with varying lengths and compositions). Previous methods handle variable-length discrete sequences via autoregressive models or recent works on edit-based flows in the discrete domain; however, scalable, principled approaches for variable-length continuous and multimodal state spaces have remained elusive.

Branching Flows formulates the generative process as the evolution of a population of elements over a forest of binary trees, governed by split (branching) and deletion events with time-dependent rates learned by the model. This approach composes naturally with any flow matching base process, enabling compatibility with sequences over discrete sets, continuous Euclidean spaces, smooth manifolds, and multimodal product spaces. Conditioning and anchor-based interpolation enable the framework to handle complex generative tasks such as the insertion of variable-length segments conditioned on fixed regions ("unknown-length infix sampling"), a problem unaddressed by prior models.

Formalization and Theoretical Foundation

Branching Flows are developed within the Generator Matching paradigm, parameterizing the conditional process via an auxiliary latent $Z$ that encodes both the initial and final state, the branching structure (forest of trees), and auxiliary anchor variables. This construction enables the model to generate stochastic paths from simple initial distributions to complex, variable-length data distributions, satisfying boundary constraints via a hierarchical conditional path.

Event Modeling: Counting and Deletion Flows

The stochastic evolution of sequence length is managed via Poisson counting flows (for splits) and deletion flows (for removals), both formally described as CTMCs with time-inhomogeneous hazard rates. For splits, the number of remaining branches is explicitly tracked, yielding efficient sampling strategies for waiting times and events, with losses formulated as Bregman divergences parameterizing the event intensities. Deletion flows are analogously structured, employing binary states and time-dependent hazard rates, with cross-entropy loss for the event presence prediction.

Conditional Paths and Marginalization

Branching Flows leverage the latent tree structure to generate conditional paths where elements are independently evolved along branches, splitting and dying according to their assigned trajectories. Detailed formalism is provided for representing branch indices, implementing split and deletion operators, and anchoring element evolution via mean-reverting stochastic processes (e.g., Ornstein–Uhlenbeck for continuous variables, Discrete Flow Matching for discrete labels). The framework permits marginalizing over unobserved branching indicators, rigorously connecting the marginal generator to conditional processes as demonstrated in recent theoretical extensions.

Loss Functions and Training

The composite loss integrates three components: split event prediction, deletion event prediction, and the base flow matching loss, each applied elementwise and aggregated. Prediction targets—number of splits ahead and deletion probability—are obtained directly from the sampled conditional path, while anchors guide the covariance and interpolation for each variable. Training proceeds within a stochastic gradient descent framework, sampling latent trajectories and intermediate states, and taking gradient steps on the joint loss.

Implementation Strategies and Practical Considerations

Implementing Branching Flows requires efficient sampling and tracking of trees, anchors, and element states. The Julia ecosystem is leveraged for GPU-accelerated deep learning (Flux.jl, CUDA.jl), visualization (Makie.jl), transformer architectures with pairwise features, and specialized attention kernels. Modular code composition allows the integration of arbitrary base processes, supporting multimodal state spaces and leveraging domain-specific encodings (e.g., random Fourier features for spatial positions, Rotary Positional Encoding for sequence ordering).

Branch indices can be tracked explicitly via pairwise count matrices or implicitly via ordering, with empirical evidence suggesting the latter suffices for most domains. Anchors for continuous spaces interpolate via geodesics, while discrete spaces utilize mask/dummy tokens, both sampled recursively in post-order from leaves to roots.

Marginal path sampling adopts an Euler scheme, with per-element event intensities calculated at each step and applied via sampling, yielding variable-length outputs dynamically during inference.

Empirical Evaluation

Small Molecule Generation: QM9

Branching Flows generates variable-length atom clouds with discrete atom types, outperforming transdimensional jump-diffusion models (Campbell et al., 2023) in matching the data distribution across numerous molecular properties (atom counts, weights, fingerprints). Quantitative assessment via Kolmogorov–Smirnov statistics demonstrates higher fidelity in property distributions. Generated molecules are visualized via UMAP embeddings and scrutinized for chemical validity (RDKit), demonstrating that the stochastic branching process successfully reproduces the diversity and structure of real data without a priori length constraints.

Discrete Sequence Generation: Antibody Sequences

Branching Flows achieves near-parity with oracle-length discrete flow matching models on antibody sequence generation (using non-redundant, length-diverse OAS data [kovaltsuk2018oas]), measured by both an external perplexity metric from a large autoregressive LLM, and distributional analyses of amino acid frequencies, CDR3 lengths, and overall sequence novelty. The framework demonstrates its capacity to capture strong positional dependencies without explicit length conditioning, implying that split and deletion event modeling is sufficient for distributional coverage in high-entropy discrete domains.

Multimodal Protein Generation

Finetuning into the ChainStorm architecture [OrestenChainStorm], Branching Flows enables the concurrent stochastic generation of protein backbone frames (spatial positions, SO(3) rotations) and amino acid identities, supporting both unconditional generation and conditional design (e.g., unknown-length CDR loop insertion). Self-consistency TM-score assays using state-of-the-art folding models (ProteinMPNN, Boltz2) confirm that generated structures retain the correct geometric and topological properties. The dynamic ability to insert arbitrary-length segments ("infix sampling") conditioned on fixed spatial context allows for applications in protein engineering (interface design, flexible linker design) unattainable with previous fixed-length or AR models.

Branching Flows structurally generalizes discrete-only edit-based flows ("Edit Flows" (Havasi et al., 10 Jun 2025)) and jump-diffusion models for continuous states (Campbell et al., 2023), overcoming the limitations of insert/deletion modeling in the continuous domain and sidestepping problematic assumptions about insertion distributions during dimension changes. Prior approaches such as DrugFlow (Schneuing et al., 25 Aug 2025) rely on virtual node augmentation, confining their models to deletion-only mechanisms and fixed-length upper bounds, while Branching Flows naturally accommodates unbounded growth and arbitrary event composition.

The theoretical scaffolding connects rigorously to Generator Matching, enabling deep integration with established frameworks for learning marginal couplings between distributions, and ensuring that practical algorithms recover correct sample marginal distributions in expectation, even when marginalizing over event histories.

Implications and Future Directions

Branching Flows introduces a principled family of methods for flow matching in variable-length and multimodal domains, bridging a critical gap in generative modeling. The structure of the branching process affords flexibility in modeling tree-like or nested generative tasks, suggesting potential applications in code generation (syntax-driven tree sampling), document editing, and compositional design in bioinformatics and chemistry.

Finetuning into existing, pre-trained models demonstrates computational efficiency and transferability, indicating that learned representations from fixed-length domains can be repurposed for stochastic event-driven frameworks with minimal architectural adaptation.

Further theoretical and empirical investigation will address optimization of tree and anchor sampling schemes, domain-specific event modeling, downstream benchmarking, and integration with rigorous scoring metrics in application domains (e.g., foldability, developability in protein design, chemical feasibility in molecule generation).

Conclusion

Branching Flows offers a robust, extensible approach for generative modeling over variable-length, multimodal spaces, providing stable learning dynamics, accurate marginal distribution matching, and new capabilities such as dynamic infix sampling. Its foundation in generator matching, rigorous event modeling, and compositional loss formulation positions it as a versatile tool for researchers addressing complex generative tasks in computational biology, chemistry, natural language, and beyond. Further refinement and benchmarking are anticipated to elucidate its practical advantages and extend its utility in emerging domains.