BiDiFuse: Bidirectional Diffusion Fusion

Updated 17 September 2025

Bidirectional Diffusion Fusion (BiDiFuse) is a framework that enables two-way, context-aware feature integration in multi-modal diffusion models, ensuring coherent synthesis and accurate segmentation.
The method integrates dual branches—one for RGB appearance and one for semantic or part-based decomposition—using symmetric, learnable fusion operators to maintain spatial, temporal, and semantic consistency.
Incorporating a contrastive part consistency loss and dual-branch UNet architecture, BiDiFuse enhances cross-modal alignment and stability, benefiting applications in animation and motion tracking.

Bidirectional Diffusion Fusion (BiDiFuse) refers to a class of frameworks and architectural modules that enable two-way, context-aware feature integration within multi-branch diffusion or generative models. BiDiFuse is most commonly applied to scenarios involving multi-modal synthesis, semantic segmentation, kinematic part generation, and cross-modal fusion, where strong coherence between appearance data and structural, semantic, or part-level decompositions is crucial. The essential property of BiDiFuse is bidirectional exchange of intermediate features during the denoising (diffusion) process, typically realized with learnable fusion operators, which symmetrically integrate the representations produced by distinct branches (such as RGB generation and part segmentation) at multiple levels within the architecture. This results in enhanced alignment across spatial, temporal, and semantic axes.

1. Dual-Branch Diffusion Architecture and Feature Symmetry

BiDiFuse is implemented within a dual-branch UNet-backed structure. One branch is dedicated to high-frequency appearance synthesis (generating RGB image or video frames), and the other to structural or semantic decomposition (e.g., kinematic part maps). Although these branches operate with different objectives and loss functions, they are synchronized by their conditioning sequences—the same spatial, temporal, and view parameters.

The BiDiFuse module is inserted at corresponding encoder and decoder blocks of both branches. At each insertion point, the RGB and part feature maps are concatenated along the channel dimension and passed through a lightweight fusion function (typically two stacked 1×1 convolutions with ReLU). Formally,

$h^{RGB}_{fused} = h^{RGB} + \mathcal{V}([h^{RGB}, h^{Part}])$

$h^{Part}_{fused} = h^{Part} + \mathcal{V}([h^{RGB}, h^{Part}])$

where $\mathcal{V}$ denotes the fusion operator and $h^{RGB}, h^{Part}$ are intermediate features at the same resolution. This symmetric fusion ensures bidirectional information exchange, maintaining the modality-specific discriminative power while introducing cross-branch regularization.

A central motivation behind BiDiFuse is to ensure that part segmentation branch benefits from appearance cues (e.g., edges, texture boundaries) provided by the RGB branch, while the RGB branch is implicitly guided by motion and articulation structure encoded in the part branch. Without explicit feature sharing, part maps tend to drift from perceptual boundaries, and RGB frames lose coherence with physical articulation, especially under strong deformation or complex motion.

By integrating fused representations at every block, the two branches co-adapt, yielding outputs that are not only consistent in appearance but also aligned with true underlying kinematic structure. Ablation studies reported in the original work demonstrate that removing BiDiFuse drastically reduces alignment metrics such as mean Intersection-over-Union (mIoU) and Adjusted Rand Index (ARI), substantiating its role as a necessary cross-modal constraint.

3. Contrastive Part Consistency Loss

To further reinforce spatial and temporal identity alignment of kinematic parts, a contrastive part consistency loss is introduced. Part predictions, encoded as continuous RGB-like values, are aggregated into discrete part features across regions and frames. An InfoNCE-style contrastive objective ensures that embeddings associated with the same affiliated physical part (from different views or time steps) are maximally similar, while those from different parts are pushed apart.

$\mathcal{L}_{contrast} = - \mathbb{E}_{i \in \mathcal{P}, j \in \mathcal{P}_i^+} \log \frac{\exp(\mathrm{sim}(f_i, f_j)/\tau)}{\sum_{k \in \mathcal{P} \setminus \{i\}} \exp(\mathrm{sim}(f_i, f_k)/\tau)}$

Here, $f_i$ is the aggregated embedding of part $i$ , $\mathcal{P}_i^+$ contains indices of features affiliated with the same part, sim $(\cdot,\cdot)$ is cosine similarity, and $\tau$ a temperature hyperparameter. The effect is enforced cross-view and cross-temporal consistency, producing part segmentations suitable for downstream animation and 3D rigging pipelines.

4. Practical Implementation and Training Dynamics

Within the Stable Part Diffusion 4D (SP4D) framework (Zhang et al., 12 Sep 2025), BiDiFuse modules are placed after every encoder and decoder block in both RGB and part branches. Both branches share a latent VAE backbone, leveraging spatial color encoding for part masks that allows efficient continuous mapping and easy recovery during post-processing. The dual-branch model is conditioned jointly on view (camera pose), time (frame index), and object instance.

During training, the network is exposed to multi-view, multi-frame sequences from the KinematicParts20K dataset. Each mini-batch comprises paired RGB and part sequences, and optimization proceeds via denoising reconstruction losses for both modalities, with additional contrastive part consistency regularization. Empirical findings show that the inclusion of BiDiFuse substantially improves cross-modal alignment, multi-view stability, and produces skeletons and skinning weights readily usable for animation.

5. Impact on Downstream Applications and Evaluation

SP4D’s BiDiFuse module enables outputs where 2D part maps can be robustly lifted to 3D, facilitating kinematic skeleton extraction and harmonic skinning weight computation with minimal manual input. The generated part decompositions exhibit strong spatial and temporal coherence even in challenging, rare pose scenarios and across real-world or synthetic data. Quantitative gains are observed in metrics such as mIoU, ARI, and part identity consistency compared to dual-branch models without bidirectional fusion.

In animation and motion tracking tasks, the outputs demonstrate smoother joint localization, fewer fragmented parts, and enhanced multi-view alignment, which simplifies rigging, editing, and re-animation. The technique is also broadly extensible, with plausible implications for any multi-modal generative model requiring strong coherence between structure and appearance.

6. Theoretical Context and Broader Significance

Bidirectional Diffusion Fusion constitutes a generalizable architectural principle for multi-branch generative or denoising models. Precise bidirectional feature sharing can be considered a technical superset of one-way cross-modal fusion, resolving longstanding issues in alignment, consistency, and coherence between modalities. This approach is supported by robust empirical evidence of performance gains and alignment improvements in challenging generative, segmentation, and reconstruction tasks.

The methodology also establishes a foundation for future research into multi-modal generative modeling of articulated or non-Euclidean structures, as well as for principled design of contrastive and cross-modal regularization losses. Given its success in SP4D, BiDiFuse is poised to serve as a template for bidirectional fusion in other contexts, including human motion synthesis, cross-modal perception, and multi-sensor data reconstruction.

PDF Markdown Chat (Pro)

References (1)

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation (2025)

Follow Topic

Get notified by email when new papers are published related to Bidirectional Diffusion Fusion (BiDiFuse).