PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

Published 15 Apr 2026 in cs.CV | (2604.13863v1)

Abstract: Image generation technology can synthesize condition-specific images to supplement real-world industrial anomaly data and enhance anomaly detection model performance. Existing generation techniques rarely account for the pose and orientation of industrial components in assembly, making the generated images difficult to utilize for downstream application. To solve this, we propose a novel image synthesis approach, called PostureObjectStitch, that achieves accurate generation to meet the requirement of industrial assembly. A condition decoupling approach is introduced to separate input multi-view images into high-frequency, texture, and RGB features. The feature temporal modulation mechanism adapts these features across diffusion model time-steps, enabling progressive generation from coarse to fine details while maintaining consistency. To ensure semantic accuracy, we introduce a conditional loss that enhances critical industrial elements and a geometric prior that guides component positioning for correct assembly relationships. Comprehensive experimental results on the MureCom dataset, our newly contributed DreamAssembly dataset, and the downstream application validate the outstanding performance of our method.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a diffusion-based synthesis framework that integrates assembly relationship constraints to realistically generate industrial anomaly images.
It employs multi-view feature decoupling and time-feature modulation with position-pose priors to ensure precise structural and textural fidelity.
Empirical results show superior performance in SSIM, LPIPS, and CLIP-I metrics, enhancing downstream detection with YOLOv5 in industrial settings.

PostureObjectstitch: Anomaly Image Generation with Assembly Relationship Modeling

Problem Formulation and Motivation

The rarity and diversity of industrial assembly anomalies, such as misassembled components, severely limit the effectiveness of supervised anomaly detection models. Traditional approaches to dataset construction are economically impractical and unable to exhaustively enumerate all possible errors. General image synthesis paradigms, including text-conditioned diffusion models and object composition frameworks, do not explicitly enforce pose, orientation, or assembly relationship constraints necessary for downstream detection robustness. These limitations prevent deployment of synthesized anomaly images in realistic industrial inspection pipelines. The "PostureObjectstitch" framework directly addresses this gap by introducing a generation protocol that encodes and preserves assembly relationship constraints during the diffusion-based synthesis of anomaly images.

Figure 1: High-level schematic of the PostureObjectstitch pipeline, detailing feature decomposition, temporal modulation, and semantic-geometry loss integration across time-steps.

Methodological Advances

Condition Decoupling and Feature Decomposition

PostureObjectstitch leverages multi-view reference imagery to disentangle foreground object representations into three channels: high-frequency (edge, contour), texture (local granular details), and RGB (global color and luminance). Each channel is independently extracted using Sobel, Laplacian, Canny, and HOG pipelines, then encoded via CLIP for robust cross-modal feature alignment. Foreground segmentation eliminates distractive background priors prior to feature extraction, maximalizing foreground fidelity for downstream synthesis.

Feature Temporal Modulation

Recognizing the coarse-to-fine generative trajectory intrinsic to diffusion models, the method introduces a time-feature modulation protocol. Timestep encodings combine with learned modulation parameters (scale/shift MLPs, fusion weights) to dynamically reweight and adapt each feature type through the denoising process. Early timesteps prioritize structural outline and positional priors, while later stages emphasize detailed texture and RGB reconstruction. This adaptive scheduling maintains multi-level consistency—object morphology first, textural fidelity second—across the generative progression.

Integration of Semantic and Geometric Assembly Constraints

To prevent semantic drift and geometric misplacement, the framework deploys two loss mechanisms:

Conditional Loss: The loss integrates variational lower bound, MSE, and a binary-weighted OCR auxiliary loss. The latter is triggered when text is present on the foreground, enforcing semantic coherence between generated and reference text via a frozen OCR backbone during low-noise denoising stages.
Figure 2: Auxiliary OCR loss structure, ensuring preservation of semantic textual content in synthesized anomaly regions.
Position-Pose Prior: High-frequency foreground features superimposed on initial noise provide explicit positional and orientation cues. This ensures that assembly relationships and spatial accuracy are not compromised through the denoising process. During inference, these priors are retained in the noise initialization, maintaining assembly plausible geometry.
Figure 3: Fusion procedure for pose and orientation prior information, constraining generated anomalies to physically valid assembly relationships.

Dataset Construction and Benchmarking

DreamAssembly Dataset

PostureObjectstitch introduces DreamAssembly, a purpose-built dataset for assembly anomaly evaluation, partitioned into authentic industrial backgrounds and 25 canonical foreground components, each annotated under various pose and tilt configurations. Backgrounds span production lines, inspection stations, and equipment environments, ensuring domain coverage and realism.

Figure 4: Representation of DreamAssembly dataset with segmented backgrounds and foreground components, emphasizing pose diversity and contextual variation.

Evaluative Protocols

The framework utilizes DINO and CLIP-I for foreground similarity assessment, complemented by SSIM and LPIPS on masked background regions to isolate preservation fidelity. Baselines include ObjectStitch, AnyDoor, MureObjectStitch, and IC-Custom, with PostureObjectstitch consistently outperforming in pose accuracy, scene context preservation, and assembly fidelity.

Empirical Results and Visualizations

Quantitative Performance

On the MureCom dataset, PostureObjectstitch achieves the best CLIP-I (0.8143) and LPIPS (0.0645), with competitive DINO and SSIM outcomes. On DreamAssembly, its SSIM (0.9821) and LPIPS (0.0230) dominate, while CLIP-I and DINO trail only AnyDoor, whose higher scores stem from trivial pose copying and compromised background context. Downstream, YOLOv5 trained on PostureObjectstitch-synthesized data outperforms all baselines (Acc 0.8000, F1 0.8211), demonstrating superior distributional alignment with real-world data.

Qualitative Analysis

Visual comparisons illustrate that PostureObjectstitch generates anomalies with precise assembly relationships, maintains geometric plausibility, and restores fine textural details—especially crucial text content—unlike baselines whose outputs display pose errors or background collapse.

Figure 5: Comparative samples on MureCom, evidencing superior pose, detail fidelity, and semantic restoration in PostureObjectstitch outputs.

Figure 6: Comparative anomaly generations on DreamAssembly, highlighting preservation of assembly relationships and textual elements unique to PostureObjectstitch.

Component Ablation and Analysis

Ablation studies validate that feature decoupling and time modulation yield significant improvements in semantic and structural alignment. Pose priors further enhance LPIPS/SSIM, confirming their benefit for assembly relationship integrity. OCR auxiliary loss, while quantitatively marginal, provides substantial qualitative improvements in text restoration.

Theoretical and Practical Implications

PostureObjectstitch delineates a pathway for physically aware anomaly synthesis suitable for industrial QA pipelines. Explicit modeling of pose, orientation, and semantic fidelity enables realistic simulation of assembly errors, facilitating robust training of anomaly detectors where real data is deficient. Practically, this enhances automated inspection system development and augments training data for vision models in manufacturing.

Theoretically, the protocol demonstrates the impact of timestep-adaptive feature modulation, geometric priors, and semantic auxiliary losses in bridging the domain gap for industrial compositional image generation. Future extensions integrating 3D asset and point cloud priors are anticipated to further enforce physical constraints and generalize to multi-modal inspection platforms.

Conclusion

PostureObjectstitch introduces a specialized diffusion-based anomaly generation framework with assembly relationship modeling, outperforming generic methods in pose accuracy, scene context preservation, and downstream task utility. Its integration of feature decoupling, time-feature modulation, geometric and semantic loss design, and the DreamAssembly dataset establishes a rigorous foundation for industrial anomaly synthesis. The methodology advances the state-of-the-art in realistic, constraint-preserving anomaly generation for smart manufacturing, with future directions towards multi-modal, physically consistent asset synthesis and more intelligent inspection automation.

Markdown Report Issue