Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 14 tok/s Pro

GPT-4o 88 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Kimi K2 207 tok/s Pro

2000 character limit reached

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning (2508.18966v1)

Published 26 Aug 2025 in cs.CV and cs.LG

Abstract: Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

Collections

Summary

The paper presents USO, a unified framework that disentangles and recombines content and style features for enhanced image generation.
It employs a novel triplet curation pipeline and DiT-based latent diffusion backbone to improve style fidelity and subject consistency, evidenced by superior CLIP-I, DINO, and CSD scores.
Ablation studies confirm the critical role of style reward learning, hierarchical projection, and disentangled encoders in advancing multi-conditional generative modeling.

Unified Style and Subject-Driven Generation via Disentangled and Reward Learning: An Expert Analysis

Introduction and Motivation

The USO framework addresses a central challenge in conditional image generation: the disentanglement and recombination of content and style features from reference images. Prior work has treated style-driven and subject-driven generation as distinct tasks, each with isolated disentanglement strategies. USO posits that these tasks are inherently complementary—learning to include features for one task informs the exclusion of those features for the other. This insight motivates a unified approach, leveraging cross-task co-disentanglement to mutually enhance both style and subject fidelity.

Figure 1: Joint disentanglement of content and style enables unification of style-driven and subject-driven generation within a single framework.

Cross-Task Triplet Curation Framework

USO introduces a systematic triplet curation pipeline, constructing datasets of the form $\langle$ style reference, de-stylized subject reference, stylized subject result $\rangle$ . This is achieved by training stylization and de-stylization experts atop a state-of-the-art subject-driven model, followed by filtering with VLM-based metrics to enforce style similarity and subject consistency. The resulting dataset contains both layout-preserved and layout-shifted triplets, enabling flexible recombination of subjects and styles.

Figure 2: The cross-task triplet curation framework generates both layout-preserved and layout-shifted triplets for unified training.

Unified Customization Architecture

USO’s architecture is built on a DiT-based latent diffusion backbone, augmented with disentangled encoders for style and content. The training proceeds in two stages:

Style Alignment Training: Style features are extracted using SigLIP embeddings, projected via a hierarchical projector, and aligned with the text token distribution. Only the projector is updated, ensuring rapid adaptation to style cues.
Content-Style Disentanglement Training: Content images are encoded via a frozen VAE, and the DiT model is trained to disentangle content and style features explicitly. This design mitigates content leakage and enables precise control over feature inclusion/exclusion.
Figure 3: USO training framework: Stage 1 aligns style features; Stage 2 disentangles content and style; style-reward learning supervises both stages.

Style Reward Learning (SRL)

USO incorporates a style reward learning paradigm, extending flow-matching objectives with explicit style similarity rewards. The reward is computed using VLM-based or CSD metrics, and backpropagated to sharpen the model’s ability to extract and retain desired features. SRL is shown to improve both style fidelity and subject consistency, even in tasks not explicitly targeted during training.

Experimental Results

USO is evaluated on the newly introduced USO-Bench and DreamBench, covering subject-driven, style-driven, and joint style-subject-driven tasks. Quantitative metrics include CLIP-I, DINO, CSD, and CLIP-T, measuring subject consistency, style similarity, and text-image alignment.

Subject-Driven Generation: USO achieves the highest DINO (0.793) and CLIP-I (0.623) scores, outperforming all baselines in subject fidelity and text controllability.
Style-Driven Generation: USO attains the highest CSD (0.557) and competitive CLIP-T (0.282), demonstrating superior style transfer capabilities.
Style-Subject-Driven Generation: USO leads with CSD (0.495) and CLIP-T (0.283), supporting arbitrary combinations of subjects and styles.
Figure 4: USO model demonstrates versatile abilities across subject-driven, style-driven, and joint style-subject-driven generation.

Figure 5: Qualitative comparison on subject-driven generation; USO maintains subject identity and applies style edits robustly.

Figure 6: Qualitative comparison on style-driven generation; USO preserves global and fine-grained stylistic features.

Figure 7: Qualitative comparison on identity-driven generation; USO achieves high identity consistency and realism.

Figure 8: Qualitative comparison on style-subject-driven generation; USO supports layout-preserved and layout-shifted scenarios.

Figure 9: Radar charts from user evaluation show USO’s top performance in subject and style-driven tasks across multiple dimensions.

Ablation Studies

Ablation experiments confirm the necessity of each architectural component:

SRL: Removing SRL sharply reduces CSD and CLIP-I, indicating its critical role in style and content disentanglement.
Style Alignment Training: Omitting SAT degrades style fidelity and text alignment.
Disentangled Encoders: Using a single encoder for both style and content harms all metrics, underscoring the importance of explicit disentanglement.
Hierarchical Projector: This yields the highest style similarity scores, outperforming alternative projection strategies.
Figure 10: Ablation paper of SRL; style reward learning enhances both style similarity and subject consistency.

Figure 11: Ablation paper of USO; hierarchical projector and disentangled encoders are essential for optimal performance.

Implementation Details and Resource Requirements

USO is implemented atop FLUX.1 dev and SigLIP pretrained models. Training is staged: style alignment (23k steps, batch 16, 768px), followed by content-style disentanglement (21k steps, batch 64, 1024px). LoRA rank 128 is used for parameter-efficient adaptation. The model supports high-resolution generation and flexible conditioning, but requires substantial GPU resources for training and inference at scale.

Practical and Theoretical Implications

USO’s unified approach enables free-form recombination of subjects and styles, supporting both layout-preserving and layout-shifting scenarios. This has direct applications in creative industries, personalized content generation, and advanced customization pipelines. The cross-task co-disentanglement paradigm may inform future research in multi-conditional generative modeling, suggesting that joint training on complementary tasks can yield emergent capabilities and improved generalization.

Future Directions

Potential avenues for further research include:

Extending USO to video generation and multi-modal synthesis.
Investigating more granular disentanglement of additional attributes (e.g., lighting, pose).
Scaling the triplet curation framework to larger, more diverse datasets.
Integrating reinforcement learning from human feedback for finer control over subjective quality metrics.

Conclusion

USO establishes a unified framework for style-driven, subject-driven, and joint style-subject-driven image generation, leveraging cross-task co-disentanglement and style reward learning. Empirical results demonstrate state-of-the-art performance across all evaluated tasks, with robust subject consistency, style fidelity, and text controllability. The framework’s modular design and training strategies provide a blueprint for future advances in multi-conditional generative modeling.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (8)

Tweets

https://twitter.com/_akhaliq/status/1961455755111842126

https://twitter.com/AdinaYakup/status/1961357629810245680

https://twitter.com/HuggingPapers/status/1961401129725559235

https://twitter.com/illetrateNerd/status/1963790905082781961

alphaXiv

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning (47 likes, 0 questions)