USO-Bench: Unified Style & Subject Benchmark
- USO-Bench is a unified benchmark designed to evaluate both style-driven and subject-driven image generation through precise, quantitative measures.
- It utilizes a triplet dataset—comprising a content image, a style image, and a stylized output—to ensure thorough assessment of subject consistency and style similarity.
- The framework supports pure subject, pure style, and joint style–subject tasks, employing metrics like CLIP-I, DINO cosine similarity, and CSD scores to guide model improvements.
USO-Bench is a unified benchmark specifically established to evaluate the joint capabilities of generative models in both style-driven and subject-driven image generation. It is designed to measure a model’s ability to reproduce desired artistic or photographic styles while simultaneously preserving subject identity and detailed content, advancing evaluation methodologies beyond existing benchmarks which focus solely on one of these aspects.
1. Definition and Scope
USO-Bench is introduced as the first evaluation framework capable of jointly and quantitatively measuring both style similarity and subject fidelity within a single protocol. The benchmark is constructed to address a central challenge in generative modeling: the ability to disentangle and recombine content and style such that models can faithfully render an input subject in any target style while retaining both the visual identity of the subject and the formal characteristics of the style.
USO-Bench encompasses evaluations over three task regimes:
- Pure subject-driven generation: measuring the maintenance of subject identity without style changes.
- Pure style-driven generation: evaluating the correct transfer of stylistic features, independent of content rearrangement.
- Joint “style–subject” scenario: models are tasked with both style transfer and subject preservation, with added complexity through support for layout-preserved and layout-shifted triplets.
2. Benchmark Design and Metrics
USO-Bench leverages a comprehensive set of metrics to cover both principal axes of evaluation:
Evaluation Aspect | Metric(s) | Description |
---|---|---|
Subject Consistency | Cosine similarity of CLIP-I, DINO | Measures fidelity of the generated image to the subject reference |
Style Similarity | CSD score | Quantifies stylistic congruence between output and style reference |
Text–Image Alignment | CLIP-T | Assesses alignment between text prompts and generated image content |
Subject consistency is measured through embedding similarity (using CLIP-I and DINO), while style similarity is quantitatively assessed via the CSD score. CLIP-T provides an additional modality check via text-image semantic alignment.
Unlike prior benchmarks restricted to either subject or style transfer, USO-Bench requires models to optimize performance along both dimensions simultaneously, underscoring the significance of disentanglement and recomposition.
3. Dataset Construction Methodology
The dataset underlying USO-Bench is assembled as a large-scale collection of triplets comprising a content image (subject reference), a style image (style reference), and a stylized output. The curation process follows a “subject-for-style” paradigm, orchestrated as follows:
- Aggregation of public datasets with subject-driven (e.g., UNO-1M) and instruction-based editing samples (e.g., X2I2).
- Application of a stylization expert to synthesize the style image from the target, accurately capturing stylistic cues (e.g., brushwork, color palette).
- Use of a de-stylization expert to invert the stylized image, generating a photorealistic subject image for robust content capture.
- Employment of a VLM-based filter ensuring both style high similarity (between stylized output and style reference) and subject consistency (between target and de-stylized images).
The dataset includes layout-preserved and layout-shifted triplets, which broadens representational diversity and enables evaluation of models under both rigid and flexible subject positioning constraints.
4. Integration Into USO Model Training
USO-Bench directly influences the training and validation paradigm of the USO (Unified Style-Subject Optimized) model:
- In the Style-Alignment Training phase, SigLIP extractors replace standard VAEs to obtain multi-scale, semantically enriched style representations, projected via a lightweight Hierarchical Projector . These are concatenated with text and latent tokens for integrated conditioning, as formalized:
- In the Content–Style Disentanglement Training phase, the content image is embedded via a frozen VAE producing , yielding the multi-modal input:
The triplet-based design and layout options encourage models to both isolate and recombine content and style without cross-domain leakage.
5. Style Reward-Learning Paradigm
USO’s training incorporates an explicit Style Reward-Learning (SRL) module:
- Computes a reward score based on style similarity between the generated and reference style images, via either a VLM-based filter or the CSD metric.
- Augments the standard loss with a reward loss:
where transitions from 0 to 1 at a training step , controlling the contribution of SRL.
- Alternates between a gradient-free inference step, where rewards are collected, and back-propagation, where the reward signal guides model update toward higher style fidelity.
By sharpening the separation of style and content signals, SRL indirectly improves subject consistency as well.
6. Experimental Results and Benchmarks
Evaluation on USO-Bench demonstrates that USO attains state-of-the-art results along all measured axes:
- Subject-driven generation: Highest scores reported on subject consistency (CLIP-I and DINO metrics), with strong CLIP-T performance.
- Style-driven evaluation: Highest CSD scores achieved, along with robust text-image alignment.
- Joint style–subject tasks: Maintains subject identity and applies dense stylistic transfer under both layout-preserved and layout-shifted scenarios.
Comparisons in the benchmark tables indicate consistent outperformance over recent models such as UNO, REALCUSTOM++, and DEADiff across all relevant metrics. Qualitative studies further observe improvements in visual fidelity, detail retention, and adherence to textual prompts.
7. Accessibility and Implementation
USO-Bench and the USO model are available at https://github.com/bytedance/USO. Usage requires a deep-learning environment compatible with diffusion models, SigLIP, VAE backbones, and optional SRL modules. Supported by detailed documentation, users can reproduce benchmark evaluations, retrain or fine-tune models, and adapt the components for use in distinct generative modeling tasks. The repository includes explicit guidance on training regimes, hyperparameters, and inference, facilitating broad adoption and reproducibility.
USO-Bench thus provides a unified and rigorous toolset for developing and critically assessing models targeting simultaneous subject fidelity and style customization, establishing a new empirical foundation for research in disentangled visual generation (Wu et al., 26 Aug 2025).