OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Published 16 Apr 2024 in cs.CV and cs.AI | (2404.10267v4)

Abstract: Text-to-image diffusion models benefit artists with high-quality image generation. Yet their stochastic nature hinders artists from creating consistent images of the same subject. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external restricted data or require expensive tuning of the diffusion model. For this issue, we propose a novel one-shot tuning paradigm, termed OneActor. It efficiently performs consistent subject generation solely driven by prompts via a learned semantic guidance to bypass the laborious backbone tuning. We lead the way to formalize the objective of consistent subject generation from a clustering perspective, and thus design a cluster-conditioned model. To mitigate the overfitting challenge shared by one-shot tuning pipelines, we augment the tuning with auxiliary samples and devise two inference strategies: semantic interpolation and cluster guidance. These techniques are later verified to significantly improve the generation quality. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory subject consistency, superior prompt conformity as well as high image quality. Our method is capable of multi-subject generation and compatible with popular diffusion extensions. Besides, we achieve a 4 times faster tuning speed than tuning-based baselines and, if desired, avoid increasing the inference time. Furthermore, our method can be naturally utilized to pre-train a consistent subject generation network from scratch, which will implement this research task into more practical applications. (Project page: https://johnneywang.github.io/OneActor-webpage/)

Abstract PDF HTML Upgrade to Chat

Authors (8)

References (30)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a cluster-conditioned guidance mechanism that ensures consistent character generation in diffusion models through targeted latent sub-cluster selection.
It employs a lightweight projector network and minimal tuning to achieve up to 4× efficiency improvements over traditional personalization methods.
Empirical results demonstrate superior identity preservation and image quality while providing fine-grained control over consistency and diversity.

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Motivation and Problem Formulation

Recent advances in text-to-image (T2I) generation with diffusion models have enabled high-quality visual synthesis from prompts. However, existing diffusion models are dominated by a stochastic sampling process, leading to inconsistent representations of the same character across images, which constrains their use in longitudinal or narrative visual tasks (e.g., storybooks, animation pipelines, advertising). Prior solutions either rely on external images for personalization (e.g., DreamBooth, Textual Inversion) or require costly tuning phases, which limit scalability, generality, and the generation of novel characters. The new research direction focuses on consistent character generation using only prompt guidance, entirely decoupling the process from external data.

OneActor formalizes this task as finding a precise guidance mechanism such that denoising trajectories of the diffusion model are systematically driven to a particular identity sub-cluster within the feature space, ensuring that different samples, despite variable random seeds, always correspond to the same coherent character instance.

Figure 1: (a) Standard models generate heterogeneous "hobbits" from various identity sub-clusters under different prompts/noises. (b) OneActor achieves deterministic sampling to a target identity sub-cluster after minimal tuning.

Cluster-Conditioned Generative Framework

The OneActor paradigm critically rests on a mathematical formalization of character consistency: recognizing each character as associated with a latent sub-cluster within the generative space of the diffusion model. Using a user-supplied prompt, multiple base images are first generated. A preferred sample is selected to act as a target, while the remainder serve as auxiliary negatives/positives to ensure well-constrained cluster guidance.

The pipeline employs a modular, lightweight guidance module—specifically, a projector network operating on precomputed U-Net features (ResNet-based, with subsequent linear and LayerNorm layers). This projector is exclusively tuned, with the diffusion backbone entirely frozen, mitigating overfitting and preserving the manifold geometry of the latent space.

Figure 2: OneActor architecture: a latent encoder (frozen U-Net extractor) and projector jointly generate cluster guidance; batched tuning with target and auxiliary samples ensures robust cluster assignment.

The core generative update modifies the usual classifier-free guidance (CFG) formula to include cluster affinities. Explicitly, denoised predictions are biased towards the target sub-cluster and repelled from auxiliary clusters via a cluster-based score function:

$\epsilon_{\boldsymbol{\theta}}(z_t, t) + \eta_1 \left[ \epsilon_{\boldsymbol{\theta}}(z_t, t, S^{tar}) - \epsilon_{\boldsymbol{\theta}}(z_t, t)\right] - \eta_2 \sum_{i=1}^{N-1} \left[ \epsilon_{\boldsymbol{\theta}}(z_t, t, S^{aux}_i) - \epsilon_{\boldsymbol{\theta}}(z_t, t)\right]$

where $S^{tar}$ and $S^{aux}$ are the semantic representations extracted from the projector network, and $\eta_1,\eta_2$ control the respective guidance strengths.

This approach is at least $4\times$ more efficient compared to conventional tuning-based pipelines.

Semantic Interpolation as Generative Control

A key theoretical contribution is the demonstration that the semantic embedding space entangled with the denoising network exhibits the same controllable interpolation properties as the latent space itself. By varying the semantic offset scaling applied to the cluster embedding, OneActor provides continuous control over both character consistency and generative diversity. This property is rigorously substantiated through controlled semantic and latent interpolations, which result in matched effects on output images.

Figure 3: Semantic and latent interpolation scales both yield predictable, monotonic adjustments to generated content, confirming the entanglement and controllability of the semantic space.

By adjusting the scale parameter $v$ for the offset in the base word embedding, practitioners can fine-tune the trade-off between strict identity consistency and creative diversity.

Empirical Validation

Comprehensive experiments are conducted on the SDXL backbone, benchmarking against Textual Inversion, DreamBooth, IP-Adapter, BLIP-Diffusion, and TheChosenOne. Evaluation spans visual inspection, CLIP-based identity/prompt similarity, and a large user study.

Results show that OneActor:

Establishes a new Pareto frontier on the character consistency vs. prompt conformity plane—surpassing all encoder-based or personalization baselines in balanced utility.
Outperforms TheChosenOne in maintaining fine-grained consistent features (such as clothing details), while being significantly more time-efficient (average 5 minutes vs. 20 minutes for TheChosenOne).

(Figure 4)

Figure 4: Qualitative comparison—OneActor maintains consistent identity, high image quality, and prompt alignment where other baselines falter due to weak tuning or overfitting.

(Figure 5)

Figure 5: Comparison with TheChosenOne—OneActor achieves superior detail fidelity and tuning efficiency.

Quantitative results on identity and prompt similarity metrics—using CLIP-based cosine similarity—confirm OneActor's overall dominance. A user study (N=500 evaluations) indicates clear preference for OneActor's results on consistency, diversity, and prompt adherence, aligned with quantitative outcomes.

(Figure 6)

Figure 6: Left—OneActor (OA) dominates in both CLIP-based identity and prompt similarity. Right—user preference for OA across all task dimensions.

Ablation and Analysis

An ablation study demonstrates the importance of batch-based tuning with auxiliary samples and the average guidance strategy. The inclusion of all loss components is necessary to optimize simultaneously for stability, diversity, and consistency. Analysis of the semantic interpolation parameter $v$ corroborates the theoretically postulated consistency-diversity tradeoff, with $v=0.8$ empirically validated as optimal for SDXL.

(Figure 7)

Figure 7: Left—Progressive integration of loss terms results in monotonic improvements. Right—tuning semantic scale $v$ enables controlled adjustment of consistency and diversity.

Implications and Future Directions

OneActor provides an efficient, robust, and theoretically grounded solution for prompt-driven, consistent character image generation. The cluster-conditional score objective and minimal-tuning projector framework generalize well and avoid the pitfalls of model overfitting and identity drift endemic to traditional personalization.

Practical implications include its deployment in high-throughput creative pipelines, story visualization, advertising, and interactive design tools, where both speed and semantic reliability are critical. The proof of semantic interpolation opens new avenues in controllable generation and fine-grained style or attribute transfer. The lightweight tuning regime suggests compatibility with real-time or user-in-the-loop applications.

Theoretically, this work gestates a promising line in leveraging latent cluster geometry for conditional generation tasks and establishes a blueprint for future interventions that unify embedding space manipulation with generative control in large-scale diffusion models.

Conclusion

OneActor advances the state of prompt-driven, consistent text-to-image generation by introducing a cluster-conditioned architecture with a formally grounded cluster-guided score function. Through efficient projector-based tuning and semantic space interpolation, it achieves superior character consistency, prompt conformity, and image quality, all while reducing tuning time by an order of magnitude. This paradigm sets a new operational standard for scalable and controllable character-consistent generation (2404.10267).

Markdown Report Issue