Cross-Character Augmentation (CCA) for Mixed Domain Modeling

Updated 8 October 2025

Cross-Character Augmentation (CCA) is a data-centric method that generates synthetic scenes where characters from different domains interact, overcoming non-coexistence and style delusion issues.
CCA extracts and composites segmented characters onto complementary backgrounds, ensuring identity invariance and behavioral consistency through precise annotation and filtering.
Empirical evaluations show that moderate augmentation ratios significantly improve identity, motion, style, and interaction fidelity in generative models.

Cross-Character Augmentation (CCA) is a data-centric methodology for enriching generative and analytic models with synthetic examples depicting interactions between otherwise non-coexistent characters, typically from diverse stylistic or semantic domains. Originating in the domain of text-to-video generation, CCA constructs training scenarios in which character-specific appearance, behavioral logic, and interaction sequences are preserved and controlled despite cross-domain compositing challenges. The technique specifically addresses the “non-coexistence” and “style delusion” problems in generative modeling, enabling systems to produce outputs in which heterogeneous characters interact naturally without compromising their distinct stylistic or semantic identities (Liao et al., 6 Oct 2025).

1. Motivation and Problem Statement

Text-to-video and related multimodal generation models trained on real-world media face a key limitation: their data contains no examples where characters from different source domains (e.g., cartoons and live-action series) interact. As a result, the default behavior of such models is to avoid generating mixed-domain interactions, or, when forced, to blend stylistic features in ways that degrade both identity and style fidelity (the “style delusion” phenomenon). The resulting generations are typically incoherent—characters adopt inappropriate styles or lose defining behaviors, rendering creative cross-universe storytelling or domain adaptation infeasible. CCA is devised to overcome this limitation by synthetically augmenting the training set with co-existence and mixed-style interaction data where each character’s identity and logic are preserved by construction (Liao et al., 6 Oct 2025).

2. Synthetic Cross-Domain Compositing

The central procedural step in CCA is the formation of synthetic training clips wherein characters, originally disjoint, are assembled into joint scenes. This is performed as follows:

Segmentation: Each character is isolated from its source media using a segmentation model such as SAM2. For live-action entities, reference-image matching ensures alignment to canonical appearances; for animated entities, domain-specific detectors (e.g., Gemini) perform the filtering and selection.
Compositing: The segmented character is composited into a background sampled from the opposing style domain (cartoon or realistic), using the segmentation mask to seamlessly place the character while preserving spatial context.
Filtering: Only those composite samples where both segmentation and style-matching pass strict filtering criteria are retained for subsequent training, thereby mitigating label noise and style confusion.

This augmentation pipeline enables the model to observe visual, geometric, and semantic juxtapositions absent from the original training corpus, including multi-character, multi-style, and cross-universe interactions (Liao et al., 6 Oct 2025).

3. Caption and Context Tag Enrichment

To ensure that the model can leverage the synthetic compositing for identity preservation and style control, each augmented sample is annotated with a structured caption. The format is

1
2
3

[Character: <name>], <action>.
[Character: <name>], <action>.
[scene-style: cartoon/realistic]

These explicit annotations allow downstream conditioning mechanisms to disentangle character identity from scene style and serve as supervisory signals that guide the generator towards faithful visual and behavioral outputs under domain-mixed conditions. The style tag (“[scene-style: ...]”) is especially important for controlling stylistic appearance and preventing spill-over of artistic traits between source and background (Liao et al., 6 Oct 2025).

4. Identity and Behavior Preservation

One of the principal aims of CCA is to prevent degradation of core character features (e.g., a live-action face becoming cartoonish, or vice versa) during cross-domain interaction synthesis. By supervising on augmented scenes where identity, behavioral logic, and interaction events are defined by construction (e.g., via segmentation reference matching and explicit action annotation), the training procedure compels the model to:

Learn to maintain identity invariance even when the context violates training-time co-existence statistics.
Model behavioral consistency for each character under new interaction regimes.
Robustly handle style delusion by exposure to both in-domain and out-of-domain contexts with precise ground-truth supervisor signals (Liao et al., 6 Oct 2025).

This mechanism generalizes to arbitrary combinations of domain, character, and context, making it possible to realize novel generative tasks such as blending hand-crafted characters with data-driven protagonists.

5. Training Process and Pseudo-Algorithm

CCA is implemented as a modular preprocessing and data-injection pipeline alongside the standard training loop for generative models. The process operates according to the following schema:

Algorithm: Cross-Character Augmentation (CCA)
Input: Video dataset V, segmentation model SAM2, background set B, filtering criteria, augmentation ratio r.
Output: Augmented training set V_aug

for each video v in V:
    for each character c detected in v:
        extract mask M_c = SAM2(c)
        choose background b ∈ B from complement domain
        composite character c using mask M_c onto background b → v_synth
        generate caption with explicit tags [Character: c], action; [scene-style: ...]
        with probability r, add v_synth to V_aug
return V ∪ V_aug

In training, the augmented set V_aug is blended with the original data. Fine-tuning occurs by updating only specific network layers (e.g., LoRA adapters) while freezing the main backbone, limiting overfitting and preserving generalization (Liao et al., 6 Oct 2025).

6. Empirical Evaluation and Ablation

Comprehensive experiments conducted on a mixed-domain video dataset confirm the effectiveness of CCA. Key results include:

Augmentation Ratio Tuning: Moderate synthetic ratio values (e.g., 5–10%) yield the best tradeoff between interaction quality and style/identity preservation; over-augmentation (e.g., 20%) can negatively impact performance by introducing excessive sampling bias.
Qualitative Impact: Ablation studies reveal that with no CCA (0%), style and identity are poorly preserved; with increasing CCA, both improve, but with diminishing returns at high augmentation levels.
Metrics: Models trained with CCA achieve substantial gains in Identity-P (character visual fidelity), Motion-P (behavioral realism), Style-P (stylistic consistency), and Interaction-P (multi-agent coherence) compared to non-augmented baselines.

The improvement is especially pronounced in generating videos of multi-character interactions where non-coexistence and style blending present the greatest challenges (Liao et al., 6 Oct 2025).

7. Implications and Generalization

CCA enables generative models to synthesize interactions between semantically or stylistically distinct entities without loss of fidelity or behavioral logic. While demonstrated for video generation and storytelling across live-action and cartoon domains, the core methodology generalizes to any context where cross-domain, cross-identity, or cross-style interaction is desired but not represented in the original corpus. Applications include:

Multimodal retrieval and cross-domain annotation (via kernel mapping and regularized CCA (Mroueh et al., 2015)).
Machine reading comprehension and rare entity representation (via character-level composition (Zhang et al., 2018)).
Multi-view alignment and robust representation learning for data with non-overlapping domains (via adversarial and marginalization-based strategies (Shi et al., 2020)).

A plausible implication is the extension of CCA-like augmentation pipelines to non-visual domains (cross-script OCR, cross-speaker TTS) and for improving model robustness to combinatorial generalization tasks.

In summary, Cross-Character Augmentation (CCA) is a principled, effective scheme for addressing the combinatorial and stylistic gaps created by mutual exclusivity of training sources in generative and analytic modeling pipelines. By introducing synthetic, well-annotated, and style-robust co-existence samples, CCA unlocks new capabilities for coherent and identity-preserving cross-domain generation and interaction modeling (Liao et al., 6 Oct 2025).