Controllable Generation (CAR) Overview
- Controllable Generation (CAR) is a framework where explicit control signals guide generative models to synthesize data across diverse modalities.
- It employs techniques such as direct attribute injection, cross-attention, and posterior regularization to ensure precise alignment with user-specified attributes.
- CAR enhances efficiency and local editability in applications including vision, text, and structured data, paving the way for scalable, plug-and-play control.
Controllable Generation (CAR) refers to a family of generative modeling paradigms in which the output is guided or constrained by explicit control signals, attributes, or properties supplied by the user or system. The goal is to endow deep generative models—across modalities such as images, text, music, graphs, and structured scenes—with the ability to reliably synthesize data matching desired attributes, often with precision, diversity, and interpretability exceeding what is achievable with standard unconditional or loosely conditioned models.
1. Foundations and Formal Definitions
CAR is formally situated in conditional generative modeling, where samples are drawn from a conditional distribution , with denoting the desired control (label, attribute, structure, style, or even latent prompt) (Wang et al., 2022). The practical objective is to learn such that generated data approximates for a variety of , while allowing efficient sampling, precise attribute matching, and (when possible) interpretable or editable control.
Multiple statistical frameworks instantiate this principle:
- Conditional VAEs: with encoder/decoder models fed (Wang et al., 2022).
- Conditional GANs: and , with appended to input and auxiliary attribute classification (Wang et al., 2022, Tann et al., 2020).
- Conditional diffusion and flow models: Infuse into each denoising or flow step, often via concatenation, cross-attention, or FiLM modulation (Wang et al., 2022, Ma et al., 22 Jul 2025, Bokhovkin et al., 2024).
- RL-based approaches: Optimize a policy to maximize property-aligned reward (Wang et al., 2022).
Evaluation targets controllability (attribute accuracy), quality (FID, BLEU, MOS), diversity, and realism (Wang et al., 2022, Ma et al., 22 Jul 2025).
2. Mechanisms and Architectures for Control
CAR systems implement control using varied architectural and algorithmic strategies, often tailored to the generator backbone:
2.1 Direct Attribute Injection
- Input fusion: Concatenate or combine control with data or latent input (e.g., image, text, graph) at encoder, decoder, or all stages. Examples include concatenating at each scale or step in autoregressive models (e.g., CAR for images (Yao et al., 2024)) and across generator input and output in cGANs (Wang et al., 2022, Tann et al., 2020).
- FiLM and gating: Apply control via feature-wise linear modulation layers, adjusting scale and bias of intermediate activations as a function of (Ma et al., 22 Jul 2025).
- Cross-attention: Condition via multi-head attention using projected as keys/values (scene text, reference embeddings, attributes, etc.) (Bokhovkin et al., 2024, Yao et al., 2024, Wei et al., 14 Mar 2025).
2.2 Posterior Regularization and Guidance
- Posterior-guided control: Use variational bounds and auxiliary classifiers to regularize output toward meeting control constraints (e.g., ELBO for CVAEs, classifier loss for conditional GANs and diffusion) (Wang et al., 2022, Zhou et al., 7 Oct 2025).
- Oracle-guided decoding: NADO converts sequence-level oracles into stepwise guidance scores, reweighting token-level predictions to enforce properties in black-box LLMs (Meng et al., 2022).
2.3 Masked and Spatial Control
- Spatially localized control (images, motion, scenes): Spatially-structured control signals (segmentation maps, pose, bounding boxes, flow fields) are injected as parallel or fused streams; spatial consistency is enforced via loss terms or in-network constraints (Pinyoanuntapong et al., 2024, Fang et al., 2023, Bokhovkin et al., 2024).
- Masked modeling: Control is achieved by masking and reconstructing only parts of the latent/codebook based on user-provided constraints, supporting rapid and targeted edits (Pinyoanuntapong et al., 2024).
2.4 Modularization for Attribute Decoupling
- Decoupled modules: Text-to-motion or text-to-speech systems often split control by decoupling high-level attributes (style, trajectory, text) into distinct adapters or planners, each independently trainable, supporting generalization and zero-shot user specification (Wei et al., 14 Mar 2025, Zhou et al., 7 Oct 2025).
2.5 Symbolic or Programmatic Control
- Procedural and symbolic approaches: Compositional scene generators rely on interpretable, modular, and code-driven asset placement, often via an intermediary of API calls or semantic boxes for local editing (Bokhovkin et al., 2024).
3. Domains and Application-Specific Adaptations
CAR is instantiated differently across modalities.
3.1 Vision and Video
- Edge, depth, and sketch conditioning: Control signals include edge maps (Canny, HED), depth, or semantic sketches, injected at multiple scales or via peripheral branches in U-Net- or transformer-based generative models (Yao et al., 2024).
- Multi-modal video conditioning: State-of-the-art video CAR frameworks employ cross-attention, FiLM, and ControlNet-style adapters for conditioning on text, pose, depth, bounding boxes, camera trajectory, or structured audio (Ma et al., 22 Jul 2025, Gosselin et al., 30 May 2025).
- Factorized latent editing: SceneFactor enables region-wise, strictly local editing of large-scale 3D scenes by factorizing generation into a semantic layout diffusion and a geometric refinement diffusion, exposing the coarse semantic grid for click-driven user edits (Bokhovkin et al., 2024).
3.2 Text
- Module-wise text control: Attribute signals are inserted at initialization, stepwise input, core generator, output layer, or via custom control losses, with hybrid designs dominating (VAE+classifier+attention, adversarial regularization, latent code arithmetic) (Prabhumoye et al., 2020).
- Plug-and-play postprocessing: NADO and related approaches allow arbitrary base models to be "wrapped" for hard constraint satisfaction via stepwise token guidance, without retraining or fine-tuning (Meng et al., 2022).
- Attribute-regularized diffusion: RegDiff integrates attribute clustering in latent space at training, such that test-time generation achieves control without classifiers, sustaining both stylistic accuracy and content preservation (Zhou et al., 7 Oct 2025).
3.3 Structured Data
- Graph controllability: ShadowCast demonstrates control of global graph properties by conditioning generation on a Markov process over attribute sequences; the generator mimics attribute-driven structural dynamics, preserving global statistics under user-specified control (Tann et al., 2020).
- Line drawings with style control: Explicit injection of per-pixel “style maps” or continuous sliders shapes the continuity, thickness, and detail level of artistic sketches (Fang et al., 2023).
3.4 Music
- Disentangled factor control: SOTA unsupervised models strive for representation axes (e.g., timbre, structure) that are informative, invariant to irrelevant transformations, and equivariant to others, but leakage remains a core challenge for fine-grained controllability (Ibáñez-Martínez et al., 10 Feb 2026).
3.5 Physical Systems
- Power grid control: Distributed control architectures use local feedback laws, whose parameters are optimized centrally but deployed infrequently, to maintain global system stability and attribute control (e.g., frequency fidelity) under high-renewable regimes (Dvijotham et al., 2012).
4. Evaluation Protocols and Metrics
A comprehensive suite of metrics is employed in CAR, often tailored to the modality and the specificity of control:
- Control accuracy: Classifier- or metric-based precision for matching discrete/categorical attributes (e.g., style, sentiment, class) (Wang et al., 2022, Zhou et al., 7 Oct 2025).
- Property error: Regression-based error between generated and desired continuous attributes (Wang et al., 2022).
- Distributional realism: FID, IS, CLIP-SIM for images/videos; MOS and style metrics for speech; BLEU, ROUGE, SBERT-similarity for text (Wang et al., 2022, Ma et al., 22 Jul 2025, Zhou et al., 7 Oct 2025).
- Task-specific: Geometry (Chamfer, EMD), keypoint/trajectory accuracy, need/emotion consistency in stories (Bokhovkin et al., 2024, Xie et al., 2022).
- User studies: Evaluations of fidelity, controllability, and utility on real tasks (Yao et al., 2024, Bokhovkin et al., 2024, Ma et al., 22 Jul 2025).
5. Empirical Advances and Core Contributions
Across modalities and paradigms, several empirical themes recur:
- Plug-and-play control: CAR designs such as the CAR framework for visual AR models (Yao et al., 2024) and NADO for text (Meng et al., 2022) allow for the augmentation of strong pretrained generators with negligible retraining or freezing of base parameters.
- Superior efficiency: CAR achieves faster inference (e.g., five-fold to twenty-fold speedups over diffusion in visual domains) and limited data regimes by building atop scalable AR models or leveraging modular adapters (Yao et al., 2024, Pinyoanuntapong et al., 2024).
- Generality and compositionality: CAR frameworks are increasingly supporting multi-attribute, multi-modal, and region/part-conditional control, often via hierarchical adapters, universal condition encoders, or modular planners (Ma et al., 22 Jul 2025, Wei et al., 14 Mar 2025, Xie et al., 2022).
- Explicit local editing: Factored and region-inpainted generations enable non-destructive local edits in high-dimensional structured scenes, with formal mechanisms ensuring invariance outside edited regions (Bokhovkin et al., 2024).
- Practical success: SOTA results are reported consistently across control tasks, e.g., attribute-regularized diffusion achieving superior style accuracy over classifier-guided and classifier-free baselines in text (Zhou et al., 7 Oct 2025), or robust structure preservation in controllable graph generation (Tann et al., 2020).
6. Challenges, Limitations, and Research Directions
Despite rapid progress, significant open challenges remain:
- Unified, disentangled, and interpretable controls: Many systems rely on ad hoc, dataset-specific attributes; robust disentanglement and semantic consistency across model classes and tasks is an ongoing research frontier (Ibáñez-Martínez et al., 10 Feb 2026, Wang et al., 2022).
- Multi-conditioned and hierarchical constraints: Handling multiple, possibly conflicting controls (e.g., semantic + spatial + style) requires compositional, often hierarchical models and optimization methods (Ma et al., 22 Jul 2025).
- Scalable, high-dimensional control: While plug-in and adapter-based approaches have narrowed the gap, AR and diffusion models still encounter computational barriers at ultra-high resolution or long temporal extents (Yao et al., 2024).
- Evaluation and benchmarking: Joint metrics for realism, diversity, and control accuracy are lacking, and cross-modal benchmarks remain underdeveloped (Wang et al., 2022).
- Domain knowledge incorporation: Especially in domains with hard constraints (chemistry, power systems), enforcing symbolic rules and safety constraints remains a major direction (Wang et al., 2022, Dvijotham et al., 2012).
- Dynamic, interactive, and LLM-driven control: Emerging paradigms integrate large multimodal or reasoning models to process complex, user-driven specifications and context-adaptive control (Ma et al., 22 Jul 2025, Wei et al., 14 Mar 2025).
7. Representative Results and Model Comparisons
Quantitative results from diverse recent works demonstrate the maturity of CAR approaches:
| Domain | CAR Approach | Key Metric (Score) | Baseline (Score) | Paper |
|---|---|---|---|---|
| Text (style) | RegDiff (diffusion) | Style Acc. 0.95–0.96 | ParaGuide 0.81–0.86 | (Zhou et al., 7 Oct 2025) |
| Image | CAR (AR control) | FID ↓ 8.3–10.2 (Canny–Sketch) | ControlNet 11.6–15.3 | (Yao et al., 2024) |
| 3D Scene | SceneFactor | MMD (0.019), COV (0.421) | SDFusion (0.03/0.36) | (Bokhovkin et al., 2024) |
| Motion | ControlMM (masked) | FID 0.061, Error 0.0091 | TLControl 0.271/0.0108 | (Pinyoanuntapong et al., 2024) |
| Video | Ctrl-Crash (diffusion) | FVD 449.5, JEDi 0.1219 | Ctrl-V 517.1/0.2910 | (Gosselin et al., 30 May 2025) |
| Graph | ShadowCast (cGAN+Markov) | ΔCLUST≤0.00932 on Enron | GraphRNN, GVAE... | (Tann et al., 2020) |
A plausible implication is that the CAR paradigm, when appropriately tailored to the structure and semantics of the target domain, enables fine-grained, efficient, and reliable generation under both hard and soft control constraints, outstripping prior heuristics and poorly-regularized conditional models. Ongoing advances are expected to further strengthen the theory, modularity, and usability of controllable generation frameworks across science, engineering, and creative domains.