Compositional Generalization in Multimodal Models

Updated 14 July 2025

Compositional Generalization in Multimodal Models is the ability to recombine learned primitives from various modalities into novel configurations, mimicking human-like understanding.
Architectural innovations, including syntactic attention and retrieval-augmented representations, enhance the effective recombination of modality-specific features.
Empirical benchmarks and regularization techniques like low-rank constraints reveal that precise data alignment and modular design are crucial for overcoming compositional challenges.

Compositional generalization in multimodal models refers to the capacity of machine learning systems to correctly process, recognize, or generate novel combinations of atomic concepts—where the system has experienced each component during training, but their particular combination is new. This capability is considered essential for robust real-world generalization, data efficiency, and human-like understanding across tasks involving mixtures of diverse input modalities, such as images, text, audio, and structured data.

1. Theoretical Underpinnings and Key Definitions

Compositional generalization is formalized as the extrapolation to novel combinations of seen primitives or factors (i.e., components such as object types, attributes, actions, or phrases) (Ram et al., 2 May 2024, Li et al., 29 May 2025). In multimodal settings, these primitives often originate from separate modalities—for instance, an image region may embody one primitive, while a phrase in a textual question embodies another.

A foundational perspective employs a neuro-symbolic definition: a function $f(X)$ is compositional if it can be expressed as

$f(X) = h \circ g^{\otimes D(X)} (e(x_1,1), \ldots, e(x_L, L))$

where $e$ is a modality-specific encoder, $D(X)$ is a computation DAG specifying the composition structure, $g$ is a composition operator, and $h$ is a readout function (Ram et al., 2 May 2024). Compositional complexity is captured using the locus of influence (LoI), characterizing how deeply input factors affect the output through this architecture.

In multimodal models, compositional generalization thus includes the ability to process and combine atoms from different modalities and to extrapolate to previously unseen multimodal compositions (e.g., pairing a previously seen object with a previously seen attribute, but in a combination not present during training) (Li et al., 29 May 2025).

2. Architectures and Mechanisms Supporting Compositionality

Several architectural strategies have been developed to promote and assess compositional generalization in multimodal systems:

Variational and Generative Objectives: By designing joint generative models (e.g., multimodal VAEs or GAN-VAEs), where the latent variable $z$ generates each modality conditionally, it is possible to ensure that the learned latent space supports translation and recombination between modalities (Wu et al., 2019). These frameworks encourage alignment of representations that capture abstract, compositional structure.
Syntactic and Structured Biases: The explicit integration of syntactic structure through attention masks derived from dependency or constituency parses enables transformer-based models to capture the relationship between linguistic tokens and associated visual/semantic content, thus promoting compositional generalization (Kamali et al., 2023). Syntactic masking restricts attention to connections grounded in language structure, preventing overfitting to spurious dataset artifacts and increasing parameter efficiency via weight sharing.
Retrieval-Augmented Representation: Retrieval-based methods unify features of semantically equivalent primitives across modalities, decreasing the representation gap and fostering robust cross-modal alignment. This approach aggregates features retrieved from external databases containing semantic equivalents in visual and linguistic form, leading to more consistent representations and improved generalization to novel multimodal compositions (Li et al., 29 May 2025).
Model Composition and Parameter Decoupling: Constructing new multi-modality models by merging modal-specific components from pretrained foundations, with careful handling of parameter interference (e.g., via decoupling and adaptive weight adjustment), allows for expanded modality coverage without introducing catastrophic forgetting or decreased generalizability (Chen et al., 20 Feb 2024).
Low-Rank and Factorization Constraints: Forcing models to render their latent factors all the way into the output representation space (rather than only at bottleneck layers) by architectural modifications or regularization (e.g., low-rank embedding constraints and augmented data with isolated factor examples) can dramatically improve compositional generalization and data efficiency (Liang et al., 30 Jan 2025).

3. Empirical Findings and Evaluation Benchmarks

Experimental analyses have identified both strengths and limitations in current multimodal models' compositional capabilities:

Explicit Benchmarks: Multiple recent datasets and benchmarks have been developed for compositional generalization in multimodal contexts:
- GQA-MSCG and GQA-CCG: Address multi-sourced and multi-level compositional generalization in visual question answering, partitioning evaluation according to modality source (language-language, vision-vision, language-vision) and compositional complexity (phrase–phrase, phrase–word, word–word) (Li et al., 18 Dec 2024, Li et al., 29 May 2025).
- MCUB: Tests a model’s ability to identify shared commonalities across inputs from varying modalities (image, audio, point cloud, etc.) (Chen et al., 20 Feb 2024).
- CG-Bench: Focuses on compositionality across domains and classes, with metrics targeting out-of-distribution combinations (Wang et al., 5 Feb 2024).
- CompAct: Assesses sequential compositionality in instructional video settings, with held-out compound verb–noun sequences (Yagcioglu et al., 18 Apr 2024).
- ViLPAct: Benchmarks future action prediction given text and video, emphasizing novel compositions of known actions (Zhuo et al., 2022).
Scaling and Data Coverage: Larger models and greater task coverage lead to improved compositional generalization, provided the training set maintains “compositional” and “connected” support (i.e., sufficient representation of each module/component and enough diverse pairs or combinations) (Redhardt et al., 9 Jul 2025). The minimal number of training tasks required for success scales sub-exponentially with the size of the compositional space under these conditions.
Pretraining Distribution Biases: Co-occurrence statistics in pretraining data play a decisive role; models such as CLIP and LMMs generalize poorly to rare or unobserved concept pairs (quantified via low Pointwise Mutual Information, PMI), even when both individual concepts are common, evidencing a memorization bias toward typical seen combinations (Qu et al., 10 Jul 2025). This motivates research beyond mere data scaling, toward architectures that decouple individual concepts from their joint statistics.
Performance Gaps with Humans: While recent multimodal generative and LLMs demonstrate improvement in simple relational and compositional tasks, they lag behind humans—especially in complex scene understanding or when multiple objects and relations are involved, often manifesting binding failures (inaccurately assigning relations between entities) (Fu et al., 29 Mar 2025).
Analysis of Hidden Representations: In models that successfully generalize compositionally, the constituent task modules or concepts can be linearly decoded from hidden activations, a property that correlates with compositional generation success rates—even in text-to-image settings (Redhardt et al., 9 Jul 2025).

4. Limitations and Open Challenges

Despite notable advances, compositional generalization in multimodal models faces significant limitations:

Combinatorial Data Coverage: The exponential number of possible concept pairs or higher-order combinations means that no realistic dataset can cover the full compositional space, leaving standard empirical risk minimization susceptible to systematic failures on rare or unseen compositions (Qu et al., 10 Jul 2025).
Architectural Constraints: The presence of multiple attention layers and cross-modal attention enables systematic recombination and distractor robustness, but does not guarantee productive compositional generalization to more complex, deeper structures without further bases for explicit structure or regularization (Ito et al., 26 Jan 2024).
Input-Dependent Fusion: Many models still employ fixed, input-agnostic fusion layers, which can lead to suboptimal alignment and increased compositional complexity; adaptive or data-dependent fusion mechanisms may be necessary for robust performance (Ram et al., 2 May 2024).
Linguistic Priors vs. Multimodal Reasoning: Benchmarks often overestimate compositional ability due to strong linguistic cues or priors in evaluation data, masking poor image-text integration. Metrics like the “linguistic gap” and “hard test accuracy” help clarify the true multimodal contribution (Wu et al., 2023).

5. Practical Implications and Future Research Directions

Designing robust, compositionally generalizable multimodal models requires both architectural innovations and careful data and evaluation methodology:

Architectural Innovations: Promising directions include explicit structural or symbolic compositional mechanisms (e.g., tree-structured processing, slot attention), weight sharing across layers (as in universal transformers), low-rank embedding regularization, and meta-learning curricula that progress from simple to complex compositions (Kamali et al., 2023, Ram et al., 2 May 2024, Li et al., 18 Dec 2024).
Data and Benchmarking Strategies: Data curation should maximize compositional and connected support for primitives across modalities, include augmentation for rare pairs, and introduce more hard negatives for rigorous compositionality assessment (Redhardt et al., 9 Jul 2025, Yagcioglu et al., 18 Apr 2024).
Representation Diagnostics: Developing reliable metrics (such as the linear decodability of hidden activations, Jacobian-based factorization measures, or the compositionality of feature subspaces) will help align model development with systematic generalization goals (Liang et al., 30 Jan 2025, Redhardt et al., 9 Jul 2025).
Task Transfer and Real-World Deployment: Robust composition supports not only out-of-distribution generalization but also transfer learning to new domains, label-efficient training in data-sparse regimes, and broad applicability across domains—from open-domain VQA to specialized areas such as medical imaging (Cai et al., 28 Dec 2024).
Algorithmic and Theoretical Research: Continued effort is required to clarify what inductive biases or explicit constraints best promote compositional feature learning in high-dimensional, multimodal tasks, including in the context of large foundation models and beyond supervised learning (e.g., self-supervised or contrastive learning) (Wang et al., 5 Feb 2024, Redhardt et al., 9 Jul 2025).

6. Summary Table: Key Architectural Elements and Their Influence

Architectural Element	Effect on Compositional Generalization	Reference
Cross-modal attention	Promotes systematic recombination	(Ito et al., 26 Jan 2024)
Syntactic attention masking	Increases parameter efficiency, structure	(Kamali et al., 2023)
Retrieval-based feature aggregation	Improves alignment across modalities	(Li et al., 29 May 2025)
Low-rank/factorized output embedding	Enforces factorization in outputs	(Liang et al., 30 Jan 2025)
Scaling data and model size	Leads to improved generalization (with sufficient support)	(Redhardt et al., 9 Jul 2025)
Weight sharing across layers	Reduces compositional complexity	(Ram et al., 2 May 2024)

7. Conclusion

Compositional generalization remains a central open challenge for multimodal models. Current research underscores the necessity of designing architectures and evaluation methodologies that explicitly address the complexities of combining concepts across modalities and extrapolating to unseen compositions. Progress hinges on principled architectural motifs, diagnostics rooted in representation analysis, and the development of compositionality-aware datasets and benchmarks, ensuring models approach the flexibility and robustness of human combinatorial reasoning.