Transformer Module Networks (TMN)
- Transformer Module Networks are neural module networks that use small Transformer-encoder stacks to perform specialized, program-conditioned subtasks in VQA.
- They employ both stack and tree compositions, with module specialization critical for systematic generalization, as evidenced by significant improvements on benchmarks like CLOSURE.
- Empirical evaluations on datasets such as CLEVR, GQA-SGL, and CoGenT demonstrate that TMNs achieve robust performance with minimal pre-training compared to large-scale multimodal Transformers.
Transformer Module Networks (TMNs) are a class of Neural Module Networks for Visual Question Answering (VQA) in which modules are instantiated as small Transformer-encoder stacks. These networks are explicitly structured to promote systematic generalization by assigning specialized, parameter-distinct Transformer modules to each distinct subtask in a program derived from a question. TMNs integrate the flexible attention mechanisms of Transformers with the compositional, program-conditional computation of NMNs, achieving state-of-the-art generalization—particularly on novel combinations of linguistic or visual concepts—while requiring less extensive pre-training compared to large-scale multimodal Transformers (Yamada et al., 2022).
1. Architectural Framework
A TMN processes a given input image and an associated program , where each subtask (such as FILTER, COUNT, AND) is aligned with a corresponding argument (e.g., "sphere", "red"). Feature extraction is performed using a standard CNN or object detector , yielding region/grid features , which are projected into -dimensional visual tokens augmented with positional encodings. A special head token (commonly initialized as the mean over all ) is included.
A library of modules is maintained, where each module is a stack of Transformer-encoder layers with unique parameters. Program execution occurs in a stack- or tree-structured manner: modules are composed sequentially or in parallel (with merge modules such as AND/OR) according to the parsed program. At each time step , the input to the next module is , with being the embedding of the argument. Each module transforms its input as
Stacking Transformer-encoder layers within each module, each layer applies multi-head self-attention and a feed-forward sublayer: After all modules execute, the final head token is passed to a classifier to produce answer logits:
where is the number of answer classes.
2. Mathematical Formalism
Composition is conditioned on the program encoded as . The sequential updates are:
The network is trained by minimizing cross-entropy over a batch of examples :
Each module maintains independent weights—critical for subtask specialization—without explicit regularization. This implicitly drives each module to optimize its parameters for the unique distribution associated with its assigned subtask.
3. Implementation and Optimization Regimen
TMNs treat VQA as a classification task, using the cross-entropy loss. Visual features are extracted via ResNet-101 grid-maps (CLEVR, CLOSURE) or Faster R-CNN regions (GQA-SGL). The program decomposition is externally provided for each example; no program parsing is trained within the TMN. Optimization employs Adam with batch size 128; learning rates are chosen by dataset.
Two primary module composition strategies are implemented:
- Stack: serial execution of modules in program order.
- Tree: parallel execution, spawning sub-branches per program graph structure and merging via specialized modules (e.g., AND, OR).
4. Empirical Evaluation and Generalization
Evaluation is performed on three systematic generalization benchmarks:
- CLEVR-CoGenT (visual attribute generalization): Training on Condition A (e.g., cubes in gray/blue only); testing on Condition B (cubes in novel colors).
- CLOSURE (linguistic construct generalization): Testing on splits requiring novel subtask combinations.
- GQA-SGL (natural image generalization): Testing on question types with unseen argument pairings.
Key performance statistics (mean accuracy %, SD):
| Dataset | Transformer | Transformer w/PR | TMN-Stack | TMN-Tree | Vector-NMN | NS-VQA | MDETR |
|---|---|---|---|---|---|---|---|
| CoGenT-A | 97.5 ±0.2 | 97.4 ±0.6 | 97.9 ±0.03 | 98.0 ±0.02 | 98.0 ±0.2 | 99.8 | 99.7 |
| CoGenT-B | 78.9 ±0.8 | 81.7 ±1.1 | 80.6 ±0.2 | 80.1 ±0.7 | 73.2 ±0.2 | 63.9 | 76.2 |
| CLEVR | 97.4 ±0.2 | 97.1 ±0.1 | 98.0 ±0.03 | 97.9 ±0.01 | 98.0 ±0.07 | 99.8 | 99.7 |
| CLOSURE | 57.4 ±1.6 | 64.5 ±2.5 | 90.9 ±0.5 | 95.4 ±0.2 | 94.4 | 76.4 | 53.3 |
| Overall sys-gen | 68.2 | 73.1 | 85.3 | 87.8 | 83.8 | 70.2 | 64.8 |
TMN-Tree achieves 95.4% on CLOSURE vs. 57.4% for the standard Transformer—a 30 point improvement. On GQA-SGL, TMN-Tree displays minimal performance drop (1.8 pts) when transitioning from in-distribution to systematic generalization, while standard Transformers exhibit a decrease exceeding 7 points.
5. Analysis and Ablation Study
Module specialization is critical for generalization. Four library variants were evaluated: (a) individual modules for each subtask, (b) semantic group modules, (c) random groupings, and (d) position-based modules without specialization. On CLOSURE, “Individual” module libraries yielded 90.9%, “Semantic group” 93.7%, “Random group” 93.0%, whereas “Order” (no specialization) scored only 68.4%. Module specialization per subtask, or at least semantically coherent domains, is thus essential.
Beyond module structure, augmenting standard Transformers with variable depth and “split” token streams did not recover the systematic generalization behavior of TMNs: their best result was 60% on CLOSURE vs. TMN's 90.9%. Thus, module specialization, not just architectural depth or token partitioning, is necessary.
Switching from ResNet-derived features to object detector regional features (Visual Genome pretrained) substantially improved visual attribute generalization (increases of +6–8 points on CoGenT-B), but only modestly affected linguistic generalization (CLOSURE). TMNs achieved superior systematic generalization compared to heavily pretrained models such as LXMERT or MDETR, while requiring substantially less image-text pretraining.
6. Conclusion and Outlook
Transformer Module Networks integrate the compositional paradigm of Neural Module Networks with the representational power of Transformer encoders. Specialization at the module level and explicit program-conditional routing enable robust systematic generalization, with TMNs achieving gains exceeding 30 percentage points in novel linguistic composition tests relative to conventional Transformers. This demonstrates that modularity and parameter isolation, rather than just increased model capacity or depth, are principal enablers of systematic compositional reasoning in VQA. TMNs reach this performance without extensive image-text pretraining required by alternative multimodal architectures (Yamada et al., 2022).