Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transformer Module Networks (TMN)

Updated 31 March 2026
  • Transformer Module Networks are neural module networks that use small Transformer-encoder stacks to perform specialized, program-conditioned subtasks in VQA.
  • They employ both stack and tree compositions, with module specialization critical for systematic generalization, as evidenced by significant improvements on benchmarks like CLOSURE.
  • Empirical evaluations on datasets such as CLEVR, GQA-SGL, and CoGenT demonstrate that TMNs achieve robust performance with minimal pre-training compared to large-scale multimodal Transformers.

Transformer Module Networks (TMNs) are a class of Neural Module Networks for Visual Question Answering (VQA) in which modules are instantiated as small Transformer-encoder stacks. These networks are explicitly structured to promote systematic generalization by assigning specialized, parameter-distinct Transformer modules to each distinct subtask in a program derived from a question. TMNs integrate the flexible attention mechanisms of Transformers with the compositional, program-conditional computation of NMNs, achieving state-of-the-art generalization—particularly on novel combinations of linguistic or visual concepts—while requiring less extensive pre-training compared to large-scale multimodal Transformers (Yamada et al., 2022).

1. Architectural Framework

A TMN processes a given input image II and an associated program P=(s1,arg1),...,(sL,argL)P = (s_1, \mathrm{arg}_1), ..., (s_L, \mathrm{arg}_L), where each subtask sts_t (such as FILTER, COUNT, AND) is aligned with a corresponding argument argt\mathrm{arg}_t (e.g., "sphere", "red"). Feature extraction is performed using a standard CNN or object detector ϕ\phi, yielding NN region/grid features F=ϕ(I)RN×DF = \phi(I) \in \mathbb{R}^{N \times D}, which are projected into DD-dimensional visual tokens v1,...,vNv_1,...,v_N augmented with positional encodings. A special head token h0RDh^0 \in \mathbb{R}^D (commonly initialized as the mean over all viv_i) is included.

A library of modules {Ms}sS\{M_s\}_{s \in S} is maintained, where each module MsM_s is a stack of KK Transformer-encoder layers with unique parameters. Program execution occurs in a stack- or tree-structured manner: modules are composed sequentially or in parallel (with merge modules such as AND/OR) according to the parsed program. At each time step tt, the input to the next module is Xt1=[v1,...,vN;ht1;e(argt)]X^{t-1} = [v_1,...,v_N; h^{t-1}; e(\mathrm{arg}_t)], with e(argt)e(\mathrm{arg}_t) being the embedding of the argument. Each module transforms its input as

Xt=Mst(Xt1)R(N+2)×DX^t = M_{s_t}(X^{t-1}) \in \mathbb{R}^{(N+2) \times D}

Stacking KK Transformer-encoder layers within each module, each layer applies multi-head self-attention and a feed-forward sublayer: Q=XWQ K=XWK V=XWV A=softmax(QKTd)V X=LayerNorm(X+A) X=LayerNorm(X+MLP(X))\begin{align*} Q &= XW_Q \ K &= XW_K \ V &= XW_V \ A &= \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right)V \ X' &= \mathrm{LayerNorm}(X + A) \ X'' &= \mathrm{LayerNorm}(X' + \mathrm{MLP}(X')) \end{align*} After all modules execute, the final head token hLh^L is passed to a classifier to produce answer logits:

y^=softmax(WchL)RC\hat{y} = \mathrm{softmax}(W_c h^L) \in \mathbb{R}^C

where CC is the number of answer classes.

2. Mathematical Formalism

Composition is conditioned on the program PP encoded as [s1,...,sL][s_1,...,s_L]. The sequential updates are: X0=[v1;...;vN;h0;e(arg1)] Xt=Mst(Xt1;e(argt)),t=1...L hL=head-token(XL) y^=softmax(WchL)\begin{align*} X^0 &= [v_1; ...; v_N; h^0; e(\mathrm{arg}_1)]\ X^t &= M_{s_t}(X^{t-1}; e(\mathrm{arg}_t)), \quad t=1...L \ h^L &= \text{head-token}(X^L) \ \hat{y} &= \mathrm{softmax}(W_c h^L) \end{align*}

The network is trained by minimizing cross-entropy over a batch of BB examples (I(i),P(i),y(i))(I^{(i)}, P^{(i)}, y^{(i)}): L(θ)=1Bi=1Bc=1C1[y(i)=c]logy^c(i)\mathcal{L}(\theta) = -\frac{1}{B} \sum_{i=1}^B \sum_{c=1}^C 1\left[y^{(i)} = c\right] \log \hat{y}_c^{(i)}

Each module maintains independent weights—critical for subtask specialization—without explicit regularization. This implicitly drives each module to optimize its parameters for the unique distribution associated with its assigned subtask.

3. Implementation and Optimization Regimen

TMNs treat VQA as a classification task, using the cross-entropy loss. Visual features are extracted via ResNet-101 grid-maps (CLEVR, CLOSURE) or Faster R-CNN regions (GQA-SGL). The program decomposition is externally provided for each example; no program parsing is trained within the TMN. Optimization employs Adam with batch size 128; learning rates are chosen by dataset.

Two primary module composition strategies are implemented:

  • Stack: serial execution of modules Ms1,Ms2,...,MsLM_{s_1}, M_{s_2}, ..., M_{s_L} in program order.
  • Tree: parallel execution, spawning sub-branches per program graph structure and merging via specialized modules (e.g., AND, OR).

4. Empirical Evaluation and Generalization

Evaluation is performed on three systematic generalization benchmarks:

  • CLEVR-CoGenT (visual attribute generalization): Training on Condition A (e.g., cubes in gray/blue only); testing on Condition B (cubes in novel colors).
  • CLOSURE (linguistic construct generalization): Testing on splits requiring novel subtask combinations.
  • GQA-SGL (natural image generalization): Testing on question types with unseen argument pairings.

Key performance statistics (mean accuracy %, ±\pmSD):

Dataset Transformer Transformer w/PR TMN-Stack TMN-Tree Vector-NMN NS-VQA MDETR
CoGenT-A 97.5 ±0.2 97.4 ±0.6 97.9 ±0.03 98.0 ±0.02 98.0 ±0.2 99.8 99.7
CoGenT-B 78.9 ±0.8 81.7 ±1.1 80.6 ±0.2 80.1 ±0.7 73.2 ±0.2 63.9 76.2
CLEVR 97.4 ±0.2 97.1 ±0.1 98.0 ±0.03 97.9 ±0.01 98.0 ±0.07 99.8 99.7
CLOSURE 57.4 ±1.6 64.5 ±2.5 90.9 ±0.5 95.4 ±0.2 94.4 76.4 53.3
Overall sys-gen 68.2 73.1 85.3 87.8 83.8 70.2 64.8

TMN-Tree achieves 95.4% on CLOSURE vs. 57.4% for the standard Transformer—a >>30 point improvement. On GQA-SGL, TMN-Tree displays minimal performance drop (\approx1.8 pts) when transitioning from in-distribution to systematic generalization, while standard Transformers exhibit a decrease exceeding 7 points.

5. Analysis and Ablation Study

Module specialization is critical for generalization. Four library variants were evaluated: (a) individual modules for each subtask, (b) semantic group modules, (c) random groupings, and (d) position-based modules without specialization. On CLOSURE, “Individual” module libraries yielded 90.9%, “Semantic group” 93.7%, “Random group” 93.0%, whereas “Order” (no specialization) scored only 68.4%. Module specialization per subtask, or at least semantically coherent domains, is thus essential.

Beyond module structure, augmenting standard Transformers with variable depth and “split” token streams did not recover the systematic generalization behavior of TMNs: their best result was 60% on CLOSURE vs. TMN's 90.9%. Thus, module specialization, not just architectural depth or token partitioning, is necessary.

Switching from ResNet-derived features to object detector regional features (Visual Genome pretrained) substantially improved visual attribute generalization (increases of +6–8 points on CoGenT-B), but only modestly affected linguistic generalization (CLOSURE). TMNs achieved superior systematic generalization compared to heavily pretrained models such as LXMERT or MDETR, while requiring substantially less image-text pretraining.

6. Conclusion and Outlook

Transformer Module Networks integrate the compositional paradigm of Neural Module Networks with the representational power of Transformer encoders. Specialization at the module level and explicit program-conditional routing enable robust systematic generalization, with TMNs achieving gains exceeding 30 percentage points in novel linguistic composition tests relative to conventional Transformers. This demonstrates that modularity and parameter isolation, rather than just increased model capacity or depth, are principal enablers of systematic compositional reasoning in VQA. TMNs reach this performance without extensive image-text pretraining required by alternative multimodal architectures (Yamada et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transformer Module Networks (TMN).