End-to-End Module Networks (N2NMN)

Updated 31 March 2026

The paper introduces a fully differentiable framework that jointly optimizes network layout prediction and modular computations for visual question answering.
It employs a sequence-to-sequence attentional LSTM to generate instance-specific Reverse Polish Notation layouts and integrates a diverse inventory of neural modules.
Empirical results on CLEVR show nearly a 48% error reduction, demonstrating improved compositional reasoning and transparency compared to traditional methods.

End-to-End Module Networks (N2NMN) are a compositional reasoning framework for visual question answering (VQA) that enables the direct prediction of instance-specific network layouts without dependence on external language parsers or hand-engineered pipeline elements. By integrating both layout prediction and modular neural computation in a fully differentiable and learnable system, N2NMN achieves substantial gains in compositional generalization, transparency, and accuracy, particularly on datasets designed to test multi-step reasoning such as CLEVR. The approach is characterized by a policy over network layouts, a discrete inventory of differentiable neural modules, and joint optimization of both layout structure and module parameters by mixed imitation and reinforcement learning strategies (Hu et al., 2017).

1. Architectural Framework

N2NMN receives as input a natural language question $q = (w_1, ..., w_T)$ and an image $I$ , from which a convolutional feature map $x_\text{vis} \in \mathbb{R}^{H \times W \times D}$ is extracted, optionally with additional spatial channels. The architecture comprises:

Neural Modules: Each module $f_m$ is a parameterized, differentiable function with 0–2 attention inputs (each attention map $a \in \mathbb{R}^{H \times W}$ ), a visual feature map $x_\text{vis}$ , and a module-specific text vector $x_\text{txt}^{(m)}$ . Module types include find, relocate, filter, and, or, exist, count, describe, compare, eq_count, more, less, and others. Modules operate either to produce new attention maps or to yield output vectors for scoring candidate answers.
Layout Policy $p(l\,|\,q)$ : A sequence-to-sequence attentional LSTM policy over possible module sequences, generating a Reverse-Polish Notation (RPN) layout $l$ specific to the question. The encoder LSTM ingests question word embeddings; the decoder LSTM samples module tokens $m^{(t)}$ using attention over the encoder’s hidden states, producing layout-specific text vectors per module via learned attention.
Assembly and Execution: The selected module sequence is used to dynamically instantiate a computation graph for each question. Execution proceeds by connecting modules according to the sampled layout, culminating in a final output distribution $y$ over candidate answers.

2. Layout Inference and Joint Training

Let $\theta$ denote all model parameters (encoder/decoder LSTM, attention, and module internals). The layout prediction and network training proceed as follows:

Expected QA Loss: For a generated layout $l$ , the per-example loss is $\tilde{L}(\theta, l; q,I) = \text{cross-entropy}(y(l, q, I; \theta),\, \text{answer})$ , where $y(l, q, I; \theta)$ is the model’s answer distribution.
The overall optimization objective:

$L(\theta) = \mathbb{E}_{l \sim p(l|q;\theta)} [\, \tilde{L}(\theta, l; q, I)\,]$

Policy Gradient: The gradient of $L(\theta)$ involves both REINFORCE (for the discrete layout sampling) and standard backpropagation (through the continuous modules):

$\nabla_\theta L = \mathbb{E}_{l \sim p} [\,\tilde{L}(\theta, l) \nabla_\theta \log p(l|q) + \nabla_\theta \tilde{L}(\theta, l) \,]$

In practice, this is approximated per example using a moving average baseline $b$ :

$\nabla_\theta L \approx (\tilde{L}(\theta, l) - b)\, \nabla_\theta \log p(l|q) + \nabla_\theta \tilde{L}(\theta, l)$

Mixed Imitation and Reinforcement Learning: Early in training, an “expert” layout policy $p_e(l|q)$ from ground-truth functional programs is used for supervised behavior cloning:

$L_\mathrm{im}(\theta) = \mathbb{E}_q [ -\log p(l_e|q; \theta) ]$

Post-cloning, the model transitions to reinforcement learning on layout and module parameters by optimizing $L(\theta)$ . This two-phase curriculum accelerates convergence and enhances final accuracy.

3. Core Module Types and Their Parameterization

The primary modules in the inventory, with input/output signatures and parameterizations, are as follows:

Module	Signature / Formula	Output
find	$a = \text{conv}_2(\text{conv}_1(x_\text{vis}) \odot (W_x x_\text{txt}))$	Attention map $a$
relocate	$a = \text{conv}_2(\text{conv}_1(x_\text{vis}) \odot (W_1\!\cdot\! \sum(a_\text{prev}\odot x_\text{vis})) \odot (W_2 x_\text{txt}))$	Attention map $a$
filter	$a = \min(a_\text{prev},\, \text{find}(x_\text{vis}, x_\text{txt}))$	Attention map $a$
and	$a = \min(a_1, a_2)$	Attention map $a$
or	$a = \max(a_1, a_2)$	Attention map $a$
exist	$y = W^\top \,\text{vec}(a)$	Answer score vector $y$
count	As in exist	Answer score vector $y$
describe	$y=W_1^\top [ (W_2 \sum(a\odot x_\text{vis})) \odot (W_3 x_\text{txt}) ]$	Attribute score vector $y$
eq_count	$y = W_1^\top \text{vec}(a_1) + W_2^\top \text{vec}(a_2)$	Binary score vector $y$
compare	$y = W_1^\top[ (W_2 \sum(a_1 \odot x_\text{vis})) \odot (W_3 \sum(a_2 \odot x_\text{vis})) \odot (W_4 x_\text{txt}) ]$	Comparison score vector $y$

All modules use differentiable functions (elementwise operations, 1–2 convolutional layers, affine transforms) and enable end-to-end backpropagation contingent on a fixed structure.

4. Optimization and Gradient Propagation

Gradient flow is partitioned as follows:

$\nabla_\theta \tilde{L}(\theta, l)$ is obtained by backpropagation through the dynamically constructed network, updating both the parameters of the modules and the text-attention mechanisms instantiated per module.
$\nabla_\theta \log p(l|q)$ propagates through the decoder LSTM and the word-attention parameters, thus tuning the layout policy via REINFORCE with the QA loss as negative reward.
The combined optimization ensures that both the execution of the modules and the generation of layouts are tuned to maximize end-task performance.

5. Empirical Results on CLEVR

On the CLEVR benchmark, which is designed for compositional visual reasoning, N2NMN demonstrates substantial error reduction relative to state-of-the-art attentional and modular baselines. Performance metrics reported include:

68.5%: Stacked Attention baseline
72.1%: NMN with expert layouts
69.0%: N2NMN trained from scratch with pure policy search
78.9%: N2NMN after behavioral cloning from expert layouts
83.7%: N2NMN after subsequent RL-based layout search

These results correspond to a decrease in error rate from approximately 31.5% (CNN+LSTM+SA baseline) to 16.3% (N2NMN with cloning+RL), representing a nearly 48% error reduction. Ablation analyses across question categories (existence, counting, attribute comparison and queries) indicate that end-to-end learned layouts yield significant and consistent gains, particularly for color and integer comparisons for which expert layouts were less optimal (Hu et al., 2017).

6. Compositionality and Interpretability in Reasoning

The module-by-module execution of N2NMN layouts produces unconstrained, instance-specific architectures that transparently mirror the substructure of the input question. For example:

For the question “How many other things are there of the same size as the matte green ball?” the predicted layout is [find] → [relocate] → [count], progressing from object localization (find), relational object selection (relocate), and set cardinality estimation (count).
For “Is there an equal number of cubes and spheres that are metal?” the layout incorporates multiple branches: metal object localization (find), filtering by shape (filter), counting object sets (count), comparing set sizes (eq_count), and mapping to a binary response (exist).

Intermediate attention maps generated by each module can be visualized, allowing analysis of the model’s internal reasoning steps and facilitating error diagnosis as well as insight into compositional generalization capabilities.

End-to-End Module Networks unify structured reasoning and neural learning by directly optimizing both computation graph structure and module representations, bypassing fixed parser dependencies. On compositional VQA tasks, this yields interpretable, question-tailored networks and demonstrable improvements in accuracy (Hu et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Learning to Reason: End-to-End Module Networks for Visual Question Answering (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to End-to-End Module Networks (N2NMN).