Stack-NMN: Differentiable Modular Reasoning
- Stack-NMN is a differentiable neural architecture that composes a fixed library of modules via a soft LIFO stack to enable interpretable compositional reasoning.
- It utilizes a layout controller and differentiable stack operations to merge parallel module outputs, ensuring robust handling of complex visual questions.
- The model achieves competitive performance on benchmarks like CLEVR VQA and offers transparent intermediate traces, fostering human-understandable neural reasoning.
Stack Neural Module Networks (Stack-NMN) are fully differentiable, end-to-end trainable neural architectures designed for compositional question answering and grounding. Stack-NMN addresses the dual challenges of enabling compositional reasoning and producing interpretable decision traces, all while eliminating the requirement for strong supervision over task decomposition. At each time step, Stack-NMN "softly" selects among a fixed library of neural modules, invoking all modules in parallel on a differentiable last-in-first-out (LIFO) stack. The outputs are merged according to the selection weights, assembling a soft, differentiable computation graph that provides step-by-step human-readable intermediate results (Hu et al., 2018).
1. Architectural Overview
Stack-NMN comprises three key interacting components:
- Layout Controller: This module processes the input question (or referring expression) via a BiLSTM encoder, generating for each time step both a soft module-selection probability distribution and a "textual parameter" vector. The selection distribution ("layout") determines the relative contribution of each module at each reasoning step.
- Module Inventory: The architecture includes a fixed set of small neural network modules: . These modules accept stack inputs and textual parameters, returning updated stack states and—in the case of answer modules—answer score vectors. Crucially, module parameterization is shared across tasks (e.g., Visual Question Answering [VQA] and grounding).
- Differentiable Stack: Intermediate attention maps are stored in a differentiable memory stack , where is an array of feature maps and is a (nearly) one-hot pointer indicating stack position. Modules operate via pop and push primitives, and stack operations remain differentiable for gradient-based learning.
The system is initialized with a uniform image attention and proceeds for a fixed number of steps. At each step, all modules execute in parallel, and outputs are linearly combined according to the soft module selection weights.
2. Mathematical Formulation
Question Encoding and Controller
Let the question of words be encoded by a 2-layer BiLSTM into , . A summary vector is . The controller maintains a context and computes: where , , and .
The module-selection distribution is: Word-level attention over question words:
Module Operations
Each module is a differentiable function:
For modules that return attention maps, the input is computed by . For example, the Find module: Answer and Compare modules compute attended features and answer logits:
Differentiable Stack
The stack maintains attention maps and pointer :
- Push(z): ,
- Pop(): ,
Stack operations are soft and fully differentiable.
Program Execution
At each time , all modules process the same stack. Merged outputs are: Final answer logit aggregation:
3. Training Regimes and Objectives
Stack-NMN is trainable end-to-end by gradient descent, requiring no strong supervision over program traces. For VQA, a cross-entropy loss is applied to . For grounding, the final stack-popped attention is used as a classifier over grid cells with an additional bounding-box regression term via a smooth- loss.
When expert module layouts are available (i.e., labeled program traces), an auxiliary cross-entropy loss can be added to regularize toward gold modules at each step. Optimization is performed with Adam at a learning rate of $1$e.
4. Forward Pass Workflow
Pseudocode for one forward pass:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
h = BiLSTM(q) # Question encoding q̄ = mean(h) c = 0 # Initialize context A[0][1] = uniform_attention # Stack init p[0] = one_hot(1) y_accum = 0 for t in range(T): u = W2([W1[t] * q̄ + b1; c]) + b2 w = softmax(MLP(u)) # Module selection α = softmax(W3(u ∘ h)) c = sum(α * h) for m in M: # In parallel (A_m, p_m; y_m) = run_module_m(A, p; c) A = sum(w * A_m) p = softmax(sum(w * p_m)) y_accum += sum_{m ∈ {Answer, Compare}} w_m * y_m |
This workflow embodies soft program assembly and execution through fully differentiable control and memory mechanisms.
5. Empirical Results and Comparative Performance
Stack-NMN demonstrates high performance on both synthetic and real-world VQA tasks without requiring expert layout supervision:
| Dataset/Task | Expert Layouts | Accuracy (%) | Reference Comparison |
|---|---|---|---|
| CLEVR VQA | Yes | 96.5 | N2NMN: 97.7 |
| CLEVR VQA | No | 93.0 | N2NMN: ~69, PG+EE/TbD: no convergence |
| CLEVR VQA+REF | No | 93.9 / 95.4 | - |
| VQAv2 (real) | No | 64.1 | N2NMN: 63.3 |
Prior modular models experience severe performance drops in the absence of layout supervision, while Stack-NMN maintains strong results and end-to-end trainability (Hu et al., 2018).
6. Interpretability and Human Studies
Stack-NMN provides intermediate attention maps and textual attentions at each step. Human evaluators rate Stack-NMN as substantially more interpretable than several non-modular baselines. On a 4-point Likert scale, subjective clarity for Stack-NMN (without layouts) is 3.33, compared to 2.46 for MAC (12 steps). In a forward-prediction experiment—where humans attempt to predict model correctness from intermediate traces—Stack-NMN achieves 62.5% accuracy (chance=50%), exceeding MAC's 56.5%. This suggests that Stack-NMN's soft stack traces offer both subjective and objective advances in human-explainable neural reasoning (Hu et al., 2018).
7. Significance, Limitations, and Context
Stack-NMN advances explainable neural computation by enabling compositional reasoning with transparent execution traces—without depending on expensive subtask supervision. Shared parameterization across tasks and modules supports broader generalization. In contrast, prior modular approaches, such as N2NMN, exhibit degraded performance without structured supervision, while others (PG+EE, TbD) fail to converge end-to-end. A plausible implication is that Stack-NMN's architecture, based on soft module selection and differentiable stack memory, provides a robust compromise between interpretability and model flexibility for visual reasoning applications (Hu et al., 2018).