Stack-NMN: Differentiable Modular Reasoning

Updated 31 March 2026

Stack-NMN is a differentiable neural architecture that composes a fixed library of modules via a soft LIFO stack to enable interpretable compositional reasoning.
It utilizes a layout controller and differentiable stack operations to merge parallel module outputs, ensuring robust handling of complex visual questions.
The model achieves competitive performance on benchmarks like CLEVR VQA and offers transparent intermediate traces, fostering human-understandable neural reasoning.

Stack Neural Module Networks (Stack-NMN) are fully differentiable, end-to-end trainable neural architectures designed for compositional question answering and grounding. Stack-NMN addresses the dual challenges of enabling compositional reasoning and producing interpretable decision traces, all while eliminating the requirement for strong supervision over task decomposition. At each time step, Stack-NMN "softly" selects among a fixed library of neural modules, invoking all modules in parallel on a differentiable last-in-first-out (LIFO) stack. The outputs are merged according to the selection weights, assembling a soft, differentiable computation graph that provides step-by-step human-readable intermediate results (Hu et al., 2018).

1. Architectural Overview

Stack-NMN comprises three key interacting components:

Layout Controller: This module processes the input question (or referring expression) via a BiLSTM encoder, generating for each time step both a soft module-selection probability distribution and a "textual parameter" vector. The selection distribution ("layout") determines the relative contribution of each module at each reasoning step.
Module Inventory: The architecture includes a fixed set $M$ of small neural network modules: $\{\mathrm{Find}, \mathrm{Transform}, \mathrm{And}, \mathrm{Or}, \mathrm{Filter}, \mathrm{Scene}, \mathrm{Answer}, \mathrm{Compare}, \mathrm{NoOp}\}$ . These modules accept stack inputs and textual parameters, returning updated stack states and—in the case of answer modules—answer score vectors. Crucially, module parameterization is shared across tasks (e.g., Visual Question Answering [VQA] and grounding).
Differentiable Stack: Intermediate attention maps are stored in a differentiable memory stack $(A,p)$ , where $A$ is an array of $H \times W$ feature maps and $p$ is a (nearly) one-hot pointer indicating stack position. Modules operate via pop and push primitives, and stack operations remain differentiable for gradient-based learning.

The system is initialized with a uniform image attention and proceeds for a fixed number of steps. At each step, all modules execute in parallel, and outputs are linearly combined according to the soft module selection weights.

2. Mathematical Formulation

Question Encoding and Controller

Let the question of $S$ words be encoded by a 2-layer BiLSTM into $[h_1, \ldots, h_S]$ , $h_s \in \mathbb{R}^d$ . A summary vector is $q = \frac{1}{S} \sum_{s=1}^S h_s$ . The controller maintains a context $c_{t-1} \in \mathbb{R}^d$ and computes: $u^{(t)} = W_2 \begin{bmatrix} W_1^{(t)} q + b_1 \ c_{t-1} \end{bmatrix} + b_2$ where $W_1^{(t)} \in \mathbb{R}^{d \times d}$ , $W_2 \in \mathbb{R}^{d \times 2d}$ , and $b_1, b_2 \in \mathbb{R}^d$ .

The module-selection distribution is: $w^{(t)} = \mathrm{softmax}(\mathrm{MLP}(u^{(t)})) \in \Delta^{|M|}$ Word-level attention over question words: $\alpha_{t,s} = \mathrm{softmax}_s(W_3(u^{(t)} \circ h_s)), \quad c_t = \sum_{s=1}^S \alpha_{t,s} h_s$

Module Operations

Each module is a differentiable function: $(A_{\mathrm{out}}, p_{\mathrm{out}}; y_{\mathrm{out}}) = \mathrm{run\_module}_m(A_{\mathrm{in}}, p_{\mathrm{in}}; c_t)$

For modules that return attention maps, the input is computed by $\mathrm{pop}(): a = \sum_{i=1}^L p_i A_i$ . For example, the Find module: $z = \mathrm{conv}_1(x) \circ (W_f c_t), \quad a_{\mathrm{out}} = \mathrm{conv}_2(z) \to \mathbb{R}^{H \times W}$ Answer and Compare modules compute attended features and answer logits: $v = \sum_{i,j} a_{i,j} x_{i,j}, \quad y = W_a^T[W_v v \circ W_c c_t]$

Differentiable Stack

The stack maintains attention maps $A_i$ and pointer $p$ :

Push(z): $p' = \mathrm{conv1d}(p; [0,0,1])$ , $A'_i = A_i (1 - p'_i) + z p'_i$
Pop(): $z = \sum_{i=1}^L A_i p_i$ , $p' = \mathrm{conv1d}(p; [1,0,0])$

Stack operations are soft and fully differentiable.

Program Execution

At each time $t$ , all modules process the same $(A^{(t)}, p^{(t)})$ stack. Merged outputs are: $A^{(t+1)} = \sum_{m \in M} w_m^{(t)} A_m^{(t)}, \quad p^{(t+1)} = \mathrm{softmax} \left(\sum_{m \in M} w_m^{(t)} p_m^{(t)} \right)$ Final answer logit aggregation: $y_{\mathrm{final}} = \sum_{t=0}^{T-1} \sum_{m\in \{\mathrm{Answer},\mathrm{Compare}\}} w_m^{(t)} y_m^{(t)}$

3. Training Regimes and Objectives

Stack-NMN is trainable end-to-end by gradient descent, requiring no strong supervision over program traces. For VQA, a cross-entropy loss is applied to $y_{\mathrm{final}}$ . For grounding, the final stack-popped attention is used as a classifier over grid cells with an additional bounding-box regression term via a smooth- $\ell_1$ loss.

When expert module layouts are available (i.e., labeled program traces), an auxiliary cross-entropy loss can be added to regularize $w^{(t)}$ toward gold modules at each step. Optimization is performed with Adam at a learning rate of $1$e $-4$ .

4. Forward Pass Workflow

Pseudocode for one forward pass:

h = BiLSTM(q)              # Question encoding
q̄ = mean(h)
c = 0                      # Initialize context
A[0][1] = uniform_attention  # Stack init
p[0] = one_hot(1)
y_accum = 0

for t in range(T):
    u = W2([W1[t] * q̄ + b1; c]) + b2
    w = softmax(MLP(u))      # Module selection
    α = softmax(W3(u ∘ h))
    c = sum(α * h)
    for m in M:              # In parallel
        (A_m, p_m; y_m) = run_module_m(A, p; c)
    A = sum(w * A_m)
    p = softmax(sum(w * p_m))
    y_accum += sum_{m ∈ {Answer, Compare}} w_m * y_m

This workflow embodies soft program assembly and execution through fully differentiable control and memory mechanisms.

5. Empirical Results and Comparative Performance

Stack-NMN demonstrates high performance on both synthetic and real-world VQA tasks without requiring expert layout supervision:

Dataset/Task	Expert Layouts	Accuracy (%)	Reference Comparison
CLEVR VQA	Yes	96.5	N2NMN: 97.7
CLEVR VQA	No	93.0	N2NMN: ~69, PG+EE/TbD: no convergence
CLEVR VQA+REF	No	93.9 / 95.4	-
VQAv2 (real)	No	64.1	N2NMN: 63.3

Prior modular models experience severe performance drops in the absence of layout supervision, while Stack-NMN maintains strong results and end-to-end trainability (Hu et al., 2018).

6. Interpretability and Human Studies

Stack-NMN provides intermediate attention maps and textual attentions at each step. Human evaluators rate Stack-NMN as substantially more interpretable than several non-modular baselines. On a 4-point Likert scale, subjective clarity for Stack-NMN (without layouts) is 3.33, compared to 2.46 for MAC (12 steps). In a forward-prediction experiment—where humans attempt to predict model correctness from intermediate traces—Stack-NMN achieves 62.5% accuracy (chance=50%), exceeding MAC's 56.5%. This suggests that Stack-NMN's soft stack traces offer both subjective and objective advances in human-explainable neural reasoning (Hu et al., 2018).

7. Significance, Limitations, and Context

Stack-NMN advances explainable neural computation by enabling compositional reasoning with transparent execution traces—without depending on expensive subtask supervision. Shared parameterization across tasks and modules supports broader generalization. In contrast, prior modular approaches, such as N2NMN, exhibit degraded performance without structured supervision, while others (PG+EE, TbD) fail to converge end-to-end. A plausible implication is that Stack-NMN's architecture, based on soft module selection and differentiable stack memory, provides a robust compromise between interpretability and model flexibility for visual reasoning applications (Hu et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Explainable Neural Computation via Stack Neural Module Networks (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stack-NMN.