Neuro-Symbolic Concept Learner
- Neuro-Symbolic Concept Learner (NS-CL) is a unified model that combines object-level visual concept learning, natural language parsing, and symbolic program execution.
- It uses an object-centric scene representation with Mask R-CNN and ResNet-34 to derive deep features and attribute quantizations for differentiable reasoning.
- NS-CL achieves state-of-the-art performance on visual question answering benchmarks and exhibits robust generalization with minimal supervision.
The Neuro-Symbolic Concept Learner (NS-CL) is a unified model for jointly inducing object-level visual concepts, word representations, and a semantic parser that maps natural language to symbolic reasoning programs—using only natural supervision from paired images, questions, and answers. NS-CL constructs an explicit object-centric scene representation, composes a symbolic program in a restricted domain-specific language (DSL) for each input question, and executes that program in a differentiable neuro-symbolic reasoning module to produce answers. The entire system is trained end-to-end; all supervision is derived from final answers, with no annotation for concept labels or programs and no entailed explicit mapping between objects and words (Mao et al., 2019).
1. System Architecture
NS-CL consists of three interconnected modules:
- Perception module (): Maps an input image to object-centric deep features and attribute-specific features . Mask R-CNN proposes bounding boxes, and each crop is RoI-aligned and processed by ResNet-34, concatenated with a global image feature, giving .
- Semantic parsing module (): Converts a question into a distribution over symbolic programs in a compact DSL. Architecture includes a bi-directional GRU encoder and a recursive, sequence-to-tree decoder emitting program operators and concept parameters, producing compositional trees with nodes such as Filter, Relate, Count, Query.
- Neuro-symbolic reasoning module (EXEC): Executes programs on sets of object representations to return an answer 0. Execution is implemented as a deterministic interpreter using differentiable functional operations, enabling joint training of upstream modules via answer-level loss gradients and reinforcement or off-policy updates for the parser.
All modules are optimized end-to-end, with answer cross-entropy gradients flowing into the perception stack and REINFORCE or off-policy search driving the parser (Mao et al., 2019).
2. Perception and Concept Quantization
NS-CL forms an explicit object-based scene structure, supporting composable concept learning:
- Object proposals: Mask R-CNN generates candidate bounding boxes. Each is RoI-aligned, processed by ResNet-34, and concatenated with a global image feature for the object vector 1.
- Attribute operators: For each attribute 2, a dedicated MLP 3 maps to attribute subspace. Each category concept 4 (e.g., "Red," "Cube," "LeftOf") is represented by a learned vector 5 in the attribute space.
- Concept quantization: For object 6 and concept 7 of attribute 8, the probability is computed:
9
where 0 is the sigmoid, 1 is cosine similarity. For relations (e.g., LeftOf), concatenated object pairs 2 are similarly scored.
- Supervision: No attribute labels are available; all supervision comes via the cross-entropy loss on generated answers, implicitly sculpting the concept embeddings during curriculum-based training.
This structure allows learning both attribute and relation concepts and grounding language directly in perception without explicit annotation (Mao et al., 2019).
3. Semantic Parsing and Program Induction
The semantic parsing module grounds questions into symbolic programs using weak supervision:
- Encoding: A 2-layer bi-GRU encodes 3 to final state 4.
- Decoding: A recursive, sequence-to-tree decoder (composed of an operator decoder, a concept decoder, and small per-node RNNs) emits operator–concept pairs forming the parse tree 5.
- Domain-specific language (DSL): Programs are constructed from operators over sets and objects:
- Scene() 6 ObjectSet
- Filter(ObjectSet, ObjConcept) 7 ObjectSet
- Relate(Object, RelConcept) 8 ObjectSet
- Intersection, Union, Query, Exist, Count, and various attribute and relational queries
- Learning: Programs are not directly supervised. Instead, REINFORCE or off-policy search optimize the expected reward:
9
with 0 if EXEC1 Perception2, else 0. Variance is reduced by enumerating all correct-answer-producing small programs and maximizing their aggregate probability.
This approach entangles semantic parsing with perceptual grounding and compositional reasoning, all from answer-level signals (Mao et al., 2019).
4. Neuro-Symbolic Reasoning and Execution
The reasoning module interprets induced programs over the soft, object-centric scene graph:
- Data structures: Sets and objects are represented as "soft masks" 3, with 4 the probability that object 5 is present. Singletons are masked via softmax for sharpness.
- Operators: The DSL is executed via differentiable operators, e.g.,
- Filter6
- Relate7
- Count8
- Query9 For each concept 0 of attribute 1,
2
Differentiability: All operations (min, sum, sigmoid, softmax) are differentiable in 3. Thus, answer losses backpropagate through execution to perceptual and concept parameters.
This hybrid interpreter guarantees deterministic, fully differentiable neuro-symbolic reasoning with transparent execution traces (Mao et al., 2019).
5. Staged Learning and Optimization
NS-CL uses curriculum learning and end-to-end optimization:
Curriculum phases:
- Stage 1 ("Object concepts"): Few (43) objects per scene, only basic attribute or counting questions; builds initial concept space.
- Stage 2 ("Relational concepts"): Adds relational and compositional queries (e.g., "cube to the left of the sphere"); perception module frozen.
- Stage 3 ("Full complexity"): All CLEVR questions, up to 10-object scenes, arbitrary program depth; after a period of fixed parser and executor, joint fine-tuning is performed.
- Optimization objective: The joint loss is
5
This obviates the need for separated concept or parse losses, relying solely on end-to-end QA supervision.
The curriculum strategy aligns with human-like concept acquisition, facilitating efficient and stable training (Mao et al., 2019).
6. Empirical Performance and Generalization
NS-CL achieves state-of-the-art results on visual question answering, compositional generalization, and transfer:
| Task / Domain | NS-CL Performance | Comparative Baselines |
|---|---|---|
| CLEVR val (Color/Shape/Mat./Size) | ≈99% classification | – |
| CLEVR VQA, 10% training data | 98.9% accuracy | FiLM, MAC, TbD: 55–68% |
| CLEVR VQA, full training (no programs) | 98.9% accuracy | TbD (w/ 700k programs): 99.1% |
| CLEVR Compositional splits | ≈99% in all 4 splits | Implicit models degrade 4–8% |
| CLEVR-CoGenT (novel combos) | 98.8/98.9% (A/B) | – |
| Incremental concept: "Purple" | 93.9% (vs IEP: 89.3%) | TbD: 87.8% |
| Cross-DSL image–caption retrieval | 97% | CNN-LSTM: 68.9% |
| Minecraft domain (no program labels) | 93.3% | NS-VQA: 87.7% |
| Real-image VQA (VQS) | 44.3% | MLP: 43.9%, MAC: 46.2% |
NS-CL demonstrates robust generalization to novel attribute compositions (CoGenT), compositional scene splits (scene size, program depth), new domains (image–caption retrieval, Minecraft), and rapid incremental learning of new concepts ("Purple") from minimal supervision. The model yields transparent execution traces via the neuro-symbolic interpreter, aiding interpretability—a property not generally available in purely neural baselines (Mao et al., 2019).
7. Implementation and System Parameters
- Framework: PyTorch, publicly available.
- Optimization: Adam optimizer; learning rate annealed from 6 to 7, batch size ≈32.
- Perceptual backbone: Mask R-CNN (pretrained on 4k CLEVR bounding boxes), ResNet-34 (ImageNet-pretrained).
- Parser architecture: 2-layer bi-GRU (hidden 512), word embedding (256 + DPOS 128), decoding via 2-layer MLPs (OpDecoder and ConceptDecoder).
- Curriculum control: Data loader switches and task replay between stages.
- Semantic parsing: Beam/off-policy search, beam size ≈5; all small programs yielding correct answer are aggregated for parser updates.
NS-CL is engineered for modularity and compositionality, leveraging established vision and LLMs with novel neuro-symbolic integration and supervision exclusively via QA signals (Mao et al., 2019).