Neuro-Symbolic Concept Learner

Neuro-Symbolic Concept Learner (NSCL)

Mar 2022

Wufei Ma
Purdue University

Abstract

Paper reading notes for The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences from Natural Supervision [1].

In this work, the authors propose the Neuro-Symbolic Concept Learner (NSCL) that jointly learns visual concepts, words, and semantic language parsing from images and question-answer pairs. NSCL has three modules: a neural-based perception module that extracts object-level representaions from the scene, a visually-grounded semantic parser for translating questions into executable programs, and a symbolic program executor that reads out the perceptual representation of objects, classifies their attributes/relations, and executes the program to obtain an answer. NSCL learns from natural supervision (images, and QA pairs), and learns via curriculum learning.

Neuro-Symbolic Concept Learner (NSCL)

NSCL first use a visual perception module to construct an object-based representation for a scene, and run a semantic parsing module to translate a question into an executable program. Then we apply a quasi-symbolic program executor to infer the answer based on the scene representation.

Visual perception. Given the input image, a Mask R-CNN is used to generate object proposals for all objects. The bounding box for each object paired with the original image is then sent to a ResNet34 to extract the region-based (by RoI Align) and image-based features respectively.

Concept quantization. The authors assume each visual attribute (e.g., shape) contains a set of visual concept (e.g., Cube). In NSCL, visual attributes are implemented as neural operators, mapping the object representation into an attribute-specific embedding space. The figure below shows an inference of an object's shape. This can also be extended to relational concepts (e.g., Left) between a pair of objects by concatenating the visual representations for both objects.

DSL and semantic parsing. The semantic parsing module translates a natural language question into an executable program with a hierarchy of primitive operations, represented in a domain-specific language (DSL) designed for VQA. Some examples are given below. Then the semantic parser generates the hierarchies of latent programs in a sequence to tree manner using a bidirectional GRU to encode and decode the input question.

Quasi-symbolic program execution. The program executor is a collection of deterministic functional modules designed to realize all logic operations specified in the DSL. The figure below shows an illustrative execution trace of a program. To make the execution differentiable w.r.t. visual representations, intermediate results are represented in a probabilistic manner.

Model training. The goal is to find $\Theta_v$ of the visual perception module and $\Theta_s$ of the semantic parsing module, to maximize the likelihood of answering the question $Q$ correctly: \[ \arg\max_{\Theta_v, \Theta_s} \mathbb{E}_P[ \text{Pr}[A = \text{Executor}(\text{Perception}(S; \Theta_v), P)] ] \] The authors also employ a curriculum learning to help joint optimization, which also essential to the learning of NSCL.

Results

Quantiative results on CLEVR with no program annotations.

References

[1] J. Mao, C. Gan, P. Kohli, J. Tenenbaum, J. Wu. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences from Natural Supervision. In ICLR, 2019.