Neuro-Symbolic Concept Learner (NSCL)
NSCL first use a visual perception module to construct an object-based representation for a scene, and run a semantic parsing module to translate a question into an executable program. Then we apply a quasi-symbolic program executor to infer the answer based on the scene representation.
Visual perception. Given the input image, a Mask R-CNN is used to generate object proposals for all objects. The bounding box for each object paired with the original image is then sent to a ResNet34 to extract the region-based (by RoI Align) and image-based features respectively.
Concept quantization. The authors assume each visual attribute (e.g., shape) contains a set of visual concept (e.g., Cube). In NSCL, visual attributes are implemented as neural operators, mapping the object representation into an attribute-specific embedding space. The figure below shows an inference of an object's shape. This can also be extended to relational concepts (e.g., Left) between a pair of objects by concatenating the visual representations for both objects.
DSL and semantic parsing. The semantic parsing module translates a natural language question into an executable program with a hierarchy of primitive operations, represented in a domain-specific language (DSL) designed for VQA. Some examples are given below. Then the semantic parser generates the hierarchies of latent programs in a sequence to tree manner using a bidirectional GRU to encode and decode the input question.
Quasi-symbolic program execution. The program executor is a collection of deterministic functional modules designed to realize all logic operations specified in the DSL. The figure below shows an illustrative execution trace of a program. To make the execution differentiable w.r.t. visual representations, intermediate results are represented in a probabilistic manner.
Model training. The goal is to find $\Theta_v$ of the visual perception module and $\Theta_s$ of the semantic parsing module, to maximize the likelihood of answering the question $Q$ correctly:
\[ \arg\max_{\Theta_v, \Theta_s} \mathbb{E}_P[ \text{Pr}[A = \text{Executor}(\text{Perception}(S; \Theta_v), P)] ] \]
The authors also employ a curriculum learning to help joint optimization, which also essential to the learning of NSCL.
|