CONTEXTUALIZED SCENE IMAGINATION FOR GENERATIVE COMMONSENSE REASONING

Overview

  • leverage SKGs for unifying scene knowledge from different resources.
  • pre-train a contextualized imagination module to construct an SKG for a set of concepts, based on the collected SKG instances.
  • At inference time, our verbalization module realizes the generated SKG into natural language.
  • Contribution:

    • SKGs extracted from visual captions and story datasets are more helpful than other resources
    • our model can learn faster (with less training data) with the help of scene imagination
    • the imagination module with a larger backbone LM demonstrates a larger capacity in encoding commonsense knowledge.

      Approach

      image.png
      image.png
  • input: a list of concept sets Commonsense Reasoning - 图3, a textual context Commonsense Reasoning - 图4

    • each concept consists of multiple concept words (object or action)
  • goal: generate K sequences Commonsense Reasoning - 图5, each describing a plausible situation following human common sense for a concept set Commonsense Reasoning - 图6
  • Imagine and Verbalize: Commonsense Reasoning - 图7
    • in each step: sample the most probable scene representation Commonsense Reasoning - 图8 that maximize Commonsense Reasoning - 图9, then apply the verbalization module by sampling from Commonsense Reasoning - 图10
    • Commonsense Reasoning - 图11 includes the given context Commonsense Reasoning - 图12 and previously generated Commonsense Reasoning - 图13
  • IMAGINATION VIA GENERATING SKG
    • SKG: a graph Commonsense Reasoning - 图14, node set Commonsense Reasoning - 图15 includes both given and implicit concepts, Commonsense Reasoning - 图16 denotes how two concepts should be related.
    • Collecting Diverse SKGs: from visual captions and narrative stories
      • Textual Modality (stories and captions): parse the sentence into a graph using AMR parsing tool. Also keeping the sentences that precede the sentence correspond to the SKG, as context Commonsense Reasoning - 图17
      • Visual Modality: map the relations in scene graphs from VisualGenome to the ones used in textual SKGs
  • LEARNING THE SCENE IMAGINATION MODULE: formulate SKG construction as an auto-regressive sequence generation task (using LMs)
    • Linearized SKG Generation: concatenate input concepts into a sequence Commonsense Reasoning - 图18, preceded by the context Commonsense Reasoning - 图19. As for outputs (ground truth), first convert the graph to a spanning tree, then conduct DFS and follow PENMAN format to do linearization. (some training tricks included)
    • Continual-Pretraining and Fine-tuning: fine-tune on downstream datasets (obtain silver-standard SKGs from training example)
      • maximize Commonsense Reasoning - 图20
  • SCENE-AWARE VERBALIZATION
    • Iterative Imagine-and-Verbalize: iteratively to generate the most plausible SKG for each given concept set, Commonsense Reasoning - 图21. The verbalization module generates the i-th sentence by sampling from Commonsense Reasoning - 图22. Multiple sentences are generated iteratively by alternating between the scene imagination (to construct SKG) and verbalization
    • Model Training: Commonsense Reasoning - 图23
      • For each training instance Commonsense Reasoning - 图24, construct two types of SKG instances as the input Commonsense Reasoning - 图25: 1. perform AMR parsing on Commonsense Reasoning - 图26 to obtain a silver-standard SKG. 2. apply the trained imagination module to generate a SKG Commonsense Reasoning - 图27, Commonsense Reasoning - 图28 includes the given context and ground-truth prefix sentences.