AGQA Benchmark


For more details, see our paper.

Updated 08/19/21

We have released a new version of the balanced dataset that fixes small bugs in data formatting.

There is one error in programs for the instances in which the object in an action is referred to indirectly as ‘the last thing they were [relationship]’. This program should include IterateUntil(backward, …), but instead states IterateUntil(forward, …).

Download videos and questions.


Download videos (“Data”) from Charades.


Download our Question-Answer pairs from our website

Training Question Format

  1. {...
  2. 'question_id': {
  3. 'question': 'Did they contact a blanket?',
  4. 'answer': 'No',
  5. 'video_id': 'YSKX3',
  6. 'global': ['exists', 'obj-rel'],
  7. 'local': 'yes-no-o4',
  8. 'ans_type': 'binary',
  9. 'steps': 1,
  10. 'semantic': 'object',
  11. 'structural': 'verify',
  12. 'novel_comp': 0,
  13. 'more_steps': 0,
  14. 'sg_grounding': {(start char, end char): [scene graph vertices]},
  15. 'program': 'program string',
  16. }
  17. ...}

o4 in the question’s local value is an identifer for blanket. English translations of these identifiers can be found here.

Testing Question Format

  1. {...
  2. 'question_id': {
  3. 'question': 'Did they contact a blanket?,
  4. 'answer': 'No',
  5. 'video_id': 'YSKX3',
  6. 'global': ['exists', 'obj-rel'],
  7. 'local': 'yes-no-o4',
  8. 'ans_type': 'binary',
  9. 'steps': 1,
  10. 'semantic': 'object',
  11. 'structural': 'verify',
  12. 'novel_comp': 0,
  13. 'nc_seq': 0,
  14. 'nc_sup': 0,
  15. 'nc_dur': 0,
  16. 'nc_objrel': 0,
  17. 'indirect': 0,
  18. 'i_obj': 0,
  19. 'i_rel': 0,
  20. 'i_act': 0,
  21. 'i_temp': 0,
  22. 'more_steps': 0,
  23. 'direct_equiv': 'question_id',
  24. 'sg_grounding': {(start char, end char): [scene graph vertices]},
  25. 'program': 'program string',
  26. }
  27. ...}

o4 in the question’s local value is an identifer for blanket. English translations of these identifiers can be found here.

Splitting test set by categories


These question attributes are describe in more detail in Section 3.2.

1. Reasoning


The reasoning categories are included in the list under 'global'. If a question has multiple reasoning categories, it should be included in the accuracy for all categories.

2. Semantic


Split the test set by the values of 'semantic'.

3. Structural


Split the test set by the values of 'structural'.

4. Binary vs open


Split the test set by the values of 'ans_type'.

Additional Metrics


We describe these metric in more detail in Section 3.4.

1. Novel compositions metric

To run the Novel Compositions metric, train on questions without a novel composition ('novel_comp' == 0) and test on questions with novel compositions ('novel_comp' == 1).

For a more detailed analysis (Table 4):


Sequencing compositions: 'nc_seq' == 1

Superlative compositions: 'nc_sup' == 1

Duration compositions: 'nc_dur' == 1

Object-Relationship compositions: 'nc_objrel' == 1

2. Indirect References metric


To use the Indirect References metric, train and test on all questions.

Split by reference type


Indirect object reference: 'i_obj' == 1

Indirect relationship reference: 'i_rel' == 1

Indirect action reference: 'i_act' == 1

Temporal localization phrase: 'i_temp' == 1

To calculate Recall scores


The Recall scores are the accuracy of each reference category.

To calculate Precision scores


Take the subset of questions that:

  1. contain an indirect reference AND
    • 'indirect' == 1
  2. have a correctly-answered equivalent question:
    • 'direct_equiv' == 'direct_equiv_question_id' AND
    • 'direct_equiv_question_id' was answered correctly

If no such equivalent question is included in the dataset, 'direct_equiv' == None.

The Precision scores are the accuracy on each of these subsets. Split this subset by reference type for a more detailed representation.

3. Compositional Steps metric


To run the More Compositional Steps metric, train on questions in which 'more_steps' == 0 and test on questions in which 'more_steps' == 1.

Program formatting


Each question has a program of interlocking functions that break down the reasoning steps needed to answer the question. Note that Localize(‘between’ []) takes two action arguments. We specify the program names below. An ‘item’ can refer to any type of data structure.

Function Input Output
AND bool 1
bool 2
True if bool 1 AND bool 2, else False
Choose item 1
item 2
[items]
item iff item 1 OR item 2 exists in [items]
Compare [items]
bool func
The item that returns True from function
Equals item 1
item 2
boolean
Exists item
[items]
boolean
Filter dict
[keys]
mapping of [keys] to dictionary (dict[keys[0]][keys[1]]…)
HasItem [items] True if length of list > 0, else False
Iterate [items]
function
list with function mapped to each item in [items]
IterateUntil ‘forward/backward’
[items]
bool func
secondary func
output of secondary function on the first item to satisfy the bool func when iterating forwards or backwards
Localize temporal phrase
action
[frames]
OnlyItem [item] item in list, if length of list is 1
Query attribute
item
value of attribute within item
Superlative ‘min/max’
[items]
func
item with min or max value when func applied
XOR bool 1
bool 2
True if bool 1 XOR bool 2, else False

Scene graph grounding


Each question associates parts of the question to nodes in the scene graph. This grounding maps keys of character indices to lists of node indices. The character indices refer to the ‘start-end’ characters of the phrase. These node indices refer to items in the scene graph.

  • We only include the relevant node indices (e.g. if the question asks ‘Did they contact a cup after walking through a doorway?’, the reference to ‘a cup’ only includes vertices with that cup during frames after they walked through a doorway. If the list is empty, there are no such vertices so the answer is ‘No’). Relevancy is determined by the highest level in the program.
  • If the keys are the same number “X-X”, then they refer to all the frames in the video. Some questions do not have a phrase for temporal localization (e.g. “Do they contact a cup”), so the reference to relevant frames has the same start and end index.
  • Some scene graphs currently have negative values. This is a bug with one template and one indirect reference. We are fixing it now, and will re-upload shortly.

As an example, the question “Does someone contact a paper before drinking from a cup?” may have the following scene graph grounding:

  1. {
  2. # refers to the nodes for 'a paper'. 'o23' is the idx reference for 'a paper' (see IDX.pkl)
  3. "21-28": [ "o23/000083", "o23/000084" ... "o23/000813", "o23/000826"],
  4. # refers to the node for 'drinking from a cup'. 'c106' is the idx reference for 'drinking from a cup' (see IDX.pkl)
  5. "36-55": ["c106/1"],
  6. # refers to the frames localized by the phrase 'before drinking from a cup'
  7. "29-55": ["000083", "000084" ... "000869", "000900"]
  8. }

Using Scene Graphs


The scene graph files map video ids to scene graph dictionaries. Each scene graph maps vertex ids to vertex information.

Each vertex contains information about the other vertices to which it is connected. The type of information included depends on vertex type.

  1. {
  2. 'frameid': {
  3. 'id': '000105'
  4. 'type': 'frame'
  5. 'secs': 6.3
  6. 'objects': [connecting object nodes]
  7. 'attention': [connecting attention nodes]
  8. 'contact': [connecting contact nodes]
  9. 'spatial': [connecting spatial nodes]
  10. 'verb': [connecting verb nodes]
  11. 'actions': [connecting action nodes]
  12. 'metadata': test / train
  13. 'next': next frame
  14. 'prev': previous frame
  15. }
  16. }
  17. {
  18. 'actionid': {
  19. 'id': 'c076/1'
  20. 'charades': 'c076'
  21. 'phrase': 'holding a pillow'
  22. 'type': 'act'
  23. 'start': 14.56
  24. 'end': 17.5
  25. 'length': 2.94
  26. 'objects': [connecting object nodes]
  27. 'attention': [connecting attention nodes]
  28. 'contact': [connecting contact nodes]
  29. 'spatial': [connecting spatial nodes]
  30. 'verb': [connecting verb nodes]
  31. 'metadata': test / train
  32. 'all_f': [list of frame ids]
  33. 'subject': 'o9'
  34. 'verb': 'v25
  35. 'next_discrete': next non-overlapping action
  36. 'prev_discrete': previous non-overlapping
  37. 'next_instance': next c076 action
  38. 'prev_instance': previous c076 action
  39. 'while': [list of co-occuring action objects]
  40. }
  41. }
  42. {
  43. 'objectid': {
  44. 'id': 'o4/000105'
  45. 'type': 'object'
  46. 'class': 'o4'
  47. 'attention': [connecting attention nodes]
  48. 'contact': [connecting contact nodes]
  49. 'spatial': [connecting spatial nodes]
  50. 'verb': [connecting verb nodes]
  51. 'visible': True if visible, False otherwise
  52. 'bbox': [list of bbox values from Action Genome]
  53. 'metadata': test / train
  54. 'frame_num': 000105
  55. 'secs': 6.3
  56. 'next': next o4 object
  57. 'prev': previous o4 object
  58. }
  59. }
  60. {
  61. 'relationid': {
  62. 'id': 'r22/000209'
  63. 'type': 'contact'
  64. 'class': 'r22'
  65. 'objects': [connecting object nodes]
  66. 'metadata': test / train
  67. 'frame_num': 000209
  68. 'secs': 12.6
  69. 'next': next r22 relation
  70. 'prev': previous r22 relation
  71. }
  72. }


  1. {
  2. 'actionid': {
  3. 'id': 'c076/1'
  4. 'phrase': 'holding a pillow'
  5. 'start': 14.56
  6. 'end': 17.5
  7. 'metadata': test / train
  8. 'all_f': [list of frame ids]
  9. }
  10. }
  11. {
  12. 'frameid': {
  13. 'id': '000105'
  14. 'attention': [connecting attention nodes]
  15. 'contact': [connecting contact nodes]
  16. 'spatial': [connecting spatial nodes]
  17. 'verb': [connecting verb nodes]
  18. }
  19. }
  20. {
  21. 'relationid': {
  22. 'id': 'r22/000209'
  23. 'objects': [connecting object nodes]
  24. }
  25. }
  26. {
  27. 'objectid': {
  28. 'id': 'o4/000105'
  29. 'bbox': [list of bbox values from Action Genome]
  30. }
  31. }
  1. {
  2. "video_id":xxx,
  3. "frame_ids":xxx,
  4. "graph_list":xxx,
  5. "action_list":xxx
  6. }