AGQA Benchmark
For more details, see our paper.
Updated 08/19/21
We have released a new version of the balanced dataset that fixes small bugs in data formatting.
There is one error in programs for the instances in which the object in an action is referred to indirectly as ‘the last thing they were [relationship]’. This program should include IterateUntil(backward, …), but instead states IterateUntil(forward, …).
Download videos and questions.
Download videos (“Data”) from Charades.
Download our Question-Answer pairs from our website
Training Question Format
{...
'question_id': {
'question': 'Did they contact a blanket?',
'answer': 'No',
'video_id': 'YSKX3',
'global': ['exists', 'obj-rel'],
'local': 'yes-no-o4',
'ans_type': 'binary',
'steps': 1,
'semantic': 'object',
'structural': 'verify',
'novel_comp': 0,
'more_steps': 0,
'sg_grounding': {(start char, end char): [scene graph vertices]},
'program': 'program string',
}
...}
o4 in the question’s local value is an identifer for blanket. English translations of these identifiers can be found here.
Testing Question Format
{...
'question_id': {
'question': 'Did they contact a blanket?,
'answer': 'No',
'video_id': 'YSKX3',
'global': ['exists', 'obj-rel'],
'local': 'yes-no-o4',
'ans_type': 'binary',
'steps': 1,
'semantic': 'object',
'structural': 'verify',
'novel_comp': 0,
'nc_seq': 0,
'nc_sup': 0,
'nc_dur': 0,
'nc_objrel': 0,
'indirect': 0,
'i_obj': 0,
'i_rel': 0,
'i_act': 0,
'i_temp': 0,
'more_steps': 0,
'direct_equiv': 'question_id',
'sg_grounding': {(start char, end char): [scene graph vertices]},
'program': 'program string',
}
...}
o4 in the question’s local value is an identifer for blanket. English translations of these identifiers can be found here.
Splitting test set by categories
These question attributes are describe in more detail in Section 3.2.
1. Reasoning
The reasoning categories are included in the list under 'global'
. If a question has multiple reasoning categories, it should be included in the accuracy for all categories.
2. Semantic
Split the test set by the values of 'semantic'
.
3. Structural
Split the test set by the values of 'structural'
.
4. Binary vs open
Split the test set by the values of 'ans_type'
.
Additional Metrics
We describe these metric in more detail in Section 3.4.
1. Novel compositions metric
To run the Novel Compositions metric, train on questions without a novel composition ('novel_comp' == 0
) and test on questions with novel compositions ('novel_comp' == 1
).
For a more detailed analysis (Table 4):
Sequencing compositions: 'nc_seq' == 1
Superlative compositions: 'nc_sup' == 1
Duration compositions: 'nc_dur' == 1
Object-Relationship compositions: 'nc_objrel' == 1
2. Indirect References metric
To use the Indirect References metric, train and test on all questions.
Split by reference type
Indirect object reference: 'i_obj' == 1
Indirect relationship reference: 'i_rel' == 1
Indirect action reference: 'i_act' == 1
Temporal localization phrase: 'i_temp' == 1
To calculate Recall scores
The Recall scores are the accuracy of each reference category.
To calculate Precision scores
Take the subset of questions that:
- contain an indirect reference AND
'indirect' == 1
- have a correctly-answered equivalent question:
'direct_equiv' == 'direct_equiv_question_id'
AND'direct_equiv_question_id'
was answered correctly
If no such equivalent question is included in the dataset, 'direct_equiv' == None
.
The Precision scores are the accuracy on each of these subsets. Split this subset by reference type for a more detailed representation.
3. Compositional Steps metric
To run the More Compositional Steps metric, train on questions in which 'more_steps' == 0
and test on questions in which 'more_steps' == 1
.
Program formatting
Each question has a program of interlocking functions that break down the reasoning steps needed to answer the question. Note that Localize(‘between’ []) takes two action arguments. We specify the program names below. An ‘item’ can refer to any type of data structure.
Function | Input | Output |
---|---|---|
AND | bool 1 bool 2 |
True if bool 1 AND bool 2, else False |
Choose | item 1 item 2 [items] |
item iff item 1 OR item 2 exists in [items] |
Compare | [items] bool func |
The item that returns True from function |
Equals | item 1 item 2 |
boolean |
Exists | item [items] |
boolean |
Filter | dict [keys] |
mapping of [keys] to dictionary (dict[keys[0]][keys[1]]…) |
HasItem | [items] | True if length of list > 0, else False |
Iterate | [items] function |
list with function mapped to each item in [items] |
IterateUntil | ‘forward/backward’ [items] bool func secondary func |
output of secondary function on the first item to satisfy the bool func when iterating forwards or backwards |
Localize | temporal phrase action |
[frames] |
OnlyItem | [item] | item in list, if length of list is 1 |
Query | attribute item |
value of attribute within item |
Superlative | ‘min/max’ [items] func |
item with min or max value when func applied |
XOR | bool 1 bool 2 |
True if bool 1 XOR bool 2, else False |
Scene graph grounding
Each question associates parts of the question to nodes in the scene graph. This grounding maps keys of character indices to lists of node indices. The character indices refer to the ‘start-end’ characters of the phrase. These node indices refer to items in the scene graph.
- We only include the relevant node indices (e.g. if the question asks ‘Did they contact a cup after walking through a doorway?’, the reference to ‘a cup’ only includes vertices with that cup during frames after they walked through a doorway. If the list is empty, there are no such vertices so the answer is ‘No’). Relevancy is determined by the highest level in the program.
- If the keys are the same number “X-X”, then they refer to all the frames in the video. Some questions do not have a phrase for temporal localization (e.g. “Do they contact a cup”), so the reference to relevant frames has the same start and end index.
- Some scene graphs currently have negative values. This is a bug with one template and one indirect reference. We are fixing it now, and will re-upload shortly.
As an example, the question “Does someone contact a paper before drinking from a cup?” may have the following scene graph grounding:
{
# refers to the nodes for 'a paper'. 'o23' is the idx reference for 'a paper' (see IDX.pkl)
"21-28": [ "o23/000083", "o23/000084" ... "o23/000813", "o23/000826"],
# refers to the node for 'drinking from a cup'. 'c106' is the idx reference for 'drinking from a cup' (see IDX.pkl)
"36-55": ["c106/1"],
# refers to the frames localized by the phrase 'before drinking from a cup'
"29-55": ["000083", "000084" ... "000869", "000900"]
}
Using Scene Graphs
The scene graph files map video ids to scene graph dictionaries. Each scene graph maps vertex ids to vertex information.
Each vertex contains information about the other vertices to which it is connected. The type of information included depends on vertex type.
{
'frameid': {
'id': '000105'
'type': 'frame'
'secs': 6.3
'objects': [connecting object nodes]
'attention': [connecting attention nodes]
'contact': [connecting contact nodes]
'spatial': [connecting spatial nodes]
'verb': [connecting verb nodes]
'actions': [connecting action nodes]
'metadata': test / train
'next': next frame
'prev': previous frame
}
}
{
'actionid': {
'id': 'c076/1'
'charades': 'c076'
'phrase': 'holding a pillow'
'type': 'act'
'start': 14.56
'end': 17.5
'length': 2.94
'objects': [connecting object nodes]
'attention': [connecting attention nodes]
'contact': [connecting contact nodes]
'spatial': [connecting spatial nodes]
'verb': [connecting verb nodes]
'metadata': test / train
'all_f': [list of frame ids]
'subject': 'o9'
'verb': 'v25
'next_discrete': next non-overlapping action
'prev_discrete': previous non-overlapping
'next_instance': next c076 action
'prev_instance': previous c076 action
'while': [list of co-occuring action objects]
}
}
{
'objectid': {
'id': 'o4/000105'
'type': 'object'
'class': 'o4'
'attention': [connecting attention nodes]
'contact': [connecting contact nodes]
'spatial': [connecting spatial nodes]
'verb': [connecting verb nodes]
'visible': True if visible, False otherwise
'bbox': [list of bbox values from Action Genome]
'metadata': test / train
'frame_num': 000105
'secs': 6.3
'next': next o4 object
'prev': previous o4 object
}
}
{
'relationid': {
'id': 'r22/000209'
'type': 'contact'
'class': 'r22'
'objects': [connecting object nodes]
'metadata': test / train
'frame_num': 000209
'secs': 12.6
'next': next r22 relation
'prev': previous r22 relation
}
}
{
'actionid': {
'id': 'c076/1'
'phrase': 'holding a pillow'
'start': 14.56
'end': 17.5
'metadata': test / train
'all_f': [list of frame ids]
}
}
{
'frameid': {
'id': '000105'
'attention': [connecting attention nodes]
'contact': [connecting contact nodes]
'spatial': [connecting spatial nodes]
'verb': [connecting verb nodes]
}
}
{
'relationid': {
'id': 'r22/000209'
'objects': [connecting object nodes]
}
}
{
'objectid': {
'id': 'o4/000105'
'bbox': [list of bbox values from Action Genome]
}
}
{
"video_id":xxx,
"frame_ids":xxx,
"graph_list":xxx,
"action_list":xxx
}