ConvLab-2

清华CoAI https://github.com/thu-coai/Convlab-2

主要关注任务型对话

训练：对于每一个任务/模型的，会设计适用于该任务/模型的数据接口，在此基础上对所有数据集进行预处理
评估：针对每个数据集设计了统一的接口，格式dict

总体来说较为混乱，可以看到与选择数据集相关的if else语句，这也和该项目目前仅支持4个数据集有关

ParlAI

FaceBook https://github.com/facebookresearch/ParlAI

致力于打造构建对话系统的统一平台

一份数据集称为一个task，遵循提供的接口，就可以面向所有模型
Implement build.py to download and build any needed data
Implement agents.py, with at least a DefaultTeacher which extends Teacher or one of its children
Add the task to the the task list
每一条数据通过Message进行包装，在agent和environment之间传递

build.py
完成build方法
下载存储
optional：预处理，分割train/dev/test

agents.py
实现一个DefaultTeacher

Text files
ParlAIDialogTeacher(DialogTeacher)
- data in the format of ParlAI Dialog
- example: ``` text:Sam went to the kitchen. Pat gave Sam the milk. Where is the milk? labels:kitchen reward:1 label_candidates:hallway|kitchen|bathroom

text:Sam went to the hallway. Pat went to the bathroom. Where is the milk? labels:hallway reward:1

label_candidates:hallway|kitchen|bathroom episode_done:True


      - key:value, seperated by tab
      - support attr: text(str), labels(list, concat by | in str initially), label_candidates(str, concat by | in str initially), episode_done(bool) and anything(str) you like but just text
- `DialogTeacher(FixedDialogTeacher)`
   - an iterable with each call returning a tuple in the form `((x, y, r, c, i), new_episode?)` 
      - 支持query, label, reward, label candidates, image and anything else (you can put it in str or iter according your format, no limit)
      - `x` (str) is a query and possibly context
      - `y` (iter) is an iterable of label(s) for that query
      - `r` (str) is the str reward for getting that query correct, optional
      - `c` (iter) is an iterable of label candidates that the student can choose from, optional
      - `i` (str) is a str path to an image on disk, which will be loaded by the data class at request-time. should always point to the raw image file, optional
      - `new_episode?` (bool) is a boolean value specifying whether that example is the start of a new episode. If you don't use episodes set this to `True` every time.\
<a name="bGL12"></a>
### Json
`ConversationTeacher(DialogTeacher)` 
- jsonl
```json
{
  'possible_conversation_level_info': True,
  'dialog':
  [   
    [
      {
        'id': 'speaker_1',
        'text': <first utterance>,
      },
      {
        'id': 'speaker_2',
        'text': <second utterance>,
      },
      ...
    ],
    ...
  ]
    ...
}

only support id and text in dialog

Others
ChunkTeacher: 适用于内存不够的情况
from Scratch：适用于non-fixed data等其他情况
可以通过命令行参数指定数据集的扩展选项
- ‘-t babi’ sets up the DefaultTeacher in ‘parlai/core/tasks/babi/agents.py’.
- ‘-t babi:task1k’ sets up the Task1kTeacher in the babi/agents.py file, which allows you to specify specific settings for certain tasks. For bAbI, this refers to the setting where there are only 1000 unique training examples per task.
- ‘-t babi:task1k:1’ provides 1 as a parameter to Task1kTeacher, which is interpreted by the Task1kTeacher to mean “I want task 1” (as opposed to the 19 other bAbI tasks).
- ‘-t babi,squad’ sets up the DefaultTeacher for both babi and squad. Any number of tasks can be chained together with commas to load up each one of them.
- ‘-t #qa’ specifies the ‘qa’ category, loading up all tasks with that category in the ‘parlai/core/task_list.py’ file.
  Message
primary medium for information flow (messages between agents and the environment) in ParlAI
a subclass of a python dict containing the actions of an agent (observable by other agents or the environment)
The primary function of the Message object is to ensure that agents do not unintentionally edit the fields within observations and actions. In order to edit the field of a Message object, one must call message.force_set(key, new_value).

机器学习

对话工具包数据格式调研

ConvLab-2

ParlAI

build.py

agents.py

Text files

Others

Message