ConvLab-2

清华CoAI https://github.com/thu-coai/Convlab-2

主要关注任务型对话

  • 训练:对于每一个任务/模型的,会设计适用于该任务/模型的数据接口,在此基础上对所有数据集进行预处理
  • 评估:针对每个数据集设计了统一的接口,格式dict

总体来说较为混乱,可以看到与选择数据集相关的if else语句,这也和该项目目前仅支持4个数据集有关

ParlAI

FaceBook https://github.com/facebookresearch/ParlAI

致力于打造构建对话系统的统一平台

  • 一份数据集称为一个task,遵循提供的接口,就可以面向所有模型
  • Implement build.py to download and build any needed data
  • Implement agents.py, with at least a DefaultTeacher which extends Teacher or one of its children
  • Add the task to the the task list
  • 每一条数据通过Message进行包装,在agent和environment之间传递

    build.py

  • 完成build方法

  • 下载存储
  • optional:预处理,分割train/dev/test

    agents.py

    实现一个DefaultTeacher

    Text files

  • ParlAIDialogTeacher(DialogTeacher)

    • data in the format of ParlAI Dialog
    • example: ``` text:Sam went to the kitchen. Pat gave Sam the milk. Where is the milk? labels:kitchen reward:1 label_candidates:hallway|kitchen|bathroom

text:Sam went to the hallway. Pat went to the bathroom. Where is the milk? labels:hallway reward:1

label_candidates:hallway|kitchen|bathroom episode_done:True

  1. - key:value, seperated by tab
  2. - support attr: text(str), labels(list, concat by | in str initially), label_candidates(str, concat by | in str initially), episode_done(bool) and anything(str) you like but just text
  3. - `DialogTeacher(FixedDialogTeacher)`
  4. - an iterable with each call returning a tuple in the form `((x, y, r, c, i), new_episode?)`
  5. - 支持query, label, reward, label candidates, image and anything else (you can put it in str or iter according your format, no limit)
  6. - `x` (str) is a query and possibly context
  7. - `y` (iter) is an iterable of label(s) for that query
  8. - `r` (str) is the str reward for getting that query correct, optional
  9. - `c` (iter) is an iterable of label candidates that the student can choose from, optional
  10. - `i` (str) is a str path to an image on disk, which will be loaded by the data class at request-time. should always point to the raw image file, optional
  11. - `new_episode?` (bool) is a boolean value specifying whether that example is the start of a new episode. If you don't use episodes set this to `True` every time.\
  12. <a name="bGL12"></a>
  13. ### Json
  14. `ConversationTeacher(DialogTeacher)`
  15. - jsonl
  16. ```json
  17. {
  18. 'possible_conversation_level_info': True,
  19. 'dialog':
  20. [
  21. [
  22. {
  23. 'id': 'speaker_1',
  24. 'text': <first utterance>,
  25. },
  26. {
  27. 'id': 'speaker_2',
  28. 'text': <second utterance>,
  29. },
  30. ...
  31. ],
  32. ...
  33. ]
  34. ...
  35. }
  • only support id and text in dialog

    Others

  • ChunkTeacher: 适用于内存不够的情况

  • from Scratch:适用于non-fixed data等其他情况
  • 可以通过命令行参数指定数据集的扩展选项

    • ‘-t babi’ sets up the DefaultTeacher in ‘parlai/core/tasks/babi/agents.py’.
    • ‘-t babi:task1k’ sets up the Task1kTeacher in the babi/agents.py file, which allows you to specify specific settings for certain tasks. For bAbI, this refers to the setting where there are only 1000 unique training examples per task.
    • ‘-t babi:task1k:1’ provides 1 as a parameter to Task1kTeacher, which is interpreted by the Task1kTeacher to mean “I want task 1” (as opposed to the 19 other bAbI tasks).
    • ‘-t babi,squad’ sets up the DefaultTeacher for both babi and squad. Any number of tasks can be chained together with commas to load up each one of them.
    • ‘-t #qa’ specifies the ‘qa’ category, loading up all tasks with that category in the ‘parlai/core/task_list.py’ file.

      Message

      对话工具包数据格式调研 - 图1
  • primary medium for information flow (messages between agents and the environment) in ParlAI

  • a subclass of a python dict containing the actions of an agent (observable by other agents or the environment)
  • The primary function of the Message object is to ensure that agents do not unintentionally edit the fields within observations and actions. In order to edit the field of a Message object, one must call message.force_set(key, new_value).