This page describes the different types of training data that go into a Rasa assistant and how this training data is structured.

1. Overview

Rasa Open Source uses YAML as a unified and extendable way to manage all training data, including NLU data, stories and rules.

You can split the training data over any number of YAML files, and each file can contain any combination of NLU data, stories, and rules. The training data parser determines the training data type using top level keys.

The domain uses the same YAML format as the training data and can also be split across multiple files or combined in one file. The domain includes the definitions for responses and forms. See the documentation for the domain for information on how to format your domain file.

:::info

💡 LEGACY FORMATS

Looking for Rasa Open Source 1.x data formats? They are now deprecated, but you can still find the documentation for markdown NLU data and markdown stories. :::

1.1 High-Level Structure

Each file can contain one or more keys with corresponding training data. One file can contain multiple keys, but each key can only appear once in a single file. The available keys are:

  • version
  • nlu
  • stories
  • rules

You should specify the version key in all YAML training data files. If you don’t specify a version key in your training data file, Rasa will assume you are using the latest training data format specification supported by the version of Rasa Open Source you have installed. Training data files with a Rasa Open Source version greater than the version you have installed on your machine will be skipped. Currently, the latest training data format specification for Rasa 2.x is 2.0.

1.2 Example

Here’s a short example which keeps all training data in a single file:

  1. version: "2.0"
  2. nlu:
  3. - intent: greet
  4. examples: |
  5. - Hey
  6. - Hi
  7. - hey there [Sara](name)
  8. - intent: faq/language
  9. examples: |
  10. - What language do you speak?
  11. - Do you only handle english?
  12. stories:
  13. - story: greet and faq
  14. steps:
  15. - intent: greet
  16. - action: utter_greet
  17. - intent: faq
  18. - action: utter_faq
  19. rules:
  20. - rule: Greet user
  21. steps:
  22. - intent: greet
  23. - action: utter_greet

To specify your test stories, you need to put them into a separate file:

:::success 📑 tests/test_stories.yml :::

  1. stories:
  2. - story: greet and ask language
  3. - steps:
  4. - user: |
  5. hey
  6. intent: greet
  7. - action: utter_greet
  8. - user: |
  9. what language do you speak
  10. intent: faq/language
  11. - action: utter_faq

Test stories use the same format as the story training data and should be placed in a separate file with the prefix test_.

:::info

💡 THE | SYMBOL

As shown in the above examples, the user and examples keys are followed by | (pipe) symbol. In YAML | identifies multi-line strings with preserved indentation. This helps to keep special symbols like ", ' and others still available in the training examples. :::

2. NLU Training Data

NLU training data consists of example user utterances categorized by intent. Training examples can also include entities. Entities are structured pieces of information that can be extracted from a user’s message. You can also add extra information such as regular expressions and lookup tables to your training data to help the model identify intents and entities correctly.

NLU training data is defined under the nlu key. Items that can be added under this key are:

  • Training examples grouped by user intent e.g. optionally with annotated entities ```yaml nlu:
  • intent: check_balance examples: |

    • What’s my credit balance?
    • What’s the balance on my [credit card account]{“entity”:”account”,”value”:”credit”} ```
  • Synonyms ```yaml nlu:

  • synonym: credit examples: |

    • credit card account
    • credit account ```
  • Regular expressions ```yaml nlu:

  • regex: account_number examples: |

    • \d{10,12} ```
  • Lookup tables ```yaml nlu:

  • lookup: banks examples: |
    • JPMC
    • Comerica
    • Bank of America ```

      2.1 Training Examples

      Training examples are grouped by intent and listed under the examples key. Usually, you’ll list one example per line as follows: ```yaml nlu:
  • intent: greet examples: |
    • hey
    • hi
    • whats up ```

However, it’s also possible to use an extended format if you have a custom NLU component and need metadata for your examples:

  1. nlu:
  2. - intent: greet
  3. examples:
  4. - text: |
  5. hi
  6. metadata:
  7. sentiment: neutral
  8. - text: |
  9. hey there!

The metadata key can contain arbitrary key-value data that is tied to an example and accessible by the components in the NLU pipeline. In the example above, the sentiment metadata could be used by a custom component in the pipeline for sentiment analysis.

You can also specify this metadata at the intent level:

  1. nlu:
  2. - intent: greet
  3. metadata:
  4. sentiment: neutral
  5. examples:
  6. - text: |
  7. hi
  8. - text: |
  9. hey there!

In this case, the content of the metadata key is passed to every intent example.

If you want to specify retrieval intents, then your NLU examples will look as follows:

  1. nlu:
  2. - intent: chitchat/ask_name
  3. examples: |
  4. - What is your name?
  5. - May I know your name?
  6. - What do people call you?
  7. - Do you have a name for yourself?
  8. - intent: chitchat/ask_weather
  9. examples: |
  10. - What's the weather like today?
  11. - Does it look sunny outside today?
  12. - Oh, do you mind checking the weather for me please?
  13. - I like sunny days in Berlin.

All retrieval intents have a suffix added to them which identifies a particular response key for your assistant. In the above example, ask_name and ask_weather are the suffixes. The suffix is separated from the retrieval intent name by a / delimiter.

:::info

💡 SPECIAL MEANING OF /

As shown in the above examples, the / symbol is reserved as a delimiter to separate retrieval intents from their associated response keys. Make sure not to use it in the name of your intents. :::

2.2 Entities

Entities are structured pieces of information that can be extracted from a user’s message.

Entities are annotated in training examples with the entity’s name. In addition to the entity name, you can annotate an entity with synonyms, roles, or groups.

In training examples, entity annotation would look like this:

  1. nlu:
  2. - intent: check_balance
  3. examples: |
  4. - how much do I have on my [savings]("account") account
  5. - how much money is in my [checking]{"entity": "account"} account
  6. - What's the balance on my [credit card account]{"entity":"account","value":"credit"}

The full possible syntax for annotating an entity is:

  1. [<entity-text>]{"entity": "<entity name>", "role": "<role name>", "group": "<group name>", "value": "<entity synonym>"}

The keywords role, group, and value are optional in this notation. The value field refers to synonyms. To understand what the labels role and group are for, see the section on entity roles and groups.

2.3 Synonyms

Synonyms normalize your training data by mapping an extracted entity to a value other than the literal text extracted. You can define synonyms using the format:

  1. nlu:
  2. - synonym: credit
  3. examples: |
  4. - credit card account
  5. - credit account

You can also define synonyms in-line in your training examples by specifying the value of the entity:

  1. nlu:
  2. - intent: check_balance
  3. examples: |
  4. - how much do I have on my [credit card account]{"entity": "account", "value": "credit"}
  5. - how much do I owe on my [credit account]{"entity": "account", "value": "credit"}

Read more about synonyms on the NLU Training Data page.

2.4 Regular Expressions

You can use regular expressions to improve intent classification and entity extraction using the RegexFeaturizer and RegexEntityExtractor components.

The format for defining a regular expression is as follows:

  1. nlu:
  2. - regex: account_number
  3. examples: |
  4. - \d{10,12}

Here account_number is the name of the regular expression. When used as as features for the RegexFeaturizer the name of the regular expression does not matter. When using the RegexEntityExtractor, the name of the regular expression should match the name of the entity you want to extract.

Read more about when and how to use regular expressions with each component on the NLU Training Data page.

2.5 Lookup Tables

Lookup tables are lists of words used to generate case-insensitive regular expression patterns. The format is as follows:

  1. nlu:
  2. - lookup: banks
  3. examples: |
  4. - JPMC
  5. - Bank of America

When you supply a lookup table in your training data, the contents of that table are combined into one large regular expression. This regex is used to check each training example to see if it contains matches for entries in the lookup table.

Lookup table regexes are processed identically to the regular expressions directly specified in the training data and can be used either with the RegexFeaturizer or with the RegexEntityExtractor. The name of the lookup table is subject to the same constraints as the name of a regex feature.

Read more about using lookup tables on the NLU Training Data page.

3. Conversation Training Data

Stories and rules are both representations of conversations between a user and a conversational assistant. They are used to train the dialogue management model. Stories are used to train a machine learning model to identify patterns in conversations and generalize to unseen conversation paths. Rules describe small pieces of conversations that should always follow the same path and are used to train the RulePolicy.

3.1 Stories

Stories are composed of:

  • story: The story’s name. The name is arbitrary and not used in training; you can use it as a human-readable reference for the story.
  • metadata: arbitrary and optional, not used in training, you can use it to store relevant information about the story like e.g. the author
  • a list of steps: The user messages and actions that make up the story

For example: