Requirement for Translation

The human translation output is the literal translation from human transcription, e.g “word for word translation”. We would like to have the translation from Chinese to English, as well as Chinese to Russian.

Requirement for Annotation

General Requirement

The human translation output has the following annotations
- Key phrases: such as product name, product feature, product brand, product origin, coupon, shipping date, shipping location, promotion info.
- Disfluency: non-intelligence utterances, repeated words/phrases, incompleted words

The disfluency and keyphrases annotations are required for both source and target languages.

Input data

Taobao evaluation clips are provided by ASR team under srbp.ts format (with timestamps). Examples
image.png

See more in attachments

Annotation guidelines

Key-phrases:

Keyphrases are important words/phrases in a sentence. For AE live shows we are interested in the following keyphrases types

  • Product name: Iphone 11 Pro
  • Brand/Company name: Lancome, Channel, Apple
  • Quantity:

    1. + money/price: 幸运的朋友获得外援免单也就意味着你买五千块钱的东西直接年五千块六千免除九九年期间九千、一万免一万。但是你到了一万一千块钱的话,你自己需要租。<br /> Money quantity like 5000, 6000, 9000, 10000 yuan are keyphrases<br /> + number: we only have 20 items available<br /> 20 is a keyphrase
  • Events: Double Eleven, Black Friday, Boxing Day, Spring Festival, …

  • Coupons: ten-dollar coupon, 25% off
  • Date time
  • Shipping information

Example: 这个是在十五号链接的,cg的纯羊毛的帽子帽子怎么拍了一张十块钱优惠券,到手价一百三十一一百三十一、纯羊毛的三幺幺三幺、二、一百三十一块钱。
Expected keyphrases are
十五号链接 (link No. 15)
cg的纯羊毛的帽子 (all wool hat from cg)
十块钱优惠券 (10Yuan off coupon)
到手价一百三十一 (131 Yuan final price)

We only need a single tag KEY_PHRASE to represent different types of important phrases.

Disfluency: We follow the “Disfluency Annotation Stylebook for the Switchboard Corpus” from Switchboard corpus guideline

Acceptance Criteria

We prefer the tagging scheme is BIO under the CoNLL plain text format. BIO stands for Beginning/Inside/Outside of Keyphrases and Disfluency
We use 2 tags, KEYP and DISFL for key phrases and disfluency
The file format is tab-separated values. A blank line is required at the end of a sentence.
A special token is required in between 2 segments. A segment is a paragraph in human transcription file.

BIO Annotation Example

Example 1:

```
We B-DISFL
have I-DISFL
well I-DISFL
we O
love O
Apple B-KEYP
iPhone I-KEYP
10 I-KEYP
Pro I-KEYP
. O

Example 2:

FREE O
entry O
English B-KEYP
Wine I-KEYP
Festival I-KEYP
for O
for B-DISFL
everyone O
who O
buy O
the O
product O
in O
next O
10 B-KEYP
minutes I-KEYP
. O

```

Example 3:

今 O
天 O
我 O
们 O
直 O
播 O
的 O
产 O
品 O
是 O
iPhone B-KEY_PHRASE
手 I-KEY_PHRASE
机 I-KEY_PHRASE
壳 I-KEY_PHRASE