corpus

Kashgari provides several build-in corpus for testing.

Chinese Daily Ner Corpus

Chinese Ner corpus cotains 20864 train samples, 4636 test samples and 2318 valid samples.

Usage:

  1. from kashgari.corpus import ChineseDailyNerCorpus
  2. train_x, train_y = ChineseDailyNerCorpus.load_data('train')
  3. test_x, test_y = ChineseDailyNerCorpus.load_data('test')
  4. valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')

Data Sample:

  1. >>> x[0]
  2. ['海', '钓', '比', '赛', '地', '点', '在', '厦', '门']
  3. >>> y[0]
  4. ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC']

SMP2018 ECDT Human-Computer Dialogue Classification Corpus

https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/

This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.

  1. label query
  2. 0 weather 今天东莞天气如何
  3. 1 map 从观音桥到重庆市图书馆怎么走
  4. 2 cookbook 鸭蛋怎么腌?
  5. 3 health 怎么治疗牛皮癣
  6. 4 chat 唠什么

Usage:

  1. from kashgari.corpus import SMP2018ECDTCorpus
  2. train_x, train_y = SMP2018ECDTCorpus.load_data('train')
  3. test_x, test_y = SMP2018ECDTCorpus.load_data('test')
  4. valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
  5. # Change cutter to jieba, need to install jieba first
  6. train_x, train_y = SMP2018ECDTCorpus.load_data('train', cutter='jieba')
  7. test_x, test_y = SMP2018ECDTCorpus.load_data('test', cutter='jieba')
  8. valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid', cutter='jieba')

Data Sample:

  1. # char cutted
  2. >>> x[0]
  3. [['给', '周', '玉', '发', '短', '信']]
  4. >>> y[0]
  5. ['message']
  6. # jieba cutted
  7. >>> x[0]
  8. [['给', '周玉', '发短信']]
  9. >>> y[0]
  10. ['message']