corpus
Kashgari provides several build-in corpus for testing.
Chinese Daily Ner Corpus
Chinese Ner corpus cotains 20864 train samples, 4636 test samples and 2318 valid samples.
Usage:
from kashgari.corpus import ChineseDailyNerCorpustrain_x, train_y = ChineseDailyNerCorpus.load_data('train')test_x, test_y = ChineseDailyNerCorpus.load_data('test')valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
Data Sample:
>>> x[0]['海', '钓', '比', '赛', '地', '点', '在', '厦', '门']>>> y[0]['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC']
SMP2018 ECDT Human-Computer Dialogue Classification Corpus
https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/
This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.
label query0 weather 今天东莞天气如何1 map 从观音桥到重庆市图书馆怎么走2 cookbook 鸭蛋怎么腌?3 health 怎么治疗牛皮癣4 chat 唠什么
Usage:
from kashgari.corpus import SMP2018ECDTCorpustrain_x, train_y = SMP2018ECDTCorpus.load_data('train')test_x, test_y = SMP2018ECDTCorpus.load_data('test')valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')# Change cutter to jieba, need to install jieba firsttrain_x, train_y = SMP2018ECDTCorpus.load_data('train', cutter='jieba')test_x, test_y = SMP2018ECDTCorpus.load_data('test', cutter='jieba')valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid', cutter='jieba')
Data Sample:
# char cutted>>> x[0][['给', '周', '玉', '发', '短', '信']]>>> y[0]['message']# jieba cutted>>> x[0][['给', '周玉', '发短信']]>>> y[0]['message']
