链接
教程

前期准备

首先,需要先导入[synonyms](https://github.com/huyingxi/Synonyms)库,synonyms可以用于自然语言理解的很多任务:文本对齐,推荐算法,相似度计算,语义偏移,关键字提取,概念提取,自动摘要,搜索引擎等。

  1. pip install synonyms

初次安装会下载词向量文件比较大,可能会下载失败,官网又给一个解决办法,但笔者这里直接本地下载后,移动到 当前python环境/synonyms/data文件夹下(文件夹下会看见下载好的vocab.txt)完成安装
本地下载gitee地址:词向量文件gitee地址
synonyms的词向量在V3.12.0后做了一次升级,目前有40w+的词汇,会加载词向量会需要等待一定时间和内存消耗。

代码

读取哈工大停用词表

  1. import jieba
  2. import synonyms
  3. import random
  4. from random import shuffle
  5. random.seed(2019)
  6. # 停用词列表,默认使用哈工大停用词表
  7. f = open('./hit_stopwords.txt', 'r', encoding='utf-8')
  8. stop_words = list()
  9. for stop_word in f.readlines():
  10. stop_words.append(stop_word[:-1])

随机同义词替换

功能:替换一个语句中的n个单词为其同义词
输入:words为分词列表,n 控制替换个数
输出:新的分词列表

  1. # 随机同义词替换
  2. def synonym_replacement(words, n):
  3. new_words = words.copy()
  4. random_word_list = list(set([word for word in words if word not in stop_words]))
  5. random.shuffle(random_word_list)
  6. num_replaced = 0
  7. for random_word in random_word_list:
  8. synonyms = get_synonyms(random_word)
  9. if len(synonyms) >= 1:
  10. synonym = random.choice(synonyms)
  11. new_words = [synonym if word == random_word else word for word in new_words]
  12. num_replaced += 1
  13. if num_replaced >= n:
  14. break
  15. sentence = ' '.join(new_words)
  16. new_words = sentence.split(' ')
  17. return new_words
  18. # 获取同义词
  19. def get_synonyms(word):
  20. return synonyms.nearby(word)[0]

随机插入

功能:随机在语句中插入n个词
输入:words为分词列表,n 控制插入个数
输出:新的分词列表

  1. # 随机插入
  2. def random_insertion(words, n):
  3. new_words = words.copy()
  4. for _ in range(n):
  5. add_word(new_words)
  6. return new_words
  7. def add_word(new_words):
  8. synonyms = []
  9. counter = 0
  10. while len(synonyms) < 1:
  11. random_word = new_words[random.randint(0, len(new_words) - 1)]
  12. synonyms = get_synonyms(random_word)
  13. counter += 1
  14. if counter >= 10:
  15. return
  16. random_synonym = random.choice(synonyms)
  17. random_idx = random.randint(0, len(new_words) - 1)
  18. new_words.insert(random_idx, random_synonym)

随机交换

功能:将句子中的两个单词随机交换n次
输入:words为分词列表,n 交换次数
输出:新的分词列表

  1. # 随机交换
  2. def random_swap(words, n):
  3. new_words = words.copy()
  4. for _ in range(n):
  5. new_words = swap_word(new_words)
  6. return new_words
  7. def swap_word(new_words):
  8. random_idx_1 = random.randint(0, len(new_words) - 1)
  9. random_idx_2 = random_idx_1
  10. counter = 0
  11. while random_idx_2 == random_idx_1:
  12. random_idx_2 = random.randint(0, len(new_words) - 1)
  13. counter += 1
  14. if counter > 3:
  15. return new_words
  16. new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1]
  17. return new_words

随机删除

功能:以概率p删除语句中的词
输入:words为分词列表,p为概率值
输出:新的分词列表

  1. # 随机删除
  2. def random_deletion(words, p):
  3. if len(words) == 1:
  4. return words
  5. new_words = []
  6. for word in words:
  7. r = random.uniform(0, 1)
  8. if r > p:
  9. new_words.append(word)
  10. if len(new_words) == 0:
  11. rand_int = random.randint(0, len(words) - 1)
  12. return [words[rand_int]]
  13. return new_words

EDA

将前面几个功能合并,
sentence:输入句子,字符串(str)形式
alpha_sr:随机替换同义词的比重,默认为0.1
alpha_ri:随机插入的比重,默认为0.1
alpha_rs:随机交换的比重,默认为0.1
p_rd:随机删除的概率值大小,认为0.1
num_aug:增强次数,默认为9词

  1. # EDA函数
  2. def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9):
  3. seg_list = jieba.cut(sentence)
  4. seg_list = " ".join(seg_list)
  5. words = list(seg_list.split())
  6. num_words = len(words)
  7. augmented_sentences = []
  8. num_new_per_technique = int(num_aug / 4) + 1
  9. n_sr = max(1, int(alpha_sr * num_words))
  10. n_ri = max(1, int(alpha_ri * num_words))
  11. n_rs = max(1, int(alpha_rs * num_words))
  12. # print(words, "\n")
  13. # 同义词替换sr
  14. for _ in range(num_new_per_technique):
  15. a_words = synonym_replacement(words, n_sr)
  16. augmented_sentences.append(' '.join(a_words))
  17. # 随机插入ri
  18. for _ in range(num_new_per_technique):
  19. a_words = random_insertion(words, n_ri)
  20. augmented_sentences.append(' '.join(a_words))
  21. # 随机交换rs
  22. for _ in range(num_new_per_technique):
  23. a_words = random_swap(words, n_rs)
  24. augmented_sentences.append(' '.join(a_words))
  25. # 随机删除rd
  26. for _ in range(num_new_per_technique):
  27. a_words = random_deletion(words, p_rd)
  28. augmented_sentences.append(' '.join(a_words))
  29. # print(augmented_sentences)
  30. shuffle(augmented_sentences)
  31. if num_aug >= 1:
  32. augmented_sentences = augmented_sentences[:num_aug]
  33. else:
  34. keep_prob = num_aug / len(augmented_sentences)
  35. augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
  36. augmented_sentences.append(seg_list)
  37. return augmented_sentences

实践

以近期突发事件文本为例子
输入句子:“11月22日上午,经过千余人连续多日紧张搜救,在云南哀牢山失联的4名中国地质调查局昆明自然资源综合调查中心工作人员被找到,但均已无生命体征,不幸遇难。”
输出10组增强后的句子

  1. if __name__ == '__main__':
  2. sentence = "11月22日上午,经过千余人连续多日紧张搜救,在云南哀牢山失联的4名中国地质调查局昆明自然资源综合调查中心工作人员被找到,但均已无生命体征,不幸遇难。"
  3. augmented_sentences = eda(sentence=sentence)
  4. print("原句:",sentence)
  5. for idx,sentence in enumerate(augmented_sentences):
  6. print("增强句{}:{}".format(idx+1,sentence))

结果如图所示
image.png

此外,针对中文的数据增强工具还有一个比较好用的工具nlpcda,关于nlpcda的内容将在后续进行实践
image.png
参考资料:
[1] An implement of the paper of EDA for Chinese corpus.中文语料的EDA数据增强工具
[2] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks