1 理论

(1)数据预处理

  • 文本更正,主要是将文本标准化
  • 文本泛化,如一个手机号码,因为有几千万的手机号码,不可能为每个手机号码设一个特征,所以最好将手机号码转化为同一个特征;另外表情符号、人名、地址、网址、命名实体等也要考虑这种泛化,泛化的程度这个视具体的任务,比如说地址可以以国家为粒度,也可以以省份为粒度
  • 规范文本为统一长度时,取所有长度的均值或者中位数,但是别取最大值;截断时根据具体任务考虑从前面阶段或从后面截断

(2)学习率
BERT 顶层对于文本分类任务更加有效
每层适当的逐层降低学习速率,可以提高文本分类效果

(3)预训练
任务内和领域内(和任务内数据分布相似)的进一步预训练可以提升文本分类效果

2 实验内容

(1)数据预处理

  • 只转换为小写(本地) 线上0.792

batch_size = 10

epoch 1 :train loss 1.87 train-acc 0.66 val loss 1.47 val-acc 0.7506 epoch 2 :train loss 1.40 train-acc 0.78 val loss 1.34 val-acc 0.7718 epoch 3 :Train loss 1.20 train-acc 0.83 val loss 1.28 val-acc 0.7802 epoch 4 :Train loss 1.06 train-acc 0.87 val loss 1.24 val-acc 0.7858 epoch 5 :Train loss 0.98 train-acc 0.89 val loss 1.23 val-acc 0.79

  • 停用词+泛化+词性替换+单词替换(云)

Batch_size = 20

epoch 1 :train loss 1.99 train-acc 0.6356 val loss 1.57 val-acc 0.7296 epoch 2 :train loss 1.52 train-acc 0.0.76 val loss 1.43 val-acc 0.7536 epoch 2 :Train loss 1.35 train-acc 0.80 val loss 1.37 val-acc 0.766

  • 去特殊符号,自定义Process_text处理后(本地)

Batch_size = 10

epoch 1 :train loss 1.90 train-acc 0.65 val loss 1.51 val-acc 0.7374 epoch 2 :train loss 1.43 train-acc 0.77 val loss 1.37 val-acc 0.7614
epoch 3 :Train loss 1.24 train-acc 0.82 val loss 1.32 val-acc 0.7668epoch 4 :epoch 4 :Train loss 1.11 train-acc 0.85 val loss 1.28 val-acc 0.7772
epoch 5 :Train loss 1.034 train-acc 0.87 val loss 1.26 val-acc 0.7794

  • 泛化+词性替换+单词替换+不去除停用词(云)

batch_size = 20

Epoch 1:Train loss 1.98 accuracy 0.64 Val loss 1.54 accuracy 0.7384 Epoch 2:Train loss 1.50 accuracy 0.76 Val loss 1.401 accuracy 0.7684 Epoch 3:Train loss 1.32 accuracy 0.81 Val loss 1.34 accuracy 0.774 Epoch 4:Train loss 1.21 accuracy 0.84 Val loss 1.31 accuracy 0.7818 Epoch 5:Train loss 1.15 accuracy 0.86 Val loss 1.30 accuracy 0.7824

  • 词性替换+小写(云)

Batch_size=40

Epoch 1: Train loss 2.14 accuracy 0.60 Val loss 1.62 accuracy 0.7154

Batch_size=20

Epoch 1:Train loss 1.97 accuracy 0.645 Val loss 1.54 accuracy 0.7418 Epoch 2:Train loss 1.49 accuracy 0.77 Val loss 1.40 accuracy 0.7634

Batch_size = 10

Epoch 1: Train loss 1.86 accuracy 0.669 Val loss 1.46 accuracy 0.7532 Epoch 2: Train loss 1.39 accuracy 0.78 Val loss 1.33 accuracy 0.7734

Batch_size = 5 线上0.7908

Epoch 1:Train loss 1.78 accuracy 0.68 Val loss 1.4 accuracy 0.761 Epoch 2 :Train loss 1.30 accuracy 0.79 Val loss 1.28 accuracy 0.77 Epoch 3 :Train loss 1.07 accuracy 0.84 Val loss 1.21 accuracy 0.78 Epoch 4: Train loss 0.91 accuracy 0.88 Val loss 1.18 accuracy 0.7884 Epoch 5:Train loss 0.82 accuracy 0.90 Val loss 1.16 accuracy 0.7924

Len = 150

Epoch 1: Train loss 1.90 accuracy 0.65 Val loss 1.50 accuracy 0.7376

Len = 200

Epoch 1: Train loss 1.88 accuracy 0.669 Val loss 1.49 accuracy 0.74

(2)交叉验证训练
Batchsize= 10,EPoch=2,交叉验证+词性还原,线上0.7991

(3)Batch_size = 10,删除字符+词性还原

并不好

(4)泛化处理+词性还原(云)

Epoch 1: Train loss 1.87 accuracy 0.66 Val loss 1.46 accuracy 0.752 Epoch 2:Train loss 1.40 accuracy 0.78 Val loss 1.34 accuracy 0.773 Epoch 3: Train loss 1.20 accuracy 0.83 Val loss 1.28 accuracy 0.779

泛化处理+词性还原并没有带来提升
(5)缩写词的替换(本地)
搭配删除特殊字符使用

Epoch 1: Train loss 1.87 accuracy 0.59 Val loss 1.46 accuracy 0.7516 Epoch 2:Train loss 1.40 accuracy 0.70 Val loss 1.32 accuracy 0.775 Epoch 2:Train loss 1.20 accuracy 0.75 Val loss 1.27 accuracy 0.7826

没有带来提升
(6)只训练摘要

(6)总结

  • 对比Max_len =150 、 200、300,300最佳
  • 对比Batch_size = 5、10、20、40 ,5和10接近,但10最佳
  • 词性还原有作用
  • 交叉验证取平均有作用

bert_base_cross ,预处理了特殊符号

Train loss 2.2093427358567714 accuracy 0.664675 Val loss 1.8473641620874406 accuracy 0.7285

Train loss 1.2737687629163266 accuracy 0.884475 Val loss 1.076409373342991 accuracy 0.8863

3 参考资料

【文本分类中的一些经验和 tricks】http://wulc.me/2019/02/06/%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E4%B8%AD%E7%9A%84%E4%B8%80%E4%BA%9B%E7%BB%8F%E9%AA%8C%E5%92%8C%20tricks/
【Bert遇到文本分类:如何用好Bert这把刀】https://posts.careerengine.us/p/5fc8e77a291d510c6d6eb98b?from=latest-posts-panel&type=previewimage
【文本分类Bert技巧】https://neptune.ai/blog/text-classification-tips-and-tricks-kaggle-competitions