前言

数据来源:
2019年CCF互联网新闻情感分析大赛
https://www.datafountain.cn/competitions/350/

数据集附件:

链接:https://pan.baidu.com/s/1ePKyHyE8AGN3vW_1vg9-yg
提取码:2021

具体工具:

  1. 深度学习框架使用 tensorflow2.4.0
  2. 自然语言处理库 gensim3.8.3
  3. 分词工具 jieba 0.42.1

代码哪里不懂点哪里:python菜鸟教程更像一个查询手册
https://www.runoob.com/python3/python3-tutorial.html

具体流程

[2]使用BiLSTM进行情感分析 - 图1

1. 数据预处理

1.1 读取数据

这里通过read_csv读取数据集,因为train_data和train_label是分开的,使用pd.merge()进行合并

  • pd.merge(x,y,how=”left”,on=”id”)详解:
  • notnull() :是否为空,返回对应布尔值
  • fillna(value):找到空值并替换为value ```python import pandas as pd import numpy as np

读取训练数据

train_data = pd.read_csv(‘Train_DataSet.csv’) train_label = pd.read_csv(‘Train_DataSet_Label.csv’) test = pd.read_csv(‘Test_DataSet.csv’)

合并两个训练数据和训练标签

train = pd.merge(train_data, train_label, how=’left’, on=’id’)

使用布尔判断 判断train中是否存在空值 并剔除

train = train[(train.label.notnull()) & (train.content.notnull())]

将na值替换为空字符串

train[‘title’] = train[‘title’].fillna(‘’) train[‘content’] = train[‘content’].fillna(‘’) test[‘title’] = test[‘title’].fillna(‘’) test[‘content’] = test[‘content’].fillna(‘’)

  1. <a name="Hxtx6"></a>
  2. ### 1.2 文本无效字符和标签过滤
  3. 其中 text_filter(text) 函数可以存储到工具库,方便下次使用
  4. - re.sub(正则表达式,被替换字符串,替换字符串)
  5. - str.strip():去除字符串首尾的空格
  6. ```python
  7. import re
  8. # 文本过滤函数
  9. def text_filter(text):
  10. # re.sub(正则表达式,被替换字符串,替换字符串)
  11. text = re.sub("[A-Za-z0-9\!\=\?\%\[\]\,\(\)\>\<:&lt;\/#\. -----\_]", "", text)
  12. text = text.replace('图片', '')
  13. text = text.replace('\xa0', '') # 删除nbsp
  14. # 去除html标签
  15. cleanr = re.compile('<.*?>')
  16. text = re.sub(cleanr, ' ', text)
  17. # 去除其他字符
  18. r1 = "\\【.*?】+|\\《.*?》+|\\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——!\\\,。=?、:“”‘’¥……()《》【】]"
  19. text = re.sub(r1,'',text)
  20. # 去除字符串首尾的空格
  21. text = text.strip()
  22. return text
  23. # 文本清理函数
  24. def clean_text(data):
  25. # 标题文本
  26. data['title'] = data['title'].apply(lambda x: text_filter(x))
  27. # 正文文本
  28. data['content'] = data['content'].apply(lambda x: text_filter(x))
  29. return data
  30. # run clean_text
  31. train = clean_text(train)
  32. test = clean_text(test)

1.3 分词和停用词

  • str.maketrans(x,y,z):三个参数 x、y、z,第三个参数 z 必须是字符串,其字符将被映射为 None,即删除该字符;如果 z 中字符与 x 中字符重复,该重复的字符在最终结果中还是会被删除。 也就是无论是否重复,只要有第三个参数 z,z 中的字符都会被删除。
  • string.punctuation:所有的标点符号
  • [token for token in tokens if token not in stop_words] :列表表达式:如果tokens中的token不在stop_words中那么就返回此token ```python import jieba import string

加载stop_words停用词

stop_words = pd.read_table(‘stop.txt’, header=None)[0].tolist()

创建翻译表,后续用于去除英文标点

table = str.maketrans(“”,””,string.punctuation) def cut_text(sentence): tokens = list(jieba.cut(sentence))

  1. # 去除停用词 列表
  2. tokens = [token for token in tokens if token not in stop_words]

# 去除英文标点

tokens = [w.translate(table) for w in tokens] # translate 和 maketrans 连用 具体百度

  1. return tokens

调用分词函数对训练集和测试集中的标题和文本进行分词

train_title = [cut_text(sent) for sent in train.title.values] train_content = [cut_text(sent) for sent in train.content.values] test_title = [cut_text(sent) for sent in test.title.values] test_content = [cut_text(sent) for sent in test.content.values]

连接所有分词,为后续训练词向量做准备

all_doc = train_title + train_content + test_title + test_content

  1. <a name="wGCmK"></a>
  2. ### 1.4 使用gensim训练词向量
  3. 这一系列代码可以直接使用,训练自己的词向量,经过测试样本量越大效越好。本次比赛分词后vacob_size大约为29244
  4. ```python
  5. import gensim
  6. import time
  7. class EpochSaver(gensim.models.callbacks.CallbackAny2Vec):
  8. '''用于保存模型, 打印损失函数等等'''
  9. def __init__(self, save_path):
  10. self.save_path = save_path # 模型存储路径
  11. self.epoch = 0 # 轮次
  12. self.pre_loss = 0 # 前一轮损失
  13. self.best_loss = 999999999.9 # 最佳损失
  14. self.since = time.time() # 跑一轮的持续时间
  15. def on_epoch_end(self, model):
  16. self.epoch += 1
  17. cum_loss = model.get_latest_training_loss() # 返回的是从第一个epoch累计的
  18. epoch_loss = cum_loss - self.pre_loss # epoch-loss = 当前损失 - 前一轮损失
  19. time_taken = time.time() - self.since # 持续时间
  20. print("Epoch %d, loss: %.2f, time: %dmin %ds" %
  21. (self.epoch, epoch_loss, time_taken//60, time_taken%60)) # 打印一轮的结果,时间采用分秒
  22. # 记录best_loss,并通过best_loss进行early_stop
  23. if self.best_loss > epoch_loss:
  24. self.best_loss = epoch_loss
  25. print("Better model. Best loss: %.2f" % self.best_loss) # 打印最佳损失
  26. model.save(self.save_path) # 保存模型
  27. print("Model %s save done!" % self.save_path)
  28. self.pre_loss = cum_loss
  29. self.since = time.time()
  30. # 下面的代码可以加载训练好的词向量
  31. # model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')

创建word2vec训练器并使用build_vocab把单词导入到词库

  1. model_word2vec = gensim.models.Word2Vec(min_count=1,
  2. window=5,
  3. size=256,
  4. workers=4,
  5. batch_words=1000)
  6. since = time.time()
  7. model_word2vec.build_vocab(all_doc, progress_per=2000)
  8. time_elapsed = time.time() - since
  9. print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
  • 训练词向量并保存 ```python since = time.time() model_word2vec.train(all_doc, total_examples=model_word2vec.corpus_count,
    1. epochs=20, compute_loss=True, report_delay=60*10,
    2. callbacks=[EpochSaver('./final_word2vec_model')])
    time_elapsed = time.time() - since print(‘Time to train: {:.0f}min {:.0f}s’.format(time_elapsed // 60, time_elapsed % 60))
  1. <a name="4EaGw"></a>
  2. ### 1.5 编码为TF.Tokenizer格式
  3. Tokenizer是tensorflow的分词器,里面有封装完整的词典索引等。<br />参考博客:[https://dengbocong.blog.csdn.net/article/details/108038858](https://dengbocong.blog.csdn.net/article/details/108038858)
  4. ```python
  5. # 转化为Tokenizer
  6. from tensorflow.keras.preprocessing.text import Tokenizer
  7. tokenizer = Tokenizer()
  8. tokenizer.fit_on_texts(train_title + test_title)
  9. # tokenizer.fit_on_texts(train_content + test_content)

1.6 构造Embedding_matrix矩阵

  1. from tqdm import tqdm
  2. # 转化成词向量矩阵,利用新的word2vec模型
  3. vocab_size = len(tokenizer.word_index) # 词库大小
  4. error_count=0
  5. embedding_matrix = np.zeros((vocab_size + 1, 256))
  6. for word, i in tqdm(tokenizer.word_index.items()):
  7. if word in model_word2vec:
  8. embedding_matrix[i] = model_word2vec.wv[word]
  9. else:
  10. error_count += 1

1.7 padding编码

  • padding用于补齐较短的文本,或截断较长的文本 ```python from tensorflow.keras.preprocessing.sequence import pad_sequences

sequence = tokenizer.texts_to_sequences(train_title) traintitle = pad_sequences(sequence, maxlen=30) sequence = tokenizer.texts_to_sequences(test_title) testtitle = pad_sequences(sequence, maxlen=30)

sequence = tokenizer.texts_to_sequences(train_content)

traincontent = pad_sequences(sequence, maxlen=512)

sequence = tokenizer.texts_to_sequences(test_content)

testcontent = pad_sequences(sequence, maxlen=512)

  1. <a name="yfsLq"></a>
  2. ## 2.构造模型
  3. - [https://zhuanlan.zhihu.com/p/95293440](https://zhuanlan.zhihu.com/p/95293440) Keras.metrics中的accuracy总结
  4. <a name="eCprd"></a>
  5. ### 2.1 BiLSTM
  6. ```python
  7. from tensorflow.keras.layers import *
  8. from tensorflow.keras.models import Model, Sequential
  9. from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers
  10. model = Sequential([
  11. layers.Embedding(input_dim=len(tokenizer.word_index) + 1,
  12. output_dim=256,
  13. input_length=30,
  14. weights=[embedding_matrix]),
  15. layers.Bidirectional(LSTM(32, return_sequences = True)),
  16. layers.GlobalMaxPool1D(),
  17. layers.Dense(20, activation="relu"),
  18. layers.Dropout(0.05),
  19. layers.Dense(3, activation="softmax"),
  20. ])
  21. model.compile(loss='categorical_crossentropy',
  22. optimizer='adam',
  23. metrics=['categorical_accuracy'])
  24. model.summary()

2.2 TextCNN

Attention 别人写的代码

  1. import numpy as np
  2. from tensorflow.keras.preprocessing.text import Tokenizer
  3. from tensorflow.keras import Input, Model,backend as K
  4. from tensorflow.keras.layers import Embedding, Dense, Attention, Bidirectional, LSTM
  5. from tensorflow.keras import initializers, regularizers, constraints
  6. from tensorflow.keras.layers import Layer
  7. class Attention(Layer):
  8. def __init__(self, step_dim,
  9. W_regularizer=None, b_regularizer=None,
  10. W_constraint=None, b_constraint=None,
  11. bias=True, **kwargs):
  12. """
  13. Keras Layer that implements an Attention mechanism for temporal data.
  14. Supports Masking.
  15. Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
  16. # Input shape
  17. 3D tensor with shape: `(samples, steps, features)`.
  18. # Output shape
  19. 2D tensor with shape: `(samples, features)`.
  20. :param kwargs:
  21. Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
  22. The dimensions are inferred based on the output shape of the RNN.
  23. Example:
  24. # 1
  25. model.add(LSTM(64, return_sequences=True))
  26. model.add(Attention())
  27. # next add a Dense layer (for classification/regression) or whatever...
  28. # 2
  29. hidden = LSTM(64, return_sequences=True)(words)
  30. sentence = Attention()(hidden)
  31. # next add a Dense layer (for classification/regression) or whatever...
  32. """
  33. self.supports_masking = True
  34. self.init = initializers.get('glorot_uniform')
  35. self.W_regularizer = regularizers.get(W_regularizer)
  36. self.b_regularizer = regularizers.get(b_regularizer)
  37. self.W_constraint = constraints.get(W_constraint)
  38. self.b_constraint = constraints.get(b_constraint)
  39. self.bias = bias
  40. self.step_dim = step_dim
  41. self.features_dim = 0
  42. super(Attention, self).__init__(**kwargs)
  43. def build(self, input_shape):
  44. assert len(input_shape) == 3
  45. self.W = self.add_weight(shape=(input_shape[-1],),
  46. initializer=self.init,
  47. name='{}_W'.format(self.name),
  48. regularizer=self.W_regularizer,
  49. constraint=self.W_constraint)
  50. self.features_dim = input_shape[-1]
  51. if self.bias:
  52. self.b = self.add_weight(shape=(input_shape[1],),
  53. initializer='zero',
  54. name='{}_b'.format(self.name),
  55. regularizer=self.b_regularizer,
  56. constraint=self.b_constraint)
  57. else:
  58. self.b = None
  59. self.built = True
  60. def compute_mask(self, input, input_mask=None):
  61. # do not pass the mask to the next layers
  62. return None
  63. def call(self, x, mask=None):
  64. features_dim = self.features_dim
  65. step_dim = self.step_dim
  66. e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim)) # e = K.dot(x, self.W)
  67. if self.bias:
  68. e += self.b
  69. e = K.tanh(e)
  70. a = K.exp(e)
  71. # apply mask after the exp. will be re-normalized next
  72. if mask is not None:
  73. # cast the mask to floatX to avoid float64 upcasting in theano
  74. a *= K.cast(mask, K.floatx())
  75. # in some cases especially in the early stages of training the sum may be almost zero
  76. # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
  77. a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
  78. a = K.expand_dims(a)
  79. c = K.sum(a * x, axis=1)
  80. return c
  81. def compute_output_shape(self, input_shape):
  82. return input_shape[0], self.features_dim

TextCNN

  1. # from keras import Input, Model
  2. from tensorflow.keras.layers import Embedding, Dense, Conv1D, GlobalMaxPooling1D, Concatenate, Dropout
  3. class TextCNN(object):
  4. def __init__(self, maxlen, max_features, embedding_dims,
  5. class_num=1,
  6. last_activation='sigmoid'):
  7. self.maxlen = maxlen
  8. self.max_features = max_features
  9. self.embedding_dims = embedding_dims
  10. self.class_num = class_num
  11. self.last_activation = last_activation
  12. def get_model(self):
  13. input = Input((self.maxlen,))
  14. # Embedding part can try multichannel as same as origin paper
  15. embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen,
  16. weights=[embedding_matrix])(input)
  17. convs = []
  18. for kernel_size in [3, 4, 5]:
  19. c = Conv1D(128, kernel_size, activation='relu')(embedding)
  20. c = GlobalMaxPooling1D()(c)
  21. convs.append(c)
  22. x = Concatenate()(convs)
  23. output = Dense(self.class_num, activation=self.last_activation)(x)
  24. model = Model(inputs=input, outputs=output)
  25. return model
  26. model = TextCNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
  27. embedding_dims=256, class_num=3, last_activation='softmax').get_model()
  28. # metric_F1score 在下面
  29. model.compile('adam', 'categorical_crossentropy', metrics=['accuracy',metric_F1score])
  30. model.summary()

2.3 Atention-BiLSTM

Atention-BiLSTM

  1. class TextAttBiRNN(object):
  2. def __init__(self, maxlen, max_features, embedding_dims,
  3. class_num=1,
  4. last_activation='sigmoid'):
  5. self.maxlen = maxlen
  6. self.max_features = max_features
  7. self.embedding_dims = embedding_dims
  8. self.class_num = class_num
  9. self.last_activation = last_activation
  10. def get_model(self):
  11. input = Input((self.maxlen,))
  12. embedding = Embedding(self.max_features, self.embedding_dims,
  13. input_length=self.maxlen, weights=[embedding_matrix])(input)
  14. x = Bidirectional(LSTM(128,return_sequences=True))(embedding) # LSTM or GRU
  15. x = Attention(self.maxlen)(x)
  16. output = Dense(self.class_num, activation=self.last_activation)(x)
  17. model = Model(inputs=input, outputs=output)
  18. return model
  19. pass
  20. model = TextAttBiRNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
  21. embedding_dims=256, class_num=3, last_activation='softmax').get_model()
  22. model.compile('adam', 'categorical_crossentropy', metrics=['categorical_accuracy'])
  23. model.summary()

3.模型训练

3.1 评价标准

  1. import tensorflow as tf
  2. # F1值指标
  3. def metric_F1score(y_true,y_pred):
  4. TP=tf.reduce_sum(y_true*tf.round(y_pred))
  5. TN=tf.reduce_sum((1-y_true)*(1-tf.round(y_pred)))
  6. FP=tf.reduce_sum((1-y_true)*tf.round(y_pred))
  7. FN=tf.reduce_sum(y_true*(1-tf.round(y_pred)))
  8. precision=TP/(TP+FP)
  9. recall=TP/(TP+FN)
  10. F1score=2*precision*recall/(precision+recall)
  11. return F1score

3.2 训练集切分

  • 输入:traintitle时刚才padding过得sequence
  • 输出:label从原始csv中获取
  • 划分比例:训练集:测试机 == 5:1 ```python import tensorflow as tf from sklearn.model_selection import train_test_split

label = train[‘label’].astype(int)

train_X, val_X, train_Y, val_Y = train_test_split(traintitle, label, shuffle=True, test_size=0.2,random_state=42)

to_categorical是tf的one-hot编码转换,因为 loss用的 categorical_crossentropy

loos用 sparse_categorical_crossentropy 就不用转换

train_Y = tf.keras.utils.to_categorical(train_Y)

  1. <a name="4OCkC"></a>
  2. ### 3.3 模型训练
  3. - 其他参数自己设定
  4. ```python
  5. # 模型训练
  6. history = model.fit(train_X,train_Y,
  7. batch_size=128,
  8. epochs=10,
  9. validation_split=0.1,
  10. validation_freq=1,
  11. )

image.png

3.4 校验模型效果

  1. from sklearn.metrics import f1_score
  2. pred_val = model.predict(val_X)
  3. print(f1_score(val_Y, np.argmax(pred_val, axis=1), average='macro'))

image.png

3.5 可视化损失和F1值

  1. import matplotlib.pyplot as plt
  2. # 画出损失函数
  3. def show_loss_acc_img(history):
  4. # 损失
  5. plt.plot(history.history['loss'], label="$Loss$")
  6. plt.plot(history.history['val_loss'], label='$val_loss$')
  7. plt.title('Loss')
  8. plt.xlabel('epoch')
  9. plt.ylabel('num')
  10. plt.legend()
  11. plt.show()
  12. # 准确率
  13. plt.plot(history.history['categorical_accuracy'], label="$categorical_accuracy$")
  14. plt.plot(history.history['val_categorical_accuracy'], label='$val_categorical_accuracy$')
  15. plt.title('Accuracy')
  16. plt.xlabel('epoch')
  17. plt.ylabel('num')
  18. plt.legend()
  19. plt.show()
  20. pass
  21. show_loss_acc_img(history)

image.png

3.6 预测测试集情感极性

  1. # 预测测试集极性
  2. pred_val = model.predict(testtitle)
  3. # 保存预测文件
  4. submission = pd.DataFrame(test.id.values,columns=["id"])
  5. submission["label"] = np.argmax(pred_val, axis=1)
  6. submission.to_csv("submission.csv",index=False)

可以直接用的干货

1. 使用正则去除文本的html和其他符号

  1. import re
  2. # 文本过滤函数
  3. def text_filter(text):
  4. # re.sub(正则表达式,被替换字符串,替换字符串)
  5. text = re.sub("[A-Za-z0-9\!\=\?\%\[\]\,\(\)\>\<:&lt;\/#\. -----\_]", "", text)
  6. text = text.replace('图片', '')
  7. text = text.replace('\xa0', '') # 删除nbsp
  8. # 去除html标签
  9. cleanr = re.compile('<.*?>')
  10. text = re.sub(cleanr, ' ', text)
  11. # 去除其他字符
  12. r1 = "\\【.*?】+|\\《.*?》+|\\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——!\\\,。=?、:“”‘’¥……()《》【】]"
  13. text = re.sub(r1,'',text)
  14. # 去除字符串首尾的空格
  15. text = text.strip()
  16. return text

2. 使用gensim训练自己的词向量

参考博客:
[1] https://www.jianshu.com/p/5f04e97d1b27 使用TSEN降维后打印词向量图片
[2] https://www.cnblogs.com/johnnyzen/p/10900040.html gensim.models.Word2Vec参数详解

  1. #!/usr/bin/env python
  2. # -*- encoding: utf-8 -*-
  3. '''
  4. @File : word2vecgensim.py
  5. @Contact : htkstudy@163.com
  6. @Modify Time @Author @Version @Desciption
  7. ------------ ------- -------- -----------
  8. 2021/3/9 8:55 Armor(htk) 1.0 None
  9. '''
  10. import gensim
  11. import time
  12. from sklearn.manifold import TSNE
  13. from matplotlib.font_manager import *
  14. import matplotlib.pyplot as plt
  15. class EpochSaver(gensim.models.callbacks.CallbackAny2Vec):
  16. '''用于保存模型, 打印损失函数等等'''
  17. def __init__(self, save_path):
  18. self.save_path = save_path # 模型存储路径
  19. self.epoch = 0 # 轮次
  20. self.pre_loss = 0 # 前一轮损失
  21. self.best_loss = 999999999.9 # 最佳损失
  22. self.since = time.time() # 跑一轮的持续时间
  23. def on_epoch_end(self, model):
  24. self.epoch += 1
  25. cum_loss = model.get_latest_training_loss() # 返回的是从第一个epoch累计的
  26. epoch_loss = cum_loss - self.pre_loss # epoch-loss = 当前损失 - 前一轮损失
  27. time_taken = time.time() - self.since # 持续时间
  28. print("Epoch %d, loss: %.2f, time: %dmin %ds" %
  29. (self.epoch, epoch_loss, time_taken // 60, time_taken % 60)) # 打印一轮的结果,时间采用分秒
  30. # 记录best_loss,并通过best_loss进行early_stop
  31. if self.best_loss > epoch_loss:
  32. self.best_loss = epoch_loss
  33. print("Better model. Best loss: %.2f" % self.best_loss) # 打印最佳损失
  34. model.save(self.save_path) # 保存模型
  35. print("Model %s save done!" % self.save_path)
  36. self.pre_loss = cum_loss
  37. self.since = time.time()
  38. pass
  39. # 加载以训练好的词向量
  40. def load_model_word2vec(save_path):
  41. model_word2vec = gensim.models.Word2Vec.load(save_path)
  42. # 下面的代码可以加载训练好的词向量
  43. # model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
  44. return model_word2vec
  45. def print_since_time(since):
  46. time_elapsed = time.time() - since
  47. print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
  48. def show_word2vec_2D(model_word2vec,random_word):
  49. # 训练TSNE降维 注意model_word2vec.wv[random_word]中 random_word 必须是字符串构成的list
  50. X_tsne = TSNE(n_components=2, learning_rate=100).fit_transform(model_word2vec.wv[random_word])
  51. # 解决负号'-'显示为方块的问题
  52. plt.figure(figsize=(14, 8))
  53. myfont = FontProperties(fname='C:\Windows\Fonts\simsun.ttc') # 加载字体
  54. plt.scatter(X_tsne[:, 0], X_tsne[:, 1]) # 创建散点图
  55. for i in range(len(X_tsne)):
  56. x = X_tsne[i][0]
  57. y = X_tsne[i][1]
  58. plt.text(x, y, random_word[i], fontproperties=myfont, size=16) # 输出坐标标签
  59. plt.show()
  60. pass
  61. if __name__=="__main__":
  62. # 自己处理输入 注意input_doc 不能是一维,构造成这样的形式input_doc = [[]]
  63. input_doc = [['`', '广告', '联系', '微', '信号', '花都区', '租房', '满', '一年', '有望', '确保', '学位', '信息', '时报讯', '记者', '崔小远', '近日',
  64. '发布', '下称', '了解', '今年', '花都区', '公办', '小学', '计划', '招收', '班级', '公办', '初中', '计划', '招收', '班级', ';', '民办小学', '花都区',
  65. '班级', '民办', '初中', '计划', '招收', '班级', '对比', '年', '招生', '细则', '今年', '招生', '规模', '总体', '变化', '不', '大', '计划', '招收',
  66. '了解', '花都区', '招生', '时间', '安排', '月', '日', '~', '月', '日', '花都区', '公办', '小学', '网上', '报名', ';','教育局',
  67. '~', '积分', '入学', '网上', '报名', ';', '月', '日', '~', '月', '日', '花都区', '民办小学', '网上', '报名', ';',
  68. '~', '花都区', '小区', '配套', '业主', '非', '广州市', '户籍', '适龄', '子女', '报名', '保障', '区内', '明确', '未来', '十年',
  69. '承租人', '子女', '入学', '方面', '提出', '具有', '广州市', '户籍', '含', '政策性', '照顾', '生', '广州市', '无', '自有', '产权', '住房', '含', '城乡',
  70. '自建房', '租赁', '房屋', '所在地', '唯一', '居住地', '房屋', '租赁', '合同', '登记', '备案', '连续', '满', '一年', '截止', '日期', '申请', '入学', '内在',
  71. '年月日', '以上', '申请', '时', '租赁', '合同', '有效', '状态', '承租人', '适龄', '子女', '花都区', '教育局', '确保', '学位', '供给', '年', '当中', '已经',
  72. '增城', '花都', '从化', '张', '病床', '条件', '建立', '专业', '精神病', '医院', '来源', '花都', '早晨', '区卫计局', '广州', '花都', '发布', '今日', '花都',
  73. '花都', '求职', '招聘', '群', '添加', '时', '注明', '招定', '求职', '小编', '工资', '大拇指', '挂钩', '点', '一下', '一分钱', '求', '打赏', '记得', '加']]
  74. # 训练模型主题 -----------------------------------------------
  75. model_word2vec = gensim.models.Word2Vec(min_count=1,
  76. window=5,
  77. size=256,
  78. workers=4,
  79. batch_words=1000)
  80. since = time.time() # 计时开始
  81. model_word2vec.build_vocab(input_doc, progress_per=2000) # 从一连串的句子中建立词汇,progress_per表示多少单词展示一次
  82. print_since_time(since) #计时结束,打印耗时
  83. since = time.time()
  84. model_word2vec.train(input_doc,
  85. total_examples=model_word2vec.corpus_count,
  86. epochs=20,
  87. compute_loss=True,
  88. report_delay=60 * 10,
  89. callbacks=[EpochSaver('./final_word2vec_model')]) # model_word2vec模型存储
  90. print_since_time(since) #计时结束,打印耗时
  91. # 打印词向量图片
  92. show_word2vec_2D(model_word2vec, input_doc[0])
  93. # model_word2vec = load_model_word2vec(./final_word2vec_model)
  94. # print(model_word2vec)
  95. # # 计算两个词语之间的相似度
  96. # y2 = model_word2vec.wv.similarity(u"租赁", u"承租人")
  97. # print(y2)
  98. # # 打印词语相似度
  99. # for i in model_word2vec.wv.most_similar(u"建立"):
  100. # print(i[0], i[1])

使用TSNE打印词向量的效果
image.png