- 前言
- 具体流程
- 读取训练数据
- 合并两个训练数据和训练标签
- 使用布尔判断 判断train中是否存在空值 并剔除
- 将na值替换为空字符串
- 加载stop_words停用词
- 创建翻译表,后续用于去除英文标点
- # 去除英文标点
- tokens = [w.translate(table) for w in tokens] # translate 和 maketrans 连用 具体百度
- 调用分词函数对训练集和测试集中的标题和文本进行分词
- 连接所有分词,为后续训练词向量做准备
- sequence = tokenizer.texts_to_sequences(train_content)
- traincontent = pad_sequences(sequence, maxlen=512)
- sequence = tokenizer.texts_to_sequences(test_content)
- testcontent = pad_sequences(sequence, maxlen=512)
- 3.模型训练
- to_categorical是tf的one-hot编码转换,因为 loss用的 categorical_crossentropy
- loos用 sparse_categorical_crossentropy 就不用转换
- 可以直接用的干货
前言
数据来源:
2019年CCF互联网新闻情感分析大赛
https://www.datafountain.cn/competitions/350/
数据集附件:
链接:https://pan.baidu.com/s/1ePKyHyE8AGN3vW_1vg9-yg
提取码:2021
具体工具:
深度学习框架使用 tensorflow2.4.0
自然语言处理库 gensim3.8.3
分词工具 jieba 0.42.1
代码哪里不懂点哪里:python菜鸟教程更像一个查询手册
https://www.runoob.com/python3/python3-tutorial.html
具体流程
1. 数据预处理
1.1 读取数据
这里通过read_csv读取数据集,因为train_data和train_label是分开的,使用pd.merge()进行合并
- pd.merge(x,y,how=”left”,on=”id”)详解:
- notnull() :是否为空,返回对应布尔值
- fillna(value):找到空值并替换为value ```python import pandas as pd import numpy as np
读取训练数据
train_data = pd.read_csv(‘Train_DataSet.csv’) train_label = pd.read_csv(‘Train_DataSet_Label.csv’) test = pd.read_csv(‘Test_DataSet.csv’)
合并两个训练数据和训练标签
train = pd.merge(train_data, train_label, how=’left’, on=’id’)
使用布尔判断 判断train中是否存在空值 并剔除
train = train[(train.label.notnull()) & (train.content.notnull())]
将na值替换为空字符串
train[‘title’] = train[‘title’].fillna(‘’) train[‘content’] = train[‘content’].fillna(‘’) test[‘title’] = test[‘title’].fillna(‘’) test[‘content’] = test[‘content’].fillna(‘’)
<a name="Hxtx6"></a>
### 1.2 文本无效字符和标签过滤
其中 text_filter(text) 函数可以存储到工具库,方便下次使用
- re.sub(正则表达式,被替换字符串,替换字符串)
- str.strip():去除字符串首尾的空格
```python
import re
# 文本过滤函数
def text_filter(text):
# re.sub(正则表达式,被替换字符串,替换字符串)
text = re.sub("[A-Za-z0-9\!\=\?\%\[\]\,\(\)\>\<:<\/#\. -----\_]", "", text)
text = text.replace('图片', '')
text = text.replace('\xa0', '') # 删除nbsp
# 去除html标签
cleanr = re.compile('<.*?>')
text = re.sub(cleanr, ' ', text)
# 去除其他字符
r1 = "\\【.*?】+|\\《.*?》+|\\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——!\\\,。=?、:“”‘’¥……()《》【】]"
text = re.sub(r1,'',text)
# 去除字符串首尾的空格
text = text.strip()
return text
# 文本清理函数
def clean_text(data):
# 标题文本
data['title'] = data['title'].apply(lambda x: text_filter(x))
# 正文文本
data['content'] = data['content'].apply(lambda x: text_filter(x))
return data
# run clean_text
train = clean_text(train)
test = clean_text(test)
1.3 分词和停用词
- str.maketrans(x,y,z):三个参数 x、y、z,第三个参数 z 必须是字符串,其字符将被映射为 None,即删除该字符;如果 z 中字符与 x 中字符重复,该重复的字符在最终结果中还是会被删除。 也就是无论是否重复,只要有第三个参数 z,z 中的字符都会被删除。
- string.punctuation:所有的标点符号
- [token for token in tokens if token not in stop_words] :列表表达式:如果tokens中的token不在stop_words中那么就返回此token ```python import jieba import string
加载stop_words停用词
stop_words = pd.read_table(‘stop.txt’, header=None)[0].tolist()
创建翻译表,后续用于去除英文标点
table = str.maketrans(“”,””,string.punctuation) def cut_text(sentence): tokens = list(jieba.cut(sentence))
# 去除停用词 列表
tokens = [token for token in tokens if token not in stop_words]
# 去除英文标点
tokens = [w.translate(table) for w in tokens] # translate 和 maketrans 连用 具体百度
return tokens
调用分词函数对训练集和测试集中的标题和文本进行分词
train_title = [cut_text(sent) for sent in train.title.values] train_content = [cut_text(sent) for sent in train.content.values] test_title = [cut_text(sent) for sent in test.title.values] test_content = [cut_text(sent) for sent in test.content.values]
连接所有分词,为后续训练词向量做准备
all_doc = train_title + train_content + test_title + test_content
<a name="wGCmK"></a>
### 1.4 使用gensim训练词向量
这一系列代码可以直接使用,训练自己的词向量,经过测试样本量越大效越好。本次比赛分词后vacob_size大约为29244
```python
import gensim
import time
class EpochSaver(gensim.models.callbacks.CallbackAny2Vec):
'''用于保存模型, 打印损失函数等等'''
def __init__(self, save_path):
self.save_path = save_path # 模型存储路径
self.epoch = 0 # 轮次
self.pre_loss = 0 # 前一轮损失
self.best_loss = 999999999.9 # 最佳损失
self.since = time.time() # 跑一轮的持续时间
def on_epoch_end(self, model):
self.epoch += 1
cum_loss = model.get_latest_training_loss() # 返回的是从第一个epoch累计的
epoch_loss = cum_loss - self.pre_loss # epoch-loss = 当前损失 - 前一轮损失
time_taken = time.time() - self.since # 持续时间
print("Epoch %d, loss: %.2f, time: %dmin %ds" %
(self.epoch, epoch_loss, time_taken//60, time_taken%60)) # 打印一轮的结果,时间采用分秒
# 记录best_loss,并通过best_loss进行early_stop
if self.best_loss > epoch_loss:
self.best_loss = epoch_loss
print("Better model. Best loss: %.2f" % self.best_loss) # 打印最佳损失
model.save(self.save_path) # 保存模型
print("Model %s save done!" % self.save_path)
self.pre_loss = cum_loss
self.since = time.time()
# 下面的代码可以加载训练好的词向量
# model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
创建word2vec训练器并使用build_vocab把单词导入到词库
model_word2vec = gensim.models.Word2Vec(min_count=1,
window=5,
size=256,
workers=4,
batch_words=1000)
since = time.time()
model_word2vec.build_vocab(all_doc, progress_per=2000)
time_elapsed = time.time() - since
print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
- 训练词向量并保存
```python
since = time.time()
model_word2vec.train(all_doc, total_examples=model_word2vec.corpus_count,
time_elapsed = time.time() - since print(‘Time to train: {:.0f}min {:.0f}s’.format(time_elapsed // 60, time_elapsed % 60))epochs=20, compute_loss=True, report_delay=60*10,
callbacks=[EpochSaver('./final_word2vec_model')])
<a name="4EaGw"></a>
### 1.5 编码为TF.Tokenizer格式
Tokenizer是tensorflow的分词器,里面有封装完整的词典索引等。<br />参考博客:[https://dengbocong.blog.csdn.net/article/details/108038858](https://dengbocong.blog.csdn.net/article/details/108038858)
```python
# 转化为Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_title + test_title)
# tokenizer.fit_on_texts(train_content + test_content)
1.6 构造Embedding_matrix矩阵
from tqdm import tqdm
# 转化成词向量矩阵,利用新的word2vec模型
vocab_size = len(tokenizer.word_index) # 词库大小
error_count=0
embedding_matrix = np.zeros((vocab_size + 1, 256))
for word, i in tqdm(tokenizer.word_index.items()):
if word in model_word2vec:
embedding_matrix[i] = model_word2vec.wv[word]
else:
error_count += 1
1.7 padding编码
- padding用于补齐较短的文本,或截断较长的文本 ```python from tensorflow.keras.preprocessing.sequence import pad_sequences
sequence = tokenizer.texts_to_sequences(train_title) traintitle = pad_sequences(sequence, maxlen=30) sequence = tokenizer.texts_to_sequences(test_title) testtitle = pad_sequences(sequence, maxlen=30)
sequence = tokenizer.texts_to_sequences(train_content)
traincontent = pad_sequences(sequence, maxlen=512)
sequence = tokenizer.texts_to_sequences(test_content)
testcontent = pad_sequences(sequence, maxlen=512)
<a name="yfsLq"></a>
## 2.构造模型
- [https://zhuanlan.zhihu.com/p/95293440](https://zhuanlan.zhihu.com/p/95293440) Keras.metrics中的accuracy总结
<a name="eCprd"></a>
### 2.1 BiLSTM
```python
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers
model = Sequential([
layers.Embedding(input_dim=len(tokenizer.word_index) + 1,
output_dim=256,
input_length=30,
weights=[embedding_matrix]),
layers.Bidirectional(LSTM(32, return_sequences = True)),
layers.GlobalMaxPool1D(),
layers.Dense(20, activation="relu"),
layers.Dropout(0.05),
layers.Dense(3, activation="softmax"),
])
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['categorical_accuracy'])
model.summary()
2.2 TextCNN
Attention 别人写的代码
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import Input, Model,backend as K
from tensorflow.keras.layers import Embedding, Dense, Attention, Bidirectional, LSTM
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras.layers import Layer
class Attention(Layer):
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, **kwargs):
"""
Keras Layer that implements an Attention mechanism for temporal data.
Supports Masking.
Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
:param kwargs:
Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
The dimensions are inferred based on the output shape of the RNN.
Example:
# 1
model.add(LSTM(64, return_sequences=True))
model.add(Attention())
# next add a Dense layer (for classification/regression) or whatever...
# 2
hidden = LSTM(64, return_sequences=True)(words)
sentence = Attention()(hidden)
# next add a Dense layer (for classification/regression) or whatever...
"""
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim
self.features_dim = 0
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight(shape=(input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight(shape=(input_shape[1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None):
# do not pass the mask to the next layers
return None
def call(self, x, mask=None):
features_dim = self.features_dim
step_dim = self.step_dim
e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim)) # e = K.dot(x, self.W)
if self.bias:
e += self.b
e = K.tanh(e)
a = K.exp(e)
# apply mask after the exp. will be re-normalized next
if mask is not None:
# cast the mask to floatX to avoid float64 upcasting in theano
a *= K.cast(mask, K.floatx())
# in some cases especially in the early stages of training the sum may be almost zero
# and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
c = K.sum(a * x, axis=1)
return c
def compute_output_shape(self, input_shape):
return input_shape[0], self.features_dim
TextCNN
# from keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Conv1D, GlobalMaxPooling1D, Concatenate, Dropout
class TextCNN(object):
def __init__(self, maxlen, max_features, embedding_dims,
class_num=1,
last_activation='sigmoid'):
self.maxlen = maxlen
self.max_features = max_features
self.embedding_dims = embedding_dims
self.class_num = class_num
self.last_activation = last_activation
def get_model(self):
input = Input((self.maxlen,))
# Embedding part can try multichannel as same as origin paper
embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen,
weights=[embedding_matrix])(input)
convs = []
for kernel_size in [3, 4, 5]:
c = Conv1D(128, kernel_size, activation='relu')(embedding)
c = GlobalMaxPooling1D()(c)
convs.append(c)
x = Concatenate()(convs)
output = Dense(self.class_num, activation=self.last_activation)(x)
model = Model(inputs=input, outputs=output)
return model
model = TextCNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
embedding_dims=256, class_num=3, last_activation='softmax').get_model()
# metric_F1score 在下面
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy',metric_F1score])
model.summary()
2.3 Atention-BiLSTM
Atention-BiLSTM
class TextAttBiRNN(object):
def __init__(self, maxlen, max_features, embedding_dims,
class_num=1,
last_activation='sigmoid'):
self.maxlen = maxlen
self.max_features = max_features
self.embedding_dims = embedding_dims
self.class_num = class_num
self.last_activation = last_activation
def get_model(self):
input = Input((self.maxlen,))
embedding = Embedding(self.max_features, self.embedding_dims,
input_length=self.maxlen, weights=[embedding_matrix])(input)
x = Bidirectional(LSTM(128,return_sequences=True))(embedding) # LSTM or GRU
x = Attention(self.maxlen)(x)
output = Dense(self.class_num, activation=self.last_activation)(x)
model = Model(inputs=input, outputs=output)
return model
pass
model = TextAttBiRNN(maxlen=30, max_features=len(tokenizer.word_index) + 1,
embedding_dims=256, class_num=3, last_activation='softmax').get_model()
model.compile('adam', 'categorical_crossentropy', metrics=['categorical_accuracy'])
model.summary()
3.模型训练
3.1 评价标准
import tensorflow as tf
# F1值指标
def metric_F1score(y_true,y_pred):
TP=tf.reduce_sum(y_true*tf.round(y_pred))
TN=tf.reduce_sum((1-y_true)*(1-tf.round(y_pred)))
FP=tf.reduce_sum((1-y_true)*tf.round(y_pred))
FN=tf.reduce_sum(y_true*(1-tf.round(y_pred)))
precision=TP/(TP+FP)
recall=TP/(TP+FN)
F1score=2*precision*recall/(precision+recall)
return F1score
3.2 训练集切分
- 输入:traintitle时刚才padding过得sequence
- 输出:label从原始csv中获取
- 划分比例:训练集:测试机 == 5:1 ```python import tensorflow as tf from sklearn.model_selection import train_test_split
label = train[‘label’].astype(int)
train_X, val_X, train_Y, val_Y = train_test_split(traintitle, label, shuffle=True, test_size=0.2,random_state=42)
to_categorical是tf的one-hot编码转换,因为 loss用的 categorical_crossentropy
loos用 sparse_categorical_crossentropy 就不用转换
train_Y = tf.keras.utils.to_categorical(train_Y)
<a name="4OCkC"></a>
### 3.3 模型训练
- 其他参数自己设定
```python
# 模型训练
history = model.fit(train_X,train_Y,
batch_size=128,
epochs=10,
validation_split=0.1,
validation_freq=1,
)
3.4 校验模型效果
from sklearn.metrics import f1_score
pred_val = model.predict(val_X)
print(f1_score(val_Y, np.argmax(pred_val, axis=1), average='macro'))
3.5 可视化损失和F1值
import matplotlib.pyplot as plt
# 画出损失函数
def show_loss_acc_img(history):
# 损失
plt.plot(history.history['loss'], label="$Loss$")
plt.plot(history.history['val_loss'], label='$val_loss$')
plt.title('Loss')
plt.xlabel('epoch')
plt.ylabel('num')
plt.legend()
plt.show()
# 准确率
plt.plot(history.history['categorical_accuracy'], label="$categorical_accuracy$")
plt.plot(history.history['val_categorical_accuracy'], label='$val_categorical_accuracy$')
plt.title('Accuracy')
plt.xlabel('epoch')
plt.ylabel('num')
plt.legend()
plt.show()
pass
show_loss_acc_img(history)
3.6 预测测试集情感极性
# 预测测试集极性
pred_val = model.predict(testtitle)
# 保存预测文件
submission = pd.DataFrame(test.id.values,columns=["id"])
submission["label"] = np.argmax(pred_val, axis=1)
submission.to_csv("submission.csv",index=False)
可以直接用的干货
1. 使用正则去除文本的html和其他符号
import re
# 文本过滤函数
def text_filter(text):
# re.sub(正则表达式,被替换字符串,替换字符串)
text = re.sub("[A-Za-z0-9\!\=\?\%\[\]\,\(\)\>\<:<\/#\. -----\_]", "", text)
text = text.replace('图片', '')
text = text.replace('\xa0', '') # 删除nbsp
# 去除html标签
cleanr = re.compile('<.*?>')
text = re.sub(cleanr, ' ', text)
# 去除其他字符
r1 = "\\【.*?】+|\\《.*?》+|\\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——!\\\,。=?、:“”‘’¥……()《》【】]"
text = re.sub(r1,'',text)
# 去除字符串首尾的空格
text = text.strip()
return text
2. 使用gensim训练自己的词向量
参考博客:
[1] https://www.jianshu.com/p/5f04e97d1b27 使用TSEN降维后打印词向量图片
[2] https://www.cnblogs.com/johnnyzen/p/10900040.html gensim.models.Word2Vec参数详解
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File : word2vecgensim.py
@Contact : htkstudy@163.com
@Modify Time @Author @Version @Desciption
------------ ------- -------- -----------
2021/3/9 8:55 Armor(htk) 1.0 None
'''
import gensim
import time
from sklearn.manifold import TSNE
from matplotlib.font_manager import *
import matplotlib.pyplot as plt
class EpochSaver(gensim.models.callbacks.CallbackAny2Vec):
'''用于保存模型, 打印损失函数等等'''
def __init__(self, save_path):
self.save_path = save_path # 模型存储路径
self.epoch = 0 # 轮次
self.pre_loss = 0 # 前一轮损失
self.best_loss = 999999999.9 # 最佳损失
self.since = time.time() # 跑一轮的持续时间
def on_epoch_end(self, model):
self.epoch += 1
cum_loss = model.get_latest_training_loss() # 返回的是从第一个epoch累计的
epoch_loss = cum_loss - self.pre_loss # epoch-loss = 当前损失 - 前一轮损失
time_taken = time.time() - self.since # 持续时间
print("Epoch %d, loss: %.2f, time: %dmin %ds" %
(self.epoch, epoch_loss, time_taken // 60, time_taken % 60)) # 打印一轮的结果,时间采用分秒
# 记录best_loss,并通过best_loss进行early_stop
if self.best_loss > epoch_loss:
self.best_loss = epoch_loss
print("Better model. Best loss: %.2f" % self.best_loss) # 打印最佳损失
model.save(self.save_path) # 保存模型
print("Model %s save done!" % self.save_path)
self.pre_loss = cum_loss
self.since = time.time()
pass
# 加载以训练好的词向量
def load_model_word2vec(save_path):
model_word2vec = gensim.models.Word2Vec.load(save_path)
# 下面的代码可以加载训练好的词向量
# model_word2vec = gensim.models.Word2Vec.load('final_word2vec_model')
return model_word2vec
def print_since_time(since):
time_elapsed = time.time() - since
print('Time to build vocab: {:.0f}min {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
def show_word2vec_2D(model_word2vec,random_word):
# 训练TSNE降维 注意model_word2vec.wv[random_word]中 random_word 必须是字符串构成的list
X_tsne = TSNE(n_components=2, learning_rate=100).fit_transform(model_word2vec.wv[random_word])
# 解决负号'-'显示为方块的问题
plt.figure(figsize=(14, 8))
myfont = FontProperties(fname='C:\Windows\Fonts\simsun.ttc') # 加载字体
plt.scatter(X_tsne[:, 0], X_tsne[:, 1]) # 创建散点图
for i in range(len(X_tsne)):
x = X_tsne[i][0]
y = X_tsne[i][1]
plt.text(x, y, random_word[i], fontproperties=myfont, size=16) # 输出坐标标签
plt.show()
pass
if __name__=="__main__":
# 自己处理输入 注意input_doc 不能是一维,构造成这样的形式input_doc = [[]]
input_doc = [['`', '广告', '联系', '微', '信号', '花都区', '租房', '满', '一年', '有望', '确保', '学位', '信息', '时报讯', '记者', '崔小远', '近日',
'发布', '下称', '了解', '今年', '花都区', '公办', '小学', '计划', '招收', '班级', '公办', '初中', '计划', '招收', '班级', ';', '民办小学', '花都区',
'班级', '民办', '初中', '计划', '招收', '班级', '对比', '年', '招生', '细则', '今年', '招生', '规模', '总体', '变化', '不', '大', '计划', '招收',
'了解', '花都区', '招生', '时间', '安排', '月', '日', '~', '月', '日', '花都区', '公办', '小学', '网上', '报名', ';','教育局',
'~', '积分', '入学', '网上', '报名', ';', '月', '日', '~', '月', '日', '花都区', '民办小学', '网上', '报名', ';',
'~', '花都区', '小区', '配套', '业主', '非', '广州市', '户籍', '适龄', '子女', '报名', '保障', '区内', '明确', '未来', '十年',
'承租人', '子女', '入学', '方面', '提出', '具有', '广州市', '户籍', '含', '政策性', '照顾', '生', '广州市', '无', '自有', '产权', '住房', '含', '城乡',
'自建房', '租赁', '房屋', '所在地', '唯一', '居住地', '房屋', '租赁', '合同', '登记', '备案', '连续', '满', '一年', '截止', '日期', '申请', '入学', '内在',
'年月日', '以上', '申请', '时', '租赁', '合同', '有效', '状态', '承租人', '适龄', '子女', '花都区', '教育局', '确保', '学位', '供给', '年', '当中', '已经',
'增城', '花都', '从化', '张', '病床', '条件', '建立', '专业', '精神病', '医院', '来源', '花都', '早晨', '区卫计局', '广州', '花都', '发布', '今日', '花都',
'花都', '求职', '招聘', '群', '添加', '时', '注明', '招定', '求职', '小编', '工资', '大拇指', '挂钩', '点', '一下', '一分钱', '求', '打赏', '记得', '加']]
# 训练模型主题 -----------------------------------------------
model_word2vec = gensim.models.Word2Vec(min_count=1,
window=5,
size=256,
workers=4,
batch_words=1000)
since = time.time() # 计时开始
model_word2vec.build_vocab(input_doc, progress_per=2000) # 从一连串的句子中建立词汇,progress_per表示多少单词展示一次
print_since_time(since) #计时结束,打印耗时
since = time.time()
model_word2vec.train(input_doc,
total_examples=model_word2vec.corpus_count,
epochs=20,
compute_loss=True,
report_delay=60 * 10,
callbacks=[EpochSaver('./final_word2vec_model')]) # model_word2vec模型存储
print_since_time(since) #计时结束,打印耗时
# 打印词向量图片
show_word2vec_2D(model_word2vec, input_doc[0])
# model_word2vec = load_model_word2vec(./final_word2vec_model)
# print(model_word2vec)
# # 计算两个词语之间的相似度
# y2 = model_word2vec.wv.similarity(u"租赁", u"承租人")
# print(y2)
# # 打印词语相似度
# for i in model_word2vec.wv.most_similar(u"建立"):
# print(i[0], i[1])
使用TSNE打印词向量的效果