1-3,文本数据建模流程范例

一,准备数据

imdb数据集的目标是根据电影评论的文本内容预测评论的情感标签。
训练集有20000条电影评论文本,测试集有5000条电影评论文本,其中正面评论和负面评论都各占一半。
文本数据预处理较为繁琐,包括中文切词(本示例不涉及),构建词典,编码转换,序列填充,构建数据管道等等。
在tensorflow中完成文本数据预处理的常用方案有两种,第一种是利用tf.keras.preprocessing中的Tokenizer词典构建工具和tf.keras.utils.Sequence构建文本数据生成器管道。
第二种是使用tf.data.Dataset搭配.keras.layers.experimental.preprocessing.TextVectorization预处理层。
第一种方法较为复杂,其使用范例可以参考以下文章。
https://zhuanlan.zhihu.com/p/67697840
第二种方法为TensorFlow原生方式,相对也更加简单一些。
我们此处介绍第二种方法。
image.png

  1. import numpy as np
  2. import pandas as pd
  3. from matplotlib import pyplot as plt
  4. import tensorflow as tf
  5. from tensorflow.keras import models,layers,preprocessing,optimizers,losses,metrics
  6. from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
  7. import re,string
  8. train_data_path = "./data/imdb/train.csv"
  9. test_data_path = "./data/imdb/test.csv"
  10. MAX_WORDS = 10000 # 仅考虑最高频的10000个词
  11. MAX_LEN = 200 # 每个样本保留200个词的长度
  12. BATCH_SIZE = 20
  13. #构建管道
  14. def split_line(line):
  15. arr = tf.strings.split(line,"\t")
  16. label = tf.expand_dims(tf.cast(tf.strings.to_number(arr[0]),tf.int32),axis = 0)
  17. text = tf.expand_dims(arr[1],axis = 0)
  18. return (text,label)
  19. ds_train_raw = tf.data.TextLineDataset(filenames = [train_data_path]) \
  20. .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
  21. .shuffle(buffer_size = 1000).batch(BATCH_SIZE) \
  22. .prefetch(tf.data.experimental.AUTOTUNE)
  23. ds_test_raw = tf.data.TextLineDataset(filenames = [test_data_path]) \
  24. .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
  25. .batch(BATCH_SIZE) \
  26. .prefetch(tf.data.experimental.AUTOTUNE)
  27. #构建词典
  28. def clean_text(text):
  29. lowercase = tf.strings.lower(text)
  30. stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  31. cleaned_punctuation = tf.strings.regex_replace(stripped_html,
  32. '[%s]' % re.escape(string.punctuation),'')
  33. return cleaned_punctuation
  34. vectorize_layer = TextVectorization(
  35. standardize=clean_text,
  36. split = 'whitespace',
  37. max_tokens=MAX_WORDS-1, #有一个留给占位符
  38. output_mode='int',
  39. output_sequence_length=MAX_LEN)
  40. ds_text = ds_train_raw.map(lambda text,label: text)
  41. vectorize_layer.adapt(ds_text)
  42. print(vectorize_layer.get_vocabulary()[0:100])
  43. #单词编码
  44. ds_train = ds_train_raw.map(lambda text,label:(vectorize_layer(text),label)) \
  45. .prefetch(tf.data.experimental.AUTOTUNE)
  46. ds_test = ds_test_raw.map(lambda text,label:(vectorize_layer(text),label)) \
  47. .prefetch(tf.data.experimental.AUTOTUNE)
  1. [b'the', b'and', b'a', b'of', b'to', b'is', b'in', b'it', b'i', b'this', b'that', b'was', b'as', b'for', b'with', b'movie', b'but', b'film', b'on', b'not', b'you', b'his', b'are', b'have', b'be', b'he', b'one', b'its', b'at', b'all', b'by', b'an', b'they', b'from', b'who', b'so', b'like', b'her', b'just', b'or', b'about', b'has', b'if', b'out', b'some', b'there', b'what', b'good', b'more', b'when', b'very', b'she', b'even', b'my', b'no', b'would', b'up', b'time', b'only', b'which', b'story', b'really', b'their', b'were', b'had', b'see', b'can', b'me', b'than', b'we', b'much', b'well', b'get', b'been', b'will', b'into', b'people', b'also', b'other', b'do', b'bad', b'because', b'great', b'first', b'how', b'him', b'most', b'dont', b'made', b'then', b'them', b'films', b'movies', b'way', b'make', b'could', b'too', b'any', b'after', b'characters']

二,定义模型

使用Keras接口有以下3种方式构建模型:使用Sequential按层顺序构建模型,使用函数式API构建任意结构模型,继承Model基类构建自定义模型。

此处选择使用继承Model基类构建自定义模型。

  1. # 演示自定义模型范例,实际上应该优先使用Sequential或者函数式API
  2. tf.keras.backend.clear_session()
  3. class CnnModel(models.Model):
  4. def __init__(self):
  5. super(CnnModel, self).__init__()
  6. def build(self,input_shape):
  7. self.embedding = layers.Embedding(MAX_WORDS,7,input_length=MAX_LEN)
  8. self.conv_1 = layers.Conv1D(16, kernel_size= 5,name = "conv_1",activation = "relu")
  9. self.pool_1 = layers.MaxPool1D(name = "pool_1")
  10. self.conv_2 = layers.Conv1D(128, kernel_size=2,name = "conv_2",activation = "relu")
  11. self.pool_2 = layers.MaxPool1D(name = "pool_2")
  12. self.flatten = layers.Flatten()
  13. self.dense = layers.Dense(1,activation = "sigmoid")
  14. super(CnnModel,self).build(input_shape)
  15. def call(self, x):
  16. x = self.embedding(x)
  17. x = self.conv_1(x)
  18. x = self.pool_1(x)
  19. x = self.conv_2(x)
  20. x = self.pool_2(x)
  21. x = self.flatten(x)
  22. x = self.dense(x)
  23. return(x)
  24. # 用于显示Output Shape
  25. def summary(self):
  26. x_input = layers.Input(shape = MAX_LEN)
  27. output = self.call(x_input)
  28. model = tf.keras.Model(inputs = x_input,outputs = output)
  29. model.summary()
  30. model = CnnModel()
  31. model.build(input_shape =(None,MAX_LEN))
  32. model.summary()
  1. Model: "model"
  2. _________________________________________________________________
  3. Layer (type) Output Shape Param #
  4. =================================================================
  5. input_1 (InputLayer) [(None, 200)] 0
  6. _________________________________________________________________
  7. embedding (Embedding) (None, 200, 7) 70000
  8. _________________________________________________________________
  9. conv_1 (Conv1D) (None, 196, 16) 576
  10. _________________________________________________________________
  11. pool_1 (MaxPooling1D) (None, 98, 16) 0
  12. _________________________________________________________________
  13. conv_2 (Conv1D) (None, 97, 128) 4224
  14. _________________________________________________________________
  15. pool_2 (MaxPooling1D) (None, 48, 128) 0
  16. _________________________________________________________________
  17. flatten (Flatten) (None, 6144) 0
  18. _________________________________________________________________
  19. dense (Dense) (None, 1) 6145
  20. =================================================================
  21. Total params: 80,945
  22. Trainable params: 80,945
  23. Non-trainable params: 0
  24. _________________________________________________________________

三,训练模型

训练模型通常有3种方法,内置fit方法,内置train_on_batch方法,以及自定义训练循环。此处我们通过自定义训练循环训练模型。

  1. #打印时间分割线
  2. @tf.function
  3. def printbar():
  4. today_ts = tf.timestamp()%(24*60*60)
  5. hour = tf.cast(today_ts//3600+8,tf.int32)%tf.constant(24)
  6. minite = tf.cast((today_ts%3600)//60,tf.int32)
  7. second = tf.cast(tf.floor(today_ts%60),tf.int32)
  8. def timeformat(m):
  9. if tf.strings.length(tf.strings.format("{}",m))==1:
  10. return(tf.strings.format("0{}",m))
  11. else:
  12. return(tf.strings.format("{}",m))
  13. timestring = tf.strings.join([timeformat(hour),timeformat(minite),
  14. timeformat(second)],separator = ":")
  15. tf.print("=========="*8+timestring)
  1. optimizer = optimizers.Nadam()
  2. loss_func = losses.BinaryCrossentropy()
  3. train_loss = metrics.Mean(name='train_loss')
  4. train_metric = metrics.BinaryAccuracy(name='train_accuracy')
  5. valid_loss = metrics.Mean(name='valid_loss')
  6. valid_metric = metrics.BinaryAccuracy(name='valid_accuracy')
  7. @tf.function
  8. def train_step(model, features, labels):
  9. with tf.GradientTape() as tape:
  10. predictions = model(features,training = True)
  11. loss = loss_func(labels, predictions)
  12. gradients = tape.gradient(loss, model.trainable_variables)
  13. optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  14. train_loss.update_state(loss)
  15. train_metric.update_state(labels, predictions)
  16. @tf.function
  17. def valid_step(model, features, labels):
  18. predictions = model(features,training = False)
  19. batch_loss = loss_func(labels, predictions)
  20. valid_loss.update_state(batch_loss)
  21. valid_metric.update_state(labels, predictions)
  22. def train_model(model,ds_train,ds_valid,epochs):
  23. for epoch in tf.range(1,epochs+1):
  24. for features, labels in ds_train:
  25. train_step(model,features,labels)
  26. for features, labels in ds_valid:
  27. valid_step(model,features,labels)
  28. #此处logs模板需要根据metric具体情况修改
  29. logs = 'Epoch={},Loss:{},Accuracy:{},Valid Loss:{},Valid Accuracy:{}'
  30. if epoch%1==0:
  31. printbar()
  32. tf.print(tf.strings.format(logs,
  33. (epoch,train_loss.result(),train_metric.result(),valid_loss.result(),valid_metric.result())))
  34. tf.print("")
  35. train_loss.reset_states()
  36. valid_loss.reset_states()
  37. train_metric.reset_states()
  38. valid_metric.reset_states()
  39. train_model(model,ds_train,ds_test,epochs = 6)
  1. ================================================================================13:54:08
  2. Epoch=1,Loss:0.442317516,Accuracy:0.7695,Valid Loss:0.323672801,Valid Accuracy:0.8614
  3. ================================================================================13:54:20
  4. Epoch=2,Loss:0.245737702,Accuracy:0.90215,Valid Loss:0.356488883,Valid Accuracy:0.8554
  5. ================================================================================13:54:32
  6. Epoch=3,Loss:0.17360799,Accuracy:0.93455,Valid Loss:0.361132562,Valid Accuracy:0.8674
  7. ================================================================================13:54:44
  8. Epoch=4,Loss:0.113476314,Accuracy:0.95975,Valid Loss:0.483677238,Valid Accuracy:0.856
  9. ================================================================================13:54:57
  10. Epoch=5,Loss:0.0698405355,Accuracy:0.9768,Valid Loss:0.607856631,Valid Accuracy:0.857
  11. ================================================================================13:55:15
  12. Epoch=6,Loss:0.0366807655,Accuracy:0.98825,Valid Loss:0.745884955,Valid Accuracy:0.854

四,评估模型

通过自定义训练循环训练的模型没有经过编译,无法直接使用model.evaluate(ds_valid)方法

  1. def evaluate_model(model,ds_valid):
  2. for features, labels in ds_valid:
  3. valid_step(model,features,labels)
  4. logs = 'Valid Loss:{},Valid Accuracy:{}'
  5. tf.print(tf.strings.format(logs,(valid_loss.result(),valid_metric.result())))
  6. valid_loss.reset_states()
  7. train_metric.reset_states()
  8. valid_metric.reset_states()
  1. evaluate_model(model,ds_test)
  1. Valid Loss:0.745884418,Valid Accuracy:0.854

五,使用模型

可以使用以下方法:

  • model.predict(ds_test)
  • model(x_test)
  • model.call(x_test)
  • model.predict_on_batch(x_test)

推荐优先使用model.predict(ds_test)方法,既可以对Dataset,也可以对Tensor使用。

  1. model.predict(ds_test)
  1. array([[0.7864823 ],
  2. [0.9999901 ],
  3. [0.99944776],
  4. ...,
  5. [0.8498302 ],
  6. [0.13382755],
  7. [1. ]], dtype=float32)
  1. for x_test,_ in ds_test.take(1):
  2. print(model(x_test))
  3. #以下方法等价:
  4. #print(model.call(x_test))
  5. #print(model.predict_on_batch(x_test))
  1. tf.Tensor(
  2. [[7.8648227e-01]
  3. [9.9999011e-01]
  4. [9.9944776e-01]
  5. [3.7153201e-09]
  6. [9.4462049e-01]
  7. [2.3522753e-04]
  8. [1.2044354e-04]
  9. [9.3752089e-07]
  10. [9.9996352e-01]
  11. [9.3435925e-01]
  12. [9.8746723e-01]
  13. [9.9908626e-01]
  14. [4.1563155e-08]
  15. [4.1808244e-03]
  16. [8.0184749e-05]
  17. [8.3910513e-01]
  18. [3.5167937e-05]
  19. [7.2113985e-01]
  20. [4.5228912e-03]
  21. [9.9942589e-01]], shape=(20, 1), dtype=float32)

六,保存模型

推荐使用TensorFlow原生方式保存模型。

  1. model.save('./data/tf_model_savedmodel', save_format="tf")
  2. print('export saved model.')
  3. model_loaded = tf.keras.models.load_model('./data/tf_model_savedmodel')
  4. model_loaded.predict(ds_test)
  1. array([[0.7864823 ],
  2. [0.9999901 ],
  3. [0.99944776],
  4. ...,
  5. [0.8498302 ],
  6. [0.13382755],
  7. [1. ]], dtype=float32)