原文:tensor-flow-exercises

译者:飞龙

协议:CC BY-NC-SA 4.0

2.1 TensorFlow Not MNIST

致谢:派生于 Google 的 TensorFlow

配置

参考配置指南.

练习 1

本练习的目的是了解简单的数据管理实践,并熟悉我们稍后将重用的一些数据。

此笔记本使用 not MNIST 数据集,它将用于 python 实验。这个数据集看起来像经典的 MNIST 数据集,看起来更像真实数据:这是一项更难的任务,数据不比 MNIST 更“干净”。

  1. # 这些是我们之后使用的所有模块。
  2. # 在继续之前,确保你可以导入它们。
  3. import matplotlib.pyplot as plt
  4. import numpy as np
  5. import os
  6. import tarfile
  7. import urllib
  8. from IPython.display import display, Image
  9. from scipy import ndimage
  10. from sklearn.linear_model import LogisticRegression
  11. import cPickle as pickle

首先,我们将数据集下载到本地计算机。数据由字符组成,在 28x28 图像上以各种字体呈现。 标签限于'A''J'(10 个类别)。训练集大约有 500k,测试集是 19000 个标注示例。 鉴于这些尺寸,应该可以在任何机器上快速训练模型。

  1. url = 'http://yaroslavvb.com/upload/notMNIST/'
  2. def maybe_download(filename, expected_bytes):
  3. """如果文件不存在则下载,并确保它的大小正确"""
  4. if not os.path.exists(filename):
  5. filename, _ = urllib.urlretrieve(url + filename, filename)
  6. statinfo = os.stat(filename)
  7. if statinfo.st_size == expected_bytes:
  8. print 'Found and verified', filename
  9. else:
  10. raise Exception(
  11. 'Failed to verify' + filename + '. Can you get to it with a browser?')
  12. return filename
  13. train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)
  14. test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)
  15. '''
  16. Found and verified notMNIST_large.tar.gz
  17. Found and verified notMNIST_small.tar.gz
  18. '''

从压缩的.tar.gz文件中提取数据集。这应该给你一组目录,标记为AJ

  1. num_classes = 10
  2. def extract(filename):
  3. tar = tarfile.open(filename)
  4. tar.extractall()
  5. tar.close()
  6. root = os.path.splitext(os.path.splitext(filename)[0])[0] # 移除 .tar.gz
  7. data_folders = [os.path.join(root, d) for d in sorted(os.listdir(root))]
  8. if len(data_folders) != num_classes:
  9. raise Exception(
  10. 'Expected %d folders, one per class. Found %d instead.' % (
  11. num_folders, len(data_folders)))
  12. print data_folders
  13. return data_folders
  14. train_folders = extract(train_filename)
  15. test_folders = extract(test_filename)
  16. '''
  17. ['notMNIST_large/A', 'notMNIST_large/B', 'notMNIST_large/C', 'notMNIST_large/D', 'notMNIST_large/E', 'notMNIST_large/F', 'notMNIST_large/G', 'notMNIST_large/H', 'notMNIST_large/I', 'notMNIST_large/J']
  18. ['notMNIST_small/A', 'notMNIST_small/B', 'notMNIST_small/C', 'notMNIST_small/D', 'notMNIST_small/E', 'notMNIST_small/F', 'notMNIST_small/G', 'notMNIST_small/H', 'notMNIST_small/I', 'notMNIST_small/J']
  19. '''

问题 1

让我们看看一些数据,以确保它看起来合理。每个示例应该是以不同字体呈现的字符AJ的图像。显示我们刚刚下载的图像样本。提示:你可以使用IPython.display包。

现在让我们以更易于管理的格式加载数据。我们将整个数据集转换为浮点值的 3D 数组(图像索引,x,y),归一化为零均值和 0.5 标准差,使训练更容易。 标签将存储在 0 到 9 的单独数组中。一些图像可能无法读取,我们只是跳过它们。

  1. image_size = 28 # 像素宽度和高度
  2. pixel_depth = 255.0 # 每个像素的深度
  3. def load(data_folders, min_num_images, max_num_images):
  4. dataset = np.ndarray(
  5. shape=(max_num_images, image_size, image_size), dtype=np.float32)
  6. labels = np.ndarray(shape=(max_num_images), dtype=np.int32)
  7. label_index = 0
  8. image_index = 0
  9. for folder in data_folders:
  10. print folder
  11. for image in os.listdir(folder):
  12. if image_index >= max_num_images:
  13. raise Exception('More images than expected: %d >= %d' % (
  14. num_images, max_num_images))
  15. image_file = os.path.join(folder, image)
  16. try:
  17. image_data = (ndimage.imread(image_file).astype(float) -
  18. pixel_depth / 2) / pixel_depth
  19. if image_data.shape != (image_size, image_size):
  20. raise Exception('Unexpected image shape: %s' % str(image_data.shape))
  21. dataset[image_index, :, :] = image_data
  22. labels[image_index] = label_index
  23. image_index += 1
  24. except IOError as e:
  25. print 'Could not read:', image_file, ':', e, '- it\'s ok, skipping.'
  26. label_index += 1
  27. num_images = image_index
  28. dataset = dataset[0:num_images, :, :]
  29. labels = labels[0:num_images]
  30. if num_images < min_num_images:
  31. raise Exception('Many fewer images than expected: %d < %d' % (
  32. num_images, min_num_images))
  33. print 'Full dataset tensor:', dataset.shape
  34. print 'Mean:', np.mean(dataset)
  35. print 'Standard deviation:', np.std(dataset)
  36. print 'Labels:', labels.shape
  37. return dataset, labels
  38. train_dataset, train_labels = load(train_folders, 450000, 550000)
  39. test_dataset, test_labels = load(test_folders, 18000, 20000)
  40. '''
  41. notMNIST_large/A
  42. Could not read: notMNIST_large/A/SG90IE11c3RhcmQgQlROIFBvc3Rlci50dGY=.png : cannot identify image file - it's ok, skipping.
  43. Could not read: notMNIST_large/A/RnJlaWdodERpc3BCb29rSXRhbGljLnR0Zg==.png : cannot identify image file - it's ok, skipping.
  44. Could not read: notMNIST_large/A/Um9tYW5hIEJvbGQucGZi.png : cannot identify image file - it's ok, skipping.
  45. notMNIST_large/B
  46. Could not read: notMNIST_large/B/TmlraXNFRi1TZW1pQm9sZEl0YWxpYy5vdGY=.png : cannot identify image file - it's ok, skipping.
  47. notMNIST_large/C
  48. notMNIST_large/D
  49. Could not read: notMNIST_large/D/VHJhbnNpdCBCb2xkLnR0Zg==.png : cannot identify image file - it's ok, skipping.
  50. notMNIST_large/E
  51. notMNIST_large/F
  52. notMNIST_large/G
  53. notMNIST_large/H
  54. notMNIST_large/I
  55. notMNIST_large/J
  56. Full dataset tensor: (529114, 28, 28)
  57. Mean: -0.0816593
  58. Standard deviation: 0.45423
  59. Labels: (529114,)
  60. notMNIST_small/A
  61. Could not read: notMNIST_small/A/RGVtb2NyYXRpY2FCb2xkT2xkc3R5bGUgQm9sZC50dGY=.png : cannot identify image file - it's ok, skipping.
  62. notMNIST_small/B
  63. notMNIST_small/C
  64. notMNIST_small/D
  65. notMNIST_small/E
  66. notMNIST_small/F
  67. Could not read: notMNIST_small/F/Q3Jvc3NvdmVyIEJvbGRPYmxpcXVlLnR0Zg==.png : cannot identify image file - it's ok, skipping.
  68. notMNIST_small/G
  69. notMNIST_small/H
  70. notMNIST_small/I
  71. notMNIST_small/J
  72. Full dataset tensor: (18724, 28, 28)
  73. Mean: -0.0746364
  74. Standard deviation: 0.458622
  75. Labels: (18724,)
  76. '''

问题 2

让我们验证数据仍然看起来不错。 显示ndarray中的标签和图像样本。提示:你可以使用matplotlib.pyplot。接下来,我们将数据随机化。为训练和要匹配的测试分布,将标签充分打乱是很重要的。

  1. np.random.seed(133)
  2. def randomize(dataset, labels):
  3. permutation = np.random.permutation(labels.shape[0])
  4. shuffled_dataset = dataset[permutation,:,:]
  5. shuffled_labels = labels[permutation]
  6. return shuffled_dataset, shuffled_labels
  7. train_dataset, train_labels = randomize(train_dataset, train_labels)
  8. test_dataset, test_labels = randomize(test_dataset, test_labels)

问题 3

说服自己打乱后数据仍然很好!

问题 4

另一项检查:我们希望不同类别之间的数据是平衡的。验证它。根据需要剪裁训练数据。根据你的计算机设置,你可能无法将其全部放在内存中,你可以根据需要调整train_size。同时为超参数调整创建验证数据集。

  1. train_size = 200000
  2. valid_size = 10000
  3. valid_dataset = train_dataset[:valid_size,:,:]
  4. valid_labels = train_labels[:valid_size]
  5. train_dataset = train_dataset[valid_size:valid_size+train_size,:,:]
  6. train_labels = train_labels[valid_size:valid_size+train_size]
  7. print 'Training', train_dataset.shape, train_labels.shape
  8. print 'Validation', valid_dataset.shape, valid_labels.shape
  9. '''
  10. Training (200000, 28, 28) (200000,)
  11. Validation (10000, 28, 28) (10000,)
  12. '''

最后,让我们保存数据以便之后的复用:

  1. pickle_file = 'notMNIST.pickle'
  2. try:
  3. f = open(pickle_file, 'wb')
  4. save = {
  5. 'train_dataset': train_dataset,
  6. 'train_labels': train_labels,
  7. 'valid_dataset': valid_dataset,
  8. 'valid_labels': valid_labels,
  9. 'test_dataset': test_dataset,
  10. 'test_labels': test_labels,
  11. }
  12. pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
  13. f.close()
  14. except Exception as e:
  15. print 'Unable to save data to', pickle_file, ':', e
  16. raise
  17. statinfo = os.stat(pickle_file)
  18. print 'Compressed pickle size:', statinfo.st_size
  19. # Compressed pickle size: 718193801

问题 5

通过构造,该数据集可能包含大量重叠样本,包括验证和测试集中也包含的训练数据! 如果你希望在一个从不重叠的环境中使用你的模型,那么训练和测试之间的重叠可能会使结果产生偏差,但如果你希望在使用它时看到重复的训练样本,则实际上是正常的。测量训练,验证和测试样本之间的重叠程度。

可选问题:

  • 数据集之间几乎重复的是什么? (图像几乎完全相同)
  • 创建一个经过整理的验证和测试集,并在后续练习中比较准确率。

问题 6

让我们了解一下,现有的分类器在这个数据可以为你提供什么。 检查是否有需要学习的东西总是很好的,并且这是一个不是很麻烦的问题,而固定解决方案可以解决它。

使用 50, 100, 1000 和 5000 个训练样本,在这个数据上训练简单模型。提示:你可以使用sklearn.linear_model中的LogisticRegression模型。

可选问题:在所有数据上训练现有的模型!

2.2 TensorFlow 全连接

致谢:派生于 Google 的 TensorFlow

配置

参考配置指南.

练习 2

以前在1_notmnist.ipynb中,我们创建了一个带有格式化数据集的pickle,用于[ not MNIST 数据集]的训练,开发和测试(http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html)。本练习的目标是使用 TensorFlow 逐步训练更深入,更准确的模型。

  1. # 这些是我们之后使用的所有模块。
  2. # 在继续之前,确保你可以导入它们。
  3. import cPickle as pickle
  4. import numpy as np
  5. import tensorflow as tf

首先重新加载我们在1_notmist.ipynb中生成的数据。

  1. pickle_file = 'notMNIST.pickle'
  2. with open(pickle_file, 'rb') as f:
  3. save = pickle.load(f)
  4. train_dataset = save['train_dataset']
  5. train_labels = save['train_labels']
  6. valid_dataset = save['valid_dataset']
  7. valid_labels = save['valid_labels']
  8. test_dataset = save['test_dataset']
  9. test_labels = save['test_labels']
  10. del save # 帮助 GC 回收内存的提示
  11. print 'Training set', train_dataset.shape, train_labels.shape
  12. print 'Validation set', valid_dataset.shape, valid_labels.shape
  13. print 'Test set', test_dataset.shape, test_labels.shape
  14. '''
  15. Training set (200000, 28, 28) (200000,)
  16. Validation set (10000, 28, 28) (10000,)
  17. Test set (18724, 28, 28) (18724,)
  18. '''

重新格式化为更适合我们要训练的模型的形状:

  • 数据是平面矩阵,
  • 标签是浮点单热编码。
  1. image_size = 28
  2. num_labels = 10
  3. def reformat(dataset, labels):
  4. dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  5. # 将 0 映射为 [1.0, 0.0, 0.0 ...],1 映射为 [0.0, 1.0, 0.0 ...]
  6. labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  7. return dataset, labels
  8. train_dataset, train_labels = reformat(train_dataset, train_labels)
  9. valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
  10. test_dataset, test_labels = reformat(test_dataset, test_labels)
  11. print 'Training set', train_dataset.shape, train_labels.shape
  12. print 'Validation set', valid_dataset.shape, valid_labels.shape
  13. print 'Test set', test_dataset.shape, test_labels.shape
  14. '''
  15. Training set (200000, 784) (200000, 10)
  16. Validation set (10000, 784) (10000, 10)
  17. Test set (18724, 784) (18724, 10)
  18. '''

我们首先要使用简单的梯度下降来训练多元 Logistic 回归。TensorFlow 的工作方式如下:首先描述要执行的计算:输入,变量和操作的样子。 这些创建为计算图上的节点。这个描述全部包含在以下块中:

  1. with graph.as_default():
  2. ...

然后你可以通过调用session.run()来多次在这个图上运行操作,提供输入并获取从图中返回的输出。这个运行时操作全部包含在下面的块中:

  1. with tf.Session(graph=graph) as session:
  2. ...

让我们将所有数据加载到 TensorFlow 中,并构建对应我们的训练的计算图:

  1. # 通过梯度下降来训练,即使这么多数据也是令人望而却步的。
  2. # 取训练数据的子集来加快周转时间。
  3. train_subset = 10000
  4. graph = tf.Graph()
  5. with graph.as_default():
  6. # 输入数据
  7. # 加载训练,验证和测试数据到常量中
  8. # 它们附加到图中
  9. tf_train_dataset = tf.constant(train_dataset[:train_subset, :])
  10. tf_train_labels = tf.constant(train_labels[:train_subset])
  11. tf_valid_dataset = tf.constant(valid_dataset)
  12. tf_test_dataset = tf.constant(test_dataset)
  13. # 变量
  14. # 这些是我们打算训练的参数
  15. # 权重矩阵会使用服从(截断)正态分布的随机值初始化
  16. # 偏置初始化为零
  17. weights = tf.Variable(
  18. tf.truncated_normal([image_size * image_size, num_labels]))
  19. biases = tf.Variable(tf.zeros([num_labels]))
  20. # 训练计算
  21. # 我们将输入与权重矩阵相乘,并添加偏置
  22. # 我们计算 softmax 和交叉熵(这是 TF 中的一个操作,
  23. # 因为它很常见,并且是可优化的),我们计算
  24. # 所有训练样本上的交叉熵的均值:这就是我们的损失
  25. logits = tf.matmul(tf_train_dataset, weights) + biases
  26. loss = tf.reduce_mean(
  27. tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
  28. # 优化器
  29. # 我们打算使用高梯度下降找到这个损失的最小值
  30. optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  31. # 对训练,验证和测试数据做出预测
  32. # 它们不是训练的一部分,但是很少
  33. # 所以我们可以在训练时汇报数字
  34. train_prediction = tf.nn.softmax(logits)
  35. valid_prediction = tf.nn.softmax(
  36. tf.matmul(tf_valid_dataset, weights) + biases)
  37. test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

让我们运行这个计算并迭代:

  1. num_steps = 801
  2. def accuracy(predictions, labels):
  3. return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
  4. / predictions.shape[0])
  5. with tf.Session(graph=graph) as session:
  6. # 这是一次性的操作,它确保参数得到初始化
  7. # 就像我们在图中描述的那样:矩阵为随机权重
  8. # 偏置为零
  9. tf.global_variables_initializer().run()
  10. print 'Initialized'
  11. for step in xrange(num_steps):
  12. # 运行计算。我们告诉 run(),我们打算运行优化器,
  13. # 并且获取损失值和训练预测,作为 NumPy 数组返回
  14. _, l, predictions = session.run([optimizer, loss, train_prediction])
  15. if (step % 100 == 0):
  16. print 'Loss at step', step, ':', l
  17. print 'Training accuracy: %.1f%%' % accuracy(
  18. predictions, train_labels[:train_subset, :])
  19. # 在 valid_prediction 上调用 .eval(),基本上和调用 run() 一样,但是
  20. # 只能得到一个 NumPy 数组。注意它重新计算所有图依赖
  21. print 'Validation accuracy: %.1f%%' % accuracy(
  22. valid_prediction.eval(), valid_labels)
  23. print 'Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels)
  24. '''
  25. Initialized
  26. Loss at step 0 : 17.2939
  27. Training accuracy: 10.8%
  28. Validation accuracy: 13.8%
  29. Loss at step 100 : 2.26903
  30. Training accuracy: 72.3%
  31. Validation accuracy: 71.6%
  32. Loss at step 200 : 1.84895
  33. Training accuracy: 74.9%
  34. Validation accuracy: 73.9%
  35. Loss at step 300 : 1.60701
  36. Training accuracy: 76.0%
  37. Validation accuracy: 74.5%
  38. Loss at step 400 : 1.43912
  39. Training accuracy: 76.8%
  40. Validation accuracy: 74.8%
  41. Loss at step 500 : 1.31349
  42. Training accuracy: 77.5%
  43. Validation accuracy: 75.0%
  44. Loss at step 600 : 1.21501
  45. Training accuracy: 78.1%
  46. Validation accuracy: 75.4%
  47. Loss at step 700 : 1.13515
  48. Training accuracy: 78.6%
  49. Validation accuracy: 75.4%
  50. Loss at step 800 : 1.0687
  51. Training accuracy: 79.2%
  52. Validation accuracy: 75.6%
  53. Test accuracy: 82.9%
  54. '''

现在让我们转而使用随机梯度下降训练,速度要快得多。图是类似的,除了不将所有训练数据保存到一个常量节点,我们创建一个Placeholder节点,它将在每次调用sesion.run()时提供实际数据。

  1. batch_size = 128
  2. graph = tf.Graph()
  3. with graph.as_default():
  4. # 用于训练的输入数据,我们使用占位符
  5. # 它将在运行时由训练小批量填充
  6. tf_train_dataset = tf.placeholder(tf.float32,
  7. shape=(batch_size, image_size * image_size))
  8. tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  9. tf_valid_dataset = tf.constant(valid_dataset)
  10. tf_test_dataset = tf.constant(test_dataset)
  11. # 变量
  12. weights = tf.Variable(
  13. tf.truncated_normal([image_size * image_size, num_labels]))
  14. biases = tf.Variable(tf.zeros([num_labels]))
  15. # 训练计算
  16. logits = tf.matmul(tf_train_dataset, weights) + biases
  17. loss = tf.reduce_mean(
  18. tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
  19. # 优化器
  20. optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  21. # 对训练,验证和测试数据做出预测
  22. train_prediction = tf.nn.softmax(logits)
  23. valid_prediction = tf.nn.softmax(
  24. tf.matmul(tf_valid_dataset, weights) + biases)
  25. test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

让我们运行它:

  1. num_steps = 3001
  2. with tf.Session(graph=graph) as session:
  3. tf.global_variables_initializer().run()
  4. print "Initialized"
  5. for step in xrange(num_steps):
  6. # 在训练数据中选取一个偏移,它是随机化的
  7. # 注意:我们可以在迭代之间使用更好的随机化
  8. offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
  9. # 生成小批量
  10. batch_data = train_dataset[offset:(offset + batch_size), :]
  11. batch_labels = train_labels[offset:(offset + batch_size), :]
  12. # 准备字典,告诉会话将小批量送到哪里
  13. # 字典的键是要馈送的图的占位符节点
  14. # 值是要馈送给它的 NumPy 数组
  15. feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
  16. _, l, predictions = session.run(
  17. [optimizer, loss, train_prediction], feed_dict=feed_dict)
  18. if (step % 500 == 0):
  19. print "Minibatch loss at step", step, ":", l
  20. print "Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels)
  21. print "Validation accuracy: %.1f%%" % accuracy(
  22. valid_prediction.eval(), valid_labels)
  23. print "Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels)
  24. '''
  25. Initialized
  26. Minibatch loss at step 0 : 16.8091
  27. Minibatch accuracy: 12.5%
  28. Validation accuracy: 14.0%
  29. Minibatch loss at step 500 : 1.75256
  30. Minibatch accuracy: 77.3%
  31. Validation accuracy: 75.0%
  32. Minibatch loss at step 1000 : 1.32283
  33. Minibatch accuracy: 77.3%
  34. Validation accuracy: 76.6%
  35. Minibatch loss at step 1500 : 0.944533
  36. Minibatch accuracy: 83.6%
  37. Validation accuracy: 76.5%
  38. Minibatch loss at step 2000 : 1.03795
  39. Minibatch accuracy: 78.9%
  40. Validation accuracy: 77.8%
  41. Minibatch loss at step 2500 : 1.10219
  42. Minibatch accuracy: 80.5%
  43. Validation accuracy: 78.0%
  44. Minibatch loss at step 3000 : 0.758874
  45. Minibatch accuracy: 82.8%
  46. Validation accuracy: 78.8%
  47. Test accuracy: 86.1%
  48. '''

问题:

将具有 SGD 的逻辑回归示例转换为具有整流线性单元(nn.relu())和 1024 个隐藏节点的单隐层神经网络。 该模型应该可以提高验证/测试的准确率。

2.3 TensorFlow 正则化

致谢:派生于 Google 的 TensorFlow

配置

参考配置指南.

练习 3

在之前的2_fullyconnected.ipynb中,你训练了逻辑回归和神经网络模型。本练习的目的是探索正则化技术。

  1. # 这些是我们之后使用的所有模块。
  2. # 在继续之前,确保你可以导入它们。
  3. import cPickle as pickle
  4. import numpy as np
  5. import tensorflow as tf

首先重新加载我们在notmist.ipynb中生成的数据。

  1. pickle_file = 'notMNIST.pickle'
  2. with open(pickle_file, 'rb') as f:
  3. save = pickle.load(f)
  4. train_dataset = save['train_dataset']
  5. train_labels = save['train_labels']
  6. valid_dataset = save['valid_dataset']
  7. valid_labels = save['valid_labels']
  8. test_dataset = save['test_dataset']
  9. test_labels = save['test_labels']
  10. del save # 帮助 GC 回收内存的提示
  11. print 'Training set', train_dataset.shape, train_labels.shape
  12. print 'Validation set', valid_dataset.shape, valid_labels.shape
  13. print 'Test set', test_dataset.shape, test_labels.shape
  14. '''
  15. Training set (200000, 28, 28) (200000,)
  16. Validation set (10000, 28, 28) (10000,)
  17. Test set (18724, 28, 28) (18724,)
  18. '''

重新格式化为更适合我们要训练的模型的形状:

  • 数据是平面矩阵,
  • 标签是浮点单热编码。
  1. image_size = 28
  2. num_labels = 10
  3. def reformat(dataset, labels):
  4. dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  5. # 将 2 映射为 [0.0, 1.0, 0.0 ...],3 映射为 [0.0, 0.0, 1.0 ...]
  6. labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  7. return dataset, labels
  8. train_dataset, train_labels = reformat(train_dataset, train_labels)
  9. valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
  10. test_dataset, test_labels = reformat(test_dataset, test_labels)
  11. print 'Training set', train_dataset.shape, train_labels.shape
  12. print 'Validation set', valid_dataset.shape, valid_labels.shape
  13. print 'Test set', test_dataset.shape, test_labels.shape
  14. '''
  15. Training set (200000, 784) (200000, 10)
  16. Validation set (10000, 784) (10000, 10)
  17. Test set (18724, 784) (18724, 10)
  18. '''
  19. def accuracy(predictions, labels):
  20. return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
  21. / predictions.shape[0])

问题 1

为逻辑和神经网络模型引入和调整 L2 正则化。 请记住,L2 相当于对损失的权重范式加上惩罚。 在 TensorFlow 中,你可以使用nn.l2_loss(t)来计算张量t的 L2 损失。 适量的正则化应该可以提高验证/测试的准确率。

问题 2

让我们展示一个过拟合的极端情况。将你的训练数据限制为几个批量。会发生什么?

问题 3

在神经网络的隐藏层上引入丢弃(Dropout)。 请记住:丢弃只应在训练期间引入,而不是评估期间,否则你的评估结果也将是随机的。TensorFlow 为此提供了nn.dropout(),但你必须确保它只在训练期间插入。我们极度过拟合的情况会怎样?

问题 4

尝试使用多层模型获得最佳表现! 使用深度网络报告的最佳测试准确度为 97.1%。你可以探索的一个途径是添加多层。另一个是使用学习率衰减:

  1. global_step = tf.Variable(0) # 计算采取的步骤数量
  2. learning_rate = tf.train.exponential_decay(0.5, step, ...)
  3. optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

2.4 TensorFlow 卷积

致谢:派生于 Google 的 TensorFlow

配置

参考配置指南.

练习 4

以前在2_fullyconnected.ipynb3_regularization.ipynb中,我们训练全连接的网络来对 not MNIST 字符进行分类。这个练习的目标是使神经网络卷积。

  1. # 这些是我们之后使用的所有模块。
  2. # 在继续之前,确保你可以导入它们。
  3. import cPickle as pickle
  4. import numpy as np
  5. import tensorflow as tf
  6. pickle_file = 'notMNIST.pickle'
  7. with open(pickle_file, 'rb') as f:
  8. save = pickle.load(f)
  9. train_dataset = save['train_dataset']
  10. train_labels = save['train_labels']
  11. valid_dataset = save['valid_dataset']
  12. valid_labels = save['valid_labels']
  13. test_dataset = save['test_dataset']
  14. test_labels = save['test_labels']
  15. del save # 帮助 GC 回收内存的提示
  16. print 'Training set', train_dataset.shape, train_labels.shape
  17. print 'Validation set', valid_dataset.shape, valid_labels.shape
  18. print 'Test set', test_dataset.shape, test_labels.shape
  19. '''
  20. Training set (200000, 28, 28) (200000,)
  21. Validation set (10000, 28, 28) (10000,)
  22. Test set (18724, 28, 28) (18724,)
  23. '''

重新格式化为更适合我们要训练的模型的形状:

  • 数据是平面矩阵,
  • 标签是浮点单热编码。
  1. image_size = 28
  2. num_labels = 10
  3. num_channels = 1 # 灰度
  4. import numpy as np
  5. def reformat(dataset, labels):
  6. dataset = dataset.reshape(
  7. (-1, image_size, image_size, num_channels)).astype(np.float32)
  8. labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  9. return dataset, labels
  10. train_dataset, train_labels = reformat(train_dataset, train_labels)
  11. valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
  12. test_dataset, test_labels = reformat(test_dataset, test_labels)
  13. print 'Training set', train_dataset.shape, train_labels.shape
  14. print 'Validation set', valid_dataset.shape, valid_labels.shape
  15. print 'Test set', test_dataset.shape, test_labels.shape
  16. '''
  17. Training set (200000, 28, 28, 1) (200000, 10)
  18. Validation set (10000, 28, 28, 1) (10000, 10)
  19. Test set (18724, 28, 28, 1) (18724, 10)
  20. '''
  21. def accuracy(predictions, labels):
  22. return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
  23. / predictions.shape[0])

让我们构建一个带有两个卷积层的小型网络,然后是一个全连接层。卷积网络在计算上更昂贵,因此我们将限制其深度和全连接节点的数量。

  1. batch_size = 16
  2. patch_size = 5
  3. depth = 16
  4. num_hidden = 64
  5. graph = tf.Graph()
  6. with graph.as_default():
  7. # 输入数据
  8. tf_train_dataset = tf.placeholder(
  9. tf.float32, shape=(batch_size, image_size, image_size, num_channels))
  10. tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  11. tf_valid_dataset = tf.constant(valid_dataset)
  12. tf_test_dataset = tf.constant(test_dataset)
  13. # 变量
  14. layer1_weights = tf.Variable(tf.truncated_normal(
  15. [patch_size, patch_size, num_channels, depth], stddev=0.1))
  16. layer1_biases = tf.Variable(tf.zeros([depth]))
  17. layer2_weights = tf.Variable(tf.truncated_normal(
  18. [patch_size, patch_size, depth, depth], stddev=0.1))
  19. layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
  20. layer3_weights = tf.Variable(tf.truncated_normal(
  21. [image_size / 4 * image_size / 4 * depth, num_hidden], stddev=0.1))
  22. layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  23. layer4_weights = tf.Variable(tf.truncated_normal(
  24. [num_hidden, num_labels], stddev=0.1))
  25. layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  26. # 模型
  27. def model(data):
  28. conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
  29. hidden = tf.nn.relu(conv + layer1_biases)
  30. conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
  31. hidden = tf.nn.relu(conv + layer2_biases)
  32. shape = hidden.get_shape().as_list()
  33. reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
  34. hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
  35. return tf.matmul(hidden, layer4_weights) + layer4_biases
  36. # 训练计算
  37. logits = model(tf_train_dataset)
  38. loss = tf.reduce_mean(
  39. tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
  40. # 优化器
  41. optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)
  42. # 对训练,验证和测试数据做出预测
  43. train_prediction = tf.nn.softmax(logits)
  44. valid_prediction = tf.nn.softmax(model(tf_valid_dataset))
  45. test_prediction = tf.nn.softmax(model(tf_test_dataset))
  46. num_steps = 1001
  47. with tf.Session(graph=graph) as session:
  48. tf.global_variables_initializer().run()
  49. print "Initialized"
  50. for step in xrange(num_steps):
  51. offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
  52. batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
  53. batch_labels = train_labels[offset:(offset + batch_size), :]
  54. feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
  55. _, l, predictions = session.run(
  56. [optimizer, loss, train_prediction], feed_dict=feed_dict)
  57. if (step % 50 == 0):
  58. print "Minibatch loss at step", step, ":", l
  59. print "Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels)
  60. print "Validation accuracy: %.1f%%" % accuracy(
  61. valid_prediction.eval(), valid_labels)
  62. print "Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels)
  63. '''
  64. Initialized
  65. Minibatch loss at step 0 : 3.51275
  66. Minibatch accuracy: 6.2%
  67. Validation accuracy: 12.8%
  68. Minibatch loss at step 50 : 1.48703
  69. Minibatch accuracy: 43.8%
  70. Validation accuracy: 50.4%
  71. Minibatch loss at step 100 : 1.04377
  72. Minibatch accuracy: 68.8%
  73. Validation accuracy: 67.4%
  74. Minibatch loss at step 150 : 0.601682
  75. Minibatch accuracy: 68.8%
  76. Validation accuracy: 73.0%
  77. Minibatch loss at step 200 : 0.898649
  78. Minibatch accuracy: 75.0%
  79. Validation accuracy: 77.8%
  80. Minibatch loss at step 250 : 1.3637
  81. Minibatch accuracy: 56.2%
  82. Validation accuracy: 75.4%
  83. Minibatch loss at step 300 : 1.41968
  84. Minibatch accuracy: 62.5%
  85. Validation accuracy: 76.0%
  86. Minibatch loss at step 350 : 0.300648
  87. Minibatch accuracy: 81.2%
  88. Validation accuracy: 80.2%
  89. Minibatch loss at step 400 : 1.32092
  90. Minibatch accuracy: 56.2%
  91. Validation accuracy: 80.4%
  92. Minibatch loss at step 450 : 0.556701
  93. Minibatch accuracy: 81.2%
  94. Validation accuracy: 79.4%
  95. Minibatch loss at step 500 : 1.65595
  96. Minibatch accuracy: 43.8%
  97. Validation accuracy: 79.6%
  98. Minibatch loss at step 550 : 1.06995
  99. Minibatch accuracy: 75.0%
  100. Validation accuracy: 81.2%
  101. Minibatch loss at step 600 : 0.223684
  102. Minibatch accuracy: 100.0%
  103. Validation accuracy: 82.3%
  104. Minibatch loss at step 650 : 0.619602
  105. Minibatch accuracy: 87.5%
  106. Validation accuracy: 81.8%
  107. Minibatch loss at step 700 : 0.812091
  108. Minibatch accuracy: 75.0%
  109. Validation accuracy: 82.4%
  110. Minibatch loss at step 750 : 0.276302
  111. Minibatch accuracy: 87.5%
  112. Validation accuracy: 82.3%
  113. Minibatch loss at step 800 : 0.450241
  114. Minibatch accuracy: 81.2%
  115. Validation accuracy: 82.3%
  116. Minibatch loss at step 850 : 0.137139
  117. Minibatch accuracy: 93.8%
  118. Validation accuracy: 82.3%
  119. Minibatch loss at step 900 : 0.52664
  120. Minibatch accuracy: 75.0%
  121. Validation accuracy: 82.2%
  122. Minibatch loss at step 950 : 0.623835
  123. Minibatch accuracy: 87.5%
  124. Validation accuracy: 82.1%
  125. Minibatch loss at step 1000 : 0.243114
  126. Minibatch accuracy: 93.8%
  127. Validation accuracy: 82.9%
  128. Test accuracy: 90.0%
  129. '''

问题 1

上面的卷积模型使用步长为 2 的卷积来减少维数。 将步长替换为步长为 2 和核大小为 2 的最大池操作(nn.max_pool())。

问题 2

尝试使用卷积网络获得最佳表现。例如,在经典 LeNet5架构中添加 Dropout,和/或添加学习率衰减。

2.5 TensorFlow word2vec

致谢:派生于 Google 的 TensorFlow

配置

参考配置指南.

练习 5

本练习的目标是在 Text8 数据上训练 skip-gram 模型。

  1. # 这些是我们之后使用的所有模块。
  2. # 在继续之前,确保你可以导入它们。
  3. import collections
  4. import math
  5. import numpy as np
  6. import os
  7. import random
  8. import tensorflow as tf
  9. import urllib
  10. import zipfile
  11. from matplotlib import pylab
  12. from sklearn.manifold import TSNE

如有必要,从源网站下载数据。

  1. url = 'http://mattmahoney.net/dc/'
  2. def maybe_download(filename, expected_bytes):
  3. """如果文件不存在则下载,并确保它的大小正确"""
  4. if not os.path.exists(filename):
  5. filename, _ = urllib.urlretrieve(url + filename, filename)
  6. statinfo = os.stat(filename)
  7. if statinfo.st_size == expected_bytes:
  8. print 'Found and verified', filename
  9. else:
  10. print statinfo.st_size
  11. raise Exception(
  12. 'Failed to verify ' + filename + '. Can you get to it with a browser?')
  13. return filename
  14. filename = maybe_download('text8.zip', 31344016)
  15. # Found and verified text8.zip

将数据读入字符串。

  1. def read_data(filename):
  2. f = zipfile.ZipFile(filename)
  3. for name in f.namelist():
  4. return f.read(name).split()
  5. f.close()
  6. words = read_data(filename)
  7. print 'Data size', len(words)
  8. # Data size 17005207

构建字典并用 UNK 记号替换罕见的单词。

  1. vocabulary_size = 50000
  2. def build_dataset(words):
  3. count = [['UNK', -1]]
  4. count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  5. dictionary = dict()
  6. for word, _ in count:
  7. dictionary[word] = len(dictionary)
  8. data = list()
  9. unk_count = 0
  10. for word in words:
  11. if word in dictionary:
  12. index = dictionary[word]
  13. else:
  14. index = 0 # dictionary['UNK']
  15. unk_count = unk_count + 1
  16. data.append(index)
  17. count[0][1] = unk_count
  18. reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  19. return data, count, dictionary, reverse_dictionary
  20. data, count, dictionary, reverse_dictionary = build_dataset(words)
  21. print 'Most common words (+UNK)', count[:5]
  22. print 'Sample data', data[:10]
  23. del words # 减少内存的提示
  24. '''
  25. Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
  26. Sample data [5243, 3083, 12, 6, 195, 2, 3136, 46, 59, 156]
  27. '''

用于为 skip-gram 模型生成训练批量的函数。

  1. data_index = 0
  2. def generate_batch(batch_size, num_skips, skip_window):
  3. global data_index
  4. assert batch_size % num_skips == 0
  5. assert num_skips <= 2 * skip_window
  6. batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  7. labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  8. span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  9. buffer = collections.deque(maxlen=span)
  10. for _ in range(span):
  11. buffer.append(data[data_index])
  12. data_index = (data_index + 1) % len(data)
  13. for i in range(batch_size / num_skips):
  14. target = skip_window # 目标标签在缓冲区中间
  15. targets_to_avoid = [ skip_window ]
  16. for j in range(num_skips):
  17. while target in targets_to_avoid:
  18. target = random.randint(0, span - 1)
  19. targets_to_avoid.append(target)
  20. batch[i * num_skips + j] = buffer[skip_window]
  21. labels[i * num_skips + j, 0] = buffer[target]
  22. buffer.append(data[data_index])
  23. data_index = (data_index + 1) % len(data)
  24. return batch, labels
  25. batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
  26. for i in range(8):
  27. print batch[i], '->', labels[i, 0]
  28. print reverse_dictionary[batch[i]], '->', reverse_dictionary[labels[i, 0]]
  29. '''
  30. 3083 -> 5243
  31. originated -> anarchism
  32. 3083 -> 12
  33. originated -> as
  34. 12 -> 3083
  35. as -> originated
  36. 12 -> 6
  37. as -> a
  38. 6 -> 12
  39. a -> as
  40. 6 -> 195
  41. a -> term
  42. 195 -> 6
  43. term -> a
  44. 195 -> 2
  45. term -> of
  46. '''

训练 skip-gram 模型。

  1. batch_size = 128
  2. embedding_size = 128 # 嵌入向量的维度
  3. skip_window = 1 # 考虑左边和右边多少个词
  4. num_skips = 2 # 复用输入多少次来生成标签
  5. # 我们选取一个随机验证集,来采样最近邻。这里我们将
  6. # 验证样本限制为较低数值 ID 的单词
  7. # 它们的构造也是最频繁的
  8. valid_size = 16 # 单词的随机集合,用于评估相似性
  9. valid_window = 100 # 只选取分布开头的 dev 样本
  10. valid_examples = np.array(random.sample(xrange(valid_window), valid_size))
  11. num_sampled = 64 # 要采样的负样本数量
  12. graph = tf.Graph()
  13. with graph.as_default():
  14. # 输入数据
  15. train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  16. train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  17. valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  18. # 变量
  19. embeddings = tf.Variable(
  20. tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  21. softmax_weights = tf.Variable(
  22. tf.truncated_normal([vocabulary_size, embedding_size],
  23. stddev=1.0 / math.sqrt(embedding_size)))
  24. softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  25. # 模型
  26. # Look up embeddings for inputs.
  27. embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  28. # Compute the softmax loss, using a sample of the negative labels each time.
  29. loss = tf.reduce_mean(
  30. tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
  31. train_labels, num_sampled, vocabulary_size))
  32. # 优化器
  33. optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  34. # 计算小批量样本和所有嵌入之间的相似性
  35. # 我们使用余弦距离:
  36. norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  37. normalized_embeddings = embeddings / norm
  38. valid_embeddings = tf.nn.embedding_lookup(
  39. normalized_embeddings, valid_dataset)
  40. similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
  41. num_steps = 100001
  42. with tf.Session(graph=graph) as session:
  43. tf.global_variables_initializer().run()
  44. print "Initialized"
  45. average_loss = 0
  46. for step in xrange(num_steps):
  47. batch_data, batch_labels = generate_batch(
  48. batch_size, num_skips, skip_window)
  49. feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
  50. _, l = session.run([optimizer, loss], feed_dict=feed_dict)
  51. average_loss += l
  52. if step % 2000 == 0:
  53. if step > 0:
  54. average_loss = average_loss / 2000
  55. # 平均损失是最后 2000 个批量上的损失的估计
  56. print "Average loss at step", step, ":", average_loss
  57. average_loss = 0
  58. # 注意这非常昂贵(如果计算每 500 步,会慢约 20%)
  59. if step % 10000 == 0:
  60. sim = similarity.eval()
  61. for i in xrange(valid_size):
  62. valid_word = reverse_dictionary[valid_examples[i]]
  63. top_k = 8 # 最近邻数量
  64. nearest = (-sim[i, :]).argsort()[1:top_k+1]
  65. log = "Nearest to %s:" % valid_word
  66. for k in xrange(top_k):
  67. close_word = reverse_dictionary[nearest[k]]
  68. log = "%s %s," % (log, close_word)
  69. print log
  70. final_embeddings = normalized_embeddings.eval()
  71. num_points = 400
  72. tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
  73. two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
  74. def plot(embeddings, labels):
  75. assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  76. pylab.figure(figsize=(15,15)) # 以英寸为单位
  77. for i, label in enumerate(labels):
  78. x, y = embeddings[i,:]
  79. pylab.scatter(x, y)
  80. pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
  81. ha='right', va='bottom')
  82. pylab.show()
  83. words = [reverse_dictionary[i] for i in xrange(1, num_points+1)]
  84. plot(two_d_embeddings, words)
  85. '''
  86. Initialized
  87. Average loss at step 0 : 8.58149623871
  88. Nearest to been: unfavourably, marmara, ancestral, legal, bogart, glossaries, worst, rooms,
  89. Nearest to time: conformist, strawberries, sindhi, waterfall, xia, nominates, psp, sensitivity,
  90. Nearest to over: overlord, panda, golden, semigroup, rawlings, involved, shreveport, handling,
  91. Nearest to not: hymenoptera, reintroducing, lamiaceae, because, davao, omnipotent, combustion, debilitating,
  92. Nearest to three: catalog, koza, gn, braque, holstein, postgresql, luddite, justine,
  93. ...
  94. Nearest to and: or, but, purview, thirst, sukkot, epr, including, honesty,
  95. Nearest to eight: seven, nine, six, four, five, three, zero, one,
  96. Nearest to they: we, there, you, he, she, prisons, it, these,
  97. Nearest to more: less, most, very, quite, faster, larger, rather, smaller,
  98. Nearest to other: various, different, tamara, theos, some, cope, many, others,
  99. '''

二、TensorFlow 练习 - 图1

2.6 TensorFlow LSTM

致谢:派生于 Google 的 TensorFlow

配置

参考配置指南.

练习 6

5_word2vec.ipynb中训练了一个 skip-gram 模型之后,本练习的目标是在 Text8 数据上训练 LSTM 字符模型。

  1. # 这些是我们之后使用的所有模块。
  2. # 在继续之前,确保你可以导入它们。
  3. import os
  4. import numpy as np
  5. import random
  6. import string
  7. import tensorflow as tf
  8. import urllib
  9. import zipfile
  10. url = 'http://mattmahoney.net/dc/'
  11. def maybe_download(filename, expected_bytes):
  12. """如果文件不存在则下载,并确保它的大小正确"""
  13. if not os.path.exists(filename):
  14. filename, _ = urllib.urlretrieve(url + filename, filename)
  15. statinfo = os.stat(filename)
  16. if statinfo.st_size == expected_bytes:
  17. print 'Found and verified', filename
  18. else:
  19. print statinfo.st_size
  20. raise Exception(
  21. 'Failed to verify ' + filename + '. Can you get to it with a browser?')
  22. return filename
  23. filename = maybe_download('text8.zip', 31344016)
  24. # Found and verified text8.zip
  25. def read_data(filename):
  26. f = zipfile.ZipFile(filename)
  27. for name in f.namelist():
  28. return f.read(name)
  29. f.close()
  30. text = read_data(filename)
  31. print "Data size", len(text)
  32. # Data size 100000000

创建一个小型验证集。

  1. valid_size = 1000
  2. valid_text = text[:valid_size]
  3. train_text = text[valid_size:]
  4. train_size = len(train_text)
  5. print train_size, train_text[:64]
  6. print valid_size, valid_text[:64]
  7. '''
  8. 99999000 ons anarchists advocate social relations based upon voluntary as
  9. 1000 anarchism originated as a term of abuse first used against earl
  10. '''

用于将字符映射到词汇表 ID 并返回的工具。

  1. vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
  2. first_letter = ord(string.ascii_lowercase[0])
  3. def char2id(char):
  4. if char in string.ascii_lowercase:
  5. return ord(char) - first_letter + 1
  6. elif char == ' ':
  7. return 0
  8. else:
  9. print 'Unexpected character:', char
  10. return 0
  11. def id2char(dictid):
  12. if dictid > 0:
  13. return chr(dictid + first_letter - 1)
  14. else:
  15. return ' '
  16. print char2id('a'), char2id('z'), char2id(' '), char2id('ï')
  17. print id2char(1), id2char(26), id2char(0)
  18. '''
  19. 1 26 0 Unexpected character: ï
  20. 0
  21. a z
  22. '''

用于生成 LSTM 模型的训练批量的函数。

  1. batch_size=64
  2. num_unrollings=10
  3. class BatchGenerator(object):
  4. def __init__(self, text, batch_size, num_unrollings):
  5. self._text = text
  6. self._text_size = len(text)
  7. self._batch_size = batch_size
  8. self._num_unrollings = num_unrollings
  9. segment = self._text_size / batch_size
  10. self._cursor = [ offset * segment for offset in xrange(batch_size)]
  11. self._last_batch = self._next_batch()
  12. def _next_batch(self):
  13. """从当前游标位置生成单个批量"""
  14. batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
  15. for b in xrange(self._batch_size):
  16. batch[b, char2id(self._text[self._cursor[b]])] = 1.0
  17. self._cursor[b] = (self._cursor[b] + 1) % self._text_size
  18. return batch
  19. def next(self):
  20. """从数据中生成批量的下一个数组
  21. 数据包含前一个数组的最后一个批量,后面是 num_unrollings 个新的"""
  22. batches = [self._last_batch]
  23. for step in xrange(self._num_unrollings):
  24. batches.append(self._next_batch())
  25. self._last_batch = batches[-1]
  26. return batches
  27. def characters(probabilities):
  28. """将单热编码,或者可能字符的概率分布转换为它的(最可能的)字符表示"""
  29. return [id2char(c) for c in np.argmax(probabilities, 1)]
  30. def batches2string(batches):
  31. """将批量序列转换为它们(最可能的)字符串表示"""
  32. s = [''] * batches[0].shape[0]
  33. for b in batches:
  34. s = [''.join(x) for x in zip(s, characters(b))]
  35. return s
  36. train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
  37. valid_batches = BatchGenerator(valid_text, 1, 1)
  38. print batches2string(train_batches.next())
  39. print batches2string(train_batches.next())
  40. print batches2string(valid_batches.next())
  41. print batches2string(valid_batches.next())
  42. '''
  43. ['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
  44. ['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev']
  45. [' a']
  46. ['an']
  47. '''
  48. def logprob(predictions, labels):
  49. """预测批量中的真实标签的对数概率"""
  50. predictions[predictions < 1e-10] = 1e-10
  51. return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]
  52. def sample_distribution(distribution):
  53. """从分布中抽取一个元素,分布假设为标准化概率的数组 """
  54. r = random.uniform(0, 1)
  55. s = 0
  56. for i in xrange(len(distribution)):
  57. s += distribution[i]
  58. if s >= r:
  59. return i
  60. return len(distribution) - 1
  61. def sample(prediction):
  62. """将一列预测转换为单热编码的样本"""
  63. p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  64. p[0, sample_distribution(prediction[0])] = 1.0
  65. return p
  66. def random_distribution():
  67. """生成概率的随机列"""
  68. b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  69. return b/np.sum(b, 1)[:,None]

简单的 LSTM 模型。

  1. num_nodes = 64
  2. graph = tf.Graph()
  3. with graph.as_default():
  4. # 参数:
  5. # 输入门:输入,上个输出,和偏置
  6. ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  7. im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  8. ib = tf.Variable(tf.zeros([1, num_nodes]))
  9. # 遗忘门:输入,上个输出,和偏置
  10. fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  11. fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  12. fb = tf.Variable(tf.zeros([1, num_nodes]))
  13. # 记忆单元:输入,上个输出,和偏置
  14. cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  15. cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  16. cb = tf.Variable(tf.zeros([1, num_nodes]))
  17. # 输出们:输入,上个输出,和偏置
  18. ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  19. om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  20. ob = tf.Variable(tf.zeros([1, num_nodes]))
  21. # 在展开过程中保存状态的变量
  22. saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  23. saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  24. # 分类器的权重和偏置
  25. w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  26. b = tf.Variable(tf.zeros([vocabulary_size]))
  27. # 定义单元计算
  28. def lstm_cell(i, o, state):
  29. """创建 LSTM 单元。请见:http://arxiv.org/pdf/1402.1128v1.pdf
  30. 要注意在这个形式中,我们省略了上一个状态和门的各种连接"""
  31. input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
  32. forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
  33. update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
  34. state = forget_gate * state + input_gate * tf.tanh(update)
  35. output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
  36. return output_gate * tf.tanh(state), state
  37. # 输入数据
  38. train_data = list()
  39. for _ in xrange(num_unrollings + 1):
  40. train_data.append(
  41. tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  42. train_inputs = train_data[:num_unrollings]
  43. train_labels = train_data[1:] # labels are inputs shifted by one time step.
  44. # 展开 LSTM 循环
  45. outputs = list()
  46. output = saved_output
  47. state = saved_state
  48. for i in train_inputs:
  49. output, state = lstm_cell(i, output, state)
  50. outputs.append(output)
  51. # 展开过程中的状态保存
  52. with tf.control_dependencies([saved_output.assign(output),
  53. saved_state.assign(state)]):
  54. # 分类器
  55. logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
  56. loss = tf.reduce_mean(
  57. tf.nn.softmax_cross_entropy_with_logits(
  58. logits, tf.concat(0, train_labels)))
  59. # 优化器
  60. global_step = tf.Variable(0)
  61. learning_rate = tf.train.exponential_decay(
  62. 10.0, global_step, 5000, 0.1, staircase=True)
  63. optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  64. gradients, v = zip(*optimizer.compute_gradients(loss))
  65. gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  66. optimizer = optimizer.apply_gradients(
  67. zip(gradients, v), global_step=global_step)
  68. # 预测
  69. train_prediction = tf.nn.softmax(logits)
  70. # 采样和验证评估:批量 1,不展开
  71. sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  72. saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  73. saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  74. reset_sample_state = tf.group(
  75. saved_sample_output.assign(tf.zeros([1, num_nodes])),
  76. saved_sample_state.assign(tf.zeros([1, num_nodes])))
  77. sample_output, sample_state = lstm_cell(
  78. sample_input, saved_sample_output, saved_sample_state)
  79. with tf.control_dependencies([saved_sample_output.assign(sample_output),
  80. saved_sample_state.assign(sample_state)]):
  81. sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
  82. num_steps = 7001
  83. summary_frequency = 100
  84. with tf.Session(graph=graph) as session:
  85. tf.global_variables_initializer().run()
  86. print 'Initialized'
  87. mean_loss = 0
  88. for step in xrange(num_steps):
  89. batches = train_batches.next()
  90. feed_dict = dict()
  91. for i in xrange(num_unrollings + 1):
  92. feed_dict[train_data[i]] = batches[i]
  93. _, l, predictions, lr = session.run(
  94. [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
  95. mean_loss += l
  96. if step % summary_frequency == 0:
  97. if step > 0:
  98. mean_loss = mean_loss / summary_frequency
  99. # 平均损失是最后几个批量上的损失估计
  100. print 'Average loss at step', step, ':', mean_loss, 'learning rate:', lr
  101. mean_loss = 0
  102. labels = np.concatenate(list(batches)[1:])
  103. print 'Minibatch perplexity: %.2f' % float(
  104. np.exp(logprob(predictions, labels)))
  105. if step % (summary_frequency * 10) == 0:
  106. # 生成一些样本
  107. print '=' * 80
  108. for _ in xrange(5):
  109. feed = sample(random_distribution())
  110. sentence = characters(feed)[0]
  111. reset_sample_state.run()
  112. for _ in xrange(79):
  113. prediction = sample_prediction.eval({sample_input: feed})
  114. feed = sample(prediction)
  115. sentence += characters(feed)[0]
  116. print sentence
  117. print '=' * 80
  118. # 测量验证集 perplexity.
  119. reset_sample_state.run()
  120. valid_logprob = 0
  121. for _ in xrange(valid_size):
  122. b = valid_batches.next()
  123. predictions = sample_prediction.eval({sample_input: b[0]})
  124. valid_logprob = valid_logprob + logprob(predictions, b[1])
  125. print 'Validation set perplexity: %.2f' % float(np.exp(
  126. valid_logprob / valid_size))
  127. '''
  128. Initialized
  129. Average loss at step 0 : 3.29904174805 learning rate: 10.0
  130. Minibatch perplexity: 27.09
  131. ================================================================================
  132. srk dwmrnuldtbbgg tapootidtu xsciu sgokeguw hi ieicjq lq piaxhazvc s fht wjcvdlh
  133. lhrvallvbeqqquc dxd y siqvnle bzlyw nr rwhkalezo siie o deb e lpdg storq u nx o
  134. meieu nantiouie gdys qiuotblci loc hbiznauiccb cqzed acw l tsm adqxplku gn oaxet
  135. unvaouc oxchywdsjntdh zpklaejvxitsokeerloemee htphisb th eaeqseibumh aeeyj j orw
  136. ogmnictpycb whtup otnilnesxaedtekiosqet liwqarysmt arj flioiibtqekycbrrgoysj
  137. ================================================================================
  138. ...
  139. ================================================================================
  140. jague are officiencinels ored by film voon higherise haik one nine on the iffirc
  141. oshe provision that manned treatists on smalle bodariturmeristing the girto in s
  142. kis would softwenn mustapultmine truativersakys bersyim by s of confound esc bub
  143. ry of the using one four six blain ira mannom marencies g with fextificallise re
  144. one son vit even an conderouss to person romer i a lebapter at obiding are iuse
  145. ================================================================================
  146. Validation set perplexity: 4.25
  147. '''

问题 1

你可能已经注意到 LSTM 单元的定义涉及输入的 4 个矩阵乘法,以及输出的 4 个矩阵乘法。 通过对每个使用单个矩阵乘法和 4 倍大的变量来简化表达式。

问题 2

我们希望在二元组上训练 LSTM,即ab之类的连续字符对,而不是像a这样的单个字符。 由于可能的二元组的数量很大,使用单热编码将它们直接馈送到 LSTM 将产生非常稀疏的表示,这在计算上是非常浪费的。

a)在输入上引入嵌入查找,并将嵌入提供给 LSTM 单元而不是输入本身。

b)编写基于二元组的 LSTM,以上面的字符 LSTM 为模型。

c)引入丢弃。对于如何在 LSTM 中使用丢弃的最佳实践,请参阅此文章

问题 3

(困难!)

编写一个序列到序列的 LSTM,它反转了一个句子中的所有单词。例如,如果你的输入是:

  1. the quick brown fox

模型应该尝试输出:

  1. eht kciuq nworb xof

参考:http://arxiv.org/abs/1409.3215