In the previous homework you implemented a fully-connected two-layer neural network on CIFAR-10. The implementation was simple but not very modular since the loss and gradient were computed in a single monolithic function. This is manageable for a simple two-layer network, but would become impractical as we move to bigger models. Ideally we want to build networks using a more modular design so that we can implement different layer types in isolation and then snap them together into models with different architectures.

In this exercise we will implement fully-connected networks using a more modular approach. For each layer we will implement a forward and a backward function. The forward function will receive inputs, weights, and other parameters and will return both an output and a cache object storing data needed for the backward pass, like this:

  1. def layer_forward(x, w):
  2. """ Receive inputs x and weights w """
  3. # Do some computations ...
  4. z = # ... some intermediate value
  5. # Do some more computations ...
  6. out = # the output
  7. cache = (x, w, z, out) # Values we need to compute gradients
  8. return out, cache

The backward pass will receive upstream derivatives and the cache object, and will return gradients with respect to the inputs and weights, like this:

  1. def layer_backward(dout, cache):
  2. """
  3. Receive dout (derivative of loss with respect to outputs) and cache,
  4. and compute derivative with respect to inputs.
  5. """
  6. # Unpack cache values
  7. x, w, z, out = cache
  8. # Use values in cache to compute derivatives
  9. dx = # Derivative of loss with respect to x
  10. dw = # Derivative of loss with respect to w
  11. return dx, dw

After implementing a bunch of layers this way, we will be able to easily combine them to build classifiers with different architectures.

In addition to implementing fully-connected networks of arbitrary depth, we will also explore different update rules for optimization, and introduce Dropout as a regularizer and Batch/Layer Normalization as a tool to more efficiently optimize deep networks.

  1. # As usual, a bit of setup
  2. from __future__ import print_function
  3. import time
  4. import numpy as np
  5. import matplotlib.pyplot as plt
  6. from cs231n.classifiers.fc_net import *
  7. from cs231n.data_utils import get_CIFAR10_data
  8. from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
  9. from cs231n.solver import Solver
  10. %matplotlib inline
  11. plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
  12. plt.rcParams['image.interpolation'] = 'nearest'
  13. plt.rcParams['image.cmap'] = 'gray'
  14. # for auto-reloading external modules
  15. # see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
  16. %load_ext autoreload
  17. %autoreload 2
  18. def rel_error(x, y):
  19. """ returns relative error """
  20. return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
  1. run the following from the cs231n directory and try again:
  2. python setup.py build_ext --inplace
  3. You may also need to restart your iPython kernel
  1. # Load the (preprocessed) CIFAR10 data.
  2. data = get_CIFAR10_data()
  3. for k, v in list(data.items()):
  4. print(('%s: ' % k, v.shape))
  1. ('X_train: ', (49000, 3, 32, 32))
  2. ('y_train: ', (49000,))
  3. ('X_val: ', (1000, 3, 32, 32))
  4. ('y_val: ', (1000,))
  5. ('X_test: ', (1000, 3, 32, 32))
  6. ('y_test: ', (1000,))

Affine layer: foward

Open the file cs231n/layers.py and implement the affine_forward function.

Once you are done you can test your implementaion by running the following:

  1. # Test the affine_forward function
  2. num_inputs = 2
  3. input_shape = (4, 5, 6)
  4. output_dim = 3
  5. input_size = num_inputs * np.prod(input_shape)
  6. weight_size = output_dim * np.prod(input_shape)
  7. x = np.linspace(-0.1, 0.5, num=input_size).reshape(num_inputs, *input_shape)
  8. w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_shape), output_dim)
  9. b = np.linspace(-0.3, 0.1, num=output_dim)
  10. out, _ = affine_forward(x, w, b)
  11. correct_out = np.array([[ 1.49834967, 1.70660132, 1.91485297],
  12. [ 3.25553199, 3.5141327, 3.77273342]])
  13. # Compare your output with ours. The error should be around e-9 or less.
  14. print('Testing affine_forward function:')
  15. print('difference: ', rel_error(out, correct_out))
  1. Testing affine_forward function:
  2. difference: 9.769849468192957e-10

Affine layer: backward

Now implement the affine_backward function and test your implementation using numeric gradient checking.

  1. # Test the affine_backward function
  2. np.random.seed(231)
  3. x = np.random.randn(10, 2, 3)
  4. w = np.random.randn(6, 5)
  5. b = np.random.randn(5)
  6. dout = np.random.randn(10, 5)
  7. dx_num = eval_numerical_gradient_array(lambda x: affine_forward(x, w, b)[0], x, dout)
  8. dw_num = eval_numerical_gradient_array(lambda w: affine_forward(x, w, b)[0], w, dout)
  9. db_num = eval_numerical_gradient_array(lambda b: affine_forward(x, w, b)[0], b, dout)
  10. _, cache = affine_forward(x, w, b)
  11. dx, dw, db = affine_backward(dout, cache)
  12. # The error should be around e-10 or less
  13. print('Testing affine_backward function:')
  14. print('dx error: ', rel_error(dx_num, dx))
  15. print('dw error: ', rel_error(dw_num, dw))
  16. print('db error: ', rel_error(db_num, db))
  1. Testing affine_backward function:
  2. dx error: 5.399100368651805e-11
  3. dw error: 9.904211865398145e-11
  4. db error: 2.4122867568119087e-11

ReLU activation: forward

Implement the forward pass for the ReLU activation function in the relu_forward function and test your implementation using the following:

  1. # Test the relu_forward function
  2. x = np.linspace(-0.5, 0.5, num=12).reshape(3, 4)
  3. out, _ = relu_forward(x)
  4. correct_out = np.array([[ 0., 0., 0., 0., ],
  5. [ 0., 0., 0.04545455, 0.13636364,],
  6. [ 0.22727273, 0.31818182, 0.40909091, 0.5, ]])
  7. # Compare your output with ours. The error should be on the order of e-8
  8. print('Testing relu_forward function:')
  9. print('difference: ', rel_error(out, correct_out))
  1. Testing relu_forward function:
  2. difference: 4.999999798022158e-08

ReLU activation: backward

Now implement the backward pass for the ReLU activation function in the relu_backward function and test your implementation using numeric gradient checking:

  1. np.random.seed(231)
  2. x = np.random.randn(10, 10)
  3. dout = np.random.randn(*x.shape)
  4. dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)
  5. _, cache = relu_forward(x)
  6. dx = relu_backward(dout, cache)
  7. # The error should be on the order of e-12
  8. print('Testing relu_backward function:')
  9. print('dx error: ', rel_error(dx_num, dx))
  1. Testing relu_backward function:
  2. dx error: 3.2756349136310288e-12

Inline Question 1:

We’ve only asked you to implement ReLU, but there are a number of different activation functions that one could use in neural networks, each with its pros and cons. In particular, an issue commonly seen with activation functions is getting zero (or close to zero) gradient flow during backpropagation. Which of the following activation functions have this problem? If you consider these functions in the one dimensional case, what types of input would lead to this behaviour?

  1. Sigmoid
  2. ReLU
  3. Leaky ReLU

Answer:

a. Sigmoid
b. The input whose gradient become nearly zero when the input become larger would lead to this behaviour.

“Sandwich” layers

There are some common patterns of layers that are frequently used in neural nets. For example, affine layers are frequently followed by a ReLU nonlinearity. To make these common patterns easy, we define several convenience layers in the file cs231n/layer_utils.py.

For now take a look at the affine_relu_forward and affine_relu_backward functions, and run the following to numerically gradient check the backward pass:

  1. from cs231n.layer_utils import affine_relu_forward, affine_relu_backward
  2. np.random.seed(231)
  3. x = np.random.randn(2, 3, 4)
  4. w = np.random.randn(12, 10)
  5. b = np.random.randn(10)
  6. dout = np.random.randn(2, 10)
  7. out, cache = affine_relu_forward(x, w, b)
  8. dx, dw, db = affine_relu_backward(dout, cache)
  9. dx_num = eval_numerical_gradient_array(lambda x: affine_relu_forward(x, w, b)[0], x, dout)
  10. dw_num = eval_numerical_gradient_array(lambda w: affine_relu_forward(x, w, b)[0], w, dout)
  11. db_num = eval_numerical_gradient_array(lambda b: affine_relu_forward(x, w, b)[0], b, dout)
  12. # Relative error should be around e-10 or less
  13. print('Testing affine_relu_forward and affine_relu_backward:')
  14. print('dx error: ', rel_error(dx_num, dx))
  15. print('dw error: ', rel_error(dw_num, dw))
  16. print('db error: ', rel_error(db_num, db))
  1. Testing affine_relu_forward and affine_relu_backward:
  2. dx error: 2.299579177309368e-11
  3. dw error: 8.162011105764925e-11
  4. db error: 7.826724021458994e-12

Loss layers: Softmax and SVM

You implemented these loss functions in the last assignment, so we’ll give them to you for free here. You should still make sure you understand how they work by looking at the implementations in cs231n/layers.py.

You can make sure that the implementations are correct by running the following:

  1. np.random.seed(231)
  2. num_classes, num_inputs = 10, 50
  3. x = 0.001 * np.random.randn(num_inputs, num_classes)
  4. y = np.random.randint(num_classes, size=num_inputs)
  5. dx_num = eval_numerical_gradient(lambda x: svm_loss(x, y)[0], x, verbose=False)
  6. loss, dx = svm_loss(x, y)
  7. # Test svm_loss function. Loss should be around 9 and dx error should be around the order of e-9
  8. print('Testing svm_loss:')
  9. print('loss: ', loss)
  10. print('dx error: ', rel_error(dx_num, dx))
  11. dx_num = eval_numerical_gradient(lambda x: softmax_loss(x, y)[0], x, verbose=False)
  12. loss, dx = softmax_loss(x, y)
  13. # Test softmax_loss function. Loss should be close to 2.3 and dx error should be around e-8
  14. print('\nTesting softmax_loss:')
  15. print('loss: ', loss)
  16. print('dx error: ', rel_error(dx_num, dx))
  1. Testing svm_loss:
  2. loss: 8.999602749096233
  3. dx error: 1.4021566006651672e-09
  4. Testing softmax_loss:
  5. loss: 2.302545844500738
  6. dx error: 9.384673161989355e-09

Two-layer network

In the previous assignment you implemented a two-layer neural network in a single monolithic class. Now that you have implemented modular versions of the necessary layers, you will reimplement the two layer network using these modular implementations.

Open the file cs231n/classifiers/fc_net.py and complete the implementation of the TwoLayerNet class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation.

  1. np.random.seed(231)
  2. N, D, H, C = 3, 5, 50, 7
  3. X = np.random.randn(N, D)
  4. y = np.random.randint(C, size=N)
  5. std = 1e-3
  6. model = TwoLayerNet(input_dim=D, hidden_dim=H, num_classes=C, weight_scale=std)
  7. print('Testing initialization ... ')
  8. W1_std = abs(model.params['W1'].std() - std)
  9. b1 = model.params['b1']
  10. W2_std = abs(model.params['W2'].std() - std)
  11. b2 = model.params['b2']
  12. assert W1_std < std / 10, 'First layer weights do not seem right'
  13. assert np.all(b1 == 0), 'First layer biases do not seem right'
  14. assert W2_std < std / 10, 'Second layer weights do not seem right'
  15. assert np.all(b2 == 0), 'Second layer biases do not seem right'
  16. print('Testing test-time forward pass ... ')
  17. model.params['W1'] = np.linspace(-0.7, 0.3, num=D*H).reshape(D, H)
  18. model.params['b1'] = np.linspace(-0.1, 0.9, num=H)
  19. model.params['W2'] = np.linspace(-0.3, 0.4, num=H*C).reshape(H, C)
  20. model.params['b2'] = np.linspace(-0.9, 0.1, num=C)
  21. X = np.linspace(-5.5, 4.5, num=N*D).reshape(D, N).T
  22. scores = model.loss(X)
  23. correct_scores = np.asarray(
  24. [[11.53165108, 12.2917344, 13.05181771, 13.81190102, 14.57198434, 15.33206765, 16.09215096],
  25. [12.05769098, 12.74614105, 13.43459113, 14.1230412, 14.81149128, 15.49994135, 16.18839143],
  26. [12.58373087, 13.20054771, 13.81736455, 14.43418138, 15.05099822, 15.66781506, 16.2846319 ]])
  27. scores_diff = np.abs(scores - correct_scores).sum()
  28. assert scores_diff < 1e-6, 'Problem with test-time forward pass'
  29. print('Testing training loss (no regularization)')
  30. y = np.asarray([0, 5, 1])
  31. loss, grads = model.loss(X, y)
  32. correct_loss = 3.4702243556
  33. assert abs(loss - correct_loss) < 1e-10, 'Problem with training-time loss'
  34. model.reg = 1.0
  35. loss, grads = model.loss(X, y)
  36. correct_loss = 26.5948426952
  37. assert abs(loss - correct_loss) < 1e-10, 'Problem with regularization loss'
  38. # Errors should be around e-7 or less
  39. for reg in [0.0, 0.7]:
  40. print('Running numeric gradient check with reg = ', reg)
  41. model.reg = reg
  42. loss, grads = model.loss(X, y)
  43. for name in sorted(grads):
  44. f = lambda _: model.loss(X, y)[0]
  45. grad_num = eval_numerical_gradient(f, model.params[name], verbose=False)
  46. print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))
  1. Testing initialization ...
  2. Testing test-time forward pass ...
  3. Testing training loss (no regularization)
  4. Running numeric gradient check with reg = 0.0
  5. W1 relative error: 1.83e-08
  6. W2 relative error: 3.12e-10
  7. b1 relative error: 9.83e-09
  8. b2 relative error: 4.33e-10
  9. Running numeric gradient check with reg = 0.7
  10. W1 relative error: 2.53e-07
  11. W2 relative error: 2.85e-08
  12. b1 relative error: 1.56e-08
  13. b2 relative error: 7.76e-10

Solver

In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.

Open the file cs231n/solver.py and read through it to familiarize yourself with the API. After doing so, use a Solver instance to train a TwoLayerNet that achieves at least 50% accuracy on the validation set.

  1. model = TwoLayerNet()
  2. solver = None
  3. ##############################################################################
  4. # TODO: Use a Solver instance to train a TwoLayerNet that achieves at least #
  5. # 50% accuracy on the validation set. #
  6. ##############################################################################
  7. # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
  8. lrs = np.linspace(1e-6, 1e-2, 8)
  9. solver = Solver(model, data,
  10. update_rule='sgd',
  11. optim_config={'learning_rate': 1e-3},
  12. lr_decay=0.95,
  13. num_epochs=8, batch_size=100,
  14. print_every=100)
  15. for lr in lrs:
  16. solver.optim_config['learning_rate'] = lr
  17. solver.train()
  18. # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
  19. ##############################################################################
  20. # END OF YOUR CODE #
  21. ##############################################################################
  1. (Iteration 1 / 3920) loss: 2.306114
  2. (Epoch 0 / 8) train acc: 0.135000; val_acc: 0.132000
  3. (Iteration 101 / 3920) loss: 1.854126
  4. (Iteration 201 / 3920) loss: 1.851763
  5. (Iteration 301 / 3920) loss: 1.558956
  6. (Iteration 401 / 3920) loss: 1.486990
  7. (Epoch 1 / 8) train acc: 0.405000; val_acc: 0.411000
  8. (Iteration 501 / 3920) loss: 1.460269
  9. (Iteration 601 / 3920) loss: 1.411556
  10. (Iteration 701 / 3920) loss: 1.595890
  11. (Iteration 801 / 3920) loss: 1.524085
  12. (Iteration 901 / 3920) loss: 1.226461
  13. (Epoch 2 / 8) train acc: 0.518000; val_acc: 0.468000
  14. (Iteration 1001 / 3920) loss: 1.603831
  15. (Iteration 1101 / 3920) loss: 1.573270
  16. (Iteration 1201 / 3920) loss: 1.215742
  17. (Iteration 1301 / 3920) loss: 1.298395
  18. (Iteration 1401 / 3920) loss: 1.114452
  19. (Epoch 3 / 8) train acc: 0.496000; val_acc: 0.463000
  20. (Iteration 1501 / 3920) loss: 1.408867
  21. (Iteration 1601 / 3920) loss: 1.205041
  22. (Iteration 1701 / 3920) loss: 1.304048
  23. (Iteration 1801 / 3920) loss: 1.367574
  24. (Iteration 1901 / 3920) loss: 1.328690
  25. (Epoch 4 / 8) train acc: 0.541000; val_acc: 0.516000
  26. (Iteration 2001 / 3920) loss: 1.364002
  27. (Iteration 2101 / 3920) loss: 1.263270
  28. (Iteration 2201 / 3920) loss: 1.406228
  29. (Iteration 2301 / 3920) loss: 1.257448
  30. (Iteration 2401 / 3920) loss: 1.387278
  31. (Epoch 5 / 8) train acc: 0.557000; val_acc: 0.503000
  32. (Iteration 2501 / 3920) loss: 1.292922
  33. (Iteration 2601 / 3920) loss: 1.239152
  34. (Iteration 2701 / 3920) loss: 1.184182
  35. (Iteration 2801 / 3920) loss: 1.409515
  36. (Iteration 2901 / 3920) loss: 1.401696
  37. (Epoch 6 / 8) train acc: 0.575000; val_acc: 0.514000
  38. (Iteration 3001 / 3920) loss: 1.075956
  39. (Iteration 3101 / 3920) loss: 1.216456
  40. (Iteration 3201 / 3920) loss: 1.179340
  41. (Iteration 3301 / 3920) loss: 1.210690
  42. (Iteration 3401 / 3920) loss: 1.324619
  43. (Epoch 7 / 8) train acc: 0.594000; val_acc: 0.514000
  44. (Iteration 3501 / 3920) loss: 1.140519
  45. (Iteration 3601 / 3920) loss: 1.198608
  46. (Iteration 3701 / 3920) loss: 1.366061
  47. (Iteration 3801 / 3920) loss: 1.160887
  48. (Iteration 3901 / 3920) loss: 1.180904
  49. (Epoch 8 / 8) train acc: 0.589000; val_acc: 0.494000
  50. (Iteration 1 / 3920) loss: 1.213504
  51. (Epoch 8 / 8) train acc: 0.542000; val_acc: 0.528000
  52. (Iteration 101 / 3920) loss: 1.304551
  53. (Iteration 201 / 3920) loss: 1.321262
  54. (Iteration 301 / 3920) loss: 1.103654
  55. (Iteration 401 / 3920) loss: 1.104092
  56. (Epoch 9 / 8) train acc: 0.577000; val_acc: 0.517000
  57. (Iteration 501 / 3920) loss: 1.059394
  58. (Iteration 601 / 3920) loss: 1.501449
  59. (Iteration 701 / 3920) loss: 1.020301
  60. (Iteration 801 / 3920) loss: 1.098940
  61. (Iteration 901 / 3920) loss: 1.140753
  62. (Epoch 10 / 8) train acc: 0.606000; val_acc: 0.501000
  63. (Iteration 1001 / 3920) loss: 1.257351
  64. (Iteration 1101 / 3920) loss: 1.223436
  65. (Iteration 1201 / 3920) loss: 1.025524
  66. (Iteration 1301 / 3920) loss: 1.236936
  67. (Iteration 1401 / 3920) loss: 1.335805
  68. (Epoch 11 / 8) train acc: 0.603000; val_acc: 0.530000
  69. (Iteration 1501 / 3920) loss: 1.085880
  70. (Iteration 1601 / 3920) loss: 1.087423
  71. (Iteration 1701 / 3920) loss: 1.122185
  72. (Iteration 1801 / 3920) loss: 1.141183
  73. (Iteration 1901 / 3920) loss: 1.214175
  74. (Epoch 12 / 8) train acc: 0.586000; val_acc: 0.516000
  75. (Iteration 2001 / 3920) loss: 1.192277
  76. (Iteration 2101 / 3920) loss: 1.228408
  77. (Iteration 2201 / 3920) loss: 1.101908
  78. (Iteration 2301 / 3920) loss: 1.249030
  79. (Iteration 2401 / 3920) loss: 1.371962
  80. (Epoch 13 / 8) train acc: 0.617000; val_acc: 0.515000
  81. (Iteration 2501 / 3920) loss: 1.084948
  82. (Iteration 2601 / 3920) loss: 1.069335
  83. (Iteration 2701 / 3920) loss: 1.073005
  84. (Iteration 2801 / 3920) loss: 1.306515
  85. (Iteration 2901 / 3920) loss: 1.146512
  86. (Epoch 14 / 8) train acc: 0.614000; val_acc: 0.530000
  87. (Iteration 3001 / 3920) loss: 1.059701
  88. (Iteration 3101 / 3920) loss: 0.943784
  89. (Iteration 3201 / 3920) loss: 0.933784
  90. (Iteration 3301 / 3920) loss: 1.177530
  91. (Iteration 3401 / 3920) loss: 1.154607
  92. (Epoch 15 / 8) train acc: 0.642000; val_acc: 0.507000
  93. (Iteration 3501 / 3920) loss: 1.120586
  94. (Iteration 3601 / 3920) loss: 1.299895
  95. (Iteration 3701 / 3920) loss: 1.179488
  96. (Iteration 3801 / 3920) loss: 1.160027
  97. (Iteration 3901 / 3920) loss: 0.992143
  98. (Epoch 16 / 8) train acc: 0.634000; val_acc: 0.520000
  99. (Iteration 1 / 3920) loss: 1.173236
  100. (Epoch 16 / 8) train acc: 0.606000; val_acc: 0.539000
  101. (Iteration 101 / 3920) loss: 1.071621
  102. (Iteration 201 / 3920) loss: 1.135191
  103. (Iteration 301 / 3920) loss: 1.157115
  104. (Iteration 401 / 3920) loss: 1.188208
  105. (Epoch 17 / 8) train acc: 0.645000; val_acc: 0.542000
  106. (Iteration 501 / 3920) loss: 1.027983
  107. (Iteration 601 / 3920) loss: 1.071079
  108. (Iteration 701 / 3920) loss: 0.776308
  109. (Iteration 801 / 3920) loss: 1.178619
  110. (Iteration 901 / 3920) loss: 1.101992
  111. (Epoch 18 / 8) train acc: 0.575000; val_acc: 0.516000
  112. (Iteration 1001 / 3920) loss: 1.229142
  113. (Iteration 1101 / 3920) loss: 1.169607
  114. (Iteration 1201 / 3920) loss: 1.074905
  115. (Iteration 1301 / 3920) loss: 1.094570
  116. (Iteration 1401 / 3920) loss: 1.323550
  117. (Epoch 19 / 8) train acc: 0.627000; val_acc: 0.527000
  118. (Iteration 1501 / 3920) loss: 1.096040
  119. (Iteration 1601 / 3920) loss: 1.134923
  120. (Iteration 1701 / 3920) loss: 0.980066
  121. (Iteration 1801 / 3920) loss: 1.282441
  122. (Iteration 1901 / 3920) loss: 1.158977
  123. (Epoch 20 / 8) train acc: 0.654000; val_acc: 0.529000
  124. (Iteration 2001 / 3920) loss: 1.106014
  125. (Iteration 2101 / 3920) loss: 1.045532
  126. (Iteration 2201 / 3920) loss: 1.079075
  127. (Iteration 2301 / 3920) loss: 1.175802
  128. (Iteration 2401 / 3920) loss: 1.010878
  129. (Epoch 21 / 8) train acc: 0.656000; val_acc: 0.524000
  130. (Iteration 2501 / 3920) loss: 0.912243
  131. (Iteration 2601 / 3920) loss: 1.124658
  132. (Iteration 2701 / 3920) loss: 1.019369
  133. (Iteration 2801 / 3920) loss: 0.986130
  134. (Iteration 2901 / 3920) loss: 0.986593
  135. (Epoch 22 / 8) train acc: 0.657000; val_acc: 0.522000
  136. (Iteration 3001 / 3920) loss: 1.019763
  137. (Iteration 3101 / 3920) loss: 0.848278
  138. (Iteration 3201 / 3920) loss: 1.035108
  139. (Iteration 3301 / 3920) loss: 1.027310
  140. (Iteration 3401 / 3920) loss: 0.889497
  141. (Epoch 23 / 8) train acc: 0.664000; val_acc: 0.527000
  142. (Iteration 3501 / 3920) loss: 0.952756
  143. (Iteration 3601 / 3920) loss: 0.706521
  144. (Iteration 3701 / 3920) loss: 1.074842
  145. (Iteration 3801 / 3920) loss: 1.284972
  146. (Iteration 3901 / 3920) loss: 1.129590
  147. (Epoch 24 / 8) train acc: 0.651000; val_acc: 0.546000
  148. (Iteration 1 / 3920) loss: 1.119762
  149. (Epoch 24 / 8) train acc: 0.643000; val_acc: 0.540000
  150. (Iteration 101 / 3920) loss: 1.043755
  151. (Iteration 201 / 3920) loss: 1.003167
  152. (Iteration 301 / 3920) loss: 0.999443
  153. (Iteration 401 / 3920) loss: 1.155654
  154. (Epoch 25 / 8) train acc: 0.654000; val_acc: 0.530000
  155. (Iteration 501 / 3920) loss: 0.780462
  156. (Iteration 601 / 3920) loss: 1.008519
  157. (Iteration 701 / 3920) loss: 0.937963
  158. (Iteration 801 / 3920) loss: 1.158664
  159. (Iteration 901 / 3920) loss: 1.168604
  160. (Epoch 26 / 8) train acc: 0.685000; val_acc: 0.545000
  161. (Iteration 1001 / 3920) loss: 1.139831
  162. (Iteration 1101 / 3920) loss: 0.884459
  163. (Iteration 1201 / 3920) loss: 0.934558
  164. (Iteration 1301 / 3920) loss: 0.972650
  165. (Iteration 1401 / 3920) loss: 0.946480
  166. (Epoch 27 / 8) train acc: 0.672000; val_acc: 0.531000
  167. (Iteration 1501 / 3920) loss: 0.806164
  168. (Iteration 1601 / 3920) loss: 0.869970
  169. (Iteration 1701 / 3920) loss: 0.877202
  170. (Iteration 1801 / 3920) loss: 0.859036
  171. (Iteration 1901 / 3920) loss: 0.829787
  172. (Epoch 28 / 8) train acc: 0.682000; val_acc: 0.538000
  173. (Iteration 2001 / 3920) loss: 0.888814
  174. (Iteration 2101 / 3920) loss: 0.978453
  175. (Iteration 2201 / 3920) loss: 0.936475
  176. (Iteration 2301 / 3920) loss: 0.711255
  177. (Iteration 2401 / 3920) loss: 0.983541
  178. (Epoch 29 / 8) train acc: 0.707000; val_acc: 0.532000
  179. (Iteration 2501 / 3920) loss: 1.164406
  180. (Iteration 2601 / 3920) loss: 0.810918
  181. (Iteration 2701 / 3920) loss: 0.941060
  182. (Iteration 2801 / 3920) loss: 0.955967
  183. (Iteration 2901 / 3920) loss: 0.882876
  184. (Epoch 30 / 8) train acc: 0.691000; val_acc: 0.527000
  185. (Iteration 3001 / 3920) loss: 0.831312
  186. (Iteration 3101 / 3920) loss: 0.855202
  187. (Iteration 3201 / 3920) loss: 0.888598
  188. (Iteration 3301 / 3920) loss: 0.757917
  189. (Iteration 3401 / 3920) loss: 0.823209
  190. (Epoch 31 / 8) train acc: 0.689000; val_acc: 0.527000
  191. (Iteration 3501 / 3920) loss: 0.747915
  192. (Iteration 3601 / 3920) loss: 0.829295
  193. (Iteration 3701 / 3920) loss: 0.822353
  194. (Iteration 3801 / 3920) loss: 0.800108
  195. (Iteration 3901 / 3920) loss: 0.934793
  196. (Epoch 32 / 8) train acc: 0.688000; val_acc: 0.509000
  197. (Iteration 1 / 3920) loss: 0.851312
  198. (Epoch 32 / 8) train acc: 0.714000; val_acc: 0.518000
  199. (Iteration 101 / 3920) loss: 0.791078
  200. (Iteration 201 / 3920) loss: 0.895301
  201. (Iteration 301 / 3920) loss: 0.965742
  202. (Iteration 401 / 3920) loss: 0.619245
  203. (Epoch 33 / 8) train acc: 0.690000; val_acc: 0.512000
  204. (Iteration 501 / 3920) loss: 0.773794
  205. (Iteration 601 / 3920) loss: 0.905245
  206. (Iteration 701 / 3920) loss: 0.827477
  207. (Iteration 801 / 3920) loss: 0.780678
  208. (Iteration 901 / 3920) loss: 0.705739
  209. (Epoch 34 / 8) train acc: 0.693000; val_acc: 0.539000
  210. (Iteration 1001 / 3920) loss: 0.900693
  211. (Iteration 1101 / 3920) loss: 1.042272
  212. (Iteration 1201 / 3920) loss: 0.930219
  213. (Iteration 1301 / 3920) loss: 0.846260
  214. (Iteration 1401 / 3920) loss: 0.833534
  215. (Epoch 35 / 8) train acc: 0.713000; val_acc: 0.508000
  216. (Iteration 1501 / 3920) loss: 0.589478
  217. (Iteration 1601 / 3920) loss: 0.883821
  218. (Iteration 1701 / 3920) loss: 0.832119
  219. (Iteration 1801 / 3920) loss: 0.787544
  220. (Iteration 1901 / 3920) loss: 0.815191
  221. (Epoch 36 / 8) train acc: 0.763000; val_acc: 0.533000
  222. (Iteration 2001 / 3920) loss: 0.937591
  223. (Iteration 2101 / 3920) loss: 0.716356
  224. (Iteration 2201 / 3920) loss: 0.862611
  225. (Iteration 2301 / 3920) loss: 1.027764
  226. (Iteration 2401 / 3920) loss: 0.863654
  227. (Epoch 37 / 8) train acc: 0.719000; val_acc: 0.520000
  228. (Iteration 2501 / 3920) loss: 1.029389
  229. (Iteration 2601 / 3920) loss: 0.598151
  230. (Iteration 2701 / 3920) loss: 0.691719
  231. (Iteration 2801 / 3920) loss: 0.831859
  232. (Iteration 2901 / 3920) loss: 0.838723
  233. (Epoch 38 / 8) train acc: 0.725000; val_acc: 0.527000
  234. (Iteration 3001 / 3920) loss: 0.868424
  235. (Iteration 3101 / 3920) loss: 0.838250
  236. (Iteration 3201 / 3920) loss: 0.823848
  237. (Iteration 3301 / 3920) loss: 0.726566
  238. (Iteration 3401 / 3920) loss: 0.705295
  239. (Epoch 39 / 8) train acc: 0.735000; val_acc: 0.512000
  240. (Iteration 3501 / 3920) loss: 0.743577
  241. (Iteration 3601 / 3920) loss: 0.792176
  242. (Iteration 3701 / 3920) loss: 0.697288
  243. (Iteration 3801 / 3920) loss: 0.675618
  244. (Iteration 3901 / 3920) loss: 0.865431
  245. (Epoch 40 / 8) train acc: 0.720000; val_acc: 0.517000
  246. (Iteration 1 / 3920) loss: 0.821200
  247. (Epoch 40 / 8) train acc: 0.739000; val_acc: 0.521000
  248. (Iteration 101 / 3920) loss: 0.897865
  249. (Iteration 201 / 3920) loss: 0.785206
  250. (Iteration 301 / 3920) loss: 0.708041
  251. (Iteration 401 / 3920) loss: 0.601831
  252. (Epoch 41 / 8) train acc: 0.741000; val_acc: 0.522000
  253. (Iteration 501 / 3920) loss: 0.682592
  254. (Iteration 601 / 3920) loss: 0.872770
  255. (Iteration 701 / 3920) loss: 0.804489
  256. (Iteration 801 / 3920) loss: 0.791230
  257. (Iteration 901 / 3920) loss: 0.701115
  258. (Epoch 42 / 8) train acc: 0.741000; val_acc: 0.511000
  259. (Iteration 1001 / 3920) loss: 0.668302
  260. (Iteration 1101 / 3920) loss: 0.845355
  261. (Iteration 1201 / 3920) loss: 0.754407
  262. (Iteration 1301 / 3920) loss: 0.681011
  263. (Iteration 1401 / 3920) loss: 0.813246
  264. (Epoch 43 / 8) train acc: 0.755000; val_acc: 0.525000
  265. (Iteration 1501 / 3920) loss: 0.730083
  266. (Iteration 1601 / 3920) loss: 0.526022
  267. (Iteration 1701 / 3920) loss: 0.905424
  268. (Iteration 1801 / 3920) loss: 0.794723
  269. (Iteration 1901 / 3920) loss: 0.666263
  270. (Epoch 44 / 8) train acc: 0.738000; val_acc: 0.509000
  271. (Iteration 2001 / 3920) loss: 0.784094
  272. (Iteration 2101 / 3920) loss: 0.953461
  273. (Iteration 2201 / 3920) loss: 0.733237
  274. (Iteration 2301 / 3920) loss: 0.741551
  275. (Iteration 2401 / 3920) loss: 0.809936
  276. (Epoch 45 / 8) train acc: 0.746000; val_acc: 0.509000
  277. (Iteration 2501 / 3920) loss: 0.706718
  278. (Iteration 2601 / 3920) loss: 0.993605
  279. (Iteration 2701 / 3920) loss: 0.765338
  280. (Iteration 2801 / 3920) loss: 0.792885
  281. (Iteration 2901 / 3920) loss: 0.805664
  282. (Epoch 46 / 8) train acc: 0.777000; val_acc: 0.532000
  283. (Iteration 3001 / 3920) loss: 0.892900
  284. (Iteration 3101 / 3920) loss: 0.868157
  285. (Iteration 3201 / 3920) loss: 0.666570
  286. (Iteration 3301 / 3920) loss: 0.829526
  287. (Iteration 3401 / 3920) loss: 0.768001
  288. (Epoch 47 / 8) train acc: 0.779000; val_acc: 0.520000
  289. (Iteration 3501 / 3920) loss: 0.855577
  290. (Iteration 3601 / 3920) loss: 0.750030
  291. (Iteration 3701 / 3920) loss: 0.653861
  292. (Iteration 3801 / 3920) loss: 0.664095
  293. (Iteration 3901 / 3920) loss: 0.828190
  294. (Epoch 48 / 8) train acc: 0.735000; val_acc: 0.519000
  295. (Iteration 1 / 3920) loss: 0.662947
  296. (Epoch 48 / 8) train acc: 0.743000; val_acc: 0.523000
  297. (Iteration 101 / 3920) loss: 0.745197
  298. (Iteration 201 / 3920) loss: 0.773770
  299. (Iteration 301 / 3920) loss: 0.706227
  300. (Iteration 401 / 3920) loss: 0.629369
  301. (Epoch 49 / 8) train acc: 0.739000; val_acc: 0.516000
  302. (Iteration 501 / 3920) loss: 0.662874
  303. (Iteration 601 / 3920) loss: 0.753527
  304. (Iteration 701 / 3920) loss: 0.904063
  305. (Iteration 801 / 3920) loss: 0.599661
  306. (Iteration 901 / 3920) loss: 0.793521
  307. (Epoch 50 / 8) train acc: 0.770000; val_acc: 0.517000
  308. (Iteration 1001 / 3920) loss: 0.671293
  309. (Iteration 1101 / 3920) loss: 0.678262
  310. (Iteration 1201 / 3920) loss: 0.736108
  311. (Iteration 1301 / 3920) loss: 0.639588
  312. (Iteration 1401 / 3920) loss: 0.581877
  313. (Epoch 51 / 8) train acc: 0.757000; val_acc: 0.525000
  314. (Iteration 1501 / 3920) loss: 0.622416
  315. (Iteration 1601 / 3920) loss: 0.775816
  316. (Iteration 1701 / 3920) loss: 0.719395
  317. (Iteration 1801 / 3920) loss: 0.715135
  318. (Iteration 1901 / 3920) loss: 0.744467
  319. (Epoch 52 / 8) train acc: 0.768000; val_acc: 0.516000
  320. (Iteration 2001 / 3920) loss: 0.648257
  321. (Iteration 2101 / 3920) loss: 0.570483
  322. (Iteration 2201 / 3920) loss: 0.880970
  323. (Iteration 2301 / 3920) loss: 0.770496
  324. (Iteration 2401 / 3920) loss: 0.797153
  325. (Epoch 53 / 8) train acc: 0.744000; val_acc: 0.521000
  326. (Iteration 2501 / 3920) loss: 0.669477
  327. (Iteration 2601 / 3920) loss: 0.715992
  328. (Iteration 2701 / 3920) loss: 0.793173
  329. (Iteration 2801 / 3920) loss: 0.668304
  330. (Iteration 2901 / 3920) loss: 0.694873
  331. (Epoch 54 / 8) train acc: 0.770000; val_acc: 0.511000
  332. (Iteration 3001 / 3920) loss: 0.686666
  333. (Iteration 3101 / 3920) loss: 0.663038
  334. (Iteration 3201 / 3920) loss: 0.749805
  335. (Iteration 3301 / 3920) loss: 0.757684
  336. (Iteration 3401 / 3920) loss: 0.675762
  337. (Epoch 55 / 8) train acc: 0.771000; val_acc: 0.514000
  338. (Iteration 3501 / 3920) loss: 0.614942
  339. (Iteration 3601 / 3920) loss: 0.670961
  340. (Iteration 3701 / 3920) loss: 0.649142
  341. (Iteration 3801 / 3920) loss: 0.605455
  342. (Iteration 3901 / 3920) loss: 0.730182
  343. (Epoch 56 / 8) train acc: 0.789000; val_acc: 0.524000
  344. (Iteration 1 / 3920) loss: 0.614868
  345. (Epoch 56 / 8) train acc: 0.783000; val_acc: 0.523000
  346. (Iteration 101 / 3920) loss: 0.808440
  347. (Iteration 201 / 3920) loss: 0.698936
  348. (Iteration 301 / 3920) loss: 0.702807
  349. (Iteration 401 / 3920) loss: 0.654820
  350. (Epoch 57 / 8) train acc: 0.771000; val_acc: 0.511000
  351. (Iteration 501 / 3920) loss: 0.573736
  352. (Iteration 601 / 3920) loss: 0.788944
  353. (Iteration 701 / 3920) loss: 0.676950
  354. (Iteration 801 / 3920) loss: 0.647669
  355. (Iteration 901 / 3920) loss: 0.651173
  356. (Epoch 58 / 8) train acc: 0.768000; val_acc: 0.527000
  357. (Iteration 1001 / 3920) loss: 0.876323
  358. (Iteration 1101 / 3920) loss: 0.639902
  359. (Iteration 1201 / 3920) loss: 0.645207
  360. (Iteration 1301 / 3920) loss: 0.571653
  361. (Iteration 1401 / 3920) loss: 0.793754
  362. (Epoch 59 / 8) train acc: 0.767000; val_acc: 0.505000
  363. (Iteration 1501 / 3920) loss: 0.670442
  364. (Iteration 1601 / 3920) loss: 0.600570
  365. (Iteration 1701 / 3920) loss: 0.826705
  366. (Iteration 1801 / 3920) loss: 0.696892
  367. (Iteration 1901 / 3920) loss: 0.713566
  368. (Epoch 60 / 8) train acc: 0.782000; val_acc: 0.508000
  369. (Iteration 2001 / 3920) loss: 0.818164
  370. (Iteration 2101 / 3920) loss: 0.814385
  371. (Iteration 2201 / 3920) loss: 0.790078
  372. (Iteration 2301 / 3920) loss: 0.807461
  373. (Iteration 2401 / 3920) loss: 0.791152
  374. (Epoch 61 / 8) train acc: 0.808000; val_acc: 0.514000
  375. (Iteration 2501 / 3920) loss: 0.658992
  376. (Iteration 2601 / 3920) loss: 0.732733
  377. (Iteration 2701 / 3920) loss: 0.703825
  378. (Iteration 2801 / 3920) loss: 0.736217
  379. (Iteration 2901 / 3920) loss: 0.749870
  380. (Epoch 62 / 8) train acc: 0.789000; val_acc: 0.513000
  381. (Iteration 3001 / 3920) loss: 0.598580
  382. (Iteration 3101 / 3920) loss: 0.586201
  383. (Iteration 3201 / 3920) loss: 0.649068
  384. (Iteration 3301 / 3920) loss: 0.662564
  385. (Iteration 3401 / 3920) loss: 0.684387
  386. (Epoch 63 / 8) train acc: 0.787000; val_acc: 0.518000
  387. (Iteration 3501 / 3920) loss: 0.576055
  388. (Iteration 3601 / 3920) loss: 0.656433
  389. (Iteration 3701 / 3920) loss: 0.774953
  390. (Iteration 3801 / 3920) loss: 0.597454
  391. (Iteration 3901 / 3920) loss: 0.559453
  392. (Epoch 64 / 8) train acc: 0.769000; val_acc: 0.513000
  1. # Run this cell to visualize training loss and train / val accuracy
  2. plt.subplot(2, 1, 1)
  3. plt.title('Training loss')
  4. plt.plot(solver.loss_history, 'o')
  5. plt.xlabel('Iteration')
  6. plt.subplot(2, 1, 2)
  7. plt.title('Accuracy')
  8. plt.plot(solver.train_acc_history, '-o', label='train')
  9. plt.plot(solver.val_acc_history, '-o', label='val')
  10. plt.plot([0.5] * len(solver.val_acc_history), 'k--')
  11. plt.xlabel('Epoch')
  12. plt.legend(loc='lower right')
  13. plt.gcf().set_size_inches(15, 12)
  14. plt.show()

FullyConnectedNets_21_0.png

Multilayer network

Next you will implement a fully-connected network with an arbitrary number of hidden layers.

Read through the FullyConnectedNet class in the file cs231n/classifiers/fc_net.py.

Implement the initialization, the forward pass, and the backward pass. For the moment don’t worry about implementing dropout or batch/layer normalization; we will add those features soon.

Initial loss and gradient check

As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. Do the initial losses seem reasonable?

For gradient checking, you should expect to see errors around 1e-7 or less.

  1. np.random.seed(231)
  2. N, D, H1, H2, C = 2, 15, 20, 30, 10
  3. X = np.random.randn(N, D)
  4. y = np.random.randint(C, size=(N,))
  5. for reg in [0, 3.14]:
  6. print('Running check with reg = ', reg)
  7. model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,
  8. reg=reg, weight_scale=5e-2, dtype=np.float64)
  9. loss, grads = model.loss(X, y)
  10. print('Initial loss: ', loss)
  11. # Most of the errors should be on the order of e-7 or smaller.
  12. # NOTE: It is fine however to see an error for W2 on the order of e-5
  13. # for the check when reg = 0.0
  14. for name in sorted(grads):
  15. f = lambda _: model.loss(X, y)[0]
  16. grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
  17. print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))
  1. Running check with reg = 0
  2. Initial loss: 2.3004790897684924
  3. W1 relative error: 1.48e-07
  4. W2 relative error: 2.21e-05
  5. W3 relative error: 3.53e-07
  6. b1 relative error: 5.38e-09
  7. b2 relative error: 2.09e-09
  8. b3 relative error: 5.80e-11
  9. Running check with reg = 3.14
  10. Initial loss: 7.052114776533016
  11. W1 relative error: 3.90e-09
  12. W2 relative error: 6.87e-08
  13. W3 relative error: 2.13e-08
  14. b1 relative error: 1.48e-08
  15. b2 relative error: 1.72e-09
  16. b3 relative error: 1.57e-10

As another sanity check, make sure you can overfit a small dataset of 50 images. First we will try a three-layer network with 100 units in each hidden layer. In the following cell, tweak the learning rate and weight initialization scale to overfit and achieve 100% training accuracy within 20 epochs.

  1. %%time
  2. # TODO: Use a three-layer Net to overfit 50 training examples by
  3. # tweaking just the learning rate and initialization scale.
  4. num_train = 50
  5. small_data = {
  6. 'X_train': data['X_train'][:num_train],
  7. 'y_train': data['y_train'][:num_train],
  8. 'X_val': data['X_val'],
  9. 'y_val': data['y_val'],
  10. }
  11. weight_scale = 1e-2 # Experiment with this!
  12. learning_rate = 1e-2 # Experiment with this!
  13. model = FullyConnectedNet([100, 100],
  14. weight_scale=weight_scale, dtype=np.float64)
  15. solver = Solver(model, small_data,
  16. print_every=10, num_epochs=20, batch_size=25,
  17. update_rule='sgd',
  18. optim_config={
  19. 'learning_rate': learning_rate,
  20. }
  21. )
  22. solver.train()
  23. '''
  24. ### ways to find best lr & ws
  25. lrs = np.logspace(-6,-1,6)
  26. wss = np.logspace(-6,-1,6)
  27. methods = []
  28. for ws in wss:
  29. solver.model.weight_scale = ws
  30. for lr in lrs:
  31. solver.optim_config['learning_rate'] = lr
  32. solver.train()
  33. methods.append({
  34. 'lr':lr,
  35. 'ws':ws,
  36. 'acc':solver.best_val_acc,
  37. })
  38. print('learning_rate\tweight_scale\taccuracy')
  39. for item in sorted(methods,key=lambda x:x['acc'],reverse=True):
  40. print("{0:4e},\t{1:4e},\t{2:4f}".format(item['lr'],item['ws'],item['acc']))
  41. '''
  42. plt.plot(solver.loss_history, 'o')
  43. plt.title('Training loss history')
  44. plt.xlabel('Iteration')
  45. plt.ylabel('Training loss')
  46. plt.show()
  1. (Iteration 1 / 40) loss: 2.283204
  2. (Epoch 0 / 20) train acc: 0.380000; val_acc: 0.140000
  3. (Epoch 1 / 20) train acc: 0.280000; val_acc: 0.134000
  4. (Epoch 2 / 20) train acc: 0.500000; val_acc: 0.163000
  5. (Epoch 3 / 20) train acc: 0.660000; val_acc: 0.166000
  6. (Epoch 4 / 20) train acc: 0.700000; val_acc: 0.168000
  7. (Epoch 5 / 20) train acc: 0.820000; val_acc: 0.182000
  8. (Iteration 11 / 40) loss: 0.922495
  9. (Epoch 6 / 20) train acc: 0.620000; val_acc: 0.141000
  10. (Epoch 7 / 20) train acc: 0.780000; val_acc: 0.181000
  11. (Epoch 8 / 20) train acc: 0.800000; val_acc: 0.158000
  12. (Epoch 9 / 20) train acc: 0.880000; val_acc: 0.173000
  13. (Epoch 10 / 20) train acc: 0.940000; val_acc: 0.178000
  14. (Iteration 21 / 40) loss: 0.259975
  15. (Epoch 11 / 20) train acc: 0.960000; val_acc: 0.188000
  16. (Epoch 12 / 20) train acc: 0.960000; val_acc: 0.178000
  17. (Epoch 13 / 20) train acc: 0.980000; val_acc: 0.181000
  18. (Epoch 14 / 20) train acc: 0.980000; val_acc: 0.179000
  19. (Epoch 15 / 20) train acc: 0.940000; val_acc: 0.193000
  20. (Iteration 31 / 40) loss: 0.402781
  21. (Epoch 16 / 20) train acc: 0.980000; val_acc: 0.206000
  22. (Epoch 17 / 20) train acc: 1.000000; val_acc: 0.198000
  23. (Epoch 18 / 20) train acc: 1.000000; val_acc: 0.198000
  24. (Epoch 19 / 20) train acc: 1.000000; val_acc: 0.193000
  25. (Epoch 20 / 20) train acc: 1.000000; val_acc: 0.191000

FullyConnectedNets_26_1.png

  1. Wall time: 1.65 s

Now try to use a five-layer network with 100 units on each layer to overfit 50 training examples. Again, you will have to adjust the learning rate and weight initialization scale, but you should be able to achieve 100% training accuracy within 20 epochs.

  1. %%time
  2. # TODO: Use a five-layer Net to overfit 50 training examples by
  3. # tweaking just the learning rate and initialization scale.
  4. num_train = 50
  5. small_data = {
  6. 'X_train': data['X_train'][:num_train],
  7. 'y_train': data['y_train'][:num_train],
  8. 'X_val': data['X_val'],
  9. 'y_val': data['y_val'],
  10. }
  11. learning_rate = 1e-3
  12. weight_scale = 1e-1
  13. model = FullyConnectedNet([100, 100, 100, 100],
  14. weight_scale=weight_scale, dtype=np.float64)
  15. solver = Solver(model, small_data,
  16. print_every=10, num_epochs=20, batch_size=25,
  17. update_rule='sgd',
  18. optim_config={
  19. 'learning_rate': learning_rate,
  20. }
  21. )
  22. solver.train()
  23. '''ways to find best (lr,ws) (1e-3,1e-1)
  24. plt.plot(solver.loss_history, 'o')
  25. plt.title('Training loss history')
  26. plt.xlabel('Iteration')
  27. plt.ylabel('Training loss')
  28. plt.show()
  29. lrs = np.logspace(-3,-1,3)
  30. wss = np.logspace(-3,-1,3)
  31. # lrs = [1e-3,1e-2]
  32. # wss = [1e-3,1e-2]
  33. methods = []
  34. for ws in wss:
  35. model = FullyConnectedNet([100, 100, 100, 100],
  36. weight_scale=ws, dtype=np.float64)
  37. for lr in lrs:
  38. solver = Solver(model, small_data,
  39. print_every=50, num_epochs=20, batch_size=25,
  40. update_rule='sgd',
  41. optim_config={
  42. 'learning_rate': lr,
  43. }
  44. )
  45. solver.train()
  46. methods.append({
  47. 'lr':lr,
  48. 'ws':ws,
  49. 'acc':solver.be,
  50. })
  51. print('-----------end of one training----------')
  52. print('learning_rate\tweight_scale\taccuracy')
  53. for item in sorted(methods,key=lambda x:x['acc'],reverse=True):
  54. print("{0:4e},\t{1:4e},\t{2:4f}".format(item['lr'],item['ws'],item['acc']))
  55. '''
  56. plt.plot(solver.loss_history, 'o')
  57. plt.title('Training loss history')
  58. plt.xlabel('Iteration')
  59. plt.ylabel('Training loss')
  60. plt.show()
  1. (Iteration 1 / 40) loss: 248.434438
  2. (Epoch 0 / 20) train acc: 0.260000; val_acc: 0.112000
  3. (Epoch 1 / 20) train acc: 0.160000; val_acc: 0.107000
  4. (Epoch 2 / 20) train acc: 0.540000; val_acc: 0.107000
  5. (Epoch 3 / 20) train acc: 0.420000; val_acc: 0.099000
  6. (Epoch 4 / 20) train acc: 0.740000; val_acc: 0.153000
  7. (Epoch 5 / 20) train acc: 0.820000; val_acc: 0.143000
  8. (Iteration 11 / 40) loss: 3.960893
  9. (Epoch 6 / 20) train acc: 0.860000; val_acc: 0.131000
  10. (Epoch 7 / 20) train acc: 0.880000; val_acc: 0.147000
  11. (Epoch 8 / 20) train acc: 0.980000; val_acc: 0.141000
  12. (Epoch 9 / 20) train acc: 0.940000; val_acc: 0.139000
  13. (Epoch 10 / 20) train acc: 1.000000; val_acc: 0.149000
  14. (Iteration 21 / 40) loss: 0.001239
  15. (Epoch 11 / 20) train acc: 1.000000; val_acc: 0.149000
  16. (Epoch 12 / 20) train acc: 1.000000; val_acc: 0.149000
  17. (Epoch 13 / 20) train acc: 1.000000; val_acc: 0.149000
  18. (Epoch 14 / 20) train acc: 1.000000; val_acc: 0.149000
  19. (Epoch 15 / 20) train acc: 1.000000; val_acc: 0.149000
  20. (Iteration 31 / 40) loss: 0.000339
  21. (Epoch 16 / 20) train acc: 1.000000; val_acc: 0.149000
  22. (Epoch 17 / 20) train acc: 1.000000; val_acc: 0.148000
  23. (Epoch 18 / 20) train acc: 1.000000; val_acc: 0.148000
  24. (Epoch 19 / 20) train acc: 1.000000; val_acc: 0.148000
  25. (Epoch 20 / 20) train acc: 1.000000; val_acc: 0.148000

FullyConnectedNets_28_1.png

  1. Wall time: 1.56 s

Inline Question 2:

Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?

Answer:

The deeper network is, the more sensitive it is to the initialization scale.

Update rules

So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD.

SGD+Momentum

Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochastic gradient descent. See the Momentum Update section at http://cs231n.github.io/neural-networks-3/#sgd for more information.

Open the file cs231n/optim.py and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function sgd_momentum and run the following to check your implementation. You should see errors less than e-8.

  1. from cs231n.optim import sgd_momentum
  2. N, D = 4, 5
  3. w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
  4. dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
  5. v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
  6. config = {'learning_rate': 1e-3, 'velocity': v}
  7. next_w, _ = sgd_momentum(w, dw, config=config)
  8. expected_next_w = np.asarray([
  9. [ 0.1406, 0.20738947, 0.27417895, 0.34096842, 0.40775789],
  10. [ 0.47454737, 0.54133684, 0.60812632, 0.67491579, 0.74170526],
  11. [ 0.80849474, 0.87528421, 0.94207368, 1.00886316, 1.07565263],
  12. [ 1.14244211, 1.20923158, 1.27602105, 1.34281053, 1.4096 ]])
  13. expected_velocity = np.asarray([
  14. [ 0.5406, 0.55475789, 0.56891579, 0.58307368, 0.59723158],
  15. [ 0.61138947, 0.62554737, 0.63970526, 0.65386316, 0.66802105],
  16. [ 0.68217895, 0.69633684, 0.71049474, 0.72465263, 0.73881053],
  17. [ 0.75296842, 0.76712632, 0.78128421, 0.79544211, 0.8096 ]])
  18. # Should see relative errors around e-8 or less
  19. print('next_w error: ', rel_error(next_w, expected_next_w))
  20. print('velocity error: ', rel_error(expected_velocity, config['velocity']))
  1. next_w error: 8.882347033505819e-09
  2. velocity error: 4.269287743278663e-09

Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster.

  1. num_train = 4000
  2. small_data = {
  3. 'X_train': data['X_train'][:num_train],
  4. 'y_train': data['y_train'][:num_train],
  5. 'X_val': data['X_val'],
  6. 'y_val': data['y_val'],
  7. }
  8. solvers = {}
  9. for update_rule in ['sgd', 'sgd_momentum']:
  10. print('running with ', update_rule)
  11. model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)
  12. solver = Solver(model, small_data,
  13. num_epochs=5, batch_size=100,
  14. update_rule=update_rule,
  15. optim_config={
  16. 'learning_rate': 5e-3,
  17. },
  18. verbose=True)
  19. solvers[update_rule] = solver
  20. solver.train()
  21. print()
  22. plt.subplot(3, 1, 1)
  23. plt.title('Training loss')
  24. plt.xlabel('Iteration')
  25. plt.subplot(3, 1, 2)
  26. plt.title('Training accuracy')
  27. plt.xlabel('Epoch')
  28. plt.subplot(3, 1, 3)
  29. plt.title('Validation accuracy')
  30. plt.xlabel('Epoch')
  31. for update_rule, solver in solvers.items():
  32. plt.subplot(3, 1, 1)
  33. plt.plot(solver.loss_history, 'o', label="loss_%s" % update_rule)
  34. plt.subplot(3, 1, 2)
  35. plt.plot(solver.train_acc_history, '-o', label="train_acc_%s" % update_rule)
  36. plt.subplot(3, 1, 3)
  37. plt.plot(solver.val_acc_history, '-o', label="val_acc_%s" % update_rule)
  38. for i in [1, 2, 3]:
  39. plt.subplot(3, 1, i)
  40. plt.legend(loc='upper center', ncol=4)
  41. plt.gcf().set_size_inches(15, 15)
  42. plt.show()
  1. running with sgd
  2. (Iteration 1 / 200) loss: 2.633875
  3. (Epoch 0 / 5) train acc: 0.111000; val_acc: 0.112000
  4. (Iteration 11 / 200) loss: 2.235894
  5. (Iteration 21 / 200) loss: 2.206218
  6. (Iteration 31 / 200) loss: 2.122974
  7. (Epoch 1 / 5) train acc: 0.240000; val_acc: 0.209000
  8. (Iteration 41 / 200) loss: 2.071314
  9. (Iteration 51 / 200) loss: 2.086420
  10. (Iteration 61 / 200) loss: 2.074775
  11. (Iteration 71 / 200) loss: 1.968858
  12. (Epoch 2 / 5) train acc: 0.274000; val_acc: 0.238000
  13. (Iteration 81 / 200) loss: 2.013417
  14. (Iteration 91 / 200) loss: 1.974977
  15. (Iteration 101 / 200) loss: 1.929298
  16. (Iteration 111 / 200) loss: 1.888262
  17. (Epoch 3 / 5) train acc: 0.302000; val_acc: 0.261000
  18. (Iteration 121 / 200) loss: 1.903221
  19. (Iteration 131 / 200) loss: 1.833502
  20. (Iteration 141 / 200) loss: 1.966820
  21. (Iteration 151 / 200) loss: 1.882028
  22. (Epoch 4 / 5) train acc: 0.356000; val_acc: 0.286000
  23. (Iteration 161 / 200) loss: 1.854492
  24. (Iteration 171 / 200) loss: 1.815712
  25. (Iteration 181 / 200) loss: 1.691079
  26. (Iteration 191 / 200) loss: 1.764298
  27. (Epoch 5 / 5) train acc: 0.374000; val_acc: 0.309000
  28. running with sgd_momentum
  29. (Iteration 1 / 200) loss: 2.556549
  30. (Epoch 0 / 5) train acc: 0.083000; val_acc: 0.109000
  31. (Iteration 11 / 200) loss: 2.041460
  32. (Iteration 21 / 200) loss: 1.948460
  33. (Iteration 31 / 200) loss: 1.980526
  34. (Epoch 1 / 5) train acc: 0.332000; val_acc: 0.291000
  35. (Iteration 41 / 200) loss: 1.855131
  36. (Iteration 51 / 200) loss: 1.744686
  37. (Iteration 61 / 200) loss: 1.691368
  38. (Iteration 71 / 200) loss: 1.864358
  39. (Epoch 2 / 5) train acc: 0.420000; val_acc: 0.312000
  40. (Iteration 81 / 200) loss: 1.732355
  41. (Iteration 91 / 200) loss: 1.713076
  42. (Iteration 101 / 200) loss: 1.500155
  43. (Iteration 111 / 200) loss: 1.349369
  44. (Epoch 3 / 5) train acc: 0.457000; val_acc: 0.360000
  45. (Iteration 121 / 200) loss: 1.446921
  46. (Iteration 131 / 200) loss: 1.367205
  47. (Iteration 141 / 200) loss: 1.270960
  48. (Iteration 151 / 200) loss: 1.446620
  49. (Epoch 4 / 5) train acc: 0.517000; val_acc: 0.351000
  50. (Iteration 161 / 200) loss: 1.556725
  51. (Iteration 171 / 200) loss: 1.667454
  52. (Iteration 181 / 200) loss: 1.317438
  53. (Iteration 191 / 200) loss: 1.380900
  54. (Epoch 5 / 5) train acc: 0.557000; val_acc: 0.360000
  55. d:\office\office\python\python3.6.2\lib\site-packages\matplotlib\cbook\deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance. In a future version, a new instance will always be created and returned. Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  56. warnings.warn(message, mplDeprecation, stacklevel=1)

FullyConnectedNets_34_2.png

RMSProp and Adam

RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.

In the file cs231n/optim.py, implement the RMSProp update rule in the rmsprop function and implement the Adam update rule in the adam function, and check your implementations using the tests below.

NOTE: Please implement the complete Adam update rule (with the bias correction mechanism), not the first simplified version mentioned in the course notes.

[1] Tijmen Tieleman and Geoffrey Hinton. “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.” COURSERA: Neural Networks for Machine Learning 4 (2012).

[2] Diederik Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, ICLR 2015.

  1. # Test RMSProp implementation
  2. from cs231n.optim import rmsprop
  3. N, D = 4, 5
  4. w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
  5. dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
  6. cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
  7. config = {'learning_rate': 1e-2, 'cache': cache}
  8. next_w, _ = rmsprop(w, dw, config=config)
  9. expected_next_w = np.asarray([
  10. [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],
  11. [-0.132737, -0.08078555, -0.02881884, 0.02316247, 0.07515774],
  12. [ 0.12716641, 0.17918792, 0.23122175, 0.28326742, 0.33532447],
  13. [ 0.38739248, 0.43947102, 0.49155973, 0.54365823, 0.59576619]])
  14. expected_cache = np.asarray([
  15. [ 0.5976, 0.6126277, 0.6277108, 0.64284931, 0.65804321],
  16. [ 0.67329252, 0.68859723, 0.70395734, 0.71937285, 0.73484377],
  17. [ 0.75037008, 0.7659518, 0.78158892, 0.79728144, 0.81302936],
  18. [ 0.82883269, 0.84469141, 0.86060554, 0.87657507, 0.8926 ]])
  19. # You should see relative errors around e-7 or less
  20. print('next_w error: ', rel_error(expected_next_w, next_w))
  21. print('cache error: ', rel_error(expected_cache, config['cache']))
  1. next_w error: 9.524687511038133e-08
  2. cache error: 2.6477955807156126e-09
  1. # Test Adam implementation
  2. from cs231n.optim import adam
  3. N, D = 4, 5
  4. w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
  5. dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
  6. m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
  7. v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)
  8. config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}
  9. next_w, _ = adam(w, dw, config=config)
  10. expected_next_w = np.asarray([
  11. [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],
  12. [-0.1380274, -0.08544591, -0.03286534, 0.01971428, 0.0722929],
  13. [ 0.1248705, 0.17744702, 0.23002243, 0.28259667, 0.33516969],
  14. [ 0.38774145, 0.44031188, 0.49288093, 0.54544852, 0.59801459]])
  15. expected_v = np.asarray([
  16. [ 0.69966, 0.68908382, 0.67851319, 0.66794809, 0.65738853,],
  17. [ 0.64683452, 0.63628604, 0.6257431, 0.61520571, 0.60467385,],
  18. [ 0.59414753, 0.58362676, 0.57311152, 0.56260183, 0.55209767,],
  19. [ 0.54159906, 0.53110598, 0.52061845, 0.51013645, 0.49966, ]])
  20. expected_m = np.asarray([
  21. [ 0.48, 0.49947368, 0.51894737, 0.53842105, 0.55789474],
  22. [ 0.57736842, 0.59684211, 0.61631579, 0.63578947, 0.65526316],
  23. [ 0.67473684, 0.69421053, 0.71368421, 0.73315789, 0.75263158],
  24. [ 0.77210526, 0.79157895, 0.81105263, 0.83052632, 0.85 ]])
  25. # You should see relative errors around e-7 or less
  26. print('next_w error: ', rel_error(expected_next_w, next_w))
  27. print('v error: ', rel_error(expected_v, config['v']))
  28. print('m error: ', rel_error(expected_m, config['m']))
  1. next_w error: 0.20720703668629928
  2. v error: 4.208314038113071e-09
  3. m error: 4.214963193114416e-09

Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:

  1. learning_rates = {'rmsprop': 1e-4, 'adam': 1e-3}
  2. for update_rule in ['adam', 'rmsprop']:
  3. print('running with ', update_rule)
  4. model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)
  5. solver = Solver(model, small_data,
  6. num_epochs=5, batch_size=100,
  7. update_rule=update_rule,
  8. optim_config={
  9. 'learning_rate': learning_rates[update_rule]
  10. },
  11. verbose=True)
  12. solvers[update_rule] = solver
  13. solver.train()
  14. print()
  15. plt.subplot(3, 1, 1)
  16. plt.title('Training loss')
  17. plt.xlabel('Iteration')
  18. plt.subplot(3, 1, 2)
  19. plt.title('Training accuracy')
  20. plt.xlabel('Epoch')
  21. plt.subplot(3, 1, 3)
  22. plt.title('Validation accuracy')
  23. plt.xlabel('Epoch')
  24. for update_rule, solver in list(solvers.items()):
  25. plt.subplot(3, 1, 1)
  26. plt.plot(solver.loss_history, 'o', label=update_rule)
  27. plt.subplot(3, 1, 2)
  28. plt.plot(solver.train_acc_history, '-o', label=update_rule)
  29. plt.subplot(3, 1, 3)
  30. plt.plot(solver.val_acc_history, '-o', label=update_rule)
  31. for i in [1, 2, 3]:
  32. plt.subplot(3, 1, i)
  33. plt.legend(loc='upper center', ncol=4)
  34. plt.gcf().set_size_inches(15, 15)
  35. plt.show()
  1. running with adam
  2. (Iteration 1 / 200) loss: 2.565073
  3. (Epoch 0 / 5) train acc: 0.117000; val_acc: 0.087000
  4. (Iteration 11 / 200) loss: 2.103384
  5. (Iteration 21 / 200) loss: 2.008961
  6. (Iteration 31 / 200) loss: 2.016690
  7. (Epoch 1 / 5) train acc: 0.304000; val_acc: 0.292000
  8. (Iteration 41 / 200) loss: 1.795322
  9. (Iteration 51 / 200) loss: 1.976308
  10. (Iteration 61 / 200) loss: 1.770154
  11. (Iteration 71 / 200) loss: 1.742921
  12. (Epoch 2 / 5) train acc: 0.369000; val_acc: 0.297000
  13. (Iteration 81 / 200) loss: 1.668088
  14. (Iteration 91 / 200) loss: 1.718752
  15. (Iteration 101 / 200) loss: 1.785251
  16. (Iteration 111 / 200) loss: 1.758603
  17. (Epoch 3 / 5) train acc: 0.388000; val_acc: 0.353000
  18. (Iteration 121 / 200) loss: 1.824496
  19. (Iteration 131 / 200) loss: 1.573791
  20. (Iteration 141 / 200) loss: 1.586312
  21. (Iteration 151 / 200) loss: 1.600374
  22. (Epoch 4 / 5) train acc: 0.414000; val_acc: 0.328000
  23. (Iteration 161 / 200) loss: 1.351332
  24. (Iteration 171 / 200) loss: 1.379117
  25. (Iteration 181 / 200) loss: 1.618205
  26. (Iteration 191 / 200) loss: 1.490465
  27. (Epoch 5 / 5) train acc: 0.431000; val_acc: 0.351000
  28. running with rmsprop
  29. (Iteration 1 / 200) loss: 2.789734
  30. (Epoch 0 / 5) train acc: 0.118000; val_acc: 0.113000
  31. (Iteration 11 / 200) loss: 2.134903
  32. (Iteration 21 / 200) loss: 2.058140
  33. (Iteration 31 / 200) loss: 1.810528
  34. (Epoch 1 / 5) train acc: 0.372000; val_acc: 0.312000
  35. (Iteration 41 / 200) loss: 1.795879
  36. (Iteration 51 / 200) loss: 1.725097
  37. (Iteration 61 / 200) loss: 1.757195
  38. (Iteration 71 / 200) loss: 1.537397
  39. (Epoch 2 / 5) train acc: 0.402000; val_acc: 0.336000
  40. (Iteration 81 / 200) loss: 1.581306
  41. (Iteration 91 / 200) loss: 1.743312
  42. (Iteration 101 / 200) loss: 1.459270
  43. (Iteration 111 / 200) loss: 1.454714
  44. (Epoch 3 / 5) train acc: 0.480000; val_acc: 0.344000
  45. (Iteration 121 / 200) loss: 1.606346
  46. (Iteration 131 / 200) loss: 1.688550
  47. (Iteration 141 / 200) loss: 1.579165
  48. (Iteration 151 / 200) loss: 1.411453
  49. (Epoch 4 / 5) train acc: 0.518000; val_acc: 0.365000
  50. (Iteration 161 / 200) loss: 1.607441
  51. (Iteration 171 / 200) loss: 1.428779
  52. (Iteration 181 / 200) loss: 1.507496
  53. (Iteration 191 / 200) loss: 1.309462
  54. (Epoch 5 / 5) train acc: 0.525000; val_acc: 0.347000
  55. d:\office\office\python\python3.6.2\lib\site-packages\matplotlib\cbook\deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance. In a future version, a new instance will always be created and returned. Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  56. warnings.warn(message, mplDeprecation, stacklevel=1)

FullyConnectedNets_39_2.png

Inline Question 3:

AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:

  1. cache += dw**2
  2. w += - learning_rate * dw / (np.sqrt(cache) + eps)

John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?

Answer:

[FILL THIS IN]

Train a good model!

Train the best fully-connected model that you can on CIFAR-10, storing your best model in the best_model variable. We require you to get at least 50% accuracy on the validation set using a fully-connected net.

If you are careful it should be possible to get accuracies above 55%, but we don’t require it for this part and won’t assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional nets rather than fully-connected nets.

You might find it useful to complete the BatchNormalization.ipynb and Dropout.ipynb notebooks before completing this part, since those techniques can help you train powerful models.

  1. best_model = None
  2. ################################################################################
  3. # TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might #
  4. # find batch/layer normalization and dropout useful. Store your best model in #
  5. # the best_model variable. #
  6. ################################################################################
  7. # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
  8. data = get_CIFAR10_data()
  9. weight_scale = 1e-2
  10. learning_rate = 1e-3
  11. update_rule = 'adam'
  12. num_epochs = 20
  13. batch_size = 100
  14. model = FullyConnectedNet([100, 100, 100, 100],
  15. weight_scale=weight_scale, dtype=np.float64)
  16. solver = Solver(model, data,
  17. print_every=100,
  18. num_epochs=num_epochs,
  19. batch_size=batch_size,
  20. update_rule=update_rule,
  21. optim_config={
  22. 'learning_rate': learning_rate,
  23. }
  24. )
  25. solver.train()
  26. best_model = solver.model
  27. # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
  28. ################################################################################
  29. # END OF YOUR CODE #
  30. ################################################################################
  1. (Iteration 1 / 9800) loss: 2.302519
  2. (Epoch 0 / 20) train acc: 0.123000; val_acc: 0.098000
  3. (Iteration 101 / 9800) loss: 1.900525
  4. (Iteration 201 / 9800) loss: 1.867200
  5. (Iteration 301 / 9800) loss: 1.771347
  6. (Iteration 401 / 9800) loss: 1.813210
  7. (Epoch 1 / 20) train acc: 0.362000; val_acc: 0.381000
  8. (Iteration 501 / 9800) loss: 1.869626
  9. (Iteration 601 / 9800) loss: 1.722641
  10. (Iteration 701 / 9800) loss: 1.514980
  11. (Iteration 801 / 9800) loss: 1.572666
  12. (Iteration 901 / 9800) loss: 1.539571
  13. (Epoch 2 / 20) train acc: 0.446000; val_acc: 0.433000
  14. (Iteration 1001 / 9800) loss: 1.402627
  15. (Iteration 1101 / 9800) loss: 1.592149
  16. (Iteration 1201 / 9800) loss: 1.491804
  17. (Iteration 1301 / 9800) loss: 1.510342
  18. (Iteration 1401 / 9800) loss: 1.463611
  19. (Epoch 3 / 20) train acc: 0.450000; val_acc: 0.457000
  20. (Iteration 1501 / 9800) loss: 1.428206
  21. (Iteration 1601 / 9800) loss: 1.410774
  22. (Iteration 1701 / 9800) loss: 1.419294
  23. (Iteration 1801 / 9800) loss: 1.554943
  24. (Iteration 1901 / 9800) loss: 1.276085
  25. (Epoch 4 / 20) train acc: 0.496000; val_acc: 0.487000
  26. (Iteration 2001 / 9800) loss: 1.508825
  27. (Iteration 2101 / 9800) loss: 1.514870
  28. (Iteration 2201 / 9800) loss: 1.378368
  29. (Iteration 2301 / 9800) loss: 1.333612
  30. (Iteration 2401 / 9800) loss: 1.369037
  31. (Epoch 5 / 20) train acc: 0.516000; val_acc: 0.480000
  32. (Iteration 2501 / 9800) loss: 1.569647
  33. (Iteration 2601 / 9800) loss: 1.474409
  34. (Iteration 2701 / 9800) loss: 1.238740
  35. (Iteration 2801 / 9800) loss: 1.469653
  36. (Iteration 2901 / 9800) loss: 1.368814
  37. (Epoch 6 / 20) train acc: 0.487000; val_acc: 0.502000
  38. (Iteration 3001 / 9800) loss: 1.390118
  39. (Iteration 3101 / 9800) loss: 1.232272
  40. (Iteration 3201 / 9800) loss: 1.059411
  41. (Iteration 3301 / 9800) loss: 1.333375
  42. (Iteration 3401 / 9800) loss: 1.245856
  43. (Epoch 7 / 20) train acc: 0.517000; val_acc: 0.508000
  44. (Iteration 3501 / 9800) loss: 1.299509
  45. (Iteration 3601 / 9800) loss: 1.515722
  46. (Iteration 3701 / 9800) loss: 1.377892
  47. (Iteration 3801 / 9800) loss: 1.335273
  48. (Iteration 3901 / 9800) loss: 1.499312
  49. (Epoch 8 / 20) train acc: 0.551000; val_acc: 0.529000
  50. (Iteration 4001 / 9800) loss: 1.377010
  51. (Iteration 4101 / 9800) loss: 1.375806
  52. (Iteration 4201 / 9800) loss: 1.250878
  53. (Iteration 4301 / 9800) loss: 1.326187
  54. (Iteration 4401 / 9800) loss: 1.367880
  55. (Epoch 9 / 20) train acc: 0.528000; val_acc: 0.504000
  56. (Iteration 4501 / 9800) loss: 1.285116
  57. (Iteration 4601 / 9800) loss: 1.112254
  58. (Iteration 4701 / 9800) loss: 1.201709
  59. (Iteration 4801 / 9800) loss: 1.238681
  60. (Epoch 10 / 20) train acc: 0.559000; val_acc: 0.521000
  61. (Iteration 4901 / 9800) loss: 1.049936
  62. (Iteration 5001 / 9800) loss: 1.080508
  63. (Iteration 5101 / 9800) loss: 1.202960
  64. (Iteration 5201 / 9800) loss: 1.155451
  65. (Iteration 5301 / 9800) loss: 1.159336
  66. (Epoch 11 / 20) train acc: 0.537000; val_acc: 0.517000
  67. (Iteration 5401 / 9800) loss: 1.382031
  68. (Iteration 5501 / 9800) loss: 1.111946
  69. (Iteration 5601 / 9800) loss: 1.142069
  70. (Iteration 5701 / 9800) loss: 1.148606
  71. (Iteration 5801 / 9800) loss: 1.178862
  72. (Epoch 12 / 20) train acc: 0.582000; val_acc: 0.520000
  73. (Iteration 5901 / 9800) loss: 1.290777
  74. (Iteration 6001 / 9800) loss: 1.166438
  75. (Iteration 6101 / 9800) loss: 1.259738
  76. (Iteration 6201 / 9800) loss: 1.120008
  77. (Iteration 6301 / 9800) loss: 0.921455
  78. (Epoch 13 / 20) train acc: 0.594000; val_acc: 0.504000
  79. (Iteration 6401 / 9800) loss: 1.115358
  80. (Iteration 6501 / 9800) loss: 1.053265
  81. (Iteration 6601 / 9800) loss: 1.208918
  82. (Iteration 6701 / 9800) loss: 1.028606
  83. (Iteration 6801 / 9800) loss: 1.033682
  84. (Epoch 14 / 20) train acc: 0.587000; val_acc: 0.540000
  85. (Iteration 6901 / 9800) loss: 1.013524
  86. (Iteration 7001 / 9800) loss: 1.187387
  87. (Iteration 7101 / 9800) loss: 1.165351
  88. (Iteration 7201 / 9800) loss: 1.032104
  89. (Iteration 7301 / 9800) loss: 1.008612
  90. (Epoch 15 / 20) train acc: 0.615000; val_acc: 0.520000
  91. (Iteration 7401 / 9800) loss: 1.158815
  92. (Iteration 7501 / 9800) loss: 1.144391
  93. (Iteration 7601 / 9800) loss: 0.828153
  94. (Iteration 7701 / 9800) loss: 1.156747
  95. (Iteration 7801 / 9800) loss: 1.137408
  96. (Epoch 16 / 20) train acc: 0.593000; val_acc: 0.514000
  97. (Iteration 7901 / 9800) loss: 0.961378
  98. (Iteration 8001 / 9800) loss: 1.160585
  99. (Iteration 8101 / 9800) loss: 1.067675
  100. (Iteration 8201 / 9800) loss: 1.011692
  101. (Iteration 8301 / 9800) loss: 1.130812
  102. (Epoch 17 / 20) train acc: 0.612000; val_acc: 0.512000
  103. (Iteration 8401 / 9800) loss: 1.009954
  104. (Iteration 8501 / 9800) loss: 1.316998
  105. (Iteration 8601 / 9800) loss: 1.033570
  106. (Iteration 8701 / 9800) loss: 1.136633
  107. (Iteration 8801 / 9800) loss: 1.166442
  108. (Epoch 18 / 20) train acc: 0.590000; val_acc: 0.523000
  109. (Iteration 8901 / 9800) loss: 1.185326
  110. (Iteration 9001 / 9800) loss: 1.111609
  111. (Iteration 9101 / 9800) loss: 1.142837
  112. (Iteration 9201 / 9800) loss: 1.153727
  113. (Iteration 9301 / 9800) loss: 1.288469
  114. (Epoch 19 / 20) train acc: 0.628000; val_acc: 0.526000
  115. (Iteration 9401 / 9800) loss: 1.233846
  116. (Iteration 9501 / 9800) loss: 0.979883
  117. (Iteration 9601 / 9800) loss: 1.080138
  118. (Iteration 9701 / 9800) loss: 1.035944
  119. (Epoch 20 / 20) train acc: 0.613000; val_acc: 0.520000

Test your model!

Run your best model on the validation and test sets. You should achieve above 50% accuracy on the validation set.

  1. y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
  2. y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
  3. print('Validation set accuracy: ', (y_val_pred == data['y_val']).mean())
  4. print('Test set accuracy: ', (y_test_pred == data['y_test']).mean())
  1. Validation set accuracy: 0.54
  2. Test set accuracy: 0.487
  1. !jupyter nbconvert --to markdown FullyConnectedNets.ipynb