

hidden units:各隐藏层神经元个数
learning rate decay:学习因子下降参数
mini-batch size:批量训练样本包含的样本个数


通常来说,学习因子αα是最重要的超参数,也是需要重点调试的超参数。动量梯度下降因子β、各隐藏层神经元个数#hidden units和mini-batch size的重要性仅次于αα。然后就是神经网络层数#layers和学习因子下降参数learning rate decay。最后,Adam算法的三个参数β1,β2,ε一般常设置为0.9,0.999和10−810−8,不需要反复调试。


2-3超参数调试、正则化以及优化 - 图1

2-3超参数调试、正则化以及优化 - 图2

二、 Using an appropriate scale to pick hyperparameters


2-3超参数调试、正则化以及优化 - 图3

三、 Batch Normalization



batch norm

2-3超参数调试、正则化以及优化 - 图4
2-3超参数调试、正则化以及优化 - 图5
Normalizing inputs使所有输入的均值为0,方差为1。而Batch Normalization可使各隐藏层输入的均值和方差为任意值。实际上,从激活函数的角度来说,如果各隐藏层的输入均值在靠近0的区域即处于激活函数的线性区域,这样不利于训练好的非线性神经网络,得到的模型效果也不会太好。

四、 Why does Batch Norm work

如果发生covariate shift,因为batch norm的作用,隐藏层的输出均值和方差仍不变,则其他参数会更加稳定,模型在其他样本上也会有不错的表现。
从另一个方面来说,Batch Norm也起到轻微的正则化(regularization)效果。具体表现在:

  • 每个mini-batch都进行均值为0,方差为1的归一化操作
  • 每个mini-batch中,对各个隐藏层的Z[l]添加了随机噪声,效果类似于Dropout
  • mini-batch越小,正则化效果越明显

    五、 softmax回归


    对于二分类问题,网络的输出层只有一个神经单元,输出值表示预测输出2-3超参数调试、正则化以及优化 - 图6为正类的概率2-3超参数调试、正则化以及优化 - 图72-3超参数调试、正则化以及优化 - 图8则判断为正类,2-3超参数调试、正则化以及优化 - 图9 则判断为负类。
    2-3超参数调试、正则化以及优化 - 图10


    2-3超参数调试、正则化以及优化 - 图11


  • Caffe
  • Keras
  • CNTK
  • DL4J
  • Lasagne
  • mxnet
  • PaddlePaddle
  • Tensorflow
  • Theano
  • Torch


  • ease of programming
  • running speed
  • truly open


  • create placeholder
  1. def create_placeholders(n_x, n_y):
  2. """
  3. Creates the placeholders for the tensorflow session.
  4. Arguments:
  5. n_x -- scalar, size of an image vector (num_px * num_px = 64 * 64 * 3 = 12288)
  6. n_y -- scalar, number of classes (from 0 to 5, so -> 6)
  7. Returns:
  8. X -- placeholder for the data input, of shape [n_x, None] and dtype "float"
  9. Y -- placeholder for the input labels, of shape [n_y, None] and dtype "float"
  10. Tips:
  11. - You will use None because it let's us be flexible on the number of examples you will for the placeholders.
  12. In fact, the number of examples during test/train is different.
  13. """
  14. ### START CODE HERE ### (approx. 2 lines)
  15. X = tf.placeholder(tf.float32, [n_x, None], name="X")
  16. Y = tf.placeholder(tf.float32, [n_y, None], name='Y')
  17. ### END CODE HERE ###
  18. return X, Y
  • initializing the parameters
  1. # GRADED FUNCTION: initialize_parameters
  2. def initialize_parameters():
  3. """
  4. Initializes parameters to build a neural network with tensorflow. The shapes are:
  5. W1 : [25, 12288]
  6. b1 : [25, 1]
  7. W2 : [12, 25]
  8. b2 : [12, 1]
  9. W3 : [6, 12]
  10. b3 : [6, 1]
  11. Returns:
  12. parameters -- a dictionary of tensors containing W1, b1, W2, b2, W3, b3
  13. """
  14. tf.set_random_seed(1) # so that your "random" numbers match ours
  15. ### START CODE HERE ### (approx. 6 lines of code)
  16. W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
  17. b1 = tf.get_variable("b1",[25,1],initializer=tf.zeros_initializer())
  18. W2 = tf.get_variable("W2", [12, 25], initializer = tf.contrib.layers.xavier_initializer(seed=1))
  19. b2 = tf.get_variable("b2", [12, 1], initializer = tf.zeros_initializer())
  20. W3 = tf.get_variable("W3", [6, 12], initializer = tf.contrib.layers.xavier_initializer(seed=1))
  21. b3 = tf.get_variable("b3", [6, 1], initializer = tf.zeros_initializer())
  22. ### END CODE HERE ###
  23. parameters = {"W1": W1,
  24. "b1": b1,
  25. "W2": W2,
  26. "b2": b2,
  27. "W3": W3,
  28. "b3": b3}
  29. return parameters
  • forward propagation in tensorflow
  1. # GRADED FUNCTION: forward_propagation
  2. def forward_propagation(X, parameters):
  3. """
  4. Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
  5. Arguments:
  6. X -- input dataset placeholder, of shape (input size, number of examples)
  7. parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
  8. the shapes are given in initialize_parameters
  9. Returns:
  10. Z3 -- the output of the last LINEAR unit
  11. """
  12. # Retrieve the parameters from the dictionary "parameters"
  13. W1 = parameters['W1']
  14. b1 = parameters['b1']
  15. W2 = parameters['W2']
  16. b2 = parameters['b2']
  17. W3 = parameters['W3']
  18. b3 = parameters['b3']
  19. ### START CODE HERE ### (approx. 5 lines) # Numpy Equivalents:
  20. Z1 = tf.add(tf.matmul(W1,X),b1) # Z1 = np.dot(W1, X) + b1
  21. A1 = tf.nn.relu(Z1) # A1 = relu(Z1)
  22. Z2 = tf.add(tf.matmul(W2,A1),b2) # Z2 = np.dot(W2, a1) + b2s
  23. A2 = tf.nn.relu(Z2) # A2 = relu(Z2)
  24. Z3 = tf.add(tf.matmul(W3,A2),b3) # Z3 = np.dot(W3,Z2) + b3
  25. ### END CODE HERE ###
  26. return Z3
  • compute cost
  1. # GRADED FUNCTION: compute_cost
  2. def compute_cost(Z3, Y):
  3. """
  4. Computes the cost
  5. Arguments:
  6. Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
  7. Y -- "true" labels vector placeholder, same shape as Z3
  8. Returns:
  9. cost - Tensor of the cost function
  10. """
  11. # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
  12. logits = tf.transpose(Z3)
  13. labels = tf.transpose(Y)
  14. ### START CODE HERE ### (1 line of code)
  15. cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=labels))
  16. ### END CODE HERE ###
  17. return cost
  • backward propagation & parameter updates
  1. #For instance, for gradient descent the optimizer would be:
  2. optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost)
  3. #To make the optimization you would do:
  4. _ , c = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})