一、调试处理
二、 Using an appropriate scale to pick hyperparameters
- 非均匀刻度尺
三、 Batch Normalization
- 标准化输入
- batch norm
四、 Why does Batch Norm work
五、 softmax回归
- 介绍
- 损失函数
六、深度学习框架
七、编程作业

一、调试处理

超参数

α：学习因子
β：动量梯度下降因子
β1,β2,ε：Adam算法参数
layers：神经网络层数
hidden units：各隐藏层神经元个数
learning rate decay：学习因子下降参数
mini-batch size：批量训练样本包含的样本个数

重要性

通常来说，学习因子αα是最重要的超参数，也是需要重点调试的超参数。动量梯度下降因子β、各隐藏层神经元个数#hidden units和mini-batch size的重要性仅次于αα。然后就是神经网络层数#layers和学习因子下降参数learning rate decay。最后，Adam算法的三个参数β1,β2,ε一般常设置为0.9，0.999和10−810−8，不需要反复调试。

调整

传统的机器学习，对每个超参数等间隔选取，然后，分别使用不同点对应的参数组合进行训练，最后根据验证集上的表现好坏，来选定最佳的参数
2-3超参数调试、正则化以及优化 - 图1

深度网络模型中，一般随机选取。因为每个参数的重要程度不同。例如上面对每个参数只有5种取值，下面这个则有25种情况。
2-3超参数调试、正则化以及优化 - 图2

二、 Using an appropriate scale to pick hyperparameters

非均匀刻度尺

2-3超参数调试、正则化以及优化 - 图3
α、β
关于β的说明：假设β从0.9000变化为0.9005，那么1/1−β基本没有变化。但假设ββ从0.9990变化为0.9995，那么1/1−β前后差别1000。β越接近1，指数加权平均的个数越多，变化越大。所以对β接近1的区间，应该采集得更密集一些。

三、 Batch Normalization

标准化输入

在训练神经网络时，标准化输入可以提高训练的速度。方法是对训练数据集进行归一化的操作，即将原始数据减去其均值μ后，再除以其方差σ2。

batch norm

2-3超参数调试、正则化以及优化 - 图4
一般进一步处理
2-3超参数调试、正则化以及优化 - 图5
Normalizing inputs使所有输入的均值为0，方差为1。而Batch Normalization可使各隐藏层输入的均值和方差为任意值。实际上，从激活函数的角度来说，如果各隐藏层的输入均值在靠近0的区域即处于激活函数的线性区域，这样不利于训练好的非线性神经网络，得到的模型效果也不会太好。

四、 Why does Batch Norm work

如果发生covariate shift，因为batch norm的作用，隐藏层的输出均值和方差仍不变，则其他参数会更加稳定，模型在其他样本上也会有不错的表现。
从另一个方面来说，Batch Norm也起到轻微的正则化（regularization）效果。具体表现在：

每个mini-batch都进行均值为0，方差为1的归一化操作
每个mini-batch中，对各个隐藏层的Z[l]添加了随机噪声，效果类似于Dropout
mini-batch越小，正则化效果越明显
五、 softmax回归
介绍
对于二分类问题，网络的输出层只有一个神经单元，输出值表示预测输出为正类的概率，则判断为正类，则判断为负类。
对于多分类问题，用C表示种类个数，神经网络中输出层就有C个神经元，其中每个神经元的输出依次对应属于该类的概率。为了处理多分类问题，我们一般使用Softmax回归模型。其激活函数如下：

损失函数

六、深度学习框架

Caffe
Keras
CNTK
DL4J
Lasagne
mxnet
PaddlePaddle
Tensorflow
Theano
Torch

选择框架的标准：

ease of programming
running speed
truly open

七、编程作业

create placeholder

def create_placeholders(n_x, n_y):
    """
    Creates the placeholders for the tensorflow session.
    Arguments:
    n_x -- scalar, size of an image vector (num_px * num_px = 64 * 64 * 3 = 12288)
    n_y -- scalar, number of classes (from 0 to 5, so -> 6)
    Returns:
    X -- placeholder for the data input, of shape [n_x, None] and dtype "float"
    Y -- placeholder for the input labels, of shape [n_y, None] and dtype "float"
    Tips:
    - You will use None because it let's us be flexible on the number of examples you will for the placeholders.
      In fact, the number of examples during test/train is different.
    """
    ### START CODE HERE ### (approx. 2 lines)
    X = tf.placeholder(tf.float32, [n_x, None], name="X")
    Y = tf.placeholder(tf.float32, [n_y, None], name='Y')
    ### END CODE HERE ###
    return X, Y

initializing the parameters

# GRADED FUNCTION: initialize_parameters
def initialize_parameters():
    """
    Initializes parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [25, 12288]
                        b1 : [25, 1]
                        W2 : [12, 25]
                        b2 : [12, 1]
                        W3 : [6, 12]
                        b3 : [6, 1]
    Returns:
    parameters -- a dictionary of tensors containing W1, b1, W2, b2, W3, b3
    """
    tf.set_random_seed(1)                   # so that your "random" numbers match ours
    ### START CODE HERE ### (approx. 6 lines of code)
    W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b1 = tf.get_variable("b1",[25,1],initializer=tf.zeros_initializer())
    W2 = tf.get_variable("W2", [12, 25], initializer = tf.contrib.layers.xavier_initializer(seed=1))
    b2 = tf.get_variable("b2", [12, 1], initializer = tf.zeros_initializer())
    W3 = tf.get_variable("W3", [6, 12], initializer = tf.contrib.layers.xavier_initializer(seed=1))
    b3 = tf.get_variable("b3", [6, 1], initializer = tf.zeros_initializer())
    ### END CODE HERE ###
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2,
                  "W3": W3,
                  "b3": b3}
    return parameters

forward propagation in tensorflow

# GRADED FUNCTION: forward_propagation
def forward_propagation(X, parameters):
    """
    Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
    Arguments:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
                  the shapes are given in initialize_parameters
    Returns:
    Z3 -- the output of the last LINEAR unit
    """
    # Retrieve the parameters from the dictionary "parameters" 
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    W3 = parameters['W3']
    b3 = parameters['b3']
    ### START CODE HERE ### (approx. 5 lines)              # Numpy Equivalents:
    Z1 = tf.add(tf.matmul(W1,X),b1)                                              # Z1 = np.dot(W1, X) + b1
    A1 = tf.nn.relu(Z1)                                              # A1 = relu(Z1)
    Z2 = tf.add(tf.matmul(W2,A1),b2)                                             # Z2 = np.dot(W2, a1) + b2s
    A2 = tf.nn.relu(Z2)                                              # A2 = relu(Z2)
    Z3 = tf.add(tf.matmul(W3,A2),b3)                                              # Z3 = np.dot(W3,Z2) + b3
    ### END CODE HERE ###
    return Z3

compute cost

# GRADED FUNCTION: compute_cost 
def compute_cost(Z3, Y):
    """
    Computes the cost
    Arguments:
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3
    Returns:
    cost - Tensor of the cost function
    """
    # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
    logits = tf.transpose(Z3)
    labels = tf.transpose(Y)
    ### START CODE HERE ### (1 line of code)
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=labels))
    ### END CODE HERE ###
    return cost

backward propagation & parameter updates

#For instance, for gradient descent the optimizer would be:
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost)
#To make the optimization you would do:
_ , c = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})

2-3超参数调试、正则化以及优化