Mechine learning - L3 Deep Neural Networks - 《机器学习》

Non-linear Relationships 非线性关系
Neural Networks: Bio-Inspired 神经网络：生物启发
MLP Neural Network: Key Idea MLP神经网络:关键思想
Notation
Why use a non-linearity? 为什么要使用非线性
Other non-linearities 其他的非线性关系
Notation
Hidden Dimensions
Network with 2 ReLU Neurons 两个relu的神经网络
Network with 50 ReLU Neurons 50个ReLU的神经网络
Recap
- Multi-Layer Perceptron 多层感知器
- Feature Learning 特征学习
Recap: Gradient Descent 概述：梯度下降
Summary

Non-linear Relationships 非线性关系

• What do we do if the target does not have a linear relationship with the input? 如果输入输出是非线性关系我们怎么办？
• We need to fit a non-linear function to our data 我们需要用非线性函数拟合我们的数据

Neural Networks: Bio-Inspired 神经网络：生物启发

Initially inspired by models of the brain 被大脑的模型所启发
We keep some of the ideas and terminology, but it is not a current model of the brain 我们保留了一些想法和术语，但它不是目前的大脑模型
A better way to think of modern neural networks: Differentiable vector cascades 思考现代神经网络的更好方法:可微向量级联

MLP Neural Network: Key Idea MLP神经网络:关键思想

• MLP = Multi-Layer Perceptron 多层感知器(a simple type of neural network, also called a feed forward network) (一种简单的神经网络，也称为前馈网络)
– Stack multiple linear models on top of each other 将多个线性模型堆叠在一起
– Output of one model after a non-linearity is used as input to next model. 非线性后一个模型的输出用作下一个模型的输入。
• Note: Superscript [#] notation denotes layer number.注:上标[#]表示层数。

• The intermediate models learns to output a ‘useful representation’ of the input. 中间模型学会输出输入的“有用表示”。
• The final model still learns to predict the desired output 最终模型仍然学会预测期望的输出

Notation

Sometimes written without the bias since we can absorb it into W by adding an extra dimension of value 1 to the x vector 有时写起来没有偏差，因为我们可以通过在x向量上加一个额外的维度值1来把它吸收到W中

Why use a non-linearity? 为什么要使用非线性

If you compose two linear functions, you just get another linear function, we are still fitting a straight line. 如果你简单将两个线性函数合并，你只能得到另外一个线性函数，仍然在拟合一条直线

其他尝试

is some function that is non-linear and (mostly) differentiable. seita 是非线性函数且一般可求导。
is called the ‘activation function’. 称作激活函数。

Rectified Linear Unit (ReLU) 整流器线性单元

• In principle we can use any non-linear (mostly) differentiable function as the activation function. 原则上，我们可以使用任何非线性(大部分)可微函数作为激活函数。
• Commonly used in practice:
• 0或最大值
• Fast to compute. 计算快速
• Works well 效果好

Other non-linearities 其他的非线性关系

Sigmoid was popular for a long period of time. Sigmoid流行了很长一段时间。
Tanh was also often used.

Various smooth approximations to ReLU are common ReLU的各种平滑近似是常见的

For the output layer: 对于输出层
For classification use softmax 经常使用softmax来做分类任务
For regression use a linear last layer. 使用线性层做回归

Notation

For an MLP we typically think of each composition of an activation function with a linear function as a ‘layer’. 对于MLP，我们通常把具有线性函数的激活函数的每个组成部分看作“层”。
• Input layer is the first layer. 输入层是第一层
• Output layer is the last layer. 输出层是最后一层
• All other layers are ‘hidden layers’. 其他的都是隐藏层

In general:
Almost anything, regardless of complexity, can be called a layer, much like almost anything can be called a function. The term is used to conceptually distinguish between different components. 几乎任何东西，不管复杂程度如何，都可以称为层，就像几乎任何东西都可以称为函数一样。该术语用于在概念上区分不同的组件。、

Hidden Dimensions

• The size of the input layer is defined by the number of input features. 输入层的尺寸由输入特征数量定义
• The size of the output layer is defined by the number of output targets. 输出层尺寸由输出目标的数量定义
• The size of the hidden layers can be anything, this is a hyper-parameter for you to choose. 隐藏层的尺寸不固定，可以是你选择设置的超参数
• The dimension of the output of a hidden layer is called the ‘number of hidden neurons’ in that layer. 隐藏层输出的维度称为该层中的“隐藏神经元数量”。

Network with 2 ReLU Neurons 两个relu的神经网络

• What does it look like if we fit a neural network with one layer of 2 hidden neurons with ReLU activations? 如果一个神经网络的一层有两个具有ReLU激活的隐藏神经元
• Each hidden neuron learns a linear function then applies ReLU. 每一个隐藏神经元学习一个线性函数然后应用ReLU
• Output is then a linear function of the hidden outputs. 输出是隐藏层后的一个线性函数。
• Resulting graph is equivalent to scaling, shifting and flipping two ReLU graphs then adding them. 结果图相当于缩放、移动和翻转两个ReLU图，然后将它们相加。

Network with 50 ReLU Neurons 50个ReLU的神经网络

• Increasing the number of hidden neurons means we add together more (scaled) versions of the activation function. 更多的隐藏神经元意味着我们增加了更多的激活函数
• As we add more neurons we can approximate more complicated functions. 随着我们增加了越多的神经元，我们可以拟合更复杂的函数。
• As we add more neurons, our model becomes more capable of fitting noise in the training dataset – overfitting. 但是，随着我们添加更多神经元，我们的模型变得更有能力拟合训练数据集中的噪声——过拟合。

Recap

Multinomial Logistic Regression (Classification) 多项式逻辑回归(分类)

Multi-Layer Perceptron 多层感知器

This MLP has 2 hidden layers, with 4 and 3 neurons. 这个多层感知器由两个隐藏层，分别有4个和3个神经元

Feature Learning 特征学习

Neural nets can be viewed as a way of learning features: 神经网络可以被看做一种学习特征值的方式

Recap: Gradient Descent 概述：梯度下降

A simple linear regression model with non-linearity and square loss: 下图是具有非线性和平方损失的简单线性回归模型

For gradient descent we need numerical values of the following derivatives at the current w and b: 对于梯度下降，我们需要当前w和b处的下列导数的数值:

Backpropagation 反向传导

We need the derivative of the loss with respect to the parameters. We want a systematic way of doing this:
- Forwards pass: compute the loss and store the output of each layer.
- Backwards pass: compute the derivatives layer by layer using the chain rule and the results stored from the forward pass.

Chain Rule 链式法则

Given the expression:

We have the derivative:

When a function has two parameters, we will need to use the chain rule in this form: 当一个函数有两个参数时，我们需要对其使用链式求导法则

Backpropagation 反向传导

Full model with loss function:

We can introduce intermediate variables中间变量 z, h to give:

Backpropagation: Forward Pass 反向传导：向前传播

Compute and store z, h, L

Backpropagation: Backward pass 反向传导：反向传播

Compute the derivative of the loss wrt h 计算L对于h的导数、

Note that both h, y are known (h is stored during the forward pass, y is the ground truth) 请注意，h和y都是已知的(h存储在前向传播期间，y是真实值)

Using Backpropagation Algorithm

                                ![image.png](https://cdn.nlark.com/yuque/0/2021/png/21710893/1635404425445-8e98a764-1d7b-4871-a71b-7f1068706e0c.png#align=left&display=inline&height=60&margin=%5Bobject%20Object%5D&name=image.png&originHeight=103&originWidth=474&size=12354&status=done&style=none&width=274)<br />![image.png](https://cdn.nlark.com/yuque/0/2021/png/21710893/1635404437888-3855dad1-e297-42d5-8fd3-4ecdabf435a4.png#align=left&display=inline&height=207&margin=%5Bobject%20Object%5D&name=image.png&originHeight=602&originWidth=1597&size=98580&status=done&style=none&width=550)

Computation Graph 计算图

• We can diagram out the computations using a computation graph. 我们可以用计算图来画出计算结果。
• The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes. 节点代表所有输入和计算量，边代表哪些节点是根据哪些其他节点直接计算的。

Backpropagation Algorithm 反向传播算法

Full backpropagation (BP) algorithm:
Let be a topological ordering of the computation graph (i.e. parents come before children.)
denotes the variable we’re trying to compute derivatives of (e.g. loss).

Automatic Differentiation: Pytorch

Calculus is hard! Use Autograd Tools such as Pytorch (more in next week’s lab), TensorFlow, JAX:

Pytorch: Neural Network

Stochastic Gradient Descent 随机梯度下降

• For standard GD, the gradients are computed on the loss of the entire training dataset. 对于标准的梯度下降，梯度是根据整个训练数据集的损失来计算的。
– Slow if the dataset is very large. 如果数据集非常大，则速度较慢。
• For SGD, at each update we randomly sample a batch of data points and compute gradients only of the loss on these points. 对于SGD，在每次更新时，我们随机抽取一批数据点，并仅计算这些点的损失梯度。
– The batch size is a hyper-parameter. 批次大小是一个超级参数。
– Usually is one of 32, 64, …, 512. 通常是32，64，…，512中的一个。

• SGD is much faster to compute per step. 每一步计算SGD要快得多。
• SGD acts as a regularizer: models trained with SGD usually generalize better. SGD充当正则化器: 用SGD训练的模型通常能更好地概括。

Summary

• A neural network is a stack of linear and non-linear functions
• Neural networks build up intermediate feature representations
• Neural networks are commonly trained using backpropagation and SGD

L3 Deep Neural Networks

Non-linear Relationships 非线性关系

Neural Networks: Bio-Inspired 神经网络：生物启发

MLP Neural Network: Key Idea MLP神经网络:关键思想

Notation

Why use a non-linearity? 为什么要使用非线性

其他尝试

Rectified Linear Unit (ReLU) 整流器线性单元

Other non-linearities 其他的 非线性关系

Notation

Hidden Dimensions

Network with 2 ReLU Neurons 两个relu的神经网络

Network with 50 ReLU Neurons 50个ReLU的神经网络

Recap

Multi-Layer Perceptron 多层感知器

Feature Learning 特征学习

Recap: Gradient Descent 概述：梯度下降

Backpropagation 反向传导

Chain Rule 链式法则

Backpropagation 反向传导

Backpropagation: Forward Pass 反向传导：向前传播

Backpropagation: Backward pass 反向传导：反向传播

Using Backpropagation Algorithm

Computation Graph 计算图

Backpropagation Algorithm 反向传播算法

Automatic Differentiation: Pytorch

Pytorch: Neural Network

Stochastic Gradient Descent 随机梯度下降

Summary

Other non-linearities 其他的非线性关系