Backpropagation Algorithm
From Ufldl
(Created page with "Suppose we have a fixed training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples. We can train our neural network using ...") |
|||
Line 95: | Line 95: | ||
here is the backpropagation algorithm: | here is the backpropagation algorithm: | ||
- | + | :1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and so on up to the output layer <math>L_{n_l}</math>. | |
- | :1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and | + | |
- | so on up to the output layer <math>L_{n_l}</math>. | + | |
:2. For each output unit <math>i</math> in layer <math>n_l</math> (the output layer), set | :2. For each output unit <math>i</math> in layer <math>n_l</math> (the output layer), set | ||
::<math>\begin{align} | ::<math>\begin{align} | ||
Line 114: | Line 112: | ||
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}. | \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}. | ||
\end{align}</math> | \end{align}</math> | ||
+ | |||
+ | Finally, we can also re-write | ||
+ | the algorithm using matrix-vectorial notation. | ||
+ | We will use "<math>\textstyle \bullet</math>" to denote the element-wise product | ||
+ | operator (denoted ``{\tt .*}'' in Matlab or Octave, and also called the Hadamard product), | ||
+ | so | ||
+ | that if <math>\textstyle a = b \bullet c</math>, then <math>\textstyle a_i = b_ic_i</math>. | ||
+ | Similar to how we extended the definition of <math>\textstyle f(\cdot)</math> to apply element-wise to vectors, we also | ||
+ | do the same for <math>\textstyle f'(\cdot)</math> (so that <math>\textstyle f'([z_1, z_2, z_3]) = | ||
+ | [\frac{\partial}{\partial z_1} f(z_1), | ||
+ | \frac{\partial}{\partial z_2} f(z_2), | ||
+ | \frac{\partial}{\partial z_3} f(z_3)]</math>). | ||
+ | The algorithm can then be written: | ||
+ | |||
+ | : 1. Perform a feedforward pass, computing the activations for layers <math>\textstyle L_2</math>, <math>\textstyle L_3</math>, up to the output layer <math>\textstyle L_{n_l}</math>, using Equations~(\ref{eqn-forwardprop1}-\ref{eqn-forwardprop2}). | ||
+ | : 2. For the output layer (layer <math>\textstyle n_l</math>), set | ||
+ | ::<math>\begin{align} | ||
+ | \delta^{(n_l)} | ||
+ | = - (y - a^{(n_l)}) \bullet f'(z^{(n)}) | ||
+ | \end{align}</math> | ||
+ | : 3. For <math>\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> | ||
+ | ::Set | ||
+ | :::<math>\begin{align} | ||
+ | \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)}) | ||
+ | \end{align}</math> | ||
+ | : 4. Compute the desired partial derivatives: | ||
+ | ::<math>\begin{align} | ||
+ | \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\ | ||
+ | \nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}. | ||
+ | \end{align}</math> | ||
+ | |||
+ | <br> | ||
+ | '''Implementation note:''' In steps 2 and 3 above, we need to compute <math>\textstyle f'(z^{(l)}_i)</math> for each value of <math>\textstyle i</math>. | ||
+ | Assuming <math>\textstyle f(z)</math> is the sigmoid activation function, we would already have <math>\textstyle a^{(l)}_i</math> stored away from the | ||
+ | forward pass through the network. Thus, using the expression that we worked out earlier for <math>\textstyle f'(z)</math>, | ||
+ | we can compute this as <math>\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)</math>. | ||
+ | <br> | ||
+ | |||
+ | |||
+ | Finally, we are ready to describe the full gradient descent algorithm. In the pseudo-code | ||
+ | below, <math>\textstyle \Delta W^{(l)}</math> is a matrix (of the same dimension as <math>\textstyle W^{(l)}</math>), and | ||
+ | <math>\textstyle \Delta b^{(l)}</math> is a vector (of the same dimension as <math>\textstyle b^{(l)}</math>). Note that in this notation, | ||
+ | ``<math>\textstyle \Delta W^{(l)}</math>'' is a matrix, and in particular it isn't ``<math>\textstyle \Delta</math> times <math>\textstyle W^{(l)}</math>.'' | ||
+ | We implement one iteration of batch gradient descent as follows: | ||
+ | |||
+ | : 1. Set <math>\textstyle \Delta W^{(l)} := 0</math>, <math>\textstyle \Delta b^{(l)} := 0</math> (matrix/vector of zeros) for all <math>\textstyle l</math>. | ||
+ | : 2. For <math>\textstyle i = 1</math> to <math>\textstyle m</math>, | ||
+ | :: 2a. Use backpropagation to compute <math>\textstyle \nabla_{W^{(l)}} J(W,b;x,y)</math> and \\ | ||
+ | <math>\textstyle \nabla_{b^{(l)}} J(W,b;x,y)</math>. | ||
+ | :: 2b. Set <math>\textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y)</math>. | ||
+ | :: 2c. Set <math>\textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y)</math>. | ||
+ | : 3. Update the parameters: | ||
+ | :<math>\begin{align} | ||
+ | W^{(l)} &= W^{(l)} - \alpha \left[ \left(\frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)}\right] \\ | ||
+ | b^{(l)} &= b^{(l)} - \alpha \left[\frac{1}{m} \Delta b^{(l)}\right] | ||
+ | \end{align}</math> | ||
+ | |||
+ | To train our neural network, we can now repeatedly take steps of gradient | ||
+ | descent to reduce our cost function <math>\textstyle J(W,b)</math>. |