Gradient checking and advanced optimization
From Ufldl
m (Fixed quotation marks) |
|||
Line 1: | Line 1: | ||
Backpropagation is a notoriously difficult algorithm to debug and get right, | Backpropagation is a notoriously difficult algorithm to debug and get right, | ||
- | especially since many subtly buggy implementations of it | + | especially since many subtly buggy implementations of it—for example, one |
that has an off-by-one error in the indices and that thus only trains some of | that has an off-by-one error in the indices and that thus only trains some of | ||
- | the layers of weights, or an implementation that omits the bias term | + | the layers of weights, or an implementation that omits the bias term—will |
manage to learn something that can look surprisingly reasonable | manage to learn something that can look surprisingly reasonable | ||
(while performing less well than a correct implementation). Thus, even with a | (while performing less well than a correct implementation). Thus, even with a | ||
Line 32: | Line 32: | ||
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}} | \frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}} | ||
\end{align}</math> | \end{align}</math> | ||
- | In practice, we set {\rm EPSILON} to a small constant, say around <math>\textstyle 10^{-4}</math>. | + | In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>. |
- | (There's a large range of values of {\rm EPSILON} that should work well, but | + | (There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but |
- | we don't set {\rm EPSILON} to be "extremely" small, say <math>\textstyle 10^{-20}</math>, | + | we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>, |
as that would lead to numerical roundoff errors.) | as that would lead to numerical roundoff errors.) | ||
Line 68: | Line 68: | ||
and "0"s everywhere else). So, | and "0"s everywhere else). So, | ||
<math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented | <math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented | ||
- | by {\rm EPSILON}. Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the | + | by <math>{\rm EPSILON}</math>. Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the |
- | corresponding vector with the <math>\textstyle i</math>-th element decreased by {\rm EPSILON}. | + | corresponding vector with the <math>\textstyle i</math>-th element decreased by <math>{\rm EPSILON}</math>. |
We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>, | We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>, | ||
that: | that: | ||
Line 84: | Line 84: | ||
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}. | \nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}. | ||
\end{align}</math> | \end{align}</math> | ||
- | This result shows that the final block of psuedo-code in | + | This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed |
implementing gradient descent. | implementing gradient descent. | ||
To make sure your implementation of gradient descent is correct, it is | To make sure your implementation of gradient descent is correct, it is | ||
Line 101: | Line 101: | ||
Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method). A full discussion of these | Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method). A full discussion of these | ||
algorithms is beyond the scope of these notes, but one example is | algorithms is beyond the scope of these notes, but one example is | ||
- | the '''L-BFGS''' algorithm. (Another example is '''conjugate gradient'''.) You will use one of | + | the '''L-BFGS''' algorithm. (Another example is the '''conjugate gradient''' algorithm.) You will use one of |
these algorithms in the programming exercise. | these algorithms in the programming exercise. | ||
The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able | The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able | ||
Line 108: | Line 108: | ||
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>. Algorithms | to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>. Algorithms | ||
such as L-BFGS and conjugate gradient can often be much faster than gradient descent. | such as L-BFGS and conjugate gradient can often be much faster than gradient descent. | ||
+ | |||
+ | |||
+ | {{Sparse_Autoencoder}} | ||
+ | |||
+ | |||
+ | {{Languages|梯度检验与高级优化|中文}} |