梯度检验与高级优化

From Ufldl

Jump to: navigation, search
Line 1: Line 1:
 +
初译: @pocketwalker
 +
 +
一审:王方,email:fangkey@gmail.com,新浪微博:@GuitarFang
 +
 +
二审:@大黄蜂的思索
 +
 +
Wiki上传者:王方,email:fangkey@gmail.com,新浪微博:@GuitarFang
 +
 +
:【原文】:
Backpropagation is a notoriously difficult algorithm to debug and get right,
Backpropagation is a notoriously difficult algorithm to debug and get right,
especially since many subtly buggy implementations of it—for example, one
especially since many subtly buggy implementations of it—for example, one
Line 10: Line 19:
derivative checking procedure described here will significantly increase
derivative checking procedure described here will significantly increase
your confidence in the correctness of your code.
your confidence in the correctness of your code.
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>.
Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>.
For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>.
For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>.
Line 17: Line 30:
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
\end{align}</math>
\end{align}</math>
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly
Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly
computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent
computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent
using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>.  How can we check if our implementation of
using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>.  How can we check if our implementation of
<math>\textstyle g</math> is correct?
<math>\textstyle g</math> is correct?
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
Recall the mathematical definition of the derivative as
Recall the mathematical definition of the derivative as
:<math>\begin{align}
:<math>\begin{align}
Line 32: Line 53:
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\end{align}</math>
\end{align}</math>
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.
In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.
(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but
(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but
we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,
we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,
as that would lead to numerical roundoff errors.)
as that would lead to numerical roundoff errors.)
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing
Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing
<math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness
<math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness
Line 48: Line 77:
you'll usually find that the left- and right-hand sides of the above will agree
you'll usually find that the left- and right-hand sides of the above will agree
to at least 4 significant digits (and often many more).
to at least 4 significant digits (and often many more).
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
Line 54: Line 87:
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes
Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes
<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math>
<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math>
Line 76: Line 113:
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
\end{align}</math>
\end{align}</math>
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
When implementing backpropagation to train a neural network, in a correct implementation
When implementing backpropagation to train a neural network, in a correct implementation
we will have that
we will have that
Line 84: Line 125:
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\end{align}</math>
\end{align}</math>
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
 +
:【原文】:
This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed
This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed
implementing gradient descent.
implementing gradient descent.
Line 91: Line 137:
your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are
your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are
indeed giving the derivatives you want.
indeed giving the derivatives you want.
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
 +
:【原文】:
Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>.  If you have
Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>.  If you have
implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more
implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more
Line 108: Line 158:
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
 +
:【初译】:
 +
:【一审】:
 +
:【二审】:
{{Sparse_Autoencoder}}
{{Sparse_Autoencoder}}

Revision as of 10:49, 9 March 2013

Personal tools