Backpropagation Algorithm

From Ufldl

Jump to: navigation, search
 
Line 19: Line 19:
The first term in the definition of <math>J(W,b)</math> is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.
The first term in the definition of <math>J(W,b)</math> is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.
-
[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>.  Applying weight decay to the bias units usually makes only a small different to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]
+
[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>.  Applying weight decay to the bias units usually makes only a small difference to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]
The '''weight decay parameter''' <math>\lambda</math> controls the relative importance of the two terms. Note also the slightly overloaded notation: <math>J(W,b;x,y)</math> is the squared error cost with respect to a single example; <math>J(W,b)</math> is the overall cost function, which includes the weight decay term.
The '''weight decay parameter''' <math>\lambda</math> controls the relative importance of the two terms. Note also the slightly overloaded notation: <math>J(W,b;x,y)</math> is the squared error cost with respect to a single example; <math>J(W,b)</math> is the overall cost function, which includes the weight decay term.
Line 131: Line 131:
To train our neural network, we can now repeatedly take steps of gradient descent to reduce our cost function <math>\textstyle J(W,b)</math>.
To train our neural network, we can now repeatedly take steps of gradient descent to reduce our cost function <math>\textstyle J(W,b)</math>.
 +
 +
 +
{{Sparse_Autoencoder}}
 +
 +
 +
{{Languages|反向传导算法|中文}}

Latest revision as of 12:50, 7 April 2013

Personal tools