Stacked Autoencoders

From Ufldl

Jump to: navigation, search
(Training)
 
Line 8: Line 8:
\begin{align}
\begin{align}
a^{(l)} = f(z^{(l)}) \\
a^{(l)} = f(z^{(l)}) \\
-
z^{(l + 1)} = W^{(l, 1)}a^{(l)} + b^{(l, l)}
+
z^{(l + 1)} = W^{(l, 1)}a^{(l)} + b^{(l, 1)}
\end{align}
\end{align}
</math>
</math>
Line 26: Line 26:
===Training===
===Training===
-
A good way to obtain good parameters for a stacked autoencoder is to use greedy layer-wise training. To do this, first train the first layer on raw input to obtain parameters W1, W2, b1 and b2. Use the first layer to transform the raw input into a vector consisting of activation of the hidden units, A. Train the second layer on this vector to obtain parameters W1, W2, b1 and b2. Repeat for subsequent layers, using the output of each layer as input for the subsequent layer.
+
A good way to obtain good parameters for a stacked autoencoder is to use greedy layer-wise training. To do this, first train the first layer on raw input to obtain parameters <math>W^{(1,1)}, W^{(1,2)}, b^{(1,1)}, b^{(1,2)}</math>. Use the first layer to transform the raw input into a vector consisting of activation of the hidden units, A. Train the second layer on this vector to obtain parameters <math>W^{(2,1)}, W^{(2,2)}, b^{(2,1)}, b^{(2,2)}</math>. Repeat for subsequent layers, using the output of each layer as input for the subsequent layer.
-
This method trains the parameters of each layer individually while freezing parameters for the remainder of the model. To produce better results, after this phase of training is complete, fine-tuning using backpropagation can be used to improve the results by tuning the parameters of all layers are changed at the same time.  
+
This method trains the parameters of each layer individually while freezing parameters for the remainder of the model. To produce better results, after this phase of training is complete, [[Fine-tuning Stacked AEs | fine-tuning]] using backpropagation can be used to improve the results by tuning the parameters of all layers are changed at the same time.  
<!-- In practice, fine-tuning should be use when the parameters have been brought close to convergence through layer-wise training. Attempting to use fine-tuning with the weights initialized randomly will lead to poor results due to local optima. -->
<!-- In practice, fine-tuning should be use when the parameters have been brought close to convergence through layer-wise training. Attempting to use fine-tuning with the weights initialized randomly will lead to poor results due to local optima. -->
{{Quote|
{{Quote|
-
If one is only interested in finetuning for the purposes of classification, the common practice is to then discard the "decoding" layers of the stacked autoencoder and link the last hidden layer <math>a^(n)</math> to the softmax classifier. The gradients from the (softmax) classification error will then be backpropagated into the encoding layers.
+
If one is only interested in finetuning for the purposes of classification, the common practice is to then discard the "decoding" layers of the stacked autoencoder and link the last hidden layer <math>a^{(n)}</math> to the softmax classifier. The gradients from the (softmax) classification error will then be backpropagated into the encoding layers.
}}
}}
-
===Motivation===
+
===Concrete example===
-
A stacked autoencoder inherits all the benefits of any deep network: greater expressive power and greater statistical efficiency. In addition, its purpose can be described in an intuitive sense as follows.
+
To give a concrete example, suppose you wished to train a stacked autoencoder with 2 hidden layers for classification of MNIST digits, as you will be doing in [[Exercise: Implement deep networks for digit classification | the next exercise]].  
-
Recall that an autoencoder tends to learn features that form a good representation of its input. The first layer of a stacked autoencoder tends to learn first-order features in the raw input. The second layer of a stacked autoencoder tends to learn second-order features corresponding to patterns in the appearance of first-order features. Higher layers of the stacked autoencoder tend to learn even higher-order features.
+
First, you would train a sparse autoencoder on the raw inputs <math>x^{(k)}</math> to learn primary features <math>h^{(1)(k)}</math> on the raw input.
 +
[[File:Stacked_SparseAE_Features1.png|400px]]
 +
 +
Next, you would feed the raw input into this trained sparse autoencoder, obtaining the primary feature activations <math>h^{(1)(k)}</math> for each of the inputs <math>x^{(k)}</math>. You would then use these primary features as the "raw input" to another sparse autoencoder to learn secondary features <math>h^{(2)(k)}</math> on these primary features.
 +
 +
[[File:Stacked_SparseAE_Features2.png|400px]]
 +
 +
Following this, you would feed the primary features into the second sparse autoencoder to obtain the secondary feature activations <math>h^{(2)(k)}</math> for each of the primary features <math>h^{(1)(k)}</math> (which correspond to the primary features of the corresponding inputs <math>x^{(k)}</math>). You would then treat these secondary features as "raw input" to a softmax classifier, training it to map secondary features to digit labels.
 +
 +
[[File:Stacked_Softmax_Classifier.png|400px]]
 +
 +
Finally, you would combine all three layers together to form a stacked autoencoder with 2 hidden layers and a final softmax classifier layer capable of classifying the MNIST digits as desired.
 +
 +
[[File:Stacked_Combined.png|500px]]
 +
 +
===Discussion===
 +
 +
A stacked autoencoder enjoys all the benefits of any deep network of greater expressive power. 
 +
 +
Further, it often captures a useful "hierarchical grouping" or "part-whole decomposition" of the input.  To see this, recall that an autoencoder tends to learn features that form a good representation of its input. The first layer of a stacked autoencoder tends to learn first-order features in the raw input (such as edges in an image). The second layer of a stacked autoencoder tends to learn second-order features corresponding to patterns in the appearance of first-order features (e.g., in terms of what edges tend to occur together--for example, to form contour or corner detectors). Higher layers of the stacked autoencoder tend to learn even higher-order features.
 +
 +
 +
{{CNN}}
 +
 +
<!--
For instance, in the context of image input, the first layers usually learns to recognize edges. The second layer usually learns features that arise from combinations of the edges, such as corners. With certain types of network configuration and input modes, the higher layers can learn meaningful combinations of features. For instance, if the input set consists of images of faces, higher layers may learn features corresponding to parts of the face such as eyes, noses or mouths.
For instance, in the context of image input, the first layers usually learns to recognize edges. The second layer usually learns features that arise from combinations of the edges, such as corners. With certain types of network configuration and input modes, the higher layers can learn meaningful combinations of features. For instance, if the input set consists of images of faces, higher layers may learn features corresponding to parts of the face such as eyes, noses or mouths.
 +
!-->
 +
 +
 +
{{Languages|栈式自编码算法|中文}}

Latest revision as of 13:33, 7 April 2013

Personal tools