Self-Taught Learning to Deep Networks

From Ufldl

Jump to: navigation, search
 
Line 1: Line 1:
-
== From Self-Taught Learning to Deep Networks ==
+
In the previous section, you used an autoencoder to learn features that were then fed as input
 +
to a softmax or logistic regression classifier.  In that method, the features were learned using
 +
only unlabeled data.  In this section, we describe how you can  '''fine-tune''' and further improve
 +
the learned features using labeled data.  When you have a large amount of labeled
 +
training data, this can significantly improve your classifier's performance.
-
Recall that in self-taught learning, we first train a sparse
+
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data.  Then,  
-
autoencoder on our unlabeled data.  Then, given a new example <math>\textstyle x</math>, we can use the
+
given a new example <math>\textstyle x</math>, we used the hidden layer to extract  
-
hidden layer to extract features <math>\textstyle a</math>.  This is shown as follows:  
+
features <math>\textstyle a</math>.  This is illustrated in the following diagram:  
[[File:STL_SparseAE_Features.png|300px]]
[[File:STL_SparseAE_Features.png|300px]]
-
Now, we are interested in solving a classification task, where our goal is to
+
We are interested in solving a classification task, where our goal is to
predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
+
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.
-
Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
+
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
-
computed by the sparse autoencoder.  This gives us a training set <math>\textstyle \{(a^{(2)},
+
computed by the sparse autoencoder (the "replacement" representation).  This gives us a training set <math>\textstyle \{(a^{(1)},
-
y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
+
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
-
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.
+
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.
 +
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:
-
As before, we can draw our logistic unit (show in orange) as
+
::::[[File:STL_Logistic_Classifier.png|380px]]
-
follows:
+
-
[[File:STL_Logistic_Classifier.png|400px]]
+
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned  
-
 
+
using this method. 
-
If we now look at the final classifier that we've learned, in terms
+
In particular, let us examine the function that our classifier uses to map from from a new test example  
-
of what function it computes given a new test example <math>\textstyle x</math>, we
+
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. 
-
see that it can be drawn by putting the two pictures above together.  In  
+
We can draw a representation of this function by putting together the  
-
particular, the final classifier looks like this:
+
two pictures from above.  In particular, the final classifier looks like this:
[[File:STL_CombinedAE.png|500px]]
[[File:STL_CombinedAE.png|500px]]
-
This model was trained in two stagesThe first layer of weights <math>\textstyle W^{(1)}</math>
+
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math>
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
as part of the sparse autoencoder training process.  The second layer
as part of the sparse autoencoder training process.  The second layer
-
of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
+
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was
-
trained using logistic regression.   
+
trained using logistic regression (or softmax regression).
 +
 
 +
But the form of our overall/final classifier is clearly just a whole big neural networkSo,
 +
having trained up an initial set of parameters for our model (training the first layer using an
 +
autoencoder, and the second layer
 +
via logistic/softmax regression), we can further modify all the parameters in our model to try to
 +
further reduce the training error.  In particular, we can '''fine-tune''' the parameters, meaning perform
 +
gradient descent (or use L-BFGS) from the current setting of the
 +
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
 +
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>.
 +
 
 +
When fine-tuning is used, sometimes the original unsupervised feature learning steps
 +
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''
 +
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as
 +
well, so that adjustments can be made to the features <math>a</math> extracted by the layer
 +
of hidden units.
 +
 
 +
So far, we have described this process assuming that you used the "replacement" representation, where
 +
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,
 +
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.
 +
It is also possible to perform fine-tuning too using the "concatenation" representation.  (This corresponds
 +
to a neural network where the input units <math>x_i</math> also feed directly to the logistic
 +
classifier in the output layer.  You can draw this using a slightly different type of neural network
 +
diagram than the ones we have seen so far; in particular, you would have edges that go directly
 +
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.)
 +
However, so long as we are using finetuning, usually the "concatenation" representation
 +
has little advantage over the "replacement" representation.  Thus, if we are using fine-tuning usually we will do so
 +
with a network built using the replacement representation.  (If you are not using fine-tuning however,
 +
then sometimes the concatenation representation can give much better performance.)
-
But the final algorithm is clearly just a whole big neural network.  So,
+
When should we use fine-tuning?  It is typically used only if you have a large labeled training
-
we can also carrying out further '''fine-tuning''' of the weights to
+
set; in this setting, fine-tuning can significantly improve the performance of your classifier.   
-
improve the overall classifier's performanceIn particular,  
+
However, if you
-
having trained the first layer using an autoencoder and the second layer
+
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and
-
via logistic regression (this process is sometimes called '''pre-training''',
+
only a relatively small labeled training set, then fine-tuning is significantly less likely to
-
and sometimes more generally unsupervised feature learning),
+
help.
-
we can now further perform gradient descent from the current value of
+
-
the weights to try to further drive down training error.
+
-
===Discussion===
 
-
Given that the whole algorithm is just a big neural network, why don't we just
+
{{CNN}}
-
carry out the fine-tuning step, without doing any pre-training/unsupervised
+
-
feature learning?  There are several reasons:
+
-
<ul>
 
-
<li> First and most important, labeled data is often scarce, and unlabeled
 
-
data is cheap and plentiful.  The promise of self-taught learning is that by
 
-
exploiting the massive amount of unlabeled data, we can learn much better
 
-
models.  The fine-tuning step can be done only using labeled data.  In
 
-
contrast, by using unlabeled data to learn a good initial value for the
 
-
first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers
 
-
after fine-tuning.
 
-
<li> Second, training a neural network using supervised learning involves
+
{{Languages|从自我学习到深层网络|中文}}
-
solving a highly non-convex optimization problem (say, minimizing the training
+
-
error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters
+
-
<math>\textstyle W</math>). 
+
-
The optimization problem can therefore be rife with local optima, and training
+
-
with gradient descent (or methods like conjugate gradient and L-BFGS) do not
+
-
work well.  In contrast, by first initializing the parameters using an
+
-
unsupervised feature learning/pre-training step, we can end up at much better
+
-
solutions. (Actually, pre-training has benefits beyond just helping to
+
-
get out of local optima.  In particular, it has been shown to also have
+
-
a useful "regularization" effect. (Erhan et al., 2010) But a full discussion
+
-
is beyond the scope of these notes)
+
-
</ul>
+

Latest revision as of 13:29, 7 April 2013

Personal tools