Self-Taught Learning

From Ufldl

Jump to: navigation, search
(Working with large images)
Line 1: Line 1:
-
== Self-taught learning and Unsupervised feature learning ==
+
== Overview ==
In machine learning, sometimes it's not who has the best algorithm that wins.  It's who has the most data.  
In machine learning, sometimes it's not who has the best algorithm that wins.  It's who has the most data.  
Line 55: Line 55:
-
===Learning features===
+
== Learning features ==
We have already seen how an autoencoder can be used to learn features from
We have already seen how an autoencoder can be used to learn features from
Line 80: Line 80:
Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
-
We can now find a better represetation for the inputs.  In particular, rather
+
We can now find a better representation for the inputs.  In particular, rather
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding
<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding
Line 102: Line 102:
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>.  Then, feed  
For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>.  Then, feed  
-
either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier
+
either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.  
-
to get a prediction.  
+
 +
{{Quote|
'''An important note about preprocessing.'''  During the feature learning
'''An important note about preprocessing.'''  During the feature learning
stage where we were learning from the labeled training set  
stage where we were learning from the labeled training set  
Line 122: Line 122:
pre-processing transformation, which would make the input distribution to
pre-processing transformation, which would make the input distribution to
the autoencoder very different from what it was actually trained on.  
the autoencoder very different from what it was actually trained on.  
-
 
+
}}
-
 
+
-
 
+
-
===Image classification===
+
-
 
+
-
[[BLAH]]
+
-
 
+
-
 
+
-
===Fine-tuning===
+
-
 
+
-
Suppose we are doing self-taught learning, and have trained a sparse
+
-
autoencoder on our unlabeled data.  Given a new example <math>\textstyle x</math>, we can use the
+
-
hidden layer to extract features <math>\textstyle a</math>.  This is shown as follows:
+
-
 
+
-
[[PICTURE]]
+
-
 
+
-
Now, we are interested in solving a classification task, where our goal is to
+
-
predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
+
-
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
+
-
Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
+
-
computed by the sparse autoencoder.  This gives us a training set  <math>\textstyle \{(a^{(2)},
+
-
y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
+
-
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.
+
-
As before, we can draw our logistic unit (show in red for illustration) as
+
-
follows:
+
-
 
+
-
[[PICTURE--use different color for node, for clarity?]]
+
-
 
+
-
If we now look at the final classifier that we've learned, in terms
+
-
of what function it computes given a new test example <math>\textstyle x</math>, we
+
-
see that it can be drawn by putting the two pictures above together.  In
+
-
particular, the final classifier looks like this:
+
-
 
+
-
[[PICTURE]]
+
-
 
+
-
This model was trained in two stages.  The first layer of weights <math>\textstyle W^{(1)}</math>
+
-
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
+
-
as part of the sparse autoencoder training process.  The second layer
+
-
of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
+
-
trained using logistic regression. 
+
-
 
+
-
But the final algorithm is clearly just a whole big neural network.  So,
+
-
we can also carrying out further '''fine-tuning''' of the weights to
+
-
improve the overall classifier's performance.  In particular,
+
-
having trained the first layer using an autoencoder and the second layer
+
-
via logistic regression (this process is sometimes called '''pre-training''',
+
-
and sometimes more generally unsupervised feature learning),
+
-
we can now further perform gradient descent from the current value of
+
-
the weights to try to further drive down training error.
+
-
 
+
-
===Discussion===
+
-
 
+
-
Given that the whole algorithm is just a big neural network, why don't we just
+
-
carry out the fine-tuning step, without doing any pre-training/unsupervised
+
-
feature learning?  There're several reasons:
+
-
 
+
-
<ul>
+
-
<li> First and most important, labeled data is often scarce, and unlabeled
+
-
data is cheap and plentiful.  The promise of self-taught learning is that by
+
-
exploiting the massive amount of unlabeled data, we can learn much better
+
-
models.  The fine-tuning step can be done only using labeled data.  In
+
-
contrast, by using unlabeled data to learn a good initial value for the
+
-
first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers
+
-
after fine-tuning.
+
-
 
+
-
<li> Second, training a neural network using supervised learning involves
+
-
solving a highly non-convex optimization problem (say, minimizing the training
+
-
error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters
+
-
<math>\textstyle W</math>). 
+
-
The optimization problem can therefore be rife with local optima, and training
+
-
with gradient descent (or methods like conjugate gradient and L-BFGS) do not
+
-
work well.  In contrast, by first initializing the parameters using an
+
-
unsupervised feature learning/pre-training step, we can end up at much better
+
-
solutions.\footnote{Actually, pre-training has benefits beyond just helping to
+
-
get out of local optima.  In particular, it has been shown to also have
+
-
a useful "regularization" effect. (Erhan et al., 2010) But a full discussion
+
-
is beyond the scope of these notes.}
+
-
</ul>
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier
+
-
to get a prediction.
+
-
 
+
-
use just a two layer network (layers <math>\textstyle L_1</math> and <math>\textstyle L_2</math>) as described
+
-
above to extract features from our input. 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
===Working with large images===
+

Revision as of 04:37, 4 May 2011

Personal tools