Self-Taught Learning to Deep Networks

== From Self-Taught Learning to Deep Networks ==

Recall that in self-taught learning, we first train a sparse
autoencoder on our unlabeled data.  Then, given a new example <math>\textstyle x</math>, we can use the
hidden layer to extract features <math>\textstyle a</math>.  This is shown as follows: 

[[File:STL_SparseAE_Features.png|300px]]

Now, we are interested in solving a classification task, where our goal is to
predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
computed by the sparse autoencoder.  This gives us a training set  <math>\textstyle \{(a^{(2)},
y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.

As before, we can draw our logistic unit (show in orange) as
follows:

[[File:STL_Logistic_Classifier.png|400px]]

If we now look at the final classifier that we've learned, in terms
of what function it computes given a new test example <math>\textstyle x</math>, we 
see that it can be drawn by putting the two pictures above together.  In 
particular, the final classifier looks like this:

[[File:STL_CombinedAE.png|500px]]

This model was trained in two stages.  The first layer of weights <math>\textstyle W^{(1)}</math>
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
as part of the sparse autoencoder training process.  The second layer
of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
trained using logistic regression.  

But the final algorithm is clearly just a whole big neural network.  So,
we can also carrying out further '''fine-tuning''' of the weights to 
improve the overall classifier's performance.  In particular, 
having trained the first layer using an autoencoder and the second layer
via logistic regression (this process is sometimes called '''pre-training''',
and sometimes more generally unsupervised feature learning),
we can now further perform gradient descent from the current value of
the weights to try to further drive down training error.

===Discussion=== 

Given that the whole algorithm is just a big neural network, why don't we just
carry out the fine-tuning step, without doing any pre-training/unsupervised
feature learning?  There are several reasons:

<ul>
<li> First and most important, labeled data is often scarce, and unlabeled
data is cheap and plentiful.  The promise of self-taught learning is that by
exploiting the massive amount of unlabeled data, we can learn much better
models.  The fine-tuning step can be done only using labeled data.  In
contrast, by using unlabeled data to learn a good initial value for the
first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers 
after fine-tuning.

<li> Second, training a neural network using supervised learning involves
solving a highly non-convex optimization problem (say, minimizing the training
error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters
<math>\textstyle W</math>).  
The optimization problem can therefore be rife with local optima, and training
with gradient descent (or methods like conjugate gradient and L-BFGS) do not
work well.  In contrast, by first initializing the parameters using an
unsupervised feature learning/pre-training step, we can end up at much better
solutions. (Actually, pre-training has benefits beyond just helping to 
get out of local optima.  In particular, it has been shown to also have 
a useful "regularization" effect. (Erhan et al., 2010) But a full discussion
is beyond the scope of these notes)
</ul>