|
|
Line 1: |
Line 1: |
- | == From Self-Taught Learning to Deep Networks == | + | == Overview == |
| | | |
- | Recall that in self-taught learning, we first train a sparse | + | In this section, we will improve upon the features learned from self-taught learning by ''finetuning'' them for our classification objective. |
- | autoencoder on our unlabeled data. Then, given a new example <math>\textstyle x</math>, we can use the | + | |
| + | Recall that in self-taught learning, we first train a sparse autoencoder on our unlabeled data. Then, given a new example <math>\textstyle x</math>, we can use the |
| hidden layer to extract features <math>\textstyle a</math>. This is shown as follows: | | hidden layer to extract features <math>\textstyle a</math>. This is shown as follows: |
| | | |
- | [[File:STL_SparseAE_Features.png|300px]] | + | [[File:STL_SparseAE_Features.png|200px]] |
| | | |
| Now, we are interested in solving a classification task, where our goal is to | | Now, we are interested in solving a classification task, where our goal is to |
Line 41: |
Line 42: |
| we can now further perform gradient descent from the current value of | | we can now further perform gradient descent from the current value of |
| the weights to try to further drive down training error. | | the weights to try to further drive down training error. |
- |
| |
- | ===Discussion===
| |
- |
| |
- | Given that the whole algorithm is just a big neural network, why don't we just
| |
- | carry out the fine-tuning step, without doing any pre-training/unsupervised
| |
- | feature learning? There are several reasons:
| |
- |
| |
- | <ul>
| |
- | <li> First and most important, labeled data is often scarce, and unlabeled
| |
- | data is cheap and plentiful. The promise of self-taught learning is that by
| |
- | exploiting the massive amount of unlabeled data, we can learn much better
| |
- | models. The fine-tuning step can be done only using labeled data. In
| |
- | contrast, by using unlabeled data to learn a good initial value for the
| |
- | first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers
| |
- | after fine-tuning.
| |
- |
| |
- | <li> Second, training a neural network using supervised learning involves
| |
- | solving a highly non-convex optimization problem (say, minimizing the training
| |
- | error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters
| |
- | <math>\textstyle W</math>).
| |
- | The optimization problem can therefore be rife with local optima, and training
| |
- | with gradient descent (or methods like conjugate gradient and L-BFGS) do not
| |
- | work well. In contrast, by first initializing the parameters using an
| |
- | unsupervised feature learning/pre-training step, we can end up at much better
| |
- | solutions. (Actually, pre-training has benefits beyond just helping to
| |
- | get out of local optima. In particular, it has been shown to also have
| |
- | a useful "regularization" effect. (Erhan et al., 2010) But a full discussion
| |
- | is beyond the scope of these notes)
| |
- | </ul>
| |