Self-Taught Learning
From Ufldl
Line 34: | Line 34: | ||
the labeled data, but ignoring the labels). | the labeled data, but ignoring the labels). | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
== Learning features == | == Learning features == | ||
Line 69: | Line 46: | ||
[[File:STL_SparseAE.png]] | [[File:STL_SparseAE.png]] | ||
- | Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)} W^{(2)} b^{(2)}</math> of this model, | + | Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)} b^{(2)}</math> of this model, |
given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of | given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of | ||
activations <math>\textstyle a</math> of the hidden units. As we saw previously, this often gives a | activations <math>\textstyle a</math> of the hidden units. As we saw previously, this often gives a | ||
Line 82: | Line 59: | ||
Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}), | Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}), | ||
- | (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples. | + | (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples. |
+ | (The subscript "l" stands for "labeled.") | ||
We can now find a better representation for the inputs. In particular, rather | We can now find a better representation for the inputs. In particular, rather | ||
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed | than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed | ||
Line 104: | Line 82: | ||
regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values. | regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values. | ||
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure: | Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure: | ||
- | For feed it to the autoencoder to get <math>\textstyle a_{\rm test | + | For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>. Then, feed |
either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction. | either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction. | ||
- | + | == On pre-processing the data == | |
- | + | ||
- | stage where we were learning from the | + | During the feature learning stage where we were learning from the unlabeled training set |
<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed | <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed | ||
various pre-processing parameters. For example, one may have computed | various pre-processing parameters. For example, one may have computed | ||
Line 118: | Line 96: | ||
during the labeled training phase and the test phase, so as to make sure | during the labeled training phase and the test phase, so as to make sure | ||
we are always transforming the data the same way to feed into the autoencoder. | we are always transforming the data the same way to feed into the autoencoder. | ||
- | In particular, if we have | + | In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA, |
we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the | we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the | ||
- | labeled examples and the test data | + | labeled examples and the test data. We should '''not''' re-estimate a |
different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the | different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the | ||
labeled training set, since that might result in a dramatically different | labeled training set, since that might result in a dramatically different | ||
pre-processing transformation, which would make the input distribution to | pre-processing transformation, which would make the input distribution to | ||
the autoencoder very different from what it was actually trained on. | the autoencoder very different from what it was actually trained on. | ||
- | + | ||
+ | == On the terminology of unsupervised feature learning == | ||
+ | |||
+ | There are two common unsupervised feature learning settings, depending on what type of | ||
+ | unlabeled data you have. The more powerful setting is the '''self-taught learning''' | ||
+ | setting, which does not assume that your unlabeled data <math>x_u</math> has to | ||
+ | be drawn from the same distribution as your labeled data <math>x_l</math>. The | ||
+ | more restrictive setting where the unlabeled data comes from exactly the same | ||
+ | distribution as the labeled data is sometimes called the '''semi-supervised learning''' | ||
+ | setting. This distinctions is best explained with an example, which we now give. | ||
+ | |||
+ | Suppose your goal is a computer vision task where you'd like | ||
+ | to distinguish between images of cars and images of motorcycles; so, each labeled | ||
+ | example in your training set is either an image of a car or an image of a motorcycle. | ||
+ | Where can we get lots of unlabeled data? The easiest way would be to obtain some | ||
+ | random collection of images, perhaps downloaded off the internet. We could then | ||
+ | train the autoencoder on this large collection of images, and obtain useful features | ||
+ | from them. Because here the unlabeled data is drawn from a different distribution | ||
+ | than the labeled data (i.e., perhaps some of our unlabeled images may contain | ||
+ | cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we | ||
+ | call this self-taught learning. | ||
+ | |||
+ | In contrast, if we happen to have lots of unlabeled images lying around | ||
+ | that are all images of ''either'' a car or a motorcycle, but where the data | ||
+ | is just missing its label (so you don't know which ones are cars, and which | ||
+ | ones are motorcycles), then we could use this form of unlabeled data to | ||
+ | learn the features. This setting---where each unlabeled example is drawn from the same | ||
+ | distribution as your labeled examples---is sometimes called the '''semi-supervised''' | ||
+ | setting. In practice, we rarely have this sort of unlabeled data (where would you | ||
+ | get a database of images where every image is either a car or a motorcycle, but | ||
+ | just missing its label?), and so in the context of learning features from unlabeled | ||
+ | data, the self-taught learning setting is much more broadly applicable. |