Self-Taught Learning
From Ufldl
Line 1: | Line 1: | ||
== Overview == | == Overview == | ||
- | + | Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable | |
- | to give | + | ways to get better performance is to give the algorithm more data. This has led to the |
+ | that aphorism that in | ||
machine learning, "sometimes it's not who has the best algorithm that wins; it's | machine learning, "sometimes it's not who has the best algorithm that wins; it's | ||
who has the most data." | who has the most data." | ||
Line 28: | Line 29: | ||
supervised learning on that labeled data to solve the classification task. | supervised learning on that labeled data to solve the classification task. | ||
- | These ideas | + | These ideas probably have the most powerful effects in problems where we have a lot of |
- | unlabeled data, and a | + | unlabeled data, and a smaller amount of labeled data. However, |
- | + | they typically give good results even if we have only | |
labeled data (in which case we usually perform the feature learning step using | labeled data (in which case we usually perform the feature learning step using | ||
- | the labeled data, but ignoring the labels). | + | the labeled data, but ignoring the labels). |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
== Learning features == | == Learning features == | ||
Line 67: | Line 44: | ||
(perhaps with appropriate whitening or other pre-processing): | (perhaps with appropriate whitening or other pre-processing): | ||
- | [[File:STL_SparseAE.png]] | + | [[File:STL_SparseAE.png|350px]] |
- | Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)} W^{(2)} b^{(2)}</math> of this model, | + | Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math> of this model, |
given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of | given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of | ||
activations <math>\textstyle a</math> of the hidden units. As we saw previously, this often gives a | activations <math>\textstyle a</math> of the hidden units. As we saw previously, this often gives a | ||
Line 76: | Line 53: | ||
neural network: | neural network: | ||
- | [[File:STL_SparseAE_Features.png]] | + | [[File:STL_SparseAE_Features.png|300px]] |
This is just the sparse autoencoder that we previously had, with with the final | This is just the sparse autoencoder that we previously had, with with the final | ||
Line 82: | Line 59: | ||
Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}), | Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}), | ||
- | (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples. | + | (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples. |
+ | (The subscript "l" stands for "labeled.") | ||
We can now find a better representation for the inputs. In particular, rather | We can now find a better representation for the inputs. In particular, rather | ||
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed | than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed | ||
Line 95: | Line 73: | ||
\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the | \}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the | ||
<math>\textstyle i</math>-th training example), or <math>\textstyle \{ | <math>\textstyle i</math>-th training example), or <math>\textstyle \{ | ||
- | ((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots | + | ((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots, |
((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated | ((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated | ||
representation). In practice, the concatenated representation often works | representation). In practice, the concatenated representation often works | ||
Line 104: | Line 82: | ||
regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values. | regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values. | ||
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure: | Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure: | ||
- | For feed it to the autoencoder to get <math>\textstyle a_{\rm test | + | For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>. Then, feed |
- | either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction. | + | either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction. |
- | + | == On pre-processing the data == | |
- | + | ||
- | stage where we were learning from the | + | During the feature learning stage where we were learning from the unlabeled training set |
<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed | <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed | ||
various pre-processing parameters. For example, one may have computed | various pre-processing parameters. For example, one may have computed | ||
a mean value of the data and subtracted off this mean to perform mean normalization, | a mean value of the data and subtracted off this mean to perform mean normalization, | ||
- | or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or PCA | + | or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used |
+ | PCA | ||
whitening or ZCA whitening). If this is the case, then it is important to | whitening or ZCA whitening). If this is the case, then it is important to | ||
save away these preprocessing parameters, and to use the ''same'' parameters | save away these preprocessing parameters, and to use the ''same'' parameters | ||
during the labeled training phase and the test phase, so as to make sure | during the labeled training phase and the test phase, so as to make sure | ||
we are always transforming the data the same way to feed into the autoencoder. | we are always transforming the data the same way to feed into the autoencoder. | ||
- | In particular, if we have | + | In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA, |
we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the | we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the | ||
- | labeled examples and the test data | + | labeled examples and the test data. We should '''not''' re-estimate a |
different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the | different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the | ||
labeled training set, since that might result in a dramatically different | labeled training set, since that might result in a dramatically different | ||
pre-processing transformation, which would make the input distribution to | pre-processing transformation, which would make the input distribution to | ||
- | the autoencoder very different from what it was actually trained on. | + | the autoencoder very different from what it was actually trained on. |
- | }} | + | |
+ | == On the terminology of unsupervised feature learning == | ||
+ | |||
+ | There are two common unsupervised feature learning settings, depending on what type of | ||
+ | unlabeled data you have. The more general and powerful setting is the '''self-taught learning''' | ||
+ | setting, which does not assume that your unlabeled data <math>x_u</math> has to | ||
+ | be drawn from the same distribution as your labeled data <math>x_l</math>. The | ||
+ | more restrictive setting where the unlabeled data comes from exactly the same | ||
+ | distribution as the labeled data is sometimes called the '''semi-supervised learning''' | ||
+ | setting. This distinctions is best explained with an example, which we now give. | ||
+ | |||
+ | Suppose your goal is a computer vision task where you'd like | ||
+ | to distinguish between images of cars and images of motorcycles; so, each labeled | ||
+ | example in your training set is either an image of a car or an image of a motorcycle. | ||
+ | Where can we get lots of unlabeled data? The easiest way would be to obtain some | ||
+ | random collection of images, perhaps downloaded off the internet. We could then | ||
+ | train the autoencoder on this large collection of images, and obtain useful features | ||
+ | from them. Because here the unlabeled data is drawn from a different distribution | ||
+ | than the labeled data (i.e., perhaps some of our unlabeled images may contain | ||
+ | cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we | ||
+ | call this self-taught learning. | ||
+ | |||
+ | In contrast, if we happen to have lots of unlabeled images lying around | ||
+ | that are all images of ''either'' a car or a motorcycle, but where the data | ||
+ | is just missing its label (so you don't know which ones are cars, and which | ||
+ | ones are motorcycles), then we could use this form of unlabeled data to | ||
+ | learn the features. This setting---where each unlabeled example is drawn from the same | ||
+ | distribution as your labeled examples---is sometimes called the semi-supervised | ||
+ | setting. In practice, we often do not have this sort of unlabeled data (where would you | ||
+ | get a database of images where every image is either a car or a motorcycle, but | ||
+ | just missing its label?), and so in the context of learning features from unlabeled | ||
+ | data, the self-taught learning setting is more broadly applicable. | ||
+ | |||
+ | |||
+ | {{STL}} | ||
+ | |||
+ | |||
+ | {{Languages|自我学习|中文}} |