Self-Taught Learning

Revision as of 22:54, 10 May 2011 (view source)

Ang (Talk | contribs)

← Older edit

Latest revision as of 13:26, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 1:

== Overview ==

-

~~In machine~~ learning, one of the most reliable ways to get better performance is

+

Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable

-

to give ~~your algorithms~~ more data. This has led to the that aphorism that in

+

ways to get better performance is to give the algorithm more data. This has led to the

+

that aphorism that in

machine learning, "sometimes it's not who has the best algorithm that wins; it's

who has the most data."

Line 28:

Line 29:

supervised learning on that labeled data to solve the classification task.

-

These ideas ~~are~~ probably most powerful in ~~settings~~ where we have a lot of

+

These ideas probably have the most powerful effects in problems where we have a lot of

-

unlabeled data, and a ~~relatively~~ smaller amount of labeled data. However,

+

unlabeled data, and a smaller amount of labeled data. However,

-

~~these models often given~~ good results even if we have only

+

they typically give good results even if we have only

labeled data (in which case we usually perform the feature learning step using

-

the labeled data, but ignoring the labels).

+

the labeled data, but ignoring the labels).

-

+

-

~~<!--~~

+

-

~~In terms of terminology, there are two common unsupervised feature learning~~

+

-

~~settings, depending on what type of unlabeled data you have. Lets explain this~~

+

-

~~with an example. Suppose your goal is a computer vision task where you'd like~~

+

-

~~to distinguish between images of cars and images of motorcycles. Where can we~~

+

-

~~get lots of unlabeled data? If you have lots of unlabeled images lying around~~

+

-

~~that are all images of ''either'' a car or a motorcycle, but where the data~~

+

-

~~is just missing its label (so you don't know which ones are cars, and which~~

+

-

~~ones are motorcycles), then you could use that data to learn the features.~~

+

-

~~This setting---where each unlabeled example is drawn from the same~~

+

-

~~distribution as your labeled examples (and thus can be labeled either "car"~~

+

-

~~or "motorcycle")---is usually called the '''semi-supervised''' setting;~~

+

-

~~unsupervised feature learning algorithms can be helpful for this. In practice~~

+

-

~~however, we often do not have this sort of unlabeled data. (Where would you~~

+

-

~~get a database of images where every image is either a car or a motorcycle, but~~

+

-

~~it's just missing its label?) Thus, we might instead learn our features using~~

+

-

~~a large collection of random images downloaded off the internet. This latter~~

+

-

~~setting, in which the unlabeled data (random internet images) may be drawn from~~

+

-

~~a different distribution than the labeled data, is called the~~

+

-

~~'''self-taught learning''' setting. In the self-taught learning setting, it is far easier to~~

+

-

~~obtain large amounts of unlabeled data, and thus leverage the potential of~~

+

-

~~learning from massive amounts of data.~~

+

-

~~!-->~~

+

== Learning features ==

Line 67:

Line 44:

(perhaps with appropriate whitening or other pre-processing):

-

[[File:STL_SparseAE.png]]

+

[[File:STL_SparseAE.png|350px]]

-

Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)} W^{(2)} b^{(2)}</math> of this model,

+

Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math> of this model,

given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of

activations <math>\textstyle a</math> of the hidden units. As we saw previously, this often gives a

Line 76:

Line 53:

neural network:

-

[[File:STL_SparseAE_Features.png]]

+

[[File:STL_SparseAE_Features.png|300px]]

This is just the sparse autoencoder that we previously had, with with the final

Line 82:

Line 59:

Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),

-

(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.

+

(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.

+

(The subscript "l" stands for "labeled.")

We can now find a better representation for the inputs. In particular, rather

than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed

Line 95:

Line 73:

\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the

<math>\textstyle i</math>-th training example), or <math>\textstyle \{

-

((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots

+

((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,

((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated

representation). In practice, the concatenated representation often works

Line 104:

Line 82:

regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values.

Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:

-

For feed it to the autoencoder to get <math>\textstyle a_{\rm test~~}^{(1)~~}</math>. Then, feed

+

For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>. Then, feed

-

either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.

+

either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.

-

~~{{Quote|~~

+

== On pre-processing the data ==

-

~~'''An important note about preprocessing.'''~~ During the feature learning

+

-

stage where we were learning from the ~~labeled~~ training set

+

During the feature learning stage where we were learning from the unlabeled training set

<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed

various pre-processing parameters. For example, one may have computed

a mean value of the data and subtracted off this mean to perform mean normalization,

-

or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or PCA

+

or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used

+

PCA

whitening or ZCA whitening). If this is the case, then it is important to

save away these preprocessing parameters, and to use the ''same'' parameters

during the labeled training phase and the test phase, so as to make sure

we are always transforming the data the same way to feed into the autoencoder.

-

In particular, if we have ~~computer~~ a matrix <math>\textstyle U</math> using the unlabeled data and PCA,

+

In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA,

we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the

-

labeled examples and the test data ~~instead~~. We should '''not''' re-estimate a

+

labeled examples and the test data. We should '''not''' re-estimate a

different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the

labeled training set, since that might result in a dramatically different

pre-processing transformation, which would make the input distribution to

-

the autoencoder very different from what it was actually trained on.

+

the autoencoder very different from what it was actually trained on.

-

}}

+

== On the terminology of unsupervised feature learning ==

+

There are two common unsupervised feature learning settings, depending on what type of

+

unlabeled data you have. The more general and powerful setting is the '''self-taught learning'''

+

setting, which does not assume that your unlabeled data <math>x_u</math> has to

+

be drawn from the same distribution as your labeled data <math>x_l</math>. The

+

more restrictive setting where the unlabeled data comes from exactly the same

+

distribution as the labeled data is sometimes called the '''semi-supervised learning'''

+

setting. This distinctions is best explained with an example, which we now give.

+

Suppose your goal is a computer vision task where you'd like

+

to distinguish between images of cars and images of motorcycles; so, each labeled

+

example in your training set is either an image of a car or an image of a motorcycle.

+

Where can we get lots of unlabeled data? The easiest way would be to obtain some

+

random collection of images, perhaps downloaded off the internet. We could then

+

train the autoencoder on this large collection of images, and obtain useful features

+

from them. Because here the unlabeled data is drawn from a different distribution

+

than the labeled data (i.e., perhaps some of our unlabeled images may contain

+

cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we

+

call this self-taught learning.

+

In contrast, if we happen to have lots of unlabeled images lying around

+

that are all images of ''either'' a car or a motorcycle, but where the data

+

is just missing its label (so you don't know which ones are cars, and which

+

ones are motorcycles), then we could use this form of unlabeled data to

+

learn the features. This setting---where each unlabeled example is drawn from the same

+

distribution as your labeled examples---is sometimes called the semi-supervised

+

setting. In practice, we often do not have this sort of unlabeled data (where would you

+

get a database of images where every image is either a car or a motorcycle, but

+

just missing its label?), and so in the context of learning features from unlabeled

+

data, the self-taught learning setting is more broadly applicable.

+

Self-Taught Learning

From Ufldl

Latest revision as of 13:26, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
 == Overview ==
-In machine learning, one of the most reliable ways to get better performance is
+Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable
-to give your algorithms more data.  This has led to the  that aphorism that in
+ways to get better performance is to give the algorithm more data.  This has led to the
+that aphorism that in
 machine learning, "sometimes it's not who has the best algorithm that wins; it's
 who has the most data."
@@ Line 28: / Line 29: @@
 supervised learning on that labeled data to solve the classification task.
-These ideas are probably most powerful in settings where we have a lot of
+These ideas probably have the most powerful effects in problems where we have a lot of
-unlabeled data, and a relatively smaller amount of labeled data.  However,
+unlabeled data, and a smaller amount of labeled data.  However,
-these models often given good results even if we have only
+they typically give good results even if we have only
 labeled data (in which case we usually perform the feature learning step using
 the labeled data, but ignoring the labels).
-<!--
-In terms of terminology, there are two common unsupervised feature learning
-settings, depending on what type of unlabeled data you have.  Lets explain this
-with an example.  Suppose your goal is a computer vision task where you'd like
-to distinguish between images of cars and images of motorcycles.  Where can we
-get lots of unlabeled data?  If you have lots of unlabeled images lying around
-that are all images of ''either'' a car or a motorcycle, but where the data
-is just missing its label (so you don't know which ones are cars, and which
-ones are motorcycles), then you could use that data to learn the features.
-This setting---where each unlabeled example is drawn from the same
-distribution as your labeled examples (and thus can be labeled either "car"
-or "motorcycle")---is usually called the '''semi-supervised''' setting;
-unsupervised feature learning algorithms can be helpful for this.  In practice
-however, we often do not have this sort of unlabeled data.  (Where would you
-get a database of images where every image is either a car or a motorcycle, but
-it's just missing its label?)  Thus, we might instead learn our features using
-a large collection of random images downloaded off the internet.  This latter
-setting, in which the unlabeled data (random internet images) may be drawn from
-a different distribution than the labeled data, is called the
-'''self-taught learning''' setting.  In the self-taught learning setting, it is far easier to
-obtain large amounts of unlabeled data, and thus leverage the potential of
-learning from massive amounts of data.
-!-->
 == Learning features ==
@@ Line 67: / Line 44: @@
 (perhaps with appropriate whitening or other pre-processing):
-[[File:STL_SparseAE.png]]
+[[File:STL_SparseAE.png|350px]]
-Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)} W^{(2)} b^{(2)}</math> of this model,
+Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math> of this model,
 given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of
 activations <math>\textstyle a</math> of the hidden units.  As we saw previously, this often gives a
@@ Line 76: / Line 53: @@
 neural network:
-[[File:STL_SparseAE_Features.png]]
+[[File:STL_SparseAE_Features.png|300px]]
 This is just the sparse autoencoder that we previously had, with with the final
@@ Line 82: / Line 59: @@
 Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
 (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
+(The subscript "l" stands for "labeled.")
 We can now find a better representation for the inputs.  In particular, rather
 than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
@@ Line 95: / Line 73: @@
 \}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the
 <math>\textstyle i</math>-th training example), or <math>\textstyle \{
-((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots
+((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,
 ((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated
 representation).  In practice, the concatenated representation often works
@@ Line 104: / Line 82: @@
 regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values.
 Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
-For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>.  Then, feed
+For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>.  Then, feed
 either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.
-{{Quote|
+== On pre-processing the data ==
-'''An important note about preprocessing.'''   During the feature learning
-stage where we were learning from the labeled training set
+During the feature learning stage where we were learning from the unlabeled training set
 <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed
 various pre-processing parameters.  For example, one may have computed
 a mean value of the data and subtracted off this mean to perform mean normalization,
-or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or PCA
+or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used
+PCA
 whitening or ZCA whitening).  If this is the case, then it is important to
 save away these preprocessing parameters, and to use the ''same'' parameters
 during the labeled training phase and the test phase, so as to make sure
 we are always transforming the data the same way to feed into the autoencoder.
-In particular, if we have computer a matrix <math>\textstyle U</math> using the unlabeled data and PCA,
+In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA,
 we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the
-labeled examples and the test data instead.  We should '''not''' re-estimate a
+labeled examples and the test data.  We should '''not''' re-estimate a
 different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the
 labeled training set, since that might result in a dramatically different
 pre-processing transformation, which would make the input distribution to
 the autoencoder very different from what it was actually trained on.
-}}
+== On the terminology of unsupervised feature learning ==
+There are two common unsupervised feature learning settings, depending on what type of
+unlabeled data you have.  The more general and powerful setting is the '''self-taught learning'''
+setting, which does not assume that your unlabeled data <math>x_u</math> has to
+be drawn from the same distribution as your labeled data <math>x_l</math>.  The
+more restrictive setting where the unlabeled data comes from exactly the same
+distribution as the labeled data is sometimes called the '''semi-supervised learning'''
+setting.  This distinctions is best explained with an example, which we now give.
+Suppose your goal is a computer vision task where you'd like
+to distinguish between images of cars and images of motorcycles; so, each labeled
+example in your training set is either an image of a car or an image of a motorcycle.
+Where can we get lots of unlabeled data?  The easiest way would be to obtain some
+random collection of images, perhaps downloaded off the internet.  We could then
+train the autoencoder on this large collection of images, and obtain useful features
+from them.  Because here the unlabeled data is drawn from a different distribution
+than the labeled data (i.e., perhaps some of our unlabeled images may contain
+cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we
+call this self-taught learning.
+In contrast, if we happen to have lots of unlabeled images lying around
+that are all images of ''either'' a car or a motorcycle, but where the data
+is just missing its label (so you don't know which ones are cars, and which
+ones are motorcycles), then we could use this form of unlabeled data to
+learn the features.  This setting---where each unlabeled example is drawn from the same
+distribution as your labeled examples---is sometimes called the semi-supervised
+setting.  In practice, we often do not have this sort of unlabeled data (where would you
+get a database of images where every image is either a car or a motorcycle, but
+just missing its label?), and so in the context of learning features from unlabeled
+data, the self-taught learning setting is more broadly applicable.
+{{STL}}
+{{Languages|自我学习|中文}}