Self-Taught Learning

Revision as of 05:54, 10 May 2011 (view source)

71.202.42.22 (Talk)

(→Overview)

← Older edit

Revision as of 22:54, 10 May 2011 (view source)

Ang (Talk | contribs)

Newer edit →

Line 1:

== Overview ==

-

In machine learning, sometimes it's not who has the best algorithm that wins~~. It~~'s who has the most data.

+

In machine learning, one of the most reliable ways to get better performance is

+

to give your algorithms more data. This has led to the that aphorism that in

+

machine learning, "sometimes it's not who has the best algorithm that wins; it's

+

who has the most data."

-

~~While one~~ can always try to get more labeled data, ~~that's often~~ expensive. In

+

One can always try to get more labeled data, but this can be expensive. In

particular, researchers have already gone to extraordinary lengths to use tools

such as AMT (Amazon Mechanical Turk) to get large training sets. While having

Line 12:

Line 15:

from ''unlabeled'' data, then we can easily obtain and learn from massive

amounts of it. Even though a single unlabeled example is less informative than

-

a single labeled example, if we can get tons of the former---for example, ~~for~~

+

a single labeled example, if we can get tons of the former---for example, by downloading

-

~~sucking~~ random unlabeled images/audio clips/text documents off the

+

random unlabeled images/audio clips/text documents off the

internet---and if our algorithms can exploit this unlabeled data effectively,

then we might be able to achieve better performance than the massive

-

hand-engineering and massive hand-labeling

+

hand-engineering and massive hand-labeling approaches.

-

approaches.

+

In Self-taught learning and Unsupervised feature learning, we will give our

Line 23:

Line 25:

representation of the input. If we are trying to solve a specific

classification task, then we take this learned feature representation and

-

whatever labeled data we have for that classification task, and apply

+

whatever (perhaps small amount of) labeled data we have for that classification task, and apply

supervised learning on that labeled data to solve the classification task.

These ideas are probably most powerful in settings where we have a lot of

unlabeled data, and a relatively smaller amount of labeled data. However,

-

these models ~~apply and have~~ often given good results even if we have only

+

these models often given good results even if we have only

labeled data (in which case we usually perform the feature learning step using

the labeled data, but ignoring the labels).

+

<!--

In terms of terminology, there are two common unsupervised feature learning

settings, depending on what type of unlabeled data you have. Lets explain this

Line 40:

Line 43:

is just missing its label (so you don't know which ones are cars, and which

ones are motorcycles), then you could use that data to learn the features.

-

This setting---where each unlabeled ~~examples are~~ drawn from the same

+

This setting---where each unlabeled example is drawn from the same

distribution as your labeled examples (and thus can be labeled either "car"

or "motorcycle")---is usually called the '''semi-supervised''' setting;

Line 52:

Line 55:

'''self-taught learning''' setting. In the self-taught learning setting, it is far easier to

obtain large amounts of unlabeled data, and thus leverage the potential of

-

learning from massive amounts of data.

+

learning from massive amounts of data.

+

!-->

== Learning features ==

Self-Taught Learning

From Ufldl

Revision as of 22:54, 10 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
 == Overview ==
-In machine learning, sometimes it's not who has the best algorithm that wins.  It's who has the most data.
+In machine learning, one of the most reliable ways to get better performance is
+to give your algorithms more data.  This has led to the  that aphorism that in
+machine learning, "sometimes it's not who has the best algorithm that wins; it's
+who has the most data."
-While one can always try to get more labeled data, that's often expensive.  In
+One can always try to get more labeled data, but this can be expensive.  In
 particular, researchers have already gone to extraordinary lengths to use tools
 such as AMT (Amazon Mechanical Turk) to get large training sets.  While having
@@ Line 12: / Line 15: @@
 from ''unlabeled'' data, then we can easily obtain and learn from massive
 amounts of it.  Even though a single unlabeled example is less informative than
-a single labeled example, if we can get tons of the former---for example, for
+a single labeled example, if we can get tons of the former---for example, by downloading
-sucking random unlabeled images/audio clips/text documents off the
+random unlabeled images/audio clips/text documents off the
 internet---and if our algorithms can exploit this unlabeled data effectively,
 then we might be able to achieve better performance than the massive
-hand-engineering and massive hand-labeling
+hand-engineering and massive hand-labeling approaches.
-approaches.
 In Self-taught learning and Unsupervised feature learning, we will give our
@@ Line 23: / Line 25: @@
 representation of the input.  If we are trying to solve a specific
 classification task, then we take this learned feature representation and
-whatever labeled data we have for that classification task, and apply
+whatever (perhaps small amount of) labeled data we have for that classification task, and apply
 supervised learning on that labeled data to solve the classification task.
 These ideas are probably most powerful in settings where we have a lot of
 unlabeled data, and a relatively smaller amount of labeled data.  However,
-these models apply and have often given good results even if we have only
+these models often given good results even if we have only
 labeled data (in which case we usually perform the feature learning step using
 the labeled data, but ignoring the labels).
+<!--
 In terms of terminology, there are two common unsupervised feature learning
 settings, depending on what type of unlabeled data you have.  Lets explain this
@@ Line 40: / Line 43: @@
 is just missing its label (so you don't know which ones are cars, and which
 ones are motorcycles), then you could use that data to learn the features.
-This setting---where each unlabeled examples are drawn from the same
+This setting---where each unlabeled example is drawn from the same
 distribution as your labeled examples (and thus can be labeled either "car"
 or "motorcycle")---is usually called the '''semi-supervised''' setting;
@@ Line 52: / Line 55: @@
 '''self-taught learning''' setting.  In the self-taught learning setting, it is far easier to
 obtain large amounts of unlabeled data, and thus leverage the potential of
 learning from massive amounts of data.
+!-->
 == Learning features ==