Softmax Regression

Revision as of 05:43, 10 May 2011 (view source)

Ang (Talk | contribs)

← Older edit

Revision as of 05:59, 10 May 2011 (view source)

Ang (Talk | contribs)

Newer edit →

Line 1:

== Introduction ==

-

In these notes, we describe the '''Softmax regression''' model~~, a generalization of~~ logistic regression to

+

In these notes, we describe the '''Softmax regression''' model. This model generalizes logistic regression to

classification problems where the class label <math>y</math> can take on more than two possible values.

-

This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different

+

This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different

-

numerical digits. ~~'''~~Softmax regression~~'''~~ is a supervised learning algorithm, but we will ~~shortly~~ be

+

numerical digits. Softmax regression is a supervised learning algorithm, but we will later be

using it in conjuction with our deep learning/unsupervised feature learning methods.

Recall that in logistic regression, we had a training set

<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>

-

of $m$ examples. We had considered the binary classification setting, so the

+

of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.

-

labels were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form:

+

We had considered the binary classification setting, so the labels

+

were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form:

<math>\begin{align}

Line 30:

Line 31:

two. Thus, in our training set

<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,

-

we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. For example,

+

we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that

+

our convention will be to index the classes starting from 1, rather than from 0.) For example,

in the MNIST digit recognition task, we would have <math>k=10</math> different classes.

Given a test input <math>x</math>, we want our hypothesis to estimate

-

the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>.

+

the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.

-

~~In other words~~, we want to estimate the probability of the class label taking

+

I.e., we want to estimate the probability of the class label taking

on each of the <math>k</math> different possible values. Thus, our hypothesis

will output a <math>k</math> dimensional vector (whose elements sum to 1) giving

-

us our <math>k</math> estimated probabilities.

+

us our <math>k</math> estimated probabilities. Concretely, our hypothesis

-

+

-

Concretely, our hypothesis

+

<math>h_{\theta}(x)</math> takes the form:

<math>

\begin{align}

-

h(x^{(i)}) =

+

h_\theta(x^{(i)}) =

\begin{bmatrix}

-

P(y^{(i)} = 1 | x^{(i)}) \\

+

p(y^{(i)} = 1 | x^{(i)}; \theta) \\

-

P(y^{(i)} = 2 | x^{(i)}) \\

+

p(y^{(i)} = 2 | x^{(i)}; \theta) \\

\vdots \\

-

P(y^{(i)} = k | x^{(i)})

+

p(y^{(i)} = k | x^{(i)}; \theta)

\end{bmatrix}

=

-

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

+

\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }

\begin{bmatrix}

e^{ \theta_1^T x^{(i)} } \\

Line 63:

</math>

-

~~where~~ <math>\theta_1, \theta_2, \ldots, \~~theta_n</math> are each~~

+

Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the

-

~~<math>~~n</math>~~-dimensional column vectors that constitute~~ the parameters of our

+

parameters of our model.

-

~~hypothesis~~. Notice that

+

Notice that

-

<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>

+

the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>

-

normalizes the distribution so that it sums to one.

+

normalizes the distribution, so that it sums to one.

+

For convenience, we will also write

+

<math>\theta</math> to denote all the

+

parameters of our model. When you implement softmax regression, is is usually

+

convenient to represent <math>\theta</math> as a matrix obtained by

+

stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math>, so that

+

<math>

+

\theta = \begin{bmatrix}

+

\mbox{--}\theta_1^T\mbox{--}

+

\mbox{--}\theta_2^T\mbox{--} \\

+

\vdots

+

\mbox{--}\theta_k^T\mbox{--}

+

\end{bmatrix}

+

</math>

<!--

Line 113:

Line 127:

Notes]. }}

!-->

-

== Cost Function ==

-

We now ~~present~~ the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is

+

We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is

-

the '''indicator function,''' so that <math>1\{\hbox{A true ~~sentence~~}\}=1</math>, and <math>1\{\hbox{A false ~~sentence~~}\}=0</math>.

+

the '''indicator function,''' so that <math>1\{\hbox{A true statement}\}=1</math>, and <math>1\{\hbox{A false statement}\}=0</math>.

-

Our cost function will be:

+

For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:

<math>

\begin{align}

-

J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} ~~\left[~~ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

\end{align}

</math>

Line 131:

Line 144:

<math>

\begin{align}

-

J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} ~~\left[~~ 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

+

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + \log h_\theta(x^{(i)}) \right] \\

+

&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

\end{align}

</math>

Softmax Regression

From Ufldl

Revision as of 05:59, 10 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
 == Introduction ==
-In these notes, we describe the '''Softmax regression''' model, a generalization of logistic regression to
+In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
 classification problems where the class label <math>y</math> can take on more than two possible values.
-This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different
+This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
-numerical digits.  '''Softmax regression''' is a supervised learning algorithm, but we will shortly be
+numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
 using it in conjuction with our deep learning/unsupervised feature learning methods.
 Recall that in logistic regression, we had a training set
 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
-of $m$ examples.  We had considered the binary classification setting, so the
+of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.
-labels were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
+We had considered the binary classification setting, so the labels
+were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
 <math>\begin{align}
@@ Line 30: / Line 31: @@
 two.  Thus, in our training set
 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
-we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  For example,
+we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
+our convention will be to index the classes starting from 1, rather than from 0.)  For example,
 in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
 Given a test input <math>x</math>, we want our hypothesis to estimate
-the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>.
+the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
-In other words, we want to estimate the probability of the class label taking
+I.e., we want to estimate the probability of the class label taking
 on each of the <math>k</math> different possible values.  Thus, our hypothesis
 will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
-us our <math>k</math> estimated probabilities.
+us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
-Concretely, our hypothesis
 <math>h_{\theta}(x)</math> takes the form:
 <math>
 \begin{align}
-h(x^{(i)}) =
+h_\theta(x^{(i)}) =
 \begin{bmatrix}
-P(y^{(i)} = 1 | x^{(i)}) \\
+p(y^{(i)} = 1 | x^{(i)}; \theta) \\
-P(y^{(i)} = 2 | x^{(i)}) \\
+p(y^{(i)} = 2 | x^{(i)}; \theta) \\
 \vdots \\
-P(y^{(i)} = k | x^{(i)})
+p(y^{(i)} = k | x^{(i)}; \theta)
 \end{bmatrix}
 =
-\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
 \begin{bmatrix}
 e^{ \theta_1^T x^{(i)} } \\
@@ Line 63: / Line 63: @@
 </math>
-where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each
+Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
-<math>n</math>-dimensional column vectors that constitute the parameters of our
+parameters of our model.
-hypothesis.  Notice that
+Notice that
-<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
+the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
-normalizes the distribution so that it sums to one.
+normalizes the distribution, so that it sums to one.
+For convenience, we will also write
+<math>\theta</math> to denote all the
+parameters of our model.  When you implement softmax regression, is is usually
+convenient to represent <math>\theta</math> as a matrix obtained by
+stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math>, so that
+<math>
+\theta = \begin{bmatrix}
+\mbox{--}\theta_1^T\mbox{--}
+\mbox{--}\theta_2^T\mbox{--} \\
+\vdots
+\mbox{--}\theta_k^T\mbox{--}
+\end{bmatrix}
+</math>
 <!--
@@ Line 113: / Line 127: @@
 Notes].  }}
 !-->
 == Cost Function ==
-We now present the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
+We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
-the '''indicator function,''' so that <math>1\{\hbox{A true sentence}\}=1</math>, and <math>1\{\hbox{A false sentence}\}=0</math>.
+the '''indicator function,''' so that <math>1\{\hbox{A true statement}\}=1</math>, and <math>1\{\hbox{A false statement}\}=0</math>.
-Our cost function will be:
+For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
 <math>
 \begin{align}
-J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
 \end{align}
 </math>
@@ Line 131: / Line 144: @@
 <math>
 \begin{align}
-J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} \left[ 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)}  (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + \log h_\theta(x^{(i)}) \right] \\
+&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
 \end{align}
 </math>