Softmax Regression

From Ufldl

Jump to: navigation, search
(Introduction)
 
Line 1: Line 1:
== Introduction ==
== Introduction ==
-
'''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels.  
+
In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
 +
classification problems where the class label <math>y</math> can take on more than two possible values.
 +
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
 +
numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
 +
using it in conjuction with our deep learning/unsupervised feature learning methods.
-
Recall that in logistic regression, our hypothesis was of the form:
+
Recall that in logistic regression, we had a training set
 +
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
 +
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>. 
 +
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
 +
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)
 +
With logistic regression, we were in the binary classification setting, so the labels
 +
were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
<math>\begin{align}
<math>\begin{align}
Line 9: Line 19:
\end{align}</math>
\end{align}</math>
-
where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example (input image) is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9).
+
and the model parameters <math>\theta</math> were trained to minimize
 +
the cost function
-
To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. Formally, let us consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes.
+
<math>
 +
\begin{align}
 +
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
 +
\end{align}
 +
</math>
-
Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that
+
In the softmax regression setting, we are interested in multi-class
 +
classification (as opposed to only binary classification), and so the label
 +
<math>y</math> can take on <math>k</math> different values, rather than only
 +
two.  Thus, in our training set
 +
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
 +
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
 +
our convention will be to index the classes starting from 1, rather than from 0.)  For example,
 +
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
 +
 
 +
Given a test input <math>x</math>, we want our hypothesis to estimate
 +
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
 +
I.e., we want to estimate the probability of the class label taking
 +
on each of the <math>k</math> different possible values.  Thus, our hypothesis
 +
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
 +
us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
 +
<math>h_{\theta}(x)</math> takes the form:
<math>
<math>
-
\begin{align}  
+
\begin{align}
-
h(x^{(i)}) =  
+
h_\theta(x^{(i)}) =
-
\begin{bmatrix}  
+
\begin{bmatrix}
-
P(y^{(i)} = 1 | x^{(i)}) \\  
+
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
-
P(y^{(i)} = 2 | x^{(i)}) \\  
+
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
-
\vdots \\  
+
\vdots \\
-
P(y^{(i)} = n | x^{(i)})  
+
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
\end{bmatrix}
-
=  
+
=
-
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
-
\begin{bmatrix}  
+
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
\vdots \\
-
e^{ \theta_n^T x^{(i)} } \\
+
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{bmatrix}
\end{align}
\end{align}
</math>
</math>
-
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.  
+
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
 +
parameters of our model.
 +
Notice that
 +
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
 +
normalizes the distribution, so that it sums to one.  
 +
For convenience, we will also write
 +
<math>\theta</math> to denote all the
 +
parameters of our model.  When you implement softmax regression, it is usually
 +
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
 +
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
-
''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''
+
<math>
 +
\theta = \begin{bmatrix}
 +
\mbox{---} \theta_1^T \mbox{---} \\
 +
\mbox{---} \theta_2^T \mbox{---} \\
 +
\vdots \\
 +
\mbox{---} \theta_k^T \mbox{---} \\
 +
\end{bmatrix}
 +
</math>
 +
<!--
 +
To extend the logistic regression framework which only outputs a single
 +
probability value, we consider a hypothesis that outputs K values (summing to
 +
1) that represent the predicted probability distribution. Formally, let us
 +
consider the classification problem where we have <math>m</math>
 +
<math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots,
 +
x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots,
 +
y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with
 +
<math>n</math> being the number of classes.
-
Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.  
+
where we trained the logistic regression weights to optimize the (conditional)
 +
log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In
 +
softmax regression, we are interested in multi-class problems where each
 +
example (input image) is assigned to one of <tt>k</tt> labels. One example of a
 +
multi-class classification problem would be classifying digits on the MNIST
 +
dataset where each example has label 1 of 10 possible labels (i.e., where it is
 +
the digit 0, 1, ... or 9).
-
{{Quote|
+
If you have seen (in a different course) with the justification of logistic regression as
-
Motivation: One motivation for selecting this form of hypotheses comes from linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, we defer to the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes].
+
maximum likelihood estimation, you have have recognized this as
-
}}
+
maximizing <math> \sum_i \log p(y^{(i)}|x^{(i)};\theta) = - J(\theta)</math>.
-
== Optimizing Softmax Regression ==
 
-
Expanding the log-likelihood expression, we find that:
+
''Strictly speaking, we only need <math>n - 1</math> parameters for
 +
<math>n</math> classes, but for convenience, we use <math>n</math> parameters
 +
in our derivation.''
 +
 
 +
Now, this hypothesis defines a predicted probability distribution given some
 +
<tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the
 +
model, a natural choice is to maximize the (conditional) log-likelihood of the
 +
data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)})
 +
}</math>.
 +
 
 +
{{Quote| Motivation: One reason for selecting this form of hypotheses comes
 +
from connections to linear discriminant analysis. In particular, if one assumes
 +
a generative model for the data in the form <math>p(x,y) = p(y) \times p(x |
 +
y)</math> and selects for <math>p(x | y)</math> a member of the exponential
 +
family (which includes Gaussians, Poissons, etc.) it is possible to show that
 +
the conditional probability <math>p(y | x)</math> has the same form as our
 +
chosen hypotheses <math>h(x)</math>. For more details, see the
 +
[http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2
 +
Notes].  }}
 +
!-->
 +
 
 +
== Cost Function ==
 +
 
 +
We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
 +
the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
 +
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
<math>
<math>
\begin{align}
\begin{align}
-
\ell(\theta) &= \ln L(\theta; x, y) \\
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
-
&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
+
-
&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
+
-
&= \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}
+
\end{align}
\end{align}
</math>
</math>
-
Unfortunately, there is no closed form solution to this optimization problem (although it is concave), and we usually use an off-the-shelf optimization method (e.g., L-BFGS, stochastic gradient descent) to find the optimal parameters. Using these optimization methods require computing the gradient (<math>\ell(\theta)</math> w.r.t. <math>\theta_{k}</math>), which can can be derived as follows:
+
Notice that this generalizes the logistic regression cost function, which could also have been written:
<math>
<math>
\begin{align}
\begin{align}
-
\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} \\
+
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m  (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
 +
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
 +
\end{align}
 +
</math>
-
&= I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }  
+
The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values
-
\cdot
+
of the class label.  Note also that in softmax regression, we have that
-
e^{ \theta_k^T x^{(i)} }  
+
<math>
-
\cdot
+
p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
-
x^{(i)}
+
</math>.
-
\qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) }  \\
+
-
&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) )
+
There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
 +
optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
 +
 
 +
<math>
 +
\begin{align}
 +
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
\end{align}
</math>
</math>
-
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
+
<!--
 +
where as usual
 +
<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !-->
 +
Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
 +
is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
 +
the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.
-
=== Weight Regularization ===
+
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
 +
minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
 +
we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).
-
When using softmax regression in practice, it is important to use weight regularization. In particular, if there exist a linear separator that perfectly classifies all the data points, then the softmax-objective is unbounded (given any <math>\theta</math> that separates the data perfectly, one can always scale <math>\theta</math> to be larger and obtain a better objective value). With weight regularization, one penalizes the weights for being large and thus avoids these degenerate situations.  
+
When implementing softmax regression, we will typically use a modified version of the cost function described above;
 +
specifically, one that incorporates weight decay. We describe the motivation and details below.
-
Weight regularization is also important as it often results in models that generalize better. In particular, one can view weight regularization as placing a (Gaussian) prior on <math>\theta</math> so as to prefer <math>\theta</math> with smaller values.
+
== Properties of softmax regression parameterization ==
-
In practice, we often use a L2 weight regularization on the weights where we penalize the squared value of each element of <math>\theta</math>. Formally, we use:
+
Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
 +
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
 +
from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
 +
(for every <math>j=1, \ldots, k</math>).  Our hypothesis
 +
now estimates the class label probabilities as
<math>
<math>
\begin{align}
\begin{align}
-
w(\theta) = \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
+
p(y^{(i)} = j | x^{(i)} ; \theta)
 +
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
 +
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
 +
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
\end{align}
\end{align}
</math>
</math>
-
This regularization term is added together with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimize''' (note that we want to minimize the negative log-likelihood, which corresponds to maximizing the log-likelihood):
+
In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>
 +
does not affect our hypothesis' predictions at all!  This shows that softmax
 +
regression's parameters are "redundant."  More formally, we say that our
 +
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
 +
fit to the data, there are multiple parameter settings that give rise to exactly
 +
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
 +
to the predictions.
 +
 
 +
Further, if the cost function <math>J(\theta)</math> is minimized by some
 +
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
 +
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
 +
\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
 +
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,  
 +
<math>J(\theta)</math> is still convex, and thus gradient descent will
 +
not run into a local optima problems.  But the Hessian is singular/non-invertible,
 +
which causes a straightforward implementation of Newton's method to run into
 +
numerical problems.)
 +
 
 +
Notice also that by setting <math>\psi = \theta_1</math>, one can always
 +
replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
 +
0's), without affecting the hypothesis.  Thus, one could "eliminate" the vector
 +
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
 +
any single value of <math>j</math>), without harming the representational power
 +
of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
 +
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
 +
<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
 +
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
 +
remaining parameters, and this would work fine.
 +
 
 +
In practice, however, it is often cleaner and simpler to implement the version which keeps
 +
all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
 +
arbitrarily setting one of them to zero.  But we will
 +
make one change to the cost function: Adding weight decay.  This will take care of
 +
the numerical problems associated with softmax regression's overparameterized representation.
 +
 
 +
== Weight Decay ==
 +
 
 +
<!--
 +
Fortunately, this turns out to be a convex optimization problem, and thus algorithms such as
 +
gradient descent and L-BFGS are guaranteed to converge to the global minimum.
 +
!-->
 +
 
 +
We will modify the cost function by adding a weight decay term
 +
<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
 +
which penalizes large values of the parameters.  Our cost function is now
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = -\ell(\theta) + \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
 +
              + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
\end{align}
\end{align}
</math>
</math>
-
The gradients with respect to the cost function must then be adjusted to account for the weight decay term:
+
With this weight decay term (for any <math>\lambda > 0</math>), the cost function
 +
<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a
 +
unique solution.  The Hessian is now invertible, and because <math>J(\theta)</math> is
 +
convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed
 +
to converge to the global minimum.
 +
To apply an optimization algorithm, we also need the derivative of this
 +
new definition of <math>J(\theta)</math>.  One can show that the derivative is:
<math>
<math>
\begin{align}
\begin{align}
-
\frac{\partial J(\theta)}{\partial \theta_k}
+
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
-
&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k
+
\end{align}
\end{align}
</math>
</math>
-
Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
+
By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.
-
== Parameterization ==
+
<!--
-
 
+
Expanding the log-likelihood expression, we find that:
-
We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:
+
<math>
<math>
-
\begin{align}  
+
\begin{align}
-
h(x^{(i)}) &=  
+
\ell(\theta) &= \ln L(\theta; x, y) \\
 +
&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
 +
&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
 +
&= \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]}
 +
\end{align}
 +
</math>
-
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
 
-
\begin{bmatrix}
 
-
e^{ \theta_1^T x^{(i)} } \\
 
-
e^{ \theta_2^T x^{(i)} } \\
 
-
\vdots \\
 
-
e^{ \theta_n^T x^{(i)} } \\
 
-
\end{bmatrix} \\
 
-
&=  
+
<math>
 +
\begin{align}
 +
\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} \\
-
\frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+
&= \sum_{i=1}^{m}{ \left[ I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\cdot
\cdot
-
\frac{1}{e^{ \theta_n^T x^{(i)} } }
+
e^{ \theta_k^T x^{(i)} }
-
\begin{bmatrix}
+
\cdot
-
e^{ \theta_1^T x^{(i)} } \\
+
x^{(i)} \right]}
-
e^{ \theta_2^T x^{(i)} } \\
+
\qquad \text{(where } I_{ \{ y^{(i)} = k\}  } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) } \\
-
\vdots \\
+
-
e^{ \theta_n^T x^{(i)} } \\
+
-
\end{bmatrix} \\
+
-
&=  
+
&= \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = k\}  - P(y^{(i)} = k | x^{(i)}) ) \right]  }
-
 
+
-
\frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T  - \theta_n^T) x^{(i)} }} }
+
-
\begin{bmatrix}
+
-
e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\
+
-
e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\
+
-
\vdots \\
+
-
e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\
+
-
\end{bmatrix} \\
+
\end{align}
\end{align}
</math>
</math>
-
Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives
+
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
 +
!-->
 +
 
 +
== Relationship to Logistic Regression ==
 +
 
 +
In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
 +
This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
 +
the softmax regression hypothesis outputs
<math>
<math>
 +
\begin{align}
 +
h_\theta(x) &=
 +
\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
 +
\begin{bmatrix}
 +
e^{ \theta_1^T x } \\
 +
e^{ \theta_2^T x }
 +
\end{bmatrix}
 +
\end{align}
 +
</math>
 +
 +
Taking advantage of the fact that this hypothesis
 +
is overparameterized and setting <math>\psi = \theta_1</math>,
 +
we can subtract <math>\theta_1</math> from each of the two parameters, giving us
 +
 +
<math>
\begin{align}
\begin{align}
-
h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j x^{(i)} }} }
+
h(x) &=
-
\begin{bmatrix}  
+
 
-
e^{ \Theta_1^T x^{(i)} } \\
+
\frac{1}{ e^{\vec{0}^Tx} + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
-
e^{ \Theta_2^T x^{(i)} } \\
+
\begin{bmatrix}
-
\vdots \\
+
e^{ \vec{0}^T x } \\
-
1 \\
+
e^{ (\theta_2-\theta_1)^T x }
\end{bmatrix} \\
\end{bmatrix} \\
 +
 +
&=
 +
\begin{bmatrix}
 +
\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
 +
\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
 +
\end{bmatrix} \\
 +
 +
&=
 +
\begin{bmatrix}
 +
\frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
 +
1 - \frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
 +
\end{bmatrix}
\end{align}
\end{align}
</math>
</math>
-
Showing that only <math>n-1</math> parameters are required.
 
-
In practice, however, it is often easier to implement the version which is over-parametrized although both methods will lead to the same classifier.
+
Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find
 +
that softmax regression predicts the probability of one of the classes as
 +
<math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>,
 +
and that of the other class as
 +
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
 +
same as logistic regression.
-
=== Binary Logistic Regression ===
+
== Softmax Regression vs. k Binary Classifiers ==
-
In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression:
+
Suppose you are working on a music classification application, and there are
 +
<math>k</math> types of music that you are trying to recognize.  Should you use a
 +
softmax classifier, or should you build <math>k</math> separate binary classifiers using
 +
logistic regression?
-
<math>
+
This will depend on whether the four classes are ''mutually exclusive.''  For example,
 +
if your four classes are classical, country, rock, and jazz, then assuming each
 +
of your training examples is labeled with exactly one of these four class labels,
 +
you should build a softmax classifier with <math>k=4</math>.
 +
(If there're also some examples that are none of the above four classes,
 +
then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)
-
\begin{align}
+
If however your categories are has_vocals, dance, soundtrack, pop, then the
-
h(x^{(i)}) &=
+
classes are not mutually exclusive; for example, there can be a piece of pop
 +
music that comes from a soundtrack and in addition has vocals.  In this case, it
 +
would be more appropriate to build 4 binary logistic regression classifiers.
 +
This way, for each new musical piece, your algorithm can separately decide whether
 +
it falls into each of the four categories.
-
\frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } }
+
Now, consider a computer vision example, where you're trying to classify images into
-
\begin{bmatrix}
+
three different classes.  (i) Suppose that your classes are indoor_scene,
-
e^{ \Theta_1^T x^{(i)} } \\
+
outdoor_urban_scene, and outdoor_wilderness_scene.  Would you use sofmax regression
-
1 \\
+
or three logistic regression classifiers?  (ii) Now suppose your classes are
-
\end{bmatrix} \\
+
indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
 +
regression or multiple logistic regression classifiers?
-
&=
+
In the first case, the classes are mutually exclusive, so a softmax regression
 +
classifier would be appropriate.  In the second case, it would be more appropriate to build
 +
three separate logistic regression classifiers.
-
\frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } }
 
-
\cdot
 
-
\frac{1}{e^{ \Theta_1 x^{(1)} } }
 
-
\begin{bmatrix}
 
-
e^{ \Theta_1^T x^{(i)} } \\
 
-
1 \\
 
-
\end{bmatrix} \\
 
-
&=
+
{{Softmax}}
-
\frac{1}{ e^{ -\Theta_1 x^{(i)} } + 1 }
 
-
\begin{bmatrix}
 
-
1 \\
 
-
e^{ -\Theta_1^T x^{(i)} } \\
 
-
\end{bmatrix} \\
 
-
 
+
{{Languages|Softmax回归|中文}}
-
\end{align}
+
-
</math>
+

Latest revision as of 13:24, 7 April 2013

Personal tools