Softmax Regression

From Ufldl

Jump to: navigation, search
(Properties of softmax regression parameterization)
 
Line 73: Line 73:
For convenience, we will also write  
For convenience, we will also write  
<math>\theta</math> to denote all the
<math>\theta</math> to denote all the
-
parameters of our model.  When you implement softmax regression, is is usually
+
parameters of our model.  When you implement softmax regression, it is usually
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
Line 202: Line 202:
regression's parameters are "redundant."  More formally, we say that our
regression's parameters are "redundant."  More formally, we say that our
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-
fit to the data, there're multiple parameter settings that give rise to exactly
+
fit to the data, there are multiple parameter settings that give rise to exactly
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
to the predictions.  
to the predictions.  
Line 241: Line 241:
We will modify the cost function by adding a weight decay term  
We will modify the cost function by adding a weight decay term  
-
<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>
+
<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
which penalizes large values of the parameters.  Our cost function is now
which penalizes large values of the parameters.  Our cost function is now
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
-
               + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2
+
               + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
\end{align}
\end{align}
</math>
</math>
Line 257: Line 257:
to converge to the global minimum.
to converge to the global minimum.
-
To implement these optimization algorithms, we also need the derivative of this
+
To apply an optimization algorithm, we also need the derivative of this
new definition of <math>J(\theta)</math>.  One can show that the derivative is:
new definition of <math>J(\theta)</math>.  One can show that the derivative is:
<math>
<math>
Line 301: Line 301:
== Relationship to Logistic Regression ==
== Relationship to Logistic Regression ==
-
In the special case where <math>k = 2</math>, one can also show that softmax regression reduces to logistic regression.
+
In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
-
This shows that softmax regression is a generalization of logistic regression.  Concretely, our hypothesis outputs
+
This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
 +
the softmax regression hypothesis outputs
<math>
<math>
\begin{align}
\begin{align}
-
h(x) &=
+
h_\theta(x) &=
\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
Line 317: Line 318:
Taking advantage of the fact that this hypothesis
Taking advantage of the fact that this hypothesis
-
is overparameterized and setting <math>\psi - =\theta_1</math>,
+
is overparameterized and setting <math>\psi = \theta_1</math>,
we can subtract <math>\theta_1</math> from each of the two parameters, giving us
we can subtract <math>\theta_1</math> from each of the two parameters, giving us
Line 352: Line 353:
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
same as logistic regression.
same as logistic regression.
-
 
== Softmax Regression vs. k Binary Classifiers ==
== Softmax Regression vs. k Binary Classifiers ==
Line 380: Line 380:
or three logistic regression classifiers?  (ii) Now suppose your classes are
or three logistic regression classifiers?  (ii) Now suppose your classes are
indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
-
regression of multiple logistic regression classifiers?
+
regression or multiple logistic regression classifiers?
In the first case, the classes are mutually exclusive, so a softmax regression
In the first case, the classes are mutually exclusive, so a softmax regression
classifier would be appropriate.  In the second case, it would be more appropriate to build
classifier would be appropriate.  In the second case, it would be more appropriate to build
three separate logistic regression classifiers.
three separate logistic regression classifiers.
 +
 +
 +
{{Softmax}}
 +
 +
 +
{{Languages|Softmax回归|中文}}

Latest revision as of 13:24, 7 April 2013

Personal tools