Softmax Regression

From Ufldl

Jump to: navigation, search
(Properties of softmax regression parameterization)
(Properties of softmax regression parameterization)
Line 185: Line 185:
Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,  
Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,  
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
-
from it, so that <math>\theta_1</math> is now replaced with <math>\theta_1 - \psi</math>,
+
from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>  
-
<math>\theta_2</math> is replaced with <math>\theta_2 - \psi</math>, and so on.  Our hypothesis
+
(for every <math>j=1, \ldots, k</math>).  Our hypothesis
now estimates the class label probabilities as
now estimates the class label probabilities as
Line 194: Line 194:
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
+
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
\end{align}
\end{align}
</math>
</math>
Line 207: Line 207:
Further, if the cost function <math>J(\theta)</math> is minimized by some
Further, if the cost function <math>J(\theta)</math> is minimized by some
-
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>,
+
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
-
\theta_n - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+
\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,  
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,  
<math>J(\theta)</math> is still convex, and thus gradient descent will
<math>J(\theta)</math> is still convex, and thus gradient descent will
-
not run into a local optimum.  But the Hessian is singular/non-invertible,
+
not run into a local optima problems.  But the Hessian is singular/non-invertible,
-
which cause a straightforward implementation of Newton's method to run into
+
which causes a straightforward implementation of Newton's method to run into
numerical problems.)  
numerical problems.)  
Line 221: Line 221:
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
any single value of <math>j</math>), without harming the representational power
any single value of <math>j</math>), without harming the representational power
-
of our hypothesis.  Indeed, rather than optimizing over the <math>kn</math>
+
of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-
<math>\theta_j \in \Re^{n+1}</math>), one could indeed set <math>\theta_1 =
+
<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
remaining parameters, and this would work fine.  
remaining parameters, and this would work fine.  

Revision as of 18:41, 10 May 2011

Personal tools