Softmax Regression
From Ufldl
(→Properties of softmax regression parameterization) |
(→Properties of softmax regression parameterization) |
||
Line 185: | Line 185: | ||
Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, | Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, | ||
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | ||
- | from it, so that <math>\ | + | from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math> |
- | <math>\ | + | (for every <math>j=1, \ldots, k</math>). Our hypothesis |
now estimates the class label probabilities as | now estimates the class label probabilities as | ||
Line 194: | Line 194: | ||
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ | &= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ | ||
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\ | &= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\ | ||
- | &= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T | + | &= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}. |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 207: | Line 207: | ||
Further, if the cost function <math>J(\theta)</math> is minimized by some | Further, if the cost function <math>J(\theta)</math> is minimized by some | ||
- | setting of the parameters <math>(\theta_1, \theta_2,\ldots, \ | + | setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>, |
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | ||
- | \ | + | \theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the |
minimizer of <math>J(\theta)</math> is not unique. (Interestingly, | minimizer of <math>J(\theta)</math> is not unique. (Interestingly, | ||
<math>J(\theta)</math> is still convex, and thus gradient descent will | <math>J(\theta)</math> is still convex, and thus gradient descent will | ||
- | not run into a local | + | not run into a local optima problems. But the Hessian is singular/non-invertible, |
- | which | + | which causes a straightforward implementation of Newton's method to run into |
numerical problems.) | numerical problems.) | ||
Line 221: | Line 221: | ||
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | ||
any single value of <math>j</math>), without harming the representational power | any single value of <math>j</math>), without harming the representational power | ||
- | of our hypothesis. Indeed, rather than optimizing over the <math> | + | of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math> |
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where | parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where | ||
- | <math>\theta_j \in \Re^{n+1}</math>), one could | + | <math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 = |
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math> | \vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math> | ||
remaining parameters, and this would work fine. | remaining parameters, and this would work fine. |