Softmax Regression

Revision as of 18:37, 10 May 2011 (view source)

Ang (Talk | contribs)

(→Properties of softmax regression parameterization)

← Older edit

Revision as of 18:41, 10 May 2011 (view source)

Ang (Talk | contribs)

(→Properties of softmax regression parameterization)

Newer edit →

Line 185:

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means,

suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>

-

from it, so that <math>\~~theta_1~~</math> is now replaced with <math>\~~theta_1~~ - \psi</math>,

+

from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>

-

<math>\~~theta_2~~</math> ~~is replaced with <math>\theta_2 - \psi</math>, and so on~~. Our hypothesis

+

(for every <math>j=1, \ldots, k</math>). Our hypothesis

now estimates the class label probabilities as

Line 194:

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\

-

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.

\end{align}

</math>

Line 207:

Further, if the cost function <math>J(\theta)</math> is minimized by some

-

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \~~theta_n~~)</math>,

+

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,

then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,

-

\~~theta_n~~ - \psi)</math> for any value of <math>\psi</math>. Thus, the

+

\theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the

minimizer of <math>J(\theta)</math> is not unique. (Interestingly,

<math>J(\theta)</math> is still convex, and thus gradient descent will

-

not run into a local ~~optimum~~. But the Hessian is singular/non-invertible,

+

not run into a local optima problems. But the Hessian is singular/non-invertible,

-

which ~~cause~~ a straightforward implementation of Newton's method to run into

+

which causes a straightforward implementation of Newton's method to run into

numerical problems.)

Line 221:

of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for

any single value of <math>j</math>), without harming the representational power

-

of our hypothesis. Indeed, rather than optimizing over the <math>kn</math>

+

of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math>

parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where

-

<math>\theta_j \in \Re^{n+1}</math>), one could ~~indeed~~ set <math>\theta_1 =

+

<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =

\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>

remaining parameters, and this would work fine.

Softmax Regression

From Ufldl

Revision as of 18:41, 10 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 185: / Line 185: @@
 Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
 suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
-from it, so that <math>\theta_1</math> is now replaced with <math>\theta_1 - \psi</math>,
+from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
-<math>\theta_2</math> is replaced with <math>\theta_2 - \psi</math>, and so on.  Our hypothesis
+(for every <math>j=1, \ldots, k</math>).  Our hypothesis
 now estimates the class label probabilities as
@@ Line 194: / Line 194: @@
 &= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
 &= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
 \end{align}
 </math>
@@ Line 207: / Line 207: @@
 Further, if the cost function <math>J(\theta)</math> is minimized by some
-setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>,
+setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
 then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
-\theta_n - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
 minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
 <math>J(\theta)</math> is still convex, and thus gradient descent will
-not run into a local optimum.  But the Hessian is singular/non-invertible,
+not run into a local optima problems.  But the Hessian is singular/non-invertible,
-which cause a straightforward implementation of Newton's method to run into
+which causes a straightforward implementation of Newton's method to run into
 numerical problems.)
@@ Line 221: / Line 221: @@
 of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
 any single value of <math>j</math>), without harming the representational power
-of our hypothesis.  Indeed, rather than optimizing over the <math>kn</math>
+of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
 parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-<math>\theta_j \in \Re^{n+1}</math>), one could indeed set <math>\theta_1 =
+<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
 \vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
 remaining parameters, and this would work fine.