Softmax Regression

From Ufldl

Jump to: navigation, search
Line 73: Line 73:
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
-
=== Weight decay ===
+
=== Weight Regularization ===
 +
When using softmax regression in practice, it is important to use weight regularization. In particular, if there exist a linear separator that perfectly classifies all the data points, then the softmax-objective is unbounded (given any <math>\theta</math> that separates the data perfectly, one can always scale <math>\theta</math> to be larger and obtain a better objective value). With weight regularization, one penalizes the weights for being large and thus avoids these degenerate situations.
 +
Weight regularization is also important as it often results in models that generalize better. In particular, one can view weight regularization as placing a (Gaussian) prior on <math>\theta</math> so as to prefer <math>\theta</math> with smaller values.
-
When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder).
+
In practice, we often use a L2 weight regularization on the weights where we penalize the squared value of each element of <math>\theta</math>. Formally, we use:
-
 
+
-
Why should the weights balloon up? You can check for yourself that if our current parameters <math>\theta</math> classify the examples perfectly, then multiplying each of the parameters by a large constant increases the log-likelihood of the data under the parameters.  
+
-
 
+
-
In order to combat this, when using softmax in practice, it may be useful to include a weight decay term to keep the weights small.
+
-
 
+
-
The weight decay term takes the form:
+
<math>
<math>
Line 91: Line 87:
</math>
</math>
-
This is combined with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimise''' (observe that we have '''negated the log-likelihood''' so that minimising the cost function maximising the log-likelihood):
+
This regularization term is added together with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimize''' (note that we want to minimize the negative log-likelihood, which corresponds to maximizing the log-likelihood):
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = -\ell(\theta) + w(\theta)
+
J(\theta) = -\ell(\theta) + \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
\end{align}
\end{align}
</math>
</math>
Line 104: Line 100:
\begin{align}
\begin{align}
\frac{\partial J(\theta)}{\partial \theta_k}
\frac{\partial J(\theta)}{\partial \theta_k}
-
 
&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k
&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k
\end{align}
\end{align}
</math>
</math>
-
Minimising <math>J(\theta)</math> will now maximise the log-likelihood while keeping the weights low.
+
Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
-
== Parameters ==
+
== Parameterization ==
We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:
We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:
Line 171: Line 166:
=== Logistic regression ===
=== Logistic regression ===
-
In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression:
+
In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression:
<math>
<math>

Revision as of 04:20, 4 May 2011

Personal tools