Softmax Regression
From Ufldl
Line 73: | Line 73: | ||
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | ||
- | === Weight | + | === Weight Regularization === |
+ | When using softmax regression in practice, it is important to use weight regularization. In particular, if there exist a linear separator that perfectly classifies all the data points, then the softmax-objective is unbounded (given any <math>\theta</math> that separates the data perfectly, one can always scale <math>\theta</math> to be larger and obtain a better objective value). With weight regularization, one penalizes the weights for being large and thus avoids these degenerate situations. | ||
+ | Weight regularization is also important as it often results in models that generalize better. In particular, one can view weight regularization as placing a (Gaussian) prior on <math>\theta</math> so as to prefer <math>\theta</math> with smaller values. | ||
- | + | In practice, we often use a L2 weight regularization on the weights where we penalize the squared value of each element of <math>\theta</math>. Formally, we use: | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
<math> | <math> | ||
Line 91: | Line 87: | ||
</math> | </math> | ||
- | This is | + | This regularization term is added together with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimize''' (note that we want to minimize the negative log-likelihood, which corresponds to maximizing the log-likelihood): |
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = -\ell(\theta) + | + | J(\theta) = -\ell(\theta) + \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } } |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 104: | Line 100: | ||
\begin{align} | \begin{align} | ||
\frac{\partial J(\theta)}{\partial \theta_k} | \frac{\partial J(\theta)}{\partial \theta_k} | ||
- | |||
&= x^{(i)} ( I_{ \{ y^{(i)} = k\} } - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k | &= x^{(i)} ( I_{ \{ y^{(i)} = k\} } - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | Minimizing <math>J(\theta)</math> now performs regularized softmax regression. | |
- | == | + | == Parameterization == |
We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again: | We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again: | ||
Line 171: | Line 166: | ||
=== Logistic regression === | === Logistic regression === | ||
- | In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression: | + | In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression: |
<math> | <math> |