Softmax Regression

== Introduction ==

'''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. 

Recall that in logistic regression, our hypothesis was of the form:

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>

where we trained the logistic regression weights to optimize the (conditional) log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example (input image) is assigned to one of <tt>k</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9). 


To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. Formally, let us consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. 


Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that

<math>
\begin{align} 
h(x^{(i)}) = 
\begin{bmatrix} 
P(y^{(i)} = 1 | x^{(i)}) \\ 
P(y^{(i)} = 2 | x^{(i)}) \\ 
\vdots \\ 
P(y^{(i)} = n | x^{(i)}) 
\end{bmatrix}
= 
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one. 

''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''

Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>. 

{{Quote|
Motivation: One reason for selecting this form of hypotheses comes from connections to linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, see the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes].
}}

== Optimizing Softmax Regression ==

Expanding the log-likelihood expression, we find that:

<math>
\begin{align}
\ell(\theta) &= \ln L(\theta; x, y) \\
&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
&= \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]}
\end{align}
</math>

Unfortunately, there is no closed form solution to this optimization problem (although it is concave), and we usually use an off-the-shelf optimization method (e.g., L-BFGS, stochastic gradient descent) to find the optimal parameters. Using these optimization methods require computing the gradient (<math>\ell(\theta)</math> w.r.t. <math>\theta_{k}</math>), which can can be derived as follows:

<math>
\begin{align}
\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} \\

&= \sum_{i=1}^{m}{ \left[ I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\cdot
e^{ \theta_k^T x^{(i)} } 
\cdot
x^{(i)} \right]}
\qquad \text{(where } I_{ \{ y^{(i)} = k\}  } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) }  \\

&= \sum_{i=1}^{m}{ \left[ x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) \right]  }
\end{align}
</math>

With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.


=== Weight Regularization ===

When using softmax regression in practice, it is important to use weight regularization. In particular, if there exists a linear separator that perfectly classifies all the data points, then the softmax-objective is unbounded (given any <math>\theta</math> that separates the data perfectly, one can always scale <math>\theta</math> to be larger and obtain a better objective value). With weight regularization, one penalizes the weights for being large and thus avoids these degenerate situations. 

Weight regularization is also important as it often results in models that generalize better. In particular, one can view weight regularization as placing a (Gaussian) prior on <math>\theta</math> so as to prefer <math>\theta</math> with smaller values. 

In practice, we often use a L2 weight regularization on the weights where we penalize the squared value of each element of <math>\theta</math>. Formally, we use:

<math>
\begin{align}
w(\theta) = \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
\end{align}
</math>

This regularization term is added together with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimize''' (note that we want to minimize the negative log-likelihood, which corresponds to maximizing the log-likelihood):

<math>
\begin{align}
J(\theta) = -\ell(\theta) + \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
\end{align}
</math>

The gradients with respect to the cost function must then be adjusted to account for the weight decay term:

<math>
\begin{align}
\frac{\partial J(\theta)}{\partial \theta_k}
&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k
\end{align}
</math>

Minimizing <math>J(\theta)</math> now performs regularized softmax regression.

== Parameterization ==

We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:

<math>
\begin{align} 
h(x^{(i)}) &= 

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\cdot
\frac{1}{e^{ \theta_n^T x^{(i)} } }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T  - \theta_n^T) x^{(i)} }} }
\begin{bmatrix} 
e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\
e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\
\vdots \\
e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\
\end{bmatrix} \\
\end{align}
</math>

Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives

<math>

\begin{align}
h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
e^{ \Theta_2^T x^{(i)} } \\
\vdots \\
1 \\
\end{bmatrix} \\

\end{align}
</math>

Showing that only <math>n-1</math> parameters are required.

In practice, however, it is often easier to implement the version which is over-parametrized although both methods will lead to approximately the same classifier.

=== Binary Logistic Regression ===

In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression:

<math>

\begin{align}
h(x^{(i)}) &= 

\frac{1}{ 1 + e^{ \Theta_1^T x^{(i)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{e^{ \Theta_1^T x^{(i)} } }{ 1 + e^{ \Theta_1^T x^{(i)} } }
\cdot
\frac{1}{e^{ \Theta_1^T x^{(i)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{1}{ e^{ -\Theta_1^T x^{(i)} } + 1 }
\begin{bmatrix} 
1 \\
e^{ -\Theta_1^T x^{(i)} } \\
\end{bmatrix} \\


\end{align}
</math>