Softmax Regression

Revision as of 03:51, 4 May 2011 (view source)

Jngiam (Talk | contribs)

(→Introduction)

← Older edit

Latest revision as of 13:24, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 1:

== Introduction ==

-

'''Softmax regression'''~~, also known as '''multinomial logistic regression''', is a generalisation of~~ logistic regression to problems where ~~there are~~ more than ~~2 class labels~~. Recall that in logistic regression, ~~our hypothesis was~~ of the form:

+

In these notes, we describe the '''Softmax regression''' model. This model generalizes logistic regression to

+

classification problems where the class label <math>y</math> can take on more than two possible values.

+

This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different

+

numerical digits. Softmax regression is a supervised learning algorithm, but we will later be

+

using it in conjuction with our deep learning/unsupervised feature learning methods.

+

Recall that in logistic regression, we had a training set

+

<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>

+

of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.

+

(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be

+

<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)

+

With logistic regression, we were in the binary classification setting, so the labels

+

were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form:

<math>\begin{align}

Line 7:

Line 19:

\end{align}</math>

-

~~We trained the logistic regression weights to optimize the log-likelihood of~~ the ~~dataset using~~ <math> ~~p(y|x) = h_~~\theta~~(x)~~ </math>~~. In softmax regression, we are interested in multi-class problems where each example is assigned~~ to ~~one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is~~ the ~~digit 0, 1, ... or 9).~~

+

and the model parameters <math>\theta</math> were trained to minimize

+

the cost function

-

~~To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values~~ (~~summing to~~ 1) ~~that represent the predicted probability distribution.~~

+

<math>

+

\begin{align}

+

J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]

+

\end{align}

+

</math>

-

=~~= Mathematical form ==~~

+

In the softmax regression setting, we are interested in multi-class

+

classification (as opposed to only binary classification), and so the label

+

<math>y</math> can take on <math>k</math> different values, rather than only

+

two. Thus, in our training set

+

<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,

+

we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that

+

our convention will be to index the classes starting from 1, rather than from 0.) For example,

+

in the MNIST digit recognition task, we would have <math>k=10</math> different classes.

-

~~Formally, we consider the classification problem where we have~~ <math>m</math> <math>k</math>~~-dimensional inputs~~ <math>~~x^{(~~1~~)}, x^{(2)}~~, \ldots, ~~x^{(m)}~~</math> ~~with corresponding~~ class ~~labels~~ <math>~~y^{(1)}, y^{(2)}, \ldots, y^{(m)}~~</math>, ~~where~~ <math>~~y^{(i)} \in \{1, 2, \ldots, n\}~~</math>~~, with~~ <math>n</math> ~~being the number of classes~~.

+

Given a test input <math>x</math>, we want our hypothesis to estimate

-

+

the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.

-

~~Our~~ hypothesis <math>h_{\theta}(x)</math>~~, returns a vector of probabilities, such that~~

+

I.e., we want to estimate the probability of the class label taking

+

on each of the <math>k</math> different possible values. Thus, our hypothesis

+

will output a <math>k</math> dimensional vector (whose elements sum to 1) giving

+

us our <math>k</math> estimated probabilities. Concretely, our hypothesis

+

<math>h_{\theta}(x)</math> takes the form:

<math>

-

\begin{align}

+

\begin{align}

-

h(x^{(i)}) =

+

h_\theta(x^{(i)}) =

-

\begin{bmatrix}

+

\begin{bmatrix}

-

P(y^{(i)} = 1 | x^{(i)}) \\

+

p(y^{(i)} = 1 | x^{(i)}; \theta) \\

-

P(y^{(i)} = 2 | x^{(i)}) \\

+

p(y^{(i)} = 2 | x^{(i)}; \theta) \\

-

\vdots \\

+

\vdots \\

-

P(y^{(i)} = n | x^{(i)})

+

p(y^{(i)} = k | x^{(i)}; \theta)

\end{bmatrix}

-

=

+

=

-

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

+

\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }

-

\begin{bmatrix}

+

\begin{bmatrix}

e^{ \theta_1^T x^{(i)} } \\

e^{ \theta_2^T x^{(i)} } \\

\vdots \\

-

e^{ \~~theta_n~~^T x^{(i)} } \\

+

e^{ \theta_k^T x^{(i)} } \\

\end{bmatrix}

\end{align}

</math>

-

~~where~~ <math>\theta_1, \theta_2, \ldots, \~~theta_n~~</math> are ~~each <math>k</math>-dimensional column vectors that constitute~~ the parameters of our ~~hypothesis~~. ~~Note~~ that ~~'''strictly speaking, we only need~~ <math>~~n -~~ 1</math> ~~parameters for <math>n</math> classes'''~~, ~~but for convenience, we use <math>n</math> parameters in our derivation. We will explore this further in the later section on parameters~~.

+

Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the

+

parameters of our model.

+

Notice that

+

the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>

+

normalizes the distribution, so that it sums to one.

-

~~Our objective is~~ to ~~maximise~~ the ~~likelihood~~ of ~~the data~~, <math>L(\theta~~; x, y) = \prod_{i=1}^{m}{ P~~(~~y^{(i)} | x^{(i~~)~~}) }~~</math>.

+

For convenience, we will also write

+

<math>\theta</math> to denote all the

+

parameters of our model. When you implement softmax regression, it is usually

+

convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by

+

stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

-

~~For convenience~~, we consider the log-likelihood of the data,

+

<math>

+

\theta = \begin{bmatrix}

+

\mbox{---} \theta_1^T \mbox{---} \\

+

\mbox{---} \theta_2^T \mbox{---} \\

+

\vdots \\

+

\mbox{---} \theta_k^T \mbox{---} \\

+

\end{bmatrix}

+

</math>

+

<!--

+

To extend the logistic regression framework which only outputs a single

+

probability value, we consider a hypothesis that outputs K values (summing to

+

1) that represent the predicted probability distribution. Formally, let us

+

consider the classification problem where we have <math>m</math>

+

<math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots,

+

x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots,

+

y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with

+

<math>n</math> being the number of classes.

+

where we trained the logistic regression weights to optimize the (conditional)

+

log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In

+

softmax regression, we are interested in multi-class problems where each

+

example (input image) is assigned to one of <tt>k</tt> labels. One example of a

+

multi-class classification problem would be classifying digits on the MNIST

+

dataset where each example has label 1 of 10 possible labels (i.e., where it is

+

the digit 0, 1, ... or 9).

+

If you have seen (in a different course) with the justification of logistic regression as

+

maximum likelihood estimation, you have have recognized this as

+

maximizing <math> \sum_i \log p(y^{(i)}|x^{(i)};\theta) = - J(\theta)</math>.

+

''Strictly speaking, we only need <math>n - 1</math> parameters for

+

<math>n</math> classes, but for convenience, we use <math>n</math> parameters

+

in our derivation.''

+

Now, this hypothesis defines a predicted probability distribution given some

+

<tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the

+

model, a natural choice is to maximize the (conditional) log-likelihood of the

+

data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)})

+

}</math>.

+

{{Quote| Motivation: One reason for selecting this form of hypotheses comes

+

from connections to linear discriminant analysis. In particular, if one assumes

+

a generative model for the data in the form <math>p(x,y) = p(y) \times p(x |

+

y)</math> and selects for <math>p(x | y)</math> a member of the exponential

+

family (which includes Gaussians, Poissons, etc.) it is possible to show that

+

the conditional probability <math>p(y | x)</math> has the same form as our

+

chosen hypotheses <math>h(x)</math>. For more details, see the

+

[http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2

+

Notes]. }}

+

!-->

+

== Cost Function ==

+

We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is

+

the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.

+

For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:

<math>

\begin{align}

-

~~\ell~~(\theta) &= \~~ln L(\theta; x, y) \\~~

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

-

~~&= \ln \prod_~~{i=1}^{m}{ ~~P(y^{(~~i)} ~~| x~~^{~~(i)})~~ } \\

+

-

&= \sum_{i=1}^{m}{ \ln \~~frac{ e^{ \theta^T_~~{y^{(i)}} ~~x^{(i)} } }{~~ \~~sum_{j=1}^{n}~~{e^{ \theta_j^T x^{(i)} }} ~~} } \\~~

+

-

~~&= \theta^T_~~{~~y^{(i)}} x^{(i)} - \ln~~ \sum_{j=1}^~~{n}{~~e^{ \~~theta_j~~^T x^{(i)} }}

+

\end{align}

</math>

-

~~To find <math>\theta</math> such~~ that ~~<math>\ell(\theta)</math> is maximised, we first find~~ the ~~derivatives of <math>\ell(\theta)</math> with respect to <math>\theta_{k}</math>~~:

+

Notice that this generalizes the logistic regression cost function, which could also have been written:

<math>

\begin{align}

-

~~\frac{\partial \ell~~(\theta)~~}{\partial \theta_k}~~ &= \frac{~~\partial~~}{\~~partial~~ \~~theta_k~~} \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \~~theta_j^T x~~^{(i)} }} \\

+

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\

+

&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

+

\end{align}

+

</math>

-

~~&= I_{ \{~~ y^{(i)} = ~~k\} }~~ x^{(i)} - \frac~~{1}{ \sum_{j=1}^{n}~~{e^{ \theta_j^T x^{(i)} }} }

+

The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values

-

~~\cdot~~

+

of the class label. Note also that in softmax regression, we have that

-

e^{ \~~theta_k~~^T x^{(i)} }

+

<math>

-

~~\cdot~~

+

p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }

-

~~x^{(i)~~}

+

</math>.

-

~~\qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) } \\~~

+

-

&= x^{(i)} ( ~~I_{~~ \{ y^{(i)} = k\} } - P(y^{(i)} = k | x^{(i)}) )

+

There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative

+

optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:

+

<math>

+

\begin{align}

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }

\end{align}

</math>

-

~~With this, we can now find a set of parameters that maximises~~ <math>~~\ell~~(\theta)</math>~~, for instance by using gradient ascent~~.

+

<!--

+

where as usual

+

<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !-->

-

~~=== Weight decay ===~~

+

Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math>

+

is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>

+

the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.

-

~~When using softmax in practice, you might find that~~ the ~~weights sometimes balloon up to very large numbers. This~~ can ~~create numerical difficulties~~ and ~~other issues during training or when~~ the ~~trained weights are used in other settings~~ (~~as in a stacked autoencoder~~).

+

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it

+

minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration

+

we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).

-

~~Why should the weights balloon up? You can check for yourself that if our current parameters <math>\theta</math> classify the examples perfectly~~, ~~then multiplying each of the parameters by~~ a ~~large constant increases the log-likelihood~~ of the ~~data under~~ the ~~parameters~~.

+

When implementing softmax regression, we will typically use a modified version of the cost function described above;

+

specifically, one that incorporates weight decay. We describe the motivation and details below.

-

~~In order to combat this, when using~~ softmax ~~in practice, it may be useful to include a weight decay term to keep the weights small.~~

+

== Properties of softmax regression parameterization ==

-

~~The weight decay term takes~~ the ~~form:~~

+

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means,

+

suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>

+

from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>

+

(for every <math>j=1, \ldots, k</math>). Our hypothesis

+

now estimates the class label probabilities as

<math>

\begin{align}

-

w(\theta) = \frac{\~~lambda}~~{2} \sum_{i}{ \sum_{j}{ \~~theta_~~{ij}^2 } }

+

p(y^{(i)} = j | x^{(i)} ; \theta)

+

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.

\end{align}

</math>

-

This is ~~combined with~~ the ~~log-likelihood~~ function to ~~give a~~ cost function, <math>J(\theta)</math>, which ~~we want to '''minimise'~~'' (~~observe that~~ we ~~have ''~~'~~negated~~ the ~~log~~-~~likelihood''' so that minimising~~ the cost function ~~maximising~~ the ~~log-likelihood):~~

+

In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>

+

does not affect our hypothesis' predictions at all! This shows that softmax

+

regression's parameters are "redundant." More formally, we say that our

+

softmax model is '''overparameterized,''' meaning that for any hypothesis we might

+

fit to the data, there are multiple parameter settings that give rise to exactly

+

the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>

+

to the predictions.

+

Further, if the cost function <math>J(\theta)</math> is minimized by some

+

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,

+

then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,

+

\theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the

+

minimizer of <math>J(\theta)</math> is not unique. (Interestingly,

+

<math>J(\theta)</math> is still convex, and thus gradient descent will

+

not run into a local optima problems. But the Hessian is singular/non-invertible,

+

which causes a straightforward implementation of Newton's method to run into

+

numerical problems.)

+

Notice also that by setting <math>\psi = \theta_1</math>, one can always

+

replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all

+

0's), without affecting the hypothesis. Thus, one could "eliminate" the vector

+

of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for

+

any single value of <math>j</math>), without harming the representational power

+

of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math>

+

parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where

+

<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =

+

\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>

+

remaining parameters, and this would work fine.

+

In practice, however, it is often cleaner and simpler to implement the version which keeps

+

all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without

+

arbitrarily setting one of them to zero. But we will

+

make one change to the cost function: Adding weight decay. This will take care of

+

the numerical problems associated with softmax regression's overparameterized representation.

+

== Weight Decay ==

+

<!--

+

Fortunately, this turns out to be a convex optimization problem, and thus algorithms such as

+

gradient descent and L-BFGS are guaranteed to converge to the global minimum.

+

!-->

+

We will modify the cost function by adding a weight decay term

+

<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>

+

which penalizes large values of the parameters. Our cost function is now

<math>

\begin{align}

-

J(\theta) = -\~~ell(~~\~~theta~~) ~~+ w~~(\~~theta~~)

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]

+

+ \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2

\end{align}

</math>

-

~~The gradients with respect to~~ the cost function ~~must then be adjusted~~ to ~~account for~~ the ~~weight decay term:~~

+

With this weight decay term (for any <math>\lambda > 0</math>), the cost function

+

<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a

+

unique solution. The Hessian is now invertible, and because <math>J(\theta)</math> is

+

convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed

+

to converge to the global minimum.

+

To apply an optimization algorithm, we also need the derivative of this

+

new definition of <math>J(\theta)</math>. One can show that the derivative is:

<math>

\begin{align}

-

\~~frac~~{\~~partial~~ J(\theta)}{~~\partial \theta_k~~}

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j

-

+

-

&= x^{(i)} ( ~~I_{~~ \{ y^{(i)} = k\} } - P(y^{(i)} = k | x^{(i)}) ) + \lambda \~~theta_k~~

+

\end{align}

</math>

-

~~Minimising~~ <math>J(\theta)</math> will ~~now maximise the log-likelihood while keeping the weights low~~.

+

By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.

-

~~== Parameters ==~~

-

~~We noted earlier that we actually only need~~ <~~math>n~~ - ~~1</math> parameters to model <math>n</math> classes. To see why this is so~~, ~~consider our hypothesis again~~:

+

<!--

+

Expanding the log-likelihood expression, we find that:

<math>

-

\begin{align}

+

\begin{align}

-

h(x^{(i)}) &=

+

\ell(\theta) &= \ln L(\theta; x, y) \\

+

&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\

+

&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\

+

&= \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]}

+

\end{align}

+

</math>

-

~~\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }~~

-

~~\begin{bmatrix}~~

-

~~e^{ \theta_1^T x^{(i)} } \\~~

-

~~e^{ \theta_2^T x^{(i)} } \\~~

-

~~\vdots \\~~

-

~~e^{ \theta_n^T x^{(i)} } \\~~

-

~~\end{bmatrix} \\~~

-

&=

+

<math>

+

\begin{align}

+

\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} \\

-

\~~frac~~{e^{ \~~theta_n~~^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

+

&= \sum_{i=1}^{m}{ \left[ I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

\cdot

-

~~\frac{1}{~~e^{ \~~theta_n~~^T x^{(i)} } }

+

e^{ \theta_k^T x^{(i)} }

-

\~~begin{bmatrix}~~

+

\cdot

-

~~e^{ \theta_1^T~~ x^{(i)} } \\

+

x^{(i)} \right]}

-

e^{ \~~theta_2^T x~~^{(i)} } \\

+

\qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) } \\

-

~~\vdots \\~~

+

-

e^{ ~~\theta_n^T x~~^{(i)} } \\

+

-

~~\end~~{~~bmatrix~~} \\

+

-

&=

+

&= \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}) ) \right] }

-

+

-

~~\frac{1}{~~ \sum_{j=1}^{n}{~~e^{ (~~\~~theta_j^T - \theta_n^T)~~ x^{(i)} ~~}} }~~

+

-

\~~begin~~{~~bmatrix}~~

+

-

~~e^{ (\theta_1^T - \theta_n^T) x~~^{(i)} } \\

+

-

~~e^{~~ (~~\theta_2^T - \theta_n^T) x~~^{(i)} ~~} \\~~

+

-

~~\vdots \\~~

+

-

~~e^{ (\theta_n^T - \theta_n^T)~~ x^{(i)} } \\

+

-

~~\end{bmatrix~~} \\

+

\end{align}

</math>

-

~~Letting~~ <math>\~~Theta_j =~~ \~~theta_j - \theta_n~~</math> for <math>j = 1, 2 ~~\ldots n - 1~~</math> ~~gives~~

+

With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.

+

!-->

+

== Relationship to Logistic Regression ==

+

In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.

+

This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>,

+

the softmax regression hypothesis outputs

<math>

+

\begin{align}

+

h_\theta(x) &=

+

\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } }

+

\begin{bmatrix}

+

e^{ \theta_1^T x } \\

+

e^{ \theta_2^T x }

+

\end{bmatrix}

+

\end{align}

+

</math>

+

Taking advantage of the fact that this hypothesis

+

is overparameterized and setting <math>\psi = \theta_1</math>,

+

we can subtract <math>\theta_1</math> from each of the two parameters, giving us

+

<math>

\begin{align}

-

h(x~~^{(i)}~~) &= \frac{1}{ ~~1 +~~ \~~sum_~~{~~j=1~~}^~~{n-1~~}{e^{ \~~Theta_j~~ x^{(i)} }} }

+

h(x) &=

-

\begin{bmatrix}

+

-

e^{ \~~Theta_1~~^T x~~^{(i)}~~ } \\

+

\frac{1}{ e^{\vec{0}^Tx} + e^{ (\theta_2-\theta_1)^T x^{(i)} } }

-

e^{ \~~Theta_2~~^T x~~^{(i)~~} ~~} \\~~

+

\begin{bmatrix}

-

~~\vdots \\~~

+

e^{ \vec{0}^T x } \\

-

~~1 \\~~

+

e^{ (\theta_2-\theta_1)^T x }

\end{bmatrix} \\

+

&=

+

\begin{bmatrix}

+

\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\

+

\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }

+

\end{bmatrix} \\

+

&=

+

\begin{bmatrix}

+

\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\

+

1 - \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\

+

\end{bmatrix}

\end{align}

</math>

-

~~Showing that only <math>n-1</math> parameters are required.~~

-

~~=== Logistic~~ regression ~~===~~

+

Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find

+

that softmax regression predicts the probability of one of the classes as

+

<math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,

+

and that of the other class as

+

<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,

+

same as logistic regression.

-

~~In the special case where <math>n~~ = ~~2</math>, softmax regression reduces to logistic regression:~~

+

== Softmax Regression vs. k Binary Classifiers ==

-

<math>

+

Suppose you are working on a music classification application, and there are

+

<math>k</math> types of music that you are trying to recognize. Should you use a

+

softmax classifier, or should you build <math>k</math> separate binary classifiers using

+

logistic regression?

-

~~\begin{align}~~

+

This will depend on whether the four classes are ''mutually exclusive.'' For example,

-

h(~~x^{(i)}) &~~=

+

if your four classes are classical, country, rock, and jazz, then assuming each

+

of your training examples is labeled with exactly one of these four class labels,

+

you should build a softmax classifier with <math>k=4</math>.

+

(If there're also some examples that are none of the above four classes,

+

then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)

-

~~\frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } }~~

+

If however your categories are has_vocals, dance, soundtrack, pop, then the

-

~~\begin{bmatrix}~~

+

classes are not mutually exclusive; for example, there can be a piece of pop

-

~~e^{ \Theta_1^T x^{(i)} } \\~~

+

music that comes from a soundtrack and in addition has vocals. In this case, it

-

~~1 \\~~

+

would be more appropriate to build 4 binary logistic regression classifiers.

-

~~\end{bmatrix} \\~~

+

This way, for each new musical piece, your algorithm can separately decide whether

+

it falls into each of the four categories.

-

&=

+

Now, consider a computer vision example, where you're trying to classify images into

+

three different classes. (i) Suppose that your classes are indoor_scene,

+

outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression

+

or three logistic regression classifiers? (ii) Now suppose your classes are

+

indoor_scene, black_and_white_image, and image_has_people. Would you use softmax

+

regression or multiple logistic regression classifiers?

-

~~\frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } }~~

+

In the first case, the classes are mutually exclusive, so a softmax regression

-

~~\cdot~~

+

classifier would be appropriate. In the second case, it would be more appropriate to build

-

~~\frac{1}{e^{ \Theta_1 x^{(1)} } }~~

+

three separate logistic regression classifiers.

-

~~\begin{bmatrix}~~

+

-

~~e^{ \Theta_1^T x^{(i)} } \\~~

+

-

~~1 \\~~

+

-

~~\end{bmatrix} \\~~

+

-

&=

-

~~\frac~~{1}{ ~~e^{ -\Theta_1 x^{(i)~~} } ~~+ 1 }~~

+

-

~~\begin{bmatrix}~~

+

-

~~1 \\~~

+

-

~~e^{ -\Theta_1^T x^{(i)} } \\~~

+

-

~~\end{bmatrix} \\~~

+

-

~~\end~~{~~align~~}

+

-

~~</math>~~

+

Softmax Regression

From Ufldl

Latest revision as of 13:24, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
 == Introduction ==
-'''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. Recall that in logistic regression, our hypothesis was of the form:
+In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
+classification problems where the class label <math>y</math> can take on more than two possible values.
+This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
+numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
+using it in conjuction with our deep learning/unsupervised feature learning methods.
+Recall that in logistic regression, we had a training set
+<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
+of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.
+(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
+<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)
+With logistic regression, we were in the binary classification setting, so the labels
+were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
 <math>\begin{align}
@@ Line 7: / Line 19: @@
 \end{align}</math>
-We trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9).
+and the model parameters <math>\theta</math> were trained to minimize
+the cost function
-To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution.
+<math>
+\begin{align}
+J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
+\end{align}
+</math>
-== Mathematical form ==
+In the softmax regression setting, we are interested in multi-class
+classification (as opposed to only binary classification), and so the label
+<math>y</math> can take on <math>k</math> different values, rather than only
+two.  Thus, in our training set
+<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
+we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
+our convention will be to index the classes starting from 1, rather than from 0.)  For example,
+in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
-Formally, we consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes.
+Given a test input <math>x</math>, we want our hypothesis to estimate
+the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
-Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that
+I.e., we want to estimate the probability of the class label taking
+on each of the <math>k</math> different possible values.  Thus, our hypothesis
+will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
+us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
+<math>h_{\theta}(x)</math> takes the form:
 <math>
 \begin{align}
-h(x^{(i)}) =
+h_\theta(x^{(i)}) =
 \begin{bmatrix}
-P(y^{(i)} = 1 | x^{(i)}) \\
+p(y^{(i)} = 1 | x^{(i)}; \theta) \\
-P(y^{(i)} = 2 | x^{(i)}) \\
+p(y^{(i)} = 2 | x^{(i)}; \theta) \\
 \vdots \\
-P(y^{(i)} = n | x^{(i)})
+p(y^{(i)} = k | x^{(i)}; \theta)
 \end{bmatrix}
 =
-\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
 \begin{bmatrix}
 e^{ \theta_1^T x^{(i)} } \\
 e^{ \theta_2^T x^{(i)} } \\
 \vdots \\
-e^{ \theta_n^T x^{(i)} } \\
+e^{ \theta_k^T x^{(i)} } \\
 \end{bmatrix}
 \end{align}
 </math>
-where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Note that '''strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes''', but for convenience, we use <math>n</math> parameters in our derivation. We will explore this further in the later section on parameters.
+Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
+parameters of our model.
+Notice that
+the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
+normalizes the distribution, so that it sums to one.
-Our objective is to maximise the likelihood of the data, <math>L(\theta; x, y) = \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) }</math>.
+For convenience, we will also write
+<math>\theta</math> to denote all the
+parameters of our model.  When you implement softmax regression, it is usually
+convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
+stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
-For convenience, we consider the log-likelihood of the data,
+<math>
+\theta = \begin{bmatrix}
+\mbox{---} \theta_1^T \mbox{---} \\
+\mbox{---} \theta_2^T \mbox{---} \\
+\vdots \\
+\mbox{---} \theta_k^T \mbox{---} \\
+\end{bmatrix}
+</math>
+<!--
+To extend the logistic regression framework which only outputs a single
+probability value, we consider a hypothesis that outputs K values (summing to
+) that represent the predicted probability distribution. Formally, let us
+consider the classification problem where we have <math>m</math>
+<math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots,
+x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots,
+y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with
+<math>n</math> being the number of classes.
+where we trained the logistic regression weights to optimize the (conditional)
+log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In
+softmax regression, we are interested in multi-class problems where each
+example (input image) is assigned to one of <tt>k</tt> labels. One example of a
+multi-class classification problem would be classifying digits on the MNIST
+dataset where each example has label 1 of 10 possible labels (i.e., where it is
+the digit 0, 1, ... or 9).
+If you have seen (in a different course) with the justification of logistic regression as
+maximum likelihood estimation, you have have recognized this as
+maximizing <math> \sum_i \log p(y^{(i)}|x^{(i)};\theta) = - J(\theta)</math>.
+''Strictly speaking, we only need <math>n - 1</math> parameters for
+<math>n</math> classes, but for convenience, we use <math>n</math> parameters
+in our derivation.''
+Now, this hypothesis defines a predicted probability distribution given some
+<tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the
+model, a natural choice is to maximize the (conditional) log-likelihood of the
+data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)})
+}</math>.
+{{Quote| Motivation: One reason for selecting this form of hypotheses comes
+from connections to linear discriminant analysis. In particular, if one assumes
+a generative model for the data in the form <math>p(x,y) = p(y) \times p(x |
+y)</math> and selects for <math>p(x | y)</math> a member of the exponential
+family (which includes Gaussians, Poissons, etc.) it is possible to show that
+the conditional probability <math>p(y | x)</math> has the same form as our
+chosen hypotheses <math>h(x)</math>. For more details, see the
+[http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2
+Notes].  }}
+!-->
+== Cost Function ==
+We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
+the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
+For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
 <math>
 \begin{align}
-\ell(\theta) &= \ln L(\theta; x, y) \\
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
-&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
-&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
-&= \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}
 \end{align}
 </math>
-To find <math>\theta</math> such that <math>\ell(\theta)</math> is maximised, we first find the derivatives of <math>\ell(\theta)</math> with respect to <math>\theta_{k}</math>:
+Notice that this generalizes the logistic regression cost function, which could also have been written:
 <math>
 \begin{align}
-\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} \\
+J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
+&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+\end{align}
+</math>
-&= I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values
-\cdot
+of the class label.  Note also that in softmax regression, we have that
-e^{ \theta_k^T x^{(i)} }
+<math>
-\cdot
+p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
-x^{(i)}
+</math>.
-\qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) }  \\
-&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) )
+There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
+optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
+<math>
+\begin{align}
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
 \end{align}
 </math>
-With this, we can now find a set of parameters that maximises <math>\ell(\theta)</math>, for instance by using gradient ascent.
+<!--
+where as usual
+<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  !-->
-=== Weight decay ===
+Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
+is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
+the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.
-When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder).
+Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
+minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
+we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).
-Why should the weights balloon up? You can check for yourself that if our current parameters <math>\theta</math> classify the examples perfectly, then multiplying each of the parameters by a large constant increases the log-likelihood of the data under the parameters.
+When implementing softmax regression, we will typically use a modified version of the cost function described above;
+specifically, one that incorporates weight decay.  We describe the motivation and details below.
-In order to combat this, when using softmax in practice, it may be useful to include a weight decay term to keep the weights small.
+== Properties of softmax regression parameterization ==
-The weight decay term takes the form:
+Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
+suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
+from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
+(for every <math>j=1, \ldots, k</math>).  Our hypothesis
+now estimates the class label probabilities as
 <math>
 \begin{align}
-w(\theta) = \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
+p(y^{(i)} = j | x^{(i)} ; \theta)
+&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
+&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
 \end{align}
 </math>
-This is combined with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimise''' (observe that we have '''negated the log-likelihood''' so that minimising the cost function maximising the log-likelihood):
+In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>
+does not affect our hypothesis' predictions at all!  This shows that softmax
+regression's parameters are "redundant."  More formally, we say that our
+softmax model is '''overparameterized,''' meaning that for any hypothesis we might
+fit to the data, there are multiple parameter settings that give rise to exactly
+the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
+to the predictions.
+Further, if the cost function <math>J(\theta)</math> is minimized by some
+setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
+then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
+\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
+<math>J(\theta)</math> is still convex, and thus gradient descent will
+not run into a local optima problems.  But the Hessian is singular/non-invertible,
+which causes a straightforward implementation of Newton's method to run into
+numerical problems.)
+Notice also that by setting <math>\psi = \theta_1</math>, one can always
+replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
+'s), without affecting the hypothesis.  Thus, one could "eliminate" the vector
+of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
+any single value of <math>j</math>), without harming the representational power
+of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
+parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
+<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
+\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
+remaining parameters, and this would work fine.
+In practice, however, it is often cleaner and simpler to implement the version which keeps
+all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
+arbitrarily setting one of them to zero.  But we will
+make one change to the cost function: Adding weight decay.  This will take care of
+the numerical problems associated with softmax regression's overparameterized representation.
+== Weight Decay ==
+<!--
+Fortunately, this turns out to be a convex optimization problem, and thus algorithms such as
+gradient descent and L-BFGS are guaranteed to converge to the global minimum.
+!-->
+We will modify the cost function by adding a weight decay term
+<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
+which penalizes large values of the parameters.  Our cost function is now
 <math>
 \begin{align}
-J(\theta) = -\ell(\theta) + w(\theta)
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
+              + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
 \end{align}
 </math>
-The gradients with respect to the cost function must then be adjusted to account for the weight decay term:
+With this weight decay term (for any <math>\lambda > 0</math>), the cost function
+<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a
+unique solution.  The Hessian is now invertible, and because <math>J(\theta)</math> is
+convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed
+to converge to the global minimum.
+To apply an optimization algorithm, we also need the derivative of this
+new definition of <math>J(\theta)</math>.  One can show that the derivative is:
 <math>
 \begin{align}
-\frac{\partial J(\theta)}{\partial \theta_k}
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
-&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k
 \end{align}
 </math>
-Minimising <math>J(\theta)</math> will now maximise the log-likelihood while keeping the weights low.
+By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.
-== Parameters ==
-We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:
+<!--
+Expanding the log-likelihood expression, we find that:
 <math>
 \begin{align}
-h(x^{(i)}) &=
+\ell(\theta) &= \ln L(\theta; x, y) \\
+&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
+&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
+&= \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]}
+\end{align}
+</math>
-\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
-\begin{bmatrix}
-e^{ \theta_1^T x^{(i)} } \\
-e^{ \theta_2^T x^{(i)} } \\
-\vdots \\
-e^{ \theta_n^T x^{(i)} } \\
-\end{bmatrix} \\
-&=
+<math>
+\begin{align}
+\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} \\
-\frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+&= \sum_{i=1}^{m}{ \left[ I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
 \cdot
-\frac{1}{e^{ \theta_n^T x^{(i)} } }
+e^{ \theta_k^T x^{(i)} }
-\begin{bmatrix}
+\cdot
-e^{ \theta_1^T x^{(i)} } \\
+x^{(i)} \right]}
-e^{ \theta_2^T x^{(i)} } \\
+\qquad \text{(where } I_{ \{ y^{(i)} = k\}  } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) }  \\
-\vdots \\
-e^{ \theta_n^T x^{(i)} } \\
-\end{bmatrix} \\
-&=
+&= \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = k\}  - P(y^{(i)} = k | x^{(i)}) ) \right]  }
-\frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T  - \theta_n^T) x^{(i)} }} }
-\begin{bmatrix}
-e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\
-e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\
-\vdots \\
-e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\
-\end{bmatrix} \\
 \end{align}
 </math>
-Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives
+With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
+!-->
+== Relationship to Logistic Regression ==
+In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
+This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
+the softmax regression hypothesis outputs
 <math>
+\begin{align}
+h_\theta(x) &=
+\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
+\begin{bmatrix}
+e^{ \theta_1^T x } \\
+e^{ \theta_2^T x }
+\end{bmatrix}
+\end{align}
+</math>
+Taking advantage of the fact that this hypothesis
+is overparameterized and setting <math>\psi = \theta_1</math>,
+we can subtract <math>\theta_1</math> from each of the two parameters, giving us
+<math>
 \begin{align}
-h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j x^{(i)} }} }
+h(x) &=
-\begin{bmatrix}
-e^{ \Theta_1^T x^{(i)} } \\
+\frac{1}{ e^{\vec{0}^Tx}  + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
-e^{ \Theta_2^T x^{(i)} } \\
+\begin{bmatrix}
-\vdots \\
+e^{ \vec{0}^T x } \\
-\\
+e^{ (\theta_2-\theta_1)^T x }
 \end{bmatrix} \\
+&=
+\begin{bmatrix}
+\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
+\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
+\end{bmatrix} \\
+&=
+\begin{bmatrix}
+\frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
+- \frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
+\end{bmatrix}
 \end{align}
 </math>
-Showing that only <math>n-1</math> parameters are required.
-=== Logistic regression ===
+Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find
+that softmax regression predicts the probability of one of the classes as
+<math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>,
+and that of the other class as
+<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
+same as logistic regression.
-In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression:
+== Softmax Regression vs. k Binary Classifiers ==
-<math>
+Suppose you are working on a music classification application, and there are
+<math>k</math> types of music that you are trying to recognize.  Should you use a
+softmax classifier, or should you build <math>k</math> separate binary classifiers using
+logistic regression?
-\begin{align}
+This will depend on whether the four classes are ''mutually exclusive.''  For example,
-h(x^{(i)}) &=
+if your four classes are classical, country, rock, and jazz, then assuming each
+of your training examples is labeled with exactly one of these four class labels,
+you should build a softmax classifier with <math>k=4</math>.
+(If there're also some examples that are none of the above four classes,
+then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)
-\frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } }
+If however your categories are has_vocals, dance, soundtrack, pop, then the
-\begin{bmatrix}
+classes are not mutually exclusive; for example, there can be a piece of pop
-e^{ \Theta_1^T x^{(i)} } \\
+music that comes from a soundtrack and in addition has vocals.  In this case, it
-\\
+would be more appropriate to build 4 binary logistic regression classifiers.
-\end{bmatrix} \\
+This way, for each new musical piece, your algorithm can separately decide whether
+it falls into each of the four categories.
-&=
+Now, consider a computer vision example, where you're trying to classify images into
+three different classes.  (i) Suppose that your classes are indoor_scene,
+outdoor_urban_scene, and outdoor_wilderness_scene.  Would you use sofmax regression
+or three logistic regression classifiers?  (ii) Now suppose your classes are
+indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
+regression or multiple logistic regression classifiers?
-\frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } }
+In the first case, the classes are mutually exclusive, so a softmax regression
-\cdot
+classifier would be appropriate.  In the second case, it would be more appropriate to build
-\frac{1}{e^{ \Theta_1 x^{(1)} } }
+three separate logistic regression classifiers.
-\begin{bmatrix}
-e^{ \Theta_1^T x^{(i)} } \\
-\\
-\end{bmatrix} \\
-&=
-\frac{1}{ e^{ -\Theta_1 x^{(i)} } + 1 }
+{{Softmax}}
-\begin{bmatrix}
-\\
-e^{ -\Theta_1^T x^{(i)} } \\
-\end{bmatrix} \\
-\end{align}
+{{Languages|Softmax回归|中文}}
-</math>