Softmax Regression
From Ufldl
(→Mathematical background) |
|||
Line 3: | Line 3: | ||
'''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. An example would be classifying the digits from the MNIST data set - each input can be labelled with 1 of 10 possible class labels. | '''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. An example would be classifying the digits from the MNIST data set - each input can be labelled with 1 of 10 possible class labels. | ||
- | == Mathematical | + | == Mathematical form == |
Formally, we consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. | Formally, we consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. | ||
Line 29: | Line 29: | ||
</math> | </math> | ||
- | where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. | + | where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Note that '''strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes''', but for convenience, we use <math>n</math> parameters in our derivation. We will explore this further in the later section on parameters. |
Our objective is to maximise the likelihood of the data, <math>L(\theta; x, y) = \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) }</math>. | Our objective is to maximise the likelihood of the data, <math>L(\theta; x, y) = \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) }</math>. | ||
Line 55: | Line 55: | ||
With this, we can now find a set of parameters that maximises <math>\ell(\theta)</math>, for instance by using gradient ascent. | With this, we can now find a set of parameters that maximises <math>\ell(\theta)</math>, for instance by using gradient ascent. | ||
+ | |||
+ | == Parameters == | ||
+ | |||
+ | We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | h(x^{(i)}) &= | ||
+ | |||
+ | \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } | ||
+ | \begin{bmatrix} | ||
+ | e^{ \theta_1^T x^{(i)} } \\ | ||
+ | e^{ \theta_2^T x^{(i)} } \\ | ||
+ | \vdots \\ | ||
+ | e^{ \theta_n^T x^{(i)} } \\ | ||
+ | \end{bmatrix} \\ | ||
+ | |||
+ | &= | ||
+ | |||
+ | \frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } | ||
+ | \cdot | ||
+ | \frac{1}{e^{ \theta_n^T x^{(i)} } } | ||
+ | \begin{bmatrix} | ||
+ | e^{ \theta_1^T x^{(i)} } \\ | ||
+ | e^{ \theta_2^T x^{(i)} } \\ | ||
+ | \vdots \\ | ||
+ | e^{ \theta_n^T x^{(i)} } \\ | ||
+ | \end{bmatrix} \\ | ||
+ | |||
+ | &= | ||
+ | |||
+ | \frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T - \theta_n^T) x^{(i)} }} } | ||
+ | \begin{bmatrix} | ||
+ | e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\ | ||
+ | e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\ | ||
+ | \vdots \\ | ||
+ | e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\ | ||
+ | \end{bmatrix} \\ | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives | ||
+ | |||
+ | <math> | ||
+ | |||
+ | \begin{align} | ||
+ | h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j x^{(i)} }} } | ||
+ | \begin{bmatrix} | ||
+ | e^{ \Theta_1^T x^{(i)} } \\ | ||
+ | e^{ \Theta_2^T x^{(i)} } \\ | ||
+ | \vdots \\ | ||
+ | 1 \\ | ||
+ | \end{bmatrix} \\ | ||
+ | |||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | Showing that only <math>n-1</math> parameters are required. | ||
+ | |||
+ | === Logistic regression === | ||
+ | |||
+ | In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression: | ||
+ | |||
+ | <math> | ||
+ | |||
+ | \begin{align} | ||
+ | h(x^{(i)}) &= | ||
+ | |||
+ | \frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } } | ||
+ | \begin{bmatrix} | ||
+ | e^{ \Theta_1^T x^{(i)} } \\ | ||
+ | 1 \\ | ||
+ | \end{bmatrix} \\ | ||
+ | |||
+ | &= | ||
+ | |||
+ | \frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } } | ||
+ | \cdot | ||
+ | \frac{1}{e^{ \Theta_1 x^{(1)} } } | ||
+ | \begin{bmatrix} | ||
+ | e^{ \Theta_1^T x^{(i)} } \\ | ||
+ | 1 \\ | ||
+ | \end{bmatrix} \\ | ||
+ | |||
+ | &= | ||
+ | |||
+ | \frac{1}{ e^{ -\Theta_1 x^{(i)} } + 1 } | ||
+ | \begin{bmatrix} | ||
+ | 1 \\ | ||
+ | e^{ -\Theta_1^T x^{(i)} } \\ | ||
+ | \end{bmatrix} \\ | ||
+ | |||
+ | |||
+ | \end{align} | ||
+ | </math> |