Softmax Regression
From Ufldl
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
- | In these notes, we describe the '''Softmax regression''' model | + | In these notes, we describe the '''Softmax regression''' model. This model generalizes logistic regression to |
classification problems where the class label <math>y</math> can take on more than two possible values. | classification problems where the class label <math>y</math> can take on more than two possible values. | ||
- | This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different | + | This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different |
- | numerical digits. | + | numerical digits. Softmax regression is a supervised learning algorithm, but we will later be |
using it in conjuction with our deep learning/unsupervised feature learning methods. | using it in conjuction with our deep learning/unsupervised feature learning methods. | ||
Recall that in logistic regression, we had a training set | Recall that in logistic regression, we had a training set | ||
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> | <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> | ||
- | of | + | of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>. |
- | labels were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form: | + | We had considered the binary classification setting, so the labels |
+ | were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form: | ||
<math>\begin{align} | <math>\begin{align} | ||
Line 30: | Line 31: | ||
two. Thus, in our training set | two. Thus, in our training set | ||
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>, | <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>, | ||
- | we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. For example, | + | we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that |
+ | our convention will be to index the classes starting from 1, rather than from 0.) For example, | ||
in the MNIST digit recognition task, we would have <math>k=10</math> different classes. | in the MNIST digit recognition task, we would have <math>k=10</math> different classes. | ||
Given a test input <math>x</math>, we want our hypothesis to estimate | Given a test input <math>x</math>, we want our hypothesis to estimate | ||
- | the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>. | + | the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>. |
- | + | I.e., we want to estimate the probability of the class label taking | |
on each of the <math>k</math> different possible values. Thus, our hypothesis | on each of the <math>k</math> different possible values. Thus, our hypothesis | ||
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving | will output a <math>k</math> dimensional vector (whose elements sum to 1) giving | ||
- | us our <math>k</math> estimated probabilities. | + | us our <math>k</math> estimated probabilities. Concretely, our hypothesis |
- | + | ||
- | Concretely, our hypothesis | + | |
<math>h_{\theta}(x)</math> takes the form: | <math>h_{\theta}(x)</math> takes the form: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | h_\theta(x^{(i)}) = | |
\begin{bmatrix} | \begin{bmatrix} | ||
- | + | p(y^{(i)} = 1 | x^{(i)}; \theta) \\ | |
- | + | p(y^{(i)} = 2 | x^{(i)}; \theta) \\ | |
\vdots \\ | \vdots \\ | ||
- | + | p(y^{(i)} = k | x^{(i)}; \theta) | |
\end{bmatrix} | \end{bmatrix} | ||
= | = | ||
- | \frac{1}{ \sum_{j=1}^{ | + | \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } |
\begin{bmatrix} | \begin{bmatrix} | ||
e^{ \theta_1^T x^{(i)} } \\ | e^{ \theta_1^T x^{(i)} } \\ | ||
Line 63: | Line 63: | ||
</math> | </math> | ||
- | + | Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the | |
- | + | parameters of our model. | |
- | + | Notice that | |
- | <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math> | + | the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math> |
- | normalizes the distribution so that it sums to one. | + | normalizes the distribution, so that it sums to one. |
+ | For convenience, we will also write | ||
+ | <math>\theta</math> to denote all the | ||
+ | parameters of our model. When you implement softmax regression, is is usually | ||
+ | convenient to represent <math>\theta</math> as a matrix obtained by | ||
+ | stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math>, so that | ||
+ | |||
+ | <math> | ||
+ | \theta = \begin{bmatrix} | ||
+ | \mbox{--}\theta_1^T\mbox{--} | ||
+ | \mbox{--}\theta_2^T\mbox{--} \\ | ||
+ | \vdots | ||
+ | \mbox{--}\theta_k^T\mbox{--} | ||
+ | \end{bmatrix} | ||
+ | </math> | ||
<!-- | <!-- | ||
Line 113: | Line 127: | ||
Notes]. }} | Notes]. }} | ||
!--> | !--> | ||
- | |||
== Cost Function == | == Cost Function == | ||
- | We now | + | We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is |
- | the '''indicator function,''' so that <math>1\{\hbox{A true | + | the '''indicator function,''' so that <math>1\{\hbox{A true statement}\}=1</math>, and <math>1\{\hbox{A false statement}\}=0</math>. |
- | Our cost function will be: | + | For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be: |
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} | + | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 131: | Line 144: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} | + | J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + \log h_\theta(x^{(i)}) \right] \\ |
+ | &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | ||
\end{align} | \end{align} | ||
</math> | </math> |