Softmax Regression
From Ufldl
(→Introduction) |
|||
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
- | '''Softmax regression''' | + | In these notes, we describe the '''Softmax regression''' model. This model generalizes logistic regression to |
+ | classification problems where the class label <math>y</math> can take on more than two possible values. | ||
+ | This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different | ||
+ | numerical digits. Softmax regression is a supervised learning algorithm, but we will later be | ||
+ | using it in conjuction with our deep learning/unsupervised feature learning methods. | ||
+ | |||
+ | Recall that in logistic regression, we had a training set | ||
+ | <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> | ||
+ | of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>. | ||
+ | (In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be | ||
+ | <math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.) | ||
+ | With logistic regression, we were in the binary classification setting, so the labels | ||
+ | were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form: | ||
<math>\begin{align} | <math>\begin{align} | ||
Line 7: | Line 19: | ||
\end{align}</math> | \end{align}</math> | ||
- | + | and the model parameters <math>\theta</math> were trained to minimize | |
+ | the cost function | ||
- | + | <math> | |
+ | \begin{align} | ||
+ | J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right] | ||
+ | \end{align} | ||
+ | </math> | ||
- | = | + | In the softmax regression setting, we are interested in multi-class |
+ | classification (as opposed to only binary classification), and so the label | ||
+ | <math>y</math> can take on <math>k</math> different values, rather than only | ||
+ | two. Thus, in our training set | ||
+ | <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>, | ||
+ | we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that | ||
+ | our convention will be to index the classes starting from 1, rather than from 0.) For example, | ||
+ | in the MNIST digit recognition task, we would have <math>k=10</math> different classes. | ||
- | + | Given a test input <math>x</math>, we want our hypothesis to estimate | |
- | + | the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>. | |
- | + | I.e., we want to estimate the probability of the class label taking | |
+ | on each of the <math>k</math> different possible values. Thus, our hypothesis | ||
+ | will output a <math>k</math> dimensional vector (whose elements sum to 1) giving | ||
+ | us our <math>k</math> estimated probabilities. Concretely, our hypothesis | ||
+ | <math>h_{\theta}(x)</math> takes the form: | ||
<math> | <math> | ||
- | \begin{align} | + | \begin{align} |
- | + | h_\theta(x^{(i)}) = | |
- | \begin{bmatrix} | + | \begin{bmatrix} |
- | + | p(y^{(i)} = 1 | x^{(i)}; \theta) \\ | |
- | + | p(y^{(i)} = 2 | x^{(i)}; \theta) \\ | |
- | \vdots \\ | + | \vdots \\ |
- | + | p(y^{(i)} = k | x^{(i)}; \theta) | |
\end{bmatrix} | \end{bmatrix} | ||
- | = | + | = |
- | \frac{1}{ \sum_{j=1}^{ | + | \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } |
- | \begin{bmatrix} | + | \begin{bmatrix} |
e^{ \theta_1^T x^{(i)} } \\ | e^{ \theta_1^T x^{(i)} } \\ | ||
e^{ \theta_2^T x^{(i)} } \\ | e^{ \theta_2^T x^{(i)} } \\ | ||
\vdots \\ | \vdots \\ | ||
- | e^{ \ | + | e^{ \theta_k^T x^{(i)} } \\ |
\end{bmatrix} | \end{bmatrix} | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the | |
+ | parameters of our model. | ||
+ | Notice that | ||
+ | the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math> | ||
+ | normalizes the distribution, so that it sums to one. | ||
- | + | For convenience, we will also write | |
+ | <math>\theta</math> to denote all the | ||
+ | parameters of our model. When you implement softmax regression, it is usually | ||
+ | convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by | ||
+ | stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that | ||
- | + | <math> | |
+ | \theta = \begin{bmatrix} | ||
+ | \mbox{---} \theta_1^T \mbox{---} \\ | ||
+ | \mbox{---} \theta_2^T \mbox{---} \\ | ||
+ | \vdots \\ | ||
+ | \mbox{---} \theta_k^T \mbox{---} \\ | ||
+ | \end{bmatrix} | ||
+ | </math> | ||
+ | |||
+ | <!-- | ||
+ | To extend the logistic regression framework which only outputs a single | ||
+ | probability value, we consider a hypothesis that outputs K values (summing to | ||
+ | 1) that represent the predicted probability distribution. Formally, let us | ||
+ | consider the classification problem where we have <math>m</math> | ||
+ | <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, | ||
+ | x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, | ||
+ | y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with | ||
+ | <math>n</math> being the number of classes. | ||
+ | |||
+ | where we trained the logistic regression weights to optimize the (conditional) | ||
+ | log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In | ||
+ | softmax regression, we are interested in multi-class problems where each | ||
+ | example (input image) is assigned to one of <tt>k</tt> labels. One example of a | ||
+ | multi-class classification problem would be classifying digits on the MNIST | ||
+ | dataset where each example has label 1 of 10 possible labels (i.e., where it is | ||
+ | the digit 0, 1, ... or 9). | ||
+ | |||
+ | If you have seen (in a different course) with the justification of logistic regression as | ||
+ | maximum likelihood estimation, you have have recognized this as | ||
+ | maximizing <math> \sum_i \log p(y^{(i)}|x^{(i)};\theta) = - J(\theta)</math>. | ||
+ | |||
+ | |||
+ | ''Strictly speaking, we only need <math>n - 1</math> parameters for | ||
+ | <math>n</math> classes, but for convenience, we use <math>n</math> parameters | ||
+ | in our derivation.'' | ||
+ | |||
+ | Now, this hypothesis defines a predicted probability distribution given some | ||
+ | <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the | ||
+ | model, a natural choice is to maximize the (conditional) log-likelihood of the | ||
+ | data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) | ||
+ | }</math>. | ||
+ | |||
+ | {{Quote| Motivation: One reason for selecting this form of hypotheses comes | ||
+ | from connections to linear discriminant analysis. In particular, if one assumes | ||
+ | a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | | ||
+ | y)</math> and selects for <math>p(x | y)</math> a member of the exponential | ||
+ | family (which includes Gaussians, Poissons, etc.) it is possible to show that | ||
+ | the conditional probability <math>p(y | x)</math> has the same form as our | ||
+ | chosen hypotheses <math>h(x)</math>. For more details, see the | ||
+ | [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 | ||
+ | Notes]. }} | ||
+ | !--> | ||
+ | |||
+ | == Cost Function == | ||
+ | |||
+ | We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is | ||
+ | the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>. | ||
+ | For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] | |
- | + | ||
- | + | ||
- | + | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | Notice that this generalizes the logistic regression cost function, which could also have been written: | |
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ | |
+ | &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | ||
+ | \end{align} | ||
+ | </math> | ||
- | + | The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values | |
- | + | of the class label. Note also that in softmax regression, we have that | |
- | e^{ \ | + | <math> |
- | + | p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} } | |
- | + | </math>. | |
- | + | ||
- | + | There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative | |
+ | optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] } | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | <!-- | |
+ | where as usual | ||
+ | <math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !--> | ||
- | + | Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math> | |
+ | is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math> | ||
+ | the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>. | ||
- | + | Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it | |
+ | minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration | ||
+ | we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>). | ||
- | + | When implementing softmax regression, we will typically use a modified version of the cost function described above; | |
+ | specifically, one that incorporates weight decay. We describe the motivation and details below. | ||
- | + | == Properties of softmax regression parameterization == | |
- | + | Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, | |
+ | suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | ||
+ | from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math> | ||
+ | (for every <math>j=1, \ldots, k</math>). Our hypothesis | ||
+ | now estimates the class label probabilities as | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | p(y^{(i)} = j | x^{(i)} ; \theta) | |
+ | &= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ | ||
+ | &= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\ | ||
+ | &= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}. | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | This is | + | In other words, subtracting <math>\psi</math> from every <math>\theta_j</math> |
+ | does not affect our hypothesis' predictions at all! This shows that softmax | ||
+ | regression's parameters are "redundant." More formally, we say that our | ||
+ | softmax model is '''overparameterized,''' meaning that for any hypothesis we might | ||
+ | fit to the data, there are multiple parameter settings that give rise to exactly | ||
+ | the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math> | ||
+ | to the predictions. | ||
+ | |||
+ | Further, if the cost function <math>J(\theta)</math> is minimized by some | ||
+ | setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>, | ||
+ | then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | ||
+ | \theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the | ||
+ | minimizer of <math>J(\theta)</math> is not unique. (Interestingly, | ||
+ | <math>J(\theta)</math> is still convex, and thus gradient descent will | ||
+ | not run into a local optima problems. But the Hessian is singular/non-invertible, | ||
+ | which causes a straightforward implementation of Newton's method to run into | ||
+ | numerical problems.) | ||
+ | |||
+ | Notice also that by setting <math>\psi = \theta_1</math>, one can always | ||
+ | replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all | ||
+ | 0's), without affecting the hypothesis. Thus, one could "eliminate" the vector | ||
+ | of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | ||
+ | any single value of <math>j</math>), without harming the representational power | ||
+ | of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math> | ||
+ | parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where | ||
+ | <math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 = | ||
+ | \vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math> | ||
+ | remaining parameters, and this would work fine. | ||
+ | |||
+ | In practice, however, it is often cleaner and simpler to implement the version which keeps | ||
+ | all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without | ||
+ | arbitrarily setting one of them to zero. But we will | ||
+ | make one change to the cost function: Adding weight decay. This will take care of | ||
+ | the numerical problems associated with softmax regression's overparameterized representation. | ||
+ | |||
+ | == Weight Decay == | ||
+ | |||
+ | <!-- | ||
+ | Fortunately, this turns out to be a convex optimization problem, and thus algorithms such as | ||
+ | gradient descent and L-BFGS are guaranteed to converge to the global minimum. | ||
+ | !--> | ||
+ | |||
+ | We will modify the cost function by adding a weight decay term | ||
+ | <math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math> | ||
+ | which penalizes large values of the parameters. Our cost function is now | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = -\ | + | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] |
+ | + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2 | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | With this weight decay term (for any <math>\lambda > 0</math>), the cost function | |
+ | <math>J(\theta)</math> is now strictly convex, and is guaranteed to have a | ||
+ | unique solution. The Hessian is now invertible, and because <math>J(\theta)</math> is | ||
+ | convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed | ||
+ | to converge to the global minimum. | ||
+ | To apply an optimization algorithm, we also need the derivative of this | ||
+ | new definition of <math>J(\theta)</math>. One can show that the derivative is: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | \ | + | \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j |
- | + | ||
- | + | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression. | |
- | |||
- | + | <!-- | |
+ | Expanding the log-likelihood expression, we find that: | ||
<math> | <math> | ||
- | \begin{align} | + | \begin{align} |
- | + | \ell(\theta) &= \ln L(\theta; x, y) \\ | |
+ | &= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\ | ||
+ | &= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\ | ||
+ | &= \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} | ||
+ | \end{align} | ||
+ | </math> | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | &= | + | <math> |
+ | \begin{align} | ||
+ | \frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} \\ | ||
- | \ | + | &= \sum_{i=1}^{m}{ \left[ I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } |
\cdot | \cdot | ||
- | + | e^{ \theta_k^T x^{(i)} } | |
- | \ | + | \cdot |
- | + | x^{(i)} \right]} | |
- | + | \qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) } \\ | |
- | + | ||
- | + | ||
- | + | ||
- | &= | + | &= \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}) ) \right] } |
- | + | ||
- | + | ||
- | \ | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | |
+ | !--> | ||
+ | |||
+ | == Relationship to Logistic Regression == | ||
+ | |||
+ | In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression. | ||
+ | This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>, | ||
+ | the softmax regression hypothesis outputs | ||
<math> | <math> | ||
+ | \begin{align} | ||
+ | h_\theta(x) &= | ||
+ | \frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } } | ||
+ | \begin{bmatrix} | ||
+ | e^{ \theta_1^T x } \\ | ||
+ | e^{ \theta_2^T x } | ||
+ | \end{bmatrix} | ||
+ | \end{align} | ||
+ | </math> | ||
+ | |||
+ | Taking advantage of the fact that this hypothesis | ||
+ | is overparameterized and setting <math>\psi = \theta_1</math>, | ||
+ | we can subtract <math>\theta_1</math> from each of the two parameters, giving us | ||
+ | |||
+ | <math> | ||
\begin{align} | \begin{align} | ||
- | h(x | + | h(x) &= |
- | \begin{bmatrix} | + | |
- | e^{ \ | + | \frac{1}{ e^{\vec{0}^Tx} + e^{ (\theta_2-\theta_1)^T x^{(i)} } } |
- | e^{ \ | + | \begin{bmatrix} |
- | + | e^{ \vec{0}^T x } \\ | |
- | + | e^{ (\theta_2-\theta_1)^T x } | |
\end{bmatrix} \\ | \end{bmatrix} \\ | ||
+ | |||
+ | &= | ||
+ | \begin{bmatrix} | ||
+ | \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\ | ||
+ | \frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } | ||
+ | \end{bmatrix} \\ | ||
+ | |||
+ | &= | ||
+ | \begin{bmatrix} | ||
+ | \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\ | ||
+ | 1 - \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\ | ||
+ | \end{bmatrix} | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | |||
- | + | Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find | |
+ | that softmax regression predicts the probability of one of the classes as | ||
+ | <math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | ||
+ | and that of the other class as | ||
+ | <math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | ||
+ | same as logistic regression. | ||
- | + | == Softmax Regression vs. k Binary Classifiers == | |
- | <math> | + | Suppose you are working on a music classification application, and there are |
+ | <math>k</math> types of music that you are trying to recognize. Should you use a | ||
+ | softmax classifier, or should you build <math>k</math> separate binary classifiers using | ||
+ | logistic regression? | ||
- | + | This will depend on whether the four classes are ''mutually exclusive.'' For example, | |
- | + | if your four classes are classical, country, rock, and jazz, then assuming each | |
+ | of your training examples is labeled with exactly one of these four class labels, | ||
+ | you should build a softmax classifier with <math>k=4</math>. | ||
+ | (If there're also some examples that are none of the above four classes, | ||
+ | then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.) | ||
- | + | If however your categories are has_vocals, dance, soundtrack, pop, then the | |
- | + | classes are not mutually exclusive; for example, there can be a piece of pop | |
- | + | music that comes from a soundtrack and in addition has vocals. In this case, it | |
- | + | would be more appropriate to build 4 binary logistic regression classifiers. | |
- | + | This way, for each new musical piece, your algorithm can separately decide whether | |
+ | it falls into each of the four categories. | ||
- | + | Now, consider a computer vision example, where you're trying to classify images into | |
+ | three different classes. (i) Suppose that your classes are indoor_scene, | ||
+ | outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression | ||
+ | or three logistic regression classifiers? (ii) Now suppose your classes are | ||
+ | indoor_scene, black_and_white_image, and image_has_people. Would you use softmax | ||
+ | regression or multiple logistic regression classifiers? | ||
- | + | In the first case, the classes are mutually exclusive, so a softmax regression | |
- | + | classifier would be appropriate. In the second case, it would be more appropriate to build | |
- | + | three separate logistic regression classifiers. | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | |||
- | + | {{Softmax}} | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | {{Languages|Softmax回归|中文}} | |
- | + |