Softmax Regression
From Ufldl
(→Introduction) |
|||
Line 11: | Line 11: | ||
where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9). | where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9). | ||
- | To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. | + | To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. Formally, let us consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. |
- | + | ||
- | + | ||
- | + | ||
- | Formally, | + | |
Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that | Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that | ||
Line 39: | Line 35: | ||
</math> | </math> | ||
- | where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. | + | where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one. |
- | + | ''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.'' | |
- | + | Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>. | |
+ | |||
+ | == Optimizing Softmax Regression == | ||
+ | |||
+ | Expanding the log-likelihood expression, we find that: | ||
<math> | <math> | ||
Line 54: | Line 54: | ||
</math> | </math> | ||
- | + | Unfortunately, there is no close form solution to this optimization problem (although it is concave), and we usually use an off-the-shelf optimization method (e.g., L-BFGS, stochastic gradient descent) to find the optimal parameters. Using these optimization methods require computing the gradient (<math>\ell(\theta)</math> w.r.t. <math>\theta_{k}</math>), which can can be derived as follows: | |
<math> | <math> | ||
Line 71: | Line 71: | ||
</math> | </math> | ||
- | With this, we can now find a set of parameters that | + | With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. |
=== Weight decay === | === Weight decay === | ||
+ | |||
+ | |||
When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder). | When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder). |