Softmax Regression
From Ufldl
(→Optimizing Softmax Regression) |
|||
Line 36: | Line 36: | ||
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one. | where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one. | ||
+ | |||
''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.'' | ''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.'' | ||
+ | |||
Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>. | Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>. | ||
+ | |||
+ | {{Quote| | ||
+ | Motivation: One motivation for selecting this form of hypotheses comes from linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, we defer to the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes]. | ||
+ | }} | ||
+ | |||
== Optimizing Softmax Regression == | == Optimizing Softmax Regression == | ||
Line 72: | Line 79: | ||
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | ||
+ | |||
=== Weight Regularization === | === Weight Regularization === | ||
Line 105: | Line 113: | ||
Minimizing <math>J(\theta)</math> now performs regularized softmax regression. | Minimizing <math>J(\theta)</math> now performs regularized softmax regression. | ||
+ | |||
== Parameterization == | == Parameterization == | ||
Line 164: | Line 173: | ||
Showing that only <math>n-1</math> parameters are required. | Showing that only <math>n-1</math> parameters are required. | ||
- | === Logistic | + | In practice, however, it is often easier to implement the version which is over-parametrized although both methods will lead to the same classifier. |
+ | |||
+ | === Binary Logistic Regression === | ||
In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression: | In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression: |