Softmax Regression

From Ufldl

Jump to: navigation, search
(Optimizing Softmax Regression)
Line 36: Line 36:
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.  
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.  
 +
''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''
''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''
 +
Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.  
Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.  
 +
 +
{{Quote|
 +
Motivation: One motivation for selecting this form of hypotheses comes from linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, we defer to the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes].
 +
}}
 +
== Optimizing Softmax Regression ==
== Optimizing Softmax Regression ==
Line 72: Line 79:
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
 +
=== Weight Regularization ===
=== Weight Regularization ===
Line 105: Line 113:
Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
 +
== Parameterization ==
== Parameterization ==
Line 164: Line 173:
Showing that only <math>n-1</math> parameters are required.
Showing that only <math>n-1</math> parameters are required.
-
=== Logistic regression ===
+
In practice, however, it is often easier to implement the version which is over-parametrized although both methods will lead to the same classifier.
 +
 
 +
=== Binary Logistic Regression ===
In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression:
In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression:

Revision as of 04:30, 4 May 2011

Personal tools