Softmax Regression

Revision as of 04:26, 4 May 2011 (view source)

128.12.80.69 (Talk)

(→Optimizing Softmax Regression)

← Older edit

Revision as of 04:30, 4 May 2011 (view source)

Jngiam (Talk | contribs)

Newer edit →

Line 36:

where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.

+

''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''

+

Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.

+

{{Quote|

+

Motivation: One motivation for selecting this form of hypotheses comes from linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, we defer to the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes].

+

}}

+

== Optimizing Softmax Regression ==

Line 72:

Line 79:

With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.

+

=== Weight Regularization ===

Line 105:

Line 113:

Minimizing <math>J(\theta)</math> now performs regularized softmax regression.

+

== Parameterization ==

Line 164:

Line 173:

Showing that only <math>n-1</math> parameters are required.

-

=== Logistic ~~regression~~ ===

+

In practice, however, it is often easier to implement the version which is over-parametrized although both methods will lead to the same classifier.

+

=== Binary Logistic Regression ===

In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression:

Softmax Regression

From Ufldl

Revision as of 04:30, 4 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 36: / Line 36: @@
 where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.
 ''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''
 Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.
+{{Quote|
+Motivation: One motivation for selecting this form of hypotheses comes from linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, we defer to the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes].
+}}
 == Optimizing Softmax Regression ==
@@ Line 72: / Line 79: @@
 With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
 === Weight Regularization ===
@@ Line 105: / Line 113: @@
 Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
 == Parameterization ==
@@ Line 164: / Line 173: @@
 Showing that only <math>n-1</math> parameters are required.
-=== Logistic regression ===
+In practice, however, it is often easier to implement the version which is over-parametrized although both methods will lead to the same classifier.
+=== Binary Logistic Regression ===
 In the special case where <math>n = 2</math>, one can also show that softmax regression reduces to logistic regression: