Softmax Regression

From Ufldl

Jump to: navigation, search
(Introduction)
(Introduction)
Line 9: Line 9:
\end{align}</math>
\end{align}</math>
-
where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example (input image) is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9).  
+
where we trained the logistic regression weights to optimize the (conditional) log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example (input image) is assigned to one of <tt>k</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9).  
 +
 
To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. Formally, let us consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes.  
To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. Formally, let us consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes.  
 +
Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that
Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that
Line 36: Line 38:
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.  
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.  
-
 
''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''
''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''
-
 
Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.  
Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.  
{{Quote|
{{Quote|
-
Motivation: One motivation for selecting this form of hypotheses comes from linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, we defer to the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes].
+
Motivation: One reason for selecting this form of hypotheses comes from connections to linear discriminant analysis. In particular, if one assumes a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | y)</math> and selects for <math>p(x | y)</math> a member of the exponential family (which includes Gaussians, Poissons, etc.) it is possible to show that the conditional probability <math>p(y | x)</math> has the same form as our chosen hypotheses <math>h(x)</math>. For more details, we defer to the [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 Notes].
}}
}}

Revision as of 04:51, 4 May 2011

Personal tools