Exercise:Softmax Regression

Revision as of 01:06, 11 April 2011 (view source)

(Created page with "== Softmax regression == In this problem set, you will use softmax regression on pixels to classify MNIST images. However, since you will also be using softmax regression fo...")

← Older edit

Revision as of 04:08, 11 April 2011 (view source)

Cyfoo (Talk | contribs)

(→Softmax regression)

Newer edit →

Line 6:

=== Step 0: Initialise constants ===

+

Two constants, inputSize and outputSize, corresponding to the size of each input vector and the number of class labels have been defined in the starter code. This will allow you to reuse your code on a different data set in a later exercise.

=== Step 1: Load data ===

+

The starter code loads the MNIST images and labels into inputData and outputData respectively. The images are pre-processed to scale the pixel values to the range <math>[0, 1]</math>, and the label 0 is remapped to 10 for convenience of implementation. You will not need to change any code in this step for this exercise, but note that your code should be general enough to operate on data of arbitrary size belonging to any number of classes.

=== Step 2: Implement softmaxCost ===

+

In softmaxCost.m, implement code to compute the softmax cost function. Since minFunc minimises this cost, we set it to the '''negative''' of the log-likelihood <math>-\ell(\theta)</math>, in order to maximise <math>\ell(\theta)</math>. Your code should also compute the appropriate gradients, as well as the predictions for the input data (which will be used in the cross-validation step later).

+

'''Implementation tip: computing the output matrix''' - in your code, you may need to compute the output matrix <tt>M</tt>, such that <tt>M(r, c)</tt> is 1 if <math>y^{(c)} = r</math> and 0 otherwise. This can be done quickly, without a loop, using the MATLAB functions <tt>sparse</tt> and <tt>full</tt>. <tt>sparse(r, c, v)</tt> creates a sparse matrix such that <tt>M(r(i), c(i)) = v(i)</tt> for all i. That is, the vectors <tt>r</tt> and <tt>c</tt> give the position of the elements whose values we wish to set, and <tt>v</tt> the corresponding values of the elements. Running <tt>full</tt> on a sparse matrix gives the full representation of the matrix for use. Note that the code for using <tt>sparse</tt> and <tt>full</tt> to compute the output matrix has already been included in softmaxCost.m.

+

'''Implementation tip: preventing overflows''' - in softmax regression, you will have to compute the hypothesis

+

<math>

+

\begin{align}

+

h(x^{(i)}) =

+

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

+

\begin{bmatrix}

+

e^{ \theta_1^T x^{(i)} } \\

+

e^{ \theta_2^T x^{(i)} } \\

+

\vdots \\

+

e^{ \theta_n^T x^{(i)} } \\

+

\end{bmatrix}

+

\end{align}

+

</math>

+

When the products <math>\theta_i^T x^{(i)}</math> are large, the exponential function <math>e^{\theta_i^T x^{(i)}}</math> will become very large and possibly overflow. When this happens, you will not be able to compute your hypothesis. However, there is an easy solution - observe that we can multiply the top and bottom of the hypothesis by some constant without changing the output:

+

<math>

+

\begin{align}

+

h(x^{(i)}) &=

+

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

+

\begin{bmatrix}

+

e^{ \theta_1^T x^{(i)} } \\

+

e^{ \theta_2^T x^{(i)} } \\

+

\vdots \\

+

e^{ \theta_n^T x^{(i)} } \\

+

\end{bmatrix} \\

+

&=

+

\frac{ e^{-\alpha} }{ e^{-\alpha} \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

+

\begin{bmatrix}

+

e^{ \theta_1^T x^{(i)} } \\

+

e^{ \theta_2^T x^{(i)} } \\

+

\vdots \\

+

e^{ \theta_n^T x^{(i)} } \\

+

\end{bmatrix} \\

+

&=

+

\frac{ 1 }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} - \alpha }} }

+

\begin{bmatrix}

+

e^{ \theta_1^T x^{(i)} - \alpha } \\

+

e^{ \theta_2^T x^{(i)} - \alpha } \\

+

\vdots \\

+

e^{ \theta_n^T x^{(i)} - \alpha } \\

+

\end{bmatrix} \\

+

\end{align}

+

</math>

+

Hence, to prevent overflow, simply subtract some large constant value from each of the <math>\theta_j^T x^{(i)}</math> terms before computing the exponential. In practice, for each example, you can use the maximum of the <math>\theta_j^T x^{(i)}</math> terms as the constant. Assuming you have a matrix <tt>M</tt> containing these terms such that <tt>M(r, c)</tt> is <math>\theta_r^T x^{(c)}</math>, then you can use the following code to accomplish this:

+

% M is the matrix as described in the text

+

M = bsxfun(@minus, M, max(M));

+

max(M) yields a row vector with each element giving the maximum value in that column. <tt>bsxfun</tt> (short for binary singleton expansion function) applies minus along each row of M, hence subtracting the maximum of each column from every element in the column.

+

You may also find <tt>bsxfun</tt> useful in computing your predictions - if you have a matrix <tt>M</tt> containing the <math>e^{\theta_j^T x^{(i)}}</math> terms, such that <tt>M(r, c)</tt> contains the <math>e^{\theta_r^T x^{(c)}}</math> term, you can use the following code to compute the hypothesis (by diving all elements in each column by their column sum):

+

% M is the matrix as described in the text

+

M = bsxfun(@rdivide, M, sum(M))

+

The operation of <tt>bsxfun</tt> in this case is analogous to the earlier example.

=== Step 3: Gradient checking ===

+

Once you have written the softmax cost function, you should check your gradients numerically. In general, whenever implementing any learning algorithm, you should always check your gradients numerically before proceeding to train the model. The norm of the difference between the numerical gradient and your analytical gradient should be small, on the order of <math>10^-9</math>.

+

'''Implementation tip - faster gradient checking''' - when debugging, you can speed up gradient checking by reducing the number of parameters your model uses. In this case, we have included code for reducing the size of the input data, using the first 8 pixels of the images instead of the full 28x28 images. This code can be used by setting the variable <tt>DEBUG</tt> to true, as described in step 1 of the code.

=== Step 4: Learning parameters ===

+

Now that you've verified that your gradients are correct, you can train your softmax model using the L-BFGS algorithm, in the function <tt>minFunc</tt>. Training the model on the entire MNIST training set of 60000 28x28 images should be rather quick, and take less than 3 minutes for 100 iterations.

=== Step 5: Cross-validation ===

+

Now that you've trained your model, you will cross-validate it against the MNIST test set, comprising 10000 28x28 images. Code has been provided to compute the accuracy (the proportion of correctly classified images) and squared error of your model. Our implementation achieved an accuracy of 92%. If your model's accuracy is significantly less (less than 90%), check your code, ensure that you are using the trained weights, and that you are training your model on the full 60000 training images. Conversely, if your accuracy is too high (99-100%), ensure that you have not accidentally trained your model on the test set as well.

Exercise:Softmax Regression

From Ufldl

Revision as of 04:08, 11 April 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 6: / Line 6: @@
 === Step 0: Initialise constants ===
+Two constants, inputSize and outputSize, corresponding to the size of each input vector and the number of class labels have been defined in the starter code. This will allow you to reuse your code on a different data set in a later exercise.
 === Step 1: Load data ===
+The starter code loads the MNIST images and labels into inputData and outputData respectively. The images are pre-processed to scale the pixel values to the range <math>[0, 1]</math>, and the label 0 is remapped to 10 for convenience of implementation. You will not need to change any code in this step for this exercise, but note that your code should be general enough to operate on data of arbitrary size belonging to any number of classes.
 === Step 2: Implement softmaxCost ===
+In softmaxCost.m, implement code to compute the softmax cost function. Since minFunc minimises this cost, we set it to the '''negative''' of the log-likelihood <math>-\ell(\theta)</math>, in order to maximise <math>\ell(\theta)</math>. Your code should also compute the appropriate gradients, as well as the predictions for the input data (which will be used in the cross-validation step later).
+'''Implementation tip: computing the output matrix''' - in your code, you may need to compute the output matrix <tt>M</tt>, such that <tt>M(r, c)</tt> is 1 if <math>y^{(c)} = r</math> and 0 otherwise. This can be done quickly, without a loop, using the MATLAB functions <tt>sparse</tt> and <tt>full</tt>. <tt>sparse(r, c, v)</tt> creates a sparse matrix such that <tt>M(r(i), c(i)) = v(i)</tt> for all i. That is, the vectors <tt>r</tt> and <tt>c</tt> give the position of the elements whose values we wish to set, and <tt>v</tt> the corresponding values of the elements. Running <tt>full</tt> on a sparse matrix gives the full representation of the matrix for use. Note that the code for using <tt>sparse</tt> and <tt>full</tt> to compute the output matrix has already been included in softmaxCost.m.
+'''Implementation tip: preventing overflows''' - in softmax regression, you will have to compute the hypothesis
+<math>
+\begin{align}
+h(x^{(i)}) =
+\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+\begin{bmatrix}
+e^{ \theta_1^T x^{(i)} } \\
+e^{ \theta_2^T x^{(i)} } \\
+\vdots \\
+e^{ \theta_n^T x^{(i)} } \\
+\end{bmatrix}
+\end{align}
+</math>
+When the products <math>\theta_i^T x^{(i)}</math> are large, the exponential function <math>e^{\theta_i^T x^{(i)}}</math> will become very large and possibly overflow. When this happens, you will not be able to compute your hypothesis. However, there is an easy solution - observe that we can multiply the top and bottom of the hypothesis by some constant without changing the output:
+<math>
+\begin{align}
+h(x^{(i)}) &=
+\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+\begin{bmatrix}
+e^{ \theta_1^T x^{(i)} } \\
+e^{ \theta_2^T x^{(i)} } \\
+\vdots \\
+e^{ \theta_n^T x^{(i)} } \\
+\end{bmatrix} \\
+&=
+\frac{ e^{-\alpha} }{ e^{-\alpha} \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+\begin{bmatrix}
+e^{ \theta_1^T x^{(i)} } \\
+e^{ \theta_2^T x^{(i)} } \\
+\vdots \\
+e^{ \theta_n^T x^{(i)} } \\
+\end{bmatrix} \\
+&=
+\frac{ 1 }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} - \alpha }} }
+\begin{bmatrix}
+e^{ \theta_1^T x^{(i)} - \alpha } \\
+e^{ \theta_2^T x^{(i)} - \alpha } \\
+\vdots \\
+e^{ \theta_n^T x^{(i)} - \alpha } \\
+\end{bmatrix} \\
+\end{align}
+</math>
+Hence, to prevent overflow, simply subtract some large constant value from each of the <math>\theta_j^T x^{(i)}</math> terms before computing the exponential. In practice, for each example, you can use the maximum of the <math>\theta_j^T x^{(i)}</math> terms as the constant. Assuming you have a matrix <tt>M</tt> containing these terms such that <tt>M(r, c)</tt> is <math>\theta_r^T x^{(c)}</math>, then you can use the following code to accomplish this:
+ % M is the matrix as described in the text
+ M = bsxfun(@minus, M, max(M));
+max(M) yields a row vector with each element giving the maximum value in that column. <tt>bsxfun</tt> (short for binary singleton expansion function) applies minus along each row of M, hence subtracting the maximum of each column from every element in the column.
+You may also find <tt>bsxfun</tt> useful in computing your predictions - if you have a matrix <tt>M</tt> containing the <math>e^{\theta_j^T x^{(i)}}</math> terms, such that <tt>M(r, c)</tt> contains the <math>e^{\theta_r^T x^{(c)}}</math> term, you can use the following code to compute the hypothesis (by diving all elements in each column by their column sum):
+ % M is the matrix as described in the text
+ M = bsxfun(@rdivide, M, sum(M))
+The operation of <tt>bsxfun</tt> in this case is analogous to the earlier example.
 === Step 3: Gradient checking ===
+Once you have written the softmax cost function, you should check your gradients numerically. In general, whenever implementing any learning algorithm, you should always check your gradients numerically before proceeding to train the model. The norm of the difference between the numerical gradient and your analytical gradient should be small, on the order of <math>10^-9</math>.
+'''Implementation tip - faster gradient checking''' - when debugging, you can speed up gradient checking by reducing the number of parameters your model uses. In this case, we have included code for reducing the size of the input data, using the first 8 pixels of the images instead of the full 28x28 images. This code can be used by setting the variable <tt>DEBUG</tt> to true, as described in step 1 of the code.
 === Step 4: Learning parameters ===
+Now that you've verified that your gradients are correct, you can train your softmax model using the L-BFGS algorithm, in the function <tt>minFunc</tt>. Training the model on the entire MNIST training set of 60000 28x28 images should be rather quick, and take less than 3 minutes for 100 iterations.
 === Step 5: Cross-validation ===
+Now that you've trained your model, you will cross-validate it against the MNIST test set, comprising 10000 28x28 images. Code has been provided to compute the accuracy (the proportion of correctly classified images) and squared error of your model. Our implementation achieved an accuracy of 92%. If your model's accuracy is significantly less (less than 90%), check your code, ensure that you are using the trained weights, and that you are training your model on the full 60000 training images. Conversely, if your accuracy is too high (99-100%), ensure that you have not accidentally trained your model on the test set as well.