Exercise:Convolution and Pooling

== Convolution and Pooling ==

This problem set is divided into two parts. In the first part, you will implement a [[Linear Decoders | linear decoder]] to learn features on color images from the STL10 dataset. In the second part, you will use these learned features in convolution and pooling for classifying STL10 images.

In the file <tt>[http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip cnn_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.

For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from your earlier exercise. You will also need to modify '''<tt>cnnConvolve.m</tt>''' and '''<tt>cnnPool.m</tt>''' from this exercise.

=== Dependencies ===

The following additional files are required for this exercise:
* STL10 dataset

You will also need:
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]
* <tt>softmaxTrain.m</tt> (and related functions) from [[Exercise:Softmax Regression]]

''If you have not completed the exercises listed above, we strongly suggest you complete them first.''

=== Part I: Linear decoder on color images ===

In all the exercise so far, you have been working only with grayscale images. In this exercise, you will get the opportunity to work with RGB color images for the first time. 

Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. 

=== Step 0: Initialization ===

In this step, we initialize some parameters used in the exercise.

=== Step 1: Modify sparseAutoencoderCost.m to use a linear decoder ===

Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradient to ensure that they are correct.

=== Step 2: Learn features on small patches ===

You will now use your sparse autoencoder to learn features on a set of 100 000 small 8x8 patches sampled from the larger 96x96 STL10 images (The STL10 dataset comprises 5000 test and 8000 train 96x96 labelled color images belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). 

Code has been provided to load patches sampled from the images. Note that you will need to apply the exact same preprocessing steps to the convolved images as you do to the patches used for training the autoencoder (you have to subtract the same mean image and use the exact same whitening matrix), so using a fixed set of patches means that you can recompute these matrices if necessary. Code to load the sampled patches has already been provided, so no additional changes are required on your part.

In this step, you will train a sparse autoencoder (with linear decoder) on the sampled patches. The code provided trains your sparse autoencoder for 800 iterations with the default parameters initialized in step 0. This should take less than 15 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and opponent colors, as in the figure below. 

[[File:cnn_Features_Good.png|480px]]

If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might get one of the following images instead:

<table>
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr>
</table>

=== Part II: Convolution and pooling ===

=== Step 3: Convolution and pooling ===

Now that you have learned features for small patches, you will convolved these learned features with the large images, and pool these convolved features for use in a classifier later.

==== Step 3a: Convolution ====

Implement convolution, as described in [[feature extraction using convolution]], in the function <tt>cnnConvolve</tt> in <tt>cnnConvolve.m</tt>. Implementing convolution is somewhat involved, so we will guide you through the process below.

First of all, what we want to compute is <math>\sigma(Wx_{(r,c)} + b)</math> for all ''valid'' <math>(r, c)</math> (''valid'' meaning that the entire 8x8 patch is contained within the image; as opposed to a ''full'' convolution which allows the patch to extend outside the image, with the area outside the image assumed to be 0) , where <math>W</math> and <math>b</math> are the learned weights and biases from the input layer to the hidden layer, and <math>x_{(r,c)}</math> is the 8x8 patch with the upper left corner at <math>(r, c)</math>. To accomplish this, what we could do is loop over all such patches and compute <math>\sigma(Wx_{(r,c)} + b)</math> for each of them. In theory, this is correct. However, in practice, the convolution is usually done in three small steps to take advantage of MATLAB's optimized convolution functions.

Observe that the convolution above can be broken down into the following three small steps. First, compute <math>Wx_{(r,c)}</math> for all <math>(r, c)</math>. Next, add b to all the computed values. Finally, apply the sigmoid function to the resultant values. This doesn't seem to buy you anything, since the first step still requires a loop. However, you can replace the loop in the first step with one of MATLAB's optimized convolution functions, <tt>conv2</tt>, speeding up the process slightly.

There are a few complications in using <tt>conv2</tt>. First,  <tt>conv2</tt> performs a 2-D convolution, but you have 5 "dimensions" - image number, feature number, row of image, column of image, and channel of image - that you want to convolve over. Because of this, you will have to convolve each image, feature and image channel separately for each image, using the row and column of the image as the 2 dimensions you convolve over. This means that you will need three outer loops over the image number <tt>imageNum</tt>, feature number <tt>featureNum</tt>, and the channel number of the image <tt>channel</tt>, with the 2-D convolution of the weight matrix for the <tt>featureNum</tt>-th feature and <tt>channel</tt>-th channel with the image matrix for the <tt>imageNum</tt>-th image going inside. 

More concretely, your code will look something like the following:

<syntaxhighlight lang="matlab">
convolvedFeatures(featureNum, imageNum, r, c)
for imageNum = 1:numImages
  for featureNum = 1:hiddenSize
    % Obtain the feature matrix for this feature
    Wt = W(featureNum, :);
    Wt = reshape(Wt, patchDim, patchDim, 3);
     
    % Get convolution of image with feature matrix for each channel
    convolvedTemp = zeros(imageDim - patchDim + 1, imageDim - patchDim + 1, 3);
    for channel = 1:3
      % Flip the feature matrix because of the definition of convolution, as explained
      % later
      Wt(:, :, channel) = flipud(fliplr(squeeze(Wt(:, :, channel))));		
      convolvedTemp(:, :, channel) = conv2(squeeze(images(:, :, channel, imageNum)), squeeze(Wt(:, :, channel)), 'valid');
    end
    
    % The convolved feature is the sum of the convolved values for all channels
    convolvedFeatures(featureNum, imageNum, :, :) = sum(convolvedTemp, 3);
  end
end
</syntaxhighlight>
 
One detail in the above code needs to be explained - observe that the we "flip" the feature matrix about its rows and columns before passing it into <tt>conv2</tt>. This is necessary because the mathematical definition of convolution involves "flipping" the matrix that is convolved with, as explained in more detail in the implementation tip section below. 

<div style="border:1px solid black; padding: 5px">

'''Implementation tip:''' Using <tt>conv2</tt> and <tt>convn</tt>

Because the mathematical definition of convolution involves "flipping" the matrix to convolve with, to use MATLAB's convolution functions, you must first "flip" the weight matrix so that when MATLAB "flips" it according to the mathematical definition the entries will be at the correct place. For example, suppose you wanted to convolve two matrices <math>image</math> (a large image) and <math>W</math> (the feature) using <tt>conv2(image, W)</tt>, and W is a 3x3 matrix as below:

<math>
 W = 
 \begin{pmatrix}
  1 & 2 & 3 \\
  4 & 5 & 6 \\
  7 & 8 & 9  \\
 \end{pmatrix}
</math>

If you use <tt>conv2(image, W)</tt>, MATLAB will first "flip" <math>W</math>, reversing its rows and columns, before convolving <math>W</math> with <math>image</math>, as below:

<math>
 \begin{pmatrix}
  1 & 2 & 3 \\
  4 & 5 & 6 \\
  7 & 8 & 9  \\
 \end{pmatrix}

 \xrightarrow{flip}

 \begin{pmatrix}
  9 & 8 & 7 \\
  6 & 5 & 4 \\
  3 & 2 & 1  \\
 \end{pmatrix}
</math>

If the original layout of <math>W</math> was correct, after flipping, it would be incorrect. For the layout to be correct after flipping, you will have to flip <math>W</math> before passing it into <tt>conv2</tt>, so that after MATLAB flips <math>W</math> in <tt>conv2</tt>, the layout will be correct. For <tt>conv2</tt>, this means reversing the rows and columns, which can be done with <tt>flipud</tt> and <tt>fliplr</tt>, as we did in the example code above. This is also true for the general convolution function <tt>convn</tt>, in which case MATLAB reverses every dimension. In general, you can flip the matrix <math>W</math> using the following code snippet, which works for <math>W</math> of any dimension

<syntaxhighlight lang="matlab">
% Flip W for use in conv2 / convn
temp = W(:);
temp = flipud(temp);
temp = reshape(temp, size(W));
</syntaxhighlight>

</div>

To each of <tt>convolvedFeatures</tt>, you should then add <tt>b</tt>, the corresponding bias for the <tt>featureNum</tt>-th feature. If you had done no preprocessing of the patches, you could then apply the sigmoid function to obtain the convolved features. However, because you preprocessed the patches before learning features on them, you must also apply the same preprocessing steps to the convolved patches to get the correct feature activations.

In particular, you did the following to the patches:
<ol>
<li> divide by 255 to normalize them into the range <math>[0, 1]</math>
<li> subtract the mean patch, <tt>meanPatch</tt> to zero the mean of the patches 
<li> ZCA whiten using the whitening matrix <tt>ZCAWhite</tt>.
</ol>
These same three steps must also be applied to the convolved patches. 

Taking the preprocessing steps into account, the feature activations that you should compute is <math>\sigma(W(T(x-\bar{x})) + b)</math>, where <math>T</math> is the whitening matrix and <math>\bar{x}</math> is the mean patch. Expanding this, you obtain <math>\sigma(WTx - WT\bar{x} + b)</math>, which suggests that you should convolve the images with <math>WT</math> rather than <math>W</math> as earlier, and you should add <math>(b - WT\bar{x})</math>, rather than just <math>b</math> to the resulting matrix <tt>C</tt>, before finally applying the sigmoid function.

==== Step 3b: Checking ====

We have provided some code for you to check that you have done the convolution correctly. The code randomly checks the convolved values for a number of (feature, row, column) tuples by computing the feature activations for the selected features and patches directly using the sparse autoencoder. 

==== Step 3c: Pooling ====

Implement [[pooling]] in the function <tt>cnnPool</tt> in <tt>cnnPool.m</tt>.

=== Step 4: Use pooled features for classification ===

Once you have implemented pooling, you will use the pooled features to train a softmax classifier to map the pooled features to the class labels. The code in this section uses <tt>softmaxTrain</tt> from the softmax exercise to train a softmax classifier on the pooled features for 500 iterations, which should take less than 5 minutes.

=== Step 5: Test classifier ===

Now that you have a trained softmax classifier, you can see how well it performs on the test set. This section contains code that will load the test set (which is a smaller part of the STL10 dataset, specifically, 3200 rescaled 64x64 images from 4 different classes) and obtain the pooled, convolved features for the images using the functions <tt>cnnConvolve</tt> and <tt>cnnPool</tt> which you wrote earlier, as well as the preprocessing matrices <tt>ZCAWhite</tt> and <tt>meanImage</tt> which were computed earlier in preprocessing the training images. These pooled features will then be run through the softmax classifier, and the accuracy of the predictions will be computed. Because object recognition is a difficult task, the accuracy will be relatively low - we obtained an accuracy of around XX%.