Feature extraction using convolution

From Ufldl

Jump to: navigation, search
(Created page with "=== Convolution === So far, you have been working with relatively small images (either 8x8 patches for the sparse autoencoder assignment, or 28x28 images for the MNIST dataset)....")
Line 1: Line 1:
=== Convolution ===
=== Convolution ===
-
So far, you have been working with relatively small images (either 8x8 patches for the sparse autoencoder assignment, or 28x28 images for the MNIST dataset). With these small images, it is computationally feasible to learn features on the entire image, and to use these features for classification of the entire image. But what if you want to work with larger images instead, for instance, 96x96 images instead? With a 96x96 image, learning features on for the entire image is no longer feasible - you would have to have 10^4 visible units, and assuming you want to learn 100 features, you would have on the order of 10^6 parameters to learn. The feedforward and backpropagation computations would also be on the order of at least 10^2 slower, compared to 28x28 images.
+
So far, you have been working with relatively small images (either 8x8 patches for the sparse autoencoder assignment, or 28x28 images for the MNIST dataset). With these small images, it is computationally feasible to learn features on the entire image, and to use these features for classification of the entire image. But what if you want to work with larger images instead, for instance, 96x96 images instead? With a 96x96 image, learning features on for the entire image is no longer feasible - you would have to have <math>10^4</math> visible units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be on the order of at least <math>10^2</math> slower, compared to 28x28 images.
Furthermore, given what we know about the features learned for natural images so far, it doesn't seem to make sense to try to learn features on the entire 100x100 image. For natural images, we found that the sparse autoencoder learns edges at different orientations and at different locations in the image. For a larger image, we would expect the same result, except with the edges now transposed to even more locations in the image. This suggests to us that we should try re-using the features we learn on small patches for large images by translating them around the large image.
Furthermore, given what we know about the features learned for natural images so far, it doesn't seem to make sense to try to learn features on the entire 100x100 image. For natural images, we found that the sparse autoencoder learns edges at different orientations and at different locations in the image. For a larger image, we would expect the same result, except with the edges now transposed to even more locations in the image. This suggests to us that we should try re-using the features we learn on small patches for large images by translating them around the large image.
-
Indeed, this intuition leads us to the method of '''feature extraction using convolution''' for large images. The idea is to first learn some features on smaller patches (say 8x8 patches) sampled from the large image, and then to '''convolve''' these features with the larger image to get the feature activations at various points in the image. Convolution corresponds precisely to the intuitive notion of translating the features. To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at (1, 1), (1, 2), ... (89, 89), you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. These convolved features can then be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification.  
+
Indeed, this intuition leads us to the method of '''feature extraction using convolution''' for large images. The idea is to first learn some features on smaller patches (say 8x8 patches) sampled from the large image, and then to '''convolve''' these features with the larger image to get the feature activations at various points in the image. Convolution corresponds precisely to the intuitive notion of translating the features. To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. These convolved features can then be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification.  
-
Formally, given some large <math>r x c</math> images <math>x_large</math>, we first train a sparse autoencoder on small <math>a x b</math> patches <math>x_small</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_small + b^{(1)})</math>, given by the weights <math>W^{(1)}M</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a x b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_convolved</math>, a <math>k x (r-a+1) x (c-b+1)</math> array of convolved features. These convolved features can then be [[#pooling | pooled]] for classification, as described below.
+
Formally, given some large <math>r x c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_small</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math>, given by the weights <math>W^{(1)}M</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. These convolved features can then be [[#pooling | pooled]] for classification, as described below.
=== Pooling ===
=== Pooling ===
Line 13: Line 13:
Now that you have obtained an array of convolved features, you might try using these features for classification. However, thinking about why we decided to obtain convolved features suggests a further step that could improve our classification performance. Recall that we decided to obtain convolved features because we thought that the features for the large image would simply be the features for smaller patches translated around the large image. This suggests to us that what we might really be interested in are the feature activations independent of some small translations. You can see why this might be so intuitively - if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position.  
Now that you have obtained an array of convolved features, you might try using these features for classification. However, thinking about why we decided to obtain convolved features suggests a further step that could improve our classification performance. Recall that we decided to obtain convolved features because we thought that the features for the large image would simply be the features for smaller patches translated around the large image. This suggests to us that what we might really be interested in are the feature activations independent of some small translations. You can see why this might be so intuitively - if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position.  
-
Hence, what we are really interested in is the '''translation-invariant''' feature activation - we want to know whether there is an edge, regardless of whether it is at (1, 1), (3, 3) or (5, 5), though perhaps if it is at (50, 50) we might want to treat it as a separate edge. This suggests that what we should do is to take the maximum (or perhaps mean) activation of the convolved features around a certain small region, hence making our resultant pooled features less sensitive to small translations.
+
Hence, what we are really interested in is the '''translation-invariant''' feature activation - we want to know whether there is an edge, regardless of whether it is at <math>(1, 1), (3, 3)</math> or <math>(5, 5)</math>, though perhaps if it is at <math>(50, 50)</math> we might want to treat it as a separate edge. This suggests that what we should do is to take the maximum (or perhaps mean) activation of the convolved features around a certain small region, hence making our resultant pooled features less sensitive to small translations.
-
Formally, after obtaining our convolved features as earlier, we decide the size of the region, say <math>m x n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m x n</math> regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.
+
Formally, after obtaining our convolved features as earlier, we decide the size of the region, say <math>m \times n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m \times n</math> regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.

Revision as of 05:31, 7 May 2011

Personal tools