Feature extraction using convolution

From Ufldl

Jump to: navigation, search
(Convolution)
 
Line 1: Line 1:
-
=== Convolution ===
+
== Overview ==
-
So far, you have been working with relatively small images (either 8x8 patches for the sparse autoencoder assignment, or 28x28 images for the MNIST dataset). With these small images, it is computationally feasible to learn features on the entire image, and to use these features for classification of the entire image. But what if you want to work with larger images instead, for instance, 96x96 images instead? With a 96x96 image, learning features on for the entire image is no longer feasible - you would have to have <math>10^4</math> visible units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be on the order of at least <math>10^2</math> slower, compared to 28x28 images.
+
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.
-
Furthermore, given what we know about the features learned for natural images so far, it doesn't seem to make sense to try to learn features on the entire 100x100 image. For natural images, we found that the sparse autoencoder learns edges at different orientations and at different locations in the image. For a larger image, we would expect the same result, except with the edges now transposed to even more locations in the image. This suggests to us that we should try re-using the features we learn on small patches for large images by translating them around the large image.
+
== Fully Connected Networks ==
-
Indeed, this intuition leads us to the method of '''feature extraction using convolution''' for large images. The idea is to first learn some features on smaller patches (say 8x8 patches) sampled from the large image, and then to '''convolve''' these features with the larger image to get the feature activations at various points in the image. Convolution corresponds precisely to the intuitive notion of translating the features. To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. These convolved features can then be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification.  
+
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.
 +
 
 +
== Locally Connected Networks ==
 +
 
 +
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units.  Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input.  (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.)
 +
 
 +
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology.  Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).
 +
 
 +
== Convolutions ==
 +
 
 +
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations.
 +
<!--
 +
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional  added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.
 +
 
 +
== Fast Feature Learning and Extraction ==
 +
 
 +
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute.
 +
!-->
 +
 
 +
 
 +
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image.  Specifically, we can take the learned 8x8 features and
 +
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image.
 +
 
 +
 
 +
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. Suppose further this was done with an autoencoder that has 100 hidden units.  To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in 100 sets 89x89 convolved features.
 +
 
 +
 
 +
<!--
 +
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification.  
 +
!-->
[[File:Convolution_schematic.gif]]
[[File:Convolution_schematic.gif]]
-
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. These convolved features can then be [[#pooling | pooled]] for classification, as described below.
+
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features.  
 +
 
 +
 
 +
 
 +
In the next section, we further describe how to "pool" these features together to get even better features for classification.
 +
 
 +
 
 +
{{Languages|卷积特征提取|中文}}

Latest revision as of 04:11, 8 April 2013

Personal tools