PCA
From Ufldl
(→References) |
|||
Line 240: | Line 240: | ||
that you retained 120 (or whatever other number of) components. | that you retained 120 (or whatever other number of) components. | ||
+ | For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math> | ||
+ | to have a similar range of values to the others (and to have a mean close to | ||
+ | zero). If you've used PCA on other applications before, you may therefore have | ||
+ | separately pre-processed each feature to have zero mean and unit variance, by | ||
+ | separately estimating the mean and variance of each feature <math>\textstyle x_j</math>. However, | ||
+ | this isn't the pre-processing that we will apply to most types of images. Specifically, | ||
+ | suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is | ||
+ | the value of pixel <math>\textstyle j</math>. By "natural images," we informally mean the type of image that | ||
+ | a typical animal or person might see over their lifetime.\footnote{Usually we use | ||
+ | images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image | ||
+ | patches randomly from these to train the algorithm. But in practice most | ||
+ | feature learning algorithms are extremely robust to the exact type of image | ||
+ | it is trained on, so most images taken with a normal camera, so long as they | ||
+ | aren't excessively blurry or have strange artifacts, should work.} | ||
+ | In this case, it makes little sense to estimate a separate mean and | ||
+ | variance for each pixel, because the statistics in one part | ||
+ | of the image should (theoretically) be the same as any other. | ||
+ | This property of images is called '''stationarity'''. | ||
+ | |||
+ | In detail, in order for PCA to work well, informally we require that (i) The | ||
+ | features have approximately zero mean, and (ii) The different features have | ||
+ | similar variances to each other. With natural images, (ii) is already | ||
+ | satisfied even without variance normalization, and so we won't perform any | ||
+ | variance normalization. | ||
+ | (If you are training on audio data---say, on | ||
+ | spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform | ||
+ | variance normalization either.) | ||
+ | In fact, PCA is invariant to the scaling of | ||
+ | the data, and will return the same eigenvectors regardless of the scaling of | ||
+ | the input. More formally, if you multiply each feature vector <math>\textstyle x</math> by some | ||
+ | positive number (thus scaling every feature in every training example by the | ||
+ | same number), PCA's output eigenvectors will not change. | ||
+ | |||
+ | So, we won't use variance normalization. The only normalization we need to | ||
+ | perform then is mean normalization, to ensure that the features have a mean | ||
+ | around zero. Depending on the application, very often we are not interested | ||
+ | in how bright the overall input image is. For example, in object recognition | ||
+ | tasks, the overall brightness of the image doesn't affect what objects | ||
+ | there are in the image. More formally, we are not interested in the | ||
+ | mean intensity value of an image patch; thus, we can subtract out this value, | ||
+ | as a form of mean normalization. | ||
+ | |||
+ | Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of | ||
+ | a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image | ||
+ | <math>\textstyle x^{(i)}</math> as follows: | ||
+ | \begin{align} | ||
+ | \mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j \\ | ||
+ | x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)} \;\;\;\;\hbox{for all <math>\textstyle j</math>} | ||
+ | \end{align} | ||
+ | Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>, | ||
+ | and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>. In particular, | ||
+ | this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>. | ||
+ | |||
+ | If you are training your algorithm on images other than natural images (for | ||
+ | example, images of handwritten characters, or images of single isolated objects | ||
+ | centered against a white background), other types of normalization might be | ||
+ | worth considering, and the best choice may be application dependent. But | ||
+ | when training on natural images, using the per-image mean normalization | ||
+ | as in Equations~(\ref{eqn-normalize1}-\ref{eqn-normalize2}) | ||
+ | would be a reasonable default. | ||
+ | |||
+ | For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math> | ||
+ | to have a similar range of values to the others (and to have a mean close to | ||
+ | zero). If you've used PCA on other applications before, you may therefore have | ||
+ | separately pre-processed each feature to have zero mean and unit variance, by | ||
+ | separately estimating the mean and variance of each feature <math>\textstyle x_j</math>. However, | ||
+ | this isn't the pre-processing that we will apply to most types of images. Specifically, | ||
+ | suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is | ||
+ | the value of pixel <math>\textstyle j</math>. By "natural images," we informally mean the type of image that | ||
+ | a typical animal or person might see over their lifetime.\footnote{Usually we use | ||
+ | images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image | ||
+ | patches randomly from these to train the algorithm. But in practice most | ||
+ | feature learning algorithms are extremely robust to the exact type of image | ||
+ | it is trained on, so most images taken with a normal camera, so long as they | ||
+ | aren't excessively blurry or have strange artifacts, should work.} | ||
+ | In this case, it makes little sense to estimate a separate mean and | ||
+ | variance for each pixel, because the statistics in one part | ||
+ | of the image should (theoretically) be the same as any other. | ||
+ | This property of images is called '''stationarity'''. | ||
+ | |||
+ | In detail, in order for PCA to work well, informally we require that (i) The | ||
+ | features have approximately zero mean, and (ii) The different features have | ||
+ | similar variances to each other. With natural images, (ii) is already | ||
+ | satisfied even without variance normalization, and so we won't perform any | ||
+ | variance normalization. | ||
+ | (If you are training on audio data---say, on | ||
+ | spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform | ||
+ | variance normalization either.) | ||
+ | In fact, PCA is invariant to the scaling of | ||
+ | the data, and will return the same eigenvectors regardless of the scaling of | ||
+ | the input. More formally, if you multiply each feature vector <math>\textstyle x</math> by some | ||
+ | positive number (thus scaling every feature in every training example by the | ||
+ | same number), PCA's output eigenvectors will not change. | ||
+ | |||
+ | So, we won't use variance normalization. The only normalization we need to | ||
+ | perform then is mean normalization, to ensure that the features have a mean | ||
+ | around zero. Depending on the application, very often we are not interested | ||
+ | in how bright the overall input image is. For example, in object recognition | ||
+ | tasks, the overall brightness of the image doesn't affect what objects | ||
+ | there are in the image. More formally, we are not interested in the | ||
+ | mean intensity value of an image patch; thus, we can subtract out this value, | ||
+ | as a form of mean normalization. | ||
+ | |||
+ | Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of | ||
+ | a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image | ||
+ | <math>\textstyle x^{(i)}</math> as follows: | ||
+ | \begin{align} | ||
+ | \mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j \\ | ||
+ | x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)} \;\;\;\;\hbox{for all <math>\textstyle j</math>} | ||
+ | \end{align} | ||
+ | Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>, | ||
+ | and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>. In particular, | ||
+ | this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>. | ||
+ | |||
+ | If you are training your algorithm on images other than natural images (for | ||
+ | example, images of handwritten characters, or images of single isolated objects | ||
+ | centered against a white background), other types of normalization might be | ||
+ | worth considering, and the best choice may be application dependent. But | ||
+ | when training on natural images, using the per-image mean normalization | ||
+ | as in Equations~(\ref{eqn-normalize1}-\ref{eqn-normalize2}) | ||
+ | would be a reasonable default. | ||
+ | |||
+ | == PCA on Images == | ||
For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math> | For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math> | ||
to have a similar range of values to the others (and to have a mean close to | to have a similar range of values to the others (and to have a mean close to |