PCA

Revision as of 19:42, 4 April 2011 (view source)

Watsuen (Talk | contribs)

(→Reducing the Data)

← Older edit

Latest revision as of 13:18, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 44:

as follows:

:<math>\begin{align}

-

\Sigma = \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.

+

\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.

\end{align}</math>

-

If <math>\textstyle x</math> has zero mean, then <math>\textstyle ~~\frac{1}{m}~~ \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.

+

If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>. (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix. Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.)

It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is

Line 52:

the second eigenvector.

-

~~'''(Important~~: ~~For~~ a mathematical derivation/~~formal~~ justification of this, see the CS229 lecture notes on PCA~~, linked below.~~)'''

+

Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page). You won't need to do so to follow along this course, however.

You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).

Line 91:

This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general

case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis

-

<math>\textstyle u_1</math>,<math>\textstyle u_2</math>,~~\ldots~~,<math>\textstyle u_n</math>.

+

<math>\textstyle u_1</math>,<math>\textstyle u_2</math>, ...,<math>\textstyle u_n</math>.

-

One of the properties of <math>\textstyle U</math> is that it is an orthogonal ~~basis~~, ~~and thus~~ <math>\textstyle U^TU = UU^T = I</math>.

+

One of the properties of <math>\textstyle U</math> is that it is an "orthogonal" matrix, which means

-

So if you ever need to go ~~back~~ from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the

+

that it satisfies <math>\textstyle U^TU = UU^T = I</math>.

+

So if you ever need to go from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the

original data <math>\textstyle x</math>, you can compute

:<math>\begin{align}

Line 101:

Line 102:

because <math>\textstyle U x_{\rm rot} = UU^T x = x</math>.

-

== Reducing the Data ==

+

== Reducing the Data Dimension ==

We see that the principal direction of variation of the data is the first

Line 164:

Line 165:

Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation

of the original <math>\textstyle x \in \Re^n</math>. Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to

-

the original value of <math>\textstyle x</math>? From ~~Equation~(\ref{eqn-unrotate})~~, we know that <math>\textstyle x = U x_{\rm rot}</math>. Further,

+

the original value of <math>\textstyle x</math>? From an [[#Rotating the Data|earlier section]], we know that <math>\textstyle x = U x_{\rm rot}</math>. Further,

we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have

set the last <math>\textstyle n-k</math> components to zeros. Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can

Line 173:

Line 174:

= \sum_{i=1}^k u_i \tilde{x}_i.

\end{align}</math>

-

The final equality above comes from the definition of <math>\textstyle U</math> given ~~in Equation~(\ref{eqn-udefinition})~~.

+

The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].

(In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply

by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just

Line 179:

Line 180:

Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>:

-

~~INCLUDE GRAPHIC HERE~~

+

[[File:PCA-xhat.png | 600px]]

We are thus using a 1 dimensional approximation to the original dataset.

Line 201:

Line 202:

approximation to the data.

-

To decide how to set <math>\textstyle k</math>, we will usually look at the ~~{\bf~~ percentage of variance

+

To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained'''

-

~~retained}~~ for different values of <math>\textstyle k</math>. Concretely, if <math>\textstyle k=n</math>, then we have

+

for different values of <math>\textstyle k</math>. Concretely, if <math>\textstyle k=n</math>, then we have

-

an exact approximation to the data, and we say that 100\% of the variance is

+

an exact approximation to the data, and we say that 100% of the variance is

retained. I.e., all of the variation of the original data is retained.

Conversely, if <math>\textstyle k=0</math>, then we are approximating all the data with the zero vector,

-

and thus 0\% of the variance is retained.

+

and thus 0% of the variance is retained.

More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues

Line 217:

Line 218:

In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>. Thus,

by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>,

-

or 91.3\% of the variance.

+

or 91.3% of the variance.

A more formal definition of percentage of variance retained is beyond the scope

Line 229:

Line 230:

and for which we would incur a greater approximation error if we were to set them to zero.

-

In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99\% of

+

In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of

the variance. In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies

:<math>\begin{align}

Line 235:

Line 236:

\end{align}</math>

Depending on the application, if you are willing to incur some

-

additional error, values in the 90-98\% range are also sometimes used. When you

+

additional error, values in the 90-98% range are also sometimes used. When you

-

describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95\% of

+

describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95% of

the variance will also be a much more easily interpretable description than saying

that you retained 120 (or whatever other number of) components.

+

== PCA on Images ==

For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math>

to have a similar range of values to the others (and to have a mean close to

Line 248:

Line 250:

suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is

the value of pixel <math>\textstyle j</math>. By "natural images," we informally mean the type of image that

-

a typical animal or person might see over their lifetime.~~\footnote{~~Usually we use

+

a typical animal or person might see over their lifetime.

-

images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image

+

-

patches randomly from these to train the algorithm. But in practice most

+

Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm. But in practice most feature learning algorithms are extremely robust to the exact type of image it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.

-

feature learning algorithms are extremely robust to the exact type of image

+

-

it is trained on, so most images taken with a normal camera, so long as they

+

When training on natural images, it makes little sense to estimate a separate mean and

-

aren't excessively blurry or have strange artifacts, should work.}

+

-

~~In this case~~, it makes little sense to estimate a separate mean and

+

variance for each pixel, because the statistics in one part

of the image should (theoretically) be the same as any other.

-

This property of images is called '''stationarity'''.

+

This property of images is called '''stationarity.'''

In detail, in order for PCA to work well, informally we require that (i) The

Line 285:

a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image

<math>\textstyle x^{(i)}</math> as follows:

-

~~\begin{align}~~

-

~~\mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j \\~~

-

~~x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)} \;\;\;\;\hbox{for all <math>\textstyle j</math>}~~

-

~~\end{align}~~

-

~~Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,~~

-

~~and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>. In particular,~~

-

~~this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.~~

-

~~If you are training your algorithm on images other than natural images~~ (~~for~~

+

-

~~example, images of handwritten characters, or images of single isolated objects~~

+

-

~~centered against a white background~~)~~, other types of normalization might be~~

+

-

~~worth considering, and the best choice may be application dependent. But~~

+

-

~~when training on natural images, using the per-image mean normalization~~

+

-

~~as in Equations~(~~\~~ref~~{~~eqn-normalize1~~}-\~~ref~~{~~eqn-normalize2~~})

+

-

~~would be a reasonable default.~~

+

-

~~For PCA to work, usually we want each of the features~~ <math>~~\textstyle x_1, x_2, \ldots, x_n</math>~~

+

<math>x^{(i)}_j := x^{(i)}_j - \mu^{(i)}</math>, for all <math>\textstyle j</math>

-

~~to have a similar range of values to the others (and to have a mean close to~~

+

-

~~zero). If you've used PCA on other applications before, you may therefore have~~

+

-

~~separately pre-processed each feature to have zero mean and unit variance, by~~

+

-

~~separately estimating the mean and variance of each feature <math>\textstyle x_j</math>. However,~~

+

-

~~this isn't the pre-processing that we will apply to most types of images. Specifically,~~

+

-

~~suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is~~

+

-

~~the value of pixel <math>\textstyle j</math>. By "natural images," we informally mean the type of image that~~

+

-

~~a typical animal or person might see over their lifetime.\footnote{Usually we use~~

+

-

~~images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image~~

+

-

~~patches randomly from these to train the algorithm. But in practice most~~

+

-

~~feature learning algorithms are extremely robust to the exact type of image~~

+

-

~~it is trained on, so most images taken with a normal camera, so long as they~~

+

-

~~aren't excessively blurry or have strange artifacts, should work.}~~

+

-

~~In this case, it makes little sense to estimate a separate mean and~~

+

-

~~variance for each pixel, because the statistics in one part~~

+

-

~~of the image should (theoretically) be the same as any other.~~

+

-

~~This property of images is called '''stationarity'''.~~

+

-

+

-

~~In detail, in order for PCA to work well, informally we require that (i) The~~

+

-

~~features have approximately zero mean, and (ii) The different features have~~

+

-

~~similar variances to each other. With natural images, (ii) is already~~

+

-

~~satisfied even without variance normalization, and so we won't perform any~~

+

-

~~variance normalization.~~

+

-

~~(If you are training on audio data---say, on~~

+

-

~~spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform~~

+

-

~~variance normalization either.)~~

+

-

~~In fact, PCA is invariant to the scaling of~~

+

-

~~the data, and will return the same eigenvectors regardless of the scaling of~~

+

-

~~the input. More formally, if you multiply each feature vector <math>\textstyle x</math> by some~~

+

-

~~positive number (thus scaling every feature in every training example by the~~

+

-

~~same number), PCA's output eigenvectors will not change.~~

+

-

+

-

~~So, we won't use variance normalization. The only normalization we need to~~

+

-

~~perform then is mean normalization, to ensure that the features have a mean~~

+

-

~~around zero. Depending on the application, very often we are not interested~~

+

-

~~in how bright the overall input image is. For example, in object recognition~~

+

-

~~tasks, the overall brightness of the image doesn't affect what objects~~

+

-

~~there are in the image. More formally, we are not interested in the~~

+

-

~~mean intensity value of an image patch; thus, we can subtract out this value,~~

+

-

~~as a form of mean normalization.~~

+

-

+

-

~~Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of~~

+

-

~~a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image~~

+

-

~~<math>\textstyle x^{(i)}</math> as follows:~~

+

-

~~\begin{align}~~

+

-

~~\mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j \\~~

+

-

x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)} ~~\;\;\;\;\hbox{~~for all <math>\textstyle j</math>}

+

-

~~\end{align}~~

+

Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,

and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>. In particular,

-

this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.

+

this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.

-

If you are training your algorithm on images other than natural images (for

+

If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.

-

example, images of handwritten characters, or images of single isolated objects

+

-

centered against a white background), other types of normalization might be

+

-

worth considering, and the best choice may be application dependent. But

+

-

when training on natural images, using the per-image mean normalization

+

-

as in ~~Equations~(\ref{eqn-normalize1}-\ref{eqn-normalize2})~~

+

-

would be a reasonable default.

+

-

== ~~PCA on Images~~ ==

+

== References ==

-

~~For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n<~~/~~math>~~

+

http://cs229.stanford.edu

-

~~to have a similar range of values to the others (and to have a mean close to~~

+

-

~~zero). If you've used PCA on other applications before, you may therefore have~~

+

-

~~separately pre-processed each feature to have zero mean and unit variance, by~~

+

-

~~separately estimating the mean and variance of each feature <math>\textstyle x_j<~~/~~math>~~. ~~However,~~

+

-

~~this isn't the pre-processing that we will apply to most types of images. Specifically,~~

+

-

~~suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is~~

+

-

~~the value of pixel <math>\textstyle j</math>. By "natural images," we informally mean the type of image that~~

+

-

~~a typical animal or person might see over their lifetime.\footnote{Usually we use~~

+

-

~~images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image~~

+

-

~~patches randomly from these to train the algorithm. But in practice most~~

+

-

~~feature learning algorithms are extremely robust to the exact type of image~~

+

-

~~it is trained on, so most images taken with a normal camera, so long as they~~

+

-

~~aren't excessively blurry or have strange artifacts, should work.}~~

+

-

~~In this case, it makes little sense to estimate a separate mean and~~

+

-

~~variance for each pixel, because the statistics in one part~~

+

-

~~of the image should (theoretically) be the same as any other.~~

+

-

~~This property of images is called '''stationarity'''~~.

+

-

~~In detail, in order for PCA to work well, informally we require that (i) The~~

-

~~features have approximately zero mean, and (ii) The different features have~~

-

~~similar variances to each other. With natural images, (ii) is already~~

-

~~satisfied even without variance normalization, and so we won't perform any~~

-

~~variance normalization.~~

-

~~(If you are training on audio data---say, on~~

-

~~spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform~~

-

~~variance normalization either.)~~

-

~~In fact, PCA is invariant to the scaling of~~

-

~~the data, and will return the same eigenvectors regardless of the scaling of~~

-

~~the input. More formally, if you multiply each feature vector <math>\textstyle x</math> by some~~

-

~~positive number (thus scaling every feature in every training example by the~~

-

~~same number), PCA's output eigenvectors will not change.~~

-

~~So, we won't use variance normalization. The only normalization we need to~~

+

-

~~perform then is mean normalization, to ensure that the features have a mean~~

+

-

~~around zero. Depending on the application, very often we are not interested~~

+

-

~~in how bright the overall input image is. For example, in object recognition~~

+

-

~~tasks, the overall brightness of the image doesn't affect what objects~~

+

-

~~there are in the image. More formally, we are not interested in the~~

+

-

~~mean intensity value of an image patch; thus, we can subtract out this value,~~

+

-

~~as a form of mean normalization.~~

+

-

~~Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of~~

-

~~a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image~~

-

~~<math>\textstyle x^{(i)}</math> as follows:~~

-

~~\begin{align}~~

-

~~\mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j \\~~

-

~~x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)} \;\;\;\;\hbox{for all <math>\textstyle j</math>}~~

-

~~\end{align}~~

-

~~Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,~~

-

~~and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>. In particular,~~

-

~~this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.~~

-

~~If you are training your algorithm on images other than natural images (for~~

+

-

~~example, images of handwritten characters, or images of single isolated objects~~

+

-

~~centered against a white background), other types of normalization might be~~

+

-

~~worth considering, and the best choice may be application dependent. But~~

+

-

~~when training on natural images, using the per-image mean normalization~~

+

-

~~as in Equations~(\ref~~{~~eqn-normalize1}-\ref~~{~~eqn-normalize2~~})

+

-

~~would be a reasonable default.~~

+

-

+

-

~~== References ==~~

+

-

~~http://cs229.stanford.edu~~

+

From Ufldl

Latest revision as of 13:18, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 44: / Line 44: @@
 as follows:
 :<math>\begin{align}
-\Sigma = \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.
+\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.
 \end{align}</math>
-If <math>\textstyle x</math> has zero mean, then <math>\textstyle \frac{1}{m} \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.
+If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.  (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix.  Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.)
 It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is
@@ Line 52: / Line 52: @@
 the second eigenvector.
-'''(Important: For a mathematical derivation/formal justification of this, see the CS229 lecture notes on PCA, linked below.)'''
+Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page).  You won't need to do so to follow along this course, however.
 You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).
@@ Line 91: / Line 91: @@
 This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general
 case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis
-<math>\textstyle u_1</math>,<math>\textstyle u_2</math>,\ldots,<math>\textstyle u_n</math>.
+<math>\textstyle u_1</math>,<math>\textstyle u_2</math>, ...,<math>\textstyle u_n</math>.
-One of the properties of <math>\textstyle U</math> is that it is an orthogonal basis, and thus <math>\textstyle U^TU = UU^T = I</math>.
+One of the properties of <math>\textstyle U</math> is that it is an "orthogonal" matrix, which means
-So if you ever need to go back from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the
+that it satisfies <math>\textstyle U^TU = UU^T = I</math>.
+So if you ever need to go from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the
 original data <math>\textstyle x</math>, you can compute
 :<math>\begin{align}
@@ Line 101: / Line 102: @@
 because <math>\textstyle U x_{\rm rot} =  UU^T x = x</math>.
-== Reducing the Data ==
+== Reducing the Data Dimension ==
 We see that the principal direction of variation of the data is the first
@@ Line 164: / Line 165: @@
 Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation
 of the original <math>\textstyle x \in \Re^n</math>.  Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to
-the original value of <math>\textstyle x</math>?  From Equation~(\ref{eqn-unrotate}), we know that <math>\textstyle x = U x_{\rm rot}</math>.  Further,
+the original value of <math>\textstyle x</math>?  From an [[#Rotating the Data|earlier section]], we know that <math>\textstyle x = U x_{\rm rot}</math>.  Further,
 we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have
 set the last <math>\textstyle n-k</math> components to zeros.  Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can
@@ Line 173: / Line 174: @@
 = \sum_{i=1}^k u_i \tilde{x}_i.
 \end{align}</math>
-The final equality above comes from the definition of <math>\textstyle U</math> given in Equation~(\ref{eqn-udefinition}).
+The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].
 (In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply
 by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just
@@ Line 179: / Line 180: @@
 Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>:
-INCLUDE GRAPHIC HERE
+[[File:PCA-xhat.png | 600px]]
 We are thus using a 1 dimensional approximation to the original dataset.
@@ Line 201: / Line 202: @@
 approximation to the data.
-To decide how to set <math>\textstyle k</math>, we will usually look at the {\bf percentage of variance
+To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained'''
-retained} for different values of <math>\textstyle k</math>.  Concretely, if <math>\textstyle k=n</math>, then we have
+for different values of <math>\textstyle k</math>.  Concretely, if <math>\textstyle k=n</math>, then we have
-an exact approximation to the data, and we say that 100\% of the variance is
+an exact approximation to the data, and we say that 100% of the variance is
 retained.  I.e., all of the variation of the original data is retained.
 Conversely, if <math>\textstyle k=0</math>, then we are approximating all the data with the zero vector,
-and thus 0\% of the variance is retained.
+and thus 0% of the variance is retained.
 More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues
@@ Line 217: / Line 218: @@
 In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>.  Thus,
 by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>,
-or 91.3\% of the variance.
+or 91.3% of the variance.
 A more formal definition of percentage of variance retained is beyond the scope
@@ Line 229: / Line 230: @@
 and for which we would incur a greater approximation error if we were to set them to zero.
-In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99\% of
+In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of
 the variance.  In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies
 :<math>\begin{align}
@@ Line 235: / Line 236: @@
 \end{align}</math>
 Depending on the application, if you are willing to incur some
-additional error, values in the 90-98\% range are also sometimes used.  When you
+additional error, values in the 90-98% range are also sometimes used.  When you
-describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95\% of
+describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95% of
 the variance will also be a much more easily interpretable description than saying
 that you retained 120 (or whatever other number of) components.
+== PCA on Images ==
 For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math>
 to have a similar range of values to the others (and to have a mean close to
@@ Line 248: / Line 250: @@
 suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is
 the value of pixel <math>\textstyle j</math>.  By "natural images," we informally mean the type of image that
-a typical animal or person might see over their lifetime.\footnote{Usually we use
+a typical animal or person might see over their lifetime.
-images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image
-patches randomly from these to train the algorithm.  But in practice most
+Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm.  But in practice most feature learning algorithms are extremely robust to the exact type of image  it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.
-feature learning algorithms are extremely robust to the exact type of image
-it is trained on, so most images taken with a normal camera, so long as they
+When training on natural images, it makes little sense to estimate a separate mean and
-aren't excessively blurry or have strange artifacts, should work.}
-In this case, it makes little sense to estimate a separate mean and
 variance for each pixel, because the statistics in one part
 of the image should (theoretically) be the same as any other.
-This property of images is called '''stationarity'''.
+This property of images is called '''stationarity.'''
 In detail, in order for PCA to work well, informally we require that (i) The
@@ Line 285: / Line 285: @@
 a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image
 <math>\textstyle x^{(i)}</math> as follows:
-\begin{align}
-\mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j  \\
-x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)}  \;\;\;\;\hbox{for all <math>\textstyle j</math>}
-\end{align}
-Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,
-and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>.  In particular,
-this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.
-If you are training your algorithm on images other than natural images (for
+<math>\mu^{(i)} := \frac{1}{n} \sum_{j=1}^n x^{(i)}_j</math>
-example, images of handwritten characters, or images of single isolated objects
-centered against a white background), other types of normalization might be
-worth considering, and the best choice may be application dependent.  But
-when training on natural images, using the per-image mean normalization
-as in Equations~(\ref{eqn-normalize1}-\ref{eqn-normalize2})
-would be a reasonable default.
-For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math>
+<math>x^{(i)}_j := x^{(i)}_j - \mu^{(i)}</math>, for all <math>\textstyle j</math>
-to have a similar range of values to the others (and to have a mean close to
-zero).  If you've used PCA on other applications before, you may therefore have
-separately pre-processed each feature to have zero mean and unit variance, by
-separately estimating the mean and variance of each feature <math>\textstyle x_j</math>.  However,
-this isn't the pre-processing that we will apply to most types of images.  Specifically,
-suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is
-the value of pixel <math>\textstyle j</math>.  By "natural images," we informally mean the type of image that
-a typical animal or person might see over their lifetime.\footnote{Usually we use
-images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image
-patches randomly from these to train the algorithm.  But in practice most
-feature learning algorithms are extremely robust to the exact type of image
-it is trained on, so most images taken with a normal camera, so long as they
-aren't excessively blurry or have strange artifacts, should work.}
-In this case, it makes little sense to estimate a separate mean and
-variance for each pixel, because the statistics in one part
-of the image should (theoretically) be the same as any other.
-This property of images is called '''stationarity'''.
-In detail, in order for PCA to work well, informally we require that (i) The
-features have approximately zero mean, and (ii) The different features have
-similar variances to each other.  With natural images, (ii) is already
-satisfied even without variance normalization, and so we won't perform any
-variance normalization.
-(If you are training on audio data---say, on
-spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform
-variance normalization either.)
-In fact, PCA is invariant to the scaling of
-the data, and will return the same eigenvectors regardless of the scaling of
-the input.  More formally, if you multiply each feature vector <math>\textstyle x</math> by some
-positive number (thus scaling every feature in every training example by the
-same number), PCA's output eigenvectors will not change.
-So, we won't use variance normalization.  The only normalization we need to
-perform then is mean normalization, to ensure that the features have a mean
-around zero.  Depending on the application, very often we are not interested
-in how bright the overall input image is.  For example, in object recognition
-tasks, the overall brightness of the image doesn't affect what objects
-there are in the image.  More formally, we are not interested in the
-mean intensity value of an image patch; thus, we can subtract out this value,
-as a form of mean normalization.
-Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of
-a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image
-<math>\textstyle x^{(i)}</math> as follows:
-\begin{align}
-\mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j  \\
-x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)}  \;\;\;\;\hbox{for all <math>\textstyle j</math>}
-\end{align}
 Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,
 and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>.  In particular,
 this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.
-If you are training your algorithm on images other than natural images (for
+If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.
-example, images of handwritten characters, or images of single isolated objects
-centered against a white background), other types of normalization might be
-worth considering, and the best choice may be application dependent.  But
-when training on natural images, using the per-image mean normalization
-as in Equations~(\ref{eqn-normalize1}-\ref{eqn-normalize2})
-would be a reasonable default.
-== PCA on Images ==
+== References ==
-For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math>
+http://cs229.stanford.edu
-to have a similar range of values to the others (and to have a mean close to
-zero).  If you've used PCA on other applications before, you may therefore have
-separately pre-processed each feature to have zero mean and unit variance, by
-separately estimating the mean and variance of each feature <math>\textstyle x_j</math>.  However,
-this isn't the pre-processing that we will apply to most types of images.  Specifically,
-suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is
-the value of pixel <math>\textstyle j</math>.  By "natural images," we informally mean the type of image that
-a typical animal or person might see over their lifetime.\footnote{Usually we use
-images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image
-patches randomly from these to train the algorithm.  But in practice most
-feature learning algorithms are extremely robust to the exact type of image
-it is trained on, so most images taken with a normal camera, so long as they
-aren't excessively blurry or have strange artifacts, should work.}
-In this case, it makes little sense to estimate a separate mean and
-variance for each pixel, because the statistics in one part
-of the image should (theoretically) be the same as any other.
-This property of images is called '''stationarity'''.
-In detail, in order for PCA to work well, informally we require that (i) The
-features have approximately zero mean, and (ii) The different features have
-similar variances to each other.  With natural images, (ii) is already
-satisfied even without variance normalization, and so we won't perform any
-variance normalization.
-(If you are training on audio data---say, on
-spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform
-variance normalization either.)
-In fact, PCA is invariant to the scaling of
-the data, and will return the same eigenvectors regardless of the scaling of
-the input.  More formally, if you multiply each feature vector <math>\textstyle x</math> by some
-positive number (thus scaling every feature in every training example by the
-same number), PCA's output eigenvectors will not change.
-So, we won't use variance normalization.  The only normalization we need to
+{{PCA}}
-perform then is mean normalization, to ensure that the features have a mean
-around zero.  Depending on the application, very often we are not interested
-in how bright the overall input image is.  For example, in object recognition
-tasks, the overall brightness of the image doesn't affect what objects
-there are in the image.  More formally, we are not interested in the
-mean intensity value of an image patch; thus, we can subtract out this value,
-as a form of mean normalization.
-Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of
-a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image
-<math>\textstyle x^{(i)}</math> as follows:
-\begin{align}
-\mu^{(i)} &:= \frac{1}{n} \sum_{j=1}^n x^{(i)}_j  \\
-x^{(i)}_j &:= x^{(i)}_j - \mu^{(i)}  \;\;\;\;\hbox{for all <math>\textstyle j</math>}
-\end{align}
-Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,
-and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>.  In particular,
-this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.
-If you are training your algorithm on images other than natural images (for
+{{Languages|主成分分析|中文}}
-example, images of handwritten characters, or images of single isolated objects
-centered against a white background), other types of normalization might be
-worth considering, and the best choice may be application dependent.  But
-when training on natural images, using the per-image mean normalization
-as in Equations~(\ref{eqn-normalize1}-\ref{eqn-normalize2})
-would be a reasonable default.
-== References ==
-http://cs229.stanford.edu