主成分分析

Revision as of 20:22, 11 March 2013 (view source)

Kandeng (Talk | contribs)

← Older edit

Revision as of 20:44, 11 March 2013 (view source)

Kandeng (Talk | contribs)

Newer edit →

Line 91:

[[File:PCA-u1.png | 600px]]

-

I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>.

+

【原文】：I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>.

To more formally find the directions <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, we first compute the matrix <math>\textstyle \Sigma</math>

as follows:

Line 105:

\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.

\end{align}</math>

-

If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>. (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix. Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.)

+

【原文】：If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>. (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix. Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.)

It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is

Line 119:

Line 120:

-

~~Note~~: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page). You won't need to do so to follow along this course, however.

+

【原文】：Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page). You won't need to do so to follow along this course, however.

【初译】：

Line 128:

Line 129:

-

~~You~~ can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).

+

【原文】：You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).

Concretely, let us compute the eigenvectors of <math>\textstyle \Sigma</math>, and stack

the eigenvectors in columns to form the matrix <math>\textstyle U</math>:

Line 147:

Line 148:

\end{bmatrix}

\end{align}</math>

-

~~Here~~, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue),

+

【原文】：Here, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue),

<math>\textstyle u_2</math> is the second eigenvector, and so on.

Also, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the corresponding eigenvalues.

Line 157:

Line 159:

【二审】：

-

~~The~~ vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we

+

【原文】：The vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we

can represent the data. Concretely, let <math>\textstyle x \in \Re^2</math> be some training example. Then <math>\textstyle u_1^Tx</math>

is the length (magnitude) of the projection of <math>\textstyle x</math> onto the vector <math>\textstyle u_1</math>.

Line 168:

Line 171:

【二审】：

+

== Rotating the Data ==

Line 190:

Line 194:

[[File:PCA-rotated.png|600px]]

-

~~This~~ is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general

+

【原文】：This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general

case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis

<math>\textstyle u_1</math>,<math>\textstyle u_2</math>, ...,<math>\textstyle u_n</math>.

Line 198:

Line 202:

So if you ever need to go from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the

original data <math>\textstyle x</math>, you can compute

+

:<math>\begin{align}

x = U x_{\rm rot} ,

Line 208:

Line 213:

【二审】：

+

== Reducing the Data Dimension ==

Line 224:

Line 230:

\tilde{x}^{(i)} = x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)} \in \Re.

\end{align}</math>

-

~~More~~ generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to

+

【原文】：More generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to

a <math>\textstyle k</math> dimensional representation <math>\textstyle \tilde{x} \in \Re^k</math> (where <math>\textstyle k < n</math>), we would

take the first <math>\textstyle k</math> components of <math>\textstyle x_{\rm rot}</math>, which correspond to

Line 235:

Line 242:

【二审】：

-

~~Another~~ way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional

+

【原文】：Another way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional

vector, where the first few components are likely to

be large (e.g., in our example, we saw that <math>\textstyle x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)}</math> takes

Line 248:

Line 256:

【二审】：

-

~~What~~

+

【原文】：What

PCA does it it

drops the the later (smaller) components of <math>\textstyle x_{\rm rot}</math>, and

Line 256:

Line 265:

all but the first

<math>\textstyle k</math> components are zeros. In other words, we have:

+

:<math>\begin{align}

\tilde{x} =

Line 277:

Line 287:

= x_{\rm rot}

\end{align}</math>

+

In our example, this gives us the following plot of <math>\textstyle \tilde{x}</math> (using <math>\textstyle n=2, k=1</math>):

-

~~【初译】：~~

+

【初译】：而PCA算法就是丢弃x_(rot )中后面的（较小的）成分，把他们赋值为零。更直观的说就是，也可以用的除了前k个成分，其余全赋值为零来表示。换句话说就是，我们可以运用如下算式：

+

:<math>\begin{align}

+

\tilde{x} =

+

\begin{bmatrix}

+

x_{{\rm rot},1} \\

+

\vdots \\

+

x_{{\rm rot},k} \\

+

0 \\

+

\vdots \\

+

0 \\

+

\end{bmatrix}

+

\approx

+

\begin{bmatrix}

+

x_{{\rm rot},1} \\

+

\vdots \\

+

x_{{\rm rot},k} \\

+

x_{{\rm rot},k+1} \\

+

\vdots \\

+

x_{{\rm rot},n}

+

\end{bmatrix}

+

= x_{\rm rot}

+

\end{align}</math>

+

在我们的实验中，得到了如下的的图形（取）：

【一审】：

Line 287:

Line 321:

[[File:PCA-xtilde.png | 600px]]

-

~~However~~, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would

+

【原文】：However, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would

always be zero, there is no need to keep these zeros around, and so we

define <math>\textstyle \tilde{x}</math> as a <math>\textstyle k</math>-dimensional vector with just the first <math>\textstyle k</math> (non-zero) components.

-

~~【初译】：~~

+

【初译】：然而，由于上面<math>\textstyle \tilde{x}</math>的后<math>\textstyle n-k</math>项都归为了零，所以也就没必要把这些零项保留下来。因此，我们仅用前<math>\textstyle k</math>个成分来定义<math>\textstyle k</math>维向量的<math>\textstyle \tilde{x}</math>。

【一审】：

Line 297:

Line 332:

【二审】：

-

~~This~~ also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis:

+

【原文】：This also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis:

Deciding which components to keep becomes just keeping the top <math>\textstyle k</math> components. When we

do this, we also say that we are "retaining the top <math>\textstyle k</math> PCA (or principal) components."

-

~~【初译】：~~

+

【初译】：这也解释了为什么会以<math>\textstyle u_1, u_2, \ldots, u_n</math>为基来表示我们的数据：任务从决定哪些成分需要保留变成了只需简单选取<math>\textstyle k</math>个成分。这种运算我们也可以描述为“我们保留了<math>\textstyle k</math>个PCA（主）成分”。

【一审】：

【二审】：

+

== Recovering an Approximation of the Data ==

Line 317:

Line 354:

by <math>\textstyle U</math> to get our approximation to <math>\textstyle x</math>. Concretely, we get

-

~~【初译】：~~

+

【初译】：经过上一步骤，我们得到了对应原始数据<math>\textstyle \tilde{x} \in \Re^k</math>的低维“压缩”表征量<math>\textstyle \tilde{x} \in \Re^k</math>，反过来，如果给定<math>\textstyle \tilde{x}</math>，我们应如何尽可能地还原原始数据<math>\textstyle \hat{x}</math>呢？由本章第三节可<math>\textstyle x = U x_{\rm rot}</math>，且<math>\textstyle \tilde{x}</math>可以看<math>\textstyle x_{\rm rot}</math>的近似值，因为<math>\textstyle \tilde{x}</math>只是将<math>\textstyle x_{\rm rot}</math>最后的<math>\textstyle n-k</math>个元素用0代替并省略而得到，因此如果给定<math>\textstyle \tilde{x} \in \Re^k</math>，可以通过在其末尾添加<math>\textstyle n-k</math>个0来得到<math>\textstyle x_{\rm rot} \in \Re^n</math>的近似，接着用<math>\textstyle U</math>左乘该<math>\textstyle x_{\rm rot}</math>近似值便可得到对原数据<math>\textstyle x</math>的近似还原。具体来说，我们需进行如下计算：

【一审】：

Line 327:

Line 364:

= \sum_{i=1}^k u_i \tilde{x}_i.

\end{align}</math>

-

~~The~~ final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].

+

【原文】：The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].

(In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply

by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just

Line 333:

Line 371:

Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>:

-

~~【初译】：~~

+

【初译】：该式中的第二个等号由先前对<math>\textstyle U</math>的定义可知成立，（在实际应用时，我们不倾向于先给<math>\textstyle \tilde{x}</math>填0然后再左乘<math>\textstyle U</math>，因为这样意味着大量的乘0运算，相反我们选择用<math>\textstyle U</math>的前<math>\textstyle k</math>列来乘<math>\textstyle \tilde{x} \in \Re^k</math>，其结果也即等于上面式子中最右边项。）将该算法应用到本章节的样例数据集，我们可以得到以下关于<math>\textstyle \hat{x}</math>的作图：

【一审】：

Line 341:

Line 379:

[[File:PCA-xhat.png | 600px]]

-

We are thus using a 1 dimensional approximation to the original dataset.

+

【原文】：We are thus using a 1 dimensional approximation to the original dataset.

If you are training an autoencoder or other unsupervised feature learning algorithm,

Line 351:

Line 390:

introducing very little approximation error.

-

~~【初译】：~~

+

【初译】：由图可看出我们实际上得到的是对原始数据的一维近似。

+

如果要训练一个自动编码器（autoencoder）或其它无监督特征学习算法，运算时间将直接依赖于输入数据的维数。若用<math>\textstyle \tilde{x} \in \Re^k</math>取代<math>\textstyle x</math>作为输入数据，那么算法将使用该低维数据进行训练，运行速度也大大加快。对于很多数据集来说，低维表征量<math>\textstyle \tilde{x}</math>都可达到对原数据集的完美近似，因此对这些数据集使用PCA算法将可保证在只产生较小近似误差的同时极大地提速程序。

+

【一审】：

【二审】：

+

== Number of components to retain ==

Line 367:

Line 409:

approximation to the data.

-

~~【初译】：~~

+

【初译】：接下来的问题是我们如何选择<math>\textstyle k</math>，即有多少个PCA主成分值得保留？在我们这个简单的二维实验中，保留第一个成分是十分自然的选择，但对于高维数据来说，做这个决定就没那么简单：如果<math>\textstyle k</math>过大，我们便没有多少有效压缩，如果是极限情况<math>\textstyle k=n</math>，我们等同于在使用原始数据（只是旋转投射到了一组不同的基）；相反地，如果<math>\textstyle k</math>过小，那我们使用的近似值也可能带来很大的近似误差。

【一审】：

Line 374:

Line 416:

-

To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained'''

+

【原文】：To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained'''

for different values of <math>\textstyle k</math>. Concretely, if <math>\textstyle k=n</math>, then we have

an exact approximation to the data, and we say that 100% of the variance is

Line 381:

Line 423:

and thus 0% of the variance is retained.

-

~~【初译】：~~

+

【初译】：在决定<math>\textstyle k</math>的过程中，我们通常会计算不同<math>\textstyle k</math>值可保留的方差百分比，具体来说，如果<math>\textstyle k=n</math>，那么我们得到的是对数据的完美近似，也就是保留了100%的方差，即原始数据的所有变化特性（variation）都被保留下来；如果<math>\textstyle k=0</math>，那么我们是在使用零向量来近似输入数据，因此也就只有0%的方差可以被保留。

【一审】：

Line 387:

Line 429:

【二审】：

-

~~More~~ generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues

+

【原文】：More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues

of <math>\textstyle \Sigma</math> (sorted in decreasing order), so that <math>\textstyle \lambda_j</math> is the eigenvalue

corresponding to the eigenvector <math>\textstyle u_j</math>. Then if we retain <math>\textstyle k</math> principal components,

Line 402:

Line 445:

\end{align}</math>

-

In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>. Thus,

+

【原文】：In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>. Thus,

by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>,

or 91.3% of the variance.

Line 412:

Line 456:

【二审】：

-

A more formal definition of percentage of variance retained is beyond the scope

+

【原文】：A more formal definition of percentage of variance retained is beyond the scope

of these notes. However, it is possible to show that <math>\textstyle \lambda_j =

\sum_{i=1}^m x_{{\rm rot},j}^2</math>. Thus, if <math>\textstyle \lambda_j \approx 0</math>, that shows that

Line 429:

Line 474:

【二审】：

-

In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of

+

【原文】：In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of

the variance. In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies

:<math>\begin{align}

Line 467:

Line 513:

【二审】：

-

~~Note~~: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm. But in practice most feature learning algorithms are extremely robust to the exact type of image it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.

+

【原文】：Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm. But in practice most feature learning algorithms are extremely robust to the exact type of image it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.

【初译】：注：通常我们使用户外拍摄草木等场景的照片，并从中随机截取小图像块（大小为16乘16像素）来训练算法，实际应用中我们发现，大多数特征学习算法对于训练的图片类型都具有极大的鲁棒性（robust），普通照相机拍摄的图片，只要不是特别的模糊或者有非常奇怪的人工痕迹，都应可以使用。

Line 475:

Line 522:

【二审】：

-

~~When~~ training on natural images, it makes little sense to estimate a separate mean and

+

【原文】：When training on natural images, it makes little sense to estimate a separate mean and

variance for each pixel, because the statistics in one part

of the image should (theoretically) be the same as any other.

Line 486:

Line 534:

【二审】：

-

In detail, in order for PCA to work well, informally we require that (i) The

+

【原文】：In detail, in order for PCA to work well, informally we require that (i) The

features have approximately zero mean, and (ii) The different features have

similar variances to each other. With natural images, (ii) is already

Line 506:

Line 555:

【二审】：

-

So, we won't use variance normalization. The only normalization we need to

+

【原文】：So, we won't use variance normalization. The only normalization we need to

perform then is mean normalization, to ensure that the features have a mean

around zero. Depending on the application, very often we are not interested

Line 521:

Line 571:

【二审】：

-

~~Concretely~~, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of

+

【原文】：Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of

a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image

<math>\textstyle x^{(i)}</math> as follows:

Line 535:

Line 586:

<math>x^{(i)}_j := x^{(i)}_j - \mu^{(i)}</math>, for all <math>\textstyle j</math>

-

~~Note~~ that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,

+

【原文】：Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,

and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>. In particular,

this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.

Line 545:

Line 597:

【二审】：

-

If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.

+

【原文】：If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.

【初译】：如果你处理的图像并非自然图像（比如，手写文字，或者白背景正中摆放单独物体），其他规范化操作可能就值得引入，哪种做法最合适也将依赖于具体应用场合，但是对自然图像进行训练时，使用如上所述的整个图像块范围内的均值规范化操作可以放心假定为合理的办法。

From Ufldl

Revision as of 20:44, 11 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 91: / Line 91: @@
 [[File:PCA-u1.png | 600px]]
-I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>.
+【原文】：I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>.
 To more formally find the directions <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, we first compute the matrix <math>\textstyle \Sigma</math>
 as follows:
@@ Line 105: / Line 105: @@
 \Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.
 \end{align}</math>
-If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.  (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix.  Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.)
+【原文】：If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.  (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix.  Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.)
 It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is
@@ Line 119: / Line 120: @@
-Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page).  You won't need to do so to follow along this course, however.
+【原文】：Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page).  You won't need to do so to follow along this course, however.
 【初译】：
@@ Line 128: / Line 129: @@
-You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).
+【原文】：You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).
 Concretely, let us compute the eigenvectors of <math>\textstyle \Sigma</math>, and stack
 the eigenvectors in columns to form the matrix <math>\textstyle U</math>:
@@ Line 147: / Line 148: @@
 \end{bmatrix}
 \end{align}</math>
-Here, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue),
+【原文】：Here, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue),
 <math>\textstyle u_2</math> is the second eigenvector, and so on.
 Also, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the corresponding eigenvalues.
@@ Line 157: / Line 159: @@
 【二审】：
-The vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we
+【原文】：The vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we
 can represent the data.  Concretely, let <math>\textstyle x \in \Re^2</math> be some training example.  Then <math>\textstyle u_1^Tx</math>
 is the length (magnitude) of the projection of <math>\textstyle x</math> onto the vector <math>\textstyle u_1</math>.
@@ Line 168: / Line 171: @@
 【二审】：
 == Rotating the Data ==
@@ Line 190: / Line 194: @@
 [[File:PCA-rotated.png|600px]]
-This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general
+【原文】：This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general
 case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis
 <math>\textstyle u_1</math>,<math>\textstyle u_2</math>, ...,<math>\textstyle u_n</math>.
@@ Line 198: / Line 202: @@
 So if you ever need to go from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the
 original data <math>\textstyle x</math>, you can compute
 :<math>\begin{align}
 x = U x_{\rm rot}   ,
@@ Line 208: / Line 213: @@
 【二审】：
 == Reducing the Data Dimension ==
@@ Line 224: / Line 230: @@
 \tilde{x}^{(i)} = x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)} \in \Re.
 \end{align}</math>
-More generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to
+【原文】：More generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to
 a <math>\textstyle k</math> dimensional representation <math>\textstyle \tilde{x} \in \Re^k</math> (where <math>\textstyle k < n</math>), we would
 take the first <math>\textstyle k</math> components of <math>\textstyle x_{\rm rot}</math>, which correspond to
@@ Line 235: / Line 242: @@
 【二审】：
-Another way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional
+【原文】：Another way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional
 vector, where the first few components are likely to
 be large (e.g., in our example, we saw that <math>\textstyle x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)}</math> takes
@@ Line 248: / Line 256: @@
 【二审】：
-What
+【原文】：What
 PCA does it it
 drops the the later (smaller) components of <math>\textstyle x_{\rm rot}</math>, and
@@ Line 256: / Line 265: @@
 all but the first
 <math>\textstyle k</math> components are zeros.  In other words, we have:
 :<math>\begin{align}
 \tilde{x} =
@@ Line 277: / Line 287: @@
 = x_{\rm rot}
 \end{align}</math>
 In our example, this gives us the following plot of <math>\textstyle \tilde{x}</math> (using <math>\textstyle n=2, k=1</math>):
-【初译】：
+【初译】：而PCA算法就是丢弃x_(rot )中后面的（较小的）成分，把他们赋值为零。更直观的说就是， 也可以用 的除了前k个成分，其余全赋值为零来表示。换句话说就是，我们可以运用如下算式：
+:<math>\begin{align}
+\tilde{x} =
+\begin{bmatrix}
+x_{{\rm rot},1} \\
+\vdots \\
+x_{{\rm rot},k} \\
+\\
+\vdots \\
+\\
+\end{bmatrix}
+\approx
+\begin{bmatrix}
+x_{{\rm rot},1} \\
+\vdots \\
+x_{{\rm rot},k} \\
+x_{{\rm rot},k+1} \\
+\vdots \\
+x_{{\rm rot},n}
+\end{bmatrix}
+= x_{\rm rot}
+\end{align}</math>
+在我们的实验中，得到了如下的 的图形（取 ）：
 【一审】：
@@ Line 287: / Line 321: @@
 [[File:PCA-xtilde.png | 600px]]
-However, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would
+【原文】：However, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would
 always be zero, there is no need to keep these zeros around, and so we
 define <math>\textstyle \tilde{x}</math> as a <math>\textstyle k</math>-dimensional vector with just the first <math>\textstyle k</math> (non-zero) components.
-【初译】：
+【初译】：然而，由于上面<math>\textstyle \tilde{x}</math>的后<math>\textstyle n-k</math>项都归为了零，所以也就没必要把这些零项保留下来。因此，我们仅用前<math>\textstyle k</math>个成分来定义<math>\textstyle k</math>维向量的<math>\textstyle \tilde{x}</math>。
 【一审】：
@@ Line 297: / Line 332: @@
 【二审】：
-This also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis:
+【原文】：This also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis:
 Deciding which components to keep becomes just keeping the top <math>\textstyle k</math> components.  When we
 do this, we also say that we are "retaining the top <math>\textstyle k</math> PCA (or principal) components."
-【初译】：
+【初译】：这也解释了为什么会以<math>\textstyle u_1, u_2, \ldots, u_n</math>为基来表示我们的数据：任务从决定哪些成分需要保留变成了只需简单选取<math>\textstyle k</math>个成分。这种运算我们也可以描述为“我们保留了<math>\textstyle k</math>个PCA（主）成分”。
 【一审】：
 【二审】：
 == Recovering an Approximation of the Data ==
@@ Line 317: / Line 354: @@
 by <math>\textstyle U</math> to get our approximation to <math>\textstyle x</math>.  Concretely, we get
-【初译】：
+【初译】：经过上一步骤，我们得到了对应原始数据<math>\textstyle \tilde{x} \in \Re^k</math>的低维“压缩”表征量<math>\textstyle \tilde{x} \in \Re^k</math>，反过来，如果给定<math>\textstyle \tilde{x}</math>，我们应如何尽可能地还原原始数据<math>\textstyle \hat{x}</math>呢？由本章第三节可<math>\textstyle x = U x_{\rm rot}</math>，且<math>\textstyle \tilde{x}</math>可以看<math>\textstyle x_{\rm rot}</math>的近似值，因为<math>\textstyle \tilde{x}</math>只是将<math>\textstyle x_{\rm rot}</math>最后的<math>\textstyle n-k</math>个元素用0代替并省略而得到，因此如果给定<math>\textstyle \tilde{x} \in \Re^k</math>，可以通过在其末尾添加<math>\textstyle n-k</math>个0来得到<math>\textstyle x_{\rm rot} \in \Re^n</math>的近似，接着用<math>\textstyle U</math>左乘该<math>\textstyle x_{\rm rot}</math>近似值便可得到对原数据<math>\textstyle x</math>的近似还原。具体来说，我们需进行如下计算：
 【一审】：
@@ Line 327: / Line 364: @@
 = \sum_{i=1}^k u_i \tilde{x}_i.
 \end{align}</math>
-The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].
+【原文】：The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].
 (In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply
 by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just
@@ Line 333: / Line 371: @@
 Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>:
-【初译】：
+【初译】：该式中的第二个等号由先前对<math>\textstyle U</math>的定义可知成立，（在实际应用时，我们不倾向于先给<math>\textstyle \tilde{x}</math>填0然后再左乘<math>\textstyle U</math>，因为这样意味着大量的乘0运算，相反我们选择用<math>\textstyle U</math>的前<math>\textstyle k</math>列来乘<math>\textstyle \tilde{x} \in \Re^k</math>，其结果也即等于上面式子中最右边项。）将该算法应用到本章节的样例数据集，我们可以得到以下关于<math>\textstyle \hat{x}</math>的作图：
 【一审】：
@@ Line 341: / Line 379: @@
 [[File:PCA-xhat.png | 600px]]
-We are thus using a 1 dimensional approximation to the original dataset.
+【原文】：We are thus using a 1 dimensional approximation to the original dataset.
 If you are training an autoencoder or other unsupervised feature learning algorithm,
@@ Line 351: / Line 390: @@
 introducing very little approximation error.
-【初译】：
+【初译】：由图可看出我们实际上得到的是对原始数据的一维近似。
+如果要训练一个自动编码器（autoencoder）或其它无监督特征学习算法，运算时间将直接依赖于输入数据的维数。若用<math>\textstyle \tilde{x} \in \Re^k</math>取代<math>\textstyle x</math>作为输入数据，那么算法将使用该低维数据进行训练，运行速度也大大加快。对于很多数据集来说，低维表征量<math>\textstyle \tilde{x}</math>都可达到对原数据集的完美近似，因此对这些数据集使用PCA算法将可保证在只产生较小近似误差的同时极大地提速程序。
 【一审】：
 【二审】：
 == Number of components to retain ==
@@ Line 367: / Line 409: @@
 approximation to the data.
-【初译】：
+【初译】：接下来的问题是我们如何选择<math>\textstyle k</math>，即有多少个PCA主成分值得保留？在我们这个简单的二维实验中，保留第一个成分是十分自然的选择，但对于高维数据来说，做这个决定就没那么简单：如果<math>\textstyle k</math>过大，我们便没有多少有效压缩，如果是极限情况<math>\textstyle k=n</math>，我们等同于在使用原始数据（只是旋转投射到了一组不同的基）；相反地，如果<math>\textstyle k</math>过小，那我们使用的近似值也可能带来很大的近似误差。
 【一审】：
@@ Line 374: / Line 416: @@
-To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained'''
+【原文】：To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained'''
 for different values of <math>\textstyle k</math>.  Concretely, if <math>\textstyle k=n</math>, then we have
 an exact approximation to the data, and we say that 100% of the variance is
@@ Line 381: / Line 423: @@
 and thus 0% of the variance is retained.
-【初译】：
+【初译】：在决定<math>\textstyle k</math>的过程中，我们通常会计算不同<math>\textstyle k</math>值可保留的方差百分比，具体来说，如果<math>\textstyle k=n</math>，那么我们得到的是对数据的完美近似，也就是保留了100%的方差，即原始数据的所有变化特性（variation）都被保留下来；如果<math>\textstyle k=0</math>，那么我们是在使用零向量来近似输入数据，因此也就只有0%的方差可以被保留。
 【一审】：
@@ Line 387: / Line 429: @@
 【二审】：
-More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues
+【原文】：More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues
 of <math>\textstyle \Sigma</math> (sorted in decreasing order), so that <math>\textstyle \lambda_j</math> is the eigenvalue
 corresponding to the eigenvector <math>\textstyle u_j</math>.  Then if we retain <math>\textstyle k</math> principal components,
@@ Line 402: / Line 445: @@
 \end{align}</math>
-In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>.  Thus,
+【原文】：In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>.  Thus,
 by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>,
 or 91.3% of the variance.
@@ Line 412: / Line 456: @@
 【二审】：
-A more formal definition of percentage of variance retained is beyond the scope
+【原文】：A more formal definition of percentage of variance retained is beyond the scope
 of these notes.  However, it is possible to show that <math>\textstyle \lambda_j =
 \sum_{i=1}^m x_{{\rm rot},j}^2</math>.  Thus, if <math>\textstyle \lambda_j \approx 0</math>, that shows that
@@ Line 429: / Line 474: @@
 【二审】：
-In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of
+【原文】：In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of
 the variance.  In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies
 :<math>\begin{align}
@@ Line 467: / Line 513: @@
 【二审】：
-Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm.  But in practice most feature learning algorithms are extremely robust to the exact type of image  it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.
+【原文】：Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm.  But in practice most feature learning algorithms are extremely robust to the exact type of image  it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.
 【初译】：注：通常我们使用户外拍摄草木等场景的照片，并从中随机截取小图像块（大小为16乘16像素）来训练算法，实际应用中我们发现，大多数特征学习算法对于训练的图片类型都具有极大的鲁棒性（robust），普通照相机拍摄的图片，只要不是特别的模糊或者有非常奇怪的人工痕迹，都应可以使用。
@@ Line 475: / Line 522: @@
 【二审】：
-When training on natural images, it makes little sense to estimate a separate mean and
+【原文】：When training on natural images, it makes little sense to estimate a separate mean and
 variance for each pixel, because the statistics in one part
 of the image should (theoretically) be the same as any other.
@@ Line 486: / Line 534: @@
 【二审】：
-In detail, in order for PCA to work well, informally we require that (i) The
+【原文】：In detail, in order for PCA to work well, informally we require that (i) The
 features have approximately zero mean, and (ii) The different features have
 similar variances to each other.  With natural images, (ii) is already
@@ Line 506: / Line 555: @@
 【二审】：
-So, we won't use variance normalization.  The only normalization we need to
+【原文】：So, we won't use variance normalization.  The only normalization we need to
 perform then is mean normalization, to ensure that the features have a mean
 around zero.  Depending on the application, very often we are not interested
@@ Line 521: / Line 571: @@
 【二审】：
-Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of
+【原文】：Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of
 a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image
 <math>\textstyle x^{(i)}</math> as follows:
@@ Line 535: / Line 586: @@
 <math>x^{(i)}_j := x^{(i)}_j - \mu^{(i)}</math>, for all <math>\textstyle j</math>
-Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,
+【原文】：Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,
 and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>.  In particular,
 this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.
@@ Line 545: / Line 597: @@
 【二审】：
-If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.
+【原文】：If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.
 【初译】：如果你处理的图像并非自然图像（比如，手写文字，或者白背景正中摆放单独物体），其他规范化操作可能就值得引入，哪种做法最合适也将依赖于具体应用场合，但是对自然图像进行训练时，使用如上所述的整个图像块范围内的均值规范化操作可以放心假定为合理的办法。