PCA

From Ufldl

Jump to: navigation, search
(Recovering an Approximation of the Data)
Line 190: Line 190:
to the original, and using PCA this way can significantly speed up your algorithm while
to the original, and using PCA this way can significantly speed up your algorithm while
introducing very little approximation error.
introducing very little approximation error.
 +
 +
== Number of components to retain ==
 +
 +
How do we set <math>\textstyle k</math>; i.e., how many PCA components should we retain?  In our
 +
simple 2 dimensional example, it seemed natural to retain 1 out of the 2
 +
components, but for higher dimensional data, this decision is less trivial.  If <math>\textstyle k</math> is
 +
too large, then we won't be compressing the data much; in the limit of <math>\textstyle k=n</math>,
 +
then we're just using the original data (but rotated into a different basis).
 +
Conversely, if <math>\textstyle k</math> is too small, then we might be using a very bad
 +
approximation to the data.
 +
 +
To decide how to set <math>\textstyle k</math>, we will usually look at the {\bf percentage of variance
 +
retained} for different values of <math>\textstyle k</math>.  Concretely, if <math>\textstyle k=n</math>, then we have
 +
an exact approximation to the data, and we say that 100\% of the variance is
 +
retained.  I.e., all of the variation of the original data is retained. 
 +
Conversely, if <math>\textstyle k=0</math>, then we are approximating all the data with the zero vector,
 +
and thus 0\% of the variance is retained.
 +
 +
More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues
 +
of <math>\textstyle \Sigma</math> (sorted in decreasing order), so that <math>\textstyle \lambda_j</math> is the eigenvalue
 +
corresponding to the eigenvector <math>\textstyle u_j</math>.  Then if we retain <math>\textstyle k</math> principal components,
 +
the percentage of variance retained is given by:
 +
:<math>\begin{align}
 +
\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j}.
 +
\end{align}</math>
 +
In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>.  Thus,
 +
by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>,
 +
or 91.3\% of the variance.
 +
 +
A more formal definition of percentage of variance retained is beyond the scope
 +
of these notes.  However, it is possible to show that <math>\textstyle \lambda_j =
 +
\sum_{i=1}^m x_{{\rm rot},j}^2</math>.  Thus, if <math>\textstyle \lambda_j \approx 0</math>, that shows that
 +
<math>\textstyle x_{{\rm rot},j}</math> is usually near 0 anyway, and we lose relatively little by
 +
approximating it with a constant 0.  This also explains why we retain the top principal
 +
components (corresponding to the larger values of <math>\textstyle \lambda_j</math>) instead of the bottom
 +
ones.  The top principal components
 +
<math>\textstyle x_{{\rm rot},j}</math> are the ones that're more variable and that take on larger values,
 +
and for which we would incur a greater approximation error if we were to set them to zero.
 +
 +
In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99\% of
 +
the variance.  In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies
 +
:<math>\begin{align}
 +
\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j} \geq 0.99.
 +
\end{align}</math>
 +
Depending on the application, if you are willing to incur some
 +
additional error, values in the 90-98\% range are also sometimes used.  When you
 +
describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95\% of
 +
the variance will also be a much more easily interpretable description than saying
 +
that you retained 120 (or whatever other number of) components.
== References ==
== References ==
http://cs229.stanford.edu
http://cs229.stanford.edu

Revision as of 11:01, 4 April 2011

Personal tools