PCA
From Ufldl
(→Recovering an Approximation of the Data) |
|||
Line 190: | Line 190: | ||
to the original, and using PCA this way can significantly speed up your algorithm while | to the original, and using PCA this way can significantly speed up your algorithm while | ||
introducing very little approximation error. | introducing very little approximation error. | ||
+ | |||
+ | == Number of components to retain == | ||
+ | |||
+ | How do we set <math>\textstyle k</math>; i.e., how many PCA components should we retain? In our | ||
+ | simple 2 dimensional example, it seemed natural to retain 1 out of the 2 | ||
+ | components, but for higher dimensional data, this decision is less trivial. If <math>\textstyle k</math> is | ||
+ | too large, then we won't be compressing the data much; in the limit of <math>\textstyle k=n</math>, | ||
+ | then we're just using the original data (but rotated into a different basis). | ||
+ | Conversely, if <math>\textstyle k</math> is too small, then we might be using a very bad | ||
+ | approximation to the data. | ||
+ | |||
+ | To decide how to set <math>\textstyle k</math>, we will usually look at the {\bf percentage of variance | ||
+ | retained} for different values of <math>\textstyle k</math>. Concretely, if <math>\textstyle k=n</math>, then we have | ||
+ | an exact approximation to the data, and we say that 100\% of the variance is | ||
+ | retained. I.e., all of the variation of the original data is retained. | ||
+ | Conversely, if <math>\textstyle k=0</math>, then we are approximating all the data with the zero vector, | ||
+ | and thus 0\% of the variance is retained. | ||
+ | |||
+ | More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues | ||
+ | of <math>\textstyle \Sigma</math> (sorted in decreasing order), so that <math>\textstyle \lambda_j</math> is the eigenvalue | ||
+ | corresponding to the eigenvector <math>\textstyle u_j</math>. Then if we retain <math>\textstyle k</math> principal components, | ||
+ | the percentage of variance retained is given by: | ||
+ | :<math>\begin{align} | ||
+ | \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j}. | ||
+ | \end{align}</math> | ||
+ | In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>. Thus, | ||
+ | by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>, | ||
+ | or 91.3\% of the variance. | ||
+ | |||
+ | A more formal definition of percentage of variance retained is beyond the scope | ||
+ | of these notes. However, it is possible to show that <math>\textstyle \lambda_j = | ||
+ | \sum_{i=1}^m x_{{\rm rot},j}^2</math>. Thus, if <math>\textstyle \lambda_j \approx 0</math>, that shows that | ||
+ | <math>\textstyle x_{{\rm rot},j}</math> is usually near 0 anyway, and we lose relatively little by | ||
+ | approximating it with a constant 0. This also explains why we retain the top principal | ||
+ | components (corresponding to the larger values of <math>\textstyle \lambda_j</math>) instead of the bottom | ||
+ | ones. The top principal components | ||
+ | <math>\textstyle x_{{\rm rot},j}</math> are the ones that're more variable and that take on larger values, | ||
+ | and for which we would incur a greater approximation error if we were to set them to zero. | ||
+ | |||
+ | In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99\% of | ||
+ | the variance. In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies | ||
+ | :<math>\begin{align} | ||
+ | \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j} \geq 0.99. | ||
+ | \end{align}</math> | ||
+ | Depending on the application, if you are willing to incur some | ||
+ | additional error, values in the 90-98\% range are also sometimes used. When you | ||
+ | describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95\% of | ||
+ | the variance will also be a much more easily interpretable description than saying | ||
+ | that you retained 120 (or whatever other number of) components. | ||
== References == | == References == | ||
http://cs229.stanford.edu | http://cs229.stanford.edu |