Data Preprocessing
From Ufldl
Line 42: | Line 42: | ||
In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.) | In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.) | ||
- | Next, one needs to select the value of < | + | Next, one needs to select the value of <tt>epsilon</tt> to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting <tt>epsilon</tt>: |
=== Reconstruction Based Models === | === Reconstruction Based Models === | ||
- | In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data. | + | In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for <tt>epsilon</tt>, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if <tt>epsilon</tt> is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of <tt>epsilon</tt> to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose <tt>epsilon</tt> such that most of the "long tail" is filtered out, i.e. choose <tt>epsilon</tt> such that it is greater than most of the small eigenvalues corresponding to the noise. |
+ | |||
+ | [[File:ZCA_Eigenvalues_Plot.png]] | ||
In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered. | In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered. | ||
Line 53: | Line 55: | ||
Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>. | Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>. | ||
}} | }} | ||
+ | |||
=== ICA-based Models (with orthogonalization) === | === ICA-based Models (with orthogonalization) === | ||
Line 65: | Line 68: | ||
{{quote| | {{quote| | ||
Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }} | Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }} | ||
- | |||
== Large Images == | == Large Images == | ||
Line 86: | Line 88: | ||
=== Audio (MFCC/Spectrograms) === | === Audio (MFCC/Spectrograms) === | ||
- | For audio data (MFCC and Spectrograms), each dimension usually have different scales (variances). This is especially so when one includes the temporal derivatives (a common practice in audio processing). As a result, the preprocessing usually starts with simple data standardization (zero-mean, unit-variance per data dimension), followed by PCA/ZCA whitening (with an appropriate <tt>epsilon</tt>). | + | For audio data (MFCC and Spectrograms), each dimension usually have different scales (variances); the first component of MFCCs, for example, is the DC component and usually has a larger magnitude than the other components. This is especially so when one includes the temporal derivatives (a common practice in audio processing). As a result, the preprocessing usually starts with simple data standardization (zero-mean, unit-variance per data dimension), followed by PCA/ZCA whitening (with an appropriate <tt>epsilon</tt>). |
=== MNIST Handwritten Digits === | === MNIST Handwritten Digits === | ||
The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. In practice, removing the mean-value per example can also help feature learning. ''Note: While one could also elect to use PCA/ZCA whitening on MNIST if desired, this is not often done in practice.'' | The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. In practice, removing the mean-value per example can also help feature learning. ''Note: While one could also elect to use PCA/ZCA whitening on MNIST if desired, this is not often done in practice.'' | ||
+ | |||
+ | |||
+ | |||
+ | {{Languages|数据预处理|中文}} |