Data Preprocessing

From Ufldl

Jump to: navigation, search
(PCA/ZCA Whitening)
 
Line 48: Line 48:
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for <tt>epsilon</tt>, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if <tt>epsilon</tt> is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of <tt>epsilon</tt> to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose <tt>epsilon</tt> such that most of the "long tail" is filtered out, i.e. choose <tt>epsilon</tt> such that it is greater than most of the small eigenvalues corresponding to the noise.
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for <tt>epsilon</tt>, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if <tt>epsilon</tt> is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of <tt>epsilon</tt> to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose <tt>epsilon</tt> such that most of the "long tail" is filtered out, i.e. choose <tt>epsilon</tt> such that it is greater than most of the small eigenvalues corresponding to the noise.
-
[[File::ZCA_Eigenvalue_Plot.png]]
+
[[File:ZCA_Eigenvalues_Plot.png]]
In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.
In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.
Line 93: Line 93:
The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. In practice, removing the mean-value per example can also help feature learning. ''Note: While one could also elect to use PCA/ZCA whitening on MNIST if desired, this is not often done in practice.''
The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. In practice, removing the mean-value per example can also help feature learning. ''Note: While one could also elect to use PCA/ZCA whitening on MNIST if desired, this is not often done in practice.''
 +
 +
 +
 +
{{Languages|数据预处理|中文}}

Latest revision as of 04:22, 8 April 2013

Personal tools