Data Preprocessing

Revision as of 06:42, 29 April 2011 (view source)

Jngiam (Talk | contribs)

← Older edit

Revision as of 06:48, 29 April 2011 (view source)

Jngiam (Talk | contribs)

Newer edit →

Line 46:

=== Reconstruction Based Models ===

-

In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data.

+

In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data.

+

{{Quote|

Line 57:

Line 59:

{{quote|

Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }}

+

== Large Images ==

Line 64:

Line 67:

== Standard Pipeline ==

+

In this section, we describe several "standard pipelines" that have worked well for some datasets:

+

=== Natural Grey-scale Images ===

+

Since grey-scale images have the stationarity property, we usually first remove the mean-component from each data example separately (remove DC). After this step, PCA/ZCA whitening is often employed with a value of <tt>epsilon</tt> set large enough to low-pass filter the data.

+

=== Color Images ===

+

-

== ~~Model Idiosyncrasies~~ ==

+

=== Audio (MFCC/Spectrograms) ===

-

~~=== Sparse Autoencoder ===~~

-

~~==== Sigmoid Decoders ====~~

-

===~~= Linear Decoders =~~===

+

=== MNIST Handwritten Digits ===

-

~~=== Independent Component Analysis ===~~

+

The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. A sparse autoencoder often works well after this simple normalization. While one could also elect to use PCA/ZCA whitening if desired, this is not often done in practice. ''Note: Since the 0 value is meaningful in MNIST, we do ''not'' perform per-example mean normalization.''

Data Preprocessing

From Ufldl

Revision as of 06:48, 29 April 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 46: / Line 46: @@
 === Reconstruction Based Models ===
 In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data.
 {{Quote|
@@ Line 57: / Line 59: @@
 {{quote|
 Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values.  }}
 == Large Images ==
@@ Line 64: / Line 67: @@
 == Standard Pipeline ==
+In this section, we describe several "standard pipelines" that have worked well for some datasets:
+=== Natural Grey-scale Images ===
+Since grey-scale images have the stationarity property, we usually first remove the mean-component from each data example separately (remove DC). After this step, PCA/ZCA whitening is often employed with a value of <tt>epsilon</tt> set large enough to low-pass filter the data.
+=== Color Images ===
-== Model Idiosyncrasies ==
+=== Audio (MFCC/Spectrograms) ===
-=== Sparse Autoencoder ===
-==== Sigmoid Decoders ====
-==== Linear Decoders ====
+=== MNIST Handwritten Digits ===
-=== Independent Component Analysis ===
+The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. A sparse autoencoder often works well after this simple normalization. While one could also elect to use PCA/ZCA whitening if desired, this is not often done in practice. ''Note: Since the 0 value is meaningful in MNIST, we do ''not'' perform per-example mean normalization.''