数据预处理

Revision as of 18:01, 13 March 2013 (view source)

Revision as of 16:40, 14 March 2013 (view source)

Line 27:

'''例子：'''对于图像，这种归一化可以移除图像的平均亮度值(intensity)。很多情况下我们对图像的照度并不感兴趣，而更多地关注其内容，这时对每个数据点移除像素的均值是有意义的。'''注意：'''虽然该方法广泛地应用于图像，但在处理彩色图像时需要格外小心，具体来说，是因为不同色彩通道中的像素并不都存在平稳特性。

-

=== ~~Feature Standardization/~~特征标准化 ===

+

=== 特征标准化 ===

-

~~【原文】~~

+

特征标准化指的是（独立地）使得数据的每一维具有零均值和单位方差。这是归一化中最常见的方法并被广泛地使用（例如，在使用支持向量机（SVM）时特征标准化常被建议为预处理的一部分）。在实际应用中，特征标准化的具体做法是：首先计算每一个维度上数据的均值（使用全体数据计算），之后在每一个维度上都减去该均值。下一步便是在数据的每一维度上除以该维度上数据的标准差。

-

Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance. This is the most common method for normalization and is generally used widely (e.g., when working with SVMs, feature standardization is often recommended as a preprocessing step). In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension. Next, each dimension is divided by its standard deviation.

+

-

'''~~Example:~~ ''' ~~When working with audio data, it is common to use~~ [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently.

+

'''例子''':处理音频数据时，常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表征数据。然而MFCC特征的第一个分量（表示直流分量）数值太大，常常会掩盖其他分量。这种情况下，为了平衡各个分量的影响，通常对特征的每个分量独立地使用标准化处理。

-

~~【初译】~~

+

== PCA/ZCA白化==

-

特征标准化指的是（独立的）使得数据的每一维都是零均值和单位方差的。这是归一化中最常用的方法（如在使用支持向量机时特征标准化常被建议为预处理的一部分）。在实际中，首先计算每一维度均值并在相应维度减除，然后每一维度上除以标准差。

+

在做完简单的归一化后，白化通常会被用来作为接下来的预处理步骤，它会使我们的算法工作得更好。实际上许多深度学习算法都依赖于白化以获得好的特征。

-

~~'''例子''':处理音频数据时，常用[http:~~//en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表征数据。然而MFCC特征的第一组件（表示直流）常常会掩盖其他组件。因此一种重新平衡组件的方法是独立的对每一组件进行标准化。

+

在进行PCA/ZCA白化时，首先使特征零均值化是很有必要的，这保证了<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。特别地，这一步需要在计算协方差矩阵前完成。（唯一例外的情况是已经进行了逐样本均值消减，并且数据在各维度上或像素上是平稳的。）

-

~~【一审】~~

+

接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>（回忆一下，这是规则化项，对数据有低通滤波作用）。选取合适的<tt>epsilon</tt>值对特征学习起着很大作用，下面讨论在两种不同场合下如何选取<tt>epsilon</tt>：

-

特征标准化指的是（独立地）使得数据的每一维具有零均值和单位方差。这是归一化中最常用的方法（通常建议在使用SVM时首先对训练数据做特征标准化预处理）。在实际应用中，特征标准化的具体做法是：首先计算训练集的样本均值，每一个样本都减去该均值，然后在样本的每一维度上除以该维度上的样本标准差。

+

-

+

-

'''例子''':处理音频数据时，常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表示数据。然而MFCC特征的第一个分量（表示直流分量）数值太大，常常会掩盖其他分量。这种情况下，为了平衡各个分量的影响，通常对特征的每个分量做标准化处理。

+

-

~~== PCA/ZCA Whitening/PCA/ZCA白化==~~

+

-

~~【原文】~~

+

-

After doing the simple normalizations, whitening is often the next preprocessing step employed that helps make our algorithms work better. In practice, many deep learning algorithms rely on whitening to learn good features.

+

-

+

-

In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.)

+

-

+

-

Next, one needs to select the value of <tt>epsilon</tt> to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting <tt>epsilon</tt>:

+

-

+

-

~~【初译】~~

+

-

~~在做完简单的归一化后，白化通常会被用来作为接下来的预处理步骤来使算法性能更好。在实际中，众多深度学习算法依赖白化获得好的特征。~~

+

-

+

-

当进行PCA/ZCA白化时首先要零均值化特征以保证<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。特别的是这需要在计算协方差矩阵前完成(唯一例外的情况是均值消除已经完成且数据在不同维度/像素间是平稳的)。

+

-

+

-

接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>（这是规则化参数，对数据有低通滤波作用）。选取合适的值对特征学习起着很大作用，我们讨论选取epsilon的两个例子<tt>epsilon</tt>：

+

-

+

-

~~【一审】~~

+

-

~~为了提高算法的性能，在做完简单的归一化之后，经常还要对特征进行白化。实际上许多深度学习算法都依赖于白化以获得好的特征。~~

+

-

+

-

在进行PCA/ZCA白化时，首先要从特征中减去样本均值，使得<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。具体来说，这一步要在计算样本协方差之前进行（唯一例外的情况是对样本数据已经执行了分量均值归零操作，并且数据在不同维度之间是平稳的）。

+

-

+

-

接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>~~（这是正则化参数，对数据有低通滤波作用）。~~ 选取合适的<tt>epsilon</tt>~~值对特征学习起着很大作用，下面讨论两种不同场合下如何选取~~<tt>epsilon</tt>：

+

=== Reconstruction Based Models/基于重构的模型 ===

From Ufldl

Revision as of 16:40, 14 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 27: / Line 27: @@
 '''例子：'''对于图像，这种归一化可以移除图像的平均亮度值(intensity)。很多情况下我们对图像的照度并不感兴趣，而更多地关注其内容，这时对每个数据点移除像素的均值是有意义的。'''注意：'''虽然该方法广泛地应用于图像，但在处理彩色图像时需要格外小心，具体来说，是因为不同色彩通道中的像素并不都存在平稳特性。
-=== Feature Standardization/特征标准化 ===
+=== 特征标准化 ===
-【原文】
+特征标准化指的是（独立地）使得数据的每一维具有零均值和单位方差。这是归一化中最常见的方法并被广泛地使用（例如，在使用支持向量机（SVM）时特征标准化常被建议为预处理的一部分）。在实际应用中，特征标准化的具体做法是：首先计算每一个维度上数据的均值（使用全体数据计算），之后在每一个维度上都减去该均值。下一步便是在数据的每一维度上除以该维度上数据的标准差。
-Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance. This is the most common method for normalization and is generally used widely (e.g., when working with SVMs, feature standardization is often recommended as a preprocessing step). In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension. Next, each dimension is divided by its standard deviation.
-'''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently.
+'''例子''':处理音频数据时，常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表征数据。然而MFCC特征的第一个分量（表示直流分量）数值太大，常常会掩盖其他分量。这种情况下，为了平衡各个分量的影响，通常对特征的每个分量独立地使用标准化处理。
-【初译】
+== PCA/ZCA白化==
-特征标准化指的是（独立的）使得数据的每一维都是零均值和单位方差的。这是归一化中最常用的方法（如在使用支持向量机时特征标准化常被建议为预处理的一部分）。在实际中，首先计算每一维度均值并在相应维度减除，然后每一维度上除以标准差。
+在做完简单的归一化后，白化通常会被用来作为接下来的预处理步骤，它会使我们的算法工作得更好。实际上许多深度学习算法都依赖于白化以获得好的特征。
-'''例子''':处理音频数据时，常用[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表征数据。然而MFCC特征的第一组件（表示直流）常常会掩盖其他组件。因此一种重新平衡组件的方法是独立的对每一组件进行标准化。
+在进行PCA/ZCA白化时，首先使特征零均值化是很有必要的，这保证了<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。特别地，这一步需要在计算协方差矩阵前完成。（唯一例外的情况是已经进行了逐样本均值消减，并且数据在各维度上或像素上是平稳的。）
-【一审】
+接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>（回忆一下，这是规则化项，对数据有低通滤波作用）。 选取合适的<tt>epsilon</tt>值对特征学习起着很大作用，下面讨论在两种不同场合下如何选取<tt>epsilon</tt>：
-特征标准化指的是（独立地）使得数据的每一维具有零均值和单位方差。这是归一化中最常用的方法（通常建议在使用SVM时首先对训练数据做特征标准化预处理）。在实际应用中，特征标准化的具体做法是：首先计算训练集的样本均值，每一个样本都减去该均值，然后在样本的每一维度上除以该维度上的样本标准差。
-'''例子''':处理音频数据时，常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表示数据。然而MFCC特征的第一个分量（表示直流分量）数值太大，常常会掩盖其他分量。这种情况下，为了平衡各个分量的影响，通常对特征的每个分量做标准化处理。
-== PCA/ZCA Whitening/PCA/ZCA白化==
-【原文】
-After doing the simple normalizations, whitening is often the next preprocessing step employed that helps make our algorithms work better. In practice, many deep learning algorithms rely on whitening to learn good features.
-In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.)
-Next, one needs to select the value of <tt>epsilon</tt> to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting <tt>epsilon</tt>:
-【初译】
-在做完简单的归一化后，白化通常会被用来作为接下来的预处理步骤来使算法性能更好。在实际中，众多深度学习算法依赖白化获得好的特征。
-当进行PCA/ZCA白化时首先要零均值化特征以保证<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。特别的是这需要在计算协方差矩阵前完成(唯一例外的情况是均值消除已经完成且数据在不同维度/像素间是平稳的)。
-接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>（这是规则化参数，对数据有低通滤波作用）。 选取合适的值对特征学习起着很大作用，我们讨论选取epsilon的两个例子<tt>epsilon</tt>：
-【一审】
-为了提高算法的性能，在做完简单的归一化之后，经常还要对特征进行白化。实际上许多深度学习算法都依赖于白化以获得好的特征。
-在进行PCA/ZCA白化时，首先要从特征中减去样本均值，使得<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。具体来说，这一步要在计算样本协方差之前进行（唯一例外的情况是对样本数据已经执行了分量均值归零操作，并且数据在不同维度之间是平稳的）。
-接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>（这是正则化参数，对数据有低通滤波作用）。 选取合适的<tt>epsilon</tt>值对特征学习起着很大作用，下面讨论两种不同场合下如何选取<tt>epsilon</tt>：
 === Reconstruction Based Models/基于重构的模型 ===