数据预处理

Revision as of 17:00, 14 March 2013 (view source)

Kandeng (Talk | contribs)

← Older edit

Revision as of 17:36, 14 March 2013 (view source)

Kandeng (Talk | contribs)

Newer edit →

Line 44:

[[File:ZCA_Eigenvalues_Plot.png]]

-

~~【原文】~~

+

在基于重构的模型中，损失函数有一项是用于惩罚那些与原始输入数据差异较大的重构结果（译注：以自动编码机为例，要求输入数据经过编码和解码之后还能尽可能的还原输入数据）。如果<tt>epsilon</tt>太小，白化后的数据中就会包含很多噪声，而模型要拟合这些噪声，以达到很好的重构结果。因此，对于基于重构的模型来说，对原始数据进行低通滤波就显得非常重要。

-

~~In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if~~ <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.

+

-

~~【初译】~~

-

在基于重构的模型中，损失函数中会包含一项来惩罚与原数据较大差异的重构。如果<tt>epsilon</tt>太小，数据将包含较多噪声，模型需要重构。因此在基于重构的模型中，对数据低通滤波很重要。

-

~~【一审】~~

-

基于重构的方法的损失函数有一项是用于惩罚那些与原始数据差异较大的重构结果（译注：以自动编码机为例，要求输入数据经过编码和解码之后还能尽可能的还原输入数据）。如果<tt>epsilon</tt>太小，白化后的数据中就会包含很多噪声，而模型要拟合这些噪声，以达到很好的重构结果。因此，对于基于重构的模型来说，对原始数据进行低通滤波就显得非常重要。

-

~~【原文】~~

{{Quote|

-

~~Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>.~~

+

提示：如果数据已被缩放到合理范围(如<math>[0, 1]</math>)，可以从<math>epsilon = 0.01</math>或<math>epsilon = 0.1</math>开始调节<tt>epsilon</tt>。

-

}}

+

-

+

-

~~【初译】~~

+

-

~~{{Quote|~~

+

-

提示：如果数据已被缩放到合理范围(如<math>[0, 1]</math>)，从<math>epsilon = 0.01</math>或<math>epsilon = 0.1</math>开始调节<tt>epsilon</tt>。

+

}}

+

=== 基于正交化ICA的模型 ===

【一审】

-

~~{{Quote~~|

+

对基于正交化ICA的模型来说，保证输入数据尽可能地白化（即协方差矩阵为单位矩阵）非常重要。这是因为：这类模型需要对学习到的特征做正交化，以解除不同维度之间的相关性（详细内容请参考[[Independent Component Analysis | ICA]]一节）。因此在这种情况下，<tt>epsilon</tt>要足够小（比如<math>epsilon = 1e-6</math>）。

-

~~如果数据已被缩放到合理范围(如<math>[0, 1~~]<~~/math>)，可以从<math~~>epsilon ~~= 0.01~~</~~math~~>或<math>epsilon = ~~0.1~~</math>~~开始调节<tt>epsilon</tt>。~~

+

-

}}

+

-

~~=== ICA-based Models (with orthogonalization)/基于正交化ICA的模型 ===~~

-

~~【原文】~~

-

For ICA-based models with orthogonalization, it is ''very'' important for the data to be as close to white (identity covariance) as possible. This is a side-effect of using orthogonalization to decorrelate the features learned (more details in [[Independent Component Analysis | ICA]]). Hence, in this case, you will want to use an <tt>epsilon</tt> that is as small as possible (e.g., <math>epsilon = 1e-6</math>).

-

~~【初译】~~

-

对于正交化的基于ICA的模型，数据越接近白化（同协方差）越好，正交化来解相关特征是副作用 (详细内容请参考[[Independent Component Analysis | ICA]]一节)。因此在这种情况下需要采用尽量小的<tt>epsilon</tt>(如<math>epsilon = 1e-6</math>)。

-

~~【一审】~~

-

对基于正交化ICA的模型来说，保证输入数据尽可能地白化（即协方差矩阵为单位阵）非常重要。这是因为：这类模型需要对学习到的特征做正交化，以解除不同维度之间的相关性（详细内容请参考[[Independent Component Analysis | ICA]]一节）。因此在这种情况下，<tt>epsilon</tt>要足够小（比如<math>epsilon = 1e-6</math>）。

-

~~【原文】~~

{{Quote|

-

Tip: In PCA whitening, one also has the option of performing dimension reduction while whitening the data. This is usually an excellent idea since it can greatly speed up the algorithms (less computation and less parameters). A simple rule of thumb to choose how many principle components to retain is to keep enough components to have 99% ~~of the variance retained~~ (~~more details at~~ [[PCA#Number_of_components_to_retain | PCA]])

+

提示：我们也可以在PCA白化过程中同时降低数据的维度。这是一个很好的主意，因为这样可以大大提升算法的速度（减少了运算量和参数数目）。确定要保留的主成分数目有一个经验法则：即所保留的成分的总方差达到总样本方差的99%以上。(详细内容请参考[[PCA#Number_of_components_to_retain | PCA]])

}}

-

~~【初译】~~

{{Quote|

-

提示: ~~在主成分分析白化中，在白化数据的过程中也可以降低数据维度。这是一个很好的主意，因为这将大大提升算法的速度~~(~~更少的运算和更少的参数~~)~~。一个选取保留主成分数目的简单规则是使剩余的方差达到99%以上。~~(~~详细内容请参考[[PCA#Number_of_components_to_retain | PCA]]~~)

+

注意: 在使用分类框架时，我们应该只基于练集上的数据计算PCA/ZCA白化矩阵。需要保存以下两个参数留待测试集合使用：(a)用于零均值化数据的平均值向量；(b)白化矩阵。测试集需要采用这两组保存的参数来进行相同的预处理。}}

-

}}

+

-

~~【一审】~~

+

== 大图像 ==

-

~~{{Quote|~~

+

对于大图像，采用基于PCA/ZCA的白化方法是不切实际的，因为协方差矩阵太大。在这些情况下我们退而使用1/f 白化方法（更多细节稍后陈述）。

-

提示：我们可以在PCA白化过程中同时降低数据的维度。这是一个很好的主意，因为这样可以大大提升算法的速度（减少了运算量和参数数目）。确定要保留的主成分数目有一个简单的规则：即所保留的成分的总方差达到总样本方差的99%以上。(详细内容请参考[[PCA#Number_of_components_to_retain | PCA]])

+

-

}}

+

-

~~【原文】~~

+

== 标准流程 ==

-

~~{{quote|~~

+

在这一部分中，我们将介绍几种在一些数据集上有良好表现的预处理标准流程

-

Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }}

+

-

~~【初译】~~

+

=== 自然灰度图像 ===

-

~~{{quote|~~

+

灰度图像具有平稳特性，我们通常在第一步对每个数据样本分别做均值消减（即减去直流分量），然后采用PCA/ZCA白化处理，其中的<tt>epsilon</tt>要足够大以达到低通滤波的效果。

-

注意: 在处理分类框架时，需要在训练集合上计算PCA/ZCA白化矩阵，需要保存以下两个参数留待测试集合使用(a)平均向量用于零均值化数据；(b)白化矩阵。测试集需要采用保存的参数来进行相同的预处理。}}

+

-

+

-

~~【一审】~~

+

-

~~{{quote|~~

+

-

注意: 在分类问题中，PCA/ZCA白化矩阵是在训练集合上计算的，需要保存以下两个参数留待测试集合使用：(a)样本均值；(b)白化矩阵。测试集需要采用这两组保存的参数来进行相同的预处理。}}

+

-

+

-

== ~~Large Images/大图像~~ ==

+

-

~~【原文】~~

+

-

~~For large images, PCA/ZCA based whitening methods are impractical as the covariance matrix is too large. For these cases, we defer to 1/f-whitening methods. (more details to come)~~

+

-

+

-

~~【初译】~~

+

-

~~对于大图像，采用基于PCA/ZCA的白化方法是不实际的，这是因为协方差矩阵太大。在这些情况下我们推荐1/f 白化方法（更多内容后续再讲）。~~

+

-

+

-

~~【一审】~~

+

-

~~对于大图像，采用基于PCA/ZCA的白化方法是不切实际的，因为协方差矩阵太大。在这些情况下我们推荐1/f 白化方法（更多内容后续再讲）。~~

+

-

+

-

~~== Standard Pipelines/标准流程 ==~~

+

-

~~【原文】~~

+

-

~~In this section, we describe several "standard pipelines" that have worked well for some datasets:~~

+

-

+

-

~~【初译】~~

+

-

~~在这一部分我们将介绍几种在一些数据集上有效地标准流程~~

+

-

+

-

~~【一审】~~

+

-

~~在这一部分我们将介绍几种在一些数据集上有良好表现的预处理标准流程~~

+

-

+

-

~~=== Natural Grey-scale Images/~~自然灰度图像 ===

+

-

~~【原文】~~

+

-

Since grey-scale images have the stationarity property, we usually first remove the mean-component from each data example separately (remove DC). After this step, PCA/ZCA whitening is often employed with a value of <tt>epsilon</tt> set large enough to low-pass filter the data.

+

-

+

-

~~【初译】~~

+

-

~~因为灰度图像具有平稳特性，我们第一步通常在样本上分别移除均值项，然后采用PCA/ZCA白化处理，其中的<tt>epsilon</tt>足够大以对数据低通过滤。~~

+

-

+

-

~~【一审】~~

+

-

~~灰度图像具有平稳特性，我们通常在对每个样本做分量均值归零化（即减去直流分量），然后采用PCA~~/ZCA白化处理，其中的<tt>epsilon</tt>要足够大以达到低通滤波的效果。

+

-

+

-

~~=== Color Images/彩色图像 ===~~

+

-

~~【原文】~~

+

-

For color images, the stationarity property does not hold across color channels. Hence, we usually start by rescaling the data (making sure it is in <math>[0, 1]</math>) ad then applying PCA/ZCA with a sufficiently large <tt>epsilon</tt>. Note that it is important to perform feature mean-normalization before computing the PCA transformation.

+

-

+

-

~~【初译】~~

+

-

对于彩色图像，色彩通道间并不存在平稳特性。因此我们通常首先对数据进行重缩放（使之位于<math>[0, 1]</math>区间），然后在使用足够大的<tt>epsilon</tt>来做PCA/ZCA。值得注意的是在进行PCA转换前需要对特征进行均值归一化。

+

-

+

-

~~【一审】~~

+

-

对于彩色图像，彩色通道间并不存在平稳特性。因此我们通常首先对数据进行特征缩放（使像素值位于<math>[0, 1]</math>区间），然后使用足够大的<tt>epsilon</tt>来做PCA/ZCA。注意在进行PCA变换前需要对特征进行分量均值归零化。

+

-

~~【说明】~~

+

=== 彩色图像 ===

-

原文中说的mean-normalization在整个文档中没有别的地方提及。我认为这是指分量均值归零化，使得图像的平均像素值变为0。即：对彩色图像的处理需要首先把像素值变换到[0,1]~~区间，然后按照灰度图像的同样方法做预处理。~~

+

对于彩色图像，色彩通道间并不存在平稳特性。因此我们通常首先对数据进行特征缩放（使像素值位于<math>[0, 1]</math>区间），然后使用足够大的<tt>epsilon</tt>来做PCA/ZCA。注意在进行PCA变换前需要对特征进行分量均值归零化。

=== Audio (MFCC/Spectrograms)/音频 (MFCC/频谱图) ===

From Ufldl

Revision as of 17:36, 14 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 44: / Line 44: @@
 [[File:ZCA_Eigenvalues_Plot.png]]
-【原文】
+在基于重构的模型中，损失函数有一项是用于惩罚那些与原始输入数据差异较大的重构结果（译注：以自动编码机为例，要求输入数据经过编码和解码之后还能尽可能的还原输入数据）。如果<tt>epsilon</tt>太小，白化后的数据中就会包含很多噪声，而模型要拟合这些噪声，以达到很好的重构结果。因此，对于基于重构的模型来说，对原始数据进行低通滤波就显得非常重要。
-In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.
-【初译】
-在基于重构的模型中，损失函数中会包含一项来惩罚与原数据较大差异的重构。如果<tt>epsilon</tt>太小，数据将包含较多噪声，模型需要重构。因此在基于重构的模型中，对数据低通滤波很重要。
-【一审】
-基于重构的方法的损失函数有一项是用于惩罚那些与原始数据差异较大的重构结果（译注：以自动编码机为例，要求输入数据经过编码和解码之后还能尽可能的还原输入数据）。如果<tt>epsilon</tt>太小，白化后的数据中就会包含很多噪声，而模型要拟合这些噪声，以达到很好的重构结果。因此，对于基于重构的模型来说，对原始数据进行低通滤波就显得非常重要。
-【原文】
 {{Quote|
-Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>.
+提示：如果数据已被缩放到合理范围(如<math>[0, 1]</math>)，可以从<math>epsilon = 0.01</math>或<math>epsilon = 0.1</math>开始调节<tt>epsilon</tt>。
-}}
-【初译】
-{{Quote|
-提示：如果数据已被缩放到合理范围(如<math>[0, 1]</math>)，从<math>epsilon = 0.01</math>或<math>epsilon = 0.1</math>开始调节<tt>epsilon</tt>。
 }}
+=== 基于正交化ICA的模型 ===
 【一审】
-{{Quote|
+对基于正交化ICA的模型来说，保证输入数据尽可能地白化（即协方差矩阵为单位矩阵）非常重要。这是因为：这类模型需要对学习到的特征做正交化，以解除不同维度之间的相关性（详细内容请参考[[Independent Component Analysis | ICA]]一节）。因此在这种情况下，<tt>epsilon</tt>要足够小（比如<math>epsilon = 1e-6</math>）。
-如果数据已被缩放到合理范围(如<math>[0, 1]</math>)，可以从<math>epsilon = 0.01</math>或<math>epsilon = 0.1</math>开始调节<tt>epsilon</tt>。
-}}
-=== ICA-based Models (with orthogonalization)/基于正交化ICA的模型 ===
-【原文】
-For ICA-based models with orthogonalization, it is ''very'' important for the data to be as close to white (identity covariance) as possible. This is a side-effect of using orthogonalization to decorrelate the features learned (more details in [[Independent Component Analysis | ICA]]). Hence, in this case, you will want to use an <tt>epsilon</tt> that is as small as possible (e.g., <math>epsilon = 1e-6</math>).
-【初译】
-对于正交化的基于ICA的模型，数据越接近白化（同协方差）越好，正交化来解相关特征是副作用 (详细内容请参考[[Independent Component Analysis | ICA]]一节)。因此在这种情况下需要采用尽量小的<tt>epsilon</tt>(如<math>epsilon = 1e-6</math>)。
-【一审】
-对基于正交化ICA的模型来说，保证输入数据尽可能地白化（即协方差矩阵为单位阵）非常重要。这是因为：这类模型需要对学习到的特征做正交化，以解除不同维度之间的相关性（详细内容请参考[[Independent Component Analysis | ICA]]一节）。因此在这种情况下，<tt>epsilon</tt>要足够小（比如<math>epsilon = 1e-6</math>）。
-【原文】
 {{Quote|
-Tip: In PCA whitening, one also has the option of performing dimension reduction while whitening the data. This is usually an excellent idea since it can greatly speed up the algorithms (less computation and less parameters). A simple rule of thumb to choose how many principle components to retain is to keep enough components to have 99% of the variance retained (more details at [[PCA#Number_of_components_to_retain | PCA]])
+提示：我们也可以在PCA白化过程中同时降低数据的维度。这是一个很好的主意，因为这样可以大大提升算法的速度（减少了运算量和参数数目）。确定要保留的主成分数目有一个经验法则：即所保留的成分的总方差达到总样本方差的99%以上。(详细内容请参考[[PCA#Number_of_components_to_retain | PCA]])
 }}
-【初译】
 {{Quote|
-提示: 在主成分分析白化中，在白化数据的过程中也可以降低数据维度。这是一个很好的主意，因为这将大大提升算法的速度(更少的运算和更少的参数)。一个选取保留主成分数目的简单规则是使剩余的方差达到99%以上。(详细内容请参考[[PCA#Number_of_components_to_retain | PCA]])
+注意: 在使用分类框架时，我们应该只基于练集上的数据计算PCA/ZCA白化矩阵。需要保存以下两个参数留待测试集合使用：(a)用于零均值化数据的平均值向量；(b)白化矩阵。测试集需要采用这两组保存的参数来进行相同的预处理。}}
-}}
-【一审】
+== 大图像 ==
-{{Quote|
+对于大图像，采用基于PCA/ZCA的白化方法是不切实际的，因为协方差矩阵太大。在这些情况下我们退而使用1/f 白化方法（更多细节稍后陈述）。
-提示：我们可以在PCA白化过程中同时降低数据的维度。这是一个很好的主意，因为这样可以大大提升算法的速度（减少了运算量和参数数目）。确定要保留的主成分数目有一个简单的规则：即所保留的成分的总方差达到总样本方差的99%以上。(详细内容请参考[[PCA#Number_of_components_to_retain | PCA]])
-}}
-【原文】
+== 标准流程 ==
-{{quote|
+在这一部分中，我们将介绍几种在一些数据集上有良好表现的预处理标准流程
-Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values.  }}
-【初译】
+=== 自然灰度图像 ===
-{{quote|
+灰度图像具有平稳特性，我们通常在第一步对每个数据样本分别做均值消减（即减去直流分量），然后采用PCA/ZCA白化处理，其中的<tt>epsilon</tt>要足够大以达到低通滤波的效果。
-注意: 在处理分类框架时，需要在训练集合上计算PCA/ZCA白化矩阵，需要保存以下两个参数留待测试集合使用(a)平均向量用于零均值化数据；(b)白化矩阵。测试集需要采用保存的参数来进行相同的预处理。}}
-【一审】
-{{quote|
-注意: 在分类问题中，PCA/ZCA白化矩阵是在训练集合上计算的，需要保存以下两个参数留待测试集合使用：(a)样本均值；(b)白化矩阵。测试集需要采用这两组保存的参数来进行相同的预处理。}}
-== Large Images/大图像 ==
-【原文】
-For large images, PCA/ZCA based whitening methods are impractical as the covariance matrix is too large. For these cases, we defer to 1/f-whitening methods. (more details to come)
-【初译】
-对于大图像，采用基于PCA/ZCA的白化方法是不实际的，这是因为协方差矩阵太大。在这些情况下我们推荐1/f 白化方法（更多内容后续再讲）。
-【一审】
-对于大图像，采用基于PCA/ZCA的白化方法是不切实际的，因为协方差矩阵太大。在这些情况下我们推荐1/f 白化方法（更多内容后续再讲）。
-== Standard Pipelines/标准流程 ==
-【原文】
-In this section, we describe several "standard pipelines" that have worked well for some datasets:
-【初译】
-在这一部分我们将介绍几种在一些数据集上有效地标准流程
-【一审】
-在这一部分我们将介绍几种在一些数据集上有良好表现的预处理标准流程
-=== Natural Grey-scale Images/自然灰度图像 ===
-【原文】
-Since grey-scale images have the stationarity property, we usually first remove the mean-component from each data example separately (remove DC). After this step, PCA/ZCA whitening is often employed with a value of <tt>epsilon</tt> set large enough to low-pass filter the data.
-【初译】
-因为灰度图像具有平稳特性，我们第一步通常在样本上分别移除均值项，然后采用PCA/ZCA白化处理，其中的<tt>epsilon</tt>足够大以对数据低通过滤。
-【一审】
-灰度图像具有平稳特性，我们通常在对每个样本做分量均值归零化（即减去直流分量），然后采用PCA/ZCA白化处理，其中的<tt>epsilon</tt>要足够大以达到低通滤波的效果。
-=== Color Images/彩色图像 ===
-【原文】
-For color images, the stationarity property does not hold across color channels. Hence, we usually start by rescaling the data (making sure it is in <math>[0, 1]</math>) ad then applying PCA/ZCA with a sufficiently large <tt>epsilon</tt>. Note that it is important to perform feature mean-normalization before computing the PCA transformation.
-【初译】
-对于彩色图像，色彩通道间并不存在平稳特性。因此我们通常首先对数据进行重缩放（使之位于<math>[0, 1]</math>区间），然后在使用足够大的<tt>epsilon</tt>来做PCA/ZCA。值得注意的是在进行PCA转换前需要对特征进行均值归一化。
-【一审】
-对于彩色图像，彩色通道间并不存在平稳特性。因此我们通常首先对数据进行特征缩放（使像素值位于<math>[0, 1]</math>区间），然后使用足够大的<tt>epsilon</tt>来做PCA/ZCA。注意在进行PCA变换前需要对特征进行分量均值归零化。
-【说明】
+=== 彩色图像 ===
-原文中说的mean-normalization在整个文档中没有别的地方提及。我认为这是指分量均值归零化，使得图像的平均像素值变为0。即：对彩色图像的处理需要首先把像素值变换到[0,1]区间，然后按照灰度图像的同样方法做预处理。
+对于彩色图像，色彩通道间并不存在平稳特性。因此我们通常首先对数据进行特征缩放（使像素值位于<math>[0, 1]</math>区间），然后使用足够大的<tt>epsilon</tt>来做PCA/ZCA。注意在进行PCA变换前需要对特征进行分量均值归零化。
 === Audio (MFCC/Spectrograms)/音频 (MFCC/频谱图) ===