自我学习

Revision as of 04:06, 9 March 2013 (view source)

Kandeng (Talk | contribs)

(→On pre-processing the data)

← Older edit

Latest revision as of 05:35, 8 April 2013 (view source)

Wikiroot (Talk | contribs)

Line 1:

-

~~'''Self-Taught Learning'''~~

+

==综述==

-

~~'''[初译]'''：自我学习(by '''@Call_Me_Zero''')~~

+

如果已经有一个足够强大的机器学习算法，为了获得更好的性能，最靠谱的方法之一是给这个算法以更多的数据。机器学习界甚至有个说法：“有时候胜出者并非有最好的算法，而是有更多的数据。”

-

~~'''[一审]'''：自学习 (by '''@晓风_机器学习''')~~

+

-

~~== Overview ==~~

-

~~'''[初译]'''：综述~~

-

~~'''[一审]'''：综述~~

-

~~'''[原文]'''~~

+

人们总是可以尝试获取更多的已标注数据，但是这样做成本往往很高。例如研究人员已经花了相当的精力在使用类似 AMT(Amazon Mechanical Turk) 这样的工具上，以期获取更大的训练数据集。相比大量研究人员通过手工方式构建特征，用众包的方式让多人手工标数据是一个进步，但是我们可以做得更好。具体的说，如果算法能够从未标注数据中学习，那么我们就可以轻易地获取大量无标注数据，并从中学习。自学习和无监督特征学习就是这种的算法。尽管一个单一的未标注样本蕴含的信息比一个已标注的样本要少，但是如果能获取大量无标注数据（比如从互联网上下载随机的、无标注的图像、音频剪辑或者是文本），并且算法能够有效的利用它们，那么相比大规模的手工构建特征和标数据，算法将会取得更好的性能。

-

Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable ways to get better performance is to give the algorithm more data. This has led to the that aphorism that in machine learning, "sometimes it's not who has the best algorithm that wins; it's who has the most data."

+

-

~~'''[初译]'''~~

-

假如我们拥有足够强大的机器学习算法，那么，为了获得更好的性能，最靠谱的一种方法就是给予学习算法更多数据。机器学习界有句格言：有时候效果最好的，不是最优的算法，而是那些拥有最多数据的。

-

~~'''[一审]'''~~

+

在自学习和无监督特征学习问题上，可以给算法以大量的未标注数据，学习出较好的特征描述。在尝试解决一个具体的分类问题时，可以基于这些学习出的特征描述和任意的（可能比较少的）已标注数据，使用有监督学习方法完成分类。

-

如果已经有一个足够强大的机器学习算法，为了获得更好的性能，最靠谱的方法之一是给这个算法以更多的数据。机器学习界甚至有个说法：“胜出的往往不是最好的算法，而是尽可能多的数据。”

+

-

~~'''[原文]'''~~

+

在一些拥有大量未标注数据和少量的已标注数据的场景中，上述思想可能是最有效的。即使在只有已标注数据的情况下（这时我们通常忽略训练数据的类标号进行特征学习），以上想法也能得到很好的结果。

-

One can always try to get more labeled data, but this can be expensive. In particular, researchers have already gone to extraordinary lengths to use tools such as AMT (Amazon Mechanical Turk) to get large training sets. While having large numbers of people hand-label lots of data is probably a step forward compared to having large numbers of researchers hand-engineer features, it would be nice to do better. In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from unlabeled data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former---for example, by downloading random unlabeled images/audio clips/text documents off the internet---and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.

+

-

~~'''[初译]'''~~

-

有人总是尝试获得更多标记过的数据，这样做耗费巨大。典型的一种场景是，为了获得大量的训练集，学者们花费很长时间来使用诸如AMT(Amazon Mechanical Turk)之类的工具。相比起human hand-engineering，虽然使用大量人力手工标注数据已经是一个进步，但是我们可以做的更好。自我学习以及非监督特征学习能够做到：如果我们有能够从未被标记数据中学习的算法，那么就可以用来轻易地获取数据，并且从这些数据中进行大量学习。即便针对比起被标记的样本信息量小很多的未被标记样本，，这样做也能行的通。如果我们能够获取一系列未被标记样本（比如，通过从互联网随机下载未被标记的图像/音频片段/文本文件），同时使用的算法能够有效地挖掘这些未被标注数据，那么比起大量的human hand-engineering方法以及手工标注的方法，将获得更好的性能。

-

~~'''[一审]'''~~

+

==特征学习==

-

在解决很多问题上，总是可以尝试获取更多的带类标数据，但是成本往往很高。典型地，研究人员已经花了相当的精力在使用类似AMT(Amazon Mechanical Turk，一个基于互联网的众包市场)这样的工具上，以期获取更大的训练数据集。相比大量的研究人员手工构建特征，用众包的方式让多人手工标数据是一个进步，而且期望着可以做的更好。特别是自学习和无监督特征学习，预示着如果算法能够从无类标数据中进行学习，就可以轻而易举的获取大量这样的数据供算法学习。尽管一个单一的无类标数据样例蕴含的信息比一个带类标的数据样例要少，但是如果能大量的获取无类标数据（比如从互联网上下载随机的、无类标的图像、音频剪辑或者是文本），并且算法能够有效的利用它们，相比大规模的手工构建特征和标数据，最终将会有更好的性能。

+

-

+

我们已经了解到如何使用一个自编码器（autoencoder）从无标注数据中学习特征。具体来说，假定有一个无标注的训练数据集 <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>（下标 <math>\textstyle u</math> 代表“不带类标”）。现在用它们训练一个稀疏自编码器（可能需要首先对这些数据做白化或其它适当的预处理）。

-

~~'''[原文]'''~~

+

-

In Self-taught learning and Unsupervised feature learning, we will give our algorithms a large amount of unlabeled data with which to learn a good feature representation of the input. If we are trying to solve a specific classification task, then we take this learned feature representation and whatever (perhaps small amount of) labeled data we have for that classification task, and apply supervised learning on that labeled data to solve the classification task.'''

+

-

+

-

~~'''[初译]'''~~

+

-

在自我学习和非监督特征学习领域，我们给予算法大量未标注的数据，通过它们来学习更好的特征重现形式。在尝试解决具体的分类任务的时候，通过这些学习来的特征重现形式，同时，应用监督学习方法于任意数量（可能是很少量）的被标注数据，两者一起来完成分类任务。

+

-

+

-

~~'''[一审]'''~~

+

-

在自学习和无监督特征学习问题上，可以给算法以大量的无类标数据，学习出较好的特征描述。如果面对一个具体的分类问题，就可以基于这些学习出的特征描述和任意的（可能比较少的）带类标数据，使用有监督学习方法解决。

+

-

+

-

+

-

~~'''[原文]'''~~

+

-

These ideas probably have the most powerful effects in problems where we have a lot of unlabeled data, and a smaller amount of labeled data. However, they typically give good results even if we have only labeled data (in which case we usually perform the feature learning step using the labeled data, but ignoring the labels).

+

-

+

-

~~'''[初译]'''~~

+

-

以上想法对于以下场景最有效--同时拥有大量未被标记数据和小部分已标记数据。即便是，我们只拥有已标记数据（在这种情况下，我们常常在特征学习阶段使用被标注数据，但是忽略标记，仅关注数据本身），以上想法也能给出很好的结果。

+

-

+

-

~~'''[一审]'''~~

+

-

~~在一些拥有大量无类标数据和少量的带类标数据的场景中，甚至是只有带类标数据的场景中（丢掉类标进行特征学习），以上想法都可能十分凑效。~~

+

-

+

-

+

-

~~== Learning features ==~~

+

-

~~'''[初译]'''：特征学习~~

+

-

~~'''[一审]'''：特征学习~~

+

-

+

-

+

-

~~'''[原文]'''~~

+

-

~~We have already seen how an autoencoder can be used to learn features from~~

+

-

~~unlabeled data. Concretely, suppose we have an unlabeled~~

+

-

~~training set~~ <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>

+

-

~~with~~ <math>\textstyle ~~m_u</math> unlabeled examples. (The subscript "~~u~~" stands for~~

+

-

~~"unlabeled.") We can then train a sparse autoencoder on this data~~

+

-

~~(perhaps with appropriate whitening or other pre-processing):~~

+

-

+

-

~~'''[初译]'''~~

+

-

我们已经了解自编码神经网络（autoencoder）怎么用来从未被标记数据中学习特征。具体来说，假设我们有<math>\textstyle m_u</math>个未被标记的训练集合<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>（下标u代表“未被标记的”）。现在用它们来训练一个稀疏的自编码神经网络。（可以使用合适的白化以及其他预操作）

+

-

+

-

~~'''[一审]'''~~

+

-

我们已经了解到如何使用一个自编码神经网络（autoencoder）来从无类标数据中学习特征。具体来说，假定有一个无类标的训练数据集<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>~~（下标u代表“不带类标”）。现在用它们训练一个稀疏自编码神经网络（可以使用合适的whitening及其他预处理工作）。~~

+

[[File:STL_SparseAE.png|350px]]

-

~~'''[原文]'''~~

+

利用训练得到的模型参数 <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math>，给定任意的输入数据 <math>\textstyle x</math>，可以计算隐藏单元的激活量（activations） <math>\textstyle a</math>。如前所述，相比原始输入 <math>\textstyle x</math> 来说，<math>\textstyle a</math> 可能是一个更好的特征描述。下图的神经网络描述了特征（激活量 <math>\textstyle a</math>）的计算。

-

~~Having trained the parameters~~ <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math> ~~of this model,~~

+

-

~~given any new input~~ <math>\textstyle x</math>~~, we can now compute the corresponding vector of~~

+

-

~~activations~~ <math>\textstyle a</math> ~~of the hidden units. As we saw previously, this often gives a~~

+

-

~~better representation of the input than the original raw input~~ <math>\textstyle x</math>~~. We can also~~

+

-

~~visualize the algorithm for computing the features~~/~~activations~~ <math>\textstyle a</math> ~~as the following~~

+

-

~~neural network:~~

+

-

~~'''~~[~~初译]'''~~

+

[[File:STL_SparseAE_Features.png|300px]]

-

训练得到的模型参数<math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math>，对于任何新的输入<math>\textstyle x</math>，可以计算隐藏单元对应的activations <math>\textstyle a</math>向量。正如前面看到的，这种方法常能给出比原始输入<math>\textstyle x</math>更好的表达重现。如下的神经网络图可视化地阐释了特征/activations <math>\textstyle a</math>的计算:

+

-

~~'''[一审]'''~~

-

利用训练得到的模型参数<math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math>，给定任意的输入数据<math>\textstyle x</math>，可以计算隐藏单元对应的激活值（activations）向量<math>\textstyle a</math>。如前所述，相比原始输入<math>\textstyle x</math>来说，这样做可以得到一个更好的特征描述。下图的神经网络描述了特征/激活值向量<math>\textstyle a</math>的计算。

-

~~[[File:STL_SparseAE_Features.png|300px]]~~

+

这实际上就是之前得到的稀疏自编码器，在这里去掉了最后一层。

-

~~'''[原文]'''~~

+

假定有大小为 <math>\textstyle m_l</math> 的已标注训练集 <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),

-

~~This is just the sparse autoencoder that we previously had~~, ~~with with the final layer removed.~~

+

(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>（下标 <math>\textstyle l</math> 表示“带类标”），我们可以为输入数据找到更好的特征描述。例如，可以将 <math>\textstyle x_l^{(1)}</math> 输入到稀疏自编码器，得到隐藏单元激活量 <math>\textstyle a_l^{(1)}</math>。接下来，可以直接使用 <math>\textstyle a_l^{(1)}</math> 来代替原始数据 <math>\textstyle x_l^{(1)}</math> （“替代表示”,Replacement Representation）。也可以合二为一，使用新的向量 <math>\textstyle (x_l^{(1)}, a_l^{(1)})</math> 来代替原始数据 <math>\textstyle x_l^{(1)}</math> （“级联表示”,Concatenation Representation）。

-

~~'''[初译]'''~~

-

~~这是之前得到的移除了最终层次的稀疏自编码神经网络。~~

-

~~'''[一审]'''~~

+

经过变换后，训练集就变成 <math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})

-

~~这实际上就是之前得到的稀疏自编码神经网络，在这里去掉了最后一层。~~

+

\}</math>或者是<math>\textstyle \{

+

((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,

+

((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math>（取决于使用 <math>\textstyle a_l^{(1)}</math> 替换 <math>\textstyle x_l^{(1)}</math> 还是将二者合并）。在实践中，将 <math>\textstyle a_l^{(1)}</math> 和 <math>\textstyle x_l^{(1)}</math> 合并通常表现的更好。但是考虑到内存和计算的成本，也可以使用替换操作。

-

~~'''[原文]'''~~

+

最终，可以训练出一个有监督学习算法（例如 svm, logistic regression 等），得到一个判别函数对 <math>\textstyle y</math> 值进行预测。预测过程如下：给定一个测试样本 <math>\textstyle x_{\rm test}</math>，重复之前的过程，将其送入稀疏自编码器，得到 <math>\textstyle a_{\rm test}</math>。然后将 <math>\textstyle a_{\rm test}</math> （或者 <math>\textstyle (x_{\rm test}, a_{\rm test})</math> ）送入分类器中，得到预测值。

-

~~Now~~, ~~suppose we have a labeled training set~~ <math>\textstyle ~~\{ (x_l^{(1)},~~ y~~^{(1)}),~~

+

-

~~(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}~~</math> of <math>\textstyle ~~m_l</math> examples.~~

+

-

~~(The subscript "l" stands for "labeled.")~~

+

-

~~We can now find a better representation for the inputs. In particular, rather~~

+

-

~~than representing the first training example as <math>~~\~~textstyle x_l^{(1)~~}</math>~~, we can feed~~

+

-

<math>\textstyle ~~x_l^~~{~~(1)~~}</math> ~~as the input to our autoencoder, and obtain the corresponding~~

+

-

~~vector of activations~~ <math>\textstyle ~~a_l^~~{~~(1)}</math>. To represent this example, we can either~~

+

-

~~just '''replace''' the original feature vector with <math>~~\~~textstyle a_l^{(1)~~}</math>.

+

-

~~Alternatively, we can '''concatenate''' the two feature vectors together,~~

+

-

~~getting a representation~~ <math>\textstyle (~~x_l^~~{~~(1)~~}, ~~a_l^~~{~~(1)~~})</math>.

+

-

~~'''[初译]'''~~

-

~~现在，假设有一组大小为<math>\textstyle m_l</math>个的被标记训练集<math>\textstyle \{ (x_l^{(1)}, y^{(1)}),~~

-

(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>。（下标l代表被“标记的”），我们可以找到更好的输入的重新表达形式。相比起去重现第一个训练样本<math>\textstyle x_l^{(1)}</math>，我们将<math>\textstyle x_l^{(1)}</math>作为自编码神经网络的输入，以此获得对应的activations <math>\textstyle a_l^{(1)}</math>向量。为了重新表达这个样本，用<math>\textstyle a_l^{(1)}</math>来替换原始的特征向量<math>\textstyle x_l^{(1)}</math>。或者，将两个特征向量合并起来，得到重新表达形式<math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>。

-

~~'''[一审]'''~~

+

==数据预处理==

-

~~一审：假定有大小为<math>\textstyle m_l</math>的带类标训练数据集 <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),~~

+

-

(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>（下表l表示“带类标”），对于输入数据，可以找到更好的特征描述。相比原始的数据特征描述，可以将<math>\textstyle x_l^{(1)}</math>输入到稀疏自编码神经网络，得到隐藏单元激活值向量<math>\textstyle a_l^{(1)}</math>。接下来，可以直接使用来代替<math>\textstyle a_l^{(1)}</math>描述原始数据<math>\textstyle x_l^{(1)}</math>。也可以合二为一，使用新的向量<math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>来描述。

+

在特征学习阶段，我们从未标注训练集 <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math> 中学习，这一过程中可能计算了各种数据预处理参数。例如计算数据均值并且对数据做均值标准化（mean normalization）；或者对原始数据做主成分分析（PCA），然后将原始数据表示为 <math>\textstyle U^Tx</math> (又或者使用 PCA 白化或 ZCA 白化)。这样的话，有必要将这些参数保存起来，并且在后面的训练和测试阶段使用同样的参数，以保证数据进入稀疏自编码神经网络之前经过了同样的变换。例如，如果对未标注数据集进行PCA预处理，就必须将得到的矩阵 <math>\textstyle U</math> 保存起来，并且应用到有标注训练集和测试集上；而不能使用有标注训练集重新估计出一个不同的矩阵 <math>\textstyle U</math> （也不能重新计算均值并做均值标准化），否则的话可能得到一个完全不一致的数据预处理操作，导致进入自编码器的数据分布迥异于训练自编码器时的数据分布。

-

~~'''[原文]'''~~

-

~~Thus, our training set now becomes~~

-

~~<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})~~

-

~~\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the~~

-

~~<math>\textstyle i</math>-th training example), or <math>\textstyle \{~~

-

~~((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,~~

-

~~((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated~~

-

~~representation). In practice, the concatenated representation often works~~

-

~~better; but for memory or computation representations, we will sometimes use~~

-

~~the replacement representation as well.~~

-

~~'''[初译]'''~~

+

==无监督特征学习的术语==

-

~~因此，现在训练集合变成<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})~~

+

-

\}</math>（使用上述的替换表达形式，同时使用<math>\textstyle a_l^{(i)}</math>来表达第<math>\textstyle i</math>个训练样本）。训练集合也可以表示为 <math>\textstyle \{

+

-

~~((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,~~

+

-

((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> （使用上述的连接表达形式)。在实践中，这种连接表达形式常常有更好的效果。但是，考虑到内存或者计算表达形式，有些时候，需要使用替换的表达形式。

+

-

+

-

~~'''[一审]'''~~

+

-

~~经过变换后，训练数据集就变成<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})~~

+

-

~~\}</math>或者是<math>\textstyle \{~~

+

-

~~((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,~~

+

-

((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math>（决定于使用<math>\textstyle a_l^{(1)}</math>替换<math>\textstyle x_l^{(1)}</math>还是将二者合并）。在实践中，将<math>\textstyle a_l^{(1)}</math>和<math>\textstyle x_l^{(1)}</math>合并通常表现的更好。但是考虑到内存和计算的成本，也可以使用替换操作。

+

有两种常见的无监督特征学习方式，区别在于你有什么样的未标注数据。自学习(self-taught learning) 是其中更为一般的、更强大的学习方式，它不要求未标注数据 <math> \textstyle x_u</math> 和已标注数据 <math> \textstyle x_l</math> 来自同样的分布。另外一种带限制性的方式也被称为半监督学习，它要求 <math> \textstyle x_u</math>和<math> \textstyle x_l</math> 服从同样的分布。下面通过例子解释二者的区别。

-

~~'''[原文]'''~~

-

~~Finally, we can train a supervised learning algorithm such as an SVM, logistic~~

-

~~regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values.~~

-

~~Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:~~

-

~~For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>. Then, feed~~

-

~~either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.~~

-

~~'''[初译]'''~~

+

假定有一个计算机视觉方面的任务，目标是区分汽车和摩托车图像；也即训练样本里面要么是汽车的图像，要么是摩托车的图像。哪里可以获取大量的未标注数据呢？最简单的方式可能是从互联网上下载一些随机的图像数据集，在这些数据上训练出一个稀疏自编码器，从中得到有用的特征。这个例子里，未标注数据完全来自于一个和已标注数据不同的分布（未标注数据集中，或许其中一些图像包含汽车或者摩托车，但是不是所有的图像都如此）。这种情形被称为自学习。

-

最终，我们能够使用一个监督学习算法来训练，比如，SVM，logistic回归，等等，来获得对<math>\textstyle y</math>值的预测。对于一个测试样例<math>\textstyle x_{\rm test}</math>，遵守这样的过程：首先，把它送入自编码神经网络得到<math>\textstyle a_{\rm test}</math>。然后，将<math>\textstyle a_{\rm test}</math>或者<math>\textstyle (x_{\rm test}, a_{\rm test})</math>送到分类器得到预测值。

+

-

~~'''[一审]'''~~

-

最终，可以训练出一个有监督学习算法（例如svm,logistic regression等），得到一个判别函数对<math>\textstyle y</math>值就行预测。预测过程如下：给定一个测试样例<math>\textstyle x_{\rm test}</math>,重复之前的过程，送入稀疏自编码神经网络，得到<math>\textstyle a_{\rm test}</math>。然后将<math>\textstyle a_{\rm test}</math>或者（<math>\textstyle (x_{\rm test}, a_{\rm test})</math>）送入训练出的分类器中，得到预测值。

+

相反，如果有大量的未标注图像数据，要么是汽车图像，要么是摩托车图像，仅仅是缺失了类标号（没有标注每张图片到底是汽车还是摩托车）。也可以用这些未标注数据来学习特征。这种方式，即要求未标注样本和带标注样本服从相同的分布，有时候被称为半监督学习。在实践中，常常无法找到满足这种要求的未标注数据（到哪里找到一个每张图像不是汽车就是摩托车，只是丢失了类标号的图像数据库？）因此，自学习在无标注数据集的特征学习中应用更广。

-

~~== On pre-processing the data ==~~

-

~~'''[初译]'''~~

-

~~有关数据预处理~~

-

~~'''[一审]'''~~

-

~~数据预处理~~

-

~~'''[原文]'''~~

-

~~During the feature learning stage where we were learning from the unlabeled training set~~

-

~~<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed~~

-

~~various pre-processing parameters. For example, one may have computed~~

-

~~a mean value of the data and subtracted off this mean to perform mean normalization,~~

-

~~or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used~~

-

~~PCA~~

-

~~whitening or ZCA whitening). If this is the case, then it is important to~~

-

~~save away these preprocessing parameters, and to use the ''same'' parameters~~

-

~~during the labeled training phase and the test phase, so as to make sure~~

-

~~we are always transforming the data the same way to feed into the autoencoder.~~

-

~~In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA,~~

-

~~we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the~~

-

~~labeled examples and the test data. We should '''not''' re-estimate a~~

-

~~different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the~~

-

~~labeled training set, since that might result in a dramatically different~~

-

~~pre-processing transformation, which would make the input distribution to~~

-

~~the autoencoder very different from what it was actually trained on.~~

-

~~'''[初译]'''~~

+

==中英文对照==

-

在特征学习阶段，我们从未被标记的样本集合<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>中学习,在这之前它们已经进行过大量的参数预处理。举个例子，假如我们已经计算出一组数据的平均值，并且进行均值化，或者使用主成分分析计算出矩阵U来表达这组数据（或者使用PCA白化或者ZCA白化）将数据表示为<math>\textstyle U^Tx</math>。在这种情况下，保存预处理的参数是很重要的，需要在被标注数据的训练阶段和测试阶段使用同样的参数。这样能保证我们总是使用相同的方式来转化数据，进入自编码神经网络的时候也能使用相同的方式。尤其是，如果已经使用了未被标记的数据和主成分分析得到矩阵<math>\textstyle U</math>，我们必须保持同样的矩阵，并且使用他们进行被标记样本以及测试数据的预处理，而不能使用标记过的训练样本，重新预估一个不同的<math>\textstyle U</math>矩阵（或者使用均值化得到的均值，等等）。其原因是，这样可能导致显著不同的预处理变化，这变化将使得自编码神经网络的输入分布迥异于实际。

+

-

~~'''[一审]'''~~

+

:自我学习/自学习 self-taught learning

-

~~在特征学习阶段，我们从无类标训练数据集<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}<~~/math>中进行学习，这一过程中可能计算了各种数据预处理参数。例如计算数据均值并且对数据做均值标准化（mean normalization）；或者对原始数据做主成分分析（PCA），然后将原始数据表示为<math>\textstyle U^Tx</math>(又或者使用PCA白化或ZCA白化)。这样的话，有必要将这些参数保存起来，并且在后面的训练和测试阶段使用同样的参数，以保证数据进入稀疏自编码神经网络之前经过了同样的变换。例如，如果对无类标数据集进行PCA，就必须将得到矩阵<math>\textstyle U</math>保存起来，并且应用到带类标训练数据集和测试数据集上去；而不能使用带类标训练数据集重新估计出一个不同的矩阵<math>\textstyle U</math>出来（也不能重新计算均值并做均值标准化），否则的话可能得到一个完全不一致的数据预处理操作，最终导致进入自编码神经网络的数据分布也迥异于训练阶段。

+

-

~~== On the terminology of~~ unsupervised feature learning ==

+

:无监督特征学习 unsupervised feature learning

-

~~'''[初译]'''~~

+

-

~~有关非监督特征学习的术语~~

+

-

~~'''[一审]'''~~

+

-

~~非监督特征学习术语~~

+

:自编码器 autoencoder

-

~~'''[原文]'''~~

+

:白化 whitening

-

~~There are two common unsupervised feature learning settings, depending on what type of~~

+

-

~~unlabeled data you have. The more general and powerful setting is the '''self-taught learning'''~~

+

-

~~setting, which does not assume that your unlabeled data <math>x_u</math> has to~~

+

-

~~be drawn from the same distribution as your labeled data <math>x_l</math>. The~~

+

-

~~more restrictive setting where the unlabeled data comes from exactly the same~~

+

-

~~distribution as the labeled data is sometimes called the '''semi-supervised learning'''~~

+

-

~~setting. This distinctions is best explained with an example, which we now give.~~

+

-

~~'''[初译]'''~~

+

:激活量 activation

-

有两种常见的非监督特征学习设置，区别在于你拥有什么样的未标记数据。最为广泛应用的强大是自主学习设置，它不假设未标记数据<math>x_u</math>与被标记的数据<math>x_l</math>有着相同的分布。另一种有限制的设置是未被标记的数据与被标记的数据有着完全相同的分布，我们叫它半监督学习设置。现在我们来解释一下这种差别。

+

-

~~'''[一审]'''~~

+

:稀疏自编码器 sparse autoencoder

-

有两种常见的无监督特征学习方式，区别在于你有什么样的无类标数据。自学习(self-taught learning)是其中一般的、强大的学习方式，它不要求无类标数据<math>x_u</math>和带类标数据<math>x_l</math>来自同样的分布。另外一种带限制性的方式也被称为半监督学习，它要求<math>x_u</math>和<math>x_l</math>服从同样的分布。下面通过例子解释二者的区别。

+

:半监督学习 semi-supervised learning

-

~~'''[原文]'''~~

-

~~Suppose your goal is a computer vision task where you'd like~~

-

~~to distinguish between images of cars and images of motorcycles; so, each labeled~~

-

~~example in your training set is either an image of a car or an image of a motorcycle.~~

-

~~Where can we get lots of unlabeled data? The easiest way would be to obtain some~~

-

~~random collection of images, perhaps downloaded off the internet. We could then~~

-

~~train the autoencoder on this large collection of images, and obtain useful features~~

-

~~from them. Because here the unlabeled data is drawn from a different distribution~~

-

~~than the labeled data (i.e., perhaps some of our unlabeled images may contain~~

-

~~cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we~~

-

~~call this self-taught learning.~~

-

~~'''[初译]'''~~

-

假设你的目标是区分汽车或者摩托车图像。即，训练集的每个被标记样本要么是汽车的图像，要么是摩托车的图像。哪里可以得到这么多未被标记数据？最简便的方法是获取一些图像的随机集合，或者从互联网下载一些。接着可以将这些大量的图像集合用于自编码神经网络训练，以获得有用的特征。因为未标记的数据与标注过的数据有着不同的分布（未标记的图像可能包含汽车/摩托车，下载的每张图像都是汽车或者摩托车），所以，称其自我学习算法。

-

~~'''[一审]'''~~

+

==中文译者==

-

假定有一个计算机视觉方面的任务，目标是区分汽车和摩托车图像；也即训练样本里面要么是汽车的图像，要么是摩托车的图像。哪里获取大量的无类标数据呢？最简单的方式可能是到互联网上下载一些随机的图像数据集，这这些数据上训练出一个稀疏自编码神经网络，从中得到有用的特征。这个例子里，无类标数据完全来自于一个和带类标数据不同的分布（无类标数据集中，或许其中一些图像包含汽车或者摩托车，但是不是所有的图像都如此）。这种情形被称为自学习。

+

张灵（lingzhang001@outlook.com），晓风（xiaofeng.zhb@alibaba-inc.com），王文中（wangwenzhong@ymail.com）

-

~~'''[原文]'''~~

-

~~In contrast, if we happen to have lots of unlabeled images lying around~~

-

~~that are all images of ''either'' a car or a motorcycle, but where the data~~

-

~~is just missing its label (so you don't know which ones are cars, and which~~

-

~~ones are motorcycles), then we could use this form of unlabeled data to~~

-

~~learn the features. This setting---where each unlabeled example is drawn from the same~~

-

~~distribution as your labeled examples---is sometimes called the semi-supervised~~

-

~~setting. In practice, we often do not have this sort of unlabeled data (where would you~~

-

~~get a database of images where every image is either a car or a motorcycle, but~~

-

~~just missing its label?), and so in the context of learning features from unlabeled~~

-

~~data, the self-taught learning setting is more broadly applicable.~~

-

~~'''[初译]'''~~

+

-

相反的，如果恰好有成千上万张图像，它们要么是汽车，要么是摩托车，只是它们缺少标记（你不知道那张是汽车，哪张是摩托车），我们可以用这种未标记的数据来学习特征。对于这些设置--每个未被标记的样例与你标记过的样例有着相同的分布--有时候称它是半监督学习。在实践中，我们常常没有这种未标记数据（你可以得到这样的图像数据库，其中每张图像是汽车或者摩托车，只是丢失了标记）。综上，在针对未标记数据的特征学习上，自我学习设置能够被更广泛的使用。

+

-

~~'''[一审]'''~~

-

相反，如果有大量的无类标图像数据，要么是汽车图像，要么是摩托车图像，仅仅是缺失了类标（没有标注每张图片到底是汽车还是摩托车）。也可以用这些无类标数据来学习特征。这种方式，即要求无类标样本和带类标样本服从相同的分布，有时候被称为半监督学习。在实践中，常常无法找到满足这种要求的无类标数据（到哪里找到一个每张图像不是汽车就是摩托车，只是丢失了类标的图像数据库？）因此，自学习被广泛的应用于从无类标数据集中学习特征。

-

+

From Ufldl

Latest revision as of 05:35, 8 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
-'''Self-Taught Learning'''
+==综述==
-'''[初译]'''：自我学习(by '''@Call_Me_Zero''')
+如果已经有一个足够强大的机器学习算法，为了获得更好的性能，最靠谱的方法之一是给这个算法以更多的数据。机器学习界甚至有个说法：“有时候胜出者并非有最好的算法，而是有更多的数据。”
-'''[一审]'''：自学习  (by '''@晓风_机器学习''')
-== Overview ==
-'''[初译]'''：综述
-'''[一审]'''：综述
-'''[原文]'''
+人们总是可以尝试获取更多的已标注数据，但是这样做成本往往很高。例如研究人员已经花了相当的精力在使用类似 AMT(Amazon Mechanical Turk) 这样的工具上，以期获取更大的训练数据集。相比大量研究人员通过手工方式构建特征，用众包的方式让多人手工标数据是一个进步，但是我们可以做得更好。具体的说，如果算法能够从未标注数据中学习，那么我们就可以轻易地获取大量无标注数据，并从中学习。自学习和无监督特征学习就是这种的算法。尽管一个单一的未标注样本蕴含的信息比一个已标注的样本要少，但是如果能获取大量无标注数据（比如从互联网上下载随机的、无标注的图像、音频剪辑或者是文本），并且算法能够有效的利用它们，那么相比大规模的手工构建特征和标数据，算法将会取得更好的性能。
-Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable ways to get better performance is to give the algorithm more data. This has led to the that aphorism that in machine learning, "sometimes it's not who has the best algorithm that wins; it's who has the most data."
-'''[初译]'''
-假如我们拥有足够强大的机器学习算法，那么，为了获得更好的性能，最靠谱的一种方法就是给予学习算法更多数据。机器学习界有句格言：有时候效果最好的，不是最优的算法，而是那些拥有最多数据的。
-'''[一审]'''
+在自学习和无监督特征学习问题上，可以给算法以大量的未标注数据，学习出较好的特征描述。在尝试解决一个具体的分类问题时，可以基于这些学习出的特征描述和任意的（可能比较少的）已标注数据，使用有监督学习方法完成分类。
-如果已经有一个足够强大的机器学习算法，为了获得更好的性能，最靠谱的方法之一是给这个算法以更多的数据。机器学习界甚至有个说法：“胜出的往往不是最好的算法，而是尽可能多的数据。”
-'''[原文]'''
+在一些拥有大量未标注数据和少量的已标注数据的场景中，上述思想可能是最有效的。即使在只有已标注数据的情况下（这时我们通常忽略训练数据的类标号进行特征学习），以上想法也能得到很好的结果。
-One can always try to get more labeled data, but this can be expensive. In particular, researchers have already gone to extraordinary lengths to use tools such as AMT (Amazon Mechanical Turk) to get large training sets. While having large numbers of people hand-label lots of data is probably a step forward compared to having large numbers of researchers hand-engineer features, it would be nice to do better. In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from unlabeled data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former---for example, by downloading random unlabeled images/audio clips/text documents off the internet---and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.
-'''[初译]'''
-有人总是尝试获得更多标记过的数据，这样做耗费巨大。典型的一种场景是，为了获得大量的训练集，学者们花费很长时间来使用诸如AMT(Amazon Mechanical Turk)之类的工具。相比起human hand-engineering，虽然使用大量人力手工标注数据已经是一个进步，但是我们可以做的更好。自我学习以及非监督特征学习能够做到：如果我们有能够从未被标记数据中学习的算法，那么就可以用来轻易地获取数据，并且从这些数据中进行大量学习。即便针对比起被标记的样本信息量小很多的未被标记样本，，这样做也能行的通。如果我们能够获取一系列未被标记样本（比如，通过从互联网随机下载未被标记的图像/音频片段/文本文件），同时使用的算法能够有效地挖掘这些未被标注数据，那么比起大量的human hand-engineering方法以及手工标注的方法，将获得更好的性能。
-'''[一审]'''
+==特征学习==
-在解决很多问题上，总是可以尝试获取更多的带类标数据，但是成本往往很高。典型地，研究人员已经花了相当的精力在使用类似AMT(Amazon Mechanical Turk，一个基于互联网的众包市场)这样的工具上，以期获取更大的训练数据集。相比大量的研究人员手工构建特征，用众包的方式让多人手工标数据是一个进步，而且期望着可以做的更好。特别是自学习和无监督特征学习，预示着如果算法能够从无类标数据中进行学习，就可以轻而易举的获取大量这样的数据供算法学习。尽管一个单一的无类标数据样例蕴含的信息比一个带类标的数据样例要少，但是如果能大量的获取无类标数据（比如从互联网上下载随机的、无类标的图像、音频剪辑或者是文本），并且算法能够有效的利用它们，相比大规模的手工构建特征和标数据，最终将会有更好的性能。
+我们已经了解到如何使用一个自编码器（autoencoder）从无标注数据中学习特征。具体来说，假定有一个无标注的训练数据集 <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>（下标 <math>\textstyle u</math> 代表“不带类标”）。现在用它们训练一个稀疏自编码器（可能需要首先对这些数据做白化或其它适当的预处理）。
-'''[原文]'''
-In Self-taught learning and Unsupervised feature learning, we will give our algorithms a large amount of unlabeled data with which to learn a good feature representation of the input. If we are trying to solve a specific classification task, then we take this learned feature representation and whatever (perhaps small amount of) labeled data we have for that classification task, and apply supervised learning on that labeled data to solve the classification task.'''
-'''[初译]'''
-在自我学习和非监督特征学习领域，我们给予算法大量未标注的数据，通过它们来学习更好的特征重现形式。在尝试解决具体的分类任务的时候，通过这些学习来的特征重现形式，同时，应用监督学习方法于任意数量（可能是很少量）的被标注数据，两者一起来完成分类任务。
-'''[一审]'''
-在自学习和无监督特征学习问题上，可以给算法以大量的无类标数据，学习出较好的特征描述。如果面对一个具体的分类问题，就可以基于这些学习出的特征描述和任意的（可能比较少的）带类标数据，使用有监督学习方法解决。
-'''[原文]'''
-These ideas probably have the most powerful effects in problems where we have a lot of unlabeled data, and a smaller amount of labeled data. However, they typically give good results even if we have only labeled data (in which case we usually perform the feature learning step using the labeled data, but ignoring the labels).
-'''[初译]'''
-以上想法对于以下场景最有效--同时拥有大量未被标记数据和小部分已标记数据。即便是，我们只拥有已标记数据（在这种情况下，我们常常在特征学习阶段使用被标注数据，但是忽略标记，仅关注数据本身），以上想法也能给出很好的结果。
-'''[一审]'''
-在一些拥有大量无类标数据和少量的带类标数据的场景中，甚至是只有带类标数据的场景中（丢掉类标进行特征学习），以上想法都可能十分凑效。
-== Learning features ==
-'''[初译]'''：特征学习
-'''[一审]'''：特征学习
-'''[原文]'''
-We have already seen how an autoencoder can be used to learn features from
-unlabeled data.  Concretely, suppose we have an unlabeled
-training set <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>
-with <math>\textstyle m_u</math> unlabeled examples.  (The subscript "u" stands for
-"unlabeled.")  We can then train a sparse autoencoder on this data
-(perhaps with appropriate whitening or other pre-processing):
-'''[初译]'''
-我们已经了解自编码神经网络（autoencoder）怎么用来从未被标记数据中学习特征。具体来说，假设我们有<math>\textstyle m_u</math>个未被标记的训练集合<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>（下标u代表“未被标记的”）。现在用它们来训练一个稀疏的自编码神经网络。（可以使用合适的白化以及其他预操作）
-'''[一审]'''
-我们已经了解到如何使用一个自编码神经网络（autoencoder）来从无类标数据中学习特征。具体来说，假定有一个无类标的训练数据集<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>（下标u代表“不带类标”）。现在用它们训练一个稀疏自编码神经网络（可以使用合适的whitening及其他预处理工作）。
 [[File:STL_SparseAE.png|350px]]
-'''[原文]'''
+利用训练得到的模型参数 <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math>，给定任意的输入数据 <math>\textstyle x</math>，可以计算隐藏单元的激活量（activations） <math>\textstyle a</math>。如前所述，相比原始输入 <math>\textstyle x</math> 来说，<math>\textstyle a</math> 可能是一个更好的特征描述。下图的神经网络描述了特征（激活量 <math>\textstyle a</math>）的计算。
-Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math> of this model,
-given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of
-activations <math>\textstyle a</math> of the hidden units.  As we saw previously, this often gives a
-better representation of the input than the original raw input <math>\textstyle x</math>.  We can also
-visualize the algorithm for computing the features/activations <math>\textstyle a</math> as the following
-neural network:
-'''[初译]'''
+[[File:STL_SparseAE_Features.png|300px]]
-训练得到的模型参数<math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math>，对于任何新的输入<math>\textstyle x</math>，可以计算隐藏单元对应的activations <math>\textstyle a</math>向量。正如前面看到的，这种方法常能给出比原始输入<math>\textstyle x</math>更好的表达重现。如下的神经网络图可视化地阐释了特征/activations <math>\textstyle a</math>的计算:
-'''[一审]'''
-利用训练得到的模型参数<math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math>，给定任意的输入数据<math>\textstyle x</math>，可以计算隐藏单元对应的激活值（activations）向量<math>\textstyle a</math>。如前所述，相比原始输入<math>\textstyle x</math>来说，这样做可以得到一个更好的特征描述。下图的神经网络描述了特征/激活值向量<math>\textstyle a</math>的计算。
-[[File:STL_SparseAE_Features.png|300px]]
+这实际上就是之前得到的稀疏自编码器，在这里去掉了最后一层。
-'''[原文]'''
+假定有大小为 <math>\textstyle m_l</math> 的已标注训练集 <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-This is just the sparse autoencoder that we previously had, with with the final layer removed.
+(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>（下标 <math>\textstyle l</math> 表示“带类标”），我们可以为输入数据找到更好的特征描述。例如，可以将 <math>\textstyle x_l^{(1)}</math> 输入到稀疏自编码器，得到隐藏单元激活量 <math>\textstyle a_l^{(1)}</math>。接下来，可以直接使用 <math>\textstyle a_l^{(1)}</math> 来代替原始数据 <math>\textstyle x_l^{(1)}</math> （“替代表示”,Replacement Representation）。也可以合二为一，使用新的向量 <math>\textstyle (x_l^{(1)}, a_l^{(1)})</math> 来代替原始数据 <math>\textstyle x_l^{(1)}</math> （“级联表示”,Concatenation Representation）。
-'''[初译]'''
-这是之前得到的移除了最终层次的稀疏自编码神经网络。
-'''[一审]'''
+经过变换后，训练集就变成 <math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})
-这实际上就是之前得到的稀疏自编码神经网络，在这里去掉了最后一层。
+\}</math>或者是<math>\textstyle \{
+((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,
+((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math>（取决于使用 <math>\textstyle a_l^{(1)}</math> 替换 <math>\textstyle x_l^{(1)}</math> 还是将二者合并）。在实践中，将 <math>\textstyle a_l^{(1)}</math> 和 <math>\textstyle x_l^{(1)}</math> 合并通常表现的更好。但是考虑到内存和计算的成本，也可以使用替换操作。
-'''[原文]'''
+最终，可以训练出一个有监督学习算法（例如 svm, logistic regression 等），得到一个判别函数对 <math>\textstyle y</math> 值进行预测。预测过程如下：给定一个测试样本 <math>\textstyle x_{\rm test}</math>，重复之前的过程，将其送入稀疏自编码器，得到 <math>\textstyle a_{\rm test}</math>。然后将 <math>\textstyle a_{\rm test}</math> （或者 <math>\textstyle (x_{\rm test}, a_{\rm test})</math> ）送入分类器中，得到预测值。
-Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
-(The subscript "l" stands for "labeled.")
-We can now find a better representation for the inputs.  In particular, rather
-than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
-<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding
-vector of activations <math>\textstyle a_l^{(1)}</math>.  To represent this example, we can either
-just '''replace''' the original feature vector with <math>\textstyle a_l^{(1)}</math>.
-Alternatively, we can '''concatenate''' the two feature vectors together,
-getting a representation <math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>.
-'''[初译]'''
-现在，假设有一组大小为<math>\textstyle m_l</math>个的被标记训练集<math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>。（下标l代表被“标记的”），我们可以找到更好的输入的重新表达形式。相比起去重现第一个训练样本<math>\textstyle x_l^{(1)}</math>，我们将<math>\textstyle x_l^{(1)}</math>作为自编码神经网络的输入，以此获得对应的activations <math>\textstyle a_l^{(1)}</math>向量。为了重新表达这个样本，用<math>\textstyle a_l^{(1)}</math>来替换原始的特征向量<math>\textstyle x_l^{(1)}</math>。或者，将两个特征向量合并起来，得到重新表达形式<math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>。
-'''[一审]'''
+==数据预处理==
-一审：假定有大小为<math>\textstyle m_l</math>的带类标训练数据集 <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>（下表l表示“带类标”），对于输入数据，可以找到更好的特征描述。相比原始的数据特征描述 ，可以将<math>\textstyle x_l^{(1)}</math>输入到稀疏自编码神经网络，得到隐藏单元激活值向量<math>\textstyle a_l^{(1)}</math>。接下来，可以直接使用来代替<math>\textstyle a_l^{(1)}</math>描述原始数据<math>\textstyle x_l^{(1)}</math>。也可以合二为一，使用新的向量<math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>来描述。
+在特征学习阶段，我们从未标注训练集 <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math> 中学习，这一过程中可能计算了各种数据预处理参数。例如计算数据均值并且对数据做均值标准化（mean normalization）；或者对原始数据做主成分分析（PCA），然后将原始数据表示为 <math>\textstyle U^Tx</math> (又或者使用 PCA 白化或 ZCA 白化)。这样的话，有必要将这些参数保存起来，并且在后面的训练和测试阶段使用同样的参数，以保证数据进入稀疏自编码神经网络之前经过了同样的变换。例如，如果对未标注数据集进行PCA预处理，就必须将得到的矩阵 <math>\textstyle U</math> 保存起来，并且应用到有标注训练集和测试集上；而不能使用有标注训练集重新估计出一个不同的矩阵 <math>\textstyle U</math> （也不能重新计算均值并做均值标准化），否则的话可能得到一个完全不一致的数据预处理操作，导致进入自编码器的数据分布迥异于训练自编码器时的数据分布。
-'''[原文]'''
-Thus, our training set now becomes
-<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})
-\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the
-<math>\textstyle i</math>-th training example), or <math>\textstyle \{
-((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,
-((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated
-representation).  In practice, the concatenated representation often works
-better; but for memory or computation representations, we will sometimes use
-the replacement representation as well.
-'''[初译]'''
+==无监督特征学习的术语==
-因此，现在训练集合变成<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})
-\}</math>（使用上述的替换表达形式，同时使用<math>\textstyle a_l^{(i)}</math>来表达第<math>\textstyle i</math>个训练样本）。训练集合也可以表示为 <math>\textstyle \{
-((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,
-((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> （使用上述的连接表达形式)。在实践中，这种连接表达形式常常有更好的效果。但是，考虑到内存或者计算表达形式，有些时候，需要使用替换的表达形式。
-'''[一审]'''
-经过变换后，训练数据集就变成<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})
-\}</math>或者是<math>\textstyle \{
-((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,
-((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math>（决定于使用<math>\textstyle a_l^{(1)}</math>替换<math>\textstyle x_l^{(1)}</math>还是将二者合并）。在实践中，将<math>\textstyle a_l^{(1)}</math>和<math>\textstyle x_l^{(1)}</math>合并通常表现的更好。但是考虑到内存和计算的成本，也可以使用替换操作。
+有两种常见的无监督特征学习方式，区别在于你有什么样的未标注数据。自学习(self-taught learning) 是其中更为一般的、更强大的学习方式，它不要求未标注数据 <math> \textstyle x_u</math> 和已标注数据 <math> \textstyle x_l</math> 来自同样的分布。另外一种带限制性的方式也被称为半监督学习，它要求 <math> \textstyle x_u</math>和<math> \textstyle x_l</math> 服从同样的分布。下面通过例子解释二者的区别。
-'''[原文]'''
-Finally, we can train a supervised learning algorithm such as an SVM, logistic
-regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values.
-Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
-For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>.  Then, feed
-either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.
-'''[初译]'''
+假定有一个计算机视觉方面的任务，目标是区分汽车和摩托车图像；也即训练样本里面要么是汽车的图像，要么是摩托车的图像。哪里可以获取大量的未标注数据呢？最简单的方式可能是从互联网上下载一些随机的图像数据集，在这些数据上训练出一个稀疏自编码器，从中得到有用的特征。这个例子里，未标注数据完全来自于一个和已标注数据不同的分布（未标注数据集中，或许其中一些图像包含汽车或者摩托车，但是不是所有的图像都如此）。这种情形被称为自学习。
-最终，我们能够使用一个监督学习算法来训练，比如，SVM，logistic回归，等等，来获得对<math>\textstyle y</math>值的预测。对于一个测试样例<math>\textstyle x_{\rm test}</math>，遵守这样的过程：首先，把它送入自编码神经网络得到<math>\textstyle a_{\rm test}</math>。然后，将<math>\textstyle a_{\rm test}</math>或者<math>\textstyle (x_{\rm test}, a_{\rm test})</math>送到分类器得到预测值。
-'''[一审]'''
-最终，可以训练出一个有监督学习算法（例如svm,logistic regression等），得到一个判别函数对<math>\textstyle y</math>值就行预测。预测过程如下：给定一个测试样例<math>\textstyle x_{\rm test}</math>,重复之前的过程，送入稀疏自编码神经网络，得到<math>\textstyle a_{\rm test}</math>。然后将<math>\textstyle a_{\rm test}</math>或者（<math>\textstyle (x_{\rm test}, a_{\rm test})</math>）送入训练出的分类器中，得到预测值。
+相反，如果有大量的未标注图像数据，要么是汽车图像，要么是摩托车图像，仅仅是缺失了类标号（没有标注每张图片到底是汽车还是摩托车）。也可以用这些未标注数据来学习特征。这种方式，即要求未标注样本和带标注样本服从相同的分布，有时候被称为半监督学习。在实践中，常常无法找到满足这种要求的未标注数据（到哪里找到一个每张图像不是汽车就是摩托车，只是丢失了类标号的图像数据库？）因此，自学习在无标注数据集的特征学习中应用更广。
-== On pre-processing the data ==
-'''[初译]'''
-有关数据预处理
-'''[一审]'''
-数据预处理
-'''[原文]'''
-During the feature learning stage where we were learning from the unlabeled training set
-<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed
-various pre-processing parameters.  For example, one may have computed
-a mean value of the data and subtracted off this mean to perform mean normalization,
-or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used
-PCA
-whitening or ZCA whitening).  If this is the case, then it is important to
-save away these preprocessing parameters, and to use the ''same'' parameters
-during the labeled training phase and the test phase, so as to make sure
-we are always transforming the data the same way to feed into the autoencoder.
-In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA,
-we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the
-labeled examples and the test data.  We should '''not''' re-estimate a
-different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the
-labeled training set, since that might result in a dramatically different
-pre-processing transformation, which would make the input distribution to
-the autoencoder very different from what it was actually trained on.
-'''[初译]'''
+==中英文对照==
-在特征学习阶段，我们从未被标记的样本集合<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>中学习,在这之前它们已经进行过大量的参数预处理。举个例子，假如我们已经计算出一组数据的平均值，并且进行均值化，或者使用主成分分析计算出矩阵U来表达这组数据（或者使用PCA白化或者ZCA白化）将数据表示为<math>\textstyle U^Tx</math>。在这种情况下，保存预处理的参数是很重要的，需要在被标注数据的训练阶段和测试阶段使用同样的参数。这样能保证我们总是使用相同的方式来转化数据，进入自编码神经网络的时候也能使用相同的方式。尤其是，如果已经使用了未被标记的数据和主成分分析得到矩阵<math>\textstyle U</math>，我们必须保持同样的矩阵，并且使用他们进行被标记样本以及测试数据的预处理，而不能使用标记过的训练样本，重新预估一个不同的<math>\textstyle U</math>矩阵（或者使用均值化得到的均值，等等）。其原因是，这样可能导致显著不同的预处理变化，这变化将使得自编码神经网络的输入分布迥异于实际。
-'''[一审]'''
+:自我学习/自学习	self-taught learning
-在特征学习阶段，我们从无类标训练数据集<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>中进行学习，这一过程中可能计算了各种数据预处理参数。例如计算数据均值并且对数据做均值标准化（mean normalization）；或者对原始数据做主成分分析（PCA），然后将原始数据表示为<math>\textstyle U^Tx</math>(又或者使用PCA白化或ZCA白化)。这样的话，有必要将这些参数保存起来，并且在后面的训练和测试阶段使用同样的参数，以保证数据进入稀疏自编码神经网络之前经过了同样的变换。例如，如果对无类标数据集进行PCA，就必须将得到矩阵<math>\textstyle U</math>保存起来，并且应用到带类标训练数据集和测试数据集上去；而不能使用带类标训练数据集重新估计出一个不同的矩阵<math>\textstyle U</math>出来（也不能重新计算均值并做均值标准化），否则的话可能得到一个完全不一致的数据预处理操作，最终导致进入自编码神经网络的数据分布也迥异于训练阶段。
-== On the terminology of unsupervised feature learning ==
+:无监督特征学习	unsupervised feature learning
-'''[初译]'''
-有关非监督特征学习的术语
-'''[一审]'''
-非监督特征学习术语
+:自编码器	autoencoder
-'''[原文]'''
+:白化	whitening
-There are two common unsupervised feature learning settings, depending on what type of
-unlabeled data you have.  The more general and powerful setting is the '''self-taught learning'''
-setting, which does not assume that your unlabeled data <math>x_u</math> has to
-be drawn from the same distribution as your labeled data <math>x_l</math>.  The
-more restrictive setting where the unlabeled data comes from exactly the same
-distribution as the labeled data is sometimes called the '''semi-supervised learning'''
-setting.  This distinctions is best explained with an example, which we now give.
-'''[初译]'''
+:激活量	activation
-有两种常见的非监督特征学习设置，区别在于你拥有什么样的未标记数据。最为广泛应用的强大是自主学习设置，它不假设未标记数据<math>x_u</math>与被标记的数据<math>x_l</math>有着相同的分布。另一种有限制的设置是未被标记的数据与被标记的数据有着完全相同的分布，我们叫它半监督学习设置。现在我们来解释一下这种差别。
-'''[一审]'''
+:稀疏自编码器	sparse autoencoder
-有两种常见的无监督特征学习方式，区别在于你有什么样的无类标数据。自学习(self-taught learning)是其中一般的、强大的学习方式，它不要求无类标数据<math>x_u</math>和带类标数据<math>x_l</math>来自同样的分布。另外一种带限制性的方式也被称为半监督学习，它要求<math>x_u</math>和<math>x_l</math>服从同样的分布。下面通过例子解释二者的区别。
+:半监督学习	semi-supervised learning
-'''[原文]'''
-Suppose your goal is a computer vision task where you'd like
-to distinguish between images of cars and images of motorcycles; so, each labeled
-example in your training set is either an image of a car or an image of a motorcycle.
-Where can we get lots of unlabeled data?  The easiest way would be to obtain some
-random collection of images, perhaps downloaded off the internet.  We could then
-train the autoencoder on this large collection of images, and obtain useful features
-from them.  Because here the unlabeled data is drawn from a different distribution
-than the labeled data (i.e., perhaps some of our unlabeled images may contain
-cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we
-call this self-taught learning.
-'''[初译]'''
-假设你的目标是区分汽车或者摩托车图像。即，训练集的每个被标记样本要么是汽车的图像，要么是摩托车的图像。哪里可以得到这么多未被标记数据？最简便的方法是获取一些图像的随机集合，或者从互联网下载一些。接着可以将这些大量的图像集合用于自编码神经网络训练，以获得有用的特征。因为未标记的数据与标注过的数据有着不同的分布（未标记的图像可能包含汽车/摩托车，下载的每张图像都是汽车或者摩托车），所以，称其自我学习算法。
-'''[一审]'''
+==中文译者==
-假定有一个计算机视觉方面的任务，目标是区分汽车和摩托车图像；也即训练样本里面要么是汽车的图像，要么是摩托车的图像。哪里获取大量的无类标数据呢？最简单的方式可能是到互联网上下载一些随机的图像数据集，这这些数据上训练出一个稀疏自编码神经网络，从中得到有用的特征。这个例子里，无类标数据完全来自于一个和带类标数据不同的分布（无类标数据集中，或许其中一些图像包含汽车或者摩托车，但是不是所有的图像都如此）。这种情形被称为自学习。
+张灵（lingzhang001@outlook.com），晓风（xiaofeng.zhb@alibaba-inc.com），王文中（wangwenzhong@ymail.com）
-'''[原文]'''
-In contrast, if we happen to have lots of unlabeled images lying around
-that are all images of ''either'' a car or a motorcycle, but where the data
-is just missing its label (so you don't know which ones are cars, and which
-ones are motorcycles), then we could use this form of unlabeled data to
-learn the features.  This setting---where each unlabeled example is drawn from the same
-distribution as your labeled examples---is sometimes called the semi-supervised
-setting.  In practice, we often do not have this sort of unlabeled data (where would you
-get a database of images where every image is either a car or a motorcycle, but
-just missing its label?), and so in the context of learning features from unlabeled
-data, the self-taught learning setting is more broadly applicable.
-'''[初译]'''
+{{自我学习与无监督特征学习}}
-相反的，如果恰好有成千上万张图像，它们要么是汽车，要么是摩托车，只是它们缺少标记（你不知道那张是汽车，哪张是摩托车），我们可以用这种未标记的数据来学习特征。对于这些设置--每个未被标记的样例与你标记过的样例有着相同的分布--有时候称它是半监督学习。在实践中，我们常常没有这种未标记数据（你可以得到这样的图像数据库，其中每张图像是汽车或者摩托车，只是丢失了标记）。综上，在针对未标记数据的特征学习上，自我学习设置能够被更广泛的使用。
-'''[一审]'''
-相反，如果有大量的无类标图像数据，要么是汽车图像，要么是摩托车图像，仅仅是缺失了类标（没有标注每张图片到底是汽车还是摩托车）。也可以用这些无类标数据来学习特征。这种方式，即要求无类标样本和带类标样本服从相同的分布，有时候被称为半监督学习。在实践中，常常无法找到满足这种要求的无类标数据（到哪里找到一个每张图像不是汽车就是摩托车，只是丢失了类标的图像数据库？）因此，自学习被广泛的应用于从无类标数据集中学习特征。
-{{STL}}
+{{Languages|Self-Taught_Learning|English}}