自我学习

'''Self-Taught Learning'''

'''[初译]'''：自我学习(by '''@Call_Me_Zero''')
'''[一审]'''：自学习  (by '''@晓风_机器学习''')

==综述==

如果已经有一个足够强大的机器学习算法，为了获得更好的性能，最靠谱的方法之一是给这个算法以更多的数据。机器学习界甚至有个说法：“有时候胜出者并非有最好的算法，而是有更多的数据。”


人们总是可以尝试获取更多的已标注数据，但是这样做成本往往很高。例如研究人员已经花了相当的精力在使用类似AMT(Amazon Mechanical Turk)这样的工具上，以期获取更大的训练数据集。相比大量研究人员通过手工方式构建特征，用众包的方式让多人手工标数据是一个进步，但是我们可以做得更好。具体的说，如果算法能够从未标注数据中学习，那么我们就可以轻易地获取大量无标注数据，并从中学习。自学习和无监督特征学习就是这种的算法。尽管一个单一的未标注样本蕴含的信息比一个已标注的样本要少，但是如果能获取大量无标注数据（比如从互联网上下载随机的、无标注的图像、音频剪辑或者是文本），并且算法能够有效的利用它们，那么相比大规模的手工构建特征和标数据，算法将会取得更好的性能。


在自学习和无监督特征学习问题上，可以给算法以大量的未标注数据，学习出较好的特征描述。在尝试解决一个具体的分类问题时，可以基于这些学习出的特征描述和任意的（可能比较少的）已标注数据，使用有监督学习方法完成分类。


在一些拥有大量未标注数据和少量的已标注数据的场景中，上述思想可能是最有效的。即使在只有已标注数据的情况下（这时我们通常忽略训练数据的类标号进行特征学习），以上想法也能得到很好的结果。

==特征学习==

我们已经了解到如何使用一个自编码器（autoencoder）从无标注数据中学习特征。具体来说，假定有一个无标注的训练数据集<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>（下标<math>\textstyle u</math>代表“不带类标”）。现在用它们训练一个稀疏自编码器（可能需要首先对这些数据做白化或其它适当的预处理）。

[[File:STL_SparseAE.png|350px]]


利用训练得到的模型参数<math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math>，给定任意的输入数据<math>\textstyle x</math>，可以计算隐藏单元的激活量（activations）<math>\textstyle a</math>。如前所述，相比原始输入<math>\textstyle x</math>来说，<math>\textstyle a</math>可能是一个更好的特征描述。下图的神经网络描述了特征（激活量<math>\textstyle a</math>）的计算。

[[File:STL_SparseAE_Features.png|300px]]


这实际上就是之前得到的稀疏自编码器，在这里去掉了最后一层。


假定有大小为<math>\textstyle m_l</math>的已标注训练集 <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>（下标<math>\textstyle l</math>表示“带类标”），我们可以为输入数据找到更好的特征描述。例如，可以将<math>\textstyle x_l^{(1)}</math>输入到稀疏自编码器，得到隐藏单元激活量<math>\textstyle a_l^{(1)}</math>。接下来，可以直接使用<math>\textstyle a_l^{(1)}</math>来代替原始数据<math>\textstyle x_l^{(1)}</math>。也可以合二为一，使用新的向量<math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>来代替原始数据<math>\textstyle x_l^{(1)}</math>。


经过变换后，训练集就变成<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})
\}</math>或者是<math>\textstyle \{
((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots, 
((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math>（取决于使用<math>\textstyle a_l^{(1)}</math>替换<math>\textstyle x_l^{(1)}</math>还是将二者合并）。在实践中，将<math>\textstyle a_l^{(1)}</math>和<math>\textstyle x_l^{(1)}</math>合并通常表现的更好。但是考虑到内存和计算的成本，也可以使用替换操作。


最终，可以训练出一个有监督学习算法（例如svm,logistic regression等），得到一个判别函数对<math>\textstyle y</math>值进行预测。预测过程如下：给定一个测试样本<math>\textstyle x_{\rm test}</math>，重复之前的过程，将其送入稀疏自编码器，得到<math>\textstyle a_{\rm test}</math>。然后将<math>\textstyle a_{\rm test}</math>（或者<math>\textstyle (x_{\rm test}, a_{\rm test})</math>）送入分类器中，得到预测值。

== On pre-processing the data ==
'''[初译]'''
有关数据预处理
'''[一审]'''
数据预处理


'''[原文]'''
During the feature learning stage where we were learning from the unlabeled training set 
<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed
various pre-processing parameters.  For example, one may have computed
a mean value of the data and subtracted off this mean to perform mean normalization,
or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used 
PCA 
whitening or ZCA whitening).  If this is the case, then it is important to
save away these preprocessing parameters, and to use the ''same'' parameters
during the labeled training phase and the test phase, so as to make sure
we are always transforming the data the same way to feed into the autoencoder. 
In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA,
we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the
labeled examples and the test data.  We should '''not''' re-estimate a
different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the
labeled training set, since that might result in a dramatically different
pre-processing transformation, which would make the input distribution to
the autoencoder very different from what it was actually trained on.

'''[初译]'''
在特征学习阶段，我们从未被标记的样本集合<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>中学习,在这之前它们已经进行过大量的参数预处理。举个例子，假如我们已经计算出一组数据的平均值，并且进行均值化，或者使用主成分分析计算出矩阵U来表达这组数据（或者使用PCA白化或者ZCA白化）将数据表示为<math>\textstyle U^Tx</math>。在这种情况下，保存预处理的参数是很重要的，需要在被标注数据的训练阶段和测试阶段使用同样的参数。这样能保证我们总是使用相同的方式来转化数据，进入自编码神经网络的时候也能使用相同的方式。尤其是，如果已经使用了未被标记的数据和主成分分析得到矩阵<math>\textstyle U</math>，我们必须保持同样的矩阵，并且使用他们进行被标记样本以及测试数据的预处理，而不能使用标记过的训练样本，重新预估一个不同的<math>\textstyle U</math>矩阵（或者使用均值化得到的均值，等等）。其原因是，这样可能导致显著不同的预处理变化，这变化将使得自编码神经网络的输入分布迥异于实际。

'''[一审]'''
在特征学习阶段，我们从无类标训练数据集<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>中进行学习，这一过程中可能计算了各种数据预处理参数。例如计算数据均值并且对数据做均值标准化（mean normalization）；或者对原始数据做主成分分析（PCA），然后将原始数据表示为<math>\textstyle U^Tx</math>(又或者使用PCA白化或ZCA白化)。这样的话，有必要将这些参数保存起来，并且在后面的训练和测试阶段使用同样的参数，以保证数据进入稀疏自编码神经网络之前经过了同样的变换。例如，如果对无类标数据集进行PCA，就必须将得到矩阵<math>\textstyle U</math>保存起来，并且应用到带类标训练数据集和测试数据集上去；而不能使用带类标训练数据集重新估计出一个不同的矩阵<math>\textstyle U</math>出来（也不能重新计算均值并做均值标准化），否则的话可能得到一个完全不一致的数据预处理操作，最终导致进入自编码神经网络的数据分布也迥异于训练阶段。

== On the terminology of unsupervised feature learning ==
'''[初译]'''
有关非监督特征学习的术语
'''[一审]'''
非监督特征学习术语


'''[原文]'''
There are two common unsupervised feature learning settings, depending on what type of 
unlabeled data you have.  The more general and powerful setting is the '''self-taught learning'''
setting, which does not assume that your unlabeled data <math>x_u</math> has to
be drawn from the same distribution as your labeled data <math>x_l</math>.  The 
more restrictive setting where the unlabeled data comes from exactly the same 
distribution as the labeled data is sometimes called the '''semi-supervised learning''' 
setting.  This distinctions is best explained with an example, which we now give. 

'''[初译]'''
有两种常见的非监督特征学习设置，区别在于你拥有什么样的未标记数据。最为广泛应用的强大是自主学习设置，它不假设未标记数据<math>x_u</math>与被标记的数据<math>x_l</math>有着相同的分布。另一种有限制的设置是未被标记的数据与被标记的数据有着完全相同的分布，我们叫它半监督学习设置。现在我们来解释一下这种差别。

'''[一审]'''
有两种常见的无监督特征学习方式，区别在于你有什么样的无类标数据。自学习(self-taught learning)是其中一般的、强大的学习方式，它不要求无类标数据<math>x_u</math>和带类标数据<math>x_l</math>来自同样的分布。另外一种带限制性的方式也被称为半监督学习，它要求<math>x_u</math>和<math>x_l</math>服从同样的分布。下面通过例子解释二者的区别。


'''[原文]'''
Suppose your goal is a computer vision task where you'd like
to distinguish between images of cars and images of motorcycles; so, each labeled
example in your training set is either an image of a car or an image of a motorcycle.  
Where can we get lots of unlabeled data?  The easiest way would be to obtain some
random collection of images, perhaps downloaded off the internet.  We could then 
train the autoencoder on this large collection of images, and obtain useful features
from them.  Because here the unlabeled data is drawn from a different distribution
than the labeled data (i.e., perhaps some of our unlabeled images may contain
cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we
call this self-taught learning. 

'''[初译]'''
假设你的目标是区分汽车或者摩托车图像。即，训练集的每个被标记样本要么是汽车的图像，要么是摩托车的图像。哪里可以得到这么多未被标记数据？最简便的方法是获取一些图像的随机集合，或者从互联网下载一些。接着可以将这些大量的图像集合用于自编码神经网络训练，以获得有用的特征。因为未标记的数据与标注过的数据有着不同的分布（未标记的图像可能包含汽车/摩托车，下载的每张图像都是汽车或者摩托车），所以，称其自我学习算法。

'''[一审]'''
假定有一个计算机视觉方面的任务，目标是区分汽车和摩托车图像；也即训练样本里面要么是汽车的图像，要么是摩托车的图像。哪里获取大量的无类标数据呢？最简单的方式可能是到互联网上下载一些随机的图像数据集，这这些数据上训练出一个稀疏自编码神经网络，从中得到有用的特征。这个例子里，无类标数据完全来自于一个和带类标数据不同的分布（无类标数据集中，或许其中一些图像包含汽车或者摩托车，但是不是所有的图像都如此）。这种情形被称为自学习。


'''[原文]'''
In contrast, if we happen to have lots of unlabeled images lying around
that are all images of ''either'' a car or a motorcycle, but where the data
is just missing its label (so you don't know which ones are cars, and which
ones are motorcycles), then we could use this form of unlabeled data to
learn the features.  This setting---where each unlabeled example is drawn from the same
distribution as your labeled examples---is sometimes called the semi-supervised 
setting.  In practice, we often do not have this sort of unlabeled data (where would you
get a database of images where every image is either a car or a motorcycle, but
just missing its label?), and so in the context of learning features from unlabeled
data, the self-taught learning setting is more broadly applicable.

'''[初译]'''
相反的，如果恰好有成千上万张图像，它们要么是汽车，要么是摩托车，只是它们缺少标记（你不知道那张是汽车，哪张是摩托车），我们可以用这种未标记的数据来学习特征。对于这些设置--每个未被标记的样例与你标记过的样例有着相同的分布--有时候称它是半监督学习。在实践中，我们常常没有这种未标记数据（你可以得到这样的图像数据库，其中每张图像是汽车或者摩托车，只是丢失了标记）。综上，在针对未标记数据的特征学习上，自我学习设置能够被更广泛的使用。

'''[一审]'''
相反，如果有大量的无类标图像数据，要么是汽车图像，要么是摩托车图像，仅仅是缺失了类标（没有标注每张图片到底是汽车还是摩托车）。也可以用这些无类标数据来学习特征。这种方式，即要求无类标样本和带类标样本服从相同的分布，有时候被称为半监督学习。在实践中，常常无法找到满足这种要求的无类标数据（到哪里找到一个每张图像不是汽车就是摩托车，只是丢失了类标的图像数据库？）因此，自学习被广泛的应用于从无类标数据集中学习特征。

{{STL}}