自我学习
From Ufldl
(→Learning features) |
|||
Line 154: | Line 154: | ||
'''[一审]''' | '''[一审]''' | ||
最终,可以训练出一个有监督学习算法(例如svm,logistic regression等),得到一个判别函数对<math>\textstyle y</math>值就行预测。预测过程如下:给定一个测试样例<math>\textstyle x_{\rm test}</math>,重复之前的过程,送入稀疏自编码神经网络,得到<math>\textstyle a_{\rm test}</math>。然后将<math>\textstyle a_{\rm test}</math>或者(<math>\textstyle (x_{\rm test}, a_{\rm test})</math>)送入训练出的分类器中,得到预测值。 | 最终,可以训练出一个有监督学习算法(例如svm,logistic regression等),得到一个判别函数对<math>\textstyle y</math>值就行预测。预测过程如下:给定一个测试样例<math>\textstyle x_{\rm test}</math>,重复之前的过程,送入稀疏自编码神经网络,得到<math>\textstyle a_{\rm test}</math>。然后将<math>\textstyle a_{\rm test}</math>或者(<math>\textstyle (x_{\rm test}, a_{\rm test})</math>)送入训练出的分类器中,得到预测值。 | ||
+ | |||
+ | |||
+ | == On pre-processing the data == | ||
+ | '''[初译]''' | ||
+ | 有关数据预处理 | ||
+ | '''[一审]''' | ||
+ | 数据预处理 | ||
+ | |||
+ | |||
+ | '''[原文]''' | ||
+ | During the feature learning stage where we were learning from the unlabeled training set | ||
+ | <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed | ||
+ | various pre-processing parameters. For example, one may have computed | ||
+ | a mean value of the data and subtracted off this mean to perform mean normalization, | ||
+ | or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used | ||
+ | PCA | ||
+ | whitening or ZCA whitening). If this is the case, then it is important to | ||
+ | save away these preprocessing parameters, and to use the ''same'' parameters | ||
+ | during the labeled training phase and the test phase, so as to make sure | ||
+ | we are always transforming the data the same way to feed into the autoencoder. | ||
+ | In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA, | ||
+ | we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the | ||
+ | labeled examples and the test data. We should '''not''' re-estimate a | ||
+ | different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the | ||
+ | labeled training set, since that might result in a dramatically different | ||
+ | pre-processing transformation, which would make the input distribution to | ||
+ | the autoencoder very different from what it was actually trained on. | ||
+ | |||
+ | '''[初译]''' | ||
+ | 在特征学习阶段,我们从未被标记的样本集合<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>中学习,在这之前它们已经进行过大量的参数预处理。举个例子,假如我们已经计算出一组数据的平均值,并且进行均值化,或者使用主成分分析计算出矩阵U来表达这组数据(或者使用PCA白化或者ZCA白化)将数据表示为<math>\textstyle U^Tx</math>。在这种情况下,保存预处理的参数是很重要的,需要在被标注数据的训练阶段和测试阶段使用同样的参数。这样能保证我们总是使用相同的方式来转化数据,进入自编码神经网络的时候也能使用相同的方式。尤其是,如果已经使用了未被标记的数据和主成分分析得到矩阵U,我们必须保持同样的矩阵,并且使用他们进行被标记样本以及测试数据的预处理,而不能使用标记过的训练样本,重新预估一个不同的U矩阵(或者使用均值化得到的均值,等等)。其原因是,这样可能导致显著不同的预处理变化,这变化将使得自编码神经网络的输入分布迥异于实际。 | ||
+ | |||
+ | '''[一审]''' | ||
+ | 在特征学习阶段,我们从无类标训练数据集 中进行学习,这一过程中可能计算了各种数据预处理参数。例如计算数据均值并且对数据做均值标准化(mean normalization);或者对原始数据做主成分分析(PCA),然后将原始数据表示为<math>\textstyle U^Tx</math>(又或者使用PCA白化或ZCA白化)。这样的话,有必要将这些参数保存起来,并且在后面的训练和测试阶段使用同样的参数,以保证数据进入稀疏自编码神经网络之前经过了同样的变换。例如,如果对无类标数据集进行PCA,就必须将得到矩阵U保存起来,并且应用到带类标训练数据集和测试数据集上去;而不能使用带类标训练数据集重新估计出一个不同的矩阵U出来(也不能重新计算均值并做均值标准化),否则的话可能得到一个完全不一致的数据预处理操作,最终导致进入自编码神经网络的数据分布也迥异于训练阶段。 | ||
+ | |||
+ | |||
+ | == On the terminology of unsupervised feature learning == | ||
+ | '''[初译]''' | ||
+ | 有关非监督特征学习的术语 | ||
+ | '''[一审]''' | ||
+ | 非监督特征学习术语 | ||
+ | |||
+ | |||
+ | '''[原文]''' | ||
+ | There are two common unsupervised feature learning settings, depending on what type of | ||
+ | unlabeled data you have. The more general and powerful setting is the '''self-taught learning''' | ||
+ | setting, which does not assume that your unlabeled data <math>x_u</math> has to | ||
+ | be drawn from the same distribution as your labeled data <math>x_l</math>. The | ||
+ | more restrictive setting where the unlabeled data comes from exactly the same | ||
+ | distribution as the labeled data is sometimes called the '''semi-supervised learning''' | ||
+ | setting. This distinctions is best explained with an example, which we now give. | ||
+ | |||
+ | '''[初译]''' | ||
+ | 有两种常见的非监督特征学习设置,区别在于你拥有什么样的未标记数据。最为广泛应用的强大是自主学习设置,它不假设未标记数据<math>x_u</math>与被标记的数据<math>x_l</math>有着相同的分布。另一种有限制的设置是未被标记的数据与被标记的数据有着完全相同的分布,我们叫它半监督学习设置。现在我们来解释一下这种差别。 | ||
+ | |||
+ | '''[一审]''' | ||
+ | 有两种常见的无监督特征学习方式,区别在于你有什么样的无类标数据。自学习(self-taught learning)是其中一般的、强大的学习方式,它不要求无类标数据<math>x_u</math>和带类标数据<math>x_l</math>来自同样的分布。另外一种带限制性的方式也被称为半监督学习,它要求<math>x_u</math>和<math>x_l</math>服从同样的分布。下面通过例子解释二者的区别。 | ||
+ | |||
+ | |||
+ | '''[原文]''' | ||
+ | Suppose your goal is a computer vision task where you'd like | ||
+ | to distinguish between images of cars and images of motorcycles; so, each labeled | ||
+ | example in your training set is either an image of a car or an image of a motorcycle. | ||
+ | Where can we get lots of unlabeled data? The easiest way would be to obtain some | ||
+ | random collection of images, perhaps downloaded off the internet. We could then | ||
+ | train the autoencoder on this large collection of images, and obtain useful features | ||
+ | from them. Because here the unlabeled data is drawn from a different distribution | ||
+ | than the labeled data (i.e., perhaps some of our unlabeled images may contain | ||
+ | cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we | ||
+ | call this self-taught learning. | ||
+ | |||
+ | '''[初译]''' | ||
+ | 假设你的目标是区分汽车或者摩托车图像。即,训练集的每个被标记样本要么是汽车的图像,要么是摩托车的图像。哪里可以得到这么多未被标记数据?最简便的方法是获取一些图像的随机集合,或者从互联网下载一些。接着可以将这些大量的图像集合用于自编码神经网络训练,以获得有用的特征。因为未标记的数据与标注过的数据有着不同的分布(未标记的图像可能包含汽车/摩托车,下载的每张图像都是汽车或者摩托车),所以,称其自我学习算法。 | ||
+ | |||
+ | '''[一审]''' | ||
+ | 假定有一个计算机视觉方面的任务,目标是区分汽车和摩托车图像;也即训练样本里面要么是汽车的图像,要么是摩托车的图像。哪里获取大量的无类标数据呢?最简单的方式可能是到互联网上下载一些随机的图像数据集,这这些数据上训练出一个稀疏自编码神经网络,从中得到有用的特征。这个例子里,无类标数据完全来自于一个和带类标数据不同的分布(无类标数据集中,或许其中一些图像包含汽车或者摩托车,但是不是所有的图像都如此)。这种情形被称为自学习。 | ||
+ | |||
+ | |||
+ | '''[原文]''' | ||
+ | In contrast, if we happen to have lots of unlabeled images lying around | ||
+ | that are all images of ''either'' a car or a motorcycle, but where the data | ||
+ | is just missing its label (so you don't know which ones are cars, and which | ||
+ | ones are motorcycles), then we could use this form of unlabeled data to | ||
+ | learn the features. This setting---where each unlabeled example is drawn from the same | ||
+ | distribution as your labeled examples---is sometimes called the semi-supervised | ||
+ | setting. In practice, we often do not have this sort of unlabeled data (where would you | ||
+ | get a database of images where every image is either a car or a motorcycle, but | ||
+ | just missing its label?), and so in the context of learning features from unlabeled | ||
+ | data, the self-taught learning setting is more broadly applicable. | ||
+ | |||
+ | '''[初译]''' | ||
+ | 相反的,如果恰好有成千上万张图像,它们要么是汽车,要么是摩托车,只是它们缺少标记(你不知道那张是汽车,哪张是摩托车),我们可以用这种未标记的数据来学习特征。对于这些设置--每个未被标记的样例与你标记过的样例有着相同的分布--有时候称它是半监督学习。在实践中,我们常常没有这种未标记数据(你可以得到这样的图像数据库,其中每张图像是汽车或者摩托车,只是丢失了标记)。综上,在针对未标记数据的特征学习上,自我学习设置能够被更广泛的使用。 | ||
+ | |||
+ | '''[一审]''' | ||
+ | 相反,如果有大量的无类标图像数据,要么是汽车图像,要么是摩托车图像,仅仅是缺失了类标(没有标注每张图片到底是汽车还是摩托车)。也可以用这些无类标数据来学习特征。这种方式,即要求无类标样本和带类标样本服从相同的分布,有时候被称为半监督学习。在实践中,常常无法找到满足这种要求的无类标数据(到哪里找到一个每张图像不是汽车就是摩托车,只是丢失了类标的图像数据库?)因此,自学习被广泛的应用于从无类标数据集中学习特征。 | ||
+ | |||
+ | {{STL}} |