自我学习

From Ufldl

Jump to: navigation, search
(Created page with "自我学习")
Line 1: Line 1:
-
自我学习
+
'''Self-Taught Learning'''
 +
 
 +
'''[初译]''':自我学习(by '''@Call_Me_Zero''')
 +
'''[一审]''':自学习  (by '''@晓风_机器学习''')
 +
 
 +
== Overview ==
 +
'''[初译]''':综述
 +
'''[一审]''':综述
 +
 
 +
 
 +
'''Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable ways to get better performance is to give the algorithm more data. This has led to the that aphorism that in machine learning, "sometimes it's not who has the best algorithm that wins; it's who has the most data."'''
 +
 
 +
'''[初译]''':假如我们拥有足够强大的机器学习算法,那么,为了获得更好的性能,最靠谱的一种方法就是给予学习算法更多数据。机器学习界有句格言:有时候效果最好的,不是最优的算法,而是那些拥有最多数据的。
 +
 
 +
'''[一审]''':如果已经有一个足够强大的机器学习算法,为了获得更好的性能,最靠谱的方法之一是给这个算法以更多的数据。机器学习界甚至有个说法:“胜出的往往不是最好的算法,而是尽可能多的数据。”
 +
 
 +
 
 +
'''One can always try to get more labeled data, but this can be expensive. In particular, researchers have already gone to extraordinary lengths to use tools such as AMT (Amazon Mechanical Turk) to get large training sets. While having large numbers of people hand-label lots of data is probably a step forward compared to having large numbers of researchers hand-engineer features, it would be nice to do better. In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from unlabeled data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former---for example, by downloading random unlabeled images/audio clips/text documents off the internet---and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.'''
 +
 
 +
'''[初译]''':有人总是尝试获得更多标记过的数据,这样做耗费巨大。典型的一种场景是,为了获得大量的训练集,学者们花费很长时间来使用诸如AMT(Amazon Mechanical Turk)之类的工具。相比起human hand-engineering,虽然使用大量人力手工标注数据已经是一个进步,但是我们可以做的更好。自我学习以及非监督特征学习能够做到:如果我们有能够从未被标记数据中学习的算法,那么就可以用来轻易地获取数据,并且从这些数据中进行大量学习。即便针对比起被标记的样本信息量小很多的未被标记样本,,这样做也能行的通。如果我们能够获取一系列未被标记样本(比如,通过从互联网随机下载未被标记的图像/音频片段/文本文件),同时使用的算法能够有效地挖掘这些未被标注数据,那么比起大量的human hand-engineering方法以及手工标注的方法,将获得更好的性能。
 +
 
 +
'''[一审]''':在解决很多问题上,总是可以尝试获取更多的带类标数据,但是成本往往很高。典型地,研究人员已经花了相当的精力在使用类似AMT(Amazon Mechanical Turk,一个基于互联网的众包市场)这样的工具上,以期获取更大的训练数据集。相比大量的研究人员手工构建特征,用众包的方式让多人手工标数据是一个进步,而且期望着可以做的更好。特别是自学习和无监督特征学习,预示着如果算法能够从无类标数据中进行学习,就可以轻而易举的获取大量这样的数据供算法学习。尽管一个单一的无类标数据样例蕴含的信息比一个带类标的数据样例要少,但是如果能大量的获取无类标数据(比如从互联网上下载随机的、无类标的图像、音频剪辑或者是文本),并且算法能够有效的利用它们,相比大规模的手工构建特征和标数据,最终将会有更好的性能。

Revision as of 02:52, 9 March 2013

Personal tools