自编码算法与稀疏性
From Ufldl
Line 156: | Line 156: | ||
\end{align}</math> | \end{align}</math> | ||
其中, <math>\textstyle \rho</math> 是'''稀疏性参数''',通常是一个接近于0的较小的值(比如 <math>\textstyle \rho = 0.05</math> )。换句话说,我们想要让隐藏神经元 <math>\textstyle j</math> 的平均活跃度接近0.05。为了满足这一条件,隐藏神经元的活跃度必须接近于0。 | 其中, <math>\textstyle \rho</math> 是'''稀疏性参数''',通常是一个接近于0的较小的值(比如 <math>\textstyle \rho = 0.05</math> )。换句话说,我们想要让隐藏神经元 <math>\textstyle j</math> 的平均活跃度接近0.05。为了满足这一条件,隐藏神经元的活跃度必须接近于0。 | ||
+ | |||
+ | 【原文】 | ||
+ | |||
+ | To achieve this, we will add an extra penalty term to our optimization objective that | ||
+ | penalizes <math>\textstyle \hat\rho_j</math> deviating significantly from <math>\textstyle \rho</math>. Many choices of the penalty | ||
+ | term will give reasonable results. We will choose the following: | ||
+ | :<math>\begin{align} | ||
+ | \sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}. | ||
+ | \end{align}</math> | ||
+ | Here, <math>\textstyle s_2</math> is the number of neurons in the hidden layer, and the index <math>\textstyle j</math> is summing | ||
+ | over the hidden units in our network. If you are | ||
+ | familiar with the concept of KL divergence, this penalty term is based on | ||
+ | it, and can also be written | ||
+ | :<math>\begin{align} | ||
+ | \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j), | ||
+ | \end{align}</math> | ||
+ | where <math>\textstyle {\rm KL}(\rho || \hat\rho_j) | ||
+ | = \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}</math> | ||
+ | is the Kullback-Leibler (KL) divergence between | ||
+ | a Bernoulli random variable with mean <math>\textstyle \rho</math> and a Bernoulli random variable with mean <math>\textstyle \hat\rho_j</math>. | ||
+ | KL-divergence is a standard function for measuring how different two different | ||
+ | distributions are. (If you've not seen KL-divergence before, don't worry about | ||
+ | it; everything you need to know about it is contained in these notes.) | ||
+ | |||
+ | 【初译】 | ||
+ | |||
+ | 为了实现这一限制,我们将会在我们的优化目标函数中加入另外一个惩罚因子,而这一惩罚因子将惩罚那些 <math>\textstyle \hat\rho_j</math> 和 <math>\textstyle \rho</math> 有显著不同的情况从而使得隐藏神经元的平均活跃度保持在较小范围内。惩罚因子的具体形式有很多种合理的选择,我们将会选择以下这一种: | ||
+ | :<math>\begin{align} | ||
+ | \sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}. | ||
+ | \end{align}</math> | ||
+ | 这里, <math>\textstyle s_2</math> 是隐层中隐藏神经元的数量,而索引 <math>\textstyle j</math> 则代表隐层中的某一个神经元。如果你对相对熵(KL divergence)比较熟悉,这一惩罚因子实际上是基于它的。于是惩罚因子也可以被表示为 | ||
+ | :<math>\begin{align} | ||
+ | \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j), | ||
+ | \end{align}</math> | ||
+ | 其中 <math>\textstyle {\rm KL}(\rho || \hat\rho_j) | ||
+ | = \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}</math> 是一个以 <math>\textstyle \rho</math> 为均值和一个以 <math>\textstyle \hat\rho_j</math> 为均值的两个伯努利随机变量之间的相对熵。相对熵是一种标准的用来测量两个分布之间差异的方法。(如果你没有见过相对熵,不用担心,所有你需要知道的内容都会被包含在这份笔记之中。) | ||
+ | |||
+ | 【一审】 | ||
+ | |||
+ | 【二审】 |