稀疏编码自编码表达

From Ufldl

Jump to: navigation, search
(Topographic sparse coding)
(Sparse coding)
Line 1: Line 1:
 +
[原文]
[原文]
[原文]
== Sparse coding ==
== Sparse coding ==
Line 6: Line 7:
[初译]
[初译]
 +
稀疏编码
 +
 +
在稀疏自编码中,为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>(以及对应的偏移量<math>b</math>)。
[一审]
[一审]
-
[原文]
+
稀疏编码
 +
 
 +
在稀疏自编码中,为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>(以及对应的偏移量<math>b</math>)。
[[File:STL_SparseAE.png | 240px]]
[[File:STL_SparseAE.png | 240px]]
Line 19: Line 25:
[初译]
[初译]
 +
稀疏编码可以看作是稀疏自编码方法的一个变形,该方法试图直接学习数据的特征集。利用偏移量将待学习特征集从特征空间转化到数据空间,实现了待学习特征集数据的重新表示。
[一审]
[一审]
 +
 +
稀疏编码可以看作是稀疏自编码方法的一个变形,该方法试图直接学习数据的特征集。利用连带基将学习特征集从特征空间转化到数据空间,就可以从学到的特征中重建数据。
[原文]
[原文]
Line 28: Line 37:
[初译]
[初译]
 +
在稀疏编码中,通常有很多数据<math>x</math>供我们进行特征学习。例如:<math>s</math>是一个用于表示数据的稀疏特征集,<math>A</math>是特征集从特征空间转换到数据空间的基。因此,为了计算s和A构建如下目标函数:
[一审]
[一审]
 +
 +
在稀疏编码中,对于从数据<math>x</math>中进行特征学习的情况。例如学习一个用于表示数据的稀疏特征集math>s</math>,和一个将特征从特征空间转换到数据空间的基<math>A</math>,我们可以构建如下目标函数:
[原文]
[原文]
Line 41: Line 53:
[初译]
[初译]
 +
( <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数,L1范数是向量元素的绝对值之和)
[一审]
[一审]
-
[原文]
+
( <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数,L1范数是向量元素的绝对值之和)
 +
[原文]
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse.  
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse.  
Line 51: Line 65:
[初译]
[初译]
 +
上式前半部分为重建误差,后半部分为稀疏性惩罚项(sparsity penalty term)用于保证向量集的稀疏性。
[一审]
[一审]
-
[原文]
+
上式前半部分为重建误差,后半部分为稀疏性惩罚项(sparsity penalty term)用于保证向量集的稀疏性。
 +
[原文]
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>,  
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>,  
Line 62: Line 78:
[初译]
[初译]
 +
但是,目标函数在不改变重建误差的前提下,可以通过常数倍缩放<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价(目标函数第二项)。因此,需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为:
[一审]
[一审]
 +
 +
但是,目标函数在不改变重建误差的前提下,可以通过常数倍扩大<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价(目标函数第二项)。因此,需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为:
[原文]
[原文]
-
 
:<math>
:<math>
Line 82: Line 100:
[初译]
[初译]
 +
遗憾的是,因为目标函数并不是一个凸函数,所以不能用梯度方法解决这个优化问题。但是,在给定A通过最小化<math>J(A, s)</math>求解s的问题是凸函数。同理,给定<math>s</math>通过最小化<math>J(A, s)</math>求解A的问题也是凸函数。这表明,可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明,这一策略取得的效果非常好。
[一审]
[一审]
 +
 +
遗憾的是,因为目标函数并不是一个凸函数,所以不能用梯度方法解决这个优化问题。但是,在给定A通过最小化<math>J(A, s)</math>求解s的问题是凸问题。同理,给定<math>s</math>通过最小化<math>J(A, s)</math>求解A的问题也是凸问题。这表明,可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明,这一策略取得的效果非常好。
[原文]
[原文]
-
 
-
 
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:
-
 
[初译]
[初译]
 +
但是,问题的表示形式带来了另一个难题:其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能简单的应用到梯度方法中。因此在实际问题中,此约束条件被弱化为“权重衰变”("weight decay")项用于保证<math>A</math>的每一项值够小。至此得到一个新的目标函数:
[一审]
[一审]
 +
 +
但是,问题的表示形式带来了另一个难题:其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能在简单的梯度方法中实施。因此在实际问题中,此约束条件被弱化为“权重衰变”("weight decay")项以保证<math>A</math>的每一项值够小。至此得到一个新的目标函数:
[原文]
[原文]
-
 
:<math>
:<math>
Line 108: Line 128:
[初译]
[初译]
 +
(注意上式中第三项, <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>,是A各项的平方和)
[一审]
[一审]
 +
 +
(注意上式中第三项, <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>,是A各项的平方和)
[原文]
[原文]
-
 
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below.  
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below.  
Line 120: Line 142:
[初译]
[初译]
 +
这一目标函数带来了最后一个问题,即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题,但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑,其中<math>\epsilon</math>是“平滑参数”("smoothing parameter")或者“稀疏参数”("sparsity parameter")(如果<math>\epsilon</math>远大于<math>x</math>,则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导,其平方根近似于<math>\sqrt{\epsilon}</math>)。考虑拓扑稀疏编码时,“平滑”会派上用场。
 +
 +
因此,最终的目标函数是:
[一审]
[一审]
-
[原文]
+
这一目标函数表现出最后一个问题,即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题,但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑,其中<math>\epsilon</math>是“平滑参数”("smoothing parameter")或者“稀疏参数”("sparsity parameter")(如果<math>\epsilon</math>远大于<math>x</math>,则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导,其平方根近似于<math>\sqrt{\epsilon}</math>)。后文可见考虑拓扑稀疏编码时,“平滑”会派上用场。
 +
因此,最终的目标函数是:
 +
 +
[原文]
:<math>
:<math>
Line 134: Line 162:
[初译]
[初译]
 +
( <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写)
[一审]
[一审]
 +
 +
(其中 <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写)
[原文]
[原文]
-
 
This objective function can then be optimized iteratively, using the following procedure:
This objective function can then be optimized iteratively, using the following procedure:
Line 152: Line 182:
[初译]
[初译]
 +
该目标函数可以通过以下过程迭代优化:
 +
<ol>
 +
<li>随机初始化<math>A</math>
 +
<li>重复以下步骤直至收敛:
 +
  <ol>
 +
    <li>根据上一步给定的<math>A</math>,求解能够最小化<math>J(A, s)</math>的<math>s</math> 
 +
    <li>根据上一步得到的<math>s</math>,求解能够最小化<math>J(A, s)</math>的<math>A</math> 
 +
  </ol>
 +
</ol>
[一审]
[一审]
-
[原文]
+
该目标函数可以通过以下过程迭代优化:
 +
<ol>
 +
<li>随机初始化<math>A</math>
 +
<li>重复以下步骤直至收敛:
 +
  <ol>
 +
    <li>根据上一步给定的<math>A</math>,求解能够最小化<math>J(A, s)</math>的<math>s</math> 
 +
    <li>根据上一步得到的<math>s</math>,求解能够最小化<math>J(A, s)</math>的<math>A</math> 
 +
  </ol>
 +
</ol>
 +
 +
[原文]
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
Line 166: Line 215:
[原文]
[原文]
-
 
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful.
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful.
Line 172: Line 220:
[初译]
[初译]
 +
理论上,通过上述迭代方法求解目标函数的最优化问题最终得到的特征集(基向量<math>A</math>)与通过稀疏编码学习得到的特征集类似。但是实际上,为了更好的获得算法收敛性需要使用一些小技巧,后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧,此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。
[一审]
[一审]
 +
 +
理论上,通过上述迭代方法求解目标函数的最优化问题最终得到的特征集(<math>A</math>的基向量)与通过稀疏自编码学习得到的特征集是相似的。但是实际上,为了更好的获得算法收敛性需要使用一些小技巧,后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧,此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。
[原文]
[原文]

Revision as of 08:05, 8 March 2013

Personal tools