Sparse Coding: Autoencoder Interpretation

From Ufldl

Jump to: navigation, search
(Created page with "In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math>...")
Line 1: Line 1:
 +
== Sparse coding ==
 +
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>.  
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>.  
Line 30: Line 32:
:<math>
:<math>
-
J_{weight}(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2
+
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2
</math>
</math>
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)
-
In practice, this objective is easier to optimize for, and works relatively well.
+
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below.
 +
 
 +
Our final objective function is hence:
 +
 
 +
:<math>
 +
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2
 +
</math>
 +
 
 +
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)
 +
 
 +
Optimizing for this objective function will yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder.
 +
 
 +
== Topographic sparse coding ==
 +
 
 +
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal.
 +
 
 +
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent.
 +
 
 +
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.
 +
 
 +
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:
 +
 
 +
:<math>
 +
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2
 +
</math>
 +
 
 +
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:
 +
 
 +
:<math>
 +
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2
 +
</math>
 +
 
 +
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)
 +
 
 +
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.

Revision as of 05:47, 28 May 2011

Personal tools