Sparse Coding: Autoencoder Interpretation

From Ufldl

Jump to: navigation, search
Line 59: Line 59:
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
-
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[Sparse Coding: Autoencoder Interpretation#Sparse coding in practice]] | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.
+
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.
== Topographic sparse coding ==
== Topographic sparse coding ==
Line 107: Line 107:
=== Batching examples into mini-batches ===
=== Batching examples into mini-batches ===
-
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence ([[(TODO]]: explain why).
+
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why).
=== Good initialization of <math>s</math> ===
=== Good initialization of <math>s</math> ===
-
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is to set <math>s \leftarrow W^Tx</math>, where <math>x</math> is the patches in the mini-batch, and then normalize each column of <math>s</math> so that all the feature vectors have norm 1.  
+
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following:
 +
<ol>
 +
<li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch)
 +
<li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math>
 +
</ol>
 +
 
 +
Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)
=== The practical algorithm ===
=== The practical algorithm ===
Line 121: Line 127:
   <ol>
   <ol>
     <li>Select a random mini-batch of 2000 patches
     <li>Select a random mini-batch of 2000 patches
-
     <li>Initialize <math>s</math> to <math>W^Tx</math>, and normalize each feature (each column of <math>s</math>) to unit norm
+
     <li>Initialize <math>s</math> as described above
     <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
     <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
     <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step  
     <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step  

Revision as of 06:50, 29 May 2011

Personal tools