Exercise:Sparse Autoencoder

==Sparse autoencoder implementation==

In this problem set, you will implement the sparse autoencoder
algorithm, and show how it discovers that edges are a good
representation for natural images. (Images provided by
Bruno Olshausen.) The sparse autoencoder algorithm is described in
the lecture notes found on the course website.

In the file [http://ufldl.stanford.edu/wiki/resources/sparseae_exercise.zip sparseae_exercise.zip], we have provided some starter code in
Matlab. You should write your code at the places indicated
in the files ("<tt>YOUR CODE HERE</tt>"). You have to complete the following files:
<tt>sampleIMAGES.m, sparseAutoencoderCost.m, computeNumericalGradient.m</tt>. 
The starter code in <tt>train.m</tt> shows how these functions are used.

Specifically, in this exercise you will implement a sparse autoencoder, 
trained with 8&times;8 image patches using the L-BFGS optimization algorithm.

'''A note on the software:''' The provided .zip file includes a subdirectory
<tt>minFunc</tt> with 3rd party software implementing L-BFGS, that 
is licensed under a Creative Commons, Attribute, Non-Commercial license.  
If you need to use this software for commercial purposes, you can 
download and use a different function (fminlbfgs) that can serve the same
purpose, but runs ~3x slower for this exercise (and thus is less recommended). 
You can read more about this in the [[Fminlbfgs_Details]] page. 



===Step 1: Generate training set===

The first step is to generate a training set.   To get a single training 
example <math>x</math>, randomly pick one of the 10 images, then randomly sample 
an 8&times;8 image patch from the selected image, and convert the image patch (either 
in row-major order or column-major order; it doesn't matter) into a 64-dimensional 
vector to get a training example <math>x \in \Re^{64}.</math>

Complete the code in <tt>sampleIMAGES.m</tt>.  Your code should sample 10000 image 
patches and concatenate them into a 64&times;10000 matrix. 

To make sure your implementation is working, run the code in "Step 1" of <tt>train.m</tt>.
This should result in a plot of a random sample of 200 patches from the dataset. 

'''Implementational tip:''' When we run our implemented <tt>
sampleImages()</tt>, it takes under 5 seconds.  If your implementation
takes over 30 seconds, it may be because you are accidentally making a
copy of an entire 512&times;512 image each time you're picking a random
image.  By copying a 512&times;512 image 10000 times, this can make your
implementation much less efficient.  While this doesn't slow down your
code significantly for this exercise (because we have only 10000
examples), when we scale to much larger problems later this quarter
with <math>10^6</math> or more examples, this will significantly slow down your
code.  Please implement <tt>sampleIMAGES</tt> so that you aren't making a
copy of an entire 512&times;512 image each time you need to cut out an 8x8
image patch.

===Step 2: Sparse autoencoder objective===

Implement code to compute the sparse autoencoder cost function <math>J_{\rm sparse}(W,b)</math> 
(Section 3 of the lecture notes)
and the corresponding derivatives of <math>J_{\rm sparse}</math> with respect to 
the different parameters.  Use the sigmoid function for the activation function, 
<math>f(z) = \frac{1}{{1+e^{-z}}}</math>. 
In particular, complete the code in <tt>sparseAutoencoderCost.m</tt>.

The sparse autoencoder is parameterized by matrices 
<math>W^{(1)} \in \Re^{s_1\times s_2}</math>,
<math>W^{(2)} \in \Re^{s_2\times s_3}</math> 
vectors 
<math>b^{(1)} \in \Re^{s_2}</math>, 
<math>b^{(2)} \in \Re^{s_3}</math>.
However, for subsequent notational convenience, we will "unroll" all of these parameters
into a very long parameter vector <math>\theta</math> with <math>s_1s_2 + s_2s_3 + s_2 + s_3</math> elements.  The
code for converting between the <math>(W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)})</math> and the <math>\theta</math> parameterization 
is already provided in the starter code.

'''Implementational tip:''' The objective <math>J_{\rm sparse}(W,b)</math> contains 3 terms, corresponding
to the squared error term, the weight decay term, and the sparsity penalty.  You're welcome
to implement this however you want, but for ease of debugging,
you might implement the cost function and derivative computation (backpropagation) only for the 
squared error term first (this corresponds to setting <math>\lambda = \beta = 0</math>), and implement 
the gradient checking method in the next section to first verify that this code is correct.  Then only
after you have verified that the objective and derivative calculations corresponding to the squared error 
term are working, add in code to compute the weight decay and sparsity penalty terms and their corresponding derivatives. 

===Step 3: Gradient checking===

Following Section 2.3 of the lecture notes, implement code for gradient checking.  
Specifically, complete the code in <tt>computeNumericalGradient.m</tt>.  Please 
use <tt>EPSILON</tt> = 10<sup>-4</sup> as described in the lecture notes. 

We've also provided code in <tt>checkNumericalGradient.m</tt> for you to test your code. 
This code defines a simple quadratic function <math>h: \Re^2 \mapsto \Re</math> given by 
<math>h(x) = x_1^2 + 3x_1 x_2</math>, and evaluates it at the point <math>x = (4, 10)^T</math>.  It allows you
to verify that your numerically evaluated gradient is very close to the true (analytically
computed) gradient.  

After using <tt>checkNumericalGradient.m</tt> to make sure your implementation is correct, 
next use <tt>computeNumericalGradient.m</tt> to make sure that your <tt>sparseAutoencoderCost.m</tt>
is computing derivatives correctly.  For details, see Steps 3 in <tt>train.m</tt>.  We strongly
encourage you not to proceed to the next step until you've verified that your derivative
computations are correct. 

'''Implementational tip:''' If you are debugging your code, performing gradient checking on smaller models 
and smaller training sets (e.g., using only 10 training examples and 1-2 hidden 
units) may speed things up.

===Step 4: Train the sparse autoencoder===

Now that you have code that computes 
<math>J_{\rm sparse}</math> and its derivatives, we're ready to minimize 
<math>J_{\rm sparse}</math> with respect to its parameters, and thereby train our
sparse autoencoder.

We will use the L-BFGS algorithm.  This is provided to you in a function called
<tt>minFunc</tt> (code provided by Mark Schmidt) included in the starter code.  (For the purpose of this
assignment, you only need to call minFunc with the default parameters. You do
not need to know how L-BFGS works.)  We have already provided code in <tt>train.m</tt>
(Step 4) to call <tt>minFunc</tt>.  The <tt>minFunc</tt> code assumes that the parameters
to be optimized are a long parameter vector; so we will use the "<math>\theta</math>" parameterization
rather than the "<math>(W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)})</math>" parameterization when passing our parameters
to it.

Train a sparse autoencoder with 64 input units, 25 hidden units, and 64 output units.
In our starter code, we have provided a function for initializing the parameters.
We initialize the biases <math>b^{(l)}_i</math> to zero, and the weights <math>W^{(l)}_{ij}</math>
to random numbers drawn uniformly from the interval 
<math>\left[-\sqrt{\frac{6}{n_{\rm in}+n_{\rm out}+1}},\sqrt{\frac{6}{n_{\rm in}+n_{\rm out}+1}}\,\right]</math>, where <math>n_{\rm in}</math> is the fan-in
(the number of inputs feeding into a node) and <math>n_{\rm out}</math> is the fan-in (the number of
units that a node feeds into).

The values we provided for the various parameters (<math>\lambda, \beta, \rho</math>, etc.)
should work, but feel free to play with different settings of the parameters as
well.

===Step 5: Visualization===

After training the autoencoder, use <tt>display_network.m</tt> to visualize the learned
weights.  (See <tt>train.m</tt>, Step 5.)  Run "<tt>print -djpeg weights.jpg</tt>" to save
the visualization to a file "<tt>weights.jpg</tt>" (which you will submit together with
your code). 

==Results==

To successfully complete this assignment, you should demonstrate your sparse
autoencoder algorithm learning a set of edge detectors.  For example, this
was the visualization we obtained: 


[[File:Gabor.jpg]]


Our implementation took around 10 minutes to run on a fast computer.
In case you end up needing to try out multiple implementations or 
different parameter values, be sure to budget enough time for debugging 
and to run the experiments you'll need. 

Also, by way of comparison, here are some visualizations from implementations
that we do not consider successful (either a buggy implementation, or where
the parameters were poorly tuned):


[[File:badfilter1.jpg|240 px]] [[File:badfilter2.jpg|240 px]] [[File:badfilter3.jpg|240 px]]

[[File:badfilter4.jpg|240 px]] [[File:badfilter5.jpg|240 px]] [[File:badfilter6.jpg|240 px]]


[[Category:Exercises]]