http://deeplearning.stanford.edu/wiki/index.php?title=Special:Contributions/Cyfoo&feed=atom&limit=50&target=Cyfoo&year=&month=Ufldl - User contributions [en]2024-03-29T10:10:53ZFrom UfldlMediaWiki 1.16.2http://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_AnalysisExercise:Independent Component Analysis2011-06-19T17:44:47Z<p>Cyfoo: </p>
<hr />
<div>== Independent Component Analysis ==<br />
<br />
In this exercise, you will implement [[Independent Component Analysis]] on color images from the STL-10 dataset.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/independent_component_analysis_exercise.zip independent_component_analysis_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>OrthonormalICACost.m</tt>''' and '''<tt>ICAExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>displayColorNetwork.m</tt> from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we load and use a portion of the 8x8 patches from the STL-10 dataset (which you first saw in the exercise on [[Exercise:Learning color features with Sparse Autoencoders | linear decoders]]).<br />
<br />
=== Step 2: ZCA whiten patches ===<br />
<br />
In this step, we ZCA whiten the patches as required by orthonormal ICA.<br />
<br />
=== Step 3: Implement and check ICA cost functions ===<br />
<br />
In this step, you should implement the ICA cost function:<br />
<tt>orthonormalICACost</tt> in <tt>orthonormalICACost.m</tt>, which computes the cost and gradient for the orthonormal ICA objective. Note that the orthonormality constraint is '''not''' enforced in the cost function. It will be enforced by a projection in the gradient descent step, which you will have to complete in step 4.<br />
<br />
When you have implemented the cost function, you should check the gradients numerically.<br />
<br />
'''Hint''' - if you are having difficulties deriving the gradients, you may wish to consult the page on [[deriving gradients using the backpropagation idea]].<br />
<br />
==== Step 4: Optimization ====<br />
<br />
In step 4, you will optimize for the orthonormal ICA objective using gradient descent with backtracking line search (the code for which has already been provided for you. For more details on the backtracking line search, you may wish to consult the [[Exercise:Independent Component Analysis#Appendix| appendix ]] of this exercise). The orthonormality constraint should be enforced with a projection, which you should fill in.<br />
<br />
Once you have filled in the code for the projection, check that it is correct by using the verification code provided. Once you have verified that your projection is correct, comment out the verification code and run the optimization. 10 000 iterations of gradient descent should take around 2 hours, and produce a basis which looks like the following:<br />
<br />
[[File:OrthonormalICAFeatures.png | 350px]]<br />
<br />
Observe that few of the bases have been completely learned even after 10 000 iterations, highlighting a weakness of orthonormal ICA - it is difficult to optimize for the objective while enforcing the orthonormality constraint using gradient descent, and convergence can be very slow. Hence, in situations where an orthonormal basis is not required, other faster methods of learning bases (such as [[Sparse Coding: Autoencoder Interpretation | sparse coding]]) may be preferable.<br />
<br />
=== Appendix ===<br />
<br />
==== Backtracking line search ====<br />
<br />
The backtracking line search used in the exercise is based off that in [http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization by Boyd and Vandenbergh]. In the backtracking line search, given a descent direction <math>\vec{u}</math> (in this exercise we use <math>\vec{u} = -\nabla f(\vec{x})</math>), we want to find a good step size <math>t</math> that gives us a steep descent. The general idea is to use a linear approximation (the first order Taylor approximation) to the function <math>f</math> at the current point <math>\vec{x}</math>, and to search for a step size <math>t</math> such that we can decrease the function's value by more than <math>\alpha</math> times the decrease predicted by the linear approximation (<math>\alpha \in (0, 0.5)</math>. For more details, you may wish to consult [http://www.stanford.edu/~boyd/cvxbook/ the book].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_AnalysisExercise:Independent Component Analysis2011-06-18T21:26:22Z<p>Cyfoo: /* Step 4a: Orthonormal ICA */</p>
<hr />
<div>== Independent Component Analysis ==<br />
<br />
In this exercise, you will implement [[Independent Component Analysis]] on color images from the STL-10 dataset.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/independent_component_analysis_exercise.zip independent_component_analysis_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>OrthonormalICACost.m</tt>''', '''<tt>ReconstructionICACost.m</tt>''' and '''<tt>ICAExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>displayColorNetwork.m</tt> from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we load and use a portion of the 8x8 patches from the STL-10 dataset (which you first saw in the exercise on [[Exercise:Learning color features with Sparse Autoencoders | linear decoders]]).<br />
<br />
=== Step 2: ZCA whiten patches ===<br />
<br />
In this step, we ZCA whiten the patches as required by orthonormal ICA.<br />
<br />
=== Step 3: Implement and check ICA cost functions ===<br />
<br />
In this step, you should implement the two ICA cost functions:<br />
<ol><br />
<li><tt>orthonormalICACost</tt> in <tt>orthonormalICACost.m</tt>, which computes the cost and gradient for the orthonormal ICA objective. Note that the orthonormality constraint is '''not''' enforced in the cost function. It will be enforced by a projection in the gradient descent step, which you will have to complete in step 4a.<br />
<li><tt>reconstructionICACost</tt> in <tt>reconstructionICACost.m</tt>, which computes the cost and gradient for the reconstruction ICA objective.<br />
</ol><br />
<br />
When you have implemented the cost functions, you should check the gradients numerically.<br />
<br />
'''Hint''' - if you are having difficulties deriving the gradients, you may wish to consult the page on [[deriving gradients using the backpropagation idea]].<br />
<br />
=== Step 4: Optimization ===<br />
<br />
This step is broken down into two substeps. In each substep, you will (separately) optimize for one of the two ICA objective functions.<br />
<br />
==== Step 4a: Orthonormal ICA ====<br />
<br />
In step 4a, you will optimize for the orthonormal ICA objective using gradient descent with backtracking line search (the code for which has already been provided for you. For more details on the backtracking line search, you may wish to consult the [[Exercise:Independent Component Analysis#Appendix| appendix ]] of this exercise). The orthonormality constraint should be enforced with a projection, which you should fill in.<br />
<br />
Once you have filled in the code for the projection, check that it is correct by using the verification code provided. Once you have verified that your projection is correct, comment out the verification code and run the optimization. 10 000 iterations of gradient descent should take around 2 hours, and produce a basis which looks like the following:<br />
<br />
[[File:OrthonormalICAFeatures.png | 350px]]<br />
<br />
Observe that few of the bases have been completely learned even after 10 000 iterations, highlighting a weakness of orthonormal ICA - it is difficult to optimize for the objective while enforcing the orthonormality constraint using gradient descent, and convergence can be very slow. Hence, in situations where an orthonormal basis is not required, reconstruction ICA or other faster methods of learning bases (such as [[Sparse Coding: Autoencoder Interpretation | sparse coding]]) may be preferable.<br />
<br />
==== Step 4b: Reconstruction ICA ====<br />
<br />
In step 4b, you will optimize for the reconstruction ICA objective using <tt>minFunc</tt>. Code has already been provided to do so, so all that is left is for you to run the code. 600 iterations of <tt>minFunc</tt> should take 20-25 minutes, and produce a basis which looks like the following:<br />
<br />
[[File:ReconstructionICAFeatures.png | 350px]]<br />
<br />
=== Appendix ===<br />
<br />
==== Backtracking line search ====<br />
<br />
The backtracking line search used in the exercise is based off that in [http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization by Boyd and Vandenbergh]. In the backtracking line search, given a descent direction <math>\vec{u}</math> (in this exercise we use <math>\vec{u} = -\nabla f(\vec{x})</math>), we want to find a good step size <math>t</math> that gives us a steep descent. The general idea is to use a linear approximation (the first order Taylor approximation) to the function <math>f</math> at the current point <math>\vec{x}</math>, and to search for a step size <math>t</math> such that we can decrease the function's value by more than <math>\alpha</math> times the decrease predicted by the linear approximation (<math>\alpha \in (0, 0.5)</math>. For more details, you may wish to consult [http://www.stanford.edu/~boyd/cvxbook/ the book].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_AnalysisExercise:Independent Component Analysis2011-06-18T21:25:42Z<p>Cyfoo: /* Step 4a: Orthonormal ICA */</p>
<hr />
<div>== Independent Component Analysis ==<br />
<br />
In this exercise, you will implement [[Independent Component Analysis]] on color images from the STL-10 dataset.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/independent_component_analysis_exercise.zip independent_component_analysis_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>OrthonormalICACost.m</tt>''', '''<tt>ReconstructionICACost.m</tt>''' and '''<tt>ICAExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>displayColorNetwork.m</tt> from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we load and use a portion of the 8x8 patches from the STL-10 dataset (which you first saw in the exercise on [[Exercise:Learning color features with Sparse Autoencoders | linear decoders]]).<br />
<br />
=== Step 2: ZCA whiten patches ===<br />
<br />
In this step, we ZCA whiten the patches as required by orthonormal ICA.<br />
<br />
=== Step 3: Implement and check ICA cost functions ===<br />
<br />
In this step, you should implement the two ICA cost functions:<br />
<ol><br />
<li><tt>orthonormalICACost</tt> in <tt>orthonormalICACost.m</tt>, which computes the cost and gradient for the orthonormal ICA objective. Note that the orthonormality constraint is '''not''' enforced in the cost function. It will be enforced by a projection in the gradient descent step, which you will have to complete in step 4a.<br />
<li><tt>reconstructionICACost</tt> in <tt>reconstructionICACost.m</tt>, which computes the cost and gradient for the reconstruction ICA objective.<br />
</ol><br />
<br />
When you have implemented the cost functions, you should check the gradients numerically.<br />
<br />
'''Hint''' - if you are having difficulties deriving the gradients, you may wish to consult the page on [[deriving gradients using the backpropagation idea]].<br />
<br />
=== Step 4: Optimization ===<br />
<br />
This step is broken down into two substeps. In each substep, you will (separately) optimize for one of the two ICA objective functions.<br />
<br />
==== Step 4a: Orthonormal ICA ====<br />
<br />
In step 4a, you will optimize for the orthonormal ICA objective using gradient descent with backtracking line search (the code for which has already been provided for you. For more details on the backtracking line search, you may wish to consult the [[Exercise:Independent Component Analysis#Appendix| appendix ]] of this exercise). The orthonormality constraint should be enforced with a projection, which you should fill in.<br />
<br />
Once you have filled in the code for the projection, check that it is correct by using the verification code provided. Once you have verified that your projection is correct, comment out the verification code and run the optimization. 10 000 iterations of gradient descent should take around 2 hours, and produce a basis which looks like the following:<br />
<br />
[[File:OrthonormalICAFeatures.png | 350px]]<br />
<br />
Observe that few of the bases have been completely learned even after 10 000 iterations, highlighting a weakness of orthonormal ICA - it is difficult to optimize for the objective while enforcing the orthonormality constraint using gradient descent, and convergence can be very slow. Hence, in situations where an orthonormal basis is not required, reconstruction ICA or other faster methods of learning bases (such as [[Sparse Coding: Autoencoder Interpretation]] sparse coding) may be preferable.<br />
<br />
==== Step 4b: Reconstruction ICA ====<br />
<br />
In step 4b, you will optimize for the reconstruction ICA objective using <tt>minFunc</tt>. Code has already been provided to do so, so all that is left is for you to run the code. 600 iterations of <tt>minFunc</tt> should take 20-25 minutes, and produce a basis which looks like the following:<br />
<br />
[[File:ReconstructionICAFeatures.png | 350px]]<br />
<br />
=== Appendix ===<br />
<br />
==== Backtracking line search ====<br />
<br />
The backtracking line search used in the exercise is based off that in [http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization by Boyd and Vandenbergh]. In the backtracking line search, given a descent direction <math>\vec{u}</math> (in this exercise we use <math>\vec{u} = -\nabla f(\vec{x})</math>), we want to find a good step size <math>t</math> that gives us a steep descent. The general idea is to use a linear approximation (the first order Taylor approximation) to the function <math>f</math> at the current point <math>\vec{x}</math>, and to search for a step size <math>t</math> such that we can decrease the function's value by more than <math>\alpha</math> times the decrease predicted by the linear approximation (<math>\alpha \in (0, 0.5)</math>. For more details, you may wish to consult [http://www.stanford.edu/~boyd/cvxbook/ the book].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_AnalysisExercise:Independent Component Analysis2011-06-18T21:24:49Z<p>Cyfoo: /* Step 3: Implement and check ICA cost functions */</p>
<hr />
<div>== Independent Component Analysis ==<br />
<br />
In this exercise, you will implement [[Independent Component Analysis]] on color images from the STL-10 dataset.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/independent_component_analysis_exercise.zip independent_component_analysis_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>OrthonormalICACost.m</tt>''', '''<tt>ReconstructionICACost.m</tt>''' and '''<tt>ICAExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>displayColorNetwork.m</tt> from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we load and use a portion of the 8x8 patches from the STL-10 dataset (which you first saw in the exercise on [[Exercise:Learning color features with Sparse Autoencoders | linear decoders]]).<br />
<br />
=== Step 2: ZCA whiten patches ===<br />
<br />
In this step, we ZCA whiten the patches as required by orthonormal ICA.<br />
<br />
=== Step 3: Implement and check ICA cost functions ===<br />
<br />
In this step, you should implement the two ICA cost functions:<br />
<ol><br />
<li><tt>orthonormalICACost</tt> in <tt>orthonormalICACost.m</tt>, which computes the cost and gradient for the orthonormal ICA objective. Note that the orthonormality constraint is '''not''' enforced in the cost function. It will be enforced by a projection in the gradient descent step, which you will have to complete in step 4a.<br />
<li><tt>reconstructionICACost</tt> in <tt>reconstructionICACost.m</tt>, which computes the cost and gradient for the reconstruction ICA objective.<br />
</ol><br />
<br />
When you have implemented the cost functions, you should check the gradients numerically.<br />
<br />
'''Hint''' - if you are having difficulties deriving the gradients, you may wish to consult the page on [[deriving gradients using the backpropagation idea]].<br />
<br />
=== Step 4: Optimization ===<br />
<br />
This step is broken down into two substeps. In each substep, you will (separately) optimize for one of the two ICA objective functions.<br />
<br />
==== Step 4a: Orthonormal ICA ====<br />
<br />
In step 4a, you will optimize for the orthonormal ICA objective using gradient descent with backtracking line search (which has already been provided for you. For more details on the backtracking line search, you may wish to consult the [[Exercise:Independent Component Analysis#Appendix| appendix ]] of this exercise). The orthonormality constraint should be enforced with a projection, which you should fill in.<br />
<br />
Once you have filled in the code for the projection, check that it is correct by using the verification code provided. Once you have verified that your projection is correct, comment out the verification code and run the optimization. 10 000 iterations of gradient descent should take around 2 hours, and produce a basis which looks like the following:<br />
<br />
[[File:OrthonormalICAFeatures.png | 350px]]<br />
<br />
Observe that few of the bases have been completely learned even after 10 000 iterations, highlighting a weakness of orthonormal ICA - it is difficult to optimize for the objective while enforcing the orthonormality constraint using gradient descent, and convergence can be very slow. Hence, in situations where an orthonormal basis is not required, reconstruction ICA or other faster methods of learning bases (such as [[Sparse Coding: Autoencoder Interpretation]] sparse coding) may be preferable.<br />
<br />
==== Step 4b: Reconstruction ICA ====<br />
<br />
In step 4b, you will optimize for the reconstruction ICA objective using <tt>minFunc</tt>. Code has already been provided to do so, so all that is left is for you to run the code. 600 iterations of <tt>minFunc</tt> should take 20-25 minutes, and produce a basis which looks like the following:<br />
<br />
[[File:ReconstructionICAFeatures.png | 350px]]<br />
<br />
=== Appendix ===<br />
<br />
==== Backtracking line search ====<br />
<br />
The backtracking line search used in the exercise is based off that in [http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization by Boyd and Vandenbergh]. In the backtracking line search, given a descent direction <math>\vec{u}</math> (in this exercise we use <math>\vec{u} = -\nabla f(\vec{x})</math>), we want to find a good step size <math>t</math> that gives us a steep descent. The general idea is to use a linear approximation (the first order Taylor approximation) to the function <math>f</math> at the current point <math>\vec{x}</math>, and to search for a step size <math>t</math> such that we can decrease the function's value by more than <math>\alpha</math> times the decrease predicted by the linear approximation (<math>\alpha \in (0, 0.5)</math>. For more details, you may wish to consult [http://www.stanford.edu/~boyd/cvxbook/ the book].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_AnalysisExercise:Independent Component Analysis2011-06-18T21:24:32Z<p>Cyfoo: /* Step 3: Implement and check ICA cost functions */</p>
<hr />
<div>== Independent Component Analysis ==<br />
<br />
In this exercise, you will implement [[Independent Component Analysis]] on color images from the STL-10 dataset.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/independent_component_analysis_exercise.zip independent_component_analysis_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>OrthonormalICACost.m</tt>''', '''<tt>ReconstructionICACost.m</tt>''' and '''<tt>ICAExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>displayColorNetwork.m</tt> from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we load and use a portion of the 8x8 patches from the STL-10 dataset (which you first saw in the exercise on [[Exercise:Learning color features with Sparse Autoencoders | linear decoders]]).<br />
<br />
=== Step 2: ZCA whiten patches ===<br />
<br />
In this step, we ZCA whiten the patches as required by orthonormal ICA.<br />
<br />
=== Step 3: Implement and check ICA cost functions ===<br />
<br />
In this step, you should implement the two ICA cost functions:<br />
<ol><br />
<li><tt>orthonormalICACost</tt> in <tt>orthonormalICACost.m</tt>, which computes the cost and gradient for the orthonormal ICA objective. Note that the orthonormality constraint is '''not''' enforced in the cost function. It will be enforced by a projection in the gradient descent step, which you will have to complete in step 4a.<br />
<li><tt>reconstructionICACost</tt> in <tt>reconstructionICACost.m</tt>, which computes the cost and gradient for the reconstruction ICA objective.<br />
</ol><br />
<br />
When you have implemented the cost functions, you should check the gradients numerically.<br />
<br />
'''Hint''' - If you are having difficulties deriving the gradients, you may wish to consult the page on [[Deriving gradients using the backpropagation idea]].<br />
<br />
=== Step 4: Optimization ===<br />
<br />
This step is broken down into two substeps. In each substep, you will (separately) optimize for one of the two ICA objective functions.<br />
<br />
==== Step 4a: Orthonormal ICA ====<br />
<br />
In step 4a, you will optimize for the orthonormal ICA objective using gradient descent with backtracking line search (which has already been provided for you. For more details on the backtracking line search, you may wish to consult the [[Exercise:Independent Component Analysis#Appendix| appendix ]] of this exercise). The orthonormality constraint should be enforced with a projection, which you should fill in.<br />
<br />
Once you have filled in the code for the projection, check that it is correct by using the verification code provided. Once you have verified that your projection is correct, comment out the verification code and run the optimization. 10 000 iterations of gradient descent should take around 2 hours, and produce a basis which looks like the following:<br />
<br />
[[File:OrthonormalICAFeatures.png | 350px]]<br />
<br />
Observe that few of the bases have been completely learned even after 10 000 iterations, highlighting a weakness of orthonormal ICA - it is difficult to optimize for the objective while enforcing the orthonormality constraint using gradient descent, and convergence can be very slow. Hence, in situations where an orthonormal basis is not required, reconstruction ICA or other faster methods of learning bases (such as [[Sparse Coding: Autoencoder Interpretation]] sparse coding) may be preferable.<br />
<br />
==== Step 4b: Reconstruction ICA ====<br />
<br />
In step 4b, you will optimize for the reconstruction ICA objective using <tt>minFunc</tt>. Code has already been provided to do so, so all that is left is for you to run the code. 600 iterations of <tt>minFunc</tt> should take 20-25 minutes, and produce a basis which looks like the following:<br />
<br />
[[File:ReconstructionICAFeatures.png | 350px]]<br />
<br />
=== Appendix ===<br />
<br />
==== Backtracking line search ====<br />
<br />
The backtracking line search used in the exercise is based off that in [http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization by Boyd and Vandenbergh]. In the backtracking line search, given a descent direction <math>\vec{u}</math> (in this exercise we use <math>\vec{u} = -\nabla f(\vec{x})</math>), we want to find a good step size <math>t</math> that gives us a steep descent. The general idea is to use a linear approximation (the first order Taylor approximation) to the function <math>f</math> at the current point <math>\vec{x}</math>, and to search for a step size <math>t</math> such that we can decrease the function's value by more than <math>\alpha</math> times the decrease predicted by the linear approximation (<math>\alpha \in (0, 0.5)</math>. For more details, you may wish to consult [http://www.stanford.edu/~boyd/cvxbook/ the book].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_AnalysisExercise:Independent Component Analysis2011-06-18T21:18:36Z<p>Cyfoo: /* Dependencies */</p>
<hr />
<div>== Independent Component Analysis ==<br />
<br />
In this exercise, you will implement [[Independent Component Analysis]] on color images from the STL-10 dataset.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/independent_component_analysis_exercise.zip independent_component_analysis_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>OrthonormalICACost.m</tt>''', '''<tt>ReconstructionICACost.m</tt>''' and '''<tt>ICAExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>displayColorNetwork.m</tt> from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we load and use a portion of the 8x8 patches from the STL-10 dataset (which you first saw in the exercise on [[Exercise:Learning color features with Sparse Autoencoders | linear decoders]]).<br />
<br />
=== Step 2: ZCA whiten patches ===<br />
<br />
In this step, we ZCA whiten the patches as required by orthonormal ICA.<br />
<br />
=== Step 3: Implement and check ICA cost functions ===<br />
<br />
In this step, you should implement the two ICA cost functions:<br />
<ol><br />
<li><tt>orthonormalICACost</tt> in <tt>orthonormalICACost.m</tt>, which computes the cost and gradient for the orthonormal ICA objective. Note that the orthonormality constraint is '''not''' enforced in the cost function. It will be enforced by a projection in the gradient descent step, which you will have to complete in step 4a.<br />
<li><tt>reconstructionICACost</tt> in <tt>reconstructionICACost.m</tt>, which computes the cost and gradient for the reconstruction ICA objective.<br />
</ol><br />
<br />
When you have implemented the cost functions, you should check the gradients numerically.<br />
<br />
=== Step 4: Optimization ===<br />
<br />
This step is broken down into two substeps. In each substep, you will (separately) optimize for one of the two ICA objective functions.<br />
<br />
==== Step 4a: Orthonormal ICA ====<br />
<br />
In step 4a, you will optimize for the orthonormal ICA objective using gradient descent with backtracking line search (which has already been provided for you. For more details on the backtracking line search, you may wish to consult the [[Exercise:Independent Component Analysis#Appendix| appendix ]] of this exercise). The orthonormality constraint should be enforced with a projection, which you should fill in.<br />
<br />
Once you have filled in the code for the projection, check that it is correct by using the verification code provided. Once you have verified that your projection is correct, comment out the verification code and run the optimization. 10 000 iterations of gradient descent should take around 2 hours, and produce a basis which looks like the following:<br />
<br />
[[File:OrthonormalICAFeatures.png | 350px]]<br />
<br />
Observe that few of the bases have been completely learned even after 10 000 iterations, highlighting a weakness of orthonormal ICA - it is difficult to optimize for the objective while enforcing the orthonormality constraint using gradient descent, and convergence can be very slow. Hence, in situations where an orthonormal basis is not required, reconstruction ICA or other faster methods of learning bases (such as [[Sparse Coding: Autoencoder Interpretation]] sparse coding) may be preferable.<br />
<br />
==== Step 4b: Reconstruction ICA ====<br />
<br />
In step 4b, you will optimize for the reconstruction ICA objective using <tt>minFunc</tt>. Code has already been provided to do so, so all that is left is for you to run the code. 600 iterations of <tt>minFunc</tt> should take 20-25 minutes, and produce a basis which looks like the following:<br />
<br />
[[File:ReconstructionICAFeatures.png | 350px]]<br />
<br />
=== Appendix ===<br />
<br />
==== Backtracking line search ====<br />
<br />
The backtracking line search used in the exercise is based off that in [http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization by Boyd and Vandenbergh]. In the backtracking line search, given a descent direction <math>\vec{u}</math> (in this exercise we use <math>\vec{u} = -\nabla f(\vec{x})</math>), we want to find a good step size <math>t</math> that gives us a steep descent. The general idea is to use a linear approximation (the first order Taylor approximation) to the function <math>f</math> at the current point <math>\vec{x}</math>, and to search for a step size <math>t</math> such that we can decrease the function's value by more than <math>\alpha</math> times the decrease predicted by the linear approximation (<math>\alpha \in (0, 0.5)</math>. For more details, you may wish to consult [http://www.stanford.edu/~boyd/cvxbook/ the book].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Independent_Component_AnalysisExercise:Independent Component Analysis2011-06-18T21:17:23Z<p>Cyfoo: Created page with "== Independent Component Analysis == In this exercise, you will implement Independent Component Analysis on color images from the STL-10 dataset. In the file <tt>[http://uf..."</p>
<hr />
<div>== Independent Component Analysis ==<br />
<br />
In this exercise, you will implement [[Independent Component Analysis]] on color images from the STL-10 dataset.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/independent_component_analysis_exercise.zip independent_component_analysis_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>OrthonormalICACost.m</tt>''', '''<tt>ReconstructionICACost.m</tt>''' and '''<tt>ICAExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>displayColorNetwork.m</tt> from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we load and use a portion of the 8x8 patches from the STL-10 dataset (which you first saw in the exercise on [[Exercise:Learning color features with Sparse Autoencoders | linear decoders]]).<br />
<br />
=== Step 2: ZCA whiten patches ===<br />
<br />
In this step, we ZCA whiten the patches as required by orthonormal ICA.<br />
<br />
=== Step 3: Implement and check ICA cost functions ===<br />
<br />
In this step, you should implement the two ICA cost functions:<br />
<ol><br />
<li><tt>orthonormalICACost</tt> in <tt>orthonormalICACost.m</tt>, which computes the cost and gradient for the orthonormal ICA objective. Note that the orthonormality constraint is '''not''' enforced in the cost function. It will be enforced by a projection in the gradient descent step, which you will have to complete in step 4a.<br />
<li><tt>reconstructionICACost</tt> in <tt>reconstructionICACost.m</tt>, which computes the cost and gradient for the reconstruction ICA objective.<br />
</ol><br />
<br />
When you have implemented the cost functions, you should check the gradients numerically.<br />
<br />
=== Step 4: Optimization ===<br />
<br />
This step is broken down into two substeps. In each substep, you will (separately) optimize for one of the two ICA objective functions.<br />
<br />
==== Step 4a: Orthonormal ICA ====<br />
<br />
In step 4a, you will optimize for the orthonormal ICA objective using gradient descent with backtracking line search (which has already been provided for you. For more details on the backtracking line search, you may wish to consult the [[Exercise:Independent Component Analysis#Appendix| appendix ]] of this exercise). The orthonormality constraint should be enforced with a projection, which you should fill in.<br />
<br />
Once you have filled in the code for the projection, check that it is correct by using the verification code provided. Once you have verified that your projection is correct, comment out the verification code and run the optimization. 10 000 iterations of gradient descent should take around 2 hours, and produce a basis which looks like the following:<br />
<br />
[[File:OrthonormalICAFeatures.png | 350px]]<br />
<br />
Observe that few of the bases have been completely learned even after 10 000 iterations, highlighting a weakness of orthonormal ICA - it is difficult to optimize for the objective while enforcing the orthonormality constraint using gradient descent, and convergence can be very slow. Hence, in situations where an orthonormal basis is not required, reconstruction ICA or other faster methods of learning bases (such as [[Sparse Coding: Autoencoder Interpretation]] sparse coding) may be preferable.<br />
<br />
==== Step 4b: Reconstruction ICA ====<br />
<br />
In step 4b, you will optimize for the reconstruction ICA objective using <tt>minFunc</tt>. Code has already been provided to do so, so all that is left is for you to run the code. 600 iterations of <tt>minFunc</tt> should take 20-25 minutes, and produce a basis which looks like the following:<br />
<br />
[[File:ReconstructionICAFeatures.png | 350px]]<br />
<br />
=== Appendix ===<br />
<br />
==== Backtracking line search ====<br />
<br />
The backtracking line search used in the exercise is based off that in [http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization by Boyd and Vandenbergh]. In the backtracking line search, given a descent direction <math>\vec{u}</math> (in this exercise we use <math>\vec{u} = -\nabla f(\vec{x})</math>), we want to find a good step size <math>t</math> that gives us a steep descent. The general idea is to use a linear approximation (the first order Taylor approximation) to the function <math>f</math> at the current point <math>\vec{x}</math>, and to search for a step size <math>t</math> such that we can decrease the function's value by more than <math>\alpha</math> times the decrease predicted by the linear approximation (<math>\alpha \in (0, 0.5)</math>. For more details, you may wish to consult [http://www.stanford.edu/~boyd/cvxbook/ the book].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:OrthonormalICAFeatures.pngFile:OrthonormalICAFeatures.png2011-06-18T19:15:24Z<p>Cyfoo: </p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:ReconstructionICAFeatures.pngFile:ReconstructionICAFeatures.png2011-06-18T19:15:17Z<p>Cyfoo: </p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-06-18T19:01:41Z<p>Cyfoo: /* Dependencies */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
The following additional file is also required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoderCost.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradients to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100,000 small 8x8 patches sampled from the larger 96x96 STL-10 images (The [http://www.stanford.edu/~acoates//stl10/ STL-10 dataset] comprises 5000 training and 8000 test examples, with each example being a 96x96 labelled color image belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.) <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and "opponent colors," as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might instead get images that look like one of the following:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-06-18T18:59:34Z<p>Cyfoo: /* Dependencies */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional file is required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoderCost.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradients to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100,000 small 8x8 patches sampled from the larger 96x96 STL-10 images (The [http://www.stanford.edu/~acoates//stl10/ STL-10 dataset] comprises 5000 training and 8000 test examples, with each example being a 96x96 labelled color image belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.) <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and "opponent colors," as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might instead get images that look like one of the following:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-06-18T18:56:22Z<p>Cyfoo: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Linear Decoders with Autoencoders'''<br />
* [[Linear Decoders]]<br />
* [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Exercise:Convolution and Pooling]]<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
'''Miscellaneous'''<br />
* [[MATLAB Modules]]<br />
* [[Style Guide]]<br />
* [[Useful Links]]<br />
<br />
'''Miscellaneous Topics'''<br />
* [[Data Preprocessing]]<br />
* [[Deriving gradients using the backpropagation idea]]<br />
<br />
'''Advanced Topics''':<br />
<br />
'''Sparse Coding'''<br />
* [[Sparse Coding]]<br />
* [[Sparse Coding: Autoencoder Interpretation]]<br />
* [[Exercise:Sparse Coding]]<br />
<br />
'''ICA Style Models'''<br />
* [[Independent Component Analysis]]<br />
* [[Exercise:Independent Component Analysis]]<br />
<br />
'''Others'''<br />
* [[Convolutional training]] <br />
* [[Restricted Boltzmann Machines]]<br />
* [[Deep Belief Networks]]<br />
* [[Denoising Autoencoders]]<br />
* [[K-means]]<br />
* [[Spatial pyramids / Multiscale]]<br />
* [[Slow Feature Analysis]]<br />
* [[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Independent_Component_AnalysisIndependent Component Analysis2011-06-18T00:47:00Z<p>Cyfoo: </p>
<hr />
<div>== Introduction ==<br />
<br />
If you recall, in [[Sparse Coding | sparse coding]], we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis vectors that we learn in sparse coding will not be linearly independent. While this may be desirable in certain situations, sometimes we want to learn a linearly independent basis for the data. In independent component analysis (ICA), this is exactly what we want to do. Further, in ICA, we want to learn not just any linearly independent basis, but an '''orthonormal''' basis for the data. (An orthonormal basis is a basis <math>(\phi_1, \ldots \phi_n)</math> such that <math>\phi_i \cdot \phi_j = 0</math> if <math>i \ne j</math> and <math>1</math> if <math>i = j</math>).<br />
<br />
Like sparse coding, independent component analysis has a simple mathematical formulation. Given some data <math>x</math>, we would like to learn a set of basis vectors which we represent in the columns of a matrix <math>W</math>, such that, firstly, as in sparse coding, our features are '''sparse'''; and secondly, our basis is an '''orthonormal''' basis. (Note that while in sparse coding, our matrix <math>A</math> was for mapping '''features''' <math>s</math> to '''raw data''', in independent component analysis, our matrix <math>W</math> works in the opposite direction, mapping '''raw data''' <math>x</math> to '''features''' instead). This gives us the following objective function:<br />
<br />
:<math><br />
J(W) = \lVert Wx \rVert_1 <br />
</math><br />
<br />
This objective function is equivalent to the sparsity penalty on the features <math>s</math> in sparse coding, since <math>Wx</math> is precisely the features that represent the data. Adding in the orthonormality constraint gives us the full optimization problem for independent component analysis:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint). <br />
<br />
In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[Independent Component Analysis#Orthonormal ICA | Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[Independent Component Analysis#Reconstruction ICA | Reconstruction ICA]], which is very similar to [[Sparse Coding: Autoencoder Interpretation | sparse coding]], but no longer strictly finds independent components.<br />
<br />
== Orthonormal ICA ==<br />
<br />
The orthonormal ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
Observe that the constraint <math>WW^T = I</math> implies two other constraints. <br />
<br />
Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[Sparse Coding: Autoencoder Interpretation | sparse coding]]. <br />
<br />
Secondly, the data must be [[Whitening | ZCA whitened]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)<br />
<br />
Hence, before we even begin to optimize for the orthonormal ICA objective, we must ensure that our data has been '''whitened''', and that we are learning an '''under-complete''' basis. <br />
<br />
Following that, to optimize for the objective, we can use gradient descent, interspersing gradient descent steps with projection steps to enforce the orthonormality constraint. Hence, the procedure will be as follows:<br />
<br />
Repeat until done:<br />
<ol><br />
<li><math>W \leftarrow W - \alpha \nabla_W \lVert Wx \rVert_1</math><br />
<li><math>W \leftarrow \operatorname{proj}_U W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math><br />
</ol><br />
<br />
In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-\frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).<br />
<br />
== Reconstruction ICA ==<br />
<br />
In reconstruction ICA, we drop the constraint that we want an orthonormal basis, replacing it with a reconstruction error term. Hence, the new reconstruction ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 + \lVert W^TWx \rVert_2^2 \\<br />
\end{array} <br />
</math><br />
<br />
where <math>W^T(Wx)</math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math>). If you recall the [[Sparse Coding: Autoencoder Interpretation | sparse coding]] objective,<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
you will notice that these two objectives are identical, except that whereas in sparse coding, we are trying to learn a matrix <math>A</math> that maps the feature space to the input space, in reconstruction ICA, we are trying to learn a matrix <math>W</math> that maps the input space to the feature space instead. <br />
<br />
As such, it is obvious that reconstruction ICA does not learn independent or orthonormal components.<br />
<br />
== Topographic ICA ==<br />
<br />
Just like [[Sparse Coding: Autoencoder Interpretation | sparse coding]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Independent_Component_AnalysisIndependent Component Analysis2011-06-18T00:46:38Z<p>Cyfoo: </p>
<hr />
<div>== Introduction ==<br />
<br />
If you recall, in [[Sparse Coding | sparse coding]], we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis vectors that we learn in sparse coding will not be linearly independent. While this may be desirable in certain situations, sometimes we want to learn a linearly independent basis for the data. In independent component analysis (ICA), this is exactly what we want to do. Further, in ICA, we want to learn not just any linearly independent basis, but an '''orthonormal''' basis for the data. (An orthonormal basis is a basis <math>(\phi_1, \ldots \phi_n)</math> such that <math>\phi_i \cdot \phi_j = 0</math> if <math>i \ne j</math> and <math>1</math> if <math>i = j</math>).<br />
<br />
Like sparse coding, independent component analysis has a simple mathematical formulation. Given some data <math>x</math>, we would like to learn a set of basis vectors which we represent in the columns of a matrix <math>W</math>, such that, firstly, as in sparse coding, our features are '''sparse'''; and secondly, our basis is an '''orthonormal''' basis. (Note that while in sparse coding, our matrix <math>A</math> was for mapping '''features''' <math>s</math> to '''raw data''', in independent component analysis, our matrix <math>W</math> works in the opposite direction, mapping '''raw data''' <math>x</math> to '''features''' instead). This gives us the following objective function:<br />
<br />
:<math><br />
J(W) = \lVert Wx \rVert_1 <br />
</math><br />
<br />
This objective function is equivalent to the sparsity penalty on the features <math>s</math> in sparse coding, since <math>Wx</math> is precisely the features that represent the data. Adding in the orthonormality constraint gives us the full optimization problem for independent component analysis:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint). <br />
<br />
In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[Independent Component Analysis#Orthonormal ICA | Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[Independent Component Analysis#Reconstruction ICA | Reconstruction ICA]], which is very similar to [[Sparse Coding: Autoencoder Interpretation | sparse coding]], but no longer strictly finds independent components.<br />
<br />
=== Orthonormal ICA ===<br />
<br />
The orthonormal ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
Observe that the constraint <math>WW^T = I</math> implies two other constraints. <br />
<br />
Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[Sparse Coding: Autoencoder Interpretation | sparse coding]]. <br />
<br />
Secondly, the data must be [[Whitening | ZCA whitened]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)<br />
<br />
Hence, before we even begin to optimize for the orthonormal ICA objective, we must ensure that our data has been '''whitened''', and that we are learning an '''under-complete''' basis. <br />
<br />
Following that, to optimize for the objective, we can use gradient descent, interspersing gradient descent steps with projection steps to enforce the orthonormality constraint. Hence, the procedure will be as follows:<br />
<br />
Repeat until done:<br />
<ol><br />
<li><math>W \leftarrow W - \alpha \nabla_W \lVert Wx \rVert_1</math><br />
<li><math>W \leftarrow \operatorname{proj}_U W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math><br />
</ol><br />
<br />
In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-\frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).<br />
<br />
=== Reconstruction ICA ===<br />
<br />
In reconstruction ICA, we drop the constraint that we want an orthonormal basis, replacing it with a reconstruction error term. Hence, the new reconstruction ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 + \lVert W^TWx \rVert_2^2 \\<br />
\end{array} <br />
</math><br />
<br />
where <math>W^T(Wx)</math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math>). If you recall the [[Sparse Coding: Autoencoder Interpretation | sparse coding]] objective,<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
you will notice that these two objectives are identical, except that whereas in sparse coding, we are trying to learn a matrix <math>A</math> that maps the feature space to the input space, in reconstruction ICA, we are trying to learn a matrix <math>W</math> that maps the input space to the feature space instead. <br />
<br />
As such, it is obvious that reconstruction ICA does not learn independent or orthonormal components.<br />
<br />
=== Topographic ICA ===<br />
<br />
Just like [[Sparse Coding: Autoencoder Interpretation | sparse coding]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Independent_Component_AnalysisIndependent Component Analysis2011-06-18T00:42:13Z<p>Cyfoo: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
If you recall, in [[Sparse Coding | sparse coding]], we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis vectors that we learn in sparse coding will not be linearly independent. While this may be desirable in certain situations, sometimes we want to learn a linearly independent basis for the data. In independent component analysis (ICA), this is exactly what we want to do. Further, in ICA, we want to learn not just any linearly independent basis, but an '''orthonormal''' basis for the data. (An orthonormal basis is a basis <math>(\phi_1, \ldots \phi_n)</math> such that <math>\phi_i \cdot \phi_j = 0</math> if <math>i \ne j</math> and <math>1</math> if <math>i = j</math>).<br />
<br />
Like sparse coding, independent component analysis has a simple mathematical formulation. Given some data <math>x</math>, we would like to learn a set of basis vectors which we represent in the columns of a matrix <math>W</math>, such that, firstly, as in sparse coding, our features are '''sparse'''; and secondly, our basis is an '''orthonormal''' basis. (Note that while in sparse coding, our matrix <math>A</math> was for mapping '''features''' <math>s</math> to '''raw data''', in independent component analysis, our matrix <math>W</math> works in the opposite direction, mapping '''raw data''' <math>x</math> to '''features''' instead). This gives us the following objective function:<br />
<br />
:<math><br />
J(W) = \lVert Wx \rVert_1 <br />
</math><br />
<br />
This objective function is equivalent to the sparsity penalty on the features <math>s</math> in sparse coding, since <math>Wx</math> is precisely the features that represent the data. Adding in the orthonormality constraint gives us the full optimization problem for independent component analysis:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint). <br />
<br />
In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[Orthonormal ICA | Independent Component Analysis#Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[Reconstruction ICA | Independent Component Analysis#Reconstruction ICA]], which is very similar to [[sparse coding | Sparse Coding: Autoencoder Interpretation]], but no longer strictly finds independent components.<br />
<br />
=== Orthonormal ICA ===<br />
<br />
The orthonormal ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
Observe that the constraint <math>WW^T = I</math> implies two other constraints. <br />
<br />
Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[sparse coding | Sparse Coding: Autoencoder Interpretation]]. <br />
<br />
Secondly, the data must be [[ZCA whitened | Whitening]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)<br />
<br />
Hence, before we even begin to optimize for the orthonormal ICA objective, we must ensure that our data has been '''whitened''', and that we are learning an '''under-complete''' basis. <br />
<br />
Following that, to optimize for the objective, we can use gradient descent, interspersing gradient descent steps with projection steps to enforce the orthonormality constraint. Hence, the procedure will be as follows:<br />
<br />
<ol><br />
Repeat until done:<br />
<li><math>W \leftarrow W - \alpha \nabla_W \lVert Wx \rVert_1</math><br />
<li><math>W \leftarrow proj_U W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math><br />
</ol><br />
<br />
In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).<br />
<br />
=== Reconstruction ICA ===<br />
<br />
In reconstruction ICA, we drop the constraint that we want an orthonormal basis, replacing it with a reconstruction error term. Hence, the new reconstruction ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 + \lvert W^TWx \rVert_2^2 \\<br />
\end{array} <br />
</math><br />
<br />
where <math>W^T(Wx)<math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math). If you recall the [[sparse coding | Sparse Coding: Autoencoder Interpretation]] objective,<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
you will notice that these two objectives are identical, except that whereas in sparse coding, we are trying to learn a matrix <math>A</math> that maps the feature space to the input space, in reconstruction ICA, we are trying to learn a matrix <math>W</math> that maps the input space to the feature space instead. <br />
<br />
As such, it is obvious that reconstruction ICA does not learn independent or orthonormal components.<br />
<br />
=== Topographic ICA ===<br />
<br />
Just like [[sparse coding | Sparse Coding: Autoencoder Interpretation]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Independent_Component_AnalysisIndependent Component Analysis2011-06-18T00:41:52Z<p>Cyfoo: </p>
<hr />
<div>== Introduction ==<br />
<br />
If you recall, in [[Sparse Coding | sparse coding]], we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis vectors that we learn in sparse coding will not be linearly independent. While this may be desirable in certain situations, sometimes we want to learn a linearly independent basis for the data. In independent component analysis (ICA), this is exactly what we want to do. Further, in ICA, we want to learn not just any linearly independent basis, but an '''orthonormal''' basis for the data. (An orthonormal basis is a basis <math>(\phi_1, \ldots \phi_n)</math> such that <math>\phi_i \cdot \phi_j = 0</math> if <math>i \ne j</math> and <math>1</math> if <math>i = j</math>).<br />
<br />
Like sparse coding, independent component analysis has a simple mathematical formulation. Given some data <math>x</math>, we would like to learn a set of basis vectors which we represent in the columns of a matrix <math>W</math>, such that, firstly, as in sparse coding, our features are '''sparse'''; and secondly, our basis is an orthonormal basis. (Note that while in sparse coding, our matrix <math>A</math> was for mapping '''features''' <math>s</math> to '''raw data''', in independent component analysis, our matrix <math>W</math> works in the opposite direction, mapping '''raw data''' <math>x</math> to '''features''' instead). This gives us the following objective function:<br />
<br />
:<math><br />
J(W) = \lVert Wx \rVert_1 <br />
</math><br />
<br />
This objective function is equivalent to the sparsity penalty on the features <math>s</math> in sparse coding, since <math>Wx</math> is precisely the features that represent the data. Adding in the orthonormality constraint gives us the full optimization problem for independent component analysis:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint). <br />
<br />
In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[Orthonormal ICA | Independent Component Analysis#Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[Reconstruction ICA | Independent Component Analysis#Reconstruction ICA]], which is very similar to [[sparse coding | Sparse Coding: Autoencoder Interpretation]], but no longer strictly finds independent components.<br />
<br />
=== Orthonormal ICA ===<br />
<br />
The orthonormal ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
Observe that the constraint <math>WW^T = I</math> implies two other constraints. <br />
<br />
Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[sparse coding | Sparse Coding: Autoencoder Interpretation]]. <br />
<br />
Secondly, the data must be [[ZCA whitened | Whitening]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)<br />
<br />
Hence, before we even begin to optimize for the orthonormal ICA objective, we must ensure that our data has been '''whitened''', and that we are learning an '''under-complete''' basis. <br />
<br />
Following that, to optimize for the objective, we can use gradient descent, interspersing gradient descent steps with projection steps to enforce the orthonormality constraint. Hence, the procedure will be as follows:<br />
<br />
<ol><br />
Repeat until done:<br />
<li><math>W \leftarrow W - \alpha \nabla_W \lVert Wx \rVert_1</math><br />
<li><math>W \leftarrow proj_U W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math><br />
</ol><br />
<br />
In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).<br />
<br />
=== Reconstruction ICA ===<br />
<br />
In reconstruction ICA, we drop the constraint that we want an orthonormal basis, replacing it with a reconstruction error term. Hence, the new reconstruction ICA objective is:<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 + \lvert W^TWx \rVert_2^2 \\<br />
\end{array} <br />
</math><br />
<br />
where <math>W^T(Wx)<math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math). If you recall the [[sparse coding | Sparse Coding: Autoencoder Interpretation]] objective,<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
you will notice that these two objectives are identical, except that whereas in sparse coding, we are trying to learn a matrix <math>A</math> that maps the feature space to the input space, in reconstruction ICA, we are trying to learn a matrix <math>W</math> that maps the input space to the feature space instead. <br />
<br />
As such, it is obvious that reconstruction ICA does not learn independent or orthonormal components.<br />
<br />
=== Topographic ICA ===<br />
<br />
Just like [[sparse coding | Sparse Coding: Autoencoder Interpretation]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Independent_Component_AnalysisIndependent Component Analysis2011-05-30T07:24:27Z<p>Cyfoo: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
If you recall, in [[Sparse Coding | sparse coding]], we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis vectors that we learn in sparse coding will not be linearly independent. While this may be desirable in certain situations, sometimes we want to learn a linearly independent basis for the data. In independent component analysis (ICA), this is exactly what we want to do. Further, in ICA, we want to learn not just any linearly independent basis, but an '''orthonormal''' basis for the data. (An orthonormal basis is a basis <math>(\phi_1, \ldots \phi_n)</math> such that <math>\phi_i \cdot \phi_j = 0</math> if <math>i \ne j</math> and <math>1</math> if <math>i = j</math>).<br />
<br />
Like sparse coding, independent component analysis has a simple mathematical formulation. Given some data <math>x</math>, we would like to learn a set of basis vectors which we represent in the columns of a matrix <math>W</math>, such that, firstly, as in sparse coding, our features are sparse; and secondly, our basis is an orthonormal basis. (Note that while in sparse coding, our matrix <math>A</math> was for mapping '''features''' <math>s</math> to '''raw data''', in independent component analysis, our matrix <math>W</math> works in the opposite direction, mapping '''raw data''' <math>x</math> to '''features''' instead). This gives us the following objective function:<br />
<br />
:<math><br />
J(W) = \lVert Wx \rVert_1 <br />
</math><br />
<br />
This objective function is equivalent to the sparsity penalty on the features <math>s</math> in sparse coding, since <math>Wx</math> is precisely the features that represent the data. Adding in the orthonormality constraint gives us the full optimization problem for independent component analysis:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint).</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Independent_Component_AnalysisIndependent Component Analysis2011-05-30T07:23:50Z<p>Cyfoo: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
If you recall, in [[Sparse Coding | sparse coding]], we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis vectors that we learn in sparse coding will not be linearly independent. While this may be desirable in certain situations, sometimes we want to learn a linearly independent basis for the data. In independent component analysis (ICA), this is exactly what we want to do - to learn not any linearly independent basis, but an '''orthonormal''' basis for the data. (An orthonormal basis is a basis <math>(\phi_1, \ldots \phi_n)</math> such that <math>\phi_i \cdot \phi_j = 0</math> if <math>i \ne j</math> and <math>1</math> if <math>i = j</math>).<br />
<br />
Like sparse coding, independent component analysis has a simple mathematical formulation. Given some data <math>x</math>, we would like to learn a set of basis vectors which we represent in the columns of a matrix <math>W</math>, such that, firstly, as in sparse coding, our features are sparse; and secondly, our basis is an orthonormal basis. (Note that while in sparse coding, our matrix <math>A</math> was for mapping '''features''' <math>s</math> to '''raw data''', in independent component analysis, our matrix <math>W</math> works in the opposite direction, mapping '''raw data''' <math>x</math> to '''features''' instead). This gives us the following objective function:<br />
<br />
:<math><br />
J(W) = \lVert Wx \rVert_1 <br />
</math><br />
<br />
This objective function is equivalent to the sparsity penalty on the features <math>s</math> in sparse coding, since <math>Wx</math> is precisely the features that represent the data. Adding in the orthonormality constraint gives us the full optimization problem for independent component analysis:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint).</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Independent_Component_AnalysisIndependent Component Analysis2011-05-30T07:23:25Z<p>Cyfoo: Created page with "== Introduction == If you recall, in sparse coding, we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis ..."</p>
<hr />
<div>== Introduction ==<br />
<br />
If you recall, in [[Sparse Coding | sparse coding]], we wanted to learn an '''over-complete''' basis for the data. In particular, this implies that the basis vectors that we learn in sparse coding will not be linearly independent. While this may be desirable in certain situations, sometimes we want to learn a linearly independent basis for the data. In independent component analysis (ICA), this is exactly what we want to do - to learn not any linearly independent basis, but an orthonormal basis for the data. (An orthonormal basis is a basis <math>(\phi_1, \ldots \phi_n)</math> such that <math>\phi_i \cdot \phi_j = 0</math> if <math>i \ne j</math> and <math>1</math> if <math>i = j</math>).<br />
<br />
Like sparse coding, independent component analysis has a simple mathematical formulation. Given some data <math>x</math>, we would like to learn a set of basis vectors which we represent in the columns of a matrix <math>W</math>, such that, firstly, as in sparse coding, our features are sparse; and secondly, our basis is an orthonormal basis. (Note that while in sparse coding, our matrix <math>A</math> was for mapping '''features''' <math>s</math> to '''raw data''', in independent component analysis, our matrix <math>W</math> works in the opposite direction, mapping '''raw data''' <math>x</math> to '''features''' instead). This gives us the following objective function:<br />
<br />
:<math><br />
J(W) = \lVert Wx \rVert_1 <br />
</math><br />
<br />
This objective function is equivalent to the sparsity penalty on the features <math>s</math> in sparse coding, since <math>Wx</math> is precisely the features that represent the data. Adding in the orthonormality constraint gives us the full optimization problem for independent component analysis:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert Wx \rVert_1 \\<br />
{\rm s.t.} & WW^T = I \\<br />
\end{array} <br />
</math><br />
<br />
As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint).</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-30T06:46:02Z<p>Cyfoo: /* Example 3: ICA reconstruction cost */</p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>).<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For each output unit <math>i</math> in layer <math>n_l</math> (the final layer), set<br />
:<math><br />
\delta^{(n_l)}_i<br />
= \frac{\partial}{\partial z^{(n_l)}_i} \;\;<br />
J(z^{(n_l)})<br />
</math><br />
where <math>J(z)</math> is our "objective function" (explained below).<br />
<li>For <math>l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l)}</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea. <br />
<br />
To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(X)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.<br />
<br />
Using this method, we can easily compute derivatives with respect to the inputs <math>X</math>, as well as derivatives with respect to any of the weights in the network, as we shall see later.<br />
<br />
== Examples ==<br />
<br />
To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]], in examples 1 and 2. In example 3, we use a function from [[Independent Component Analysis | independent component analysis]] to illustrate the use of this idea to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall for [[Sparse Coding: Autoencoder Interpretation | sparse coding]], the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>F</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A F(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png | 400px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>A</math></td><br />
<td><math>f(z_i) = z_i</math> (identity)</td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>I</math> (identity)</td><br />
<td><math>f(z_i) = z_i - x_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(3)}) = F(x)</math>, we can set <math>J(z^{(3)}) = \sum_k J(z^{(3)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math></th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>As - x</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>As</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( A^T \delta^{(2)} \right) \bullet 1</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = A^T I^T 2(As - x) \\<br />
& = A^T 2(As - x)<br />
\end{align}<br />
</math> <br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in [[Sparse Coding: Autoencoder Interpretation | sparse coding]]:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.<br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png | 600px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>V</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i + \epsilon</math></td><br />
</tr><br />
<tr><br />
<td>4</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^{\frac{1}{2}}</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math><br />
</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>4</td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>(Vss^T + \epsilon)</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td><br />
<td><math>Vss^T</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( V^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>ss^T</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>\left( I^T \delta^{(2)} \right) \bullet 2s</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = I^T V^T I^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T (Vss^T + \epsilon)^{-\frac{1}{2}} \bullet s<br />
\end{align}<br />
</math><br />
<br />
=== Example 3: ICA reconstruction cost ===<br />
<br />
Recall the [[Independent Component Analysis | independent component analysis (ICA)]] reconstruction cost term:<br />
<math>\lVert W^TWx - x \rVert_2^2</math><br />
where <math>W</math> is the weight matrix and <math>x</math> is the input.<br />
<br />
We would like to find <math>\nabla_W \lVert W^TWx - x \rVert_2^2</math> - the derivative of the term with respect to the '''weight matrix''', rather than the '''input''' as in the earlier two examples. We will still proceed similarly though, seeing this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 3.png | 400px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>W</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>W^T</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i - x_i</math></td><br />
</tr><br />
<tr><br />
<td>4</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.<br />
<br />
Now that we can see <math>F</math> as a neural network, we can try to compute the gradient <math>\nabla_W F</math>. However, we now face the difficulty that <math>W</math> appears twice in the network. Fortunately, it turns out that if <math>W</math> appears multiple times in the network, the gradient with respect to <math>W</math> is simply the sum of gradients for each instance of <math>W</math> in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we will proceed to work out the deltas first:<br />
<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math><br />
</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>4</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>(W^TWx - x)</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td><br />
<td><math>W^TWx</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( (W^T)^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>Wx</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( W^T \delta^{(2)} \right) \bullet 1</math></td><br />
<td><math>x</math></td><br />
</tr><br />
</table><br />
<br />
To find the gradients with respect to <math>W</math>, first we find the gradients with respect to each instance of <math>W</math> in the network.<br />
<br />
With respect to <math>W^T</math>:<br />
:<math><br />
\begin{align}<br />
\nabla_{W^T} F & = \delta^{(3)} a^{(2)T} \\<br />
& = 2(W^TWx - x) (Wx)^T<br />
\end{align}<br />
</math><br />
<br />
With respect to <math>W</math>:<br />
:<math><br />
\begin{align}<br />
\nabla_{W} F & = \delta^{(2)} a^{(1)T} \\<br />
& = (W^T)(2(W^TWx -x)) x^T<br />
\end{align}<br />
</math><br />
<br />
Taking sums, noting that we need to transpose the gradient with respect to <math>W^T</math> to get the gradient with respect to <math>W</math>, yields the final gradient with respect to <math>W</math> (pardon the slight abuse of notation here):<br />
<br />
:<math><br />
\begin{align}<br />
\nabla_{W} F & = \nabla_{W} F + (\nabla_{W^T} F)^T \\<br />
& = (W^T)(2(W^TWx -x)) x^T + 2(Wx)(W^TWx - x)^T<br />
\end{align}<br />
</math></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-30T06:44:03Z<p>Cyfoo: </p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>).<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For each output unit <math>i</math> in layer <math>n_l</math> (the final layer), set<br />
:<math><br />
\delta^{(n_l)}_i<br />
= \frac{\partial}{\partial z^{(n_l)}_i} \;\;<br />
J(z^{(n_l)})<br />
</math><br />
where <math>J(z)</math> is our "objective function" (explained below).<br />
<li>For <math>l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l)}</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea. <br />
<br />
To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(X)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.<br />
<br />
Using this method, we can easily compute derivatives with respect to the inputs <math>X</math>, as well as derivatives with respect to any of the weights in the network, as we shall see later.<br />
<br />
== Examples ==<br />
<br />
To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]], in examples 1 and 2. In example 3, we use a function from [[Independent Component Analysis | independent component analysis]] to illustrate the use of this idea to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall for [[Sparse Coding: Autoencoder Interpretation | sparse coding]], the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>F</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A F(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png | 400px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>A</math></td><br />
<td><math>f(z_i) = z_i</math> (identity)</td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>I</math> (identity)</td><br />
<td><math>f(z_i) = z_i - x_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(3)}) = F(x)</math>, we can set <math>J(z^{(3)}) = \sum_k J(z^{(3)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math></th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>As - x</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>As</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( A^T \delta^{(2)} \right) \bullet 1</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = A^T I^T 2(As - x) \\<br />
& = A^T 2(As - x)<br />
\end{align}<br />
</math> <br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in [[Sparse Coding: Autoencoder Interpretation | sparse coding]]:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.<br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png | 600px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>V</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i + \epsilon</math></td><br />
</tr><br />
<tr><br />
<td>4</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^{\frac{1}{2}}</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math><br />
</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>4</td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>(Vss^T + \epsilon)</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td><br />
<td><math>Vss^T</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( V^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>ss^T</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>\left( I^T \delta^{(2)} \right) \bullet 2s</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = I^T V^T I^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T (Vss^T + \epsilon)^{-\frac{1}{2}} \bullet s<br />
\end{align}<br />
</math><br />
<br />
=== Example 3: ICA reconstruction cost ===<br />
<br />
Recall the [[Independent Component Analysis | independent component analysis (ICA)]] reconstruction cost term:<br />
<math>\lVert W^TWx - x \rVert_2^2</math><br />
where <math>W</math> is the weight matrix and <math>x</math> is the input.<br />
<br />
We would like to find <math>\nabla_W \lVert W^TWx - x \rVert_2^2</math> - the derivative of the term with respect to the '''weight matrix''', rather than the '''input''' as in the earlier two examples. We will still proceed similarly though, seeing this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 3.png | 400px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>W</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>W^T</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i - x_i</math></td><br />
</tr><br />
<tr><br />
<td>4</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.<br />
<br />
Now that we can see <math>F</math> as a neural network, we can try to compute the gradient <math>\nabla_W F</math>. However, we now face the difficulty that <math>W</math> appears twice in the network. Fortunately, it turns out that if <math>W</math> appears multiple times in the network, the gradient with respect to <math>W</math> is simply the sum of gradients for each <math>W</math> in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we can proceed to work out the deltas first:<br />
<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math><br />
</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>4</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>(W^TWx - x)</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td><br />
<td><math>W^TWx</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( (W^T)^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>Wx</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( W^T \delta^{(2)} \right) \bullet 1</math></td><br />
<td><math>x</math></td><br />
</tr><br />
</table><br />
<br />
First we find the gradients with respect to each <math>W</math>.<br />
<br />
With respect to <math>W^T</math>:<br />
:<math><br />
\begin{align}<br />
\nabla_{W^T} F & = \delta^{(3)} a^{(2)T} \\<br />
& = 2(W^TWx - x) (Wx)^T<br />
\end{align}<br />
</math><br />
<br />
With respect to <math>W</math>:<br />
:<math><br />
\begin{align}<br />
\nabla_{W} F & = \delta^{(2)} a^{(1)T} \\<br />
& = (W^T)(2(W^TWx -x)) x^T<br />
\end{align}<br />
</math><br />
<br />
Taking sums, noting that we need to transpose the gradient with respect to <math>W^T</math> to get the gradient with respect to <math>W</math>, yields the final gradient with respect to <math>W</math> (pardon the slight abuse of notation here):<br />
<br />
:<math><br />
\begin{align}<br />
\nabla_{W} F & = \nabla_{W} F + (\nabla_{W^T} F)^T \\<br />
& = (W^T)(2(W^TWx -x)) x^T + 2(Wx)(W^TWx - x)^T<br />
\end{align}<br />
</math></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:Backpropagation_Method_Example_3.pngFile:Backpropagation Method Example 3.png2011-05-30T06:28:49Z<p>Cyfoo: </p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-30T06:04:58Z<p>Cyfoo: </p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>).<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For each output unit <math>i</math> in layer <math>n_l</math> (the final layer), set<br />
:<math><br />
\delta^{(n_l)}_i<br />
= \frac{\partial}{\partial z^{(n_l)}_i} \;\;<br />
J(z^{(n_l)})<br />
</math><br />
where <math>J(z)</math> is our "objective function" (explained below).<br />
<li>For <math>l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l)}</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea. <br />
<br />
To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(x)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.<br />
<br />
== Examples ==<br />
<br />
We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate the method of computing gradients of functions on matrices using the backpropagation idea.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>F</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A F(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png | 400px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>A</math></td><br />
<td><math>f(z_i) = z_i</math> (identity)</td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>I</math> (identity)</td><br />
<td><math>f(z_i) = z_i - x_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(3)}) = F(x)</math>, we can set <math>J(z^{(3)}) = \sum_k J(z^{(3)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math></th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>As - x</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>As</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( A^T \delta^{(2)} \right) \bullet 1</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = A^T I^T 2(As - x) \\<br />
& = A^T 2(As - x)<br />
\end{align}<br />
</math> <br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.<br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png | 600px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>V</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i + \epsilon</math></td><br />
</tr><br />
<tr><br />
<td>4</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^{\frac{1}{2}}</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math><br />
</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>4</td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>(Vss^T + \epsilon)</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>Vss^T</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( V^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>ss^T</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>\left( I^T \delta^{(2)} \right) \bullet 2s</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = I^T V^T I^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T (Vss^T + \epsilon)^{-\frac{1}{2}} \bullet s<br />
\end{align}<br />
</math></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-30T06:04:27Z<p>Cyfoo: </p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>).<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For each output unit <math>i</math> in layer <math>n_l</math> (the final layer), set<br />
:<math><br />
\delta^{(n_l)}_i<br />
= \frac{\partial}{\partial z^{(n_l)}_i} \;\;<br />
J(z^{(n_l)})<br />
</math><br />
where <math>J(z)</math> is our "objective function" (explained below).<br />
<li>For <math>l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l)}</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea. <br />
<br />
To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(x)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.<br />
<br />
== Examples ==<br />
<br />
We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate the method of computing gradients of functions on matrices using the backpropagation idea.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>F</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A F(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png | 400px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>A</math></td><br />
<td><math>f(z_i) = z_i (identity)</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>I</math> (identity)</td><br />
<td><math>f(z_i) = z_i - x_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(3)}) = F(x)</math>, we can set <math>J(z^{(3)}) = \sum_k J(z^{(3)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math></th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>As - x</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>As</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( A^T \delta^{(2)} \right) \bullet 1</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = A^T I^T 2(As - x) \\<br />
& = A^T 2(As - x)<br />
\end{align}<br />
</math> <br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.<br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png | 600px]]<br />
<br />
The weights and activation functions of this network are as follows:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr><br />
<tr><br />
<td>1</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i^2</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>V</math></td><br />
<td><math>f(z_i) = z_i</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>I</math></td><br />
<td><math>f(z_i) = z_i + \epsilon</math></td><br />
</tr><br />
<tr><br />
<td>4</td><br />
<td>N/A</td><br />
<td><math>f(z_i) = z_i^{\frac{1}{2}}</math></td><br />
</tr><br />
</table><br />
To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.<br />
<br />
Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields:<br />
<table align="center"><br />
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math><br />
</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr><br />
<tr><br />
<td>4</td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td><br />
<td><math>(Vss^T + \epsilon)</math></td><br />
</tr><br />
<tr><br />
<td>3</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>Vss^T</math></td><br />
</tr><br />
<tr><br />
<td>2</td><br />
<td><math>f'(z_i) = 1</math></td><br />
<td><math>\left( V^T \delta^{(3)} \right) \bullet 1</math></td><br />
<td><math>ss^T</math></td><br />
</tr><br />
<tr><br />
<td>1</td><br />
<td><math>f'(z_i) = 2z_i</math></td><br />
<td><math>\left( I^T \delta^{(2)} \right) \bullet 2s</math></td><br />
<td><math>s</math></td><br />
</tr><br />
</table><br />
<br />
Hence, <br />
:<math><br />
\begin{align}<br />
\nabla_X F & = I^T V^T I^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\<br />
& = V^T (Vss^T + \epsilon)^{-\frac{1}{2}} \bullet s<br />
\end{align}<br />
</math></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-29T08:08:57Z<p>Cyfoo: /* Example 2: Smoothed topographic L1 sparsity penalty in sparse coding */</p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>).<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For <math>l = n_l, n_l-1, n_l-2, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l)}</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Notice that we don't consider an objective function in this case, and we allow each layer to have a different activation function <math>f^{(l)}</math>. This will be useful in allowing us to compute the gradients of functions of matrices.<br />
<br />
== The method ==<br />
<br />
To compute the gradient with respect to some matrix <math>X</math> of a complicated function of matrices, it may be helpful to consider the function as a complicated multi-layer neural network, if possible. We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate this.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>J</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A J(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png | 400px]]<br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.<br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png | 600px]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:Backpropagation_Method_Example_2.pngFile:Backpropagation Method Example 2.png2011-05-29T08:06:29Z<p>Cyfoo: uploaded a new version of &quot;File:Backpropagation Method Example 2.png&quot;</p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-29T08:05:26Z<p>Cyfoo: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>).<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For <math>l = n_l, n_l-1, n_l-2, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l)}</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Notice that we don't consider an objective function in this case, and we allow each layer to have a different activation function <math>f^{(l)}</math>. This will be useful in allowing us to compute the gradients of functions of matrices.<br />
<br />
== The method ==<br />
<br />
To compute the gradient with respect to some matrix <math>X</math> of a complicated function of matrices, it may be helpful to consider the function as a complicated multi-layer neural network, if possible. We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate this.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>J</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A J(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png | 400px]]<br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png | 600px]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-29T08:05:03Z<p>Cyfoo: </p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>.<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For <math>l = n_l, n_l-1, n_l-2, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l)}</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Notice that we don't consider an objective function in this case, and we allow each layer to have a different activation function <math>f^{(l)}</math>. This will be useful in allowing us to compute the gradients of functions of matrices.<br />
<br />
== The method ==<br />
<br />
To compute the gradient with respect to some matrix <math>X</math> of a complicated function of matrices, it may be helpful to consider the function as a complicated multi-layer neural network, if possible. We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate this.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>J</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A J(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png | 400px]]<br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png | 600px]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:Backpropagation_Method_Example_2.pngFile:Backpropagation Method Example 2.png2011-05-29T08:03:27Z<p>Cyfoo: uploaded a new version of &quot;File:Backpropagation Method Example 2.png&quot;</p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:Backpropagation_Method_Example_2.pngFile:Backpropagation Method Example 2.png2011-05-29T08:02:29Z<p>Cyfoo: </p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:Backpropagation_Method_Example_1.pngFile:Backpropagation Method Example 1.png2011-05-29T08:02:16Z<p>Cyfoo: </p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Deriving_gradients_using_the_backpropagation_ideaDeriving gradients using the backpropagation idea2011-05-29T08:02:05Z<p>Cyfoo: Created page with "== Introduction == In the section on the backpropagation algorithm, you were briefly introduced to backpropagation as a means of deriving gradien..."</p>
<hr />
<div>== Introduction ==<br />
<br />
In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>.<br />
<br />
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:<br />
<ol><br />
<li>For <math>l = n_l, n_l-1, n_l-2, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives,<br />
:<math><br />
\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Quick notation recap: <br />
<ul><br />
<li><math>l</math> is the number of layers in the neural network<br />
<li><math>n_l</math> is the number of neurons in the <math>l</math>th layer<br />
<li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer<br />
<li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer<br />
<li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math><br />
<li><math>f^{(l})</math> is the activation function for units in the <math>l</math>th layer<br />
</ul><br />
<br />
Notice that we don't consider an objective function in this case, and we allow each layer to have a different activation function <math>f^{(l)}</math>. This will be useful in allowing us to compute the gradients of functions of matrices.<br />
<br />
== The method ==<br />
<br />
To compute the gradient with respect to some matrix <math>X</math> of a complicated function of matrices, it may be helpful to consider the function as a complicated multi-layer neural network, if possible. We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate this.<br />
<br />
=== Example 1: Objective for weight matrix in sparse coding ===<br />
<br />
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:<br />
:<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math><br />
<br />
We would like to find the gradient of <math>J</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A J(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. <br />
<br />
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:<br />
<br />
<ol><br />
<li>Apply <math>A</math> as the weights from the first layer to the second layer.<br />
<li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function.<br />
<li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.<br />
<li>Sum all the activations of the third layer.<br />
</ol><br />
<br />
[[File:Backpropagation Method Example 1.png]]<br />
<br />
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding ===<br />
<br />
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:<br />
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math><br />
<br />
We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network:<br />
<br />
[[File:Backpropagation Method Example 2.png]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-29T07:05:44Z<p>Cyfoo: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Linear Decoders with Autoencoders'''<br />
* [[Linear Decoders]]<br />
* [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Exercise:Convolution and Pooling]]<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
'''Miscellaneous ''':<br />
* [[MATLAB Modules]]<br />
* [[Style Guide]]<br />
* [[Useful Links]]<br />
<br />
'''Miscellaneous Topics'''<br />
* [[Data Preprocessing]]<br />
* [[Deriving gradients using the backpropagation idea]]<br />
<br />
'''Advanced Topics''':<br />
<br />
'''Sparse Coding'''<br />
* [[Sparse Coding]]<br />
* [[Sparse Coding: Autoencoder Interpretation]]<br />
* [[Exercise:Sparse Coding]]<br />
<br />
'''ICA Style Models'''<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
* [[Convolutional training]] <br />
* [[Restricted Boltzmann Machines]]<br />
* [[Deep Belief Networks]]<br />
* [[Denoising Autoencoders]]<br />
* [[K-means]]<br />
* [[Spatial pyramids / Multiscale]]<br />
* [[Slow Feature Analysis]]<br />
* [[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T07:04:20Z<p>Cyfoo: /* Sparse coding */</p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful.<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.<br />
<br />
== Sparse coding in practice ==<br />
<br />
As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse.<br />
<br />
Recall the simple iterative algorithm proposed earlier:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: <br />
<ol><br />
<li>Batching examples into "mini-batches"<br />
<li>Good initialization of <math>s</math><br />
</ol><br />
<br />
=== Batching examples into mini-batches ===<br />
<br />
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of say 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why).<br />
<br />
=== Good initialization of <math>s</math> ===<br />
<br />
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following:<br />
<ol><br />
<li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch)<br />
<li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis vector in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math><br />
</ol><br />
<br />
Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)<br />
<br />
=== The practical algorithm ===<br />
<br />
With the above two tricks, the algorithm for sparse coding then becomes:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Select a random mini-batch of 2000 patches<br />
<li>Initialize <math>s</math> as described above<br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
With this method, you should be able to reach a good local optima relatively quickly.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_CodingExercise:Sparse Coding2011-05-29T06:59:51Z<p>Cyfoo: /* Sparse Coding */</p>
<hr />
<div>== Sparse Coding ==<br />
<br />
In this exercise, you will implement [[Sparse Coding: Autoencoder Interpretation | sparse coding]] and [[Sparse Coding: Autoencoder Interpretation | topographic sparse coding]] on black-and-white natural images. <br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/sparse_coding_exercise.zip sparse_coding_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>sparseCodingWeightCost.m</tt>''', '''<tt>sparseCodingFeatureCost.m</tt>''' and '''<tt>sparseCodingExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>display_network.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we sample some patches from the <tt>IMAGES.mat</tt> dataset comprising 10 black-and-white pre-whitened natural images.<br />
<br />
=== Step 2: Implement and check sparse coding cost functions ===<br />
<br />
In this step, you should implement the two sparse coding cost functions: <br />
<br />
<ol><br />
<li><tt>sparseCodingWeightCost</tt> in <tt>sparseCodingWeightCost.m</tt>, which is used for optimizing the weight cost given the features<br />
<li><tt>sparseCodingFeatureCost</tt> in <tt>sparseCodingFeatureCost.m</tt>, which is used for optimizing the feature cost given the weights<br />
</ol><br />
<br />
Each of these functions should compute the appropriate cost and gradient. You may wish to implement the non-topographic version of <tt>sparseCodingFeatureCost</tt> first, ignoring the grouping matrix and assuming that none of the features are grouped. You can then extend this to the topographic version later. Alternatively, you may implement the topographic version directly - using the non-topographic version will then involve setting the grouping matrix to the identity matrix.<br />
<br />
Once you have implemented these functions, you should check the gradients numerically. <br />
<br />
'''Implementation tip''' - gradient checking the feature cost. One particular point to note is that when checking the gradient for the feature cost, <tt>epsilon</tt> should be set to a larger value, for instance <tt>1e-2</tt> (as has been done for you in the checking code provided), to ensure that checking the gradient numerically makes sense. This is necessary because as <tt>epsilon</tt> becomes smaller, the function <tt>sqrt(x + epsilon)</tt> becomes "sharper" and more "pointed", making the numerical gradient computed near 0 less and less accurate. To see this, consider what would happen if the numerical gradient was computed by using a point with x less than 0 and a point with x greater than 0 - the computed numerical slope would be wildly inaccurate.<br />
<br />
=== Step 3: Iterative optimization ===<br />
<br />
In this step, you will iteratively optimize for the weights and features to learn a basis for the data, as described in the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]]. Mini-batching and initialization of the features <math>s</math> has already been done for you. However, you need to still need to fill in the analytic solution to the the optimization problem with respect to the weight matrix, given the feature matrix. <br />
<br />
Once that is done, you should check that your solution is correct using the given checking code, which checks that the gradient at the point determined by your analytic solution is close to 0. Once your solution has been verified, comment out the checking code, and run the iterative optimization code. 200 iterations should take less than 45 minutes to run, and by 100 iterations you should be able to see bases that look like edges, similar to those you learned in [[Exercise:Sparse Autoencoder | the sparse autoencoder exercise]]. <br />
<br />
For the non-topographic case, these features will not be "ordered", and will look something like the following:<br />
<br />
[[File:normalSparseCodingFeatures.png]]<br />
<br />
For the topographic case, the features will be "ordered topographically", and will look something like the following:<br />
<br />
[[File:topographicSparseCodingFeatures.png]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_CodingExercise:Sparse Coding2011-05-29T06:54:56Z<p>Cyfoo: /* Step 3: Iterative optimization */</p>
<hr />
<div>== Sparse Coding ==<br />
<br />
In this exercise, you will implement [[sparse coding]] and topographic sparse coding on black-and-white natural images. <br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/sparse_coding_exercise.zip sparse_coding_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>sparseCodingWeightCost.m</tt>''', '''<tt>sparseCodingFeatureCost.m</tt>''' and '''<tt>sparseCodingExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>display_network.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we sample some patches from the <tt>IMAGES.mat</tt> dataset comprising 10 black-and-white pre-whitened natural images.<br />
<br />
=== Step 2: Implement and check sparse coding cost functions ===<br />
<br />
In this step, you should implement the two sparse coding cost functions: <br />
<br />
<ol><br />
<li><tt>sparseCodingWeightCost</tt> in <tt>sparseCodingWeightCost.m</tt>, which is used for optimizing the weight cost given the features<br />
<li><tt>sparseCodingFeatureCost</tt> in <tt>sparseCodingFeatureCost.m</tt>, which is used for optimizing the feature cost given the weights<br />
</ol><br />
<br />
Each of these functions should compute the appropriate cost and gradient. You may wish to implement the non-topographic version of <tt>sparseCodingFeatureCost</tt> first, ignoring the grouping matrix and assuming that none of the features are grouped. You can then extend this to the topographic version later. Alternatively, you may implement the topographic version directly - using the non-topographic version will then involve setting the grouping matrix to the identity matrix.<br />
<br />
Once you have implemented these functions, you should check the gradients numerically. <br />
<br />
'''Implementation tip''' - gradient checking the feature cost. One particular point to note is that when checking the gradient for the feature cost, <tt>epsilon</tt> should be set to a larger value, for instance <tt>1e-2</tt> (as has been done for you in the checking code provided), to ensure that checking the gradient numerically makes sense. This is necessary because as <tt>epsilon</tt> becomes smaller, the function <tt>sqrt(x + epsilon)</tt> becomes "sharper" and more "pointed", making the numerical gradient computed near 0 less and less accurate. To see this, consider what would happen if the numerical gradient was computed by using a point with x less than 0 and a point with x greater than 0 - the computed numerical slope would be wildly inaccurate.<br />
<br />
=== Step 3: Iterative optimization ===<br />
<br />
In this step, you will iteratively optimize for the weights and features to learn a basis for the data, as described in the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]]. Mini-batching and initialization of the features <math>s</math> has already been done for you. However, you need to still need to fill in the analytic solution to the the optimization problem with respect to the weight matrix, given the feature matrix. <br />
<br />
Once that is done, you should check that your solution is correct using the given checking code, which checks that the gradient at the point determined by your analytic solution is close to 0. Once your solution has been verified, comment out the checking code, and run the iterative optimization code. 200 iterations should take less than 45 minutes to run, and by 100 iterations you should be able to see bases that look like edges, similar to those you learned in [[Exercise:Sparse Autoencoder | the sparse autoencoder exercise]]. <br />
<br />
For the non-topographic case, these features will not be "ordered", and will look something like the following:<br />
<br />
[[File:normalSparseCodingFeatures.png]]<br />
<br />
For the topographic case, the features will be "ordered topographically", and will look something like the following:<br />
<br />
[[File:topographicSparseCodingFeatures.png]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_CodingExercise:Sparse Coding2011-05-29T06:54:35Z<p>Cyfoo: /* Step 3: Iterative optimization */</p>
<hr />
<div>== Sparse Coding ==<br />
<br />
In this exercise, you will implement [[sparse coding]] and topographic sparse coding on black-and-white natural images. <br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/sparse_coding_exercise.zip sparse_coding_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>sparseCodingWeightCost.m</tt>''', '''<tt>sparseCodingFeatureCost.m</tt>''' and '''<tt>sparseCodingExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>display_network.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we sample some patches from the <tt>IMAGES.mat</tt> dataset comprising 10 black-and-white pre-whitened natural images.<br />
<br />
=== Step 2: Implement and check sparse coding cost functions ===<br />
<br />
In this step, you should implement the two sparse coding cost functions: <br />
<br />
<ol><br />
<li><tt>sparseCodingWeightCost</tt> in <tt>sparseCodingWeightCost.m</tt>, which is used for optimizing the weight cost given the features<br />
<li><tt>sparseCodingFeatureCost</tt> in <tt>sparseCodingFeatureCost.m</tt>, which is used for optimizing the feature cost given the weights<br />
</ol><br />
<br />
Each of these functions should compute the appropriate cost and gradient. You may wish to implement the non-topographic version of <tt>sparseCodingFeatureCost</tt> first, ignoring the grouping matrix and assuming that none of the features are grouped. You can then extend this to the topographic version later. Alternatively, you may implement the topographic version directly - using the non-topographic version will then involve setting the grouping matrix to the identity matrix.<br />
<br />
Once you have implemented these functions, you should check the gradients numerically. <br />
<br />
'''Implementation tip''' - gradient checking the feature cost. One particular point to note is that when checking the gradient for the feature cost, <tt>epsilon</tt> should be set to a larger value, for instance <tt>1e-2</tt> (as has been done for you in the checking code provided), to ensure that checking the gradient numerically makes sense. This is necessary because as <tt>epsilon</tt> becomes smaller, the function <tt>sqrt(x + epsilon)</tt> becomes "sharper" and more "pointed", making the numerical gradient computed near 0 less and less accurate. To see this, consider what would happen if the numerical gradient was computed by using a point with x less than 0 and a point with x greater than 0 - the computed numerical slope would be wildly inaccurate.<br />
<br />
=== Step 3: Iterative optimization ===<br />
<br />
In this step, you will iteratively optimize for the weights and features to learn a basis for the data, as described in the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding ]]. Mini-batching and initialization of the features <math>s</math> has already been done for you. However, you need to still need to fill in the analytic solution to the the optimization problem with respect to the weight matrix, given the feature matrix. <br />
<br />
Once that is done, you should check that your solution is correct using the given checking code, which checks that the gradient at the point determined by your analytic solution is close to 0. Once your solution has been verified, comment out the checking code, and run the iterative optimization code. 200 iterations should take less than 45 minutes to run, and by 100 iterations you should be able to see bases that look like edges, similar to those you learned in [[Exercise:Sparse Autoencoder | the sparse autoencoder exercise]]. <br />
<br />
For the non-topographic case, these features will not be "ordered", and will look something like the following:<br />
<br />
[[File:normalSparseCodingFeatures.png]]<br />
<br />
For the topographic case, the features will be "ordered topographically", and will look something like the following:<br />
<br />
[[File:topographicSparseCodingFeatures.png]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T06:52:24Z<p>Cyfoo: /* Good initialization of s */</p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.<br />
<br />
== Sparse coding in practice ==<br />
<br />
As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse.<br />
<br />
Recall the simple iterative algorithm proposed earlier:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: <br />
<ol><br />
<li>Batching examples into "mini-batches"<br />
<li>Good initialization of <math>s</math><br />
</ol><br />
<br />
=== Batching examples into mini-batches ===<br />
<br />
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of say 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why).<br />
<br />
=== Good initialization of <math>s</math> ===<br />
<br />
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following:<br />
<ol><br />
<li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch)<br />
<li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis vector in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math><br />
</ol><br />
<br />
Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)<br />
<br />
=== The practical algorithm ===<br />
<br />
With the above two tricks, the algorithm for sparse coding then becomes:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Select a random mini-batch of 2000 patches<br />
<li>Initialize <math>s</math> as described above<br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
With this method, you should be able to reach a good local optima relatively quickly.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T06:51:15Z<p>Cyfoo: /* Batching examples into mini-batches */</p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.<br />
<br />
== Sparse coding in practice ==<br />
<br />
As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse.<br />
<br />
Recall the simple iterative algorithm proposed earlier:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: <br />
<ol><br />
<li>Batching examples into "mini-batches"<br />
<li>Good initialization of <math>s</math><br />
</ol><br />
<br />
=== Batching examples into mini-batches ===<br />
<br />
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of say 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why).<br />
<br />
=== Good initialization of <math>s</math> ===<br />
<br />
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following:<br />
<ol><br />
<li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch)<br />
<li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math><br />
</ol><br />
<br />
Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)<br />
<br />
=== The practical algorithm ===<br />
<br />
With the above two tricks, the algorithm for sparse coding then becomes:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Select a random mini-batch of 2000 patches<br />
<li>Initialize <math>s</math> as described above<br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
With this method, you should be able to reach a good local optima relatively quickly.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T06:50:25Z<p>Cyfoo: </p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.<br />
<br />
== Sparse coding in practice ==<br />
<br />
As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse.<br />
<br />
Recall the simple iterative algorithm proposed earlier:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: <br />
<ol><br />
<li>Batching examples into "mini-batches"<br />
<li>Good initialization of <math>s</math><br />
</ol><br />
<br />
=== Batching examples into mini-batches ===<br />
<br />
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why).<br />
<br />
=== Good initialization of <math>s</math> ===<br />
<br />
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following:<br />
<ol><br />
<li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch)<br />
<li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math><br />
</ol><br />
<br />
Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)<br />
<br />
=== The practical algorithm ===<br />
<br />
With the above two tricks, the algorithm for sparse coding then becomes:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Select a random mini-batch of 2000 patches<br />
<li>Initialize <math>s</math> as described above<br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
With this method, you should be able to reach a good local optima relatively quickly.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T06:34:19Z<p>Cyfoo: </p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[Sparse Coding: Autoencoder Interpretation#Sparse coding in practice]] | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.<br />
<br />
== Sparse coding in practice ==<br />
<br />
As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse.<br />
<br />
Recall the simple iterative algorithm proposed earlier:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: <br />
<ol><br />
<li>Batching examples into "mini-batches"<br />
<li>Good initialization of <math>s</math><br />
</ol><br />
<br />
=== Batching examples into mini-batches ===<br />
<br />
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence ([[(TODO]]: explain why).<br />
<br />
=== Good initialization of <math>s</math> ===<br />
<br />
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is to set <math>s \leftarrow W^Tx</math>, where <math>x</math> is the patches in the mini-batch, and then normalize each column of <math>s</math> so that all the feature vectors have norm 1. <br />
<br />
=== The practical algorithm ===<br />
<br />
With the above two tricks, the algorithm for sparse coding then becomes:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Select a random mini-batch of 2000 patches<br />
<li>Initialize <math>s</math> to <math>W^Tx</math>, and normalize each feature (each column of <math>s</math>) to unit norm<br />
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
With this method, you should be able to reach a good local optima relatively quickly.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T05:47:54Z<p>Cyfoo: </p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Find <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in [[Exercise:Sparse Coding | the sparse coding exercise]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition)]] can be helpful.<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T05:46:07Z<p>Cyfoo: </p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Find <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
Optimizing for this objective function using the iterative method as above, will yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in [[Exercise:Sparse Coding | the sparse coding exercise]]. The gradient computations may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition)]] can be helpful.<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_InterpretationSparse Coding: Autoencoder Interpretation2011-05-29T05:42:47Z<p>Cyfoo: </p>
<hr />
<div>== Sparse coding ==<br />
<br />
In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. <br />
<br />
[[File:STL_SparseAE.png | 240px]]<br />
<br />
Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.<br />
<br />
Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1<br />
</math><br />
<br />
(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)<br />
<br />
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. <br />
<br />
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, <br />
<math>A_j^TA_j \le 1</math>. Our problem is thus:<br />
<br />
:<math><br />
\begin{array}{rcl}<br />
{\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\<br />
{\rm s.t.} & A_j^TA_j \le 1 \; \forall j \\<br />
\end{array} <br />
</math><br />
<br />
Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.<br />
<br />
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(note that the third term, <math>\lVert A \rVert_2^2</math> is the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)<br />
<br />
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. <br />
<br />
Our final objective function is hence:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)<br />
<br />
This objective function can then be optimized iteratively, using the following procedure:<br />
<ol><br />
<li>Initialize <math>A</math> randomly<br />
<li>Repeat until convergence<br />
<ol><br />
<li>Find <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step<br />
<li>Find <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step <br />
</ol><br />
</ol><br />
<br />
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.<br />
<br />
Optimizing for this objective function using the iterative method as above, will yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. For more practical tips on implementing sparse coding, you may wish to refer to [[Exercise:Sparse Coding | the sparse coding exercise]]. For assistance with deriving the gradients, you may wish to refer to [[Deriving gradients using the backpropagation idea]].<br />
<br />
== Topographic sparse coding ==<br />
<br />
With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. <br />
<br />
Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. <br />
<br />
Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.<br />
<br />
Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:<br />
<br />
:<math><br />
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2<br />
</math><br />
<br />
(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)<br />
<br />
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_CodingExercise:Sparse Coding2011-05-28T07:22:45Z<p>Cyfoo: /* Step 2: Implement and check sparse coding cost functions */</p>
<hr />
<div>== Sparse Coding ==<br />
<br />
In this exercise, you will implement [[sparse coding]] and topographic sparse coding on black-and-white natural images. <br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/sparse_coding_exercise.zip sparse_coding_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>sparseCodingWeightCost.m</tt>''', '''<tt>sparseCodingFeatureCost.m</tt>''' and '''<tt>sparseCodingExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>display_network.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we sample some patches from the <tt>IMAGES.mat</tt> dataset comprising 10 black-and-white pre-whitened natural images.<br />
<br />
=== Step 2: Implement and check sparse coding cost functions ===<br />
<br />
In this step, you should implement the two sparse coding cost functions: <br />
<br />
<ol><br />
<li><tt>sparseCodingWeightCost</tt> in <tt>sparseCodingWeightCost.m</tt>, which is used for optimizing the weight cost given the features<br />
<li><tt>sparseCodingFeatureCost</tt> in <tt>sparseCodingFeatureCost.m</tt>, which is used for optimizing the feature cost given the weights<br />
</ol><br />
<br />
Each of these functions should compute the appropriate cost and gradient. You may wish to implement the non-topographic version of <tt>sparseCodingFeatureCost</tt> first, ignoring the grouping matrix and assuming that none of the features are grouped. You can then extend this to the topographic version later. Alternatively, you may implement the topographic version directly - using the non-topographic version will then involve setting the grouping matrix to the identity matrix.<br />
<br />
Once you have implemented these functions, you should check the gradients numerically. <br />
<br />
'''Implementation tip''' - gradient checking the feature cost. One particular point to note is that when checking the gradient for the feature cost, <tt>epsilon</tt> should be set to a larger value, for instance <tt>1e-2</tt> (as has been done for you in the checking code provided), to ensure that checking the gradient numerically makes sense. This is necessary because as <tt>epsilon</tt> becomes smaller, the function <tt>sqrt(x + epsilon)</tt> becomes "sharper" and more "pointed", making the numerical gradient computed near 0 less and less accurate. To see this, consider what would happen if the numerical gradient was computed by using a point with x less than 0 and a point with x greater than 0 - the computed numerical slope would be wildly inaccurate.<br />
<br />
=== Step 3: Iterative optimization ===<br />
<br />
In this step, you will iteratively optimize for the weights and features to learn a basis for the data. However, you need to first fill in the analytic solution to the the optimization problem with respect to the weight matrix, given the feature matrix. Once that is done, you should check that your solution is correct using the given checking code, which checks that the gradient at the point determined by your analytic solution is close to 0. Once your solution has been verified, comment out the checking code, and run the iterative optimization code. 200 iterations should take less than 45 minutes to run, and by 100 iterations you should be able to see bases that look like edges, similar to those you learned in [[Exercise:Sparse Autoencoder | the sparse autoencoder exercise]]. <br />
<br />
For the non-topographic case, these features will not be "ordered", and will look something like the following:<br />
<br />
[[File:normalSparseCodingFeatures.png]]<br />
<br />
For the topographic case, the features will be "ordered topographically", and will look something like the following:<br />
<br />
[[File:topographicSparseCodingFeatures.png]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_CodingExercise:Sparse Coding2011-05-28T07:19:31Z<p>Cyfoo: /* Step 3: Iterative optimization */</p>
<hr />
<div>== Sparse Coding ==<br />
<br />
In this exercise, you will implement [[sparse coding]] and topographic sparse coding on black-and-white natural images. <br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/sparse_coding_exercise.zip sparse_coding_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>sparseCodingWeightCost.m</tt>''', '''<tt>sparseCodingFeatureCost.m</tt>''' and '''<tt>sparseCodingExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>display_network.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we sample some patches from the <tt>IMAGES.mat</tt> dataset comprising 10 black-and-white pre-whitened natural images.<br />
<br />
=== Step 2: Implement and check sparse coding cost functions ===<br />
<br />
In this step, you should implement the two sparse coding cost functions: <br />
<br />
<ol><br />
<li><tt>sparseCodingWeightCost</tt> in <tt>sparseCodingWeightCost.m</tt>, which is used for optimizing the weight cost given the features<br />
<li><tt>sparseCodingFeatureCost</tt> in <tt>sparseCodingFeatureCost.m</tt>, which is used for optimizing the feature cost given the weights<br />
</ol><br />
<br />
Each of these functions should compute the appropriate cost and gradient. You may wish to implement the non-topographic version of <tt>sparseCodingFeatureCost</tt> first, ignoring the grouping matrix and assuming that none of the features are grouped. You can then extend this to the topographic version later. Alternatively, you may implement the topographic version directly - using the non-topographic version will then involve setting the grouping matrix to the identity matrix.<br />
<br />
Once you have implemented these functions, you should check the gradients numerically. <br />
<br />
'''Implementation tip''' - gradient checking the feature cost. One particular point to note is that when checking the gradient for the feature cost, <tt>epsilon</tt> should be set to a larger value, for instance <tt>1e-2</tt> (as has been done for you in the checking code provided), to ensure that checking the gradient numerically makes sense. This is necessary because as <tt>epsilon</tt> becomes smaller, the function <tt>sqrt(s + epsilon)</tt> becomes "sharper" and more "pointed", making the numerical gradient computed near 0 less and less accurate. To see this, consider what would happen if the numerical gradient was computed by using a point with x less than 0 and a point with x greater than 0 - the computed numerical slope would be totally incorrect.<br />
<br />
=== Step 3: Iterative optimization ===<br />
<br />
In this step, you will iteratively optimize for the weights and features to learn a basis for the data. However, you need to first fill in the analytic solution to the the optimization problem with respect to the weight matrix, given the feature matrix. Once that is done, you should check that your solution is correct using the given checking code, which checks that the gradient at the point determined by your analytic solution is close to 0. Once your solution has been verified, comment out the checking code, and run the iterative optimization code. 200 iterations should take less than 45 minutes to run, and by 100 iterations you should be able to see bases that look like edges, similar to those you learned in [[Exercise:Sparse Autoencoder | the sparse autoencoder exercise]]. <br />
<br />
For the non-topographic case, these features will not be "ordered", and will look something like the following:<br />
<br />
[[File:normalSparseCodingFeatures.png]]<br />
<br />
For the topographic case, the features will be "ordered topographically", and will look something like the following:<br />
<br />
[[File:topographicSparseCodingFeatures.png]]</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:NormalSparseCodingFeatures.pngFile:NormalSparseCodingFeatures.png2011-05-28T07:18:31Z<p>Cyfoo: </p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/File:TopographicSparseCodingFeatures.pngFile:TopographicSparseCodingFeatures.png2011-05-28T07:18:19Z<p>Cyfoo: </p>
<hr />
<div></div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_CodingExercise:Sparse Coding2011-05-28T07:14:31Z<p>Cyfoo: </p>
<hr />
<div>== Sparse Coding ==<br />
<br />
In this exercise, you will implement [[sparse coding]] and topographic sparse coding on black-and-white natural images. <br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/sparse_coding_exercise.zip sparse_coding_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>sparseCodingWeightCost.m</tt>''', '''<tt>sparseCodingFeatureCost.m</tt>''' and '''<tt>sparseCodingExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>display_network.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we sample some patches from the <tt>IMAGES.mat</tt> dataset comprising 10 black-and-white pre-whitened natural images.<br />
<br />
=== Step 2: Implement and check sparse coding cost functions ===<br />
<br />
In this step, you should implement the two sparse coding cost functions: <br />
<br />
<ol><br />
<li><tt>sparseCodingWeightCost</tt> in <tt>sparseCodingWeightCost.m</tt>, which is used for optimizing the weight cost given the features<br />
<li><tt>sparseCodingFeatureCost</tt> in <tt>sparseCodingFeatureCost.m</tt>, which is used for optimizing the feature cost given the weights<br />
</ol><br />
<br />
Each of these functions should compute the appropriate cost and gradient. You may wish to implement the non-topographic version of <tt>sparseCodingFeatureCost</tt> first, ignoring the grouping matrix and assuming that none of the features are grouped. You can then extend this to the topographic version later. Alternatively, you may implement the topographic version directly - using the non-topographic version will then involve setting the grouping matrix to the identity matrix.<br />
<br />
Once you have implemented these functions, you should check the gradients numerically. <br />
<br />
'''Implementation tip''' - gradient checking the feature cost. One particular point to note is that when checking the gradient for the feature cost, <tt>epsilon</tt> should be set to a larger value, for instance <tt>1e-2</tt> (as has been done for you in the checking code provided), to ensure that checking the gradient numerically makes sense. This is necessary because as <tt>epsilon</tt> becomes smaller, the function <tt>sqrt(s + epsilon)</tt> becomes "sharper" and more "pointed", making the numerical gradient computed near 0 less and less accurate. To see this, consider what would happen if the numerical gradient was computed by using a point with x less than 0 and a point with x greater than 0 - the computed numerical slope would be totally incorrect.<br />
<br />
=== Step 3: Iterative optimization ===<br />
<br />
In this step, you will iteratively optimize for the weights and features to learn a basis for the data. However, you need to first fill in the analytic solution to the the optimization problem with respect to the weight matrix, given the feature matrix. Once that is done, you should check that your solution is correct using the given checking code, which checks that the gradient at the point determined by your analytic solution is close to 0. Once your solution has been verified, comment out the checking code, and run the iterative optimization code. 200 iterations should take less than 45 minutes to run, and by 100 iterations you should be able to see bases that look like edges, similar to those you learned in [[Exercise:Sparse Autoencoder | the sparse autoencoder exercise]]. <br />
<br />
For the non-topographic case, these features will not be "ordered", and will look something like the following:<br />
<br />
For the topographic case, the features will be "ordered topographically", and will look something like the following:</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_CodingExercise:Sparse Coding2011-05-28T07:14:18Z<p>Cyfoo: Created page with "Sparse Coding == In this exercise, you will implement sparse coding and topographic sparse coding on black-and-white natural images. In the file <tt>[http://ufldl.stanford..."</p>
<hr />
<div>Sparse Coding ==<br />
<br />
In this exercise, you will implement [[sparse coding]] and topographic sparse coding on black-and-white natural images. <br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/sparse_coding_exercise.zip sparse_coding_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>sparseCodingWeightCost.m</tt>''', '''<tt>sparseCodingFeatureCost.m</tt>''' and '''<tt>sparseCodingExercise.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
You will need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
* <tt>display_network.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used for the exercise.<br />
<br />
=== Step 1: Sample patches ===<br />
<br />
In this step, we sample some patches from the <tt>IMAGES.mat</tt> dataset comprising 10 black-and-white pre-whitened natural images.<br />
<br />
=== Step 2: Implement and check sparse coding cost functions ===<br />
<br />
In this step, you should implement the two sparse coding cost functions: <br />
<br />
<ol><br />
<li><tt>sparseCodingWeightCost</tt> in <tt>sparseCodingWeightCost.m</tt>, which is used for optimizing the weight cost given the features<br />
<li><tt>sparseCodingFeatureCost</tt> in <tt>sparseCodingFeatureCost.m</tt>, which is used for optimizing the feature cost given the weights<br />
</ol><br />
<br />
Each of these functions should compute the appropriate cost and gradient. You may wish to implement the non-topographic version of <tt>sparseCodingFeatureCost</tt> first, ignoring the grouping matrix and assuming that none of the features are grouped. You can then extend this to the topographic version later. Alternatively, you may implement the topographic version directly - using the non-topographic version will then involve setting the grouping matrix to the identity matrix.<br />
<br />
Once you have implemented these functions, you should check the gradients numerically. <br />
<br />
'''Implementation tip''' - gradient checking the feature cost. One particular point to note is that when checking the gradient for the feature cost, <tt>epsilon</tt> should be set to a larger value, for instance <tt>1e-2</tt> (as has been done for you in the checking code provided), to ensure that checking the gradient numerically makes sense. This is necessary because as <tt>epsilon</tt> becomes smaller, the function <tt>sqrt(s + epsilon)</tt> becomes "sharper" and more "pointed", making the numerical gradient computed near 0 less and less accurate. To see this, consider what would happen if the numerical gradient was computed by using a point with x less than 0 and a point with x greater than 0 - the computed numerical slope would be totally incorrect.<br />
<br />
=== Step 3: Iterative optimization ===<br />
<br />
In this step, you will iteratively optimize for the weights and features to learn a basis for the data. However, you need to first fill in the analytic solution to the the optimization problem with respect to the weight matrix, given the feature matrix. Once that is done, you should check that your solution is correct using the given checking code, which checks that the gradient at the point determined by your analytic solution is close to 0. Once your solution has been verified, comment out the checking code, and run the iterative optimization code. 200 iterations should take less than 45 minutes to run, and by 100 iterations you should be able to see bases that look like edges, similar to those you learned in [[Exercise:Sparse Autoencoder | the sparse autoencoder exercise]]. <br />
<br />
For the non-topographic case, these features will not be "ordered", and will look something like the following:<br />
<br />
For the topographic case, the features will be "ordered topographically", and will look something like the following:</div>Cyfoohttp://deeplearning.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-28T06:52:45Z<p>Cyfoo: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Linear Decoders with Autoencoders'''<br />
* [[Linear Decoders]]<br />
* [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Exercise:Convolution and Pooling]]<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
[[Useful Links]]<br />
<br />
'''Advanced Topics''':<br />
<br />
'''Sparse Coding'''<br />
* [[Sparse Coding]]<br />
* [[Sparse Coding: Autoencoder Interpretation]]<br />
* [[Exercise:Sparse Coding]]<br />
<br />
'''ICA Style Models'''<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Convolutional training]] <br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Cyfoo