http://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&feed=atom&action=historySparse Coding - Revision history2024-03-29T13:22:26ZRevision history for this page on the wikiMediaWiki 1.16.2http://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=2317&oldid=prevKandeng at 04:28, 8 April 20132013-04-08T04:28:55Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 04:28, 8 April 2013</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 103:</td>
<td colspan="2" class="diff-lineno">Line 103:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">{{Languages|稀疏编码|中文}}</ins></div></td></tr>
</table>Kandenghttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=103&oldid=prevZhenghao: /* Probabilistic Interpretation [Based on Olshausen and Field 1996] */2011-03-21T13:36:31Z<p><span class="autocomment">Probabilistic Interpretation [Based on Olshausen and Field 1996]</span></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 13:36, 21 March 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 88:</td>
<td colspan="2" class="diff-lineno">Line 88:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div> &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div> &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{array}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{array}</math></div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>where <math>\lambda = 2<del class="diffchange diffchange-inline">*</del>\sigma^2<del class="diffchange diffchange-inline">*</del>\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>where <math>\lambda = 2\sigma^2\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) </div></td></tr>
</table>Zhenghaohttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=102&oldid=prevZhenghao: /* Sparse Coding */2011-03-21T13:34:31Z<p><span class="autocomment">Sparse Coding</span></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 13:34, 21 March 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 18:</td>
<td colspan="2" class="diff-lineno">Line 18:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions. </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math><ins class="diffchange diffchange-inline">S(a_i)=</ins>\left|a_i\right|_1 </math> and the log penalty <math><ins class="diffchange diffchange-inline">S(a_i)=</ins>\log(1+a_i^2)</math>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is</div></td></tr>
</table>Zhenghaohttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=101&oldid=prevZhenghao: /* Probabilistic Interpretation [Based on Olshausen and Field 1996] */2011-03-21T13:33:34Z<p><span class="autocomment">Probabilistic Interpretation [Based on Olshausen and Field 1996]</span></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 13:33, 21 March 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 93:</td>
<td colspan="2" class="diff-lineno">Line 93:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Using a probabilistic approach, it can also be seen that the choices of the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math> for <math>S(.)</math> correspond to the use of the Laplacian <del class="diffchange diffchange-inline">(</del><math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math><del class="diffchange diffchange-inline">) </del>and the Cauchy prior <del class="diffchange diffchange-inline">(</del><math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math><del class="diffchange diffchange-inline">) </del>respectively.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Using a probabilistic approach, it can also be seen that the choices of the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math> for <math>S(.)</math> correspond to the use of the Laplacian <math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math> and the Cauchy prior <math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math> respectively.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Learning ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Learning ==</div></td></tr>
</table>Zhenghaohttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=100&oldid=prevZhenghao: /* Probabilistic Interpretation */2011-03-21T13:31:41Z<p><span class="autocomment">Probabilistic Interpretation</span></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 13:31, 21 March 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 29:</td>
<td colspan="2" class="diff-lineno">Line 29:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{array}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{array}</math></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>== Probabilistic Interpretation ==</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>== Probabilistic Interpretation <ins class="diffchange diffchange-inline">[Based on Olshausen and Field 1996] </ins>==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
</table>Zhenghaohttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=99&oldid=prevZhenghao at 13:30, 21 March 20112011-03-21T13:30:24Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 13:30, 21 March 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>== <del class="diffchange diffchange-inline">Background and Motivation </del>==</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>== <ins class="diffchange diffchange-inline">Sparse Coding </ins>==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 16:</td>
<td colspan="2" class="diff-lineno">Line 16:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions<ins class="diffchange diffchange-inline">. </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general</ins>. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 34:</td>
<td colspan="2" class="diff-lineno">Line 36:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">We thus wish </del>to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. </div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">Our goal is </ins>to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. <ins class="diffchange diffchange-inline">One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">:<math>\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">\end{align}</math> </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div> </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2})</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp<ins class="diffchange diffchange-inline">\left</ins>(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}<ins class="diffchange diffchange-inline">\right</ins>)</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>P(<del class="diffchange diffchange-inline">\mathbf{x} \mid </del>\mathbf{a<del class="diffchange diffchange-inline">}, \mathbf{\phi</del>}) = \<del class="diffchange diffchange-inline">frac</del>{1}<del class="diffchange diffchange-inline">{Z} \exp(- \frac{(\mathbf{x}-\sum</del>^{k}<del class="diffchange diffchange-inline">_{i=1} </del>a_i <del class="diffchange diffchange-inline">\mathbf{\phi}_{i})^2}{2\sigma^2}</del>)</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>P(\mathbf{a}) = \<ins class="diffchange diffchange-inline">prod_</ins>{<ins class="diffchange diffchange-inline">i=</ins>1}^{k} <ins class="diffchange diffchange-inline">P(</ins>a_i)</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">At this point, we would like to incorporate our sparsity assumption -- the assumption that any single image is likely to be the product of relatively few source features. Therefore, we would like the probability distribution of <math>a_i</math> to be peaked at zero and have high kurtosis. A convenient parameterization of the prior distribution is </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math>\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Having defined <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> and <math> P(\mathbf{a})</math>, we can write the probability of the data <math>\mathbf{x}</math> under the model defined by <math>\mathbf{\phi}</math> as </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math>\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">and our problem reduces to finding</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math>\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) ></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Where <math><.></math> denotes expectation over our input data. </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math>\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) ></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">As before, we may increase the estimated probability by scaling down <math>a_i</math> and scaling up <math>\mathbf{\phi}</math> (since <math>P(a_i)</math> peaks about zero) , we therefore impose a norm constraint on our features <math>\mathbf{\phi}</math> to prevent this.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Finally, we can recover our original cost function by defining the energy function of this linear generative model</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math>\begin{array}{rl}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"> &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{array}</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">where <math>\lambda = 2*\sigma^2*\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">:<math>\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">\end{align}</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Using a probabilistic approach, it can also be seen that the choices of the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math> for <math>S(.)</math> correspond to the use of the Laplacian (<math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math>) and the Cauchy prior (<math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math>) respectively.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">== Learning ==</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Learning a set of basis vectors <math>\mathbf{\phi}</math> using sparse coding consists of performing two separate optimizations, the first being an optimization over coefficients <math>a_i</math> for each training example <math>\mathbf{x}</math> and the second an optimization over basis vectors <math>\mathbf{\phi}</math> across many training examples at once.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Assuming an <math>L_1</math> sparsity penalty, learning <math>a^{(j)}_i</math> reduces to solving a <math>L_1</math> regularized least squares problem which is convex in <math>a^{(j)}_i</math> for which several techniques have been developed (convex optimization software such as CVX can also be used to perform L1 regularized least squares). Assuming a differentiable <math>S(.)</math> such as the log penalty, gradient-based methods such as conjugate gradient methods can also be used.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Learning a set of basis vectors with a <math>L_2</math> norm constraint also reduces to a least squares problem with quadratic constraints which is convex in <math>\mathbf{\phi}</math>. Standard convex optimization software (e.g. CVX) or other iterative methods can be used to solve for <math>\mathbf{\phi}</math> although significantly more efficient methods such as solving the Lagrange dual have also been developed.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures.</ins></div></td></tr>
</table>Zhenghaohttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=98&oldid=prevZhenghao at 11:29, 21 March 20112011-03-21T11:29:11Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 11:29, 21 March 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 30:</td>
<td colspan="2" class="diff-lineno">Line 30:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent <del class="diffchange diffchange-inline">causal </del>features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent <ins class="diffchange diffchange-inline">source </ins>features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 37:</td>
<td colspan="2" class="diff-lineno">Line 37:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i} }{2\sigma^2})</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{<ins class="diffchange diffchange-inline">(</ins>\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i}<ins class="diffchange diffchange-inline">)^2}{2\sigma^2})</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">\end{align}</math></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">:<math>\begin{align}</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2</ins>}{2\sigma^2})</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del style="color: red; font-weight: bold; text-decoration: none;"></del></div></td><td colspan="2"> </td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del style="color: red; font-weight: bold; text-decoration: none;"></del></div></td><td colspan="2"> </td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del style="color: red; font-weight: bold; text-decoration: none;">In order to define the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we must first specify a prior distribution over the amplitudes <math>a_i</math></del></div></td><td colspan="2"> </td></tr>
</table>Zhenghaohttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=97&oldid=prevZhenghao at 11:20, 21 March 20112011-03-21T11:20:49Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 11:20, 21 March 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 37:</td>
<td colspan="2" class="diff-lineno">Line 37:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that </div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>:<math>\begin{align}</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac<del class="diffchange diffchange-inline">^</del>{1}<del class="diffchange diffchange-inline">_</del>{Z} \exp(- \frac^{1}_{2\sigma^2})</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac<ins class="diffchange diffchange-inline">{\mathbf{x}-\sum</ins>^{<ins class="diffchange diffchange-inline">k}_{i=</ins>1<ins class="diffchange diffchange-inline">} a_i \mathbf{\phi</ins>}_<ins class="diffchange diffchange-inline">{i} }</ins>{2\sigma^2})</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>\end{align}</math></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In order to define the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we must first specify a prior distribution over the amplitudes <math>a_i</math></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In order to define the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we must first specify a prior distribution over the amplitudes <math>a_i</math></div></td></tr>
</table>Zhenghaohttp://deeplearning.stanford.edu/wiki/index.php?title=Sparse_Coding&diff=96&oldid=prevZhenghao: Created page with "== Background and Motivation == Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding ..."2011-03-21T11:18:31Z<p>Created page with "== Background and Motivation == Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding ..."</p>
<p><b>New page</b></p><div>== Background and Motivation ==<br />
Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:<br />
<br />
:<math>\begin{align}<br />
\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} <br />
\end{align}</math><br />
<br />
While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an '''over-complete''' set of basis vectors to represent input vectors <math>\mathbf{x}\in\mathbb{R}^n</math> (i.e. such that <math>k > n</math>). The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data. However, with an over-complete basis, the coefficients <math>a_i</math> are no longer uniquely determined by the input vector <math>\mathbf{x}</math>. Therefore, in sparse coding, we introduce the additional criterion of '''sparsity''' to resolve the degeneracy introduced by over-completeness. <br />
<br />
Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients <math>a_i</math> be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced. <br />
<br />
We define the sparse coding cost function on a set of <math>m</math> input vectors as<br />
<br />
:<math>\begin{align}<br />
\text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)<br />
\end{align}</math><br />
<br />
where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.<br />
<br />
In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is<br />
<br />
:<math>\begin{array}{rc}<br />
\text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} & \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) <br />
\\<br />
\text{subject to} & \left|\left|\mathbf{\phi}_i\right|\right|^2 \leq C, \forall i = 1,...,k <br />
\\<br />
\end{array}</math><br />
<br />
== Probabilistic Interpretation ==<br />
So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. <br />
<br />
Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent causal features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:<br />
:<math>\begin{align}<br />
\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})<br />
\end{align}</math><br />
We thus wish to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. <br />
Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that <br />
:<math>\begin{align}<br />
P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac^{1}_{Z} \exp(- \frac^{1}_{2\sigma^2})<br />
\end{align}</math><br />
<br />
<br />
In order to define the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we must first specify a prior distribution over the amplitudes <math>a_i</math></div>Zhenghao