Independent Component Analysis

Revision as of 00:42, 18 June 2011 (view source)

Cyfoo (Talk | contribs)

(→Introduction)

← Older edit

Revision as of 00:46, 18 June 2011 (view source)

Cyfoo (Talk | contribs)

Newer edit →

Line 20:

As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint).

-

In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[~~Orthonormal ICA |~~ Independent Component Analysis#Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[~~Reconstruction ICA |~~ Independent Component Analysis#Reconstruction ICA]], which is very similar to [[~~sparse coding |~~ Sparse Coding: Autoencoder Interpretation]], but no longer strictly finds independent components.

+

In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[Independent Component Analysis#Orthonormal ICA | Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[Independent Component Analysis#Reconstruction ICA | Reconstruction ICA]], which is very similar to [[Sparse Coding: Autoencoder Interpretation | sparse coding]], but no longer strictly finds independent components.

=== Orthonormal ICA ===

Line 34:

Observe that the constraint <math>WW^T = I</math> implies two other constraints.

-

Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[~~sparse coding |~~ Sparse Coding: Autoencoder Interpretation]].

+

Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[Sparse Coding: Autoencoder Interpretation | sparse coding]].

-

Secondly, the data must be [[ZCA whitened ~~| Whitening~~]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)

+

Secondly, the data must be [[Whitening | ZCA whitened]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)

Hence, before we even begin to optimize for the orthonormal ICA objective, we must ensure that our data has been '''whitened''', and that we are learning an '''under-complete''' basis.

Line 42:

Following that, to optimize for the objective, we can use gradient descent, interspersing gradient descent steps with projection steps to enforce the orthonormality constraint. Hence, the procedure will be as follows:

-

~~<ol>~~

Repeat until done:

+

<ol>

<li><math>W \leftarrow W - \alpha \nabla_W \lVert Wx \rVert_1</math>

-

<li><math>W \leftarrow ~~proj_U~~ W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math>

+

<li><math>W \leftarrow \operatorname{proj}_U W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math>

</ol>

-

In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).

+

In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-\frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).

=== Reconstruction ICA ===

Line 55:

:<math>

\begin{array}{rcl}

-

{\rm minimize} & \lVert Wx \rVert_1 + \~~lvert~~ W^TWx \rVert_2^2 \\

+

{\rm minimize} & \lVert Wx \rVert_1 + \lVert W^TWx \rVert_2^2 \\

\end{array}

</math>

-

where <math>W^T(Wx)<math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math). If you recall the [[~~sparse coding |~~ Sparse Coding: Autoencoder Interpretation]] objective,

+

where <math>W^T(Wx)</math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math>). If you recall the [[Sparse Coding: Autoencoder Interpretation | sparse coding]] objective,

:<math>

J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1

Line 69:

=== Topographic ICA ===

-

Just like [[~~sparse coding |~~ Sparse Coding: Autoencoder Interpretation]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.

+

Just like [[Sparse Coding: Autoencoder Interpretation | sparse coding]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.

Independent Component Analysis

From Ufldl

Revision as of 00:46, 18 June 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 20: / Line 20: @@
 As is usually the case in deep learning, this problem has no simple analytic solution, and to make matters worse, the orthonormality constraint makes it slightly more difficult to optimize for the objective using gradient descent - every iteration of gradient descent must be followed by a step that maps the new basis back to the space of orthonormal bases (hence enforcing the constraint).
-In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[Orthonormal ICA | Independent Component Analysis#Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[Reconstruction ICA | Independent Component Analysis#Reconstruction ICA]], which is very similar to [[sparse coding | Sparse Coding: Autoencoder Interpretation]], but no longer strictly finds independent components.
+In practice, optimizing for the objective function while enforcing the orthonormality constraint (as described in [[Independent Component Analysis#Orthonormal ICA | Orthonormal ICA]] section below) is feasible but slow. Hence, the use of orthonormal ICA is limited to situations where it is important to obtain an orthonormal basis ([[TODO]]: what situations) . In other situations, the orthonormality constraint is replaced by a reconstruction cost instead to give [[Independent Component Analysis#Reconstruction ICA | Reconstruction ICA]], which is very similar to [[Sparse Coding: Autoencoder Interpretation | sparse coding]], but no longer strictly finds independent components.
 === Orthonormal ICA ===
@@ Line 34: / Line 34: @@
 Observe that the constraint <math>WW^T = I</math> implies two other constraints.
-Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[sparse coding | Sparse Coding: Autoencoder Interpretation]].
+Firstly, since we are learning an orthonormal basis, the number of basis vectors we learn must be less than the dimension of the input. In particular, this means that we cannot learn over-complete bases as we usually do in [[Sparse Coding: Autoencoder Interpretation | sparse coding]].
-Secondly, the data must be [[ZCA whitened | Whitening]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)
+Secondly, the data must be [[Whitening | ZCA whitened]] with no regularization (that is, with <math>\epsilon</math> set to 0). ([[TODO]] Why must this be so?)
 Hence, before we even begin to optimize for the orthonormal ICA objective, we must ensure that our data has been '''whitened''', and that we are learning an '''under-complete''' basis.
@@ Line 42: / Line 42: @@
 Following that, to optimize for the objective, we can use gradient descent, interspersing gradient descent steps with projection steps to enforce the orthonormality constraint. Hence, the procedure will be as follows:
-<ol>
 Repeat until done:
+<ol>
 <li><math>W \leftarrow W - \alpha \nabla_W \lVert Wx \rVert_1</math>
-<li><math>W \leftarrow proj_U W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math>
+<li><math>W \leftarrow \operatorname{proj}_U W</math> where <math>U</math> is the space of matrices satisfying <math>WW^T = I</math>
 </ol>
-In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).
+In practice, the learning rate <math>\alpha</math> is varied using a line-search algorithm to speed up the descent, and the projection step is achieved by setting <math>W \leftarrow (WW^T)^{-\frac{1}{2}} W</math>, which can actually be seen as ZCA whitening ([[TODO]] explain how it is like ZCA whitening).
 === Reconstruction ICA ===
@@ Line 55: / Line 55: @@
 :<math>
 \begin{array}{rcl}
-      {\rm minimize} & \lVert Wx \rVert_1 + \lvert W^TWx \rVert_2^2 \\
+      {\rm minimize} & \lVert Wx \rVert_1 + \lVert W^TWx \rVert_2^2 \\
 \end{array}
 </math>
-where <math>W^T(Wx)<math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math). If you recall the [[sparse coding | Sparse Coding: Autoencoder Interpretation]] objective,
+where <math>W^T(Wx)</math> is the reconstruction of the input from the features (i.e. <math>W^T</math> is supposed to play the role of <math>W^{-1}</math>). If you recall the [[Sparse Coding: Autoencoder Interpretation | sparse coding]] objective,
 :<math>
 J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1
@@ Line 69: / Line 69: @@
 === Topographic ICA ===
-Just like [[sparse coding | Sparse Coding: Autoencoder Interpretation]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.
+Just like [[Sparse Coding: Autoencoder Interpretation | sparse coding]], independent component analysis can be modified to give a topographic variant by adding a topographic cost term.