Deep Networks: Overview
From Ufldl
(→Difficulty of training deep architectures) |
|||
Line 71: | Line 71: | ||
The main learning algorithm that researchers were using was to randomly initialize | The main learning algorithm that researchers were using was to randomly initialize | ||
the weights of a deep network, and then train it using a labeled | the weights of a deep network, and then train it using a labeled | ||
- | training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math> | + | training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math> |
- | using a supervised learning objective, | + | using a supervised learning objective, for example by applying gradient descent to try to |
drive down the training error. However, this usually did not work well. | drive down the training error. However, this usually did not work well. | ||
There were several reasons for this. | There were several reasons for this. | ||
Line 79: | Line 79: | ||
With the method described above, one relies only on | With the method described above, one relies only on | ||
- | labeled data for training. However, labeled data is often scarce, and thus | + | labeled data for training. However, labeled data is often scarce, and thus for many |
- | is | + | problems it is difficult to get enough examples to fit the parameters of a |
- | + | complex model. For example, given the high degree of expressive power of deep networks, | |
+ | training on insufficient data would also result in overfitting. | ||
===Local optima=== | ===Local optima=== | ||
- | Training a neural network using supervised learning | + | Training a shallow network (with 1 hidden layer) using |
+ | supervised learning usually resulted in the parameters converging to reasonable values; | ||
+ | but when we are training a deep network, this works much less well. | ||
+ | In particular, training a neural network using supervised learning | ||
involves solving a highly non-convex optimization problem (say, minimizing the | involves solving a highly non-convex optimization problem (say, minimizing the | ||
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a | training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a | ||
- | function of the network parameters <math>\textstyle W</math>). | + | function of the network parameters <math>\textstyle W</math>). |
- | + | In a deep network, this problem turns out to be rife with bad local optima, and | |
training with gradient descent (or methods like conjugate gradient and L-BFGS) | training with gradient descent (or methods like conjugate gradient and L-BFGS) | ||
- | + | no longer work well. | |
===Diffusion of gradients=== | ===Diffusion of gradients=== | ||
Line 97: | Line 101: | ||
There is an additional technical reason, | There is an additional technical reason, | ||
pertaining to the gradients becoming very small, that explains why gradient | pertaining to the gradients becoming very small, that explains why gradient | ||
- | descent (and related algorithms like L-BFGS) do not work well on a deep | + | descent (and related algorithms like L-BFGS) do not work well on a deep networks |
with randomly initialized weights. Specifically, when using backpropagation to | with randomly initialized weights. Specifically, when using backpropagation to | ||
compute the derivatives, the gradients that are propagated backwards (from the | compute the derivatives, the gradients that are propagated backwards (from the | ||
- | output layer to the earlier layers of the network) rapidly | + | output layer to the earlier layers of the network) rapidly diminish in |
magnitude as the depth of the network increases. As a result, the derivative of | magnitude as the depth of the network increases. As a result, the derivative of | ||
the overall cost with respect to the weights in the earlier layers is very | the overall cost with respect to the weights in the earlier layers is very | ||
Line 113: | Line 117: | ||
randomly initialized ends up giving similar performance to training a | randomly initialized ends up giving similar performance to training a | ||
shallow network (the last few layers) on corrupted input (the result of | shallow network (the last few layers) on corrupted input (the result of | ||
- | the processing done by the earlier layers). | + | the processing done by the earlier layers). |
<!-- | <!-- | ||
Line 124: | Line 128: | ||
== Greedy layer-wise training == | == Greedy layer-wise training == | ||
- | How | + | How can we train a deep network? One method that has seen some |
success is the '''greedy layer-wise training''' method. We describe this | success is the '''greedy layer-wise training''' method. We describe this | ||
method in detail in later sections, but briefly, the main idea is to train the | method in detail in later sections, but briefly, the main idea is to train the | ||
- | layers of the network one at a time, with | + | layers of the network one at a time, so that we first train a network with 1 |
- | + | hidden layer, and only after that is done, train a network with 2 hidden layers, | |
- | supervised (say, with classification error as the objective function), | + | and so on. At each step, we take the old network with <math>k-1</math> hidden |
- | unsupervised ( | + | layers, and add an additional <math>k</math>-th hidden layer (that takes as |
- | + | input the previous hidden layer <math>k-1</math> that we had just | |
- | layers individually are then used to initialize the weights in the deep | + | trained). Training can either be |
- | + | supervised (say, with classification error as the objective function on each | |
- | trained together to optimize the training set error). | + | step), but more frequently it is |
+ | unsupervised (as in an autoencoder; details to provided later). | ||
+ | The weights from training the layers individually are then used to initialize the weights | ||
+ | in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e., | ||
+ | trained together to optimize the labeled training set error). | ||
+ | |||
+ | The success of greedy | ||
layer-wise training has been attributed to a number of factors: | layer-wise training has been attributed to a number of factors: | ||
Line 146: | Line 156: | ||
classification layer that maps to the outputs/predictions), our algorithm is | classification layer that maps to the outputs/predictions), our algorithm is | ||
able to learn and discover patterns from massively more amounts of data than | able to learn and discover patterns from massively more amounts of data than | ||
- | purely supervised approaches | + | purely supervised approaches. This often results in much better classifiers |
+ | being learned. | ||
- | === | + | ===Better local optima=== |
After having trained the network | After having trained the network | ||
on the unlabeled data, the weights are now starting at a better location in | on the unlabeled data, the weights are now starting at a better location in | ||
- | parameter space than if they had been randomly initialized. We | + | parameter space than if they had been randomly initialized. We can then |
further fine-tune the weights starting from this location. Empirically, it | further fine-tune the weights starting from this location. Empirically, it | ||
- | turns out that gradient descent from this location is | + | turns out that gradient descent from this location is much more likely to |
lead to a good local minimum, because the unlabeled data has already provided | lead to a good local minimum, because the unlabeled data has already provided | ||
a significant amount of "prior" information about what patterns there | a significant amount of "prior" information about what patterns there | ||
- | are in the input data. | + | are in the input data. |
+ | |||
In the next section, we will describe the specific details of how to go about | In the next section, we will describe the specific details of how to go about | ||
- | implementing greedy layer-wise training. | + | implementing greedy layer-wise training. |
+ | {{CNN}} | ||
<!-- | <!-- | ||
Line 178: | Line 191: | ||
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf] | [http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf] | ||
!--> | !--> | ||
+ | |||
+ | |||
+ | {{Languages|深度网络概览|中文}} |