Deep Networks: Overview
From Ufldl
Line 1: | Line 1: | ||
- | == | + | == Overview == |
- | + | In the earlier sections, you have constructed a 3-layer neural network comprising an input, hidden and output layer. In general, the neural network can be viewed as a way to compute a complicated function of the inputs; for example, computing the predicted labels (e.g, 0-10) of some inputs (e.g., MNIST digits). By adding more hidden layers, the neural network gains more expressive power and allows us to represent more complicated functions. A neural network with multiple layers of nonlinearities is often referred to as a deep network. | |
- | + | It is important for deep networks to have non-linear functions at each level, since multiple layers of linear functions can be viewed as just a single linear function. | |
- | == Advantages of deep | + | == Advantages of deep networks == |
===Expressiveness and compactness=== | ===Expressiveness and compactness=== | ||
- | Deep | + | Deep networks can model a greater range of functions than shallow networks. Further, with deep networks, these functions can be modeled with less components (neurons in the case of neural networks) than in the equivalent shallow networks. In fact, there are functions that a k-layer network can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer network cannot (with the number of components ''exponential'' in the number of inputs). |
For example, in a boolean network in which alternate layers implement the logical OR and logical AND of preceding layers, the parity function would require and exponential number of components to be represented in a 2-layer network, but a polynomial number of components if represented in a network of sufficient depth. | For example, in a boolean network in which alternate layers implement the logical OR and logical AND of preceding layers, the parity function would require and exponential number of components to be represented in a 2-layer network, but a polynomial number of components if represented in a network of sufficient depth. | ||
- | Informally, one way a deep | + | Informally, one way a deep network helps in representing functions compactly is through ''factorization''. Factorization, as the name suggests, occurs when the network represents at lower layers functions of the input that are then reused multiple times at higher layers. To gain some intuition for this, consider an arithmetic network for computing the values of polynomials, in which alternate layers implement addition and multiplication. In this network, an intermediate layer could compute the values of terms which are then used repeatedly in the next higher layer, the results of which are used repeatedly in the next higher layer, and so on. |
- | + | ||
- | + | ||
- | + | ||
== Difficulty of training deep architectures == | == Difficulty of training deep architectures == | ||
- | While the benefits of deep | + | While the benefits of deep networks in terms of their compactness and expressive power have been appreciated for many decades, before 2006, researchers had little success in training deep architectures. Training a randomly initialized deep network to optimize for a classification objective (e.g,. softmax regression) often led to poor results. |
- | ===Why random | + | ===Why random initialization fails=== |
+ | |||
+ | ====Availability of data==== | ||
+ | |||
+ | With random initialization, one relies only on labeled data for training. However, labeled data is often scarce, | ||
+ | and thus it is easy to overfit the training data and obtain a model which does not generalize well. | ||
====Local optima==== | ====Local optima==== | ||
- | The | + | |
+ | Training a neural network using supervised learning involves solving a highly non-convex optimization problem (say, minimizing the training | ||
+ | error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters | ||
+ | <math>\textstyle W</math>). The optimization problem can therefore be rife with local optima, and training | ||
+ | with gradient descent (or methods like conjugate gradient and L-BFGS) do not work well. | ||
+ | |||
+ | <!-- The optimization landscape for the objective function used in deep architectures is likely to be non-convex and filled with local optima. A gradient-based training method starting from a random location is hence likely to end up in an undesirable local optima in the neighborhood of the starting location. --> | ||
====Diffusion of gradients==== | ====Diffusion of gradients==== | ||
+ | |||
In the case of deep neural networks using backpropagation for training, the gradient propagated backwards to earlier layers rapidly diminishes as the depth of the network increases. As a result, the weights of the earlier layers change slowly, and the earlier layers fail to learn much. | In the case of deep neural networks using backpropagation for training, the gradient propagated backwards to earlier layers rapidly diminishes as the depth of the network increases. As a result, the weights of the earlier layers change slowly, and the earlier layers fail to learn much. | ||
- | + | Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are over-fitting the training data. | |
- | Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are | + | |
===Greedy layer-wise training=== | ===Greedy layer-wise training=== | ||
How should deep architectures be trained then? One method that has seen some success is the '''greedy layer-wise training''' method. In this method, the layers of the architecture are trained one at a time, with the input being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialize the weights in the deep architecture, and only then is the entire architecture '''fine-tuned''', that is, trained together. The success of greedy layer-wise training has been attributed to a number of factors: | How should deep architectures be trained then? One method that has seen some success is the '''greedy layer-wise training''' method. In this method, the layers of the architecture are trained one at a time, with the input being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialize the weights in the deep architecture, and only then is the entire architecture '''fine-tuned''', that is, trained together. The success of greedy layer-wise training has been attributed to a number of factors: | ||
+ | |||
+ | ====Availability of data==== | ||
+ | |||
+ | While labeled data can be expensive to obtain, unlabeled data is cheap and plentiful. | ||
+ | The promise of self-taught learning is that by | ||
+ | exploiting the massive amount of unlabeled data, we can learn much better | ||
+ | models. The fine-tuning step can be done only using labeled data. In | ||
+ | practice, by using unlabeled data to learn a good initial value for the weights in all layers | ||
+ | <math>\textstyle W^{(l)}</math>, we usually get much better classifiers | ||
+ | after fine-tuning. | ||
====Regularization and better local optima==== | ====Regularization and better local optima==== | ||
- | |||
- | + | Furthermore, training starts at a better location than when the weights are randomly initialized, vastly increasing the likelihood of obtaining a better local optima. Specifically, since the weights of the layers have already been initialized to reasonable values, the final solution tends to be near the good initial solution, forming a useful "regularization" effect. (more details in Erhan et al., 2010). | |
- | + | ||
== References == | == References == | ||
- | + | Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?. AISTATS 2010. [http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf] | |
- | + | ||
- | http:// | + |