Deep learning has attracted many attentions the last 5 years. Its popularity has been explained by different facts, and very often summarized by the development of GPU and very large supervised dataset. As Big Data, Deep learning needs big theory too. In this blog post, we focus on the lack of theoretical explanations about deep nets, and some source of inspiration to understand deeper why deep works.
Deep Learning advances
Deep learning enthusiam is based on applied researches of a huge community, coming from both the academic and industrial side. In this ecosystem, 3 researchers has become very popular: Yoshua Bengio (McGill University), Geoffrey Hinton (Google) and Yann Le Cun (Facebook). These three authors has co-published in the journal Nature a review on deep learning (part of a "Nature Insight" supplement on Machine Intelligence, with another interesting paper by Michael Littman about Reinforcement Learning).
These increasing attention about deep learning make Neural networks, and Convolutional Neural Nets very popular and manipulated by a huge community of coders. Test and learn engineering has been experimented in several places of the world, ranging from Toronto, New-York, Paris or San Francisco. Nowadays, particular architectures (stack of convolutions and Relu or Recurrent networks) are very used and a kind of genericity appears. Same architecture, or same cascade of operators, could be used for very different tasks, ranging from image classification, text mining, or time series forecasting. Another astonishing properties of these deep nets is the transferable properties. A deep net constructed to learn in a task of image classification, could be used to perform deep art. In other words, after 5 years of intensive research and engineering, complex tasks (see below) become tractable with a few lines of codes.
Old stuffs about Machine Learning : Bias, variance, and overfitting
Classical understanding of Machine Learning algorithms are coming from the theory of Statistical Learning and generalization bounds. This theory has been initiated by Vladimir Vapnik in the 70's. In The Nature of Statistical Learning Theory, generalization errors of machine learning algorithms are studied. Based on empirical risk minimization (like SGD for deep nets), the generalization error of an algorithm is decomposed into two parts : a bias term and a variance term. The bias is purely deterministic whereas the variance is stochastic. The bias decreases with the size of the model and is independent on the set of observations. At the opposite, the variance increases with the size of the model, but decreases when the number of observations increase. The vanilla bias-variance trade-off consists in selecting the best model size, based on a finite number of observations. This trade-off should avoid overfitting, i.e. selecting a complex model with no error on the training set, leading to poor generalization power. Usually this trade-off is calculated by splitting the dataset into a training set and a test set.
Weird properties for Deep nets - VC theory fails
When you train a deep net, the previous considerations are not exactly the right way. In many problems of deep learning, you usually need to reach a perfect accuracy over the training sample in order to have good performances over the test set. However, overfitting may occur but is solved differently. Dropout, pooling or more recently batch normalization (BN, see here) are very popular to avoid overfitting. However, these techniques are not really comparable : Batch normalization has been introduced in order to accelerate the training and avoid the saturation regime of the gradient. Dropout reduces the complexity of the network randomly by vanishing some neurons with a certain probability. Pooling regularizes in a certain sense since it reduces the size of the network. Nowadays, pooling is replaced by bigger stride and small filter size to have less parameters to learn. However, even if we ask us every day how to train a deep net, there is no particular rule. Test and learn.
In the classical studies of ML algorithms, generalization bounds are based on different measures of complexity, such as VC dimension (Vapnik-Cervonenkis dimension), Rademacher complexities or other data-dependent complexities. Recently, several authors has shown that these capacity measures are not satisfactory to explain the strong power of generalization of deep nets. The potential complexity of these nets are huge and standard VC-type bounds are prohibitive.
Moreover, explicit regularization (such as L2 penalization, or weight-decay) is not necessary. In the standard statistical learning paradigm, the main strategy to generalize well is to use a huge set of candidates, and penalize complex solutions. In deep learning, SGD (i.e. estimating the gradient with a minibatch), and Batch Normalization, acts as a powerful regularizer. Their mathematical mechanism insides remains unclear, or even counterintuitive.
Empirical understanding, layer after layer
Recently, several researchers have proposed different empirical studies. In "Understanding Deep Learning requires rethinking regularization", the capacity of Deep architecture is studied and zero training error is reached over a set of 1M images with random labels. Then, deep Learning easily fits random labels. In his PhD, EO has developed an experimental study in two steps: (1) build a simple but state-of-the-art architecture for a classification task and (2) study this architecture layer after layer. Figure 1 below shows empirically the ability of the network to decrease the complexity of the classification frontier when depth increases. He also sheds light on the existence of a margin in the stack of representations, thanks to the introduction of local support vectors at each layer. Another interesting properties is the progressive mechanism of contraction of deep nets leading to a linear separability, used by the last layer. Below, Figure 2 illustrates the decreasing of the cumulated variance in a class when depth increases.
This empirical study seems to show that neural networks learn layer after layer a progressive separation and contraction of the problem, and then a better representation of the problem. Other authors study deep nets by projecting down filters to pixel space. In this paper, properties of filters at different depth are studied by illustrating filters and strongest activation across all training examples.
Figure 1 (Left) : Classification error of a k-NN over local support vectors, as a function of k (depth from 2 to 13)
Figure 2 (right) : Cumulative variance of Principal Components of a given class (from 1 to 30)
Conclusion : why it is important
The theoretical understanding of deep nets will be a major breakthrough in AI and science in general, a way to better and faster calibrate these algorithms, as well as perhaps a way to achieve more stability and interpretability of these blackboxes.
Existing attempts above are important to try to understand deeper why deep works. However, these results are purely empirical and no serious theorem or hypothesis is available to guarantee any stability, generalization, or convergence of deep nets. Last months, a theoretical foundation of deep nets, called the Information bottleneck, was emerging. This technique introduced in 1999 by Naftali Tishby, Fernando C. Pereira, and William Bialek claims that deep nets compress the information through a bottleneck, retaining only the features most relevant to general concepts (see this polemic article in Quanta-Magazine). But many researchers remain skeptical and a recent work seems to show that this bottleneck is not a necessary condition to learn representations that generalize well.
We just sent you an email. Please click the link in the email to confirm your subscription!