Recently Kaggle hosted a competition on the CIFAR-10 dataset. The CIFAR-10 dataset consists of 60k 32x32 colour images in 10 classes. This dataset was collected by Alex
Krizhevsky, Vinod Nair, and Geoffrey Hinton.
Many contestants used convolutional nets to tackle this competition. Some resulted in scores that beat human performance on this classification task. In this blog series we will interview three contestants and also a founding father of convolutional nets: Yann LeCun.
Example of cifar-10 dataset
Yann LeCun
Yann Lecun is currently Director of AI Research at Facebook and a professor at NYU.
Which other scientists should be named and celebrated for the successes of convolutional nets?
Certainly, Kunihiko Fukushima's work on the Neo-Cognitron was an inspiration. Although the early forms of convnets owed little to the Neo-Cognitron, the version we settled on (with pooling layers) did.
A schematic diagram illustrating the interconnections between layers in the Neo-Cognitron. From Fukushima K. (1980) Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position.
Can you recount an aha!-moment or theoretical breakthrough during your early research into convolutional nets?
Not really. It was the logical thing to do. I had been playing with multi-layer nets with local connections since 1982 or so (not having the right learning algorithm though. Backprop didn't exist). I started experimenting with shared weight nets while I was a postdoc in Toronto in 1988.
The reason for not trying earlier is simply that I didn't have the software nor the data. Once I arrived at Bell Labs, I had access to a large dataset and fast computers (for the time). So I could try full-size convnets, and it worked amazingly well (though it required 2 weeks of training).
What is your opinion on the recent popularity of convolutional nets for object recognition? Did you expect it?
Yes. I knew it had to happen. It was a matter of time until the datasets become large enough and the computers powerful enough for deep learning algorithms to become better than human engineers at designing vision systems.
There was a symposium entitled "frontiers in computer vision" at MIT in August 2011. The title of my talk was “5 years from now, everyone will learn their features (you might as well start now)”. David Lowe (the inventor of SIFT) said the same thing.
A slide from the talk LeCun Y. (2011) “5 years from now, everyone will learn their features (you might as well start now)”.
Still, I was surprised by how fast the revolution happened and how much better convnets are, compared to other approaches. I would have expected the transition to be more gradual. Also, I would have expected unsupervised learning to play a greater role.
The character recognition model at AT&T was more than a simple classifier, but a complete pipeline. Can you tell more about the implementation problems your team faced?
We had to implement our own program language and write our own compiler to build this. Leon Bottou and I had written a neural net simulator called SN, back in 1987/1988. It was a Lisp interpreter with a numerical library (multidimensional arrays, neural net graphs…). We used this at Bell Labs to develop the first convnets.
Then in the early 90’s, we wanted to use our code in products. Initially, we hired a team of developers to convert our Lisp code to C/C++. But the resulting system could not be improved easily (it wasn’t a good platform for R&D). So Leon, Patrice Simard and I wrote a compiler for SN, which we used to develop the next generation OCR engine.
That system integrated a segmenter, a convnet, and a graphical model on top. The whole thing was trained end to end.
The graphical model was called a “graph transformer network”. It was conceptually similar to what we now call a conditional random field, or a structured perceptron (which it predates), but it allowed for non-linear scoring function (CRF and structured perceptrons can only have linear scoring functions).
The whole infrastructure was written in SN and compiled. This is the system that was deployed in ATM machines and check reading machines in 1996 and was reading 10 to 20% of all the checks in the US by the late 90’s.
An animation showing LeNet 5 in action. From "Invariance and Multiple Characters with SDNN (Multiple Characters Demo)".
In comparison with other methods, training convnets is pretty slow. How do you deal with the trade-off between experimentation and increased model training times? What does a typical development iteration look like?
In my experience, the best large-scale learning systems always take 2 or 3 weeks to train, regardless of the task, the method, the hardware, or the data.
I don’t know if convnets are “pretty slow”. Compared to what? They may be slow to train, but the alternative to “slow learning” is months of engineering efforts which doesn’t work as well in the end. Also, convnets are actually pretty fast to run (after training).
In a real application, no one really cares how long it takes to train. But people care a lot about how long it takes to run.
Which recent papers on convolutional nets are you most excited about? Any papers or ideas we should look out for?
There are lots and lots of ideas surrounding convnets and deep learning that have lived in relative obscurity for the last 20 years or so. No ones cared about it, and getting papers published was always a struggle. So, lots of ideas were never properly tried, never published, or were tried and published but soundly ignored and quickly forgotten. Who remembers that the first learning-based face detector that actually worked was a convolutional net (back in 1993, eight years before Viola-Jones)?
A figure with predictions from Vaillant R., Monrocq C., LeCun Y. (1993) "An original approach for the localisation of objects in images".
Today, it’s really amazing to see so many young and bright people devoting so much creative energy to the topic and coming up with new ideas and new applications. The hardware / software infrastructure is getting better, and it’s becoming possible to train large networks in a few hours or a few days. So people can explore many more ideas that in the past.
One thing I’m excited about is the idea of “spectral convolutional net”. This was a paper at ICLR 2014 by folks from my NYU lab about a generalization of convolutional nets that can be applied to any graphs (regular convnets can be applied to 1D, 2D or 3D arrays that can be seen as regular grids in terms of graph). There are practical issues, but it opens the door to many more applications of convnets to unstructured data.
MNIST digits on a sphere. From Bruna J., Wojciech Z., Szlam A., LeCun Y. (2013) "Spectral Networks and Deep Locally Connected Networks on Graphs".
I’m very excited about the application of convnets (and recurrent nets) to natural language understanding (following the seminal work of Collobert and Weston).
Since the error rate of a human is estimated to be around 6%, and Dr. Graham showed results of 4.47%, do you consider CIFAR-10 to be a solved problem?
It’s a solved problem in the same sense as MNIST is a solved problem. But frankly, people are more interested in ImageNet than in CIFAR-10 nowadays. In that sense, CIFAR-10 is not a “real” problem. But it’s not a bad benchmark for a new algorithm.
What would it take for convnets to see a much wider adoption in the industry? Will training convnets and the software to set them up become less challenging?
What are you talking about? Convnets are absolutely everywhere now (or about to be everywhere) in industry: Facebook, Google, Microsoft, IBM, Baidu, NEC, Twitter, Yahoo!….
That said, it’s true that all of these companies have significant R&D resources and that training convnets can still be challenging for smaller companies or companies that are less technically advanced.
It still requires quite of bit of experience and time investment to train a convnet if you don’t have prior training. Soon however, there will be several simple to use open source packages with efficient back-ends for that.
Are we close to the limit for convnets? Or could CIFAR-100 be "solved" next?
I don’t think it’s a good test. ImageNet is a much better test.
Shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional architectures. Do deep nets really need to be deep?
Yes, deep nets need to be deep. Try to train a shallow net to emulate a deep convnet trained on ImageNet. Come back when you have done that. In theory, a deep net can be approximated by a shallow one. But on complex tasks, the shallow net will have to be be ridiculously large.
Most of your academic work is highly practical in nature. Is this something you purposefully aim for, or is this an artefact of being employed by companies? Can you tell about the distinction between theory and practice?
Hey, I’ve been in academia since 2003, and I’m still a part-time professor at NYU. I do theory when it helps me understand things. Theory often help us understand what’s possible and what’s not possible. It helps suggest proper ways to do things.
But sometimes theory restricts our thinking. Some people will not work with some models because the theory about them is too difficult. But often, a technique works well before the reasons for it working well are fully understood theoretically.
By restricting yourself to work on stuff you fully understand theoretically, you are condemned to using conceptually simple methods.
Also, sometimes theory blinds us. For example, some people were dazzled by kernel methods because of the cute math that goes with it. But, as I’ve said in the past, in the end, kernel machines are shallow networks that perform “glorified template matching”. There is nothing wrong with that (SVM is a great method), but it has dire limitations that we should all be aware of.
A slide from LeCun Y. (2013) Learning Hierarchies from Invariant Features
What is your opinion on a well-performing convnet without any theoretical justifications for why it should work so well? Do you generally favor performance over theory? Where do you place the balance?
I don’t think their is a choice to make between performance and theory. If there is performance, there will be theory to explain it.
Also, what kind of theory are we talking about? Is it a generalization bound? Convnets have a finite VC dimension, hence they are consistent and admit the classical VC bounds. What more do you want? Do you want a tighter bound, like what you get for SVMs? No theoretical bound that I know of is tight enough to be useful in practice. So I really don’t understand the point. Sure, generic VC bounds are atrociously non tight, but non-generic bounds (like for SVMs) are only slightly less atrociously non tight.
If what you desire are convergence proofs (or guarantees), that’s a little more complicated. The loss function of multi-layer nets is non convex, so the easy proofs that assume convexity are out the window. But we all know that in practice, a convnet will almost always converge to the same level of performance, regardless of the starting point (if the initialization is done properly). There is theoretical evidence that there are lots and lots of equivalent local minima and a very small number of “bad” local minima. Hence convergence is rarely a problem.
What is your opinion on AI hype. Which practices do you think are detrimental to the field (of AI in general and specifically convnets)?
AI hype is extremely dangerous. It killed various approaches to AI at least 4 times in the past. I keep calling out hype whenever I see it, whether it’s from the press, from startups looking for investors, from large companies looking for PR, or from academics looking for grants.
There is certainly quite a bit of hype around deep learning at the moment. I don’t see a particularly high level of hype around convnets specifically. There is more hype around “cortical this”, “spiking that”, and “neuromorphic blah”. Unlike many of these things, convnets actually yield good results on useful tasks and are widely deployed in industrial applications.
Any interesting projects at Facebook involving convnets that you could talk a little more about? Some basic stats about the size?
DeepFace: a convnet for face recognition. There are also convnets for image tagging. They are big.
A figure describing the architecture from the presentation "Taigman Y., Yang M., Ranzato M., Wolf L. (2014) DeepFace for Unconstrained Face Recognition".
Recently you posted about 4 types of serious researchers. How would you label yourself?
I’m a 3, with a bit of 1 and 4.
"People who want to explain/understand learning (and perhaps intelligence) at the fundamental/theoretical level.
People who want to solve practical problems and have no interest in neuroscience.
People who want to understand intelligence, build intelligent machines, and have a side interest in understanding how the brain works.
People whose primary interest is to understand how the brain works, but feel they need to build computer models that actually work in order to do so."
Anything you wish to say to the top contestants in the CIFAR-10 challenge? Anything you wish to say to (hobbyist) researchers studying convnets? Anything in general you wish to say about the CIFAR dataset/problem?
I’m impressed by how much creativity and engineering knack went into this. It’s nice that people have pushed the technique as far as it will go on this dataset.
But it’s going to get easier and easier for independent researchers and hobbyist to play with these things and apply them to larger datasets. I think the successor to CIFAR-10 should be ImageNet-1K-128x128. This would be a version of the 1000 category ImageNet classification task where the images have been normalized to 128x128. I see several advantages:
the networks are small enough to be trainable in a reasonable amount of time on a high-end gamer rig;
the network you get at the end can actually be used for useful application (like robot vision);
the network can be run in real time on embedded platforms, like smart phones or the NVIDIA Jetson TK1.
Predictions on ImageNet. From "Krizhevsky A., Sutskever I., Hinton. G.E. (2012) ImageNet Classification with Deep Convolutional Neural Networks".
The need to have large amounts of labeled data can be a problem. What is your opinion on nets trained on unlabeled data, or the automatic labeling of data through image search engines?
There are tasks like video understanding and natural language understanding where we are going to have to use unsupervised learning. But these modalities have a temporal dimension that changes how we can approach the problem.
Clearly, we need to devise algorithms that can learn the structure of the perceptual world without being told the name of everything. Many of us have been working on this for years (if not decades), but none of us has a perfect solution.
What is your latest research focusing on?
There are two answers to this question:
Projects I’m personally involved in (enough that I would be co-author on the papers);
projects that I set the stage for, encourage other work on, and advise at the conceptual level, but in which I am not involved enough to be co-author on a paper.
A lot of (1) is at NYU and a lot of (2) is at Facebook.
The general areas are:
unsupervised learning that discovers “invariant” features, the marriage of deep learning and structured prediction, the unification of supervised and unsupervised learning, solving the problem of learning long-term dependencies, building learning systems with short-term/scratchpad memory, learning plans and sequences of actions, different ways to optimize functions than to follow the gradient, the integration of representation learning with reasoning (read Leon Bottou’s excellent position paper “from machine learning to machine reasoning”), the use of learning to perform inference efficiently, and many other topics.
Continue the series next week! Read an interview with the CIFAR-10 - Object Recognition in Images competition's first place winner, Dr. Ben Graham.