Fastml.com

Juergen Schmidhuber's answers from the Reddit AMA

2015-03-05

On March 4th Jürgen Schmidhuber answered “ask me anything” questions on Reddit. Here are some of the answers we found interesting, grouped by topic.

Software |
People |
Resources |
Machine learning |
Recurrent neural networks |
AI |
The future

Software

Why doesn’t your group post its code online for reproducing the results of competitions you’ve won, such as the ISBI Brain Segmentation Contest? Your results are impressive, but almost always not helpful for pushing the research forward. (that was the most popular question in the AMA)

We did publish lots of open source code. Our PyBrain Machine learning library is public and widely used, thanks to the efforts of Tom Schaul, Justin Bayer, Daan Wierstra, Sun Yi, Martin Felder, Frank Sehnke, Thomas Rückstiess.

Here is the already mentioned code (RNNLIB) of the first competition-winning RNNs (2009) by my former PhD student and then postdoc Alex Graves. Many are using that.

It is true though that we don’t publish all our code right away. In fact, some of our code gets tied up in industrial projects which make it hard to release.

Nevertheless, especially recently, we published less code than we could have. I am a big fan of the open source movement, and we’ve already concluded internally to contribute more to it. (…) There are also plans to release more of our recent recurrent network code soon. In particular, there are plans for a new open source library, a successor of PyBrain.

What is the future of PyBrain? Is your team still working with/on PyBrain? If not, what is your framework of choice? What do you think of Theano? Are you using something better?

My PhD students Klaus and Rupesh are working on a successor of PyBrain with many new features, which hopefully will be released later this year.

People

Do you have any interesting sources of inspiration (art, nature, other scientific fields other then obviously neuroscience) that have helped you think differently about approaches, methodology, and solutions to your work?

In my spare time, I am trying to compose music, and create visual art.

And while I am doing this, it seems obvious to me that art and science and music are driven by the same basic principle.

I think the basic motivation (objective function) of artists and scientists and comedians is data compression progress, that is, the first derivative of data compression performance on the observed history. I have published extensively about this.

A physicist gets intrinsic reward for creating an experiment leading to observations obeying a previously unpublished physical law that allows for better compressing the data.

A composer gets intrinsic reward for creating a new but non-random, non-arbitrary melody with novel, unexpected but regular harmonies that also permit compression progress of the learning data encoder.

A comedian gets intrinsic reward for inventing a novel joke with an unexpected punch line, related to the beginning of his story in an initially unexpected but quickly learnable way that also allows for better compression of the perceived data.

In a social context, all of them may later get additional extrinsic rewards, e.g., through awards or ticket sales.

How do you recognize a promising machine learning PhD student?

They all have something in common: successful students are not only smart but also tenacious. While trying to solve a challenging problem, they run into a dead end, and backtrack. Another dead end, another backtrack. But they don’t give up. And suddenly there is this little insight into the problem which changes everything. And suddenly they are world experts in a particular aspect of the field, and then find it easy to churn out one paper after another, and create a great PhD thesis.

Do you know of any labs doing biotech/bioinformatics that you think are worth exploring?

I know a great biotech/bioinformatics lab for this: the one of Deep Learning pioneer Sepp Hochreiter in Linz.

Sepp is back in the NN game, and his team promptly won nine out of 15 challenges in the Tox21 data challenge, including the Grand Challenge, the nuclear receptor panel, the stress response panel. Check out the NIH (NCATS) announcement of the winners and the leaderboard.

Sepp’s Deep Learning approach DeepTox is described here.

Why is there not much interaction and collaboration between the researchers of Recurrent NNs and the rest of the NN community, particularly Convolutional NNs? I always see Hinton, LeCun, and Bengio interacting at conferences, panels, and google plus, but never Schmidhuber. They also cite each others papers more.

Maybe part of this is just a matter of physical distance. This trio of long-term collaborators has done great work in three labs near the Northeastern US/Canadian border, co-funded by the Canadian CIFAR organization, while our labs in Switzerland and Munich were over 6,000 km away and mostly funded by the Swiss National Foundation, DFG, and EU projects. Also, I didn’t go much to the important NIPS conference in Canada any more when NIPS focused on non-neural stuff such as kernel methods during the most recent NN winter, and when cross-Atlantic flights became such a hassle after 9/11.

Nevertheless, there are quite a few connections across the big pond. For example, before he ended up at DeepMind, my former PhD student and postdoc Alex Graves went to Geoff Hinton’s lab, which is now using LSTM RNNs a lot for speech and other sequence learning problems. Similarly, my former PhD student Tom Schaul did a postdoc in Yann LeCun’s lab before he ended up at DeepMind (which has become some sort of retirement home for my former students :-). Yann LeCun also was on the PhD committee of Jonathan Masci, who did great work in our lab on fast image scans with max-pooling CNNs.

With Yoshua Bengio we even had a common paper in 2001 on the vanishing gradient problem. The first author was Sepp Hochreiter, my very first student (now professor) who identified and analysed this Fundamental Deep Learning Problem in 1991 in his diploma thesis.

There have been lots of other connections through common research interests. For example, Geoff Hinton’s deep stacks of unsupervised NNs (with Ruslan Salakhutdinov, 2006) are related to our deep stacks of unsupervised recurrent NNs (1992-1993); both systems were motivated by the desire to improve Deep Learning across many layers. His ImageNet contest-winning ensemble of GPU-based max-pooling CNNs (Krizhevsky et al., 2012) is closely related to our traffic sign contest-winning ensemble of GPU-based max-pooling CNNs (Ciresan et al., 2011a, 2011b). And all our CNN work builds on the work of Yann LeCun’s team, which first backpropagated errors (LeCun et al., 1989) through CNNs (Fukushima, 1979), and also first backpropagated errors (Ranzato et al., 2007) through max-pooling CNNs (Weng, 1992). (See also Scherer et al.’s important work (2010) in the lab of Sven Behnke.) At IJCAI 2011, we published a way of putting such MPCNNs on GPU (Ciresan et al., 2011a); this helped especially with the vision competitions. To summarise: there are lots of RNN/CNN-related links between our labs.

Resources

I just took my first machine learning course and I’m interested in learning more about the field. Where do you recommend I start? Do you have any books, tutorials, tips to recommend?

Here is a very biased list of books and links that I found useful for students entering our lab (other labs may emphasize different aspects though).

What are some of the most exciting papers that you have read (or written) in the past year?

Last year I got excited about industrial breakthroughs of our recurrent neural networks. They are now helping to revolutionize speech processing and other sequence learning domains, especially the Long Short-Term Memory (LSTM) developed in my research groups in the 1990s and 2000s (main PhD theses by Sepp Hochreiter 1999, Felix Gers 2001, Alex Graves 2008, main postdoc contributors: Fred Cummins, Santiago Fernandez, Faustino Gomez). Here some recent benchmark records achieved with LSTM, often at big IT companies:

Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)

Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)

Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)

Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)

Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)

English to French translation (Sutskever et al., Google, NIPS 2014)

Audio onset detection (Marchi et al., ICASSP 2014)

Social signal classification (Brueckner & Schulter, ICASSP 2014)

Arabic handwriting recognition (Bluche et al., DAS 2014)

TIMIT phoneme recognition (Graves et al., ICASSP 2013)

Optical character recognition (Breuel et al., ICDAR 2013)

Image caption generation (Vinyals et al., Google, 2014)

Video to textual description (Donahue et al., 2014)

Photo-real talking heads (Soong and Wang, Microsoft, 2014).

Semantic Representations (Kai Sheng Tai et al., 2015)

Learning Video Representations (Srivastava et al., 2015)

Video Description Generation (Li Yao et al., 2015)

Also check out recent end-to-end speech recognition (Hannun et al., Baidu, 2014) with our CTC-based RNNs (Graves et al., 2006), without any HMMs etc.

Many of the references above can be found in my recent Deep Learning survey, whose write-up consumed quite some time, because I wanted to get the history right: who started Deep Learning, who invented backpropagation, what has been going on in Deep Reinforcement Learning, etc, etc. In the end I had 888 references on 88 pages.

Machine Learning

In your opinion, which discoveries in ML were “game changers” in the last few decades?

I think Marcus Hutter’s AIXI model of the early 2000s was a game changer. Until then, the field of Artificial General Intelligence (AGI) had been a collection of heuristics. But heuristics come and go, while theorems last for eternity. Building on Ray Solomonoff’s earlier work on universal predictors, Marcus proved that there is a universal AI that is mathematically optimal in a certain sense. It’s not a practical sense, otherwise we’d probably not even discuss this here. But this work exposed the ultimate limits of both human and artificial intelligence, and brought mathematical soundness and theoretical credibility to the entire field for the first time. These results will still stand in a thousand years. More.

From my extremely biased perspective I’d say that there also has been a lot of important work on non-universal but still rather general and very practical recurrent neural networks. RNNs are general computers. RNNs are the deepest NNs. Some RNNs are biologically plausible. Some RNNs are compatible with physically efficient future hardware: lots of processors connected through many short and few long wires. In many ways, RNNs are the ultimate NNs. In recent decades, there has been lots of progress in both supervised learning RNNs and reinforcement learning RNNs. RNNs have started to revolutionize very important fields such as speech recognition. And that’s just the beginning. Many researchers have collectively contributed to this RNN-based “game changer”; here are some relevant sections in my little survey, with lots of references: Sec. 2, 3, 5.5, 5.5.1, 5.6.1,5.9, 5.10, 5.13, 5.16, 5.17, 5.20, 5.22, 6.1, 6.3, 6.4, 6.6, 6.7.

How do we get from supervised learning to fully unsupervised learning?

When we started explicit Deep Learning research in the early 1990s, we actually went the other way round, from unsupervised learning (UL) to supervised learning (SL)! To overcome the vanishing gradient problem, I proposed a generative model, namely, an unsupervised stack of RNNs: ftp://ftp.idsia.ch/pub/juergen/chunker.pdf (1992). The first RNN uses UL to predict its next input. Each higher level RNN tries to learn a compressed representation of the info in the RNN below, trying to minimise the description length (or negative log probability) of the data. The top RNN may then find it easy to classify the data by supervised learning. One can also “distill” a higher RNN (the teacher) into a lower RNN (the student) by forcing the lower RNN to predict the hidden units of the higher one (another form of unsupervised learning). Such systems could solve previously unsolvable deep learning tasks.

However, then came supervised LSTM, and that worked so well in so many applications that we shifted focus to that. On the other hand, LSTM can still be used in unsupervised mode as part of an RNN stack like above. This illustrates that the boundary between supervised and unsupervised learning is blurry. Often gradient-based methods such as backpropagation are used to optimize objective functions for both types of learning.

So how do we get back to fully unsupervised learning? First of all, what does that mean? The most general type of unsupervised learning comes up in the general reinforcement learning (RL) case. Which unsupervised experiments should an agent’s RL controller C conduct to collect data that quickly improves its predictive world model M, which could be an unsupervised RNN trained on the history of actions and observations so far? The simple formal theory of curiosity and creativity says: Use the learning progress of M (typically compression progress in the MDL sense) as the intrinsic reward or fun of C. I believe this general principle of active unsupervised learning explains all kinds of curious and creative behaviour in art and science, and we have built simple artificial “scientists” based on approximations thereof, using (un)supervised gradient-based learners as sub-modules.

How will IBM’s TrueNorth neurosynaptic chip affect Neural Networks community? Can we expect that the future of Deep Learning lies not in GPUs, but rather in a dedicated hardware as TrueNorth?

As already mentioned in another reply, current GPUs are much hungrier for energy than biological brains, whose neurons efficiently communicate by brief spikes (Hodgkin and Huxley, 1952; FitzHugh, 1961; Nagumo et al., 1962), and often remain quiet. Many computational models of such spiking neurons have been proposed and analyzed - see Sec. 5.26 of the Deep Learning survey. I like the TrueNorth chip because indeed it consumes relatively little energy (see Sec. 5.26 for related hardware). This will become more and more important in the future. It would be nice though to have a chip that is not only energy-efficient but also highly compatible with existing state of the art learning methods for NNs that are normally implemented on GPUs. I suspect the TrueNorth chip won’t be the last word on this.

What do you think about Hierarchical Temporal Memory (HTM) and the Cortical Learning Algorithm (CLA) theory developed by Jeff Hawkins and others?

Jeff Hawkins had to endure a lot of criticism because he did not relate his method to much earlier similar methods, and because he did not compare its performance to the one of other widely used methods.

HTM is a neural system that attempts to learn from temporal data in hierarchical fashion. To my knowledge, the first neural hierarchical sequence-processing system was our hierarchical stack of recurrent neural networks (Neural Computation, 1992). Compare also hierarchical Hidden Markov Models (e.g., Fine, S., Singer, Y., and Tishby, N., 1998), and our widely used hierarchical stacks of LSTM recurrent networks: ftp://ftp.idsia.ch/pub/juergen/IJCAI07sequence.pdf.

At the moment I don’t see any evidence that Hawkins’ system can contribute “towards more powerful AI systems (or even AGI).”

Recurrent neural networks

Why do you think that RNNs are the ultimate NNs?

Because they are general computers. Like your laptop. The program of an RNN is its weight matrix. Unlike feedforward NNs, RNNs can implement while loops, recursion, you name it.

While FNNs are traditionally linked to concepts of statistical mechanics and traditional information theory, the programs of RNNs call for the framework of algorithmic information theory (or Kolmogorov complexity theory).

How on earth did you and Hochreiter come up with LSTM units? They seem radically more complicated than any other “neuron” structure I’ve seen, and everytime I see the figure, I’m shocked that you’re able to train them.

In my first Deep Learning project ever, Sepp Hochreiter (1991) analysed the (vanishing gradient problem](http://people.idsia.ch/~juergen/fundamentaldeeplearningproblem.html). LSTM falls out of this almost naturally :-)

Your recent paper on Clockwork RNNs seems to provide an alternative to LSTMs for learning long term temporal dependencies. Are there obvious reasons to prefer on approach over the other? Have you put thought into combining elements from each approach (e.g. Clockwork RNNs that make use of multiplicative gating in some fashion)?

We had lots of ideas about this. This is actually a simplification of our RNN stack-based history compressors (Neural Computation, 1992): ftp://ftp.idsia.ch/pub/juergen/chunker.pdf, where the clock rates are not fixed, but depend on the predictability of the incoming sequence (and where a slowly clocking teacher net can be “distilled” into a fast clocking student net that imitates the teacher net’s hidden units).

But we don’t know yet in general when to prefer which variant of plain LSTM over which variant of Clockwork RNNs or Clockwork LSTMs or history compressors. Clockwork RNNs so far are better only on the synthetic benchmarks presented in the ICML 2014 paper.

What kind of a future do you see for recurrent neural networks, as considered against deep learning networks? What kinds of training algorithms, if any, work well for recurrent networks?

All deep feedforward networks are special cases of recurrent networks. In principle, recurrent neural networks are the deepest of them all. Deep Learning can occur in both recurrent and feedforward networks. See Sec. 3 of survey below.

What works well? Useful algorithms for supervised, unsupervised, and reinforcement learning recurrent networks are mentioned in the following sections of the Deep Learning survey: Sec. 5.5, 5.5.1, 5.6.1, 5.9, 5.10, 5.13, 5.16, 5.17, 5.20, 5.22, 6.1, 6.3, 6.4, 6.6, 6.7.

What are the next big things that you a) want to or b) will happen in the world of recurrent neural nets?

The world of RNNs is such a big world because RNNs (the deepest of all NNs) are general computers, and because efficient computing hardware in general is becoming more and more RNN-like, as dictated by physics: lots of processors connected through many short and few long wires. It does not take a genius to predict that in the near future, both supervised learning RNNs and reinforcement learning RNNs will be greatly scaled up. Current large, supervised LSTM RNNs have on the order of a billion connections; soon that will be a trillion, at the same price. (Human brains have maybe a thousand trillion, much slower, connections - to match this economically may require another decade of hardware development or so). In the supervised learning department, many tasks in natural language processing, speech recognition, automatic video analysis and combinations of all three will perhaps soon become trivial through large RNNs (the vision part augmented by CNN front-ends). The commercially less advanced but more general reinforcement learning department will see significant progress in RNN-driven adaptive robots in partially observable environments. Perhaps much of this won’t really mean breakthroughs in the scientific sense, because many of the basic methods already exist. However, much of this will SEEM like a big thing for those who focus on applications. (It also seemed like a big thing when in 2011 our team achieved the first superhuman visual classification performance in a controlled contest, although none of the basic algorithms was younger than two decades.)

So what will be the real big thing? I like to believe that it will be self-referential general purpose learning algorithms that improve not only some system’s performance in a given domain, but also the way they learn, and the way they learn the way they learn, etc., limited only by the fundamental limits of computability. I have been dreaming about and working on this all-encompassing stuff since my 1987 diploma thesis on this topic, but now I can see how it is starting to become a practical reality. Previous work on this is collected [here] (http://people.idsia.ch/~juergen/metalearner.html).

What’s something exciting you’re working on right now, if it’s okay to be specific?

Among other things, we are working on the “RNNAIssance” - the birth of a Recurrent Neural Network-based Artificial Intelligence (RNNAI). This is about a reinforcement learning, RNN-based, increasingly general problem solver.

AI

From the AMA intro: Since age 15 or so, Jürgen Schmidhuber’s main scientific ambition has been to build an optimal scientist through self-improving Artificial Intelligence (AI), then retire.

What sparked you at age 15 to have that as your ambition?

When I was a boy, the nearby public library had all those popular science books. I got fascinated by the most fundamental of natural sciences, namely, physics. Even mathematics, the “queen of sciences,” had a history of physics-driven discoveries. Einstein was my big hero. I wanted to become a physicist, too. But then I realized the much bigger potential impact of building an artificial scientist much smarter than myself, letting him do the remaining work. This has been driving me ever since. I went on to study computer science with the ambition to build a general purpose artificial intelligence. This naturally led to self-modifying programs and reinforcement learning RNNs etc, etc.

What is your take on the threat posed by artificial super intelligence to mankind?

I guess there is no lasting way of controlling systems much smarter than humans, pursuing their own goals, being curious and creative, in a way similar to the way humans and other mammals are creative, but on a much grander scale.

But I think we may hope there won’t be too many goal conflicts between “us” and “them.” Let me elaborate on this.

Humans and others are interested in those they can compete and collaborate with. Politicians are interested in other politicians. Business people are interested in other business people. Scientists are interested in other scientists. Kids are interested in other kids of the same age. Goats are interested in other goats.

Supersmart AIs will be mostly interested in other supersmart AIs, not in humans. Just like humans are mostly interested in other humans, not in ants. Aren’t we much smarter than ants? But we don’t extinguish them, except for the few that invade our homes. The weight of all ants is still comparable to the weight of all humans.

Human interests are mainly limited to a very thin film of biosphere around the third planet, full of poisonous oxygen that makes many robots rust. The rest of the solar system, however, is not made for humans, but for appropriately designed robots. Some of the most important explorers of the 20th century already were (rather stupid) robotic spacecraft. And they are getting smarter rapidly. Let’s go crazy. Imagine an advanced robot civilization in the asteroid belt, quite different from ours in the biosphere, with access to many more resources (e.g., the earth gets less than a billionth of the sun’s light). The belt contains lots of material for innumerable self-replicating robot factories. Robot minds or parts thereof will travel in the most elegant and fastest way (namely by radio from senders to receivers) across the solar system and beyond. There are incredible new opportunities for robots and software life in places hostile to biological beings. Why should advanced robots care much for our puny territory on the surface of planet number 3?

You see, I am an optimist :-)

You once said:

All attempts at making sure there will be only provably friendly AIs seem doomed. Once somebody posts the recipe for practically feasible self-improving Goedel machines or AIs in form of code into which one can plug arbitrary utility functions, many users will equip such AIs with many different goals, often at least partially conflicting with those of humans.

Do you still believe this? Secondly, if someone comes up with such a recipe, wouldn’t it be best if they didn’t publish it?

If there was a recipe, would it be better if some single guy had it under his wings, or would it be better if it got published?

I guess the biggest threat to humans will as always come from other humans, mostly because those share similar goals, which results in goal conflicts. Please see my optimistic answer above.

The future

In what field do you think machine learning will make the biggest impact in the next ~5 years?

I think it depends a bit on what you mean by “impact”. Commercial impact? If so, in a related answer I write: Both supervised learning recurrent neural networks (RNNs) and reinforcement learning RNNs will be greatly scaled up. In the commercially relevant supervised department, many tasks such as natural language processing, speech recognition, automatic video analysis and combinations of all three will perhaps soon become trivial through large RNNs (the vision part augmented by CNN front-ends).

“Symbol grounding” will be a natural by-product of this. For example, the speech or text-processing units of the RNN will be connected to its video-processing units, and the RNN will learn the visual meaning of sentences such as “the cat in the video fell from the tree”. Such RNNs should have many commercial applications.

I am not so sure when we will see the first serious applications of reinforcement learning RNNs to real world robots, but it might also happen within the next 5 years.

Where do you see the field of machine learning 5, 10, and 20 years from now?

Even (minor extensions of) existing machine learning and neural network algorithms will achieve many important superhuman feats. I guess we are witnessing the ignition phase of the field’s explosion. But how to predict turbulent details of an explosion from within?

Earlier I tried to reply to questions about the next 5 years. You are also asking about the next 10 years. In 10 years we’ll have 2025. That’s an interesting date, the centennial of the first transistor, patented by Julius Lilienfeld in 1925. But let me skip the 10 year question, which I find very difficult, and immediately address the 20 year question, which I find even much, much more difficult.

We are talking about 2035, which also is an interesting date, a century or so after modern theoretical computer science was created by Goedel (1931) & Church/Turing/Post (1936), and the patent application for the first working general program-controlled computer was filed by Zuse (1936). Assuming Moore’s law will hold up, in 2035 computers will be more than 10,000 times faster than today, at the same price. This sounds more or less like a human brain power in a small portable device. Or the human brain power of a city in a larger computer.

Given such raw computational power, I expect huge (by today’s standards) recurrent neural networks on dedicated hardware to simultaneously perceive and analyse an immense number of multimodal data streams (speech, texts, video, many other modalities) from many sources, learning to correlate all those inputs and use the extracted information to achieve a myriad of commercial and non-commercial goals. Those RNNs will continually and quickly learn new skills on top of those they already know. This should have innumerable applications, although I am not even sure whether the word “application” still makes sense here.

This will change society in innumerable ways. What will be the cumulative effect of all those mutually interacting changes on our civilisation, which will depend on machine learning in so many ways? In 2012 (http://people.idsia.ch/%7Ejuergen/newmillenniumai2012.pdf), I tried to illustrate how hard it is to answer such questions: A single human predicting the future of humankind is like a single neuron predicting what its brain will do.

I am supposed to be an expert, but my imagination is so biased and limited - I must admit that I have no idea what is going to happen. It just seems clear that everything will change. Sorry for completely failing to answer your question.

If you want more, here’s a recent interview with Jürgen.