Architects of Intelligence
Page 3
A lot of current research is in areas where we’re not doing so well, such as unsupervised learning. This is where the computer can be more autonomous in the way that it acquires knowledge about the world. Another area of research is in causality, where the computer can not only observe data, like images or videos, but also act on it and see the effect of those actions in order to infer causal relationships in the world. The kinds of things that DeepMind, OpenAI, or Berkeley are doing with virtual agents, for example, are going in the right direction to answer those types of questions, and we’re also doing these kinds of things in Montreal.
MARTIN FORD: Are there any particular projects that you would point to as being really at the forefront of deep learning right now? The obvious one is AlphaZero, but what other projects really represent the leading edge of this technology?
YOSHUA BENGIO: There are a number of interesting projects, but the ones that I think are likely in the long run to have a big impact are those that involve virtual worlds in which an agent is trying to solve problems and is trying to learn about their environment. We are working on this at MILA, and there are projects in the same area in progress at DeepMind, OpenAI, Berkeley, Facebook and Google Brain. It’s the new frontier.
It’s important to remember, though, that this is not short-term research. We’re not working on a particular application of deep learning, instead we’re looking into the future of how a learning agent makes sense of its environment and how a learning agent can learn to speak or to understand language, in particular what we call grounded language.
MARTIN FORD: Can you explain that term?
YOSHUA BENGIO: Sure, a lot of the previous effort in trying to make computers understand language has the computer just read lots and lots of text. That’s nice and all, but it’s hard for the computer to actually get the meaning of those words unless those sentences are associated with real things. You might link words to images or videos, for example, or for robots that might be objects in the real world.
There’s a lot of research in grounded language learning now trying to build an understanding of language, even if it’s a small subset of the language, where the computer actually understands what those words mean, and it can act in correspondence to those words. It’s a very interesting direction that could have a practical impact on things like language understanding for dialog, personal assistants, and so on.
MARTIN FORD: So, the idea there is basically to turn an agent loose in a simulated environment and have it learn like a child?
YOSHUA BENGIO: Exactly, in fact, we want to take inspiration from child development scientists who are studying how a newborn goes through a series of stages in the first few months of life where they gradually acquire more understanding about the world. We don’t completely understand which part of this is innate or really learned, and I think this understanding of what babies go through can help us design our own systems.
One idea I introduced a few years ago in machine learning that is very common in training animals is curriculum learning. The idea is that we don’t just show all the training examples as one big pile in an arbitrary order. Instead, we go through examples in an order that makes sense for the learner. We start with easy things, and once the easy things are mastered, we can use those concepts as the building blocks for learning slightly more complicated things. That’s why we go through school, and why when we are 6 years old we don’t go straight to university. This kind of learning is becoming more important in training computers as well.
MARTIN FORD: Let’s talk about the path to AGI. Obviously, you believe that unsupervised learning—essentially having a system learn like a person—is an important component of it. Is that enough to get to AGI, or are there other critical components and breakthroughs that have to happen for us to get there?
YOSHUA BENGIO: My friend Yann LeCun has a nice metaphor that describes this. We’re currently climbing a hill, and we are all excited because we have made a lot of progress on climbing the hill, but as we approach the top of the hill, we can start to see a series of other hills rising in front of us. That is what we see now in the development of AGI, some of the limitations of our current approaches. When we were climbing the first hill, when we were discovering how to train deeper networks, for example, we didn’t see the limitations of the systems we were building because we were just discovering how to go up a few steps.
As we reach this satisfying improvement that we are getting in our techniques—we reach the top of the first hill—we also see the limitations, and then we see another hill that we have to climb, and once we climb that one we’ll see another one, and so on. It’s impossible to tell how many more breakthroughs or significant advances are going to be needed before we reach human-level intelligence.
MARTIN FORD: How many hills are there? What’s the timescale for AGI? Can you give me your best guess?
YOSHUA BENGIO: You won’t be getting that from me, there’s no point. It’s useless to guess a date because we have no clue. All I can say is that it’s not going to happen in the next few years.
MARTIN FORD: Do you think that deep learning or neural networks generally are really the way forward?
YOSHUA BENGIO: Yes, what we have discovered in terms of the scientific concepts that are behind deep learning and the years of progress made in this field, means that for the most part, many of the concepts behind deep learning and neural networks are here to stay. Simply put, they are incredibly powerful. In fact, they are probably going to help us better understand how animal and human brains learn complex things. As I said, though, they’re not enough to get us to AGI. We’re at a point where we can see some of the limitations in what we currently have, and we’re going to improve and build on top of that.
MARTIN FORD: I know that the Allen Institute for AI is working on Project Mosaic, which is about building common sense into computers. Do you think that kind of thing is critical, or do you think that maybe common sense emerges as part of the learning process?
YOSHUA BENGIO: I’m sure common sense will emerge as part of the learning process. It won’t come up because somebody sticks little bits of knowledge into your head, that’s not how it works for humans.
MARTIN FORD: Is deep learning the primary way to get us to AGI, or do you think it’s going to require some sort of a hybrid system?
YOSHUA BENGIO: Classical AI was purely symbolic, and there was no learning. It focused on a really interesting aspect of cognition, which is how we sequentially reason and combine pieces of information. Deep learning neural networks, on the other hand, have always been about focusing on a sort of bottom-up view of cognition, where we start with perception and we anchor the machine’s understanding of the world in perception. From there, we build distributed representations and can capture the relationship between many variables.
I studied the relationships between such variables with my brother around 1999. That gave rise to a lot of the recent progress in natural language, such as word embeddings, or distributed representations for words and sentences. In these cases, a word is represented by a pattern of activity in your brain—or by a set of numbers. Those words that have a similar meaning are then associated with similar patterns of numbers.
What’s going on now in the deep learning field is that people are building on top of these deep learning concepts and starting to try to solve the classical AI problems of reasoning and being able to understand, program, or plan. Researchers are trying to use the building blocks that we developed from perception and extend them towards these higher-level cognitive tasks (sometimes called System 2 by psychologists). I believe in part that’s the way that we’re going to move towards human-level AI. It’s not that it’s a hybrid system; it’s like we’re trying to solve some of the same problems that classical AI was trying to solve but using the building blocks coming from deep learning. It’s a very different way of doing it, but the objectives are very similar.
MARTIN FORD: Your prediction, then, is that it’s all going to b
e neural networks, but with different architectures?
YOSHUA BENGIO: Yes. Note that your brain is all neural networks. We have to come up with different architectures and different training frameworks that can do the kinds of things that classical AI was trying to do, like reasoning, inferring an explanation for what you’re seeing and planning.
MARTIN FORD: Do you think it can all be done with learning and training or does there need to be some structure there?
YOSHUA BENGIO: There is structure there, it’s just that it’s not the kind of structure that we use to represent knowledge when we write an encyclopedia, or we write a mathematical formula. The kind of structure that we put in corresponds to the architecture of the neural net, and to fairly broad assumptions about the world and the kind of task that we’re trying to solve. When we put in a special structure and architecture that allows the network to have an attention mechanism, it’s putting in a lot of prior knowledge. It turns out that this is central to the success of things like machine translation.
You need that kind of tool in your toolbox in order to solve some of those problems, in the same way that if you deal with images, you need to have something like a convolutional neural network structure in order to do a good job. If you don’t put in that structure, then performance is much worse. There are already a lot of domain-specific assumptions about the world and about the function you’re trying to learn, that are implicit in the kind of architectures and training objectives that are used in deep learning. This is what most of the research papers today are about.
MARTIN FORD: What I was trying to get at with the question on structure was that, for example, a baby can recognize human faces right after it is born. Clearly, then, there is some structure in the human brain that allows the baby to do that. It’s not just raw neurons working on pixels.
YOSHUA BENGIO: You’re wrong! It is raw neurons working on pixels, except that there is a particular architecture in the baby’s brain that recognizes something circular with two dots inside it.
MARTIN FORD: My point is that the structure pre-exists.
YOSHUA BENGIO: Of course it does, but all the things that we’re designing in those neural networks also pre-exist. What deep learning researchers are doing is like the work of evolution, where we’re putting in the prior knowledge in the form of both the architecture and the training procedure.
If we wanted, we could hardwire something that would allow the network to recognize a face, but it’s useless for an AI because they can learn that very quickly. Instead, we put in the things that are really useful for solving the harder problems that we’re trying to deal with.
Nobody is saying that there is no innate knowledge in humans, babies, and animals, in fact, most animals have only innate knowledge. An ant doesn’t learn much, it’s all like a big, fixed program, but as you go higher up in the intelligence hierarchy, the share of learning keeps increasing. What makes humans different from many other animals is how much we learn versus how much is innate at the start.
MARTIN FORD: Let’s step back and define some of those concepts. In the 1980s, neural networks were a very marginalized subject and they were just one layer, so there was nothing deep about them. You were involved in transforming that into what we now call deep learning. Could you define, in relatively non-technical terms, what that is?
YOSHUA BENGIO: Deep learning is an approach to machine learning. While machine learning is trying to put knowledge into computers by allowing computers to learn from examples, deep learning is doing it in a way that is inspired by the brain.
Deep learning and machine learning are just a continuation of that earlier work on neural networks. They’re called “deep” because they added the ability to train deeper networks, meaning they have more layers, and each layer represents a different level of representation. We hope that as the network gets deeper, it can represent more abstract things, and so far, that does seem to be the case.
MARTIN FORD: When you say layers, do you mean layers of abstraction? So, in terms of a visual image, the first layer would be pixels, then it would be edges, followed by corners, and then gradually you would get all the way up to objects?
YOSHUA BENGIO: Yes, that’s correct.
MARTIN FORD: If I understand correctly, though, the computer still doesn’t understand what that object is, right?
YOSHUA BENGIO: The computer has some understanding, it’s not a black-and-white argument. A cat understands a door, but it doesn’t understand it as well as you do. Different people have different levels of understanding of the many things around them, and science is about trying to deepen our understanding of those many things. These networks have a level of understanding of images if they’ve been trained on images, but that level is still not as abstract and as general as ours. One reason for this is that we interpret images in the context of our three-dimensional understanding of the world, obtained thanks to our stereo vision and our movements and actions in the world. This gives us a lot more than just a visual model: it also gives us a physical model of objects. The current level of computer understanding of images is still primitive but it’s still good enough to be incredibly useful in many applications.
MARTIN FORD: Is it true that the thing that has really made deep learning possible is backpropagation? The idea that you can send the error information back through the layers, and adjust each layer based on the final outcome.
YOSHUA BENGIO: Indeed, backpropagation has been at the heart of the success of deep learning in recent years. It is a method to do credit assignment, that is, to figure out how internal neurons should change to make the bigger network behave properly. Backpropagation, at least in the context of neural networks, was discovered in the early 1980s, at the time when I started my own work. Yann LeCun independently discovered it around the same time as Geoffrey Hinton and David Rumelhart. It’s an old idea, but we didn’t practically succeed in training these deeper networks until around 2006, over a quarter of a century later.
Since then, we’ve been adding a number of other features to these networks, which are very exciting for our research into artificial intelligence, such as attention mechanisms, memory, and the ability to not just classify but also generate images.
MARTIN FORD: Do we know if the brain does something similar to backpropagation?
YOSHUA BENGIO: That’s a good question. Neural nets are not trying to imitate the brain, but they are inspired by some of its computational characteristics, at least at an abstract level.
You have to realize that we don’t yet have a full picture of how the brain works. There are many aspects of the brain that are not yet understood by neuroscientists. There are tons of observations about the brain, but we don’t know how to connect the dots yet.
It may be that the work that we’re doing in machine learning with neural nets could provide a testable hypothesis for brain science. That’s one of the things that I’m interested in. In particular, backpropagation up to now has mostly been considered something that computers can do, but not realistic for brains.
The thing is, backpropagation is working incredibly well, and it suggests that maybe the brain is doing something similar—not exactly the same, but with the same function. As a result of that, I’m currently involved in some very interesting research in that direction.
MARTIN FORD: I know that there was an “AI Winter” where most people had dismissed deep learning, but a handful of people, like yourself, Geoffrey Hinton, and Yann LeCun, kept it alive. How did that then evolve to the point where we find ourselves today?
YOSHUA BENGIO: By the end of the ‘90s and through the early 2000s, neural networks were not trendy, and very few groups were involved with them. I had a strong intuition that by throwing out neural networks, we were throwing out something really important.
Part of that was because of something that we now call compositionality: The ability of these systems to represent very rich information about the data in a compositional way, where you compose many building bl
ocks that correspond to the neurons and the layers. That led me to language models, early neural networks that model text using word embeddings. Each word is associated with a set of numbers corresponding to different attributes that are learned autonomously by the machine. It didn’t really catch on at the time, but nowadays almost everything to do with modeling language from data uses these ideas.
The big question was how we could train deeper networks, and the breakthrough was made by Geoffrey Hinton and his work with Restricted Boltzmann Machines (RBMs). In my lab, we were working on autoencoders, which are very closely related to RBMs, and autoencoders have given rise to all kinds of models, such as generative adversarial networks. It turned out that by stacking these RBMs or autoencoders we are able to train deeper networks than we were able to before.
MARTIN FORD: Could you explain what an autoencoder is?
YOSHUA BENGIO: There are two parts to an autoencoder, an encoder and a decoder. The idea is that the encoder part takes an image, for example, and tries to represent it in a compressed way, such as a verbal description. The decoder then takes that representation and tries to recover the original image. The autoencoder is trained to do this compression and decompression so that it is as faithful as possible to the original.
Autoencoders have changed quite a bit since that original vision. Now, we think of them in terms of taking raw information, like an image, and transforming it into a more abstract space where the important, semantic aspect of it will be easier to read. That’s the encoder part. The decoder works backwards, taking those high-level quantities—that you don’t have to define by hand—and transforming them into an image. That was the early deep learning work.
Then a few years later, we discovered that we didn’t need these approaches to train deep networks, we could just change the nonlinearity. One of my students was working with neuroscientists, and we thought that we should try rectified linear units (ReLUs)—we called them rectifiers in those days—because they were more biologically plausible, and this is an example of actually taking inspiration from the brain.