Q1. What are some strategies for making your machine learning model work well when you don’t have much data?
One of the main deep learning algorithms for small data set is called dropout algorithm, where every time you run the neural net you randomly turn off some of the neurons in the neural net. Because it’s different set of random neurons each time you run it, it’s kinda of like having more data. There is also an approach called Bayesian inference where the idea is instead of finding one neural net to explain the data, we think about all the possible different neural nets that could explain the data in many different ways. In practice, you can’t enumerate all the different possible neural nets but you can write down a mathematical integral that describes what the voting would look like if every neural net was able to vote weighted by how well it could explain the training set. And if you could approximate that integral, you can do really well even if you don’t have very much training data.
Q2. What scenarios should you not use neural nets?
One answer would be if you have very limited computational budget, like if you are working at high frequency trading company, and it’s really important to make your trades very very fast, then you might use some kind of shallow algorithm. Another answer is if there just isn’t much structure in your data compared to how much noise there is. If there isn’t any highly complicated pattern you can reliably extract, then there isn’t any need to use a complicated model that describes those complicated patterns. For most tasks you consider to be artificial intelligence tasks, you usually want to use deep neural net, so if you are doing something like understanding speech, recognising objects, generating images, playing video games, making a robot cook, that kind of thing, those are usually tasks where you want to use deep learning.
Q3. Android speech recognition has to respond to low latency, but they still use deep nets?
When I talk about latency, I am talking about latency that is not on the scale of a human mind where micro seconds in the training algorithm will change the difference in how much profit you make. Humans will tolerate fractions of a second, if you ask your android phone a question, it responds in absolutely micro seconds after you quit speaking, it would be a little bit creepy and annoying. It’s okay for there to be several hundred mili seconds of delay for speech recognition.
Q4. Can you see mistakes that happen at different layers of the neural net and can you go back to correct it?
The answer is that is a much deeper question than you might realise. It’s really hard for us to understand exactly what the meaning of each of the network layer is. It’s really hard for us to tell exactly what the network is doing. So it’s possible there are correcting mistakes at different layer of the network. There are some models that are designed explicitly to do that, but those aren’t the models that are the most popular right now. The models I describe in this lecture just move in one direction through the network, until maybe about 5 years ago, a very popular research direction was to understand if information flows backwards as well. We know in human brain there are 10 times more backwards connections than forward connections, we don’t necessarily know what those do, some of them might be for the learning process and not actually used to recognise new images, but they might also be used in the recognition process itself. So there are a lot of algorithms such as deep boltzman machine where more abstract layers can go back and change more concrete layers, that can possibly fix the mistakes you are talking about, but so far it has not ended up being the most popular and effective algorithm and we don’t really know why.
There are a lot of research and different visualisation techniques for understanding what the intermediate layers are doing, but a lot of the different analysis techniques will show you very different results and it’s kinda of hard to understand which one we should take more seriously than others. Part of the issue is that it is such a complicated system that if you expect to find something, you can probably what you are looking for somewhere and you just don’t know if that’s the most common thing or it happens only occasionally. In biology, an example I give people is for gene transcription, you gotta to read a gene out of a DNA and how does the body actually decides where to start reading and where to stop reading? Well the default mechanism we usually have is there’s a start code and stop code, but then there are other mechanisms, such like as the DNA transcription enzyme starts to read DNA and copy the RNA, the RNA is designed to swing around and hit the enzyme and knocks it off the DNA. So it’s a completely crazy mechanism but the body actually decides to use it somewhere. I think neural nets are a little bit like that, if you can think of a mechanism, it probably happens somewhere in some neural net, and it can be hard to tell if you are finding a mechanism that happens only occasionally. This is a popular research area but at the moment I would say techniques for analysing neural nets haven’t reached a solid conclusion yet. And a lot of them find almost contradictory things to each other.
Q5. Do you need more than three layers in neural nets?
I think by three layers you meant input layer, hidden layer and output layer. So basically, do you need to learn more than one hidden layer or not? So there are a few questions here, what functions can neural net represent? and what functions can neural net learn? It turns out if you have just one hidden layer, you can represent any function with neural net. You might need to give it lots lots of neurons, but you can represent it. But when you actually start to learn from the training set, it might be really hard to learn with that size of hidden layer. Basically if you have just one hidden layer you might be able to solve in principle most of the problems, but you might make it very difficult for yourself, you might be able to solve it with fewer neurons if you make it deeper, or you might be able to generalize to the test set with fewer training examples if you make it deeper. And it changes a lot from one task to another. Partly it depends on the structure of the task you are solving, a deeper network is saying that the task has recursive structure to it, so like objects are made of object parts and objects parts are made of contours and corners, contours and corners are made of edges and edges are made of pixels, right. So that kinda of tells us you want to have several layers of processing. Some other tasks like whether you should give the patient a particular drug or not, that might be a very simple function and you don’t need much depth to solve it.
Q6. Can you tell how noisy your data is ahead of time?
A lot of time you have a pretty good idea, you just guess base on your knowledge of where the data is coming from. There are many different sources of noise in the world, one source is just things are really very random like you got heat in the physical world that just scatters and you are measuring variables that are related to that. You can imagine some of the physical processes you are measuring can have very random effects. Or your understanding of the physics tells you it will be fairly noisy. Another source of randomness is if you have very incomplete information. If you are trying to predict whether a user will buy a particular product but you don’t know much of anything about that user, you might know something where they are located or what website they used to navigate to your website, but you don’t really have any idea whether they already have the shoes that you are selling or something like that. If you have complete information about them, you might have a better idea whether it makes sense for them to click on a particular product. That’s why a lot of ad models still use linear models for quite a long time. You can do things like measuring the randomness in your system, but it’s a little hard to know for sure if you are measuring it correctly, basically if you try to fit structure and you don’t gain anything in fitting that structure, then it’s a sign there might be just the noise there. But you won’t know if it’s some structure you are failing to detect or it’s noisy because there’s not structure to start with.
Q7. How do you approach debugging?
That’s almost like my whole job. Debugging is probably the hardest thing, it’s the reason that I cannot just write a program that does machine learning for you. There’re all kinds of bugs that can come up everywhere from the way you prepare the data to the way you write the code for the machine learning algorithm to where you can choose the hyper parameters, you need a lot of experience, really. There are so much you can say about that, it is really hard to answer that question on the spot.
Q8. Is there an area of research that’s more interesting than the amount of hype it’s receiving?
Maybe it’s fairness in machine learning, that’s an issue I think a lot of people are not even aware it exists. When you start using machine learning algorithms to make decisions that affect people’s lives, like whether to approve their mortgages or not, you need to think a lot about how that algorithm is actually working. It’s just a difficult technical problem, nobody designs unfair machine learning algorithms because they are cruel cold hearted people, it’s just because the algorithms that work the best are really hard to understand how they work, the algorithms we know how they are working are not usually very effective. A lot of people are starting to dive into this area, but it has not become white hot as things like reinforcement learning or supervised learning. I am also working on another area, which maybe appropriately hyped, which is machine learning security. Machine learning security is how you make sure that your machine learning algorithm will work correctly even if someone is intentionally trying to interfere with it. What if they are changing the training set examples to make it learn the wrong functions, what if they are changing the input to make it mis-recognise things and recommend that you take a bad course of action, what if they are trying to study the parameters of your training model in order to recover training examples that are sensitive information you don’t want to publish? That’s something that really just took off last year or so.
Q9. Do you want to make an end-to-end system or do you want to make a system that’s divided into several different components?
So you can imagine a system that’s divided into different components maybe a system say that you want to read a piece of text, a system that’s divided into components will find each of the different letters of the text and then another component will go through and recognise each letter individually and then the end system will just look at the text and output the whole sentence all at once. There’s always this debate about whether you could do end-to-end learning or whether you need to split things into components. There are some theoretical reasons that end-to-end learning could be hard for some problems but it also requires a lot less engineering effort, so if you can get it to work, it’s great. There might be some problems that this just doesn’t work. In practice, it seems end-to-end learning has been very successful for a lot of problems and a lot of times we see papers that overstate how hard end-to-end learning is. There was a paper that said you can’t train a convolutional neural net to recognise sequence of symbols and then my co-authors and I did basically that at Google a year later. So sometimes you think things are really difficult then it turns out the problem just goes away if you make the network bigger or train it with more data or something like that. In other cases, you can actually prove theoretically that you can’t learn them end-to-end but maybe the proof only applies to very weird problems that don’t resemble anything that comes up in real life very much.
Q10. Is there a good way for deep learning to deal with missing data?
Most of deep learning models don’t have any good way to deal with missing data but there are some that can. A lot of the reasons people study generative models is that generative models gives a good way to deal with missing data. That problem hasn’t been very popular lately.
Q11. Is there difference between the way deep learning works and the way children and adults learn?
There is not a huge amount of literature on that topic, one thing I can think of is my friend Andrew Saxe’s paper Exact Solutions to the Non-linear Dynamic Learning. That’s probably one under-hyped paper rather than an under-hyped field. That paper influenced my thinking a lot for several years since then. One aspect of the paper is comparing the way that children learn to the way deep learning algorithms learn. Both deep learning and child learning have this funny thing that they learn very fast, their error rate goes down really fast, then they kind of level off for a while and then they suddenly go down really fast again. That turns out to be related to the shape of the surface of the cost function they are minimising. The learning algorithm can slow down when it comes near the saddle point. The saddle point is when it comes to the cost function, it looks like a minimum in one cross section but then in another cross section it looks like a maximum. So when you come down on the cross section that looks like a minimum, you get stuck on the bottom of it for a while, then you discover that you are actually on top of a maximum, you start moving on the other direction.
Q12. What’s interesting about moving the architecture forward about deep learning? Is it just more layers and more layers?
One of the things that I think about the most now is adversarial examples.
A lot of the reason is that the machine learning algorithms you use together are very linear as a function of their input, even though they have lots of layers, they still end up looking a lot like a linear function. It’s really nonlinear function of their parameters, but not of their input itself. I am really interested in designing new architectures that are less linear and more non-linear, and in particular that they able not to be fooled by these tiny little changes. That’s the thing I am personally most excited about and I spend like 60-70 percent of time working on that.