On the evolution of machine learning

From linear models to neural networks: an interview with Reza Zadeh.

Get notified when our free report, “Future of Machine Intelligence: Perspectives from Leading Practitioners,” is available for download. The following interview is one of many that will be included in the report.

As part of our ongoing series of interviews surveying the frontiers of machine intelligence, I recently interviewed Reza Zadeh. Reza is a Consulting Professor in the Institute for Computational and Mathematical Engineering at Stanford University and a Technical Advisor to Databricks. His work focuses on Machine Learning Theory and Applications, Distributed Computing, and Discrete Applied Mathematics.

Key Takeaways

  • Neural networks have made a comeback and are playing a growing role in new approaches to machine learning.
  • The greatest successes are being achieved via a supervised approach leveraging established algorithms.
  • Spark is an especially well-suited environment for distributed machine learning.

David Beyer: Tell us a bit about your work at Stanford

Reza Zadeh: At Stanford, I designed and teach distributed algorithms and optimization (CME 323) as well as a course called discrete mathematics and algorithms (CME 305). In the discrete mathematics course, I teach algorithms from a completely theoretical perspective, meaning that it is not tied to any programming language or framework, and we fill up whiteboards with many theorems and their proofs.

On the more practical side, in the distributed algorithms class, we work with the Spark cluster programming environment. I spend at least half my time on Spark. So all the theory that I teach in regard to distributed algorithms and machine learning gets implemented and made concrete by Spark, and then put in the hands of thousands of industry and academic folks who use commodity clusters.

I started running MapReduce jobs at Google back in 2006, before Hadoop was really popular or even known; but MapReduce was already mature at Google. I was 18 at the time, and even then I could see clearly that this is something that the world needs outside of Google. So I spent a lot of time building and thinking about algorithms on top of MapReduce, and always worked to stay current, long after leaving Google. When Spark came along, it was nice that it was open-source and one could see its internals, and contribute to it. I felt like it was the right time to jump on board because the idea of an RDD was the right abstraction for much of distributed computing.

DB: From your time at Google up to the present work you’re doing with Spark, you have had the chance to see some of the evolution of machine learning as it ties to distributed computing. Can you describe that evolution?

RZ: Machine learning has been through several transition periods starting in the mid-90’s. From 1995–2005, there was a lot of focus on natural language, search, and information retrieval. The machine learning tools were simpler than what we’re using today; they include things like logistic regression, SVMs (support vector machines), kernels with SVMs, and PageRank. Google became immensely successful using these technologies, building major success stories like Google News and the Gmail spam classifier using easy-to-distribute algorithms for ranking and text classification – using technologies that were already mature by the mid-90’s.

Then around 2005 neural networks started making a comeback. Neural networks are a technology from the 80’s – some would even date them back to the 60’s – and they’ve become “retrocool” thanks to their important recent advances in computer vision. Computer vision makes very productive use of (convolutional) neural networks. As that fact has become better established, neural networks are making their way into other applications, creeping into areas like natural language processing and machine translation.

But there’s a problem: neural networks are probably the most challenging of all the mentioned models to distribute. Those earlier models have all had their training successfully distributed. We can use 100 machines and train a logistic regression or SVM without much hassle. But developing a distributed neural network learning setup has been more difficult.

So guess who’s done it successfully? The only organization so far is Google; they are the pioneers, yet again. It’s very much like the scene back in 2005 when Google published the MapReduce paper, and everyone scrambled to build the same infrastructure. Google managed to distribute neural networks, get more bang for their buck, and now everyone is wishing they were in the same situation. But they’re not.

DB: Why is an SVM or logistic regression easier to distribute than a neural network?

RZ: First of all, evaluating an SVM is a lot easier. After you’ve learned an SVM model or logistic regression model — or any linear model — the actual evaluation is very fast. Say you built a spam classifier. A new email comes along; to classify it as spam or not it takes very little time, because it’s just one dot product (in linear algebra terms). When it comes to a neural network, you have to do a lot more computation — even after you have learned the model — to figure out the model’s output. And that’s not even the biggest problem. A typical SVM might be happy with just a million parameters, but the smallest successful neural networks I’ve seen have around 6 million–and that’s the absolutely smallest. Another problem is that the training algorithms don’t benefit from much of optimization theory. Most of the linear models that we use have mathematical guarantees on when training is finished. They can guarantee when you have found the best model you’re going to find. But the optimization algorithms that exist for neural networks don’t afford such guarantees. You don’t know after you’ve trained a neural network whether, given your setup, this is the best model you could have found. So you’re left wondering if you would have a better model if you kept on training.

DB: As neural networks become more powerful, do you see them subsuming more and more of the work that used to be the bread and butter of linear methods?

RZ: I think so, yes. Actually that’s happening right now. There’s always this issue that linear models can only discriminate linearly. In order to get non-linearities involved, you would have to add or change features, which involves a lot of work. For example, computer vision scientists spent a decade developing and tuning these things called SIFT features, which enable image classification and other vision tasks using linear methods. But then neural networks came along and SIFT features became unnecessary; the neural network approach is to make features automatically as part of the training.

But I think it’s asking for too much to say neural networks can replace all feature construction techniques. I don’t think that will happen. There will always be a place for linear models and good human-driven feature engineering. Having said that, pretty much any researcher who has been to the NIPS Conference is beginning to evaluate neural networks for their application. Everyone is testing whether their application can benefit from the non-linearities that neural networks bring.

It’s not like we never had non-linear models before. We have had them–many of them. It’s just that the neural network model happens to be particularly powerful. It can really work for some applications, and so it’s worth trying. That’s what a lot of people are doing. And when they see successes, they write papers about them. So far, I’ve seen successes in speech recognition, in computer vision, and in machine translation. It is a very wide array of difficult tasks, so there is good reason to be excited.

DB: Why is a neural network so powerful compared to the traditional linear and non-linear methods that have existed up until now?

RZ: When you have a linear model, every feature is either going to hurt or help whatever you are trying to score. That’s the assumption inherent in linear models. So the model might determine that if the feature is large, then it’s indicative of class 1; but if it’s small, it’s indicative of class 2. Even if you go all the way up to very large values of the feature, or down to very small values of the feature, you will never have a situation where you say, “In this interval, the feature is indicative of class 1; but in another interval it’s indicative of class 2.”

That’s too limited. Say you are analyzing images, looking for pictures of dogs. It might be that only a certain subset of a feature’s values indicate whether it is a picture of a dog, and the rest of the values for that pixel, or for that patch of an image, indicate another class. You can’t draw a line to define such a complex set of relationships. Non-linear models are much more powerful, but at the same time they’re much more difficult to train. Once again, you run into those hard problems from optimization theory. That’s why for a long while we thought that neural networks weren’t good enough, because they would over-fit, or they were too powerful.  We couldn’t do precise, guaranteed optimization on them. That’s why they (temporarily) vanished from the scene.

DB: Within neural network theory, there are multiple branches and approaches to computer learning. Can you summarize some of the key approaches?

RZ: By far the most successful approach has been a supervised approach where an older algorithm, called backpropagation, is used to build a neural network that has many different outputs.

Let’s look at a neural network construction that has become very popular, called Convolutional Neural Networks. The idea is that the machine learning researcher builds a model constructed of several layers, each of which handles connections from the previous layer in a different way.

In the first layer, you have a window that slides a patch across an image, which becomes the input for that layer. This is called a convolutional layer because the patch “convolves”, it overlaps with itself. Then several different types of layers follow. Each have different properties, and pretty much all of them introduce non-linearities.

The last layer has 10,000 potential neuron outputs; each one of those activations correspond to a particular label which identifies the image. The first class might be a cat; the second class might be a car; and so on for all the 10,000 classes that ImageNet has. If the first neuron is firing the most out of the 10,000 then the input is identified as belonging to the first class, a cat.

The drawback of the supervised approach is that you must apply labels to images while training. This is a car. This is a zoo. Etc.

DB: Right. And the unsupervised approach?

RZ: A less popular approach involves “autoencoders”, which are unsupervised neural networks. Here the neural network is not used to classify the image, but to compress it. You read the image in the same way I just described, by identifying a patch and feeding the pixels into a convolutional layer. Several other layers then follow, including a middle layer which is very small compared to the others.  It has relatively few neurons. Basically you’re reading the image, going through a bottleneck, and then coming out the other side and trying to reconstruct the image.
No labels are required for this training, because all you are doing is putting the image at both ends of the neural network and training the network to make the image fit, especially in the middle layer. Once you do that, you are in possession of a neural network that knows how to compress images. And it’s effectively giving you features that you can use in other classifiers. So if you have only a little bit of labeled training data, no problem — you always have a lot of images. Think of these images as non-labeled training data. You can use images to build an autoencoder, then from the autoencoder pull out features that are a good fit using a little bit of training data to find the neurons in your auto-encoded neural network that are susceptible to particular patterns.

DB: What got you into Spark? And where do you see that set of technologies heading?

RZ: I’ve known Matei Zaharia, the creator of Spark, since we were both undergraduates at Waterloo. And we actually interned at Google at the same time. He was working on developer productivity tools, completely unrelated to big data. He worked at Google and never touched MapReduce, which was my focus–kind of funny given where he ended up.

Then Matei went to Facebook, where he worked on Hadoop and became immensely successful. During that time, I kept thinking about distributing machine learning and none of the frameworks that were coming out — including Hadoop — looked exciting enough for me to build on top of because I knew from my time at Google what was really possible.

DB: Tell us a bit about what Spark is, how it works, and why it’s particularly useful for distributed machine learning.

RZ: Spark is a cluster computing environment that gives you a distributed vector that works similar to the vectors you’re used to programming with on a single machine. You can’t do everything you could with a regular vector; for example, you don’t have arbitrary random access via indices. But you can, for example, intersect two vectors; you can union; you can sort. You can do dmany things that you would expect from a regular vector.

One reason Spark makes machine learning easy is that it works by keeping some important parts of the data in memory as much as possible without writing to disk. In a distributed environment, a typical way to get fault resilience is to write to disk, to replicate a disk across the network three times using HDFS.

What makes this suitable for machine learning is that the data can come into memory and stay there. If it doesn’t fit in memory, that’s fine too. It will get paged on and off a disk as needed. But the point is while it can fit in memory, it will stay there. This benefits any process that will go through the data many times — and that’s most of machine learning. Almost every machine learning algorithm needs to go through the data tens, if not hundreds, of times.

DB: Where do you see Spark vis-a-vis MapReduce? Is there a place for both of them for different kinds of workloads and jobs?

RZ: To be clear, Hadoop as an ecosystem is going to thrive and be around for a long time. I don’t think the same is true for the MapReduce component of the Hadoop ecosystem.

With regard to MapReduce, to answer your question, no, I don’t think so. I honestly think that if you’re starting a new workload, it makes no sense to start in MapReduce unless you have an existing code base that you need to maintain. Other than that, there’s no reason. It’s kind of a silly thing to do MapReduce these days: it’s the difference between assembly and C++. It doesn’t make sense to write assembly code if you can write C++ code.

DB: Where is Spark headed?

RZ: Spark itself is pretty stable right now. The biggest changes and improvements that are happening right now and happening in the next couple years are in the libraries. The machine learning library, the graph processing library, the SQL library, and the streaming libraries are all being rapidly developed, and every single one of them has an exciting roadmap for the next two years at least. These are all features that I want, and it’s very nice to see that they can be easily implemented. I’m also excited about community-driven contributions that aren’t general enough to put into Spark itself, but that support Spark as a community-driven set of packages on http://spark-packages.org. I think those will also be very helpful to the long-tail of users.

Over time, I think Spark will become the de facto distribution engine on which we can build machine learning algorithms, especially at scale.

tags: , , , , , ,