One of the questions we software engineers ask all the time now, is: how do we become deep learning experts, and how do we get in on all the action building the future with it? TensorFlow (TF), the open-source platform for machine learning from Google, is the most widely used tool to code it up. And the beauty of it is Google culture: code first, code always. There are many more software engineers out there than ML experts, and everything about TF helps developers to approach ML through code.
I talked with Rajat Monga, the Engineering Director for TensorFlow, to learn more about Google’s plans for evolving TensorFlow.
Alexy: We’re here at Google and we’re very excited to talk to Rajat. Can you tell us a little bit about your career and how you became interested in this? And, basically, how you ended up being the engineering lead for such an important project?
Rajat: Sure. I’ve been involved with Google Brain pretty much since it started back in 2011. I joined Google almost 9 years ago now in Ads, I was leading a large team there. And I had spent a fair bit of time improving the infrastructure and so on. That was something that I’ve done for years, even before coming to Google and scaling up things. Of course, the scale at Google is exciting and interesting, but at some point, I wanted to do something exciting and new.
And so when Google Brain was starting off, it seemed like a very interesting opportunity to make something happen, which if it took off would be very exciting. Of course, now, 7 years in, it seems amazing to have been through that journey. So I really joined that team back in 2011. And I worked with Jeff and a few other engineers to build our first platform, DistBelief. And there were really no managers effectively for us at that point. Jeff was still an engineer and this was his 20% project.
And we built that in time. It became really successful. We got a number of successes. During this time, a couple of years in I guess, as the team was growing and Jeff had started to manage the overall Brain team, as the infrastructure part grew, he wanted somebody to take on management as well. And I’d done this before. I’d managed people and was happy to take on the responsibility and really help drive the team. So that’s how I got into this role.
Back in 2014, we decided to start building TensorFlow as a successor of DistBelief and then I just took that on and ran with it.
Alexy: So what’s the most exciting application of TensorFlow that you have seen so far? And how did you guys make that technology possible?
Rajat: When I think about things within Google, different kinds of things that are happening, one of the things I’m very excited about is stuff that’s actually happening on the phones. For example, the Pixel 2 that launched this year, had this bokeh effect where you essentially have a portrait mode which separates out the person from the background. And this was done all using ML running TensorFlow. Of course, training a model of TensorFlow in the backend, but also running it on the phone. I think this is a very exciting area to see something that’s a flagship product for Google and an amazing effect that lots of other companies have been doing with multiple cameras and so on, and we’ve been able to accomplish this with just software and deep learning.
Alexy: Interesting. And also I have seen there are healthcare applications. Do you see TensorFlow being embedded in medical devices and helping to cure people and save lives?
Rajat: Totally. So far there is a lot happening inside Google as I said earlier. But in terms of what’s happening out there, seeing the students pick up these tools like TensorFlow and applying it to areas like medicine is really amazing. That’s an area that will have a huge impact for all of us as a society. And I see the need for improving the tools that doctors use that are there to see the doctors, so they don’t have to spend time looking at all the x-rays and so on, and they can spend time doing something more interesting and exciting to them.
And then also bringing these tools all the way to wherever they are, right? In some case, it’s on the computers. In some case, it might be the devices that they carry with them, or custom tools or whatever. So I definitely see a need for TensorFlow to run on these variety of devices.
Alexy: So far we see a lot of data science revolutions led by data scientist school who know how to do graphs and notebooks, but not as familiar with distributed systems or continuous delivery. That’s the bread and butter of hardcore software engineering. It’s really exciting to our communities where we consider data pipelines at scale.
It seems like more and more software engineers are getting into this and Google has this code first culture where you have to be able to code up these large-scale systems, you have to ingest all this data. Can you tell us how do you see software engineers are taking TensorFlow and pumping all this data through that?
Rajat: I think it’s interesting how things have been moving rapidly in this area. One thing I think we’ve learned in our group as well, putting people of different backgrounds together often ends up building really amazing things. That works across the board for all kinds of things. In our case at Google Brain, we have researchers and engineers who have really sat together to build these new tools, including TensorFlow. As you look at all kinds of applications and how engineers are doing things, in lots of cases, people spend a fair bit of time wrangling data itself and doing all kinds of things, and being able to code and being an engineer who understands the tools around, definitely helps you a lot in that process.
There is definitely the machine learning part and the stats part and understanding that is important and useful. That said, both of these things are becoming easier and easier, so you can mix and match. A data scientist can do more things at scale because the tools allow them to do it. On the other hand, engineers can do more things with machine learning because most algorithms are already implemented. And the are basically tools like TensorFlow libraries that they can use to do what they want.
Again, if you put both of them together, you can still gain because the data scientists will bring their expertise in improving the models maybe. And the engineer can help scale it up or help deploy it and so on. So I still see the need for putting them together as much as possible.
Alexy: One the interesting thing about TensorFlow is it builds on data flow. It is basically an approach to program with data and using graphs. And then there is also integration into Google Cloud Platform. So can you talk a little bit about the setup?
There are companies which happily move to Google Cloud Platform, like Spotify, one of our flagship members of the community that use Scala and Spark and Machine Learning and they moved everything to GCP. But we have kind of in between companies, which use Amazon or Google and then they have some on-prem. So can you talk a little bit about what are the optimum setups? Can TensorFlow be used optimally in the small startup setting, or is it really shining inside of GCP where there is actually DataFlow and BigQuery and all these data sources? What are some of the good setups and recommendations for TensorFlow?
Rajat: We built it with open source in mind, not just for cloud. The idea is that anybody can take it and run it. So it is not tied to the cloud in any way. You can run it pretty well on-prem. In fact, we see both. We see tons of users on-prem. We see tons of users on all kinds of clouds. And similarly with Google Cloud as well, it’s built a lot with open source, so it makes it easier to bridge the gap between on-prem and cloud. You’re not locked into any place. In terms of specifically running TensorFlow itself, you see people running it on a single machine, single device, just one CPU, maybe one GPU that they have in their machine. Often these are the kind of things that they use to get started. And then as they start to scale up, if they have a small cluster, they might run that with Kubernetes or Spark cluster and distribute it across a few machines. Often deployments are a cluster of machines or whatever and that’s fine.
And then if you really want to scale that up large, cloud is a great place to go. And often you can get more automation, more management, all the good stuff that you get with cloud. And we see, like you said, there are lots of people still on-prem, and take time to get to cloud, public or private or wherever they go. And we want to make sure we can meet them where they are, not necessarily push them towards one direction or another.When I l
Alexy: When I look at the TensorFlow and DataFlow connection, I found some very interesting blog posts from Google Cloud developer advocates. Justin Kestelyn is a well-known developer advocate in our community, now at Google, and he wrote a very good blog post about this. So what’s interesting to me is it’s not very easy to find a blueprint for complete data pipelines. How do guys think about TensorFlow deployment in real life? Do you have or do you plan to basically give people almost an ready back end? I can tell you what we’re doing in traditional open source. So we have this SMACK Stack, which we actually started in 2015, which is SMACK 1.0. As you recall, it was Spark, Mesos, Akka, Cassandra, and Kafka.
The data pipeline starts with API, it is Akka in our case. And we have kind of an Uber app with millions of iPhones banging on an API. And then you have Kafka, which is the message bus, which connects the different systems. It empties into Spark, which is the compute engine. It persists into Cassandra, and the whole thing runs on Mesos, which is an operations layer. And you can replace every letter with am equivalent system. It can be YARN or it can be RabbitMQ or it can be something else with an API. And obviously Google has an equivalent for almost each of these. It has BigTable, and BigQuery, and it has its own ways to basically spin out the systems.
Is there an easy way for practitioners to take, not just TensorFlow, but a data pipeline with TensorFlow in it and connect a million iPhones and a web app and suddenly they have a backend, and they have a startup applying Machine Learning to its data in real time. How would you recommend going about it?
Rajat: Right. I would say there are a number of solutions that are in there today that you can take TensorFlow and you can integrate into the kind of pipeline that you talked about with SMACK Stack for example. TensorFlow has a good integration with Spark where you can use the Spark cluster to train, etc., and there are a number of things you can do there. There are also integrations with Mesos, Kubernetes to allow you to do some of the cluster management stuff. And this is an area I would say there is still lots more to be done. One recent effort that was kicked off in this area is called Kubeflow, which puts Kubernetes and TensorFlow together and looks at better integration in those.
So lots more to do there. When you talk about cloud, yes, we do have a lot of corresponding other services in Cloud. In many cases, I think they offer a number of advantages for certain kinds of users. At the same time, most of those offer open APIs that can work with other platforms. So, for example, for data flow, we have Apache Beam, which basically works with Spark and with data flow and cloud. You can use it on Google Cloud or outside. You want to make sure that we provide interoperable APIs for people to get the best of what they need and really plug and play for what makes the most sense for them and not lock them in. I think this is an area we continue to invest in and make better as well.
Alexy: Maybe you can talk a little bit about how TensorFlow is kind of the ice-breaker who changes the whole Open Source culture at Google. So how do you see TensorFlow changing the culture?
Rajat: I think it’s interesting. Google has always in some ways been very Open Source-friendly. We’ve invested in things like Linux and the compilers and many other things. We’ve always contributed back to the community. It’s not like we’ve kept things internally. Protobuffer is a good example, lots of other pieces of libraries that you will see.
You’re absolutely right about a number of these very important papers that were published by Google over the last decade or two that have really created new industries, for example, Hadoop and many others. But we just published the papers and not really published the code. We decided to keep it internal. I think in this case again, this was Jeff’s idea, let’s not stop at the papers that we’ve done so far, given that he’s been pretty much one of the authors on all of those papers and his built most infrastructure here. We are seeing others take these ideas and implementing them, so there’s clearly a need outside Google now that people are starting to scale up. It’s not the small scale things anymore. So why don’t we actually start publishing code? Especially in this new area like machine learning where there is a need, we can do a much better job from where things are and really push the needle.
How does it impact the rest of us? I think given how well this has gone and how much interest we’ve seen from the external world, there’s definitely more interest across Google to open more things as much as possible. And of course, as a company that is responsible to shareholders to make money as well, it’s not like every single thing that we do can be open, but there’s some trade-offs if you would link pieces of infrastructure that’s more generally interesting, especially for a platform like cloud or for all our users, then it’s something that we should strongly consider. And we’re seeing that in more and more things. For example, gRPC went out, again, as an open thing. And we’re looking at any things that we do from that angle basically.
Alexy: We actually found that our community is extremely receptive to this because people essentially put together open source SMACK Stack-like systems. And they just take things and put them together. So if Google really starts to provide this and also provide documentation, it’s really good because now people can actually create new kinds of systems much easier. I think with Istio, we see a lot of this.
Rajat: Totally. Another example of an area where we’ve done this is with Angular and the web stack where it’s really become part of some standards with MEAN stack and so on. So we are really excited about open-source and definitely see that happening more.
With TensorFlow and this whole Machine Learning in the AI area, I’m very interested in seeing how close we can bring it to the regular developers. You asked earlier about what can developers do with this and so on. I think there are a lot of developers who have jumped in, but for some of them, it’s been, “Okay, is it too hard for me? Do I need to be a data scientist or a researcher to do that?” And I really want us, with tools like TensorFlow, to bridge that gap and make it as easy as possible for more and more developers to get onboard. So I’m interested in learning more about where the developers are, what their needs are, how can we help them and really understand more about the ecosystem that they’re playing with? Like, you talked about the entire data pipeline and so on. So what would help more developers get on board? What would help us ensure that every product that’s built going forward can leverage machine learning in different ways?
Alexy: I noticed that at the first TensorFlow Summit keynote, you mentioned the IBM PowerAI integration. What’s interesting about IBM, it’s really present everywhere. They go into customers. So they’re not just online. They actually go inside all the companies around the world. And so they do actually have exposure to all kinds of developers. Their own developers and corporate developers. We have all kinds of applications. We have all the industries, IBM serves all of them. What would be some of the exciting places for you to look at?
Rajat: There is a couple that I think are very interesting to me. One is the whole IoT area, the devices. I see that starting to grow and take off. I can already see a number of startups that are trying to create places for the data collection and really bringing machine learning to the edge. So I think that’s a very interesting area, edge computing. So that’s one. I think another one, from a data scientist’s and an engineer’s perspective, making that whole process easier today. Like, how do we cut down the friction that they have, going back to how do we make it easy for them to come on board? And today it’s happening on-prem where these people are. And companies like IBM are clearly seeing that first-hand.
Alexy: I really like the idea that you want to –- you come to meet the developers. You want to understand their pain points. You want to see the opportunities. So, for me, what I see regularly is basically software engineers asking the question, “How do I become a machine learning expert?” And Google has done a great job. But, still, it’s a lot to learn.
So do you guys plan to do more in terms of training? Because, essentially, as good as the docs are, you need to come in front of people, you need to walk them through. So what do you plan in terms of developer education? Is it something logically following from this open source work?
Rajat: Right. I think that totally makes sense. There’s a lot to be done. We see lots of organizations stepping in. Companies that are doing MOOCs, for example, are building MOOCs around deep learning tools, TensorFlow and others, and really helping people understand what deep learning is about. There are other meet-ups like yours and others that do help bridge that gap in bringing developers closer to people who are practicing and doing things.
I think there’s still — if you look at the number of developers — and I forget the numbers — but it’s maybe tens of millions and so on. But in terms of the number of developers who are doing machine learning, that’s still very, very small. And the number, that’s growing, but it really needs to grow by an order of magnitude or more to really get to the point where we can effectively use these technologies. So, exactly as you said, in some ways it’s kind of like programing, learning to program. Of course, the good thing is, the developers know how to program, so they have a leg up.
That said, there is this new shift in mindset that they need to think about. Now, the program they’re building is actually starting to learn. It’s not just whatever roles you write, that’s what it’s going to do. It’s going to learn and it doesn’t always learn the right way if you make mistakes there. So education is definitely going to play a big role. We are looking at a lot of other providers as well to see how we can improve it. Google has always done a fair bit of education and we’ll continue to look at that as well to see what roles we can play as a company.
Alexy: Any plans for Google University?
Rajat: Not that I know of.
Alexy: I think if one of the organizations start teaching developers, I think you guys are best positioned, right? Because I think that is a very logical step.
Rajat: Right. There are different things we tried to do in this space. And one program that we’ve had for the last couple of years is this AI Residency program, which was originally called the Google Brain Residency. Now, that’s still small scale, but that’s intending to really help with new researchers in the field, people who are interested, but don’t really have the background, and help them grow. So that’s one way we are trying to help education.
Alexy: That’s excellent. So it’s almost like entrepreneur in residence. You take people in and they learn. This is great for culture. How is the internal dynamics of TensorFlow’s success changing Google culture inside? Do people now want to move to your group? Is it changing the company inside-out?
Rajat: Open Source has always been exciting I think to lots of developers and Google’s no different. We have people who really love that core technology and they like the idea of things being more open and sharing things. One good thing about Google has always been the publication process. We want to publish and tell the world about what we’re doing, so others can gain from it, just as we gained from the publications from others.
This has been true across systems. What has changed is now starting to put the code out as well, and it makes the whole process of sharing ideas much more efficient because you don’t have to take a paper, re-implement it, you made a mistake, not really spend a lot of time reimplementing. So that’s really speeding up that process. And I think we’re seeing that in how fast we can do new things as well, and that’s changing how developers play with this too.
Alexy: The last question I would ask is going a little bit into the whole question of AI future. A lot of people are asking questions about AI ethics. Is AI changing our society? It often sounds almost apocalyptic. But some serious folks in the field are concerned about it, and mention AI as an extinction-level event if you ask them, what keeps you up at night? If AI is really this singularity event, which is extinction level event for humankind, and we’re engineers, I’m wondering is there any code we can write which can quantify these things, for instance, ethics? People talk about biases. Obviously we’ll have systems which will make decisions for us, so we need utility functions which are beneficial for us, which are not detrimental. If we talk about ethics, as engineers, we need to quantify what does it mean. So, eventually, we need to get into the humans’ minds with our programs to understand the impact of AI. We put stuff in front of people and see what happens. Essentially, Google is already a giant behavioral machine because it puts stuff in front of people to see what they do, affecting the real work. At this point, they click on ads.
But for ethics, for utilities, for happiness, I think we need to do more. So how do you see that thinking developing inside Google? Do you guys think of the meta level? Where are these tools to make decisions? Now we need some tools to see how these decisions affect values, such as fairness, economic equality, happiness. E.g., I can make you sit in front of the computer all day. It isn’t going to be good for you. Maybe you need to take a break. So are you guys thinking on the meta track? How do you measure these things?
Rajat: Right. I think talking about many of these things, they are definitely on our minds. Ethics, fairness, making sure that the tools and products we build as Google and also that we help the world build, are as good as they can be and really improving value for everyone around us, where you’re making the world better. I wish there were simple rules that you could say, “This is exactly what you need to do,” and we could code it up. It’s not that simple. But there’s a lot of progress we are making. We are engaging with lots of other like-minded people across the industry and trying to come up with ideas and really push the state of the art forward in all of these areas.
Some of these problems are more real than others. For example, if you are building a product, you want to make sure that that’s fair to all the kinds of users that you have, not just one particular segment or it’s not negative for one particular segment of the users. And there are many reasons for that. So part of the thing right now I think is that the field is just getting started in terms of understanding what the problems are. Once you have a better understanding, I think you start to look at solutions and how you build them out, whether some of them just need coding or others need actual learning or utility functions as you said. I think there’s still a lot of work to be done.
Alexy: I don’t know if anybody at Google is writing some code for specifically ethics or AI safety — is anybody writing some code? Can we as a community to help you guys write some code around it? Is there any kind of way to pitch in?
Rajat: I think right now it’s still the learning phase and there’s lots of learning that we’re doing. We have worked on AI safety and we’ve published a paper on that along with other groups and organizations. We are eventually going to do the same with fairness. We are looking to do that. We are involved with a number of organizations that aren’t just limited to Google. So really interacting with the community in building that. In certain cases, we have built tools to, again, understand what’s going on in the world. For example, with TensorFlow, we have this tool called Facets where you can understand where the data is, what the answers are, and really slice and dice and look at that in different ways.
Alexy: So you can see if your analysis has biases, for instance?
Rajat: Exactly.
Alexy: I was actually asking Peter Norvig, is there code for AI ethics. There is a policy group. There is partnership for AI. So there are people talking policy at a very high level, at a government level, and there are engineers here. I want to really connect the two. If you have any code-first approach that you want the community to help with, I’d like to put in front of our AI groups and help you guys out. Because I think it could be a really good way, now that you have open source, to have a hackathon, to have some kind of engagement, because I think that’s a topic that really resonates with a lot of people.
Rajat: That’s a great idea and something we should definitely explore more.
This article originally appeared on Alexy’s blog on Medium.
Leave a Reply
You must be logged in to post a comment.