Holden Karau, Google | Flink Forward 2018

>> Narrator: Live from San Francisco, it's the Cube, covering Flink Forward, brought to you by Data Artisans. (tech music) >> Hi, this is George Gilbert, we're at Flink Forward, the user conference for the Apache Flink Community, sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest, with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. >> Yes. >> And now, Holden is at Google, focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's Dataflow, as well as Flink and other stream processors. >> Yeah. >> And Holden has been working on integrating it with the Google TensorFlow framework, also open-sourced. Yes. >> So, Holden, tell us about the objective of putting these together. What type of use cases.... >> So, I think it's really exciting. And it's still very early days, I want to be clear. If you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark, with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot people essentially end up having to write, like, a stream-processing job to do their data preparation, and they have to write a corresponding online serving job, to do similar data preparation for when they want to make real predictions. And by integrating tf.Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple, uniform way, that can be taken from the training time into the online serving time, without them having to rewrite their code, removing the potential for mistakes where we like, change one variable slightly in one place and forget to update it in another. And just really simplifying the deployment process for these models. >> Okay, so help us tie that back to, in this case, Flink. >> Yes. >> And also to clarify, that data prep.... My impression was data prep was a different activity. It was like design time and serving was run time. But you're saying that they can be better integrated? >> So, there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, I don't have to do that at serving time. But one of the classic examples for data prep would be tokenizing my inputs, or performing some kind of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form, or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to re-create the data prep logic that I created for training at serving time. >> So, by having common Beam API and the common provider underneath it, like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine-learning model that you want those.... It would be ideal to have those transformation activities be common in your prep work, and then in the production serving. >> Yes, very true. >> So, tell us what type of customers want to write to the Beam API and have that portability? >> Yeah, so that's a really good question. So, there's a lot of people who really want portability outside of Google Cloud, and that's one group of people, essentially people who want to adopt different Google Cloud technologies, but they don't want be locked into Google Cloud forever. Which is completely understandable. There are other people who are more interested in being able to switch streaming engines, like, they want to be able to switch between Spark and Flink. And those are people who want to try out different streaming engines without having to rewrite their entire jobs. >> Does Spark Structured Streaming support the Beam API? >> So, right now, the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the Structured Streaming API. It's a thing we're actively discussing on the mailing list, how to go about doing. Because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure. But it's something that we should look at more of. Where was I going with this? So the other one that I see, is like, Flink is a wonderful API, but it's very Java-focused. And so, Java's great, everyone loves it, but a lot of cool things that are being done nowadays, are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. Beam gives a way for people to work with Python, across these different engines. Flink supports Python, but it's maybe not a first class citizen. And the Beam Python support is still a work in progress. We're working to get it to be better, but it's.... You can see the demos this afternoon, although if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. >> In to support Go. >> Which is a little out of left field. >> So, would it be fair to say that the value of Beam, for potential Flink customers, they can work and start on Google Cloud platform. They can start on one of several stream processors. They can move to another one later, and they also inherit the better language support, or bindings from the Beam API? >> I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. It's somewhat subjective, like what better language support is. But I think definitely for Go, it's pretty clear. This stuff is all stuff that's in the master branch, it's not released today. But if people are looking to play with it, I think it's really exciting. They can go and check it out from GitHub, and build it locally. >> So, what type of customers do you see who have moved into production with machine learning? >> So the.... >> And the streaming pipelines? >> The biggest customer that's in production is obviously, or not obviously, is Spotify. One of them is Spotify. They give a lot of talks about it. Because I didn't know we were going to be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly. I'll just stick with Spotify. >> Without the names, the sort of use cases and the general industry.... >> I don't want to get in trouble. >> Okay. >> I'm just going to ... sorry. >> Okay. So then, let's talk about, does Google view Dataflow as their sort of strategic successor to map produce? >> Yes, so.... >> And is that a competitor then to Flink? >> I think Flink and Dataflow can be used in some of the same cases. But, I think they're more complementary. Flink is something you can run on-prem. You can run it in different Defenders. And Dataflow is very much like, "I can run this on Google Cloud." And part of the idea with Beam is to make it so that people who want to write Dataflow jobs but maybe want the flexibility to go back to something else later can still have that. Yeah, we couldn't swap in Flink or Dataflow execution engines if we're on Google Cloud, but.... We're not, how do I put it nicely? Provided people are running this stuff, they're burning CPU cycles, I don't really care if they're running Dataflow or Flink as the execution engine. Either way, it's a party for me, right? >> George: Okay. >> It's probably one of those, sort of, friendly competitions. Where we both push each other to do better and add more features that the respective projects have. >> Okay, 30 second question. >> Cool. >> Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? >> Totally. I mostly see it as extending existing apps. This is obviously, possibly a bias, just for the people that I talk to. But, going ground up with both streaming and machine learning, at the same time, like, starting both of those projects fresh is a really big hurdle to get over. >> George: For skills. >> For skills. It's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something ... maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly. Or you'll build a streaming system, and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping head first into both at the same time. But this could change. Batch has been King for a long time and streaming is getting it's day in the sun. So, we could start seeing people becoming more adventurous and doing both, at the same time. >> Holden, on that note, we'll have to call it a day. That was most informative. >> It's really good to see you again. >> Likewise. So this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference, sponsored by Data Artisans. And we will be back in a few minutes after this short break. (tech music)

Published Date : Apr 11 2018

SUMMARY :

Narrator: Live from San Francisco, it's the Cube, This is the second Flink Forward conference focused on the Beam API, which is an API And Holden has been working on integrating it So, Holden, tell us about the objective of the same preparation code available to use And also to clarify, that data prep.... I don't have to do that at serving time. and the common provider underneath it, in being able to switch streaming engines, And the Beam Python support is still a work in progress. or bindings from the Beam API? But if people are looking to play with it, I didn't have a chance to go through my customer list the sort of use cases and the general industry.... as their sort of strategic successor to map produce? And part of the idea with Beam is to make it so that and add more features that the respective projects have. at the same time, and streaming is getting it's day in the sun. Holden, on that note, we'll have to call it a day. We're on the ground at Flink Forward,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
San Francisco	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Holden Karau	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
Python	TITLE	0.99+
Java	TITLE	0.99+
Holden	PERSON	0.99+
Google	ORGANIZATION	0.99+
Spotify	ORGANIZATION	0.99+
both	QUANTITY	0.99+
two paths	QUANTITY	0.99+
TensorFlow	TITLE	0.99+
One	QUANTITY	0.99+
Spark	TITLE	0.99+
GitHub	ORGANIZATION	0.98+
today	DATE	0.98+
Dataflow	TITLE	0.97+
Flink	ORGANIZATION	0.97+
one variable	QUANTITY	0.97+
a day	QUANTITY	0.97+
Go	TITLE	0.97+
Flink Forward	EVENT	0.96+
Flink	TITLE	0.96+
30 second question	QUANTITY	0.96+
one place	QUANTITY	0.95+
Beam	TITLE	0.95+
second	QUANTITY	0.95+
Google Cloud	TITLE	0.94+
Apache	ORGANIZATION	0.94+
one group	QUANTITY	0.94+
one	QUANTITY	0.93+
this afternoon	DATE	0.9+
Dstream	TITLE	0.88+
2018	DATE	0.87+
first	QUANTITY	0.79+
Beam API	TITLE	0.75+
Beam	ORGANIZATION	0.74+
Apache Flink Community	ORGANIZATION	0.72+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Dstream: