Holden Karau, Google | Flink Forward 2018

>> Narrator: Live from San Francisco, it's the Cube, covering Flink Forward, brought to you by Data Artisans. (tech music) >> Hi, this is George Gilbert, we're at Flink Forward, the user conference for the Apache Flink Community, sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest, with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. >> Yes. >> And now, Holden is at Google, focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's Dataflow, as well as Flink and other stream processors. >> Yeah. >> And Holden has been working on integrating it with the Google TensorFlow framework, also open-sourced. Yes. >> So, Holden, tell us about the objective of putting these together. What type of use cases.... >> So, I think it's really exciting. And it's still very early days, I want to be clear. If you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark, with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot people essentially end up having to write, like, a stream-processing job to do their data preparation, and they have to write a corresponding online serving job, to do similar data preparation for when they want to make real predictions. And by integrating tf.Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple, uniform way, that can be taken from the training time into the online serving time, without them having to rewrite their code, removing the potential for mistakes where we like, change one variable slightly in one place and forget to update it in another. And just really simplifying the deployment process for these models. >> Okay, so help us tie that back to, in this case, Flink. >> Yes. >> And also to clarify, that data prep.... My impression was data prep was a different activity. It was like design time and serving was run time. But you're saying that they can be better integrated? >> So, there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, I don't have to do that at serving time. But one of the classic examples for data prep would be tokenizing my inputs, or performing some kind of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form, or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to re-create the data prep logic that I created for training at serving time. >> So, by having common Beam API and the common provider underneath it, like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine-learning model that you want those.... It would be ideal to have those transformation activities be common in your prep work, and then in the production serving. >> Yes, very true. >> So, tell us what type of customers want to write to the Beam API and have that portability? >> Yeah, so that's a really good question. So, there's a lot of people who really want portability outside of Google Cloud, and that's one group of people, essentially people who want to adopt different Google Cloud technologies, but they don't want be locked into Google Cloud forever. Which is completely understandable. There are other people who are more interested in being able to switch streaming engines, like, they want to be able to switch between Spark and Flink. And those are people who want to try out different streaming engines without having to rewrite their entire jobs. >> Does Spark Structured Streaming support the Beam API? >> So, right now, the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the Structured Streaming API. It's a thing we're actively discussing on the mailing list, how to go about doing. Because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure. But it's something that we should look at more of. Where was I going with this? So the other one that I see, is like, Flink is a wonderful API, but it's very Java-focused. And so, Java's great, everyone loves it, but a lot of cool things that are being done nowadays, are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. Beam gives a way for people to work with Python, across these different engines. Flink supports Python, but it's maybe not a first class citizen. And the Beam Python support is still a work in progress. We're working to get it to be better, but it's.... You can see the demos this afternoon, although if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. >> In to support Go. >> Which is a little out of left field. >> So, would it be fair to say that the value of Beam, for potential Flink customers, they can work and start on Google Cloud platform. They can start on one of several stream processors. They can move to another one later, and they also inherit the better language support, or bindings from the Beam API? >> I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. It's somewhat subjective, like what better language support is. But I think definitely for Go, it's pretty clear. This stuff is all stuff that's in the master branch, it's not released today. But if people are looking to play with it, I think it's really exciting. They can go and check it out from GitHub, and build it locally. >> So, what type of customers do you see who have moved into production with machine learning? >> So the.... >> And the streaming pipelines? >> The biggest customer that's in production is obviously, or not obviously, is Spotify. One of them is Spotify. They give a lot of talks about it. Because I didn't know we were going to be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly. I'll just stick with Spotify. >> Without the names, the sort of use cases and the general industry.... >> I don't want to get in trouble. >> Okay. >> I'm just going to ... sorry. >> Okay. So then, let's talk about, does Google view Dataflow as their sort of strategic successor to map produce? >> Yes, so.... >> And is that a competitor then to Flink? >> I think Flink and Dataflow can be used in some of the same cases. But, I think they're more complementary. Flink is something you can run on-prem. You can run it in different Defenders. And Dataflow is very much like, "I can run this on Google Cloud." And part of the idea with Beam is to make it so that people who want to write Dataflow jobs but maybe want the flexibility to go back to something else later can still have that. Yeah, we couldn't swap in Flink or Dataflow execution engines if we're on Google Cloud, but.... We're not, how do I put it nicely? Provided people are running this stuff, they're burning CPU cycles, I don't really care if they're running Dataflow or Flink as the execution engine. Either way, it's a party for me, right? >> George: Okay. >> It's probably one of those, sort of, friendly competitions. Where we both push each other to do better and add more features that the respective projects have. >> Okay, 30 second question. >> Cool. >> Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? >> Totally. I mostly see it as extending existing apps. This is obviously, possibly a bias, just for the people that I talk to. But, going ground up with both streaming and machine learning, at the same time, like, starting both of those projects fresh is a really big hurdle to get over. >> George: For skills. >> For skills. It's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something ... maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly. Or you'll build a streaming system, and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping head first into both at the same time. But this could change. Batch has been King for a long time and streaming is getting it's day in the sun. So, we could start seeing people becoming more adventurous and doing both, at the same time. >> Holden, on that note, we'll have to call it a day. That was most informative. >> It's really good to see you again. >> Likewise. So this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference, sponsored by Data Artisans. And we will be back in a few minutes after this short break. (tech music)

Published Date : Apr 11 2018

SUMMARY :

Narrator: Live from San Francisco, it's the Cube, This is the second Flink Forward conference focused on the Beam API, which is an API And Holden has been working on integrating it So, Holden, tell us about the objective of the same preparation code available to use And also to clarify, that data prep.... I don't have to do that at serving time. and the common provider underneath it, in being able to switch streaming engines, And the Beam Python support is still a work in progress. or bindings from the Beam API? But if people are looking to play with it, I didn't have a chance to go through my customer list the sort of use cases and the general industry.... as their sort of strategic successor to map produce? And part of the idea with Beam is to make it so that and add more features that the respective projects have. at the same time, and streaming is getting it's day in the sun. Holden, on that note, we'll have to call it a day. We're on the ground at Flink Forward,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
San Francisco	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Holden Karau	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
Python	TITLE	0.99+
Java	TITLE	0.99+
Holden	PERSON	0.99+
Google	ORGANIZATION	0.99+
Spotify	ORGANIZATION	0.99+
both	QUANTITY	0.99+
two paths	QUANTITY	0.99+
TensorFlow	TITLE	0.99+
One	QUANTITY	0.99+
Spark	TITLE	0.99+
GitHub	ORGANIZATION	0.98+
today	DATE	0.98+
Dataflow	TITLE	0.97+
Flink	ORGANIZATION	0.97+
one variable	QUANTITY	0.97+
a day	QUANTITY	0.97+
Go	TITLE	0.97+
Flink Forward	EVENT	0.96+
Flink	TITLE	0.96+
30 second question	QUANTITY	0.96+
one place	QUANTITY	0.95+
Beam	TITLE	0.95+
second	QUANTITY	0.95+
Google Cloud	TITLE	0.94+
Apache	ORGANIZATION	0.94+
one group	QUANTITY	0.94+
one	QUANTITY	0.93+
this afternoon	DATE	0.9+
Dstream	TITLE	0.88+
2018	DATE	0.87+
first	QUANTITY	0.79+
Beam API	TITLE	0.75+
Beam	ORGANIZATION	0.74+
Apache Flink Community	ORGANIZATION	0.72+

Greg Benson, SnapLogic | Flink Forward 2018

>> Announcer: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> Hi this is George Gilbert. We are at Flink Forward on the ground in San Francisco. This is the user conference for the Apache Flink Community. It's the second one in the US and this is sponsored by Data Artisans. We have with us Greg Benson, who's Chief Scientist at Snap Logic and also professor of computer science at University of San Francisco. >> Yeah that's great, thanks for havin' me. >> Good to have you. So, Greg, tell us a little bit about how Snap Logic currently sets up its, well how it builds its current technology to connect different applications. And then talk about, a little bit, where you're headed and what you're trying to do. >> Sure, sure, so Snap Logic is a data and app integration Cloud platform. We provide a graphical interface that lets you drag and drop. You can open components that we call Snaps and you kind of put them together like Lego pieces to define relatively sophisticated tasks so that you don't have to write Java code. We use machine learning to help you build out these pipelines quickly so we can anticipate based on your data sources, what you are going to need next, and that lends itself to rapid building of these pipelines. We have a couple of different ways to execute these pipelines. You can think of it as sort of this specification of what the pipeline's supposed to do. We have a proprietary engine that we can execute on single notes, either in the Cloud or behind your firewall in your data center. We also have a mode which can translate these pipelines into Spark code and then execute those pipelines at scale. So, you can do sort of small, low latency processing to sort of larger, batch processing on very large data sets. >> Okay, and so you were telling me before that you're evaluating Flink or doing research with Flink as another option. Tell us what use cases that would address that the first two don't. >> Yeah, good question. I'd love to just back up a little bit. So, because I have this dual role of Chief Scientist and as a professor of Computer Science, I'm able to get graduate students to work on research projects for credit, and then eventually as interns at SnapLogic. A recent project that we've been working on since we started last fall so working on about six or seven months now is investigating Flink as a possible new back end for the SnapLogic platform. So this allows us to you know, to explore and prototype and just sort of figure out if there's going to be a good match between an emerging technology and our platform. So, to go back to your question. What would this address? Well, so, without going into too much of the technical differences between Flink and Spark which I imagine has come up in some of your conversations or it comes up here because they can solve similar use cases our experience with Flink is the code base is easy to work with both from taking our specification of pipelines and then converting them into Flink code that can run. But there's another benefit that we see from Flink and that is, whenever any product, whether it's our product or anybody else's product, that uses something like Spark or Flink as a back end, there's this challenge because you're converting something that your users understand into this target, right, this Spark API code or Flink API code. And the challenge there is if something goes wrong, how do you propagate that back to the users so the user doesn't have to read log files or get into the nuts and bolts of how Spark really works. >> It's almost like you've compiled the code, and now if something doesn't work right, you need to work at the source level. >> That's exactly right, and that's what we don't want our users to do, right? >> Right. >> So one promising thing about Flink is that we're able to integrate the code base in such a way that we have a better understanding of what's happening in the failure conditions that occur. And we're working on ways to propagate those back to the user so they can take actionable steps to remedy those without having to understand the Flink API code iself. >> And what is it, then, about Flink or its API that gives you that feedback about errors or you know, operational status that gives you better visibility than you would get in something else like Spark. >> Yeah, so without getting too too deep on the subject, what we have found is, one thing nice about the Flink code base is the core is written in Scala, but there's a lot of, all the IO and memory handling is written in Java and that's where we need to do our primary interfacing and the building blocks, sort of the core building blocks to get to, for example, something that you build with a dataset API to execution. We have found it easier to follow the transformation steps that Flink takes to end up with the resulting sort of optimized, optimized Flink pipeline. Now by understanding that transformation, like you were saying, the compilation step, by understanding it, then we can work backwards, and understand how, when something happens, how to trace it back to what the user was originally trying to specify. >> The GUI specification. >> Yeah. Right. >> So, help me understand though it sounds like you're the one essentially building a compiler from a graphical specification language down to Spark as the, you know, sort of, pseudo, you know, psuedo compile code, >> Yep. >> Or Flink. And, but if you're the one doing that compilation, I'm still struggling to understand why you would have better reverse engineering capabilities with one. >> It just is a matter of getting visibility into the steps that the underlying frameworks are taking and so, I'm not saying this is impossible to do in Spark, but we have found that we've had, it's been easier for us to get into the transformation steps that Flink is taking. >> Almost like, for someone who's had as much programming as a one semester in night school, like a variable and specter that's already there, >> Yeah, that's a good, there you go, yeah, yeah, yeah. >> Okay, so you don't have to go try and you can't actually add it, and you don't have to then infer it from all this log data. >> Now, I should add, there's another potential Flink. You were asking about use cases and what does Flink address. As you know, Flink is a streaming platform, in addition to being a batch platform, and Flink does streaming differently than how Spark does. Spark takes a microbatch approach. What we're also looking at in my research effort is how to take advantage of Flink's streaming approach to allow the SnapLogic GUI to be used to specify streaming Flink applications. Initially we're just focused on the batch mode but now we're also looking at the potential to convert these graphical pipelines into streaming Flink applications, which would be a great benefit to customers who want-- >> George: Real time integration. >> Want to do what Alibaba and all the other companies are doing but take advantage of it without having to get to the nuts and bolts of the programming. Do it through the GUI. >> Wow, so it's almost like, it's like, Flink, Beam, in terms of obstruction layers, >> Sure. >> And then SnapLogic. >> Greg: Sure, yes. >> Not that you would compile the beam, but the idea that you would have perv and processing and a real-time pipeline. >> Yes. >> Okay. So that's actually interesting, so that would open up a whole new set of capabilities. >> Yeah and, you know, it follows our you know, company's vision in allowing lots of users to do very sophisticated things without being, you know, Hadoop developers or Spark developers, or even Flink developers, we do a lot of the hard work of trying to give you a representation that's easier to work with, right but, also allow you to sort of evolve that and de-bug it and also eventually get the performance out of these systems One of the challenges of course of Spark and Flink is that they have to be tuned, and you have to, and so what we're trying to do is, using some of our machine learning, is eventually gather information that can help us identify how to tune different types of work flows in different environments. And that, if we're able to do that in it's entirety, then we, you know, we take out a lot of the really hard work that goes into making a lot of these streaming applications both scalable and performing. >> Performimg. So this would be, but you would have, to do that, you would probably have to collect well, what's the term? I guess data from the operations of many customers, >> Right. >> Because, as training data, just as the developer alone, you won't really have enough. >> Absolutely, and that's, so that you have to bootstrap that. For our machine learning that we currently use today, we leverage, you know, the thousands of pipelines, the trillions of documents that we now process on a monthly basis, and that allows us to provide good recommendations when you're building pipelines, because we have a lot of information. >> Oh, so you are serving the runtime, these runtime compilations. >> Yes. >> Oh, they're not all hosted on the customer premises. >> Oh, no no no, we do both. So it's interesting, we do both. So you can, you can deploy completely in the cloud, we're a complete SASS provider for you. Most of our customers though, you know, Banks Healthcare, want to run our engine behind their firewalls. Even when we do that though, we still have metadata that we can get introspection, sort of anonymized, but we can get introspection into how things are behaving. >> Okay. That's very interesting. Alright, Greg we're going to have to end it on that note, but uh you know, I guess everyone stay tuned. That sounds like a big step forward in sort of specification of real time pipelines at a graphical level. >> Yeah, well, it's, I hope to be talking to you again soon with more results. >> Looking forward to it. With that, this is George Gilbert, we are at Flink Forward, the user conference for the Apache Flink conference, sorry for the Apache Flink user community, sponsored by Data Artisans, we will be back shortly. (upbeat music)

Published Date : Apr 11 2018

SUMMARY :

brought to you by Data Artisans. We are at Flink Forward on the ground in San Francisco. and what you're trying to do. so that you don't have to write Java code. Okay, and so you were telling me before So this allows us to you know, to explore and prototype you need to work at the source level. so they can take actionable steps to remedy those that gives you that feedback something that you build with a dataset API to execution. you would have better and so, I'm not saying this is impossible to do in Spark, and you don't have to then infer it from all this log data. As you know, Flink is a streaming platform, Want to do what Alibaba and all the other companies the idea that you would have perv and processing so that would open up a whole new is that they have to be tuned, and you have to, So this would be, but you would have, to do that, just as the developer alone, you won't really have enough. we leverage, you know, the thousands of pipelines, Oh, so you are serving the runtime, Most of our customers though, you know, Banks Healthcare, you know, I guess everyone stay tuned. Yeah, well, it's, I hope to be talking to you again soon Looking forward to it.

ENTITIES

Entity	Category	Confidence
Greg Benson	PERSON	0.99+
George Gilbert	PERSON	0.99+
Greg	PERSON	0.99+
US	LOCATION	0.99+
Alibaba	ORGANIZATION	0.99+
Java	TITLE	0.99+
San Francisco	LOCATION	0.99+
Snap Logic	ORGANIZATION	0.99+
George	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
both	QUANTITY	0.99+
Spark	TITLE	0.99+
Scala	TITLE	0.99+
Flink	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
Banks Healthcare	ORGANIZATION	0.99+
second one	QUANTITY	0.99+
Lego	ORGANIZATION	0.99+
last fall	DATE	0.98+
one semester	QUANTITY	0.98+
SnapLogic	ORGANIZATION	0.98+
SnapLogic	TITLE	0.97+
first two	QUANTITY	0.97+
today	DATE	0.97+
Flink	TITLE	0.96+
single notes	QUANTITY	0.96+
about six	QUANTITY	0.96+
trillions of documents	QUANTITY	0.95+
Flink Forward	ORGANIZATION	0.95+
seven months	QUANTITY	0.94+
One	QUANTITY	0.94+
University of San Francisco	ORGANIZATION	0.93+
one	QUANTITY	0.92+
one thing	QUANTITY	0.91+
Apache Flink Community	ORGANIZATION	0.89+
Spark	ORGANIZATION	0.85+
Apache Flink	ORGANIZATION	0.82+
Flink Forward	EVENT	0.82+
2018	DATE	0.81+
pipelines	QUANTITY	0.81+
Flink	EVENT	0.76+
SASS	ORGANIZATION	0.73+
dual	QUANTITY	0.72+
Logic	TITLE	0.72+
Apache	ORGANIZATION	0.57+
Hadoop	ORGANIZATION	0.54+
thing	QUANTITY	0.54+
Beam	TITLE	0.51+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Apache Flink Community: