Kenneth Knowles, Google - Flink Forward - #FFSF17 - #theCUBE
>> Welcome everybody, we're at the Flink Forward conference in San Francisco, at the Kabuki Hotel. Flink Forward U.S. is the first U.S. user conference for the Flink community sponsored by data Artisans, the creators of Flink, and we're here with special guest Kenneth Knowles-- >> Hi. >> Who works for Google and who heads up the Apache Beam Team where, just to set context, Beam is the API Or STK on which developers can build stream processing apps that can be supported by Google's Dataflow, Apache Flink, Spark, Apex, among other future products that'll come along. Ken, why don't you tell us, what was the genesis of Beam, and why did Google open up sort of the API to it. >> So, I can speak as an Apache Beam Team PMC member, that the genesis came from a combined code donation to Apache from Google Cloud Dataflow STK and there was also already written by data Artisans a Flink runner for that, which already included some portability hooks, and then there was also a runner for Spark that was written by some folks at PayPal. And so, sort of those three efforts pointed out that it was a good time to have a unified model for these DAG-based computational... I guess it's a DAG-based computational model. >> Okay, so I want to pause you for a moment. >> Yeah. >> And generally, we try to avoid being rude and cutting off our guests but, in this case, help us understand what a DAG is, and why it's so important. >> Okay, so a DAG is a directed acyclic graph, and, in some sense, if you draw a boxes and arrows diagram of your computation where you say "I read some data from here," and it goes through some filters and then I do a join and then I write it somewhere. These all end up looking what they call the DAG just because of the fact that it is the structure, and all computation sort of can be modeled this way, and in particular, these massively parallel computations profit a lot from being modeled this way as opposed to MapReduce because the fact that you have access to the entire DAG means you can perform transformations and optimizations and you have more opportunities for executing it in different ways. >> Oh, in other words, because you can see the big picture you can find, like, the shortest path as opposed to I've got to do this step, I've got to do this step and this step. >> Yeah, it's exactly like that, you're not constrained to sort of, the person writing the program knows what it is that they want to compute, and then, you know, you have very smart people writing the optimizer and the execution engine. So it may execute an entirely different way, so for example, if you're doing a summation, right, rather than shuffling all your data to one place and summing there, maybe you do some partial summations, and then you just shuffle accumulators to one place, and finish the summation, right? >> Okay, now let me bump you up a couple levels >> Yeah. >> And tell us, so, MapReduce was a trees within the forest approach, you know, lots of seeing just what's a couple feet ahead of you. And now we have the big picture that allows you to find the best path, perhaps, one way of saying it. Tell us though, with Google or with others who are using Beam-compatible applications, what new class of solutions can they build that you wouldn't have done with MapReduce before? >> Well, I guess there's... There's two main aspects to Beam that I would emphasize, there's the portability, so you can write this application without having to commit to which backend you're going to run it on. And there's... There's also the unification of streaming and batch which is not present in a number of backends, and Beam as this layer sort of makes it very easy to use sort of batch-style computation and streaming-style computation in the same pipeline. And actually I said there was two things, the third thing that actually really opens things up is that Beam is not just a portability layer across backends, it's also a portability layer across languages, so, something that really only has preliminary support on a lot of systems is Python, so, for example, Beam has a Python STK where you write a DAG description of your computation in Python, and via Beam's portability API's, one of these sort of usually Java-centric engines would be able to run that Python pipeline. >> Okay, so-- >> So, did I answer your question? >> Yes, yes, but let's go one level deeper, which is, if MapReduce, if its sweet spot was web crawl indexing in batch mode, what are some of the things that are now possible with a Beam-style platform that supports Beam, you know, underneath it, that can do this direct acyclic graph processing? >> I guess what I, I'm still learning all the different things that you can do with this style of computation, and the truth is it's just extremely general, right? You can set up a DAG, and there's a lot of talks here at Flink Forward about using a stream processor to do high frequency trading or fraud detection. And those are completely different even though they're in the same model of computation as, you know, you would still use it for things like crawling the web and doing PageRank over. Actually, at the moment we don't have iterative computations so we wouldn't do PageRank today. >> So, is it considered a complete replacement, and then new used cases for older style frameworks like MapReduce, or is it a complement for things where you want to do more with data in motion or lower latency? >> It is absolutely intended as a full replacement for MapReduce, yes, like, if you're thinking about writing a MapReduce pipeline, instead you should write a Beam pipeline, and then you should benchmark it on different Beam backends, right? >> And, so, working with Spark, working with Flink, how are they, in terms of implementing the full richness of the Beam-interface relative to the Google product Dataflow, from which I assumed Beam was derived? >> So, all of the different backends exist in sort of different states as far as implementing the full model. One thing I really want to emphasize is that Beam is not trying to take the intersection on all of these, right? And I think that your question already shows that you know this, we keep sort of a matrix on our website where we say, "Okay there's all these different "features you might want, "and then there's all these backends "you might want to run it on," and it's sort of there's can you do it, can you do it sometimes, and notes about that, we want this whole matrix to be, yes, you can use all of the model on Flink, all of it on Spark, all of it on Google Cloud Dataflow, but so they all have some gaps and I guess, yeah, we're really welcoming contributors in that space. >> So, for someone whose been around for a long time, you might think of it as an ODBC driver, where the capabilities of the databases behind it are different, and so the drivers can only support some subset of a full capability. >> Yeah, I think that there's, so, I'm not familiar enough with ODBC to say absolutely yes, absolutely no, but yes, it's that sort of a thing, it's like the JVM has many languages on it and ODBC provides this generic database abstraction. >> Is Google's goal with Beam API to make it so that customers demand a level of portability that goes not just for the on-prim products but for products that are in other public clouds, and sort of pry open the API lock in? >> So, I can't say what Google's goals are, but I can certainly say that Beam's goals are that nobody's going to be locked into a particular backend. >> Okay. >> I mean, I can't even say what Beam's goals are, sorry, those are my goals, I can speak for myself. >> Is Beam seeing so far adoption by the sort of big consumer internet companies, or has it started to spread to mainstream enterprises, or is still a little immature? >> I think Beam's still a little bit less mature than that, we're heading into our first stable release, so, we began incubating it as an Apache project about a year ago, and then, around the beginning of the new year, actually right at the end of 2016, we graduated to be an Apache top level project, so right now we're sort of on the road from we've become a top level project, we're seeing contributions ramp up dramatically, and we're aiming for a stable release as soon as possible, our next release we expect to be a stable API that we would encourage users and enterprises to adopt I think. >> Okay, and that's when we would see it in production form on the Google Cloud platform? >> Well, so the thing is that the code and the backends behind it are all very mature, but, right now, we're still sort of like, I don't know how to say it, we're polishing the edges, right, it's still got a lot of rough edges and you might encounter them if you're trying it out right now and things might change out from under you before we make our stable release. >> Understood. >> Yep. All right. Kenneth, thank you for joining us, and for the update on the Beam project and we'll be looking for that and seeing its progress over the next few months. >> Great. Thanks for having me. >> With that, I'm George Gilbert, I'm with Kenneth Knowles, we're at the dataArtisan's Flink Forward user conference in San Francisco at the Kabuki Hotel and we'll be back after a few minutes.
SUMMARY :
and we're here with special guest Kenneth Knowles-- Beam is the API Or STK on which developers can build and then there was also a runner for Spark and cutting off our guests but, in this case, and you have more opportunities for executing it Oh, in other words, because you can see the big picture and then you just shuffle accumulators to one place, that allows you to find the best path, and streaming-style computation in the same pipeline. and the truth is it's just extremely general, right? and it's sort of there's can you do it, and so the drivers can only support some subset and ODBC provides this generic database abstraction. are that nobody's going to be I mean, I can't even say what Beam's goals are, and we're aiming for a stable release and you might encounter them and for the update on the Beam project Thanks for having me. in San Francisco at the Kabuki Hotel
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Kenneth | PERSON | 0.99+ |
Kenneth Knowles | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Python | TITLE | 0.99+ |
ORGANIZATION | 0.99+ | |
Ken | PERSON | 0.99+ |
two things | QUANTITY | 0.99+ |
PayPal | ORGANIZATION | 0.99+ |
one place | QUANTITY | 0.98+ |
three efforts | QUANTITY | 0.98+ |
Flink | ORGANIZATION | 0.98+ |
Flink Forward | EVENT | 0.98+ |
Python STK | TITLE | 0.98+ |
Apache | ORGANIZATION | 0.98+ |
MapReduce | TITLE | 0.98+ |
ODBC | TITLE | 0.97+ |
Beam | TITLE | 0.97+ |
dataArtisan | ORGANIZATION | 0.97+ |
third thing | QUANTITY | 0.97+ |
first stable release | QUANTITY | 0.96+ |
first | QUANTITY | 0.95+ |
#FFSF17 | EVENT | 0.95+ |
Apache Beam Team | ORGANIZATION | 0.94+ |
Flink Forward | ORGANIZATION | 0.94+ |
two main aspects | QUANTITY | 0.93+ |
Artisans | ORGANIZATION | 0.93+ |
Beam | ORGANIZATION | 0.93+ |
Spark | TITLE | 0.92+ |
end of 2016 | DATE | 0.92+ |
Kabuki Hotel | LOCATION | 0.92+ |
today | DATE | 0.87+ |
about a year ago | DATE | 0.85+ |
Cloud Dataflow | TITLE | 0.83+ |
Dataflow | TITLE | 0.82+ |
Java | TITLE | 0.81+ |
one way | QUANTITY | 0.77+ |
One thing | QUANTITY | 0.73+ |
Google Cloud | TITLE | 0.72+ |
couple feet | QUANTITY | 0.71+ |
Apache | TITLE | 0.7+ |
Flink Forward user | EVENT | 0.7+ |
JVM | TITLE | 0.69+ |
Cloud Dataflow STK | TITLE | 0.69+ |
PMC | ORGANIZATION | 0.69+ |
Forward | EVENT | 0.64+ |
year | DATE | 0.62+ |
DAG | OTHER | 0.59+ |
U.S. | LOCATION | 0.53+ |
Apex | TITLE | 0.51+ |