Stephan Ewen, data Artisans | Flink Forward 2018

>> Narrator: Live from San Francisco. It's the CUBE covering Flink Forward brought to you by data Artisans. >> Hi, this is George Gilbert. We are at Flink Forward. The conference put on by data Artisans for the Apache Flink community. This is the second Flink Forward in San Francisco and we are honored to have with us Stephan Ewen, co-founder of data Artisans, co-creator of Apache Flink, and CTO of data Artisans. Stephan, welcome. >> Thank you, George. >> Okay, so with others we were talking about the use cases they were trying to solve but you put together the sort of all the pieces in your head first and are building out, you know, something that's ultimately gets broader and broader in its applicability. Help us, now maybe from the bottom up, help us think through the problems you were trying to solve and and let's start, you know, with the ones that you saw first and then how the platform grows so that you can solve more and more a broader scale of problems. >> Yes, yeah, happy to do that. So, I think we have to take a bunch of step backs and kind of look at what is the let's say the breadth or use cases that we're looking at. How did that, you know, influence some of the inherent decisions and how we've built Flink? How does that relate to what we presented earlier today, the in Austrian processing platform and so on? So, starting to work on Flink and stream processing. Stream processing is an extremely general and broad paradigm, right? We've actually started to say what Flink is underneath the hood. It's an engine to do stateful computations over data streams. It's a system that can process data streams as a batch processor processes, you know, bounded data. It can process data streams as a real-time stream processor produces real-time streams of events. It can handle, you know, data streams as in sophisticated event by event, stateful, timely, logic as you know many applications that are, you know, implemented as data-driven micro services or so and implement their logic. And the basic idea behind how Flink takes its approach to that is just start with the basic ingredients that you need that and try not to impose any form of like various constraints and so on around the use of that. So, when I give the presentations, I very often say the basic building blocks for Flink is just like flowing streams of data, streams being, you know, like received from that systems like Kafka, file systems, databases. So, you route them, you may want to repartition them, organize them by key, broadcast them, depending on what you need to do. You implement computation on these streams, a computation that can keep state almost as if it was, you know, like a standalone java application. You don't think necessarily in terms of writing state or database. Think more in terms of maintaining your own variables or so. Sophisticated access to tracking time and progress or progress of data, completeness of data. That's in some sense what is behind the event time streaming notion. You're tracking completeness of data as for a certain point of time. And then to to round this all up, give this a really nice operational tool by introducing this concept of distributed consistent snapshots. And just sticking with these basic primitives, you have streams that just flow, no barrier, no transactional barriers necessarily there between operations, no microbatches, just streams that flow, state variables that get updated and then full tolerance happening as an asynchronous background process. Now that is what is in some sense the I would say kind of the core idea and what helps Flink generalize from batch processing to, you know, real-time stream processing to event-driven applications. And what we saw today is, in the presentation that I gave earlier, how we use that to build a platform for stream processing and event-driven applications. That's taking some of these things and in that case I'm most prominently the fourth aspect the ability to draw like some application snapshots at any point in time and and use this as an extremely powerful operational tool. You can think of it as being a tool to archive applications, migrate applications, fork applications, modify them independently. >> And these snapshots are essentially your individual snapshots at the node level and then you're sort of organizing them into one big logical snapshot. >> Yeah, each node is its own snapshot but they're consistently organized into a globally consistent snapshot, yes. That has a few very interesting and important implications for example. So just to give you one example where this makes really things much easier. If you have an application that you want to upgrade and you don't have a mechanism like that right, what is the default way that many folks do these updates today? Try to do a rolling upgrade of all your individual nodes. You replace one then the next, then the next, then the next but that has this interesting situation where at some point in time there's actually two versions of the application running at the same time. >> And operating on the same sort of data stream. >> Potentially, yeah, or on some partitions of the data stream, we have one version and some partitions you have another version. You may be at the point we have to maintain two wire formats like all pieces of your logic have to be written in understanding both versions or you try to you know use the data format that makes this a little easier but it's just inherently a thing that you don't even have to worry about it if you have this consistent distributed snapshots. It's just a way to switch from one application to the other as if nothing was like shared or in-flight at any point in time. It just gets many of these problems just out of the way. >> Okay and that snapshot applies to code and data? >> So in Flink's architecture itself, the snapshot applies first of all only to data. And that is very important. >> George: Yeah. >> Because what it actually allows you is to decouple the snapshot from the code if you want to. >> George: Okay. >> That allows you to do things like we showed earlier this morning. If you actually have an earlier snapshot where the data is correct then you change the code but you introduce the back. You can just say, "Okay, let me actually change the code "and apply different code to a different snapshot." So, you can actually, roll back or roll forward different versions of code and different versions of state independently or you can go and say when I'm forking this application I'm actually modifying it. That is a level of flexibility that's incredible to, yeah, once you've actually start to make use of it and practice it, it's incredibly useful. It's been actually almost, it's been one of the maybe least obvious things once you start to look into stream processing but once you actually started production as stream processing, this operational flexibility that you get there is I would say very high up for a lot of users when they said, "Okay this is "why we took Flink to streaming production and not others." The ability to do for example that. >> But this sounds then like with some stream processors the idea of the unbundling the database you have derived data you know at different sync points and that derived data is you know for analysis, views, whatever, but it sounds like what you're doing is taking a derived data of sort of what the application is working on in progress and creating essentially a logically consistent view that's not really derived data for some other application use but for operational use. >> Yeah, so. >> Is that a fair way to explain? >> Yeah, let me try to rephrase it a bit. >> Okay. >> When you start to take this streaming style approach to things, which you know it's been called turning the database inside out, unbundling the database, your input sequence of event is arguably the ground truth and what the stream processor computes is as a view of the state of the world. So, while this sounds you know this sounds at first super easy and you know views, you can always recompute a few, right? Now in practice this view of the world is not just something that's just like a lightweight thing that's only derived from the sequence of events. it's actually the, it's the state of the world that you want to use. It might not be fully reproducible just because either the sequence of events has been truncated or because the sequence events is just like too plain long to feasibly recompute it in a reasonable time. So, having a way to work with this in a way that just complements this whole idea of you know like event-driven, log-driven architecture very cleanly is kind of what this snapshot tool also gives you. >> Okay, so then help us think so that sounds like that was part of core Flink. >> That is part of core Flink's inherent design, yes. >> Okay, so then take us to the the next level of abstraction. The scaffolding that you're building around it with the dA platform and how that should make that sort of thing that makes stream processing more accessible, how it you know it empowers a whole other generation. >> Yeah, so there's different angles to what the dA platform does. So, one angle is just very pragmatically easing rollout of applications by having a one way to integrate the you know the platform with your metrics. Alerting logins, the ICD pipeline, and then every application that you deploy over there just like inherits all of that like every edge in the application developer doesn't have to worry about anything. They just say like this is my piece of code. I'm putting it there and it's just going to be hooked in with everything else. That's not rocket science but it's extremely valuable because there's like a lot of tedious bits here and there that you know otherwise eat up a significant amount of the development time. Like technologically maybe more challenging part that this solves is the part where we're really integrating the application snapshot, the compute resources, the configuration management and everything into this model where you don't think about I'm running a Flink job here. That Flink job has created a snapshot that is running around here. There's also a snapshot here which probably may come from that Flink application. Also, that Flink application was running. That's actually just a new version of that Flink application which is the let's say testing or acceptance run for the version that we're about to deploy here and so like tying all of these things together. >> So, it's not just the artifacts from one program, it's how they all interrelate? >> It gives you the idea of exactly of how they all interrelate because an application over its lifetime will correspond to different configurations different code versions, different different deployments on production a/b testing and so on and like how all of these things kind of work together how they interplay right, Flink, like I said before Flink deliberately couples checkpoints and code and so on in a rather loose way to allow you to to evolve the code differently then and still be able to match a previous snapshot into a newer code version and so on. We make heavy use of that but we we cannot give you a good way of first of all tracking all of these things together how do they how do they relate, when was which version running, what code version was that, having a snapshots we can always go back and reinstate earlier versions, having the ability to always move on a deployment from here to there, like fork it, drop it, and so on. That is one part of it and the other part of it is the tight integration with with Kubernetes which is initially container sweet spot was stateless compute and the way stream processing is, how architecture works is the nodes are inherently not stateless, they have a view of the state of the world. This is recoverable always. You can also change the number of containers and with Flink and other frameworks you have the ability to kind of adjust this and so on, >> Including repartitioning the-- >> Including repartitioning the state, but it's a thing that you have to be often quite careful how to do that so that it all integrates exactly consistency, like the right containers are running at the right point in time with the exact right version and there's not like there's not a split brain situation where this happens to be still running some other partitions at the same time or you're running that container goes down and it's this a situation where you're supposed to recover or rescale like, figuring all of these things out, together this is what they like the idea of integrating these things in a very tight way gives you so think of it as the following way, right? You start with, initially you just start with Docker. Doctor is a way to say, I'm packaging up everything that a process needs, all of its environment to make sure that I can deploy it here and here in here and just always works it's not like, "Oh, I'm missing "the correct version of the library here," or "I'm "interfering with that other process on a port." On top of Docker, people added things like Kubernetes to orchestrate many containers together forming an application and then on top of Kubernetes there are things like Helm or for certain frameworks there's like Kubernetes Operators and so on which try to raise the abstraction to say, "Okay we're taking care of these aspects that this needs in addition to a container orchestration," we're doing exactly that thing like we're raising the abstraction one level up to say, okay we're not just thinking about the containers the computer and maybe they're like local persistent storage but we're looking at the entire state full application with its compute, with its state with its archival storage with all of it together. >> Okay let me sort of peel off with a question about more conventionally trained developers and admins and they're used to databases for a batch and request response type jobs or applications do you see them becoming potential developers of continuous stream processing apps or do you see it only mainly for a new a new generation of developers? >> No, I would I would actually say that that a lot of the like classic... Call it request/response or call it like create update, delete create read update delete or so application working against the database, there's this huge potential for stream processing or that kind of event-driven architectures to help change this view. There's actually a fascinating talk here by the folks from (mumbles) who implemented an entire social network in this in this industry processing architecture so not against the database but against a log in and a stream processor instead it comes with some really cool... with some really cool properties like very unique way of of having operational flexibility too at the same time test, and evolve run and do very rapid iterations over your-- >> Because of the decoupling? >> Exactly, because of the decoupling because you don't have to always worry about okay I'm experimenting here with something. Let me first of all create a copy of the database and then once I actually think that this is working out well then, okay how do I either migrate those changes back or make sure that the copy of the database that I did that bring this up to speed with a production database again before I switch over to the new version and so like so many of these things, the pieces just fall together easily in the streaming world. >> I think I asked this of Kostas, but if a business analyst wants to query the current state of what's in the cluster, do they go through some sort of head node that knows where the partitions lay and then some sort of query optimizer figures out how to execute that with a cost model or something? In other words, if you want it to do some sort of batcher interactive type... >> So there's different answers to that, I think. First of all, there's the ability to log into the state of link as in you have the individual nodes that maintains they're doing the computation and you can look into this but it's more like a look up thing. >> It's you're not running a query as in a sequel query against that particular state. If you would like to do something like that, what Flink gives you as the ability is as always... There's a wide variety of connectors so you can for example say, I'm describing my streaming computation here, you can describe in an SQL, you can say the result of this thing, I'm writing it to a neatly queryable data store and in-memory database or so and then you would actually run the dashboard style exploratory queries against that particular database. So Flink's sweet spot at this point is not to run like many small fast short-lived sequel queries against something that is in Flink running at the moment. That's not what it is yet built and optimized for. >> A more batch oriented one would be the derived data that's in the form of a materialized view. >> Exactly, so this place, these two sites play together very well, right? You have the more exploratory better style queries that go against the view and then you have the stream processor and streaming sequel used to continuously compute that view that you then explore. >> Do you see scenarios where you have traditional OLTP databases that are capturing business transactions but now you want to inform those transactions or potentially automate them with machine learning. And so you capture a transaction, and then there's sort of ambient data, whether it's about the user interaction or it's about the machine data flowing in, and maybe you don't capture the transaction right away but you're capturing data for the transaction and the ambient data. The ambient data you calculate some sort of analytic result. Could be a model score and that informs the transaction that's running at the front end of this pipeline. Is that a model that you see in the future? >> So that sounds like a formal use case that has actually been run. It's not uncommon, yeah. It's actually, in some sense, a model like that is behind many of the fraud detection applications. You have the transaction that you capture. You have a lot of contextual data that you receive from which you either built a model in the stream processor or you built a model offline and push it into the stream processor. As you know, let's say a stream of model updates, and then you're using that stream of model updates. You derive your classifiers or your rule engines, or your predictor state from that set of updates and from the history of the previous transactions and then you use that to attach a classification to the transaction and then once this is actually returned, this stream is fed back to the part of the computation that actually processes that transaction itself to trigger the decision whether to for example hold it back or to let it go forward. >> So this is an application where people who have built traditional architectures would add this capability on for low latency analytics? >> Yeah, that's one way to look at it, yeah. >> As opposed to a rip and replace, like we're going to take out our request/response in our batch and put in stream processing. >> Yeah, so that is definitely a way that stream processing is used, that you you basically capture a change log or so of whatever is happening in either a database or you just immediately capture the events, the interaction from users and devices and then you let the stream processor run side by side with the old infrastructure. And just exactly compute additional information that, even a mainframe database might in the end used to decide what to do with a certain transaction. So it's a way to complement legacy infrastructure with new infrastructure without having to break off or break away the legacy infrastructure. >> So let me ask in a different direction more on the complexity that forms attacks for developers and administrators. Many of the open source community products slash projects solve narrow sort of functions within a broader landscape and there's a tax on developers and admins and trying to make those work together because of the different security models, data models, all that. >> There is a zoo of systems and technologies out there and also of different paradigms to do things. Once systems kind of have a similar paradigm, or a tier in mind, they usually work together well, but there's different philosophical takes-- >> Give me some examples of the different paradigms that don't fit together well. >> For example... Maybe one good example was initially when streaming was a rather new thing. At this point in time stream processors were very much thought of as a bit of an addition to the, let's say, the batch stack or whatever ever other stack you currently have, just look at it as an auxiliary piece to do some approximate computation and a big reason why that was the case is because, the way that these stream processors thought of state was with a different consistency model, the way they thought of time was actually different than you know like the batch processors of the databases at which use time stem fields and the early stream processors-- >> They can't handle event time. >> Exactly, just use processing time, that's why these things you know you could maybe complement the stack with that but it didn't really go well together, you couldn't just say like, okay I can actually take this batch job kind of interpret it also as a streaming job. Once the stream processors got a better interpretation. >> The OEM architecture. >> Exactly. So once the stream processors adopted a stronger consistency model a time model that is more compatible with reprocessing and so on, all of these things all of a sudden fit together much better. >> Okay so, do you see that vendors who are oriented around a single paradigm or unified paradigm, do you see them continuing to broaden their footprint so that they can essentially take some of the complexity off the developer and the admin by providing something that, one throat to choke with the pieces that were designed to work together out-of-the-box, unlike some of the zoos with the former Hadoop community? In other words, lot of vendors seem to be trying to do a broader footprint so that it's something that's just simpler to develop to and to operate? >> There there are a few good efforts happening in that space right now, so one that I really like is the idea of standardizing on some APIs. APIs are hard to standardize on but you can at least standardize on semantics, which is something, that for example Flink and Beam have been very keen on trying to have an open discussion and a road map that is very compatible in thinking about streaming semantics. This has been taken to the next level I would say with the whole streaming sequel design. Beam is adding adding stream sequel and Flink is adding stream sequel, both in collaboration with the Apache CXF project, so very similar standardized semantics and so on, and the sequel compliancy so you start to get common interfaces, which is a very important first step I would say. Standardizing on things like-- >> So sequel semantics are across products that would be within a stream processing architecture? >> Yes and I think this will become really powerful once other vendors start to adopt the same interpretation of streaming sequel and think of it as, yes it's a way to take a changing data table here and project a view of this changing data table, a changing materialized view into another system, and then use this as a starting point to maybe compute another derive, you see. You can actually start and think more high-level about things, think really relational queries, dynamic tables across different pieces of infrastructure. Once you can do something like interplay in architectures become easier to handle, because even if on the runtime level things behave a bit different, at least you start to establish a standardized model, in thinking about how to compose your architecture and even if you decide to change on the way, you frequently saved the problem of having to rip everything out and redesign everything because the next system that you bring in just has a completely different paradigm that it follows. >> Okay, this is helpful. To be continued offline or back online on the CUBE. This is George Gilbert. We were having a very interesting and extended conversation with Stephan Ewen, CTO and co-founder of data Artisans and one of the creators of Apache Flink. And we are at Flink Forward in San Francisco. We will be back after this short break.

Published Date : Apr 12 2018

SUMMARY :

brought to you by data Artisans. This is the second Flink Forward in San Francisco how the platform grows so that you can solve with the basic ingredients that you need that and then you're sort of organizing them So just to give you one example where this makes have to worry about it if you have this consistent the snapshot applies first of all only to data. the snapshot from the code if you want to. that you get there is I would say very high up and that derived data is you know for analysis, approach to things, which you know it's been called like that was part of core Flink. more accessible, how it you know it empowers and everything into this model where you and so on in a rather loose way to allow you to raise the abstraction to say, "Okay we're taking care that a lot of the like classic... make sure that the copy of the database that I did that In other words, if you want it to do the state of link as in you have the individual nodes or so and then you would actually run of a materialized view. go against the view and then you have the stream processor Is that a model that you see in the future? You have the transaction that you capture. As opposed to a rip and replace, and devices and then you let the stream processor run Many of the open source community there and also of different paradigms to do things. Give me some examples of the different paradigms that the batch stack or whatever ever other stack you currently you know you could maybe complement the stack with that So once the stream processors right now, so one that I really like is the idea of to maybe compute another derive, you see. and one of the creators of Apache Flink.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Stephan Ewen	PERSON	0.99+
George	PERSON	0.99+
Stephan	PERSON	0.99+
San Francisco	LOCATION	0.99+
Flink	ORGANIZATION	0.99+
one version	QUANTITY	0.99+
both versions	QUANTITY	0.99+
two sites	QUANTITY	0.99+
Apache Flink	ORGANIZATION	0.99+
two versions	QUANTITY	0.99+
Flink Forward	ORGANIZATION	0.99+
second	QUANTITY	0.99+
one	QUANTITY	0.99+
today	DATE	0.98+
fourth aspect	QUANTITY	0.98+
java	TITLE	0.98+
Artisans	ORGANIZATION	0.98+
one program	QUANTITY	0.97+
one way	QUANTITY	0.97+
both	QUANTITY	0.97+
Kubernetes	TITLE	0.97+
one angle	QUANTITY	0.97+
Kafka	TITLE	0.96+
one part	QUANTITY	0.96+
first step	QUANTITY	0.96+
two wire formats	QUANTITY	0.96+
first	QUANTITY	0.96+
First	QUANTITY	0.94+
each node	QUANTITY	0.94+
Beam	ORGANIZATION	0.94+
one example	QUANTITY	0.94+
CTO	PERSON	0.93+
2018	DATE	0.93+
Docker	TITLE	0.92+
Apache	ORGANIZATION	0.91+
one good example	QUANTITY	0.91+
single paradigm	QUANTITY	0.9+
one application	QUANTITY	0.89+
Flink	TITLE	0.86+
node	TITLE	0.79+
Kostas	ORGANIZATION	0.76+
earlier this morning	DATE	0.69+
CUBE	ORGANIZATION	0.67+
SQL	TITLE	0.64+
Helm	TITLE	0.59+
CXF	TITLE	0.59+

Kostas Tzoumas, data Artisans | Flink Forward 2018

(techno music) >> Announcer: Live, from San Francisco, it's theCUBE. Covering Flink Forward, brought to you by data Artisans. (techno music) >> Hello again everybody, this is George Gilbert, we're at the Flink Forward Conference, sponsored by data Artisans, the provider of both Apache Flink and the commercial distribution, the dA Platform that supports the productionization and operationalization of Flink, and makes it more accessible to mainstream enterprises. We're priviledged to have Kostas Tzoumas, CEO of data Artisans, with us today. Welcome Kostas. >> Thank you. Thank you George. >> So, tell us, let's start with sort of an idealized application-use case, that is in the sweet spot of Flink, and then let's talk about how that's going to broaden over time. >> Yeah, so just a little bit of an umbrella above that. So what we see very, very consistently, we see it in tech companies, and we see, so modern tech companies, and we see it in traditional enterprises that are trying to move there, is a move towards a business that runs in real time. Runs 24/7, is data-driven, so decisions are made based on data, and is software operated. So increasingly decisions are made by AI, by software, rather than someone looking at something and making a decision, yeah. So for example, some of the largest users of Apache Flink are companies like Uber, Netflix, Alibaba, Lyft, they are all working in this way. >> Can you tell us about the size of their, you know, something in terms of records per day, or cluster size, or, >> Yeah, sure. So, latest I heard, Alibaba is powering Alibaba Certs, more than a thousand nodes, terabytes of states, I'm pretty sure they will give us bigger numbers today. Netflix has reported of doing about one trillion events per day. >> George: Wow. >> On Flink. So pretty big sizes. >> So and is Netflix, I think I read, is powering their real time recommendation updates. >> They are powering a bunch of things, a bunch of applications, there's a lot of routing events internally. I think they have a talk, they had a talk definitely at the last conference, where they talk about this. And it's really a variety of use cases. It's really about building a platform, internally. And offering it to all sorts of departments in the company, be that for recommendations, be that for BI, be that for running, state of microservices, you know, all sorts of things. And we also see, the more traditional enterprise moving to this modus operandi. For example, ING is also one of our biggest partners, it's a global consumer bank based in the Netherlands, and their CEO is saying that ING is not a bank, it's a tech company that happens to have a banking license. It's a tech company that inherited a banking license. So that's how they want to operate. So what we see, is stream processing is really the enabler for this kind of business, for this kind of modern business where we interact with, in real time, they interact with the consumer in real time, they push notifications, they can change the pricing, et cetera, et cetera. So this is really the crux of stateful stream processing , for me. >> So okay, so tell us, for those who, you know, have a passing understanding of how Kafka's evolving, how Apache Spark and Structured Streaming's evolving, as distinct from, but also, Databricks. What is it about having state management that's sort of integrated, that for example, might make it easy to elastically change a cluster size by repartitioning. What can you assume about managing state internally, that makes things easier? >> Yeah, so I think really the, the sweet spot of Flink, is that if you are looking for stream process, from a stream processing engine, and for a stateful stream processing engine for that matter, Flink is the definition of this. It's the definite solution to this problem. It was created from scratch, with this in mind, it was not sort of a bolt-on on top of something else, so it's streaming from the get-go. And we have done a lot of work to make state a first-class citizen. What this means, is that in Flink programs, you can keep state that scales to terabytes, we have seen that, and you can manage this state together with your application. So Flink has this model based on check points, where you take a check point of your application and state together, and you can restart at any time from there. So it's really, the core of Flink, is around state management. >> And you manage exactly one semantics across the checkpointing? >> It's exactly once, it's application-level exactly once. We have also introduced end-to-end exactly once with Kafka. So Kafka-Flink-Kafka exactly once. So fully consistent. >> Okay so, let's drill down a little bit. What are some of the things that customers would do with an application running on a, let's say a big cluster or a couple clusters, where they want to operate both on the application logic and on the state that having it integrated you know makes much easier? >> Yeah, so it is a lot about a flipped architecture and about making operations and DevOps much, much easier. So traditionally what you would do is create, let's say a containerized stateless application and have a central centralized data store to keep all your states. What you do now, is the state becomes part of the application. So this has several benefits. It has performance benefits, it has organizational benefits in the company. >> Autonomy >> Autonomy between teams. It has, you know it gives you a lot of flexibility on what you can do with the applications, like, for example right, scaling an application. What you can do with Flink is that you have an application running with parallelism over 100 and you are getting a higher volume and you want to scale it to 500 right, so you can simply with Flink take a snapshot of the state and the application together, and then restart it at a 500 and Flink is going to resolve the state. So no need to do anything on a database. >> And then it'll reshard and Flink will reshard it. >> Will reshard and it will restart. And then one step further with the product that we have introduced, dA Platform which includes Flink, you can simply do this with one click or with one rest command. >> So, the the resharding was possible with core Flink, the Apache Flink and the dA Platform just makes it that much easier along with other operations. >> Yeah so what the dA Platform does is it gives you an API for common operational tasks, that we observed everybody that was deploying Flink at a decent scale needs to do. It abstracts, it is based on Kubernetes, but it gives you a higher-level API than Kubernetes. You can manage the application and the state together, and it gives that to you in a rest API, in a UI, et cetera. >> Okay, so in other words it's sort of like by abstracting even up from Kubernetes you might have a cluster as a first-class citizen but you're treating it almost like a single entity and then under the covers you're managing the, the things that happen across the cluster. >> So what we have in the dA Platform is a notion of a deployment which is, think of it as, I think of it as a cluster, but it's basically based on containers. So you have this notion of deployments that you can manage, (coughs) sorry, and then you have a notion of an application. And an application, is a Flink job that evolves over time. And then you have a very, you know, bird's-eye view on this. You can, when you update the code, this is the same application with updated code. You can travel through a history, you can visit the logs, and you can do common operational tasks, like as I said, rescaling, updating the code, rollbacks, replays, migrate to a new deployment target, et cetera. >> Let me ask you, outside of the big tech companies who have built much of the application management scaffolding themselves, you can democratize access to stream processing because the capabilities, you know, are not in the skill set of traditional, mainstream developers. So question, the first thing I hear from a lot of sort of newbies, or people who want to experiment, is, "Well, it's so easy to manage the state "in a shared database, even if I'm processing, "you know, continuously." Where should they make the trade-off? When is it appropriate to use a shared database? Maybe you know, for real OLTP work, and then when can you sort of scale it out and manage it integrally with the rest of the application? >> So when should we use a database and when should we use streaming, right? >> Yeah, and even if it's streaming with the embedded state. >> Yeah, that's a very good question. I think it really depends on the use case. So what we see in the market, is many enterprises start with with a use case that either doesn't scale, or it's not developer friendly enough to have these database application levels. Level separation. And then it quickly spreads out in the whole company and other teams start using it. So for example, in the work we did with ING, they started with a fraud detection application, where the idea was to load models dynamically in the application, as the data scientists are creating new models, and have a scalable fraud detection system that can handle their load. And then we have seen other teams in the company adopting processing after that. >> Okay, so that sounds like where the model becomes part of the application logic and it's a version of the application logic and then, >> The version of the model >> Is associated with the checkpoint >> Correct. >> So let me ask you then, what happens when you you're managing let's say terabytes of state across a cluster, and someone wants to query across that distributed state. Is there in Flink a query manager that, you know, knows about where all the shards are and the statistics around the shards to do a cost-based query? >> So there is a feature in Flink called queryable state that gives you the ability to do, very simple for now, queries on the state. This feature is evolving, it's in progress. And it will get more sophisticated and more production-ready over time. >> And that enables a different class of users. >> Exactly, I wouldn't, like to be frank, I wouldn't use it for complex data warehousing scenarios. That still needs a data warehouse, but you can do point queries and a few, you know, slightly more sophisticated queries. >> So this is different. This type of state would be different from like in Kafka where you can store you know the commit log for X amount of time and then replay it. This, it's in a database I assume, not in a log form and so, you have faster access. >> Exactly, and it's placed together with a log, so, you can think of the state in Flink as the materialized view of the log, at any given point in time, with various versions. >> Okay. >> And really, the way replay works is, roll back the state to a prior version and roll back the log, the input log, to that same logical time. >> Okay, so how do you see Flink spreading out, now that it's been proven in the most demanding customers, and now we have to accommodate skills, you know, where the developers and DevOps don't have quite the same distributed systems knowledge? >> Yeah, I mean we do a lot of work at data Artisans with financial services, insurance, very traditional companies, but it's definitely something that is work in progress in the sense that our product the dA Platform makes operation smarts easier. This was a common problem everywhere, this was something that tech companies solved for themselves, and we wanted to solve it for everyone else. Application development is yet another thing, and as we saw today in the last keynote, we are working together with Google and the BIM Community to bring Python, GOLD, all sorts of languages into Flink. >> Okay so that'll help at the developer level, and you're also doing work at the operations level with the platform. >> And of course there's SQL right? So Flink has Stream SQL which is standard SQL. >> And would you see, at some point, actually sort of managing the platform for customers, either on-prem or in the cloud? >> Yeah, so right now, the platform is running on Kubernetes, which means that typically the customer installs it in their clusters, in their Kubernetes clusters. Which can be either their own machines, or it can be a Kubernetes service from a cloud vendor. Moving forward I think it will be very interesting yes, to move to more hosted solutions. Make it even easier for people. >> Do you see a breakpoint or a transition between the most sophisticated customers who, either are comfortable on their own premises, or who were cloud, sort of native, from the beginning, and then sort of the rest of the mainstream? You know, what sort of applications might they move to the cloud or might coexist between on-prem and the cloud? >> Well I think it's clear that the cloud is, you know, every new business starts on the cloud, that's clear. There's a lot of enterprise that is not yet there, but there's big willingness to move there. And there's a lot of hybrid cloud solutions as well. >> Do you see mainstream customers rewriting applications because they would be so much more powerful in stream processing, or do you see them doing just new applications? >> Both, we see both. It's always easier to start with a new application, but we do see a lot of legacy applications in big companies that are not working anymore. And we see those rewritten. And very core applications, very core to the business. >> So could that be, could you be sort of the source and in an analytic processing for the continuous data and then that sort of feeds a transaction and some parameters that then feed a model? >> Yeah. >> Is that, is that a, >> Yeah. >> so in other words you could augment existing OLTP applications with analytics then inform them in real time essentially. >> Absolutely. >> Okay, 'cause that sounds like then something that people would build around what exists. >> Yeah, I mean you can do, you can think of stream processing, in a way, as transaction processing. It's not a dedicated OLTP store, but you can think of it in this flipped architecture right? Like the log is essentially the re-do log, you know, and then you create the materialized views, that's the write path, and then you have the read path, which is queryable state. This is this whole CQRS idea right? >> Yeah, Command-Query-Response. >> Exactly. >> So, this is actually interesting, and I guess this is critical, it's sort of like a new way of doing distributed databases. I know that's not the word you would choose, but it's like the derived data, managed by, sort of coming off of the state changes, then in the stream processor that goes through a single sort of append-only log, and then reading, and how do you manage consistency on the materialized views that derive data? >> Yeah, so we have seen Flink users implement that. So we have seen, you know, companies really base the complete product on the CQRS pattern. I think this is a little bit further out. Consistency-wise, Flink gives you the exactly once consistency on the write path, yeah. What we see a lot more is an architecture where there's a lot of transactional stores in the front end that are running, and then there needs to be some kind of global, of single source of truth, between all of them. And a very typical way to do that is to get these logs into a stream, and then have a Flink application that can actually scale to that. Create a single source of truth from all of these transactional stores. >> And by having, by feeding the transactional stores into this sort of hub, I presume, some cluster as a hub, and even if it's in the form of sort of a log, how can you replay it with sufficient throughput, I guess not to be a data warehouse but to, you know, have low latency for updating the derived data? And is that derived data I assume, in non-Flink products? >> Yeah, so the way it works is that, you know, you can get the change logs from the databases, you can use something like Kafka to buffer them up, and then you can use Flink for all the processing and to do the reprocessing with Flink, this is really one of the core strengths of Flink. Basically what you do is, you replay the Flink program together with the states you can get really, really high throughput reprocessing there. >> Where does the super high throughput come from? Is that because of the integration of state and logic? >> Yeah, that is because Flink is a true streaming engine. It is a high-performance streaming engine. And it manages the state, there's no tier, >> Crossing a boundary? >> no tier crossing and there's no boundary crossing when you access state. It's embedded in the Flink application. >> Okay, so that you can optimize the IO path? >> Correct. >> Okay, very, very interesting. So, it sounds like the Kafka guys, the Confluent folks, their aspirations, from the last time we talked to 'em, doesn't extend to analytics, you know, I don't know whether they want partners to do that, but it sounds like they have a similar topology, but they're, but I'm not clear how much of a first-class citizen state is, other than the log. How would you characterize the trade-offs between the two? >> Yeah, so, I mean obviously I cannot comment on Confluent, but like, what I think is that the state and the log are two very different things. You can think of the log as storage, it's a kind of hot storage because it's the most recent data but you know, you cannot query it, it's not a materialized view, right. So for me the separation is between processing state and storage. The log is is a kind of storage, so kind of message queue. State is really the active data, the real-time active data that needs to have consistency guarantees, and that's a completely different thing. >> Okay, and that's the, you're managing, it's almost like you're managing under the covers a distributed database. >> Yes, kind of. Yeah a distributed key-value store if you wish. >> Okay, okay, and then that's exposed through multiple interfaces, data stream, table. >> Data stream, table API, SQL, other languages in the future, et cetera. >> Okay, so going further down the line, how do you see the sort of use cases that are going to get you across the chasm from the big tech companies into the mainstream? >> Yeah, so we are already seeing that a lot. So we're doing a lot of work with financial services, insurance companies a lot of very traditional businesses. And it's really a lot about maintaining single source of truth, becoming more real-time in the way they interact with the outside world, and the customer, like they do see the need to transform. If we take financial services and investment banks for example, there is a big push in this industry to modernize the IT infrastructure, to get rid of legacy, to adopt modern solutions, become more real-time, et cetera. >> And so they really needed this, like the application platform, the dA Platform, because operationalizing what Netflix did isn't going to be very difficult maybe for non-tech companies. >> Yeah, I mean, you know, it's always a trade-off right, and you know for some, some companies build, some companies buy, and for many companies it's much more sensible to buy. That's why we have software products. And really, our motivation was that we worked in the open-source Flink community with all the big tech companies. We saw their successes, we saw what they built, we saw, you know, their failures. We saw everything and we decided to build this for everybody else, for everyone that, you know, is not Netflix, is not Uber, cannot hire software developers so easily, or with such good quality. >> Okay, alright, on that note, Kostas, we're going to have to end it, and to be continued, one with Stefan next, apparently. >> Nice. >> And then hopefully next year as well. >> Nice. Thank you. >> Alright, thanks Kostas. >> Thank you George. Alright, we're with Kostas Tzoumas, CEO of data Artisans, the company behind Apache Flink and now the application platform that makes Flink run for mainstream enterprises. We will be back, after this short break. (techno music)

Published Date : Apr 11 2018

SUMMARY :

Covering Flink Forward, brought to you by data Artisans. and makes it more accessible to mainstream enterprises. Thank you George. application-use case, that is in the sweet spot of Flink, So for example, some of the largest users of Apache Flink I'm pretty sure they will give us bigger numbers today. So pretty big sizes. So and is Netflix, I think I read, is powering it's a tech company that happens to have a banking license. So okay, so tell us, for those who, you know, and you can restart at any time from there. We have also introduced end-to-end exactly once with Kafka. and on the state that having it integrated So traditionally what you would do is and you want to scale it to 500 right, which includes Flink, you can simply do this with one click So, the the resharding was possible with and it gives that to you in a rest API, in a UI, et cetera. you might have a cluster as a first-class citizen and you can do common operational tasks, because the capabilities, you know, are not in the skill set So for example, in the work we did with ING, and the statistics around the shards that gives you the ability to do, but you can do point queries and a few, you know, where you can store you know the commit log so, you can think of the state in Flink and roll back the log, the input log, in the sense that our product the dA Platform at the operations level with the platform. And of course there's SQL right? Yeah, so right now, the platform is running on Kubernetes, Well I think it's clear that the cloud is, you know, It's always easier to start with a new application, so in other words you could augment Okay, 'cause that sounds like then something that's the write path, and then you have the read path, I know that's not the word you would choose, So we have seen, you know, companies Yeah, so the way it works is that, you know, And it manages the state, there's no tier, It's embedded in the Flink application. doesn't extend to analytics, you know, but you know, you cannot query it, Okay, and that's the, you're managing, it's almost like Yeah a distributed key-value store if you wish. Okay, okay, and then that's exposed other languages in the future, et cetera. and the customer, like they do see the need to transform. like the application platform, the dA Platform, and you know for some, some companies build, and to be continued, one with Stefan next, apparently. and now the application platform

ENTITIES

Entity	Category	Confidence
Alibaba	ORGANIZATION	0.99+
Netflix	ORGANIZATION	0.99+
Uber	ORGANIZATION	0.99+
ING	ORGANIZATION	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Kostas Tzoumas	PERSON	0.99+
Google	ORGANIZATION	0.99+
Lyft	ORGANIZATION	0.99+
Kostas	PERSON	0.99+
Stefan	PERSON	0.99+
San Francisco	LOCATION	0.99+
Flink	ORGANIZATION	0.99+
next year	DATE	0.99+
Netherlands	LOCATION	0.99+
two	QUANTITY	0.99+
Both	QUANTITY	0.99+
Kafka	TITLE	0.99+
both	QUANTITY	0.99+
one click	QUANTITY	0.99+
Python	TITLE	0.99+
SQL	TITLE	0.98+
first thing	QUANTITY	0.98+
more than a thousand nodes	QUANTITY	0.98+
Kubernetes	TITLE	0.98+
500	QUANTITY	0.98+
today	DATE	0.98+
one	QUANTITY	0.98+
Confluent	ORGANIZATION	0.97+
Artisans	ORGANIZATION	0.96+
single source	QUANTITY	0.96+
2018	DATE	0.96+
over 100	QUANTITY	0.95+
dA Platform	TITLE	0.95+
Flink	TITLE	0.94+
about one trillion events per day	QUANTITY	0.94+
Apache Flink	ORGANIZATION	0.93+
single	QUANTITY	0.92+
Flink Forward Conference	EVENT	0.9+
one step	QUANTITY	0.9+
Apache	ORGANIZATION	0.89+
Databricks	ORGANIZATION	0.89+
Kafka	PERSON	0.88+
first	QUANTITY	0.86+
single entity	QUANTITY	0.84+
one rest command	QUANTITY	0.82+

Greg Fee, Lyft | Flink Forward 2018

>> Narrator: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> This is George Gilbert. We are at Data Artisan's conference Flink Forward. It is for the Apache Flink commmunity, sponsored by Data Artisans, and all the work they're doing to move Flink Forward, and to surround it with additional value that makes building stream-processing applications accessible to mainstream companies. Right now though, we are not talking to a mainstream company, we're talking to Greg Fee from Lyft. Not Uber. (laughs) And Greg tell us a little bit about what you're doing with Flink. What's the first-use case, that comes to mind that really exercises its capabilities? >> Sure, yeah, so the process of adopting Flink at Lyft has really started with a use case, which was, we're trying to make machine learning more accessible across all of Lyft. So we already use machine learning in quite a few applications, but we want to make sure that we use machine learning as much as possible, we really think that's the path forward. And one of the fundamental difficulties with that is having consistent feature generation between these offline batch-y training scenarios and the online real-time streaming scenarios. And the unified processing engine of Flink really helps us bridge that gap, so. >> When you say unified processing engine, are you saying that the fact that you can manage code and data, as sort of an application version, and some of the, either code or data, is part of the model, and so your versioning? >> That's even a step beyond what I'm talking about. >> Okay. >> Just the basic fundamental ability to have one piece of business logic that you can apply at the batch bulk layer, and in the real-time layer. >> George: Yeah. >> So that's sort of like the core of what Flink gives you. >> Are you running both batch and streaming on Flink? >> Yes, that's right. >> And using the, so, you're using the windows? Or just periodic execution on a stream to simulate batch? >> That's right. So we have, so feature generation crosses a broad spectrum of possible use cases in Flink. >> George: Yeah. >> And this is where we sort of transition more into what dA platform could give for us. So, we're looking to have thousands of different features across all of our machine learning models. So having a platform that can help us host many of these little programs running, help with the application life-cycle of each of these features, as we version them over time. So, we're very excited about what dA platform can do for us. >> Can you tell us a little more about how the stream processing helps you with the feature selection engineering, and is it that you're using streaming, or simulated batch, or batch using the same programming model to train these models, and you're using, you're picking up different derived data, is that how it's working? >> So, typical life-cycle is, it's going to be a feature engineering stage, so the data scientist is looking at their data, they're trying figure out patterns in the data, and they're going to, how you apply Flink there, is as you come up with potential algorithms for how you generate your feature, can run that through Flink, generate some data, apply machine learning model on top of it, and sort of play around with that data, prototype things. >> So, what you're doing is offline, or out of the platform, you're doing the feature selection and the engineering. >> Man: Right. >> Then you attach a stream to it that has just the relevant, perhaps, the relevant features. >> Man: Right. >> And then that model gets sort of, well maybe not yet, but eventually versioned as part of the application, which includes the application, the rest of the application logic and the data. >> Right. So, like some of the stuff that was touched on this morning at the keynotes, the versioning and maintaining machine learning applications, is a much, is a very complex ecosystem there. So being able to say, okay, going from the prototype stage, doing stuff in batch, to doing stuff in production, and real-time, then being able to version those over time, to move to better and better versions of the future generation, is very important to us. >> I don't know if this is the most politically correct thing, but you just explained it better than everyone else we have talked to. >> Great. (laughs) >> About how it all fits together with the machine learning. So, once you've got that in place, it sounds like you're using the dA platform, as well as, you know, perhaps some extensions for machine learning, to sort of add that as a separate life-cycle, besides the application code. Then, is that going to be the enterprise-wide platform for deploying, developing and deploying, machine learning applications? >> Yes, certainly we think there's probably a broad ecosystem to do machine learning. It's a very, sort of, wide open area. Certainly my agenda is to push it across the company and get as many things running in this system as possible. I think the real-time aspects of it, a unifying aspect, of what Flink can give us, and the platform can give us, in terms of the life-cycles. >> So, are you set up essentially like where you're the, a shared resource, a shared service, which is the platform group? >> Man: Right. >> And then, all the business units, adopt that platform and build their apps on it. >> Right. So my initiative is part of a greater data science platform at Lyft, so, my goal is to have, we have hundreds of data scientists who are going to be looking at this data, giving me little features that they want to do, and we're probably going to end up numbering in the thousands of features, being able to generate all those, maintain all those little programs. >> And when you say generate all those little programs, that's the application logic, and the models specific to that application? >> That's right, well. >> Or is it this? >> There's features that are typically shared across many models. >> Okay. >> So there's like two layers of things happening. >> So you're managing features separately from the models. >> That's right. >> Interesting. Okay, haven't heard that. And is the application manager tooling going to help address that, or is that custom stuff that you have to do? >> So, I think there's, I think there's a potential that that's the way we're going to manage the model stuff as well, but it's still little new over there. >> That you put it on the application platform? >> Right. >> Then that's sort of at the boundary of what you're doing right now, or what you will be doing shortly. >> Right. It's all, it's a matter of use-case, whether it's online or offline, and how it fits best in with the rest of the Lyft engineering system. >> When you're talking about your application landscape, do you have lots of streaming applications that feed other streaming applications, going through a hub. Or, are they sort of more discrete, you know, artifacts, discrete programs, and then when do you keep, stay within the streaming processors, and when do you have it in a shared database? >> That's a, that's a lot of questions, kind of a deep question. So, the goal is to have a central hub, where sort of all of our event data passes through it, and that allows us to decouple. >> So that's to be careful, that's not a database central hub, that's a, like a? >> An event hub. >> Event hub. >> Right. >> Yeah, okay. >> So, an event hub in the middle allows us to decompose the different, sort of smaller programs, which again are probably going to number in the thousands, so that being able to have different parts of the company maintain their own part of the overall system is very important to us. I think we'll probably see Flink as a major player, in terms of how those programs run, but we'll probably be shooting things off to other systems like Druid, like Hive, like Presto, like Elasticsearch. >> As derived data? >> As all derived data, from these Flink jobs. And then also, pushing data directly out into some of our production systems to feed into machine learning decisions. >> Okay, this is quite, sounds like the most ambitious infrastructure that we've heard, in that it sounds like pretty ubiquitous. >> We want to be a machine-learning first company. So, it's everywhere. >> So, now help me clarify for me, when? Because this is, you know, for mainstream companies who've programmed with, you know, DBMS, as a shared state manager for decades, help explain to them when you would still use a DBMS for shared state, and when you would start using the distributed state that's embedded in Flink, and the derived data, you know, at the endpoints, at the syncs. >> So I mean, I guess this kind of gets into your exact, your use cases and, you know, your opinions and thoughts about how to use these things best, but. >> George: Your opinion is what we're interested in. >> Right. From where I'm coming, I see basically databases as potential one sync for this data. They do things very well, right? They do structured queries very well. You can have indices built off that, aggregates, really feed into a lot of visualization stuff. >> George: Yeah. >> But, from where I am sitting, like we're really moving away from databases as something that feeds production data. We've got other stores to do that, that are sort of more tailored towards those scenarios. >> When you say to feed production data, this is transaction capture, or data capture. >> Right. So we don't have a lot of atomic transactions, outside the payments at Lyft, most of the stuff is eventually consistent. So we have stores, more like Dynamo or Cassandra HBase that feed a lot of our production data. >> And those databases, are they for like ambient information like influencing an interaction, it doesn't sound like automating a transaction. It would be, it sounds like, context that helps with analytics, but very separate from the OLTP apps. >> That's right. So we have, you can kind of bifurcate the company into the data that's used in production to make decisions that are like facing the user, and then our analytics back end, that really helps business analysts and like the executives make decisions about how we proceed. >> And so that second part, that backend, is more like operational efficiency. >> Man: Right. >> And coding new business processes to support new ways of doing business, but the customer-facing stuff specifically like with payments, that still needs a traditional OLTP. >> Man: Right. >> But there not, those use cases aren't growing that much. >> That's right. So, basically we have very specific use-cases for like a traditional database, but in terms of capturing the types of scale, and the type of growth, we're looking for at Lyft, we think some of the other storage engines suit those better. >> So in that use-case, would the OLTP DBMS be at the front end, would it be a source, or a sync? It sounds like it's a source. >> So we actually do it both ways. Right, so, it's great to get our transactional data flowing through our streaming system, it's a lot of value in that, but also then pushing it out, back to some of the aggregate results to DBMS, helps with our analytics pipeline. >> Okay, okay. Well this is actually really interesting. So, where do you see the dA platform helping, you know, going forward; is it something you don't really need because you've built all that scaffolding to help with sort of application life-cycle management, or or do you see it as something that'll help sort of push Flink sort of enterprise-wide? >> I think the dA platform really helps people sort of adopt Flink at an enterprise level. Maintaining the applications is a core part of what it means to run it as a business. And so we're looking at dA platform as a way of managing our applications, and I think, like I'm just talking about one, I'm mostly talking about one application we have for Flink at Lyft. >> Yeah. >> We have many other Flink programs actually running, that are sort of unrelated to my project. >> What about managing non-Flink applications? Do you need an application manager? Is it okay that it's associated with one service, or platform like Flink, or is there a desire you know among bleeding edge customers to have an overall, sort of infrastructure management, application management kind of suite. >> Yes, for sure. You're touching on something I have started to push inside of Lyft, which is the need for an overall application life-cycle management product that's not technology specific. >> Would these sort of plug into the dA platform and whatever the confluent, you know, equivalent is, or is it going to to directly tie to the, you know, operational capabilities, or the functional capabilities, not the management capabilities. In other words would it plug into like core Flink, core Kafka, core Spark, that sort of stuff? >> I think that's sort of largely to be determined. If you go back to sort of how distributed design system works, typically. We have a user plane, which is going to be our data users. Then you end up with the thing we're probably most familiar with, which is our data plane, technologies like Flink and Kafka and Hive, all those guys. What's missing in the middle right now is a control plane. It's a map from the user desire, from the user intention, to what we do with all of that data plane stuff. So launch a new program, maybe you need a new Kafka topic, maybe you need to provision in Kafka. Higher, you need to get some Flink programs running, and whether that talks directly talks to Flink, and goes against Kubernetes, or something like that, or whether it talks to a higher level, like more application-specific platform. >> Man: Yeah. >> I think, you know it's certainly a lot easier, if we have some of these platforms in the way. >> Because they give you better abstractions. >> That's right. >> To talk to the platforms. >> That's right. >> That's interesting. Okay, geesh, we learn something really, really interesting with each interview. I'm curious though, if you look out a couple years, how much of your application landscape will be continuous processing, and is that something you can see mainstream enterprises adopting, or has decades of work with, you know, batch and interactive sort of made people too difficult to learn something so radically new? >> I think it's all going to be driven by the business needs, and whether the value is there for people to make that transition 'cause it is quite expensive to invest in new infrastructure. For companies like Lyft, where we're trying to make decisions very quickly, you know, users get down to two seconds makes a difference for the customer, so we're trying to be as, you know, real-time as possible. I used to work at Salesforce. Salespeople are a little less sensitive to these things, and you know it's very, very traditional world. >> That's interesting. (background applauding) >> But even Salesforce is moving towards that style. >> Even Salesforce is moving? >> Is moving toward streaming processing. >> Really? >> George: So like, I think we're going to see it slowly be adopted across the big enterprises. >> George: I imagine that's probably for their analytics. >> That's where they're starting, of course, yeah. >> Okay. So, this was, a little more affirmation on to how we're going to see the control plane evolve, and the interesting use-cases that you're up to. I hope we can see you back next year. And you can tell us how far you've proceeded. >> I certainly hope so, yeah. >> This was really interesting. So, Greg Fee from Lyft. We will hopefully see you again. And this is George Gilbert. We're at the Data Artisans Flink Forward conference in San Francisco. We'll be back after this break. (techno music)

Published Date : Apr 12 2018

SUMMARY :

brought to you by Data Artisans. What's the first-use case, that comes to mind And one of the fundamental difficulties with that That's even a step beyond what Just the basic fundamental ability to have So we have, so feature generation crosses a broad So having a platform that can help us host with potential algorithms for how you So, what you're doing is offline, or out of the platform, Then you attach a stream to it that has just of the application logic and the data. So, like some of the stuff that was touched on politically correct thing, but you just explained (laughs) Then, is that going to be the enterprise-wide platform in terms of the life-cycles. and build their apps on it. in the thousands of features, being able to generate There's features that are typically And is the application manager tooling going to help that that's the way we're going to manage the model stuff Then that's sort of at the boundary of what you're of the Lyft engineering system. and when do you have it in a shared database? So, the goal is to have a central hub, So, an event hub in the middle allows us to decompose some of our production systems to feed into Okay, this is quite, sounds like the most ambitious So, it's everywhere. and the derived data, you know, at the endpoints, about how to use these things best, but. into a lot of visualization stuff. We've got other stores to do that, that are sort of When you say to feed production data, outside the payments at Lyft, most of the stuff And those databases, are they for like ambient information So we have, you can kind of bifurcate the company And so that second part, that backend, is more like of doing business, but the customer-facing stuff the types of scale, and the type of growth, we're looking be at the front end, would it be a source, or a sync? some of the aggregate results to DBMS, So, where do you see the dA platform helping, you know, Maintaining the applications is a core part actually running, that are sort of unrelated to my project. you know among bleeding edge customers to have an overall, inside of Lyft, which is the need for an overall application or is it going to to directly tie to the, you know, to what we do with all of that data plane stuff. I think, you know it's certainly a lot easier, or has decades of work with, you know, and you know it's very, That's interesting. that style. adopted across the big enterprises. I hope we can see you back next year. We're at the Data Artisans Flink Forward conference

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Greg	PERSON	0.99+
Greg Fee	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Lyft	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
next year	DATE	0.99+
second part	QUANTITY	0.99+
Uber	ORGANIZATION	0.99+
each interview	QUANTITY	0.99+
Dynamo	ORGANIZATION	0.99+
Salesforce	ORGANIZATION	0.99+
Apache	ORGANIZATION	0.98+
Flink	ORGANIZATION	0.98+
one service	QUANTITY	0.98+
two layers	QUANTITY	0.98+
two seconds	QUANTITY	0.98+
each	QUANTITY	0.97+
thousands of features	QUANTITY	0.97+
both ways	QUANTITY	0.97+
Kafka	TITLE	0.93+
first-use case	QUANTITY	0.92+
one application	QUANTITY	0.92+
Druid	TITLE	0.92+
Flink Forward	TITLE	0.92+
decades	QUANTITY	0.91+
Elasticsearch	TITLE	0.89+
Data Artisans Flink Forward	EVENT	0.89+
one	QUANTITY	0.89+
Artisan	EVENT	0.87+
first company	QUANTITY	0.87+
hundreds of data scientists	QUANTITY	0.87+
both batch	QUANTITY	0.84+
one piece	QUANTITY	0.83+
2018	DATE	0.81+
Flink	TITLE	0.8+
Hive	TITLE	0.77+
Presto	TITLE	0.76+
this morning	DATE	0.75+
features	QUANTITY	0.74+
couple	QUANTITY	0.73+
Flink Forward	EVENT	0.69+
Hive	ORGANIZATION	0.65+
Spark	TITLE	0.62+
Kubernetes	ORGANIZATION	0.61+
Data	ORGANIZATION	0.6+
Cassandra HBase	ORGANIZATION	0.57+

Enrico Canzonieri, Yelp, Inc. | Flink Forward 2018

>> Narrator: Live from San Francisco, it's The Cube, covering Flink Forward, brought to you by Data Artisans. (upbeat music) >> Hi this is George Gilbert, we are at Flink Forward the conference for the Apache Flink community, sponsored by Data Artisans, which is the company commercializing Flink, and we have with us now Enrico Canzonieri, wait a minute I didn't get that right, Canzonieri, Sorry, Enrico from Yelp. And he's going to tell us how sort of Flink has taken Yelp by storm over the past year. Why don't we start off with where you were last year in terms of your data pipeline and what challenges you were facing. >> Yeah sure, so we had a Python company in the sense we developed most of our software in Python, so until last year we had most of our stream processing was depending on Python. We had developed and announced framework that was doing Python processing, and that was really what we had, there was no Flink running, most of the applications were built around a very simple interface that was process message function, that was what we expected developers to use, so no real obstruction there. >> Okay so, in other words, it sounds like you had a discrete task, request response or a batch, and then hand it off to the next function, and is that what the pipeline looked like? >> The pipeline was more of a streaming pipeline where we had a Kafka topic and input and we had these developers who would write this process message function where each message would be individually processed, there was a kind of the semantic of the pipeline there. And then we will get the result of that processing task into another Kafka topic and then get another processing function on top of that So we could have very easily, two or three processing tasks all connected by Kafka topics. Obviously there were like big limitations of that kind of architecture, especially when you want to do something more advanced that can include windowing, aggregation, or especially state management. >> Well Kafka has several layers of abstraction and I guess you'd have to go pretty low level to get the windowing, all the windowing and state management capabilities. >> Yeah it becomes really hard you basically have to implement by yourself unless you're using a (mumbles) on the conference platform or you're using what they call Kafka streams, but we are not using that. >> Oh, not? Okay. >> Obviously we were trying to implement that on top of what Python or Simple Flame work from zero. >> So tell us how the choice of Flink, sort of where did it hit your awareness and where did you start mapping it into this pipeline that was Python based? >> Yeah so we had the really I think two main use cases, this was last year, that we were struggling to get right and to really get working. The first one was a connector and the challenge there was to aggregate that locally, scale it to hundreds of streams, and then once we aggregated that locally, upload at our own S3. So that was one application, we were really struggling to get that work because in the framework we had, we had no real obstruction for windowing so we had this process message function where it was trying to implement all of that. And also because we were using a very low level Kafka code primitives, getting scalability was not as straight forward. So that was one application that was pretty challenging, the other one was really purer state full application, where we needed to retain the state forever. It was doing a windowed join across streams, so obviously the challenges in that case are even harder more because you have to implement a state management from the ground up. >> And all the time semantics. >> Yeah we had basically no even time semantics. We were not supporting that. >> Okay. >> So we looked at Flink because of even time support so then we could actually do even time processing. State management support, already implemented, it's way different than implementing it from the ground up. And then obviously the abstractions so the streaming primitives, you have windows, you have a nice interface that you can use that makes developers who are writing code it becomes easier for them. >> So let's start with the state management, help us walk through like, what capabilities in state management does Flink have relative to the lowest level abstraction you were using in Kafka or perhaps what spark structured streaming might provide. >> Yeah so I think the nice features in streams are really around the fact that the state management is implemented and fully supports the clusterized approach of Flink. So for example if you're using Kafka, Flink already, in the Kafka connector, Flink already provides a way to represent the Kafka state the state of a Kafka consumer. It also, for open editors, if you have a flat map or you have a window, state for Windows is already fully supported so if you are cumulating events, in your window, you don't really know what to do, then nothing special, these states will be automatically maintained by the Flink framework. That means that if Flink is taking a snapshot so a check point or a save point, all the state that was there will get stored in the check point and then will be able to recover. >> For the full window. >> Yeah. >> It's cause it understands the concept of the window when it does a check point. >> Yeah because there's native support in Flink for that. >> And what's the advantage of having state be integrated with the compute as opposed to compute and then some sort of API to a separate state manager? >> It's definitely like call to clarity, it's a big simplification of how you implement your code, your streaming application. Cause in the end if for every simple string application you need to go ahead and implement or define, implement basically the way your state gets stored, that really makes a very complex application, especially on the maintain, for maintenance. So in Flink you kind of focus on the business logic we actually did some tuning on the state manager that was necessary, but the tuning that we did applies in the same way across all the applications we built. Then users want to build an application, they focus on the business logic that they want, and they have, I would say the state is more kind of declarative, you say you want this map, you need this list and this state as part of the state and Flink will take care of actually making sure that that gets into the check point. >> So the sort of semantics of state management are built in at the compute layer, as opposed to going down to an API for a separate service in other implementations. >> Yeah, yeah. >> Okay. All right we have just a minute left, tell us about some of the things you're looking forward to doing with Flink, and are they similar to what the DA platform that's coming out from Data Artisans or do you have still a whole bunch of things on the data pipeline that you want to accomplish with just the core functionality? >> Yeah we definitely, I will say one of the features that we are really excited about is the stream sequel. I see a lot of potential there for new applications, we actually use a stream sequel at Yelp and we deploy that as a service so it makes it easier for users to deploy and to develop stream plus string applications. We definitely are planning to expand our Flink deployment into just new apps. Especially one of the things that we try to do especially building reusable components, and trying to deploy the reusable components are very coupled with the way we think about our data pipeline. >> Okay so would it be fair to say that can you look at the DA platform and say for companies that are not quite as sophisticated as you, that this is going to make it easier for you know main stream companies to build and deploy, operate? >> I see good potential there, I was looking at the presentation in the morning I like the integration we culminated for sure, since like that's where kind of the current trend for application deployment is going. So yeah I definitely see potential, I think for Yelp we clearly have a complex enough deployment and service integration that won't probably be a good fit for us. But probably companies that are approaching the route of Flink, now and we'll probably have an already existing deployment they may probably give it a try. >> Okay, all right Enrico we got to end it there but that was very helpful and thanks for stopping by. >> Thanks for having me here. >> Okay. And this is George Gilbert we are at Flink Forward, the Data Artisan's conference for the Apache Flink community, and we will be right back after this short break. (upbeat music)

Published Date : Apr 11 2018

SUMMARY :

covering Flink Forward, brought to you by Data Artisans. and we have with us now Enrico Canzonieri, in the sense we developed most of our software in Python, and we had these developers who would write this and state management capabilities. on the conference platform or you're using Okay. Obviously we were trying to implement that because in the framework we had, Yeah we had basically no even time semantics. so then we could actually do even time processing. relative to the lowest level abstraction you were using all the state that was there It's cause it understands the concept of the window that was necessary, but the tuning that we did So the sort of semantics of state management on the data pipeline that you want to accomplish Especially one of the things that we try to do I like the integration we culminated for sure, but that was very helpful and thanks for stopping by. and we will be right back after this short break.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Enrico Canzonieri	PERSON	0.99+
Enrico	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
Flink	ORGANIZATION	0.99+
Canzonieri	PERSON	0.99+
Python	TITLE	0.99+
last year	DATE	0.99+
two	QUANTITY	0.99+
Kafka	TITLE	0.99+
Apache	ORGANIZATION	0.99+
Yelp	ORGANIZATION	0.99+
each message	QUANTITY	0.99+
one application	QUANTITY	0.99+
Yelp, Inc.	ORGANIZATION	0.99+
S3	TITLE	0.99+
hundreds of streams	QUANTITY	0.99+
San Francisco	LOCATION	0.98+
one	QUANTITY	0.96+
past year	DATE	0.94+
2018	DATE	0.93+
Simple Flame	TITLE	0.92+
Windows	TITLE	0.91+
first one	QUANTITY	0.9+
two main use cases	QUANTITY	0.88+
Flink	TITLE	0.87+
Data Artisan	EVENT	0.82+
three processing tasks	QUANTITY	0.76+
Flink Forward	ORGANIZATION	0.73+
zero	QUANTITY	0.72+
Narrator	TITLE	0.69+
The Cube	TITLE	0.69+
string	QUANTITY	0.68+
windows	TITLE	0.64+
things	QUANTITY	0.52+
Flink Forward	EVENT	0.5+
Forward	EVENT	0.49+

Holden Karau, Google | Flink Forward 2018

>> Narrator: Live from San Francisco, it's the Cube, covering Flink Forward, brought to you by Data Artisans. (tech music) >> Hi, this is George Gilbert, we're at Flink Forward, the user conference for the Apache Flink Community, sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest, with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. >> Yes. >> And now, Holden is at Google, focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's Dataflow, as well as Flink and other stream processors. >> Yeah. >> And Holden has been working on integrating it with the Google TensorFlow framework, also open-sourced. Yes. >> So, Holden, tell us about the objective of putting these together. What type of use cases.... >> So, I think it's really exciting. And it's still very early days, I want to be clear. If you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark, with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot people essentially end up having to write, like, a stream-processing job to do their data preparation, and they have to write a corresponding online serving job, to do similar data preparation for when they want to make real predictions. And by integrating tf.Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple, uniform way, that can be taken from the training time into the online serving time, without them having to rewrite their code, removing the potential for mistakes where we like, change one variable slightly in one place and forget to update it in another. And just really simplifying the deployment process for these models. >> Okay, so help us tie that back to, in this case, Flink. >> Yes. >> And also to clarify, that data prep.... My impression was data prep was a different activity. It was like design time and serving was run time. But you're saying that they can be better integrated? >> So, there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, I don't have to do that at serving time. But one of the classic examples for data prep would be tokenizing my inputs, or performing some kind of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form, or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to re-create the data prep logic that I created for training at serving time. >> So, by having common Beam API and the common provider underneath it, like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine-learning model that you want those.... It would be ideal to have those transformation activities be common in your prep work, and then in the production serving. >> Yes, very true. >> So, tell us what type of customers want to write to the Beam API and have that portability? >> Yeah, so that's a really good question. So, there's a lot of people who really want portability outside of Google Cloud, and that's one group of people, essentially people who want to adopt different Google Cloud technologies, but they don't want be locked into Google Cloud forever. Which is completely understandable. There are other people who are more interested in being able to switch streaming engines, like, they want to be able to switch between Spark and Flink. And those are people who want to try out different streaming engines without having to rewrite their entire jobs. >> Does Spark Structured Streaming support the Beam API? >> So, right now, the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the Structured Streaming API. It's a thing we're actively discussing on the mailing list, how to go about doing. Because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure. But it's something that we should look at more of. Where was I going with this? So the other one that I see, is like, Flink is a wonderful API, but it's very Java-focused. And so, Java's great, everyone loves it, but a lot of cool things that are being done nowadays, are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. Beam gives a way for people to work with Python, across these different engines. Flink supports Python, but it's maybe not a first class citizen. And the Beam Python support is still a work in progress. We're working to get it to be better, but it's.... You can see the demos this afternoon, although if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. >> In to support Go. >> Which is a little out of left field. >> So, would it be fair to say that the value of Beam, for potential Flink customers, they can work and start on Google Cloud platform. They can start on one of several stream processors. They can move to another one later, and they also inherit the better language support, or bindings from the Beam API? >> I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. It's somewhat subjective, like what better language support is. But I think definitely for Go, it's pretty clear. This stuff is all stuff that's in the master branch, it's not released today. But if people are looking to play with it, I think it's really exciting. They can go and check it out from GitHub, and build it locally. >> So, what type of customers do you see who have moved into production with machine learning? >> So the.... >> And the streaming pipelines? >> The biggest customer that's in production is obviously, or not obviously, is Spotify. One of them is Spotify. They give a lot of talks about it. Because I didn't know we were going to be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly. I'll just stick with Spotify. >> Without the names, the sort of use cases and the general industry.... >> I don't want to get in trouble. >> Okay. >> I'm just going to ... sorry. >> Okay. So then, let's talk about, does Google view Dataflow as their sort of strategic successor to map produce? >> Yes, so.... >> And is that a competitor then to Flink? >> I think Flink and Dataflow can be used in some of the same cases. But, I think they're more complementary. Flink is something you can run on-prem. You can run it in different Defenders. And Dataflow is very much like, "I can run this on Google Cloud." And part of the idea with Beam is to make it so that people who want to write Dataflow jobs but maybe want the flexibility to go back to something else later can still have that. Yeah, we couldn't swap in Flink or Dataflow execution engines if we're on Google Cloud, but.... We're not, how do I put it nicely? Provided people are running this stuff, they're burning CPU cycles, I don't really care if they're running Dataflow or Flink as the execution engine. Either way, it's a party for me, right? >> George: Okay. >> It's probably one of those, sort of, friendly competitions. Where we both push each other to do better and add more features that the respective projects have. >> Okay, 30 second question. >> Cool. >> Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? >> Totally. I mostly see it as extending existing apps. This is obviously, possibly a bias, just for the people that I talk to. But, going ground up with both streaming and machine learning, at the same time, like, starting both of those projects fresh is a really big hurdle to get over. >> George: For skills. >> For skills. It's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something ... maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly. Or you'll build a streaming system, and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping head first into both at the same time. But this could change. Batch has been King for a long time and streaming is getting it's day in the sun. So, we could start seeing people becoming more adventurous and doing both, at the same time. >> Holden, on that note, we'll have to call it a day. That was most informative. >> It's really good to see you again. >> Likewise. So this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference, sponsored by Data Artisans. And we will be back in a few minutes after this short break. (tech music)

Published Date : Apr 11 2018

SUMMARY :

Narrator: Live from San Francisco, it's the Cube, This is the second Flink Forward conference focused on the Beam API, which is an API And Holden has been working on integrating it So, Holden, tell us about the objective of the same preparation code available to use And also to clarify, that data prep.... I don't have to do that at serving time. and the common provider underneath it, in being able to switch streaming engines, And the Beam Python support is still a work in progress. or bindings from the Beam API? But if people are looking to play with it, I didn't have a chance to go through my customer list the sort of use cases and the general industry.... as their sort of strategic successor to map produce? And part of the idea with Beam is to make it so that and add more features that the respective projects have. at the same time, and streaming is getting it's day in the sun. Holden, on that note, we'll have to call it a day. We're on the ground at Flink Forward,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
San Francisco	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Holden Karau	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
Python	TITLE	0.99+
Java	TITLE	0.99+
Holden	PERSON	0.99+
Google	ORGANIZATION	0.99+
Spotify	ORGANIZATION	0.99+
both	QUANTITY	0.99+
two paths	QUANTITY	0.99+
TensorFlow	TITLE	0.99+
One	QUANTITY	0.99+
Spark	TITLE	0.99+
GitHub	ORGANIZATION	0.98+
today	DATE	0.98+
Dataflow	TITLE	0.97+
Flink	ORGANIZATION	0.97+
one variable	QUANTITY	0.97+
a day	QUANTITY	0.97+
Go	TITLE	0.97+
Flink Forward	EVENT	0.96+
Flink	TITLE	0.96+
30 second question	QUANTITY	0.96+
one place	QUANTITY	0.95+
Beam	TITLE	0.95+
second	QUANTITY	0.95+
Google Cloud	TITLE	0.94+
Apache	ORGANIZATION	0.94+
one group	QUANTITY	0.94+
one	QUANTITY	0.93+
this afternoon	DATE	0.9+
Dstream	TITLE	0.88+
2018	DATE	0.87+
first	QUANTITY	0.79+
Beam API	TITLE	0.75+
Beam	ORGANIZATION	0.74+
Apache Flink Community	ORGANIZATION	0.72+

Greg Benson, SnapLogic | Flink Forward 2018

>> Announcer: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> Hi this is George Gilbert. We are at Flink Forward on the ground in San Francisco. This is the user conference for the Apache Flink Community. It's the second one in the US and this is sponsored by Data Artisans. We have with us Greg Benson, who's Chief Scientist at Snap Logic and also professor of computer science at University of San Francisco. >> Yeah that's great, thanks for havin' me. >> Good to have you. So, Greg, tell us a little bit about how Snap Logic currently sets up its, well how it builds its current technology to connect different applications. And then talk about, a little bit, where you're headed and what you're trying to do. >> Sure, sure, so Snap Logic is a data and app integration Cloud platform. We provide a graphical interface that lets you drag and drop. You can open components that we call Snaps and you kind of put them together like Lego pieces to define relatively sophisticated tasks so that you don't have to write Java code. We use machine learning to help you build out these pipelines quickly so we can anticipate based on your data sources, what you are going to need next, and that lends itself to rapid building of these pipelines. We have a couple of different ways to execute these pipelines. You can think of it as sort of this specification of what the pipeline's supposed to do. We have a proprietary engine that we can execute on single notes, either in the Cloud or behind your firewall in your data center. We also have a mode which can translate these pipelines into Spark code and then execute those pipelines at scale. So, you can do sort of small, low latency processing to sort of larger, batch processing on very large data sets. >> Okay, and so you were telling me before that you're evaluating Flink or doing research with Flink as another option. Tell us what use cases that would address that the first two don't. >> Yeah, good question. I'd love to just back up a little bit. So, because I have this dual role of Chief Scientist and as a professor of Computer Science, I'm able to get graduate students to work on research projects for credit, and then eventually as interns at SnapLogic. A recent project that we've been working on since we started last fall so working on about six or seven months now is investigating Flink as a possible new back end for the SnapLogic platform. So this allows us to you know, to explore and prototype and just sort of figure out if there's going to be a good match between an emerging technology and our platform. So, to go back to your question. What would this address? Well, so, without going into too much of the technical differences between Flink and Spark which I imagine has come up in some of your conversations or it comes up here because they can solve similar use cases our experience with Flink is the code base is easy to work with both from taking our specification of pipelines and then converting them into Flink code that can run. But there's another benefit that we see from Flink and that is, whenever any product, whether it's our product or anybody else's product, that uses something like Spark or Flink as a back end, there's this challenge because you're converting something that your users understand into this target, right, this Spark API code or Flink API code. And the challenge there is if something goes wrong, how do you propagate that back to the users so the user doesn't have to read log files or get into the nuts and bolts of how Spark really works. >> It's almost like you've compiled the code, and now if something doesn't work right, you need to work at the source level. >> That's exactly right, and that's what we don't want our users to do, right? >> Right. >> So one promising thing about Flink is that we're able to integrate the code base in such a way that we have a better understanding of what's happening in the failure conditions that occur. And we're working on ways to propagate those back to the user so they can take actionable steps to remedy those without having to understand the Flink API code iself. >> And what is it, then, about Flink or its API that gives you that feedback about errors or you know, operational status that gives you better visibility than you would get in something else like Spark. >> Yeah, so without getting too too deep on the subject, what we have found is, one thing nice about the Flink code base is the core is written in Scala, but there's a lot of, all the IO and memory handling is written in Java and that's where we need to do our primary interfacing and the building blocks, sort of the core building blocks to get to, for example, something that you build with a dataset API to execution. We have found it easier to follow the transformation steps that Flink takes to end up with the resulting sort of optimized, optimized Flink pipeline. Now by understanding that transformation, like you were saying, the compilation step, by understanding it, then we can work backwards, and understand how, when something happens, how to trace it back to what the user was originally trying to specify. >> The GUI specification. >> Yeah. Right. >> So, help me understand though it sounds like you're the one essentially building a compiler from a graphical specification language down to Spark as the, you know, sort of, pseudo, you know, psuedo compile code, >> Yep. >> Or Flink. And, but if you're the one doing that compilation, I'm still struggling to understand why you would have better reverse engineering capabilities with one. >> It just is a matter of getting visibility into the steps that the underlying frameworks are taking and so, I'm not saying this is impossible to do in Spark, but we have found that we've had, it's been easier for us to get into the transformation steps that Flink is taking. >> Almost like, for someone who's had as much programming as a one semester in night school, like a variable and specter that's already there, >> Yeah, that's a good, there you go, yeah, yeah, yeah. >> Okay, so you don't have to go try and you can't actually add it, and you don't have to then infer it from all this log data. >> Now, I should add, there's another potential Flink. You were asking about use cases and what does Flink address. As you know, Flink is a streaming platform, in addition to being a batch platform, and Flink does streaming differently than how Spark does. Spark takes a microbatch approach. What we're also looking at in my research effort is how to take advantage of Flink's streaming approach to allow the SnapLogic GUI to be used to specify streaming Flink applications. Initially we're just focused on the batch mode but now we're also looking at the potential to convert these graphical pipelines into streaming Flink applications, which would be a great benefit to customers who want-- >> George: Real time integration. >> Want to do what Alibaba and all the other companies are doing but take advantage of it without having to get to the nuts and bolts of the programming. Do it through the GUI. >> Wow, so it's almost like, it's like, Flink, Beam, in terms of obstruction layers, >> Sure. >> And then SnapLogic. >> Greg: Sure, yes. >> Not that you would compile the beam, but the idea that you would have perv and processing and a real-time pipeline. >> Yes. >> Okay. So that's actually interesting, so that would open up a whole new set of capabilities. >> Yeah and, you know, it follows our you know, company's vision in allowing lots of users to do very sophisticated things without being, you know, Hadoop developers or Spark developers, or even Flink developers, we do a lot of the hard work of trying to give you a representation that's easier to work with, right but, also allow you to sort of evolve that and de-bug it and also eventually get the performance out of these systems One of the challenges of course of Spark and Flink is that they have to be tuned, and you have to, and so what we're trying to do is, using some of our machine learning, is eventually gather information that can help us identify how to tune different types of work flows in different environments. And that, if we're able to do that in it's entirety, then we, you know, we take out a lot of the really hard work that goes into making a lot of these streaming applications both scalable and performing. >> Performimg. So this would be, but you would have, to do that, you would probably have to collect well, what's the term? I guess data from the operations of many customers, >> Right. >> Because, as training data, just as the developer alone, you won't really have enough. >> Absolutely, and that's, so that you have to bootstrap that. For our machine learning that we currently use today, we leverage, you know, the thousands of pipelines, the trillions of documents that we now process on a monthly basis, and that allows us to provide good recommendations when you're building pipelines, because we have a lot of information. >> Oh, so you are serving the runtime, these runtime compilations. >> Yes. >> Oh, they're not all hosted on the customer premises. >> Oh, no no no, we do both. So it's interesting, we do both. So you can, you can deploy completely in the cloud, we're a complete SASS provider for you. Most of our customers though, you know, Banks Healthcare, want to run our engine behind their firewalls. Even when we do that though, we still have metadata that we can get introspection, sort of anonymized, but we can get introspection into how things are behaving. >> Okay. That's very interesting. Alright, Greg we're going to have to end it on that note, but uh you know, I guess everyone stay tuned. That sounds like a big step forward in sort of specification of real time pipelines at a graphical level. >> Yeah, well, it's, I hope to be talking to you again soon with more results. >> Looking forward to it. With that, this is George Gilbert, we are at Flink Forward, the user conference for the Apache Flink conference, sorry for the Apache Flink user community, sponsored by Data Artisans, we will be back shortly. (upbeat music)

Published Date : Apr 11 2018

SUMMARY :

brought to you by Data Artisans. We are at Flink Forward on the ground in San Francisco. and what you're trying to do. so that you don't have to write Java code. Okay, and so you were telling me before So this allows us to you know, to explore and prototype you need to work at the source level. so they can take actionable steps to remedy those that gives you that feedback something that you build with a dataset API to execution. you would have better and so, I'm not saying this is impossible to do in Spark, and you don't have to then infer it from all this log data. As you know, Flink is a streaming platform, Want to do what Alibaba and all the other companies the idea that you would have perv and processing so that would open up a whole new is that they have to be tuned, and you have to, So this would be, but you would have, to do that, just as the developer alone, you won't really have enough. we leverage, you know, the thousands of pipelines, Oh, so you are serving the runtime, Most of our customers though, you know, Banks Healthcare, you know, I guess everyone stay tuned. Yeah, well, it's, I hope to be talking to you again soon Looking forward to it.

ENTITIES

Entity	Category	Confidence
Greg Benson	PERSON	0.99+
George Gilbert	PERSON	0.99+
Greg	PERSON	0.99+
US	LOCATION	0.99+
Alibaba	ORGANIZATION	0.99+
Java	TITLE	0.99+
San Francisco	LOCATION	0.99+
Snap Logic	ORGANIZATION	0.99+
George	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
both	QUANTITY	0.99+
Spark	TITLE	0.99+
Scala	TITLE	0.99+
Flink	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
Banks Healthcare	ORGANIZATION	0.99+
second one	QUANTITY	0.99+
Lego	ORGANIZATION	0.99+
last fall	DATE	0.98+
one semester	QUANTITY	0.98+
SnapLogic	ORGANIZATION	0.98+
SnapLogic	TITLE	0.97+
first two	QUANTITY	0.97+
today	DATE	0.97+
Flink	TITLE	0.96+
single notes	QUANTITY	0.96+
about six	QUANTITY	0.96+
trillions of documents	QUANTITY	0.95+
Flink Forward	ORGANIZATION	0.95+
seven months	QUANTITY	0.94+
One	QUANTITY	0.94+
University of San Francisco	ORGANIZATION	0.93+
one	QUANTITY	0.92+
one thing	QUANTITY	0.91+
Apache Flink Community	ORGANIZATION	0.89+
Spark	ORGANIZATION	0.85+
Apache Flink	ORGANIZATION	0.82+
Flink Forward	EVENT	0.82+
2018	DATE	0.81+
pipelines	QUANTITY	0.81+
Flink	EVENT	0.76+
SASS	ORGANIZATION	0.73+
dual	QUANTITY	0.72+
Logic	TITLE	0.72+
Apache	ORGANIZATION	0.57+
Hadoop	ORGANIZATION	0.54+
thing	QUANTITY	0.54+
Beam	TITLE	0.51+

Steven Wu, Netflix | Flink Forward 2018

>> Narrator: Live from San Francisco, it's theCube, covering Flink Forward, brought to you by Data Artisans. >> Hi, this is George Gilbert. We're back at Flink Forward, the Flink conference sponsored by Data Artisans, the company that commercializes Apache Flink and provides additional application management platforms that make it easy to take stream processing at scale for commercial organizations. We have Steven Wu from Netflix, always a company that is pushing the edge of what's possible, and one of the early Flink users. Steven, welcome. >> Thank you. >> And tell us a little about the use case that was first, you know, applied to Flink. >> Sure, our first-use case is a routing job for Keystone data pipeline. Keystone data pipeline process over three trillion events per day, so we have a thousand routing jobs that we do some simple filter projection, but the Solr routing job is a challenge for us and we recently migrated our routing job to Apache Flink. >> And so is the function of a routing job, is it like an ETL pipeline? >> Not exactly ETL pipeline, but more like it's a data pipeline to deliver data from the producers to the data syncs where people can consume those data like array search, Kafka or higher. >> Oh, so almost like the source and sync with a hub in the middle? >> Yes, that is exactly- >> Okay. >> That's the one with our big use case. And the other thing is our data engineer, they also need some stream processing today to do data analytics, so their job can be stateless or it can be stateful if it's a stateful job it can be as big as a terabyte of base state for a single job. >> So tell me what these stateful jobs, what are some of the things that you use state for? >> So, for example like a session of user activity, like if you have clicked the video on the online URI all those activity, they would need to be sessionalized window, for the windows, sessionalized, yeah those are the states, typical. >> OK, and what sort of calculations might you be doing? And which of the Flink APIs are you using? >> So, right now we're using the data stream API, so a little bit low level, we haven't used the Flink SQL yet but it's in our road map, yeah. >> OK, so what is the data stream, you know, down closer to the metal, what does that give you control over, right now, that is attractive? And will you have as much control with the SQL API? >> OK, yes, so the low level data stream API can give you the full feature set of everything. High level SQL is much easier to use, but obviously you have, the feature set is more limited. Yeah, so that's a trade-off there. >> So, tell me about, for a stateful application, is there sort of scaffolding about managing this distributed cluster that you had to build that you see coming down the pipe from Flink and Data Artisans that might make it easier, either for you or for mainstream customers? >> Sure, I think internal state management, I think that is where Flink really shines compared to other stream processing engine. So they do a lot with work underneath already. I think the main thing we need from Flink for the future, near future is regarding the job recovery performance. But like a state management API is very mature. Flink is, I think it's more mature than most of the other stream processing engines. >> Meaning like Kafka, Spark. So, in the state management, can a business user or business analyst issue a SQL query across the cluster and Flink figures out how to manage the distribution of the query and the filtering and presentation of the results transparently across the cluster? >> I'm not an expert on Flink SQL, but I think yes, essentially Flink SQL will convert to a Flink job which will be using the data stream API, so they will manage the state, yes, but, >> So, when you're using the lower level data stream API, you have to manage the distributed state and sort of retrieving and filtering, but that's something at a higher level abstraction, hopefully that'll be, >> No, I think that in either case, I think the state management is handled by Flint. >> Okay. >> Yeah. >> Distributed. >> All the state management, yes >> Even if it's querying at the data stream level? >> Yeah, but if you query at the SQL level, you won't be able to deal with those state APIs directly. You can still do actual windowing, let's say you have a SQL app doing window with some session by session by idle time that would be transfer for job and Flink will manage those window, manage those session state so you do not need to worry about either way you do not need to worry about state management. Apache Flink take care of it. >> So tell me, some of the other products you might have looked at, is the issue that if they have a clean separation from the storage layer, for large scale state management, you know, as opposed to, in memory, is it that the large scale is almost treated like a second tier and therefore, you almost have a separate set or a restricted set of operations at distributed state level versus at the compute level, would that be a limitation of other streaming processors? >> No, I don't see that. I think that given that stream will have taken a different approach, you find like a Google Cloud data flow, Google Cloud flow, they are thinking about using a big table, for example. But those are external state management. Flint decided to take a the approach of embedded state management inside of Flink. >> And when it's external, what's the trade-off? >> That's good question, I think if external, the latency may be higher, but your throughput might be a little low. Because you're going all the natural. But the benefit of that external state management is now your job becomes stateless. Your job make the recovery much faster for job failure, so either trade-off over there. >> OK. >> Yes. >> OK, got it. Alright, Steven we're going to have to end it on that, but that was most enlightening, and thanks for joining. >> Sure, thank you. >> This is George Gilbert, for Wikibon and theCube, we're again at Flink Forward in San Francisco with Data Artisans, we'll be back after a short break. (techno music)

Published Date : Apr 11 2018

SUMMARY :

covering Flink Forward, brought to you by Data Artisans. always a company that is pushing the edge that was first, you know, applied to Flink. but the Solr routing job is a challenge for us it's a data pipeline to deliver data from the producers And the other thing is our data engineer, like if you have clicked the video on the online URI so a little bit low level, we haven't used the Flink SQL yet but obviously you have, the feature set is more limited. than most of the other stream processing engines. across the cluster and Flink figures out how to manage the No, I think that in either case, Yeah, but if you query at the SQL level, taken a different approach, you find like But the benefit of that external state management but that was most enlightening, and thanks for joining. This is George Gilbert, for Wikibon and theCube,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Steven	PERSON	0.99+
Steven Wu	PERSON	0.99+
SQL	TITLE	0.99+
Data Artisans	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Netflix	ORGANIZATION	0.99+
Flink	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Flint	ORGANIZATION	0.98+
Kafka	TITLE	0.98+
Flink Forward	ORGANIZATION	0.98+
Spark	TITLE	0.97+
second tier	QUANTITY	0.97+
Wikibon	ORGANIZATION	0.97+
today	DATE	0.95+
over three trillion events per day	QUANTITY	0.93+
Keystone	ORGANIZATION	0.92+
single job	QUANTITY	0.92+
Flint	PERSON	0.91+
Flink SQL	TITLE	0.91+
first-use case	QUANTITY	0.86+
one	QUANTITY	0.86+
Apache Flink	ORGANIZATION	0.84+
theCube	ORGANIZATION	0.82+
2018	DATE	0.81+
Forward	TITLE	0.8+
SQL API	TITLE	0.8+
Flink	TITLE	0.79+
a thousand routing jobs	QUANTITY	0.77+
Flink	EVENT	0.77+
Flink Forward	EVENT	0.73+
terabyte	QUANTITY	0.71+
Google	ORGANIZATION	0.65+
Cloud	TITLE	0.48+
Forward	EVENT	0.39+

Vikram Bhambri, Dell EMC - Dell EMC World 2017

>> Narrator: Live from Las Vegas, it's theCUBE. Covering Dell EMC World 2017, brought to you by Dell EMC. >> Okay, welcome back everyone, we are live in Las Vegas for Dell EMC World 2017. This is theCUBE's eighth year of coverage of what was once EMC World, now it's Dell EMC World 2017. I'm John Furrier at SiliconANGLE, and also my cohost from SiliconANGLE, Paul Gillin. Our next guest is Vikram Bhambri, who is the Vice President of Product Management at Dell EMC. Formally with Microsoft Azure, knows cloud, knows VIPRE, knows the management, knows storage up and down, the Emerging Technologies Group, formerly of EMC. Good to see you on theCUBE again. >> Good to see you guys again. >> Okay, so Elastic Compute, this is going to be the game changer. We're so excited about one of our favorite interviews was your colleague we had on earlier. Unstructured data, object store, is becoming super valuable. And it was once the throwaway, "Yeah, store, later late ". Now with absent data driven enterprises having access to data is the value proposition that they're all driving towards. >> Absolutely. >> Where are you guys with making that happen and bringing that data to life? >> So, when I think about object storage in general, people talk about it's the S3 protocol, or it's the object protocol versus the file protocol. I think the conversation is not about that. The conversation is about data of the universe is increasing and it's increasing tremendously. We're talking about 44 zettabytes of data by 2020. You need an easier way to consume, store, that data in a meaningful way, and not only just that but being able to derive meaningful insights out of that either when the data is coming in or when the data is stored on a periodic basis being able to drive value. So having access to the data at any point of time, anywhere, is the most important aspect of it. And with ECS we've been able to actually attack the market from both sides. Whether it's talking about moving data from higher cost storage arrays or higher performance tiers down to a more accessible, more cheap storage that is available geographically, that's one market. And then also you have tons of data that's available on the tape drive but that data is so difficult to access, so not available. And if you want to go put that tape back on a actual active system the turnaround time is so long. So being able to turn all of that storage into an active storage system that's accessible all the time is the real value proposition that we have to talk about. >> Well now help me understand this because we have all these different ways to make sense of unstructured data now. We have NoSQL databases, we have JSON, we have HDFS, and we've got object storage. Where does it fit into the hierarchy of making sense of unstructured data? >> The simplest way to think about it is we talk about a data ocean, with the amount of data that's growing. Having the capability to store data that is in a global content repository. That is accessible-- >> Meaning one massive repository. >> One massive repository. And not necessarily in one data center, right? It's spread across multiple data centers, it's accessible, available with a single, global namespace, regardless of whether you're trying to access data from location A or location B. But having that data be available through a single global namespace is the key value proposition that object storage brings to bear. The other part is the economics that we're able to provide consistently better than what the public clouds are able to offer. You're talking about anywhere between 30 to 48% cheaper TCO than what public clouds are able to offer, in your own data center with all the constraints that you want to like upload to it, whether it's regular environments. Whether you're talking about country specific clouds and such, that's where it fits well together. But, exposing that same data out whether through HDFS or a file is where ECS differentiated itself from other cloud platforms. Yes, you can go to a Hadoop cluster and do a separate data processing but then you're creating more copies of the same data that you have in your primary storage. So things like that essentially help position object as the global content repository where you can just dump and forget about, about the storage needs. >> Vikram I want to ask you about the elastic cloud storage, as you mentioned, ECS, it's been around for a couple of years. You just announced a ECS lesser cloud storage, dedicated cloud. Can you tell me what that is and more about that because some people think of elastic they think Amazon, "I'll just throw it in object storage in the cloud." What are you guys doing specifically 'cause you have this hybrid offering. >> Absolutely. >> What is this about, can you explain that? >> Yeah, so if you look at, there are two extremes, or two paradigms that people are attracted by. On one side you have public clouds which give you the ease of use, you just swipe your credit card and you're in business. You don't have to worry about the infrastructure, you don't have to worry about, like, "Where my data is going to be stored?" It's just there. And then on the other side you have regular environments or you just have environments where you cannot move to public clouds so customers end up put in ECS, or other object storage for that matter, though ECS is the best. >> John: Biased, but that's okay. >> Yeah, now we are starting to see customers they're saying, "Can I have the best of both worlds? "Can I have a situation where I like the ease of use "of the public cloud but I don't want to "be in a shared bathtub environment. "I don't want to be in a public cloud environment. "I like the privacy that you are able to provide me "with this ECS in my own data center "but I don't want to take on the infrastructure management." So for those customers we have launched ECS dedicated cloud service. And this is specifically targeted for scenarios where customers have maybe one data center, two data centers, but they want to use the full strength and the capabilities of ECS. So what we're telling them we will actually put their bought ECS in our data centers, ECS team will operate and manage that environment for the customer but they're the only dedicated customer on that cloud. So that means they have their own environment-- >> It's completely secure for their data. >> Vikram: Exactly. >> No multi tenant issues at all. >> No, and you can have either partial capabilities in our data center, or you can fully host in our data center. So you can do various permutation and combinations thus giving customers a lot of flexibility of starting with one point and moving to the other. Let's them start with a private cloud, they want to move to a hybrid version they can move that, or if they start from the hybrid and they want to go back to their own data centers they can do that as well. >> Let's change gears and talk about IoT. You guys had launched Project Nautilus, we also heard that from your boss earlier, two days ago. What is that about? Explain, specifically, what is Project Nautilus? >> So as I was mentioning earlier there is a whole universe of data that is now being generated by these IoT devices. Whether you're talking about connected cars, you're talking about wind sensors, you're talking about anything that collects a piece of data that needs to be not only stored but people want to do realtime analysis on that dataset. And today people end up using a combination of 10 different things. They're using Kafka, Speak, HDFS, Cassandra, DASH storage to build together a makeshift solution, that sort of works but doesn't really. Or you end up, like, if you're in the public cloud you'll end up using some implementation of Lambda Architecture. But the challenge there is you're storing same amount of data in a few different places, and not only that there is no consistent way of managing data, processing data that effectively. So what Project Nautilus is our attempt to essentially streamline all of that. Allow stream of data that's coming from these IoT devices to be processed realtime, or for batch, in the same solution. And then once you've done that processing you essentially push that data down to a tier, whether it's Isilon or ECS, depending on the use case that you are trying to do. So it simplifies the whole story on realtime analytics and you don't want to do it in a closed source way. What we've done is we've created this new paradigm, or new primitive called streaming storage, and we are open sourcing it, we are Project Pravega, which is in the Apache Foundation. We want the whole community, just like there is a common sense of awareness for object file we want to that same thing for streaming storage-- >> So you guys are active in open source. Explain quickly, many might not know that. Talk about that. >> So, yeah, as I mentioned Project Prevega is something we announced at Flink Forward Conference. It's a streaming storage layer which is completely open source in the Apache Foundation and we just open sourced it today. And giving customers the capability to contribute code to it, take their version, or they can do whatever they want to do, like build additional innovation on top. And the goal is to make streaming storage just like a common paradigm like everything else. And in addition we're partnering with another open source component. There is a company called data Artisans based out of Berlin, Germany, and they have a project called Flink, and we're working with them pretty closely to bring Nautilus to fruition. >> theCUBE was there by the way, we covered Flink Forward again, one of the-- >> Paul: True streaming engine. >> Very good, very big open source project. >> Yeah, we we're talking with Jeff Woodrow earlier about software defined storage, self driving storage as he calls it. >> Where does ECS fit in the self driving storage? Is this an important part of what you're doing right now or is it a different use? >> Yeah, our vision right from the beginning itself was when we built this next generation of object storage system it has to be software first. Not only software first where a customer can choose their commodity hardware to bring to bear or we an supply the commodity hardware but over time build intelligence in that layer of software so that you can pull data off smartly to other, from SSDs to more SATA based drives. Or you can bring in smarts around metadata search capabilities that we've introduced recently. Because you have now billions of billions of records that are being stored on ECS. You want ease of search of what specifically you're looking for, so we introduced metadata search capability. So making the storage system and all of the data services that were usually outside of the platform, making them be part of the code platform itself. >> Are you working with Elasticsearch? >> Yes, we are using Elasticsearch more to enable customers who want to get insights about ECS itself. And Nautilus, of course, is also going to integrate with Elasticsearch as well. >> Vikram let's wrap this up. Thank you for coming on theCUBE. Bottom line, what's the bottom line message, quickly, summarize the value proposition, why customers should be using ECS, what's the big aha moment, what's the proposition? >> I would say the value proposition is very simple. Sometimes it can be like, people talk about lots of complex terms, it's very simple. Sustainably, low cost storage, for storing a wide variety of content in a global content repository is the key value proposition. >> And used for application developers to tap into? The whole dev ops, data as code, infrastructure as code movement. >> Yeah, you start, what we have seen in the majority of the used cases customers start with one used case of archiving. And then they very quickly realize that there's, it's like a Swiss Army knife. You start with archiving then you move on to application development, more modern applications, or in the cloud native applications development. And now with IoT and Nautilus being able to leverage data from these IoT devices onto these-- >> As I said two days ago, I think this is a huge, important area for agile developers. Having access to data in less than a hundred milliseconds, from any place in the world, is going to be table steaks. >> ECS has to be, or in general, object storage, has to be part of every important conversation that is happening about digital IT transformation. >> It sounds like eventually most of the data's going to end up there. >> Absolutely. >> Okay, so I'll put ya on the spot. When are we going to be seeing data in less than a hundred milliseconds from any database anywhere in the fabric of a company for a developer to call a data ocean and give me data back from any database, from any transaction in less than a hundred milliseconds? Can we do that today? >> We can do that today, it's available today. The challenge is how quickly enterprises are adopting the technology. >> John: So they got to architect it? >> Yeah. >> They have to architect it. >> Paul: If it's all of Isilon. >> They can pull it, they can cloud pull it down from Isilon to ECS. >> True. >> Yeah. >> Speed, low latency, is the key to success. Congratulations. >> Thank you so much. >> And I love this new object store, love this tier two value proposition. It's so much more compelling for developers, certainly in cloud native. >> Vikram: Absolutely. >> Vikram, here on theCUBE, bringing you more action from Las Vegas. We'll be right back as day three coverage continues here at Dell EMC World 2017. I'm John Furrier with Paul Gillan, we'll be right back.

Published Date : May 10 2017

SUMMARY :

brought to you by Dell EMC. Good to see you on theCUBE again. this is going to be the game changer. is the real value proposition that we have to talk about. Where does it fit into the hierarchy Having the capability to store data of the same data that you have in your primary storage. Vikram I want to ask you about the elastic cloud storage, And then on the other side you have regular environments "I like the privacy that you are able to provide me No, and you can have either partial capabilities What is that about? depending on the use case that you are trying to do. So you guys are active in open source. And the goal is to make streaming storage Yeah, we we're talking with Jeff Woodrow so that you can pull data off smartly to other, And Nautilus, of course, is also going to summarize the value proposition, of content in a global content repository is the key developers to tap into? You start with archiving then you move on from any place in the world, is going to be table steaks. has to be part of every important conversation of the data's going to end up there. of a company for a developer to call a data ocean are adopting the technology. down from Isilon to ECS. Speed, low latency, is the key to success. And I love this new object store, bringing you more action from Las Vegas.

ENTITIES

Entity	Category	Confidence
Jeff Woodrow	PERSON	0.99+
Paul	PERSON	0.99+
John	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Paul Gillan	PERSON	0.99+
Vikram Bhambri	PERSON	0.99+
Vikram	PERSON	0.99+
John Furrier	PERSON	0.99+
Paul Gillin	PERSON	0.99+
EMC	ORGANIZATION	0.99+
Emerging Technologies Group	ORGANIZATION	0.99+
2020	DATE	0.99+
Las Vegas	LOCATION	0.99+
less than a hundred milliseconds	QUANTITY	0.99+
Dell EMC	ORGANIZATION	0.99+
two extremes	QUANTITY	0.99+
Apache Foundation	ORGANIZATION	0.99+
two paradigms	QUANTITY	0.99+
Isilon	ORGANIZATION	0.99+
eighth year	QUANTITY	0.99+
both sides	QUANTITY	0.99+
Swiss Army	ORGANIZATION	0.99+
Flink	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
today	DATE	0.99+
two days ago	DATE	0.99+
one	QUANTITY	0.98+
Nautilus	ORGANIZATION	0.98+
30	QUANTITY	0.98+
Lambda Architecture	TITLE	0.98+
48%	QUANTITY	0.98+
two data centers	QUANTITY	0.98+
10 different things	QUANTITY	0.98+
SiliconANGLE	ORGANIZATION	0.98+
one data center	QUANTITY	0.98+
Elasticsearch	TITLE	0.98+
NoSQL	TITLE	0.97+
ECS	TITLE	0.97+
single	QUANTITY	0.97+
Kafka	TITLE	0.97+
both worlds	QUANTITY	0.97+
ECS	ORGANIZATION	0.97+
one point	QUANTITY	0.97+
one side	QUANTITY	0.97+
one market	QUANTITY	0.96+
first	QUANTITY	0.96+
Speak	TITLE	0.96+
Cassandra	TITLE	0.95+
Dell EMC World 2017	EVENT	0.94+
VIPRE	ORGANIZATION	0.94+
billions of billions of records	QUANTITY	0.93+
Project Nautilus	ORGANIZATION	0.92+
Vikram	ORGANIZATION	0.92+
day three	QUANTITY	0.91+
JSON	TITLE	0.91+
Berlin, Germany	LOCATION	0.9+
tons of data	QUANTITY	0.89+
EMC World 2017	EVENT	0.88+
data Artisans	ORGANIZATION	0.86+
HDFS	TITLE	0.84+
tier two	QUANTITY	0.83+
theCUBE	ORGANIZATION	0.82+
S3	OTHER	0.82+
44 zettabytes	QUANTITY	0.82+
Project Nautilus	TITLE	0.8+
Project Pravega	ORGANIZATION	0.78+

Stephan Ewen | Flink Forward 2017

(click) >> Welcome, everyone, we're back at the Flink Forward user conference sponsored by data Artisan's folks. This is the first U.S. based Flink user conference, and we are on the ground at the Kabuki Hotel in San Francisco. We have a special guest, Stephan Ewen, who is one of the founders of data Artisans, and one of the creators of Flink. He is CTO, and he is in a position to shed some unique light on the direction of the company and the product. Welcome, Stephan. >> Yeah, so you were asking about how can stream processing or how can Flink and data Artisans help companies that are enterprises that want to adopt this kind of technologies actually do that despite the fact that we've been seeing, if we look at what the big internet companies that first adopted these technologies, what they had to do, they had to go through all this big process of productionizing these things by integrating them with so many other systems, making sure everything fits together, everything kind of works as one piece. What can we do there? So I think there are a few interesting points to that. Let's maybe start with stream processing in general. So, stream processing by itself has actually the potential to simplify many of these setups and infrastructures, per se. There's multiple dimensions to that. First of all, the ability to just more naturally fit what you're doing to what is actually happening. Let me qualify that a little bit. All these companies that are dealing with big data are dealing with data that is typically continuously produced from sensors, from user devices, from server logs, from all these things, right? Which is quite naturally a stream. And processing this with systems that give you the abstraction of a stream is a much more natural fit, so you eliminate bunches of the pipeline that do, for example, try to do periodic ingestion, and then grooming that into a video file and data sets and periodic processing of that and you can for example, get rid of a lot of these things. You kind of get a paradigm that unifies the processing of real time data and also historic data. So this by itself is an interesting development that I think many have recognized and that's why they're excited about stream processing because it helps reduce a lot of that complexity. So that is one side to it. The other side to it is that there was always kind of an interplay between the processing on the data and then you want to do something with these insights, right, you don't process the data just for the fun of processing, right? Usually the outcome infers to something. Sometimes it's just a report, but sometimes it's something that immediately affects how certain services react. For example, how they apply their decisions in classifying transactions as frauds or how to send out alerts, how to trigger certain actions. The interesting thing is then, we're going to see actually a little more of that later in this conference also, is that in this reprocessing paradigm there's this very natural way for these online live applications and the analytical applications to march together, again, reducing a bunch of this complexity. Another thing that is happening that I think is very, very powerful and helping (mumbles) in bringing these kind of technologies to a broader anchor system is actually how the whole deployment stick is growing. So we see actually more and more users converging onto recessed management infrastructures. Yan was an interesting first step to make it really easy and once you've productionized that part of productionized voice systems but even beyond that, like the uptake of mezas, the uptake of containment engines like (mumbles) on the ability to just prepare more functionality buttoned together out of the box, it doesn't pack into a container of what you need and put it into a repository and then various people can bring up these services without having to go through all of the set up and integration work, it can kind of way better templated integration with systems with this kind of technology. So those seem to be helping a lot for much broader adoption of these kind of technologies. Both stream processing as an easier paradigm, fewer moving parts, and developments and (mumbles) technologies. >> So let me see if I can repeat back just a summary version, which is stream processing is more natural to how the data is generated, and so we want to match the processing to how it originates, it flows. At the same time, if we do more of that, that becomes a workload or an application pattern that then becomes more familiar to more people who didn't grow up in a continuous processing environment. But also, it has a third capability of reducing the latency between originating or adjusting the data and getting an analysis that informs a decision whether by a person or a machine. Would that be a >> Yeah, you can even go one step further, it's not just about introducing the latency from the analysis to the decision. In many cases you can actually see that the part that does the analysis in the decision just merge and become one thing which makes it much fewer moving parts, less integration work, less, yeah, less maintenance and complexity. >> Okay, and this would be like, for example, how application databases are taking on the capabilities of analytic databases to some extent, or how stream processors can have machine learning whether they're doing online learning or calling a model that they're going to score in real time or even a pre scored model, is that another example of where we put? >> You can think of those as examples, yeah. A nice way to think about it is that if you look at what a lot of what the analytical applications do versus let's say, just online services that measure offers and trades, or to generate alerts. A lot of those kind of are, in some sense, different ways of just reacting to events, right? If you are receiving some real time data and just want to process these interact with some form of knowledge that you accumulated over the past, or some form of knowledge that you've accumulated from some other inputs and then react to that. That kind of paradigm which is in the core of stream processing for (mumbles) is so generic that it covers many of these use cases, both building directly applications, as we have actually seen, we have seen users that directly build a social network on Flink, where the events that they receive are, you know, a user being created, a user joining a group and so on, and it also covers the analytics of just saying, you know, I have a stream of sensor data and on certain outliers I want to raise alerts. It's so similar once you start thinking about both of them as just handling streams of events, in this flexible fashion that it helps to just bring together many things. >> So, that sounds like it would play into the notion of, micro services where the service is responsible for its own state, and they communicate with each other asynchronously, so you have a cooperating collection of components. Now, there are a lot of people who grew up with databases out here sharing the state among modules of applications. What might drive the growth of this new pattern, the microservices, for, you know, considering that there's millions of people who just know how to use databases to build apps. >> The interesting part that I think drives this new adaption is that it's such a natural fit for the microservice world. So how do you deploy microservices with state, right? You can have a central database with which you work and every time you create a new service you have to make sure that it fits with the capacities and capabilities of the database, you have to make sure that the group that runs this database is okay with the additional load that, or you can go to the different model where each microservice comes up with its own database, but that time, every time you deploy one and that may be a new service or it may just be experimenting with a different variation of the service they'd be testing. You'd have to bring out a completely new thing. In this interesting world of stream processing, stateful stream processing is done by Flink state is embedded directly in the processing application. So, you actually don't worry about this thing separately, you just deploy that one thing, and it brings both together tightly integrated, and it's a natural fit, right, the working set of your application goes with your application. If it deployed, if it's (mumbles), if you bring it down, these things go away. What the central part in this thing is it's nothing more than if you wish a back up store where it would take these snapshots of microservices and store them in order to recover them from catastrophic failures in order to just have an historic version to look into if you figure it out later, you know, something happened, and was this introduced in the last week, let me look at what it looked like the week before or to just migrate it to a different cluster. >> So, we're going to have to cut things short in a moment, but I wanted to ask you one last question: If like, microservices as a sweet spot and sort of near real time decisions are also a sweet spot for Kafka, what might we expect to see in terms of a roadmap that helps make those, either that generalizes those cases, or that opens up new use cases? >> Yes, so, what we're immediately working on in Flink right now is definitely extending the support in this area for the ability to keep much larger state in these applications, so state that really goes into the multiple terrabytes per service, functionality that allows us to manage this, even easier to evolve this, you know. If the application actually starts owning the state and it's not in a centralized database anymore, you start needing a little bit of tooling around this state, similar as the tooling you need in databases, a (mumbles) in all of that, so things that actually make that part easier. Handling (mumbles) and we're actually looking into what are the API's that users actually want in this area, so Flink has I think pretty stellar stream processing API's and if you've seen in the last release, we've actually started adding more low level API's one could even think, API's in which you don't think as streams as distributed collections and windows but to just think about the very basic in gradiances, events, state, time and snapshots, so more control and more flexibility by just taking directly the basic building blocks rather than more high level abstractions. I think you can expect more evolution on that layer, definitely in the near future. >> Alright, Stephan, we have to leave it at that, and hopefully to pick up the conversation not too long in the future, we are at the Flink Forward Conference at the Kabuki Hotel in San Francisco, and we will be back with more just after a few moments. (funky music)

Published Date : Apr 15 2017

SUMMARY :

and one of the creators of Flink. First of all, the ability to just more naturally that then becomes more familiar to more people that does the analysis in the decision just merge and it also covers the analytics of just saying, you know, the microservices, for, you know, and capabilities of the database, similar as the tooling you need in databases, a (mumbles) and hopefully to pick up the conversation

ENTITIES

Entity	Category	Confidence
Stephan	PERSON	0.99+
Stephan Ewen	PERSON	0.99+
Flink	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
one	QUANTITY	0.99+
last week	DATE	0.99+
first step	QUANTITY	0.99+
one piece	QUANTITY	0.99+
both	QUANTITY	0.98+
U.S.	LOCATION	0.98+
one side	QUANTITY	0.98+
first	QUANTITY	0.98+
each microservice	QUANTITY	0.98+
one thing	QUANTITY	0.97+
First	QUANTITY	0.97+
one last question	QUANTITY	0.95+
Both	QUANTITY	0.94+
third	QUANTITY	0.92+
Kabuki Hotel	LOCATION	0.9+
Kafka	TITLE	0.89+
one step	QUANTITY	0.89+
Artisan	ORGANIZATION	0.85+
Flink Forward user	EVENT	0.85+
millions of people	QUANTITY	0.85+
data Artisans	ORGANIZATION	0.82+
Flink Forward	ORGANIZATION	0.82+
2017	DATE	0.73+
Forward Conference	LOCATION	0.55+

Jamie Grier | Flink Forward 2017

>> Welcome back, everyone, we're at the Flink Forward conference, this is the user conference for the Flink community, started by Data Artisans and sponsored by Data Artisans. We're at the Kabuki Hotel in San Francisco and we have with this another special guest, Jamie Grier, who's Director of Applications Engineering at Data Artisans. Jamie, welcome. >> Thanks. >> So we've seen an incredible pace of innovation in the Apache open source community and as soon as one technology achieves mainstream acceptance, it sort of gets blown away by another one, like MapReduce and Spark. There's an energy building around Flink and help us understand where it fits relative to, not necessarily things that it's replacing so much as things that it's complementing. >> Sure. Really what Flink is is it's a real stream processor so it's a stateful stream processor. The reason that I say it's a real stream processor is because the model, the competition model, the way the engine works, the semantics of the whole thing are the continuous programming model, which means that, really, you just consume events one at a time, you can update any sort of data structures you want, which Flink manages, full tolerantly, at scale, and you can do flexible things with processing, with regards to time, scheduling things to happen at different times, when certain amounts of data are complete, et cetera, so it's not oriented strictly towards, a lot of the stream processing in the past has been oriented sort of towards analytics alone or that's the real sweet spot, whereas Flink as a technology enables you to build much more complex event- and time-driven applications in a much more flexible way. >> Okay so let me unpack that a bit. >> Sure. >> So what we've seen in the Haddud community for the last x many years was really an analytic data pipeline put the data into a data lake and the hand-offs between the services made it a batch process. We tried to start adding data science and machine learning to it, it remained pretty much a batch process, 'cause it's in the data lake, and then when we started to experiment with stream processors, their building blocks were all around analytics and so they were basically an analytic pipeline. If I'm understanding you, you handle not just the analytics but the update-oriented or the cred-oriented operations, create, read, update, delete. >> Yeah, exactly. >> That you would expect from having a database as part of an application platform. >> Yeah. I mean, that's all true, but it goes beyond that. I mean, Flink as a stateful stream processor has in a sense a micro simple database as part of the stream processor. So yeah, you can update that state, like you said, the crud operations on that state, but it's more than that, you can build any kind of logic at all that you can think of that's driven by consuming events. Consuming events, doing calculations, and emitting events. Analytics is very easily built on top of something as powerful as that, but if you drop down below these higher level analytics APIS, you truly can build anything you want that consumes events, updates state, and emits events. And especially when there's a time dimension to these things like sometimes you consume some event and it means that at some future time, you want to schedule some processing to happen. And these basic primitives really allow you to build, I tell people all the time, Flink allows you to do this consuming of events and updating data structures of your own choosing, does it full tolerantly and at scale, build whatever you want out of that. And what people are building are things that are truly not really expressible as an analytics jobs. It's more just building applications. >> Okay, so let me drill down on that. >> Sure. >> Let's take an example app, whether it's, I'll let you pick it, but one where you have to assume that you can update state and you can do analytis and they're both in the same map, which is what we've come to expect from traditional apps although they have their shared state in a database outside the application. >> So a good example is, I just got done doing a demo, literally just before this, and it's a training application, so you build a training engine, it's consuming position information from webstream systems and it's consuming quotes. Quotes are all the bids and all the offers to buy stock at a given price, we have our own positions we're holding within the firm if we're a bank, and those positions, that's our state we're talking about. So it says I own a million shares of Apple, I own this many shares of Google, this is the price I paid, et cetera, so then we have some series of complex rules that say, hey, I've been holding this position service for a certain period of time, I've been holding it for a day now and so I want to more aggressively trade out of this position and I do that by modifying my state, driven by time, so more time has gone past, I'm going to lower my ask price, now trades are streaming in as well to the system and I'm trying to more aggressively make trades by lowering the price I'm willing to trade for. So these things are all just event-driven applications, the state is your positions in the market and the time dimension is exactly that, as you've been holding the position longer, you start to change your price or change your trading strategy in order to liquidate a little bit more aggressively, none of that is in the category of, I'd say you're using analytics along the way but none of that is just what you'd think of as a typical analytics or an analytics API. You need an API that allows you to build those sorts of flexible event-driven things. >> And the persistence part or the maybe transactional part is I need to make a decision as a human or the machine and record that decision and so that's why there's benefit to having the analytics and the database, whatever term we give it, in the same. >> Co-located. >> Co-located, yeah, in the same platform. >> Yeah there's a bunch of reasons why that's good. That's one of them, another reason is because when you do things at high scale and you have high through, say in that trading system we're consuming the entire options chains worth of all the bids in asks, right? It's a load of data so you want to use a bunch of machines but you want to, you don't want to have to look up your state in some database for every single message when instead you could share the input stream and both input streams by the same key and you end up doing all of your look-up join type operations locally on one machine. So at high scale it's a huge just performance benefit. Also allows you to manage that state consistently, consistent with the input streams. If you have the data in a external database and a node fails then you need to sort of back up in the input stream a little bit, replay a little bit of the data, you have to also be able to back up your state to a consistent point with all of the inputs and if you don't manage that state, you cannot do it. So that's one of the core reasons why stream processors need to have state, so they can provide strong guarantees about correctness. >> What are some of the other popular stream processors, when they choose perhaps not to manage state to the same integrated degree that you guys do? What was their thinking in terms of, what trade-off did they make? >> It was hard. So I've also worked on previous streaming systems in the past and for a long time, actually, and managing all this state in a consistent way is difficult and so the early generation systems didn't do it for exactly that reason, let's just put it in the database but the problem with that is exactly what I just mentioned and in stream processing we tend to talk about exactly once and at least once, this is actually the source of the problem so if the database is storing your state, you can't really provide these exactly-once type guarantees because when you replace some data, you back up in the input, you also have to back up the state and that's not really a database operation that's normally available, so when you manage to state yourself in the stream processor, you can consistently manage the input in the state. So you can exactly-once semantics in the face of failure. >> And what do you trade in not having, what do you give up in not having a shared database that has 40 years of maturity and scalability behind it versus having these micro databases distributed around. Is it the shuffling of? >> You give up a robust external quarry interface, for one thing, you give up some things you don't need like the ability to have multiple writers and transactions and all that stuff, you don't need any of that because in a stream processor, for any given key there's always one writer and so you get a much simpler type of database you have to support. What else? Those are the main things you really give up but I would like to also draw a distinction here between state and storage. Databases are still obviously, Flink state is not storage, not long-term storage, it's to hold the data that's currently sort of in flight and mutable until it's no longer being mutated and then the best practice would be to emit that as some sort of event or as a sync into a database and then it's stored for the long-term, so it's really good to start to think about the difference between what is state and what is storage, does that make sense? >> I think so. >> So think of, you're accounting, you're doing distributed accounting, which is an analytics thing, you're counting by key, the count per key is your state until that window closes and I'm not going to be mutated anymore, then we're headed into the database. >> Got it. >> Right? >> Yeah. >> But that internal, that sort of in-flight state is what you need to manage in the stream process. >> Okay, so. >> So it's not a total replacement for database, it's not that. >> No no no, but this opens up another thread that I don't think we've heard enough of. Jamie, we're going to pause it here. >> Okay. >> 'Cause I hope to pick this thread up with you again, the big surprise from the last two interviews, really, is Flink is not just about being able to do low latency per event processing, it's that it's a new way of thinking about applications beyond the traditional stream processors where it manages state or data that you want to keep that's not just transient and that it becomes a new way of building micro services. >> Exactly, yeah. >> So on that note, we're going to sign off from the Data Artisans user conference, Flink Forward, we're here in San Francisco on the ground at the Kabuki Hotel. (upbeat music)

Published Date : Apr 14 2017

SUMMARY :

for the Flink community, started by Data Artisans in the Apache open source community and as soon as one and you can do flexible things with processing, 'cause it's in the data lake, and then when we started That you would expect from having a database I tell people all the time, Flink allows you to do this that you can update state and you can do analytis You need an API that allows you to build those sorts And the persistence part or the maybe transactional part in the same platform. by the same key and you end up doing all of your in the input, you also have to back up the state what do you give up in not having a shared database Those are the main things you really give up by key, the count per key is your state until that window that sort of in-flight state is what you need So it's not a total that I don't think we've heard enough of. this thread up with you again, the big surprise on the ground at the Kabuki Hotel.

ENTITIES

Entity	Category	Confidence
Jamie Grier	PERSON	0.99+
Google	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
Jamie	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
40 years	QUANTITY	0.99+
Flink	ORGANIZATION	0.99+
one	QUANTITY	0.99+
one writer	QUANTITY	0.99+
one machine	QUANTITY	0.98+
a million shares	QUANTITY	0.97+
both	QUANTITY	0.97+
a day	QUANTITY	0.97+
Flink Forward	ORGANIZATION	0.97+
one thing	QUANTITY	0.97+
Data Artisans	EVENT	0.94+
2017	DATE	0.94+
Kabuki Hotel	LOCATION	0.94+
once	QUANTITY	0.93+
Flink Forward	EVENT	0.85+
every single message	QUANTITY	0.81+
both input	QUANTITY	0.81+
two interviews	QUANTITY	0.81+
Apache	ORGANIZATION	0.78+
one technology	QUANTITY	0.69+
MapReduce	ORGANIZATION	0.67+
Haddud	LOCATION	0.58+
least	QUANTITY	0.58+
Spark	TITLE	0.52+
APIS	ORGANIZATION	0.41+

Xiaowei Jiang | Flink Forward 2017

>> Welcome everyone, we're back at the first Flink Forward Conference in the U.S. It's the Flink User Conference sponsored by Data Artisans, the creators of Apache Flink. We're on the ground at the Kabuki Hotel, and we've heard some very high-impact customer presentations this morning, including Uber and Netflix. And we have the great honor to have Xiaowei Jiang from Alibaba with us. He's Senior Director of Research, and what's so special about having him as our guest is that they have the largest Flink cluster in operation in the world that we know of, and that the Flink folks know of as well. So welcome, Xiaowei. >> Thanks for having me. >> So we gather you have a 1,500 node cluster running Flink. Let's sort of unpack how you got there. What were some of the use cases that drove you in the direction of Flink and complementary technologies to build with? >> Okay, I explain a few use cases. The first use case that prompted us to look into Flink is the classical Soch ETL case. Where basically it needs to process all the data that's necessary for such series. So we look into Flink about two years ago. The next case we use is the A/B testing framework which is used to evaluate how your machine learning model works. So, today we using a few other very interesting case, like we are using to do machine learning to adjust ranking of search results, to personalize your search results at real-time to deliver the best search results for our user. We are also using to do real-time anti-fraud detection for ads. So these are the typical use case we are doing. >> Okay, this is very interesting because with the ads, and the one before that, was it fraud? >> Ads is anti-fraud. Before that is machine learning, real-time machine learning. >> So for those, low latency is very important. Now, help unpack that. Are you doing the training for these models like in a central location and then pushing the models out close to where they're going to be used for like the near real-time decisions? Or is that all run in the same cluster? >> Yeah, so basically we are doing two things. We use Flink to do real-time feature update which change the feature at the real-time, like in a few seconds. So for example, when a user buy a product, the inventory needs to be updated. Such features get reflected in the ranking of search results to real-time. We also use it to do real-time trending of the model itself. This becomes important in some special events. For example, on China Singles Day which is the largest shopping holiday in China, it generates more revenue than Black Friday in United States already. On such a day, because things go on sale for almost 50 percent off, the user's behavior changes a lot. So whatever model you trend before does not work reliably. So it's really nice to have a way to adjust the model at real-time to deliver a best experience to our users. All this is actually running in the same cluster. >> OK, that's really interesting. So, it's like you have a multi-tenant solution that sounds like it's rather resource intensive. >> Yes. >> When you're changing a feature, or features, in the models, how do you go through the process of evaluating them and finding out their efficacy before you put them into production? >> Yeah, so this is exactly the A/B testing framework I just mentioned earlier. >> George: Okay. >> So, we also use Flink to track the metrics, the performance of these models at real time. Once these data are processed, we upload them into our Olark system so we can see the performance of the models at real time. >> Okay. Very, very impressive. So, explain perhaps why Flink was appropriate for those use cases. Is it because you really needed super low latency, or that you wanted less resource-intensive sort of streaming engine to support these? What made it fit that right sweet spot? >> Yeah, so Soch has lots of different products. They have lots of different data processing needs, so when we looked into all these needs, we quickly realized we actually need a computer engine that can do both batch processing and streaming processing. And in terms of streaming processing, we have a few needs. For example, we really need super low latency. So in some cases, for example, if a product is sold out, and is still displaying in your search results, when users click and try to buy they cannot buy it. It's a bad experience. So, the sooner you can get the data processed, the better. So with- >> So near real-time for you means, how many milliseconds does the- >> It's usually like a second. One second, something like that. >> But that's one second end to end talking to inventory. >> That's right. >> How much time would the model itself have to- >> Oh, it's very short. Yeah. >> George: In the single digit milliseconds? >> It's probably around that. There are some scenarios that require single digit milliseconds. Like a security scenario; that's something we are currently looking into. So when you do transactions in our site, we need to detect if it's a fraud transaction. We want to be able to block such transactions at real-time. For that to happen, we really need a latency that's below 10 millisecond. So when we're looking at computer engines, this is also one of the requirements we will think about. So we really need a computer engine which is able to deliver sub-second latency if necessary, and at the same time can also do batch efficiently. So we are looking for solutions that can cover all our computation needs. >> So one way of looking at it is many vendors and customers talk about elasticity as in the size of the cluster, but you're talking about elasticity or scaling in terms of latency. >> Yes, latency and the way of doing computation. So you can view the security in the scenario as super restrict on the latency requirement, but view Apache as most relaxed version of latency requirement. We want a full spectrum; it's a part of the full spectrum. It's possible that you can use different engines for each scenario; but which means you are required to maintain more code bases, which can be a headache. And we believe it's possible to have a single solution that works for all these use cases. >> So, okay, last question. Help us understand, for mainstream customers who don't hire the top Ph.D's out of the Chinese universities but who have skilled data scientists but not an unending supply, and aspire to build solutions like this; tell us some of the trade-offs they should consider given that, you know, the skillset and the bench strength is very deep at Alibaba, and it's perhaps not as widely disseminated or dispersed within a mainstream enterprise. How should they think about the trade-offs in terms of the building blocks for this type of system? >> Yeah, that's a very good question. So we actually thought about this. So, initially what we did is we were using data set and data string API, which is a relatively lower level API. So to develop an application with this is reasonable, but it still requires some skill. So we want a way to make it even simpler, for example, to make it possible for data scientists to do this. And so in the last half a year, we spent a lot of time working on Tableau API and SQL Support, which basically tries to describe your computation logic or data processing logic using SQL. SQL is used widely, so a lot of people have experience in it. So we are hoping with this approach, it will greatly lower the threshold of how people to use Flink. At the same time, SQL is also a nice way to unify the streaming processing and the batch processing. With SQL, you only need to write your process logic once. You can run it in different modes. >> So, okay this is interesting because some of the Flink folks say you know, structured streaming, which is a table construct with dataframes in Spark, is not a natural way to think about streaming. And yet, the Spark guys say hey that's what everyone's comfortable with. We'll live with probabilistic answers instead of deterministic answers, because we might have late arrivals in the data. But it sounds like there's a feeling in the Flink community that you really do want to work with tables despite their shortcomings, because so many people understand them. >> So ease of use is definitely one of the strengths of SQL, and the other strength of SQL is it's very descriptive. The user doesn't need to say exactly how you do the computation, but it just tells you what I want to get. This gives the framework a lot of freedom in optimization. So users don't need to worry about hard details to optimize their code. It lets the system do its work. At the same time, I think that deterministic things can be achieved in SQL. It just means the framework needs to handle such kind things correctly with the implementation of SQL. >> Okay. >> When using SQL, you are not really sacrificing such determinism. >> Okay. This is, we'll have to save this for a follow-up conversation, because there's more to unpack there. But Xiaowei Jiang, thank you very much for joining us and imparting some of the wisdom from Alibaba. We are on the ground at Flink Forward, the Data Artisans conference for the Flink community at the Kabuki hotel in San, Francisco; and we'll be right back.

Published Date : Apr 14 2017

SUMMARY :

and that the Flink folks know of as well. So we gather you have a 1,500 node cluster running Flink. So these are the typical use case we are doing. Before that is machine learning, Or is that all run in the same cluster? adjust the model at real-time to deliver a best experience So, it's like you have a multi-tenant solution Yeah, so this is exactly the A/B testing framework of the models at real time. or that you wanted less resource-intensive So, the sooner you can get the data processed, the better. It's usually like a second. Oh, it's very short. For that to happen, we really need a latency as in the size of the cluster, So you can view the security in the scenario in terms of the building blocks for this type of system? So we are hoping with this approach, because some of the Flink folks say It just means the framework needs to handle you are not really sacrificing such determinism. and imparting some of the wisdom from Alibaba.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
China	LOCATION	0.99+
Xiaowei Jiang	PERSON	0.99+
One second	QUANTITY	0.99+
Alibaba	ORGANIZATION	0.99+
United States	LOCATION	0.99+
SQL	TITLE	0.99+
Uber	ORGANIZATION	0.99+
one	QUANTITY	0.99+
two things	QUANTITY	0.99+
Xiaowei	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
San, Francisco	LOCATION	0.99+
Tableau	TITLE	0.99+
one second	QUANTITY	0.99+
Flink	ORGANIZATION	0.99+
Black Friday	EVENT	0.98+
Soch	ORGANIZATION	0.98+
each scenario	QUANTITY	0.98+
today	DATE	0.98+
Spark	TITLE	0.98+
both	QUANTITY	0.98+
first use case	QUANTITY	0.97+
U.S.	LOCATION	0.97+
Data Artisans	ORGANIZATION	0.96+
China Singles Day	EVENT	0.96+
almost 50 percent	QUANTITY	0.96+
Flink Forward Conference	EVENT	0.95+
single solution	QUANTITY	0.94+
Flink User Conference	EVENT	0.93+
two years ago	DATE	0.93+
Flink Forward	EVENT	0.91+
below 10 millisecond	QUANTITY	0.91+
Apache Flink	ORGANIZATION	0.91+
1,500 node	QUANTITY	0.91+
Chinese	OTHER	0.9+
second	QUANTITY	0.89+
single	QUANTITY	0.88+
this morning	DATE	0.88+
Kabuki Hotel	LOCATION	0.87+
last half a year	DATE	0.87+
Apache	ORGANIZATION	0.85+
Olark	ORGANIZATION	0.83+
single digit	QUANTITY	0.77+
Data Artisans conference	EVENT	0.77+
Flink	TITLE	0.74+
2017	DATE	0.74+
first	QUANTITY	0.65+
seconds	QUANTITY	0.62+
Kabuki	ORGANIZATION	0.56+
people	QUANTITY	0.51+
lot	QUANTITY	0.47+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Data Artisans: