Tom Kaitchuck, Dell EMC - Flink Forward - #FFSF17 - #theCUBE
>> Hi everyone, welcome back. We are at the Flink Forward user conference sponsored by data Artisans, the creators of Flink. We're here at the Kabuki Hotel in San Francisco. We're on the ground and we're with Tom Kaitchuck, who is senior consulting engineer at Dell EMC. >> Yes. >> You had a fairly exciting announcement to make this morning. Why don't you give us a little background on that? >> Yes, so we're announcing Pravega, which is a new open-source product, that Dell's been working on for the last year. And we're opening the floodgates on May 10th and it's going to act as a streaming storage system. >> Okay, so help us square a couple circles. So, we all learned over the last couple years as Kafka took off and, sort of, swept... large and small enterprises alike, large and medium-sized enterprises alike by storm, that we thought the way to communicate data between applications, but as you were telling me, it still makes assumptions about conventional hardware that it runs on that might be, perhaps, suboptimal. >> Yeah so, I think the difference between what we're doing and what Kafka is, it just fundamentally comes down to the model. Kafka is a messaging system and its model is built around messages. Ours is a streaming system and we operate fundamentally as a stream. So, when a client sends bytes over the wire, the server does not interpret them at all. It is opaque. It is analogous to a Unix pipe or an HTB socket. What goes over isn't interpreted and that gives us the ability to channel that data in. We ended up piping it into a long-term archival system, which gives us advantages in terms of storage. So, wherein a system that's like Kafka, where you need performance and you need to get high throughput, you're going to basically run on machines that are built for iAPPS, they're built for capacity to get data in and get data out and that works and it's fast, but what it doesn't give you is it doesn't give you cheap long-term storage. So, usually what people do, is they have a separate system for cheap, long-term storage that's usually something like HDFS. So, you end up running a Kafka job that reads out of your Kafka topic and ends up writing to HDFS. So, what we're doing is building a streaming system that is directly taking the stream that's coming in from the user and holding it locally and giving you the ability to stream off of it and the ability to connect to it and listen to it in realtime and giving you strong consistency, but at the same time, the ultimate place where this is stored durably is in your long-term storage, it's in your HDFS, and the advantage of that is that your storage becomes cheap, dense storage that you're used to configuring for HDFS, and so you can configure very long-term storage. So, you can use the same interface to back up and go to last year and stream forward and the advantage of that, is that you don't end up in what I refer to as this sort of accidental lambda architecture, where you built something like a Flink cluster and you say oh well, this is great, and it connects to the streaming component for Kafka and we can stream data and we get realtime analytics and we can do all this nice stuff, but then if we have a bug in our code and we need to go back, you actually need to flip to a different connector and deploy a different job to refill back a backfill from a different storage system. So, we're wanting it to solve that problem. >> Okay, so... let's frame that, so that a customer today would have, a mainstream customer that's been working with Hadoop, would have their data lake, HDFS and their data, which is sort of big, sort of old archive almost. >> Yes. >> And then they would be using perhaps Kafka either to ingest data, additional data into the data lake or, perhaps, extracting it for an application that wants to process it with continuous processing or low latency perhaps. Now, your solution comes where you want an emphasis on speed and scale and you're not reformatting the data essentially to hit the disk in a format that's understandable by the file system. Your data is trying to move along in the format of memory, if I'm understanding correctly. So there's a lot less translation going on and you use, partly because of that, and partly because you have, I'm understanding, higher capacity storage, you don't have to spill to disk and exercise all that I/O that you would get from expensive disks. >> Right. >> So, HDFS, big data, the Dell EMC solution, much faster data than Kafka and, then so, makes it a good citizen in a world where you want to built more and more continuous applications where latency, every last bit of latency, is the enemy. >> Yes, yes. So, our goal is to get very low append latency and that's important because we can't, like right now, you can't reasonably do something even analogous to streaming off of HDFS, because the write latency is just too high. You end up calling write with a small bit of data and you're talking 100 plus milliseconds and then you need to go turn around and read, and your read performance will be very low if you do lots of tiny appends. So, what we give you is a system that lets you do lots of tiny appends very fast, very low latency, but the same time, the data's ultimately being stored in HDFS. So, you still get the nice bulk storage capacity of HDFS, but without incurring the penalty of all those tiny appends. >> And, just to be clear, those tiny appends, it's like, your system is absorbing whatever volume or velocity that's thrown at it. So it handles the back pressure and then, rather than HDFS sort of backing things up cause of its high latency write path, you're absorbing all that because you're not very resource-intensive, being optimized for speed and capacity. And then you can put it back in to the long-term store, HDFS. >> Yes, we can aggregate all these tiny writes into one or two big writes and put them in, yeah. >> So, tell us some of the use cases that you're working on with design partners or... >> Right, so the big one we're working on with data Artisans is we want to get exactly-once semantics in Flink jobs that are derived from one another. So, for example, if you have a job and it takes in say, an order or something, and it processes it and it generates some derivative data,. Today, if you want to have exactly-once semantic on a job that's running on that derivative data, it has to be co-located and run with the first job. And that's problematic for a number of reasons. Namely, because in a lot of companies, you don't want to have some secondary job impact the primary one. So, you want something in between that can operate as a buffer there. But right now, there's no way to do that with a streaming pipeline without giving up exactly-once semantics. And exactly-once semantics is a really big deal for a lot of Flink applications and so what we can let you do is have one Flink job that runs, produces some output and then goes into Pravega as a sink and that Pravega turns around and is a source for another Flink job and you can still have exactly-once semantics end to end. >> Okay so, it sounds like, just the way Kafka was sort of the source and sink, consumer, producer, through a hub, but once it was handed off to another system, it lost that exactly-once guarantee and, as we said, wasn't optimized for, necessarily for throughput and capacity, so that's how you guys solved that problem. Okay so, if you were to pick some common applications that have been attacked by, or served by Kafka and Flink which ones are there certain characteristics that would be most amenable to the Dell EMC solution? >> Anything that requires strong consistency. So the real difference that we have is that we have a strong consistent application. So, we don't just have this one API that's dealing with events and so on. We actually have this low-level primitive and we're building a lot of APIs on top of it. So, let me give you an example. We have an API that let's you have, we call a state synchronizer. And what that is, is an object that you can hold in memory across a number of machines and you can perform updates on that object. But it's guaranteed that every process that's performing an update is performing an update on the latest version of that object. So that object is coordinated across a fleet and everyone sees the same sequence of updates and sees the same object at any given time. And that's a real advantage for anywhere where you're trying to do something that requires strong consistency. So you can do those sorts of applications and you can also do things that require transactional semantics. So, one thing that we allow is that when you write data to our output, you can do it transactionally. So, you can have one Pravega stream and coordinate a transaction potentially across different areas of sort of keyspace that would end up actually on multiple Pravega hosts and have that atomic consistency where you call commit and all of the rights across all of them go in simultaneously. And that's a big for a lot of appliactions and you can sort of combine these two primitives where you have a state object and you have a transaction object to interlink transactionality with that of an external system. So, you could, for example, say I have a Flink sink that's going to have a couple of different outputs, but one of them is, say, a SQL database, right? And then you could say, I want this output to go to Pravega if and only if, my transaction to SQL commits. >> Oh, it sounds like you get a freebie of distributed sort of transactions. >> Yes. >> That's very, very interesting, cause that's something, that's a handoff that you would expect from a single-vendor solution. Very, very impressive. Alright, Tom, on that note, we're going to have to cut it off, because we are ending our coverage at Flink Forward, the data Artisans user conference, the first one held in the U.S. and we are at the Kabuki Hotel in San Francisco. I'm George Gilbert and we're signing off for this afternoon. Thanks for watching. (bright music)
SUMMARY :
We're on the ground and we're with Tom Kaitchuck, Why don't you give us a little background on that? and it's going to act as a streaming storage system. but as you were telling me, and the advantage of that, is that you don't end up in that's been working with Hadoop, and partly because you have, in a world where you want to built more and more and then you need to go turn around and read, And then you can put it back in Yes, we can aggregate all these tiny writes that you're working on with design partners or... So, for example, if you have a job and it takes in so that's how you guys solved that problem. and you can also do things Oh, it sounds like you get that's a handoff that you would expect
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Tom Kaitchuck | PERSON | 0.99+ |
Tom | PERSON | 0.99+ |
May 10th | DATE | 0.99+ |
U.S. | LOCATION | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
100 plus milliseconds | QUANTITY | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
Kafka | TITLE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
one | QUANTITY | 0.99+ |
first job | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
Dell EMC | ORGANIZATION | 0.99+ |
Pravega | TITLE | 0.98+ |
last year | DATE | 0.98+ |
first one | QUANTITY | 0.98+ |
single | QUANTITY | 0.97+ |
two big | QUANTITY | 0.97+ |
today | DATE | 0.96+ |
SQL | TITLE | 0.96+ |
#FFSF17 | EVENT | 0.95+ |
one thing | QUANTITY | 0.94+ |
two primitives | QUANTITY | 0.9+ |
Pravega | ORGANIZATION | 0.87+ |
last couple years | DATE | 0.86+ |
this afternoon | DATE | 0.86+ |
Flink Forward | ORGANIZATION | 0.84+ |
Flink Forward user | EVENT | 0.84+ |
Kabuki Hotel | LOCATION | 0.8+ |
iAPPS | TITLE | 0.8+ |
Hadoop | TITLE | 0.79+ |
Flink Forward | EVENT | 0.75+ |
Flink | TITLE | 0.64+ |
Kafka | PERSON | 0.61+ |
morning | DATE | 0.59+ |
Pravega | PERSON | 0.53+ |
Artisans | EVENT | 0.45+ |
#theCUBE | ORGANIZATION | 0.39+ |
Kenneth Knowles, Google - Flink Forward - #FFSF17 - #theCUBE
>> Welcome everybody, we're at the Flink Forward conference in San Francisco, at the Kabuki Hotel. Flink Forward U.S. is the first U.S. user conference for the Flink community sponsored by data Artisans, the creators of Flink, and we're here with special guest Kenneth Knowles-- >> Hi. >> Who works for Google and who heads up the Apache Beam Team where, just to set context, Beam is the API Or STK on which developers can build stream processing apps that can be supported by Google's Dataflow, Apache Flink, Spark, Apex, among other future products that'll come along. Ken, why don't you tell us, what was the genesis of Beam, and why did Google open up sort of the API to it. >> So, I can speak as an Apache Beam Team PMC member, that the genesis came from a combined code donation to Apache from Google Cloud Dataflow STK and there was also already written by data Artisans a Flink runner for that, which already included some portability hooks, and then there was also a runner for Spark that was written by some folks at PayPal. And so, sort of those three efforts pointed out that it was a good time to have a unified model for these DAG-based computational... I guess it's a DAG-based computational model. >> Okay, so I want to pause you for a moment. >> Yeah. >> And generally, we try to avoid being rude and cutting off our guests but, in this case, help us understand what a DAG is, and why it's so important. >> Okay, so a DAG is a directed acyclic graph, and, in some sense, if you draw a boxes and arrows diagram of your computation where you say "I read some data from here," and it goes through some filters and then I do a join and then I write it somewhere. These all end up looking what they call the DAG just because of the fact that it is the structure, and all computation sort of can be modeled this way, and in particular, these massively parallel computations profit a lot from being modeled this way as opposed to MapReduce because the fact that you have access to the entire DAG means you can perform transformations and optimizations and you have more opportunities for executing it in different ways. >> Oh, in other words, because you can see the big picture you can find, like, the shortest path as opposed to I've got to do this step, I've got to do this step and this step. >> Yeah, it's exactly like that, you're not constrained to sort of, the person writing the program knows what it is that they want to compute, and then, you know, you have very smart people writing the optimizer and the execution engine. So it may execute an entirely different way, so for example, if you're doing a summation, right, rather than shuffling all your data to one place and summing there, maybe you do some partial summations, and then you just shuffle accumulators to one place, and finish the summation, right? >> Okay, now let me bump you up a couple levels >> Yeah. >> And tell us, so, MapReduce was a trees within the forest approach, you know, lots of seeing just what's a couple feet ahead of you. And now we have the big picture that allows you to find the best path, perhaps, one way of saying it. Tell us though, with Google or with others who are using Beam-compatible applications, what new class of solutions can they build that you wouldn't have done with MapReduce before? >> Well, I guess there's... There's two main aspects to Beam that I would emphasize, there's the portability, so you can write this application without having to commit to which backend you're going to run it on. And there's... There's also the unification of streaming and batch which is not present in a number of backends, and Beam as this layer sort of makes it very easy to use sort of batch-style computation and streaming-style computation in the same pipeline. And actually I said there was two things, the third thing that actually really opens things up is that Beam is not just a portability layer across backends, it's also a portability layer across languages, so, something that really only has preliminary support on a lot of systems is Python, so, for example, Beam has a Python STK where you write a DAG description of your computation in Python, and via Beam's portability API's, one of these sort of usually Java-centric engines would be able to run that Python pipeline. >> Okay, so-- >> So, did I answer your question? >> Yes, yes, but let's go one level deeper, which is, if MapReduce, if its sweet spot was web crawl indexing in batch mode, what are some of the things that are now possible with a Beam-style platform that supports Beam, you know, underneath it, that can do this direct acyclic graph processing? >> I guess what I, I'm still learning all the different things that you can do with this style of computation, and the truth is it's just extremely general, right? You can set up a DAG, and there's a lot of talks here at Flink Forward about using a stream processor to do high frequency trading or fraud detection. And those are completely different even though they're in the same model of computation as, you know, you would still use it for things like crawling the web and doing PageRank over. Actually, at the moment we don't have iterative computations so we wouldn't do PageRank today. >> So, is it considered a complete replacement, and then new used cases for older style frameworks like MapReduce, or is it a complement for things where you want to do more with data in motion or lower latency? >> It is absolutely intended as a full replacement for MapReduce, yes, like, if you're thinking about writing a MapReduce pipeline, instead you should write a Beam pipeline, and then you should benchmark it on different Beam backends, right? >> And, so, working with Spark, working with Flink, how are they, in terms of implementing the full richness of the Beam-interface relative to the Google product Dataflow, from which I assumed Beam was derived? >> So, all of the different backends exist in sort of different states as far as implementing the full model. One thing I really want to emphasize is that Beam is not trying to take the intersection on all of these, right? And I think that your question already shows that you know this, we keep sort of a matrix on our website where we say, "Okay there's all these different "features you might want, "and then there's all these backends "you might want to run it on," and it's sort of there's can you do it, can you do it sometimes, and notes about that, we want this whole matrix to be, yes, you can use all of the model on Flink, all of it on Spark, all of it on Google Cloud Dataflow, but so they all have some gaps and I guess, yeah, we're really welcoming contributors in that space. >> So, for someone whose been around for a long time, you might think of it as an ODBC driver, where the capabilities of the databases behind it are different, and so the drivers can only support some subset of a full capability. >> Yeah, I think that there's, so, I'm not familiar enough with ODBC to say absolutely yes, absolutely no, but yes, it's that sort of a thing, it's like the JVM has many languages on it and ODBC provides this generic database abstraction. >> Is Google's goal with Beam API to make it so that customers demand a level of portability that goes not just for the on-prim products but for products that are in other public clouds, and sort of pry open the API lock in? >> So, I can't say what Google's goals are, but I can certainly say that Beam's goals are that nobody's going to be locked into a particular backend. >> Okay. >> I mean, I can't even say what Beam's goals are, sorry, those are my goals, I can speak for myself. >> Is Beam seeing so far adoption by the sort of big consumer internet companies, or has it started to spread to mainstream enterprises, or is still a little immature? >> I think Beam's still a little bit less mature than that, we're heading into our first stable release, so, we began incubating it as an Apache project about a year ago, and then, around the beginning of the new year, actually right at the end of 2016, we graduated to be an Apache top level project, so right now we're sort of on the road from we've become a top level project, we're seeing contributions ramp up dramatically, and we're aiming for a stable release as soon as possible, our next release we expect to be a stable API that we would encourage users and enterprises to adopt I think. >> Okay, and that's when we would see it in production form on the Google Cloud platform? >> Well, so the thing is that the code and the backends behind it are all very mature, but, right now, we're still sort of like, I don't know how to say it, we're polishing the edges, right, it's still got a lot of rough edges and you might encounter them if you're trying it out right now and things might change out from under you before we make our stable release. >> Understood. >> Yep. All right. Kenneth, thank you for joining us, and for the update on the Beam project and we'll be looking for that and seeing its progress over the next few months. >> Great. Thanks for having me. >> With that, I'm George Gilbert, I'm with Kenneth Knowles, we're at the dataArtisan's Flink Forward user conference in San Francisco at the Kabuki Hotel and we'll be back after a few minutes.
SUMMARY :
and we're here with special guest Kenneth Knowles-- Beam is the API Or STK on which developers can build and then there was also a runner for Spark and cutting off our guests but, in this case, and you have more opportunities for executing it Oh, in other words, because you can see the big picture and then you just shuffle accumulators to one place, that allows you to find the best path, and streaming-style computation in the same pipeline. and the truth is it's just extremely general, right? and it's sort of there's can you do it, and so the drivers can only support some subset and ODBC provides this generic database abstraction. are that nobody's going to be I mean, I can't even say what Beam's goals are, and we're aiming for a stable release and you might encounter them and for the update on the Beam project Thanks for having me. in San Francisco at the Kabuki Hotel
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Kenneth | PERSON | 0.99+ |
Kenneth Knowles | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Python | TITLE | 0.99+ |
ORGANIZATION | 0.99+ | |
Ken | PERSON | 0.99+ |
two things | QUANTITY | 0.99+ |
PayPal | ORGANIZATION | 0.99+ |
one place | QUANTITY | 0.98+ |
three efforts | QUANTITY | 0.98+ |
Flink | ORGANIZATION | 0.98+ |
Flink Forward | EVENT | 0.98+ |
Python STK | TITLE | 0.98+ |
Apache | ORGANIZATION | 0.98+ |
MapReduce | TITLE | 0.98+ |
ODBC | TITLE | 0.97+ |
Beam | TITLE | 0.97+ |
dataArtisan | ORGANIZATION | 0.97+ |
third thing | QUANTITY | 0.97+ |
first stable release | QUANTITY | 0.96+ |
first | QUANTITY | 0.95+ |
#FFSF17 | EVENT | 0.95+ |
Apache Beam Team | ORGANIZATION | 0.94+ |
Flink Forward | ORGANIZATION | 0.94+ |
two main aspects | QUANTITY | 0.93+ |
Artisans | ORGANIZATION | 0.93+ |
Beam | ORGANIZATION | 0.93+ |
Spark | TITLE | 0.92+ |
end of 2016 | DATE | 0.92+ |
Kabuki Hotel | LOCATION | 0.92+ |
today | DATE | 0.87+ |
about a year ago | DATE | 0.85+ |
Cloud Dataflow | TITLE | 0.83+ |
Dataflow | TITLE | 0.82+ |
Java | TITLE | 0.81+ |
one way | QUANTITY | 0.77+ |
One thing | QUANTITY | 0.73+ |
Google Cloud | TITLE | 0.72+ |
couple feet | QUANTITY | 0.71+ |
Apache | TITLE | 0.7+ |
Flink Forward user | EVENT | 0.7+ |
JVM | TITLE | 0.69+ |
Cloud Dataflow STK | TITLE | 0.69+ |
PMC | ORGANIZATION | 0.69+ |
Forward | EVENT | 0.64+ |
year | DATE | 0.62+ |
DAG | OTHER | 0.59+ |
U.S. | LOCATION | 0.53+ |
Apex | TITLE | 0.51+ |
Stephan Ewen | Flink Forward 2017
(click) >> Welcome, everyone, we're back at the Flink Forward user conference sponsored by data Artisan's folks. This is the first U.S. based Flink user conference, and we are on the ground at the Kabuki Hotel in San Francisco. We have a special guest, Stephan Ewen, who is one of the founders of data Artisans, and one of the creators of Flink. He is CTO, and he is in a position to shed some unique light on the direction of the company and the product. Welcome, Stephan. >> Yeah, so you were asking about how can stream processing or how can Flink and data Artisans help companies that are enterprises that want to adopt this kind of technologies actually do that despite the fact that we've been seeing, if we look at what the big internet companies that first adopted these technologies, what they had to do, they had to go through all this big process of productionizing these things by integrating them with so many other systems, making sure everything fits together, everything kind of works as one piece. What can we do there? So I think there are a few interesting points to that. Let's maybe start with stream processing in general. So, stream processing by itself has actually the potential to simplify many of these setups and infrastructures, per se. There's multiple dimensions to that. First of all, the ability to just more naturally fit what you're doing to what is actually happening. Let me qualify that a little bit. All these companies that are dealing with big data are dealing with data that is typically continuously produced from sensors, from user devices, from server logs, from all these things, right? Which is quite naturally a stream. And processing this with systems that give you the abstraction of a stream is a much more natural fit, so you eliminate bunches of the pipeline that do, for example, try to do periodic ingestion, and then grooming that into a video file and data sets and periodic processing of that and you can for example, get rid of a lot of these things. You kind of get a paradigm that unifies the processing of real time data and also historic data. So this by itself is an interesting development that I think many have recognized and that's why they're excited about stream processing because it helps reduce a lot of that complexity. So that is one side to it. The other side to it is that there was always kind of an interplay between the processing on the data and then you want to do something with these insights, right, you don't process the data just for the fun of processing, right? Usually the outcome infers to something. Sometimes it's just a report, but sometimes it's something that immediately affects how certain services react. For example, how they apply their decisions in classifying transactions as frauds or how to send out alerts, how to trigger certain actions. The interesting thing is then, we're going to see actually a little more of that later in this conference also, is that in this reprocessing paradigm there's this very natural way for these online live applications and the analytical applications to march together, again, reducing a bunch of this complexity. Another thing that is happening that I think is very, very powerful and helping (mumbles) in bringing these kind of technologies to a broader anchor system is actually how the whole deployment stick is growing. So we see actually more and more users converging onto recessed management infrastructures. Yan was an interesting first step to make it really easy and once you've productionized that part of productionized voice systems but even beyond that, like the uptake of mezas, the uptake of containment engines like (mumbles) on the ability to just prepare more functionality buttoned together out of the box, it doesn't pack into a container of what you need and put it into a repository and then various people can bring up these services without having to go through all of the set up and integration work, it can kind of way better templated integration with systems with this kind of technology. So those seem to be helping a lot for much broader adoption of these kind of technologies. Both stream processing as an easier paradigm, fewer moving parts, and developments and (mumbles) technologies. >> So let me see if I can repeat back just a summary version, which is stream processing is more natural to how the data is generated, and so we want to match the processing to how it originates, it flows. At the same time, if we do more of that, that becomes a workload or an application pattern that then becomes more familiar to more people who didn't grow up in a continuous processing environment. But also, it has a third capability of reducing the latency between originating or adjusting the data and getting an analysis that informs a decision whether by a person or a machine. Would that be a >> Yeah, you can even go one step further, it's not just about introducing the latency from the analysis to the decision. In many cases you can actually see that the part that does the analysis in the decision just merge and become one thing which makes it much fewer moving parts, less integration work, less, yeah, less maintenance and complexity. >> Okay, and this would be like, for example, how application databases are taking on the capabilities of analytic databases to some extent, or how stream processors can have machine learning whether they're doing online learning or calling a model that they're going to score in real time or even a pre scored model, is that another example of where we put? >> You can think of those as examples, yeah. A nice way to think about it is that if you look at what a lot of what the analytical applications do versus let's say, just online services that measure offers and trades, or to generate alerts. A lot of those kind of are, in some sense, different ways of just reacting to events, right? If you are receiving some real time data and just want to process these interact with some form of knowledge that you accumulated over the past, or some form of knowledge that you've accumulated from some other inputs and then react to that. That kind of paradigm which is in the core of stream processing for (mumbles) is so generic that it covers many of these use cases, both building directly applications, as we have actually seen, we have seen users that directly build a social network on Flink, where the events that they receive are, you know, a user being created, a user joining a group and so on, and it also covers the analytics of just saying, you know, I have a stream of sensor data and on certain outliers I want to raise alerts. It's so similar once you start thinking about both of them as just handling streams of events, in this flexible fashion that it helps to just bring together many things. >> So, that sounds like it would play into the notion of, micro services where the service is responsible for its own state, and they communicate with each other asynchronously, so you have a cooperating collection of components. Now, there are a lot of people who grew up with databases out here sharing the state among modules of applications. What might drive the growth of this new pattern, the microservices, for, you know, considering that there's millions of people who just know how to use databases to build apps. >> The interesting part that I think drives this new adaption is that it's such a natural fit for the microservice world. So how do you deploy microservices with state, right? You can have a central database with which you work and every time you create a new service you have to make sure that it fits with the capacities and capabilities of the database, you have to make sure that the group that runs this database is okay with the additional load that, or you can go to the different model where each microservice comes up with its own database, but that time, every time you deploy one and that may be a new service or it may just be experimenting with a different variation of the service they'd be testing. You'd have to bring out a completely new thing. In this interesting world of stream processing, stateful stream processing is done by Flink state is embedded directly in the processing application. So, you actually don't worry about this thing separately, you just deploy that one thing, and it brings both together tightly integrated, and it's a natural fit, right, the working set of your application goes with your application. If it deployed, if it's (mumbles), if you bring it down, these things go away. What the central part in this thing is it's nothing more than if you wish a back up store where it would take these snapshots of microservices and store them in order to recover them from catastrophic failures in order to just have an historic version to look into if you figure it out later, you know, something happened, and was this introduced in the last week, let me look at what it looked like the week before or to just migrate it to a different cluster. >> So, we're going to have to cut things short in a moment, but I wanted to ask you one last question: If like, microservices as a sweet spot and sort of near real time decisions are also a sweet spot for Kafka, what might we expect to see in terms of a roadmap that helps make those, either that generalizes those cases, or that opens up new use cases? >> Yes, so, what we're immediately working on in Flink right now is definitely extending the support in this area for the ability to keep much larger state in these applications, so state that really goes into the multiple terrabytes per service, functionality that allows us to manage this, even easier to evolve this, you know. If the application actually starts owning the state and it's not in a centralized database anymore, you start needing a little bit of tooling around this state, similar as the tooling you need in databases, a (mumbles) in all of that, so things that actually make that part easier. Handling (mumbles) and we're actually looking into what are the API's that users actually want in this area, so Flink has I think pretty stellar stream processing API's and if you've seen in the last release, we've actually started adding more low level API's one could even think, API's in which you don't think as streams as distributed collections and windows but to just think about the very basic in gradiances, events, state, time and snapshots, so more control and more flexibility by just taking directly the basic building blocks rather than more high level abstractions. I think you can expect more evolution on that layer, definitely in the near future. >> Alright, Stephan, we have to leave it at that, and hopefully to pick up the conversation not too long in the future, we are at the Flink Forward Conference at the Kabuki Hotel in San Francisco, and we will be back with more just after a few moments. (funky music)
SUMMARY :
and one of the creators of Flink. First of all, the ability to just more naturally that then becomes more familiar to more people that does the analysis in the decision just merge and it also covers the analytics of just saying, you know, the microservices, for, you know, and capabilities of the database, similar as the tooling you need in databases, a (mumbles) and hopefully to pick up the conversation
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Stephan | PERSON | 0.99+ |
Stephan Ewen | PERSON | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
one | QUANTITY | 0.99+ |
last week | DATE | 0.99+ |
first step | QUANTITY | 0.99+ |
one piece | QUANTITY | 0.99+ |
both | QUANTITY | 0.98+ |
U.S. | LOCATION | 0.98+ |
one side | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
each microservice | QUANTITY | 0.98+ |
one thing | QUANTITY | 0.97+ |
First | QUANTITY | 0.97+ |
one last question | QUANTITY | 0.95+ |
Both | QUANTITY | 0.94+ |
third | QUANTITY | 0.92+ |
Kabuki Hotel | LOCATION | 0.9+ |
Kafka | TITLE | 0.89+ |
one step | QUANTITY | 0.89+ |
Artisan | ORGANIZATION | 0.85+ |
Flink Forward user | EVENT | 0.85+ |
millions of people | QUANTITY | 0.85+ |
data Artisans | ORGANIZATION | 0.82+ |
Flink Forward | ORGANIZATION | 0.82+ |
2017 | DATE | 0.73+ |
Forward Conference | LOCATION | 0.55+ |
Jamie Grier | Flink Forward 2017
>> Welcome back, everyone, we're at the Flink Forward conference, this is the user conference for the Flink community, started by Data Artisans and sponsored by Data Artisans. We're at the Kabuki Hotel in San Francisco and we have with this another special guest, Jamie Grier, who's Director of Applications Engineering at Data Artisans. Jamie, welcome. >> Thanks. >> So we've seen an incredible pace of innovation in the Apache open source community and as soon as one technology achieves mainstream acceptance, it sort of gets blown away by another one, like MapReduce and Spark. There's an energy building around Flink and help us understand where it fits relative to, not necessarily things that it's replacing so much as things that it's complementing. >> Sure. Really what Flink is is it's a real stream processor so it's a stateful stream processor. The reason that I say it's a real stream processor is because the model, the competition model, the way the engine works, the semantics of the whole thing are the continuous programming model, which means that, really, you just consume events one at a time, you can update any sort of data structures you want, which Flink manages, full tolerantly, at scale, and you can do flexible things with processing, with regards to time, scheduling things to happen at different times, when certain amounts of data are complete, et cetera, so it's not oriented strictly towards, a lot of the stream processing in the past has been oriented sort of towards analytics alone or that's the real sweet spot, whereas Flink as a technology enables you to build much more complex event- and time-driven applications in a much more flexible way. >> Okay so let me unpack that a bit. >> Sure. >> So what we've seen in the Haddud community for the last x many years was really an analytic data pipeline put the data into a data lake and the hand-offs between the services made it a batch process. We tried to start adding data science and machine learning to it, it remained pretty much a batch process, 'cause it's in the data lake, and then when we started to experiment with stream processors, their building blocks were all around analytics and so they were basically an analytic pipeline. If I'm understanding you, you handle not just the analytics but the update-oriented or the cred-oriented operations, create, read, update, delete. >> Yeah, exactly. >> That you would expect from having a database as part of an application platform. >> Yeah. I mean, that's all true, but it goes beyond that. I mean, Flink as a stateful stream processor has in a sense a micro simple database as part of the stream processor. So yeah, you can update that state, like you said, the crud operations on that state, but it's more than that, you can build any kind of logic at all that you can think of that's driven by consuming events. Consuming events, doing calculations, and emitting events. Analytics is very easily built on top of something as powerful as that, but if you drop down below these higher level analytics APIS, you truly can build anything you want that consumes events, updates state, and emits events. And especially when there's a time dimension to these things like sometimes you consume some event and it means that at some future time, you want to schedule some processing to happen. And these basic primitives really allow you to build, I tell people all the time, Flink allows you to do this consuming of events and updating data structures of your own choosing, does it full tolerantly and at scale, build whatever you want out of that. And what people are building are things that are truly not really expressible as an analytics jobs. It's more just building applications. >> Okay, so let me drill down on that. >> Sure. >> Let's take an example app, whether it's, I'll let you pick it, but one where you have to assume that you can update state and you can do analytis and they're both in the same map, which is what we've come to expect from traditional apps although they have their shared state in a database outside the application. >> So a good example is, I just got done doing a demo, literally just before this, and it's a training application, so you build a training engine, it's consuming position information from webstream systems and it's consuming quotes. Quotes are all the bids and all the offers to buy stock at a given price, we have our own positions we're holding within the firm if we're a bank, and those positions, that's our state we're talking about. So it says I own a million shares of Apple, I own this many shares of Google, this is the price I paid, et cetera, so then we have some series of complex rules that say, hey, I've been holding this position service for a certain period of time, I've been holding it for a day now and so I want to more aggressively trade out of this position and I do that by modifying my state, driven by time, so more time has gone past, I'm going to lower my ask price, now trades are streaming in as well to the system and I'm trying to more aggressively make trades by lowering the price I'm willing to trade for. So these things are all just event-driven applications, the state is your positions in the market and the time dimension is exactly that, as you've been holding the position longer, you start to change your price or change your trading strategy in order to liquidate a little bit more aggressively, none of that is in the category of, I'd say you're using analytics along the way but none of that is just what you'd think of as a typical analytics or an analytics API. You need an API that allows you to build those sorts of flexible event-driven things. >> And the persistence part or the maybe transactional part is I need to make a decision as a human or the machine and record that decision and so that's why there's benefit to having the analytics and the database, whatever term we give it, in the same. >> Co-located. >> Co-located, yeah, in the same platform. >> Yeah there's a bunch of reasons why that's good. That's one of them, another reason is because when you do things at high scale and you have high through, say in that trading system we're consuming the entire options chains worth of all the bids in asks, right? It's a load of data so you want to use a bunch of machines but you want to, you don't want to have to look up your state in some database for every single message when instead you could share the input stream and both input streams by the same key and you end up doing all of your look-up join type operations locally on one machine. So at high scale it's a huge just performance benefit. Also allows you to manage that state consistently, consistent with the input streams. If you have the data in a external database and a node fails then you need to sort of back up in the input stream a little bit, replay a little bit of the data, you have to also be able to back up your state to a consistent point with all of the inputs and if you don't manage that state, you cannot do it. So that's one of the core reasons why stream processors need to have state, so they can provide strong guarantees about correctness. >> What are some of the other popular stream processors, when they choose perhaps not to manage state to the same integrated degree that you guys do? What was their thinking in terms of, what trade-off did they make? >> It was hard. So I've also worked on previous streaming systems in the past and for a long time, actually, and managing all this state in a consistent way is difficult and so the early generation systems didn't do it for exactly that reason, let's just put it in the database but the problem with that is exactly what I just mentioned and in stream processing we tend to talk about exactly once and at least once, this is actually the source of the problem so if the database is storing your state, you can't really provide these exactly-once type guarantees because when you replace some data, you back up in the input, you also have to back up the state and that's not really a database operation that's normally available, so when you manage to state yourself in the stream processor, you can consistently manage the input in the state. So you can exactly-once semantics in the face of failure. >> And what do you trade in not having, what do you give up in not having a shared database that has 40 years of maturity and scalability behind it versus having these micro databases distributed around. Is it the shuffling of? >> You give up a robust external quarry interface, for one thing, you give up some things you don't need like the ability to have multiple writers and transactions and all that stuff, you don't need any of that because in a stream processor, for any given key there's always one writer and so you get a much simpler type of database you have to support. What else? Those are the main things you really give up but I would like to also draw a distinction here between state and storage. Databases are still obviously, Flink state is not storage, not long-term storage, it's to hold the data that's currently sort of in flight and mutable until it's no longer being mutated and then the best practice would be to emit that as some sort of event or as a sync into a database and then it's stored for the long-term, so it's really good to start to think about the difference between what is state and what is storage, does that make sense? >> I think so. >> So think of, you're accounting, you're doing distributed accounting, which is an analytics thing, you're counting by key, the count per key is your state until that window closes and I'm not going to be mutated anymore, then we're headed into the database. >> Got it. >> Right? >> Yeah. >> But that internal, that sort of in-flight state is what you need to manage in the stream process. >> Okay, so. >> So it's not a total replacement for database, it's not that. >> No no no, but this opens up another thread that I don't think we've heard enough of. Jamie, we're going to pause it here. >> Okay. >> 'Cause I hope to pick this thread up with you again, the big surprise from the last two interviews, really, is Flink is not just about being able to do low latency per event processing, it's that it's a new way of thinking about applications beyond the traditional stream processors where it manages state or data that you want to keep that's not just transient and that it becomes a new way of building micro services. >> Exactly, yeah. >> So on that note, we're going to sign off from the Data Artisans user conference, Flink Forward, we're here in San Francisco on the ground at the Kabuki Hotel. (upbeat music)
SUMMARY :
for the Flink community, started by Data Artisans in the Apache open source community and as soon as one and you can do flexible things with processing, 'cause it's in the data lake, and then when we started That you would expect from having a database I tell people all the time, Flink allows you to do this that you can update state and you can do analytis You need an API that allows you to build those sorts And the persistence part or the maybe transactional part in the same platform. by the same key and you end up doing all of your in the input, you also have to back up the state what do you give up in not having a shared database Those are the main things you really give up by key, the count per key is your state until that window that sort of in-flight state is what you need So it's not a total that I don't think we've heard enough of. this thread up with you again, the big surprise on the ground at the Kabuki Hotel.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jamie Grier | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Apple | ORGANIZATION | 0.99+ |
Jamie | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
40 years | QUANTITY | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
one writer | QUANTITY | 0.99+ |
one machine | QUANTITY | 0.98+ |
a million shares | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
a day | QUANTITY | 0.97+ |
Flink Forward | ORGANIZATION | 0.97+ |
one thing | QUANTITY | 0.97+ |
Data Artisans | EVENT | 0.94+ |
2017 | DATE | 0.94+ |
Kabuki Hotel | LOCATION | 0.94+ |
once | QUANTITY | 0.93+ |
Flink Forward | EVENT | 0.85+ |
every single message | QUANTITY | 0.81+ |
both input | QUANTITY | 0.81+ |
two interviews | QUANTITY | 0.81+ |
Apache | ORGANIZATION | 0.78+ |
one technology | QUANTITY | 0.69+ |
MapReduce | ORGANIZATION | 0.67+ |
Haddud | LOCATION | 0.58+ |
least | QUANTITY | 0.58+ |
Spark | TITLE | 0.52+ |
APIS | ORGANIZATION | 0.41+ |
Xiaowei Jiang | Flink Forward 2017
>> Welcome everyone, we're back at the first Flink Forward Conference in the U.S. It's the Flink User Conference sponsored by Data Artisans, the creators of Apache Flink. We're on the ground at the Kabuki Hotel, and we've heard some very high-impact customer presentations this morning, including Uber and Netflix. And we have the great honor to have Xiaowei Jiang from Alibaba with us. He's Senior Director of Research, and what's so special about having him as our guest is that they have the largest Flink cluster in operation in the world that we know of, and that the Flink folks know of as well. So welcome, Xiaowei. >> Thanks for having me. >> So we gather you have a 1,500 node cluster running Flink. Let's sort of unpack how you got there. What were some of the use cases that drove you in the direction of Flink and complementary technologies to build with? >> Okay, I explain a few use cases. The first use case that prompted us to look into Flink is the classical Soch ETL case. Where basically it needs to process all the data that's necessary for such series. So we look into Flink about two years ago. The next case we use is the A/B testing framework which is used to evaluate how your machine learning model works. So, today we using a few other very interesting case, like we are using to do machine learning to adjust ranking of search results, to personalize your search results at real-time to deliver the best search results for our user. We are also using to do real-time anti-fraud detection for ads. So these are the typical use case we are doing. >> Okay, this is very interesting because with the ads, and the one before that, was it fraud? >> Ads is anti-fraud. Before that is machine learning, real-time machine learning. >> So for those, low latency is very important. Now, help unpack that. Are you doing the training for these models like in a central location and then pushing the models out close to where they're going to be used for like the near real-time decisions? Or is that all run in the same cluster? >> Yeah, so basically we are doing two things. We use Flink to do real-time feature update which change the feature at the real-time, like in a few seconds. So for example, when a user buy a product, the inventory needs to be updated. Such features get reflected in the ranking of search results to real-time. We also use it to do real-time trending of the model itself. This becomes important in some special events. For example, on China Singles Day which is the largest shopping holiday in China, it generates more revenue than Black Friday in United States already. On such a day, because things go on sale for almost 50 percent off, the user's behavior changes a lot. So whatever model you trend before does not work reliably. So it's really nice to have a way to adjust the model at real-time to deliver a best experience to our users. All this is actually running in the same cluster. >> OK, that's really interesting. So, it's like you have a multi-tenant solution that sounds like it's rather resource intensive. >> Yes. >> When you're changing a feature, or features, in the models, how do you go through the process of evaluating them and finding out their efficacy before you put them into production? >> Yeah, so this is exactly the A/B testing framework I just mentioned earlier. >> George: Okay. >> So, we also use Flink to track the metrics, the performance of these models at real time. Once these data are processed, we upload them into our Olark system so we can see the performance of the models at real time. >> Okay. Very, very impressive. So, explain perhaps why Flink was appropriate for those use cases. Is it because you really needed super low latency, or that you wanted less resource-intensive sort of streaming engine to support these? What made it fit that right sweet spot? >> Yeah, so Soch has lots of different products. They have lots of different data processing needs, so when we looked into all these needs, we quickly realized we actually need a computer engine that can do both batch processing and streaming processing. And in terms of streaming processing, we have a few needs. For example, we really need super low latency. So in some cases, for example, if a product is sold out, and is still displaying in your search results, when users click and try to buy they cannot buy it. It's a bad experience. So, the sooner you can get the data processed, the better. So with- >> So near real-time for you means, how many milliseconds does the- >> It's usually like a second. One second, something like that. >> But that's one second end to end talking to inventory. >> That's right. >> How much time would the model itself have to- >> Oh, it's very short. Yeah. >> George: In the single digit milliseconds? >> It's probably around that. There are some scenarios that require single digit milliseconds. Like a security scenario; that's something we are currently looking into. So when you do transactions in our site, we need to detect if it's a fraud transaction. We want to be able to block such transactions at real-time. For that to happen, we really need a latency that's below 10 millisecond. So when we're looking at computer engines, this is also one of the requirements we will think about. So we really need a computer engine which is able to deliver sub-second latency if necessary, and at the same time can also do batch efficiently. So we are looking for solutions that can cover all our computation needs. >> So one way of looking at it is many vendors and customers talk about elasticity as in the size of the cluster, but you're talking about elasticity or scaling in terms of latency. >> Yes, latency and the way of doing computation. So you can view the security in the scenario as super restrict on the latency requirement, but view Apache as most relaxed version of latency requirement. We want a full spectrum; it's a part of the full spectrum. It's possible that you can use different engines for each scenario; but which means you are required to maintain more code bases, which can be a headache. And we believe it's possible to have a single solution that works for all these use cases. >> So, okay, last question. Help us understand, for mainstream customers who don't hire the top Ph.D's out of the Chinese universities but who have skilled data scientists but not an unending supply, and aspire to build solutions like this; tell us some of the trade-offs they should consider given that, you know, the skillset and the bench strength is very deep at Alibaba, and it's perhaps not as widely disseminated or dispersed within a mainstream enterprise. How should they think about the trade-offs in terms of the building blocks for this type of system? >> Yeah, that's a very good question. So we actually thought about this. So, initially what we did is we were using data set and data string API, which is a relatively lower level API. So to develop an application with this is reasonable, but it still requires some skill. So we want a way to make it even simpler, for example, to make it possible for data scientists to do this. And so in the last half a year, we spent a lot of time working on Tableau API and SQL Support, which basically tries to describe your computation logic or data processing logic using SQL. SQL is used widely, so a lot of people have experience in it. So we are hoping with this approach, it will greatly lower the threshold of how people to use Flink. At the same time, SQL is also a nice way to unify the streaming processing and the batch processing. With SQL, you only need to write your process logic once. You can run it in different modes. >> So, okay this is interesting because some of the Flink folks say you know, structured streaming, which is a table construct with dataframes in Spark, is not a natural way to think about streaming. And yet, the Spark guys say hey that's what everyone's comfortable with. We'll live with probabilistic answers instead of deterministic answers, because we might have late arrivals in the data. But it sounds like there's a feeling in the Flink community that you really do want to work with tables despite their shortcomings, because so many people understand them. >> So ease of use is definitely one of the strengths of SQL, and the other strength of SQL is it's very descriptive. The user doesn't need to say exactly how you do the computation, but it just tells you what I want to get. This gives the framework a lot of freedom in optimization. So users don't need to worry about hard details to optimize their code. It lets the system do its work. At the same time, I think that deterministic things can be achieved in SQL. It just means the framework needs to handle such kind things correctly with the implementation of SQL. >> Okay. >> When using SQL, you are not really sacrificing such determinism. >> Okay. This is, we'll have to save this for a follow-up conversation, because there's more to unpack there. But Xiaowei Jiang, thank you very much for joining us and imparting some of the wisdom from Alibaba. We are on the ground at Flink Forward, the Data Artisans conference for the Flink community at the Kabuki hotel in San, Francisco; and we'll be right back.
SUMMARY :
and that the Flink folks know of as well. So we gather you have a 1,500 node cluster running Flink. So these are the typical use case we are doing. Before that is machine learning, Or is that all run in the same cluster? adjust the model at real-time to deliver a best experience So, it's like you have a multi-tenant solution Yeah, so this is exactly the A/B testing framework of the models at real time. or that you wanted less resource-intensive So, the sooner you can get the data processed, the better. It's usually like a second. Oh, it's very short. For that to happen, we really need a latency as in the size of the cluster, So you can view the security in the scenario in terms of the building blocks for this type of system? So we are hoping with this approach, because some of the Flink folks say It just means the framework needs to handle you are not really sacrificing such determinism. and imparting some of the wisdom from Alibaba.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
China | LOCATION | 0.99+ |
Xiaowei Jiang | PERSON | 0.99+ |
One second | QUANTITY | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
United States | LOCATION | 0.99+ |
SQL | TITLE | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Xiaowei | PERSON | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
San, Francisco | LOCATION | 0.99+ |
Tableau | TITLE | 0.99+ |
one second | QUANTITY | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
Black Friday | EVENT | 0.98+ |
Soch | ORGANIZATION | 0.98+ |
each scenario | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Spark | TITLE | 0.98+ |
both | QUANTITY | 0.98+ |
first use case | QUANTITY | 0.97+ |
U.S. | LOCATION | 0.97+ |
Data Artisans | ORGANIZATION | 0.96+ |
China Singles Day | EVENT | 0.96+ |
almost 50 percent | QUANTITY | 0.96+ |
Flink Forward Conference | EVENT | 0.95+ |
single solution | QUANTITY | 0.94+ |
Flink User Conference | EVENT | 0.93+ |
two years ago | DATE | 0.93+ |
Flink Forward | EVENT | 0.91+ |
below 10 millisecond | QUANTITY | 0.91+ |
Apache Flink | ORGANIZATION | 0.91+ |
1,500 node | QUANTITY | 0.91+ |
Chinese | OTHER | 0.9+ |
second | QUANTITY | 0.89+ |
single | QUANTITY | 0.88+ |
this morning | DATE | 0.88+ |
Kabuki Hotel | LOCATION | 0.87+ |
last half a year | DATE | 0.87+ |
Apache | ORGANIZATION | 0.85+ |
Olark | ORGANIZATION | 0.83+ |
single digit | QUANTITY | 0.77+ |
Data Artisans conference | EVENT | 0.77+ |
Flink | TITLE | 0.74+ |
2017 | DATE | 0.74+ |
first | QUANTITY | 0.65+ |
seconds | QUANTITY | 0.62+ |
Kabuki | ORGANIZATION | 0.56+ |
people | QUANTITY | 0.51+ |
lot | QUANTITY | 0.47+ |
Sean Hester | Flink Forward 2017
>> Welcome back. We're at Flink Forward, the user conference for the Flink community, put on by data Artisans, the creators of Flink. We're on the ground at the Kabuki Hotel in Pacific Heights in San Francisco. And we have another special guest from BetterCloud, which is a management company. We have Sean Hester, Director of Engineering. And Sean, why don't you tell us, what brings you to Flink Forward? Give us some context for that. >> Sure, sure. So a little over a year ago we kind of started restructuring our application. We had a spike in our vision, we wanted to go a little bit bigger. And at that point we had done some things that were suboptimal, let's say, as far as our approach to the way we were generating operational intelligence. So we wanted to move to a streaming platform. We looked at a few different options and after pretty much a bake-off, Flink came out on top for us. And we've been using it ever since. It's been in production for us for about six months. We love it, we're big fans, we love their roadmap, so that's why we're here. >> Okay, so let's unpack that a little more. In the bake-off, what were the... So your use case is management. But within that bake-off, what were the criteria that surfaced as the highest priority? >> So for us we knew we wanted to be working with something that was kind of the latest generation of streaming technology. Something that had basically addressed all of the Google MillWheel paper, big problems, things like managing back pressure, how do you manage a checkpoint and restoring of state in a distributed streaming application? Things that we had no interest in writing ourselves after digging into the problem a little bit. So we wanted a solution that would solve those problems for us, and this seemed like it had a really solid community behind it. And again, Flink came off on top. >> Okay, so now, understanding sort of why you chose Flink, help us understand BetterCloud's service. What do you offer customers and how do you see that evolving over time? >> Sure, sure. So you've been calling us a management company, so we provide tooling for IT admins to manage their SAS applications. So things like the Google Suite, or Zendesk, or Slack. And we give them kind of that single point of entry, the single pane of glass to see everything, see all their users in one place, what applications are provisioned to which users, et cetera. And so we literally go to the APIs of each of our partners that we provide support for, gather data, and from there it starts flowing through the stream as a set of change events, basically. Hey, this user's had a title update or a manager update. Is that meaningful for us in some way? Do we want to run a particular work flow based on that event, or is that something that we need to take into account for a particular operational intelligence? >> Okay, so you dropped in there something really concrete. A change event for the role of an employee. That's a very application-specific piece of telemetry that's coming out of an app. Very different from saying, well, what's my CPU utilization, which'll be the same across all platforms. >> Correct. >> So how do you account for... applications that might have employees in one SAS app and also employees in a completely different SAS app, and they emit telemetry or events that mean different things? How do you bridge that? >> Exactly. So we have a set of teams that's dedicated to just the role of getting data from the SAS applications and emitting them into the overall BetterCloud system. After that there's another set of teams that's basically dedicated to providing that central, canonical view of a user or group or a... An asset, a document, et cetera. So all of those disparate models that might come in from any given SAS app get normalized by that team into what we call our canonical model. And that's what flows downstream to teams that I lead to have operational intelligence run on them. >> Okay, so just to be clear, for our mainstream customers who aren't rocket scientists like you-- (laughs) When they want to make sense of this, what you're telling them is they don't have to be locked into the management solution that comes from a cloud vendor where they're going to harmonize all their telemetry and their management solutions to work seamlessly across their services and the third party services that are on that platform. What you're saying is you're putting that commonality across apps that you support on different clouds. >> Yes, exactly. We provide kind of the glue, or the homogenization necessary to make that possible. >> Now this may sound arcane, but being able to put in place that commonality implies that there is overlap, complete overlap, for that information, for how to take into account and manage an employee onboarding here and one over there. What happens when, in applications, unlike in the hardware where it's obviously the same no matter what you're doing, what happens in applications where you can't find a full overlap? >> Well, it's never a full overlap. But there is typically a very core set of properties for a user account, for example, that we can work with regardless of what SAS application we might be integrating with. But we do have special areas, like metadata areas, within our events that are dedicated to the original data fresh from the SAS application's API, and we can do one-off operations specifically on that SAS app data. But yeah, in general there's a lot of commonality between the way people model a user account or a distribution group or a document. >> Okay, interesting. And so the role of streaming technology here is to get those events to you really quickly and then for you to apply your rules to identify a root cause or even to remediate either with advising a person, an administrator, or automatically. >> Yes, exactly. >> And plans for adding machine learning to this going forward? >> Absolutely, yeah. So one of our big asks, we started casting this vision in front of some of our core customers, was basically I don't know what normal is. You figure out what normal is and then let me know when something abnormal happens. Which is a perfect use case for machine learning. So we definitely want to get there. >> Running steady state, learning the steady state, then finding anomalies. >> Exactly, exactly. >> Interesting, okay. >> Not there yet but it's definitely on our roadmap. >> And then what about management companies that might say, we're just going to target workloads of this variety, like a big data workload, where we're going to take Kafka, Spark, Hive, and maybe something that predicts and serves, and we're just going to manage that. What trade-off to they get to make that are different from what you get to make? >> I'm not sure I quite understand the question you're getting at. >> If there's where they can narrow the scope of the processes they're going to model, or the workloads they're going to model, where it's, say, just big data workloads and there's going to be some batch interactive stuff and they are only going to cover a certain number of products because those are the only ones that fit into that type of workload. >> Oh I gotcha, gotcha. So we kind of designed our roadmap from the get-go knowing that one of our competitive advantages were going to be how quickly can we support additional SAS applications? So we've actually baked into most of our architecture, stuff that's very configuration-driven, let's say, versus hard coded, so that allows us to very quickly kind of onboard new SAS apps. So I think that winds up, the value of being able to manage and provision, run workloads against the 20 different SAS apps that an admin in a modern workplace might be working with is just so valuable that I think that's going to win the day eventually. >> Single pane of glass, not at the infrastructure level, but at the application level. >> Exactly, exactly. >> Okay. All right, we've been with Sean Hester of BetterCloud, and we will be right back. We're at the Flink Forward event, sponsored by data Artisans for the Flink user community. The first ever conference in the US for the Flink community. And we'll be back shortly. (electronic music)
SUMMARY :
And we have another special guest from BetterCloud, And at that point we had done some things that surfaced as the highest priority? So for us we knew we wanted to be working with and how do you see that evolving over time? based on that event, or is that something that we need to A change event for the role of an employee. So how do you account for... So we have a set of teams that's dedicated and the third party services that are on that platform. We provide kind of the glue, or the homogenization for that information, for how to take into account and we can do one-off operations And so the role of streaming technology here So one of our big asks, we started casting this vision Running steady state, learning the steady state, that are different from what you get to make? the question you're getting at. of the processes they're going to model, that I think that's going to win the day eventually. but at the application level. and we will be right back.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Sean | PERSON | 0.99+ |
Sean Hester | PERSON | 0.99+ |
US | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
BetterCloud | ORGANIZATION | 0.99+ |
Flink Forward | EVENT | 0.99+ |
SAS | TITLE | 0.99+ |
Pacific Heights | LOCATION | 0.98+ |
one | QUANTITY | 0.97+ |
about six months | QUANTITY | 0.97+ |
one place | QUANTITY | 0.96+ |
ORGANIZATION | 0.95+ | |
Suite | TITLE | 0.94+ |
Single pane | QUANTITY | 0.91+ |
San Francisco | LOCATION | 0.91+ |
each | QUANTITY | 0.9+ |
single point | QUANTITY | 0.87+ |
Spark | TITLE | 0.86+ |
Zendesk | ORGANIZATION | 0.83+ |
a year ago | DATE | 0.8+ |
Hive | TITLE | 0.8+ |
Flink Forward 2017 | EVENT | 0.79+ |
Kafka | TITLE | 0.79+ |
Flink Forward | ORGANIZATION | 0.75+ |
over | DATE | 0.75+ |
MillWheel | COMMERCIAL_ITEM | 0.71+ |
first ever | QUANTITY | 0.71+ |
Kabuki Hotel | LOCATION | 0.71+ |
20 different | QUANTITY | 0.71+ |
single pane | QUANTITY | 0.71+ |
apps | QUANTITY | 0.61+ |
Slack | TITLE | 0.59+ |
BetterCloud | TITLE | 0.52+ |
Chinmay Soman | Flink Forward 2017
>> Welcome back, everyone. We are on the ground at the data Artisans user conference for Flink. It's called Flink Forward. We are at the Kabuki Hotel in lower Pacific Heights in San Francisco. The conference kicked off this morning with some great talks by Uber and Netflix. We have the privilege of having with us Chinmay Soman from Uber. >> Yes. >> Welcome, Chinmay, it's good to have you. >> Thank you. >> You gave a really, really interesting presentation about the pipelines you're building and where Flink fits, but you've also said there's a large deployment of Spark. Help us understand how Flink became a mainstream technology for you, where it fits, and why you chose it. >> Sure. About one year back, when we were starting to evaluate what technology makes sense for the problem space that we are trying to solve, which is neural dynamics. We observed that Spark's theme processing is actually more resource intensive then some of the other technologies we benchmarked. More specifically, it was using more memory and CPU, at that time. That's one... I actually came from the Apache Samza world. It wasn't the same LinkedIn team before I came to Uber. We had in-house expertise on Samza and I think the reliability was the key motivation for choosing Samza. So we started building on top of Apache Samza for almost the last one and a half years. But then, we hit the scale where Samza, we felt, was lacking. So with Samza, it's actually tied into Kafka a lot. You need to make sure your Kafka scales in order for the stream processing to scale. >> In other words, the topics and the partitions of those topics, you have to keep the physical layout of those in mind at the message cue level, in line with the stream processing. >> That's right. The paralysm is actually tied into a number of partitions in Kafka. Further more, if you have a multi-stage pipeline, where one stage processes data and sends output to another stage, all these intermediate stages, today, again go back to Kafka. So if you want to do a lot of these use cases, you actually end up creating a lot of Kafka topics and the I/O overhead on a cluster shoots up exponentially. >> So when creating topics, or creating consumers that do something and then output to producers, if you do too many of those things, you defeat the purpose of low-latency because you're storing everything. >> Yeah. The credit of it is, it is more robust because if you suddenly get a spike in your traffic, your system is going to handle it because Kafka buffers that spike. It gives you a very reliable platform, but it's not cheap. So that's why we're looking at Flink, In Flink, you can actually build a multi-stage pipeline and have in-memory cues instead of writing back to Kafka, so it is fast and you don't have to create multiple topics per pipeline. >> So, let me unpack that just a little bit to be clearer. The in-memory cues give you, obviously, better I/O. >> Yes. >> And if I understand correctly, that can absorb some of the backpressure? >> Yeah, so backpressure is interesting. If you have everything in Kafka and no in-memory cues, there is no backpressure because Kafka is a big buffer, it just keeps running. With in-memory cues, there is backpressure. Another question is, how do you handle this? So going back to Samza systems, they actually degrade and can't recover once they are in backpressure. But Flink, as you've seen, it slows down consuming from Kafka, but once the spike is over, once you're over that hill, it actually recovers quickly. It is able to sustain heavy spikes. >> Okay, so this goes to your issues with keeping up with the growth of data... >> That's right. >> You know, the system, there's multiple leaves of elasticity and then resource intensity. Tell us about that end and the desire to get as many jobs as possible out of a certain level of resource. >> So, today, we are a platform where people come in and say, "Here's my code." Or, "Here's my SQL that I want to run on your platform." In the old days, they were telling us, "Oh, I need 10 gigabytes for a container," and this they need these many CPUs and that really limited how many use cases we onboarded and made our hardware footprint pretty expensive. So we need the pipeline, the infrastructure, to be really memory efficient. What we have seen is memory is the bottle link in our world, more so than CPU. A lot of applications, they consume from Kafka, they actually buffer locally in each container and then they do that in the local memory, in the JVM memory. So we need the memory component to be very efficient and we can pack more jobs on the same cluster if everyone is using lesser memory. That's one motivation. The other thing, for example, that Flink does and Samza also does, is make use of a RocksDB store, which is a local persistent-- >> Oh, that's where it gets the state management. >> That's right, so you can offload from memory on to the disk-- >> Into a proper database. >> Into a proper database and you don't have to cross a network to do that because it's sitting locally. >> Just to elaborate on what might be, what might seem like, a arcane topic, if it's residing locally, than anything it's going to join with has to also be residing locally. >> Yeah, that's a good point. You have to be able to partition your inputs and your state in the same way, otherwise there's no locality. >> Okay, and you'd have to shuffle stuff around the network. >> And more than that, you'd need to be able to recover if something happens because there's no replication for this state. If the hard disk on that DR node crashes, you need to recreate that cache from somewhere. So either you go back and read from Kafka, or you store that cache somewhere. So Flink actually supports this out of the box and it snapshots the RocksDB state into HTFS. >> Got it, okay. It's more resilient--- >> Yes. >> And more resource efficient. So, let me ask one last question. Main stream enterprises, they, or at least the very largest ones, have been trying to wrestle their arms around some opensource projects. Very innovative, the pace of innovation is huge, but it demands a skillset that seems to be most resident in large consumer internet companies. What advice do you have for them where they aspire to use the same technologies that you're talking about to build new systems, but they might not have the skills. >> Right, that's a very good question. I'll try to answer in the way that I can. I think the first thing to do is understand your scale. Even if you're a big, large banking corporation, you need to understand where you fit in the industry ecosystem. If it turns out that your scale isn't that big and you're using it for internal analytics, then you can just pick the off-the-shelf pipelines and make it work. For example, if you don't care about multi-tendency, if your hardware span is not that much, actually anything might actually work. The real challenge is when you pick a technology and make it work for a large use cases and you want to optimize for cost. That's where you need a huge engineering organization. So in simpler words, if your use cases extent is not that big, pick something which has a lot of support from the community. Most more common things just work out-of-the-box, and that's good enough. But if you're doing a lot of complicated things, like real-time machine running, or your scale is in billions of messages per day, or terabytes of data per day, then you really need to make a choice: Whether you invest in an engineering organization that can really understand these use cases; or you go to companies like Databricks. Get a support from Databricks, or... >> Or maybe a cloud vendor? >> Or a cloud vendor, or things like Confluent which is giving Kafka support, things like that. I don't think there is one answer. To me, our use case, for example, the reason we chose to build an engineering organization around that is because our use cases are immensely complicated and not really seen before, so we had to invest in this technology. >> Alright, Chinmay, we're going to leave it on that and hopefully keep the dialogue going-- >> Sure. >> offline. So, we'll be back shortly. We're at Flink Forward, the data Artisans user conference for Flink. We're on the ground at the Kabuki Hotel in downtown San Francisco and we'll be right back.
SUMMARY :
We have the privilege of having with us where it fits, and why you chose it. in order for the stream processing to scale. you have to keep the physical layout of those So if you want to do a lot of these use cases, that do something and then output to producers, and you don't have to create The in-memory cues give you, obviously, better I/O. but once the spike is over, once you're over that hill, Okay, so this goes to your issues with You know, the system, there's multiple leaves and that really limited how many use cases we onboarded Into a proper database and you don't have to going to join with has to also be residing locally. You have to be able to partition Okay, and you'd have to shuffle stuff and it snapshots the RocksDB state into HTFS. It's more resilient--- but it demands a skillset that seems to be and you want to optimize for cost. the reason we chose to build We're on the ground at the Kabuki Hotel
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Databricks | ORGANIZATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
Chinmay | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Chinmay Soman | PERSON | 0.99+ |
Kafka | TITLE | 0.99+ |
Confluent | ORGANIZATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
10 gigabytes | QUANTITY | 0.99+ |
each container | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.98+ |
today | DATE | 0.98+ |
one answer | QUANTITY | 0.98+ |
Apache | ORGANIZATION | 0.98+ |
2017 | DATE | 0.97+ |
one last question | QUANTITY | 0.95+ |
first thing | QUANTITY | 0.95+ |
Spark | TITLE | 0.93+ |
Pacific Heights | LOCATION | 0.91+ |
this morning | DATE | 0.86+ |
Kabuki Hotel | LOCATION | 0.85+ |
RocksDB | TITLE | 0.83+ |
About one year back | DATE | 0.82+ |
terabytes of data | QUANTITY | 0.82+ |
one motivation | QUANTITY | 0.8+ |
SQL | TITLE | 0.8+ |
Forward | EVENT | 0.78+ |
Samza | ORGANIZATION | 0.74+ |
Samza | TITLE | 0.73+ |
one stage | QUANTITY | 0.73+ |
billions of messages per day | QUANTITY | 0.72+ |
Artisans | EVENT | 0.7+ |
last one and a half years | DATE | 0.69+ |
Artisans user | EVENT | 0.62+ |
Samza | COMMERCIAL_ITEM | 0.34+ |
Dean Wampler Ph.D | Flink Forward 2017
>> Welcome everyone to the first ever U.S. user conference of Apache Flink, sponsored by data Artisans, the creators of Flink. The conference kicked off this morning with some very high-profile customer use cases, including Netflix and Uber, which were quite impressive. We're on the ground at the Kabuki Hotel in San Francisco and our first guest is Dean Wampler, VP of fast data engineering at Lightbend. Welcome Dean. >> Thank you. Good to see you again George. >> So, big picture context setting, Spark exploded on the scene, blew away the expectations, even of their creators, with the speed and the deeply integrated libraries, and essentially replaced MapReduce really quickly. >> Yeah. >> So what is behind Flink's rapid adoption? >> Right, I think it's an interesting story and if you'd asked me a year ago, I probably would've said, well I'm not sure we really need Flink, Spark seems to meet all our needs. But, I pretty quickly changed my mind as I got to know about Flink because, it is a broad ecosystem, there's a wide variety of problems people are trying to solve, and what Flink is doing very well is solving low latency streaming, but still at scale, like Spark. Where Spark is still primarily a mini-batch model, so it has longer latency. And Flink has been on the cutting edge too, of embracing some of the more advanced streaming scenarios, like proper handling of late arrival of data, windowing semantics, things like this. So it's really filling an important niche, but a fairly broad niche that people have. And also, not everybody needs the full-featured capabilities of Spark like batch analytics or whatever, and so having one tool that's focused just on processing streams is often a good idea. >> So would that relate to a smaller surface area to learn and to administer? >> I think it's a big part of it, yeah. I mean Spark is incredibly well engineered and it works very well, but it's a bigger system so there's going to be more to run. And there is something very attractive about having a more focused tool that, you know, less things to break basically. >> You mention sort of lower-latency and a few extra, a few fewer bells and whistles. Can you give us some examples of use cases where you wouldn't need perhaps all of the integrated libraries of Spark or the big footprint that gives you all that resilience and, you know, the functional programming that lets you sort of, recreate lineage. Tell us sort of how a customer who's approaching this should pick the trade-offs. >> Right. Well normally when you have a low latency problem, it means you have less time to do work, so you tend to do simpler things, in that time frame. But, just to give you a really interesting example, I was talking with a development team at a bank recently that does credit card authorizations. You click by on a website and there's maybe a few hundred milliseconds when the user is expecting a reply, right. But it turns out there's so many things going on in that loop, from browser to servers and back that they only have about ten milliseconds, when they get the data, to make a decision about whether this looks fraudulent or it looks legit, and they make a decision. So ten milliseconds is fairly narrow, that means you have to have your models already done and ready to go. And a quick way to actually apply them, you know, take this data, ask the model is this okay, and get a response. So, a lot of it is kind of boiling down to that, it's either, I would say one of two things, it's either I'm doing basic filtering, transforming of data, like raw data coming into my environment/ Or I have some maybe more sophisticated analytics that are running behind the scenes, and then in real time, so it's, so to speak, data is coming in and I'm asking questions against those models about this data, like authorizing credit cards. >> Okay, so to recap, the low latency means you have to have perhaps scored your models already. Okay, so trained and scored in the background and then, with this low latency solution you can look up, key based look up I guess, to an external store, okay. So how is Lightbend making it simple to put, what essentially has to be for any pipeline it appears, multiple products together seamlessly. >> That is the challenge. I mean it would be great if you could just deploy Flink, and that was the only thing you needed or Kafka, or pick any one of them. But of course, the reality is, we always have to integrate a bunch of tools together, and it's that integration that's usually the hard part. How do I know why this thing's misbehaving, when maybe it's something upstream that's misbehaving? That sort of thing. So, we've been surveying the landscape to understand, first of all, what are the tools that seem to be most mature, most vibrant as a community, that address the variety of scenarios people are trying to deal with, some of which we just discussed. And what are the kind of integration problems that you have to solve to make these reliable systems? So we've been building a platform, called the Fast Data Platform, that's approaching its first beta, that is designed to help solve a lot of those problems for you, so you can focus on your actual business problems. >> And from a customer point of view, would you take end-to-end ownership of that solution, so that if they chose you could manage it On-Prem or in the Cloud, and handle level three support across the stack? >> That's an interesting question. We think eventually we'll get to that point of more of a service offering, but right now most of the customers we're talking to are still more interested in managing things themselves, but not having as much of a hassle of doing it themselves. So what we're trying to balance is tooling that makes it easier to get started quickly and build applications, but also leverages some of the modern, like machine-learning, artificial intelligence stuff to automatically detect and correct for a lot of common problems, and other management scenarios. So at least it's not quite as, you're on your own, as it could be if you were just trying to glue everything together yourself. >> So if I understand, it sounds like the first stage in the journey is, help me rationalize what I'm trying to get to work together On-Prem, and part of that is using machine-learning now, as part of management. And then, over time, this management gets better and better at root-cause analysis and auto-remediation, and then it can move into the Cloud. And these disparate components become part of a single SAS solution, under the management. >> That's the long-term goal, definitely yeah. >> Looking out at where all this intense interest is right now in IOT applications. We can't really go back to the Cloud for, send all the data back to the Cloud, and get an immediate answer, and then drive an action. How do you see that shaping up in terms of what's on the edge and what's on the Cloud? >> Yeah, that's a really interesting question, and there are some particular challenges, because a lot of companies will migrate to the Cloud in a peace meal fashion, so they've got a sort of hybrid deployment scenario with things On-Premise and in the Cloud, and so forth. One of the things you mentioned that's pretty important, is I've got all this data coming in, how do I capture it reliably? So, tools like Kafka are really good for that and Pravega that Strachan from EMC mentioned, is sort of filling the same need, that I need to capture stuff reliably, serve downstream consumers, make it easy to do analytics over this stream that looks a lot different than a traditional database, where it's kind of data at rest, it's not static, but it's not moving. So, that's one of the things you have to do well, and then figure out how to get that data to the right consumer, and account for all of the latencies, like if I needed that ten millisecond credit card authorization, but I had data split over my On-Premise and my Cloud environment, you know, that would not work very well. So there's a lot of that kind of architecture of data flow, so it becomes really important. >> Do you see Lightbend offering that management solution that enforces SLAs or do you see sourcing that technology from others and then integrating it tightly with the particular software building blocks that make up the pipeline? >> It's a little of both. We're sort of in the early stages of building services along those lines. Some of the technology we've had for a while, our Akka middleware system, and the streaming API on top of it would be really good for basing that kind of a platform, where you can think about SLA requirements and trading off performance, or whatever, versus getting answers in a reasonable time, good recovery and error scenarios, stuff like that. So it's all early days, but we are thinking very hard about that problem, because ultimately, at the end of the, that's what customers care about, they don't care about Kafka versus Spark, or whatever. They just care about, I've got data coming in, I need an answer, and ten milliseconds or I lose money, and that's the kind of thing that they want you to sell for them, so that's really what we have to focus on. >> So, last question before we have to go, do you see potentially a scenario where there's one type of technology on the edge, or many types, and then something more dominant in the Cloud, where basically you do more training, model training, and out on the edge you do the low latency predictions or prescriptions. >> That's pretty much the architecture that has emerged. I'm going to talk a little bit about this today, in my talk, where, like we said earlier, I may have a very short window in which I have to make a decision, but it's based on a model that I have been building for a while and I can build in the background, where I have more tolerance for the time it takes. >> Up in the Cloud? >> Up in the Cloud. Actually this is kind of independent of deployment scenario, but it could be both like that, so you could have something that is closer to the consumer of the data, maybe in the Cloud, and deployed in Europe for European customers, but it might be working with systems back in the U.S.A. that are doing the heavy-lifting of building these models and so forth. We live in such a world where you can put things where you want, you can move things around, you can glue things together, and a lot of times it's just knowing what's the right combination of stuff. >> Alright Dean, it was great to see you and to hear the story. It sounds compelling. >> Thank you very much. >> So, this is George Gilbert. We are on the ground at Flink Forward, data Artisans user conference for the Flink product, and we will be back after this short break.
SUMMARY :
We're on the ground at the Kabuki Hotel in San Francisco Good to see you again George. Spark exploded on the scene, of embracing some of the more advanced streaming scenarios, you know, less things to break basically. that gives you all that resilience and, you know, that means you have to have your models already done Okay, so to recap, the low latency means you have to have and that was the only thing you needed that makes it easier to get started quickly and part of that is using machine-learning now, send all the data back to the Cloud, So, that's one of the things you have to do well, and that's the kind of thing in the Cloud, where basically you do more training, but it's based on a model that I have been building that are doing the heavy-lifting and to hear the story. We are on the ground at Flink Forward,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dean Wampler | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Dean | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
U.S.A. | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
EMC | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Lightbend | ORGANIZATION | 0.99+ |
first beta | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
first guest | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
ten milliseconds | QUANTITY | 0.98+ |
Kafka | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
Uber | ORGANIZATION | 0.98+ |
ten millisecond | QUANTITY | 0.98+ |
Spark | TITLE | 0.97+ |
U.S. | LOCATION | 0.97+ |
two things | QUANTITY | 0.96+ |
first stage | QUANTITY | 0.96+ |
Netflix | ORGANIZATION | 0.95+ |
a year ago | DATE | 0.95+ |
about ten milliseconds | QUANTITY | 0.94+ |
level three | QUANTITY | 0.94+ |
Flink Forward | ORGANIZATION | 0.93+ |
one type | QUANTITY | 0.93+ |
single | QUANTITY | 0.92+ |
2017 | DATE | 0.89+ |
MapReduce | TITLE | 0.89+ |
Apache Flink | ORGANIZATION | 0.89+ |
Akka | ORGANIZATION | 0.88+ |
European | OTHER | 0.83+ |
this morning | DATE | 0.78+ |
Kabuki Hotel | LOCATION | 0.78+ |
one tool | QUANTITY | 0.77+ |
Pravega | TITLE | 0.72+ |
few hundred milliseconds | QUANTITY | 0.66+ |
Strachan | PERSON | 0.62+ |
SAS | ORGANIZATION | 0.58+ |
Forward | EVENT | 0.55+ |
Cloud | TITLE | 0.51+ |
Fast Data Platform | TITLE | 0.5+ |
Flink | TITLE | 0.5+ |
Lightbend | PERSON | 0.38+ |