Steven Wu, Netflix | Flink Forward 2018

>> Narrator: Live from San Francisco, it's theCube, covering Flink Forward, brought to you by Data Artisans. >> Hi, this is George Gilbert. We're back at Flink Forward, the Flink conference sponsored by Data Artisans, the company that commercializes Apache Flink and provides additional application management platforms that make it easy to take stream processing at scale for commercial organizations. We have Steven Wu from Netflix, always a company that is pushing the edge of what's possible, and one of the early Flink users. Steven, welcome. >> Thank you. >> And tell us a little about the use case that was first, you know, applied to Flink. >> Sure, our first-use case is a routing job for Keystone data pipeline. Keystone data pipeline process over three trillion events per day, so we have a thousand routing jobs that we do some simple filter projection, but the Solr routing job is a challenge for us and we recently migrated our routing job to Apache Flink. >> And so is the function of a routing job, is it like an ETL pipeline? >> Not exactly ETL pipeline, but more like it's a data pipeline to deliver data from the producers to the data syncs where people can consume those data like array search, Kafka or higher. >> Oh, so almost like the source and sync with a hub in the middle? >> Yes, that is exactly- >> Okay. >> That's the one with our big use case. And the other thing is our data engineer, they also need some stream processing today to do data analytics, so their job can be stateless or it can be stateful if it's a stateful job it can be as big as a terabyte of base state for a single job. >> So tell me what these stateful jobs, what are some of the things that you use state for? >> So, for example like a session of user activity, like if you have clicked the video on the online URI all those activity, they would need to be sessionalized window, for the windows, sessionalized, yeah those are the states, typical. >> OK, and what sort of calculations might you be doing? And which of the Flink APIs are you using? >> So, right now we're using the data stream API, so a little bit low level, we haven't used the Flink SQL yet but it's in our road map, yeah. >> OK, so what is the data stream, you know, down closer to the metal, what does that give you control over, right now, that is attractive? And will you have as much control with the SQL API? >> OK, yes, so the low level data stream API can give you the full feature set of everything. High level SQL is much easier to use, but obviously you have, the feature set is more limited. Yeah, so that's a trade-off there. >> So, tell me about, for a stateful application, is there sort of scaffolding about managing this distributed cluster that you had to build that you see coming down the pipe from Flink and Data Artisans that might make it easier, either for you or for mainstream customers? >> Sure, I think internal state management, I think that is where Flink really shines compared to other stream processing engine. So they do a lot with work underneath already. I think the main thing we need from Flink for the future, near future is regarding the job recovery performance. But like a state management API is very mature. Flink is, I think it's more mature than most of the other stream processing engines. >> Meaning like Kafka, Spark. So, in the state management, can a business user or business analyst issue a SQL query across the cluster and Flink figures out how to manage the distribution of the query and the filtering and presentation of the results transparently across the cluster? >> I'm not an expert on Flink SQL, but I think yes, essentially Flink SQL will convert to a Flink job which will be using the data stream API, so they will manage the state, yes, but, >> So, when you're using the lower level data stream API, you have to manage the distributed state and sort of retrieving and filtering, but that's something at a higher level abstraction, hopefully that'll be, >> No, I think that in either case, I think the state management is handled by Flint. >> Okay. >> Yeah. >> Distributed. >> All the state management, yes >> Even if it's querying at the data stream level? >> Yeah, but if you query at the SQL level, you won't be able to deal with those state APIs directly. You can still do actual windowing, let's say you have a SQL app doing window with some session by session by idle time that would be transfer for job and Flink will manage those window, manage those session state so you do not need to worry about either way you do not need to worry about state management. Apache Flink take care of it. >> So tell me, some of the other products you might have looked at, is the issue that if they have a clean separation from the storage layer, for large scale state management, you know, as opposed to, in memory, is it that the large scale is almost treated like a second tier and therefore, you almost have a separate set or a restricted set of operations at distributed state level versus at the compute level, would that be a limitation of other streaming processors? >> No, I don't see that. I think that given that stream will have taken a different approach, you find like a Google Cloud data flow, Google Cloud flow, they are thinking about using a big table, for example. But those are external state management. Flint decided to take a the approach of embedded state management inside of Flink. >> And when it's external, what's the trade-off? >> That's good question, I think if external, the latency may be higher, but your throughput might be a little low. Because you're going all the natural. But the benefit of that external state management is now your job becomes stateless. Your job make the recovery much faster for job failure, so either trade-off over there. >> OK. >> Yes. >> OK, got it. Alright, Steven we're going to have to end it on that, but that was most enlightening, and thanks for joining. >> Sure, thank you. >> This is George Gilbert, for Wikibon and theCube, we're again at Flink Forward in San Francisco with Data Artisans, we'll be back after a short break. (techno music)

Published Date : Apr 11 2018

SUMMARY :

covering Flink Forward, brought to you by Data Artisans. always a company that is pushing the edge that was first, you know, applied to Flink. but the Solr routing job is a challenge for us it's a data pipeline to deliver data from the producers And the other thing is our data engineer, like if you have clicked the video on the online URI so a little bit low level, we haven't used the Flink SQL yet but obviously you have, the feature set is more limited. than most of the other stream processing engines. across the cluster and Flink figures out how to manage the No, I think that in either case, Yeah, but if you query at the SQL level, taken a different approach, you find like But the benefit of that external state management but that was most enlightening, and thanks for joining. This is George Gilbert, for Wikibon and theCube,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Steven	PERSON	0.99+
Steven Wu	PERSON	0.99+
SQL	TITLE	0.99+
Data Artisans	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Netflix	ORGANIZATION	0.99+
Flink	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Flint	ORGANIZATION	0.98+
Kafka	TITLE	0.98+
Flink Forward	ORGANIZATION	0.98+
Spark	TITLE	0.97+
second tier	QUANTITY	0.97+
Wikibon	ORGANIZATION	0.97+
today	DATE	0.95+
over three trillion events per day	QUANTITY	0.93+
Keystone	ORGANIZATION	0.92+
single job	QUANTITY	0.92+
Flint	PERSON	0.91+
Flink SQL	TITLE	0.91+
first-use case	QUANTITY	0.86+
one	QUANTITY	0.86+
Apache Flink	ORGANIZATION	0.84+
theCube	ORGANIZATION	0.82+
2018	DATE	0.81+
Forward	TITLE	0.8+
SQL API	TITLE	0.8+
Flink	TITLE	0.79+
a thousand routing jobs	QUANTITY	0.77+
Flink	EVENT	0.77+
Flink Forward	EVENT	0.73+
terabyte	QUANTITY	0.71+
Google	ORGANIZATION	0.65+
Cloud	TITLE	0.48+
Forward	EVENT	0.39+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Flink SQL: