Jim Walker, Cockroach Labs & Christian Hüning, finleap connect | Kubecon + Cloudnativecon EU 2022

>> (bright music) >> Narrator: The Cube, presents Kubecon and Cloudnativecon, year of 2022, brought to you by Red Hat, the cloud native computing foundation and its ecosystem partners. >> Now what we're opening. Welcome to Valencia, Spain in Kubecon Cloudnativecon, Europe, 2022. I'm Keith Townsend, along with my host, Paul Gillin, who is the senior editor for architecture at Silicon angle, Paul. >> Keith you've been asking me questions all these last two days. Let me ask you one. You're a traveling man. You go to a lot of conferences. What's different about this one. >> You know what, we're just talking about that pre-conference, open source conferences are usually pretty intimate. This is big. 7,500 people talking about complex topics, all in one big area. And then it's, I got to say it's overwhelming. It's way more. It's not focused on a single company's product or messaging. It is about a whole ecosystem, very different show. >> And certainly some of the best t-shirts I've ever seen. And our first guest, Jim has one of the better ones. >> I mean a bit cockroach come on, right. >> Jim Walker, principal product evangelist at CockroachDB and Christian Huning, tech director of cloud technologies at Finleap Connect, a financial services company that's based out of Germany, now offering services in four countries now. >> Basically all over Europe. >> Okay. >> But we are in three countries with offices. >> So you're CockroachDB customer and I got to ask the obvious question. Databases are hard and started the company in 2015 CockroachDB, been a customer since 2019, I understand. Why take the risk on a four year old database. I mean that just sounds like a world of risk and trouble. >> So it was in 2018 when we joined the company back then and we did this cloud native transformation, that was our task basically. We had very limited amount of time and we were faced with a legacy infrastructure and we needed something that would run in a cloud native way and just blend in with everything else we had. And the idea was to go all in with Kubernetes. Though early days, a lot of things were alpha beta, and we were running on mySQL back then. >> Yeah. >> On a VM, kind of small setup. And then we were looking for something that we could just deploy in Kubernetes, alongside with everything else. And we had to stack and we had to duplicate it many times. So also to maintain that we wanted to do it all the same like with GitOps and everything and Cockroach delivered that proposition. So that was why we evaluate the risk of relatively early adopting that solution with the proposition of having something that's truly cloud native and really blends in with everything else we do in the same way was something we considered, and then we jumped the leap of faith and >> The fin leap of faith >> The fin leap of faith. Exactly. And we were not dissatisfied. >> So talk to me a little bit about the challenges because when we think of MySQL, MySQL scales to amazing sizes, it is the de facto database for many cloud based architectures. What problems were you running into with MySQL? >> We were running into the problem that we essentially, as a finTech company, we are regulated and we have companies, customers that really value running things like on-prem, private cloud, on-prem is a bit of a bad word, maybe. So it's private cloud, hybrid cloud, private cloud in our own data centers in Frankfurt. And we needed to run it in there. So we wanted to somehow manage that and with, so all of the managed solution were off the table, so we couldn't use them. So we needed something that ran in Kubernetes because we only wanted to maintain Kubernetes. We're a small team, didn't want to use also like full blown VM solution, of sorts. So that was that. And the other thing was, we needed something that was HA distributable somehow. So we also looked into other solutions back at the time, like Vitis, which is also prominent for having a MySQL compliant interface and great solution. We also got into work, but we figured, this is from the scale, and from the sheer amount of maintenance it would need, we couldn't deliver that, we were too small for that. So that's where then Cockroach just fitted in nicely by being able to distribute BHA, be resilient against failure, but also be able to scale out because we had this problem with a single MySQL deployment to not really, as it grew, as the data amounts grew, we had trouble to operatively keep that under control. >> So Jim, every time someone comes to me and says, I have a new database, I think we don't need it, yet another database. >> Right. >> What problem, or how does CockroachDB go about solving the types of problems that Christian had? >> Yeah. I mean, Christian laid out why it exists. I mean, look guys, building a database isn't easy. If it was easy, we'd have a database for every application, but you know, Michael Stonebraker, kind of godfather of all database says it himself, it takes seven, eight years for a database to fully gestate to be something that's like enterprise ready and kind of, be relied upon. We've been billing for about seven, eight years. I mean, I'm thankful for people like Christian to join us early on to help us kind of like troubleshoot and go through some things. We're building a database, it's not easy. You're right. But building a distributor system is also not easy. And so for us, if you look at what's going on in just infrastructure in general, what's happening in Kubernetes, like this whole space is Kubernetes. It's all about automation. How do I automate scale? How do I automate resilience out of the entire equation of what we're actually doing? I don't want to have to think about active passive systems. I don't want to think about sharding a database. Sure you can scale MySQL. You know, how many people it takes to run three or four shards of MySQL database. That's not automation. And I tell you what, this world right now with the advances in data how hard it is to find people who actually understand infrastructure to hire them. This is why this automation is happening, because our systems are more complex. So we started from the very beginning to be something that was very different. This is a cloud native database. This is built with the same exact principles that are in Kubernetes. In fact, like Kubernetes it's kind of a spawn of borg, the back end of Google. We are inspired by Spanner. I mean, this started by three engineers that worked at Google, are frustrated, they didn't have the tools, they had at Google. So they built something that was, outside of Google. And how do we give that kind of Google like infrastructure for everybody. And that's, the advent of Cockroach and kind of why we're doing, what we're doing. >> As your database has matured, you're now beginning a transition or you're in a transition to a serverless version. How are you doing that without disrupting the experience for existing customers? And why go serverless at all? >> Yeah, it's interesting. So, you know, serverless was, it was kind of a an R&D project for us. And when we first started on a path, because I think you know, ultimately what we would love to do for the database is let's not even think about database, Keith. Like, I don't want to think about the database. What we're building too is, we want a SQL API in the cloud. That's it. I don't want to think about scale. I don't want to think about upgrades. I literally like. that stuff should just go away. That's what we need, right. As developers, I don't want to think about isolation levels or like, you know, give me DML and I want to be able to communicate. And for us the realization of that vision is like, if we're going to put a database on the planet for everybody to actually use it, we have to be really, really efficient. And serverless, which I believe really should be infrastructure less because I don't think we should be thinking of just about service. We got to think about, how do I take the context of regions out of this thing? How do I take the context of cloud providers out of what we're talking about? Let's just not think about that. Let's just code against something. Serverless was the answer. Now we've been building for about a year and a half. We launched a serverless version of Cockroach last October and we did it so that everybody in the public could have a free version of a database. And that's what serverless allows us to do. It's all consumption based up to certain limits and then you pay. But I think ultimately, and we spoke a little bit about this at the very beginning. I think as ISVs, people who are building software today the serverless vision gets really interesting because I think what's on the mind of the CTO is, how do I drive down my cost to the cloud provider? And if we can basically, drive down costs through either making things multi-tenant and super efficient, and then optimizing how much compute we use, spinning things down to zero and back up and auto scaling these sort of things in our software. We can start to make changes in the way that people are thinking about spend with the cloud provider. And ultimately we did that, so we could do things for free. >> So, Jim, I think I disagree Christian, I'm sorry, Jim. I think I disagree with you just a little bit. Christian, I think the biggest challenge facing CTOs are people. >> True. >> Getting the people to worry about cost and spend and implementation. So as you hear the concepts of CoachDB moving to a serverless model, and you're a large customer how does that make you think or react to your people side of your resources? >> Well, I can say that from the people side of resources luckily Cockroach is our least problem. So it just kind of, we always said, it's an operator stream because that was the part that just worked for us, so. >> And it's worked as you have scaled it? without you having ... >> Yeah. I mean, we use it in a bit of a, we do not really scale out like the Cockroach, like really large. It's like, more that we use it with the enterprise features of encryption in the stack and our customers then demand. If they do so, we have the Zas offering and we also do like dedicated stacks. So by having a fully cloud native solution on top of Kubernetes, as the foundational layer we can just use that and stamp it out and deploy it. >> How does that translate into services you can provide your customers? Are there services you can provide customers that you couldn't have, if you were running, say, MySQL? >> No, what we do is, we run this, so the SAS offering runs in our hybrid private cloud. And the other thing that we offer is that we run the entire stack at a cloud provider of their choosing. So if they are an AWS, they give us an AWS account, we put it in there. Theoretically, we could then also talk about using the serverless variant, if they like so, but it's not strictly required for us. >> So Christian, talk to me about that provisioning process because if I had a MySQL deployment before I can imagine how putting that into a cloud native type of repeatable CICD pipeline or Ansible script that could be difficult. Talk to me about that. How CockroachDB enables you to create new onboarding experiences for your customers? >> So what we do is, we use helm charts all over the place as probably everybody else. And then each application team has their parts of services, they've packaged them to helm charts, they've wrapped us in a super chart that gets wrapped into the super, super chart for the entire stack. And then at the right place, somewhere in between Cockroach is added, where it's a dependency. And as they just offer a helm chart that's as easy as it gets. And then what the teams do is they have an inner job, that once you deploy all that, it would spin up. And as soon as Cockroach is ready it's just the same reconcile loop as everything. It will then provision users, set up database schema, do all that. And initialize, initial data sets that might be required for a new setup. So with that setup, we can spin up a new cluster and then deploy that stack chart in there. And it takes some time. And then it's done. >> So talk to me about life cycle management. Because when I have one database, I have one schema. When I have a lot of databases I have a lot of different schemas. How do you keep your stack consistent across customers? >> That is basically part of the same story. We have get offs all over the place. So we have this repository, we see the super helm chart versions and we maintain like minus three versions and ensure that we update the customers and keep them up to date. It's part of the contract sometimes, down to the schedule of the customer at times. And Cockroach nicely supports also, these updates with these migrations in the background, the schema migrations in the background. So we use in our case, in that integration SQL alchemy, which is also nicely supported. So there was also part of the story from MySQL to Postgres, was supported by the ORM, these kind of things. So the skill approach together with the ease of helm charts and the background migrations of the schema is a very seamless upgrade operations. Before that we had to have downtime. >> That's right, you could have online schema changes. Upgrading the database uses the same concept of rolling upgrades that you have in Kubernetes. It's just cloud native. It just fits that same context, I think. >> Christian: It became a no-brainer. >> Yeah. >> Yeah. >> Jim, you mentioned the idea of a SQL API in the cloud, that's really interesting. Why does such a thing not exist? >> Because it's really difficult to build. You know, SQL API, what does that mean? Like, okay. What I'm going to, where does that endpoint live? Is there one in California one on the east coast, one in Europe, one in Asia? Okay. And I'm asking that endpoint for data. Where does that data live? Can you control where data lives on the planet? Because ultimately what we're fighting in software today in a lot of these situations is the speed of light. And so how do you intelligently place data on this planet? So that, you know, when you're asking for data, when you're maybe home, it's a different latency than when you're here in Valencia. Does that data follow and move you? These are really, really difficult problems to solve. And I think that we're at that layer of, we're at this moment in time in software engineering, we're solving some really interesting, interesting things cause we are budding against this speed of light problem. And ultimately that's one of the biggest challenges. But underneath, it has to have all this automation like the ease at which we can scale this database like the always on resilient, the way that we can upgrade the entire thing with just rolling upgrades. The cloud native concepts is really what's enabling us to do things at global scale it's automation. >> Let's alk about that speed of light in global scale. There's no better conference for speed of light, for scale, than Kubecon. Any predictions coming out of the show? >> It's less a prediction for me and more of an observation, you guys. Like look at two years ago, when we were here in Barcelona at QCon EU, it was a lot of hype. It's a lot of hype, a lot of people walking around, curious, fascinated, this is reality. The conversations that I'm having with people today, there's a reality. There's people really doing, they're becoming cloud native. And to me, I think what we're going to see over the next two to three years is people start to adopt this kind of distributed mindset. And it permeates not just within infrastructure but it goes up into the stack. We'll start to see much more developers using, Go and these kind of the threaded languages, because I think that distributed mindset, if it starts at the chip all the way to the fingertip of the person clicking and you're distributed everywhere in between. It is extremely powerful. And I think that's what Finleap, I mean, that's exactly what the team is doing. And I think there's a lot of value and a lot of power in that. >> Jim, Christian, thank you so much for coming on the Cube and sharing your story. You know what we're past the hype cycle of Kubernetes, I agree. I was a nonbeliever in Kubernetes two, three years ago. It was mostly hype. We're looking at customers from Microsoft, Finleap and competitors doing amazing things with this platform and cloud native in general. Stay tuned for more coverage of Kubecon from Valencia, Spain. I'm Keith Townsend, along with Paul Gillin and you're watching the Cube, the leader in high tech coverage. (bright music)

Published Date : May 19 2022

SUMMARY :

brought to you by Red Hat, Welcome to Valencia, Spain You go to a lot of conferences. I got to say it's overwhelming. And certainly some of the and Christian Huning, But we are in three and started the company and we were faced with So also to maintain that we And we were not dissatisfied. So talk to me a little and we have companies, customers I think we don't need it, And how do we give that kind disrupting the experience and we did it so that I think I disagree with Getting the people to worry because that was the part And it's worked as you have scaled it? It's like, more that we use it And the other thing that we offer is that So Christian, talk to me it's just the same reconcile I have a lot of different schemas. and ensure that we update the customers Upgrading the database of a SQL API in the cloud, the way that we can Any predictions coming out of the show? and more of an observation, you guys. so much for coming on the Cube

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
Paul Gillin	PERSON	0.99+
Jim Walker	PERSON	0.99+
California	LOCATION	0.99+
Keith Townsend	PERSON	0.99+
Michael Stonebraker	PERSON	0.99+
2018	DATE	0.99+
Germany	LOCATION	0.99+
AWS	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
2015	DATE	0.99+
Frankfurt	LOCATION	0.99+
Keith	PERSON	0.99+
Europe	LOCATION	0.99+
seven	QUANTITY	0.99+
Red Hat	ORGANIZATION	0.99+
Cockroach Labs	ORGANIZATION	0.99+
Christia	PERSON	0.99+
Barcelona	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Valencia	LOCATION	0.99+
Asia	LOCATION	0.99+
Christian	PERSON	0.99+
Finleap Connect	ORGANIZATION	0.99+
MySQL	TITLE	0.99+
Kubernetes	TITLE	0.99+
Valencia, Spain	LOCATION	0.99+
three	QUANTITY	0.99+
two years ago	DATE	0.99+
Finleap	ORGANIZATION	0.99+
three engineers	QUANTITY	0.99+
three countries	QUANTITY	0.99+
first guest	QUANTITY	0.99+
SQL API	TITLE	0.99+
Paul	PERSON	0.99+
Kubecon	ORGANIZATION	0.98+
last October	DATE	0.98+
eight years	QUANTITY	0.98+
2022	DATE	0.98+
each application	QUANTITY	0.98+
four countries	QUANTITY	0.98+
one database	QUANTITY	0.98+
one	QUANTITY	0.98+
2019	DATE	0.98+
three years ago	DATE	0.98+
CockroachDB	ORGANIZATION	0.98+
one schema	QUANTITY	0.98+
Christian Huning	PERSON	0.97+
about a year and a half	QUANTITY	0.97+
two	DATE	0.96+
first	QUANTITY	0.96+
Christian Hüning	PERSON	0.94+
today	DATE	0.94+
about seven	QUANTITY	0.93+
Cloudnativecon	ORGANIZATION	0.93+
three years	QUANTITY	0.93+

Shuyi Chen, Uber | Flink Forward 2018

>> Announcer: Live from San Francisco, it's theCUBE covering Flink Forward, brought to you by data Artisans. (upbeat music) >> This is George Gilbert. We are at Flink Forward, the user conference for the Apache Flink community, sponsored by data Artisans, the company behind Flink. And we are here with Shuyi Chen from Uber, and Shuyi works on a very important project which is the Calcite Query Optimizer, SQL Query Optimizer, that's used in Apache Flink as well as several other projects. Why don't we start with, Shuyi tell us where Calcite's used and its role. >> Calcite is basically used in the Flink Table and SQL API, as the SQL POSSTR and query optimizer in planner for Flink. >> OK. >> Yeah. >> So now let's go to Uber and talk about the pipeline or pipelines you guys have been building and then how you've been using Flink and Calcite to enable the SQL API and the Table API. What workloads are you putting on that platform, or on that pipeline? >> Yeah, so basically I'm the technical lead of the streaming platform, processing platform in Uber, and so we use Apache Flink as the stream processing engine for Uber. Basically we build two different platforms one is the, called AthenaX, which use Flink SQL. So basically enable user to use SQL to compose the stream processing logic. And we have a UI, and with one click, they can just deploy the stream processing job in production. >> When you say UI, did you build a custom UI to take essentially, turn it a business intelligence tool so you have a visual way of constructing your queries? Is that what you're describing, or? >> Yeah, so it's similar to how you compose your, write a SQL query to query database. We have a UI for you to write your SQL query, with all the syntax highlight and all the hint. To write a SQL query so that, even the data scientists and also non engineers in general can actually use that UI to compose stream processing lock jobs. >> Okay, give us an example of some applications 'cause this sounds like it's a high-level API so it makes it more accessible to a wider audience. So what are some of the things they build? >> So for example, in our Uber Eats team, they use the SQL API to, as the stream processing tool to build their Restaurant Manager Dashboard. Restaurant Manager Dashboard. >> Okay. >> So basically, the data log lives in Kafka, get real-time stream into the Flink job, which it's composed using the SQL API and then that got stored in our lab database, P notes, then when the restaurant owners opens the Restaurant Manager, they will see the dashboard of their real-time earnings and everything. And with the SQL API, they no longer need to write the Flink job, they don't need to use Java or skala code, or do any testing or debugging, It's all SQL, so they, yeah. >> And then what's the SQL coverage, the SQL semantics that are implemented in the current Calcite engine? >> So it's about basic transformation, projection, and window hopping and tumbling window and also drawing, and group eye, and having, and also not to mention about the event time and real time, processing time support. >> And you can shuffle from anywhere, you don't have to have two partitions with the same join key on one node. You can have arbitrary, the data placement can be arbitrary for the partitions? >> Well the SQL is the collective, right? And so once the user compose the logic the underlying panel will actually take care of how the key by and group by, everything. >> Okay, 'cause the reason I ask is many of the early Hadoop based MPP sequel engines had the limitation where you had to co-locate the partitions that you were going to join. >> That's the same thing for Flink. >> Oh. >> But it just the SQL part is just take care of that. >> Okay. >> So you do describe what you do, but underlying get translated into a Flink program that actually will do all the co-location. >> Oh it redoes it for you, okay >> Yeah, yeah. So now they don't even need to learn Flink, they just need to learn the SQL, yeah. >> Now you said there a second platform that Uber is building on top of Flink. >> Yeah, the second platform is the, we call it the Flink as a service platform. So the motivation is, we found that SQL actually cannot satisfy all the advanced need in Uber to build stream processing, due to the reason, like for example, they will need to call up RPC services within their stream processing application or even training the RCP call, so which is hard to express in SQL and also when they are having a complicated DAG, like a workflow, it's very difficult to debug individual stages, so they want the control to actually to use delative Flink data stream APL dataset API to build their stream of batch job. >> Is the dataset API the lowest level one? >> No it's on the same level with the data stream, so it's one for streaming, one for batch. >> Okay, data stream and then the other was table? >> Dataset. >> Oh dataset, data stream, data set. >> Yeah. >> And there's one lower than that right? >> Yeah, there's one lower API but it's usually, most people don't use that API. >> So that's system programmers? >> Yeah, yeah. >> So then tell me, who is using, like what type of programmer uses the data stream or the data set API, and what do they build at Uber? >> So for example, in one of the talk later, there's a marketplace team, marketplace dynamics team, it's actually using the platform to do online model update, machinery model update, using Flink, and so basically they need to take in the model that is trained offline and do a few group by, time and location and then apply the model, and then incrementally update the model. >> And so are they taking a window of updates and then updating the model and then somehow promoting it as the candidate or, >> Yeah, yeah, yeah. Something similar, yeah. >> Okay, that's interesting. And what type of, so are these the data scientists who are using this API? >> Well data scientists are not really, it's not designed for data scientists. >> Oh so they're just going the models off, they're preparing the models offline and then they're being updated in line on the stream processing platform. >> Yes. >> And so it's maybe, data engineers who are essentially updating the features that get fed in and are continually training, or updating the models. >> Basically it's a online model update. So as Kafka event comes in, continue to refine the model. >> Okay, and so as Uber looks out couple years, what sorts of things do you see adding to one of these, either of these pipelines, and do you see a shift away from the batch and request response type workloads towards more continuous processing. >> Yes actually there we do see that trend, actually, before becoming entirely of stream processing platform team in Uber, I was in marketplace as well and at that point we always see there's a shift, like people would love to use stream processing technology to actually replace some of the normal backhand service applications. >> Tell me some examples. >> Yeah, for example... So in our dispatch platform, we have the need to actually shard the workload by, for example, writers, to different hosts to process. For example, compute say ETA or compute some of the time average, and this is before done in back hand services and say use our internal distribution system things to do the sharding. But actually with Flink, this can be just done very easily, right. And so actually there's a shift, those people will also want to adopt stream processing technology and, so long as this is not a request response style application. >> So the key thing, just to make sure I understand it's that Flink can take care of the distributed joins, whereas when it was a data base based workload, DBA had to set up the sharding and now it's sort of more transparent like it's more automated? >> I think, it's... More of the support, so if before people writing backhand services they have to write everything: the state management, the sharding, and everything, they need to-- >> George: Oh it's not even data base based-- >> Yeah, it's not data base, it's real time. >> So they have to do the physical data management, and Flink takes care of that now? >> Yeah, yeah. >> Oh got it, got it. >> For some of the application it's real time so we don't really need to store the data all the time in the database, So it's usually keep in memory and somehow gets snapshot, But we have, for normal backhand service writer they have to do everything. But with Flink it has already built in support for state management and all the sharding, partitioning and the time window, aggregation primitive, and it's all built in and they don't need to worry about re-implement the logic and we architect the system again and again. >> So it's a new platform for real time it gives you a whole lot of services, higher abstraction for real time applications. >> Yeah, yeah. >> Okay. Alright with that, Shuyi we're going to have to call it a day. This was Shuyi Chen from Uber talking about how they're building more and more of their real time platforms on Apache Flink and using a whole bunch of services to complement it. We are at Flink Forward, the user conference of data Artisans for the Apache Flink community, we're in San Francisco, this is the second Flink Forward conference and we'll be back in a couple minutes, thanks. (upbeat music)

Published Date : Apr 11 2018

SUMMARY :

brought to you by data Artisans. the user conference for the Apache Flink community, as the SQL POSSTR and talk about the pipeline or pipelines Yeah, so basically I'm the technical lead Yeah, so it's similar to how you compose your, so it makes it more accessible to a wider audience. as the stream processing tool the Flink job, they don't need to use Java or skala code, and also not to mention about the event time the data placement can be arbitrary for the partitions? And so once the user compose the logic had the limitation where you had to co-locate So you do describe what you do, So now they don't even need to learn Flink, Now you said there a second platform all the advanced need in Uber to build stream processing, No it's on the same level with the data stream, Yeah, there's one lower API but it's usually, and so basically they need to take in the model Yeah, yeah, yeah. so are these the data scientists who are using this API? it's not designed for data scientists. on the stream processing platform. and are continually training, So as Kafka event comes in, continue to refine the model. Okay, and so as Uber looks out couple years, and at that point we always see there's a shift, or compute some of the time average, More of the support, and it's all built in and they don't need to worry about So it's a new platform for real time for the Apache Flink community, we're in San Francisco,

ENTITIES

Entity	Category	Confidence
Uber	ORGANIZATION	0.99+
Shuyi Chen	PERSON	0.99+
George Gilbert	PERSON	0.99+
San Francisco	LOCATION	0.99+
George	PERSON	0.99+
Flink	ORGANIZATION	0.99+
second platform	QUANTITY	0.99+
Shuyi	PERSON	0.99+
Java	TITLE	0.99+
SQL	TITLE	0.99+
Kafka	TITLE	0.99+
Uber Eats	ORGANIZATION	0.99+
one click	QUANTITY	0.99+
SQL Query Optimizer	TITLE	0.99+
SQL POSSTR	TITLE	0.98+
second	QUANTITY	0.98+
Calcite	TITLE	0.98+
two partitions	QUANTITY	0.97+
SQL API	TITLE	0.97+
Calcite Query Optimizer	TITLE	0.97+
Flink Forward	EVENT	0.96+
a day	QUANTITY	0.95+
one	QUANTITY	0.95+
Flink Table	TITLE	0.94+
Apache Flink	ORGANIZATION	0.94+
one node	QUANTITY	0.88+
Flink	TITLE	0.83+
two different platforms	QUANTITY	0.82+
couple years	QUANTITY	0.82+
Table	TITLE	0.82+
Apache	ORGANIZATION	0.8+
Artisans	ORGANIZATION	0.78+
2018	DATE	0.77+
Hadoop	TITLE	0.73+
one for	QUANTITY	0.69+
couple minutes	QUANTITY	0.65+
AthenaX	ORGANIZATION	0.64+
Flink Forward	TITLE	0.56+
Forward	EVENT	0.52+
DBA	ORGANIZATION	0.5+
MPP	TITLE	0.47+

Steven Wu, Netflix | Flink Forward 2018

>> Narrator: Live from San Francisco, it's theCube, covering Flink Forward, brought to you by Data Artisans. >> Hi, this is George Gilbert. We're back at Flink Forward, the Flink conference sponsored by Data Artisans, the company that commercializes Apache Flink and provides additional application management platforms that make it easy to take stream processing at scale for commercial organizations. We have Steven Wu from Netflix, always a company that is pushing the edge of what's possible, and one of the early Flink users. Steven, welcome. >> Thank you. >> And tell us a little about the use case that was first, you know, applied to Flink. >> Sure, our first-use case is a routing job for Keystone data pipeline. Keystone data pipeline process over three trillion events per day, so we have a thousand routing jobs that we do some simple filter projection, but the Solr routing job is a challenge for us and we recently migrated our routing job to Apache Flink. >> And so is the function of a routing job, is it like an ETL pipeline? >> Not exactly ETL pipeline, but more like it's a data pipeline to deliver data from the producers to the data syncs where people can consume those data like array search, Kafka or higher. >> Oh, so almost like the source and sync with a hub in the middle? >> Yes, that is exactly- >> Okay. >> That's the one with our big use case. And the other thing is our data engineer, they also need some stream processing today to do data analytics, so their job can be stateless or it can be stateful if it's a stateful job it can be as big as a terabyte of base state for a single job. >> So tell me what these stateful jobs, what are some of the things that you use state for? >> So, for example like a session of user activity, like if you have clicked the video on the online URI all those activity, they would need to be sessionalized window, for the windows, sessionalized, yeah those are the states, typical. >> OK, and what sort of calculations might you be doing? And which of the Flink APIs are you using? >> So, right now we're using the data stream API, so a little bit low level, we haven't used the Flink SQL yet but it's in our road map, yeah. >> OK, so what is the data stream, you know, down closer to the metal, what does that give you control over, right now, that is attractive? And will you have as much control with the SQL API? >> OK, yes, so the low level data stream API can give you the full feature set of everything. High level SQL is much easier to use, but obviously you have, the feature set is more limited. Yeah, so that's a trade-off there. >> So, tell me about, for a stateful application, is there sort of scaffolding about managing this distributed cluster that you had to build that you see coming down the pipe from Flink and Data Artisans that might make it easier, either for you or for mainstream customers? >> Sure, I think internal state management, I think that is where Flink really shines compared to other stream processing engine. So they do a lot with work underneath already. I think the main thing we need from Flink for the future, near future is regarding the job recovery performance. But like a state management API is very mature. Flink is, I think it's more mature than most of the other stream processing engines. >> Meaning like Kafka, Spark. So, in the state management, can a business user or business analyst issue a SQL query across the cluster and Flink figures out how to manage the distribution of the query and the filtering and presentation of the results transparently across the cluster? >> I'm not an expert on Flink SQL, but I think yes, essentially Flink SQL will convert to a Flink job which will be using the data stream API, so they will manage the state, yes, but, >> So, when you're using the lower level data stream API, you have to manage the distributed state and sort of retrieving and filtering, but that's something at a higher level abstraction, hopefully that'll be, >> No, I think that in either case, I think the state management is handled by Flint. >> Okay. >> Yeah. >> Distributed. >> All the state management, yes >> Even if it's querying at the data stream level? >> Yeah, but if you query at the SQL level, you won't be able to deal with those state APIs directly. You can still do actual windowing, let's say you have a SQL app doing window with some session by session by idle time that would be transfer for job and Flink will manage those window, manage those session state so you do not need to worry about either way you do not need to worry about state management. Apache Flink take care of it. >> So tell me, some of the other products you might have looked at, is the issue that if they have a clean separation from the storage layer, for large scale state management, you know, as opposed to, in memory, is it that the large scale is almost treated like a second tier and therefore, you almost have a separate set or a restricted set of operations at distributed state level versus at the compute level, would that be a limitation of other streaming processors? >> No, I don't see that. I think that given that stream will have taken a different approach, you find like a Google Cloud data flow, Google Cloud flow, they are thinking about using a big table, for example. But those are external state management. Flint decided to take a the approach of embedded state management inside of Flink. >> And when it's external, what's the trade-off? >> That's good question, I think if external, the latency may be higher, but your throughput might be a little low. Because you're going all the natural. But the benefit of that external state management is now your job becomes stateless. Your job make the recovery much faster for job failure, so either trade-off over there. >> OK. >> Yes. >> OK, got it. Alright, Steven we're going to have to end it on that, but that was most enlightening, and thanks for joining. >> Sure, thank you. >> This is George Gilbert, for Wikibon and theCube, we're again at Flink Forward in San Francisco with Data Artisans, we'll be back after a short break. (techno music)

Published Date : Apr 11 2018

SUMMARY :

covering Flink Forward, brought to you by Data Artisans. always a company that is pushing the edge that was first, you know, applied to Flink. but the Solr routing job is a challenge for us it's a data pipeline to deliver data from the producers And the other thing is our data engineer, like if you have clicked the video on the online URI so a little bit low level, we haven't used the Flink SQL yet but obviously you have, the feature set is more limited. than most of the other stream processing engines. across the cluster and Flink figures out how to manage the No, I think that in either case, Yeah, but if you query at the SQL level, taken a different approach, you find like But the benefit of that external state management but that was most enlightening, and thanks for joining. This is George Gilbert, for Wikibon and theCube,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Steven	PERSON	0.99+
Steven Wu	PERSON	0.99+
SQL	TITLE	0.99+
Data Artisans	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Netflix	ORGANIZATION	0.99+
Flink	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Flint	ORGANIZATION	0.98+
Kafka	TITLE	0.98+
Flink Forward	ORGANIZATION	0.98+
Spark	TITLE	0.97+
second tier	QUANTITY	0.97+
Wikibon	ORGANIZATION	0.97+
today	DATE	0.95+
over three trillion events per day	QUANTITY	0.93+
Keystone	ORGANIZATION	0.92+
single job	QUANTITY	0.92+
Flint	PERSON	0.91+
Flink SQL	TITLE	0.91+
first-use case	QUANTITY	0.86+
one	QUANTITY	0.86+
Apache Flink	ORGANIZATION	0.84+
theCube	ORGANIZATION	0.82+
2018	DATE	0.81+
Forward	TITLE	0.8+
SQL API	TITLE	0.8+
Flink	TITLE	0.79+
a thousand routing jobs	QUANTITY	0.77+
Flink	EVENT	0.77+
Flink Forward	EVENT	0.73+
terabyte	QUANTITY	0.71+
Google	ORGANIZATION	0.65+
Cloud	TITLE	0.48+
Forward	EVENT	0.39+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for SQL API: