Sean Hester | Flink Forward 2017
>> Welcome back. We're at Flink Forward, the user conference for the Flink community, put on by data Artisans, the creators of Flink. We're on the ground at the Kabuki Hotel in Pacific Heights in San Francisco. And we have another special guest from BetterCloud, which is a management company. We have Sean Hester, Director of Engineering. And Sean, why don't you tell us, what brings you to Flink Forward? Give us some context for that. >> Sure, sure. So a little over a year ago we kind of started restructuring our application. We had a spike in our vision, we wanted to go a little bit bigger. And at that point we had done some things that were suboptimal, let's say, as far as our approach to the way we were generating operational intelligence. So we wanted to move to a streaming platform. We looked at a few different options and after pretty much a bake-off, Flink came out on top for us. And we've been using it ever since. It's been in production for us for about six months. We love it, we're big fans, we love their roadmap, so that's why we're here. >> Okay, so let's unpack that a little more. In the bake-off, what were the... So your use case is management. But within that bake-off, what were the criteria that surfaced as the highest priority? >> So for us we knew we wanted to be working with something that was kind of the latest generation of streaming technology. Something that had basically addressed all of the Google MillWheel paper, big problems, things like managing back pressure, how do you manage a checkpoint and restoring of state in a distributed streaming application? Things that we had no interest in writing ourselves after digging into the problem a little bit. So we wanted a solution that would solve those problems for us, and this seemed like it had a really solid community behind it. And again, Flink came off on top. >> Okay, so now, understanding sort of why you chose Flink, help us understand BetterCloud's service. What do you offer customers and how do you see that evolving over time? >> Sure, sure. So you've been calling us a management company, so we provide tooling for IT admins to manage their SAS applications. So things like the Google Suite, or Zendesk, or Slack. And we give them kind of that single point of entry, the single pane of glass to see everything, see all their users in one place, what applications are provisioned to which users, et cetera. And so we literally go to the APIs of each of our partners that we provide support for, gather data, and from there it starts flowing through the stream as a set of change events, basically. Hey, this user's had a title update or a manager update. Is that meaningful for us in some way? Do we want to run a particular work flow based on that event, or is that something that we need to take into account for a particular operational intelligence? >> Okay, so you dropped in there something really concrete. A change event for the role of an employee. That's a very application-specific piece of telemetry that's coming out of an app. Very different from saying, well, what's my CPU utilization, which'll be the same across all platforms. >> Correct. >> So how do you account for... applications that might have employees in one SAS app and also employees in a completely different SAS app, and they emit telemetry or events that mean different things? How do you bridge that? >> Exactly. So we have a set of teams that's dedicated to just the role of getting data from the SAS applications and emitting them into the overall BetterCloud system. After that there's another set of teams that's basically dedicated to providing that central, canonical view of a user or group or a... An asset, a document, et cetera. So all of those disparate models that might come in from any given SAS app get normalized by that team into what we call our canonical model. And that's what flows downstream to teams that I lead to have operational intelligence run on them. >> Okay, so just to be clear, for our mainstream customers who aren't rocket scientists like you-- (laughs) When they want to make sense of this, what you're telling them is they don't have to be locked into the management solution that comes from a cloud vendor where they're going to harmonize all their telemetry and their management solutions to work seamlessly across their services and the third party services that are on that platform. What you're saying is you're putting that commonality across apps that you support on different clouds. >> Yes, exactly. We provide kind of the glue, or the homogenization necessary to make that possible. >> Now this may sound arcane, but being able to put in place that commonality implies that there is overlap, complete overlap, for that information, for how to take into account and manage an employee onboarding here and one over there. What happens when, in applications, unlike in the hardware where it's obviously the same no matter what you're doing, what happens in applications where you can't find a full overlap? >> Well, it's never a full overlap. But there is typically a very core set of properties for a user account, for example, that we can work with regardless of what SAS application we might be integrating with. But we do have special areas, like metadata areas, within our events that are dedicated to the original data fresh from the SAS application's API, and we can do one-off operations specifically on that SAS app data. But yeah, in general there's a lot of commonality between the way people model a user account or a distribution group or a document. >> Okay, interesting. And so the role of streaming technology here is to get those events to you really quickly and then for you to apply your rules to identify a root cause or even to remediate either with advising a person, an administrator, or automatically. >> Yes, exactly. >> And plans for adding machine learning to this going forward? >> Absolutely, yeah. So one of our big asks, we started casting this vision in front of some of our core customers, was basically I don't know what normal is. You figure out what normal is and then let me know when something abnormal happens. Which is a perfect use case for machine learning. So we definitely want to get there. >> Running steady state, learning the steady state, then finding anomalies. >> Exactly, exactly. >> Interesting, okay. >> Not there yet but it's definitely on our roadmap. >> And then what about management companies that might say, we're just going to target workloads of this variety, like a big data workload, where we're going to take Kafka, Spark, Hive, and maybe something that predicts and serves, and we're just going to manage that. What trade-off to they get to make that are different from what you get to make? >> I'm not sure I quite understand the question you're getting at. >> If there's where they can narrow the scope of the processes they're going to model, or the workloads they're going to model, where it's, say, just big data workloads and there's going to be some batch interactive stuff and they are only going to cover a certain number of products because those are the only ones that fit into that type of workload. >> Oh I gotcha, gotcha. So we kind of designed our roadmap from the get-go knowing that one of our competitive advantages were going to be how quickly can we support additional SAS applications? So we've actually baked into most of our architecture, stuff that's very configuration-driven, let's say, versus hard coded, so that allows us to very quickly kind of onboard new SAS apps. So I think that winds up, the value of being able to manage and provision, run workloads against the 20 different SAS apps that an admin in a modern workplace might be working with is just so valuable that I think that's going to win the day eventually. >> Single pane of glass, not at the infrastructure level, but at the application level. >> Exactly, exactly. >> Okay. All right, we've been with Sean Hester of BetterCloud, and we will be right back. We're at the Flink Forward event, sponsored by data Artisans for the Flink user community. The first ever conference in the US for the Flink community. And we'll be back shortly. (electronic music)
SUMMARY :
And we have another special guest from BetterCloud, And at that point we had done some things that surfaced as the highest priority? So for us we knew we wanted to be working with and how do you see that evolving over time? based on that event, or is that something that we need to A change event for the role of an employee. So how do you account for... So we have a set of teams that's dedicated and the third party services that are on that platform. We provide kind of the glue, or the homogenization for that information, for how to take into account and we can do one-off operations And so the role of streaming technology here So one of our big asks, we started casting this vision Running steady state, learning the steady state, that are different from what you get to make? the question you're getting at. of the processes they're going to model, that I think that's going to win the day eventually. but at the application level. and we will be right back.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Sean | PERSON | 0.99+ |
Sean Hester | PERSON | 0.99+ |
US | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
BetterCloud | ORGANIZATION | 0.99+ |
Flink Forward | EVENT | 0.99+ |
SAS | TITLE | 0.99+ |
Pacific Heights | LOCATION | 0.98+ |
one | QUANTITY | 0.97+ |
about six months | QUANTITY | 0.97+ |
one place | QUANTITY | 0.96+ |
ORGANIZATION | 0.95+ | |
Suite | TITLE | 0.94+ |
Single pane | QUANTITY | 0.91+ |
San Francisco | LOCATION | 0.91+ |
each | QUANTITY | 0.9+ |
single point | QUANTITY | 0.87+ |
Spark | TITLE | 0.86+ |
Zendesk | ORGANIZATION | 0.83+ |
a year ago | DATE | 0.8+ |
Hive | TITLE | 0.8+ |
Flink Forward 2017 | EVENT | 0.79+ |
Kafka | TITLE | 0.79+ |
Flink Forward | ORGANIZATION | 0.75+ |
over | DATE | 0.75+ |
MillWheel | COMMERCIAL_ITEM | 0.71+ |
first ever | QUANTITY | 0.71+ |
Kabuki Hotel | LOCATION | 0.71+ |
20 different | QUANTITY | 0.71+ |
single pane | QUANTITY | 0.71+ |
apps | QUANTITY | 0.61+ |
Slack | TITLE | 0.59+ |
BetterCloud | TITLE | 0.52+ |
Chinmay Soman | Flink Forward 2017
>> Welcome back, everyone. We are on the ground at the data Artisans user conference for Flink. It's called Flink Forward. We are at the Kabuki Hotel in lower Pacific Heights in San Francisco. The conference kicked off this morning with some great talks by Uber and Netflix. We have the privilege of having with us Chinmay Soman from Uber. >> Yes. >> Welcome, Chinmay, it's good to have you. >> Thank you. >> You gave a really, really interesting presentation about the pipelines you're building and where Flink fits, but you've also said there's a large deployment of Spark. Help us understand how Flink became a mainstream technology for you, where it fits, and why you chose it. >> Sure. About one year back, when we were starting to evaluate what technology makes sense for the problem space that we are trying to solve, which is neural dynamics. We observed that Spark's theme processing is actually more resource intensive then some of the other technologies we benchmarked. More specifically, it was using more memory and CPU, at that time. That's one... I actually came from the Apache Samza world. It wasn't the same LinkedIn team before I came to Uber. We had in-house expertise on Samza and I think the reliability was the key motivation for choosing Samza. So we started building on top of Apache Samza for almost the last one and a half years. But then, we hit the scale where Samza, we felt, was lacking. So with Samza, it's actually tied into Kafka a lot. You need to make sure your Kafka scales in order for the stream processing to scale. >> In other words, the topics and the partitions of those topics, you have to keep the physical layout of those in mind at the message cue level, in line with the stream processing. >> That's right. The paralysm is actually tied into a number of partitions in Kafka. Further more, if you have a multi-stage pipeline, where one stage processes data and sends output to another stage, all these intermediate stages, today, again go back to Kafka. So if you want to do a lot of these use cases, you actually end up creating a lot of Kafka topics and the I/O overhead on a cluster shoots up exponentially. >> So when creating topics, or creating consumers that do something and then output to producers, if you do too many of those things, you defeat the purpose of low-latency because you're storing everything. >> Yeah. The credit of it is, it is more robust because if you suddenly get a spike in your traffic, your system is going to handle it because Kafka buffers that spike. It gives you a very reliable platform, but it's not cheap. So that's why we're looking at Flink, In Flink, you can actually build a multi-stage pipeline and have in-memory cues instead of writing back to Kafka, so it is fast and you don't have to create multiple topics per pipeline. >> So, let me unpack that just a little bit to be clearer. The in-memory cues give you, obviously, better I/O. >> Yes. >> And if I understand correctly, that can absorb some of the backpressure? >> Yeah, so backpressure is interesting. If you have everything in Kafka and no in-memory cues, there is no backpressure because Kafka is a big buffer, it just keeps running. With in-memory cues, there is backpressure. Another question is, how do you handle this? So going back to Samza systems, they actually degrade and can't recover once they are in backpressure. But Flink, as you've seen, it slows down consuming from Kafka, but once the spike is over, once you're over that hill, it actually recovers quickly. It is able to sustain heavy spikes. >> Okay, so this goes to your issues with keeping up with the growth of data... >> That's right. >> You know, the system, there's multiple leaves of elasticity and then resource intensity. Tell us about that end and the desire to get as many jobs as possible out of a certain level of resource. >> So, today, we are a platform where people come in and say, "Here's my code." Or, "Here's my SQL that I want to run on your platform." In the old days, they were telling us, "Oh, I need 10 gigabytes for a container," and this they need these many CPUs and that really limited how many use cases we onboarded and made our hardware footprint pretty expensive. So we need the pipeline, the infrastructure, to be really memory efficient. What we have seen is memory is the bottle link in our world, more so than CPU. A lot of applications, they consume from Kafka, they actually buffer locally in each container and then they do that in the local memory, in the JVM memory. So we need the memory component to be very efficient and we can pack more jobs on the same cluster if everyone is using lesser memory. That's one motivation. The other thing, for example, that Flink does and Samza also does, is make use of a RocksDB store, which is a local persistent-- >> Oh, that's where it gets the state management. >> That's right, so you can offload from memory on to the disk-- >> Into a proper database. >> Into a proper database and you don't have to cross a network to do that because it's sitting locally. >> Just to elaborate on what might be, what might seem like, a arcane topic, if it's residing locally, than anything it's going to join with has to also be residing locally. >> Yeah, that's a good point. You have to be able to partition your inputs and your state in the same way, otherwise there's no locality. >> Okay, and you'd have to shuffle stuff around the network. >> And more than that, you'd need to be able to recover if something happens because there's no replication for this state. If the hard disk on that DR node crashes, you need to recreate that cache from somewhere. So either you go back and read from Kafka, or you store that cache somewhere. So Flink actually supports this out of the box and it snapshots the RocksDB state into HTFS. >> Got it, okay. It's more resilient--- >> Yes. >> And more resource efficient. So, let me ask one last question. Main stream enterprises, they, or at least the very largest ones, have been trying to wrestle their arms around some opensource projects. Very innovative, the pace of innovation is huge, but it demands a skillset that seems to be most resident in large consumer internet companies. What advice do you have for them where they aspire to use the same technologies that you're talking about to build new systems, but they might not have the skills. >> Right, that's a very good question. I'll try to answer in the way that I can. I think the first thing to do is understand your scale. Even if you're a big, large banking corporation, you need to understand where you fit in the industry ecosystem. If it turns out that your scale isn't that big and you're using it for internal analytics, then you can just pick the off-the-shelf pipelines and make it work. For example, if you don't care about multi-tendency, if your hardware span is not that much, actually anything might actually work. The real challenge is when you pick a technology and make it work for a large use cases and you want to optimize for cost. That's where you need a huge engineering organization. So in simpler words, if your use cases extent is not that big, pick something which has a lot of support from the community. Most more common things just work out-of-the-box, and that's good enough. But if you're doing a lot of complicated things, like real-time machine running, or your scale is in billions of messages per day, or terabytes of data per day, then you really need to make a choice: Whether you invest in an engineering organization that can really understand these use cases; or you go to companies like Databricks. Get a support from Databricks, or... >> Or maybe a cloud vendor? >> Or a cloud vendor, or things like Confluent which is giving Kafka support, things like that. I don't think there is one answer. To me, our use case, for example, the reason we chose to build an engineering organization around that is because our use cases are immensely complicated and not really seen before, so we had to invest in this technology. >> Alright, Chinmay, we're going to leave it on that and hopefully keep the dialogue going-- >> Sure. >> offline. So, we'll be back shortly. We're at Flink Forward, the data Artisans user conference for Flink. We're on the ground at the Kabuki Hotel in downtown San Francisco and we'll be right back.
SUMMARY :
We have the privilege of having with us where it fits, and why you chose it. in order for the stream processing to scale. you have to keep the physical layout of those So if you want to do a lot of these use cases, that do something and then output to producers, and you don't have to create The in-memory cues give you, obviously, better I/O. but once the spike is over, once you're over that hill, Okay, so this goes to your issues with You know, the system, there's multiple leaves and that really limited how many use cases we onboarded Into a proper database and you don't have to going to join with has to also be residing locally. You have to be able to partition Okay, and you'd have to shuffle stuff and it snapshots the RocksDB state into HTFS. It's more resilient--- but it demands a skillset that seems to be and you want to optimize for cost. the reason we chose to build We're on the ground at the Kabuki Hotel
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Databricks | ORGANIZATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
Chinmay | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Chinmay Soman | PERSON | 0.99+ |
Kafka | TITLE | 0.99+ |
Confluent | ORGANIZATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
10 gigabytes | QUANTITY | 0.99+ |
each container | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.98+ |
today | DATE | 0.98+ |
one answer | QUANTITY | 0.98+ |
Apache | ORGANIZATION | 0.98+ |
2017 | DATE | 0.97+ |
one last question | QUANTITY | 0.95+ |
first thing | QUANTITY | 0.95+ |
Spark | TITLE | 0.93+ |
Pacific Heights | LOCATION | 0.91+ |
this morning | DATE | 0.86+ |
Kabuki Hotel | LOCATION | 0.85+ |
RocksDB | TITLE | 0.83+ |
About one year back | DATE | 0.82+ |
terabytes of data | QUANTITY | 0.82+ |
one motivation | QUANTITY | 0.8+ |
SQL | TITLE | 0.8+ |
Forward | EVENT | 0.78+ |
Samza | ORGANIZATION | 0.74+ |
Samza | TITLE | 0.73+ |
one stage | QUANTITY | 0.73+ |
billions of messages per day | QUANTITY | 0.72+ |
Artisans | EVENT | 0.7+ |
last one and a half years | DATE | 0.69+ |
Artisans user | EVENT | 0.62+ |
Samza | COMMERCIAL_ITEM | 0.34+ |