Chinmay Soman | Flink Forward 2017
>> Welcome back, everyone. We are on the ground at the data Artisans user conference for Flink. It's called Flink Forward. We are at the Kabuki Hotel in lower Pacific Heights in San Francisco. The conference kicked off this morning with some great talks by Uber and Netflix. We have the privilege of having with us Chinmay Soman from Uber. >> Yes. >> Welcome, Chinmay, it's good to have you. >> Thank you. >> You gave a really, really interesting presentation about the pipelines you're building and where Flink fits, but you've also said there's a large deployment of Spark. Help us understand how Flink became a mainstream technology for you, where it fits, and why you chose it. >> Sure. About one year back, when we were starting to evaluate what technology makes sense for the problem space that we are trying to solve, which is neural dynamics. We observed that Spark's theme processing is actually more resource intensive then some of the other technologies we benchmarked. More specifically, it was using more memory and CPU, at that time. That's one... I actually came from the Apache Samza world. It wasn't the same LinkedIn team before I came to Uber. We had in-house expertise on Samza and I think the reliability was the key motivation for choosing Samza. So we started building on top of Apache Samza for almost the last one and a half years. But then, we hit the scale where Samza, we felt, was lacking. So with Samza, it's actually tied into Kafka a lot. You need to make sure your Kafka scales in order for the stream processing to scale. >> In other words, the topics and the partitions of those topics, you have to keep the physical layout of those in mind at the message cue level, in line with the stream processing. >> That's right. The paralysm is actually tied into a number of partitions in Kafka. Further more, if you have a multi-stage pipeline, where one stage processes data and sends output to another stage, all these intermediate stages, today, again go back to Kafka. So if you want to do a lot of these use cases, you actually end up creating a lot of Kafka topics and the I/O overhead on a cluster shoots up exponentially. >> So when creating topics, or creating consumers that do something and then output to producers, if you do too many of those things, you defeat the purpose of low-latency because you're storing everything. >> Yeah. The credit of it is, it is more robust because if you suddenly get a spike in your traffic, your system is going to handle it because Kafka buffers that spike. It gives you a very reliable platform, but it's not cheap. So that's why we're looking at Flink, In Flink, you can actually build a multi-stage pipeline and have in-memory cues instead of writing back to Kafka, so it is fast and you don't have to create multiple topics per pipeline. >> So, let me unpack that just a little bit to be clearer. The in-memory cues give you, obviously, better I/O. >> Yes. >> And if I understand correctly, that can absorb some of the backpressure? >> Yeah, so backpressure is interesting. If you have everything in Kafka and no in-memory cues, there is no backpressure because Kafka is a big buffer, it just keeps running. With in-memory cues, there is backpressure. Another question is, how do you handle this? So going back to Samza systems, they actually degrade and can't recover once they are in backpressure. But Flink, as you've seen, it slows down consuming from Kafka, but once the spike is over, once you're over that hill, it actually recovers quickly. It is able to sustain heavy spikes. >> Okay, so this goes to your issues with keeping up with the growth of data... >> That's right. >> You know, the system, there's multiple leaves of elasticity and then resource intensity. Tell us about that end and the desire to get as many jobs as possible out of a certain level of resource. >> So, today, we are a platform where people come in and say, "Here's my code." Or, "Here's my SQL that I want to run on your platform." In the old days, they were telling us, "Oh, I need 10 gigabytes for a container," and this they need these many CPUs and that really limited how many use cases we onboarded and made our hardware footprint pretty expensive. So we need the pipeline, the infrastructure, to be really memory efficient. What we have seen is memory is the bottle link in our world, more so than CPU. A lot of applications, they consume from Kafka, they actually buffer locally in each container and then they do that in the local memory, in the JVM memory. So we need the memory component to be very efficient and we can pack more jobs on the same cluster if everyone is using lesser memory. That's one motivation. The other thing, for example, that Flink does and Samza also does, is make use of a RocksDB store, which is a local persistent-- >> Oh, that's where it gets the state management. >> That's right, so you can offload from memory on to the disk-- >> Into a proper database. >> Into a proper database and you don't have to cross a network to do that because it's sitting locally. >> Just to elaborate on what might be, what might seem like, a arcane topic, if it's residing locally, than anything it's going to join with has to also be residing locally. >> Yeah, that's a good point. You have to be able to partition your inputs and your state in the same way, otherwise there's no locality. >> Okay, and you'd have to shuffle stuff around the network. >> And more than that, you'd need to be able to recover if something happens because there's no replication for this state. If the hard disk on that DR node crashes, you need to recreate that cache from somewhere. So either you go back and read from Kafka, or you store that cache somewhere. So Flink actually supports this out of the box and it snapshots the RocksDB state into HTFS. >> Got it, okay. It's more resilient--- >> Yes. >> And more resource efficient. So, let me ask one last question. Main stream enterprises, they, or at least the very largest ones, have been trying to wrestle their arms around some opensource projects. Very innovative, the pace of innovation is huge, but it demands a skillset that seems to be most resident in large consumer internet companies. What advice do you have for them where they aspire to use the same technologies that you're talking about to build new systems, but they might not have the skills. >> Right, that's a very good question. I'll try to answer in the way that I can. I think the first thing to do is understand your scale. Even if you're a big, large banking corporation, you need to understand where you fit in the industry ecosystem. If it turns out that your scale isn't that big and you're using it for internal analytics, then you can just pick the off-the-shelf pipelines and make it work. For example, if you don't care about multi-tendency, if your hardware span is not that much, actually anything might actually work. The real challenge is when you pick a technology and make it work for a large use cases and you want to optimize for cost. That's where you need a huge engineering organization. So in simpler words, if your use cases extent is not that big, pick something which has a lot of support from the community. Most more common things just work out-of-the-box, and that's good enough. But if you're doing a lot of complicated things, like real-time machine running, or your scale is in billions of messages per day, or terabytes of data per day, then you really need to make a choice: Whether you invest in an engineering organization that can really understand these use cases; or you go to companies like Databricks. Get a support from Databricks, or... >> Or maybe a cloud vendor? >> Or a cloud vendor, or things like Confluent which is giving Kafka support, things like that. I don't think there is one answer. To me, our use case, for example, the reason we chose to build an engineering organization around that is because our use cases are immensely complicated and not really seen before, so we had to invest in this technology. >> Alright, Chinmay, we're going to leave it on that and hopefully keep the dialogue going-- >> Sure. >> offline. So, we'll be back shortly. We're at Flink Forward, the data Artisans user conference for Flink. We're on the ground at the Kabuki Hotel in downtown San Francisco and we'll be right back.
SUMMARY :
We have the privilege of having with us where it fits, and why you chose it. in order for the stream processing to scale. you have to keep the physical layout of those So if you want to do a lot of these use cases, that do something and then output to producers, and you don't have to create The in-memory cues give you, obviously, better I/O. but once the spike is over, once you're over that hill, Okay, so this goes to your issues with You know, the system, there's multiple leaves and that really limited how many use cases we onboarded Into a proper database and you don't have to going to join with has to also be residing locally. You have to be able to partition Okay, and you'd have to shuffle stuff and it snapshots the RocksDB state into HTFS. It's more resilient--- but it demands a skillset that seems to be and you want to optimize for cost. the reason we chose to build We're on the ground at the Kabuki Hotel
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Databricks | ORGANIZATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
Chinmay | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Chinmay Soman | PERSON | 0.99+ |
Kafka | TITLE | 0.99+ |
Confluent | ORGANIZATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
10 gigabytes | QUANTITY | 0.99+ |
each container | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.98+ |
today | DATE | 0.98+ |
one answer | QUANTITY | 0.98+ |
Apache | ORGANIZATION | 0.98+ |
2017 | DATE | 0.97+ |
one last question | QUANTITY | 0.95+ |
first thing | QUANTITY | 0.95+ |
Spark | TITLE | 0.93+ |
Pacific Heights | LOCATION | 0.91+ |
this morning | DATE | 0.86+ |
Kabuki Hotel | LOCATION | 0.85+ |
RocksDB | TITLE | 0.83+ |
About one year back | DATE | 0.82+ |
terabytes of data | QUANTITY | 0.82+ |
one motivation | QUANTITY | 0.8+ |
SQL | TITLE | 0.8+ |
Forward | EVENT | 0.78+ |
Samza | ORGANIZATION | 0.74+ |
Samza | TITLE | 0.73+ |
one stage | QUANTITY | 0.73+ |
billions of messages per day | QUANTITY | 0.72+ |
Artisans | EVENT | 0.7+ |
last one and a half years | DATE | 0.69+ |
Artisans user | EVENT | 0.62+ |
Samza | COMMERCIAL_ITEM | 0.34+ |