Kamile Taouk, UNSW & Sabrina Yan, Children's Cancer Institute | DockerCon 2020
>>from around the globe. It's the queue with digital coverage of Docker Con Live 2020 brought to you by Docker and its ecosystem partners. Welcome to the Special Cube coverage of Docker Con 2020. It's a virtual digital event co produced by Docker and the Cube. Thanks for joining us. We have great segment here. Precision cancer medicine really is evolving where the personalization of the data are really going to be important to personalize those treatments based upon unique characteristics of the tumors. This is something that's been a really hot topic, talking point and focus area in the industry. And technology is here to help with two great guests who are using technology. Docker Docker containers a variety of other things to help the process go further along. And we got here spring and who's the bioinformatics research assistant and Camille took Who's a student and in turn, you guys done some compelling work. Thanks for joining this docker con virtualized. Thanks for coming on. >>Thanks for having me. >>So first tell us about yourself and what you guys doing at the Children's Cancer Institute? That's where you're located. What's going on there? Tell us what you guys are doing there? >>Sure, So I built into Cancer Institute. As it sounds, we do a lot of research when it comes to specifically the Children's cancer, though Children a unique in the sense that a lot of the typical treatment we use for adult may or may not work or will have adverse side effects. So what we do is we do all kinds of research. But what lab and I love, which we call a dry love What we do research in silica, using computers at the develop pipelines in order to improve outcomes for Children. >>And what are some of the things you get some to deal with us on the tech side, but also there's the workflow of the patients survival rates, capacity, those constraints that you guys are dealing with. And what are some of the some of the things going on there that you have to deal with and you're trying to improve the outcomes? What specific outcomes were you trying to work through? >>Well, at the moment off of the past decade and all the work you've done in the past decade, we've made a substantial impact on the supply of ability off several high risk cancers in Pediatrics on and we've Got a certain Program, which spent I'll talk about in more depth called the Zero Childhood Cancer Program and essentially that aims to reduce childhood cancer in Children uh, zero. So that, in other words, with the previous five ability 100% on hopefully, no lives will be lost. But that's >>and what do you guys doing specifically? What's your your job? What's your focus? >>Yes, so part of our lab Old computational biology. Uh, we run a processing pipeline, the whole genome and our next guest that, given the sequencing information for the kids, though, we sequence the healthy cells and we sequence there. Two missiles. We analyze them together, and what we do is we find mutations that are causing the cancel that help us determine what treatment. So what? Clinical trials might be most effective for the kids and so specifically Allah books on that pipeline where we run a whole bunch of bioinformatics tools, that area buying thematic basically biology, informatics, and we use the data generated sequel thing in order to extract those mutations that will be the cancer driving mutations that hopefully we can target in order to treat the kids. >>You know, you hear about an attack and you hear Facebook personalization recommendation engines. What the click on you guys are really doing Really? Mawr personalization around treatment recommendations. These kinds of things come into it. Can you share a little bit about what goes on there and and tell us what's happening? >>Well, as you mentioned when you first, some brought us into this, which we're looking at, the the profile of the team itself and that allows us to specialize the medication on the young treatment for that patient on. Essentially, that lets us improve the efficiency and the effectiveness off the treatment, which in turn has an impact on this probability off. >>What are some of the technical things? How did you guys get involved with Docker with Docker fit into all this? >>Yeah, I'm sure Camille will have plenty to bring up on this as well. But, um, yes, it's been quite a project to the the pipeline that we have. Um, we have built on a specific platforms and is looking great. But as with most tools in a lot of things that you develop when your engineers eyes pretty easy for them to become platform specific. And then that kind of stuck there. And you have to re engineer the whole thing kind of of a black hole. That's such a pain to there. So, um, the project that Mikhail in my field working on was actually taking it to the individual's pools we used in the pipeline and Docker rising them individually containing them with the dependencies they need so that we could hook them up anyway. We want So we can configure the pipeline, not just customized based off of the data like we're on the same pipeline and every it even being able to change the pipeline of different things to different kids. Be able to do that easily, um, to be able to run it on different platforms. You know, the fact that we have the choice not only means that we could save money, but if there's a cloud instance that will run an app costal. If there's a platform that you know wanted to collaborate with us and they say, Oh, we have this wholesome data we'd love for you to analyze. It's over hell, like a lot of you know, >>use my tool. It's really great. >>Yeah. And so having portability is a big thing as well. And so I'm sure people can go on about, uh, some of the pain point you having to do authorize all of the different, But, you know, even though they Austin challenges associated with doing it, I think the payoff is massive. >>Dig into this because this is one of the things where you've got a problem statement. You got a real world example. Cancer patients, life or death gets a serious things going on here. You're a tech. You get in here. What's going on? You're like, Okay, this is going to be easy. Just wrangle the data. I throw some compute at it. It's over, right? You know what? How did you take us through the life? They're, you know, living >>right. So a supreme I mentioned before, first and foremost well, in the scale of several 100 terabytes worth of data for every single patient. So obviously we can start to understand just how beneficial it is to move the pipeline to the data, rather the other way around. Um, so much time would be saved. The money costs as well, in terms of actually Docker rising the but the programs that analyze the data, it was quite difficult. And I think Sabrina would agree mate would agree with me on this point. The primary issue was that almost all of the apps we encountered within the pipeline we're very, very heavily dependent on very specific versions off some dependencies, but that they were just build upon so many other different APS on and they were very heavily fined tuned. So docker rising. It was quite difficult because we have to preserve every single version of every single dependency in one instance just to ensure that that was working. And these apps get updated quite Simpson my regularly. So we have to ensure that our doctors would survive. >>So what does it really take? The doc arise your pipeline. >>I mean, it was a whole project. Well, um, myself, Camille, we had a whole bunch of, um, automatic guns doing us over the summer, which was fantastic as well. And we basically have a whole team of lost words like, Okay, here's another automatic pull in the pipeline. You get enterprise, you get to go for a special you get enterprise, they each who individually and then you've been days awake on it, depending on the app. Easier than others. Um, but particularly when it comes to things a lot by a dramatic pools, some of them are very memory hungry. Some of them are very finicky. Some of the, um ah, little stable than others. And so you could spend one day characterizing a tool. And it's done, you know, in a handful of Allah's old. Sometimes it could make a week, and he's just getting this one tool done. And the idea behind the whole team working on it was eventually use. Look through this process, and then you have, um, a docker file set up. Well, anyone to run it on any system. And we know we have an identical set up, which was not sure before, because I remember when I started and I was trying to get the pipeline running on my own machine. Ah, lot of things just didn't look like Oh, you don't have the very specific version of ah that this developer has. 00 that's not working because you don't have this specific girl file that actually has a bug fixes in it. Just for us like, Well, >>he had a lot of limitations before the doctor and doctor analyzing docker container izing it. It was tough. What was it like before and after? >>And we'll probably speak more people full. It was basically, uh, yeah, days or weeks trying to set up on in. Stole everything needed around the whole pipeline. Yeah, it took a long time. And even then, a lot of things, But how you got to set up this? You know, I think speculation of pipeline, all the units, these are the three of the different programs. Will you need this version of obligation? This new upgrade of the tools that work with that version of Oz The old, all kinds of issues that you run into when they schools depend on entirely different things and to install, like, four different versions of python. Three different versions of our or different versions of job on the one machine, you know, just to run it is a bit of >>what has. It's a hassle. Basically, it's a nightmare. And now, after you're >>probably familiar with that, >>Yeah. So what's it like after >>it's a zoo? It supports ridiculously efficient. Like it. It's It's incredible what Michael mentioned before, as soon as we did in stone. Those at the versions of the dependencies. Dhaka keeps them naturally, and we can specify the versions within a docker container. So we can. We can absolutely guarantee that that application will run successfully and effectively every single time. >>Share with me how complicated these pipelines are. Sounds like that's a key piece here for you guys. And you had all the hassles that you do. Your get Docker rised up and things work smoothly. Got that? But tell >>me about >>the pipelines. What's what's so complicated about them? >>Honestly, the biggest complication is all of the connection. It's not a simple as, um, run a from the sea, and then you don't That would be nice, but that know how these things work if you have a network of programs with the output of this, input for another, and you have to run this program before this little this one. But some of the output become input for multiple programs, and by the time you hook the whole thing up, it looks like a gigantic web of applications. The way all the connections, so it's a massive Well, it almost looks like a massive met when you look at it. But having each of the individual tools contained and working means that we can look them all up. And even though it looks complicated, it would be far more complicated if we had that entire pipeline. You know, in a single program like having to code, that whole thing in a single group would be an absolute nightmare. Where is being able to have each of the tools as individual doctors means we just have the link, the input on that book, which is the top. But once you've done that, it means that you know each of the individual pools will run. And if an individual fails, or whatever raised in memory or other issues run into, you can rerun that one individual school re hooks the output into whatever the next program is going without having one massive you know, program will file what it fails midway through, and there's nothing you can do. >>Yeah, you unpack. It really says, Basically, you get the goodness to the work up front, and a lot of goodness come out of it. So this lets comes to the future of health. What are the key takeaways that you guys have from this process? And how does it apply to things that might be helpful to you right around the corner? Or today, like deep learning as you get more tools out there with machine learning and deep learning? Um, we hope there's gonna be some cool things coming out. What do you guys see here? And the insights? >>Well, we have a section of how the computational biologist team that is looking into doing more predictive talks working out, um, basically the risk of people developing can't the risks of kids developing cancel. And that's something you can do when you have all of this data. But that requires a lot of analysis as well. And so one of the benefits of you know being able to have these very moveable pipelines and tools makes it easier to run them on. The cloud makes it easier to shale. You're processing with about researches to the hospitals, just making collaboration easier. Mainz that data sharing becomes a possibility or is before if you have three different organizations. But the daughter in three different places. Um, how do you share that with moving the daughter really feasible. Pascal, can you analyze it in a way that practical and so I don't want one of the benefits of Docker? Is all of these advanced tools coming out? You know, if there's some amazing predicted that comes out that uses some kind of regression little deep learning, whatever. If we wanted to add that being able to dock arise a complex school into a single docker ice makes it less complicated that highlighted the pipeline in the future, if that's something we'd like to do, >>Camille, any thoughts on your end on this? >>Actually, I was Sabrina in my mind for the last point. I was just thinking about scalability definitely is very. It's a huge point because the part about the girls as a technology does any kind of technology that we've got to inspect into the pipeline. As of now, it be significantly easier with the use of Docker. You could just docker rise that technology and then implant that straight into the pipeline. Minimal stress. >>So productivity agility doesn't come home for you guys. Is that resonate? >>Yeah, definitely. >>And you got the collaboration. So there's business benefits, the outcomes. Are there any proof points you could share on some results that you guys are seeing some fruit from the tree, if you will, from all this Goodness. >>Well, one of the things we've been working on is actually a collaboration with those Bio Commons and Katica. They built a platform, specifically the development pipelines. We wanted to go out, and they have support for Docker containers built into the platform, which makes it very easy to push a lot of containers of the platform, look them up and be able to collaborate with them not only to try a new platform without that, but also help them look like a platform to be able to shoot action access data that's been uploaded there as well. But a lot of people we wouldn't have been able to do that if we hadn't. Guys, they're up. It just wouldn't have. Actually, it wouldn't be possible. And now that we have, we've been able to collaborate with them in terms of improving the platform. But also to be able to share and run our pipelines on other data will just pretty good, >>awesome. Well, It's great to have you on the Cube here on Docker Con 2020 from down under. Great Internet connections get great Internet down. They're keeping us remote were sheltering in place here. Stay safe and you guys final question. Could you eat? Share in your own words from a developer? From a tech standpoint, as you're in this core role, super important role, the outcomes are significant and have real impact. What has the technology? What is docker ization done for you guys and for your work environment and for the business share in your own words what it means. A lot of other developers are watching What's your opinion? >>But yeah, I mean, the really practical point is we've massively increased capacity of the pipeline. One thing that been quite fantastic years. We've got a lot of increased. The Port zero child who can program, which means going into the schedule will actually be able to open a program. Every child in Australia that, uh, has cancel will be ableto add them to the program. Where is currently we're only able to enroll kids who are low survivability, right? So about 30% the lowest 30% of the viability we're able to roll over program currently, but having a pipeline where we can just double the memory like that double the amount of battle. Uh, and the fact that we can change the instance is really to just double the capacity trip. The capacity means that now that we have the support to be able to enroll potentially every kid, Mr Leo, um, once we've upgraded the whole pipeline, it means will actually be a code with the amount of Children being enrolled, whereas on the existing pipeline, we're currently that capacity. So doing the upgrade in a really practical way means that we're actually going to be a triple the number of kids in Australia. We can add onto the program which wouldn't have been possible otherwise >>unleashing the limitations and making it totally scalable. Your thoughts as developers watching you're in there, Your hand in your hands, dirty. You built it. It's showing some traction. What's what's your what's your take? What's your view? >>Well, I mean first and foremost locks events. It just feels fantastic knowing that what we're doing is as a substantial and quantify who impact on the on a subset of the population and we're literally saving lives. Analyze with the work that we're doing in terms off developing with With that technology, such a breeze especially compared Teoh I've had minimal contact with what it was like without docker and from the horror stories I've heard, it's It's It's a godsend. It's It's it's really improved The quality of developing. >>Well, you guys have a great mission. And congratulations on the success. Really impact right there. You guys are doing great work and it must feel great. I'm happy for you and great to connect with you guys and continue, you know, using technology to get the outcomes, not just using technology. So Fantastic story. Thank you for sharing. Appreciate >>you having me. >>Thank you. >>Okay, I'm John for we here for Docker Con 2020 Docker con virtual docker con digital. It's a digital event This year we were all shale three in place that we're in the Palo Alto studios for Docker con 2020. I'm John furrier. Stay with us for more coverage digitally go to docker con dot com from or check out all these different sessions And of course, stay with us for this feat. Thank you very much. Yeah, yeah, yeah, yeah, yeah, yeah
SUMMARY :
of Docker Con Live 2020 brought to you by Docker and its ecosystem Tell us what you guys are doing there? a unique in the sense that a lot of the typical treatment we use for adult may or may not work And what are some of the some of the things going on there that you have to deal with and you're trying to improve the outcomes? Well, at the moment off of the past decade and all the work you've done in the past decade, for the kids and so specifically Allah books on that pipeline where we run a whole bunch of What the click on you guys are really doing Really? Well, as you mentioned when you first, some brought us into this, which we're looking You know, the fact that we have the choice not only means that we could save money, It's really great. go on about, uh, some of the pain point you having to do authorize all of the different, They're, you know, living of actually Docker rising the but the programs that analyze the data, So what does it really take? Ah, lot of things just didn't look like Oh, you don't have the very specific he had a lot of limitations before the doctor and doctor analyzing docker container izing it. on the one machine, you know, just to run it is a bit of And now, Those at the versions of the dependencies. And you had all the hassles that you do. the pipelines. and by the time you hook the whole thing up, it looks like a gigantic web of applications. What are the key takeaways that you guys have of the benefits of you know being able to have these very moveable It's a huge point because the part about the girls as a technology does any So productivity agility doesn't come home for you guys. And you got the collaboration. And now that we have, we've been able to collaborate with them in terms of improving the platform. Well, It's great to have you on the Cube here on Docker Con 2020 from down under. Uh, and the fact that we can change the instance is really to just double What's what's your what's your take? on a subset of the population and we're literally saving lives. great to connect with you guys and continue, you know, using technology to get the outcomes, Thank you very much.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Brian Gilmore | PERSON | 0.99+ |
David Brown | PERSON | 0.99+ |
Tim Yoakum | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Tim Yokum | PERSON | 0.99+ |
Stu | PERSON | 0.99+ |
Herain Oberoi | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Dave Valante | PERSON | 0.99+ |
Kamile Taouk | PERSON | 0.99+ |
John Fourier | PERSON | 0.99+ |
Rinesh Patel | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Santana Dasgupta | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
Canada | LOCATION | 0.99+ |
BMW | ORGANIZATION | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
ICE | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Jack Berkowitz | PERSON | 0.99+ |
Australia | LOCATION | 0.99+ |
NVIDIA | ORGANIZATION | 0.99+ |
Telco | ORGANIZATION | 0.99+ |
Venkat | PERSON | 0.99+ |
Michael | PERSON | 0.99+ |
Camille | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Venkat Krishnamachari | PERSON | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Don Tapscott | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
Intercontinental Exchange | ORGANIZATION | 0.99+ |
Children's Cancer Institute | ORGANIZATION | 0.99+ |
Red Hat | ORGANIZATION | 0.99+ |
telco | ORGANIZATION | 0.99+ |
Sabrina Yan | PERSON | 0.99+ |
Tim | PERSON | 0.99+ |
Sabrina | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
MontyCloud | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Leo | PERSON | 0.99+ |
COVID-19 | OTHER | 0.99+ |
Santa Ana | LOCATION | 0.99+ |
UK | LOCATION | 0.99+ |
Tushar | PERSON | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
Valente | PERSON | 0.99+ |
JL Valente | PERSON | 0.99+ |
1,000 | QUANTITY | 0.99+ |
UNLIST TILL 4/1 - How The Trade Desk Reports Against Two 320-node Clusters Packed with Raw Data
hi everybody thank you for joining us today for the virtual Vertica BBC 2020 today's breakout session is entitled Vertica and en mode at the trade desk my name is su LeClair director of marketing at Vertica and I'll be your host for this webinar joining me is Ron Cormier senior Vertica database engineer at the trade desk before we begin I encourage you to submit questions or comments during the virtual session you don't have to wait just type your question or comment in the question box below the slides and click submit there will be a Q&A session at the end of the presentation we'll answer as many questions as we're able to during that time any questions that we don't address we'll do our best to answer them offline alternatively you can visit vertical forums to post your questions there after the session our engineering team is planning to join the forums to keep the conversation going also a quick reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide and yes this virtual session is being recorded and will be available to view on demand this week we'll send you a notification as soon as it's ready so let's get started over to you run thanks - before I get started I'll just mention that my slide template was created before social distancing was a thing so hopefully some of the images will harken us back to a time when we could actually all be in the same room but with that I want to get started uh the date before I get started in thinking about the technology I just wanted to cover my background real quick because I think it's peach to where we're coming from with vertically on at the trade desk and I'll start out just by pointing out that prior to my time in the trade desk I was a tech consultant at HP HP America and so I traveled the world working with Vertica customers helping them configure install tune set up their verdict and databases and get them working properly so I've seen the biggest and the smallest implementations and everything in between and and so now I'm actually principal database engineer straight desk and and the reason I mentioned this is to let you know that I'm a practitioner I'm working with with the product every day or most days this is a marketing material so hopefully the the technical details in this presentation are are helpful I work with Vertica of course and that is most relative or relevant to our ETL and reporting stack and so what we're doing is we're taking about the data in the Vertica and running reports for our customers and we're an ad tech so I did want to just briefly describe what what that means and how it affects our implementation so I'm not going to cover the all the details of this slide but basically I want to point out that the trade desk is a DSP it's a demand-side provider and so we place ads on behalf of our customers or agencies and ad agencies and their customers that are advertised as brands themselves and the ads get placed on to websites and mobile applications and anywhere anywhere digital advertising happens so publishers are what we think ocean like we see here espn.com msn.com and so on and so every time a user goes to one of these sites or one of these digital places and an auction takes place and what people are bidding on is the privilege of showing and add one or more ads to users and so this is this is really important because it helps fund the internet ads can be annoying sometimes but they actually help help are incredibly helpful in how we get much much of our content and this is happening in real time at very high volumes so on the open Internet there is anywhere from seven to thirteen million auctions happening every second of those seven to thirteen million auctions happening every second the trade desk bids on hundreds of thousands per second um so that gives it and anytime we did we have an event that ends up in Vertica that's that's one of the main drivers of our data volume and certainly other events make their way into Vertica as well but that wanted to give you a sense of the scale of the data and sort of how it's impacting or how it is impacted by sort of real real people in the world so um the uh let's let's take a little bit more into the workload and and we have the three B's in spades late like many many people listening to a massive volume velocity and variety in terms of the data sizes I've got some information here some stats on on the raw data sizes that we deal with on a daily basis per day so we ingest 85 terabytes of raw data per day and then once we get it into Vertica we do some transformations we do matching which is like joins basically and we do some aggregation group buys to reduce the data and make it clean it up make it so it's more efficient to consume buy our reporting layer so that matching in aggregation produces about ten new terabytes of raw data per day it all comes from the it all comes from the data that was ingested but it's new data and so that's so it is reduced quite a bit but it's still pretty pretty high high volume and so we have this aggregated data that we then run reports on on behalf of our customers so we have about 40,000 reports per day oh that's probably that's actually a little bit old and older number it's probably closer to 50 or 55,000 reports per day at this point so it's I think probably a pretty common use case for for Vertica customers it's maybe a little different in the sense that most of the reports themselves are >> reports so they're not it's not a user sitting at a keyboard waiting for the result basically we have we we have a workflow where we do the ingest we do this transform and then and then once once all the data is available for a day we run reports on behalf of our customer to let me have our customers on that that daily data and then we send the reports out you via email or we drop them in a shared location and then they they look at the reports at some later point of time so it's up until yawn we did all this work on on enterprise Vertica at our peak we had four production enterprise clusters each which held two petabytes of raw data and I'll give you some details on on how those enterprise clusters were configured in the hardware but before I do that I want to talk about the reporting workload specifically so the the reporting workload is particularly lumpy and what I mean by that is there's a bunch of work that becomes available bunch of queries that we need to run in a short period of time after after the days just an aggregation is completed and then the clusters are relatively quiet for the remaining portion of the day that's not to say they are they're not doing anything as far as read workload but they certainly are but it's much less reactivity after that big spike so what I'm showing here is our reporting queue and the spike is is when all those reports become a bit sort of ailable to be processed we can't we can't process we can't run the report until we've done the full ingest and matching and aggregation for the day and so right around 1:00 or 2:00 a.m. UTC time every day that's when we get this spike and the spike we affectionately called the UTC hump but basically it's a huge number of queries that need to be processed sort of as soon as possible and we have service levels that dictate what as soon as possible means but I think the spike illustrates our use case pretty pretty accurately and um it really as we'll see it's really well suited for pervert icky on and we'll see what that means so we've got our we had our enterprise clusters that I mentioned earlier and just to give you some details on what they look like there they were independent and mirrored and so what that means is all four clusters held the same data and we did this intentionally because we wanted to be able to run our report anywhere we so so we've got this big queue over port is big a number of reports that need to be run and we've got these we started we started with one cluster and then we got we found that it couldn't keep up so we added a second and we found the number of reports went up that we needed to run that short period of time and and so on so we eventually ended up with four Enterprise clusters basically with this with the and we'd say they were mirrored they all had the same data they weren't however synchronized they were independent and so basically we would run the the tailpipe line so to speak we would run ingest and the matching and the aggregation on all the clusters in parallel so they it wasn't as if each cluster proceeded to the next step in sync with which dump the other clusters they were run independently so it was sort of like each each cluster would eventually get get consistent and so this this worked pretty well for for us but it created some imbalances and there was some cost concerns that will dig into but just to tell you about each of these each of these clusters they each had 50 nodes they had 72 logical CPU cores a half half a terabyte of RAM a bunch of raid rated disk drives and 2 petabytes of raw data as I stated before so pretty big beefy nodes that are physical physical nodes that we held we had in our data centers we actually reached these nodes so so it was on our data center providers data centers and the these were these these were what we built our business on basically but there was a number of challenges that we ran into as we as we continue to build our business and add data and add workload and and the first one is is some in ceremony can relate to his capacity planning so we had to prove think about the future and try to predict the amount of work that was going to need to be done and how much hardware we were going to need to satisfy that work to meet that demand and that's that's just generally a hard thing to do it's very difficult to verdict the future as we can probably all attest to and how much the world has changed and even in the last month so it's a it's a very difficult thing to do to look six twelve eighteen eighteen months into the future and sort of get it right and and and what people what we tended to do is we reach or we tried to our art plans our estimates were very conservative so we overbought in a lot of cases and not only that we had to plan for the peak so we're planning for that that that point in time that those number of hours in the early morning when we had to we had all those reports to run and so that so so we ended up buying a lot of hardware and we actually sort of overbought at times and then and then as the hardware were days it would kind of come into it would come into maturity and we have our our our workload would sort of come approach matching the demand so that was one of the big challenges the next challenge is that we were running on disk you can we wanted to add data in sort of two dimensions the only dimensions that everybody can think about we wanted to add more columns to our big aggregates and we wanted to keep our big aggregates for for longer periods of time so both horizontally and vertically we wanted to expand the datasets but we basically were running out of disk there was no more disk in and it's hard to add a disc to Vertica in enterprise mode not not impossible but certainly hard and and one cannot add discs without adding compute because enterprise mode the disk is all local to each of the nodes for most most people you can do not exchange with sands and other external rays but that's there are a number of other challenges with that so um adding in order to add disk we had to add compute and that basically meant kept us out of balance we're adding more compute than we needed for the amount of disk so that was the problem certainly physical nodes getting them the order delivered racked cables even before we even start such Vertica there's lead times there and and so it's also long commitment since we like I mentioned me Lisa hardware so we were committing to these nodes these physical servers for two or three years at a time and I mentioned that can be a hard thing to do but we wanted to least to keep our capex down so we wanted to keep our aggregates for a long period of time we could have done crazy things or more exotic things to to help us with this if we had to in enterprise mode we could have started to like daisy chain clusters together and that would have been sort of a non-trivial engineering effort because we would need to then figure out how to migrate data source first to recharge the data across all the clusters and we had to migrate data from one cluster to another cluster hesitation and we would have to think about how to aggregate run queries across clusters so if you assured data set spans two clusters it would have had to sort of aggregated within each cluster maybe and then build something on top the aggregated the data from each of those clusters so not impossible things but certainly not easy things and luckily for us we started talking about two Vertica about separation of compute and storage and I know other customers were talking to Vertica as we were people had had these problems and so Vertica inyeon mode came to the rescue and what I want to do is just talk about nyan mode really briefly for for those in the audience who aren't familiar but it's basically Vertigo's answered to the separation of computing storage it allows one to scale compute and or storage separately and and this there's a number of advantages to doing that whereas in the old enterprise days when you add a compute you added stores and vice-versa now we can now we can add one or the other or both according to how we want to and so really briefly how this works this slide this figure was taken directly from the verdict and documentation and so just just to talk really briefly about how it works the taking advantage of the cloud and so in this case Amazon Web Services the elasticity in the cloud and basically we've got you seen two instances so elastic cloud compute servers that access data that's in an s3 bucket and so three three ec2 nodes and in a bucket or the the blue objects in this diagram and the difference is a couple of a couple of big differences one the data no longer the persistent storage of the data the data where the data lives is no longer on each of the notes the persistent stores of the data is in s3 bucket and so what that does is it basically solves one of our first big problems which is we were running out of disk the s3 has for all intensive purposes infinite storage so we can keep much more data there and that mostly solved one of our big problems so the persistent data lives on s3 now what happens is when a query runs it runs on one of the three nodes that you see here and assuming we'll talk about depo in a second but what happens in a brand new cluster where it's just just spun up the hardware is the query will will run on those ec2 nodes but there will be no data so those nodes will reach out to s3 and run the query on remote storage so that so the query that the nodes are literally reaching out to the communal storage for the data and processing it entirely without using any data on on the nodes themselves and so that that that works pretty well it's not as fast as if the data was local to the nodes but um what Vertica did is they built a caching layer on on each of the node and that's what the depot represents so the depot is some amount of disk that is relatively local to the ec2 node and so when the query runs on remote stores on the on the s3 data it then queues up the data for download to the nodes and so the data will get will reside in the Depot so that the next query or the subsequent subsequent queries can run on local storage instead of remote stores and that speeds things up quite a bit so that that's that's what the role of the Depot is the depot is basically a caching layer and we'll talk about the details of how we can see your in our Depot the other thing that I want to point out is that since this is the cloud another problem that helps us solve is the concurrency problem so you can imagine that these three nodes are one sort of cluster and what we can do is we can spit up another three nodes and have it point to the same s3 communal storage bucket so now we've got six nodes pointing to the same data but we've you isolated each of the three nodes so that they act as if they are their own cluster and so vertical calls them sub-clusters so we've got two sub clusters each of which has three nodes and what this has essentially done it is it doubled the concurrency doubled the number of queries that can run at any given time because we've now got this new place which new this new chunk of compute which which can answer queries and so that has given us the ability to add concurrency much faster and I'll point out that for since it's cloud and and there are on-demand pricing models we can have significant savings because when a sub cluster is not needed we can stop it and we pay almost nothing for it so that's that's really really important really helpful especially for our workload which I pointed out before was so lumpy so those hours of the day when it's relatively quiet I can go and stop a bunch of sub clusters and and I will pay for them so that that yields nice cost savings let's be on in a nutshell obviously engineers and the documentation can use a lot more information and I'm happy to field questions later on as well but I want to talk about how how we implemented beyond at the trade desk and so I'll start on the left hand side at the top the the what we're representing here is some clusters so there's some cluster 0 r e t l sub cluster and it is a our primary sub cluster so when you get into the world of eon there's primary Club questions and secondary sub classes and it has to do with quorum so primary sub clusters are the sub clusters that we always expect to be up and running and they they contribute to quorum they decide whether there's enough instances number a number of enough nodes to have the database start up and so these this is where we run our ETL workload which is the ingest the match in the aggregate part of the work that I talked about earlier so these nodes are always up and running because our ETL pipeline is always on we're internet ad tech company like I mentioned and so we're constantly getting costly running ad and there's always data flowing into the system and the matching is happening in the aggregation so that part happens 24/7 and we wanted so that those nodes will always be up and running and we need this we need that those process needs to be super efficient and so what that is reflected in our instance type so each of our sub clusters is sixty four nodes we'll talk about how we came at that number but the infant type for the ETL sub cluster the primary subclusters is I 3x large so that is one of the instance types that has quite a bit of nvme stores attached and we'll talk about that but on 32 cores 240 four gigs of ram on each node and and that what that allows us to do I should have put the amount of nvme but I think it's seven terabytes for anything me storage what that allows us to do is to basically ensure that our ETL everything that this sub cluster does is always in Depot and so that that makes sure that it's always fast now when we get to the secondary subclusters these are as mentioned secondary so they can stop and start and it won't affect the cluster going up or down so they're they're sort of independent and we've got four what we call Rhian subclusters and and they're not read by definition or technically they're not read only any any sub cluster can ingest and create your data within the database and that'll all get that'll all get pushed to the s3 bucket but logically for us they're read only like these we just most of these the work that they happen to do is read only which it is which is nice because if it's read only it doesn't need to worry about commits and we let we let the primary subclusters or ETL so close to worry about committing data and we don't have to we don't have to have the all nodes in the database participating in transaction commits so we've got a for read subclusters and we've got one EP also cluster so a total of five sub clusters each so plus they're running sixty-four nodes so that gives us a 320 node database all things counted and not all those nodes are up at the same time as I mentioned but often often for big chunks of the days most of the read nodes are down but they do all spin up during our during our busy time so for the reading so clusters we've got I three for Excel so again the I three incidents family type which has nvme stores these notes have I think three and a half terabytes of nvme per node we just rate it to nvme drives we raid zero them together and 16 cores 122 gigs of ram so these are smaller you'll notice but it works out well for us because the the read workload is is typically dealing with much smaller data sets than then the ingest or the aggregation workbook so we can we can run these workloads on on smaller instances and leave a little bit of money and get more granularity with how many sub clusters are stopped and started at any given time the nvme doesn't persist the data on it isn't persisted remember you stop and start this is an important detail but it's okay because the depot does a pretty good job in that in that algorithm where it pulls data in that's recently used and the that gets pushed out a victim is the data that's least reasons use so it was used a long time ago so it's probably not going to be used to get so we've got um five sub-clusters and we have actually got to two of those so we've got a 320 node cluster in u.s. East and a 320 node cluster in u.s. West so we've got a high availability region diversity so and their peers like I talked about before they're they're independent but but yours they are each run 128 shards and and so with that what that which shards are is basically the it's similar to segmentation when you take those dataset you divide it into chunks and though and each sub cluster can concede want the data set in its entirety and so each sub cluster is dealing with 128 shards it shows 128 because it'll give us even distribution of the data on 64 node subclusters 60 120 might evenly by 64 and so there's so there's no data skew and and we chose 128 because the sort of ginger proof in case we wanted to double the size of any of the questions we can double the number of notes and we still have no excuse the data would be distributed evenly the disk what we've done is so we've got a couple of raid arrays we've got an EBS based array that they're catalog uses so the catalog storage location and I think we take for for EBS volumes and raid 0 them together and come up with 128 gigabyte Drive and we wanted an EPS for the catalog because it we can stop and start nodes and that data will persist it will come back when the node comes up so we don't have to run a bunch of configuration when the node starts up basically the node starts it automatically joins the cluster and and very strongly there after it starts processing work let's catalog and EBS now the nvme is another raid zero as I mess with this data and is ephemeral so let me stop and start it goes away but basically we take 512 gigabytes of the nvme and we give it to the data temp storage location and then we take whatever is remaining and give it to the depot and since the ETL and the reading clusters are different instance types they the depot is is side differently but otherwise it's the same across small clusters also it all adds up what what we have is now we we stopped the purging data for some of our big a grits we added bunch more columns and what basically we at this point we have 8 petabytes of raw data in each Jian cluster and it is obviously about 4 times what we can hold in our enterprise classes and we can continue to add to this maybe we need to add compute maybe we don't but the the amount of data that can can be held there against can obviously grow much more we've also built in auto scaling tool or service that basically monitors the queue that I showed you earlier monitors for those spikes I want to see as low spikes it then goes and starts up instances one sub-collector any of the sub clusters so that's that's how that's how we we have compute match the capacity match that's the demand also point out that we actually have one sub cluster is a specialized nodes it doesn't actually it's not strictly a customer reports sub clusters so we had this this tool called planner which basically optimizes ad campaigns for for our customers and we built it it runs on Vertica uses data and Vertica runs vertical queries and it was it was wildly successful um so we wanted to have some dedicated compute and beyond witty on it made it really easy to basically spin up one of these sub clusters or new sub cluster and say here you go planner team do what you want you can you can completely maximize the resources on these nodes and it won't affect any of the other operations that were doing the ingest the matching the aggregation or the reports up so it gave us a great deal of flexibility and agility which is super helpful so the question is has it been worth it and without a doubt the answer is yes we're doing things that we never could have done before sort of with reasonable cost we have lots more data specialized nodes and more agility but how do you quantify that because I don't want to try to quantify it for you guys but it's difficult because each eon we still have some enterprise nodes by the way cost as you have two of them but we also have these Eon clusters and so they're there they're running different workloads the aggregation is different the ingest is running more on eon does the number of nodes is different the hardware is different so there are significant differences between enterprise and and beyond and when we combine them together to do the entire workload but eon is definitely doing the majority of the workload it has most of the data it has data that goes is much older so it handles the the heavy heavy lifting now the query performance is more anecdotal still but basically when the data is in the Depot the query performance is very similar to enterprise quite close when the data is not in Depot and it needs to run our remote storage the the query performance is is is not as good it can be multiples it's not an order not orders of magnitude worse but certainly multiple the amount of time that it takes to run on enterprise but the good news is after the data downloads those young clusters quickly catch up as the cache populates there of cost I'd love to be able to tell you that we're running to X the number of reports or things are finishing 8x faster but it's not that simple as you Iran is that you it is me I seem to have gotten to thank you you hear me okay I can hear you now yeah we're still recording but that's fine we can edit this so if I'm just talking to the person the support person he will extend our recording time so if you want to maybe pick back up from the beginning of the slide and then we'll just edit out this this quiet period that we have sir okay great I'm going to go back on mute and why don't you just go back to the previous slide and then come into this one again and I'll make sure that I tell the person who yep perfect and then we'll continue from there is that okay yeah sound good all right all right I'm going back on yet so the question is has it been worth it and for us the answer has been a resounding yes we're doing things that we never could have done at reasonable cost before and we got more data we've got this Y note this law has nodes and in work we're much more agile so how to quantify that um well it's not quite as simple and straightforward as you might hope I mean we still have enterprise clusters we've got to update the the four that we had at peak so we've still got two of those around and we got our two yawn clusters but they're running different workloads and they're comprised of entirely different hardware the dependence has I've covered the number of nodes is different for sub-clusters so 64 versus 50 is going to have different performance the the workload itself the aggregation is aggregating more columns on yon because that's where we have disk available the queries themselves are different they're running more more queries on more intensive data intensive queries on yon because that's where the data is available so in a sense it is Jian is doing the heavy lifting for the cluster for our workload in terms of query performance still a little anecdotal but like when the queries that run on the enterprise cluster the performance matches that of the enterprise cluster quite closely when the data is in the Depot when the data is not in a Depot and Vertica has to go out to the f32 to get the data performance degrades as you might expect it can but it depends on the curious all things like counts counts are is really fast but if you need lots of the data from the material others to realize lots of columns that can run slower I'm not orders of magnitude slower but certainly multiple of the amount of time in terms of costs anecdotal will give a little bit more quantifying here so what I try to do is I try to figure out multiply it out if I wanted to run the entire workload on enterprise and I wanted to run the entire workload on e on with all the data we have today all the queries everything and to try to get it to the Apple tab so for enterprise the the and estimate that we do need approximately 18,000 cores CPU cores all together and that's a big number but that's doesn't even cover all the non-trivial engineering work that would need to be required that I kind of referenced earlier things like starting the data among multiple clusters migrating the data from one culture to another the daisy chain type stuff so that's that's the data point now for eon is to run the entire workload estimate we need about twenty thousand four hundred and eighty CPU cores so more CPU cores uh then then enterprise however about half of those and partly ten thousand of both CPU cores would only run for about six hours per day and so with the on demand and elasticity of the cloud that that is a huge advantage and so we are definitely moving as fast as we can to being on all Aeon we have we have time left on our contract with the enterprise clusters or not we're not able to get rid of them quite yet but Eon is certainly the way of the future for us I also want to point out that uh I mean yawn is we found to be the most efficient MPP database on the market and what that refers to is for a given dollar of spend of cost we get the most from that zone we get the most out of Vertica for that dollar compared to other cloud and MPP database platforms so our business is really happy with what we've been able to deliver with Yan Yan has also given us the ability to begin a new use case which is probably this case is probably pretty familiar to folks on the call where it's UI based so we'll have a website that our customers can log into and on that website they'll be able to run reports on queries through the website and have that run directly on a separate row to get beyond cluster and so much more latent latency sensitive and concurrency sensitive so the workflow that I've described up until this point has been pretty steady throughout the day and then we get our spike and then and then it goes back to normal for the rest of the day this workload it will be potentially more variable we don't know exactly when our engineers are going to deliver some huge feature that is going to make a 1-1 make a lot of people want to log into the website and check how their campaigns are doing so we but Yohn really helps us with this because we can add a capacity so easily we cannot compute and we can add so we can scale that up and down as needed and it allows us to match the concurrency so beyond the concurrency is much more variable we don't need a big long lead time so we're really excited about about this so last slide here I just want to leave you with some things to think about if you're about to embark or getting started on your journey with vertically on one of the things that you'll have to think about is the no account in the shard count so they're kind of tightly coupled the node count we determined by figuring like spinning up some instances in a single sub cluster and getting performance smaller to finding an acceptable performance considering current workload future workload for the queries that we had when we started and so we went with 64 we wanted to you want to certainly want to increase over 50 but we didn't want to have them be too big because of course it costs money and so what you like to do things in power to so 64 nodes and then the shard count for the shards again is like the data segmentation is a new type of segmentation on the data and the start out we went with 128 it began the reason is so that we could have no skew but you know could process the same same amount of data and we wanted to future-proof it so that's probably it's probably a nice general recommendation doubleness account for the nodes the instance type and and how much people space those are certainly things you're going to consider like I was talking about we went for they I three for Excel I 3/8 Excel because they offer good good Depot stores which gives us a really consistent good performance and it is all in Depot the pretty good mud presentation and some information on on I think we're going to use our r5 or the are for instance types for for our UI cluster so much less the data smaller so much less enter this on Depot so we don't need on that nvm you stores the reader we're going to want to have a reserved a mix of reserved and on-demand instances if you're if you're 24/7 shop like we are like so our ETL subclusters those are reserved instances because we know we're going to run those 24 hours a day 365 days a year so there's no advantage of having them be on-demand on demand cost more than reserve so we get cost savings on on figuring out what we're going to run and have keep running and it's the read subclusters that are for the most part on on demand we have one of our each sub Buster's is actually on 24/7 because we keep it up for ad-hoc queries your analyst queries that we don't know when exactly they're going to hit and they want to be able to continue working whenever they want to in terms of the initial data load the initial data ingest what we had to do and now how it works till today is you've got to basically load all your data from scratch there isn't a great tooling just yet for data populate or moving from enterprise to Aeon so what we did is we exported all the data in our enterprise cluster into park' files and put those out on s3 and then we ingested them into into our first Eon cluster so it's kind of a pain we script it out a bunch of stuff obviously but they worked and the good news is that once you do that like the second yon cluster is just a bucket copy in it and so there's tools missions that can help help with that you're going to want to manage your fetches and addiction so this is the data that's in the cache is what I'm referring to here the data that's in the default and so like I talked about we have our ETL cluster which has the most recent data that's just an injected and the most difficult data that's been aggregated so this really recent data so we wouldn't want anybody logging into that ETL cluster and running queries on big aggregates to go back one three years because that would invalidate the cache the depot would start pulling in that historical data and it was our assessing that historical data and evicting the recent data which would slow things out flow down that ETL pipelines so we didn't want that so we need to make sure that users whether their service accounts or human users are connecting to the right phone cluster and I mean we just created the adventure users with IPS and target groups to palm those pretty-pretty it was definitely something to think about lastly if you're like us and you're going to want to stop and start nodes you're going to have to have a service that does that for you we're where we built this very simple tool that basically monitors the queue and stops and starts subclusters accordingly we're hoping that that we can work with Vertica to have it be a little bit more driven by the cloud configuration itself so for us all amazon and we love it if we could have it have a scale with the with the with the eight of us can take through points do things to watch out for when when you're working with Eon is the first is system table queries on storage layer or metadata and the thing to be careful of is that the storage layer metadata is replicated it's caught as a copy for each of the sub clusters that are out there so we have the ETL sub cluster and our resources so for each of the five sub clusters there is a copy of all the data in storage containers system table all the data and partitions system table so when you want to use this new system tables for analyzing how much data you have or any other analysis make sure that you filter your query with a node name and so for us the node name is less than or equal to 64 because each of our sub clusters at 64 so we limit we limit the nodes to the to the 64 et 64 node ETL collector otherwise if we didn't have this filter we would get 5x the values for counts and some sort of stuff and lastly there is a problem that we're kind of working on and thinking about is a DC table data for sub clusters that are our stops when when the instances stopped literally the operating system is down and there's no way to access it so it takes the DC table DC table data with it and so I cannot after after my so close to scale up in the morning and then they scale down I can't run DC table queries on how what performed well and where and that sort of stuff because it's local to those nodes so we're working on something so something to be aware of and we're working on a solution or an implementation to try to suck that data out of all the notes you can those read only knows that stop and start all the time and bring it in to some other kind of repository perhaps another vertical cluster so that we can run analysis and monitoring even you want those those are down that's it um thanks for taking the time to look into my presentation really do it thank you Ron that was a tremendous amount of information thank you for sharing that with everyone um we have some questions come in that I would like to present to you Ron if you have a couple min it your first let's jump right in the first one a loading 85 terabytes per day of data is pretty significant amount what format does that data come in and what does that load process look like yeah a great question so the format is a tab separated files that are Jesus compressed and the reason for that could basically historical we don't have much tabs in our data and this is how how the data gets compressed and moved off of our our bidders the things that generate most of this data so it's a PSD gzip compressed and how you kind of we kind of have how we load it I would say we have actually kind of a Cadillac loader in a couple of different perspectives one is um we've got this autist raishin layer that's homegrown managing the logs is the data that gets loaded into Vertica and so we accumulate data and then we take we take some some files and we push them to redistribute them along the ETL nodes in the cluster and so we're literally pushing the file to through the nodes and we then run a copy statement to to ingest data in the database and then we remove the file from from the nodes themselves and so it's a little bit extra data movement which you may think about changing in the future assisting we move more and more to be on well the really nice thing about this especially for for the enterprise clusters is that the copy' statements are really fast and so we the coffee statements use memory but let's pick any other query but the performance of the cautery statement is really sensitive to the amount of available memory and so since the data is local to the nodes literally in the data directory that I referenced earlier it can access that data from the nvme stores and the kabhi statement runs very fast and then that memory is available to do something else and so we pay a little bit of cost in terms of latency and in terms of downloading the data to the nose we might as we move more and more PC on we might start ingesting it directly from s3 not copying the nodes first we'll see about that what's there that's how that's how we read the data interesting works great thanks Ron um another question what was the biggest challenge you found when migrating from on-prem to AWS uh yeah so um a couple of things that come to mind the first was the baculum the data load it was kind of a pain I mean like I referenced in that last slide only because I mean we didn't have tools built to do this so I mean we had to script some stuff out and it wasn't overly complex but yes it's just a lot of data to move I mean even with starting with with two petabytes so making sure that there there is no missed data no gaps making and moving it from the enterprise cluster so what we did is we exported it to the local disk on the enterprise buses and we then we push this history and then we ingested it in ze on again Allspark X oh so it's a lot of days to move around and I mean we have to you have to take an outage at some point stop loading data while we do that final kiss-up phase and so that was that was a challenge a sort of a one-time challenge the other saying that I mean we've been dealing with a week not that we're dealing with but with his challenge was is I mean it's relatively you can still throw totally new product for vertical and so we are big advantages of beyond is allow us to stop and start nodes and recently Vertica has gotten quite good at stopping in part starting nodes for a while there it was it was it took a really long time to start to Noah back up and it could be invasive but we worked with with the engineering team with Yan Zi and others to really really reduce that and now it's not really an issue that we think that we think too much about hey thanks towards the end of the presentation you had said that you've got 128 shards but you have your some clusters are usually around 64 nodes and you had talked about a ratio of two to one why is that and if you were to do it again would you use 128 shards ah good question so that is a reference the reason why is because we wanted to future professionals so basically we wanted to make sure that the number of stars was evenly divisible by the number of nodes and you could I could have done that was 64 I could have done that with 128 or any other multiple entities for but we went with 128 is to try to protect ourselves in the future so that if we wanted to double the number of nodes in the ECL phone cluster specifically we could have done that so that was double from 64 to 128 and then each node would have happened just one chart that it had would have to deal with so so no skew um the second part of question if I had to do it if I had to do it over again I think I would have done I think I would have stuck with 128 we still have I mean so we either running this cluster for more than 18 months now I think especially in USC and we haven't needed to increase the number of nodes so in that sense like it's been a little bit extra overhead having more shards but it gives us the peace of mind that we can easily double that and not have to worry about it so I think I think everyone is a nice place to start and you may even consider a three to one or four to one if if you're if you're expecting really rapid growth that you were just getting started with you on and your business and your gates that's a small now but what you expect to have them grow up significantly less powerful green thank you Ron that's with all the questions that we have out there for today if you do have others please feel free to send them in and we will get back to you and we'll respond directly via email and again our engineers will be available on the vertical forums where you can continue the discussion with them there I want to thank Ron for the great presentation and also the audience for your participation in questions please note that a replay of today's event and a copy of the slides will be available on demand shortly and of course we invite you to share this information with your colleagues as well again thank you and this concludes this webinar and have a great day you
SUMMARY :
stats on on the raw data sizes that we is so that we could have no skew but you
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Ron Cormier | PERSON | 0.99+ |
seven | QUANTITY | 0.99+ |
Ron | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
8 petabytes | QUANTITY | 0.99+ |
122 gigs | QUANTITY | 0.99+ |
85 terabytes | QUANTITY | 0.99+ |
Excel | TITLE | 0.99+ |
512 gigabytes | QUANTITY | 0.99+ |
128 gigabyte | QUANTITY | 0.99+ |
three nodes | QUANTITY | 0.99+ |
three years | QUANTITY | 0.99+ |
six nodes | QUANTITY | 0.99+ |
each cluster | QUANTITY | 0.99+ |
two petabytes | QUANTITY | 0.99+ |
240 | QUANTITY | 0.99+ |
2 petabytes | QUANTITY | 0.99+ |
16 cores | QUANTITY | 0.99+ |
espn.com | OTHER | 0.99+ |
Amazon Web Services | ORGANIZATION | 0.99+ |
Yan Yan | ORGANIZATION | 0.99+ |
more than 18 months | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
each cluster | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
one cluster | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
amazon | ORGANIZATION | 0.99+ |
32 cores | QUANTITY | 0.99+ |
ten thousand | QUANTITY | 0.98+ |
each sub cluster | QUANTITY | 0.98+ |
one cluster | QUANTITY | 0.98+ |
72 | QUANTITY | 0.98+ |
seven terabytes | QUANTITY | 0.98+ |
two dimensions | QUANTITY | 0.98+ |
Two | QUANTITY | 0.98+ |
5x | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
eon | ORGANIZATION | 0.98+ |
128 | QUANTITY | 0.98+ |
50 | QUANTITY | 0.98+ |
four gigs | QUANTITY | 0.98+ |
s3 | TITLE | 0.98+ |
three and a half terabytes | QUANTITY | 0.98+ |
this week | DATE | 0.98+ |
64 | QUANTITY | 0.98+ |
8x | QUANTITY | 0.97+ |
one chart | QUANTITY | 0.97+ |
about ten new terabytes | QUANTITY | 0.97+ |
one-time | QUANTITY | 0.97+ |
two instances | QUANTITY | 0.97+ |
Depot | ORGANIZATION | 0.97+ |
last month | DATE | 0.97+ |
five sub-clusters | QUANTITY | 0.97+ |
two clusters | QUANTITY | 0.97+ |
each node | QUANTITY | 0.97+ |
five sub clusters | QUANTITY | 0.96+ |