RuairĂ McBride, Arrow ECS & Brian McCloskey, NetApp| NetApp Insight Berlin 2017
>> Narrator: Live form Berlin, Germany, it's the Cube, covering NetApp insight 2017, brought to you by NetApp. Welcome back to the Cube's live coverage of NetApp insight 2017, we're here in Berlin, Germany, I'm your host, Rebecca Night along with my cohost Peter Burris. We have two guests on the program now, we have Rory McBride, who is the technical account manager at Aero and Bryan Mclosky, who is the vice president world wide for hyper converge infrastructure at NetApp. Bryan, Rory, thanks so much for coming on the show. >> Thanks. >> Let me start with you, Bryan, talk a little bit, tell our viewers a little bit about the value, that HCI delivers to customers, especially in terms of simplifying the data. >> In a nutshell, what NetApp HCI does is it takes what wold normally be hours and hours to implement a solution and 100s of inputs, generally, over 400 inputs and it simplifies it down to under 30 inputs in an installation, that will be done within 45 minutes. Traditionally HCI solutions have similar implementation characteristics, but you lose some of the enterprise flexibility and scale, that customers of NetApp have come to expect over the years. What we've done is we've provided that simplicity, while allowing customers to have the enterprise capabilities and flexibility, that they've grown accustomed to. >> Is this something, that you are talking with customers, in terms of the simplicity, what were you hearing from customers? >> Most customers these days are challenged of, everybody has to find a way to do more with less or to do minimally a lot more with the same. If you think of NetApp, we've always been wonderful about giving customers a great production experience. When you buy a typical NetApp product, you're gonna own it for three, four or five years and it will continue. NetApp has always been great for that three, four and five year time frame and what we've done with HCI is we really simplified the beginning part of that curve of how do you get it from the time it lands on your dock to implement it and usable by our users in a short manner, that's what HCI has brought to the NetApp portfolio, that's incremental to what was there before. >> One of the advantages to third parties, that work closely with NetApp is, that by having a simpler approach of doing things, you can do more of them, but on the other hand, you want to ensure, that you're also focused on the value add. In the field, when you're sitting down with a customer and working with them to ensure, that they get the value, that they want from these products, how do you affect that balance? As the product becomes simpler to the customer now being able to focus more on other things, other than configuration of limitation. >> We've been able to get to doing something with your data is the key. You needed a little bar of entry, which a lot of the software and hardware providers are trying to do today. I think HCI just has to pull all of that together, which is great. We're hearing from third party vendors, that it's great, that from day one, they've been integrated into the overall portfolio message and I think customers are just gonna be pretty excited with what they can do from zero with this hardware. >> When you think about ultimately how they're gonna spend their time, what are they going to be doing instead of now all this all configuration work? What is Aero gonna be doing now, that you're not doing that value added configuration work? >> Hopefully, we'll be helping to realize the full potential of what they bought, rather than spending a lot of time trying to make the hardware work, they're concentrating more on delivering a service or an application back to the business, it's gonna generate some revenue. In Aero we're talking a lot to people about IOT and it's gonna be the next wave of information, that people are gonna have to deal with and having a stable product, that can support and provide value, you have information back to business, it's gonna be key. >> Bryan, HCI, as you noted, dramatically reduces the time to get to value, not only now, but it also sustains that level of simplicity over the life of the utilization of the product. How does it fit into the rest of the NetApp product set, the rest of the NetApp portfolio? What does it make better, what makes it better in addition to just the HCI product? >> NetApp has a really robust portfolio of offerings, that we, at a high level categorize into our next generation offerings, which are Solid Fire, Flexpod Solid Fire, storage grid and hyper converge and then the traditional NetApp on tap based offerings. What the glue between the whole portfolio is the data fabric and HCI is very tightly integrated into the data fabric, one of the innovations we are delivering is snap mirror integration of the RHCI platform into the traditional on tap family of products. You can seamlessly move data from our hyper converge system to a traditional on tap base system and it also gives you seamless mobility to either your own private cloud or to public cloud platforms. As a company with a wide portfolio, it gives us the ability to be consultative with our partners and our customers. What we want is and we feel customers are best served on NetApp and we want them to use NetApp, and if an on tap base system is a better solution for them than hyper converge, then that's absolutely what we will recommend for them. Into your earlier question about the partners, one of the interesting things with HCI is it's the first time as NetAPP were delivering an integrated system with compute and with a hyperviser, it comes preconfigured with the emware and it's a wonderful opportunity for our partners to add incremental value through the sale cycle to what they've brought to NetApp in the past. Because as NetApp, we're really storage experts, where our partners have a much wider and deeper understanding of the whole ecosystem than we do. It's been interesting for us to have discussions with partners, cuz we're learning a lot, because we're now involved in layers and we're deeply involved at higher levels of the stack, than we have been. >> I'm really interested in that, because you say, that you have this consultative relationship with these customers, how are you able to learn from them, their best practices and then do you transfer what you've learned to other partners and other customers? >> From the customer and we try and disseminate the learning as much as we can, but we're a huge organization with many account teams, but it all starts with what the customers wants to accomplish, minimally they need a solution, that's gonna plug in and do what they expect it to do today. What's the more important part is where what their vision is for where they wanna be three years down the road, five years down the road, 10 years down the road. It's that vision piece, that tends to drive more towards one part of the portfolio, than the other. >> Take us through how this works. You walk into an account, presumably Aero ECS has a customer. The Aero ECS customer says, "Well, we have an issue, that's going to require some specialized capabilities and how we use our data". You can look at a lot of different options, but you immediately think NetApp, what is it, that leads you to NetApp HCI versus on tap, versus Solid Fire, is there immediate characteristic, that you say, "That's HCI"? >> I would say, that the driving factor was the fact, that they wanted something that's simple and easy to manage, they want to get a mango data base up and running or they've got some other application, that really depends on their business. The underlying hardware needs to function. Bryan was saying, that it's got element OS sitting underneath it, which is in its 10th iteration and you've got VM version six, which is the most adopted virtualization platform out there. These are two best breed partnerships coming together and people are happy with that, and can move, and manage it from a single pane of glass moving forward from day one right the way through when they need to transition to a new platform, which is seamless for them. That's great from any application point, because you don't wanna worry about the health of things, you wanna be able to give an application back to the business. We talked about education, this event is gauged towards bringing customers together with NetApp and understanding the messaging around HCI, which is great. >> What are the things, that you keep hearing form customers, does this need for data simplicity, this need for huge time saving products and services? What do you think, if you can think three to five years down the road, what will the next generation of concerns be and how are you, I'm gonna use the word, that we're hearing a lot, future proof, what you're doing now to serve those customers needs of the future? >> Three to five years down the road. I can't predict three to five years out very reliably. >> But you can predict, that they're gonna have more data, they're going to merge it in new and unseen ways and they need to do it more cheaply. >> The future proofing really comes in from the data fabric. With the integration into the data fabric, you could have information, that started on a NetApp system, that was announced eight years ago, seamlessly moves into a solid fire or flash array, which seamlessly moves to a hyperconverge system, which seamlessly moves to your private cloud, which eventually moves off to a public cloud and you can bring it back into any tiers and wherever you want that data in six, seven, eight years, the data fabric will extend to it. Within each individual product, there are investment protection technologies within each one, but it's the data fabric, that should make customers feel comfortable, that no matter where they're gonna end up, taking their first step with NetApp is a step in the right direction. >> The value added ecosystem, that NetApp and others use and Aero ECS has a big play around that, has historically been tied back into hardware assets, how does it feel to be moving more into worrying about your customers data assets? >> I think it's an exciting time to be bringing those things together. At the end of the day, it's what the customer wants, they want a solution, that integrates seamlessly from whether that be the rack right the way up to the application, they want something, that they can get on their phone, they want something they can get on their tablet, they want the same experience regardless whether they're in an airplane or right next to the data center. The demand on data is huge and will only get bigger over the next five years. I was looking at a recent cover of forest magazine, it was from a number of years ago about Nokia and how can anybody ever catch them and where are they now? I think you need to be able to spot the changes and adapt quickly and to steal one of the comments from the key note yesterday, is moving from a survivor to a thriver with your data, it's gonna be key to those companies. >> In talking about the demands on data growing, it's also true, that the demands on data professionals are growing too. How is that changing the way you recruit and retain top talent? >> For us, as NetApp, if you were to look at what we wanted in the CV five years ago, we wanted people, that understood storage, we wanted people, that knew about volumes, that knew about data layouts, that knew how to maximize performance by physical placement of data and now what we're looking for is people, that really understand the whole stack and that can talk to customers about their application needs their business problems, can talk to developers. Because what we've done is we've taken those people, that were good in all those other things I mentioned, when you ask them what did you love about this product, none of them ever came back and said I love the first week I spent installing it. We've taken that away and we've let them do more interesting work. A challenge for us is, us is a collective society, is to make sure we bring people forward from an education perspective skills enablement, so they're capable of rising to that next level of demand, but we're taking a lot of the busy work out. >> Making sure, that they have the skills to be able to take what they're seeing in the data and then take action. >> We want our customers to look at NetApp as data expert, that can work with them on their business problem, not a storage expert, that can explain how an array works. >> Bryan, Rory, thank you so much for coming on the show, it's been a great conversation. >> Thank you. >> Thank you very much. >> You are watching the Cube, we will have more from NetApp insight, I'm Rebecca Night for Peter Burris in just a little bit.
SUMMARY :
covering NetApp insight 2017, brought to you by NetApp. that HCI delivers to customers, especially in terms and flexibility, that they've grown accustomed to. or to do minimally a lot more with the same. As the product becomes simpler to the customer now I think HCI just has to pull all of that together, that people are gonna have to deal with the time to get to value, not only now, and it also gives you seamless mobility From the customer and we try and disseminate what is it, that leads you to NetApp HCI and easy to manage, they want to get a mango data base I can't predict three to five years out very reliably. and they need to do it more cheaply. and you can bring it back into any tiers and adapt quickly and to steal one of the comments How is that changing the way you recruit and that can talk to customers about their application needs to be able to take what they're seeing in the data as data expert, that can work with them for coming on the show, it's been a great conversation. we will have more from NetApp insight,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Bryan | PERSON | 0.99+ |
Rory McBride | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
three | QUANTITY | 0.99+ |
Brian McCloskey | PERSON | 0.99+ |
Three | QUANTITY | 0.99+ |
Rebecca Night | PERSON | 0.99+ |
five years | QUANTITY | 0.99+ |
Nokia | ORGANIZATION | 0.99+ |
six | QUANTITY | 0.99+ |
two guests | QUANTITY | 0.99+ |
RuairĂ McBride | PERSON | 0.99+ |
four | QUANTITY | 0.99+ |
NetApp | ORGANIZATION | 0.99+ |
yesterday | DATE | 0.99+ |
10th iteration | QUANTITY | 0.99+ |
first step | QUANTITY | 0.99+ |
Bryan Mclosky | PERSON | 0.99+ |
Rory | PERSON | 0.99+ |
five year | QUANTITY | 0.99+ |
NetApp | TITLE | 0.99+ |
three years | QUANTITY | 0.99+ |
seven | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
eight years ago | DATE | 0.98+ |
10 years | QUANTITY | 0.98+ |
eight years | QUANTITY | 0.98+ |
first time | QUANTITY | 0.98+ |
five years ago | DATE | 0.98+ |
Berlin, Germany | LOCATION | 0.98+ |
today | DATE | 0.97+ |
Aero ECS | ORGANIZATION | 0.97+ |
over 400 inputs | QUANTITY | 0.97+ |
HCI | ORGANIZATION | 0.97+ |
each one | QUANTITY | 0.96+ |
Aero | ORGANIZATION | 0.95+ |
under 30 inputs | QUANTITY | 0.95+ |
45 minutes | QUANTITY | 0.95+ |
100s of inputs | QUANTITY | 0.94+ |
one part | QUANTITY | 0.92+ |
RHCI | TITLE | 0.92+ |
single pane | QUANTITY | 0.91+ |
zero | QUANTITY | 0.89+ |
Cube | COMMERCIAL_ITEM | 0.88+ |
NetAPP | TITLE | 0.87+ |
one | QUANTITY | 0.87+ |
2017 | DATE | 0.86+ |
day one | QUANTITY | 0.85+ |
two best breed partnerships | QUANTITY | 0.82+ |
number of years ago | DATE | 0.81+ |
each individual product | QUANTITY | 0.81+ |
HCI | TITLE | 0.74+ |
first week | QUANTITY | 0.71+ |
Cube | TITLE | 0.69+ |
Evan Kaplan, InfluxData | AWS re:invent 2022
>>Hey everyone. Welcome to Las Vegas. The Cube is here, live at the Venetian Expo Center for AWS Reinvent 2022. Amazing attendance. This is day one of our coverage. Lisa Martin here with Day Ante. David is great to see so many people back. We're gonna be talk, we've been having great conversations already. We have a wall to wall coverage for the next three and a half days. When we talk to companies, customers, every company has to be a data company. And one of the things I think we learned in the pandemic is that access to real time data and real time analytics, no longer a nice to have that is a differentiator and a competitive all >>About data. I mean, you know, I love the topic and it's, it's got so many dimensions and such texture, can't get enough of data. >>I know we have a great guest joining us. One of our alumni is back, Evan Kaplan, the CEO of Influx Data. Evan, thank you so much for joining us. Welcome back to the Cube. >>Thanks for having me. It's great to be here. So here >>We are, day one. I was telling you before we went live, we're nice and fresh hosts. Talk to us about what's new at Influxed since the last time we saw you at Reinvent. >>That's great. So first of all, we should acknowledge what's going on here. This is pretty exciting. Yeah, that does really feel like, I know there was a show last year, but this feels like the first post Covid shows a lot of energy, a lot of attention despite a difficult economy. In terms of, you know, you guys were commenting in the lead into Big data. I think, you know, if we were to talk about Big Data five, six years ago, what would we be talking about? We'd been talking about Hadoop, we were talking about Cloudera, we were talking about Hortonworks, we were talking about Big Data Lakes, data stores. I think what's happened is, is this this interesting dynamic of, let's call it if you will, the, the secularization of data in which it breaks into different fields, different, almost a taxonomy. You've got this set of search data, you've got this observability data, you've got graph data, you've got document data and what you're seeing in the market and now you have time series data. >>And what you're seeing in the market is this incredible capability by developers as well and mostly open source dynamic driving this, this incredible capability of developers to assemble data platforms that aren't unicellular, that aren't just built on Hado or Oracle or Postgres or MySQL, but in fact represent different data types. So for us, what we care about his time series, we care about anything that happens in time, where time can be the primary measurement, which if you think about it, is a huge proportion of real data. Cuz when you think about what drives ai, you think about what happened, what happened, what happened, what happened, what's going to happen. That's the functional thing. But what happened is always defined by a period, a measurement, a time. And so what's new for us is we've developed this new open source engine called IOx. And so it's basically a refresh of the whole database, a kilo database that uses Apache Arrow, par K and data fusion and turns it into a super powerful real time analytics platform. It was already pretty real time before, but it's increasingly now and it adds SQL capability and infinite cardinality. And so it handles bigger data sets, but importantly, not just bigger but faster, faster data. So that's primarily what we're talking about to show. >>So how does that affect where you can play in the marketplace? Is it, I mean, how does it affect your total available market? Your great question. Your, your customer opportunities. >>I think it's, it's really an interesting market in that you've got all of these different approaches to database. Whether you take data warehouses from Snowflake or, or arguably data bricks also. And you take these individual database companies like Mongo Influx, Neo Forge, elastic, and people like that. I think the commonality you see across the volume is, is many of 'em, if not all of them, are based on some sort of open source dynamic. So I think that is an in an untractable trend that will continue for on. But in terms of the broader, the broader database market, our total expand, total available tam, lots of these things are coming together in interesting ways. And so the, the, the wave that will ride that we wanna ride, because it's all big data and it's all increasingly fast data and it's all machine learning and AI is really around that measurement issue. That instrumentation the idea that if you're gonna build any sophisticated system, it starts with instrumentation and the journey is defined by instrumentation. So we view ourselves as that instrumentation tooling for understanding complex systems. And how, >>I have to follow quick follow up. Why did you say arguably data bricks? I mean open source ethos? >>Well, I was saying arguably data bricks cuz Spark, I mean it's a great company and it's based on Spark, but there's quite a gap between Spark and what Data Bricks is today. And in some ways data bricks from the outside looking in looks a lot like Snowflake to me looks a lot like a really sophisticated data warehouse with a lot of post-processing capabilities >>And, and with an open source less >>Than a >>Core database. Yeah. Right, right, right. Yeah, I totally agree. Okay, thank you for that >>Part that that was not arguably like they're, they're not a good company or >>No, no. They got great momentum and I'm just curious. Absolutely. You know, so, >>So talk a little bit about IOx and, and what it is enabling you guys to achieve from a competitive advantage perspective. The key differentiators give us that scoop. >>So if you think about, so our old storage engine was called tsm, also open sourced, right? And IOx is open sourced and the old storage engine was really built around this time series measurements, particularly metrics, lots of metrics and handling those at scale and making it super easy for developers to use. But, but our old data engine only supported either a custom graphical UI that you'd build yourself on top of it or a dashboarding tool like Grafana or Chronograph or things like that. With IOCs. Two or three interventions were important. One is we now support, we'll support things like Tableau, Microsoft, bi, and so you're taking that same data that was available for instrumentation and now you're using it for business intelligence also. So that became super important and it kind of answers your question about the expanded market expands the market. The second thing is, when you're dealing with time series data, you're dealing with this concept of cardinality, which is, and I don't know if you're familiar with it, but the idea that that it's a multiplication of measurements in a table. And so the more measurements you want over the more series you have, you have this really expanding exponential set that can choke a database off. And the way we've designed IIS to handle what we call infinite cardinality, where you don't even have to think about that design point of view. And then lastly, it's just query performance is dramatically better. And so it's pretty exciting. >>So the unlimited cardinality, basically you could identify relationships between data and different databases. Is that right? Between >>The same database but different measurements, different tables, yeah. Yeah. Right. Yeah, yeah. So you can handle, so you could say, I wanna look at the way, the way the noise levels are performed in this room according to 400 different locations on 25 different days, over seven months of the year. And that each one is a measurement. Each one adds to cardinality. And you can say, I wanna search on Tuesdays in December, what the noise level is at 2:21 PM and you get a very quick response. That kind of instrumentation is critical to smarter systems. How are >>You able to process that data at at, in a performance level that doesn't bring the database to its knees? What's the secret sauce behind that? >>It's AUM database. It's built on Parque and Apache Arrow. But it's, but to say it's nice to say without a much longer conversation, it's an architecture that's really built for pulling that kind of data. If you know the data is time series and you're looking for a time measurement, you already have the ability to optimize pretty dramatically. >>So it's, it's that purpose built aspect of it. It's the >>Purpose built aspect. You couldn't take Postgres and do the same >>Thing. Right? Because a lot of vendors say, oh yeah, we have time series now. Yeah. Right. So yeah. Yeah. Right. >>And they >>Do. Yeah. But >>It's not, it's not, the founding of the company came because Paul Dicks was working on Wall Street building time series databases on H base, on MyQ, on other platforms and realize every time we do it, we have to rewrite the code. We build a bunch of application logic to handle all these. We're talking about, we have customers that are adding hundreds of millions to billions of points a second. So you're talking about an ingest level. You know, you think about all those data points, you're talking about ingest level that just doesn't, you know, it just databases aren't designed for that. Right? And so it's not just us, our competitors also build good time series databases. And so the category is really emergent. Yeah, >>Sure. Talk about a favorite customer story they think really articulates the value of what Influx is doing, especially with IOx. >>Yeah, sure. And I love this, I love this story because you know, Tesla may not be in favor because of the latest Elon Musker aids, but, but, but so we've had about a four year relationship with Tesla where they built their power wall technology around recording that, seeing your device, seeing the stuff, seeing the charging on your car. It's all captured in influx databases that are reporting from power walls and mega power packs all over the world. And they report to a central place at, at, at Tesla's headquarters and it reports out to your phone and so you can see it. And what's really cool about this to me is I've got two Tesla cars and I've got a Tesla solar roof tiles. So I watch this date all the time. So it's a great customer story. And actually if you go on our website, you can see I did an hour interview with the engineer that designed the system cuz the system is super impressive and I just think it's really cool. Plus it's, you know, it's all the good green stuff that we really appreciate supporting sustainability, right? Yeah. >>Right, right. Talk about from a, what's in it for me as a customer, what you guys have done, the change to IOCs, what, what are some of the key features of it and the key values in it for customers like Tesla, like other industry customers as well? >>Well, so it's relatively new. It just arrived in our cloud product. So Tesla's not using it today. We have a first set of customers starting to use it. We, the, it's in open source. So it's a very popular project in the open source world. But the key issues are, are really the stuff that we've kind of covered here, which is that a broad SQL environment. So accessing all those SQL developers, the same people who code against Snowflake's data warehouse or data bricks or Postgres, can now can code that data against influx, open up the BI market. It's the cardinality, it's the performance. It's really an architecture. It's the next gen. We've been doing this for six years, it's the next generation of everything. We've seen how you make time series be super performing. And that's only relevant because more and more things are becoming real time as we develop smarter and smarter systems. The journey is pretty clear. You instrument the system, you, you let it run, you watch for anomalies, you correct those anomalies, you re instrument the system. You do that 4 billion times, you have a self-driving car, you do that 55 times, you have a better podcast that is, that is handling its audio better, right? So everything is on that journey of getting smarter and smarter. So >>You guys, you guys the big committers to IOCs, right? Yes. And how, talk about how you support the, develop the surrounding developer community, how you get that flywheel effect going >>First. I mean it's actually actually a really kind of, let's call it, it's more art than science. Yeah. First of all, you you, you come up with an architecture that really resonates for developers. And Paul Ds our founder, really is a developer's developer. And so he started talking about this in the community about an architecture that uses Apache Arrow Parque, which is, you know, the standard now becoming for file formats that uses Apache Arrow for directing queries and things like that and uses data fusion and said what this thing needs is a Columbia database that sits behind all of this stuff and integrates it. And he started talking about it two years ago and then he started publishing in IOCs that commits in the, in GitHub commits. And slowly, but over time in Hacker News and other, and other people go, oh yeah, this is fundamentally right. >>It addresses the problems that people have with things like click cows or plain databases or Coast and they go, okay, this is the right architecture at the right time. Not different than original influx, not different than what Elastic hit on, not different than what Confluent with Kafka hit on and their time is you build an audience of people who are committed to understanding this kind of stuff and they become committers and they become the core. Yeah. And you build out from it. And so super. And so we chose to have an MIT open source license. Yeah. It's not some secondary license competitors can use it and, and competitors can use it against us. Yeah. >>One of the things I know that Influx data talks about is the time to awesome, which I love that, but what does that mean? What is the time to Awesome. Yeah. For developer, >>It comes from that original story where, where Paul would have to write six months of application logic and stuff to build a time series based applications. And so Paul's notion was, and this was based on the original Mongo, which was very successful because it was very easy to use relative to most databases. So Paul developed this commitment, this idea that I quickly joined on, which was, hey, it should be relatively quickly for a developer to build something of import to solve a problem, it should be able to happen very quickly. So it's got a schemaless background so you don't have to know the schema beforehand. It does some things that make it really easy to feel powerful as a developer quickly. And if you think about that journey, if you feel powerful with a tool quickly, then you'll go deeper and deeper and deeper and pretty soon you're taking that tool with you wherever you go, it becomes the tool of choice as you go to that next job or you go to that next application. And so that's a fundamental way we think about it. To be honest with you, we haven't always delivered perfectly on that. It's generally in our dna. So we do pretty well, but I always feel like we can do better. >>So if you were to put a bumper sticker on one of your Teslas about influx data, what would it >>Say? By the way, I'm not rich. It just happened to be that we have two Teslas and we have for a while, we just committed to that. The, the, so ask the question again. Sorry. >>Bumper sticker on influx data. What would it say? How, how would I >>Understand it be time to Awesome. It would be that that phrase his time to Awesome. Right. >>Love that. >>Yeah, I'd love it. >>Excellent time to. Awesome. Evan, thank you so much for joining David, the >>Program. It's really fun. Great thing >>On Evan. Great to, you're on. Haven't Well, great to have you back talking about what you guys are doing and helping organizations like Tesla and others really transform their businesses, which is all about business transformation these days. We appreciate your insights. >>That's great. Thank >>You for our guest and Dave Ante. I'm Lisa Martin, you're watching The Cube, the leader in emerging and enterprise tech coverage. We'll be right back with our next guest.
SUMMARY :
And one of the things I think we learned in the pandemic is that access to real time data and real time analytics, I mean, you know, I love the topic and it's, it's got so many dimensions and such Evan, thank you so much for joining us. It's great to be here. Influxed since the last time we saw you at Reinvent. terms of, you know, you guys were commenting in the lead into Big data. And so it's basically a refresh of the whole database, a kilo database that uses So how does that affect where you can play in the marketplace? And you take these individual database companies like Mongo Influx, Why did you say arguably data bricks? And in some ways data bricks from the outside looking in looks a lot like Snowflake to me looks a lot Okay, thank you for that You know, so, So talk a little bit about IOx and, and what it is enabling you guys to achieve from a And the way we've designed IIS to handle what we call infinite cardinality, where you don't even have to So the unlimited cardinality, basically you could identify relationships between data And you can say, time measurement, you already have the ability to optimize pretty dramatically. So it's, it's that purpose built aspect of it. You couldn't take Postgres and do the same So yeah. And so the category is really emergent. especially with IOx. And I love this, I love this story because you know, what you guys have done, the change to IOCs, what, what are some of the key features of it and the key values in it for customers you have a self-driving car, you do that 55 times, you have a better podcast that And how, talk about how you support architecture that uses Apache Arrow Parque, which is, you know, the standard now becoming for file And you build out from it. One of the things I know that Influx data talks about is the time to awesome, which I love that, So it's got a schemaless background so you don't have to know the schema beforehand. It just happened to be that we have two Teslas and we have for a while, What would it say? Understand it be time to Awesome. Evan, thank you so much for joining David, the Great thing Haven't Well, great to have you back talking about what you guys are doing and helping organizations like Tesla and others really That's great. You for our guest and Dave Ante.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
Evan Kaplan | PERSON | 0.99+ |
six months | QUANTITY | 0.99+ |
Evan | PERSON | 0.99+ |
Tesla | ORGANIZATION | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
Paul | PERSON | 0.99+ |
55 times | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
2:21 PM | DATE | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
Dave Ante | PERSON | 0.99+ |
Paul Dicks | PERSON | 0.99+ |
six years | QUANTITY | 0.99+ |
last year | DATE | 0.99+ |
hundreds of millions | QUANTITY | 0.99+ |
Mongo Influx | ORGANIZATION | 0.99+ |
4 billion times | QUANTITY | 0.99+ |
Two | QUANTITY | 0.99+ |
December | DATE | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Influxed | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
Influx | ORGANIZATION | 0.99+ |
IOx | TITLE | 0.99+ |
MySQL | TITLE | 0.99+ |
three | QUANTITY | 0.99+ |
Tuesdays | DATE | 0.99+ |
each one | QUANTITY | 0.98+ |
400 different locations | QUANTITY | 0.98+ |
25 different days | QUANTITY | 0.98+ |
first set | QUANTITY | 0.98+ |
an hour | QUANTITY | 0.98+ |
First | QUANTITY | 0.98+ |
six years ago | DATE | 0.98+ |
The Cube | TITLE | 0.98+ |
One | QUANTITY | 0.98+ |
Neo Forge | ORGANIZATION | 0.98+ |
second thing | QUANTITY | 0.98+ |
Each one | QUANTITY | 0.98+ |
Paul Ds | PERSON | 0.97+ |
IOx | ORGANIZATION | 0.97+ |
today | DATE | 0.97+ |
Teslas | ORGANIZATION | 0.97+ |
MIT | ORGANIZATION | 0.96+ |
Postgres | ORGANIZATION | 0.96+ |
over seven months | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
five | DATE | 0.96+ |
Venetian Expo Center | LOCATION | 0.95+ |
Big Data Lakes | ORGANIZATION | 0.95+ |
Cloudera | ORGANIZATION | 0.94+ |
Columbia | LOCATION | 0.94+ |
InfluxData | ORGANIZATION | 0.94+ |
Wall Street | LOCATION | 0.93+ |
SQL | TITLE | 0.92+ |
Elastic | TITLE | 0.92+ |
Data Bricks | ORGANIZATION | 0.92+ |
Hacker News | TITLE | 0.92+ |
two years ago | DATE | 0.91+ |
Oracle | ORGANIZATION | 0.91+ |
AWS Reinvent 2022 | EVENT | 0.91+ |
Elon Musker | PERSON | 0.9+ |
Snowflake | ORGANIZATION | 0.9+ |
Reinvent | ORGANIZATION | 0.89+ |
billions of points a second | QUANTITY | 0.89+ |
four year | QUANTITY | 0.88+ |
Chronograph | TITLE | 0.88+ |
Confluent | TITLE | 0.87+ |
Spark | TITLE | 0.86+ |
Apache | ORGANIZATION | 0.86+ |
Snowflake | TITLE | 0.85+ |
Grafana | TITLE | 0.85+ |
GitHub | ORGANIZATION | 0.84+ |
Anais Dotis Georgiou, InfluxData | Evolving InfluxDB into the Smart Data Platform
>>Okay, we're back. I'm Dave Valante with The Cube and you're watching Evolving Influx DB into the smart data platform made possible by influx data. Anna East Otis Georgio is here. She's a developer advocate for influx data and we're gonna dig into the rationale and value contribution behind several open source technologies that Influx DB is leveraging to increase the granularity of time series analysis analysis and bring the world of data into realtime analytics. Anna is welcome to the program. Thanks for coming on. >>Hi, thank you so much. It's a pleasure to be here. >>Oh, you're very welcome. Okay, so IO X is being touted as this next gen open source core for Influx db. And my understanding is that it leverages in memory, of course for speed. It's a kilo store, so it gives you compression efficiency, it's gonna give you faster query speeds, it gonna use store files and object storages. So you got very cost effective approach. Are these the salient points on the platform? I know there are probably dozens of other features, but what are the high level value points that people should understand? >>Sure, that's a great question. So some of the main requirements that IOCs is trying to achieve and some of the most impressive ones to me, the first one is that it aims to have no limits on cardinality and also allow you to write any kind of event data that you want, whether that's lift tag or a field. It also wants to deliver the best in class performance on analytics queries. In addition to our already well served metrics queries, we also wanna have operator control over memory usage. So you should be able to define how much memory is used for buffering caching and query processing. Some other really important parts is the ability to have bulk data export and import, super useful. Also, broader ecosystem compatibility where possible we aim to use and embrace emerging standards in the data analytics ecosystem and have compatibility with things like sql, Python, and maybe even pandas in the future. >>Okay, so a lot there. Now we talked to Brian about how you're using Rust and and which is not a new programming language and of course we had some drama around Russ during the pandemic with the Mozilla layoffs, but the formation of the Russ Foundation really addressed any of those concerns. You got big guns like Amazon and Google and Microsoft throwing their collective weights behind it. It's really, adoption is really starting to get steep on the S-curve. So lots of platforms, lots of adoption with rust, but why rust as an alternative to say c plus plus for example? >>Sure, that's a great question. So Rust was chosen because of his exceptional performance and rebi reliability. So while rust is synt tactically similar to c c plus plus and it has similar performance, it also compiles to a native code like c plus plus. But unlike c plus plus, it also has much better memory safety. So memory safety is protection against bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And rust achieves this memory safety due to its like innovative type system. Additionally, it doesn't allow for dangling pointers and dangling pointers are the main classes of errors that lead to exploitable security vulnerabilities in languages like c plus plus. So Russ like helps meet that requirement of having no limits on card for example, because it's, we're also using the Russ implementation of Apache Arrow and this control over memory and also Russ, Russ Russ's packaging system called crates IO offers everything that you need out of the box to have features like AY and a weight to fixed race conditions to protect against buffering overflows and to ensure thread safe ay caching structures as well. So essentially it's just like has all the control, all the fine grain control, you need to take advantage of memory and all your resources as well as possible so that you can handle those really, really high ity use cases. >>Yeah, and the more I learned about the the new engine and the, and the platform IOCs et cetera, you know, you, you see things like, you know, the old days not even to even today you do a lot of garbage collection in these, in these systems and there's an inverse, you know, impact relative to performance. So it looks like you're really, you know, the community is modernizing the platform, but I wanna talk about Apache Arrow for a moment. It's designed to address the constraints that are associated with analyzing large data sets. We, we know that, but please explain why, what, what is Arrow and and what does it bring to Influx db? >>Sure, yeah. So Arrow is a, a framework for defining in memory calmer data and so much of the efficiency and performance of IOCs comes from taking advantage of calmer data structures. And I will, if you don't mind, take a moment to kind of illustrate why calmer data structures are so valuable. Let's pretend that we are gathering field data about the temperature in our room and also maybe the temperature of our stove. And in our table we have those two temperature values as well as maybe a measurement value, timestamp value, maybe some other tag values that describe what room and what house, et cetera we're getting this data from. And so you can picture this table where we have like two rows with the two temperature values for both our room and the stove. Well usually our room temperature is regulated so those values don't change very often. >>So when you have calm oriented st calm oriented storage, essentially you take each row, each column and group it together. And so if that's the case and you're just taking temperature values from the room and a lot of those temperature values are the same, then you'll, you might be able to imagine how equal values will then neighbor each other and when they neighbor each other in the storage format. This provides a really perfect opportunity for cheap compression. And then this cheap compression enables high cardinality use cases. It also enables for faster scan rates. So if you wanna define like the min and max value of the temperature in the room across a thousand different points, you only have to get those a thousand different points in order to answer that question and you have those immediately available to you. But let's contrast this with a row oriented storage solution instead so that we can understand better the benefits of calmer oriented storage. >>So if you had a row oriented storage, you'd first have to look at every field like the temperature in, in the room and the temperature of the stove. You'd have to go across every tag value that maybe describes where the room is located or what model the stove is. And every timestamp you'd then have to pluck out that one temperature value that you want at that one times stamp and do that for every single row. So you're scanning across a ton more data and that's why row oriented doesn't provide the same efficiency as calmer and Apache Arrow is in memory calmer data, calmer data fit framework. So that's where a lot of the advantages come >>From. Okay. So you've basically described like a traditional database, a row approach, but I've seen like a lot of traditional databases say, okay, now we've got, we can handle colo format versus what you're talking about is really, you know, kind of native it, is it not as effective as the, is the form not as effective because it's largely a, a bolt on? Can you, can you like elucidate on that front? >>Yeah, it's, it's not as effective because you have more expensive compression and because you can't scan across the values as quickly. And so those are, that's pretty much the main reasons why, why RO row oriented storage isn't as efficient as calm, calmer oriented storage. >>Yeah. Got it. So let's talk about Arrow data fusion. What is data fusion? I know it's written in rust, but what does it bring to to the table here? >>Sure. So it's an extensible query execution framework and it uses Arrow as its in memory format. So the way that it helps influx DB IOx is that okay, it's great if you can write unlimited amount of cardinality into influx cbis, but if you don't have a query engine that can successfully query that data, then I don't know how much value it is for you. So data fusion helps enable the, the query process and transformation of that data. It also has a PANDAS API so that you could take advantage of PDA's data frames as well and all of the machine learning tools associated with pandas. >>Okay. You're also leveraging par K in the platform course. We heard a lot about Par K in the middle of the last decade cuz as a storage format to improve on Hadoop column stores. What are you doing with Par K and why is it important? >>Sure. So Par K is the calm oriented durable file format. So it's important because it'll enable bulk import and bulk export. It has compatibility with Python and pandas so it supports a broader ecosystem. Parque files also take very little disc disc space and they're faster to scan because again they're column oriented in particular, I think PAR K files are like 16 times cheaper than CSV files, just as kind of a point of reference. And so that's essentially a lot of the, the benefits of par k. >>Got it. Very popular. So and these, what exactly is influx data focusing on as a committer to these projects? What is your focus? What's the value that you're bringing to the community? >>Sure. So Influx DB first has contributed a lot of different, different things to the Apache ecosystem. For example, they contribute an implementation of Apache Arrow and go and that will support clearing with flux. Also, there has been a quite a few contributions to data fusion for things like memory optimization and supportive additional SQL features like support for timestamp, arithmetic and support for exist clauses and support for memory control. So yeah, Influx has contributed a a lot to the Apache ecosystem and continues to do so. And I think kind of the idea here is that if you can improve these upstream projects and then the long term strategy here is that the more you contribute and build those up, then the more you will perpetuate that cycle of improvement and the more we will invest in our own project as well. So it's just that kind of symbiotic relationship and appreciation of the open source community. >>Yeah. Got it. You got that virtuous cycle going, the people call it the flywheel. Give us your last thoughts and kind of summarize, you know, where what, what the big takeaways are from your perspective. >>So I think the big takeaway is that influx data is doing a lot of really exciting things with Influx DB IOCs and I really encourage if you are interested in learning more about the technologies that Influx is leveraging to produce IOCs, the challenges associated with it and all of the hard work questions and I just wanna learn more, then I would encourage you to go to the monthly tech talks and community office hours and they are on every second Wednesday of the month at 8:30 AM Pacific time. There's also a community forums and a community Slack channel. Look for the influx D DB underscore IAC channel specifically to learn more about how to join those office hours and those monthly tech tech talks as well as ask any questions they have about IOCs, what to expect and what you'd like to learn more about. I as a developer advocate, I wanna answer your questions. So if there's a particular technology or stack that you wanna dive deeper into and want more explanation about how influx TB leverages it to build IOCs, I will be really excited to produce content on that topic for you. >>Yeah, that's awesome. You guys have a really rich community, collaborate with your peers, solve problems, and you guys super responsive, so really appreciate that. All right, thank you so much and East for explaining all this open source stuff to the audience and why it's important to the future of data. >>Thank you. I really appreciate it. >>All right, you're very welcome. Okay, stay right there and in a moment I'll be back with Tim Yokum. He's the director of engineering for Influx Data and we're gonna talk about how you update a SaaS engine while the plane is flying at 30,000 feet. You don't wanna miss this.
SUMMARY :
to increase the granularity of time series analysis analysis and bring the world of data Hi, thank you so much. So you got very cost effective approach. it aims to have no limits on cardinality and also allow you to write any kind of event data that So lots of platforms, lots of adoption with rust, but why rust as an all the fine grain control, you need to take advantage of even to even today you do a lot of garbage collection in these, in these systems and And so you can picture this table where we have like two rows with the two temperature values for order to answer that question and you have those immediately available to you. to pluck out that one temperature value that you want at that one times stamp and do that for every about is really, you know, kind of native it, is it not as effective as the, Yeah, it's, it's not as effective because you have more expensive compression and because So let's talk about Arrow data fusion. It also has a PANDAS API so that you could take advantage of What are you doing with So it's important What's the value that you're bringing to the community? here is that the more you contribute and build those up, then the kind of summarize, you know, where what, what the big takeaways are from your perspective. So if there's a particular technology or stack that you wanna dive deeper into and want and you guys super responsive, so really appreciate that. I really appreciate it. Influx Data and we're gonna talk about how you update a SaaS engine while
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tim Yokum | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Anna | PERSON | 0.99+ |
James Bellenger | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Dave Valante | PERSON | 0.99+ |
James | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
three months | QUANTITY | 0.99+ |
16 times | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Python | TITLE | 0.99+ |
mobile.twitter.com | OTHER | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
iOS | TITLE | 0.99+ |
ORGANIZATION | 0.99+ | |
30,000 feet | QUANTITY | 0.99+ |
Russ Foundation | ORGANIZATION | 0.99+ |
Scala | TITLE | 0.99+ |
Twitter Lite | TITLE | 0.99+ |
two rows | QUANTITY | 0.99+ |
200 megabyte | QUANTITY | 0.99+ |
Node | TITLE | 0.99+ |
Three months ago | DATE | 0.99+ |
one application | QUANTITY | 0.99+ |
both places | QUANTITY | 0.99+ |
each row | QUANTITY | 0.99+ |
Par K | TITLE | 0.99+ |
Anais Dotis Georgiou | PERSON | 0.99+ |
one language | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
15 engineers | QUANTITY | 0.98+ |
Anna East Otis Georgio | PERSON | 0.98+ |
both | QUANTITY | 0.98+ |
one second | QUANTITY | 0.98+ |
25 engineers | QUANTITY | 0.98+ |
About 800 people | QUANTITY | 0.98+ |
sql | TITLE | 0.98+ |
Node Summit 2017 | EVENT | 0.98+ |
two temperature values | QUANTITY | 0.98+ |
one times | QUANTITY | 0.98+ |
c plus plus | TITLE | 0.97+ |
Rust | TITLE | 0.96+ |
SQL | TITLE | 0.96+ |
today | DATE | 0.96+ |
Influx | ORGANIZATION | 0.95+ |
under 600 kilobytes | QUANTITY | 0.95+ |
first | QUANTITY | 0.95+ |
c plus plus | TITLE | 0.95+ |
Apache | ORGANIZATION | 0.95+ |
par K | TITLE | 0.94+ |
React | TITLE | 0.94+ |
Russ | ORGANIZATION | 0.94+ |
About three months ago | DATE | 0.93+ |
8:30 AM Pacific time | DATE | 0.93+ |
twitter.com | OTHER | 0.93+ |
last decade | DATE | 0.93+ |
Node | ORGANIZATION | 0.92+ |
Hadoop | TITLE | 0.9+ |
InfluxData | ORGANIZATION | 0.89+ |
c c plus plus | TITLE | 0.89+ |
Cube | ORGANIZATION | 0.89+ |
each column | QUANTITY | 0.88+ |
InfluxDB | TITLE | 0.86+ |
Influx DB | TITLE | 0.86+ |
Mozilla | ORGANIZATION | 0.86+ |
DB IOx | TITLE | 0.85+ |
Brian Gilmore, Influx Data | Evolving InfluxDB into the Smart Data Platform
>>This past May, The Cube in collaboration with Influx data shared with you the latest innovations in Time series databases. We talked at length about why a purpose built time series database for many use cases, was a superior alternative to general purpose databases trying to do the same thing. Now, you may, you may remember the time series data is any data that's stamped in time, and if it's stamped, it can be analyzed historically. And when we introduced the concept to the community, we talked about how in theory, those time slices could be taken, you know, every hour, every minute, every second, you know, down to the millisecond and how the world was moving toward realtime or near realtime data analysis to support physical infrastructure like sensors and other devices and IOT equipment. A time series databases have had to evolve to efficiently support realtime data in emerging use cases in iot T and other use cases. >>And to do that, new architectural innovations have to be brought to bear. As is often the case, open source software is the linchpin to those innovations. Hello and welcome to Evolving Influx DB into the smart Data platform, made possible by influx data and produced by the Cube. My name is Dave Valante and I'll be your host today. Now, in this program, we're going to dig pretty deep into what's happening with Time series data generally, and specifically how Influx DB is evolving to support new workloads and demands and data, and specifically around data analytics use cases in real time. Now, first we're gonna hear from Brian Gilmore, who is the director of IOT and emerging technologies at Influx Data. And we're gonna talk about the continued evolution of Influx DB and the new capabilities enabled by open source generally and specific tools. And in this program, you're gonna hear a lot about things like Rust, implementation of Apache Arrow, the use of par k and tooling such as data fusion, which powering a new engine for Influx db. >>Now, these innovations, they evolve the idea of time series analysis by dramatically increasing the granularity of time series data by compressing the historical time slices, if you will, from, for example, minutes down to milliseconds. And at the same time, enabling real time analytics with an architecture that can process data much faster and much more efficiently. Now, after Brian, we're gonna hear from Anna East Dos Georgio, who is a developer advocate at In Flux Data. And we're gonna get into the why of these open source capabilities and how they contribute to the evolution of the Influx DB platform. And then we're gonna close the program with Tim Yokum, he's the director of engineering at Influx Data, and he's gonna explain how the Influx DB community actually evolved the data engine in mid-flight and which decisions went into the innovations that are coming to the market. Thank you for being here. We hope you enjoy the program. Let's get started. Okay, we're kicking things off with Brian Gilmore. He's the director of i t and emerging Technology at Influx State of Bryan. Welcome to the program. Thanks for coming on. >>Thanks Dave. Great to be here. I appreciate the time. >>Hey, explain why Influx db, you know, needs a new engine. Was there something wrong with the current engine? What's going on there? >>No, no, not at all. I mean, I think it's, for us, it's been about staying ahead of the market. I think, you know, if we think about what our customers are coming to us sort of with now, you know, related to requests like sql, you know, query support, things like that, we have to figure out a way to, to execute those for them in a way that will scale long term. And then we also, we wanna make sure we're innovating, we're sort of staying ahead of the market as well and sort of anticipating those future needs. So, you know, this is really a, a transparent change for our customers. I mean, I think we'll be adding new capabilities over time that sort of leverage this new engine, but you know, initially the customers who are using us are gonna see just great improvements in performance, you know, especially those that are working at the top end of the, of the workload scale, you know, the massive data volumes and things like that. >>Yeah, and we're gonna get into that today and the architecture and the like, but what was the catalyst for the enhancements? I mean, when and how did this all come about? >>Well, I mean, like three years ago we were primarily on premises, right? I mean, I think we had our open source, we had an enterprise product, you know, and, and sort of shifting that technology, especially the open source code base to a service basis where we were hosting it through, you know, multiple cloud providers. That was, that was, that was a long journey I guess, you know, phase one was, you know, we wanted to host enterprise for our customers, so we sort of created a service that we just managed and ran our enterprise product for them. You know, phase two of this cloud effort was to, to optimize for like multi-tenant, multi-cloud, be able to, to host it in a truly like sass manner where we could use, you know, some type of customer activity or consumption as the, the pricing vector, you know, And, and that was sort of the birth of the, of the real first influx DB cloud, you know, which has been really successful. >>We've seen, I think, like 60,000 people sign up and we've got tons and tons of, of both enterprises as well as like new companies, developers, and of course a lot of home hobbyists and enthusiasts who are using out on a, on a daily basis, you know, and having that sort of big pool of, of very diverse and very customers to chat with as they're using the product, as they're giving us feedback, et cetera, has has, you know, pointed us in a really good direction in terms of making sure we're continuously improving that and then also making these big leaps as we're doing with this, with this new engine. >>Right. So you've called it a transparent change for customers, so I'm presuming it's non-disruptive, but I really wanna understand how much of a pivot this is and what, what does it take to make that shift from, you know, time series, you know, specialist to real time analytics and being able to support both? >>Yeah, I mean, it's much more of an evolution, I think, than like a shift or a pivot. You know, time series data is always gonna be fundamental and sort of the basis of the solutions that we offer our customers, and then also the ones that they're building on the sort of raw APIs of our platform themselves. You know, the time series market is one that we've worked diligently to lead. I mean, I think when it comes to like metrics, especially like sensor data and app and infrastructure metrics, if we're being honest though, I think our, our user base is well aware that the way we were architected was much more towards those sort of like backwards looking historical type analytics, which are key for troubleshooting and making sure you don't, you know, run into the same problem twice. But, you know, we had to ask ourselves like, what can we do to like better handle those queries from a performance and a, and a, you know, a time to response on the queries, and can we get that to the point where the results sets are coming back so quickly from the time of query that we can like limit that window down to minutes and then seconds. >>And now with this new engine, we're really starting to talk about a query window that could be like returning results in, in, you know, milliseconds of time since it hit the, the, the ingest queue. And that's, that's really getting to the point where as your data is available, you can use it and you can query it, you can visualize it, and you can do all those sort of magical things with it, you know? And I think getting all of that to a place where we're saying like, yes to the customer on, you know, all of the, the real time queries, the, the multiple language query support, but, you know, it was hard, but we're now at a spot where we can start introducing that to, you know, a a limited number of customers, strategic customers and strategic availability zones to start. But you know, everybody over time. >>So you're basically going from what happened to in, you can still do that obviously, but to what's happening now in the moment? >>Yeah, yeah. I mean, if you think about time, it's always sort of past, right? I mean, like in the moment right now, whether you're talking about like a millisecond ago or a minute ago, you know, that's, that's pretty much right now, I think for most people, especially in these use cases where you have other sort of components of latency induced by the, by the underlying data collection, the architecture, the infrastructure, the, you know, the, the devices and you know, the sort of highly distributed nature of all of this. So yeah, I mean, getting, getting a customer or a user to be able to use the data as soon as it is available is what we're after here. >>I always thought, you know, real, I always thought of real time as before you lose the customer, but now in this context, maybe it's before the machine blows up. >>Yeah, it's, it's, I mean it is operationally or operational real time is different, you know, and that's one of the things that really triggered us to know that we were, we were heading in the right direction, is just how many sort of operational customers we have. You know, everything from like aerospace and defense. We've got companies monitoring satellites, we've got tons of industrial users, users using us as a processes storing on the plant floor, you know, and, and if we can satisfy their sort of demands for like real time historical perspective, that's awesome. I think what we're gonna do here is we're gonna start to like edge into the real time that they're used to in terms of, you know, the millisecond response times that they expect of their control systems. Certainly not their, their historians and databases. >>I, is this available, these innovations to influx DB cloud customers only who can access this capability? >>Yeah. I mean, commercially and today, yes. You know, I think we want to emphasize that's a, for now our goal is to get our latest and greatest and our best to everybody over time. Of course. You know, one of the things we had to do here was like we double down on sort of our, our commitment to open source and availability. So like anybody today can take a look at the, the libraries in on our GitHub and, you know, can ex inspect it and even can try to, you know, implement or execute some of it themselves in their own infrastructure. You know, we are, we're committed to bringing our sort of latest and greatest to our cloud customers first for a couple of reasons. Number one, you know, there are big workloads and they have high expectations of us. I think number two, it also gives us the opportunity to monitor a little bit more closely how it's working, how they're using it, like how the system itself is performing. >>And so just, you know, being careful, maybe a little cautious in terms of, of, of how big we go with this right away. Just sort of both limits, you know, the risk of, of, you know, any issues that can come with new software rollouts. We haven't seen anything so far, but also it does give us the opportunity to have like meaningful conversations with a small group of users who are using the products, but once we get through that and they give us two thumbs up on it, it'll be like, open the gates and let everybody in. It's gonna be exciting time for the whole ecosystem. >>Yeah, that makes a lot of sense. And you can do some experimentation and, you know, using the cloud resources. Let's dig into some of the architectural and technical innovations that are gonna help deliver on this vision. What, what should we know there? >>Well, I mean, I think foundationally we built the, the new core on Rust. You know, this is a new very sort of popular systems language, you know, it's extremely efficient, but it's also built for speed and memory safety, which goes back to that us being able to like deliver it in a way that is, you know, something we can inspect very closely, but then also rely on the fact that it's going to behave well. And if it does find error conditions, I mean, we, we've loved working with Go and, you know, a lot of our libraries will continue to, to be sort of implemented in Go, but you know, when it came to this particular new engine, you know, that power performance and stability rust was critical. On top of that, like, we've also integrated Apache Arrow and Apache Parque for persistence. I think for anybody who's really familiar with the nuts and bolts of our backend and our TSI and our, our time series merged Trees, this is a big break from that, you know, arrow on the sort of in MI side and then Par K in the on disk side. >>It, it allows us to, to present, you know, a unified set of APIs for those really fast real time inquiries that we talked about, as well as for very large, you know, historical sort of bulk data archives in that PARQUE format, which is also cool because there's an entire ecosystem sort of popping up around Parque in terms of the machine learning community, you know, and getting that all to work, we had to glue it together with aero flight. That's sort of what we're using as our, our RPC component. You know, it handles the orchestration and the, the transportation of the Coer data. Now we're moving to like a true Coer database model for this, this version of the engine, you know, and it removes a lot of overhead for us in terms of having to manage all that serialization, the deserialization, and, you know, to that again, like blurring that line between real time and historical data. It's, you know, it's, it's highly optimized for both streaming micro batch and then batches, but true streaming as well. >>Yeah. Again, I mean, it's funny you mentioned Rust. It is, it's been around for a long time, but it's popularity is, is, you know, really starting to hit that steep part of the S-curve. And, and we're gonna dig into to more of that, but give us any, is there anything else that we should know about Bryan? Give us the last word? >>Well, I mean, I think first I'd like everybody sort of watching just to like, take a look at what we're offering in terms of early access in beta programs. I mean, if, if, if you wanna participate or if you wanna work sort of in terms of early access with the, with the new engine, please reach out to the team. I'm sure you know, there's a lot of communications going out and, you know, it'll be highly featured on our, our website, you know, but reach out to the team, believe it or not, like we have a lot more going on than just the new engine. And so there are also other programs, things we're, we're offering to customers in terms of the user interface, data collection and things like that. And, you know, if you're a customer of ours and you have a sales team, a commercial team that you work with, you can reach out to them and see what you can get access to because we can flip a lot of stuff on, especially in cloud through feature flags. >>But if there's something new that you wanna try out, we'd just love to hear from you. And then, you know, our goal would be that as we give you access to all of these new cool features that, you know, you would give us continuous feedback on these products and services, not only like what you need today, but then what you'll need tomorrow to, to sort of build the next versions of your business. Because, you know, the whole database, the ecosystem as it expands out into to, you know, this vertically oriented stack of cloud services and enterprise databases and edge databases, you know, it's gonna be what we all make it together, not just, you know, those of us who were employed by Influx db. And then finally, I would just say please, like watch in ice in Tim's sessions, Like these are two of our best and brightest. They're totally brilliant, completely pragmatic, and they are most of all customer obsessed, which is amazing. And there's no better takes, like honestly on the, the sort of technical details of this, then there's, especially when it comes to like the value that these investments will, will bring to our customers and our communities. So encourage you to, to, you know, pay more attention to them than you did to me, for sure. >>Brian Gilmore, great stuff. Really appreciate your time. Thank you. >>Yeah, thanks Dave. It was awesome. Look forward to it. >>Yeah, me too. Looking forward to see how the, the community actually applies these new innovations and goes, goes beyond just the historical into the real time, really hot area. As Brian said in a moment, I'll be right back with Anna East Dos Georgio to dig into the critical aspects of key open source components of the Influx DB engine, including Rust, Arrow, Parque, data fusion. Keep it right there. You don't want to miss this.
SUMMARY :
we talked about how in theory, those time slices could be taken, you know, As is often the case, open source software is the linchpin to those innovations. We hope you enjoy the program. I appreciate the time. Hey, explain why Influx db, you know, needs a new engine. now, you know, related to requests like sql, you know, query support, things like that, of the real first influx DB cloud, you know, which has been really successful. who are using out on a, on a daily basis, you know, and having that sort of big shift from, you know, time series, you know, specialist to real time analytics better handle those queries from a performance and a, and a, you know, a time to response on the queries, results in, in, you know, milliseconds of time since it hit the, the, the devices and you know, the sort of highly distributed nature of all of this. I always thought, you know, real, I always thought of real time as before you lose the customer, you know, and that's one of the things that really triggered us to know that we were, we were heading in the right direction, a look at the, the libraries in on our GitHub and, you know, can ex inspect it and even can try you know, the risk of, of, you know, any issues that can come with new software rollouts. And you can do some experimentation and, you know, using the cloud resources. but you know, when it came to this particular new engine, you know, that power performance really fast real time inquiries that we talked about, as well as for very large, you know, but it's popularity is, is, you know, really starting to hit that steep part of the S-curve. going out and, you know, it'll be highly featured on our, our website, you know, the whole database, the ecosystem as it expands out into to, you know, this vertically oriented Really appreciate your time. Look forward to it. the critical aspects of key open source components of the Influx DB engine,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Brian Gilmore | PERSON | 0.99+ |
Tim Yokum | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Dave Valante | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Tim | PERSON | 0.99+ |
60,000 people | QUANTITY | 0.99+ |
Influx | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Bryan | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
twice | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
three years ago | DATE | 0.99+ |
Influx DB | TITLE | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
tomorrow | DATE | 0.98+ |
Apache | ORGANIZATION | 0.98+ |
Anna East Dos Georgio | PERSON | 0.98+ |
IOT | ORGANIZATION | 0.97+ |
one | QUANTITY | 0.97+ |
In Flux Data | ORGANIZATION | 0.96+ |
Influx | TITLE | 0.95+ |
The Cube | ORGANIZATION | 0.95+ |
tons | QUANTITY | 0.95+ |
Cube | ORGANIZATION | 0.94+ |
Rust | TITLE | 0.93+ |
both enterprises | QUANTITY | 0.92+ |
iot T | TITLE | 0.91+ |
second | QUANTITY | 0.89+ |
Go | TITLE | 0.88+ |
two thumbs | QUANTITY | 0.87+ |
Anna East | PERSON | 0.87+ |
Parque | TITLE | 0.85+ |
a minute ago | DATE | 0.84+ |
Influx State | ORGANIZATION | 0.83+ |
Dos Georgio | ORGANIZATION | 0.8+ |
influx data | ORGANIZATION | 0.8+ |
Apache Arrow | ORGANIZATION | 0.76+ |
GitHub | ORGANIZATION | 0.75+ |
Bryan | LOCATION | 0.74+ |
phase one | QUANTITY | 0.71+ |
past May | DATE | 0.69+ |
Go | ORGANIZATION | 0.64+ |
number two | QUANTITY | 0.64+ |
millisecond ago | DATE | 0.61+ |
InfluxDB | TITLE | 0.6+ |
Time | TITLE | 0.55+ |
industrial | QUANTITY | 0.54+ |
phase two | QUANTITY | 0.54+ |
Parque | COMMERCIAL_ITEM | 0.53+ |
couple | QUANTITY | 0.5+ |
time | TITLE | 0.5+ |
things | QUANTITY | 0.49+ |
TSI | ORGANIZATION | 0.4+ |
Arrow | TITLE | 0.38+ |
PARQUE | OTHER | 0.3+ |
Evolving InfluxDB into the Smart Data Platform
>>This past May, The Cube in collaboration with Influx data shared with you the latest innovations in Time series databases. We talked at length about why a purpose built time series database for many use cases, was a superior alternative to general purpose databases trying to do the same thing. Now, you may, you may remember the time series data is any data that's stamped in time, and if it's stamped, it can be analyzed historically. And when we introduced the concept to the community, we talked about how in theory, those time slices could be taken, you know, every hour, every minute, every second, you know, down to the millisecond and how the world was moving toward realtime or near realtime data analysis to support physical infrastructure like sensors and other devices and IOT equipment. A time series databases have had to evolve to efficiently support realtime data in emerging use cases in iot T and other use cases. >>And to do that, new architectural innovations have to be brought to bear. As is often the case, open source software is the linchpin to those innovations. Hello and welcome to Evolving Influx DB into the smart Data platform, made possible by influx data and produced by the Cube. My name is Dave Valante and I'll be your host today. Now in this program we're going to dig pretty deep into what's happening with Time series data generally, and specifically how Influx DB is evolving to support new workloads and demands and data, and specifically around data analytics use cases in real time. Now, first we're gonna hear from Brian Gilmore, who is the director of IOT and emerging technologies at Influx Data. And we're gonna talk about the continued evolution of Influx DB and the new capabilities enabled by open source generally and specific tools. And in this program you're gonna hear a lot about things like Rust, implementation of Apache Arrow, the use of par k and tooling such as data fusion, which powering a new engine for Influx db. >>Now, these innovations, they evolve the idea of time series analysis by dramatically increasing the granularity of time series data by compressing the historical time slices, if you will, from, for example, minutes down to milliseconds. And at the same time, enabling real time analytics with an architecture that can process data much faster and much more efficiently. Now, after Brian, we're gonna hear from Anna East Dos Georgio, who is a developer advocate at In Flux Data. And we're gonna get into the why of these open source capabilities and how they contribute to the evolution of the Influx DB platform. And then we're gonna close the program with Tim Yokum, he's the director of engineering at Influx Data, and he's gonna explain how the Influx DB community actually evolved the data engine in mid-flight and which decisions went into the innovations that are coming to the market. Thank you for being here. We hope you enjoy the program. Let's get started. Okay, we're kicking things off with Brian Gilmore. He's the director of i t and emerging Technology at Influx State of Bryan. Welcome to the program. Thanks for coming on. >>Thanks Dave. Great to be here. I appreciate the time. >>Hey, explain why Influx db, you know, needs a new engine. Was there something wrong with the current engine? What's going on there? >>No, no, not at all. I mean, I think it's, for us, it's been about staying ahead of the market. I think, you know, if we think about what our customers are coming to us sort of with now, you know, related to requests like sql, you know, query support, things like that, we have to figure out a way to, to execute those for them in a way that will scale long term. And then we also, we wanna make sure we're innovating, we're sort of staying ahead of the market as well and sort of anticipating those future needs. So, you know, this is really a, a transparent change for our customers. I mean, I think we'll be adding new capabilities over time that sort of leverage this new engine, but you know, initially the customers who are using us are gonna see just great improvements in performance, you know, especially those that are working at the top end of the, of the workload scale, you know, the massive data volumes and things like that. >>Yeah, and we're gonna get into that today and the architecture and the like, but what was the catalyst for the enhancements? I mean, when and how did this all come about? >>Well, I mean, like three years ago we were primarily on premises, right? I mean, I think we had our open source, we had an enterprise product, you know, and, and sort of shifting that technology, especially the open source code base to a service basis where we were hosting it through, you know, multiple cloud providers. That was, that was, that was a long journey I guess, you know, phase one was, you know, we wanted to host enterprise for our customers, so we sort of created a service that we just managed and ran our enterprise product for them. You know, phase two of this cloud effort was to, to optimize for like multi-tenant, multi-cloud, be able to, to host it in a truly like sass manner where we could use, you know, some type of customer activity or consumption as the, the pricing vector, you know, And, and that was sort of the birth of the, of the real first influx DB cloud, you know, which has been really successful. >>We've seen, I think like 60,000 people sign up and we've got tons and tons of, of both enterprises as well as like new companies, developers, and of course a lot of home hobbyists and enthusiasts who are using out on a, on a daily basis, you know, and having that sort of big pool of, of very diverse and very customers to chat with as they're using the product, as they're giving us feedback, et cetera, has has, you know, pointed us in a really good direction in terms of making sure we're continuously improving that and then also making these big leaps as we're doing with this, with this new engine. >>Right. So you've called it a transparent change for customers, so I'm presuming it's non-disruptive, but I really wanna understand how much of a pivot this is and what, what does it take to make that shift from, you know, time series, you know, specialist to real time analytics and being able to support both? >>Yeah, I mean, it's much more of an evolution, I think, than like a shift or a pivot. You know, time series data is always gonna be fundamental and sort of the basis of the solutions that we offer our customers, and then also the ones that they're building on the sort of raw APIs of our platform themselves. You know, the time series market is one that we've worked diligently to lead. I mean, I think when it comes to like metrics, especially like sensor data and app and infrastructure metrics, if we're being honest though, I think our, our user base is well aware that the way we were architected was much more towards those sort of like backwards looking historical type analytics, which are key for troubleshooting and making sure you don't, you know, run into the same problem twice. But, you know, we had to ask ourselves like, what can we do to like better handle those queries from a performance and a, and a, you know, a time to response on the queries, and can we get that to the point where the results sets are coming back so quickly from the time of query that we can like limit that window down to minutes and then seconds. >>And now with this new engine, we're really starting to talk about a query window that could be like returning results in, in, you know, milliseconds of time since it hit the, the, the ingest queue. And that's, that's really getting to the point where as your data is available, you can use it and you can query it, you can visualize it, and you can do all those sort of magical things with it, you know? And I think getting all of that to a place where we're saying like, yes to the customer on, you know, all of the, the real time queries, the, the multiple language query support, but, you know, it was hard, but we're now at a spot where we can start introducing that to, you know, a a limited number of customers, strategic customers and strategic availability zones to start. But you know, everybody over time. >>So you're basically going from what happened to in, you can still do that obviously, but to what's happening now in the moment? >>Yeah, yeah. I mean if you think about time, it's always sort of past, right? I mean, like in the moment right now, whether you're talking about like a millisecond ago or a minute ago, you know, that's, that's pretty much right now, I think for most people, especially in these use cases where you have other sort of components of latency induced by the, by the underlying data collection, the architecture, the infrastructure, the, you know, the, the devices and you know, the sort of highly distributed nature of all of this. So yeah, I mean, getting, getting a customer or a user to be able to use the data as soon as it is available is what we're after here. >>I always thought, you know, real, I always thought of real time as before you lose the customer, but now in this context, maybe it's before the machine blows up. >>Yeah, it's, it's, I mean it is operationally or operational real time is different, you know, and that's one of the things that really triggered us to know that we were, we were heading in the right direction, is just how many sort of operational customers we have. You know, everything from like aerospace and defense. We've got companies monitoring satellites, we've got tons of industrial users, users using us as a processes storing on the plant floor, you know, and, and if we can satisfy their sort of demands for like real time historical perspective, that's awesome. I think what we're gonna do here is we're gonna start to like edge into the real time that they're used to in terms of, you know, the millisecond response times that they expect of their control systems, certainly not their, their historians and databases. >>I, is this available, these innovations to influx DB cloud customers only who can access this capability? >>Yeah. I mean commercially and today, yes. You know, I think we want to emphasize that's a, for now our goal is to get our latest and greatest and our best to everybody over time. Of course. You know, one of the things we had to do here was like we double down on sort of our, our commitment to open source and availability. So like anybody today can take a look at the, the libraries in on our GitHub and, you know, can ex inspect it and even can try to, you know, implement or execute some of it themselves in their own infrastructure. You know, we are, we're committed to bringing our sort of latest and greatest to our cloud customers first for a couple of reasons. Number one, you know, there are big workloads and they have high expectations of us. I think number two, it also gives us the opportunity to monitor a little bit more closely how it's working, how they're using it, like how the system itself is performing. >>And so just, you know, being careful, maybe a little cautious in terms of, of, of how big we go with this right away, just sort of both limits, you know, the risk of, of, you know, any issues that can come with new software rollouts. We haven't seen anything so far, but also it does give us the opportunity to have like meaningful conversations with a small group of users who are using the products, but once we get through that and they give us two thumbs up on it, it'll be like, open the gates and let everybody in. It's gonna be exciting time for the whole ecosystem. >>Yeah, that makes a lot of sense. And you can do some experimentation and, you know, using the cloud resources. Let's dig into some of the architectural and technical innovations that are gonna help deliver on this vision. What, what should we know there? >>Well, I mean, I think foundationally we built the, the new core on Rust. You know, this is a new very sort of popular systems language, you know, it's extremely efficient, but it's also built for speed and memory safety, which goes back to that us being able to like deliver it in a way that is, you know, something we can inspect very closely, but then also rely on the fact that it's going to behave well. And if it does find error conditions, I mean we, we've loved working with Go and, you know, a lot of our libraries will continue to, to be sort of implemented in Go, but you know, when it came to this particular new engine, you know, that power performance and stability rust was critical. On top of that, like, we've also integrated Apache Arrow and Apache Parque for persistence. I think for anybody who's really familiar with the nuts and bolts of our backend and our TSI and our, our time series merged Trees, this is a big break from that, you know, arrow on the sort of in MI side and then Par K in the on disk side. >>It, it allows us to, to present, you know, a unified set of APIs for those really fast real time inquiries that we talked about, as well as for very large, you know, historical sort of bulk data archives in that PARQUE format, which is also cool because there's an entire ecosystem sort of popping up around Parque in terms of the machine learning community, you know, and getting that all to work, we had to glue it together with aero flight. That's sort of what we're using as our, our RPC component. You know, it handles the orchestration and the, the transportation of the Coer data. Now we're moving to like a true Coer database model for this, this version of the engine, you know, and it removes a lot of overhead for us in terms of having to manage all that serialization, the deserialization, and, you know, to that again, like blurring that line between real time and historical data. It's, you know, it's, it's highly optimized for both streaming micro batch and then batches, but true streaming as well. >>Yeah. Again, I mean, it's funny you mentioned Rust. It is, it's been around for a long time, but it's popularity is, is you know, really starting to hit that steep part of the S-curve. And, and we're gonna dig into to more of that, but give us any, is there anything else that we should know about Bryan? Give us the last word? >>Well, I mean, I think first I'd like everybody sort of watching just to like take a look at what we're offering in terms of early access in beta programs. I mean, if, if, if you wanna participate or if you wanna work sort of in terms of early access with the, with the new engine, please reach out to the team. I'm sure you know, there's a lot of communications going out and you know, it'll be highly featured on our, our website, you know, but reach out to the team, believe it or not, like we have a lot more going on than just the new engine. And so there are also other programs, things we're, we're offering to customers in terms of the user interface, data collection and things like that. And, you know, if you're a customer of ours and you have a sales team, a commercial team that you work with, you can reach out to them and see what you can get access to because we can flip a lot of stuff on, especially in cloud through feature flags. >>But if there's something new that you wanna try out, we'd just love to hear from you. And then, you know, our goal would be that as we give you access to all of these new cool features that, you know, you would give us continuous feedback on these products and services, not only like what you need today, but then what you'll need tomorrow to, to sort of build the next versions of your business. Because you know, the whole database, the ecosystem as it expands out into to, you know, this vertically oriented stack of cloud services and enterprise databases and edge databases, you know, it's gonna be what we all make it together, not just, you know, those of us who were employed by Influx db. And then finally I would just say please, like watch in ICE in Tim's sessions, like these are two of our best and brightest, They're totally brilliant, completely pragmatic, and they are most of all customer obsessed, which is amazing. And there's no better takes, like honestly on the, the sort of technical details of this, then there's, especially when it comes to like the value that these investments will, will bring to our customers and our communities. So encourage you to, to, you know, pay more attention to them than you did to me, for sure. >>Brian Gilmore, great stuff. Really appreciate your time. Thank you. >>Yeah, thanks Dave. It was awesome. Look forward to it. >>Yeah, me too. Looking forward to see how the, the community actually applies these new innovations and goes, goes beyond just the historical into the real time really hot area. As Brian said in a moment, I'll be right back with Anna East dos Georgio to dig into the critical aspects of key open source components of the Influx DB engine, including Rust, Arrow, Parque, data fusion. Keep it right there. You don't wanna miss this >>Time series Data is everywhere. The number of sensors, systems and applications generating time series data increases every day. All these data sources producing so much data can cause analysis paralysis. Influx DB is an entire platform designed with everything you need to quickly build applications that generate value from time series data influx. DB Cloud is a serverless solution, which means you don't need to buy or manage your own servers. There's no need to worry about provisioning because you only pay for what you use. Influx DB Cloud is fully managed so you get the newest features and enhancements as they're added to the platform's code base. It also means you can spend time building solutions and delivering value to your users instead of wasting time and effort managing something else. Influx TVB Cloud offers a range of security features to protect your data, multiple layers of redundancy ensure you don't lose any data access controls ensure that only the people who should see your data can see it. >>And encryption protects your data at rest and in transit between any of our regions or cloud providers. InfluxDB uses a single API across the entire platform suite so you can build on open source, deploy to the cloud and then then easily query data in the cloud at the edge or on prem using the same scripts. And InfluxDB is schemaless automatically adjusting to changes in the shape of your data without requiring changes in your application. Logic. InfluxDB Cloud is production ready from day one. All it needs is your data and your imagination. Get started today@influxdata.com slash cloud. >>Okay, we're back. I'm Dave Valante with a Cube and you're watching evolving Influx DB into the smart data platform made possible by influx data. Anna ETOs Georgio is here, she's a developer advocate for influx data and we're gonna dig into the rationale and value contribution behind several open source technologies that Influx DB is leveraging to increase the granularity of time series analysis analysis and bring the world of data into real-time analytics and is welcome to the program. Thanks for coming on. >>Hi, thank you so much. It's a pleasure to be here. >>Oh, you're very welcome. Okay, so IX is being touted as this next gen open source core for Influx db. And my understanding is that it leverages in memory of course for speed. It's a kilo store, so it gives you a compression efficiency, it's gonna give you faster query speeds, you store files and object storage, so you got very cost effective approach. Are these the salient points on the platform? I know there are probably dozens of other features, but what are the high level value points that people should understand? >>Sure, that's a great question. So some of the main requirements that IOx is trying to achieve and some of the most impressive ones to me, the first one is that it aims to have no limits on cardinality and also allow you to write any kind of event data that you want, whether that's live tag or a field. It also wants to deliver the best in class performance on analytics queries. In addition to our already well served metrics queries, we also wanna have operator control over memory usage. So you should be able to define how much memory is used for buffering caching and query processing. Some other really important parts is the ability to have bulk data export and import super useful. Also broader ecosystem compatibility where possible we aim to use and embrace emerging standards in the data analytics ecosystem and have compatibility with things like sql, Python, and maybe even pandas in the future. >>Okay, so lot there. Now we talked to Brian about how you're using Rust and which is not a new programming language and of course we had some drama around Rust during the pandemic with the Mozilla layoffs, but the formation of the Rust Foundation really addressed any of those concerns. You got big guns like Amazon and Google and Microsoft throwing their collective weights behind it. It's really, the adoption is really starting to get steep on the S-curve. So lots of platforms, lots of adoption with rust, but why rust as an alternative to say c plus plus for example? >>Sure, that's a great question. So Russ was chosen because of his exceptional performance and reliability. So while Russ is synt tactically similar to c plus plus and it has similar performance, it also compiles to a native code like c plus plus. But unlike c plus plus, it also has much better memory safety. So memory safety is protection against bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And rust achieves this memory safety due to its like innovative type system. Additionally, it doesn't allow for dangling pointers. And dangling pointers are the main classes of errors that lead to exploitable security vulnerabilities in languages like c plus plus. So Russ like helps meet that requirement of having no limits on ality, for example, because it's, we're also using the Russ implementation of Apache Arrow and this control over memory and also Russ Russ's packaging system called crates IO offers everything that you need out of the box to have features like AY and a weight to fix race conditions, to protection against buffering overflows and to ensure thread safe async cashing structures as well. So essentially it's just like has all the control, all the fine grain control, you need to take advantage of memory and all your resources as well as possible so that you can handle those really, really high ity use cases. >>Yeah, and the more I learn about the, the new engine and, and the platform IOCs et cetera, you know, you, you see things like, you know, the old days not even to even today you do a lot of garbage collection in these, in these systems and there's an inverse, you know, impact relative to performance. So it looks like you really, you know, the community is modernizing the platform, but I wanna talk about Apache Arrow for a moment. It it's designed to address the constraints that are associated with analyzing large data sets. We, we know that, but please explain why, what, what is Arrow and and what does it bring to Influx db? >>Sure, yeah. So Arrow is a, a framework for defining in memory calmer data. And so much of the efficiency and performance of IOx comes from taking advantage of calmer data structures. And I will, if you don't mind, take a moment to kind of of illustrate why column or data structures are so valuable. Let's pretend that we are gathering field data about the temperature in our room and also maybe the temperature of our stove. And in our table we have those two temperature values as well as maybe a measurement value, timestamp value, maybe some other tag values that describe what room and what house, et cetera we're getting this data from. And so you can picture this table where we have like two rows with the two temperature values for both our room and the stove. Well usually our room temperature is regulated so those values don't change very often. >>So when you have calm oriented st calm oriented storage, essentially you take each row, each column and group it together. And so if that's the case and you're just taking temperature values from the room and a lot of those temperature values are the same, then you'll, you might be able to imagine how equal values will then enable each other and when they neighbor each other in the storage format, this provides a really perfect opportunity for cheap compression. And then this cheap compression enables high cardinality use cases. It also enables for faster scan rates. So if you wanna define like the men and max value of the temperature in the room across a thousand different points, you only have to get those a thousand different points in order to answer that question and you have those immediately available to you. But let's contrast this with a row oriented storage solution instead so that we can understand better the benefits of calmer oriented storage. >>So if you had a row oriented storage, you'd first have to look at every field like the temperature in, in the room and the temperature of the stove. You'd have to go across every tag value that maybe describes where the room is located or what model the stove is. And every timestamp you'd then have to pluck out that one temperature value that you want at that one time stamp and do that for every single row. So you're scanning across a ton more data and that's why Rowe Oriented doesn't provide the same efficiency as calmer and Apache Arrow is in memory calmer data, commoner data fit framework. So that's where a lot of the advantages come >>From. Okay. So you basically described like a traditional database, a row approach, but I've seen like a lot of traditional database say, okay, now we've got, we can handle colo format versus what you're talking about is really, you know, kind of native i, is it not as effective? Is the, is the foreman not as effective because it's largely a, a bolt on? Can you, can you like elucidate on that front? >>Yeah, it's, it's not as effective because you have more expensive compression and because you can't scan across the values as quickly. And so those are, that's pretty much the main reasons why, why RO row oriented storage isn't as efficient as calm, calmer oriented storage. Yeah. >>Got it. So let's talk about Arrow Data Fusion. What is data fusion? I know it's written in Rust, but what does it bring to the table here? >>Sure. So it's an extensible query execution framework and it uses Arrow as it's in memory format. So the way that it helps in influx DB IOCs is that okay, it's great if you can write unlimited amount of cardinality into influx Cbis, but if you don't have a query engine that can successfully query that data, then I don't know how much value it is for you. So Data fusion helps enable the, the query process and transformation of that data. It also has a PANDAS API so that you could take advantage of PANDAS data frames as well and all of the machine learning tools associated with Pandas. >>Okay. You're also leveraging Par K in the platform cause we heard a lot about Par K in the middle of the last decade cuz as a storage format to improve on Hadoop column stores. What are you doing with Parque and why is it important? >>Sure. So parque is the column oriented durable file format. So it's important because it'll enable bulk import, bulk export, it has compatibility with Python and Pandas, so it supports a broader ecosystem. Par K files also take very little disc disc space and they're faster to scan because again, they're column oriented in particular, I think PAR K files are like 16 times cheaper than CSV files, just as kind of a point of reference. And so that's essentially a lot of the, the benefits of par k. >>Got it. Very popular. So and he's, what exactly is influx data focusing on as a committer to these projects? What is your focus? What's the value that you're bringing to the community? >>Sure. So Influx DB first has contributed a lot of different, different things to the Apache ecosystem. For example, they contribute an implementation of Apache Arrow and go and that will support clearing with flux. Also, there has been a quite a few contributions to data fusion for things like memory optimization and supportive additional SQL features like support for timestamp, arithmetic and support for exist clauses and support for memory control. So yeah, Influx has contributed a a lot to the Apache ecosystem and continues to do so. And I think kind of the idea here is that if you can improve these upstream projects and then the long term strategy here is that the more you contribute and build those up, then the more you will perpetuate that cycle of improvement and the more we will invest in our own project as well. So it's just that kind of symbiotic relationship and appreciation of the open source community. >>Yeah. Got it. You got that virtuous cycle going, the people call the flywheel. Give us your last thoughts and kind of summarize, you know, where what, what the big takeaways are from your perspective. >>So I think the big takeaway is that influx data is doing a lot of really exciting things with Influx DB IOx and I really encourage, if you are interested in learning more about the technologies that Influx is leveraging to produce IOCs, the challenges associated with it and all of the hard work questions and you just wanna learn more, then I would encourage you to go to the monthly Tech talks and community office hours and they are on every second Wednesday of the month at 8:30 AM Pacific time. There's also a community forums and a community Slack channel look for the influx DDB unders IAC channel specifically to learn more about how to join those office hours and those monthly tech tech talks as well as ask any questions they have about iacs, what to expect and what you'd like to learn more about. I as a developer advocate, I wanna answer your questions. So if there's a particular technology or stack that you wanna dive deeper into and want more explanation about how INFLUX DB leverages it to build IOCs, I will be really excited to produce content on that topic for you. >>Yeah, that's awesome. You guys have a really rich community, collaborate with your peers, solve problems, and, and you guys super responsive, so really appreciate that. All right, thank you so much Anise for explaining all this open source stuff to the audience and why it's important to the future of data. >>Thank you. I really appreciate it. >>All right, you're very welcome. Okay, stay right there and in a moment I'll be back with Tim Yoakum, he's the director of engineering for Influx Data and we're gonna talk about how you update a SAS engine while the plane is flying at 30,000 feet. You don't wanna miss this. >>I'm really glad that we went with InfluxDB Cloud for our hosting because it has saved us a ton of time. It's helped us move faster, it's saved us money. And also InfluxDB has good support. My name's Alex Nada. I am CTO at Noble nine. Noble Nine is a platform to measure and manage service level objectives, which is a great way of measuring the reliability of your systems. You can essentially think of an slo, the product we're providing to our customers as a bunch of time series. So we need a way to store that data and the corresponding time series that are related to those. The main reason that we settled on InfluxDB as we were shopping around is that InfluxDB has a very flexible query language and as a general purpose time series database, it basically had the set of features we were looking for. >>As our platform has grown, we found InfluxDB Cloud to be a really scalable solution. We can quickly iterate on new features and functionality because Influx Cloud is entirely managed, it probably saved us at least a full additional person on our team. We also have the option of running InfluxDB Enterprise, which gives us the ability to even host off the cloud or in a private cloud if that's preferred by a customer. Influx data has been really flexible in adapting to the hosting requirements that we have. They listened to the challenges we were facing and they helped us solve it. As we've continued to grow, I'm really happy we have influx data by our side. >>Okay, we're back with Tim Yokum, who is the director of engineering at Influx Data. Tim, welcome. Good to see you. >>Good to see you. Thanks for having me. >>You're really welcome. Listen, we've been covering open source software in the cube for more than a decade, and we've kind of watched the innovation from the big data ecosystem. The cloud has been being built out on open source, mobile, social platforms, key databases, and of course influx DB and influx data has been a big consumer and contributor of open source software. So my question to you is, where have you seen the biggest bang for the buck from open source software? >>So yeah, you know, influx really, we thrive at the intersection of commercial services and open, so open source software. So OSS keeps us on the cutting edge. We benefit from OSS in delivering our own service from our core storage engine technologies to web services temping engines. Our, our team stays lean and focused because we build on proven tools. We really build on the shoulders of giants and like you've mentioned, even better, we contribute a lot back to the projects that we use as well as our own product influx db. >>You know, but I gotta ask you, Tim, because one of the challenge that that we've seen in particular, you saw this in the heyday of Hadoop, the, the innovations come so fast and furious and as a software company you gotta place bets, you gotta, you know, commit people and sometimes those bets can be risky and not pay off well, how have you managed this challenge? >>Oh, it moves fast. Yeah, that, that's a benefit though because it, the community moves so quickly that today's hot technology can be tomorrow's dinosaur. And what we, what we tend to do is, is we fail fast and fail often. We try a lot of things. You know, you look at Kubernetes for example, that ecosystem is driven by thousands of intelligent developers, engineers, builders, they're adding value every day. So we have to really keep up with that. And as the stack changes, we, we try different technologies, we try different methods, and at the end of the day, we come up with a better platform as a result of just the constant change in the environment. It is a challenge for us, but it's, it's something that we just do every day. >>So we have a survey partner down in New York City called Enterprise Technology Research etr, and they do these quarterly surveys of about 1500 CIOs, IT practitioners, and they really have a good pulse on what's happening with spending. And the data shows that containers generally, but specifically Kubernetes is one of the areas that has kind of, it's been off the charts and seen the most significant adoption and velocity particularly, you know, along with cloud. But, but really Kubernetes is just, you know, still up until the right consistently even with, you know, the macro headwinds and all, all of the stuff that we're sick of talking about. But, so what are you doing with Kubernetes in the platform? >>Yeah, it, it's really central to our ability to run the product. When we first started out, we were just on AWS and, and the way we were running was, was a little bit like containers junior. Now we're running Kubernetes everywhere at aws, Azure, Google Cloud. It allows us to have a consistent experience across three different cloud providers and we can manage that in code so our developers can focus on delivering services, not trying to learn the intricacies of Amazon, Azure, and Google and figure out how to deliver services on those three clouds with all of their differences. >>Just to follow up on that, is it, no. So I presume it's sounds like there's a PAs layer there to allow you guys to have a consistent experience across clouds and out to the edge, you know, wherever is that, is that correct? >>Yeah, so we've basically built more or less platform engineering, This is the new hot phrase, you know, it, it's, Kubernetes has made a lot of things easy for us because we've built a platform that our developers can lean on and they only have to learn one way of deploying their application, managing their application. And so that, that just gets all of the underlying infrastructure out of the way and, and lets them focus on delivering influx cloud. >>Yeah, and I know I'm taking a little bit of a tangent, but is that, that, I'll call it a PAs layer if I can use that term. Is that, are there specific attributes to Influx db or is it kind of just generally off the shelf paths? You know, are there, is, is there any purpose built capability there that, that is, is value add or is it pretty much generic? >>So we really build, we, we look at things through, with a build versus buy through a, a build versus by lens. Some things we want to leverage cloud provider services, for instance, Postgres databases for metadata, perhaps we'll get that off of our plate, let someone else run that. We're going to deploy a platform that our engineers can, can deliver on that has consistency that is, is all generated from code that we can as a, as an SRE group, as an ops team, that we can manage with very few people really, and we can stamp out clusters across multiple regions and in no time. >>So how, so sometimes you build, sometimes you buy it. How do you make those decisions and and what does that mean for the, for the platform and for customers? >>Yeah, so what we're doing is, it's like everybody else will do, we're we're looking for trade offs that make sense. You know, we really want to protect our customers data. So we look for services that support our own software with the most uptime, reliability, and durability we can get. Some things are just going to be easier to have a cloud provider take care of on our behalf. We make that transparent for our own team. And of course for customers you don't even see that, but we don't want to try to reinvent the wheel, like I had mentioned with SQL data stores for metadata, perhaps let's build on top of what of these three large cloud providers have already perfected. And we can then focus on our platform engineering and we can have our developers then focus on the influx data, software, influx, cloud software. >>So take it to the customer level, what does it mean for them? What's the value that they're gonna get out of all these innovations that we've been been talking about today and what can they expect in the future? >>So first of all, people who use the OSS product are really gonna be at home on our cloud platform. You can run it on your desktop machine, on a single server, what have you, but then you want to scale up. We have some 270 terabytes of data across, over 4 billion series keys that people have stored. So there's a proven ability to scale now in terms of the open source, open source software and how we've developed the platform. You're getting highly available high cardinality time series platform. We manage it and, and really as, as I mentioned earlier, we can keep up with the state of the art. We keep reinventing, we keep deploying things in real time. We deploy to our platform every day repeatedly all the time. And it's that continuous deployment that allows us to continue testing things in flight, rolling things out that change new features, better ways of doing deployments, safer ways of doing deployments. >>All of that happens behind the scenes. And like we had mentioned earlier, Kubernetes, I mean that, that allows us to get that done. We couldn't do it without having that platform as a, as a base layer for us to then put our software on. So we, we iterate quickly. When you're on the, the Influx cloud platform, you really are able to, to take advantage of new features immediately. We roll things out every day and as those things go into production, you have, you have the ability to, to use them. And so in the end we want you to focus on getting actual insights from your data instead of running infrastructure, you know, let, let us do that for you. So, >>And that makes sense, but so is the, is the, are the innovations that we're talking about in the evolution of Influx db, do, do you see that as sort of a natural evolution for existing customers? I, is it, I'm sure the answer is both, but is it opening up new territory for customers? Can you add some color to that? >>Yeah, it really is it, it's a little bit of both. Any engineer will say, well, it depends. So cloud native technologies are, are really the hot thing. Iot, industrial iot especially, people want to just shove tons of data out there and be able to do queries immediately and they don't wanna manage infrastructure. What we've started to see are people that use the cloud service as their, their data store backbone and then they use edge computing with R OSS product to ingest data from say, multiple production lines and downsample that data, send the rest of that data off influx cloud where the heavy processing takes place. So really us being in all the different clouds and iterating on that and being in all sorts of different regions allows for people to really get out of the, the business of man trying to manage that big data, have us take care of that. And of course as we change the platform end users benefit from that immediately. And, >>And so obviously taking away a lot of the heavy lifting for the infrastructure, would you say the same thing about security, especially as you go out to IOT and the Edge? How should we be thinking about the value that you bring from a security perspective? >>Yeah, we take, we take security super seriously. It, it's built into our dna. We do a lot of work to ensure that our platform is secure, that the data we store is, is kept private. It's of course always a concern. You see in the news all the time, companies being compromised, you know, that's something that you can have an entire team working on, which we do to make sure that the data that you have, whether it's in transit, whether it's at rest, is always kept secure, is only viewable by you. You know, you look at things like software, bill of materials, if you're running this yourself, you have to go vet all sorts of different pieces of software. And we do that, you know, as we use new tools. That's something that, that's just part of our jobs to make sure that the platform that we're running it has, has fully vetted software and, and with open source especially, that's a lot of work. And so it's, it's definitely new territory. Supply chain attacks are, are definitely happening at a higher clip than they used to, but that is, that is really just part of a day in the, the life for folks like us that are, are building platforms. >>Yeah, and that's key. I mean especially when you start getting into the, the, you know, we talk about IOT and the operations technologies, the engineers running the, that infrastructure, you know, historically, as you know, Tim, they, they would air gap everything. That's how they kept it safe. But that's not feasible anymore. Everything's >>That >>Connected now, right? And so you've gotta have a partner that is again, take away that heavy lifting to r and d so you can focus on some of the other activities. Right. Give us the, the last word and the, the key takeaways from your perspective. >>Well, you know, from my perspective I see it as, as a a two lane approach with, with influx, with Anytime series data, you know, you've got a lot of stuff that you're gonna run on-prem, what you had mentioned, air gaping. Sure there's plenty of need for that, but at the end of the day, people that don't want to run big data centers, people that want torus their data to, to a company that's, that's got a full platform set up for them that they can build on, send that data over to the cloud, the cloud is not going away. I think more hybrid approach is, is where the future lives and that's what we're prepared for. >>Tim, really appreciate you coming to the program. Great stuff. Good to see you. >>Thanks very much. Appreciate it. >>Okay, in a moment I'll be back to wrap up. Today's session, you're watching The Cube. >>Are you looking for some help getting started with InfluxDB Telegraph or Flux Check >>Out Influx DB University >>Where you can find our entire catalog of free training that will help you make the most of your time series data >>Get >>Started for free@influxdbu.com. >>We'll see you in class. >>Okay, so we heard today from three experts on time series and data, how the Influx DB platform is evolving to support new ways of analyzing large data sets very efficiently and effectively in real time. And we learned that key open source components like Apache Arrow and the Rust Programming environment Data fusion par K are being leveraged to support realtime data analytics at scale. We also learned about the contributions in importance of open source software and how the Influx DB community is evolving the platform with minimal disruption to support new workloads, new use cases, and the future of realtime data analytics. Now remember these sessions, they're all available on demand. You can go to the cube.net to find those. Don't forget to check out silicon angle.com for all the news related to things enterprise and emerging tech. And you should also check out influx data.com. There you can learn about the company's products. You'll find developer resources like free courses. You could join the developer community and work with your peers to learn and solve problems. And there are plenty of other resources around use cases and customer stories on the website. This is Dave Valante. Thank you for watching Evolving Influx DB into the smart data platform, made possible by influx data and brought to you by the Cube, your leader in enterprise and emerging tech coverage.
SUMMARY :
we talked about how in theory, those time slices could be taken, you know, As is often the case, open source software is the linchpin to those innovations. We hope you enjoy the program. I appreciate the time. Hey, explain why Influx db, you know, needs a new engine. now, you know, related to requests like sql, you know, query support, things like that, of the real first influx DB cloud, you know, which has been really successful. as they're giving us feedback, et cetera, has has, you know, pointed us in a really good direction shift from, you know, time series, you know, specialist to real time analytics better handle those queries from a performance and a, and a, you know, a time to response on the queries, you know, all of the, the real time queries, the, the multiple language query support, the, the devices and you know, the sort of highly distributed nature of all of this. I always thought, you know, real, I always thought of real time as before you lose the customer, you know, and that's one of the things that really triggered us to know that we were, we were heading in the right direction, a look at the, the libraries in on our GitHub and, you know, can ex inspect it and even can try And so just, you know, being careful, maybe a little cautious in terms And you can do some experimentation and, you know, using the cloud resources. You know, this is a new very sort of popular systems language, you know, really fast real time inquiries that we talked about, as well as for very large, you know, but it's popularity is, is you know, really starting to hit that steep part of the S-curve. going out and you know, it'll be highly featured on our, our website, you know, the whole database, the ecosystem as it expands out into to, you know, this vertically oriented Really appreciate your time. Look forward to it. goes, goes beyond just the historical into the real time really hot area. There's no need to worry about provisioning because you only pay for what you use. InfluxDB uses a single API across the entire platform suite so you can build on Influx DB is leveraging to increase the granularity of time series analysis analysis and bring the Hi, thank you so much. it's gonna give you faster query speeds, you store files and object storage, it aims to have no limits on cardinality and also allow you to write any kind of event data that It's really, the adoption is really starting to get steep on all the control, all the fine grain control, you need to take you know, the community is modernizing the platform, but I wanna talk about Apache And so you can answer that question and you have those immediately available to you. out that one temperature value that you want at that one time stamp and do that for every talking about is really, you know, kind of native i, is it not as effective? Yeah, it's, it's not as effective because you have more expensive compression and So let's talk about Arrow Data Fusion. It also has a PANDAS API so that you could take advantage of PANDAS What are you doing with and Pandas, so it supports a broader ecosystem. What's the value that you're bringing to the community? And I think kind of the idea here is that if you can improve kind of summarize, you know, where what, what the big takeaways are from your perspective. the hard work questions and you All right, thank you so much Anise for explaining I really appreciate it. Data and we're gonna talk about how you update a SAS engine while I'm really glad that we went with InfluxDB Cloud for our hosting They listened to the challenges we were facing and they helped Good to see you. Good to see you. So my question to you is, So yeah, you know, influx really, we thrive at the intersection of commercial services and open, You know, you look at Kubernetes for example, But, but really Kubernetes is just, you know, Azure, and Google and figure out how to deliver services on those three clouds with all of their differences. to the edge, you know, wherever is that, is that correct? This is the new hot phrase, you know, it, it's, Kubernetes has made a lot of things easy for us Is that, are there specific attributes to Influx db as an SRE group, as an ops team, that we can manage with very few people So how, so sometimes you build, sometimes you buy it. And of course for customers you don't even see that, but we don't want to try to reinvent the wheel, and really as, as I mentioned earlier, we can keep up with the state of the art. the end we want you to focus on getting actual insights from your data instead of running infrastructure, So cloud native technologies are, are really the hot thing. You see in the news all the time, companies being compromised, you know, technologies, the engineers running the, that infrastructure, you know, historically, as you know, take away that heavy lifting to r and d so you can focus on some of the other activities. with influx, with Anytime series data, you know, you've got a lot of stuff that you're gonna run on-prem, Tim, really appreciate you coming to the program. Thanks very much. Okay, in a moment I'll be back to wrap up. brought to you by the Cube, your leader in enterprise and emerging tech coverage.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Brian Gilmore | PERSON | 0.99+ |
David Brown | PERSON | 0.99+ |
Tim Yoakum | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Tim Yokum | PERSON | 0.99+ |
Stu | PERSON | 0.99+ |
Herain Oberoi | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Dave Valante | PERSON | 0.99+ |
Kamile Taouk | PERSON | 0.99+ |
John Fourier | PERSON | 0.99+ |
Rinesh Patel | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Santana Dasgupta | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
Canada | LOCATION | 0.99+ |
BMW | ORGANIZATION | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
ICE | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Jack Berkowitz | PERSON | 0.99+ |
Australia | LOCATION | 0.99+ |
NVIDIA | ORGANIZATION | 0.99+ |
Telco | ORGANIZATION | 0.99+ |
Venkat | PERSON | 0.99+ |
Michael | PERSON | 0.99+ |
Camille | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Venkat Krishnamachari | PERSON | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Don Tapscott | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
Intercontinental Exchange | ORGANIZATION | 0.99+ |
Children's Cancer Institute | ORGANIZATION | 0.99+ |
Red Hat | ORGANIZATION | 0.99+ |
telco | ORGANIZATION | 0.99+ |
Sabrina Yan | PERSON | 0.99+ |
Tim | PERSON | 0.99+ |
Sabrina | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
MontyCloud | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Leo | PERSON | 0.99+ |
COVID-19 | OTHER | 0.99+ |
Santa Ana | LOCATION | 0.99+ |
UK | LOCATION | 0.99+ |
Tushar | PERSON | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
Valente | PERSON | 0.99+ |
JL Valente | PERSON | 0.99+ |
1,000 | QUANTITY | 0.99+ |
The Truth About MySQL HeatWave
>>When Oracle acquired my SQL via the Sun acquisition, nobody really thought the company would put much effort into the platform preferring to focus all the wood behind its leading Oracle database, Arrow pun intended. But two years ago, Oracle surprised many folks by announcing my SQL Heatwave a new database as a service with a massively parallel hybrid Columbia in Mary Mary architecture that brings together transactional and analytic data in a single platform. Welcome to our latest database, power panel on the cube. My name is Dave Ante, and today we're gonna discuss Oracle's MySQL Heat Wave with a who's who of cloud database industry analysts. Holgar Mueller is with Constellation Research. Mark Stammer is the Dragon Slayer and Wikibon contributor. And Ron Westfall is with Fu Chim Research. Gentlemen, welcome back to the Cube. Always a pleasure to have you on. Thanks for having us. Great to be here. >>So we've had a number of of deep dive interviews on the Cube with Nip and Aggarwal. You guys know him? He's a senior vice president of MySQL, Heatwave Development at Oracle. I think you just saw him at Oracle Cloud World and he's come on to describe this is gonna, I'll call it a shock and awe feature additions to to heatwave. You know, the company's clearly putting r and d into the platform and I think at at cloud world we saw like the fifth major release since 2020 when they first announced MySQL heat wave. So just listing a few, they, they got, they taken, brought in analytics machine learning, they got autopilot for machine learning, which is automation onto the basic o l TP functionality of the database. And it's been interesting to watch Oracle's converge database strategy. We've contrasted that amongst ourselves. Love to get your thoughts on Amazon's get the right tool for the right job approach. >>Are they gonna have to change that? You know, Amazon's got the specialized databases, it's just, you know, the both companies are doing well. It just shows there are a lot of ways to, to skin a cat cuz you see some traction in the market in, in both approaches. So today we're gonna focus on the latest heat wave announcements and we're gonna talk about multi-cloud with a native MySQL heat wave implementation, which is available on aws MySQL heat wave for Azure via the Oracle Microsoft interconnect. This kind of cool hybrid action that they got going. Sometimes we call it super cloud. And then we're gonna dive into my SQL Heatwave Lake house, which allows users to process and query data across MyQ databases as heatwave databases, as well as object stores. So, and then we've got, heatwave has been announced on AWS and, and, and Azure, they're available now and Lake House I believe is in beta and I think it's coming out the second half of next year. So again, all of our guests are fresh off of Oracle Cloud world in Las Vegas. So they got the latest scoop. Guys, I'm done talking. Let's get into it. Mark, maybe you could start us off, what's your opinion of my SQL Heatwaves competitive position? When you think about what AWS is doing, you know, Google is, you know, we heard Google Cloud next recently, we heard about all their data innovations. You got, obviously Azure's got a big portfolio, snowflakes doing well in the market. What's your take? >>Well, first let's look at it from the point of view that AWS is the market leader in cloud and cloud services. They own somewhere between 30 to 50% depending on who you read of the market. And then you have Azure as number two and after that it falls off. There's gcp, Google Cloud platform, which is further way down the list and then Oracle and IBM and Alibaba. So when you look at AWS and you and Azure saying, hey, these are the market leaders in the cloud, then you start looking at it and saying, if I am going to provide a service that competes with the service they have, if I can make it available in their cloud, it means that I can be more competitive. And if I'm compelling and compelling means at least twice the performance or functionality or both at half the price, I should be able to gain market share. >>And that's what Oracle's done. They've taken a superior product in my SQL heat wave, which is faster, lower cost does more for a lot less at the end of the day and they make it available to the users of those clouds. You avoid this little thing called egress fees, you avoid the issue of having to migrate from one cloud to another and suddenly you have a very compelling offer. So I look at what Oracle's doing with MyQ and it feels like, I'm gonna use a word term, a flanking maneuver to their competition. They're offering a better service on their platforms. >>All right, so thank you for that. Holger, we've seen this sort of cadence, I sort of referenced it up front a little bit and they sat on MySQL for a decade, then all of a sudden we see this rush of announcements. Why did it take so long? And and more importantly is Oracle, are they developing the right features that cloud database customers are looking for in your view? >>Yeah, great question, but first of all, in your interview you said it's the edit analytics, right? Analytics is kind of like a marketing buzzword. Reports can be analytics, right? The interesting thing, which they did, the first thing they, they, they crossed the chasm between OTP and all up, right? In the same database, right? So major engineering feed very much what customers want and it's all about creating Bellevue for customers, which, which I think is the part why they go into the multi-cloud and why they add these capabilities. And they certainly with the AI capabilities, it's kind of like getting it into an autonomous field, self-driving field now with the lake cost capabilities and meeting customers where they are, like Mark has talked about the e risk costs in the cloud. So that that's a significant advantage, creating value for customers and that's what at the end of the day matters. >>And I believe strongly that long term it's gonna be ones who create better value for customers who will get more of their money From that perspective, why then take them so long? I think it's a great question. I think largely he mentioned the gentleman Nial, it's largely to who leads a product. I used to build products too, so maybe I'm a little fooling myself here, but that made the difference in my view, right? So since he's been charged, he's been building things faster than the rest of the competition, than my SQL space, which in hindsight we thought was a hot and smoking innovation phase. It kind of like was a little self complacent when it comes to the traditional borders of where, where people think, where things are separated between OTP and ola or as an example of adjacent support, right? Structured documents, whereas unstructured documents or databases and all of that has been collapsed and brought together for building a more powerful database for customers. >>So I mean it's certainly, you know, when, when Oracle talks about the competitors, you know, the competitors are in the, I always say they're, if the Oracle talks about you and knows you're doing well, so they talk a lot about aws, talk a little bit about Snowflake, you know, sort of Google, they have partnerships with Azure, but, but in, so I'm presuming that the response in MySQL heatwave was really in, in response to what they were seeing from those big competitors. But then you had Maria DB coming out, you know, the day that that Oracle acquired Sun and, and launching and going after the MySQL base. So it's, I'm, I'm interested and we'll talk about this later and what you guys think AWS and Google and Azure and Snowflake and how they're gonna respond. But, but before I do that, Ron, I want to ask you, you, you, you can get, you know, pretty technical and you've probably seen the benchmarks. >>I know you have Oracle makes a big deal out of it, publishes its benchmarks, makes some transparent on on GI GitHub. Larry Ellison talked about this in his keynote at Cloud World. What are the benchmarks show in general? I mean, when you, when you're new to the market, you gotta have a story like Mark was saying, you gotta be two x you know, the performance at half the cost or you better be or you're not gonna get any market share. So, and, and you know, oftentimes companies don't publish market benchmarks when they're leading. They do it when they, they need to gain share. So what do you make of the benchmarks? Have their, any results that were surprising to you? Have, you know, they been challenged by the competitors. Is it just a bunch of kind of desperate bench marketing to make some noise in the market or you know, are they real? What's your view? >>Well, from my perspective, I think they have the validity. And to your point, I believe that when it comes to competitor responses, that has not really happened. Nobody has like pulled down the information that's on GitHub and said, Oh, here are our price performance results. And they counter oracles. In fact, I think part of the reason why that hasn't happened is that there's the risk if Oracle's coming out and saying, Hey, we can deliver 17 times better query performance using our capabilities versus say, Snowflake when it comes to, you know, the Lakehouse platform and Snowflake turns around and says it's actually only 15 times better during performance, that's not exactly an effective maneuver. And so I think this is really to oracle's credit and I think it's refreshing because these differentiators are significant. We're not talking, you know, like 1.2% differences. We're talking 17 fold differences, we're talking six fold differences depending on, you know, where the spotlight is being shined and so forth. >>And so I think this is actually something that is actually too good to believe initially at first blush. If I'm a cloud database decision maker, I really have to prioritize this. I really would know, pay a lot more attention to this. And that's why I posed the question to Oracle and others like, okay, if these differentiators are so significant, why isn't the needle moving a bit more? And it's for, you know, some of the usual reasons. One is really deep discounting coming from, you know, the other players that's really kind of, you know, marketing 1 0 1, this is something you need to do when there's a real competitive threat to keep, you know, a customer in your own customer base. Plus there is the usual fear and uncertainty about moving from one platform to another. But I think, you know, the traction, the momentum is, is shifting an Oracle's favor. I think we saw that in the Q1 efforts, for example, where Oracle cloud grew 44% and that it generated, you know, 4.8 billion and revenue if I recall correctly. And so, so all these are demonstrating that's Oracle is making, I think many of the right moves, publishing these figures for anybody to look at from their own perspective is something that is, I think, good for the market and I think it's just gonna continue to pay dividends for Oracle down the horizon as you know, competition intens plots. So if I were in, >>Dave, can I, Dave, can I interject something and, and what Ron just said there? Yeah, please go ahead. A couple things here, one discounting, which is a common practice when you have a real threat, as Ron pointed out, isn't going to help much in this situation simply because you can't discount to the point where you improve your performance and the performance is a huge differentiator. You may be able to get your price down, but the problem that most of them have is they don't have an integrated product service. They don't have an integrated O L T P O L A P M L N data lake. Even if you cut out two of them, they don't have any of them integrated. They have multiple services that are required separate integration and that can't be overcome with discounting. And the, they, you have to pay for each one of these. And oh, by the way, as you grow, the discounts go away. So that's a, it's a minor important detail. >>So, so that's a TCO question mark, right? And I know you look at this a lot, if I had that kind of price performance advantage, I would be pounding tco, especially if I need two separate databases to do the job. That one can do, that's gonna be, the TCO numbers are gonna be off the chart or maybe down the chart, which you want. Have you looked at this and how does it compare with, you know, the big cloud guys, for example, >>I've looked at it in depth, in fact, I'm working on another TCO on this arena, but you can find it on Wiki bod in which I compared TCO for MySEQ Heat wave versus Aurora plus Redshift plus ML plus Blue. I've compared it against gcps services, Azure services, Snowflake with other services. And there's just no comparison. The, the TCO differences are huge. More importantly, thefor, the, the TCO per performance is huge. We're talking in some cases multiple orders of magnitude, but at least an order of magnitude difference. So discounting isn't gonna help you much at the end of the day, it's only going to lower your cost a little, but it doesn't improve the automation, it doesn't improve the performance, it doesn't improve the time to insight, it doesn't improve all those things that you want out of a database or multiple databases because you >>Can't discount yourself to a higher value proposition. >>So what about, I wonder ho if you could chime in on the developer angle. You, you followed that, that market. How do these innovations from heatwave, I think you used the term developer velocity. I've heard you used that before. Yeah, I mean, look, Oracle owns Java, okay, so it, it's, you know, most popular, you know, programming language in the world, blah, blah blah. But it does it have the, the minds and hearts of, of developers and does, where does heatwave fit into that equation? >>I think heatwave is gaining quickly mindshare on the developer side, right? It's not the traditional no sequel database which grew up, there's a traditional mistrust of oracles to developers to what was happening to open source when gets acquired. Like in the case of Oracle versus Java and where my sql, right? And, but we know it's not a good competitive strategy to, to bank on Oracle screwing up because it hasn't worked not on Java known my sequel, right? And for developers, it's, once you get to know a technology product and you can do more, it becomes kind of like a Swiss army knife and you can build more use case, you can build more powerful applications. That's super, super important because you don't have to get certified in multiple databases. You, you are fast at getting things done, you achieve fire, develop velocity, and the managers are happy because they don't have to license more things, send you to more trainings, have more risk of something not being delivered, right? >>So it's really the, we see the suite where this best of breed play happening here, which in general was happening before already with Oracle's flagship database. Whereas those Amazon as an example, right? And now the interesting thing is every step away Oracle was always a one database company that can be only one and they're now generally talking about heat web and that two database company with different market spaces, but same value proposition of integrating more things very, very quickly to have a universal database that I call, they call the converge database for all the needs of an enterprise to run certain application use cases. And that's what's attractive to developers. >>It's, it's ironic isn't it? I mean I, you know, the rumor was the TK Thomas Curian left Oracle cuz he wanted to put Oracle database on other clouds and other places. And maybe that was the rift. Maybe there was, I'm sure there was other things, but, but Oracle clearly is now trying to expand its Tam Ron with, with heatwave into aws, into Azure. How do you think Oracle's gonna do, you were at a cloud world, what was the sentiment from customers and the independent analyst? Is this just Oracle trying to screw with the competition, create a little diversion? Or is this, you know, serious business for Oracle? What do you think? >>No, I think it has lakes. I think it's definitely, again, attriting to Oracle's overall ability to differentiate not only my SQL heat wave, but its overall portfolio. And I think the fact that they do have the alliance with the Azure in place, that this is definitely demonstrating their commitment to meeting the multi-cloud needs of its customers as well as what we pointed to in terms of the fact that they're now offering, you know, MySQL capabilities within AWS natively and that it can now perform AWS's own offering. And I think this is all demonstrating that Oracle is, you know, not letting up, they're not resting on its laurels. That's clearly we are living in a multi-cloud world, so why not just make it more easy for customers to be able to use cloud databases according to their own specific, specific needs. And I think, you know, to holder's point, I think that definitely lines with being able to bring on more application developers to leverage these capabilities. >>I think one important announcement that's related to all this was the JSON relational duality capabilities where now it's a lot easier for application developers to use a language that they're very familiar with a JS O and not have to worry about going into relational databases to store their J S O N application coding. So this is, I think an example of the innovation that's enhancing the overall Oracle portfolio and certainly all the work with machine learning is definitely paying dividends as well. And as a result, I see Oracle continue to make these inroads that we pointed to. But I agree with Mark, you know, the short term discounting is just a stall tag. This is not denying the fact that Oracle is being able to not only deliver price performance differentiators that are dramatic, but also meeting a wide range of needs for customers out there that aren't just limited device performance consideration. >>Being able to support multi-cloud according to customer needs. Being able to reach out to the application developer community and address a very specific challenge that has plagued them for many years now. So bring it all together. Yeah, I see this as just enabling Oracles who ring true with customers. That the customers that were there were basically all of them, even though not all of them are going to be saying the same things, they're all basically saying positive feedback. And likewise, I think the analyst community is seeing this. It's always refreshing to be able to talk to customers directly and at Oracle cloud there was a litany of them and so this is just a difference maker as well as being able to talk to strategic partners. The nvidia, I think partnerships also testament to Oracle's ongoing ability to, you know, make the ecosystem more user friendly for the customers out there. >>Yeah, it's interesting when you get these all in one tools, you know, the Swiss Army knife, you expect that it's not able to be best of breed. That's the kind of surprising thing that I'm hearing about, about heatwave. I want to, I want to talk about Lake House because when I think of Lake House, I think data bricks, and to my knowledge data bricks hasn't been in the sites of Oracle yet. Maybe they're next, but, but Oracle claims that MySQL, heatwave, Lakehouse is a breakthrough in terms of capacity and performance. Mark, what are your thoughts on that? Can you double click on, on Lakehouse Oracle's claims for things like query performance and data loading? What does it mean for the market? Is Oracle really leading in, in the lake house competitive landscape? What are your thoughts? >>Well, but name in the game is what are the problems you're solving for the customer? More importantly, are those problems urgent or important? If they're urgent, customers wanna solve 'em. Now if they're important, they might get around to them. So you look at what they're doing with Lake House or previous to that machine learning or previous to that automation or previous to that O L A with O ltp and they're merging all this capability together. If you look at Snowflake or data bricks, they're tacking one problem. You look at MyQ heat wave, they're tacking multiple problems. So when you say, yeah, their queries are much better against the lake house in combination with other analytics in combination with O ltp and the fact that there are no ETLs. So you're getting all this done in real time. So it's, it's doing the query cross, cross everything in real time. >>You're solving multiple user and developer problems, you're increasing their ability to get insight faster, you're having shorter response times. So yeah, they really are solving urgent problems for customers. And by putting it where the customer lives, this is the brilliance of actually being multicloud. And I know I'm backing up here a second, but by making it work in AWS and Azure where people already live, where they already have applications, what they're saying is, we're bringing it to you. You don't have to come to us to get these, these benefits, this value overall, I think it's a brilliant strategy. I give Nip and Argo wallet a huge, huge kudos for what he's doing there. So yes, what they're doing with the lake house is going to put notice on data bricks and Snowflake and everyone else for that matter. Well >>Those are guys that whole ago you, you and I have talked about this. Those are, those are the guys that are doing sort of the best of breed. You know, they're really focused and they, you know, tend to do well at least out of the gate. Now you got Oracle's converged philosophy, obviously with Oracle database. We've seen that now it's kicking in gear with, with heatwave, you know, this whole thing of sweets versus best of breed. I mean the long term, you know, customers tend to migrate towards suite, but the new shiny toy tends to get the growth. How do you think this is gonna play out in cloud database? >>Well, it's the forever never ending story, right? And in software right suite, whereas best of breed and so far in the long run suites have always won, right? So, and sometimes they struggle again because the inherent problem of sweets is you build something larger, it has more complexity and that means your cycles to get everything working together to integrate the test that roll it out, certify whatever it is, takes you longer, right? And that's not the case. It's a fascinating part of what the effort around my SQL heat wave is that the team is out executing the previous best of breed data, bringing us something together. Now if they can maintain that pace, that's something to to, to be seen. But it, the strategy, like what Mark was saying, bring the software to the data is of course interesting and unique and totally an Oracle issue in the past, right? >>Yeah. But it had to be in your database on oci. And but at, that's an interesting part. The interesting thing on the Lake health side is, right, there's three key benefits of a lakehouse. The first one is better reporting analytics, bring more rich information together, like make the, the, the case for silicon angle, right? We want to see engagements for this video, we want to know what's happening. That's a mixed transactional video media use case, right? Typical Lakehouse use case. The next one is to build more rich applications, transactional applications which have video and these elements in there, which are the engaging one. And the third one, and that's where I'm a little critical and concerned, is it's really the base platform for artificial intelligence, right? To run deep learning to run things automatically because they have all the data in one place can create in one way. >>And that's where Oracle, I know that Ron talked about Invidia for a moment, but that's where Oracle doesn't have the strongest best story. Nonetheless, the two other main use cases of the lake house are very strong, very well only concern is four 50 terabyte sounds long. It's an arbitrary limitation. Yeah, sounds as big. So for the start, and it's the first word, they can make that bigger. You don't want your lake house to be limited and the terabyte sizes or any even petabyte size because you want to have the certainty. I can put everything in there that I think it might be relevant without knowing what questions to ask and query those questions. >>Yeah. And you know, in the early days of no schema on right, it just became a mess. But now technology has evolved to allow us to actually get more value out of that data. Data lake. Data swamp is, you know, not much more, more, more, more logical. But, and I want to get in, in a moment, I want to come back to how you think the competitors are gonna respond. Are they gonna have to sort of do a more of a converged approach? AWS in particular? But before I do, Ron, I want to ask you a question about autopilot because I heard Larry Ellison's keynote and he was talking about how, you know, most security issues are human errors with autonomy and autonomous database and things like autopilot. We take care of that. It's like autonomous vehicles, they're gonna be safer. And I went, well maybe, maybe someday. So Oracle really tries to emphasize this, that every time you see an announcement from Oracle, they talk about new, you know, autonomous capabilities. It, how legit is it? Do people care? What about, you know, what's new for heatwave Lakehouse? How much of a differentiator, Ron, do you really think autopilot is in this cloud database space? >>Yeah, I think it will definitely enhance the overall proposition. I don't think people are gonna buy, you know, lake house exclusively cause of autopilot capabilities, but when they look at the overall picture, I think it will be an added capability bonus to Oracle's benefit. And yeah, I think it's kind of one of these age old questions, how much do you automate and what is the bounce to strike? And I think we all understand with the automatic car, autonomous car analogy that there are limitations to being able to use that. However, I think it's a tool that basically every organization out there needs to at least have or at least evaluate because it goes to the point of it helps with ease of use, it helps make automation more balanced in terms of, you know, being able to test, all right, let's automate this process and see if it works well, then we can go on and switch on on autopilot for other processes. >>And then, you know, that allows, for example, the specialists to spend more time on business use cases versus, you know, manual maintenance of, of the cloud database and so forth. So I think that actually is a, a legitimate value proposition. I think it's just gonna be a case by case basis. Some organizations are gonna be more aggressive with putting automation throughout their processes throughout their organization. Others are gonna be more cautious. But it's gonna be, again, something that will help the overall Oracle proposition. And something that I think will be used with caution by many organizations, but other organizations are gonna like, hey, great, this is something that is really answering a real problem. And that is just easing the use of these databases, but also being able to better handle the automation capabilities and benefits that come with it without having, you know, a major screwup happened and the process of transitioning to more automated capabilities. >>Now, I didn't attend cloud world, it's just too many red eyes, you know, recently, so I passed. But one of the things I like to do at those events is talk to customers, you know, in the spirit of the truth, you know, they, you know, you'd have the hallway, you know, track and to talk to customers and they say, Hey, you know, here's the good, the bad and the ugly. So did you guys, did you talk to any customers my SQL Heatwave customers at, at cloud world? And and what did you learn? I don't know, Mark, did you, did you have any luck and, and having some, some private conversations? >>Yeah, I had quite a few private conversations. The one thing before I get to that, I want disagree with one point Ron made, I do believe there are customers out there buying the heat wave service, the MySEQ heat wave server service because of autopilot. Because autopilot is really revolutionary in many ways in the sense for the MySEQ developer in that it, it auto provisions, it auto parallel loads, IT auto data places it auto shape predictions. It can tell you what machine learning models are going to tell you, gonna give you your best results. And, and candidly, I've yet to meet a DBA who didn't wanna give up pedantic tasks that are pain in the kahoo, which they'd rather not do and if it's long as it was done right for them. So yes, I do think people are buying it because of autopilot and that's based on some of the conversations I had with customers at Oracle Cloud World. >>In fact, it was like, yeah, that's great, yeah, we get fantastic performance, but this really makes my life easier and I've yet to meet a DBA who didn't want to make their life easier. And it does. So yeah, I've talked to a few of them. They were excited. I asked them if they ran into any bugs, were there any difficulties in moving to it? And the answer was no. In both cases, it's interesting to note, my sequel is the most popular database on the planet. Well, some will argue that it's neck and neck with SQL Server, but if you add in Mariah DB and ProCon db, which are forks of MySQL, then yeah, by far and away it's the most popular. And as a result of that, everybody for the most part has typically a my sequel database somewhere in their organization. So this is a brilliant situation for anybody going after MyQ, but especially for heat wave. And the customers I talk to love it. I didn't find anybody complaining about it. And >>What about the migration? We talked about TCO earlier. Did your t does your TCO analysis include the migration cost or do you kind of conveniently leave that out or what? >>Well, when you look at migration costs, there are different kinds of migration costs. By the way, the worst job in the data center is the data migration manager. Forget it, no other job is as bad as that one. You get no attaboys for doing it. Right? And then when you screw up, oh boy. So in real terms, anything that can limit data migration is a good thing. And when you look at Data Lake, that limits data migration. So if you're already a MySEQ user, this is a pure MySQL as far as you're concerned. It's just a, a simple transition from one to the other. You may wanna make sure nothing broke and every you, all your tables are correct and your schema's, okay, but it's all the same. So it's a simple migration. So it's pretty much a non-event, right? When you migrate data from an O LTP to an O L A P, that's an ETL and that's gonna take time. >>But you don't have to do that with my SQL heat wave. So that's gone when you start talking about machine learning, again, you may have an etl, you may not, depending on the circumstances, but again, with my SQL heat wave, you don't, and you don't have duplicate storage, you don't have to copy it from one storage container to another to be able to be used in a different database, which by the way, ultimately adds much more cost than just the other service. So yeah, I looked at the migration and again, the users I talked to said it was a non-event. It was literally moving from one physical machine to another. If they had a new version of MySEQ running on something else and just wanted to migrate it over or just hook it up or just connect it to the data, it worked just fine. >>Okay, so every day it sounds like you guys feel, and we've certainly heard this, my colleague David Foyer, the semi-retired David Foyer was always very high on heatwave. So I think you knows got some real legitimacy here coming from a standing start, but I wanna talk about the competition, how they're likely to respond. I mean, if your AWS and you got heatwave is now in your cloud, so there's some good aspects of that. The database guys might not like that, but the infrastructure guys probably love it. Hey, more ways to sell, you know, EC two and graviton, but you're gonna, the database guys in AWS are gonna respond. They're gonna say, Hey, we got Redshift, we got aqua. What's your thoughts on, on not only how that's gonna resonate with customers, but I'm interested in what you guys think will a, I never say never about aws, you know, and are they gonna try to build, in your view a converged Oola and o LTP database? You know, Snowflake is taking an ecosystem approach. They've added in transactional capabilities to the portfolio so they're not standing still. What do you guys see in the competitive landscape in that regard going forward? Maybe Holger, you could start us off and anybody else who wants to can chime in, >>Happy to, you mentioned Snowflake last, we'll start there. I think Snowflake is imitating that strategy, right? That building out original data warehouse and the clouds tasking project to really proposition to have other data available there because AI is relevant for everybody. Ultimately people keep data in the cloud for ultimately running ai. So you see the same suite kind of like level strategy, it's gonna be a little harder because of the original positioning. How much would people know that you're doing other stuff? And I just, as a former developer manager of developers, I just don't see the speed at the moment happening at Snowflake to become really competitive to Oracle. On the flip side, putting my Oracle hat on for a moment back to you, Mark and Iran, right? What could Oracle still add? Because the, the big big things, right? The traditional chasms in the database world, they have built everything, right? >>So I, I really scratched my hat and gave Nipon a hard time at Cloud world say like, what could you be building? Destiny was very conservative. Let's get the Lakehouse thing done, it's gonna spring next year, right? And the AWS is really hard because AWS value proposition is these small innovation teams, right? That they build two pizza teams, which can be fit by two pizzas, not large teams, right? And you need suites to large teams to build these suites with lots of functionalities to make sure they work together. They're consistent, they have the same UX on the administration side, they can consume the same way, they have the same API registry, can't even stop going where the synergy comes to play over suite. So, so it's gonna be really, really hard for them to change that. But AWS super pragmatic. They're always by themselves that they'll listen to customers if they learn from customers suite as a proposition. I would not be surprised if AWS trying to bring things closer together, being morely together. >>Yeah. Well how about, can we talk about multicloud if, if, again, Oracle is very on on Oracle as you said before, but let's look forward, you know, half a year or a year. What do you think about Oracle's moves in, in multicloud in terms of what kind of penetration they're gonna have in the marketplace? You saw a lot of presentations at at cloud world, you know, we've looked pretty closely at the, the Microsoft Azure deal. I think that's really interesting. I've, I've called it a little bit of early days of a super cloud. What impact do you think this is gonna have on, on the marketplace? But, but both. And think about it within Oracle's customer base, I have no doubt they'll do great there. But what about beyond its existing install base? What do you guys think? >>Ryan, do you wanna jump on that? Go ahead. Go ahead Ryan. No, no, no, >>That's an excellent point. I think it aligns with what we've been talking about in terms of Lakehouse. I think Lake House will enable Oracle to pull more customers, more bicycle customers onto the Oracle platforms. And I think we're seeing all the signs pointing toward Oracle being able to make more inroads into the overall market. And that includes garnishing customers from the leaders in, in other words, because they are, you know, coming in as a innovator, a an alternative to, you know, the AWS proposition, the Google cloud proposition that they have less to lose and there's a result they can really drive the multi-cloud messaging to resonate with not only their existing customers, but also to be able to, to that question, Dave's posing actually garnish customers onto their platform. And, and that includes naturally my sequel but also OCI and so forth. So that's how I'm seeing this playing out. I think, you know, again, Oracle's reporting is indicating that, and I think what we saw, Oracle Cloud world is definitely validating the idea that Oracle can make more waves in the overall market in this regard. >>You know, I, I've floated this idea of Super cloud, it's kind of tongue in cheek, but, but there, I think there is some merit to it in terms of building on top of hyperscale infrastructure and abstracting some of the, that complexity. And one of the things that I'm most interested in is industry clouds and an Oracle acquisition of Cerner. I was struck by Larry Ellison's keynote, it was like, I don't know, an hour and a half and an hour and 15 minutes was focused on healthcare transformation. Well, >>So vertical, >>Right? And so, yeah, so you got Oracle's, you know, got some industry chops and you, and then you think about what they're building with, with not only oci, but then you got, you know, MyQ, you can now run in dedicated regions. You got ADB on on Exadata cloud to customer, you can put that OnPrem in in your data center and you look at what the other hyperscalers are, are doing. I I say other hyperscalers, I've always said Oracle's not really a hyperscaler, but they got a cloud so they're in the game. But you can't get, you know, big query OnPrem, you look at outposts, it's very limited in terms of, you know, the database support and again, that that will will evolve. But now you got Oracle's got, they announced Alloy, we can white label their cloud. So I'm interested in what you guys think about these moves, especially the industry cloud. We see, you know, Walmart is doing sort of their own cloud. You got Goldman Sachs doing a cloud. Do you, you guys, what do you think about that and what role does Oracle play? Any thoughts? >>Yeah, let me lemme jump on that for a moment. Now, especially with the MyQ, by making that available in multiple clouds, what they're doing is this follows the philosophy they've had the past with doing cloud, a customer taking the application and the data and putting it where the customer lives. If it's on premise, it's on premise. If it's in the cloud, it's in the cloud. By making the mice equal heat wave, essentially a plug compatible with any other mice equal as far as your, your database is concern and then giving you that integration with O L A P and ML and Data Lake and everything else, then what you've got is a compelling offering. You're making it easier for the customer to use. So I look the difference between MyQ and the Oracle database, MyQ is going to capture market more market share for them. >>You're not gonna find a lot of new users for the Oracle debate database. Yeah, there are always gonna be new users, don't get me wrong, but it's not gonna be a huge growth. Whereas my SQL heatwave is probably gonna be a major growth engine for Oracle going forward. Not just in their own cloud, but in AWS and in Azure and on premise over time that eventually it'll get there. It's not there now, but it will, they're doing the right thing on that basis. They're taking the services and when you talk about multicloud and making them available where the customer wants them, not forcing them to go where you want them, if that makes sense. And as far as where they're going in the future, I think they're gonna take a page outta what they've done with the Oracle database. They'll add things like JSON and XML and time series and spatial over time they'll make it a, a complete converged database like they did with the Oracle database. The difference being Oracle database will scale bigger and will have more transactions and be somewhat faster. And my SQL will be, for anyone who's not on the Oracle database, they're, they're not stupid, that's for sure. >>They've done Jason already. Right. But I give you that they could add graph and time series, right. Since eat with, Right, Right. Yeah, that's something absolutely right. That's, that's >>A sort of a logical move, right? >>Right. But that's, that's some kid ourselves, right? I mean has worked in Oracle's favor, right? 10 x 20 x, the amount of r and d, which is in the MyQ space, has been poured at trying to snatch workloads away from Oracle by starting with IBM 30 years ago, 20 years ago, Microsoft and, and, and, and didn't work, right? Database applications are extremely sticky when they run, you don't want to touch SIM and grow them, right? So that doesn't mean that heat phase is not an attractive offering, but it will be net new things, right? And what works in my SQL heat wave heat phases favor a little bit is it's not the massive enterprise applications which have like we the nails like, like you might be only running 30% or Oracle, but the connections and the interfaces into that is, is like 70, 80% of your enterprise. >>You take it out and it's like the spaghetti ball where you say, ah, no I really don't, don't want to do all that. Right? You don't, don't have that massive part with the equals heat phase sequel kind of like database which are more smaller tactical in comparison, but still I, I don't see them taking so much share. They will be growing because of a attractive value proposition quickly on the, the multi-cloud, right? I think it's not really multi-cloud. If you give people the chance to run your offering on different clouds, right? You can run it there. The multi-cloud advantages when the Uber offering comes out, which allows you to do things across those installations, right? I can migrate data, I can create data across something like Google has done with B query Omni, I can run predictive models or even make iron models in different place and distribute them, right? And Oracle is paving the road for that, but being available on these clouds. But the multi-cloud capability of database which knows I'm running on different clouds that is still yet to be built there. >>Yeah. And >>That the problem with >>That, that's the super cloud concept that I flowed and I I've always said kinda snowflake with a single global instance is sort of, you know, headed in that direction and maybe has a league. What's the issue with that mark? >>Yeah, the problem with the, with that version, the multi-cloud is clouds to charge egress fees. As long as they charge egress fees to move data between clouds, it's gonna make it very difficult to do a real multi-cloud implementation. Even Snowflake, which runs multi-cloud, has to pass out on the egress fees of their customer when data moves between clouds. And that's really expensive. I mean there, there is one customer I talked to who is beta testing for them, the MySQL heatwave and aws. The only reason they didn't want to do that until it was running on AWS is the egress fees were so great to move it to OCI that they couldn't afford it. Yeah. Egress fees are the big issue but, >>But Mark the, the point might be you might wanna root query and only get the results set back, right was much more tinier, which been the answer before for low latency between the class A problem, which we sometimes still have but mostly don't have. Right? And I think in general this with fees coming down based on the Oracle general E with fee move and it's very hard to justify those, right? But, but it's, it's not about moving data as a multi-cloud high value use case. It's about doing intelligent things with that data, right? Putting into other places, replicating it, what I'm saying the same thing what you said before, running remote queries on that, analyzing it, running AI on it, running AI models on that. That's the interesting thing. Cross administered in the same way. Taking things out, making sure compliance happens. Making sure when Ron says I don't want to be American anymore, I want to be in the European cloud that is gets migrated, right? So tho those are the interesting value use case which are really, really hard for enterprise to program hand by hand by developers and they would love to have out of the box and that's yet the innovation to come to, we have to come to see. But the first step to get there is that your software runs in multiple clouds and that's what Oracle's doing so well with my SQL >>Guys. Amazing. >>Go ahead. Yeah. >>Yeah. >>For example, >>Amazing amount of data knowledge and, and brain power in this market. Guys, I really want to thank you for coming on to the cube. Ron Holger. Mark, always a pleasure to have you on. Really appreciate your time. >>Well all the last names we're very happy for Romanic last and moderator. Thanks Dave for moderating us. All right, >>We'll see. We'll see you guys around. Safe travels to all and thank you for watching this power panel, The Truth About My SQL Heat Wave on the cube. Your leader in enterprise and emerging tech coverage.
SUMMARY :
Always a pleasure to have you on. I think you just saw him at Oracle Cloud World and he's come on to describe this is doing, you know, Google is, you know, we heard Google Cloud next recently, They own somewhere between 30 to 50% depending on who you read migrate from one cloud to another and suddenly you have a very compelling offer. All right, so thank you for that. And they certainly with the AI capabilities, And I believe strongly that long term it's gonna be ones who create better value for So I mean it's certainly, you know, when, when Oracle talks about the competitors, So what do you make of the benchmarks? say, Snowflake when it comes to, you know, the Lakehouse platform and threat to keep, you know, a customer in your own customer base. And oh, by the way, as you grow, And I know you look at this a lot, to insight, it doesn't improve all those things that you want out of a database or multiple databases So what about, I wonder ho if you could chime in on the developer angle. they don't have to license more things, send you to more trainings, have more risk of something not being delivered, all the needs of an enterprise to run certain application use cases. I mean I, you know, the rumor was the TK Thomas Curian left Oracle And I think, you know, to holder's point, I think that definitely lines But I agree with Mark, you know, the short term discounting is just a stall tag. testament to Oracle's ongoing ability to, you know, make the ecosystem Yeah, it's interesting when you get these all in one tools, you know, the Swiss Army knife, you expect that it's not able So when you say, yeah, their queries are much better against the lake house in You don't have to come to us to get these, these benefits, I mean the long term, you know, customers tend to migrate towards suite, but the new shiny bring the software to the data is of course interesting and unique and totally an Oracle issue in And the third one, lake house to be limited and the terabyte sizes or any even petabyte size because you want keynote and he was talking about how, you know, most security issues are human I don't think people are gonna buy, you know, lake house exclusively cause of And then, you know, that allows, for example, the specialists to And and what did you learn? The one thing before I get to that, I want disagree with And the customers I talk to love it. the migration cost or do you kind of conveniently leave that out or what? And when you look at Data Lake, that limits data migration. So that's gone when you start talking about So I think you knows got some real legitimacy here coming from a standing start, So you see the same And you need suites to large teams to build these suites with lots of functionalities You saw a lot of presentations at at cloud world, you know, we've looked pretty closely at Ryan, do you wanna jump on that? I think, you know, again, Oracle's reporting I think there is some merit to it in terms of building on top of hyperscale infrastructure and to customer, you can put that OnPrem in in your data center and you look at what the So I look the difference between MyQ and the Oracle database, MyQ is going to capture market They're taking the services and when you talk about multicloud and But I give you that they could add graph and time series, right. like, like you might be only running 30% or Oracle, but the connections and the interfaces into You take it out and it's like the spaghetti ball where you say, ah, no I really don't, global instance is sort of, you know, headed in that direction and maybe has a league. Yeah, the problem with the, with that version, the multi-cloud is clouds And I think in general this with fees coming down based on the Oracle general E with fee move Yeah. Guys, I really want to thank you for coming on to the cube. Well all the last names we're very happy for Romanic last and moderator. We'll see you guys around.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Mark | PERSON | 0.99+ |
Ron Holger | PERSON | 0.99+ |
Ron | PERSON | 0.99+ |
Mark Stammer | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Ron Westfall | PERSON | 0.99+ |
Ryan | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
Walmart | ORGANIZATION | 0.99+ |
Larry Ellison | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Holgar Mueller | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Constellation Research | ORGANIZATION | 0.99+ |
Goldman Sachs | ORGANIZATION | 0.99+ |
17 times | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
David Foyer | PERSON | 0.99+ |
44% | QUANTITY | 0.99+ |
1.2% | QUANTITY | 0.99+ |
4.8 billion | QUANTITY | 0.99+ |
Jason | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Fu Chim Research | ORGANIZATION | 0.99+ |
Dave Ante | PERSON | 0.99+ |
Evolving InfluxDB into the Smart Data Platform Full Episode
>>This past May, The Cube in collaboration with Influx data shared with you the latest innovations in Time series databases. We talked at length about why a purpose built time series database for many use cases, was a superior alternative to general purpose databases trying to do the same thing. Now, you may, you may remember the time series data is any data that's stamped in time, and if it's stamped, it can be analyzed historically. And when we introduced the concept to the community, we talked about how in theory, those time slices could be taken, you know, every hour, every minute, every second, you know, down to the millisecond and how the world was moving toward realtime or near realtime data analysis to support physical infrastructure like sensors and other devices and IOT equipment. A time series databases have had to evolve to efficiently support realtime data in emerging use cases in iot T and other use cases. >>And to do that, new architectural innovations have to be brought to bear. As is often the case, open source software is the linchpin to those innovations. Hello and welcome to Evolving Influx DB into the smart Data platform, made possible by influx data and produced by the Cube. My name is Dave Valante and I'll be your host today. Now in this program we're going to dig pretty deep into what's happening with Time series data generally, and specifically how Influx DB is evolving to support new workloads and demands and data, and specifically around data analytics use cases in real time. Now, first we're gonna hear from Brian Gilmore, who is the director of IOT and emerging technologies at Influx Data. And we're gonna talk about the continued evolution of Influx DB and the new capabilities enabled by open source generally and specific tools. And in this program you're gonna hear a lot about things like Rust, implementation of Apache Arrow, the use of par k and tooling such as data fusion, which powering a new engine for Influx db. >>Now, these innovations, they evolve the idea of time series analysis by dramatically increasing the granularity of time series data by compressing the historical time slices, if you will, from, for example, minutes down to milliseconds. And at the same time, enabling real time analytics with an architecture that can process data much faster and much more efficiently. Now, after Brian, we're gonna hear from Anna East Dos Georgio, who is a developer advocate at In Flux Data. And we're gonna get into the why of these open source capabilities and how they contribute to the evolution of the Influx DB platform. And then we're gonna close the program with Tim Yokum, he's the director of engineering at Influx Data, and he's gonna explain how the Influx DB community actually evolved the data engine in mid-flight and which decisions went into the innovations that are coming to the market. Thank you for being here. We hope you enjoy the program. Let's get started. Okay, we're kicking things off with Brian Gilmore. He's the director of i t and emerging Technology at Influx State of Bryan. Welcome to the program. Thanks for coming on. >>Thanks Dave. Great to be here. I appreciate the time. >>Hey, explain why Influx db, you know, needs a new engine. Was there something wrong with the current engine? What's going on there? >>No, no, not at all. I mean, I think it's, for us, it's been about staying ahead of the market. I think, you know, if we think about what our customers are coming to us sort of with now, you know, related to requests like sql, you know, query support, things like that, we have to figure out a way to, to execute those for them in a way that will scale long term. And then we also, we wanna make sure we're innovating, we're sort of staying ahead of the market as well and sort of anticipating those future needs. So, you know, this is really a, a transparent change for our customers. I mean, I think we'll be adding new capabilities over time that sort of leverage this new engine, but you know, initially the customers who are using us are gonna see just great improvements in performance, you know, especially those that are working at the top end of the, of the workload scale, you know, the massive data volumes and things like that. >>Yeah, and we're gonna get into that today and the architecture and the like, but what was the catalyst for the enhancements? I mean, when and how did this all come about? >>Well, I mean, like three years ago we were primarily on premises, right? I mean, I think we had our open source, we had an enterprise product, you know, and, and sort of shifting that technology, especially the open source code base to a service basis where we were hosting it through, you know, multiple cloud providers. That was, that was, that was a long journey I guess, you know, phase one was, you know, we wanted to host enterprise for our customers, so we sort of created a service that we just managed and ran our enterprise product for them. You know, phase two of this cloud effort was to, to optimize for like multi-tenant, multi-cloud, be able to, to host it in a truly like sass manner where we could use, you know, some type of customer activity or consumption as the, the pricing vector, you know, And, and that was sort of the birth of the, of the real first influx DB cloud, you know, which has been really successful. >>We've seen, I think like 60,000 people sign up and we've got tons and tons of, of both enterprises as well as like new companies, developers, and of course a lot of home hobbyists and enthusiasts who are using out on a, on a daily basis, you know, and having that sort of big pool of, of very diverse and very customers to chat with as they're using the product, as they're giving us feedback, et cetera, has has, you know, pointed us in a really good direction in terms of making sure we're continuously improving that and then also making these big leaps as we're doing with this, with this new engine. >>Right. So you've called it a transparent change for customers, so I'm presuming it's non-disruptive, but I really wanna understand how much of a pivot this is and what, what does it take to make that shift from, you know, time series, you know, specialist to real time analytics and being able to support both? >>Yeah, I mean, it's much more of an evolution, I think, than like a shift or a pivot. You know, time series data is always gonna be fundamental and sort of the basis of the solutions that we offer our customers, and then also the ones that they're building on the sort of raw APIs of our platform themselves. You know, the time series market is one that we've worked diligently to lead. I mean, I think when it comes to like metrics, especially like sensor data and app and infrastructure metrics, if we're being honest though, I think our, our user base is well aware that the way we were architected was much more towards those sort of like backwards looking historical type analytics, which are key for troubleshooting and making sure you don't, you know, run into the same problem twice. But, you know, we had to ask ourselves like, what can we do to like better handle those queries from a performance and a, and a, you know, a time to response on the queries, and can we get that to the point where the results sets are coming back so quickly from the time of query that we can like limit that window down to minutes and then seconds. >>And now with this new engine, we're really starting to talk about a query window that could be like returning results in, in, you know, milliseconds of time since it hit the, the, the ingest queue. And that's, that's really getting to the point where as your data is available, you can use it and you can query it, you can visualize it, and you can do all those sort of magical things with it, you know? And I think getting all of that to a place where we're saying like, yes to the customer on, you know, all of the, the real time queries, the, the multiple language query support, but, you know, it was hard, but we're now at a spot where we can start introducing that to, you know, a a limited number of customers, strategic customers and strategic availability zones to start. But you know, everybody over time. >>So you're basically going from what happened to in, you can still do that obviously, but to what's happening now in the moment? >>Yeah, yeah. I mean if you think about time, it's always sort of past, right? I mean, like in the moment right now, whether you're talking about like a millisecond ago or a minute ago, you know, that's, that's pretty much right now, I think for most people, especially in these use cases where you have other sort of components of latency induced by the, by the underlying data collection, the architecture, the infrastructure, the, you know, the, the devices and you know, the sort of highly distributed nature of all of this. So yeah, I mean, getting, getting a customer or a user to be able to use the data as soon as it is available is what we're after here. >>I always thought, you know, real, I always thought of real time as before you lose the customer, but now in this context, maybe it's before the machine blows up. >>Yeah, it's, it's, I mean it is operationally or operational real time is different, you know, and that's one of the things that really triggered us to know that we were, we were heading in the right direction, is just how many sort of operational customers we have. You know, everything from like aerospace and defense. We've got companies monitoring satellites, we've got tons of industrial users, users using us as a processes storing on the plant floor, you know, and, and if we can satisfy their sort of demands for like real time historical perspective, that's awesome. I think what we're gonna do here is we're gonna start to like edge into the real time that they're used to in terms of, you know, the millisecond response times that they expect of their control systems, certainly not their, their historians and databases. >>I, is this available, these innovations to influx DB cloud customers only who can access this capability? >>Yeah. I mean commercially and today, yes. You know, I think we want to emphasize that's a, for now our goal is to get our latest and greatest and our best to everybody over time. Of course. You know, one of the things we had to do here was like we double down on sort of our, our commitment to open source and availability. So like anybody today can take a look at the, the libraries in on our GitHub and, you know, can ex inspect it and even can try to, you know, implement or execute some of it themselves in their own infrastructure. You know, we are, we're committed to bringing our sort of latest and greatest to our cloud customers first for a couple of reasons. Number one, you know, there are big workloads and they have high expectations of us. I think number two, it also gives us the opportunity to monitor a little bit more closely how it's working, how they're using it, like how the system itself is performing. >>And so just, you know, being careful, maybe a little cautious in terms of, of, of how big we go with this right away, just sort of both limits, you know, the risk of, of, you know, any issues that can come with new software rollouts. We haven't seen anything so far, but also it does give us the opportunity to have like meaningful conversations with a small group of users who are using the products, but once we get through that and they give us two thumbs up on it, it'll be like, open the gates and let everybody in. It's gonna be exciting time for the whole ecosystem. >>Yeah, that makes a lot of sense. And you can do some experimentation and, you know, using the cloud resources. Let's dig into some of the architectural and technical innovations that are gonna help deliver on this vision. What, what should we know there? >>Well, I mean, I think foundationally we built the, the new core on Rust. You know, this is a new very sort of popular systems language, you know, it's extremely efficient, but it's also built for speed and memory safety, which goes back to that us being able to like deliver it in a way that is, you know, something we can inspect very closely, but then also rely on the fact that it's going to behave well. And if it does find error conditions, I mean we, we've loved working with Go and, you know, a lot of our libraries will continue to, to be sort of implemented in Go, but you know, when it came to this particular new engine, you know, that power performance and stability rust was critical. On top of that, like, we've also integrated Apache Arrow and Apache Parque for persistence. I think for anybody who's really familiar with the nuts and bolts of our backend and our TSI and our, our time series merged Trees, this is a big break from that, you know, arrow on the sort of in MI side and then Par K in the on disk side. >>It, it allows us to, to present, you know, a unified set of APIs for those really fast real time inquiries that we talked about, as well as for very large, you know, historical sort of bulk data archives in that PARQUE format, which is also cool because there's an entire ecosystem sort of popping up around Parque in terms of the machine learning community, you know, and getting that all to work, we had to glue it together with aero flight. That's sort of what we're using as our, our RPC component. You know, it handles the orchestration and the, the transportation of the Coer data. Now we're moving to like a true Coer database model for this, this version of the engine, you know, and it removes a lot of overhead for us in terms of having to manage all that serialization, the deserialization, and, you know, to that again, like blurring that line between real time and historical data. It's, you know, it's, it's highly optimized for both streaming micro batch and then batches, but true streaming as well. >>Yeah. Again, I mean, it's funny you mentioned Rust. It is, it's been around for a long time, but it's popularity is, is you know, really starting to hit that steep part of the S-curve. And, and we're gonna dig into to more of that, but give us any, is there anything else that we should know about Bryan? Give us the last word? >>Well, I mean, I think first I'd like everybody sort of watching just to like take a look at what we're offering in terms of early access in beta programs. I mean, if, if, if you wanna participate or if you wanna work sort of in terms of early access with the, with the new engine, please reach out to the team. I'm sure you know, there's a lot of communications going out and you know, it'll be highly featured on our, our website, you know, but reach out to the team, believe it or not, like we have a lot more going on than just the new engine. And so there are also other programs, things we're, we're offering to customers in terms of the user interface, data collection and things like that. And, you know, if you're a customer of ours and you have a sales team, a commercial team that you work with, you can reach out to them and see what you can get access to because we can flip a lot of stuff on, especially in cloud through feature flags. >>But if there's something new that you wanna try out, we'd just love to hear from you. And then, you know, our goal would be that as we give you access to all of these new cool features that, you know, you would give us continuous feedback on these products and services, not only like what you need today, but then what you'll need tomorrow to, to sort of build the next versions of your business. Because you know, the whole database, the ecosystem as it expands out into to, you know, this vertically oriented stack of cloud services and enterprise databases and edge databases, you know, it's gonna be what we all make it together, not just, you know, those of us who were employed by Influx db. And then finally I would just say please, like watch in ICE in Tim's sessions, like these are two of our best and brightest, They're totally brilliant, completely pragmatic, and they are most of all customer obsessed, which is amazing. And there's no better takes, like honestly on the, the sort of technical details of this, then there's, especially when it comes to like the value that these investments will, will bring to our customers and our communities. So encourage you to, to, you know, pay more attention to them than you did to me, for sure. >>Brian Gilmore, great stuff. Really appreciate your time. Thank you. >>Yeah, thanks Dave. It was awesome. Look forward to it. >>Yeah, me too. Looking forward to see how the, the community actually applies these new innovations and goes, goes beyond just the historical into the real time really hot area. As Brian said in a moment, I'll be right back with Anna East dos Georgio to dig into the critical aspects of key open source components of the Influx DB engine, including Rust, Arrow, Parque, data fusion. Keep it right there. You don't wanna miss this >>Time series Data is everywhere. The number of sensors, systems and applications generating time series data increases every day. All these data sources producing so much data can cause analysis paralysis. Influx DB is an entire platform designed with everything you need to quickly build applications that generate value from time series data influx. DB Cloud is a serverless solution, which means you don't need to buy or manage your own servers. There's no need to worry about provisioning because you only pay for what you use. Influx DB Cloud is fully managed so you get the newest features and enhancements as they're added to the platform's code base. It also means you can spend time building solutions and delivering value to your users instead of wasting time and effort managing something else. Influx TVB Cloud offers a range of security features to protect your data, multiple layers of redundancy ensure you don't lose any data access controls ensure that only the people who should see your data can see it. >>And encryption protects your data at rest and in transit between any of our regions or cloud providers. InfluxDB uses a single API across the entire platform suite so you can build on open source, deploy to the cloud and then then easily query data in the cloud at the edge or on prem using the same scripts. And InfluxDB is schemaless automatically adjusting to changes in the shape of your data without requiring changes in your application. Logic. InfluxDB Cloud is production ready from day one. All it needs is your data and your imagination. Get started today@influxdata.com slash cloud. >>Okay, we're back. I'm Dave Valante with a Cube and you're watching evolving Influx DB into the smart data platform made possible by influx data. Anna ETOs Georgio is here, she's a developer advocate for influx data and we're gonna dig into the rationale and value contribution behind several open source technologies that Influx DB is leveraging to increase the granularity of time series analysis analysis and bring the world of data into real-time analytics and is welcome to the program. Thanks for coming on. >>Hi, thank you so much. It's a pleasure to be here. >>Oh, you're very welcome. Okay, so IX is being touted as this next gen open source core for Influx db. And my understanding is that it leverages in memory of course for speed. It's a kilo store, so it gives you a compression efficiency, it's gonna give you faster query speeds, you store files and object storage, so you got very cost effective approach. Are these the salient points on the platform? I know there are probably dozens of other features, but what are the high level value points that people should understand? >>Sure, that's a great question. So some of the main requirements that IOx is trying to achieve and some of the most impressive ones to me, the first one is that it aims to have no limits on cardinality and also allow you to write any kind of event data that you want, whether that's live tag or a field. It also wants to deliver the best in class performance on analytics queries. In addition to our already well served metrics queries, we also wanna have operator control over memory usage. So you should be able to define how much memory is used for buffering caching and query processing. Some other really important parts is the ability to have bulk data export and import super useful. Also broader ecosystem compatibility where possible we aim to use and embrace emerging standards in the data analytics ecosystem and have compatibility with things like sql, Python, and maybe even pandas in the future. >>Okay, so lot there. Now we talked to Brian about how you're using Rust and which is not a new programming language and of course we had some drama around Rust during the pandemic with the Mozilla layoffs, but the formation of the Rust Foundation really addressed any of those concerns. You got big guns like Amazon and Google and Microsoft throwing their collective weights behind it. It's really, the adoption is really starting to get steep on the S-curve. So lots of platforms, lots of adoption with rust, but why rust as an alternative to say c plus plus for example? >>Sure, that's a great question. So Russ was chosen because of his exceptional performance and reliability. So while Russ is synt tactically similar to c plus plus and it has similar performance, it also compiles to a native code like c plus plus. But unlike c plus plus, it also has much better memory safety. So memory safety is protection against bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And rust achieves this memory safety due to its like innovative type system. Additionally, it doesn't allow for dangling pointers. And dangling pointers are the main classes of errors that lead to exploitable security vulnerabilities in languages like c plus plus. So Russ like helps meet that requirement of having no limits on ality, for example, because it's, we're also using the Russ implementation of Apache Arrow and this control over memory and also Russ Russ's packaging system called crates IO offers everything that you need out of the box to have features like AY and a weight to fix race conditions, to protection against buffering overflows and to ensure thread safe async cashing structures as well. So essentially it's just like has all the control, all the fine grain control, you need to take advantage of memory and all your resources as well as possible so that you can handle those really, really high ity use cases. >>Yeah, and the more I learn about the, the new engine and, and the platform IOCs et cetera, you know, you, you see things like, you know, the old days not even to even today you do a lot of garbage collection in these, in these systems and there's an inverse, you know, impact relative to performance. So it looks like you really, you know, the community is modernizing the platform, but I wanna talk about Apache Arrow for a moment. It it's designed to address the constraints that are associated with analyzing large data sets. We, we know that, but please explain why, what, what is Arrow and and what does it bring to Influx db? >>Sure, yeah. So Arrow is a, a framework for defining in memory calmer data. And so much of the efficiency and performance of IOx comes from taking advantage of calmer data structures. And I will, if you don't mind, take a moment to kind of of illustrate why column or data structures are so valuable. Let's pretend that we are gathering field data about the temperature in our room and also maybe the temperature of our stove. And in our table we have those two temperature values as well as maybe a measurement value, timestamp value, maybe some other tag values that describe what room and what house, et cetera we're getting this data from. And so you can picture this table where we have like two rows with the two temperature values for both our room and the stove. Well usually our room temperature is regulated so those values don't change very often. >>So when you have calm oriented st calm oriented storage, essentially you take each row, each column and group it together. And so if that's the case and you're just taking temperature values from the room and a lot of those temperature values are the same, then you'll, you might be able to imagine how equal values will then enable each other and when they neighbor each other in the storage format, this provides a really perfect opportunity for cheap compression. And then this cheap compression enables high cardinality use cases. It also enables for faster scan rates. So if you wanna define like the men and max value of the temperature in the room across a thousand different points, you only have to get those a thousand different points in order to answer that question and you have those immediately available to you. But let's contrast this with a row oriented storage solution instead so that we can understand better the benefits of calmer oriented storage. >>So if you had a row oriented storage, you'd first have to look at every field like the temperature in, in the room and the temperature of the stove. You'd have to go across every tag value that maybe describes where the room is located or what model the stove is. And every timestamp you'd then have to pluck out that one temperature value that you want at that one time stamp and do that for every single row. So you're scanning across a ton more data and that's why Rowe Oriented doesn't provide the same efficiency as calmer and Apache Arrow is in memory calmer data, commoner data fit framework. So that's where a lot of the advantages come >>From. Okay. So you basically described like a traditional database, a row approach, but I've seen like a lot of traditional database say, okay, now we've got, we can handle colo format versus what you're talking about is really, you know, kind of native i, is it not as effective? Is the, is the foreman not as effective because it's largely a, a bolt on? Can you, can you like elucidate on that front? >>Yeah, it's, it's not as effective because you have more expensive compression and because you can't scan across the values as quickly. And so those are, that's pretty much the main reasons why, why RO row oriented storage isn't as efficient as calm, calmer oriented storage. Yeah. >>Got it. So let's talk about Arrow Data Fusion. What is data fusion? I know it's written in Rust, but what does it bring to the table here? >>Sure. So it's an extensible query execution framework and it uses Arrow as it's in memory format. So the way that it helps in influx DB IOCs is that okay, it's great if you can write unlimited amount of cardinality into influx Cbis, but if you don't have a query engine that can successfully query that data, then I don't know how much value it is for you. So Data fusion helps enable the, the query process and transformation of that data. It also has a PANDAS API so that you could take advantage of PANDAS data frames as well and all of the machine learning tools associated with Pandas. >>Okay. You're also leveraging Par K in the platform cause we heard a lot about Par K in the middle of the last decade cuz as a storage format to improve on Hadoop column stores. What are you doing with Parque and why is it important? >>Sure. So parque is the column oriented durable file format. So it's important because it'll enable bulk import, bulk export, it has compatibility with Python and Pandas, so it supports a broader ecosystem. Par K files also take very little disc disc space and they're faster to scan because again, they're column oriented in particular, I think PAR K files are like 16 times cheaper than CSV files, just as kind of a point of reference. And so that's essentially a lot of the, the benefits of par k. >>Got it. Very popular. So and he's, what exactly is influx data focusing on as a committer to these projects? What is your focus? What's the value that you're bringing to the community? >>Sure. So Influx DB first has contributed a lot of different, different things to the Apache ecosystem. For example, they contribute an implementation of Apache Arrow and go and that will support clearing with flux. Also, there has been a quite a few contributions to data fusion for things like memory optimization and supportive additional SQL features like support for timestamp, arithmetic and support for exist clauses and support for memory control. So yeah, Influx has contributed a a lot to the Apache ecosystem and continues to do so. And I think kind of the idea here is that if you can improve these upstream projects and then the long term strategy here is that the more you contribute and build those up, then the more you will perpetuate that cycle of improvement and the more we will invest in our own project as well. So it's just that kind of symbiotic relationship and appreciation of the open source community. >>Yeah. Got it. You got that virtuous cycle going, the people call the flywheel. Give us your last thoughts and kind of summarize, you know, where what, what the big takeaways are from your perspective. >>So I think the big takeaway is that influx data is doing a lot of really exciting things with Influx DB IOx and I really encourage, if you are interested in learning more about the technologies that Influx is leveraging to produce IOCs, the challenges associated with it and all of the hard work questions and you just wanna learn more, then I would encourage you to go to the monthly Tech talks and community office hours and they are on every second Wednesday of the month at 8:30 AM Pacific time. There's also a community forums and a community Slack channel look for the influx DDB unders IAC channel specifically to learn more about how to join those office hours and those monthly tech tech talks as well as ask any questions they have about iacs, what to expect and what you'd like to learn more about. I as a developer advocate, I wanna answer your questions. So if there's a particular technology or stack that you wanna dive deeper into and want more explanation about how INFLUX DB leverages it to build IOCs, I will be really excited to produce content on that topic for you. >>Yeah, that's awesome. You guys have a really rich community, collaborate with your peers, solve problems, and, and you guys super responsive, so really appreciate that. All right, thank you so much Anise for explaining all this open source stuff to the audience and why it's important to the future of data. >>Thank you. I really appreciate it. >>All right, you're very welcome. Okay, stay right there and in a moment I'll be back with Tim Yoakum, he's the director of engineering for Influx Data and we're gonna talk about how you update a SAS engine while the plane is flying at 30,000 feet. You don't wanna miss this. >>I'm really glad that we went with InfluxDB Cloud for our hosting because it has saved us a ton of time. It's helped us move faster, it's saved us money. And also InfluxDB has good support. My name's Alex Nada. I am CTO at Noble nine. Noble Nine is a platform to measure and manage service level objectives, which is a great way of measuring the reliability of your systems. You can essentially think of an slo, the product we're providing to our customers as a bunch of time series. So we need a way to store that data and the corresponding time series that are related to those. The main reason that we settled on InfluxDB as we were shopping around is that InfluxDB has a very flexible query language and as a general purpose time series database, it basically had the set of features we were looking for. >>As our platform has grown, we found InfluxDB Cloud to be a really scalable solution. We can quickly iterate on new features and functionality because Influx Cloud is entirely managed, it probably saved us at least a full additional person on our team. We also have the option of running InfluxDB Enterprise, which gives us the ability to even host off the cloud or in a private cloud if that's preferred by a customer. Influx data has been really flexible in adapting to the hosting requirements that we have. They listened to the challenges we were facing and they helped us solve it. As we've continued to grow, I'm really happy we have influx data by our side. >>Okay, we're back with Tim Yokum, who is the director of engineering at Influx Data. Tim, welcome. Good to see you. >>Good to see you. Thanks for having me. >>You're really welcome. Listen, we've been covering open source software in the cube for more than a decade, and we've kind of watched the innovation from the big data ecosystem. The cloud has been being built out on open source, mobile, social platforms, key databases, and of course influx DB and influx data has been a big consumer and contributor of open source software. So my question to you is, where have you seen the biggest bang for the buck from open source software? >>So yeah, you know, influx really, we thrive at the intersection of commercial services and open, so open source software. So OSS keeps us on the cutting edge. We benefit from OSS in delivering our own service from our core storage engine technologies to web services temping engines. Our, our team stays lean and focused because we build on proven tools. We really build on the shoulders of giants and like you've mentioned, even better, we contribute a lot back to the projects that we use as well as our own product influx db. >>You know, but I gotta ask you, Tim, because one of the challenge that that we've seen in particular, you saw this in the heyday of Hadoop, the, the innovations come so fast and furious and as a software company you gotta place bets, you gotta, you know, commit people and sometimes those bets can be risky and not pay off well, how have you managed this challenge? >>Oh, it moves fast. Yeah, that, that's a benefit though because it, the community moves so quickly that today's hot technology can be tomorrow's dinosaur. And what we, what we tend to do is, is we fail fast and fail often. We try a lot of things. You know, you look at Kubernetes for example, that ecosystem is driven by thousands of intelligent developers, engineers, builders, they're adding value every day. So we have to really keep up with that. And as the stack changes, we, we try different technologies, we try different methods, and at the end of the day, we come up with a better platform as a result of just the constant change in the environment. It is a challenge for us, but it's, it's something that we just do every day. >>So we have a survey partner down in New York City called Enterprise Technology Research etr, and they do these quarterly surveys of about 1500 CIOs, IT practitioners, and they really have a good pulse on what's happening with spending. And the data shows that containers generally, but specifically Kubernetes is one of the areas that has kind of, it's been off the charts and seen the most significant adoption and velocity particularly, you know, along with cloud. But, but really Kubernetes is just, you know, still up until the right consistently even with, you know, the macro headwinds and all, all of the stuff that we're sick of talking about. But, so what are you doing with Kubernetes in the platform? >>Yeah, it, it's really central to our ability to run the product. When we first started out, we were just on AWS and, and the way we were running was, was a little bit like containers junior. Now we're running Kubernetes everywhere at aws, Azure, Google Cloud. It allows us to have a consistent experience across three different cloud providers and we can manage that in code so our developers can focus on delivering services, not trying to learn the intricacies of Amazon, Azure, and Google and figure out how to deliver services on those three clouds with all of their differences. >>Just to follow up on that, is it, no. So I presume it's sounds like there's a PAs layer there to allow you guys to have a consistent experience across clouds and out to the edge, you know, wherever is that, is that correct? >>Yeah, so we've basically built more or less platform engineering, This is the new hot phrase, you know, it, it's, Kubernetes has made a lot of things easy for us because we've built a platform that our developers can lean on and they only have to learn one way of deploying their application, managing their application. And so that, that just gets all of the underlying infrastructure out of the way and, and lets them focus on delivering influx cloud. >>Yeah, and I know I'm taking a little bit of a tangent, but is that, that, I'll call it a PAs layer if I can use that term. Is that, are there specific attributes to Influx db or is it kind of just generally off the shelf paths? You know, are there, is, is there any purpose built capability there that, that is, is value add or is it pretty much generic? >>So we really build, we, we look at things through, with a build versus buy through a, a build versus by lens. Some things we want to leverage cloud provider services, for instance, Postgres databases for metadata, perhaps we'll get that off of our plate, let someone else run that. We're going to deploy a platform that our engineers can, can deliver on that has consistency that is, is all generated from code that we can as a, as an SRE group, as an ops team, that we can manage with very few people really, and we can stamp out clusters across multiple regions and in no time. >>So how, so sometimes you build, sometimes you buy it. How do you make those decisions and and what does that mean for the, for the platform and for customers? >>Yeah, so what we're doing is, it's like everybody else will do, we're we're looking for trade offs that make sense. You know, we really want to protect our customers data. So we look for services that support our own software with the most uptime, reliability, and durability we can get. Some things are just going to be easier to have a cloud provider take care of on our behalf. We make that transparent for our own team. And of course for customers you don't even see that, but we don't want to try to reinvent the wheel, like I had mentioned with SQL data stores for metadata, perhaps let's build on top of what of these three large cloud providers have already perfected. And we can then focus on our platform engineering and we can have our developers then focus on the influx data, software, influx, cloud software. >>So take it to the customer level, what does it mean for them? What's the value that they're gonna get out of all these innovations that we've been been talking about today and what can they expect in the future? >>So first of all, people who use the OSS product are really gonna be at home on our cloud platform. You can run it on your desktop machine, on a single server, what have you, but then you want to scale up. We have some 270 terabytes of data across, over 4 billion series keys that people have stored. So there's a proven ability to scale now in terms of the open source, open source software and how we've developed the platform. You're getting highly available high cardinality time series platform. We manage it and, and really as, as I mentioned earlier, we can keep up with the state of the art. We keep reinventing, we keep deploying things in real time. We deploy to our platform every day repeatedly all the time. And it's that continuous deployment that allows us to continue testing things in flight, rolling things out that change new features, better ways of doing deployments, safer ways of doing deployments. >>All of that happens behind the scenes. And like we had mentioned earlier, Kubernetes, I mean that, that allows us to get that done. We couldn't do it without having that platform as a, as a base layer for us to then put our software on. So we, we iterate quickly. When you're on the, the Influx cloud platform, you really are able to, to take advantage of new features immediately. We roll things out every day and as those things go into production, you have, you have the ability to, to use them. And so in the end we want you to focus on getting actual insights from your data instead of running infrastructure, you know, let, let us do that for you. So, >>And that makes sense, but so is the, is the, are the innovations that we're talking about in the evolution of Influx db, do, do you see that as sort of a natural evolution for existing customers? I, is it, I'm sure the answer is both, but is it opening up new territory for customers? Can you add some color to that? >>Yeah, it really is it, it's a little bit of both. Any engineer will say, well, it depends. So cloud native technologies are, are really the hot thing. Iot, industrial iot especially, people want to just shove tons of data out there and be able to do queries immediately and they don't wanna manage infrastructure. What we've started to see are people that use the cloud service as their, their data store backbone and then they use edge computing with R OSS product to ingest data from say, multiple production lines and downsample that data, send the rest of that data off influx cloud where the heavy processing takes place. So really us being in all the different clouds and iterating on that and being in all sorts of different regions allows for people to really get out of the, the business of man trying to manage that big data, have us take care of that. And of course as we change the platform end users benefit from that immediately. And, >>And so obviously taking away a lot of the heavy lifting for the infrastructure, would you say the same thing about security, especially as you go out to IOT and the Edge? How should we be thinking about the value that you bring from a security perspective? >>Yeah, we take, we take security super seriously. It, it's built into our dna. We do a lot of work to ensure that our platform is secure, that the data we store is, is kept private. It's of course always a concern. You see in the news all the time, companies being compromised, you know, that's something that you can have an entire team working on, which we do to make sure that the data that you have, whether it's in transit, whether it's at rest, is always kept secure, is only viewable by you. You know, you look at things like software, bill of materials, if you're running this yourself, you have to go vet all sorts of different pieces of software. And we do that, you know, as we use new tools. That's something that, that's just part of our jobs to make sure that the platform that we're running it has, has fully vetted software and, and with open source especially, that's a lot of work. And so it's, it's definitely new territory. Supply chain attacks are, are definitely happening at a higher clip than they used to, but that is, that is really just part of a day in the, the life for folks like us that are, are building platforms. >>Yeah, and that's key. I mean especially when you start getting into the, the, you know, we talk about IOT and the operations technologies, the engineers running the, that infrastructure, you know, historically, as you know, Tim, they, they would air gap everything. That's how they kept it safe. But that's not feasible anymore. Everything's >>That >>Connected now, right? And so you've gotta have a partner that is again, take away that heavy lifting to r and d so you can focus on some of the other activities. Right. Give us the, the last word and the, the key takeaways from your perspective. >>Well, you know, from my perspective I see it as, as a a two lane approach with, with influx, with Anytime series data, you know, you've got a lot of stuff that you're gonna run on-prem, what you had mentioned, air gaping. Sure there's plenty of need for that, but at the end of the day, people that don't want to run big data centers, people that want torus their data to, to a company that's, that's got a full platform set up for them that they can build on, send that data over to the cloud, the cloud is not going away. I think more hybrid approach is, is where the future lives and that's what we're prepared for. >>Tim, really appreciate you coming to the program. Great stuff. Good to see you. >>Thanks very much. Appreciate it. >>Okay, in a moment I'll be back to wrap up. Today's session, you're watching The Cube. >>Are you looking for some help getting started with InfluxDB Telegraph or Flux Check >>Out Influx DB University >>Where you can find our entire catalog of free training that will help you make the most of your time series data >>Get >>Started for free@influxdbu.com. >>We'll see you in class. >>Okay, so we heard today from three experts on time series and data, how the Influx DB platform is evolving to support new ways of analyzing large data sets very efficiently and effectively in real time. And we learned that key open source components like Apache Arrow and the Rust Programming environment Data fusion par K are being leveraged to support realtime data analytics at scale. We also learned about the contributions in importance of open source software and how the Influx DB community is evolving the platform with minimal disruption to support new workloads, new use cases, and the future of realtime data analytics. Now remember these sessions, they're all available on demand. You can go to the cube.net to find those. Don't forget to check out silicon angle.com for all the news related to things enterprise and emerging tech. And you should also check out influx data.com. There you can learn about the company's products. You'll find developer resources like free courses. You could join the developer community and work with your peers to learn and solve problems. And there are plenty of other resources around use cases and customer stories on the website. This is Dave Valante. Thank you for watching Evolving Influx DB into the smart data platform, made possible by influx data and brought to you by the Cube, your leader in enterprise and emerging tech coverage.
SUMMARY :
we talked about how in theory, those time slices could be taken, you know, As is often the case, open source software is the linchpin to those innovations. We hope you enjoy the program. I appreciate the time. Hey, explain why Influx db, you know, needs a new engine. now, you know, related to requests like sql, you know, query support, things like that, of the real first influx DB cloud, you know, which has been really successful. as they're giving us feedback, et cetera, has has, you know, pointed us in a really good direction shift from, you know, time series, you know, specialist to real time analytics better handle those queries from a performance and a, and a, you know, a time to response on the queries, you know, all of the, the real time queries, the, the multiple language query support, the, the devices and you know, the sort of highly distributed nature of all of this. I always thought, you know, real, I always thought of real time as before you lose the customer, you know, and that's one of the things that really triggered us to know that we were, we were heading in the right direction, a look at the, the libraries in on our GitHub and, you know, can ex inspect it and even can try And so just, you know, being careful, maybe a little cautious in terms And you can do some experimentation and, you know, using the cloud resources. You know, this is a new very sort of popular systems language, you know, really fast real time inquiries that we talked about, as well as for very large, you know, but it's popularity is, is you know, really starting to hit that steep part of the S-curve. going out and you know, it'll be highly featured on our, our website, you know, the whole database, the ecosystem as it expands out into to, you know, this vertically oriented Really appreciate your time. Look forward to it. goes, goes beyond just the historical into the real time really hot area. There's no need to worry about provisioning because you only pay for what you use. InfluxDB uses a single API across the entire platform suite so you can build on Influx DB is leveraging to increase the granularity of time series analysis analysis and bring the Hi, thank you so much. it's gonna give you faster query speeds, you store files and object storage, it aims to have no limits on cardinality and also allow you to write any kind of event data that It's really, the adoption is really starting to get steep on all the control, all the fine grain control, you need to take you know, the community is modernizing the platform, but I wanna talk about Apache And so you can answer that question and you have those immediately available to you. out that one temperature value that you want at that one time stamp and do that for every talking about is really, you know, kind of native i, is it not as effective? Yeah, it's, it's not as effective because you have more expensive compression and So let's talk about Arrow Data Fusion. It also has a PANDAS API so that you could take advantage of PANDAS What are you doing with and Pandas, so it supports a broader ecosystem. What's the value that you're bringing to the community? And I think kind of the idea here is that if you can improve kind of summarize, you know, where what, what the big takeaways are from your perspective. the hard work questions and you All right, thank you so much Anise for explaining I really appreciate it. Data and we're gonna talk about how you update a SAS engine while I'm really glad that we went with InfluxDB Cloud for our hosting They listened to the challenges we were facing and they helped Good to see you. Good to see you. So my question to you is, So yeah, you know, influx really, we thrive at the intersection of commercial services and open, You know, you look at Kubernetes for example, But, but really Kubernetes is just, you know, Azure, and Google and figure out how to deliver services on those three clouds with all of their differences. to the edge, you know, wherever is that, is that correct? This is the new hot phrase, you know, it, it's, Kubernetes has made a lot of things easy for us Is that, are there specific attributes to Influx db as an SRE group, as an ops team, that we can manage with very few people So how, so sometimes you build, sometimes you buy it. And of course for customers you don't even see that, but we don't want to try to reinvent the wheel, and really as, as I mentioned earlier, we can keep up with the state of the art. the end we want you to focus on getting actual insights from your data instead of running infrastructure, So cloud native technologies are, are really the hot thing. You see in the news all the time, companies being compromised, you know, technologies, the engineers running the, that infrastructure, you know, historically, as you know, take away that heavy lifting to r and d so you can focus on some of the other activities. with influx, with Anytime series data, you know, you've got a lot of stuff that you're gonna run on-prem, Tim, really appreciate you coming to the program. Thanks very much. Okay, in a moment I'll be back to wrap up. brought to you by the Cube, your leader in enterprise and emerging tech coverage.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Brian Gilmore | PERSON | 0.99+ |
Tim Yoakum | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Tim Yokum | PERSON | 0.99+ |
Dave Valante | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Tim | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
16 times | QUANTITY | 0.99+ |
two rows | QUANTITY | 0.99+ |
New York City | LOCATION | 0.99+ |
60,000 people | QUANTITY | 0.99+ |
Rust | TITLE | 0.99+ |
Influx | ORGANIZATION | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
three experts | QUANTITY | 0.99+ |
InfluxDB | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
each row | QUANTITY | 0.99+ |
two lane | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
Noble nine | ORGANIZATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
Flux | ORGANIZATION | 0.99+ |
Influx DB | TITLE | 0.99+ |
each column | QUANTITY | 0.99+ |
270 terabytes | QUANTITY | 0.99+ |
cube.net | OTHER | 0.99+ |
twice | QUANTITY | 0.99+ |
Bryan | PERSON | 0.99+ |
Pandas | TITLE | 0.99+ |
c plus plus | TITLE | 0.99+ |
three years ago | DATE | 0.99+ |
two | QUANTITY | 0.99+ |
more than a decade | QUANTITY | 0.98+ |
Apache | ORGANIZATION | 0.98+ |
dozens | QUANTITY | 0.98+ |
free@influxdbu.com | OTHER | 0.98+ |
30,000 feet | QUANTITY | 0.98+ |
Rust Foundation | ORGANIZATION | 0.98+ |
two temperature values | QUANTITY | 0.98+ |
In Flux Data | ORGANIZATION | 0.98+ |
one time stamp | QUANTITY | 0.98+ |
tomorrow | DATE | 0.98+ |
Russ | PERSON | 0.98+ |
IOT | ORGANIZATION | 0.98+ |
Evolving InfluxDB | TITLE | 0.98+ |
first | QUANTITY | 0.97+ |
Influx data | ORGANIZATION | 0.97+ |
one | QUANTITY | 0.97+ |
first one | QUANTITY | 0.97+ |
Influx DB University | ORGANIZATION | 0.97+ |
SQL | TITLE | 0.97+ |
The Cube | TITLE | 0.96+ |
Influx DB Cloud | TITLE | 0.96+ |
single server | QUANTITY | 0.96+ |
Kubernetes | TITLE | 0.96+ |
Evolving InfluxDB into the Smart Data Platform Close
>> Okay, so we heard today from three experts on time series and data, how the InfluxDB platform is evolving to support new ways of analyzing large data sets very efficiently and effectively in realtime. And we learned that key open source components like Apache Arrow and the Rust Programming environment DataFusion parquet are being leveraged to support realtime data analytics at scale. We also learned about the contributions and importance of open source software and how the InfluxDB community is evolving the platform with minimal disruption to support new workloads, new use cases in the future of realtime data analytics. Now remember these sessions, they're all available on demand. You can go to thecube.net to find those. Don't forget to check out siliconangle.com for all the news related to things enterprise and emerging tech. And you should also check out influxdata.com. There you can learn about the company's products, you'll find developer resources like free courses, you can join the developer community and work with your peers to learn and solve problems, and there are plenty of other resources around use cases and customer stories on the website. This is Dave Vellante. Thank you for watching Evolving InfluxDB into the Smart Data Platform, made possible by InfluxData and brought to you by theCUBE, your leader in enterprise and emerging tech coverage.
SUMMARY :
and how the InfluxDB community
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
three experts | QUANTITY | 0.99+ |
thecube.net | OTHER | 0.99+ |
siliconangle.com | OTHER | 0.99+ |
InfluxDB | TITLE | 0.99+ |
today | DATE | 0.99+ |
influxdata.com | OTHER | 0.98+ |
theCUBE | ORGANIZATION | 0.95+ |
InfluxData | ORGANIZATION | 0.85+ |
Evolving | TITLE | 0.79+ |
Rust | TITLE | 0.62+ |
Apache Arrow | ORGANIZATION | 0.54+ |
DataFusion | TITLE | 0.48+ |
Evolving InfluxDB into the Smart Data Platform Open
>> This past May, the Cube, in collaboration with Influx Data shared with you the latest innovations in Time series databases. We talked at length about why a purpose-built time series database for many use cases, was a superior alternative to general purpose databases trying to do the same thing. Now, you may, you may remember that time series data is any data that's stamped in time and if it's stamped, it can be analyzed historically. And when we introduced the concept to the community we talked about how in theory those time slices could be taken, you know every hour, every minute, every second, you know, down to the millisecond and how the world was moving toward realtime or near realtime data analysis to support physical infrastructure like sensors, and other devices and IOT equipment. Time series databases have had to evolve to efficiently support realtime data in emerging use, use cases in IOT and other use cases. And to do that, new architectural innovations have to be brought to bear. As is often the case, open source software is the linchpin to those innovations. Hello and welcome to Evolving Influx DB into the Smart Data platform, made possible by influx data and produced by the cube. My name is Dave Vellante, and I'll be your host today. Now, in this program, we're going to dig pretty deep into what's happening with Time series data generally, and specifically how Influx DB is evolving to support new workloads and demands and data, and specifically around data analytics use cases in real time. Now, first we're going to hear from Brian Gilmore who is the director of IOT and emerging technologies at Influx Data. And we're going to talk about the continued evolution of Influx DB and the new capabilities enabled by open source generally and specific tools. And in this program, you're going to hear a lot about things like rust implementation of Apache Arrow, the use of Parquet and tooling such as data fusion, which are powering a new engine for Influx db. Now, these innovations, they evolve the idea of time series analysis by dramatically increasing the granularity of time series data by compressing the historical time slices if you will, from, for example minutes down to milliseconds. And at the same time, enabling real time analytics with an architecture that can process data much faster and much more efficiently. Now, after Brian, we're going to hear from Anais Dotis-Georgiou who is a developer advocate at Influx Data. And we're going to get into the "why's" of these open source capabilities, and how they contribute to the evolution of the Influx DB platform. And then we're going to close the program with Tim Yocum. He's the director of engineering at Influx Data, and he's going to explain how the Influx DB community actually evolved the data engine in mid-flight and which decisions went into the innovations that are coming to the market. Thank you for being here. We hope you enjoy the program. Let's get started.
SUMMARY :
by compressing the historical time slices
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Brian Gilmore | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Tim Yocum | PERSON | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
Anais Dotis-Georgiou | PERSON | 0.99+ |
Influx DB | TITLE | 0.99+ |
InfluxDB | TITLE | 0.94+ |
first | QUANTITY | 0.91+ |
today | DATE | 0.88+ |
second | QUANTITY | 0.85+ |
Time | TITLE | 0.82+ |
Parquet | TITLE | 0.76+ |
Apache | ORGANIZATION | 0.75+ |
past May | DATE | 0.75+ |
Influx | TITLE | 0.75+ |
IOT | ORGANIZATION | 0.69+ |
Cube | ORGANIZATION | 0.65+ |
influx | ORGANIZATION | 0.53+ |
Arrow | TITLE | 0.48+ |
Anais Dotis Georgiou, InfluxData
(upbeat music) >> Okay, we're back. I'm Dave Vellante with The Cube and you're watching Evolving InfluxDB into the smart data platform made possible by influx data. Anais Dotis-Georgiou is here. She's a developer advocate for influx data and we're going to dig into the rationale and value contribution behind several open source technologies that InfluxDB is leveraging to increase the granularity of time series analysis and bring the world of data into realtime analytics. Anais welcome to the program. Thanks for coming on. >> Hi, thank you so much. It's a pleasure to be here. >> Oh, you're very welcome. Okay, so IOx is being touted as this next gen open source core for InfluxDB. And my understanding is that it leverages in memory, of course for speed. It's a kilometer store, so it gives you compression efficiency it's going to give you faster query speeds, it's going to see you store files and object storages so you got very cost effective approach. Are these the salient points on the platform? I know there are probably dozens of other features but what are the high level value points that people should understand? >> Sure, that's a great question. So some of the main requirements that IOx is trying to achieve and some of the most impressive ones to me the first one is that it aims to have no limits on cardinality and also allow you to write any kind of event data that you want whether that's lift tag or a field. It also wants to deliver the best in class performance on analytics queries. In addition to our already well served metric queries we also want to have operator control over memory usage. So you should be able to define how much memory is used for buffering caching and query processing. Some other really important parts is the ability to have bulk data export and import, super useful. Also, broader ecosystem compatibility where possible we aim to use and embrace emerging standards in the data analytics ecosystem and have compatibility with things like SQL, Python and maybe even Pandas in the future. >> Okay, so a lot there. Now we talked to Brian about how you're using Rust and which is not a new programming language and of course we had some drama around Rust during the pandemic with the Mozilla layoffs but the formation of the Rust Foundation really addressed any of those concerns and you got big guns like Amazon and Google and Microsoft throwing their collective weights behind it. It's really adoption is really starting to get steep on the S-curve. So lots of platforms, lots of adoption with Rust but why Rust as an alternative to say C++ for example? >> Sure, that's a great question. So Rust was chosen because of his exceptional performance and reliability. So while Rust is syntactically similar to C++ and it has similar performance it also compiles to a native code like C++ But unlike C++ it also has much better memory safety. So memory safety is protection against bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And Rust achieves this memory safety due to its like innovative type system. Additionally, it doesn't allow for dangling pointers and dangling pointers are the main classes of errors that lead to exploitable security vulnerabilities in languages like C++. So Rust like helps meet that requirement of having no limits on cardinality, for example, because it's we're also using the Rust implementation of Apache Arrow and this control over memory and also Rust's packaging system called Crates IO offers everything that you need out of the box to have features like async and await to fix race conditions to protect against buffering overflows and to ensure thread safe async caching structures as well. So essentially it's just like has all the control all the fine grain control, you need to take advantage of memory and all your resources as well as possible so that you can handle those really, really high cardinality use cases. >> Yeah, and the more I learn about the new engine and the platform IOx et cetera, you see things like the old days not even to even today you do a lot of garbage collection in these systems and there's an inverse, impact relative to performance. So it looks like you're really, the community is modernizing the platform but I want to talk about Apache Arrow for a moment. It's designed to address the constraints that are associated with analyzing large data sets. We know that, but please explain why, what is Arrow and what does it bring to InfluxDB? >> Sure. Yeah. So Arrow is a a framework for defining in memory column data. And so much of the efficiency and performance of IOx comes from taking advantage of column data structures. And I will, if you don't mind, take a moment to kind of illustrate why column data structures are so valuable. Let's pretend that we are gathering field data about the temperature in our room and also maybe the temperature of our store. And in our table we have those two temperature values as well as maybe a measurement value, timestamp value maybe some other tag values that describe what room and what house, et cetera we're getting this data from. And so you can picture this table where we have like two rows with the two temperature values for both our room and the store. Well, usually our room temperature is regulated so those values don't change very often. So when you have calm oriented storage essentially you take each row each column and group it together. And so if that's the case and you're just taking temperature values from the room and a lot of those temperature values are the same then you'll, you might be able to imagine how equal values will then enable each other and when they neighbor each other in the storage format this provides a really perfect opportunity for cheap compression. And then this cheap compression enables high cardinality use cases. It also enables for faster scan rates. So if you want to define like the min and max value of the temperature in the room across a thousand different points you only have to get those a thousand different points in order to answer that question and you have those immediately available to you. But let's contrast this with a row oriented storage solution instead so that we can understand better the benefits of column oriented storage. So if you had a row oriented storage, you'd first have to look at every field like the temperature in the room and the temperature of the store. You'd have to go across every tag value that maybe describes where the room is located or what model the store is. And every timestamp you then have to pluck out that one temperature value that you want at that one time stamp and do that for every single row. So you're scanning across a ton more data and that's why row oriented doesn't provide the same efficiency as column and Apache Arrow is in memory column data column data fit framework. So that's where a lot of the advantages come from. >> Okay. So you've basically described like a traditional database a row approach, but I've seen like a lot of traditional databases say, okay, now we've got we can handle Column format versus what you're talking about is really kind of native is it not as effective as the former not as effective because it's largely a bolt on? Can you like elucidate on that front? >> Yeah, it's not as effective because you have more expensive compression and because you can't scan across the values as quickly. And so those are, that's pretty much the main reasons why row oriented storage isn't as efficient as column oriented storage. >> Yeah. Got it. So let's talk about Arrow data fusion. What is data fusion? I know it's written in Rust but what does it bring to to the table here? >> Sure. So it's an extensible query execution framework and it uses Arrow as its in memory format. So the way that it helps InfluxDB IOx is that okay it's great if you can write unlimited amount of cardinality into InfluxDB, but if you don't have a query engine that can successfully query that data then I don't know how much value it is for you. So data fusion helps enable the query process and transformation of that data. It also has a Pandas API so that you could take advantage of Pandas data frames as well and all of the machine learning tools associated with Pandas. >> Okay. You're also leveraging Par-K in the platform course. We heard a lot about Par-K in the middle of the last decade cuz as a storage format to improve on Hadoop column stores. What are you doing with Par-K and why is it important? >> Sure. So Par-K is the column oriented durable file format. So it's important because it'll enable bulk import and bulk export. It has compatibility with Python and Pandas so it supports a broader ecosystem. Par-K files also take very little disc space and they're faster to scan because again they're column oriented, in particular I think Par-K files are like 16 times cheaper than CSV files, just as kind of a point of reference. And so that's essentially a lot of the benefits of Par-K. >> Got it. Very popular. So and these, what exactly is Influx data focusing on as a committer to these projects? What is your focus? What's the value that you're bringing to the community? >> Sure. So InfluxDB first has contributed a lot of different things to the Apache ecosystem. For example, they contribute an implementation of Apache Arrow and go and that will support clearing Influx. Also, there has been a quite a few contributions to data fusion for things like memory optimization and supportive additional SQL features like support for timestamp, arithmetic and support for exist clauses and support for memory control. So yeah, Influx has contributed a lot to the Apache ecosystem and continues to do so. And I think kind of the idea here is that if you can improve these upstream projects and then the long term strategy here is that the more you contribute and build those up then the more you will perpetuate that cycle of improvement and the more we will invest in our own project as well. So it's just that kind of symbiotic relationship and appreciation of the open source community. >> Yeah. Got it. You got that virtuous cycle going people call it the flywheel. Give us your last thoughts and kind of summarize, what the big takeaways are from your perspective. >> So I think the big takeaway is that, Influx data is doing a lot of really exciting things with InfluxDB IOx and I really encourage if you are interested in learning more about the technologies that Influx is leveraging to produce IOx the challenges associated with it and all of the hard work questions and I just want to learn more then I would encourage you to go to the monthly Tech talks and community office hours and they are on every second Wednesday of the month at 8:30 AM Pacific time. There's also a community forums and a community Slack channel. Look for the InfluxDB underscore IOx channel specifically to learn more about how to join those office hours and those monthly tech talks as well as ask any questions they have about IOx what to expect and what you'd like to learn more about. I as a developer advocate, I want to answer your questions. So if there's a particular technology or stack that you want to dive deeper into and want more explanation about how InfluxDB leverages it to build IOx, I will be really excited to produce content on that topic for you. >> Yeah, that's awesome. You guys have a really rich community collaborate with your peers, solve problems and you guys super responsive, so really appreciate that. All right, thank you so much Anais for explaining all this open source stuff to the audience and why it's important to the future of data. >> Thank you. I really appreciate it. >> All right, you're very welcome. Okay, stay right there and in a moment I'll be back with Tim Yoakam. He's the director of engineering for Influx Data and we're going to talk about how you update a SaaS engine while the plane is flying at 30,000 feet. You don't want to miss this. (upbeat music)
SUMMARY :
and bring the world of data It's a pleasure to be here. it's going to give you and some of the most impressive ones to me and you got big guns and dangling pointers are the main classes Yeah, and the more I and the temperature of the store. is it not as effective as the former not and because you can't scan to to the table here? So the way that it helps Par-K in the platform course. and they're faster to scan So and these, what exactly is Influx data and appreciation of the and kind of summarize, of the hard work questions and you guys super responsive, I really appreciate it. and we're going to talk about
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tim Yoakam | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Anais | PERSON | 0.99+ |
two rows | QUANTITY | 0.99+ |
16 times | QUANTITY | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
each row | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
Rust | TITLE | 0.99+ |
C++ | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
Anais Dotis Georgiou | PERSON | 0.99+ |
InfluxDB | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
Rust Foundation | ORGANIZATION | 0.99+ |
30,000 feet | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
Mozilla | ORGANIZATION | 0.99+ |
Pandas | TITLE | 0.98+ |
InfluxData | ORGANIZATION | 0.98+ |
Influx | ORGANIZATION | 0.98+ |
IOx | TITLE | 0.98+ |
each column | QUANTITY | 0.97+ |
one time stamp | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
Influx | TITLE | 0.96+ |
Anais Dotis-Georgiou | PERSON | 0.95+ |
Crates IO | TITLE | 0.94+ |
IOx | ORGANIZATION | 0.94+ |
two temperature values | QUANTITY | 0.93+ |
Apache | ORGANIZATION | 0.93+ |
today | DATE | 0.93+ |
8:30 AM Pacific time | DATE | 0.92+ |
Wednesday | DATE | 0.91+ |
one temperature | QUANTITY | 0.91+ |
two temperature values | QUANTITY | 0.91+ |
InfluxDB IOx | TITLE | 0.9+ |
influx | ORGANIZATION | 0.89+ |
last decade | DATE | 0.88+ |
single row | QUANTITY | 0.83+ |
a ton more data | QUANTITY | 0.81+ |
thousand | QUANTITY | 0.8+ |
dozens of other features | QUANTITY | 0.8+ |
a thousand different points | QUANTITY | 0.79+ |
Hadoop | TITLE | 0.77+ |
Par-K | TITLE | 0.76+ |
points | QUANTITY | 0.75+ |
each | QUANTITY | 0.75+ |
Slack | TITLE | 0.74+ |
Evolving InfluxDB | TITLE | 0.68+ |
kilometer | QUANTITY | 0.67+ |
Arrow | TITLE | 0.62+ |
The Cube | ORGANIZATION | 0.61+ |
Brian Gilmore, InfluxData
(soft upbeat music) >> Okay, we're kicking things off with Brian Gilmore. He's the director of IoT, an emerging technology at InfluxData. Brian, welcome to the program. Thanks for coming on. >> Thanks, Dave, great to be here. I appreciate the time. >> Hey, explain why InfluxDB, you know, needs a new engine. Was there something wrong with the current engine? What's going on there? >> No, no, not at all. I mean, I think, for us it's been about staying ahead of the market. I think, you know, if we think about what our customers are coming to us sort of with now, you know, related to requests like SQL query support, things like that, we have to figure out a way to execute those for them in a way that will scale long term. And then we also want to make sure we're innovating, we're sort of staying ahead of the market as well, and sort of anticipating those future needs. So, you know, this is really a transparent change for our customers. I mean, I think we'll be adding new capabilities over time that sort of leverage this new engine. But, you know, initially, the customers who are using us are going to see just great improvements in performance, you know, especially those that are working at the top end of the workload scale, you know, the massive data volumes and things like that. >> Yeah, and we're going to get into that today and the architecture and the like. But what was the catalyst for the enhancements? I mean, when and how did this all come about? >> Well, I mean, like three years ago, we were primarily on premises, right? I mean, I think we had our open source, we had an enterprise product. And sort of shifting that technology, especially the open source code base to a service basis where we were hosting it through, you know, multiple cloud providers. That was a long journey. (chuckles) I guess, you know, phase one was, we wanted to host enterprise for our customers, so we sort of created a service that we just managed and ran our enterprise product for them. You know, phase two of this cloud effort was to optimize for like multi-tenant, multi-cloud, be able to host it in a truly like SAS manner where we could use, you know, some type of customer activity or consumption as the pricing vector. And that was sort of the birth of the real first InfluxDB cloud, you know, which has been really successful. We've seen, I think, like 60,000 people sign up. And we've got tons and tons of both enterprises as well as like new companies, developers, and of course a lot of home hobbyists and enthusiasts who are using out on a daily basis. And having that sort of big pool of very diverse and varied customers to chat with as they're using the product, as they're giving us feedback, et cetera, has, you know, pointed us in a really good direction in terms of making sure we're continuously improving that, and then also making these big leaps as we're doing with this new engine. >> All right, so you've called it a transparent change for customers, so I'm presuming it's non-disruptive, but I really want to understand how much of a pivot this is, and what does it take to make that shift from, you know, time series specialist to real time analytics and being able to support both? >> Yeah, I mean, it's much more of an evolution, I think, than like a shift or a pivot. Time series data is always going to be fundamental in sort of the basis of the solutions that we offer our customers, and then also the ones that they're building on the sort of raw APIs of our platform themselves. The time series market is one that we've worked diligently to lead. I mean, I think when it comes to like metrics, especially like sensor data and app and infrastructure metrics. If we're being honest though, I think our user base is well aware that the way we were architected was much more towards those sort of like backwards-looking historical type analytics, which are key for troubleshooting and making sure you don't, you know, run into the same problem twice. But, you know, we had to ask ourselves like, what can we do to like better handle those queries from a performance and a time to response on the queries, and can we get that to the point where the result sets are coming back so quickly from the time of query that we can like, limit that window down to minutes and then seconds? And now with this new engine, we're really starting to talk about a query window that could be like returning results in, you know, milliseconds of time since it hit the ingest queue. And that's really getting to the point where, as your data is available, you can use it and you can query it, you can visualize it, you can do all those sort of magical things with it. And I think getting all of that to a place where we're saying like, yes to the customer on, you know, all of the real time queries, the multiple language query support. But, you know, it was hard, but we're now at a spot where we can start introducing that to, you know, a limited number of customers, strategic customers and strategic availabilities zones to start, but, you know, everybody over time. >> So you're basically going from what happened to, and you can still do that, obviously, but to what's happening now in the moment? >> Yeah. Yeah. I mean, if you think about time, it's always sort of past, right? I mean, like in the moment right now, whether you're talking about like a millisecond ago or a minute ago, you know, that's pretty much right now, I think for most people, especially in these use cases where you have other sort of components of latency induced by the underlying data collection, the architecture, the infrastructure, the devices, and you know, the sort of highly distributed nature of all of this. So, yeah, I mean, getting a customer or a user to be able to use the data as soon as it is available, is what we're after here. I always thought of real time as before you lose the customer, but now in this context, maybe it's before the machine blows up. >> Yeah, I mean, it is operationally, or operational real time is different. And that's one of the things that really triggered us to know that we were heading in the right direction is just how many sort of operational customers we have, you know, everything from like aerospace and defense. We've got companies monitoring satellites. We've got tons of industrial users using us as a process historian on the plant floor. And if we can satisfy their sort of demands for like real time historical perspective, that's awesome. I think what we're going to do here is we're going to start to like edge into the real time that they're used to in terms of, you know, the millisecond response times that they expect of their control systems, certainly not their historians and databases. >> Is this available, these innovations to InfluxDB cloud customers, only who can access this capability? >> Yeah, I mean, commercially and today, yes. I think we want to emphasize that for now our goal is to get our latest and greatest and our best to everybody over time of course. You know, one of the things we had to do here was like we doubled down on sort of our commitment to open source and availability. So, like, anybody today can take a look at the libraries on our GitHub and can inspect it and even can try to implement or execute some of it themselves in their own infrastructure. We are committed to bringing our sort of latest and greatest to our cloud customers first for a couple of reasons. Number one, you know, there are big workloads and they have high expectations of us. I think number two, it also gives us the opportunity to monitor a little bit more closely how it's working, how they're using it, like how the system itself is performing. And so just, you know, being careful, maybe a little cautious in terms of how big we go with this right away. Just sort of both limits, you know, the risk of any issues that can come with new software roll outs, we haven't seen anything so far. But also it does give us the opportunity to have like meaningful conversations with a small group of users who are using the products. But once we get through that and they give us two thumbs up on it, it'll be like, open the gates and let everybody in. It's going to be exciting time for the whole ecosystem. >> Yeah, that makes a lot of sense. And you can do some experimentation and, you know, using the cloud resources. Let's dig into some of the architectural and technical innovations that are going to help deliver on this vision. What should we know there? >> Well, I mean, I think, foundationally, we built the new core on Rust. This is a new very sort of popular systems language. It's extremely efficient, but it's also built for speed and memory safety, which goes back to that us being able to like deliver it in a way that is, you know, something we can inspect very closely, but then also rely on the fact that it's going to behave well, and if it does find error conditions. I mean, we've loved working with Go, and a lot of our libraries will continue to be sort of implemented in Go, but when it came to this particular new engine, that power performance and stability of Rust was critical. On top of that, like, we've also integrated Apache Arrow and Apache Parquet for persistence. I think, for anybody who's really familiar with the nuts and bolts of our backend and our TSI and our time series merge trees, this is a big break from that. You know, Arrow on the sort of in mem side and then Parquet in the on disk side. It allows us to present, you know, a unified set of APIs for those really fast real time queries that we talked about, as well as for very large, you know, historical sort of bulk data archives in that Parquet format, which is also cool because there's an entire ecosystem sort of popping up around Parquet in terms of the machine learning community. And getting that all to work, we had to glue it together with Arrow Flight. That's sort of what we're using as our RPC component. It handles the orchestration and the transportation of the columnar data now, we're moving to like a true columnar database model for this version of the engine. You know, and it removes a lot of overhead for us in terms of having to manage all that serialization, the deserialization, and, you know, to that again, like, blurring that line between real time and historical data, it's highly optimized for both streaming micro batch and then batches, but true streaming as well. >> Yeah, again, I mean, it's funny. You mentioned Rust. It's been around for a long time but it's popularity is, you know, really starting to hit that steep part of the S-curve. And we're going to dig into more of that, but give us, is there anything else that we should know about, Brian? Give us the last word. >> Well, I mean, I think first, I'd like everybody sort of watching, just to like, take a look at what we're offering in terms of early access in beta programs. I mean, if you want to participate or if you want to work sort of in terms of early access with the new engine, please reach out to the team. I'm sure, you know, there's a lot of communications going out and it'll be highly featured on our website. But reach out to the team. Believe it or not, like we have a lot more going on than just the new engine. And so there are also other programs, things we're offering to customers in terms of the user interface, data collection and things like that. And, you know, if you're a customer of ours and you have a sales team, a commercial team that you work with, you can reach out to them and see what you can get access to, because we can flip a lot of stuff on, especially in cloud through feature flags. But if there's something new that you want to try out, we'd just love to hear from you. And then, you know, our goal would be, that as we give you access to all of these new cool features that, you know, you would give us continuous feedback on these products and services, not only like what you need today, but then what you'll need tomorrow to sort of build the next versions of your business. Because, you know, the whole database, the ecosystem as it expands out into this vertically-oriented stack of cloud services, and enterprise databases, and edge databases, you know, it's going to be what we all make it together, not just those of us who are employed by InfluxDB. And then finally, I would just say, please, like, watch and Anais' and Tim's sessions. Like, these are two of our best and brightest. They're totally brilliant, completely pragmatic, and they are most of all customer-obsessed, which is amazing. And there's no better takes, like honestly, on the sort of technical details of this than theirs, especially when it comes to the value that these investments will bring to our customers and our communities. So, encourage you to, you know, pay more attention to them than you did to me, for sure. >> Brian Gilmore, great stuff. Really appreciate your time. Thank you. >> Yeah, thanks David, it was awesome. Looking forward to it. >> Yeah, me too. I'm looking forward to see how the community actually applies these new innovations and goes beyond just the historical into the real time. Really hot area. As Brian said, in a moment, I'll be right back with Anais Dotis-Georgiou to dig into the critical aspects of key open source components of the InfluxDB engine, including Rust, Arrow, Parquet, Data Fusion. Keep it right there. You don't want to miss this. (soft upbeat music)
SUMMARY :
He's the director of IoT, I appreciate the time. you know, needs a new engine. sort of with now, you know, and the architecture and the like. I guess, you know, phase one was, that the way we were architected the devices, and you know, in terms of, you know, the And so just, you know, being careful, experimentation and, you know, in a way that is, you know, but it's popularity is, you know, And then, you know, our goal would be, Really appreciate your time. Looking forward to it. and goes beyond just the
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Brian Gilmore | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Tim | PERSON | 0.99+ |
60,000 people | QUANTITY | 0.99+ |
InfluxData | ORGANIZATION | 0.99+ |
two | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
three years ago | DATE | 0.99+ |
twice | QUANTITY | 0.99+ |
Parquet | TITLE | 0.99+ |
both | QUANTITY | 0.98+ |
Anais' | PERSON | 0.98+ |
first | QUANTITY | 0.98+ |
tomorrow | DATE | 0.98+ |
Rust | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
a minute ago | DATE | 0.95+ |
two thumbs | QUANTITY | 0.95+ |
Arrow | TITLE | 0.94+ |
Anais Dotis-Georgiou | PERSON | 0.92+ |
tons | QUANTITY | 0.9+ |
InfluxDB | TITLE | 0.85+ |
Bri | PERSON | 0.82+ |
Apache | ORGANIZATION | 0.82+ |
InfluxDB | ORGANIZATION | 0.8+ |
GitHub | ORGANIZATION | 0.78+ |
phase one | QUANTITY | 0.73+ |
both enterprises | QUANTITY | 0.69+ |
SAS | ORGANIZATION | 0.68+ |
phase two | QUANTITY | 0.67+ |
Go | TITLE | 0.65+ |
Gilmore | PERSON | 0.63+ |
millisecond ago | DATE | 0.62+ |
Arrow | ORGANIZATION | 0.59+ |
Flight | ORGANIZATION | 0.52+ |
Data Fusion | TITLE | 0.46+ |
Go | ORGANIZATION | 0.41+ |
Cracking the Code: Lessons Learned from How Enterprise Buyers Evaluate New Startups
(bright music) >> Welcome back to the CUBE presents the AWS Startup Showcase The Next Big Thing in cloud startups with AI security and life science tracks, 15 hottest growing startups are presented. And we had a great opening keynote with luminaries in the industry. And now our closing keynote is to get a deeper dive on cracking the code in the enterprise, how startups are changing the game and helping companies change. And they're also changing the game of open source. We have a great guest, Katie Drucker, Head of Business Development, Madrona Venture Group. Katie, thank you for coming on the CUBE for this special closing keynote. >> Thank you for having me, I appreciate it. >> So one of the topics we talked about with Soma from Madrona on the opening keynote, as well as Ali from Databricks is how startups are seeing success faster. So that's the theme of the Cloud speed, agility, but the game has changed in the enterprise. And I want to really discuss with you how growth changes and growth strategy specifically. They talk, go to market. We hear things like good sales to enterprise sales, organic, freemium, there's all kinds of different approaches, but at the end of the day, the most successful companies, the ones that might not be known that just come out of nowhere. So the economics are changing and the buyers are thinking differently. So let's explore that topic. So take us through your view 'cause you have a lot of experience. But first talk about your role at Madrona, what you do. >> Absolutely all great points. So my role at Madrona, I think I have personally one of the more enviable jobs and that my job is to... I get the privilege of working with all of these fantastic entrepreneurs in our portfolio and doing whatever we can as a firm to harness resources, knowledge, expertise, connections, to accelerate their growth. So my role in setting up business development is taking a look at all of those tools in the tool chest and partnering with the portfolio to make it so. And in our portfolio, we have a wide range of companies, some rely on enterprise sales, some have other go to markets. Some are direct to consumer, a wide range. >> Talk about the growth strategies that you see evolving because what's clear with the pandemic. And as we come out of it is that there are growth plays happening that don't look a little bit differently, more obvious now because of the Cloud scale, we're seeing companies like Databricks, like Snowflake, like other companies that have been built on the cloud or standalone. What are some of the new growth techniques, or I don't want to say growth hacking, that is a pejorative term, but like just a way for companies to quickly describe their value to an enterprise buyer who's moving away from the old RFP days of vendor selection. The game has changed. So take us through how you see secret key and unlocking that new equation of how to present value to an enterprise and how you see enterprises evaluating startups. >> Yes, absolutely. Well, and that's got a question, that's got a few components nestled in what I think are some bigger trends going on. AWS of course brought us the Cloud first. I think now the Cloud is more and more a utility. And so it's incumbent upon thinking about how an enterprise 'cause using the Cloud is going to go up the value stack and partner with its cloud provider and other service providers. I think also with that agility of operations, you have thinning, if you will, the systems of record and a lot of new entrance into this space that are saying things like, how can we harness AIML and other emerging trends to provide more value directly around work streams that were historically locked into those systems of record? And then I think you also have some price plans that are far more flexible around usage based as opposed to just flat subscription or even these big clunky annual or multi-year RFP type stuff. So all of those trends are really designed in ways that favor the emerging startup. And I think if done well, and in partnership with those underlying cloud providers, there can be some amazing benefits that the enterprise realizes an opportunity for those startups to grow. And I think that's what you're seeing. I think there's also this emergence of a buyer that's different than the CIO or the site the CISO. You have things with low code, no code. You've got other buyers in the organization, other line of business executives that are coming to the table, making software purchase decisions. And then you also have empowered developers that are these citizen builders and developer buyers and personas that really matter. So lots of inroads in places for a startup to reach in the enterprise to make a connection and to bring value. That's a great insight. I want to ask that just if you don't mind follow up on that, you mentioned personas. And what we're seeing is the shift happens. There's new roles that are emerging and new things that are being reconfigured or refactored if you will, whether it's human resources or AI, and you mentioned ML playing a role in automation. These are big parts of the new value proposition. How should companies posture to the customer? Because I don't want to say pivot 'cause that means it's not working but mostly extending our iterating around their positioning because as new things have not yet been realized, it might not be operationalized in a company or maybe new things need to be operationalized, it's a new solution for that. Positioning the value is super important and a lot of companies often struggle with that, but also if they get it right, that's the key. What's your feeling on startups in their positioning? So people will dismiss it like, "Oh, that's marketing." But maybe that's important. What's your thoughts on the great positioning question? >> I've been in this industry a long time. And I think there are some things that are just tried and true, and it is not unique to tech, which is, look, you have to tell a story and you have to reach the customer and you have to speak to the customer's need. And what that means is, AWS is a great example. They're famous for the whole concept of working back from the customer and thinking about what that customer's need is. I think any startup that is looking to partner or work alongside of AWS really has to embody that very, very customer centric way of thinking about things, even though, as we just talked about those personas are changing who that customer really is in the enterprise. And then speaking to that value proposition and meeting that customer and creating a dialogue with them that really helps to understand not only what their pain points are, but how you were offering solves those pain points. And sometimes the customer doesn't realize that that is their pain point and that's part of the education and part of the way in which you engage that dialogue. That doesn't change a lot, just generation to generation. I think the modality of how we have that dialogue, the methods in which we choose to convey that change, but that basic discussion is what makes us human. >> What's your... Great, great, great insight. I want to ask you on the value proposition question again, the question I often get, and it's hard to answer is am I competing on value or am I competing on commodity? And depending on where you're in the stack, there could be different things like, for example, land is getting faster, smaller, cheaper, as an example on Amazon. That's driving down to low cost high value, but it shifts up the stack. You start to see in companies this changing the criteria for how to evaluate. So an enterprise might be struggling. And I often hear enterprises say, "I don't know how to pick who I need. I buy tools, I don't buy many platforms." So they're constantly trying to look for that answer key, if you will, what's your thoughts on the changing requirements of an enterprise? And how to do vendor selection. >> Yeah, so obviously I don't think there's a single magic bullet. I always liked just philosophically to think about, I think it's always easier and frankly more exciting as a buyer to want to buy stuff that's going to help me make more revenue and build and grow as opposed to do things that save me money. And just in a binary way, I like to think which side of the fence are you sitting on as a product offering? And the best ways that you can articulate that, what opportunities are you unlocking for your customer? The problems that you're solving, what kind of growth and what impact is that going to lead to, even if you're one or two removed from that? And again, that's not a new concept. And I think that the companies that have that squarely in mind when they think about their go-to market strategy, when they think about the dialogue they're having, when they think about the problems that they're solving, find a much faster path. And I think that also speaks to why we're seeing so many explosion in the line of business, SAS apps that are out there. Again, that thinning of the systems of record, really thinking about what are the scenarios and work streams that we can have happened that are going to help with that revenue growth and unlocking those opportunities. >> What's the common startup challenge that you see when they're trying to do business development? Usually they build the product first, product led value, you hear that a lot. And then they go, "Okay, we're ready to sell, hire a sales guy." That seems to be shifting away because of the go to markets are changing. What are some of the challenges that startups have? What are some that you're seeing? >> Well, and I think the point that you're making about the changes are really almost a result of the trends that we're talking about. The sales organization itself is becoming... These work streams are becoming instrumented. Data is being collected, insights are being derived off of those things. So you see companies like Clary or Highspot or two examples or tutorial that are in our portfolio that are looking at that action and making the art of sales and marketing far more sophisticated overall, which then leads to the different growth hacking and the different insights that are driven. I think the common mistakes that I see across the board, especially with earlier stage startups, look you got to find product market fit. I think that's always... You start with a thesis or a belief and a passion that you're building something that you think the market needs. And it's a lot of dialogue you have to have to make sure that you do find that. I think once you find that another common problem that I see is leading with an explanation of technology. And again, not focusing on the buyer or the... Sorry, the buyer about solving a problem and focusing on that problem as opposed to focusing on how cool your technology is. Those are basic and really, really simple. And then I think setting a set of expectations, especially as it comes to business development and partnering with companies like AWS. The researching that you need to adequately meet the demand that can be turned on. And then I'm sure you heard about from Databricks, from an organization like AWS, you have to be pragmatic. >> Yeah, Databricks gone from zero a software sales a few years ago to over a billion. Now it looks like a Snowflake which came out of nowhere and they had a great product, but built on Amazon, they became the data cloud on top of Amazon. And now they're growing just whole new business models and new business development techniques. Katie, thank you for sharing your insight here. The CUBE's closing keynote. Thanks for coming on. >> Appreciate it, thank you. >> Okay, Katie Drucker, Head of Business Development at Madrona Venture Group. Premier VC in the Seattle area and beyond they're doing a lot of cloud action. And of course they know AWS very well and investing in the ecosystem. So great, great stuff there. Next up is Peter Wagner partner at Wing.VX. Love this URL first of all 'cause of the VC domain extension. But Peter is a long time venture capitalist. I've been following his career. He goes back to the old networking days, back when the internet was being connected during the OSI days, when the TCP IP open systems interconnect was really happening and created so much. Well, Peter, great to see you on the CUBE here and congratulations with success at Wing VC. >> Yeah, thanks, John. It's great to be here. I really appreciate you having me. >> Reason why I wanted to have you come on. First of all, you had a great track record in investing over many decades. You've seen many waves of innovation, startups. You've seen all the stories. You've seen the movie a few times, as I say. But now more than ever, enterprise wise it's probably the hottest I've ever seen. And you've got a confluence of many things on the stack. You were also an early seed investor in Snowflake, well-regarded as a huge success. So you've got your eye on some of these awesome deals. Got a great partner over there has got a network experience as well. What is the big aha moment here for the industry? Because it's not your classic enterprise startups anymore. They have multiple things going on and some of the winners are not even known. They come out of nowhere and they connect to enterprise and get the lucrative positions and can create a moat and value. Like out of nowhere, it's not the old way of like going to the airport and doing an RFP and going through the stringent requirements, and then you're in, you get to win the lucrative contract and you're in. Not anymore, that seems to have changed. What's your take on this 'cause people are trying to crack the code here and sometimes you don't have to be well-known. >> Yeah, well, thank goodness the game has changed 'cause that old thing was (indistinct) So I for one don't miss it. There was some modernization movement in the enterprise and the modern enterprise is built on data powered by AI infrastructure. That's an agile workplace. All three of those things are really transformational. There's big investments being made by enterprises, a lot of receptivity and openness to technology to enable all those agendas, and that translates to good prospects for startups. So I think as far as my career goes, I've never seen a more positive or fertile ground for startups in terms of penetrating enterprise, it doesn't mean it's easy to do, but you have a receptive audience on the other side and that hasn't necessarily always been the case. >> Yeah, I got to ask you, I know that you're a big sailor and your family and Franks Lubens also has a boat and sailing metaphor is always good to have 'cause you got to have a race that's being run and they have tactics. And this game that we're in now, you see the successes, there's investment thesises, and then there's also actually bets. And I want to get your thoughts on this because a lot of enterprises are trying to figure out how to evaluate startups and starts also can make the wrong bet. They could sail to the wrong continent and be in the wrong spot. So how do you pick the winners and how should enterprises understand how to pick winners too? >> Yeah, well, one of the real important things right now that enterprise is facing startups are learning how to do and so learning how to leverage product led growth dynamics in selling to the enterprise. And so product led growth has certainly always been important consumer facing companies. And then there's a few enterprise facing companies, early ones that cracked the code, as you said. And some of these examples are so old, if you think about, like the ones that people will want to talk about them and talk about Classy and want to talk about Twilio and these were of course are iconic companies that showed the way for others. But even before that, folks like Solar Winds, they'd go to market model, clearly product red, bottom stuff. Back then we didn't even have those words to talk about it. And then some of the examples are so enormous if think about them like the one right in front of your face, like AWS. (laughing) Pretty good PLG, (indistinct) but it targeted builders, it targeted developers and flipped over the way you think about enterprise infrastructure, as a result some how every company, even if they're harnessing relatively conventional sales and marketing motion, and you think about product led growth as a way to kick that motion off. And so it's not really an either word even more We might think OPLJ, that means there's no sales keep one company not true, but here's a way to set the table so that you can very efficiently use your sales and marketing resources, only have the most attractive targets and ones that are really (indistinct) >> I love the product led growth. I got to ask you because in the networking days, I remember the term inevitability was used being nested in a solution that they're just going to Cisco off router and a firewall is one you can unplug and replace with another vendor. Cisco you'd have to go through no switching costs were huge. So when you get it to the Cloud, how do you see the competitiveness? Because we were riffing on this with Ali, from Databricks where the lock-in might be value. The more value provider is the lock-in. Is their nestedness? Is their intimate ability as a competitive advantage for some of these starts? How do you look at that? Because startups, they're using open source. They want to have a land position in an enterprise, but how do they create that sustainable competitive advantage going forward? Because again, this is what you do. You bet on ones that you can see that could establish a model whatever we want to call it, but a competitive advantage and ongoing nested position. >> Sometimes it has to do with data, John, and so you mentioned Snowflake a couple of times here, a big part of Snowflake's strategy is what they now call the data cloud. And one of the reasons you go there is not to just be able to process data, to actually get access to it, exchange with the partners. And then that of course is a great reason for the customers to come to the Snowflake platform. And so the more data it gets more customers, it gets more data, the whole thing start spinning in the right direction. That's a really big example, but all of these startups that are using ML in a fundamental way, applying it in a novel way, the data modes are really important. So getting to the right data sources and training on it, and then putting it to work so that you can see that in this process better and doing this earlier on that scale. That's a big part of success. Another company that I work with is a good example that I call (indistinct) which works in sales technology space, really crushing it in terms of building better sales organizations both at performance level, in terms of the intelligence level, and just overall revenue attainment using ML, and using novel data sources, like the previously lost data or phone calls or Zoom calls as you already know. So I think the data advantages are really big. And smart startups are thinking through it early. >> It's interest-- >> And they're planning by the way, not to ramble on too much, but they're betting that PLG strategy. So their land option is designed not just to be an interesting way to gain usage, but it's also a way to gain access to data that then enables the expand in a component. >> That is a huge call-out point there, I was going to ask another question, but I think that is the key I see. It's a new go to market in a way. product led with that kind of approach gets you a beachhead and you get a little position, you get some data that is a cloud model, it means variable, whatever you want to call it variable value proposition, value proof, or whatever, getting that data and reiterating it. So it brings up the whole philosophical question of okay, product led growth, I love that with product led growth of data, I get that. Remember the old platform versus a tool? That's the way buyers used to think. How has that changed? 'Cause now almost, this conversation throws out the whole platform thing, but isn't like a platform. >> It looks like it's all. (laughs) you can if it is a platform, though to do that you can reveal that later, but you're looking for adoption, so if it's down stock product, you're looking for adoption by like developers or DevOps people or SOEs, and they're trying to solve a problem, and they want rapid gratification. So they don't want to have an architectural boomimg, placed in front of them. And if it's up stock product and application, then it's a user or the business or whatever that is, is adopting the application. And again, they're trying to solve a very specific problem. You need instant and immediate obvious time and value. And now you have a ticket to the dance and build on that and maybe a platform strategy can gradually take shape. But you know who's not in this conversation is the CIO, it's like, "I'm always the last to know." >> That's the CISO though. And they got him there on the firing lines. CISOs are buying tools like it's nobody's business. They need everything. They'll buy anything or you go meet with sand, they'll buy it. >> And you make it sound so easy. (laughing) We do a lot of security investment if only (indistinct) (laughing) >> I'm a little bit over the top, but CISOs are under a lot of pressure. I would talk to the CISO at Capital One and he was saying that he's on Amazon, now he's going to another cloud, not as a hedge, but he doesn't want to focus development teams. So he's making human resource decisions as well. Again, back to what IT used to be back in the old days where you made a vendor decision, you built around it. So again, clouds play that way. I see that happening. But the question is that I think you nailed this whole idea of cross hairs on the target persona, because you got to know who you are and then go to the market. So if you know you're a problem solving and the lower in the stack, do it and get a beachhead. That's a strategy, you can do that. You can't try to be the platform and then solve a problem at the same time. So you got to be careful. Is that what you were getting at? >> Well, I think you just understand what you're trying to achieve in that line of notion. And how those dynamics work and you just can't drag it out. And they could make it too difficult. Another company I work with is a very strategic cloud data platform. It's a (indistinct) on systems. We're not trying to foist that vision though (laughs) or not adopters today. We're solving some thorny problems with them in the short term, rapid time to value operational needs in scale. And then yeah, once they found success with (indistinct) there's would be an opportunity to be increasing the platform, and an obstacle for those customers. But we're not talking about that. >> Well, Peter, I appreciate you taking the time and coming out of a board meeting, I know that you're super busy and I really appreciate you making time for us. I know you've got an impressive partner in (indistinct) who's a former Sequoia, but Redback Networks part of that company over the years, you guys are doing extremely well, even a unique investment thesis. I'd like you to put the plug in for the firm. I think you guys have a good approach. I like what you guys are doing. You're humble, you don't brag a lot, but you make a lot of great investments. So could you take them in to explain what your investment thesis is and then how that relates to how an enterprise is making their investment thesis? >> Yeah, yeah, for sure. Well, the concept that I described earlier that the modern enterprise movement as a workplace built on data powered by AI. That's what we're trying to work with founders to enable. And also we're investing in companies that build the products and services that enable that modern enterprise to exist. And we do it from very early stages, but with a longterm outlook. So we'll be leading series and series, rounds of investment but staying deeply involved, both operationally financially throughout the whole life cycle of the company. And then we've done that a bunch of times, our goal is always the big independent public company and they don't always make it but enough for them to have it all be worthwhile. An interesting special case of this, and by the way, I think it intersects with some of startup showcase here is in the life sciences. And I know you were highlighting a lot of healthcare websites and deals, and that's a vertical where to disrupt tremendous impact of data both new data availability and new ways to put it to use. I know several of my partners are very focused on that. They call it bio-X data. It's a transformation all on its own. >> That's awesome. And I think that the reason why we're focusing on these verticals is if you have a cloud horizontal scale view and vertically specialized with machine learning, every vertical is impacted by data. It's so interesting that I think, first start, I was probably best time to be a cloud startup right now. I really am bullish on it. So I appreciate you taking the time Peter to come in again from your board meeting, popping out. Thanks for-- (indistinct) Go back in and approve those stock options for all the employees. Yeah, thanks for coming on. Appreciate it. >> All right, thank you John, it's a pleasure. >> Okay, Peter Wagner, Premier VC, very humble Wing.VC is a great firm. Really respect them. They do a lot of great investing investments, Snowflake, and we have Dave Vellante back who knows a lot about Snowflake's been covering like a blanket and Sarbjeet Johal. Cloud Influencer friend of the CUBE. Cloud commentator and cloud experience built clouds, runs clouds now invests. So V. Dave, thanks for coming back on. You heard Peter Wagner at Wing VC. These guys have their roots in networking, which networking back in the day was, V. Dave. You remember the internet Cisco days, remember Cisco, Wellfleet routers. I think Peter invested in Arrow Point, remember Arrow Point, that was about in the 495 belt where you were. >> Lynch's company. >> That was Chris Lynch's company. I think, was he a sales guy there? (indistinct) >> That was his first big hit I think. >> All right, well guys, let's wrap this up. We've got a great program here. Sarbjeet, thank you for coming on. >> No worries. Glad to be here todays. >> Hey, Sarbjeet. >> First of all, really appreciate the Twitter activity lately on the commentary, the observability piece on Jeremy Burton's launch, Dave was phenomenal, but Peter was talking about this dynamic and I think ties this cracking the code thing together, which is there's a product led strategy that feels like a platform, but it's also a tool. In other words, it's not mutually exclusive, the old methods thrown out the window. Land in an account, know what problem you're solving. If you're below the stack, nail it, get data and go from there. If you're a process improvement up the stack, you have to much more of a platform longer-term sale, more business oriented, different motions, different mechanics. What do you think about that? What's your reaction? >> Yeah, I was thinking about this when I was listening to some of the startups pitching, if you will, or talking about what they bring to the table in this cloud scale or cloud era, if you will. And there are tools, there are applications and then they're big monolithic platforms, if you will. And then they're part of the ecosystem. So I think the companies need to know where they play. A startup cannot be platform from the get-go I believe. Now many aspire to be, but they have to start with tooling. I believe in, especially in B2B side of things, and then go into the applications, one way is to go into the application area, if you will, like a very precise use cases for certain verticals and stuff like that. And other parties that are going into the platform, which is like horizontal play, if you will, in technology. So I think they have to understand their age, like how old they are, how new they are, how small they are, because when their size matter when you are procuring as a big business, procuring your technology vendors size matters and the economic viability matters and their proximity to other windows matter as well. So I think we'll jump into that in other discussions later, but I think that's key, as you said. >> I would agree with that. I would phrase it in my mind, somewhat differently from Sarbjeet which is you have product led growth, and that's your early phase and you get product market fit, you get product led growth, and then you expand and there are many, many examples of this, and that's when you... As part of your team expansion strategy, you're going to get into the platform discussion. There's so many examples of that. You take a look at Ali Ghodsi today with what's happening at Databricks, Snowflake is another good example. They've started with product led growth. And then now they're like, "Okay, we've got to expand the team." Okta is another example that just acquired zero. That's about building out the platform, versus more of a point product. And there's just many, many examples of that, but you cannot to your point, very hard to start with a platform. Arm did it, but that was like a one in a million chance. >> It's just harder, especially if it's new and it's not operationalized yet. So one of the things Dave that we've observed the Cloud is some of the best known successes where nobody's not known at all, database we've been covering from the beginning 'cause we were close to that movement when they came out of Berkeley. But they still were misunderstood and they just started generating revenue in only last year. So again, only a few years ago, zero software revenue, now they're approaching a billion dollars. So it's not easy to make these vendor selections anymore. And if you're new and you don't have someone to operate it or your there's no department and the departments changing, that's another problem. These are all like enterprisey problems. What's your thoughts on that, Dave? >> Well, I think there's a big discussion right now when you've been talking all day about how should enterprise think about startups and think about most of these startups they're software companies and software is very capital efficient business. At the same time, these companies are raising hundreds of millions, sometimes over a billion dollars before they go to IPO. Why is that? A lot of it's going to promotion. I look at it as... And there's a big discussion going on but well, maybe sales can be more efficient and more direct and so forth. I really think it comes down to the golden rule. Two things really mattered in the early days in the startup it's sales and engineering. And writers should probably say engineering and sales and start with engineering. And then you got to figure out your go to market. Everything else is peripheral to those two and you don't get those two things right, you struggle. And I think that's what some of these successful startups are proving. >> Sarbjeet, what's your take on that point? >> Could you repeat the point again? Sorry, I lost-- >> As cloud scale comes in this whole idea of competing, the roles are changing. So look at IOT, look at the Edge, for instance, you got all kinds of new use cases that no one actually knows is a problem to solve. It's just pure opportunity. So there's no one's operational I could have a product, but it don't know we can buy it yet. It's a problem. >> Yeah, I think the solutions have to be point solutions and the startups need to focus on the practitioners, number one, not the big buyers, not the IT, if you will, but the line of business, even within that sphere, like just focus on the practitioners who are going to use that technology. I talked to, I think it wasn't Fiddler, no, it was CoreLogics. I think that story was great today earlier in how they kind of struggle in the beginning, they were trying to do a big bang approach as a startup, but then they almost stumbled. And then they found their mojo, if you will. They went to Don the market, actually, that's a very classic theory of disruption, like what we study from Harvard School of Business that you go down the market, go to the non-consumers, because if you're trying to compete head to head with big guys. Because most of the big guys have lot of feature and functionality, especially at the platform level. And if you're trying to innovate in that space, you have to go to the practitioners and solve their core problems and then learn and expand kind of thing. So I think you have to focus on practitioners a lot more than the traditional oracle buyers. >> Sarbjeet, we had a great thread last night in Twitter, on observability that you started. And there's a couple of examples there. Chaos searches and relatively small company right now, they just raised them though. And they're part of this star showcase. And they could've said, "Hey, we're going to go after Splunk." But they chose not to. They said, "Okay, let's kind of disrupt the elk stack and simplify that." Another example is a company observed, you've mentioned Jeremy Burton's company, John. They're focused really on SAS companies. They're not going after initially these complicated enterprise deals because they got to get it right or else they'll get churn, and churn is that silent killer of software companies. >> The interesting other company that was on the showcase was Tetra Science. I don't know if you noticed that one in the life science track, and again, Peter Wagner pointed out the life science. That's an under recognized in the press vertical that's exploding. Certainly during the pandemic you saw it, Tetra science is an R&D cloud, Dave, R&D data cloud. So pharmaceuticals, they need to do their research. So the pandemic has brought to life, this now notion of tapping into data resources, not just data lakes, but like real deal. >> Yeah, you and Natalie and I were talking about that this morning and that's one of the opportunities for R&D and you have all these different data sources and yeah, it's not just about the data lake. It's about the ecosystem that you're building around them. And I see, it's really interesting to juxtapose what Databricks is doing and what Snowflake is doing. They've got different strategies, but they play a part there. You can see how ecosystems can build that system. It's not one company is going to solve all these problems. It's going to really have to be connections across these various companies. And that's what the Cloud enables and ecosystems have all this data flowing that can really drive new insights. >> And I want to call your attention to a tweet Sarbjeet you wrote about Splunk's earnings and they're data companies as well. They got Teresa Carlson there now AWS as the president, working with Doug, that should change the game a little bit more. But there was a thread of the neath there. Andy Thry says to replies to Dave you or Sarbjeet, you, if you're on AWS, they're a fine solution. The world doesn't just revolve around AWS, smiley face. Well, a lot of it does actually. So (laughing) nice point, Andy. But he brings up this thing and Ali brought it up too, Hybrid now is a new operating system for what now Edge does. So we got Mobile World Congress happening this month in person. This whole Telco 5G brings up a whole nother piece of the Cloud puzzle. Jeff Barr pointed out in his keynote, Dave. Guys, I want to get your reaction. The Edge now is... I'm calling it the super Edge because it's not just Edge as we know it before. You're going to have these pops, these points of presence that are going to have wavelength as your spectrum or whatever they have. I think that's the solution for Azure. So you're going to have all this new cloud power for low latency applications. Self-driving delivery VR, AR, gaming, Telemetry data from Teslas, you name it, it's happening. This is huge, what's your thoughts? Sarbjeet, we'll start with you. >> Yeah, I think Edge is like bound to happen. And for many reasons, the volume of data is increasing. Our use cases are also expanding if you will, with the democratization of computer analysis. Specialization of computer, actually Dave wrote extensively about how Intel and other chip players are gearing up for that future if you will. Most of the inference in the AI world will happen in the field close to the workloads if you will, that can be mobility, the self-driving car that can be AR, VR. It can be healthcare. It can be gaming, you name it. Those are the few use cases, which are in the forefront and what alarm or use cases will come into the play I believe. I've said this many times, Edge, I think it will be dominated by the hyperscalers, mainly because they're building their Metro data centers now. And with a very low latency in the Metro areas where the population is, we're serving the people still, not the machines yet, or the empty areas where there is no population. So wherever the population is, all these big players are putting their data centers there. And I think they will dominate the Edge. And I know some Edge lovers. (indistinct) >> Edge huggers. >> Edge huggers, yeah. They don't like the hyperscalers story, but I think that's the way were' going. Why would we go backwards? >> I think you're right, first of all, I agree with the hyperscale dying you look at the top three clouds right now. They're all in the Edge, Hardcore it's a huge competitive battleground, Dave. And I think the missing piece, that's going to be uncovered at Mobile Congress. Maybe they'll miss it this year, but it's the developer traction, whoever wins the developer market or wins the loyalty, winning over the market or having adoption. The applications will drive the Edge. >> And I would add the fourth cloud is Alibaba. Alibaba is actually bigger than Google and they're crushing it as well. But I would say this, first of all, it's popular to say, "Oh not everything's going to move into the Cloud, John, Dave, Sarbjeet." But the fact is that AWS they're trend setter. They are crushing it in terms of features. And you'd look at what they're doing in the plumbing with Annapurna. Everybody's following suit. So you can't just ignore that, number one. Second thing is what is the Edge? Well, the edge is... Where's the logical place to process the data? That's what the Edge is. And I think to your point, both Sarbjeet and John, the Edge is going to be won by developers. It's going to be one by programmability and it's going to be low cost and really super efficient. And most of the data is going to stay at the Edge. And so who is in the best position to actually create that? Is it going to be somebody who was taking an x86 box and throw it over the fence and give it a fancy name with the Edge in it and saying, "Here's our Edge box." No, that's not what's going to win the Edge. And so I think first of all it's huge, it's wide open. And I think where's the innovation coming from? I agree with you it's the hyperscalers. >> I think the developers as John said, developers are the kingmakers. They build the solutions. And in that context, I always talk about the skills gravity, a lot of people are educated in certain technologies and they will keep using those technologies. Their proximity to that technology is huge and they don't want to learn something new. So as humans we just tend to go what we know how to use it. So from that front, I usually talk with consumption economics of cloud and Edge. It has to focus on the practitioners. And in this case, practitioners are developers because you're just cooking up those solutions right now. We're not serving that in huge quantity right now, but-- >> Well, let's unpack that Sarbjeet, let's unpack that 'cause I think you're right on the money on that. The consumption of the tech and also the consumption of the application, the end use and end user. And I think the reason why hyperscalers will continue to dominate besides the fact that they have all the resource and they're going to bring that to the Edge, is that the developers are going to be driving the applications at the Edge. So if you're low latency Edge, that's going to open up new applications, not just the obvious ones I did mention, gaming, VR, AR, metaverse and other things that are obvious. There's going to be non-obvious things that are going to be huge that are going to come out from the developers. But the Cloud native aspect of the hyperscalers, to me is where the scales are tipping, let me explain. IT was built to build a supply resource to the businesses who were writing business applications. Mostly driven by IBM in the mainframe in the old days, Dave, and then IT became IT. Telcos have been OT closed, "This is our thing, that's it." Now they have to open up. And the Cloud native technologies is the fastest way to value. And I think that paths, Sarbjeet is going to be defined by this new developer and this new super Edge concept. So I think it's going to be wide open. I don't know what to say. I can't guess, but it's going to be creative. >> Let me ask you a question. You said years ago, data's new development kit, does low code and no code to Sarbjeet's point, change the equation? In other words, putting data in the hands of those OT professionals, those practitioners who have the context. Does low-code and no-code enable, more of those protocols? I know it's a bromide, but the citizen developer, and what impact does that have? And who's in the best position? >> Well, I think that anything that reduces friction to getting stuff out there that can be automated, will increase the value. And then the question is, that's not even a debate. That's just fact that's going to be like rent, massive rise. Then the issue comes down to who has the best asset? The software asset that's eating the world or the tower and the physical infrastructure. So if the physical infrastructure aka the Telcos, can't generate value fast enough, in my opinion, the private equity will come in and take it over, and then refactor that business model to take advantage of the over the top software model. That to me is the big stare down competition between the Telco world and this new cloud native, whichever one yields in valley is going to blink first, if you say. And I think the Cloud native wins this one hands down because the assets are valuable, but only if they enable the new model. If the old model tries to hang on to the old hog, the old model as the Edge hugger, as Sarbjeet says, they'll just going to slowly milk that cow dry. So it's like, it's over. So to me, they have to move. And I think this Mobile World Congress day, we will see, we will be looking for that. >> Yeah, I think that in the Mobile World Congress context, I think Telcos should partner with the hyperscalers very closely like everybody else has. And they have to cave in. (laughs) I usually say that to them, like the people came in IBM tried to fight and they cave in. Other second tier vendors tried to fight the big cloud vendors like top three or four. And then they cave in. okay, we will serve our stuff through your cloud. And that's where all the buyers are congregating. They're going to buy stuff along with the skills gravity, the feature proximity. I've got another term I'll turn a coin. It matters a lot when you're doing one thing and you want to do another thing when you're doing all this transactional stuff and regular stuff, and now you want to do data science, where do you go? You go next to it, wherever you have been. Your skills are in that same bucket. And then also you don't have to write a new contract with a new vendor, you just go there. So in order to serve, this is a lesson for startups as well. You need to prepare yourself for being in the Cloud marketplaces. You cannot go alone independently to fight. >> Cloud marketplace is going to replace procurement, for sure, we know that. And this brings up the point, Dave, we talked about years ago, remember on the CUBE. We said, there's going to be Tier two clouds. I used that word in quotes cause nothing... What does it even mean Tier two. And we were talking about like Amazon, versus Microsoft and Google. We set at the time and Alibaba but they're in China, put that aside for a second, but the big three. They're going to win it all. And they're all going to be successful to a relative terms, but whoever can enable that second tier. And it ended up happening, Snowflake is that example. As is Databricks as is others. So Google and Microsoft as fast as they can replicate the success of AWS by enabling someone to build their business on their cloud in a way that allows the customer to refactor their business will win. They will win most of the lion's share my opinion. So I think that applies to the Edge as well. So whoever can come in and say... Whichever cloud says, "I'm going to enable the next Snowflake, the next enterprise solution." I think takes it. >> Well, I think that it comes back... Every conversation coming back to the data. And if you think about the prevailing way in which we treated data with the exceptions of the two data driven companies in their quotes is as we've shoved all the data into some single repository and tried to come up with a single version of the truth and it's adjudicated by a centralized team, with hyper specialized roles. And then guess what? The line of business, there's no context for the business in that data architecture or data Corpus, if you will. And then the time it takes to go from idea for a data product or data service commoditization is way too long. And that's changing. And the winners are going to be the ones who are able to exploit this notion of leaving data where it is, the point about data gravity or courting a new term. I liked that, I think you said skills gravity. And then enabling the business lines to have access to their own data teams. That's exactly what Ali Ghodsi, he was saying this morning. And really having the ability to create their own data products without having to go bow down to an ivory tower. That is an emerging model. All right, well guys, I really appreciate the wrap up here, Dave and Sarbjeet. I'd love to get your final thoughts. I'll just start by saying that one of the highlights for me was the luminary guests size of 15 great companies, the luminary guests we had from our community on our keynotes today, but Ali Ghodsi said, "Don't listen to what everyone's saying in the press." That was his position. He says, "You got to figure out where the puck's going." He didn't say that, but I'm saying, I'm paraphrasing what he said. And I love how he brought up Sky Cloud. I call it Sky net. That's an interesting philosophy. And then he also brought up that machine learning auto ML has got to be table stakes. So I think to me, that's the highlight walk away. And the second one is this idea that the enterprises have to have a new way to procure and not just the consumption, but some vendor selection. I think it's going to be very interesting as value can be proved with data. So maybe the procurement process becomes, here's a beachhead, here's a little bit of data. Let me see what it can do. >> I would say... Again, I said it was this morning, that the big four have given... Last year they spent a hundred billion dollars more on CapEx. To me, that's a gift. In so many companies, especially focusing on trying to hang onto the legacy business. They're saying, "Well not everything's going to move to the Cloud." Whatever, the narrative should change to, "Hey, thank you for that gift. We're now going to build value on top of the Cloud." Ali Ghodsi laid that out, how Databricks is doing it. And it's clearly what Snowflake's new with the data cloud. It basically a layer that abstracts all that underlying complexity and add value on top. Eventually going out to the Edge. That's a value added model that's enabled by the hyperscalers. And that to me, if I have to evaluate where I'm going to place my bets as a CIO or IT practitioner, I'm going to look at who are the ones that are actually embracing that investment that's been made and adding value on top in a way that can drive my data-driven, my digital business or whatever buzzword you want to throw on. >> Yeah, I think we were talking about the startups in today's sessions. I think for startups, my advice is to be as close as you can be to hyperscalers and anybody who awards them, they will cave in at the end of the day, because that's where the whole span of gravity is. That's what the innovation gravity is, everybody's gravitating towards that. And I would say quite a few times in the last couple of years that the rate of innovation happening in a non-cloud companies, when I talk about non-cloud means are not public companies. I think it's like diminishing, if you will, as compared to in cloud, there's a lot of innovation. The Cloud companies are not paying by power people anymore. They have all sophisticated platforms and leverage those, and also leverage the marketplaces and leverage their buyers. And the key will be how you highlight yourself in that cloud market place if you will. It's like in a grocery store where your product is placed and you have to market around it, and you have to have a good story telling team in place as well after you do the product market fit. I think that's a key. I think just being close to the Cloud providers, that's the way to go for startups. >> Real, real quick. Each of you talk about what it takes to crack the code for the enterprise in the modern era now. Dave, we'll start with you. What's it take? (indistinct) >> You got to have it be solving a problem that is 10X better at one 10th a cost of anybody else, if you're a small company, that rule number one. Number two is you obviously got to get product market fit. You got to then figure out. And I think, and again, you're in your early phases, you have to be almost processed builders, figure out... Your KPIs should all be built around retention. How do I define customer success? How do I keep customers and how do I make them loyal so that I know that my cost of acquisition is going to be at least one-third or lower than my lifetime value of that customer? So you've got to nail that. And then once you nail that, you've got to codify that process in the next phase, which really probably gets into your platform discussion. And that's really where you can start to standardize and scale and figure out your go to market and the relationship between marketing spend and sales productivity. And then when you get that, then you got to move on to figure out your Mot. Your Mot might just be a brand. It might be some secret sauce, but more often than not though, it's going to be the relationship that you build. And I think you've got to think about those phases and in today's world, you got to move really fast. Sarbjeet, real quick. What's the secret to crack the code? >> I think the secret to crack the code is partnership and alliances. As a small company selling to the bigger enterprises, the vendors size will be one of the big objections. Even if they don't say it, it's on the back of their mind, "What if these guys disappear tomorrow what would we do if we pick this technology?" And another thing is like, if you're building on the left side, which is the developer side, not on the right side, which is the operations or production side, if you will, you have to understand the sales cycles are longer on the right side and left side is easier to get to, but that's why we see a lot more startups. And on the left side of your DevOps space, if you will, because it's easier to sell to practitioners and market to them and then show the value correctly. And also understand that on the left side, the developers are very know how hungry, on the right side people are very cost-conscious. So understanding the traits of these different personas, if you will buyers, it will, I think set you apart. And as Dave said, you have to solve a problem, focus on practitioners first, because you're small. You have to solve political problems very well. And then you can expand. >> Well, guys, I really appreciate the time. Dave, we're going to do more of these, Sarbjeet we're going to do more of these. We're going to add more community to it. We're going to add our community rooms next time. We're going to do these quarterly and try to do them as more frequently, we learned a lot and we still got a lot more to learn. There's a lot more contribution out in the community that we're going to tap into. Certainly the CUBE Club as we call it, Dave. We're going to build this actively around Cloud. This is another 20 years. The Edge brings us more life with Cloud, it's really exciting. And again, enterprise is no longer an enterprise, it's just the world now. So great companies here, the next Databricks, the next IPO. The next big thing is in this list, Dave. >> Hey, John, we'll see you in Barcelona. Looking forward to that. Sarbjeet, I know in a second half, we're going to run into each other. So (indistinct) thank you John. >> Trouble has started. Great talking to you guys today and have fun in Barcelona and keep us informed. >> Thanks for coming. I want to thank Natalie Erlich who's in Rome right now. She's probably well past her bedtime, but she kicked it off and emceeing and hosting with Dave and I for this AW startup showcase. This is batch two episode two day. What do we call this? It's like a release so that the next 15 startups are coming. So we'll figure it out. (laughs) Thanks for watching everyone. Thanks. (bright music)
SUMMARY :
on cracking the code in the enterprise, Thank you for having and the buyers are thinking differently. I get the privilege of working and how you see enterprises in the enterprise to make a and part of the way in which the criteria for how to evaluate. is that going to lead to, because of the go to markets are changing. and making the art of sales and they had a great and investing in the ecosystem. I really appreciate you having me. and some of the winners and the modern enterprise and be in the wrong spot. the way you think about I got to ask you because And one of the reasons you go there not just to be an interesting and you get a little position, it's like, "I'm always the last to know." on the firing lines. And you make it sound and then go to the market. and you just can't drag it out. that company over the years, and by the way, I think it intersects the time Peter to come in All right, thank you Cloud Influencer friend of the CUBE. I think, was he a sales guy there? Sarbjeet, thank you for coming on. Glad to be here todays. lately on the commentary, and the economic viability matters and you get product market fit, and the departments changing, And then you got to figure is a problem to solve. and the startups need to focus on observability that you started. So the pandemic has brought to life, that's one of the opportunities to a tweet Sarbjeet you to the workloads if you They don't like the hyperscalers story, but it's the developer traction, And I think to your point, I always talk about the skills gravity, is that the developers but the citizen developer, So if the physical You go next to it, wherever you have been. the customer to refactor And really having the ability to create And that to me, if I have to evaluate And the key will be how for the enterprise in the modern era now. What's the secret to crack the code? And on the left side of your So great companies here, the So (indistinct) thank you John. Great talking to you guys It's like a release so that the
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave | PERSON | 0.99+ |
Katie | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Natalie Erlich | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Sarbjeet | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Katie Drucker | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Peter Wagner | PERSON | 0.99+ |
Telcos | ORGANIZATION | 0.99+ |
Peter | PERSON | 0.99+ |
Natalie | PERSON | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Teresa Carlson | PERSON | 0.99+ |
Jeff Barr | PERSON | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
Andy | PERSON | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Andy Thry | PERSON | 0.99+ |
Barcelona | LOCATION | 0.99+ |
Ali | PERSON | 0.99+ |
Rome | LOCATION | 0.99+ |
Madrona Venture Group | ORGANIZATION | 0.99+ |
Jeremy Burton | PERSON | 0.99+ |
Redback Networks | ORGANIZATION | 0.99+ |
Madrona | ORGANIZATION | 0.99+ |
Jeremy Burton | PERSON | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Telco | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Doug | PERSON | 0.99+ |
Wellfleet | ORGANIZATION | 0.99+ |
Harvard School of Business | ORGANIZATION | 0.99+ |
Last year | DATE | 0.99+ |
Berkeley | LOCATION | 0.99+ |
HPE GreenLake Day Power Panel | HPE GreenLake Day 2021
>>Okay. Okay. Now we're gonna go into the Green Lake Power Panel. Talk about the cloud landscape hybrid cloud and how the partner ecosystem and customers are thinking about cloud hybrid cloud as a service and, of course, Green Lake. And with me or CR Houdyshell, president of Advise X. Ron Nemecek, Who's the business Alliance manager at C B. T s. Harry Zaric is president of competition, and Benjamin Clay is VP of sales and alliances at Arrow Electronics. Great to see you guys. Thanks so much for coming on the Cube. >>Thanks for having us >>would be here. >>Okay, here's the deal. So I'm gonna ask you guys each to introduce yourselves and your company's add a little color to my brief intro and then answer the following question. How do you and your customers think about hybrid cloud and think about it in the context of where we are today and where we're going? Not just the snapshot, but where we are today and where we're going. CR, why don't you start, please? >>Sure. Thanks a lot. They appreciate it. And, uh, again cr Howdy Shell President of advising. I've been with the company for 18 years the last four years as president. So had the great great opportunity here to lead a 45 year old company with a very strong brand and great culture. Uh, as it relates to advise X and where we're headed to with hybrid Cloud is it's a journey, so we're excited to be leading that journey for the company as well as HP. We're very excited about where HP is going with Green Lake. We believe it's it's a very strong solution when it comes to hybrid. Cloud have been an HP partner since since 1980. So for 40 years it's our longest standing OM relationship, and we're really excited about where HP is going with Green Lake from a hybrid cloud perspective. Uh, we feel like we've been doing the hybrid cloud solutions in the past few years with everything that we've focused on from a VM Ware perspective. But now, with where HP is going, we think really changing the game and it really comes down to giving customers at cloud experience with the on Prem solution with Green Lake, and we've had great response from our customers and we think we're gonna continue to see how that kind of increased activity and reception. >>Great. Thank you. Cr and yeah, I totally agree. It is. It is a journey. And we've seen it really come a long way in the last decade. Ron, I wonder if you could kick off your little first intro there, please? >>Sure. Dave, thanks for having me today. And it's a pleasure being here with all of you. My name is Ron Nemecek, business Alliance manager at C B. T. S. In my role, I am responsible for RHP Green Lake relationship globally. I've enjoyed a 33 year career in the I T industry. I'm thankful for the opportunity to serve in multiple functional and senior leadership roles that have helped me gather a great deal of education and experience that could be used to aid our customers with their evolving needs for business outcomes. The best position them for sustainable and long term success. I'm honored to be part of the C B. T s and Annex Canada Organization, C B T s stands for consult Bill transform and support. We have a 35 year relationship with HP or a platinum and inner circle partner. We're headquartered in Cincinnati Ohio. We service 3000 customers, generating over a billion dollars in revenue, and we have over 2000 associates across the globe. Our focus is partnering with our customers to deliver innovative solutions and business results through thought leadership. We drive this innovation VR team of the best and brightest technology professionals in the industry that have secured over 2800 technical certifications 260 specifically with HP and in our hybrid cloud business. We have clearly found the technology new market demands for instant responses and experiences evolving economic considerations with detailed financial evaluation and, of course, the global pandemic have challenged each of our customers across all industries to develop an optimal cloud strategy we have. We now play an enhanced strategic role for our customers as there Technology Advisor and their guide to the right mix of cloud experiences that will maximize their organizational success with predictable outcomes. Our conversations have really moved from product roadmaps and speeds and feeds to return on investment, return on capital and financial statements, ratios and metrics. We collaborate regularly with our customers at all levels and all departments to find an effective, comprehensive cloud strategy for their workloads and applications, ensuring proper alignment and costs with financial return. >>Great. Thank you, Ron. Yeah, Today it's all about the business value. Harry, please, >>I Dave. Thanks for the opportunity and greetings from the Great White North, where Canadian based company headquartered in Toronto, with offices across the country. We've been in the tech industry for a very long time. What we would call a solution provider hard for my mother to understand what that means. But our goal is to help our customers realize the business value of their technology investments just to give you an example of what it is we try and do. We just finished a build out of a new networking and point in data center technology for a brand new hospital is now being mobilized for covid high risk patients. So talk about are all being an essential industry, providing essential services across the whole spectrum of technology. Now, in terms of what's happening in the marketplace, our customers are confused. No question about it. They hear about cloud and cloud first, and everyone goes to the cloud. But the reality is there's lots of technology, lots of applications that actually still have to run on premises for a whole bunch of reasons. And what customers want is solid senior serious advice as to how they leverage what they already have in terms of their existing infrastructure but modernized and updated So it looks and feels a lot like a cloud. But they have the security. They have the protection that they need to have for reasons that are dependent on their industry and business to allow them to run on from. And so the Green Lake philosophy is perfect. That allows customers to actually have 1 ft in the cloud, 1 ft in their traditional data center, but modernize it so it actually looks like one enterprise entity. And it's that kind of flexibility that gives us an opportunity collectively, ourselves, our partners, HP to really demonstrate that we understand how to optimize the use of technology across all of the business applications they need to rest >>your hair. It's interesting about what you said is is cloud is it is kind of chaotic. My word not yours, but but there is a lot of confusion out there. I mean, it's what's cloud right? Is it Public Cloud is a private cloud the hybrid cloud. Now, now it's the edge. And of course, the answer is all of the above. Ben, what's your perspective on all this? >>Um, from a cloud perspective. You know, I think as an industry, you know, I think we we've all accepted that public cloud is not necessarily gonna win the day and were, in fact, in a hybrid world, there's certainly been some some commentary impress. Um, you know, that would sort of validate that. Not that necessarily needs any validation. But I think it's the linkages between on Prem, Um, and cloud based services have increased. Its paved the way for customers to more effectively deploy hybrid solutions in the model that they want that they desired. You know, Harry was commenting on that a moment ago. Um, you know, as the trend continues, it becomes much easier for solution providers and service providers to drive there services, initiatives, uh, you know, in particular managed services. So, you know, from from an arrow perspective, as we think about how we can help scale in particular from Greenland perspective, we've got the ability to stand up some some cloud capabilities through our aero secure platform. um that can really help customers adopt Green Lake. Uh, and, uh, benefit to benefit from, um, some alliances, opportunities as well. And I'll talk more about that as we go through >>that. I didn't mean to squeeze you on a narrow. I mean, you got arrows. Been around longer than computers. I mean, if you google the history of arrow, it'll blow your mind. But give us a little, uh, quick commercial. >>Yeah, absolutely. So, um, I've been with arrow for about 20 years. I've got responsibility for alliances, organization, North America for Global value, added distribution, business consulting and channel enablement Company. Uh, you know, we bring scope, scale and and, uh, expertise as it relates to the I t industry. Um, you know, I love the fast paced, the fast paced that comes with the market, that we're all all in, and I love helping customers and suppliers both, you know, be positioned for long term success. And, you know, the subject matter here today is just a great example of that. So I'm happy to be here and or to the discussion. >>All right, We got some good brain power in the room. Let's let's cut right to the chase. Ron, Where's the pain? What are the main problems that C B. T s. I love the what it stands for. Consult Bill Transform and support the What's the main pain point that that customers are asking you to solve when it comes to their cloud strategies. >>Third day of our customers' concerns and associated risk come from the market demands to deliver their products, services and experiences instantaneously. And then the challenges is how do they meet those demands because they have aging infrastructure processes and fiscal constraints. Our customers really need us now more than ever to be excellent listeners so we can collaborate on an effective map for the strategic placement of workloads and applications in that spectrum of cloud experiences, while managing their costs and, of course, mitigating risk to their business. This collaboration with our customer customers often identify significant costs that have to be evaluated, justified or eliminated. We find significant development, migration and egress charges in their current public cloud experience, coupled with significant over provisioning, maintenance, operational and stranded asset costs in their on premise infrastructure environment. When we look at all these costs holistically through our customized workshops and assessments. We can identify the optimal cloud experience for the respective workloads and applications through our partnership with HP and the availability of the HP Green Lake Solutions. Our customers now have a choice to deliver SLA's economics and business outcomes for their workloads and applications that best reside on premise in a private cloud and have that experience. This is a rock solid solution that eliminates, you know, the development costs at the experience and the egress charges that are associated with the public cloud while utilizing HP Green Lake to eliminate over provisioning costs and the maintenance costs on aging infrastructure hardware. Lastly, our customers only have to pay for actual infrastructure usage with no upfront capital expense. And now that achieves true utilization to cost economics. You know, with HP Green Lake solution from C B. T s. >>I love to focus on the business case because it's measurable. That sort of follow the money. That's where it's where the opportunity is. Okay, See, I got a question for you thinking about advise X customers. How are they? Are they leaning into Green Lake? You know, what are they telling you? Is the business impact when they when they experience Green Lake, >>I think it goes back to what Ron was talking about. We have to solve the business challenges first, and so far the reception's been positive. When I say that is, customers are open, everybody wants to. The C suite wants to hear about cloud and hybrid cloud fits, but what we're hearing, what we're seeing from our customers is we're seeing more adoption from customers that it may be their first put in, if you will. But as importantly, we're able to share other customers with our potentially new clients that that say, What's the first thing that happens with regard to Green like Well, number one, it works. It works as advertised and as a as a service. That's a big step. There are a lot of people out there dabbling today, but when you can say we have a proven solution, it's working in in in our environment today. That's key. I think the second thing is is flexibility. You know, when customers are looking for this, this hybrid solution, you've got to be flexible for again. I think Ron said it well, you don't have a big capital outlay but also what customers want to be able to. We're gonna build for growth, but we don't want to pay for it, so we'll pay as we grow. Not as not as we have to use because we used to do It was upfront of the capital expenditure, and I will just pay as we grow and that really facilitates. In another great examples, you'll hear from a customer, uh, this afternoon, but you'll hear where one of the biggest benefits they just acquired a $570 million company, and their integration is going to be very seamless because of their investment in Green Lake. They're looking at the flexibility to add the Green Lake as a big opportunity to integrate for acquisitions and finally is really we see it really brings the cloud experience and as a service to our customers bring. And with HP Green Lake, it brings best to breathe. So it's not just what HP has to offer. When you look at hyper converged, they have Nutanix kohi city, so I really believe it brings best to breathe. So, uh to net it out and close it out with our customers thus far, the customer experience has been exceptional with Green Lake Central has interface. Customers have had a lot of success. We just had our first customer from about a year and a half ago, just re up, and it was a highly competitive situation. But they just said, Look, it's proven it works and it gives us that cloud experience So I had a lot of great success thus far, looking forward to more. >>Thank you. So, Harry, I want to pick up on something, CR said, And get your perspectives. So when you when I talk to the C suite, they do all want to hear about, you know, Cloud, they have a cloud agenda and and what they tell me is it's not just about their I t transformation. They want, they want that. But they also want to transform their business. So I wonder if you could talk Harry about competence, perspective on the potential business impact of Green Lake, and and also, you know, I'm interested in how you guys are thinking about workloads, how to manage work, you know how to cost optimize in i t. But also the business value that comes out of that capability. >>Yes. So, Dave, you know, if you were to talk to CFO and I have the good fortune to talk to lots of CFOs, they want to pay the cost. When they generate the revenue, they don't want to have all the cost up front and then wait for the revenue to come through. A good example of where that's happening right now is related to the pandemic. Employees that used to work at the office have now moved to working from home, and now they have to. They have to connect remotely to run the same application. So use this thing called VD virtual interfacing to allow them to connect to the applications that they need to run in the off. Don't want to get into too much detail. But to be able to support that from an at home environment, they needed to buy a lot more computing capacity to handle this. Now there's an expectation that hopefully six months from now, maybe sooner than that people will start returning to the office. They may not need that capacity so they can turn down on the cost. And so the idea of having the capacity available when you need it, But then turning it off when you don't need it is really a benefit of a variable cost model. Another example that I would use is one in new development if a customer is going to implement and you, let's say, line of business application essay P is very, very popular, you know, it actually, unfortunately takes six months to two years to actually get that application setup installed, validated, test it and then moves through production. You know what used to happen before they would buy all that capacity at front and basically sit there for two years? And then when they finally went to full production, then they were really getting value out of that investment. But they actually lost a couple of years of technology, literally sitting almost idle. And so, from a CFO perspective, his ability to support the development of those applications as he scales it perfect Green Lake is the ideal solution that allows them to do that. >>You know, technology has saved businesses in this pandemic. There's no question about it and what Harry was just talking about with regard to VD. You think about that. There's the dialing up and dialing down piece, which is awesome from an i t perspective and then the business impact. There is the productivity of Of of the end users, and most C suite executives I've talked to said Productivity actually went up during covid with work from home, which is kind of astounding if you think about it. Ben, you know Ben, I We said Arrow has been around for a long, long time, certainly before all of us were born and it's gone through many, many industry transitions during our lifetimes. How does arrow and how do How do your partners think about building cloud experience experiences? And where does Green Lake fit in from your perspective? >>A great question. So from a narrow perspective, when you think about cloud experience and, of course, us taking a view as a distribution partner, we want to be able to provide scale and efficiency to our network of partners. So we do that through our aero screw platform. Um, just just a bit of a you know, a bit of a commercial. I mean, you get single quote single bill auto provision compared multi supplier, if you will Subscription management utilization reporting from the platform itself. So if we pivot that directly to HP, you're going to get a bit of a scoop here, Dave. So we're excited today to have Green Lake live in our platform available for our part of community to consume in particular the swift solutions that HP has announced. So we're very excited to to share that today, Um, maybe a little bit more on Green Lake. I think at this point in time, there it's differentiated, Um, in a sense that if you think about some of the other offerings in the market today and further with, um uh, having the solutions himself available in a row sphere So, you know, I would say, Do we identify the uniqueness, um, and quickly partner with HP to to work with our atmosphere platform? One other sort of unique thing is, you know, when you think about platform itself, you've got to give a consistent experience the different geographies around the world. So, you know, we're available in north of 20 countries. There's thousands of resellers and transacting on the platform on a regular basis, and frankly, hundreds of thousands and customers are leveraging today, so that creates an opportunity for both Arrow HP and our partner community. So we're excited. >>Uh, you know, I just want to open it up and we don't have much time left, but thoughts on on on differentiation. You know, when people ask me Okay, what's really different about H P E and Green Lake? As others you know are doing things that with with as a service to me, it's a I I always say cultural. It starts from the top with Antonio, and it's like the company's all in. But But I wonder from your perspective because you guys are hands on. Are there other differential factors that you would point to let me just open that up to the group? >>Yeah, if I could make a comment. You know, Green Lake is really just the latest invocation of the as a service model. And what does that mean? What that actually means is you have a continuous ongoing relationship with the customer. It's not a cell. And forget not that we ever forget about customers, but there are highlights. Customer buys, it gets installed, and then for two or three years, you may have an occasional engagement with them. But it's not continuous. When you move to a Green Lake model, you're actually helping them manage that you are in the core in the heart of their business. No better place to be if you want to be sticky and you want to be relevant, and you want to be always there for them. >>You know, I wonder if somebody else could add to and and and in your in your remarks from your perspective as a partner because, you know, Hey, a lot of people made a lot of money selling boxes, but those days are pretty much gone. I mean, you have to transform into a services mindset. But other thoughts, >>I think I think Dad did that day. I think Harry's right on right. What he the way he positioned Exactly. You get on the customer. Even another step back for us is we're able to have the business conversation without leading with what you just said. You don't have to leave with a storage solution to leave with a compute. You can really have step back, have a business conversation, and we've done that where you don't even bring up hp Green Lake until you get to the point of the customer says, So you can give me an on prem cloud solution that gives me scalability, flexibility, all the things you're talking about. How does that work then? Then you bring up. It's all through this HP Green link tool. It really gives you the ability to have a business conversation. And you're solving the business problems versus trying to have a technology conversation. And to me, that's clear differentiation for HP. Green length. >>All right, guys. CR Ron. Harry. Ben. Great discussion. Thank you so much for coming on the program. Really appreciate it. >>Thanks for having us, Dave. >>All >>right. Keep it right there for more great content at Green Lake Day. Right back? Yeah.
SUMMARY :
to see you guys. So I'm gonna ask you guys each to introduce yourselves and your company's So had the great great opportunity here to lead a 45 Ron, I wonder if you could kick I'm thankful for the opportunity to serve in multiple functional and senior leadership roles that They have the protection that they need to have for reasons And of course, the answer is all of the above. you know, I think we we've all accepted that public cloud is not necessarily gonna win the day and were, I didn't mean to squeeze you on a narrow. that we're all all in, and I love helping customers and suppliers both, you know, point that that customers are asking you to solve when it comes to their cloud strategies. Third day of our customers' concerns and associated risk come from the market demands to deliver I love to focus on the business case because it's measurable. They're looking at the flexibility to add the Green Lake as a big opportunity to integrate So when you when I talk to the C suite, they do all want to hear about, you know, the capacity available when you need it, But then turning it off when you don't executives I've talked to said Productivity actually went up during covid with work from having the solutions himself available in a row sphere So, you know, I would say, It starts from the top with Antonio, and it's like the company's all in. No better place to be if you want to be sticky and you want to be relevant, as a partner because, you know, Hey, a lot of people made a lot of money selling boxes, but those days are able to have the business conversation without leading with what you just said. Thank you so much for coming on the program. Keep it right there for more great content at Green Lake Day.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Benjamin Clay | PERSON | 0.99+ |
Ron Nemecek | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Ron | PERSON | 0.99+ |
Toronto | LOCATION | 0.99+ |
Green Lake | ORGANIZATION | 0.99+ |
two years | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
six months | QUANTITY | 0.99+ |
Harry | PERSON | 0.99+ |
Arrow Electronics | ORGANIZATION | 0.99+ |
18 years | QUANTITY | 0.99+ |
Ben | PERSON | 0.99+ |
35 year | QUANTITY | 0.99+ |
H P E | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
33 year | QUANTITY | 0.99+ |
three years | QUANTITY | 0.99+ |
1 ft | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
40 years | QUANTITY | 0.99+ |
Cincinnati Ohio | LOCATION | 0.99+ |
single | QUANTITY | 0.99+ |
$570 million | QUANTITY | 0.99+ |
Howdy Shell | PERSON | 0.99+ |
Today | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
Harry Zaric | PERSON | 0.99+ |
1980 | DATE | 0.99+ |
CR Houdyshell | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
over a billion dollars | QUANTITY | 0.99+ |
Annex Canada Organization | ORGANIZATION | 0.98+ |
about 20 years | QUANTITY | 0.98+ |
over 2000 associates | QUANTITY | 0.98+ |
second thing | QUANTITY | 0.98+ |
first customer | QUANTITY | 0.98+ |
RHP Green Lake | ORGANIZATION | 0.97+ |
Great White North | LOCATION | 0.97+ |
hundreds of thousands | QUANTITY | 0.97+ |
CR Ron | PERSON | 0.96+ |
Green Lake | LOCATION | 0.96+ |
C B. T s | ORGANIZATION | 0.96+ |
HP Green La | ORGANIZATION | 0.96+ |
one | QUANTITY | 0.96+ |
One | QUANTITY | 0.96+ |
45 year old | QUANTITY | 0.96+ |
over 2800 technical certifications | QUANTITY | 0.95+ |
this afternoon | DATE | 0.95+ |
Spotlight Track | HPE GreenLake Day 2021
(bright upbeat music) >> Announcer: We are entering an age of insight where data moves freely between environments to work together powerfully, from wherever it lives. A new era driven by next generation cloud services. It's freedom that accelerates innovation and digital transformation, but it's only for those who dare to propel their business toward a new future that pushes beyond the usual barriers. To a place that unites all information under a fluid yet consistent operating model, across all your applications and data. To a place called HPE GreenLake. HPE GreenLake pushes beyond the obstacles and limitations found in today's infrastructure because application entanglements, data gravity, security, compliance, and cost issues simply aren't solved by current cloud options. Instead, HPE GreenLake is the cloud that comes to you, bringing with it, increased agility, broad visibility, and open governance across your entire enterprise. This is digital transformation unlocked, incompatibility solved, data decentralized, and insights amplified. For those thinkers, makers and doers who want to create on the fly scale up or down with a single click, stand up new ideas without risk, and view it all as a single agile system of systems. HPE GreenLake is here and all are invited. >> The definition of cloud is evolving and now clearly comprises hybrid and on-prem cloud. These trends are top of mind for every CIO and the space is heating up as every major vendor has been talking about as-a-Service models and making moves to better accommodate customer needs. HPE was the first to market with its GreenLake brand, and continues to make new announcements designed to bring the cloud experience to far more customers. Come here from HPE and its partners about the momentum that they're seeing with this trend and what actions you can take to stay ahead of the competition in this fast moving market. (bright soft music) Okay, we're with Keith White, Senior Vice President and General Manager for GreenLake at HPE, and George Hope, who's the Worldwide Head of Partner Sales at Hewlett Packard Enterprise. Welcome gentlemen, good to see you. >> Awesome to be here. >> Yeah. Thanks so much. >> You're welcome, Keith, last we spoke, we talked about how you guys were enabling high performance computing workloads to get green-late right for enterprise markets. And you got some news today, which we're going to get to but you guys, you put out a pretty bold position with GreenLake, basically staking a claim if you will, the edge, cloud as-a-Service all in. How are you thinking about its impacts for your customers so far? >> You know, the impact's been amazing and, you know, in essence, I think the pandemic has really brought forward this real need to accelerate our customer's digital transformation, their modernization efforts, and you know, frankly help them solve what was amounting to a bunch of new business problems. And so, you know, this manifests itself in a set of workloads, set of solutions, and across all industries, across all customer types. And as you mentioned, you know GreenLake is really bringing that value to them. It brings the cloud to the customer in their data center, in their colo, or at the edge. And so frankly, being able to do that with that full cloud experience. All is a pay per use, you know, fully consumption-based scenario, all managed for them so they get that as I mentioned, true cloud experience. It's really sort of landing really well with customers and we continue to see accelerated growth. We're adding new customers, we're adding new technology. And we're adding a whole new set of partner ecosystem folks as well that we'll talk about. >> Well, you know, it's interesting you mentioned that just cause as a quick aside it's, the definition of cloud is evolving and it's because customers, it's the way customers look at it. It's not just vendor marketing. It's what customers want, that experience across cloud, edge, you know, multiclouds, on-prem. So George, what's your take? Anything you'd add to Keith's response? >> I would, you've heard Antonio Neri say it several times and you probably saying it for yourself. The cloud is an experience, it's not a destination. The digital transformation is pushing new business models and that demands more flexible IT. And the first round of digital transformation focused on a cloud first strategy. For our customers we're looking to get more agility. As Keith mentioned, the next phase of transformation will be characterized by bringing the cloud speed and agility to all apps and data, regardless of where they live, According to IDC, by the end of 2021, 80% of the businesses will have some mechanism in place to shift the cloud centric, infrastructure and apps and twice as fast as before the pandemic. So the pandemic has actually accelerated the impact of the digital divide, specifically, in the small and medium companies which are adapting to technology change even faster and emerging stronger as a result. You know, the analysts agree cloud computing and digitalization will be key differentiators for small and medium business in years to come. And speed and automation will be pivotal as well. And by 2022, at least 30% of the lagging SMBs will accelerate digitalization. But the fair focus will be on internal processes and operations. The digital leaders, however, will differentiate by delivering their customers, a dynamic experience. And with our partner ecosystem, we're helping our customers embrace our as-a-Service vision and stand out wherever they are. on their transformation journey. >> Well, thanks for those stats, I always liked the data. I mean, look, if you're not a digital business today I feel like you're out of business only 'cause.... I'm sure there's some exceptions, but you got to get on the digital bandwagon. I think pre-pandemic, a lot of times people really didn't know what it meant. We know now what it means. Okay, Keith, let's get into the news when we do these things. I love that you guys always have something new to share. What do you have? >> No, you got it. And you know, as we said, the world is hybrid and the world is multicloud. And so, customers are expecting these solutions. And so, we're continuing to really drive up the innovation and we're adding additional cloud services to GreenLake. We just recently went to General AVailability of our MLOps, Machine Learning Operations, and our containers for cloud services along with our virtual desktop which has become very big in a pandemic world where a lot more people are working from home. And then we have shipped our SAP HEC, customer edition, which allows SAP customers to run on their premise whether it's the data center or the colo. And then today we're introducing our new Bare Metal capabilities as well as containers on Bare Metal as a Service, for those folks that are running cloud native applications that don't require any sort of hypervisor. So we're really excited about that. And then second, I'd say similar to that HPC as a Service experience we talked about before, where we were bringing HPC down to a broader set of customers. We're expanding the entry point for our private cloud, which is virtual machines, containers, storage, compute type capabilities in workload optimized systems. So again, this is one of the key benefits that HPE brings is it combines all of the best of our hardware, software, third-party software, and our services, and financial services into a package. And we've workload optimized this for small, medium, large and extra-large. So we have a real sort of broader base for our customers to take advantage of and to really get that cloud experience through HPE GreenLake. And, you know, from a partner standpoint we also want to make sure that we continue to make this super easy. So we're adding self-service capabilities we're integrating into our distributors marketplaces through a core set of APIs to make sure that it plugs in for a very smooth customer experience. And this expands our reach to over 100,000 additional value-added resellers. And, you know, we saw just fantastic growth in the channel in Q1, over 118% year over year growth for GreenLake Cloud Services through the channel. And we're continuing to expand, extend and expand our partner ecosystem with additional key partnerships like our colos. The colocation centers are really key. So Equinix, CyrusOne and others that we're working with and I'll let George talk more about. >> Yeah, I wonder if you could pick up on that George. I mean, look, if I'm a partner and and I mean, I see an opportunity here.. Maybe, you know, I made a lot of money in the old days moving iron. But I got to move, I got to pivot my business. You know, COVID's actually, you know, accelerating a lot of those changes, but there's a lot of complexity out there and partners can be critical in helping customers make that journey. What do you see this meaning to partners, George? >> So I completely agree with Keith and through and with our partners we give our customers choice. Right, they don't have to worry about security or cost as they would with public cloud or the hyperscalers. We're driving special initiatives via Cloud28 which we run, which is the world's largest cloud aggregator. And also, in collaboration with our distributors in their marketplaces as Keith mentioned. In addition, customers can leverage our expertise and support of our service provider ecosystem, our SI's, our ISV's, to find the right mix of hybrid IT and decide where each application or workload should be hosted. 'Cause customers are now demanding robust ecosystems, cloud adjacency, and efficient low latency networks. And the modern workload demands, secure, compliant, highly available, and cost optimized environments. And Keith touched on colocation. We're partnering with colocation facilities to provide our customers with the ability to expand bandwidth, reduce latency, and get access to a robust ecosystem of adjacent providers. We touched on Equinix a bit as one of them, but we're partnering with them to enable customers to connect to multiple clouds with private on-demand interconnections from hundreds of data center locations around the globe. We continue to invest in the partner and customer experience, you know, making ourselves easier to do business with. We've now fully integrated partners in GreenLake Central, and could provide their customers end to end support and managing the entire hybrid IT estate. And lastly, we're providing partners with dedicated and exclusive enablement opportunities so customers can rely on both HPE and partner experts. And we have a competent team of specialists that can help them transform and differentiate themselves. >> Yeah, so, I'm hearing a theme of simplicity. You know, I talked earlier about this being customer-driven. To me what the customer wants is they want to come in, they want simple, like you mentioned, self-serve. I don't care if it's on-prem, in the cloud, across clouds, at the edge, abstract, all that complexity away from me. Make it simple to do, not only the technology to work, you figure out where the workload should run and let the metadata decide and that's a bold vision. And then, make it easy to do business. Let me buy as-a-Service if that's the way I want to consume. And partners are all about, you know, reducing friction and driving that. So, anyway guys, final thoughts, maybe Keith, you can close it out here and maybe George can call it timeout. >> Yeah, you summed it up really nice. You know, we're excited to continue to provide what we view as the largest and most flexible hybrid cloud for our customers' apps, data, workloads, and solutions. And really being that leading on-prem solution to meet our customer's needs. At the same time, we're going to continue to innovate and our ears are wide open, and we're listening to our customers on what their needs are, what their requirements are. So we're going to expand the use cases, expand the solution sets that we provide in these workload optimized offerings to a very very broad set of customers as they drive forward with that digital transformation and modernization efforts. >> Right, George, any final thoughts? >> Yeah, I would say, you know, with our partners we work as one team and continue to hone our skills and embrace our competence. We're looking to help them evolve their businesses and thrive, and we're here to help now more than ever. So, you know, please reach out to our team and our partners and we can show you where we've already been successful together. >> That's great, we're seeing the expanding GreenLake portfolio, partners key part of it. We're seeing new tools for them and then this ecosystem evolution and build out and expansion. Guys, thanks so much. >> Yeah, you bet, thank you. >> Thank you, appreciate it. >> You're welcome. (bright soft music) >> Okay, we're here with Jo Peterson the VP of Cloud & Security at Clarify360. Hello, Jo, welcome to theCUBE. >> Hello. >> Great to see you. >> Thanks for having me. >> You're welcome, all right, let's get right into it. How do you think about cloud where we are today in 2021? The definitions evolve, but where do you see it today and where do you see it going? >> Well, that's such an interesting question and is so relevant because the labels are disappearing. So over the last 10 years, we've sort of found ourselves defining whether an environment was public or whether it was private or whether it was hybrid. Here's the deal, cloud is infrastructure and infrastructure is cloud. So at the end of the day cloud in whatever form it's taking is a platform, and ultimately, this enablement tool for the business. Customers are consuming cloud in the best way that works for their businesses. So let's also point out that cloud is not a destination, it's this journey. And clients are finding themselves at different places on that road. And sometimes they need help getting to the next milestone. >> Right, and they're really looking for that consistent experience. Well, what are the big waves and trends that you're seeing around cloud out there in the marketplace? >> So I think that this hybrid reality is happening in most organizations. Their actual IT portfolios include a mix of on-premise and cloud infrastructure, and we're seeing this blurred line happening between the public cloud and the traditional data center. Customers want a bridge that easily connects one environment to the other environment, and they want end-to-end visibility. Customers are becoming more intentional and strategic about their cloud roadmaps. So some of them are intentionally and strategically selecting hybrid environments because they feel that it affords them more control, cost, balance, comfort level around their security. In a way, cloud itself is becoming borderless. The major tech providers are extending their platforms in an infrastructure agnostic manner and that's to work across hybrid environments, whether they be hosted in the data center, whether it includes multiple cloud providers. As cloud matures, workload environments fit is becoming more of a priority. So forward thinking where the organizations are matching workloads to the best environment. And it's sort of application rationalization on this case by case basis and it really makes sense. >> Yeah, it does makes sense. Okay, well, let's talk about HPE GreenLake. They just announced some new solutions. What do you think it means for customers? >> I think that HPE has stepped up. They've listened to not only their customers but their partners. Customers want consumable infrastructure, they've made that really clear. And HPE has expanded the cloud service portfolio for clients. They're offering more choices to not only enterprise customers but they're expanding that offering to attract this mid-market client base. And they provided additional tools for partners to make selling GreenLake easier. This is all helping to drive channel sales. >> Yeah, so better granularity, just so it increases the candidates, better optionality for customers. And this thing is evolving pretty quickly. We're seeing a number of customers that we talked to interested in this model, trying to understand it better and ultimately, I think they're going to really lean in hard. Jo, I wonder if you could maybe think about or share with us which companies are, I got to say, getting it right? And I'm really interested in the partner piece, because if you think about the partner business, it's really, it's changing a lot, right? It's gone from this notion of moving boxes and there was a lot of money to be made over the decades in doing that, but they have to now become value-add suppliers and really around cloud services. And in the early days of cloud, I think the channel was a little bit freaked out, saying, uh-oh, they're going to cut out the middleman. But what's actually happened is those smart agile partners are adding substantial value, they've got deep relationships with customers and they're serving as really trusted advisors and executors of cloud strategies. What do you see happening in the partner community? >> Well, I think it's been a learning curve and everything that you said was spot on. It's a two way street, right? In order for VARs to sell residual services, monthly recurring services, there has to have been some incentive to do that and HPE really got it right. Because they, again listened to that partner community, and they said, you know what? We've got to incentivize these guys to start selling this way. This is a partnership and we expect it to be a partnership. And the tech companies that are getting right are doing that same sort of thing, they're figuring out ways to make it palatable to that VAR, to help them along that journey. They're giving them tools, they're giving them self-serve tools, they're incentivizing them financially to make that shift. That's what's going to matter. >> Well, that's a key point you're making, I mean, the financial incentives, that's new and different. Paying, you know, incentivizing for as-a-Service models versus again, moving hardware and paying for, you know, installing iron. That's a shift in mindset, isn't it? >> It definitely is. And HPE, I think is getting it right because I didn't notice but I learned this, 70% of their annual sales are actually transacted through their channel. And they've seen this 116% increase in HPE GreenLake orders in Q1, from partners. So what they're doing is working. >> Yeah, I think you're right. And you know, the partner channel it becomes super critical. It's funny, Jo, I mean, again, in the early days of cloud, the channel was feeling like they were going to get disrupted. I don't know about you, but I mean, we've both been analysts for awhile and the more things get simple, the more they get complicated, right? I mean the consumerization of IT, the cloud, swipe your credit card, but actually applying that to your business is not easy. And so, I see that as great opportunities for the channel. Give you the last word. >> Absolutely, and what's going to matter is the tech companies that step up and realize we've got this chance, this opportunity to build that bridge and provide visibility, end-to-end visibility for clients. That's what going to matter. >> Yeah, I like how you're talking about that bridge, because that's what everybody wants. They want that bridge from on-prem to the public cloud, across clouds, going to to be moving out to the edge. And that is to your point, a journey that's going to evolve over the better part of this coming decade. Jo, great to see you. Thanks so much for coming on theCUBE today. >> Thanks for having me. (bright soft music) >> Okay, now we're going to into the GreenLake power panel to talk about the cloud landscape, hybrid cloud, and how the partner ecosystem and customers are thinking about cloud, hybrid cloud as a Service and of course, GreenLake. And with me are C.R. Howdyshell, President of Advizex. Ron Nemecek, who's the Business Alliance Manager at CBTS. Harry Zarek is President of Compugen. And Benjamin Klay is VP of Sales and Alliances at Arrow Electronics. Great to see you guys, thanks so much for coming on theCUBE. >> Thanks for having us. >> Good to be here. >> Okay, here's the deal. So I'm going to ask you guys each to introduce yourselves and your companies, add a little color to my brief intro, and then answer the following question. How do you and your customers think about hybrid cloud? And think about it in the context of where we are today and where we're going, not just the snapshot but where we are today and where we're going. C.R., why don't you start please? >> Sure, thanks a lot, Dave, appreciate it. And again, C.R. Howdyshell, President of Advizex. I've been with the company for 18 years, the last four years as president. So had the great opportunity here to lead a 45 year old company with a very strong brand and great culture. As it relates to Advizex and where we're headed to with hybrid cloud is it's a journey. So we're excited to be leading that journey for the company as well as HPE. We're very excited about where HPE is going with GreenLake. We believe it's a very strong solution when it comes to hybrid cloud. Have been an HPE partner since, well since 1980. So for 40 years, it's our longest standing OEM relationship. And we're really excited about where HPE is going with GreenLake. From a hybrid cloud perspective, we feel like we've been doing the hybrid cloud solutions, the past few years with everything that we've focused on from a VMware perspective. But now with where HPE is going, we think, probably changing the game. And it really comes down to giving customers that cloud experience with the on-prem solution with GreenLake. And we've had great response for customers and we think we're going to continue to see that kind of increased activity and reception. >> Great, thank you C.R., and yeah, I totally agree. It is a journey and we've seen it really come a long way in the last decade. Ron, I wonder if you could kickoff your little first intro there please. >> Sure Dave, thanks for having me today and it's a pleasure being here with all of you. My name is Ron Nemecek, I'm a Business Alliance manager at CBTS. In my role, I'm responsible for our HPE GreenLake relationship globally. I've enjoyed a 33 year career in the IT industry. I'm thankful for the opportunity to serve in multiple functional and senior leadership roles that have helped me gather a great deal of education and experience that could be used to aid our customers with their evolving needs, for business outcomes to best position them for sustainable and long-term success. I'm honored to be part of the CBTS and OnX Canada organization. CBTS stands for Consult Build Transform and Support. We have a 35 year relationship with HPE. We're a platinum and inner circle partner. We're headquartered in Cincinnati, Ohio. We service 3000 customers generating over a billion dollars in revenue and we have over 2000 associates across the globe. Our focus is partnering with our customers to deliver innovative solutions and business results through thought leadership. We drive this innovation via our team of the best and brightest technology professionals in the industry that have secured over 2,800 technical certifications, 260 specifically with HPE. And in our hybrid cloud business, we have clearly found that technology, new market demands for instant responses and experiences, evolving economic considerations with detailed financial evaluation, and of course the global pandemic, have challenged each of our customers across all industries to develop an optimal cloud strategy. We now play an enhanced strategic role for our customers as their technology advisor and their guide to the right mix of cloud experiences that will maximize their organizational success with predictable outcomes. Our conversations have really moved from product roadmaps and speeds and feeds to return on investment, return on capital, and financial statements, ratios, and metrics. We collaborate regularly with our customers at all levels and all departments to find an effective comprehensive cloud strategy for their workloads and applications ensuring proper alignment and cost with financial return. >> Great, thank you, Ron. Yeah, today it's all about the business value. Harry, please. >> Hi Dave, thanks for the opportunity and greetings from the Great White North. We're a Canadian-based company headquartered in Toronto with offices across the country. We've been in the tech industry for a very long time. We're what we would call a solution provider. How hard for my mother to understand what that means but what our goal is to help our customers realize the business value of their technology investments. Just to give you an example of what it is we try and do. We just finished a build out of a new networking endpoint and data center technology for a brand new hospital. It's now being mobilized for COVID high-risk patients. So talk about our all being in an essential industry, providing essential services across the whole spectrum of technology. Now, in terms of what's happening in the marketplace, our customers are confused. No question about it. They hear about cloud, I mean, cloud first, and everyone goes to the cloud, but the reality is there's lots of technology, lots of applications that actually still have to run on premises for a whole bunch of reasons. And what customers want is solid senior serious advice as to how they leverage what they already have in terms of their existing infrastructure, but modernize it, update it, so it looks and feels a lot like the cloud. But they have the security, they have the protection that they need to have for reasons that are dependent on their industry and business to allow them to run on-prem. And so, the GreenLake philosophy is perfect. That allows customers to actually have one foot in the cloud, one foot in their traditional data center but modernize it so it actually looks like one enterprise entity. And it's that kind of flexibility that gives us an opportunity collectively, ourselves, our partners, HPE, to really demonstrate that we understand how to optimize the use of technology across all of the business applications they need to run. >> You know Harry, it's interesting about what you said is, the cloud it is kind of chaotic my word, not yours. But there is a lot of confusion out there, I mean, what's cloud, right? Is it public cloud, is it private cloud, the hybrid cloud? Now, it's the edge and of course the answer is all of the above. Ben, what's your perspective on all this? >> From a cloud perspective, you know, I think as an industry, I think we we've all accepted that public cloud is not necessarily going to win the day and we're in fact, in a hybrid world. There's certainly been some commentary and press that was sort of validate that. Not that it necessarily needs any validation but I think is the linkages between on-prem and cloud-based services have increased. It's paved the way for customers more effectively, deploy hybrid solutions in in the model that they want or that they desire. You know, Harry was commenting on that a moment ago. As the trend continues, it becomes much easier for solution providers and service providers to drive their services initiatives, you know, in particular managed services. >> From an Arrow perspective is we think about how we can help scale in particular from a GreenLake perspective. We've got the ability to stand up some cloud capabilities through our ArrowSphere platform that can really help customers adopt GreenLake and to benefit from some alliances opportunities, as well. And I'll talk more about that as we go through. >> And Ben, I didn't mean to squeeze you on Arrow. I mean, Arrow has been around longer than computers. I mean, if you Google the history of Arrow it'll blow your mind, but give us a little quick commercial. >> Yeah, absolutely. So I've been with Arrow for about 20 years. I've got responsibility for Alliance organization in North America, We're a global value added distribution, business consulting and channel enablement company. And we bring scope, scale and and expertise as it relates to the IT industry. I love the fast pace that comes with the market that we're all in. And I love helping customers and suppliers both, be positioned for long-term success. And you know, the subject matter here today is just a great example of that. So I'm happy to be here and look forward to the discussion. >> All right, we got some good brain power in the room. Let's cut right to the chase. Ron, where's the pain? What are the main problems that CBTS I love what it stands for, Consult Build Transform and Support. What's the main pain point that customers are asking you to solve when it comes to their cloud strategies? >> Sure, Dave. Our customers' concerns and associated risks come from the market demands to deliver their products, services, and experiences instantaneously. And then the challenge is how do they meet those demands because they have aging infrastructure, processes, and fiscal constraints. Our customers really need us now more than ever to be excellent listeners so we can collaborate on an effective map with the strategic placement of workloads and applications in that spectrum of cloud experiences while managing their costs, and of course, mitigating risks to their business. This collaboration with our customers, often identify significant costs that have to be evaluated, justified or eliminated. We find significant development, migration, and egress charges in their current public cloud experience, coupled with significant over provisioning, maintenance, operational, and stranded asset costs in their on-premise infrastructure environment. When we look at all these costs holistically, through our customized workshops and assessments, we can identify the optimal cloud experience for the respective workloads and applications. Through our partnership with HPE and the availability of the HPE GreenLake solutions, our customers now have a choice to deliver SLA's, economics, and business outcomes for their workloads and applications that best reside on-premise in a private cloud and have that experience. This is a rock solid solution that eliminates, the development costs that they experience and the egress charges that are associated with the public cloud while utilizing HPE GreenLake to eliminate over provisioning costs and the maintenance costs on aging infrastructure hardware. Lastly, our customers only have to pay for actual infrastructure usage with no upfront capital expense. And now, that achieves true utilization to cost economics, you know, with HPE GreenLake solutions from CBTS. >> I love focus on the business case, 'cause it's measurable and it's sort of follow the money. That's where the opportunity is. Okay, C.R., so question for you. Thinking about Advizex customers, how are they, are they leaning into GreenLake? What are they telling you is the business impact when they experience GreenLake? >> Well, I think it goes back to what Ron was talking about. We had to solve the business challenges first and so far, the reception's been positive. When I say that is customers are open. Everybody wants to, the C-suite wants to hear about cloud and hybrid cloud fits. But what we hear and what we're seeing from our customers is we're seeing more adoption from customers that it may be their first foot in, if you will, but as important, we're able to share other customers with our potentially new clients that say, what's the first thing that happens with regard to GreenLake? Well, number one, it works. It works as advertised and as-a-Service, that's a big step. There are a lot of people out there dabbling today but when you can say we have a proven solution it's working in our environment today, that's key. I think the second thing is,, is flexibility. You know, when customers are looking for this hybrid solution, you got to be flexible for, again, I think Ron said (indistinct). You don't have a big capital outlay but also what customers want to be able to do is we want to build for growth but we don't want to pay for it. So we'll pay as we grow not as we have to use, as we used to do, it was upfront, the capital expenditure. Now we'll just pay as we grow, and that really facilitates in another great example as you'll hear from a customer, this afternoon. But you'll hear where one of the biggest benefits they just acquired a $570 million company and their integration is going to be very seamless because of their investment in GreenLake. They're looking at the flexibility to add to GreenLake as a big opportunity to integrate for acquisitions. And finally is really, we see, it really brings the cloud experience and as-a-Service to our customers. And with HPE GreenLake, it brings the best of breed. So it's not just what HPE has to offer. When you look at Hyperconverged, they have Nutanix, they have Cohesity. So, I really believe it brings best of breeds. So, to net it out and close it out with our customers, thus far, the customer experience has been exceptional. I mean, with GreenLake Central, as interface, customers have had a lot of success. We just had our first customer from about a year and a half ago just reopened, it was a highly competitive situation, but they just said, look, it's proven, it works, and it gives us that cloud experience so. Had a lot of great success thus far and looking forward to more. >> Thank you, so Harry, I want to pick up on something C.R. said and get your perspectives. So when I talk to the C-suite, they do all want to hear about, you know, cloud, they have a cloud agenda. And what they tell me is it's not just about their IT transformation. They want that but they also want to transform their business. So I wonder if you could talk, Harry, about Compugen's perspective on the potential business impact of GreenLake. And also, I'm interested in how you guys are thinking about workloads, how to manage work, you know, how to cost optimize in IT, but also, the business value that comes out of that capability. >> Yeah, so Dave, you know if you were to talk to CFO and I have the good fortune to talk to lots of CFOs, they want to pay the costs when they generate the revenue. They don't want to have all the costs upfront and then wait for the revenue to come through. A good example of where that's happening right now is you know, related to the pandemic, employees that used to work at the office have now moved to working from home. And now, they have to connect remotely to run the same application. So use this thing called VDI, virtual interfacing to allow them to connect to the applications that they need to run in the office. I don't want to get into too much detail but to be able to support that from an an at-home environment, they needed to buy a lot more computing capacity to handle this. Now, there's an expectation that hopefully six months from now, maybe sooner than that, people will start returning to the office. They may not need that capacity so they can turn down on the costs. And so, the idea of having the capacity available when you need it, but then turning it off when you don't need it, is really a benefit of the variable cost model. Another example that I would use is one in new development. If a customer is going to implement a new, let's say, line of business application. SAP is very very popular. You know, it actually, unfortunately, takes six months to two years to actually get that application set up, installed, validated, tested, then moves through production. You know, what used to happen before? They would buy all that capacity upfront, and it would basically sit there for two years, and then when they finally went to full production, then they were really value out of that investment. But they actually lost a couple of years of technology, literally sitting almost sidle. And so, from a CFO perspective, his ability to support the development of those applications as he scales it, perfect. GreenLake is the ideal solution that allows him to do that. >> You know, technology has saved businesses in this pandemic. There's no question about it. When Harry was just talking about with regard to VDI, you think about that, there's the dialing up and dialing down piece which is awesome from an IT perspective. And then the business impact there is the productivity of the end users. And most C-suite executives I've talked to said productivity actually went up during COVID with work from home, which is kind of astounding if you think about it. Ben, we said Arrow's been around for a long, long time. Certainly, before all of us were born and it's gone through many many industry transitions during our lifetimes. How does Arrow and how do your partners think about building cloud experiences and where does GreenLake fit in from your perspective? >> Great question. So from an Arrow perspective, when you think about cloud experience in of course us taking a view as a distribution partner, we want to be able to provide scale and efficiency to our network of partners. So we do that through our ArrowSphere platform. Just a bit of, you know, a bit of a commercial. I mean, you get single quote, single bill, auto provision, multi supplier, if you will, subscription management, utilization reporting from the platform itself. So if we pivot that directly to HPE, you're going to get a bit of a scoop here, Dave. And we're excited today to have GreenLake live in our platform available for our partner community to consume. In particular, the Swift solutions that HPE has announced so we're very excited to share that today. Maybe a little bit more on GreenLake. I think at this point in time, that it's differentiated in a sense that, if you think about some of the other offerings in the market today and further with having the the solutions themselves available in ArrowSphere. So, I would say, that we identify the uniqueness and quickly partner with HPE to work with our ArrowSphere platform. One other sort of unique thing is, when you think about platform itself, you've got to give a consistent experience. The different geographies around the world so, you know, we're available in North of 20 countries, there's thousands of resellers and transacting on the platform on a regular basis. And frankly, hundreds of thousands end customers. that are leveraging today. So that creates an opportunity for both Arrow, HPE and our partner community. So we're excited. >> You know, I just want to open it up. We don't have much time left, but thoughts on differentiation. Some people ask me, okay, what's really different about HPE and GreenLake? These others, you know, are doing things with as-a-Service. To me, I always say cultural, it starts from the top with Antonio, and it's like the company's all in. But I wonder from your perspectives, 'cause you guys are hands on. Are there other differentiable factors that you would point to? Let me just open that up to the group. >> Yeah, if I could make a comment. GreenLake is really just the latest invocation of the as-a-Service model. And what does that mean? What that actually means is you have a continuous ongoing relationship with the customer. It's not a sell and forget. Not that we ever forget about customers but there are highlights. Customer buys, it gets installed, and then for two or three years you may have an occasional engagement with them but it's not continuous. When you move to our GreenLake model, you're actually helping them manage that. You are in the core, in the heart of their business. No better place to be if you want to be sticky and you want to be relevant and you want to be always there for them. >> You know, I wonder if somebody else could add to it in your remarks. From your perspective as a partner, 'cause you know, hey, a lot of people made a lot of money selling boxes, but those days are pretty much gone. I mean, you have to transform into a services mindset, but other thoughts? >> I think to add to that Dave. I think Harry's right on. The way he positioned it it's exactly where he did own the customer. I think even another step back for us is, we're able to have the business conversation without leading with what you just said. You don't have to leave with a storage solution, you don't have to lead with compute. You know, you can really have step back, have a business conversation. And we've done that where you don't even bring up HPE GreenLake until you get to the point where the customer says, so you can give me an on-prem cloud solution that gives me scalability, flexibility, all the things you're talking about. How does that work? Then you bring up, it's all through this HPE GreenLake tool. And it really gives you the ability to have a business conversation. And you're solving the business problems versus trying to have a technology conversation. And to me, that's clear differentiation for HPE GreenLake. >> All right guys, C.R., Ron, Harry, Ben. Great discussion, thank you so much for coming on the program. Really appreciate it. >> Thanks for having us, Dave. >> Appreciate it Dave. >> All right, keep it right there for more great content at GreenLake Day, be right back. (bright soft music) (upbeat music) (upbeat electronic music)
SUMMARY :
the cloud that comes to you, and continues to make new announcements And you got some news today, It brings the cloud to the customer it's the way customers look at it. and you probably saying it for yourself. I love that you guys always and to really get that cloud experience But I got to move, I got and get access to a robust ecosystem only the technology to work, expand the solution sets that we provide and our partners and we can show you and then this ecosystem evolution (bright soft music) the VP of Cloud & Security at Clarify360. and where do you see it going? cloud in the best way in the marketplace? and that's to work across What do you think it means for customers? This is all helping to And in the early days of cloud, and everything that you said was spot on. I mean, the financial incentives, And HPE, I think is and the more things get simple, to build that bridge And that is to your point, Thanks for having me. and how the partner So I'm going to ask you guys each And it really comes down to and yeah, I totally agree. and their guide to the right about the business value. and everyone goes to the cloud, Now, it's the edge and of course in the model that they want We've got the ability to stand up to squeeze you on Arrow. and look forward to the discussion. Let's cut right to the chase. and the availability of the I love focus on the business case, and so far, the reception's been positive. how to manage work, you know, and I have the good fortune with regard to VDI, you think about that, in the market today and further with and it's like the company's all in. and you want to be relevant I mean, you have to transform And to me, that's clear differentiation for coming on the program. at GreenLake Day, be right back.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Jo Peterson | PERSON | 0.99+ |
CBTS | ORGANIZATION | 0.99+ |
Keith | PERSON | 0.99+ |
Ron Nemecek | PERSON | 0.99+ |
Ron | PERSON | 0.99+ |
Harry | PERSON | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
GreenLake | ORGANIZATION | 0.99+ |
Ben | PERSON | 0.99+ |
Toronto | LOCATION | 0.99+ |
Harry Zarek | PERSON | 0.99+ |
Keith White | PERSON | 0.99+ |
OnX | ORGANIZATION | 0.99+ |
George Hope | PERSON | 0.99+ |
Benjamin Klay | PERSON | 0.99+ |
two years | QUANTITY | 0.99+ |
C.R. Howdyshell | PERSON | 0.99+ |
18 years | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
$570 million | QUANTITY | 0.99+ |
Equinix | ORGANIZATION | 0.99+ |
six months | QUANTITY | 0.99+ |
Advizex | ORGANIZATION | 0.99+ |
one foot | QUANTITY | 0.99+ |
116% | QUANTITY | 0.99+ |
2021 | DATE | 0.99+ |
80% | QUANTITY | 0.99+ |
Cincinnati, Ohio | LOCATION | 0.99+ |
70% | QUANTITY | 0.99+ |
35 year | QUANTITY | 0.99+ |
Clarify360 | ORGANIZATION | 0.99+ |
three years | QUANTITY | 0.99+ |
Hewlett Packard Enterprise | ORGANIZATION | 0.99+ |
40 years | QUANTITY | 0.99+ |
2022 | DATE | 0.99+ |
33 year | QUANTITY | 0.99+ |
Arrow | ORGANIZATION | 0.99+ |
Arrow Electronics | ORGANIZATION | 0.99+ |
first round | QUANTITY | 0.99+ |
Geor | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
Compugen | ORGANIZATION | 0.99+ |
HPE GreenLake | TITLE | 0.99+ |
Jeff Boudreau, Dell Technologies | Dell Technologies World 2020
>>from around the globe. It's the Cube with digital coverage of Dell Technologies. World Digital experience Brought to you by Dell Technologies. Hello, everyone. And welcome back to the cubes Coverage of Del Tech World 2020. With me is Jeff Boudreau, the president general manager of Infrastructure Solutions group Deltek. Jeff, always good to see you, my friend. How you doing? >>Good. Good to see you. >>I wish we were hanging out a Sox game or a pat's game, but, uh, I guess this will dio But, you know, it was about a year ago when you took over leadership of I s G. I actually had way had that sort of brief conversation. You were in the room with Jeff Clark. I thought it was a great, great choice. How you doing? How you feeling Any sort of key moments the past 12 months that you you feel like sharing? >>Sure. So I first I want to say, I do remember that about a year ago. So thank you for reminding me. Yeah, it's, uh it's been a very interesting year, right? It's been it's been one year. It was in September was one year since I took over I s G. But I'm feeling great. So thank you for asking. I hope you're doing the same. And I'm really optimistic about where we are and where we're heading. Aziz, you know, it's been an extremely challenging year in a very unpredictable year, as we've all experienced. And I'd say for the, you know, the first part of the year, especially starting in March on I've been really focused on the health and safety of our, you know, the families, our customers and our team members of the team on a lot of it's been shifting, you know, in regards to helping our customers around, you know, work from home or education and learn from home. And, you know, during all this time, though, I'll tell you, as a team, we've accomplished a lot. There's a handful of things that I'm very proud of, you know, first and foremost, that states around the customer experience we have delivered on our best quality in our product. NPS scores in our entire history. So something I'm extremely proud of during this time around our innovation and innovation engine, we part of the entire portfolio which you're well aware of. We had nine launches in nine weeks back in that May in June. Timeframe. So something I'm really proud of the team on, uh, on. Then last, I'd say it's around the team and right, we shifted about 90% of our workforce from the office tow home, you know, from an engineering team. That could be, you know, 85% of my team is engineers and writing code. And so, you know, people were concerned about that. But we didn't skip a beat, so, you know, pretty impressed by the team and what they've done there. So, you know, the strategy remains unchanged. Uh, you know, we're focused on our customers integrating across the entire portfolio and the businesses like VM ware and really focused on getting share. So despite all the uncertainty in the market, I'm pretty pleased with the team and everything that's been going on. So uh, yeah, it's it's been it's been an interesting year, but it's really great. I'm really optimistic about what we have in front of us. >>Yeah, I mean, there's not much you could do a control about the macro condition on it, you know it. Z dealt to us and we have to deal with it. I mean, in your space. It's the sort of countervailing things here one is. Look, you're not selling laptops and endpoint security. That's not your business right in the data center. Eso. But the flip side of that is you mentioned your portfolio refresh. You know, things like Power Store. You got product cycles now kicking in. So that could be, you know, a buffer. What are you seeing with Power Store and what's the uptake look like? They're >>sure. Well, specifically, let me take a step back and the regards the portfolio. So first, you know, the portfolio itself is a direct reflection in the feedback from all our partners and our customers over the last couple of years on Day two, ramp up that innovation. I spent a lot of time in the last few years simplifying under the power brands, which you're well aware of, right? So we had a lot of for a legacy EMC and Legacy dollars. Really? How do we simplify under a set of brands really over delivering innovation on a fewer set of products that really accelerating in exceeding customer needs? And we did that across the board. So from power edge servers, you know, power Max, the high end storage, the Powerball, all that we didn't hear one. And just most recently. And, you know, it's part of the big launches. We had power scale. We have power flex for software to find. And, of course, the new flagship offer for the mid range, which is power store. Um, Specifically, the policy of the momentum has been building since our launch back in May. And the feedback from our partners and our customers has been fantastic. And we've had a lot of big wins against, you know, a lot of a lot of our core competitors. A couple examples one is Arrow Electronics SAA, Fortune 500 Global Elektronik supplier. They leverage power Store to provide, you know, basically both, you know, enterprise computing and storage needs for their for their broader bases around the world on there, really taking advantage of the 41 data reduction, really helping them simplify their capacity planning and really improve operational efficiencies specifically without impacting performance. So it's it's one. We're given the data reductions, but there's no impact on performance, which is a huge value proffer for arrow another big customers tickets and write a global law firm on their reporting to us that over 90 they've had a 90% reduction in their rack space, and they've had over five times two performance over a core competitors storage systems azi. They've deployed power store around the world, really, and it's really been helping them. Thio easily migrate workloads across, so the feedback from the customers and partners has been extremely positive. Um, there really citing benefits around the architecture, the flexibility architecture around the micro services, the containers they're loving, the D M or integration. They're loving the height of the predictable data reduction capabilities in line with in line performance or no performance penalties with data efficiencies, the workload support, I'd say the other big things around the anytime upgrades is another big thing that customers we're really talking about so very excited and optimistic in regards as we continue to re empower store the second half of the year into next year really is the full full year for power store. >>So can I ask you about that? That in line data reduction with no performance hit is that new ipe? I mean, you're not doing some kind of batch data reduction, right? >>No, it's It's new, I p. It's all patented. We've actually done a lot of work in regards to our technologies. There's some of the things we talk about GPS and deep use and smart Knicks and things like that. We've used some offload engines to help with that. So between the software and the hardware, we've had leverage new I. P. So we can actually provide that predictable data reduction. But right with the performance customers need, So we're not gonna have a trade off in regards. You get more efficiencies and less performance or more performance and less efficiency. >>That's interesting. Yeah, when I talked to the chip guys, they talk about this sort of the storage offloads and other offloads we're seeing. These alternative processors really start to hit the market videos. The obvious one. But you're seeing others. Aziz. Well, you're really it sounds like you're taking advantage of that. >>Yeah, it's a huge benefit. I mean, we should, you know, with our partners, if it's Intel's and in videos and folks like that broad comes, it's really leveraging the great innovation that they do, plus our innovation. So if you know the sum of the parts, can you know equal Mauritz a benefit to our customers in the other day? That's what it's all about. >>So it sounds like Cove. It hasn't changed your strategy. I was talking toe Dennis Hoffman and he was saying, Look, you know, fundamentally, we're executing on the same strategy. You know, tactically, there's things that we do differently. But what's your summarize your strategy coming in tow 2021. You know, we're still early in this decade. What are you seeing is the trends that you're trying to take advantage of? What do you excited about? Maybe some things that keep you up at night? >>Yeah, so I'd say, you know, I'll stay with what Dennis said. You know, it's our strategy is not changing its a company. You probably got that from Michael and from job, obviously, Dennis just recently. But for me, it's a two pronged approach. One's all about winning the consolidation in the core infrastructure markets that we could just paid in today. So I think Service Storage Network, we're already clear leader across all those segments that we serve in our you know, we'll continue to innovate within our existing product categories. And you saw that with the nine launches in the nine weeks in my point on that one is we're gonna always make sure that we have best debris offers. If it's a three tier, two tier or converge or hyper converged offer, we wanna make sure that we serve that and have the best innovation possible. In addition to that, though, the secondary piece of the strategy really is around. How do we differentiate value across or innovating across I S G? You know, Dell Technologies and even the broader ecosystems and some of the examples I'll give you right now that we're doing is if you think about innovating across icy, that's all about providing improved customer experience, a set of solutions and offers that really helped simplify customer operations, right? And really give them better T CEOs or better. S L. A. An example of something like that's cloud like it's a SAS based off of that we have. That really helps provide great insights and telemetry to our customers. That helps them simplify their I T operations, and it's a major step forward towards, you know, autonomous infrastructure which is really what they're asking for. Customers of a very happy with the work we've done around Day one, you know, faster, time to value. But now it's like Day two and beyond. How do you really helped me Kinda accelerate the operations and really take that away from a three other big pieces innovating across all technologies. And you know, we do this with VM Ware now live today, and that's just writing. So things like VX rail is an example where we work together and where the clear leader in H C I. Things like Delta Cloud Uh, when we built in V M V C F A, B, M or cloud foundation in Tan Xue delivering an industry leading hybrid cloud platform just recently a VM world. I'm sure you heard about it, but Project Monterey was just announced, and that's an effort we're doing with VM Ware and some other partners. They're really about the next generation of infrastructure. Um, you know, I guess taking it up a notch out of the infrastructure and I've g phase, you know, some of the areas that we're gonna be looking at the end to end solutions to help our customers around six key areas. I'm sure John Rose talking about the past, but things like cloud Edge five g A i m l data management security. So those will be the big things. You'll see us lean into a Z strategies consistent. Some big themes that you'll see us lean into going into next year. >>Yeah, I mean, it is consistent, right? You guys have always tried to ride the waves, vector your portfolio into those waves and add value. I'm particularly impressed with your focus on customer experience, and I think that's a huge deal. You know, in the past, a lot of companies yours included your predecessor. You see, Hey, throwing so many products at me, I can't I don't understand the portfolio. So I mean, focusing on that I think is huge right now because people want that experience, you know, to be mawr cloudlike. And that's that's what you got to deliver. What about any news from from Dell Tech world? Any any announcements that you you wanna highlight that we could talk about? >>Sure. And actually, just touching back on the point you had no about the simplification that is a major 10 of my in regards the organization. So there's three key components that I drive once around customer focus, and that's keeping customers first and foremost. And everything we do to is around axillary that innovation. Engine three is really bringing everything together as one team. So we provide a better outcome to our customers. You know, in that simplification after that you talk about is court toe what we're driving. So I want to do less things, I guess better in the notion of how we do that. What that means to me is, as I make decisions that want to move away from other technologies and really leverage our best of breed type shared type, that's technology. I p people I p I can, you know, e can exceed customer needs in those markets that were serving. So it's actually allows me to x Sorry, my innovation engine, because I shift more and more resource is onto the newer stock now for Del Tech world. Yes, We got some cool stuff coming. You probably heard about a few of them. Uh, we're gonna be announcing a project project Apex. Hopefully you've been briefed on that already. This isn't new news or I'll be in trouble. But that's really around. Our strategy about delivering, simple, consistent as a service experiences for our customers bringing together are dealt technology as a service offering and our cloud strategy together. Onda also our technology offerings in our go to market all under a single unified effort, which Ellison do would be leading. Um, you know, on behalf of our executive leadership team s, that's one big area. And there is also another big one that I'll talk about a sui expand our as a service offers. And we think there's a big power to that in regards to our Dell Technologies. Cloud console solving will be launching a new cloud console that will provide uniformed experience across all the resources and give users and ability toe instantly managed every aspect of their cloud journey with just a few clicks. So going back to your broader point, it's all about simplicity. >>Yeah, we definitely all over Apex. That's something I wanted to ask you about this notion of as a service, really requiring it could have a new mindset, certainly from a pricing and how you talk about the customer experience that it's a whole new customer experience. Your you're basically giving them access. Thio What I would consider more of a platform on giving them some greater flexibility. Yeah, there's some constraints in there, but of course, you know the physical only put so much capacity and before him. But the idea of being ableto dial up, dial down within certain commitments is, I think, a powerful one. How does it change the way in which you you think about how you go about developing products just in terms of you know, this AP economy Infrastructure is code. How how you converse about those products internally and externally. How would you see that shaking >>out Dave? That's an awesome question. And it's actually for its front center. For everything we do, obviously, customers one choice and flexibility what they do. And to your point as we evolved warm or as a service, no specific product and product brands and logos on probably the way of the future. It's the services. It's the experience that you provide in regards to how we do that. So if you think about me, you know, in in infrastructure making infrastructure as a service, you really want to define what that customer experiences. That s L. A. That they're trying toe realize. And then how do we make sure that we build the right solutions? Products feature functions to enable that a law that goes back to the core engineering stuff that we need to dio right now, a lot of that stuff is about making sure that we have the right things around. If it's around developer community. If it's around AP rich, it's around. SdK is it's all about how do we leverage if it's internal source or external open source, if you will. It's regards to How do we do that? No. A thing that I think we all you know what you're well aware but we ought to keep in mind is that the cloud native applications are really relevant. Toe both the on premises, wealthy off premise. So think about things around portability reusability. You know, those are some great examples of just kind of how we think about this as we go forward. But those modern applications were required modern infrastructure, and regardless of how that infrastructure is abstracted now, just think about things like this. Aggregation or compose ability or Internet based computing. It's just it's a huge trend that we have to make sure we're thinking of. So is we. We just aggregate between the physical layers to the software layers and how we provide that to a service that could be think of a modern container based asset that could be repurposed. Either could be on a purpose built thing. It could be deployed in a converge or hyper converged. Or it could be two points a software feature in a cloud. Now, that's really how we're thinking about that, regards that we go forward. So we're talking about building modern assets or components That could be you right once we used many type model, and we can deploy that wherever you want because of some of the abstraction of desegregation that we're gonna do. >>E could see customers in the in the near term saying, I don't care so much about the product. I want the fast one all right with the cheaper one e. >>It's kind of what you talking about, that I talked about the ways. If you think about that regards, you know, maybe it's on a specific brand or portfolio. You look into and you say, Hey, what's the service level that I'd wanted to your point like Hey, for compute or for storage, it's really gonna end up being the specific S l A. And that's around performance or Leighton see, or cost or resiliency they want. They want that experience in that that you know, And that's why they're gonna be looking for the end of the end state. That's what we have to deliver is an engineering. >>So there's an opportunity here for you guys that I wonder if you could comment on. And that's the storage admin E. M. C essentially created. You know, you get this army of people that you know pretty good of provisioning lungs, although that's not really that's a great career path for folks. But program ability is, and this notion of infrastructure is code as you as you make your systems more programmable. Is there a skill set opportunity to take that army of constituents that you guys helped train and grow and over their careers and bring them along into sort of the next decade? This new era? >>I think the the easy answer is yes, I obviously that's a hard thing to do and you go forward. But I think embracing the change in the evolution of change, I think is a great opportunity. And I think there is e mean if you look step back and you think about data management, right? And you think about all the you know all data is not created equal and you know, and it has a life cycle, if you will. And so if it's on edge to Korda, Cloward, depending think about data vaults and data mobility and all that stuff. There's gonna be a bunch of different personas and people touching data along the way. I think the I T advance and the storage admin. They're just one of those personas that we have to help serve and way talk about How do we make them heroes, if you will, in regards to their broader environment. So if they're providing, if they evolve and really helped provide a modern infrastructure that really enables, you know infrastructure is a code or infrastructure as a service, they become a nightie hero, if you will for the rest of team. So I think there's a huge opportunity for them to evolve as the technology evolves. >>Yeah, you talked about you know, your families, your employees, your team s o. You obviously focused on them. You got your products going hitting all the marks. How are you spending your time these days? >>Thes days right now? Well, we're in. We're in our cycle for fiscal 22 planning. Right? And right now, a lot of that's above the specific markets were serving. It's gonna be about the strategy and making sure that we have people focused on those things. So it really comes back to some of the strategy tents were driving for next year. Now, as I said, our focus big time. Well, I guess for the for this year is one is consolidation of the core markets. Major focus for May 2 is going to be around winning in storage, and I want to be very specific. It's winning midrange storage. And that was one of the big reasons why Power Store came. That's gonna be a big focus on Bennett's really making sure that we're delivering on the as a service stuff that we just talked about in regards to all the technology innovation that's required to really provide the customer experience. And then, lastly, it's making sure that we take advantage of some of these growth factors. So you're going to see a dentist. Probably talked a lot about Telco, but telco on edge and as a service and cloud those things, they're just gonna be key to everything I do. So if you think about from poor infrastructure to some of these emerging opportunities Z, I'm spending all my time. >>Well, it's a It's a big business and a really important one for Fidel. Jeff Boudreau. Thanks so much for coming back in the Cube. Really a pleasure seeing you. I hope we can see each other face to face soon. >>You too. Thank you for having me. >>You're very welcome. And thank you for watching everybody keep it right there. This is Dave Volonte for the Cube. Our continuing coverage of Del Tech World 2020. We'll be right back right after this short break
SUMMARY :
World Digital experience Brought to you by Dell Technologies. the past 12 months that you you feel like sharing? especially starting in March on I've been really focused on the health and safety of our, you know, the families, But the flip side of that is you mentioned your portfolio refresh. So from power edge servers, you know, power Max, the high end storage, There's some of the things we talk about GPS and deep use and smart Knicks and things like that. These alternative processors really start to hit the market videos. I mean, we should, you know, with our partners, if it's Intel's and in videos and folks like and he was saying, Look, you know, fundamentally, we're executing on the same strategy. and some of the examples I'll give you right now that we're doing is if you think about innovating across icy, And that's that's what you got to deliver. You know, in that simplification after that you talk about is court toe what we're driving. How does it change the way in which you you think about how It's the experience that you provide in regards to how we do that. I don't care so much about the product. They want that experience in that that you know, So there's an opportunity here for you guys that I wonder if you could comment on. And you think about all the you know all data is not Yeah, you talked about you know, your families, your employees, So if you think about from poor infrastructure I hope we can see each other face to face soon. Thank you for having me. And thank you for watching everybody keep it right there.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Telco | ORGANIZATION | 0.99+ |
Jeff Boudreau | PERSON | 0.99+ |
Michael | PERSON | 0.99+ |
Jeff Clark | PERSON | 0.99+ |
Dennis | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
telco | ORGANIZATION | 0.99+ |
Dave Volonte | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
May 2 | DATE | 0.99+ |
May | DATE | 0.99+ |
Deltek | ORGANIZATION | 0.99+ |
Dennis Hoffman | PERSON | 0.99+ |
85% | QUANTITY | 0.99+ |
John Rose | PERSON | 0.99+ |
March | DATE | 0.99+ |
Dell Technologies | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
Arrow Electronics SAA | ORGANIZATION | 0.99+ |
Dell Technologies | ORGANIZATION | 0.99+ |
September | DATE | 0.99+ |
one team | QUANTITY | 0.99+ |
Power Store | ORGANIZATION | 0.99+ |
10 | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
nine weeks | QUANTITY | 0.99+ |
Aziz | PERSON | 0.99+ |
Intel | ORGANIZATION | 0.99+ |
2021 | DATE | 0.99+ |
one | QUANTITY | 0.99+ |
fiscal 22 | DATE | 0.99+ |
41 data | QUANTITY | 0.99+ |
both | QUANTITY | 0.98+ |
Del Tech | ORGANIZATION | 0.98+ |
three key components | QUANTITY | 0.98+ |
Onda | ORGANIZATION | 0.98+ |
today | DATE | 0.98+ |
next decade | DATE | 0.98+ |
two tier | QUANTITY | 0.97+ |
Day two | QUANTITY | 0.97+ |
two points | QUANTITY | 0.97+ |
June | DATE | 0.97+ |
about 90% | QUANTITY | 0.97+ |
three tier | QUANTITY | 0.97+ |
first | QUANTITY | 0.96+ |
Apex | ORGANIZATION | 0.96+ |
this year | DATE | 0.96+ |
one year | QUANTITY | 0.96+ |
Fidel | PERSON | 0.96+ |
single | QUANTITY | 0.96+ |
Day one | QUANTITY | 0.96+ |
Fortune 500 Global Elektronik | ORGANIZATION | 0.95+ |
nine launches | QUANTITY | 0.95+ |
Dell Tech | ORGANIZATION | 0.94+ |
Project Monterey | ORGANIZATION | 0.94+ |
Service Storage Network | ORGANIZATION | 0.94+ |
One | QUANTITY | 0.93+ |
Mauritz | PERSON | 0.93+ |
over five times | QUANTITY | 0.93+ |
EMC | ORGANIZATION | 0.91+ |
couple | QUANTITY | 0.9+ |
Cloward | PERSON | 0.88+ |
Bennett | PERSON | 0.88+ |
Leighton | ORGANIZATION | 0.88+ |
six key areas | QUANTITY | 0.87+ |
over 90 | QUANTITY | 0.86+ |
past 12 months | DATE | 0.86+ |
Del Tech World 2020 | EVENT | 0.86+ |
Delta | ORGANIZATION | 0.85+ |
second half | QUANTITY | 0.83+ |
about | DATE | 0.82+ |
two | QUANTITY | 0.81+ |
UNLIST TILL 4/2 - The Road to Autonomous Database Management: How Domo is Delivering SLAs for Less
hello everybody and thank you for joining us today at the virtual Vertica BBC 2020 today's breakout session is entitled the road to autonomous database management how Domo is delivering SLA for less my name is su LeClair I'm the director of marketing at Vertica and I'll be your host for this webinar joining me is Ben white senior database engineer at Domo but before we begin I want to encourage you to submit questions or comments during the virtual session you don't have to wait just type your question or comment in the question box below the slides and click Submit there will be a Q&A session at the end of the presentation we'll answer as many questions as we're able to during that time any questions that we aren't able to address or drew our best to answer them offline alternatively you can visit vertical forums to post your questions there after the session our engineering team is planning to join the forum to keep the conversation going also as a reminder you can maximize your screen by clicking the double arrow button in the lower right corner of the slide and yes this virtual session is being recorded and will be available to view on demand this week we'll send you notification as soon as it's ready now let's get started then over to you greetings everyone and welcome to our virtual Vertica Big Data conference 2020 had we been in Boston the song you would have heard playing in the intro would have been Boogie Nights by heatwaves if you've never heard of it it's a great song to fully appreciate that song the way I do you have to believe that I am a genuine database whisperer then you have to picture me at 3 a.m. on my laptop tailing a vertical log getting myself all psyched up now as cool as they may sound 3 a.m. boogie nights are not sustainable they don't scale in fact today's discussion is really all about how Domo engineers the end of 3 a.m. boogie nights again well I am Ben white senior database engineer at Domo and as we heard the topic today the road to autonomous database management how Domo is delivering SLA for less the title is a mouthful in retrospect I probably could have come up with something snazzy er but it is I think honest for me the most honest word in that title is Road when I hear that word it evokes for me thoughts of the journey and how important it is to just enjoy it when you truly embrace the journey often you look up and wonder how did we get here where are we and of course what's next right now I don't intend to come across this too deep so I'll submit there's nothing particularly prescient and simply noticing the elephant in the room when it comes to database economy my opinion is then merely and perhaps more accurately my observation the office context imagine a place where thousands and thousands of users submit millions of ad-hoc queries every hour now imagine someone promised all these users that we could deliver bi leverage at cloud scale in record time I know what many of you should be thinking who in the world would do such a thing of course that news was well received and after the cheers from executives and business analysts everywhere and chance of Keep Calm and query on finally started to subside someone that turns an ass that's possible we can do that right except this is no imaginary place this is a very real challenge we face the demo through imaginative engineering demo continues to redefine what's possible the beautiful minds at Domo truly embrace the database engineering paradigm that one size does not fit all that little philosophical nugget is one I would pick up while reading the white papers and books of some guy named stone breaker so to understand how I and by extension Domo came to truly value analytic database administration look no further than that philosophy and what embracing it would mean it meant really that while others were engineering skyscrapers we would endeavor to build Datta neighborhoods with a diverse kapala G of database configuration this is where our journey at Domo really gets under way without any purposeful intent to define our destination not necessarily thinking about database as a service or anything like that we had planned this ecosystem of clusters capable of efficiently performing varied workloads we achieve this with custom configurations for node count resource pool configuration parameters etc but it also meant concerning ourselves with the unattended consequences of our ambition the impact of increased DDL activities on the catalog system overhead in general what would be the management requirements of an ever-evolving infrastructure we would be introducing multiple points of failure what are the advantages the disadvantages those types of discussions and considerations really help to define what would be the basic characteristics of our system the database itself needed to be trivial redundant potentially ephemeral customizable and above all scalable and we'll get more into that later with this knowledge of what we were getting into automation would have to be an integral part of development one might even say automation will become the first point of interest on our journey now using popular DevOps tools like saltstack terraform ServiceNow everything would be automated I mean it discluded everything from larger multi-step tasks like database designs database cluster creation and reboots to smaller routine tasks like license updates move-out and projection refreshes all of this cool automation certainly made it easier for us to respond to problems within the ecosystem these methods alone still if our database administration reactionary and reacting to an unpredictable stream of slow query complaints is not a good way to manage a database in fact that's exactly how three a.m. Boogie Nights happen and again I understand there was a certain appeal to them but ultimately managing that level of instability is not sustainable earlier I mentioned an elephant in the room which brings us to the second point of interest on our road to autonomy analytics more specifically analytic database administration why our analytics so important not just in this case but generally speaking I mean we have a whole conference set up to discuss it domo itself is self-service analytics the answer is curiosity analytics is the method in which we feed the insatiable human curiosity and that really is the impetus for analytic database administration analytics is also the part of the road I like to think of as a bridge the bridge if you will from automation to autonomy and with that in mind I say to you my fellow engineers developers administrators that as conductors of the symphony of data we call analytics we have proven to be capable producers of analytic capacity you take pride in that and rightfully so the challenge now is to become more conscientious consumers in some way shape or form many of you already employ some level of analytics to inform your decisions far too often we are using data that would be categorized as nagging perhaps you're monitoring slow queries in the management console better still maybe you consult the workflows analyzing how about a logging and alerting system like sumo logic if you're lucky you do have demo where you monitor and alert on query metrics like this all examples of analytics that help inform our decisions being a Domo the incorporation of analytics into database administration is very organic in other words pretty much company mandated as a company that provides BI leverage a cloud scale it makes sense that we would want to use our own product could be better at the business of doma adoption of stretches across the entire company and everyone uses demo to deliver insights into the hands of the people that need it when they need it most so it should come as no surprise that we have from the very beginning use our own product to make informed decisions as it relates to the application back engine in engineering we call it our internal system demo for Domo Domo for Domo in its current iteration uses a rules-based engine with elements through machine learning to identify and eliminate conditions that cause slow query performance pulling data from a number of sources including our own we could identify all sorts of issues like global query performance actual query count success rate for instance as a function of query count and of course environment timeout errors this was a foundation right this recognition that we should be using analytics to be better conductors of curiosity these types of real-time alerts were a legitimate step in the right direction for the engineering team though we saw ourselves in an interesting position as far as demo for demo we started exploring the dynamics of using the platform to not only monitor an alert of course but to also triage and remediate just how much economy could we give the application what were the pros and cons of that Trust is a big part of that equation trust in the decision-making process trust that we can mitigate any negative impacts and Trust in the very data itself still much of the data comes from systems that interacted directly and in some cases in directly with the database by its very nature much of the data was past tense and limited you know things that had already happened without any reference or correlation to the condition the mayor to those events fortunately the vertical platform holds a tremendous amount of information about the transaction it had performed its configurations the characteristics of its objects like tables projections containers resource pools etc this treasure trove of metadata is collected in the vertical system tables and the appropriately named data collector tables as a version 9 3 there are over 190 tables that define the system tables while the data collector is the collection of 215 components a rich collection can be found in the vertical system tables these tables provide a robust stable set of views that let you monitor information about your system resources background processes workload and performance allowing you to more efficiently profile diagnose and correlate historical data such as low streams query profiles to pool mover operations and more here you see a simple query to retrieve the names and descriptions of the system tables and an example of some of the tables you'll find the system tables are divided into two schemas the catalog schema contains information about persistent objects and the monitor schema tracks transient system States most of the tables you find there can be grouped into the following areas system information system resources background processes and workload and performance the Vertica data collector extends system table functionality by gathering and retaining aggregating information about your database collecting the data collector mixes information available in system table a moment ago I show you how you get a list of the system tables in their description but here we see how to get that information for the data collector tables with data from the data collecting tables in the system tables we now have enough data to analyze that we would describe as conditional or leading data that will allow us to be proactive in our system management this is a big deal for Domo and particularly Domo for demo because from here we took the critical next step where we analyze this data for conditions we know or suspect lead to poor performance and then we can suggest the recommended remediation really for the first time we were using conditional data to be proactive in a database management in record time we track many of the same conditions the Vertica support analyzes via scrutinize like tables with too many production or non partition fact tables which can negatively affect query performance and life in vertical in viral suggests if the table has a data a time step column you recommend the partitioning by the month we also can track catalog sizes percentage of total memory and alert thresholds and trigger remediations requests per hour is a very important metric in determining when a trigger are scaling solution tracking memory usage over time allows us to adjust resource pool parameters to achieve the optimal performance for the workload of course the workload analyzer is a great example of analytic database administration I mean from here one can easily see the logical next step where we were able to execute these recommendations manually or automatically be of some configuration parameter now when I started preparing for this discussion this slide made a lot of sense as far as the logical next iteration for the workload analyzing now I left it in because together with the next slide it really illustrates how firmly Vertica has its finger on the pulse of the database engineering community in 10 that OS management console tada we have the updated work lies will load analyzer we've added a column to show tuning commands the management console allows the user to select to run certain recommendations currently tuning commands that are louder and alive statistics but you can see where this is going for us using Domo with our vertical connector we were able to then pull the metadata from all of our clusters we constantly analyze that data for any number of known conditions we build these recommendations into script that we can then execute immediately the actions or we can save it to a later time for manual execution and as you would expect those actions are triggered by thresholds that we can set from the moment nyan mode was released to beta our team began working on a serviceable auto-scaling solution the elastic nature of AI mode separated store that compute clearly lent itself to our ecosystems requirement for scalability in building our system we worked hard to overcome many of the obstacles they came with the more rigid architecture of enterprise mode but with the introduction is CRM mode we now have a practical way of giving our ecosystem at Domo the architectural elasticity our model requires using analytics we can now scale our environment to match demand what we've built is a system that scales without adding management overhead or our necessary cost all the while maintaining optimal performance well we're really this is just our journey up to now and which begs the question what's next for us we expand the use of Domo for Domo within our own application stack maybe more importantly we continue to build logic into the tools we have by bringing machine learning and artificial intelligence to our analysis and decision making really do to further illustrate those priorities we announced the support for Amazon sage maker autopilot at our demo collusive conference just a couple of weeks ago for vertical the future must include in database economy the enhanced capabilities in the new management console to me are clear nod to that future in fact with a streamline and lightweight database design process all the pieces should be in place versions deliver economists database management itself we'll see well I would like to thank you for listening and now of course we will have a Q&A session hopefully very robust thank you [Applause]
SUMMARY :
conductors of the symphony of data we
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Boston | LOCATION | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
Domo | ORGANIZATION | 0.99+ |
3 a.m. | DATE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
first time | QUANTITY | 0.98+ |
this week | DATE | 0.97+ |
over 190 tables | QUANTITY | 0.97+ |
two schemas | QUANTITY | 0.96+ |
second point | QUANTITY | 0.96+ |
215 components | QUANTITY | 0.96+ |
first point | QUANTITY | 0.96+ |
three a.m. | DATE | 0.96+ |
Boogie Nights | TITLE | 0.96+ |
millions of ad-hoc queries | QUANTITY | 0.94+ |
Domo | TITLE | 0.93+ |
Vertica Big Data conference 2020 | EVENT | 0.93+ |
Ben white | PERSON | 0.93+ |
10 | QUANTITY | 0.91+ |
thousands of users | QUANTITY | 0.9+ |
one size | QUANTITY | 0.89+ |
saltstack | TITLE | 0.88+ |
4/2 | DATE | 0.86+ |
a couple of weeks ago | DATE | 0.84+ |
Datta | ORGANIZATION | 0.82+ |
end of 3 a.m. | DATE | 0.8+ |
Boogie Nights | EVENT | 0.78+ |
double arrow | QUANTITY | 0.78+ |
every hour | QUANTITY | 0.74+ |
ServiceNow | TITLE | 0.72+ |
DevOps | TITLE | 0.72+ |
Database Management | TITLE | 0.69+ |
su LeClair | PERSON | 0.68+ |
many questions | QUANTITY | 0.63+ |
SLA | TITLE | 0.62+ |
The Road | TITLE | 0.58+ |
Vertica BBC | ORGANIZATION | 0.56+ |
2020 | EVENT | 0.55+ |
database management | TITLE | 0.52+ |
Domo Domo | TITLE | 0.46+ |
version 9 3 | OTHER | 0.44+ |
UNLIST TILL 4/2 - Vertica Database Designer - Today and Tomorrow
>> Jeff: Hello everybody and thank you for joining us today for the Virtual VERTICA BDC 2020. Today's breakout session has been titled, "VERTICA Database Designer Today and Tomorrow." I'm Jeff Healey, Product VERTICA Marketing, I'll be your host for this breakout session. Joining me today is Yuanzhe Bei, Senior Technical Manager from VERTICA Engineering. But before we begin, (clearing throat) I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions, as we're able to during that time, any questions we don't address, we'll do our best to answer them offline. Alternatively, visit VERTICA forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums, to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button at the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We will send you a notification as soon as it's ready. Now let's get started. Over to you Yuanzhe. >> Yuanzhe: Thanks Jeff. Hi everyone, my name is Yuanzhe Bei, I'm a Senior Technical Manager at VERTICA Server RND Group. I run the query optimizer, catalog and the disaggregated engine team. Very glad to be here today, to talk about, the "VERTICA Database Designer Today and Tomorrow". This presentation will be organized as the following; I will first refresh some knowledge about, VERTICA fundamentals such as Tables and Projections, which will bring to the question, "What is Database Designer?" and "Why we need this tool?". Then I will take you through a deep dive, into a Database Designer or we call DBD, and see how DBD's internals works, after that I'll show you some exciting DBD improvements, we have planned for 10.0 release and lastly, I will share with you, some DBD future roadmap we planned next. As most of you should already know, VERTICA is built on a columnar architecture. That means, data is stored column wise. Here we can see a very simple example, of table with four columns, and the many of you may also know, table in VERTICA is a virtual concept. It's just a logical representation of data, which means user can write SQL query, to reference the table names and column, just like other relational database management system, but the actual physical storage of data, is called Projection. A Projection can reference a subset, or all of the columns all to its anchor table, and must be sorted by at least one column. Each table need at least one C for projection which reference all the columns to the table. If you load data to a table with no projection, and automated, auto production will be created, which will be arbitrarily assorted by, the first couple of columns in the table. As you can imagine, even though such other production, can be used to answer any query, the performance is not optimized in most cases. A common practice in VERTICA, is to create multiple projections, contain difference step of column, and sorted in different ways on the same table. When query is sent to the server, the optimizer will pick the projection, that can answer the query in the most efficient way. For example, here you can say, let's say you have a query, that select columns B, D, C and sorted by B and D, the third projection will be ideal, because the data is already sorted, so you can save the sorting costs while executing the query. Basically when you choose the design of the projection, you need to consider four things. First and foremost, of course the sort order. The data already sorted in the right way, can benefit quite a lot of the query actually, like Ordered by, Group By, Analytics, Merge, Join, Predicates and so on. The select column group is also important, because the projection must contain, all the columns referenced by your workflow query. Even missing one column in the projection, this projection cannot be used for a particular query. In addition, VERTICA is the distributed database, and allow projection to be segmented, based on the hash of a set of columns, which is beneficial if the segmentation merged, the join keys or group keys. And finally encoding of each per columns is also part of the design, because the data is sorted in different way, may completely change the optimal encoding for each column. This example only show the benefit of the first two, but you can imagine the rest too are also important. But even for that, it doesn't sound that hard, right? Well I hope you change your mind already when you see this, at least I do. These machine generated queries, really beats me. It will probably take an experienced DBA hours, to figure out which projection can be benefit these queries, not even mentioning there could be hundreds of such queries, in the regular work logs in the real world. So what can we do? That's why we need DBD. DBD is a tool integrated in the VERTICA server, that it can help DBA to perform an access, on their work log query, tabled schema and data, and then automatically figure out, the most optimized projection design for their workload. In addition, DBD also a sophisticated tool, that can take customize by a user, by sending a lot of parameters objectives and so on. And lastly, DBD has access to the optimizer, so DB knows what kind of attribute, the projection need to have, in order to have the optimizer to benefit from them. DBD has been there for years, and I'm sure there are plenty of materials available online, to show you how DBD can be used in different scenarios, whether to achieve the query optimize, or load optimize, whether it's the comprehensive design, or the incremental design, whether it's a dumping deployment script, and manual deployment later, or let the DBD do the order deployment for you, and the many other options. I'm not planning to talk about this today, instead, I will take the opportunity today, to open this black box DBD, and show you what exactly hide inside. DBD is a complex tool and I have tried my best to summarize the DBD design process into seven steps; Extract, Permute, Prune, Build, Score, Identify and Encode. What do they mean? Don't worry, I will show you step by step. The first step is Extract. Extract Interesting Columns. In this step, DBD pass the design queries, and figure out the operations that can be benefited, by the potential projection design, and extract the corresponding columns, as interesting columns. So Predicates, Group By, Order By, Joint Condition, and analytics are all interesting Column to the DBD. As you can see this three simple sample queries, DBD can extract the interest in column sets on the right. Some of these column sets are unordered. For example, the green one for Group By a1 and b1, the DBD extracts the interesting column set, and put them in the own orders set, because either data sorted by a1 first or b1 first, can benefit from this Group By operation. Some of the other sets are ordered, and the best example is here, order by clause a2 and b2, and obviously you cannot sort it by b2 and then a2. These interesting columns set will be used as if, to extend to actual projection sort order candidates. The next step is Permute, once DBD extract all the C's, it will enumerate sort order using C, and how does DBD do that? I'm starting with a very simple example. So here you can see DBD can enumerate two sort orders, by extending d1 with the unordered set a1, b1, and the derived at two sort order candidates, d1, a1, b1, and d1, b1, a1. This sort order can benefit queries with predicate on d1, and also benefit queries by Group By a1, b1, when a1, sorry when d1 is constant. So with the same idea, DBD will try to extend other States with each other, and populate more sort order permutations. You can imagine that how many of them, there could be many of them, these candidates, based on how many queries you have in the design and that can be handled of the sort order candidates. That comes to the third step, which is Pruning. This step is to limit the candidates sort order, so that the design won't be running forever. DBD uses very simple capping mechanism. It sorts all the, sort all the candidates, are ranked by length, and only a certain number of the sort order, with longest length, will be moved forward to the next step. And now we have all the sort orders candidate, that we want to try, but whether this sort order candidate, will be actually be benefit from the optimizer, DBD need to ask the optiizer. So this step before that happens, this step has to build those projection candidate, in the catalog. So this step will build, will generates the projection DBL's, surround the sort order, and create this projection in the catalog. These projections won't be loaded with real data, because that takes a lot of time, instead, DBD will copy over the statistic, on existing projections, to this projection candidates, so that the optimizer can use them. The next step is Score. Scoring with optimizer. Now projection candidates are built in the catalog. DBD can send a work log queries to optimizer, to generate a query plan. And then optimizer will return the query plan, DBD will go through the query plan, and investigate whether, there are certain benefits being achieved. The benefits list have been growing over time, when optimizer add more optimizations. Let's say in this case because the projection candidates, can be sorted by the b1 and a1, it is eligible for Group By Pipe benefit. Each benefit has a preset score. The overall benefit score of all design queries, will be aggregated and then recorded, for each projection candidate. We are almost there. Now we have all the total benefit score, for the projection candidates, we derived on the work log queries. Now the job is easy. You can just pick the sort order with the highest score as the winner. Here we have the winner d1, b1 and a1. Sometimes you need to find more winners, because the chosen winner may only benefit a subset, of the work log query you provided to the DBD. So in order to have the rest of the queries, to be also benefit, you need more projections. So in this case, DBD will go to the next iteration, and let's say in this case find to another winner, d1, c1, to benefit the work log queries, that cannot be benefit by d1, b1 and a1. The number of iterations and thus the winner outcome, DBD really depends on the design objective that uses that. It can be load optimized, which means that only one, super projection winner will be selected, or query optimized, where DBD try to create as many projections, to cover most of the work log queries, or somewhat balance an objective in the middle. The last step is to decide encoding, for each projection columns, for the projection winners. Because the data are sorted differently, the encoding benefits, can be very different from the existing projection. So choose the right projection encoding design, will save the disk footprint a significant factor. So it's worth the effort, to find out the best thing encoding. DBD picks the encoding, based on the actual sampling the data, and measure the storage footprint. For example, in this case, the projection winner has three columns, and say each column has a few encoding options. DBD will write the sample data in the way this projection is sorted, and then you can see with different encoding, the disk footprint is different. DBD will then compare the disk footprint of each, of different options for each column, and pick the best encoding options, based on the one that has the smallest storage footprint. Nothing magical here, but it just works pretty well. And basic that how DBD internal works, of course, I think we've heard it quite a lot. For example, I didn't mention how the DBD handles segmentation, but the idea is similar to analyze the sort order. But I hope this section gave you some basic idea, about DBD for today. So now let's talk about tomorrow. And here comes the exciting part. In version 10.0, we significantly improve the DBD in many ways. In this talk I will highlight four issues in old DBD and describe how the 10.0 version new DBD, will address those issues. The first issue is that a DBD API is too complex. In most situations, what user really want is very simple. My queries were slow yesterday, with the new or different projection can help speed it up? However, to answer a simple question like this using DBD, user will be very likely to have the documentation open on the side, because they have to go through it's whole complex flow, from creating a projection, run the design, get outputs and then create a design in the end. And that's not there yet, for each step, there are several functions user need to call in order. So adding these up, user need to write the quite long script with dozens of functions, it's just too complicated, and most of you may find it annoying. They either manually tune the projection to themselves, or simply live with the performance and come back, when it gets really slow again, and of course in most situations, they never come back to use the DBD. In 10.0 VERTICA support the new simplified API, to run DBD easily. There will be just one function designer_single_run and one argument, the interval that you think, your query was slow. In this case, user complained about it yesterday. So what does this user to need to do, is just specify one day, as argument and run it. The user don't need to provide anything else, because the DBD will look up his query or history, within that time window and automatically populate design, run design and export the projection design, and the clean up, no user intervention needed. No need to have the documentation on the side and carefully write a script, and a debug, just one function call. That's it. Very simple. So that must be pretty impressive, right? So now here comes to another issue. To fully utilize this single round function, users are encouraged to run DBD on the production cluster. However, in fact, VERTICA used to not recommend, to run a design on a production cluster. One of the reasons issue, is that DBD picks massive locks, both table locks and catalog locks, which will badly interfere the running workload, on a production cluster. As of 10.0, we eliminated all the table and ten catalog locks from DBD. Yes, we eliminate 100% of them, simple improvement, clear win. The third issue, which user may not be aware of, is that DBD writes intermediate result. into real VERTICA tables, the real DBD have to do that is, DBD is the background task. So the intermediate results, some user needs to monitor it, the progress of the DBD in concurrent session. For complex design, the intermediate result can be quite massive, and as a result, many lost files will be created, and written to the disk, and we should both stress, the catalog, and that the disk can slow down the design. For ER mode, it's even worse because, the table are shared on communal storage. So writing to the regular table, means that it has to upload the data, to the communal storage, which is even more expensive and disruptive. In 10.0, we significantly restructure the intermediate results buffer, and make this shared in memory data structure. Monitoring queries will go directly look up, in memory data structure, and go through the system table, and return the results. No Intermediate Results files will be written anymore. Another expensive lubidge of local disk for DBD is encoding design, as I mentioned earlier in the deep dive, to determine which encoding works the best for the new projection design, there's no magic way, but the DBD need to actually write down, the sample data to the disk, using the different encoding options, and to find out which ones have the smallest footprint, or pick it as the best choice. These written sample data will be useless after this, and it will be wiped out right away, and you can imagine this is a huge waste of the system resource. In 10.0 we improve this process. So instead of writing, the different encoded data on the disk, and then read the file size, DBD aggregate the data block size on-the-fly. The data block will not be written to the disk, so the overall encoding and design is more efficient and non-disruptive. Of course, this is just about the start. The reason why we put a significant amount of the resource on the improving the DBD in 10.0, is because the VERTICA DBD, as essential component of the out of box performance design campaign. To simply illustrate the timeline, we are now on the second step, where we significantly reduced, the running overhead of the DBD, so that user will no longer fear, to run DBD on their production cluster. Please be noted that as of 10.0, we haven't really started changing, how DBD design algorithm works, so that what we have discussed in the deep dive today, still holds. For the next phase of DBD, we will briefly make the design process smarter, and this will include better enumeration mechanism, so that the pruning is more intelligence rather than brutal, then that will result in better design quality, and also faster design. The longer term is to make DBD to achieve the automation. What entail automation and what I really mean is that, instead of having user to decide when to use DBD, until their query is slow, VERTICA have to know, detect this event, and have have DBD run automatically for users, and suggest the better projections design, if the existing projection is not good enough. Of course, there will be a lot of work that need to be done, before we can actually fully achieve the automation. But we are working on that. At the end of day, what the user really wants, is the fast database, right? And thank you for listening to my presentation. so I hope you find it useful. Now let's get ready for the Q&A.
SUMMARY :
at the end of the presentation. and the many of you may also know,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
Yuanzhe Bei | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
100% | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
one day | QUANTITY | 0.99+ |
second step | QUANTITY | 0.99+ |
third step | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
third issue | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
Each benefit | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
third projection | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
b2 | OTHER | 0.99+ |
each column | QUANTITY | 0.99+ |
first issue | QUANTITY | 0.99+ |
one column | QUANTITY | 0.99+ |
three columns | QUANTITY | 0.99+ |
VERTICA Engineering | ORGANIZATION | 0.99+ |
Yuanzhe | PERSON | 0.99+ |
each step | QUANTITY | 0.98+ |
Each table | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
DBD | TITLE | 0.98+ |
DBD | ORGANIZATION | 0.98+ |
seven steps | QUANTITY | 0.98+ |
DBL | ORGANIZATION | 0.98+ |
each | QUANTITY | 0.98+ |
one argument | QUANTITY | 0.98+ |
VERTICA | TITLE | 0.98+ |
each projection | QUANTITY | 0.97+ |
first two | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
hundreds | QUANTITY | 0.97+ |
one function | QUANTITY | 0.97+ |
clause a2 | OTHER | 0.97+ |
one | QUANTITY | 0.97+ |
each per columns | QUANTITY | 0.96+ |
Tomorrow | DATE | 0.96+ |
both | QUANTITY | 0.96+ |
four issues | QUANTITY | 0.95+ |
VERTICA | ORGANIZATION | 0.95+ |
b1 | OTHER | 0.95+ |
single round | QUANTITY | 0.94+ |
4/2 | DATE | 0.94+ |
first couple of columns | QUANTITY | 0.92+ |
VERTICA Database Designer Today and Tomorrow | TITLE | 0.91+ |
Vertica | ORGANIZATION | 0.91+ |
10.0 | QUANTITY | 0.89+ |
one function call | QUANTITY | 0.89+ |
a1 | OTHER | 0.89+ |
four things | QUANTITY | 0.88+ |
c1 | OTHER | 0.87+ |
two sort order | QUANTITY | 0.85+ |
UNLIST TILL 4/2 - Sizing and Configuring Vertica in Eon Mode for Different Use Cases
>> Jeff: Hello everybody, and thank you for joining us today, in the virtual Vertica BDC 2020. Today's Breakout session is entitled, "Sizing and Configuring Vertica in Eon Mode for Different Use Cases". I'm Jeff Healey, and I lead Vertica Marketing. I'll be your host for this Breakout session. Joining me are Sumeet Keswani, and Shirang Kamat, Vertica Product Technology Engineers, and key leads on the Vertica customer success needs. But before we begin, I encourage you to submit questions or comments during the virtual session, you don't have to wait, just type your question or comment in the question box below the slides, and click submit. There will be a Q&A session at the end of the presentation, we will answer as many questions as we're able to during that time, any questions we don't address, we'll do our best to answer them off-line. Alternatively, visit Vertica Forums, at forum.vertica.com, post your question there after the session. Our Engineering Team is planning to join the forums to keep the conversation going. Also as reminder, that you can maximize your screen by clicking the double arrow button in the lower-right corner of the slides, and yes, this virtual session is being recorded, and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. Now let's get started! Over to you, Shirang. >> Shirang: Thanks Jeff. So, for today's presentation, we have picked Eon Mode concepts, we are going to go over sizing guidelines for Eon Mode, some of the use cases that you can benefit from using Eon Mode. And at last, we are going to talk about, some tips and tricks that can help you configure and manage your cluster. Okay. So, as you know, Vertica has two modes of operation, Eon Mode and Enterprise Mode. So the question that you may have is, which mode should I implement? So let's look at what's there in the Enterprise Mode. Enterprise Mode, you have a cluster, with general purpose compute nodes, that have locally at their storage. Because of this tight integration of compute and storage, you get fast and reliable performance all the time. Now, amount of data that you can store in Enterprise Mode cluster, depends on the total disk capacity of the cluster. Again, Enterprise Mode is more suitable for on premise and cloud deployments. Now, let's look at Eon Mode. To take advantage of cloud economics, Vertica implemented Eon Mode, which is getting very popular among our customers. In Eon Mode, we have compute and storage, that are separated by introducing S3 Bucket, or, S3 compliant storage. Now because of this separation of compute and storage, you can take advantages like mapping all dynamic scale-out and scale-in. Isolation of your workload, as well as you can load data in your cluster, without having to worry about the total disk capacity of your local nodes. Obviously, you know, it's obvious from what they accept, Eon Mode is suitable for cloud deployment. Some of our customers who take advantage of the features of Eon Mode, are also deploying it on premise, by introducing S3 compliant slash web storage. Okay? So, let's look at some of the terminologies used in Eon Mode. The four things that I want to talk about are, communal storage. It's a shared storage, or S3 compliant shared storage, a bucket that is accessible from all the nodes in your cluster. Shard, is a segment of data, stored on the communal storage. Subscription, is the binding with nodes and shards. And last, depot. Depot is a local copy or, a local cache, that can help query in group performance. So, shard is a segment of data stored in communal storage. When you create a Eon Mode cluster, you have to specify the shard count. Shard count decide the maximum number of nodes that will participate in your query. So, Vertica also will introduce a shard, called replica shard, that will hold the data for replicated projections. Subscriptions, as I said before, is a binding between nodes and shards. Each node subscribes to one or more shards, and a shard has at least two nodes that subscribe to it for case 50. Subscribing nodes are responsible for writing and reading from shard data. Also subscriber node holds up-to-date metadata for a catalog of files that are present in the shard. So, when you connect to Vertica node, Vertica will automatically assign you set of nodes and subscriptions that will process your query. There are two important system tables. There are node subscriptions, and session subscriptions, that can help you understand this a little bit more. So let's look at what's on the local disk of your Eon Mode cluster. So, on local disk, you have depot. Depot is a local file system cache, that can hold subset of the data, or copy of the data, in communal storage. Other things that are there, are temp storage, temp storage is used for storing data belonging to temporary tables, and, the data that spills through this, when you are processing queries. And last, is catalog. Catalog is a persistent copy of Vertica, catalog that is written to this. The writes happen at every commit. You only need the persistent copy at node startup. There is also a copy of Vertica catalog, stored in communal storage, called durability. The local copy is synced to the copy in communal storage via service, at the interval of five minutes. So, let's look at depot. Now, as I said before, depot is your file system cache. It's help to reduce network traffic, and slow performance of your queries. So, we make assumption, that when we load data in Vertica, that's the data that you may most frequently query. So, every data that is loaded in Vertica is first entering the depot, and then as a part of same transaction, also synced to communal storage for durability. So, when you query, when you run a query against Vertica, your queries are also going to find the files in the depot first, to be used, and if the files are not found, the queries will access files from communal storage. Now, the behavior of... you know, the new files, should first enter the depot or skip depot can be changed by configuration parameters that can help you skip depot when writing. When the files are not found in depot, we make assumption that you may need those files for future runs of your query. Which means we will fetch them asynchronously into the depot, so that you have those files for future runs. If that's not the behavior that you intend, you can change configuration around return, to tell Vertica to not fetch them when you run your query, and this configuration parameter can be set at database level, session level, query level, and we are also introducing a user level parameter, where you can change this behavior. Because the depot is going to be limited in size, compared to amount of data that you may store in your Eon cluster, at some point in time, your depot will be full, or hit the capacity. To make space for new data that is coming in, Vertica will evict some of the files that are least frequently used. Hence, depot is going to be your query performance enhancer. You want to shape the extent of your depot. And, so what you want to do is, to decide what shall be in your depot. Now Vertica provides some of the policies, called pinning policies, that can help you pin of statistics table or addition of a table, into a depot, at subcluster level, or at the database level. And Sumeet will talk about this a bit more in his future slides. Now look at some of the system tables that can help you understand about the size of the depot, what's in your depot, what files were evicted, what files were recently fetched into the depot. One of the important system tables that I have listed here is DC_FILE_READS. DC_FILE_READS can be used to figure out if your transaction or query fetched with data from depot, from communal storage, or component. One of the important features of Eon Mode is a subcluster. Vertica lets you divide your cluster into smaller execution groups. Now, each of the execution groups has a set of nodes together subscribed to all the shards, and can process your query independently. So when you connect one node in the subcluster, that node, along with other nodes in the subcluster, will only process your query. And because of that, we can achieve isolation as well as, you know, fetches, scale-out and scale-in without impacting what's happening on the cluster. The good thing about subclusters, is all the subclusters have access to the communal storage. And because of this, if you load data in one subcluster, it's accessible to the queries that are running in other subclusters. When we introduced subclusters, we knew that our customers would really love these features, and, some of the things that we were considering is, we knew that our customers would dynamically scale out and in, lots of-- they would add and remove lots of subclusters on demand, and we had to provide that ab-- we had to give this feature, or provide ability to add and remove subclusters in a fast and reliable way. We knew that during off-peak hours, our customers would shut down many of their subclusters, that means, more than half of the nodes could be down. And we had to make adjustment to our quorum policy which requires at least half of the nodes to be up for database to stay up. We also were aware that customers would add hundreds of nodes in the cluster, which means we had to make adjustments to the catalog and commit policy. To take care of all these three requirements we introduced two types of subclusters, primary subclusters, and secondary subclusters. Primary subclusters is the one that you get by default when you create your first Eon cluster. The nodes in the primary subclusters are always up, that means they stay up and participate in the quorum. The nodes in the primary subcluster are responsible for processing commits, and also maintain a persistent copy, of catalog on disk. This is a subcluster that you would use to process all your ETL jobs, because the topper more also runs on the node, in the primary subcluster. If you want now at this point, have another subcluster, where you would like to run queries, and also, build this cluster up and down depending on the demand or the, depending on the workload, you would create a new subcluster. And this subcluster will be off-site secondary in nature. Now secondary subclusters have nodes that don't participate in quorums, so if these nodes are down, Vertica has no impact. These nodes are also not responsible for processing commit, though they maintain up-to-date copies of the catalog in memory. They don't store catalog on disk. And these are subclusters that you can add and remove very quickly, without impacting what is running on the other subclusters. We have customers running hundreds of nodes, subclusters with hundreds of nodes, and subclusters of size like 64 node, and they can bring this subcluster up and down, or add and remove, within few minutes. So before I go into the sizing of Eon Mode, I just want to say one more thing here. We are working very closely with some of our customers who are running Eon Mode and getting better feedback from that on a regular basis. And based on the feedback, we are making lots of improvements and fixes in every hot-fix that we put out. So if you are running Eon Mode, and want to be part of this group, I suggest that, you keep your cluster current with latest hot-fixes and work with us to give us feedback, and get the improvements that you need to be successful. So let's look at what there-- What we need, to size Eon clusters. Sizing Eon clusters is very different from sizing Enterprise Mode cluster. When you are running Enterprise Mode cluster or when you're sizing Vertica cluster running Enterprise Mode, you need to take into account the amount of data that you want to store, and the configuration of your node. Depending on which you decide, how many nodes you will need, and then start the cluster. In Eon Mode, to size a cluster, you need few things like, what should be your shard count. Now, shard count decides the maximum number of nodes that will participate in your query. And we'll talk about this little bit more in the next slide. You will decide on number of nodes that you will need within a subcluster, the instance type you will pick for running statistic subcluster, and how many subclusters you will need, and how many of them should be running all the time, and how many should be running in a dynamic mode. When it comes to shard count, you have to pick shard count up front, and you can't change it once your database is up and running. So, we... So, you need to pick shard count depending the number of nodes, are the same number of nodes that you will need to process a query. Now one thing that we want to remember here, is this is not amount of data that you have in database, but this is amount of data your queries will process. So, you may have data for six years, but if your queries process last month of data, on most of the occasions, or if your dashboards are processing up to six weeks, or ten minutes, based on whatever your needs are, you will decide or pick the number of shards, shard count and nodes, based on how much data your queries process. Looking at most of our customers, we think that 12 is a good number that should work for most of our customers. And, that means, the maximum number of nodes in a subcluster that will process queries is going to be 12. If you feel that, you need more than 12 nodes to process your query, you can pick other numbers like 24 or 48. If you pick a higher number, like 48, and you go with three nodes in your subcluster, that means node subscribes to 16 primary and 16 secondary shard subscription, which totals to 32 subscriptions per node. That will leave your catalog in a broken state. So, pick shard count appropriately, don't pick prime numbers, we suggest 12 should work for most of our customers, if you think you process more than, you know, the regular, the regular number that, or you think that your customers, you think your queries process terabytes of data, then pick a number like 24. Don't pick a prime number. Okay? We are also coming up with features in Vertica like current scaling, that will help you run more-- run queries on more than, more nodes than the number of shards that you pick. And that feature will be coming out soon. So if you have picked a smaller shard count, it's not the end of the story. Now, the next thing is, you need to pick how many nodes you need within your subclusters, to process your query. Ideal number would be node number equal to shard count, or, if you want to pick a number that is less, pick node count which is such that each of the nodes has a balanced distribution of subscriptions. When... So over here, you can have, option where you can have 12 nodes and 12 shards, or you can have two subclusters with 6 nodes and 12 shards. Depending on your workload, you can pick either of the two options. The first option, where you have 12 nodes and 12 shards, is more suitable for, more suitable for batch applications, whereas two subclusters with, with six nodes each, is more suitable for desktop type applications. Picking subclusters is, it depends on your workload, you can add remove nodes relative to isolation, or Elastic Throughput Scaling. Your subclusters can have nodes of different sizes, and you need to make sure that the nodes within the subcluster have to be homogenous. So this is my last slide before I hand over to Sumeet. And this I think is very important slide that I want you to pay attention to. When you pick instance, you are going to pick instance based on workload and query budget. I want to make it clear here that we want you to pay attention to the local disk, because you have depot on your local disk, which is going to be your query performance enhancer for all kinds of deployment, in cloud, as well as on premise. So you'd expect of what you read, or what you heard, depots still play a very important role in every Eon deployment, and they act like performance enhancers. Most of our customers choose Vertica because they love the performance we offer, and we don't want you to compromise on the performance. So pick nodes with some amount of local disk, at least two terabytes is what we suggest. i3 instances in Amazon have, you know, come up with a good local disk that is very helpful, and some of our customers are benefiting from. With that said, I want to pass it over to Sumeet. >> Sumeet: So, hi everyone, my name is Sumeet Keswani, and I'm a Product Technology Engineer at Vertica. I will be discussing the various use cases that customers deploy in Eon Mode. After that, I will go into some technical details of SQL, and then I'll blend that into the best practices, in Eon Mode. And finally, we'll go through some tips and tricks. So let's get started with the use cases. So a very basic use case that users will encounter, when they start Eon Mode the first time, is they will have two subclusters. The first subcluster will be the primary subcluster, used for ETL, like Shirang mentioned. And this subcluster will be mostly on, or always on. And there will be another subcluster used for, purely for queries. And this subcluster is the secondary subcluster and it will be on sometimes. Depending on the use case. Maybe from nine to five, or Monday to Friday, depending on what application is running on it, or what users are doing on it. So this is the most basic use case, something users get started with to get their feet wet. Now as the use of the deployment of Eon Mode with subcluster increases, the users will graduate into the second use case. And this is the next level of deployment. In this situation, they still have the primary subcluster which is used for ETL, typically a larger subcluster where there is more heavier ETL running, pretty much non-stop. Then they have the usual query subcluster which will use for queries, but they may add another one, another secondary subcluster for ad-hoc workloads. The motivation for this subcluster is to isolate the unpredictable workload from the predictable workload, so as not to impact certain isolates. So you may have ad-hoc queries, or users that are running larger queries or bad workloads that occur once in a while, from running on a secondary subcluster, on a different secondary subcluster, so as to not impact the more predictable workload running on the first subcluster. Now there is no reason why these two subclusters need to have the same instances, they can have different number of nodes, different instance types, different depot configurations. And everything can be different. Another benefit is, they can be metered differently, they can be costed differently, so that the appropriate user or tenant can be billed the cost of compute. Now as the use increases even further, this is what we see as the final state of a very advanced Eon Mode deployment here. As you see, there is the primary subcluster of course, used for ETL, very heavy ETL, and that's always on. There are numerous secondary subclusters, some for predictable applications that have a very fine-tuned workload that needs a definite performance. There are other subclusters that have different usages, some for ad-hoc queries, others for demanding tenants, there could be still more subclusters for different departments, like Finance, that need it maybe at the end of the quarter. So very, very different applications, and this is the full and final promise of Eon, where there is workload isolation, there is different metering, and each app runs in its own compute space. Okay, so let's talk about a very interesting feature in Eon Mode, which we call Hibernate and Revive. So what is Hibernate? Hibernating a Vertica database is the act of dissociating all the computers on the database, and shutting it down. At this point, you shut down all compute. You still pay for storage, because your data is in the S3 bucket, but all the compute has been shut down, and you do not pay for compute anymore. If you have reserved instances, or any other instances you can use them for different applications, and your Vertica database is shut down. So this is very similar to stop database, in Eon Mode, you're stopping all compute. The benefit of course being that you pay nothing anymore for compute. So what is Revive, then? The Revive is the opposite of Hibernate, where you now associate compute with your S3 bucket or your storage, and start up the database. There is one limitation here that you should be aware of, is that the size of the database that you have during Hibernate, you must revive it the same size. So if you have a 12-node primary subcluster when hibernating, you need to provision 12 nodes in order to revive. So one best practice comes down to this, is that you must shrink your database to the smallest size possible before you hibernate, so that you can revive it in the same size, and you don't have to spin up a ton of compute in order to revive. So basically, what this means is, when you have decided to hibernate, we ask you to remove all your secondary subclusters and shrink your primary subcluster down to the bare minimum before you hibernate it. And the benefit being, is when you do revive, you will have, you will be able to do so with the mimimum number of nodes. And of course, before you hibernate, you must cleanly shut down the database, so that all the data can be synced to S3. Finally, let's talk about backups and replication. Backups and replications are still supported in Eon Mode, we sometimes get the question, "We're in S3, and S3 has nine nines of reliability, we need a backup." Yes, we highly recommend backups, you can back-up by using the VBR script, you can back-up your database to another bucket, you can also copy the bucket and revive to a different, revive a different instance of your database. This is very useful because many times people want staging or development databases, and they need some of the data from production, and this is a nice way to get that. And it also makes sure that if you accidentally delete something you will be able to get back your data. Okay, so let's go into best practices now. I will start, let's talk about the depot first, which is the biggest performance enhancer that we see for queries. So, I want to state very clearly that reading from S3, or a remote object store like S3 is very slow, because data has to go over the network, and it's very expensive. You will pay for access cost. This is where S3 is not very cheap, is that every time you access the data, there is an ATI and access cost levied. Now the depot is a performance enhancing feature that will improve the performance of queries by keeping a local cache of the data that is most frequently used. It will also reduce the cost of accessing the data because you no longer have to go to the remote object store to get the data, since it's available on a local and permanent volume. Hence depot shaping is a very important aspect of performance tuning in an Eon database. What we ask you to do is, if you are going to use a specific table or partition frequency, you can choose to pin it, in the depot, so that if your depot is under pressure or is highly utilized, these objects that are most frequently used are kept in the depot. So therefore, depot, depot shaping is the act of setting eviction policies, instead you prevent the eviction of files that you believe you need to keep, so for example, you may keep the most recent year's data or the most recent, recent partition in the depot, and thereby all queries running on those partitions will be faster. At this time, we allow you to pin any table or partition in the depot, but it is not subcluster-based. Future versions of Vertica will allow you fine-tuning the depot based on each subcluster. So, let's now go and understand a little bit of internals of how a SQL query works in Eon Mode. And, once I explain this, we will blend into best practice and it will become much more clearer why we recommend certain things. So, since S3 is our layer of durability, where data is persistent in an Eon database. When you run an insert query, like, insert into table value one, or something similar. Data is synchronously written into S3. So, it will control returns back to the client, the copy of the data is first stored in the local depot, and then uploaded to S3. And only then do we hand the control back to the client. This ensures that if something bad were to happen, the data will be persistent. The second, the second types of SQL transactions are what we call DTLs, which are catalog operations. So for example, you create a table, or you added a column. These operations are actually working with metadata. Now, as you may know, S3 does not offer mutable storage, the storage in S3 is immutable. You can never append to a file in S3. And, the way transaction logs work is, they are append operation. So when you modify the metadata, you are actually appending to a transaction log. So this poses an interesting challenge which we resolve by appending to the transaction log locally in the catalog, and then there is a service that syncs the catalog to S3 every five minutes. So this poses an interesting challenge, right. If you were to destroy or delete an instance abruptly, you could lose the commits that happened in the last five minutes. And I'll speak to this more in the subsequent slides. Now, finally let's look at, drops or truncates in Eon. Now a drop or a truncate is really a combination of the first two things that we spoke about, when you drop a table, you are making, a drop operation, you are making a metadata change. You are telling Vertica that this table no longer exists, so we go into the transaction log, and append into the transaction log, that this table has been removed. This log of course, will be synced every five minutes to S3, like we spoke. There is also the secondary operation of deleting all the files that were associated with data in this table. Now these files are on S3. And we can go about deleting them synchronously, but that would take a lot of time. And we do not want to hold up the client for this duration. So at this point, we do not synchronously delete the files, we put the files that need to be removed in a reaper queue. And return the control back to the client. And this has the performance benefit as to the drops appear to occur really fast. This also has a cost benefit, batching deletes, in big batches, is more performant, and less costly. For example, on Amazon, you could delete 1,000 files at a time in a single cost. So if you batched your deletes, you could delete them very quickly. The disadvantage of this is if you were to terminate a Vertica customer abruptly, you could leak files in S3, because the reaper queue would not have had the chance to delete these files. Okay, so let's, let's go into best practices after speaking, after understanding some technical details. So, as I said, reading and writing to S3 is slow and costly. So, the first thing you can do is, avoid as many round trips to S3 as possible. The bigger the batches of data you load, the better. The better performance you get, per commit. The fact thing is, don't read and write from S3 if you can avoid it. A lot of our customers have intermediate data processing which they think temporarily they will transform the data before finally committing it. There is no reason to use regular tables for this kind of intermediate data. We recommend using local temporary tables, and local temporary tables have the benefit of not having to upload data to S3. Finally, there is another optimization you can make. Vertica has the concept of active partitions and inactive partitions. Active partitions are the ones where you have recently loaded data, and Vertica is lazy about merging these partitions into a single ROS container. Inactive partitions are historical partitions, like, consider last year's data, or the year before that data. Those partitions are aggressively merging into a single container. And how do we know how many partitions are active and inactive? Well that's based on the configuration parameter. If you load into an inactive partition, Vertica is very aggressive about merging these containers, so we download the entire partition, merge the records that you loaded into it, and upload it back again. This creates a lot of network traffic, and I said, accessing data is, from S3, slow and costly. So we recommend you not load into inactive partitions. You should load into the most recent or active partitions, and if you happen to load into inactive partitions, set your active partition count correctly. Okay, let's talk about the reaper queue. Depending on the velocity of your ETL, you can pile up a lot of files that need to be deleted asynchronously. If you were were to terminate a Vertica customer without allowing enough time for these files to get deleted, you could leak files in S3. Now, of course if you use local temporary tables this problem does not occur because the files were never created in S3, but if you are using regular tables, you must allow Vertica enough time to delete these files, and you can change the interval at which we delete, and how much time we allow to delete and shut down, by exiting some configuration parameters that I have mentioned here. And, yeah. Okay, so let's talk a little bit about a catalog at this point. So, the catalog is synced every five minutes onto S3 for persistence. And, the catalog truncation version is the minimum, minimal viable version of the catalog to which we can revive. So, for instance, if somebody destroyed a Vertica cluster, the entire Vertica cluster, the catalog truncation version is the mimimum viable version that you will be able to revive to. Now, in order to make sure that the catalog truncation version is up to date, you must always shut down your Vertica cluster cleanly. This allows the catalog to be synced to S3. Now here are some SQL commands that you can use to see what the catalog truncation version is on S3. For the most part, you don't have to worry about this if you're shutting down cleanly, so, this is only in cases of disaster or some event where all nodes were terminated, without... without the user's permission. And... And finally let's talk about backups, so one more time, we highly recommend you take backups, you know, S3 is designed for 99.9% availability, so there could be a, maybe an occasional down-time, making sure you have backups will help you if you accidentally drop a table. S3 will not protect you against data that was deleted by accident, so, having a backup helps you there. And why not backup, right, storage is cheap. You can replicate the entire bucket and have that as a backup, or have DR plus, you're running in a different region, which also sources a backup. So, we highly recommend that you make backups. So, so with this I would like to, end my presentation, and we're ready for any questions if you have it. Thank you very much. Thank you very much.
SUMMARY :
Also as reminder, that you can maximize your screen and get the improvements that you need to be successful. So, the first thing you can do is,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
Sumeet | PERSON | 0.99+ |
Sumeet Keswani | PERSON | 0.99+ |
Shirang Kamat | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
6 nodes | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
five minutes | QUANTITY | 0.99+ |
six years | QUANTITY | 0.99+ |
ten minutes | QUANTITY | 0.99+ |
12 nodes | QUANTITY | 0.99+ |
Shirang | PERSON | 0.99+ |
1,000 files | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
12 shards | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
99.9% | QUANTITY | 0.99+ |
two modes | QUANTITY | 0.99+ |
S3 | TITLE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
first subcluster | QUANTITY | 0.99+ |
first time | QUANTITY | 0.99+ |
two options | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
first option | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
two subclusters | QUANTITY | 0.99+ |
Each node | QUANTITY | 0.99+ |
hundreds of nodes | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
each app | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
last year | DATE | 0.99+ |
second | QUANTITY | 0.99+ |
One | QUANTITY | 0.98+ |
three nodes | QUANTITY | 0.98+ |
SQL | TITLE | 0.98+ |
Eon Mode | TITLE | 0.98+ |
single container | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
16 secondary shard subscription | QUANTITY | 0.97+ |
two types | QUANTITY | 0.97+ |
Sizing and Configuring Vertica in Eon Mode for Different Use Cases | TITLE | 0.97+ |
Vertica | TITLE | 0.97+ |
one limitation | QUANTITY | 0.97+ |
UNLIST TILL 4/2 - Autonomous Log Monitoring
>> Sue: Hi everybody, thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "Autonomous Monitoring Using Machine Learning". My name is Sue LeClaire, director of marketing at Vertica, and I'll be your host for this session. Joining me is Larry Lancaster, founder and CTO at Zebrium. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slide and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, you can also go and visit Vertica forums to post your questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, just a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available for you to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Larry, over to you. >> Larry: Hey, thanks so much. So hi, my name's Larry Lancaster and I'm here to talk to you today about something that I think who's time has come and that's autonomous monitoring. So, with that, let's get into it. So, machine data is my life. I know that's a sad life, but it's true. So I've spent most of my career kind of taking telemetry data from products, either in the field, we used to call it in the field or nowadays, that's been deployed, and bringing that data back, like log file stats, and then building stuff on top of it. So, tools to run the business or services to sell back to users and customers. And so, after doing that a few times, it kind of got to the point where I was really sort of sick of building the same kind of thing from scratch every time, so I figured, why not go start a company and do it so that we don't have to do it manually ever again. So, it's interesting to note, I've put a little sentence here saying, "companies where I got to use Vertica" So I've been actually kind of working with Vertica for a long time now, pretty much since they came out of alpha. And I've really been enjoying their technology ever since. So, our vision is basically that I want a system that will characterize incidents before I notice. So an incident is, you know, we used to call it a support case or a ticket in IT, or a support case in support. Nowadays, you may have a DevOps team, or a set of SREs who are monitoring a production sort of deployment. And so they'll call it an incident. So I'm looking for something that will notice and characterize an incident before I notice and have to go digging into log files and stats to figure out what happened. And so that's a pretty heady goal. And so I'm going to talk a little bit today about how we do that. So, if we look at logs in particular. Logs today, if you look at log monitoring. So monitoring is kind of that whole umbrella term that we use to talk about how we monitor systems in the field that we've shipped, or how we monitor production deployments in a more modern stack. And so basically there are log monitoring tools. But they have a number of drawbacks. For one thing, they're kind of slow in the sense that if something breaks and I need to go to a log file, actually chances are really good that if you have a new issue, if it's an unknown unknown problem, you're going to end up in a log file. So the problem then becomes basically you're searching around looking for what's the root cause of the incident, right? And so that's kind of time-consuming. So, they're also fragile and this is largely because log data is completely unstructured, right? So there's no formal grammar for a log file. So you have this situation where, if I write a parser today, and that parser is going to do something, it's going to execute some automation, it's going to open or update a ticket, it's going to maybe restart a service, or whatever it is that I want to happen. What'll happen is later upstream, someone who's writing the code that produces that log message, they might do something really useful for me, or for users. And they might go fix a spelling mistake in that log message. And then the next thing you know, all the automation breaks. So it's a very fragile source for automation. And finally, because of that, people will set alerts on, "Oh, well tell me how many thousands of errors are happening every hour." Or some horrible metric like that. And then that becomes the only visibility you have in the data. So because of all this, it's a very human-driven, slow, fragile process. So basically, we've set out to kind of up-level that a bit. So I touched on this already, right? The truth is if you do have an incident, you're going to end up in log files to do root cause. It's almost always the case. And so you have to wonder, if that's the case, why do most people use metrics only for monitoring? And the reason is related to the problems I just described. They're already structured, right? So for logs, you've got this mess of stuff, so you only want to dig in there when you absolutely have to. But ironically, it's where a lot of the information that you need actually is. So we have a model today, and this model used to work pretty well. And that model is called "index and search". And it basically means you treat log files like they're text documents. And so you index them and when there's some issue you have to drill into, then you go searching, right? So let's look at that model. So 20 years ago, we had sort of a shrink-wrap software delivery model. You had an incident. With that incident, maybe you had one customer and you had a monolithic application and a handful of log files. So it's perfectly natural, in fact, usually you could just v-item the log file, and search that way. Or if there's a lot of them, you could index them and search them that way. And that all worked very well because the developer or the support engineer had to be an expert in those few things, in those few log files, and understand what they meant. But today, everything has changed completely. So we live in a software as a service world. What that means is, for a given incident, first of all you're going to be affecting thousands of users. You're going to have, potentially, 100 services that are deployed in your environment. You're going to have 1,000 log streams to sift through. And yet, you're still kind of stuck in the situation where to go find out what's the matter, you're going to have to search through the log files. So this is kind of the unacceptable sort of position we're in today. So for us, the future will not be index and search. And that's simply because it cannot scale. And the reason I say that it can't scale is because it all kind of is bottlenecked by a person and their eyeball. So, you continue to drive up the amount of data that has to be sifted through, the complexity of the stack that has to be understood, and you still, at the end of the day, for MTTR purposes, you still have the same bottleneck, which is the eyeball. So this model, I believe, is fundamentally broken. And that's why, I believe in five years you're going to be in a situation where most monitoring of unknown unknown problems is going to be done autonomously. And those issues will be characterized autonomously because there's no other way it can happen. So now I'm going to talk a little bit about autonomous monitoring itself. So, autonomous monitoring basically means, if you can imagine in a monitoring platform and you watch the monitoring platform, maybe you watch the alerts coming from it or more importantly, you kind of watch the dashboards and try to see if something looks weird. So autonomous monitoring is the notion that the platform should do the watching for you and only let you know when something is going wrong and should kind of give you a window into what happened. So if you look at this example I have on screen, just to take it really slow and absorb the concept of autonomous monitoring. So here in this example, we've stopped the database. And as a result, down below you can see there were a bunch of fallout. This is an Atlassian Stack, so you can imagine you've got a Postgres database. And then you've got sort of Bitbucket, and Confluence, and Jira, and these various other components that need the database operating in order to function. So what this is doing is it's calling out, "Hey, the root cause is the database stopped and here's the symptoms." Now, you might be wondering, so what. I mean I could go write a script to do this sort of thing. Here's what's interesting about this very particular example, and I'll show a couple more examples that are a little more involved. But here's the interesting thing. So, in the software that came up with this incident and opened this incident and put this root cause and symptoms in there, there's no code that knows anything about timestamp formats, severities, Atlassian, Postgres, databases, Bitbucket, Confluence, there's no regexes that talk about starting, stopped, RDBMS, swallowed exception, and so on and so forth. So you might wonder how it's possible then, that something which is completely ignorant of the stack, could come up with this description, which is exactly what a human would have had to do, to figure out what happened. And I'm going to get into how we do that. But that's what autonomous monitoring is about. It's about getting into a set of telemetry from a stack with no prior information, and understanding when something breaks. And I could give you the punchline right now, which is there are fundamental ways that software behaves when it's breaking. And by looking at hundreds of data sets that people have generously allowed us to use containing incidents, we've been able to characterize that and now generalize it to apply it to any new data set and stack. So here's an interesting one right here. So there's a fella, David Gill, he's just a genius in the monitoring space. He's been working with us for the last couple of months. So he said, "You know what I'm going to do, is I'm going to run some chaos experiments." So for those of you who don't know what chaos engineering is, here's the idea. So basically, let's say I'm running a Kubernetes cluster and what I'll do is I'll use sort of a chaos injection test, something like litmus. And basically it will inject issues, it'll break things in my application randomly to see if my monitoring picks it up. And so this is what chaos engineering is built around. It's built around sort of generating lots of random problems and seeing how the stack responds. So in this particular case, David went in and he deleted, basically one of the tests that was presented through litmus did a delete of a pod delete. And so that's going to basically take out some containers that are part of the service layer. And so then you'll see all kinds of things break. And so what you're seeing here, which is interesting, this is why I like to use this example. Because it's actually kind of eye-opening. So the chaos tool itself generates logs. And of course, through Kubernetes, all the log files locations that are on the host, and the container logs are known. And those are all pulled back to us automatically. So one of the log files we have is actually the chaos tool that's doing the breaking, right? And so what the tool said here, when it went to determine what the root cause was, was it noticed that there was this process that had these messages happen, initializing deletion lists, selection a pod to kill, blah blah blah. It's saying that the root cause is the chaos test. And it's absolutely right, that is the root cause. But usually chaos tests don't get picked up themselves. You're supposed to be just kind of picking up the symptoms. But this is what happens when you're able to kind of tease out root cause from symptoms autonomously, is you end up getting a much more meaningful answer, right? So here's another example. So essentially, we collect the log files, but we also have a Prometheus scraper. So if you export Prometheus metrics, we'll scrape those and we'll collect those as well. And so we'll use those for our autonomous monitoring as well. So what you're seeing here is an issue where, I believe this is where we ran something out of disk space. So it opened an incident, but what's also interesting here is, you see that it pulled that metric to say that the spike in this metric was a symptom of this running out of space. So again, there's nothing that knows anything about file system usage, memory, CPU, any of that stuff. There's no actual hard-coded logic anywhere to explain any of this. And so the concept of autonomous monitoring is looking at a stack the way a human being would. If you can imagine how you would walk in and monitor something, how you would think about it. You'd go looking around for rare things. Things that are not normal. And you would look for indicators of breakage, and you would see, do those seem to be correlated in some dimension? That is how the system works. So as I mentioned a moment ago, metrics really do kind of complete the picture for us. We end up in a situation where we have a one-stop shop for incident root cause. So, how does that work? Well, we ingest and we structure the log files. So if we're getting the logs, we'll ingest them and we'll structure them, and I'm going to show a little bit what that structure looks like and how that goes into the database in a moment. And then of course we ingest and structure the Prometheus metrics. But here, structure really should have an asterisk next to it, because metrics are mostly structured already. They have names. If you have your own scraper, as opposed to going into the time series Prometheus database and pulling metrics from there, you can keep a lot more information about metadata about those metrics from the exporter's perspective. So we keep all of that too. Then we do our anomaly detection on both of those sets of data. And then we cross-correlate metrics and log anomalies. And then we create incidents. So this is at a high level, kind of what's happening without any sort of stack-specific logic built in. So we had some exciting recent validation. So Mayadata's a pretty big player in the Kubernetes space. Essentially, they do Kubernetes as a managed service. They have tens of thousands of customers that they manage their Kubernetes clusters for them. And then they're also involved, both in the OpenEBS project, as well as in the Litmius project I mentioned a moment ago. That's their tool for chaos engineering. So they're a pretty big player in the Kubernetes space. So essentially, they said, "Oh okay, let's see if this is real." So what they did was they set up our collectors, which took three minutes in Kubernetes. And then they went and they, using Litmus, they reproduced eight incidents that their actual, real-world customers had hit. And they were trying to remember the ones that were the hardest to figure out the root cause at the time. And we picked up and put a root cause indicator that was correct in 100% of these incidents with no training configuration or metadata required. So this is kind of what autonomous monitoring is all about. So now I'm going to talk a little bit about how it works. So, like I said, there's no information included or required about, so if you imagine a log file for example. Now, commonly, over to the left-hand side of every line, there will be some sort of a prefix. And what I mean by that is you'll see like a timestamp, or a severity, and maybe there's a PID, and maybe there's function name, and maybe there's some other stuff there. So basically that's kind of, it's common data elements for a large portion of the lines in a given log file. But you know, of course, the contents change. So basically today, like if you look at a typical log manager, they'll talk about connectors. And what connectors means is, for an application it'll generate a certain prefix format in a log. And that means what's the format of the timestamp, and what else is in the prefix. And this lets the tool pick it up. And so if you have an app that doesn't have a connector, you're out of luck. Well, what we do is we learn those prefixes dynamically with machine learning. You do not have to have a connector, right? And what that means is that if you come in with your own application, the system will just work for it from day one. You don't have to have connectors, you don't have to describe the prefix format. That's so yesterday, right? So really what we want to be doing is up-leveling what the system is doing to the point where it's kind of working like a human would. You look at a log line, you know what's a timestamp. You know what's a PID. You know what's a function name. You know where the prefix ends and where the variable parts begin. You know what's a parameter over there in the variable parts. And sometimes you may need to see a couple examples to know what was a variable, but you'll figure it out as quickly as possible, and that's exactly how the system goes about it. As a result, we kind of embrace free-text logs, right? So if you look at a typical stack, most of the logs generated in a typical stack are usually free-text. Even structured logging typically will have a message attribute, which then inside of it has the free-text message. For us, that's not a bad thing. That's okay. The purpose of a log is to inform people. And so there's no need to go rewrite the whole logging stack just because you want a machine to handle it. They'll figure it out for themselves, right? So, you give us the logs and we'll figure out the grammar, not only for the prefix but also for the variable message part. So I already went into this, but there's more that's usually required for configuring a log manager with alerts. You have to give it keywords. You have to give it application behaviors. You have to tell it some prior knowledge. And of course the problem with all of that is that the most important events that you'll ever see in a log file are the rarest. Those are the ones that are one out of a billion. And so you may not know what's going to be the right keyword in advance to pick up the next breakage, right? So we don't want that information from you. We'll figure that out for ourselves. As the data comes in, essentially we parse it and we categorize it, as I've mentioned. And when I say categorize, what I mean is, if you look at a certain given log file, you'll notice that some of the lines are kind of the same thing. So this one will say "X happened five times" and then maybe a few lines below it'll say "X happened six times" but that's basically the same event type. It's just a different instance of that event type. And it has a different value for one of the parameters, right? So when I say categorization, what I mean is figuring out those unique types and I'll show an example of that next. Anomaly detection, we do on top of that. So anomaly detection on metrics in a very sort of time series by time series manner with lots of tunables is a well-understood problem. So we also do this on the event types occurrences. So you can think of each event type occurring in time as sort of a point process. And then you can develop statistics and distributions on that, and you can do anomaly detection on those. Once we have all of that, we have extracted features, essentially, from metrics and from logs. We do pattern recognition on the correlations across different channels of information, so different event types, different log types, different hoses, different containers, and then of course across to the metrics. Based on all of this cross-correlation, we end up with a root cause identification. So that's essentially, at a high level, how it works. What's interesting, from the perspective of this call particularly, is that incident detection needs relationally structured data. It really does. You need to have all the instances of a certain event type that you've ever seen easily accessible. You need to have the values for a given sort of parameter easily, quickly available so you can figure out what's the distribution of this over time, how often does this event type happen. You can run analytical queries against that information so that you can quickly, in real-time, do anomaly detection against new data. So here's an example of that this looks like. And this kind of part of the work that we've done. At the top you see some examples of log lines, right? So that's kind of a snippet, it's three lines out of a log file. And you see one in the middle there that's kind of highlighted with colors, right? I mean, it's a little messy, but it's not atypical of the log file that you'll see pretty much anywhere. So there, you've got a timestamp, and a severity, and a function name. And then you've got some other information. And then finally, you have the variable part. And that's going to have sort of this checkpoint for memory scrubbers, probably something that's written in English, just so that the person who's reading the log file can understand. And then there's some parameters that are put in, right? So now, if you look at how we structure that, the way it looks is there's going to be three tables that correspond to the three event types that we see above. And so we're going to look at the one that corresponds to the one in the middle. So if we look at that table, there you'll see a table with columns, one for severity, for function name, for time zone, and so on. And date, and PID. And then you see over to the right with the colored columns there's the parameters that were pulled out from the variable part of that message. And so they're put in, they're typed and they're in integer columns. So this is the way structuring needs to work with logs to be able to do efficient and effective anomaly detection. And as far as I know, we're the first people to do this inline. All right, so let's talk now about Vertica and why we take those tables and put them in Vertica. So Vertica really is an MPP column store, but it's more than that, because nowadays when you say "column store", people sort of think, like, for example Cassandra's a column store, whatever, but it's not. Cassandra's not a column store in the sense that Vertica is. So Vertica was kind of built from the ground up to be... So it's the original column store. So back in the cStor project at Berkeley that Stonebraker was involved in, he said let's explore what kind of efficiencies we can get out of a real columnar database. And what he found was that, he and his grad students that started Vertica. What they found was that what they can do is they could build a database that gives orders of magnitude better query performance for the kinds of analytics I'm talking about here today. With orders of magnitude less data storage underneath. So building on top of machine data, as I mentioned, is hard, because it doesn't have any defined schemas. But we can use an RDBMS like Vertica once we've structured the data to do the analytics that we need to do. So I talked a little bit about this, but if you think about machine data in general, it's perfectly suited for a columnar store. Because, if you imagine laying out sort of all the attributes of an event type, right? So you can imagine that each occurrence is going to have- So there may be, say, three or four function names that are going to occur for all the instances of a given event type. And so if you were to sort all of those event instances by function name, what you would find is that you have sort of long, million long runs of the same function name over and over. So what you have, in general, in machine data, is lots and lots of slowly varying attributes, lots of low-cardinality data that it's almost completely compressed out when you use a real column store. So you end up with a massive footprint reduction on disk. And it also, that propagates through the analytical pipeline. Because Vertica does late materialization, which means it tries to carry that data through memory with that same efficiency, right? So the scale-out architecture, of course, is really suitable for petascale workloads. Also, I should point out, I was going to mention it in another slide or two, but we use the Vertica Eon architecture, and we have had no problems scaling that in the cloud. It's a beautiful sort of rewrite of the entire data layer of Vertica. The performance and flexibility of Eon is just unbelievable. And so I've really been enjoying using it. I was skeptical, you could get a real column store to run in the cloud effectively, but I was completely wrong. So finally, I should mention that if you look at column stores, to me, Vertica is the one that has the full SQL support, it has the ODBC drivers, it has the ACID compliance. Which means I don't need to worry about these things as an application developer. So I'm laying out the reasons that I like to use Vertica. So I touched on this already, but essentially what's amazing is that Vertica Eon is basically using S3 as an object store. And of course, there are other offerings, like the one that Vertica does with pure storage that doesn't use S3. But what I find amazing is how well the system performs using S3 as an object store, and how they manage to keep an actual consistent database. And they do. We've had issues where we've gone and shut down hosts, or hosts have been shut down on us, and we have to restart the database and we don't have any consistency issues. It's unbelievable, the work that they've done. Essentially, another thing that's great about the way it works is you can use the S3 as a shared object store. You can have query nodes kind of querying from that set of files largely independently of the nodes that are writing to them. So you avoid this sort of bottleneck issue where you've got contention over who's writing what, and who's reading what, and so on. So I've found the performance using separate subclusters for our UI and for the ingest has been amazing. Another couple of things that they have is they have a lot of in-database machine learning libraries. There's actually some cool stuff on their GitHub that we've used. One thing that we make a lot of use of is the sequence and time series analytics. For example, in our product, even though we do all of this stuff autonomously, you can also go create alerts for yourself. And one of the kinds of alerts you can do, you can say, "Okay, if this kind of event happens within so much time, and then this kind of an event happens, but not this one," Then you can be alerted. So you can have these kind of sequences that you define of events that would indicate a problem. And we use their sequence analytics for that. So it kind of gives you really good performance on some of these queries where you're wanting to pull out sequences of events from a fact table. And timeseries analytics is really useful if you want to do analytics on the metrics and you want to do gap filling interpolation on that. It's actually really fast in performance. And it's easy to use through SQL. So those are a couple of Vertica extensions that we use. So finally, I would like to encourage everybody, hey, come try us out. Should be up and running in a few minutes if you're using Kubernetes. If not, it's however long it takes you to run an installer. So you can just come to our website, pick it up and try out autonomous monitoring. And I want to thank everybody for your time. And we can open it up for Q and A.
SUMMARY :
Also, just a reminder that you can maximize your screen And one of the kinds of alerts you can do, you can say,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Larry Lancaster | PERSON | 0.99+ |
David Gill | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
five times | QUANTITY | 0.99+ |
Larry | PERSON | 0.99+ |
S3 | TITLE | 0.99+ |
three minutes | QUANTITY | 0.99+ |
six times | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
100 services | QUANTITY | 0.99+ |
Zebrium | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
three | QUANTITY | 0.99+ |
five years | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
yesterday | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
Kubernetes | TITLE | 0.99+ |
one | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
SQL | TITLE | 0.99+ |
one customer | QUANTITY | 0.98+ |
three lines | QUANTITY | 0.98+ |
three tables | QUANTITY | 0.98+ |
each event | QUANTITY | 0.98+ |
hundreds | QUANTITY | 0.98+ |
first people | QUANTITY | 0.98+ |
1,000 log streams | QUANTITY | 0.98+ |
20 years ago | DATE | 0.98+ |
eight incidents | QUANTITY | 0.98+ |
tens of thousands of customers | QUANTITY | 0.97+ |
later this week | DATE | 0.97+ |
thousands of users | QUANTITY | 0.97+ |
Stonebraker | ORGANIZATION | 0.96+ |
each occurrence | QUANTITY | 0.96+ |
Postgres | ORGANIZATION | 0.96+ |
One thing | QUANTITY | 0.95+ |
three event types | QUANTITY | 0.94+ |
million | QUANTITY | 0.94+ |
Vertica | TITLE | 0.94+ |
one thing | QUANTITY | 0.93+ |
4/2 | DATE | 0.92+ |
English | OTHER | 0.92+ |
four function names | QUANTITY | 0.86+ |
day one | QUANTITY | 0.84+ |
Prometheus | TITLE | 0.83+ |
one-stop | QUANTITY | 0.82+ |
Berkeley | LOCATION | 0.82+ |
Confluence | ORGANIZATION | 0.79+ |
double arrow | QUANTITY | 0.79+ |
last couple of months | DATE | 0.79+ |
one of | QUANTITY | 0.76+ |
cStor | ORGANIZATION | 0.75+ |
a billion | QUANTITY | 0.73+ |
Atlassian Stack | ORGANIZATION | 0.72+ |
Eon | ORGANIZATION | 0.71+ |
Bitbucket | ORGANIZATION | 0.68+ |
couple more examples | QUANTITY | 0.68+ |
Litmus | TITLE | 0.65+ |
UNLIST TILL 4/2 - End-to-End Security
>> Paige: Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled End-to-End Security in Vertica. I'm Paige Roberts, Open Source Relations Manager at Vertica. I'll be your host for this session. Joining me is Vertica Software Engineers, Fenic Fawkes and Chris Morris. Before we begin, I encourage you to submit your questions or comments during the virtual session. You don't have to wait until the end. Just type your question or comment in the question box below the slide as it occurs to you and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Also, you can visit Vertica forums to post your questions there after the session. Our team is planning to join the forums to keep the conversation going, so it'll be just like being at a conference and talking to the engineers after the presentation. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this whole session is being recorded and it will be available to view on-demand this week. We'll send you a notification as soon as it's ready. I think we're ready to get started. Over to you, Fen. >> Fenic: Hi, welcome everyone. My name is Fen. My pronouns are fae/faer and Chris will be presenting the second half, and his pronouns are he/him. So to get started, let's kind of go over what the goals of this presentation are. First off, no deployment is the same. So we can't give you an exact, like, here's the right way to secure Vertica because how it is to set up a deployment is a factor. But the biggest one is, what is your threat model? So, if you don't know what a threat model is, let's take an example. We're all working from home because of the coronavirus and that introduces certain new risks. Our source code is on our laptops at home, that kind of thing. But really our threat model isn't that people will read our code and copy it, like, over our shoulders. So we've encrypted our hard disks and that kind of thing to make sure that no one can get them. So basically, what we're going to give you are building blocks and you can pick and choose the pieces that you need to secure your Vertica deployment. We hope that this gives you a good foundation for how to secure Vertica. And now, what we're going to talk about. So we're going to start off by going over encryption, just how to secure your data from attackers. And then authentication, which is kind of how to log in. Identity, which is who are you? Authorization, which is now that we know who you are, what can you do? Delegation is about how Vertica talks to other systems. And then auditing and monitoring. So, how do you protect your data in transit? Vertica makes a lot of network connections. Here are the important ones basically. There are clients talk to Vertica cluster. Vertica cluster talks to itself. And it can also talk to other Vertica clusters and it can make connections to a bunch of external services. So first off, let's talk about client-server TLS. Securing data between, this is how you secure data between Vertica and clients. It prevents an attacker from sniffing network traffic and say, picking out sensitive data. Clients have a way to configure how strict the authentication is of the server cert. It's called the Client SSLMode and we'll talk about this more in a bit but authentication methods can disable non-TLS connections, which is a pretty cool feature. Okay, so Vertica also makes a lot of network connections within itself. So if Vertica is running behind a strict firewall, you have really good network, both physical and software security, then it's probably not super important that you encrypt all traffic between nodes. But if you're on a public cloud, you can set up AWS' firewall to prevent connections, but if there's a vulnerability in that, then your data's all totally vulnerable. So it's a good idea to set up inter-node encryption in less secure situations. Next, import/export is a good way to move data between clusters. So for instance, say you have an on-premises cluster and you're looking to move to AWS. Import/Export is a great way to move your data from your on-prem cluster to AWS, but that means that the data is going over the open internet. And that is another case where an attacker could try to sniff network traffic and pull out credit card numbers or whatever you have stored in Vertica that's sensitive. So it's a good idea to secure data in that case. And then we also connect to a lot of external services. Kafka, Hadoop, S3 are three of them. Voltage SecureData, which we'll talk about more in a sec, is another. And because of how each service deals with authentication, how to configure your authentication to them differs. So, see our docs. And then I'd like to talk a little bit about where we're going next. Our main goal at this point is making Vertica easier to use. Our first objective was security, was to make sure everything could be secure, so we built relatively low-level building blocks. Now that we've done that, we can identify common use cases and automate them. And that's where our attention is going. Okay, so we've talked about how to secure your data over the network, but what about when it's on disk? There are several different encryption approaches, each depends on kind of what your use case is. RAID controllers and disk encryption are mostly for on-prem clusters and they protect against media theft. They're invisible to Vertica. S3 and GCP are kind of the equivalent in the cloud. They also invisible to Vertica. And then there's field-level encryption, which we accomplish using Voltage SecureData, which is format-preserving encryption. So how does Voltage work? Well, it, the, yeah. It encrypts values to things that look like the same format. So for instance, you can see date of birth encrypted to something that looks like a date of birth but it is not in fact the same thing. You could do cool stuff like with a credit card number, you can encrypt only the first 12 digits, allowing the user to, you know, validate the last four. The benefits of format-preserving encryption are that it doesn't increase database size, you don't need to alter your schema or anything. And because of referential integrity, it means that you can do analytics without unencrypting the data. So again, a little diagram of how you could work Voltage into your use case. And you could even work with Vertica's row and column access policies, which Chris will talk about a bit later, for even more customized access control. Depending on your use case and your Voltage integration. We are enhancing our Voltage integration in several ways in 10.0 and if you're interested in Voltage, you can go see their virtual BDC talk. And then again, talking about roadmap a little, we're working on in-database encryption at rest. What this means is kind of a Vertica solution to encryption at rest that doesn't depend on the platform that you're running on. Encryption at rest is hard. (laughs) Encrypting, say, 10 petabytes of data is a lot of work. And once again, the theme of this talk is everyone has a different key management strategy, a different threat model, so we're working on designing a solution that fits everyone. If you're interested, we'd love to hear from you. Contact us on the Vertica forums. All right, next up we're going to talk a little bit about access control. So first off is how do I prove who I am? How do I log in? So, Vertica has several authentication methods. Which one is best depends on your deployment size/use case. Again, theme of this talk is what you should use depends on your use case. You could order authentication methods by priority and origin. So for instance, you can only allow connections from within your internal network or you can enforce TLS on connections from external networks but relax that for connections from your internal network. That kind of thing. So we have a bunch of built-in authentication methods. They're all password-based. User profiles allow you to set complexity requirements of passwords and you can even reject non-TLS connections, say, or reject certain kinds of connections. Should only be used by small deployments because you probably have an LDAP server, where you manage users if you're a larger deployment and rather than duplicating passwords and users all in LDAP, you should use LDAP Auth, where Vertica still has to keep track of users, but each user can then use LDAP authentication. So Vertica doesn't store the password at all. The client gives Vertica a username and password and Vertica then asks the LDAP server is this a correct username or password. And the benefits of this are, well, manyfold, but if, say, you delete a user from LDAP, you don't need to remember to also delete their Vertica credentials. You can just, they won't be able to log in anymore because they're not in LDAP anymore. If you like LDAP but you want something a little bit more secure, Kerberos is a good idea. So similar to LDAP, Vertica doesn't keep track of who's allowed to log in, it just keeps track of the Kerberos credentials and it even, Vertica never touches the user's password. Users log in to Kerberos and then they pass Vertica a ticket that says "I can log in." It is more complex to set up, so if you're just getting started with security, LDAP is probably a better option. But Kerberos is, again, a little bit more secure. If you're looking for something that, you know, works well for applications, certificate auth is probably what you want. Rather than hardcoding a password, or storing a password in a script that you use to run an application, you can instead use a certificate. So, if you ever need to change it, you can just replace the certificate on disk and the next time the application starts, it just picks that up and logs in. Yeah. And then, multi-factor auth is a feature request we've gotten in the past and it's not built-in to Vertica but you can do it using Kerberos. So, security is a whole application concern and fitting MFA into your workflow is all about fitting it in at the right layer. And we believe that that layer is above Vertica. If you're interested in more about how MFA works and how to set it up, we wrote a blog on how to do it. And now, over to Chris, for more on identity and authorization. >> Chris: Thanks, Fen. Hi everyone, I'm Chris. So, we're a Vertica user and we've connected to Vertica but once we're in the database, who are we? What are we? So in Vertica, the answer to that questions is principals. Users and roles, which are like groups in other systems. Since roles can be enabled and disabled at will and multiple roles can be active, they're a flexible way to use only the privileges you need in the moment. For example here, you've got Alice who has Dbadmin as a role and those are some elevated privileges. She probably doesn't want them active all the time, so she can set the role and add them to her identity set. All of this information is stored in the catalog, which is basically Vertica's metadata storage. How do we manage these principals? Well, depends on your use case, right? So, if you're a small organization or maybe only some people or services need Vertica access, the solution is just to manage it with Vertica. You can see some commands here that will let you do that. But what if we're a big organization and we want Vertica to reflect what's in our centralized user management system? Sort of a similar motivating use case for LDAP authentication, right? We want to avoid duplication hassles, we just want to centralize our management. In that case, we can use Vertica's LDAPLink feature. So with LDAPLink, principals are mirrored from LDAP. They're synced in a considerable fashion from the LDAP into Vertica's catalog. What this does is it manages creating and dropping users and roles for you and then mapping the users to the roles. Once that's done, you can do any Vertica-specific configuration on the Vertica side. It's important to note that principals created in Vertica this way, support multiple forms of authentication, not just LDAP. This is a separate feature from LDAP authentication and if you created a user via LDAPLink, you could have them use a different form of authentication, Kerberos, for example. Up to you. Now of course this kind of system is pretty mission-critical, right? You want to make sure you get the right roles and the right users and the right mappings in Vertica. So you probably want to test it. And for that, we've got new and improved dry run functionality, from 9.3.1. And what this feature offers you is new metafunctions that let you test various parameters without breaking your real LDAPLink configuration. So you can mess around with parameters and the configuration as much as you want and you can be sure that all of that is strictly isolated from the live system. Everything's separated. And when you use this, you get some really nice output through a Data Collector table. You can see some example output here. It runs the same logic as the real LDAPLink and provides detailed information about what would happen. You can check the documentation for specifics. All right, so we've connected to the database, we know who we are, but now, what can we do? So for any given action, you want to control who can do that, right? So what's the question you have to ask? Sometimes the question is just who are you? It's a simple yes or no question. For example, if I want to upgrade a user, the question I have to ask is, am I the superuser? If I'm the superuser, I can do it, if I'm not, I can't. But sometimes the actions are more complex and the question you have to ask is more complex. Does the principal have the required privileges? If you're familiar with SQL privileges, there are things like SELECT, INSERT, and Vertica has a few of their own, but the key thing here is that an action can require specific and maybe even multiple privileges on multiple objects. So for example, when selecting from a table, you need USAGE on the schema and SELECT on the table. And there's some other examples here. So where do these privileges come from? Well, if the action requires a privilege, these are the only places privileges can come from. The first source is implicit privileges, which could come from owning the object or from special roles, which we'll talk about in a sec. Explicit privileges, it's basically a SQL standard GRANT system. So you can grant privileges to users or roles and optionally, those users and roles could grant them downstream. Discretionary access control. So those are explicit and they come from the user and the active roles. So the whole identity set. And then we've got Vertica-specific inherited privileges and those come from the schema, and we'll talk about that in a sec as well. So these are the special roles in Vertica. First role, DBADMIN. This isn't the Dbadmin user, it's a role. And it has specific elevated privileges. You can check the documentation for those exact privileges but it's less than the superuser. The PSEUDOSUPERUSER can do anything the real superuser can do and you can grant this role to whomever. The DBDUSER is actually a role, can run Database Designer functions. SYSMONITOR gives you some elevated auditing permissions and we'll talk about that later as well. And finally, PUBLIC is a role that everyone has all the time so anything you want to be allowed for everyone, attach to PUBLIC. Imagine this scenario. I've got a really big schema with lots of relations. Those relations might be changing all the time. But for each principal that uses this schema, I want the privileges for all the tables and views there to be roughly the same. Even though the tables and views come and go, for example, an analyst might need full access to all of them no matter how many there are or what there are at any given time. So to manage this, my first approach I could use is remember to run grants every time a new table or view is created. And not just you but everyone using this schema. Not only is it a pain, it's hard to enforce. The second approach is to use schema-inherited privileges. So in Vertica, schema grants can include relational privileges. For example, SELECT or INSERT, which normally don't mean anything for a schema, but they do for a table. If a relation's marked as inheriting, then the schema grants to a principal, for example, salespeople, also apply to the relation. And you can see on the diagram here how the usage applies to the schema and the SELECT technically but in Sales.foo table, SELECT also applies. So now, instead of lots of GRANT statements for multiple object owners, we only have to run one ALTER SCHEMA statement and three GRANT statements and from then on, any time that you grant some privileges or revoke privileges to or on the schema, to or from a principal, all your new tables and views will get them automatically. So it's dynamically calculated. Now of course, setting it up securely, is that you want to know what's happened here and what's going on. So to monitor the privileges, there are three system tables which you want to look at. The first is grants, which will show you privileges that are active for you. That is your user and active roles and theirs and so on down the chain. Grants will show you the explicit privileges and inherited_privileges will show you the inherited ones. And then there's one more inheriting_objects which will show all tables and views which inherit privileges so that's useful more for not seeing privileges themselves but managing inherited privileges in general. And finally, how do you see all privileges from all these sources, right? In one go, you want to see them together? Well, there's a metafunction added in 9.3.1. Get_privileges_description which will, given an object, it will sum up all the privileges for a current user on that object. I'll refer you to the documentation for usage and supported types. Now, the problem with SELECT. SELECT let's you see everything or nothing. You can either read the table or you can't. But what if you want some principals to see subset or a transformed version of the data. So for example, I have a table with personnel data and different principals, as you can see here, need different access levels to sensitive information. Social security numbers. Well, one thing I could do is I could make a view for each principal. But I could also use access policies and access policies can do this without introducing any new objects or dependencies. It centralizes your restriction logic and makes it easier to manage. So what do access policies do? Well, we've got row and column access policies. Rows will hide and column access policies will transform data in the row or column, depending on who's doing the SELECTing. So it transforms the data, as we saw on the previous slide, to look as requested. Now, if access policies let you see the raw data, you can still modify the data. And the implication of this is that when you're crafting access policies, you should only use them to refine access for principals that need read-only access. That is, if you want a principal to be able to modify it, the access policies you craft should let through the raw data for that principal. So in our previous example, the loader service should be able to see every row and it should be able to see untransformed data in every column. And as long as that's true, then they can continue to load into this table. All of this is of course monitorable by a system table, in this case access_policy. Check the docs for more information on how to implement these. All right, that's it for access control. Now on to delegation and impersonation. So what's the question here? Well, the question is who is Vertica? And that might seem like a silly question, but here's what I mean by that. When Vertica's connecting to a downstream service, for example, cloud storage, how should Vertica identify itself? Well, most of the time, we do the permissions check ourselves and then we connect as Vertica, like in this diagram here. But sometimes we can do better. And instead of connecting as Vertica, we connect with some kind of upstream user identity. And when we do that, we let the service decide who can do what, so Vertica isn't the only line of defense. And in addition to the defense in depth benefit, there are also benefits for auditing because the external system can see who is really doing something. It's no longer just Vertica showing up in that external service's logs, it's somebody like Alice or Bob, trying to do something. One system where this comes into play is with Voltage SecureData. So, let's look at a couple use cases. The first one, I'm just encrypting for compliance or anti-theft reasons. In this case, I'll just use one global identity to encrypt or decrypt with Voltage. But imagine another use case, I want to control which users can decrypt which data. Now I'm using Voltage for access control. So in this case, we want to delegate. The solution here is on the Voltage side, give Voltage users access to appropriate identities and these identities control encryption for sets of data. A Voltage user can access multiple identities like groups. Then on the Vertica side, a Vertica user can set their Voltage username and password in a session and Vertica will talk to Voltage as that Voltage user. So in the diagram here, you can see an example of how this is leverage so that Alice could decrypt something but Bob cannot. Another place the delegation paradigm shows up is with storage. So Vertica can store and interact with data on non-local file systems. For example, HGFS or S3. Sometimes Vertica's storing Vertica-managed data there. For example, in Eon mode, you might store your projections in communal storage in S3. But sometimes, Vertica is interacting with external data. For example, this usually maps to a user storage location in the Vertica side and it might, on the external storage side, be something like Parquet files on Hadoop. And in that case, it's not really Vertica's data and we don't want to give Vertica more power than it needs, so let's request the data on behalf of who needs it. Lets say I'm an analyst and I want to copy from or export to Parquet, using my own bucket. It's not Vertica's bucket, it's my data. But I want Vertica to manipulate data in it. So the first option I have is to give Vertica as a whole access to the bucket and that's problematic because in that case, Vertica becomes kind of an AWS god. It can see any bucket, any Vertica user might want to push or pull data to or from any time Vertica wants. So it's not good for the principals of least access and zero trust. And we can do better than that. So in the second option, use an ID and secret key pair for an AWS, IAM, if you're familiar, principal that does have access to the bucket. So I might use my, the analyst, credentials, or I might use credentials for an AWS role that has even fewer privileges than I do. Sort of a restricted subset of my privileges. And then I use that. I set it in Vertica at the session level and Vertica will use those credentials for the copy export commands. And it gives more isolation. Something that's in the works is support for keyless delegation, using assumable IAM roles. So similar benefits to option two here, but also not having to manage keys at the user level. We can do basically the same thing with Hadoop and HGFS with three different methods. So first option is Kerberos delegation. I think it's the most secure. It definitely, if access control is your primary concern here, this will give you the tightest access control. The downside is it requires the most configuration outside of Vertica with Kerberos and HGFS but with this, you can really determine which Vertica users can talk to which HGFS locations. Then, you've got secure impersonation. If you've got a highly trusted Vertica userbase, or at least some subset of it is, and you're not worried about them doing things wrong but you want to know about auditing on the HGFS side, that's your primary concern, you can use this option. This diagram here gives you a visual overview of how that works. But I'll refer you to the docs for details. And then finally, option three, this is bringing your own delegation token. It's similar to what we do with AWS. We set something in the session level, so it's very flexible. The user can do it at an ad hoc basis, but it is manual, so that's the third option. Now on to auditing and monitoring. So of course, we want to know, what's happening in our database? It's important in general and important for incident response, of course. So your first stop, to answer this question, should be system tables. And they're a collection of information about events, system state, performance, et cetera. They're SELECT-only tables, but they work in queries as usual. The data is just loaded differently. So there are two types generally. There's the metadata table, which stores persistent information or rather reflects persistent information stored in the catalog, for example, users or schemata. Then there are monitoring tables, which reflect more transient information, like events, system resources. Here you can see an example of output from the resource pool's storage table which, these are actually, despite that it looks like system statistics, they're actually configurable parameters for using that. If you're interested in resource pools, a way to handle users' resource allocation and various principal's resource allocation, again, check that out on the docs. Then of course, there's the followup question, who can see all of this? Well, some system information is sensitive and we should only show it to those who need it. Principal of least privilege, right? So of course the superuser can see everything, but what about non-superusers? How do we give access to people that might need additional information about the system without giving them too much power? One option's SYSMONITOR, as I mentioned before, it's a special role. And this role can always read system tables but not change things like a superuser would be able to. Just reading. And another option is the RESTRICT and RELEASE metafunctions. Those grant and revoke access to from a certain system table set, to and from the PUBLIC role. But the downside of those approaches is that they're inflexible. So they only give you, they're all or nothing. For a specific preset of tables. And you can't really configure it per table. So if you're willing to do a little more setup, then I'd recommend using your own grants and roles. System tables support GRANT and REVOKE statements just like any regular relations. And in that case, I wouldn't even bother with SYSMONITOR or the metafunctions. So to do this, just grant whatever privileges you see fit to roles that you create. Then go ahead and grant those roles to the users that you want. And revoke access to the system tables of your choice from PUBLIC. If you need even finer-grained access than this, you can create views on top of system tables. For example, you can create a view on top of the user system table which only shows the current user's information, uses a built-in function that you can use as part of the view definition. And then, you can actually grant this to PUBLIC, so that each user in Vertica could see their own user's information and never give access to the user system table as a whole, just that view. Now if you're a superuser or if you have direct access to nodes in the cluster, filesystem/OS, et cetera, then you have more ways to see events. Vertica supports various methods of logging. You can see a few methods here which are generally outside of running Vertica, you'd interact with them in a different way, with the exception of active events which is a system table. We've also got the data collector. And that sorts events by subjects. So what the data collector does, it extends the logging and system table functionality, by the component, is what it's called in the documentation. And it logs these events and information to rotating files. For example, AnalyzeStatistics is a function that could be of use by users and as a database administrator, you might want to monitor that so you can use the data collector for AnalyzeStatistics. And the files that these create can be exported into a monitoring database. One example of that is with the Management Console Extended Monitoring. So check out their virtual BDC talk. The one on the management console. And that's it for the key points of security in Vertica. Well, many of these slides could spawn a talk on their own, so we encourage you to check out our blog, check out the documentation and the forum for further investigation and collaboration. Hopefully the information we provided today will inform your choices in securing your deployment of Vertica. Thanks for your time today. That concludes our presentation. Now, we're ready for Q&A.
SUMMARY :
in the question box below the slide as it occurs to you So for instance, you can see date of birth encrypted and the question you have to ask is more complex.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Chris | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Chris Morris | PERSON | 0.99+ |
second option | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
two types | QUANTITY | 0.99+ |
first option | QUANTITY | 0.99+ |
three | QUANTITY | 0.99+ |
Alice | PERSON | 0.99+ |
second approach | QUANTITY | 0.99+ |
Paige | PERSON | 0.99+ |
third option | QUANTITY | 0.99+ |
AWS' | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Today | DATE | 0.99+ |
first approach | QUANTITY | 0.99+ |
second half | QUANTITY | 0.99+ |
each service | QUANTITY | 0.99+ |
Bob | PERSON | 0.99+ |
10 petabytes | QUANTITY | 0.99+ |
Fenic | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
first source | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
Fen | PERSON | 0.98+ |
S3 | TITLE | 0.98+ |
One system | QUANTITY | 0.98+ |
first objective | QUANTITY | 0.98+ |
each user | QUANTITY | 0.98+ |
First role | QUANTITY | 0.97+ |
each principal | QUANTITY | 0.97+ |
4/2 | DATE | 0.97+ |
each | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
Vertica | TITLE | 0.97+ |
First | QUANTITY | 0.97+ |
one | QUANTITY | 0.96+ |
this week | DATE | 0.95+ |
three different methods | QUANTITY | 0.95+ |
three system tables | QUANTITY | 0.94+ |
one thing | QUANTITY | 0.94+ |
Fenic Fawkes | PERSON | 0.94+ |
Parquet | TITLE | 0.94+ |
Hadoop | TITLE | 0.94+ |
One example | QUANTITY | 0.93+ |
Dbadmin | PERSON | 0.92+ |
10.0 | QUANTITY | 0.92+ |
UNLIST TILL 4/2 - Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives
>> Sue: Hello everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session in entitled Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives. My name is Sue LeClaire, Director of Marketing at Vertica and I'll be your host for this webinar. Joining me is Tom Wall, a member of the Vertica engineering team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't get to, we'll do our best to answer them offline. Alternatively, you can visit the Vertica forums to post you questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Tom, over to you. >> Tom: Hello everyone and thanks for joining us today for this talk. My name is Tom Wall and I am the leader of Vertica's ecosystem engineering team. We are the team that focuses on building out all the developer tools, third party integrations that enables the SoftMaker system that surrounds Vertica to thrive. So today, we'll be talking about some of our new open source initatives and how those can be really effective for you and make things easier for you to build and integrate Vertica with the rest of your technology stack. We've got several new libraries, integration projects and examples, all open source, to share, all being built out in the open on our GitHub page. Whether you use these open source projects or not, this is a very exciting new effort that will really help to grow the developer community and enable lots of exciting new use cases. So, every developer out there has probably had to deal with the problem like this. You have some business requirements, to maybe build some new Vertica-powered application. Maybe you have to build some new system to visualize some data that's that's managed by Vertica. The various circumstances, lots of choices will might be made for you that constrain your approach to solving a particular problem. These requirements can come from all different places. Maybe your solution has to work with a specific visualization tool, or web framework, because the business has already invested in the licensing and the tooling to use it. Maybe it has to be implemented in a specific programming language, since that's what all the developers on the team know how to write code with. While Vertica has many different integrations with lots of different programming language and systems, there's a lot of them out there, and we don't have integrations for all of them. So how do you make ends meet when you don't have all the tools you need? All you have to get creative, using tools like PyODBC, for example, to bridge between programming languages and frameworks to solve the problems you need to solve. Most languages do have an ODBC-based database interface. ODBC is our C-Library and most programming languages know how to call C code, somehow. So that's doable, but it often requires lots of configuration and troubleshooting to make all those moving parts work well together. So that's enough to get the job done but native integrations are usually a lot smoother and easier. So rather than, for example, in Python trying to fight with PyODBC, to configure things and get Unicode working, and to compile all the different pieces, the right way is to make it all work smoothly. It would be much better if you could just PIP install library and get to work. And with Vertica-Python, a new Python client library, you can actually do that. So that story, I assume, probably sounds pretty familiar to you. Sounds probably familiar to a lot of the audience here because we're all using Vertica. And our challenge, as Big Data practitioners is to make sense of all this stuff, despite those technical and non-technical hurdles. Vertica powers lots of different businesses and use cases across all kinds of different industries and verticals. While there's a lot different about us, we're all here together right now for this talk because we do have some things in common. We're all using Vertica, and we're probably also using Vertica with other systems and tools too, because it's important to use the right tool for the right job. That's a founding principle of Vertica and it's true today too. In this constantly changing technology landscape, we need lots of good tools and well established patterns, approaches, and advice on how to combine them so that we can be successful doing our jobs. Luckily for us, Vertica has been designed to be easy to build with and extended in this fashion. Databases as a whole had had this goal from the very beginning. They solve the hard problems of managing data so that you don't have to worry about it. Instead of worrying about those hard problems, you can focus on what matters most to you and your domain. So implementing that business logic, solving that problem, without having to worry about all of these intense, sometimes details about what it takes to manage a database at scale. With the declarative syntax of SQL, you tell Vertica what the answer is that you want. You don't tell Vertica how to get it. Vertica will figure out the right way to do it for you so that you don't have to worry about it. So this SQL abstraction is very nice because it's a well defined boundary where lots of developers know SQL, and it allows you to express what you need without having to worry about those details. So we can be the experts in data management while you worry about your problems. This goes beyond though, what's accessible through SQL to Vertica. We've got well defined extension and integration points across the product that allow you to customize this experience even further. So if you want to do things write your own SQL functions, or extend database softwares with UDXs, you can do so. If you have a custom data format that might be a proprietary format, or some source system that Vertica doesn't natively support, we have extension points that allow you to use those. To make it very easy to do passive, parallel, massive data movement, loading into Vertica but also to export Vertica to send data to other systems. And with these new features in time, we also could do the same kinds of things with Machine Learning models, importing and exporting to tools like TensorFlow. And it's these integration points that have enabled Vertica to build out this open architecture and a rich ecosystem of tools, both open source and closed source, of different varieties that solve all different problems that are common in this big data processing world. Whether it's open source, streaming systems like Kafka or Spark, or more traditional ETL tools on the loading side, but also, BI tools and visualizers and things like that to view and use the data that you keep in your database on the right side. And then of course, Vertica needs to be flexible enough to be able to run anywhere. So you can really take Vertica and use it the way you want it to solve the problems that you need to solve. So Vertica has always employed open standards, and integrated it with all kinds of different open source systems. What we're really excited to talk about now is that we are taking our new integration projects and making those open source too. In particular, we've got two new open source client libraries that allow you to build Vertica applications for Python and Go. These libraries act as a foundation for all kinds of interesting applications and tools. Upon those libraries, we've also built some integrations ourselves. And we're using these new libraries to power some new integrations with some third party products. Finally, we've got lots of new examples and reference implementations out on our GitHub page that can show you how to combine all these moving parts and exciting ways to solve new problems. And the code for all these things is available now on our GitHub page. And so you can use it however you like, and even help us make it better too. So the first such project that we have is called Vertica-Python. Vertica-Python began at our customer, Uber. And then in late 2018, we collaborated with them and we took it over and made Vertica-Python the first official open source client for Vertica You can use this to build your own Python applications, or you can use it via tools that were written in Python. Python has grown a lot in recent years and it's very common language to solve lots of different problems and use cases in the Big Data space from things like DevOps admission and Data Science or Machine Learning, or just homegrown applications. We use Python a lot internally for our own QA testing and automation needs. And with the Python 2 End Of Life, that happened at the end of 2019, it was important that we had a robust Python solution to help migrate our internal stuff off of Python 2. And also to provide a nice migration path for all of you our users that might be worried about the same problems with their own Python code. So Vertica-Python is used already for lots of different tools, including Vertica's admintools now starting with 9.3.1. It was also used by DataDog to build a Vertica-DataDog integration that allows you to monitor your Vertica infrastructure within DataDog. So here's a little example of how you might use the Python Client to do some some work. So here we open in connection, we run a query to find out what node we've connected to, and then we do a little DataLoad by running a COPY statement. And this is designed to have a familiar look and feel if you've ever used a Python Database Client before. So we implement the DB API 2.0 standard and it feels like a Python package. So that includes things like, it's part of the centralized package manager, so you can just PIP install this right now and go start using it. We also have our client for Go length. So this is called vertica-sql-go. And this is a very similar story, just in a different context or the different programming language. So vertica-sql-go, began as a collaboration with the Microsoft Focus SecOps Group who builds microfocus' security products some of which use vertica internally to provide some of those analytics. So you can use this to build your own apps in the Go programming language but you can also use it via tools that are written Go. So most notably, we have our Grafana integration, which we'll talk a little bit more about later, that leverages this new clients to provide Grafana visualizations for vertica data. And Go is another rising popularity programming language 'cause it offers an interesting balance of different programming design trade-offs. So it's got good performance, got a good current concurrency and memory safety. And we liked all those things and we're using it to power some internal monitoring stuff of our own. And here's an example of the code you can write with this client. So this is Go code that does a similar thing. It opens a connection, it runs a little test query, and then it iterates over those rows, processing them using Go data types. You get that native look and feel just like you do in Python, except this time in the Go language. And you can go get it the way you usually package things with Go by running that command there to acquire this package. And it's important to note here for the DC projects, we're really doing open source development. We're not just putting code out on our GitHub page. So if you go out there and look, you can see that you can ask questions, you can report bugs, you can submit poll requests yourselves and you can collaborate directly with our engineering team and the other vertica users out on our GitHub page. Because it's out on our GitHub page, it allows us to be a little bit faster with the way we ship and deliver functionality compared to the core vertica release cycle. So in 2019, for example, as we were building features to prepare for the Python 3 migration, we shipped 11 different releases with 40 customer reported issues, filed on GitHub. That was done over 78 different poll requests and with lots of community engagement as we do so. So lots of people are using this already, we see as our GitHub badge last showed with about 5000 downloads of this a day of people using it in their software. And again, we want to make this easy, not just to use but also to contribute and understand and collaborate with us. So all these projects are built using the Apache 2.0 license. The master branch is always available and stable with the latest creative functionality. And you can always build it and test it the way we do so that it's easy for you to understand how it works and to submit contributions or bug fixes or even features. It uses automated testing both for locally and with poll requests. And for vertica-python, it's fully automated with Travis CI. So we're really excited about doing this and we're really excited about where it can go in the future. 'Cause this offers some exciting opportunities for us to collaborate with you more directly than we have ever before. You can contribute improvements and help us guide the direction of these projects, but you can also work with each other to share knowledge and implementation details and various best practices. And so maybe you think, "Well, I don't use Python, "I don't use go so maybe it doesn't matter to me." But I would argue it really does matter. Because even if you don't use these tools and languages, there's lots of amazing vertica developers out there who do. And these clients do act as low level building blocks for all kinds of different interesting tools, both in these Python and Go worlds, but also well beyond that. Because these implementations and examples really generalize to lots of different use cases. And we're going to do a deeper dive now into some of these to understand exactly how that's the case and what you can do with these things. So let's take a deeper look at some of the details of what it takes to build one of these open source client libraries. So these database client interfaces, what are they exactly? Well, we all know SQL, but if you look at what SQL specifies, it really only talks about how to manipulate the data within the database. So once you're connected and in, you can run commands with SQL. But these database client interfaces address the rest of those needs. So what does the programmer need to do to actually process those SQL queries? So these interfaces are specific to a particular language or a technology stack. But the use cases and the architectures and design patterns are largely the same between different languages. They all have a need to do some networking and connect and authenticate and create a session. They all need to be able to run queries and load some data and deal with problems and errors. And then they also have a lot of metadata and Type Mapping because you want to use these clients the way you use those programming languages. Which might be different than the way that vertica's data types and vertica's semantics work. So some of this client interfaces are truly standards. And they are robust enough in terms of what they design and call for to support a truly pluggable driver model. Where you might write an application that codes directly against the standard interface, and you can then plug in a different database driver, like a JDBC driver, to have that application work with any database that has a JDBC driver. So most of these interfaces aren't as robust as a JDBC or ODBC but that's okay. 'Cause it's good as a standard is, every database is unique for a reason. And so you can't really expose all of those unique properties of a database through these standard interfaces. So vertica's unique in that it can scale to the petabytes and beyond. And you can run it anywhere in any environment, whether it's on-prem or on clouds. So surely there's something about vertica that's unique, and we want to be able to take advantage of that fact in our solutions. So even though these standards might not cover everything, there's often a need and common patterns that arise to solve these problems in similar ways. When there isn't enough of a standard to define those comments, semantics that different databases might have in common, what you often see is tools will invent plug in layers or glue code to compensate by defining application wide standard to cover some of these same semantics. Later on, we'll get into some of those details and show off what exactly that means. So if you connect to a vertica database, what's actually happening under the covers? You have an application, you have a need to run some queries, so what does that actually look like? Well, probably as you would imagine, your application is going to invoke some API calls and some client library or tool. This library takes those API calls and implements them, usually by issuing some networking protocol operations, communicating over the network to ask vertica to do the heavy lifting required for that particular API call. And so these API's usually do the same kinds of things although some of the details might differ between these different interfaces. But you do things like establish a connection, run a query, iterate over your rows, manage your transactions, that sort of thing. Here's an example from vertica-python, which just goes into some of the details of what actually happens during the Connect API call. And you can see all these details in our GitHub implementation of this. There's actually a lot of moving parts in what happens during a connection. So let's walk through some of that and see what actually goes on. I might have my API call like this where I say Connect and I give it a DNS name, which is my entire cluster. And I give you my connection details, my username and password. And I tell the Python Client to get me a session, give me a connection so I can start doing some work. Well, in order to implement this, what needs to happen? First, we need to do some TCP networking to establish our connection. So we need to understand what the request is, where you're going to connect to and why, by pressing the connection string. and vertica being a distributed system, we want to provide high availability, so we might need to do some DNS look-ups to resolve that DNS name which might be an entire cluster and not just a single machine. So that you don't have to change your connection string every time you add or remove nodes to the database. So we do some high availability and DNS lookup stuff. And then once we connect, we might do Load Balancing too, to balance the connections across the different initiator nodes in the cluster, or in a sub cluster, as needed. Once we land on the node we want to be at, we might do some TLS to secure our connections. And vertica supports the industry standard TLS protocols, so this looks pretty familiar for everyone who've used TLS anywhere before. So you're going to do a certificate exchange and the client might send the server certificate too, and then you going to verify that the server is who it says it is, so that you can know that you trust it. Once you've established that connection, and secured it, then you can start actually beginning to request a session within vertica. So you going to send over your user information like, "Here's my username, "here's the database I want to connect to." You might send some information about your application like a session label, so that you can differentiate on the database with monitoring queries, what the different connections are and what their purpose is. And then you might also send over some session settings to do things like auto commit, to change the state of your session for the duration of this connection. So that you don't have to remember to do that with every query that you have. Once you've asked vertica for a session, before vertica will give you one, it has to authenticate you. and vertica has lots of different authentication mechanisms. So there's a negotiation that happens there to decide how to authenticate you. Vertica decides based on who you are, where you're coming from on the network. And then you'll do an auth-specific exchange depending on what the auth mechanism calls for until you are authenticated. Finally, vertica trusts you and lets you in, so you going to establish a session in vertica, and you might do some note keeping on the client side just to know what happened. So you might log some information, you might record what the version of the database is, you might do some protocol feature negotiation. So if you connect to a version of the database that doesn't support all these protocols, you might decide to turn some functionality off and that sort of thing. But finally, after all that, you can return from this API call and then your connection is good to go. So that connection is just one example of many different APIs. And we're excited here because with vertica-python we're really opening up the vertica client wire protocol for the first time. And so if you're a low level vertica developer and you might have used Postgres before, you might know that some of vertica's client protocol is derived from Postgres. But they do differ in many significant ways. And this is the first time we've ever revealed those details about how it works and why. So not all Postgres protocol features work with vertica because vertica doesn't support all the features that Postgres does. Postgres, for example, has a large object interface that allows you to stream very wide data values over. Whereas vertica doesn't really have very wide data values, you have 30, you have long bar charts, but that's about as wide as you can get. Similarly, the vertica protocol supports lots of features not present in Postgres. So Load Balancing, for example, which we just went through an example of, Postgres is a single node system, it doesn't really make sense for Postgres to have Load Balancing. But Load Balancing is really important for vertica because it is a distributed system. Vertica-python serves as an open reference implementation of this protocol. With all kinds of new details and extension points that we haven't revealed before. So if you look at these boxes below, all these different things are new protocol features that we've implemented since August 2019, out in the open on our GitHub page for Python. Now, the vertica-sql-go implementation of these things is still in progress, but the core protocols are there for basic query operations. There's more to do there but we'll get there soon. So this is really cool 'cause not only do you have now a Python Client implementation, and you have a Go client implementation of this, but you can use this protocol reference to do lots of other things, too. The obvious thing you could do is build more clients for other languages. So if you have a need for a client in some other language that are vertica doesn't support yet, now you have everything available to solve that problem and to go about doing so if you need to. But beyond clients, it's also used for other things. So you might use it for mocking and testing things. So rather than connecting to a real vertica database, you can simulate some of that. You can also use it to do things like query routing and proxies. So Uber, for example, this log here in this link tells a great story of how they route different queries to different vertical clusters by intercepting these protocol messages, parsing the queries in them and deciding which clusters to send them to. So a lot of these things are just ideas today, but now that you have the source code, there's no limit in sight to what you can do with this thing. And so we're very interested in hearing your ideas and requests and we're happy to offer advice and collaborate on building some of these things together. So let's take a look now at some of the things we've already built that do these things. So here's a picture of vertica's Grafana connector with some data powered from an example that we have in this blog link here. So this has an internet of things use case to it, where we have lots of different sensors recording flight data, feeding into Kafka which then gets loaded into vertica. And then finally, it gets visualized nicely here with Grafana. And Grafana's visualizations make it really easy to analyze the data with your eyes and see when something something happens. So in these highlighted sections here, you notice a drop in some of the activity, that's probably a problem worth looking into. It might be a lot harder to see that just by staring at a large table yourself. So how does a picture like that get generated with a tool like Grafana? Well, Grafana specializes in visualizing time series data. And time can be really tricky for computers to do correctly. You got time zones, daylight savings, leap seconds, negative infinity timestamps, please don't ever use those. In every system, if it wasn't hard enough, just with those problems, what makes it harder is that every system does it slightly differently. So if you're querying some time data, how do we deal with these semantic differences as we cross these domain boundaries from Vertica to Grafana's back end architecture, which is implemented in Go on it's front end, which is implemented with JavaScript? Well, you read this from bottom up in terms of the processing. First, you select the timestamp and Vertica is timestamp has to be converted to a Go time object. And we have to reconcile the differences that there might be as we translate it. So Go time has a different time zone specifier format, and it also supports nanosecond precision, while Vertica only supports microsecond precision. So that's not too big of a deal when you're querying data because you just see some extra zeros, not fractional seconds. But on the way in, if we're loading data, we have to find a way to resolve those things. Once it's into the Go process, it has to be converted further to render in the JavaScript UI. So that there, the Go time object has to be converted to a JavaScript Angular JS Date object. And there too, we have to reconcile those differences. So a lot of these differences might just be presentation, and not so much the actual data changing, but you might want to choose to render the date into a more human readable format, like we've done in this example here. Here's another picture. This is another picture of some time series data, and this one shows you can actually write your own queries with Grafana to provide answers. So if you look closely here you can see there's actually some functions that might not look too familiar with you if you know vertica's functions. Vertica doesn't have a dollar underscore underscore time function or a time filter function. So what's actually happening there? How does this actually provide an answer if it's not really real vertica syntax? Well, it's not sufficient to just know how to manipulate data, it's also really important that you know how to operate with metadata. So information about how the data works in the data source, Vertica in this case. So Grafana needs to know how time works in detail for each data source beyond doing that basic I/O that we just saw in the previous example. So it needs to know, how do you connect to the data source to get some time data? How do you know what time data types and functions there are and how they behave? How do you generate a query that references a time literal? And finally, once you've figured out how to do all that, how do you find the time in the database? How do you do know which tables have time columns and then they might be worth rendering in this kind of UI. So Go's database standard doesn't actually really offer many metadata interfaces. Nevertheless, Grafana needs to know those answers. And so it has its own plugin layer that provides a standardizing layer whereby every data source can implement hints and metadata customization needed to have an extensible data source back end. So we have another open source project, the Vertica-Grafana data source, which is a plugin that uses Grafana's extension points with JavaScript and the front end plugins and also with Go in the back end plugins to provide vertica connectivity inside Grafana. So the way this works, is that the plugin frameworks defines those standardizing functions like time and time filter, and it's our plugin that's going to rewrite them in terms of vertica syntax. So in this example, time gets rewritten to a vertica cast. And time filter becomes a BETWEEN predicate. So that's one example of how you can use Grafana, but also how you might build any arbitrary visualization tool that works with data in Vertica. So let's now look at some other examples and reference architectures that we have out in our GitHub page. For some advanced integrations, there's clearly a need to go beyond these standards. So SQL and these surrounding standards, like JDBC, and ODBC, were really critical in the early days of Vertica, because they really enabled a lot of generic database tools. And those will always continue to play a really important role, but the Big Data technology space moves a lot faster than these old database data can keep up with. So there's all kinds of new advanced analytics and query pushdown logic that were never possible 10 or 20 years ago, that Vertica can do natively. There's also all kinds of data-oriented application workflows doing things like streaming data, or Parallel Loading or Machine Learning. And all of these things, we need to build software with, but we don't really have standards to go by. So what do we do there? Well, open source implementations make for easier integrations, and applications all over the place. So even if you're not using Grafana for example, other tools have similar challenges that you need to overcome. And it helps to have an example there to show you how to do it. Take Machine Learning, for example. There's been many excellent Machine Learning tools that have arisen over the years to make data science and the task of Machine Learning lot easier. And a lot of those have basic database connectivity, but they generally only treat the database as a source of data. So they do lots of data I/O to extract data from a database like Vertica for processing in some other engine. We all know that's not the most efficient way to do it. It's much better if you can leverage Vertica scale and bring the processing to the data. So a lot of these tools don't take full advantage of Vertica because there's not really a uniform way to go do so with these standards. So instead, we have a project called vertica-ml-python. And this serves as a reference architecture of how you can do scalable machine learning with Vertica. So this project establishes a familiar machine learning workflow that scales with vertica. So it feels similar to like a scickit-learn project except all the processing and aggregation and heavy lifting and data processing happens in vertica. So this makes for a much more lightweight, scalable approach than you might otherwise be used to. So with vertica-ml-python, you can probably use this yourself. But you could also see how it works. So if it doesn't meet all your needs, you could still see the code and customize it to build your own approach. We've also got lots of examples of our UDX framework. And so this is an older GitHub project. We've actually had this for a couple of years, but it is really useful and important so I wanted to plug it here. With our User Defined eXtensions framework or UDXs, this allows you to extend the operators that vertica executes when it does a database load or a database query. So with UDXs, you can write your own domain logic in a C++, Java or Python or R. And you can call them within the context of a SQL query. And vertica brings your logic to that data, and makes it fast and scalable and fault tolerant and correct for you. So you don't have to worry about all those hard problems. So our UDX examples, demonstrate how you can use our SDK to solve interesting problems. And some of these examples might be complete, total usable packages or libraries. So for example, we have a curl source that allows you to extract data from any curlable endpoint and load into vertica. We've got things like an ODBC connector that allows you to access data in an external database via an ODBC driver within the context of a vertica query, all kinds of parsers and string processors and things like that. We also have more exciting and interesting things where you might not really think of vertica being able to do that, like a heat map generator, which takes some XY coordinates and renders it on top of an image to show you the hotspots in it. So the image on the right was actually generated from one of our intern gaming sessions a few years back. So all these things are great examples that show you not just how you can solve problems, but also how you can use this SDK to solve neat things that maybe no one else has to solve, or maybe that are unique to your business and your needs. Another exciting benefit is with testing. So the test automation strategy that we have in vertica-python these clients, really generalizes well beyond the needs of a database client. Anyone that's ever built a vertica integration or an application, probably has a need to write some integration tests. And that could be hard to do with all the moving parts, in the big data solution. But with our code being open source, you can see in vertica-python, in particular, how we've structured our tests to facilitate smooth testing that's fast, deterministic and easy to use. So we've automated the download process, the installation deployment process, of a Vertica Community Edition. And with a single click, you can run through the tests locally and part of the PR workflow via Travis CI. We also do this for multiple different python environments. So for all python versions from 2.7 up to 3.8 for different Python interpreters, and for different Linux distros, we're running through all of them very quickly with ease, thanks to all this automation. So today, you can see how we do it in vertica-python, in the future, we might want to spin that out into its own stand-alone testbed starter projects so that if you're starting any new vertica integration, this might be a good starting point for you to get going quickly. So that brings us to some of the future work we want to do here in the open source space . Well, there's a lot of it. So in terms of the the client stuff, for Python, we are marching towards our 1.0 release, which is when we aim to be protocol complete to support all of vertica's unique protocols, including COPY LOCAL and some new protocols invented to support complex types, which is our new feature in vertica 10. We have some cursor enhancements to do things like better streaming and improved performance. Beyond that we want to take it where you want to bring it. So send us your requests in the Go client fronts, just about a year behind Python in terms of its protocol implementation, but the basic operations are there. But we still have more work to do to implement things like load balancing, some of the advanced auths and other things. But they're two, we want to work with you and we want to focus on what's important to you so that we can continue to grow and be more useful and more powerful over time. Finally, this question of, "Well, what about beyond database clients? "What else might we want to do with open source?" If you're building a very deep or a robust vertica integration, you probably need to do a lot more exciting things than just run SQL queries and process the answers. Especially if you're an OEM or you're a vendor that resells vertica packaged as a black box piece of a larger solution, you might to have managed the whole operational lifecycle of vertica. There's even fewer standards for doing all these different things compared to the SQL clients. So we started with the SQL clients 'cause that's a well established pattern, there's lots of downstream work that that can enable. But there's also clearly a need for lots of other open source protocols, architectures and examples to show you how to do these things and do have real standards. So we talked a little bit about how you could do UDXs or testing or Machine Learning, but there's all sorts of other use cases too. That's why we're excited to announce here our awesome vertica, which is a new collection of open source resources available on our GitHub page. So if you haven't heard of this awesome manifesto before, I highly recommend you check out this GitHub page on the right. We're not unique here but there's lots of awesome projects for all kinds of different tools and systems out there. And it's a great way to establish a community and share different resources, whether they're open source projects, blogs, examples, references, community resources, and all that. And this tool is an open source project. So it's an open source wiki. And you can contribute to it by submitting yourself to PR. So we've seeded it with some of our favorite tools and projects out there but there's plenty more out there and we hope to see more grow over time. So definitely check this out and help us make it better. So with that, I'm going to wrap up. I wanted to thank you all. Special thanks to Siting Ren and Roger Huebner, who are the project leads for the Python and Go clients respectively. And also, thanks to all the customers out there who've already been contributing stuff. This has already been going on for a long time and we hope to keep it going and keep it growing with your help. So if you want to talk to us, you can find us at this email address here. But of course, you can also find us on the Vertica forums, or you could talk to us on GitHub too. And there you can find links to all the different projects I talked about today. And so with that, I think we're going to wrap up and now we're going to hand it off for some Q&A.
SUMMARY :
Also a reminder that you can maximize your screen and frameworks to solve the problems you need to solve.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tom Wall | PERSON | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Roger Huebner | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Tom | PERSON | 0.99+ |
Python 2 | TITLE | 0.99+ |
August 2019 | DATE | 0.99+ |
2019 | DATE | 0.99+ |
Python 3 | TITLE | 0.99+ |
two | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
Python | TITLE | 0.99+ |
python | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
late 2018 | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
end of 2019 | DATE | 0.99+ |
Vertica | TITLE | 0.99+ |
today | DATE | 0.99+ |
Java | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
C++ | TITLE | 0.99+ |
JavaScript | TITLE | 0.99+ |
vertica-python | TITLE | 0.99+ |
Today | DATE | 0.99+ |
first time | QUANTITY | 0.99+ |
11 different releases | QUANTITY | 0.99+ |
UDXs | TITLE | 0.99+ |
Kafka | TITLE | 0.99+ |
Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives | TITLE | 0.98+ |
Grafana | ORGANIZATION | 0.98+ |
PyODBC | TITLE | 0.98+ |
first | QUANTITY | 0.98+ |
UDX | TITLE | 0.98+ |
vertica 10 | TITLE | 0.98+ |
ODBC | TITLE | 0.98+ |
10 | DATE | 0.98+ |
Postgres | TITLE | 0.98+ |
DataDog | ORGANIZATION | 0.98+ |
40 customer reported issues | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
UNLIST TILL 4/2 - A Deep Dive into the Vertica Management Console Enhancements and Roadmap
>> Jeff: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "A Deep Dive "into the Vertica Mangement Console Enhancements and Roadmap." I'm Jeff Healey of Vertica Marketing. I'll be your host for this breakout session. Joining me are Bhavik Gandhi and Natalia Stavisky from Vertica engineering. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively visit Vertica Forums at forum.vertica.com. Post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going well after the event. Also, a reminder that you can maximize the screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to you on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Bhavik. >> Bhavik: All right. So hello, and welcome, everybody doing this presentation of "Deep Dive into the Vertica Management Console Enhancements and Roadmap." Myself, Bhavik, and my team member, Natalia Stavisky, will go over a few useful announcements on Vertica Management Console, discussing a few real scenarios. All right. So today we will go forward with the brief introduction about the Management Console, then we will discuss the benefits of using Management Console by going over a couple of user scenarios for the query taking too long to run and receiving email alerts from Management Console. Then we will go over a few MC features for what we call Eon Mode databases, like provisioning and reviving the Eon Mode databases from MC, managing the subcluster and understanding the Depot. Then we will go over some of the future announcements on MC that we are planning. All right, so let's get started. All right. So, do you want to know about how to provision a new Vertica cluster from MC? How to analyze and understand a database workload by monitoring the queries on the database? How do you balance the resource pools and use alerts and thresholds on MC? So, the Management Console is basically our answer and we'll talk about its capabilities and new announcements in this presentation. So just to give a brief overview of the Management Console, who uses Management Console, it's generally used by IT administrators and DB admins. Management Console can be used to monitor both Eon Mode and Enterprise Mode databases. Why to use Management Console? You can use Management Console for provisioning Vertica databases and cluster. You can manage the already existing Vertica databases and cluster you have, and you can use various tools on Management Console like query execution, Database Designer, Workload Analyzer, and set up alerts and thresholds to get notified by some of your activities on the MC. So let's go over a few benefits of using Management Console. Okay. So using Management Console, you can view and optimize resource pool usage. Management Console helps you to identify some critical conditions on your Vertica cluster. Additionally, you can set up various thresholds thresholds in MC and get other data if those thresholds are triggered on the database. So now let's dig into the couple of scenarios. So for the first scenario, we will discuss about queries taking too long and using workload analyzer to possibly help to solve the problem. In the second scenario, we will go over alert email that you received from your Management Console and analyzing the problem and taking required actions to solve the problem. So let's go over the scenario where queries are taking too long to run. So in this example, we have this one query that we are running using the query execution on MC. And for some reason we notice that it's taking about 14.8 seconds seconds to execute this query, which is higher than the expected run time of the query. The query that we are running happens to be the query used by MC during the extended monitoring. Notice that the table name and the schema name which is ds_requests_issued, and, is the schema used for extended monitoring. Now in 10.0 MC we have redesigned the Workload Analyzer and Recommendations feature to show the recommendations and allow you to execute those recommendations. In our example, we have taken the table name and figured the tuning descriptions to see if there are any tuning recommendations related to this table. As we see over here, there are three tuning recommendations available for that table. So now in 10.0 MC, you can select those recommendations and then run them. So let's run the recommendations. All right. So once recommendations are run successfully, you can go and see all the processed recommendations that you have run previously. Over here we see that there are three recommendations that we had selected earlier have successfully processed. Now we take the same query and run it on the query execution on MC and hey, it's running really faster and we see that it takes only 0.3 seconds to run the query and, which is about like 98% decrease in original runtime of the query. So in this example we saw that using a Workload Analyzer tool on MC you can possibly triage and solve issue for your queries which are taking to long to execute. All right. So now let's go over another user scenario where DB admin's received some alert email messages from MC and would like to understand and analyze the problem. So to know more about what's going on on the database and proactively react to the problems, DB admins using the Management Console can create set of thresholds and get alerted about the conditions on the database if the threshold values is reached and then respond to the problem thereafter. Now as a DB admin, I see some email message notifications from MC and upon checking the emails, I see that there are a couple of email alerts received from MC on my email. So one of the messages that I received was for Query Resource Rejections greater than 5, pool, midpool7. And then around the same time, I received another email from the MC for the Failed Queries greater than 5, and in this case I see there are 80 failed queries. So now let's go on the MC and investigate the problem. So before going into the deep investigation about failures, let's review the threshold settings on MC. So as we see, we have set up the thresholds under the database settings page for failed queries in the last 10 minutes greater than 5 and MC should send an email to the individual if the threshold is triggered. And also we have a threshold set up for queries and resource rejections in the last five minutes for midpool7 set to greater than 5. There are various other thresholds on this page that you can set if you desire to. Now let's go and triage those email alerts about the failed queries and resource rejections that we had received. To analyze the failed queries, let's take a look at the query statistics page on the database Overview page on MC. Let's take a look at the Resource Pools graph and especially for the failed queries for each resource pools. And over to the right under the failed query section, I see about like, in the last 24 hours, there are about 6,000 failed queries for midpool7. And now I switch to view to see the statistics for each user and on this page I see for User MaryLee on the right hand side there are a high number of failed queries in last 24 hours. And to know more about the failed queries for this user, I can click on the graph for this user and get the reasons behind it. So let's click on the graph and see what's going on. And so clicking on this graph, it takes me to the failed queries view on the Query Monitoring page for database, on Database activities tab. And over here, I see there are a high number of failed queries for this user, MaryLee, with the reasons stated as, exceeding high limit. To drill down more and to know more reasons behind it, I can click on the plus icon on the left hand side for each failed queries to get the failure reason for each node on the database. So let's do that. And clicking the plus icon, I see for the two nodes that are listed, over here it says there are insufficient resources like memory and file handles for midpool7. Now let's go and analyze the midpool7 configurations and activities on it. So to do so, I will go over to the Resource Pool Monitoring view and select midpool7. I see the resource allocations for this resource pool is very low. For example, the max memory is just 1MB and the max concurrency is set to 0. Hmm, that's very odd configuration for this resource pool. Also in the bottom right graph for the resource rejections for midpool7, the graph shows very high values for resource rejection. All right. So since we saw some odd configurations and odd resource allocations for midpool7, I would like to see when this resource, when the settings were changed on the resource pools. So to do this, I can preview the audit logs on, are available on the Management Console. So I can go onto the Vertica Audit Logs and see the logs for the resource pool. So I just (mumbles) for the logs and figuring the logs for midpool7. I see on February 17th, the memory and other attributes for midpool7 were modified. So now let's analyze the resource activity for midpool7 around the time when the configurations were changed. So in our case we are using extended monitoring on MC for this database, so we can go back in time and see the statistics over the larger time range for midpool7. So viewing the activities for midpool7 around February 17th, around the time when these configurations were changed, we see a decrease in resource pool usage. Also, on the bottom right, we see the resource rejections for this midpool7 have an increase, linear increase, after the configurations were changed. I can select a point on the graph to get the more details about the resource rejections. Now to analyze the effects of the modifications on midpool7. Let's go over to the Query Monitoring page. All right, I will adjust the time range around the time when the configurations were changed for midpool7 and completed activities queries for user MaryLee. And I see there are no completed queries for this user. Now I'm taking a look at the Failed Queries tab and adjusting the time range around the time when the configurations were changed. I can do so because we are using extended monitoring. So again, adjusting the time, I can see there are high number of failed queries for this user. There about about like 10,000 failed queries for this user after the configurations were changed on this resource pool. So now let's go and modify the settings since we know after the configurations were changed, this user was not able to run the queries. So you can change the resource pool settings of using Management Console's database settings page and under the Resource Pools tab. So selecting the midpool7, I see the same odd configurations for this resource pool that we saw earlier. So now let's go and modify it, the settings. So I will increase the max memory and modify the settings for midpool7 so that it has adequate resources to run the queries for the user. Hit apply on the right hand top to see the settings. Now let's do the validation after we change the resource pool attributes. So let's go over to the same query monitoring page and see if MaryLee user is able to run the queries for midpool7. We see that now, after the configuration, after the change, after we changed the configuration for midpool7, the user can run the queries successfully and the count for Completed Queries has increased after we modified the settings for this midpool7 resource pool. And also viewing the resource pool monitoring page, we can validate that after the new configurations for midpool7 has been applied and also the resource pool usage after the configuration change has increased. And also on the bottom right graph, we can see that the resource rejections for midpool7 has decreased over the time after we modified the settings. And since we are using extended monitoring for this database, I can see that the trend in data for these resource pools, the before and after effects of modifying the settings. So initially when the settings were changed, there were high resource rejections and after we again modified the settings, the resource rejections went down. Right. So now let's go work with the provisioning and reviving the Eon Mode Vertica database cluster using the Management Console on different platform. So Management Console supports provisioning and reviving of Eon Mode databases on various cloud environments like AWS, the Google Cloud Platform, and Pure Storage. So for Google, for provisioning the Vertica Management Console on Google Cloud Platform you can use launch a template. Or on AWS environment you can use the cloud formation templates available for different OS's. Once you have provisioned Vertica Management Console, you can provision the Vertica cluster and databases from MC itself. So you can provision a Vertica cluster, you can select the Create new database button available on the homepage. This will open up the wizard to create a new database and cluster. In this example, we are using we are using the Google Cloud Platform. So the wizard will ask me for varius authentication parameters for the Google Cloud Platform. And if you're on AWS, it'll ask you for the authentication parameters for the AWS environment. And going forward on the Wizard, it'll ask me to select the instance Type. I will select for the new Vertica cluster. And also provide the communal location url for my Eon Mode database and all the other preferences related to the new cluster. Once I have selected all the preferences for my new cluster I can preview the settings and I can hit, if I am, I can hit Create if all looks okay. So if I hit Create, this will create a new, MC will create a new GCP instances because we are on the GCP environment in this example. It will create a cluster on this instance, it'll create a Vertica Eon Mode Database on this cluster. And it will, additionally, you can load the test data on it if you like to. Now let's go over and revive the existing Eon Mode database from the communal location. So you can do it the same using the Management Console by selecting the Revive Eon Mode database button on the homepage. This will again open up the wizard for reviving the Eon Mode database. Again, in this example, since we are using GCP Platform, it will ask me for the Google Cloud storage authentication attributes. And for reviving, it will ask me for the communal location so I can enter the Google Storage bucket and my folder and it will discover all the Eon Mode databases located under this folder. And I can select one of the databases that I would like to revive. And it will ask me for other Vertica preferences and for this video, for this database reviving. And once I enter all the preferences and review all the preferences I can hit Revive the database button on the Wizard. So after I hit Revive database it will create the GCP instances. The number of GCP instances that I created would be seen as the number of hosts on the original Vertica cluster. It will install the Vertica cluster on this data, on this instances and it will revive the database and it will start the database. And after starting the database, it will be imported on the MC so you can start monitoring on it. So in this example, we saw you can provision and revive the Vertica database on the GCP Platform. Additionally, you can use AWS environment to provision and revive. So now since we have the Eon Mode database on MC, Natalia will go over some Eon Mode features on MC like managing subcluster and Depot activity monitoring. Over to you, Natalia. >> Natalia: Okay, thank you. Hello, my name is Natalia Stavisky. I am also a member of Vertica Management Console Team. And I will talk today about the work I did to allow users to manage subclusters using the Management Console, and also the work I did to help users understand what's going on in their Depot in the Vertica Eon Mode database. So let's look at the picture of the subclusters. On the Manage page of Vertica Management Console, you can see here is a page that has blue tabs, and the tab that's active is Subclusters. You can see that there are two subclusters are available in this database. And for each of the subclusters, you can see subcluster properties, whether this is the primary subcluster or secondary. In this case, primary is the default subcluster. It's indicated by a star. You can see what nodes belong to each subcluster. You can see the node state and node statistics. You can also easily add a new subcluster. And we're quickly going to do this. So once you click on the button, you'll launch the wizard that'll take you through the steps. You'll enter the name of the subcluster, indicate whether this is secondary or primary subcluster. I should mention that Vertica recommends having only one primary subcluster. But we have both options here available. You will enter the number of nodes for your subcluster. And once the subcluster has been created, you can manage the subcluster. What other options for managing subcluster we have here? You can scale up an existing subcluster and that's a similar approach, you launch the wizard and (mumbles) nodes. You want to add to your existing subcluster. You can scale down a subcluster. And MC validates requirements for maintaining minimal number of nodes to prevent database shutdown. So if you can not remove any nodes from a subcluster, this option will not be available. You can stop a subcluster. And depending on whether this is a primary subcluster or secondary subcluster, this option may be available or not available. Like in this picture, we can see that for the default subcluster this option is not available. And this is because shutting down the default subcluster will cause the database to shut down as well. You can terminate a subcluster. And again, the MC warns you not to terminate the primary subcluster and validates requirements for maintaining minimal number of nodes to prevent database shutdown. So now we are going to talk a little more about how the MC helps you to understand what's going on in your Depot. So Depot is one of the core of Eon Mode database. And what are the frequently asked questions about the Depot? Is the Depot size sufficient? Are a subset of users putting a high load on the database? What tables are fetched and evicted repeatedly, we call it "re-fetched," in Depot? So here in the Depot Activity Monitoring page, we now have four tabs that allow you to answer those questions. And we'll go a little more in detail through each of them, but I'll just mention what they are for now. At a Glance shows you basic Depot configuration and also shows you query executing. Depot Efficiency, we'll talk more about that and other tabs. Depot Content, that shows you what tables are currently in your Depot. And Depot Pinning allows you to see what pinning policies have been created and to create new pinning policies. Now let's go through a scenario. Monitoring performance of workloads on one subcluster. As you know, Eon Mode database allows you to have multiple subclusters and we'll explore how this feature is useful and how we can use the Management Console to make decisions regarding whether you would like to have multiple subclusters. So here we have, in my setup, a single subcluster called default_subcluster. It has two users that are running queries that are accessing tables, mostly in schema public. So the query started executing and we can see that after fetching tables from Communal, which is the red line, the rest of the time the queries are executing in Depot. The green line is indicating queries running in Depot. The all nodes Depot is about 88% full, a steady flow, and the depot size seems to be sufficient for query executions from Depot only. That's the good case scenario. Now at around 17 :15, user Sherry got an urgent request to generate a report. And at, she started running her queries. We can see that picture is quite different now. The tables Sherry is querying are in a different schema and are much larger. Now we can see multiple lines in different colors. We can see a bunch of fetches and evictions which are indicated by blue and purple bars, and a lot of queries are now spilling into Communal. This is the red and orange lines. Orange line is an indicator of a query running partially in Depot and partially getting fetched from Communal. And the red line is data fetched from Communal storage. Let's click on the, one of the lines. Each data point, each point on the line, it'll take you to the Query Details page where you can see more about what's going on. So this is the page that shows us what queries have been run in this particular time interval which is on top of this page in orange color. So that's about one minute time interval and now we can see user Sherry among the users that are running queries. Sherry's queries involve large tables and are running against a different schema. We can see the clickstream schema in the name of the, in part of the query request. So what is happening, there is not enough Depot space for both the schema that's already in use and the one Sherry needs. As a result, evictions and fetches have started occurring. What other questions we can ask ourself to help us understand what's going on? So how about, what tables are most frequently re-fetched? So for that, we will go to the Depot Efficiency page and look at the middle, the middle chart here. We can see the larger version of this chart if we expand it. So now we have 10 tables listed that are most frequently being re-fetched. We can see that there is a clickstream schema and there are other schemas so all of those tables are being used in the queries, fetched, and then there is not enough space in the Depot, they getting evicted and they get re-fetched again. So what can be done to enable all queries to run in Depot? Option one can be increase the Depot size. So we can do this by running the following queries, which (mumbles) which nodes and storage location and the new Depot size. And I should mention that we can run this query from the Management Console from the query execution page. So this would have helped us to increase the Depot size. What other options do we have, for example, when increasing Depot size is not an option? We can also provision a second subcluster to isolate workloads like Sherry's. So we are going to do this now and we will provision a second subcluster using the Manage page. Here we're creating subcluster for Sherry or for workloads like hers. And we're going to create a (mumbles). So Sherry's subcluster has been created. We can see it here, added to the list of the subclusters. It's a secondary subcluster. Sherry has been instructed to use the new SherrySubcluster for her work. Now let's see what happened. We'll go again at Depot Activity page and we'll look at the At a Glance tab. We can see that around >> 18: 07, Sherry switched to running her queries on SherrySubcluster. On top of this page, you can see subcluster selected. So we currently have two subclusters and I'm looking, what happened to SherrySubcluster once it has been provisioned? So Sherry started using it and the lines after initial fetching from Depot, which was from Communal, which was the red line, after that, all Sherry's queries fit in Depot, which is indicated by green line. Also the Depot is pretty full on those nodes, about 90% full. But the queries are processed efficiently, there is no spilling into Communal. So that's a good case scenario. Let's now go back and take a look at the original subcluster, default subcluster. So on the left portion of the chart we can see multiple lines, that was activity before Sherry switched to her own designated subcluster. At around 18:07, after Sherry switched from the subcluster to using her designated subcluster, there is no, she is no longer using the subcluster, she is not putting a load in it. So the lines after that are turning a green color, which means the queries that are still running in default subcluster are all running in Depot. We can also see that Depot fetches and evictions bars, those purple and blue bars, are no longer showing significant numbers. Also we can check the second chart that shows Communal Storage Access. And we can see that the bars have also dropped, so there is no significant access for Communal Storage. So this problem has been solved. Each of the subclusters are serving queries from Depot and that's our most efficient scenario. Let's also look at the other tabs that we have for Depot monitoring. Let's look at Depot Efficiency tab. It has six charts and I'll go through each one of them quickly. Files Reads by Location gives an indicator of where the majority of query execution took place in Depot or in Communal. Top 10 Re-Fetches into Depot, and imagine the charts earlier in our user case, it shows tables that are most frequently fetched and evicted and then fetched again. These are good candidates to get pinned if increasing Depot size is not an option. Note that both of these charts have an option to select time interval using calendar widget. So you can get the information about the activity that happened during that time interval. Depot Pinning shows what portion of your Depot is pinned, both by byte count and by table count. And the three tables at the bottom show Depot structure. How long tables stay in Depot, we would like tables to be fetched in Depot and stay there for a long time, how often they are accessed, again, the tables in Depot, we would like to see them accessed frequently, and what the size range of tables in Depot. Depot Content. This tab allows us to search for tables that are currently in Depot and also to see stats like table size in Depot. How often tables are accessed and when were they last accessed. And the same information that's available for tables in Depot is also available on projections and partition levels for those tables. Depot Pinning. This tab allows users to see what policies are currently existing and so you can do this by clicking on the first little button and click search. This'll show you all existing policies that are already created. The second option allows you to search for a table and create a policy. You can also use the action column to modify existing policies or delete them. And the third option provides details about most frequently re-fetched tables, including fetch count, total access count, and number of re-fetched bytes. So all this information can help to make decisions regarding pinning specific tables. So that's about it about the Depot. And I should mention that the server team also has a very good presentation on the, webinar, on the Eon Mode database Depot management and subcluster management. that strongly recommend it to attend or download the slide presentation. Let's talk quickly about the Management Console Roadmap, what we are planning to do in the future. So we are going to continue focusing on subcluster management, there is still a lot of things we can do here. Promoting/demoting subclusters. Load balancing across subclusters, scheduling subcluster actions, support for large cluster mode. We'll continue working on Workload Analyzer enhancement recommendation, on backup and restore from the MC. Building custom thresholds, and Eon on HDFS support. Okay, so we are ready now to take any questions you may have now. Thank you.
SUMMARY :
for the virtual Vertica BDC 2020. and all the other preferences related to the new cluster. and the depot size seems to be sufficient So on the left portion of the chart
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Natalia Stavisky | PERSON | 0.99+ |
Sherry | PERSON | 0.99+ |
MaryLee | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
Natalia | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
February 17th | DATE | 0.99+ |
second scenario | QUANTITY | 0.99+ |
10 tables | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
1MB | QUANTITY | 0.99+ |
two users | QUANTITY | 0.99+ |
first scenario | QUANTITY | 0.99+ |
second option | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Bhavik | PERSON | 0.99+ |
80 failed queries | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Depot | ORGANIZATION | 0.99+ |
third | QUANTITY | 0.99+ |
Each | QUANTITY | 0.99+ |
six charts | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
each point | QUANTITY | 0.99+ |
three recommendations | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
each | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Bhavik Gandhi | PERSON | 0.99+ |
midpool7 | TITLE | 0.99+ |
two nodes | QUANTITY | 0.99+ |
second chart | QUANTITY | 0.99+ |
two subclusters | QUANTITY | 0.98+ |
second subcluster | QUANTITY | 0.98+ |
Each data point | QUANTITY | 0.98+ |
each user | QUANTITY | 0.98+ |
both options | QUANTITY | 0.98+ |
4/2 | DATE | 0.98+ |
Eon | ORGANIZATION | 0.97+ |
this week | DATE | 0.97+ |
each subcluster | QUANTITY | 0.97+ |
about 90% | QUANTITY | 0.97+ |
three tables | QUANTITY | 0.96+ |
0 | QUANTITY | 0.96+ |
about 14.8 seconds seconds | QUANTITY | 0.96+ |
one subcluster | QUANTITY | 0.95+ |
UNLIST TILL 4/2 - The Next-Generation Data Underlying Architecture
>> Paige: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled, Vertica next generation architecture. I'm Paige Roberts, open social relationship Manager at Vertica, I'll be your host for this session. And joining me is Vertica Chief Architect, Chuck Bear, before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment, in the question box that's below the slides and click submit. So as you think about it, go ahead and type it in, there'll be a Q&A session at the end of the presentation, where we'll answer as many questions, as we're able to during the time. Any questions that we don't get a chance to address, we'll do our best to answer offline. Or alternatively, you can visit the Vertica forums to post your questions there, after the session. Our engineering team is planning to join the forum and keep the conversation going, so you can, it's just sort of like the developers lounge would be in delight conference. It gives you a chance to talk to our engineering team. Also, as a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this virtual session is being recorded, and it will be available to view on demand this week, we'll send you a notification, as soon as it's ready. Okay, now, let's get started, over to you, Chuck. >> Chuck: Thanks for the introduction, Paige, Vertica vision is to help customers, get value from structured data. This vision is simple, it doesn't matter what vertical the customer is in. They're all analytics companies, it doesn't matter what the customers environment is, as data is generated everywhere. We also can't do this alone, we know that you need other tools and people to build a complete solution. You know our database is key to delivering on the vision because we need a database that scales. When you start a new database company, you aren't going to win against 30 year old products on features. But from day one, we had something else, an architecture built for analytics performance. This architecture was inspired by the C-store project, combining the best design ideas from academics and industry veterans like Dr. Mike Stonebreaker. Our storage is optimized for performance, we use many computers in parallel. After over 10 years of refinements against various customer workloads, much of the design held up and serendipitously, the fact that we don't store in place updates set Vertica up for success in the cloud as well. These days, there are other tools that embody some of these design ideas. But we have other strengths that are more important than the storage format, where the only good analytics database that runs both on premise and in the cloud, giving customers the option to migrate their workloads, in most convenient and economical environment, or a full data management solution, not just the query tool. Unlike some other choices, ours comes with integration with a sequel ecosystem and full professional support. We organize our product roadmap into four key pillars, plus the cross cutting concerns of open integration and performance and scale. We have big plans to strengthen Vertica, while staying true to our core. This presentation is primarily about the separation pillar, and performance and scale, I'll cover our plans for Eon, our data management architecture, Mart analytic clusters, or fifth generation query executer, and our data storage layer. Let's start with how Vertica manages data, one of the central design points for Vertica was shared nothing, a design that didn't utilize a dedicated hardware shared disk technology. This quote here is how Mike put it politely, but around the Vertica office, shared disk with an LMTB over Mike's dead body. And we did get some early field experience with shared disk, customers, well, in fact will learn on anything if you let them. There were misconfigurations that required certified experts, obscure bugs extent. Another thing about the shared nothing designed for commodity hardware though, and this was in the papers, is that all the data management features like fault tolerance, backup and elasticity have to be done in software. And no matter how much you do, procuring, configuring and maintaining the machines with disks is harder. The software configuration process to add more service may be simple, but capacity planning, racking and stacking is not. The original allure of shared storage returned, this time though, the complexity and economics are different. It's cheaper, even provision storage with a few clicks and only pay for what you need. It expands, contracts and brings the maintenance of the storage close to a team is good at it. But there's a key difference, it's an object store, an object stores don't support the API's and access patterns used by most database software. So another Vertica visionary Ben, set out to exploit Vertica storage organization, which turns out to be a natural fit for modern cloud shared storage. Because Vertica data files are written once and not updated, they match the object storage model perfectly. And so today we have Eon, Eon uses shared storage to hold Vertica data with local disk depot's that act as caches, ensuring that we can get the performance that our customers have come to expect. Essentially Eon in enterprise behave similarly, but we have the benefit of flexible storage. Today Eon has the features our customers expect, it's been developed in tune for years, we have successful customers such as Redpharma, and if you'd like to know more about Eon has helped them succeed in Amazon cloud, I highly suggest reading their case study, which you can find on vertica.com. Eon provides high availability and flexible scaling, sometimes on premise customers with local disks get a little jealous of how recovery and sub-clusters work in Eon. Though we operate on premise, particularly on pure storage, but enterprise also had strengths, the most obvious being that you don't need and short shared storage to run it. So naturally, our vision is to converge the two modes, back into a single Vertica. A Vertica that runs any combination of local disks and shared storage, with full flexibility and portability. This is easy to say, but over the next releases, here's what we'll do. First, we realize that the query executer, optimizer and client drivers and so on, are already the same. Just the transaction handling and data management is different. But there's already more going on, we have peer-to-peer depot operations and other internode transfers. And enterprise also has a network, we could just get files from remote nodes over that network, essentially mimicking the behavior and benefits of shared storage with the layer of software. The only difference at the end of it, will be which storage hold the master copy. In enterprise, the nodes can't drop the files because they're the master copy. Whereas in Eon they can be evicted because it's just the cache, the masters, then shared storage. And in keeping with versus current support for multiple storage locations, we can intermix these approaches at the table level. Getting there as a journey, and we've already taken the first steps. One of the interesting design ideas of the C-store paper is the idea that redundant copies, don't have to have the same physical organization. Different copies can be optimized for different queries, sorted in different ways. Of course, Mike also said to keep the recovery system simple, because it's hard to debug, whenever the recovery system is being used, it's always in a high pressure situation. This turns out to be a contradiction, and the latter idea was better. No down performing stuff, if you don't keep the storage the same. Recovery hardware if you have, to reorganize data in the process. Even query optimization is more complicated. So over the past couple releases, we got rid of non identical buddies. But the storage files can still diverge at the fifth level, because tuple mover operations are synchronized. The same record can end up in different files than different nodes. The next step in our journey, is to make sure both copies are identical. This will help with backup and restore as well, because the second copy doesn't need backed up, or if it is backed up, it appears identical to the deduplication that is going to look present in both backup systems. Simultaneously, we're improving the Vertica networking service to support this new access pattern. In conjunction with identical storage files, we will converge to a recovery system that instantaneous nodes can process queries immediately, by retrieving data they need over the network from the redundant copies as they do in Eon day with even higher performance. The final step then is to unify the catalog and transaction model. Related concepts such as segment and shard, local catalog and shard catalog will be coalesced, as they're really represented the same concepts all along, just in different modes. In the catalog, we'll make slight changes to the definition of a projection, which represents the physical storage organization. The new definition simplifies segmentation and introduces valuable granularities of sharding to support evolution over time, and offers a straightforward migration path for both Eon and enterprise. There's a lot more to our Eon story than just the architectural roadmap. If you missed yesterday's Vertica, in Eon mode presentation about supported cloud, on premise storage option, replays are available. Be sure to catch the upcoming presentation on sizing and configuring vertica and in beyond doors. As we've seen with Eon, Vertica can separate data storage from the compute nodes, allowing machines to quickly fill in for each other, to rebuild fault tolerance. But separating compute and storage is used for much, much more. We now offer powerful, flexible ways for Vertica to add servers and increase access to the data. Vertica nine, this feature is called sub-clusters. It allows computing capacity to be added quickly and incrementally, and isolates workloads from each other. If your exploratory analytics team needs direct access to the source data, they need a lot of machines and not the same number all the time, and you don't 100% trust the kind of queries and user defined functions, they might be using sub-clusters as the solution. While there's much more expensive information available in our other presentation. I'd like to point out the highlights of our latest sub-cluster best practices. We suggest having a primary sub-cluster, this is the one that runs all the time, if you're loading data around the clock. It should be sized for the ETL workloads and also determines the natural shard count. Additional read oriented secondary sub-clusters can be added for real time dashboards, reports and analytics. That way, subclusters can be added or deep provisioned, without disruption to other users. The sub-cluster features of Vertica 9.3 are working well for customers. Yesterday, the Trade Desk presented their use case for Vertica over 300,000 in 5 sub clusters running in the cloud. If you missed a presentation, check out the replay. But we have plans beyond sub-clusters, we're extending sub-clusters to real clusters. For the Vertica savvy, this means the clusters bump, share the same spread ring network. This will provide further isolation, allowing clusters to control their own independent data sets. While replicating all are part of the data from other clusters using a publish subscribe mechanism. Synchronizing data between clusters is a feature customers want to understand the real business for themselves. This vision effects are designed for ancillary aspects, how we should assign resource pools, security policies and balance client connection. We will be simplifying our data segmentation strategy, so that when data that originate in the different clusters meet, they'll still get fully optimized joins, even if those clusters weren't positioned with the same number of nodes per shard. Having a broad vision for data management is a key component to political success. But we also take pride in our execution strategy, when you start a new database from scratch as we did 15 years ago, you won't compete on features. Our key competitive points where speed and scale of analytics, we set a target of 100 x better query performance in traditional databases with path loads. Our storage architecture provides a solid foundation on which to build toward these goals. Every query starts with data retrieval, keeping data sorted, organized by column and compressed by using adaptive caching, to keep the data retrieval time in IO to the bare minimum theoretically required. We also keep the data close to where it will be processed, and you clusters the machines to increase throughput. We have partition pruning a robust optimizer evaluate active use segmentation as part of the physical database designed to keep records close to the other relevant records. So the solid foundation, but we also need optimal execution strategies and tactics. One execution strategy which we built for a long time, but it's still a source of pride, it's how we process expressions. Databases and other systems with general purpose expression evaluators, write a compound expression into a tree. Here I'm using A plus one times B as an example, during execution, if your CPU traverses the tree and compute sub-parts from the whole. Tree traversal often takes more compute cycles than the actual work to be done. Especially in evaluation is a very common operation, so something worth optimizing. One instinct that engineers have is to use what we call, just-in-time or JIT compilation, which means generating code form the CPU into the specific activity expression, and add them. This replaces the tree of boxes that are custom made box for the query. This approach has complexity bugs, but it can be made to work. It has other drawbacks though, it adds a lot to query setup time, especially for short queries. And it pretty much eliminate the ability of mere models, mere mortals to develop user defined functions. If you go back to the problem we're trying to solve, the source of the overhead is the tree traversal. If you increase the batch of records processed in each traversal step, this overhead is amortized until it becomes negligible. It's a perfect match for a columnar storage engine. This also sets the CPU up for efficiency. The CPUs look particularly good, at following the same small sequence of instructions in a tight loop. In some cases, the CPU may even be able to vectorize, and apply the same processing to multiple records to the same instruction. This approach is easy to implement and debug, user defined functions are possible, then generally aligned with the other complexities of implementing and improving a large system. More importantly, the performance, both in terms of query setup and record throughput is dramatically improved. You'll hear me say that we look at research and industry for inspiration. In this case, our findings in line with academic binding. If you'd like to read papers, I recommend everything you always wanted to know about compiled and vectorized queries, don't afraid to ask, so we did have this idea before we read that paper. However, not every decision we made in the Vertica executer that the test of time as well as the expression evaluator. For example, sorting and grouping aren't susceptible to vectorization because sort decisions interrupt the flow. We have used JIT compiling on that for years, and Vertica 401, and it provides modest setups, but we know we can do even better. But who we've embarked on a new design for execution engine, which I call EE five, because it's our best. It's really designed especially for the cloud, now I know what you're thinking, you're thinking, I just put up a slide with an old engine, a new engine, and a sleek play headed up into the clouds. But this isn't just marketing hype, here's what I mean, when I say we've learned lessons over the years, and then we're redesigning the executer for the cloud. And of course, you'll see that the new design works well on premises as well. These changes are just more important for the cloud. Starting with the network layer in the cloud, we can't count on all nodes being connected to the same switch. Multicast doesn't work like it does in a custom data center, so as I mentioned earlier, we're redesigning the network transfer layer for the cloud. Storage in the cloud is different, and I'm not referring here to the storage of persistent data, but to the storage of temporary data used only once during the course of query execution. Our new pattern is designed to take into account the strengths and weaknesses of cloud object storage, where we can't easily do a path. Moving on to memory, many of our access patterns are reasonably effective on bare metal machines, that aren't the best choice on cloud hyperbug that have overheads, page faults or big gap. Here again, we found we can improve performance, a bit on dedicated hardware, and even more in the cloud. Finally, and this is true in all environments, core counts have gone up. And not all of our algorithms take full advantage, there's a lot of ground to cover here. But I think sorting in the perfect example to illustrate these points, I mentioned that we use JIT in sorting. We're getting rid of JIT in favor of a data format that can be treated efficiently, independent of what the data types are. We've drawn on the best, most modern technology from academia and industry. We've got our own analysis and testing, you know what we chose, we chose parallel merge sort, anyone wants to take a guess when merge sort was invented. It was invented in 1948, or at least documented that way, like computing context. If you've heard me talk before, you know that I'm fascinated by how all the things I worked with as an engineer, were invented before I was born. And in Vertica , we don't use the newest technologies, we use the best ones. And what is noble about Vertica is the way we've combined the best ideas together into a cohesive package. So all kidding about the 1940s aside, or he redesigned is actually state of the art. How do we know the sort routine is state of the art? It turns out, there's a pretty credible benchmark or at the appropriately named historic sortbenchmark.org. Anyone with resources looking for fame for their product or academic paper can try to set the record. Record is last set in 2016 with Tencent Sort, 100 terabytes in 99 seconds. Setting the records it's hard, you have to come up with hundreds of machines on a dedicated high speed switching fabric. There's a lot to a distributed sort, there all have core sorting algorithms. The authors of the paper conveniently broke out of the time spent in their sort, 67 out of 99 seconds want to know local sorting. If we break this out, divided by two CPUs and each of 512 nodes, we find that each CPU so there's almost a gig and a half per second. This is for what's called an indy sort, like an Indy race car, is in general purpose. It only handles fixed hundred five records with 10 byte key. There is a record length can vary, then it's called daytona sort, a 10 set daytona sort, is a little slower. One point is 10 gigabytes per second per CPU, now for Verrtica, We have a wide variety ability in record sizes, and more interesting data types, but still no harm in setting us like phone numbers, comfortable to the world record. On my 2017 era AMD desktop CPU, the Vertica EE5 sort to store about two and a half gigabytes per second. Obviously, this test isn't apply to apples because they use their own open power chip. But the number of DRM channels is the same, so it's pretty close the number that says we've hit on the right approach. And it performs this way on premise, in the cloud, and we can adapt it to cloud temp space. So what's our roadmap for integrating EE5 into the product and compare replacing the query executed the database to replacing the crankshaft and other parts of the engine of a car while it's been driven. We've actually done it before, between Vertica three and a half and five, and then we never really stopped changing it, now we'll do it again. The first part in replacing with algorithm called storage merge, which combines sorted data from disk. The first time has was two that are in vertical in incoming 10.0 patch that will be EE5 or resegmented storage merge, and then convert sorting and grouping into do out. There the performance results so far, in cases where the Vertica execute is doing well today, simple environments with simple data patterns, such as this simple capitalistic query, there's a lot of speed up, when we ship the segmentation code, which didn't quite make the freeze as much like to bump longer term, what we do is grouping into the storage of large operations, we'll get to where we think we ought to be, given a theoretical minimum work the CPUs need to do. Now if we look at a case where the current execution isn't doing as well, we see there's a much stronger benefit to the code shipping in Vertica 10. In fact, it turns a chart bar sideways to try to help you see the difference better. This case also benefit from the improvements in 10 product point releases and beyond. They will not happening to the vertical query executer, That was just the taste. But now I'd like to switch to the roadmap first for our adapters layer. I'll start with a story about, how our storage access layer evolved. If you go back to the academic ideas, if you start paper that persuaded investors to fund Vertica, read optimized store was the part that had substantiation in the form of performance data. Much of the paper was speculative, but we tried to follow it anyway. That paper talked about the WS with RS, The rights are in the read store, and how they work together for transaction processing and how there was a supernova. In all honesty, Vertica engineers couldn't figure out from the paper what to do next, incase you want to try, and we asked them they would like, We never got enough clarification to build it that way. But here's what we built, instead. We built the ROS, read optimized store, introduction on steep major revision. It's sorted, ordered columnar and compressed that follows a table partitioning that worked even better than the we are as described in the paper. We also built the last byte optimized store, we built four versions of this over the years actually. But this was the best one, it's not a set of interrelated V tree. It's just an append only, insertion order remember your way here, am sorry, no compression, no base, no partitioning. There is, however, a tuple over which does what we call move out. Move the data from WOS to ROS, sorting and compressing. Let's take a moment to compare how they behave, when you load data directly to the ROS, there's a data parsing operation. Then we finished the sorting, and then compressing right out the columnar data files to stay storage. The next query through executes against the ROS and it runs as it should because the ROS is read optimized. Let's repeat the exercise for WOS, the load operation response before the sorting and compressing, and before the data is written to persistent storage. Now it's possible for a query to come along, and the query could be responsible for sorting the lost data in addition to its other processes. Effect on query isn't predictable until the TM comes along and writes the data to the ROS. Over the years, we've done a lot of comparisons between ROS and WOS. ROS has always been better for sustained load throughput, it achieves much higher records per second without pushing back against the client and hasn't Vertica for when we developed the first usable merge out algorithm. ROS has always been better for predictable query performance, the ROS has never had the same management complexity and limitations as WOS. You don't have to pick a memory size and figure out which transactions get to use the pool. A non persistent nature of ROS always cause headaches when there are unexpected cluster shutdowns. We also looked at field usage data, we found that few customers were using a lot, especially among those that studied the issue carefully. So how we set out on a mission to improve the ROS to the point where it was always better than both the WOS and the profit of the past. And now it's true, ROS is better than the WOS and the loss of a couple of years ago. We implemented storage bundling, better catalog object storage and better tuple mover merge outs. And now, after extensive Q&A and customer testing, we've now succeeded, and in Vertica 10, we've removed the whys. Let's talk for a moment about simplicity, one of the best things Mike Stonebreaker said is no knobs. Anyone want to guess how many knobs we got rid of, and we took the WOS out of the product. 22 were five knobs to control whether it didn't went to ROS as well. Six controlling the ROS itself, Six more to set policies for the typical remove out and so on. In my honest opinion is still wasn't enough control over to achieve excess in a multi tenant environment, the big reason to get rid of the WOS for simplicity. Make the lives of DBAs and users better, we have a long way to go, but we're doing it. On my desk, I keep a jar with the knob in it for each knob in Vertica. When developers add a knob to the product, they have to add a knob to the jar. When they remove a knob, they get to choose one to take out, We have a lot of work to do, but I'm thrilled to report that in 15 years 10 is the first release with a number of knobs ticked downward. Get back to the WOS, I've said the most important thing get rid of it for last. We're getting rid of it so we can deliver our vision of the future to our customer. Remember how he said an Eon and sub-clusters we got all these benefits from shared storage? Guess what can't live in shared storage, the WOS. Remember how it's been a big part of the future was keeping the copies that identical to the primary copy? Independent actions of the WOS took a little at the root of the divergence between copies of the data. You have to admit it when you're wrong. That was in the original design and held up to the a selling point of time, without onto the idea of a separate ROS and WOS for too long. In Vertica, 10, we can finally bid, good reagents. I've covered a lot of ground, so let's put all the pieces together. I've talked a lot about our vision and how we're achieving it. But we also still pay attention to tactical detail. We've been fine tuning our memory management model to enhance performance. That involves revisiting tens of thousands of satellite of code, much like painting the inside of a large building with small paintbrushes. We're getting results as shown in the chart in Vertica nine, concurrent monitoring queries use memory from the global catalog tool, and Vertica 10, they don't. This is only one example of an important detail we're improving. We've also reworked the monitoring tables without network messages into two parts. The increased data we're collecting and analyzing and our quality assurance processes, we're improving on everything. As the story goes, I still have my grandfather's axe, of course, my father had to replace the handle, and I had to replace the head. Along the same lines, we still have Mike Stonebreaker Vertica. We didn't replace the query optimizer twice the debate database designer and storage layer four times each. The query executed is and it's a free design, like charted out how our code has changed over the years. I found that we don't have much from a long time ago, I did some digging, and you know what we have left in 2007. We have the original curly braces, and a little bit of percent code for handling dates and times. To deliver on our mission to help customers get value from their structured data, with high performance of scale, and in diverse deployment environments. We have the sound architecture roadmap, reviews the best execution strategy and solid tactics. On the architectural front, we're converging in an enterprise, we're extending smart analytic clusters. In query processing, we're redesigning the execution engine for the cloud, as I've told you. There's a lot more than just the fast engine. that you want to learn about our new data support for complex data types, improvements to the query optimizer statistics, or extension to live aggregate projections and flatten tables. You should check out some of the other engineering talk that the big data conference. We continue to stay on top of the details from low level CPU and memory too, to the monitoring management, developing tighter feedback cycles between development, Q&A and customers. And don't forget to check out the rest of the pillars of our roadmap. We have new easier ways to get started with Vertica in the cloud. Engineers have been hard at work on machine learning and security. It's easier than ever to use Vertica with third Party product, as a variety of tools integrations continues to increase. Finally, the most important thing we can do, is to help people get value from structured data to help people learn more about Vertica. So hopefully I left plenty of time for Q&A at the end of this presentation. I hope to hear your questions soon.
SUMMARY :
and keep the conversation going, and apply the same processing to multiple records
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Mike | PERSON | 0.99+ |
Mike Stonebreaker | PERSON | 0.99+ |
2007 | DATE | 0.99+ |
Chuck Bear | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
2016 | DATE | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
Chuck | PERSON | 0.99+ |
second copy | QUANTITY | 0.99+ |
99 seconds | QUANTITY | 0.99+ |
67 | QUANTITY | 0.99+ |
100% | QUANTITY | 0.99+ |
1948 | DATE | 0.99+ |
Ben | PERSON | 0.99+ |
two modes | QUANTITY | 0.99+ |
Redpharma | ORGANIZATION | 0.99+ |
first time | QUANTITY | 0.99+ |
first steps | QUANTITY | 0.99+ |
Paige | PERSON | 0.99+ |
two parts | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
five knobs | QUANTITY | 0.99+ |
100 terabytes | QUANTITY | 0.99+ |
both copies | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
each knob | QUANTITY | 0.99+ |
WS | ORGANIZATION | 0.99+ |
AMD | ORGANIZATION | 0.99+ |
Eon | ORGANIZATION | 0.99+ |
1940s | DATE | 0.99+ |
today | DATE | 0.99+ |
One point | QUANTITY | 0.99+ |
first part | QUANTITY | 0.99+ |
fifth level | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
yesterday | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Six | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
512 nodes | QUANTITY | 0.98+ |
ROS | TITLE | 0.98+ |
over 10 years | QUANTITY | 0.98+ |
Yesterday | DATE | 0.98+ |
15 years ago | DATE | 0.98+ |
twice | QUANTITY | 0.98+ |
sortbenchmark.org | OTHER | 0.98+ |
first release | QUANTITY | 0.98+ |
two CPUs | QUANTITY | 0.97+ |
Vertica 10 | TITLE | 0.97+ |
100 x | QUANTITY | 0.97+ |
WOS | TITLE | 0.97+ |
vertica.com | OTHER | 0.97+ |
10 byte | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
one | QUANTITY | 0.97+ |
5 sub clusters | QUANTITY | 0.97+ |
two | QUANTITY | 0.97+ |
one example | QUANTITY | 0.97+ |
over 300,000 | QUANTITY | 0.96+ |
Dr. | PERSON | 0.96+ |
One | QUANTITY | 0.96+ |
tens of thousands of satellite | QUANTITY | 0.96+ |
EE5 | COMMERCIAL_ITEM | 0.96+ |
fifth generation | QUANTITY | 0.96+ |
UNLIST TILL 4/2 - Vertica @ Uber Scale
>> Sue: Hi, everybody. Thank you for joining us today, for the Virtual Vertica BDC 2020. This breakout session is entitled "Vertica @ Uber Scale" My name is Sue LeClaire, Director of Marketing at Vertica. And I'll be your host for this webinar. Joining me is Girish Baliga, Director I'm sorry, user, Uber Engineering Manager of Big Data at Uber. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. There will be a Q and A session, at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can also Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. And as a reminder, you can maximize your screen by clicking the double arrow button, in the lower right corner of the slides. And yet, this virtual session is being recorded, and you'll be able to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Girish over to you. >> Girish: Thanks a lot Sue. Good afternoon, everyone. Thanks a lot for joining this session. My name is Girish Baliga. And as Sue mentioned, I manage interactive and real time analytics teams at Uber. Vertica is one of the main platforms that we support, and Vertica powers a lot of core business use cases. In today's talk, I wanted to cover two main things. First, how Vertica is powering critical business use cases, across a variety of orgs in the company. And second, how we are able to do this at scale and with reliability, using some of the additional functionalities and systems that we have built into the Vertica ecosystem at Uber. And towards the end, I also have a little extra bonus for all of you. I will be sharing an easy way for you to take advantage of, many of the ideas and solutions that I'm going to present today, that you can apply to your own Vertica deployments in your companies. So stick around and put on your seat belts, and let's go start on the ride. At Uber, our mission is to ignite opportunity by setting the world in motion. So we are focused on solving mobility problems, and enabling people all over the world to solve their local problems, their local needs, their local issues, in a manner that's efficient, fast and reliable. As our CEO Dara has said, we want to become the mobile operating system of local cities and communities throughout the world. As of today, Uber is operational in over 10,000 cities around the world. So, across our various business lines, we have over 110 million monthly users, who use our rides, services, or eat services, and a whole bunch of other services that we provide to Uber. And just to give you a scale of our daily operations, we in the ride business, have over 20 million trips per day. And that each business is also catching up, particularly during the recent times that we've been having. And so, I hope these numbers give you a scale of the amount of data, that we process each and every day. And support our users in their analytical and business reporting needs. So who are these users at Uber? Let's take a quick look. So, Uber to describe it very briefly, is a lot like Amazon. We are largely an operation and logistics company. And employee work based reflects that. So over 70% of our employees work in teams, which come under the umbrella of Community Operations and Centers of Excellence. So these are all folks working in various cities and towns that we operate around the world, and run the Uber businesses, as somewhat local businesses responding to local needs, local market conditions, local regulation and so forth. And Vertica is one of the most important tools, that these folks use in their day to day business activities. So they use Vertica to get insights into how their businesses are going, to deeply into any issues that they want to triage , to generate reports, to plan for the future, a whole lot of use cases. The second big class of users, are in our marketplace team. So marketplace is the engineering team, that backs our ride shared business. And as part of this, running this business, a key problem that they have to solve, is how to determine what prices to set, for particular rides, so that we have a good match between supply and demand. So obviously the real time pricing decisions they're made by serving systems, with very detailed and well crafted machine learning models. However, the training data that goes into this models, the historical trends, the insights that go into building these models, a lot of these things are powered by the data that we store, and serve out of Vertica. Similarly, in each business, we have use cases spanning all the way from engineering and back-end systems, to support operations, incentives, growth, and a whole bunch of other domains. So the big class of applications that we support across a lot of these business lines, is dashboards and reporting. So we have a lot of dashboards, which are built by core data analysts teams and shared with a whole bunch of our operations and other teams. So these are dashboards and reports that run, periodically say once a week or once a day even, depending on the frequency of data that they need. And many of these are powered by the data, and the analytics support that we provide on our Vertica platform. Another big category of use cases is for growth marketing. So this is to understand historical trends, figure out what are various business lines, various customer segments, various geographical areas, doing in terms of growth, where it is necessary for us to reinvest or provide some additional incentives, or marketing support, and so forth. So the analysis that backs a lot of these decisions, is powered by queries running on Vertica. And finally, the heart and soul of Uber is data science. So data science is, how we provide best in class algorithms, pricing, and matching. And a lot of the analysis that goes into, figuring out how to build these systems, how to build the models, how to build the various coefficients and parameters that go into making real time decisions, are based on analysis that data scientists run on Vertica systems. So as you can see, Vertica usage spans a whole bunch of organizations and users, all across the different Uber teams and ecosystems. Just to give you some quick numbers, we have over 5000 weekly active, people who run queries at least once a week, to do some critical business role or problem to solve, that they have in their day to day operations. So next, let's see how Vertica fits into the Uber data ecosystem. So when users open up their apps, and request for a ride or order food delivery on each platform, the apps are talking to our serving systems. And the serving systems use online storage systems, to store the data as the trips and eat orders are getting processed in real time. So for this, we primarily use an in house built, key value storage system called Schemaless, and an open source system called Cassandra. We also have other systems like MySQL and Redis, which we use for storing various bits of data to support serving systems. So all of this operations generates a lot of data, that we then want to process and analyze, and use for our operational improvements. So, we have ingestion systems that periodically pull in data from our serving systems and land them in our data lake. So at Uber a data lake is powered by Hadoop, with files stored on HDFS clusters. So once the raw data lines on the data lake, we then have ETL jobs that process these raw datasets, and generate, modeled and customize datasets which we then use for further analysis. So once these model datasets are available, we load them into our data warehouse, which is entirely powered by Vertica. So then we have a business intelligence layer. So with internal tools, like QueryBuilder, which is a UI interface to write queries, and look at results. And it read over the front-end sites, and Dashbuilder, which is a dash, board building tool, and report management tool. So these are all various tools that we have built within Uber. And these can talk to Vertica and run SQL queries to power, whatever, dashboards and reports that they are supporting. So this is what the data ecosystem looks like at Uber. So why Vertica and what does it really do for us? So it powers insights, that we show on dashboards as folks use, and it also powers reports that we run periodically. But more importantly, we have some core, properties and core feature sets that Vertica provides, which allows us to support many of these use cases, very well and at scale. So let me take a brief tour of what these are. So as I mentioned, Vertica powers Uber's data warehouse. So what this means is that we load our core fact and dimension tables onto Vertica. The core fact tables are all the trips, all the each orders and all these other line items for various businesses from Uber, stored as partitioned tables. So think of having one partition per day, as well as dimension tables like cities, users, riders, career partners and so forth. So we have both these two kinds of datasets, which will load into Vertica. And we have full historical data, all the way since we launched these businesses to today. So that folks can do deeper longitudinal analysis, so they can look at patterns, like how the business has grown from month to month, year to year, the same month, over a year, over multiple years, and so forth. And, the really powerful thing about Vertica, is that most of these queries, you run the deep longitudinal queries, run very, very fast. And that's really why we love Vertica. Because we see query latency P90s. That is 90 percentile of all queries that we run on our platform, typically finish in under a minute. So that's very important for us because Vertica is used, primarily for interactive analytics use cases. And providing SQL query execution times under a minute, is critical for our users and business owners to get the most out of analytics and Big Data platforms. Vertica also provides a few advanced features that we use very heavily. So as you might imagine, at Uber, one of the most important set of use cases we have is around geospatial analytics. In particular, we have some critical internal dashboards, that rely very heavily on being able to restrict datasets by geographic areas, cities, source destination pairs, heat maps, and so forth. And Vertica has a rich array of functions that we use very heavily. We also have, support for custom projections in Vertica. And this really helps us, have very good performance for critical datasets. So for instance, in some of our core fact tables, we have done a lot of query and analysis to figure out, how users run their queries, what kind of columns they use, what combination of columns they use, and what joints they do for typical queries. And then we have laid out our custom projections to maximize performance on these particular dimensions. And the ability to do that through Vertica, is very valuable for us. So we've also had some very successful collaborations, with the Vertica engineering team. About a year and a half back, we had open-sourced a Python Client, that we had built in house to talk to Vertica. We were using this Python Client in our business intelligence layer that I'd shown on the previous slide. And we had open-sourced it after working closely with Eng team. And now Vertica formally supports the Python Client as an open-source project, which you can download to and integrate into your systems. Another more recent example of collaboration is the Vertica Eon mode on GCP. So as most of or at least some of you know, Vertica Eon mode is formally supported on AWS. And at Uber, we were also looking to see if we could run our data infrastructure on GCP. So Vertica team hustled on this, and provided us early preview version, which we've been testing out to see how performance, is impacted by running on the Cloud, and on GCP. And so far, I think things are going pretty well, but we should have some numbers about this very soon. So here I have a visualization of an internal dashboard, that is powered solely by data and queries running on Vertica. So this GIF has sequence have different visualizations supported by this tool. So for instance, here you see a heat map, downgrading heat map of source of traffic demand for ride shares. And then you will see a bunch of arrows here about source destination pairs and the trip lines. And then you can see how demand moves around. So, as the cycles through the various animations, you can basically see all the different kinds of insights, and query shapes that we send to Vertica, which powers this critical business dashboard for our operations teams. All right, so now how do we do all of this at scale? So, we started off with a single Vertica cluster, a few years back. So we had our data lake, the data would land into Vertica. So these are the core fact and dimension tables that I just spoke about. And then Vertica powers queries at our business intelligence layer, right? So this is a very simple, and effective architecture for most use cases. But at Uber scale, we ran into a few problems. So the first issue that we have is that, Uber is a pretty big company at this point, with a lot of users sending almost millions of queries every week. And at that scale, what we began to see was that a single cluster was not able to handle all the query traffic. So for those of you who have done an introductory course, on queueing theory, you will realize that basically, even though you could have all the query is processed through a single serving system. You will tend to see larger and larger queue wait times, as the number of queries pile up. And what this means in practice for end users, is that they are basically just seeing longer and longer query latencies. But even though the actual query execution time on Vertica itself, is probably less than a minute, their query sitting in the queue for a bunch of minutes, and that's the end user perceived latency. So this was a huge problem for us. The second problem we had was that the cluster becomes a single point of failure. Now Vertica can handle single node failures very gracefully, and it can probably also handle like two or three node failures depending on your cluster size and your application. But very soon, you will see that, when you basically have beyond a certain number of failures or nodes in maintenance, then your cluster will probably need to be restarted or you will start seeing some down times due to other issues. So another example of why you would have to have a downtime, is when you're upgrading software in your clusters. So, essentially we're a global company, and we have users all around the world, we really cannot afford to have downtime, even for one hour slot. So that turned out to be a big problem for us. And as I mentioned, we could have hardware issues. So we we might need to upgrade our machines, or we might need to replace storage or memory due to issues with the hardware in there, due to normal wear and tear, or due to abnormal issues. And so because of all of these things, having a single point of failure, having a single cluster was not really practical for us. So the next thing we did, was we set up multiple clusters, right? So we had a bunch of identities clusters, all of which have the same datasets. So then we would basically load data using ingestion pipelines from our data lake, onto each of these clusters. And then the business intelligence layer would be able to query any of these clusters. So this actually solved most of the issues that I pointed out in the previous slide. So we no longer had a single point of failure. Anytime we had to do version upgrades, we would just take off one cluster offline, upgrade the software on it. If we had node failures, we would probably just take out one cluster, if we had to, or we would just have some spare nodes, which would rotate into our production clusters and so forth. However, having multiple clusters, led to a new set of issues. So the first problem was that since we have multiple clusters, you would end up with inconsistent schema. So one of the things to understand about our platform, is that we are an infrastructure team. So we don't actually own or manage any of the data that is served on Vertica clusters. So we have dataset owners and publishers, who manage their own datasets. Now exposing multiple clusters to these dataset owners. Turns out, it's not a great idea, right? Because they are not really aware of, the importance of having consistency of schemas and datasets across different clusters. So over time, what we saw was that the schema for the same tables would basically get out of order, because they were all the updates are not consistently applied on all clusters. Or maybe they were just experimenting some new columns or some new tables in one cluster, but they forgot to delete it, whatever the case might be. We basically ended up in a situation where, we saw a lot of inconsistent schemas, even across some of our core tables in our different clusters. A second issue was, since we had ingestion pipelines that were ingesting data independently into all these clusters, these pipelines could fail independently as well. So what this meant is that if, for instance, the ingestion pipeline into cluster B failed, then the data there would be older than clusters A and C. So, when a query comes in from the BI layer, and if it happens to hit B, you would probably see different results, than you would if you went to a or C. And this was obviously not an ideal situation for our end users, because they would end up seeing slightly inconsistent, slightly different counts. But then that would lead to a bad situation for them where they would not able to fully trust the data that was, and the results and insights that were being returned by the SQL queries and Vertica systems. And then the third problem was, we had a lot of extra replication. So the 20/80 Rule, or maybe even the 90/10 Rule, applies to datasets on our clusters as well. So less than 10% of our datasets, for instance, in 90% of the queries, right? And so it doesn't really make sense for us to replicate all of our data on all the clusters. And so having this set up where we had to do that, was obviously very suboptimal for us. So then what we did, was we basically built some additional systems to solve these problems. So this brings us to our Vertica ecosystem that we have in production today. So on the ingestion side, we built a system called Vertica Data Manager, which basically manages all the ingestion into various clusters. So at this point, people who are managing datasets or dataset owners and publishers, they no longer have to be aware of individual clusters. They just set up their ingestion pipelines with an endpoint in Vertica Data Manager. And the Vertica Data Manager ensures that, all the schemas and data is consistent across all our clusters. And on the query side, we built a proxy layer. So what this ensures is that, when queries come in from the BI layer, the query was forwarded, smartly and with knowledge and data about which cluster up, which clusters are down, which clusters are available, which clusters are loaded, and so forth. So with these two layers of abstraction between our ingestion and our query, we were able to have a very consistent, almost single system view of our entire Vertica deployment. And the third bit, we had put in place, was the data manifest, which were the communication mechanism between ingestion and proxy. So the data manifest basically is a listing of, which tables are available on which clusters, which clusters are up to date, and so forth. So with this ecosystem in place, we were also able to solve the extra replication problem. So now we basically have some big clusters, where all the core tables, and all the tables, in fact, are served. So any query that hits 90%, less so tables, goes to the big clusters. And most of the queries which hit 10% heavily queried important tables, can also be served by many other small clusters, so much more efficient use of resources. So this basically is the view that we have today, of Vertica within Uber, so external to our team, folks, just have an endpoint, where they basically set up their ingestion jobs, and another endpoint where they can forward their Vertica SQL queries. And they are so to a proxy layer. So let's get a little more into details, about each of these layers. So, on the data management side, as I mentioned, we have two kinds of tables. So we have dimension tables. So these tables are updated every cycle, so the list of cities list of drivers, the list of users and so forth. So these change not so frequently, maybe once a day or so. And so we are able to, and since these datasets are not very big, we basically swap them out on every single cycle. Whereas the fact tables, so these are tables which have information about our trips or each orders and so forth. So these are partition. So we have one partition roughly per day, for the last couple of years, and then we have more of a hierarchical partitions set up for older data. So what we do is we load the partitions for the last three days on every cycle. The reason we do that, is because not all our data comes in at the same time. So we have updates for trips, going over the past two or three days, for instance, where people add ratings to their trips, or provide feedback for drivers and so forth. So we want to capture them all in the row corresponding to that particular trip. And so we upload partitions for the last few days to make sure we capture all those updates. And we also update older partitions, if for instance, records were deleted for retention purposes, or GDPR purposes, for instance, or other regulatory reasons. So we do this less frequently, but these are also updated if necessary. So there are endpoints which allow dataset owners to specify what partitions they want to update. And as I mentioned, data is typically managed using a hierarchical partitioning scheme. So in this way, we are able to make sure that, we take advantage of the data being clustered by day, so that we don't have to update all the data at once. So when we are recovering from an cluster event, like a version upgrade or software upgrade, or hardware fix or failure handling, or even when we are adding a new cluster to the system, the data manager takes care of updating the tables, and copying all the new partitions, making sure the schemas are all right. And then we update the data and schema consistency and make sure everything is up to date before we, add this cluster to our serving pool, and the proxy starts sending traffic to it. The second thing that the data manager provides is consistency. So the main thing we do here, is we do atomic updates of our tables and partitions for fact tables using a two-phase commit scheme. So what we do is we load all the new data in temp tables, in all the clusters in phase one. And then when all the clusters give us access signals, then we basically promote them to primary and set them as the main serving tables for incoming queries. We also optimize the load, using Vertica Data Copy. So what this means is earlier, in a parallel pipelines scheme, we had to ingest data individually from HDFS clusters into each of the Vertica clusters. That took a lot of HDFS bandwidth. But using this nice feature that Vertica provides called Vertica Data Copy, we just load it data into one cluster and then much more efficiently copy it, to the other clusters. So this has significantly reduced our ingestion overheads, and speed it up our load process. And as I mentioned as the second phase of the commit, all data is promoted at the same time. Finally, we make sure that all the data is up to date, by doing some checks around the number of rows and various other key signals for freshness and correctness, which we compare with the data in the data lake. So in terms of schema changes, VDM automatically applies these consistently across all the clusters. So first, what we do is we stage these changes to make sure that these are correct. So this catches errors that are trying to do, an incompatible update, like changing a column type or something like that. So we make sure that schema changes are validated. And then we apply them to all clusters atomically again for consistency. And provide a overall consistent view of our data to all our users. So on the proxy side, we have transparent support for, replicated clusters to all our users. So the way we handle that is, as I mentioned, the cluster to table mapping is maintained in the manifest database. And when we have an incoming query, the proxy is able to see which cluster has all the tables in that query, and route the query to the appropriate cluster based on the manifest information. Also the proxy is aware of the health of individual clusters. So if for some reason a cluster is down for maintenance or upgrades, the proxy is aware of this information. And it does the monitoring based on query response and execution times as well. And it uses this information to route queries to healthy clusters, and do some load balancing to ensure that we award hotspots on various clusters. So the key takeaways that I have from the stock, are primarily these. So we started off with single cluster mode on Vertica, and we ran into a bunch of issues around scaling and availability due to cluster downtime. We had then set up a bunch of replicated clusters to handle the scaling and availability issues. Then we run into issues around schema consistency, data staleness, and data replication. So we built an entire ecosystem around Vertica, with abstraction layers around data management and ingestion, and proxy. And with this setup, we were able to enforce consistency and improve storage utilization. So, hopefully this gives you all a brief idea of how we have been able to scale Vertica usage at Uber, and power some of our most business critical and important use cases. So as I mentioned at the beginning, I have a interesting and simple extra update for you. So an easy way in which you all can take advantage of many of the features that we have built into our ecosystem, is to use the Vertica Eon mode. So the Vertica Eon mode, allows you to set up multiple clusters with consistent data updates, and set them up at various different sizes to handle different query loads. And it automatically handles many of these issues that I mentioned in our ecosystem. So do check it out. We've also been, trying it out on DCP, and initial results look very, very promising. So thank you all for joining me on this talk today. I hope you guys learned something new. And hopefully you took away something that you can also apply to your systems. We have a few more time for some questions. So I'll pause for now and take any questions.
SUMMARY :
Any questions that we don't address, So the first issue that we have is that,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Girish Baliga | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Girish | PERSON | 0.99+ |
10% | QUANTITY | 0.99+ |
one hour | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Sue | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Dara | PERSON | 0.99+ |
first issue | QUANTITY | 0.99+ |
less than a minute | QUANTITY | 0.99+ |
MySQL | TITLE | 0.99+ |
First | QUANTITY | 0.99+ |
first problem | QUANTITY | 0.99+ |
third problem | QUANTITY | 0.99+ |
third bit | QUANTITY | 0.99+ |
less than 10% | QUANTITY | 0.99+ |
each platform | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
one cluster | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
second issue | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
today | DATE | 0.99+ |
second phase | QUANTITY | 0.99+ |
two kinds | QUANTITY | 0.99+ |
over 10,000 cities | QUANTITY | 0.99+ |
over 70% | QUANTITY | 0.99+ |
each business | QUANTITY | 0.99+ |
second thing | QUANTITY | 0.98+ |
second problem | QUANTITY | 0.98+ |
Vertica | TITLE | 0.98+ |
both | QUANTITY | 0.98+ |
Vertica Data Manager | TITLE | 0.98+ |
two-phase | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
90 percentile | QUANTITY | 0.98+ |
once a week | QUANTITY | 0.98+ |
each | QUANTITY | 0.98+ |
single point | QUANTITY | 0.97+ |
SQL | TITLE | 0.97+ |
once a day | QUANTITY | 0.97+ |
Redis | TITLE | 0.97+ |
one partition | QUANTITY | 0.97+ |
under a minute | QUANTITY | 0.97+ |
@ Uber Scale | ORGANIZATION | 0.96+ |
UNLIST TILL 4/2 - Migrating Your Vertica Cluster to the Cloud
>> Jeff: Hello everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's break-out session has been titled, "Migrating Your Vertica Cluster to the Cloud." I'm Jeff Healey, and I'm in Vertica marketing. I'll be your host for this break-out session. Joining me here are Sumeet Keswani and Chris Daly, Vertica product technology engineers and key members of our customer success team. Before we begin, I encourage you to submit questions and comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. And alternatively, you can visit Vertica forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also as a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Sumeet. >> Sumeet: Thank you, Jeff. Hello everyone, my name is Sumeet Keswani, and I will be talking about planning to deploy or migrate your Vertica cluster to the Cloud. So you may be moving an on-prem cluster or setting up a new cluster in the Cloud. And there are several design and operational considerations that will come into play. You know, some of these are cost, which industry you are in, or which expertise you have, in which Cloud platform. And there may be a personal preference too. After that, you know, there will be some operational considerations like VM and cluster sizing, what Vertica mode you want to deploy, Eon or Enterprise. It depends on your use keys. What are the DevOps skills available, you know, what elasticity, separation you need, you know, what is your backup and DR strategy, what do you want in terms of high availability. And you will have to think about, you know, how much data you have and where it's going to live. And in order to understand the cost, or the cost and the benefit of deployment and you will have to understand the access patterns, and how you are moving data from and to the Cloud. So things to consider before you move a deployment, a Vertica deployment to the Cloud, right, is one thing to keep in mind is, virtual CPUs, or CPUs in the Cloud, are not the same as the usual CPUs that you've been familiar with in your data center. A vCPU is half of a CPU because of hyperthreading. There is definitely the noisy neighbor effect. There is, depending on what other things are hosted in the Cloud environment, you may see performance, you may occasionally see performance issues. There are I/O limitations on the instance that you provision, so that what that really means is you can't always scale up. You might have to scale up, basically, you have to add more instances rather than getting bigger or the right size instances. Finally, there is an important distinction here. Virtualization is not free. There can be significant overhead to virtualization. It could be as much as 30%, so when you size and scale your clusters, you must keep that in mind. Now the other important aspect is, you know, where you put Vertica cluster is important. The choice of the region, how far it is from your various office locations. Where will the data live with respect to the cluster. And remember, popular locations can fill up. So if you want to scale out, additional capacity may or may not be available. So these are things you have to keep in mind when picking or choosing your Cloud platform and your deployment. So at this point, I want to make a plug for Eon mode. Eon mode is the latest mode, is a Cloud mode from Vertica. It has been designed with Cloud economics in mind. It uses shared storage, which is durable, available, and very cheap, like S3 storage or Google Cloud storage. It has been designed for quick scaling, like scale out, and highly elastic deployments. It has also been designed for high workload isolation, where each application or user group can be isolated from the other ones, so that they'll be paid and monitored separately, without affecting each other. But there are some disadvantages, or perhaps, you know, there's a cost for using Eon mode. Storage in S3 is neither cheap nor efficient. So there is a high latency of I/O when accessing data from S3. There is API and data access cost. There is API and data access cost associated with accessing your data in S3. Vertica in Eon mode has a pay as you go model, which you know, works for some people and does not work for others. And so therefore it is important to keep that in mind. And performance can be a little bit variable here, because it depends on cache, it depends on the local depot, which is a cache, and it is not as predictable as EE mode, so that's another trade-off. So let's spend about a minute and see how a Vertica cluster in Eon mode looks like. A Vertica cluster in Eon mode has S3 as the durability layer where all the data sits. There are subclusters, which are essentially just aggregation groups, which is separated compute, which will service different workloads. So for in this example, you may have two subclusters, one servicing ETL workload and the other one servicing (mic interference obscures speaking). These clusters are isolated, and they do not affect each other's performance. This allows you to scale them independently and isolate workloads. So this is the new Vertica Eon mode which has been specifically designed by us for use in the Cloud. But beyond this, you can use EE mode or Eon mode in the Cloud, it really depends on what your use case is. But both of these are possible, and we highly recommend Eon mode wherever possible. Okay, let's talk a little bit about what we mean by Vertica support in the Cloud. Now as you know, a Cloud is a shared data center, right. Performance in the Cloud can vary. It can vary between regions, availability zones, time of the day, choice of instance type, what concurrency you use, and of course the noisy neighbor effect. You know, we in Vertica, we performance, load, and stress test our product before every release. We have a bunch of use cases, we go through all of them, make sure that we haven't, you know, regressed any performance, and make sure that it works up to standards and gives you the high performance that you've come to expect. However, your solution or your workload is unique to you, and it is still your responsibility to make sure that it is tuned appropriately. To do this, one of the easiest things you can do is you know, pick a tested operating system, allocate the virtual machine, you know, with enough resources. It's something that we recommend, because we have tested it thoroughly. It goes a long way in giving you predictability. So after this I would like to now go into the various platforms, Cloud platforms, that Vertica has worked on. And I'll start with AWS, and my colleague Chris will speak about Azure and GCP. And our thoughts forward. So without further ado, let's start with the Amazon Web Services platform. So this is Vertica running on the Amazon Web Services platform. So as you probably are all aware, Amazon Web Services is the market leader in this space, and indeed really our biggest provider by far, and have been here for a very long time. And Vertica has a deep integration in the Amazon Web Services space. We provide a marketplace offering which has both pay as you go or a bring your own license model. We have many, you know, knowledge base articles, best practices, scripts, and resources that help you configure and use a Vertica database in the Cloud. We have several customers in the Cloud for many, many years now, and we have managed and console-based point and click deployments, you know, for ease of use in the Cloud. So Vertica has a deep integration in the Amazon space, and has been there for quite a bit now. So we communicate a lot of experience here. So let's talk about sizing on AWS. And sizing on any platform comes down to you know, these four or five different things. It comes down to picking the right instance type, picking the right disk volume and type, tuning and optimizing your networking, and finally, you know, some operational concerns like security, maintainability, and backup. So let's go into each one of these on the AWS ecosystem. So the choice of instance type is one of the important choices that you will make. In Eon mode, you know, you don't really need persistent disk. You can, you should probably choose ephemeral disk because it gives you extra speed, and speed with the instance type. We highly recommend the i3.4x instance types, which are very economical, have a big, 4 terabyte depot or cache per node. The i3.metal is similar to the i3.4, but has got significantly better performance, for those subclusters that need this extra oomph. The i3.2 is good for scale out of small ad hoc clusters. You know, they have a smaller cache and lower performance but it's cheap enough to use very indiscriminately. If you were in EE mode, well we don't use S3 as the layer of durability. Your local volumes is where we persist the data. Hence you do need an EBS volume in EE mode. In order to make sure that, you know, that the instance or the deployment is manageable, you might have to use some sort of a software RAID array over the EBS volumes. The most common instance type you see in EE mode is the r4.4x, the c4, or the m4 instance types. And then of course for temp space and depot we always recommend instance volumes. They're just much faster. Okay. So let's go, let's talk about optimizing your network or tuning your network. So the best, the best thing you can do about tuning your network, especially in Eon mode but in other modes too, is to get a VPC S3 endpoint. This is essentially a route table that makes sure that all traffic between your cluster and S3 goes over an internal fabric. This makes it much faster, you don't pay for egress cost, especially if you're doing external tables or your communal storage, but you do need to create it. Many times people will forget doing it. So you really do have to create it. And best of all, it's free. It doesn't cost you anything extra. You just have to create it during cluster creation time, and there's a significant performance difference for using it. The next thing about tuning your network is, you know, sizing it correctly. Pick the closest geographical region to where you'll consume the data. Pick the right availability zone. We highly recommend using cluster placement groups. In fact, they are required for the stability of the cluster. A cluster placement group is essentially, it operates this notion of rack. Nodes in a cluster placement group, are, you know, physically closer to each other than they would otherwise be. And this allows, you know, a 10 Gbps, bidirectional, TCP/IP flow between the nodes. And this makes sure that, you know, you get a high amount of Gbps per second. As you probably are all aware, the Cloud does not support broadcast or UDP broadcast. Hence you must use point-to-point UDP for spread in the Cloud, or in AWS. Beyond that, you know, point-to-point UDP does not scale very well beyond 20 nodes. So you know, as your cluster sizes increase, you must switch over to large cluster mode. And finally, use instances with enhanced networking or SR-IOV support. Again, it's free, it comes with the choice of the instance type and the operating system. We highly recommend it, it makes a big difference in terms of how your workload will perform. So let's talk a little bit about security, configuration, and orchestration. As I said, we provide CloudFormation scripts to make the ease of deployment. You can use the MC point and click. With regard to security, you know, Vertica does support instance profiles out of the box in Amazon. We recommend you use it. This is highly desirable so that you're not passing access keys and secret keys around. If you use our marketplace image, we have picked the latest operating systems, we have patched them, Amazon actually validates everything on marketplace and scans them for security vulnerabilities. So you get that for free. We do some basic configuration, like we disable root ssh access, we disallow any password access, we turn on encryption. And we run a basic set of security checks to make sure that the image is secure. Of course, it could be made more secure. But we try to balance out security, performance, and convenience. And finally, let's talk about backups. Especially in Eon mode I get the question, "Do we really need to back up our system, "since the data is in S3?" And the answer is yes, you do. Because you know, S3's not going to protect you against an accidental drop table. You know, S3 has a finite amount of reliability, durability, and availability. And you may want to be able to restore data differently. Also, backups are important if you're doing DR, or if you have additional cluster in a different region. The other cluster can be considered a backup. And finally, you know, why not create a backup or a disaster recovery cluster, you know, storage is cheap in the Cloud. So you know, we highly recommend you use it. So with this, I would like to hand it over to my colleague Christopher Daly, who will talk about the other two platforms that we support, that is Google and Azure. Over to you, Chris, thank you. >> Chris: Thanks, Sumeet, and hi everyone. So while there's no argument that we here at Vertica have a long history of running within the Amazon Web Services space, there are other alternative Cloud service providers where we do have a presence, such as Google Cloud Platform, or GCP. For those of you who are unfamiliar with GCP, it's considered the third-largest Cloud service provider in the marketspace, and it's priced very competitively to its peers. Has a lot of similarities to AWS in the products and services that it offers, but it tends to be the go-to place for newer businesses or startups. We officially started supporting GCP a little over a year ago with our first entry into their GCP marketplace. So a solution that deployed a fully-functional and ready-to-use Enterprise mode cluster. We followed up on that with the release and the support of Google storage buckets, and now I'm extremely pleased to announce that with the launch of Vertica 10, we're officially supporting Eon mode architecture in GCP as well. But that's not all, as we're adding additional offerings into the GCP marketplace. With the launch of version 10 we'll be introducing a second listing in the marketplace that allows for the deployment of an Eon mode cluster. It's all being driven by our own management consult. This will allow customers to quickly spin up Eon-based clusters within the GCP space. And if that wasn't enough, I'm also pleased to tell you that very soon after the launch we're going to be offering Vertica by the hour in GCP as well. And while we've done a lot to automate the solutions coming out of the marketplace, we recognize the simple fact that for a lot of you, building your cluster manually is really the only option. So with that in mind, let's talk about the things you need to understand in GCP to get that done. So wag me if you think this slide looks familiar. Well nope, it's not an erroneous duplicate slide from Sumeet's AWS section, it's merely an acknowledgement of all the things you need to consider for running Vertica in the Cloud. In Vertica, the choice of the operational mode will dictate some of the choices you'll need to make in the infrastructure, particularly around storage. Just like on-prem solutions, you'll need to understand the disk and networking capacities to get the most out of your cluster. And one of the most attractive things in GCP is the pricing, as it tends to run a little less than the others. But it does translate into less choices and options within the environment. If nothing else, I want you to take one thing away from this slide, and Sumeet said this earlier. VMs running, about AWS, Sumeet said this about AWS earlier. VMs running in the GCP space run on top of hardware that has hyperthreading enabled. And that a vCPU doesn't equate to a core, but rather a processing thread. This becomes particularly important if you're moving from an on-prem environment into the Cloud. Because a physical Vertica node with 32 cores is not the same thing as a VM with 32 vCPUs. In fact, with 32 vCPUs, you're only getting about 16 cores worth of performance. GCP does offer a handful of VM types, which they categorize by letter, but for us, most of these don't make great choices for Vertica nodes. The M series, however, does offer a good core to memory ratio, especially when you're looking at the high-mem variants. Also keep in mind, performance in I/O, such as network and disk, are partially dependent on the VM size, so customers in GCP space should be focusing on 16 vCPU VMs and above for their Vertica nodes. Disk options in GCP can be broken down into two basic types, persistent disks and local disks, which are ephemeral. Persistent disks come in two forms, standard or SSD. For Vertica in Eon mode, we recommend that customers use persistent SSD disks for the catalog, and either local SSD disks or persistent SSD disks for the depot and the temp space. Couple of things to think about here, though. Persistent disks are provisioned as a single device with a settable size. Local disks are provisioned as multiple disk devices with a fixed size, requiring you to use some kind of software RAIDing to create a single storage device. So while local SSD disks provide much more throughput, you're using CPU resources to maintain that RAID set. So you're giving, it's a little bit of a trade-off. Persistent disks offer redundancy, either within the zone that they exist or within the region, and if you're selecting regional redundancy, the disks are replicated across multiple zones in the region. This does have an effect in the performance to VM, so we don't recommend this. What we do recommend is the zonal redundancy when you're using persistent disks, as it gives you that redundancy level without actually affecting the performance. Remember also, in the Cloud space, all I/O is network I/O, as disks are basically block storage devices. This means that disk actions can and will slow down network traffic. And finally, the storage bucket access in GCP is based on GCP interoperability mode, which means that it's basically compliant with the AWS S3 API. In interoperability mode, access to the bucket is granted by a key pair that GCP refers to as HMAC keys. HMAC keys can be generated for individual users or for service accounts. We will recommend that when you're creating HMAC keys, choose a service account to ensure that the keys are not tied to a single employee. When thinking about storage for Enterprise mode, things change a little bit. We still recommend persistent SSD disks over standard ones. However, the use of local SSD disks for anything other than temp space is highly discouraged. I said it before, local SSD disks are ephemeral, meaning that the data's lost if the machine is turned off or goes down. So not really a place you want to store your data. In GCP, multiple persistent disks placed into a software RAID set does not create more throughput like you can find in other Clouds. The I/O saturation usually hits the VM limit long before it hits the disk limit. In fact, performance of a persistent disk is determined not just by the size of the disk but also by the size of the VM. So a good rule of thumb in GCP is to maximize your I/O throughput for persistent disks, is that the size tends to max out at two terabytes for SSDs and 10 terabytes for standard disks. Network performance in GCP can be thought of in two distinct ways. There's node-to-node traffic, and then there's egress traffic. Node-to-node performance in GCP is really good within the zone, with typical traffic between nodes falling in the 10-15 gigabits per second range. This might vary a little from zone to zone and region to region, but usually it's only limited, they're only limited by the existing traffic where the VMs exist. So kind of a noisy neighbor effect. Egress traffic from a VM, however, is subject to throughput caps, and these are based on the size of the VM. So the speed is set for the number of vCPUs in the VM at two gigabits per second per vCPU, and tops out at 32 gigabits per second. So the larger the VM, the more vCPUs you get, the larger the cap. So some things to consider in the NAV ring space for your Vertica cluster, pick a region that's physically close to you, even if you're connecting to the GCP network from a corporate LAN as opposed to the internet. The further the packets have to travel, the longer it's going to take. Also, GCP, like most Clouds, doesn't support UDP broadcast traffic on their virtual NAV ring, so you do have to use the point-to-point flag for spread when you're creating your cluster. And since the network cap on VMs is set at 32 gigabits per second per VM, maximize your network egress throughput and don't use VMs that are smaller than 16 vCPUs for your Vertica nodes. And that gets us to the one question I get asked the most often. How do I get my data into and out of the Cloud? Well, GCP offers many different methods to support different speeds and different price points for data ingress and egress. There's the obvious one, right, across the internet either directly to the VMs or into the storage bucket. Or you can, you know, light up a VPN tunnel to encrypt all that traffic. But additionally, GCP offers direct network interconnect from your corporate network. These get provided either by Google or by a partner, and they vary in speed. They also offer things called direct or carrier peering, which is connecting the edges of the networks between your network and GCP, and you can use a CDN interconnect, which creates, I believe, an on-demand connection from the GCP network, your network to the GCP network provided by a large host of CDN service providers. So GCP offers a lot of ways to move your data around in and out of the GCP Cloud. It's really a matter of what price point works for you, and what technology your corporation is looking to use. So we've talked about AWS, we've talked about GCP, it really only leaves one more Cloud. So last, and by far not the least, there's the Microsoft Azure environment. Holding on strong to the number two place in the major Cloud providers, Azure offers a very robust Cloud offering that's attractive to customers that already consume services from Microsoft. But what you need to keep in mind is that the underlying foundation of their Cloud is based on the Microsoft Windows products. And this makes their Cloud offering a little bit different in the services and offerings that they have. The good news here, though, is that Microsoft has done a very good job of getting their virtualization drivers baked into the modern kernels of most Linux operating systems, making running Linux-based VMs in Azure fairly seamless. So here's the slide again, but now you're going to notice some slight differences. First off, in Azure we only support Enterprise mode. This is because the Azure storage product is very different from Google Cloud storage and S3 on AWS. So while we're working on getting this supported, and we're starting to focus on this, we're just not there yet. This means that since we're only supporting Enterprise mode in Azure, getting the local disk performance right is one of the keys to success of running Vertica here, with the other major key being making sure that you're getting the appropriate networking speeds. Overall, Azure's a really good platform for Vertica, and its performance and pricing are very much on par with AWS. But keep in mind that the newer versions of the Linux operating systems like RHEL and CentOS run much better here than the older versions. Okay, so first things first again, just like GCP, in Azure VMs are running on top of hardware that has hyperthreading enabled. And because of the way Hyper-V, Azure's virtualization engine works, you can actually see this, right? So if you look down into the CPU information of the VM, you'll actually see how it groups the vCPUs by core and by thread. Azure offers a lot of VM types, and is adding new ones all the time. But for us, we see three VM types that make the most sense for Vertica. For customers that are looking to run production workloads in Azure, the Es_v3 and the Ls_v2 series are the two main recommendations. While they differ slightly in the CPU to memory ratio and the I/O throughput, the Es_v3 series is probably the best recommendation for a generalized Vertica node, with the Ls_v2 series being recommended for workloads with higher I/O requirements. If you're just looking to deploy a sandbox environment, the Ds_v3 series is a very suitable choice that really can reduce your overall Cloud spend. VM storage in Azure is provided by a grouping of four different types of disks, all offering different levels of performance. Introduced at the end of last year, the Ultra Disk option is the highest-performing disk type for VMs in Azure. It was designed for database workloads where high throughput and low latency is very desirable. However, the Ultra Disk option is not available in all regions yet, although that's been changing slowly since their launch. The Premium SSD option, which has been around for a while and is widely available, can also offer really nice performance, especially higher capacities. And just like other Cloud providers, the I/O throughput you get on VMs is dictated not only by the size of the disk, but also by the size of the VM and its type. So a good rule of thumb here, VM types with an S will have a much better throughput rate than ones that don't, meaning, and the larger VMs will have, you know, higher I/O throughput than the smaller ones. You can expand the VM disk throughput by using multiple disks in Azure and using a software RAID. This overcomes limitations of single disk performance, but keep in mind, you're now using CPU cycles to maintain that raid, so it is a bit of a trade-off. The other nice thing in Azure is that all their managed disks are encrypted by default on the server side, so there's really nothing you need to do here to enable that. And of course I mentioned this earlier. There is no native access to Azure storage yet, but it is something we're working on. We have seen folks using third-party applications like MinIO to access Azure's storage as an S3 bucket. So it might be something you want to keep in mind and maybe even test out for yourself. Networking in Azure comes in two different flavors, standard and accelerated. In standard networking, the entire network stack is abstracted and virtualized. So this works really well, however, there are performance limitations. Standard networking tends to top out around four gigabits per second. Accelerated networking in Azure is based on single root I/O virtualization of the Mellanox adapter. This is basically the VM talking directly to the physical network card in the host hardware, and it can produce network speeds up to 20 gigabits per second, so much, much faster. Keep in mind, though, that not all VM types and operating systems actually support accelerated networking, and you know, just like disk throughput, network throughput is based on VM type and size. So what do you need to think about for networking in the Azure space? Again, stay close to home. Pick regions that are geographically close to your location. Yes, the backbones between the regions are very, very fast, but the more hops your packets have to make, the longer it takes. Azure offers two types of groupings of their VMs, availability sets and availability zones. Availability zones offer good redundancy across multiple zones, but this actually increases the node-to-node latency, so we recommend you avoid this. Availability sets, on the other hand, keep all your VMs grouped together within a single zone, but makes sure that no two VMs are running on the same host hardware, for redundancy. And just like the other Clouds, UDP broadcast is not supported. So you have to use the point-to-point flag when you're creating your database to ensure that the spread works properly. Spread time out, okay, this is a good one. So recently, Microsoft has started monthly rolling updates of their environment. What this looks like is VMs running on top of hardware that's receiving an update can be paused. And this becomes problematic when the pausing of the VM exceeds eight seconds, as the unpaused members of the cluster now think the paused VM is down. So consider adjusting the spread time out for your clusters in Azure to 30 seconds, and this will help avoid a little of that. If you're deploying a large cluster in Azure, more than 20 nodes, use large closer mode, as point-to-point for spread doesn't really scale well with a lot of Vertica nodes. And finally, you know, pick VM types and operating systems that support accelerated networking. The difference in the node-to-node speeds can be very dramatic. So how do we move data around in Azure, right? So Microsoft views data egress a little differently than other Clouds, as it classifies any data being transmitted by a VM as egress. However, it only bills for data egress that actually leaves the Azure environment. Egress speed limits in Azure are based entirely on the VM type and size, and then they're limited by your connection to them. While not offering as many pathways to access their Cloud as GCP, Azure does offer a direct network-to-network connection called ExpressRoute. Offered by a large group of third-party processors, partners, the ExpressRoute offers multiple tiers of performance that are based on a flat charge for inbound data and a metered charge for outbound data. And of course you can still access these via the internet, and securely through a VPN gateway. So on behalf of Jeff, Sumeet, and myself, I'd like to thank you for listening to our presentation today, and we're now ready for Q&A.
SUMMARY :
Also as a reminder that you can maximize your screen So the best, the best thing you can do and the larger VMs will have, you know,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Chris | PERSON | 0.99+ |
Sumeet | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
Chris Daly | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
Christopher Daly | PERSON | 0.99+ |
Sumeet Keswani | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Vertica | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
10 Gbps | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
Amazon Web Services | ORGANIZATION | 0.99+ |
RHEL | TITLE | 0.99+ |
Today | DATE | 0.99+ |
32 cores | QUANTITY | 0.99+ |
CentOS | TITLE | 0.99+ |
more than 20 nodes | QUANTITY | 0.99+ |
32 vCPUs | QUANTITY | 0.99+ |
two platforms | QUANTITY | 0.99+ |
eight seconds | QUANTITY | 0.99+ |
Vertica | TITLE | 0.99+ |
10 terabytes | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
20 nodes | QUANTITY | 0.99+ |
two terabytes | QUANTITY | 0.99+ |
each application | QUANTITY | 0.99+ |
S3 | TITLE | 0.99+ |
two types | QUANTITY | 0.99+ |
Linux | TITLE | 0.99+ |
two subclusters | QUANTITY | 0.98+ |
first entry | QUANTITY | 0.98+ |
one question | QUANTITY | 0.98+ |
four | QUANTITY | 0.98+ |
Azure | TITLE | 0.98+ |
Vertica 10 | TITLE | 0.98+ |
4/2 | DATE | 0.98+ |
First | QUANTITY | 0.98+ |
16 vCPU | QUANTITY | 0.98+ |
two forms | QUANTITY | 0.97+ |
MinIO | TITLE | 0.97+ |
single employee | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
this week | DATE | 0.96+ |