Larry Lancaster & Rod Bagg, Zebrium | Zebrium Root Cause as a Service

(upbeat music) >> Full stack observability is all the rage today. As businesses lean into digital, customer experience becomes ever more important. Why? Well, it's obvious, fickle consumers can switch brands in the blink of an eye or the click of a mouse. Technology companies have sprung into action and the observability space is getting pretty crowded in an effort to simplify the process of figuring out the root cause of application performance problems without an army of PhDs and lab coats, also known as endlessly digging through logs, for example. We see decades old software companies that have traditionally done monitoring or log analytics and or application performance management stepping up their game. These established players, you know, they typically have deep feature sets and sometimes purpose-built tools that attack one particular segment of the marketplace. And now they're pivoting through M&A and some organic development trying to fill gaps in their portfolio. And then, you got all these new entrants coming to the market, claiming end to end visibility across the so-called modern cloud and now edge native stacks. Meanwhile, cloud players are gaining traction and participating through a combination of native tooling combined with strong ecosystems to address this problem. But, you know, recent survey research from ETR confirms our thesis that no one company has it all. Here's the thing. Customers just want to figure out the root cause as quickly and as efficiently as possible. It's one thing to observe the stack end to end, but the question is who is automating the observers? And that's why we're here today. Hello, my name is Dave Vellante and welcome to this special Cube presentation where we dig into root cause analysis, and specifically, how one company, Zebrium, is using unsupervised machine learning to detect anomalies and pinpoint root causes and delivering it as an automated service. And in this session, we have two deep dives. First, we're going to dig into this exciting new field of RCaaS, Root Cause As A Service with two of the founders and technical experts behind Zebrium. And then we bring in two technical experts from Cisco, an early Zebrium customer who ran a POC with Zebrium's service, automating and identifying root cause problems within four very well established and well known Cisco product lines, including WebEx Client and UCS. I was pretty amazed at the results and I think you'll be impressed as well. So thanks for being here. Let's get started. With me right now is Larry Lancaster, who's a founder and CTO of Zebrium. And he's joined by Rod Bagg, who's the founder and vice president of engineering at the company. Gents, welcome. Thanks for coming on. >> Thanks. >> Okay. >> It's good to be here. >> It's good to be here >> All right Rod, talk to me. Talk to me about software downtime, what root cause means, all the buzzwords in your domain, MTTR and SLO. What do we need to know? >> Yeah, I mean, it's like you said. I mean, it's extremely important to our customers and to most businesses out there to drive uptime and avoid as much downtime as possible. So, you know, when you think about it, all of these businesses, most companies nowadays, either their product is software and it's running, you know, running on the web and that's how you get a point click. Or the business depends on, you know, internal systems to drive their business and to run it. When that is down, that is hugely impacting to them. So if you take a look, you know, way back, you know, 20, 30 years ago, software was simple. You know, there wasn't much to it. It was pretty monolithic and maybe it took a couple of people to maintain it and keep it running. There wasn't really anything complicated about it. It was a single tenant piece of software. Today's software is so complicated, often running, you know, maybe hundreds of services to keep that or to actually implement what that software is doing. So as you point out, you know, enter the sort of observability space and the tools that are now in use to help monitor that software and make sure when something goes wrong, they know about it But there's kind of an interesting stat around the observability space. So when you look at observability in the context or through the lens of the cost of downtime, it's really interesting. So observability tools are about a $20 billion market, okay? But the cost of downtime, even with that in place, is still hundreds of billions of dollars. So you're not taking much of a bite out of what the real problem is. You have to solve root cause and get to that fast. So it's all great to know that something went wrong but you got to know why. And it's our contention here that, you know, really, when you take a look at the observability space, you have metrics, that's a great tool. I mean, there's lots of great tools out there, you know, around metrics monitoring that's going to tell you when something went wrong. It's very rarely it's going to tell you why. Similarly for tracing, it's going to point you to where the issue is. It's going to take you through that stack and probably pinpoint where you're being, you know where it's happening or where something is running slow, potentially. So that's great. But again, the root cause of why it's happening is going to be buried in log files. And I can expand on that a little bit more but you know, when you're a software developer and you're writing your software, those log files are a wealth of information. It's just a set of breadcrumbs that are littered with facts about how the software is behaving and why it's doing what it's doing, or why it went wrong. And it's that that really gets you to the root cause very fast. And that's our contention, is that these software systems are so complex nowadays and that the root cause is lying in those logs. So how do you get there fast? You know, we would contend that you better automate that or you are just doomed for failure. And that's where we come in. >> Great. >> Getting to that root cause. >> Thank you, Rod. You know, it's interesting you talk about the $20 billion market. There's an analogy with security, right? We spend 80, $100 billion a year on securing our infrastructure, and yet we lose probably closer to a trillion dollars a year in breaches. And there's a similar analogy here. 20 billion could be 5X in downtime impacts or more. Okay, let's go to Larry. Tell us a little bit more about Zebrium. I'm interested always to ask a founder why you started the company. Rod touched on that a little bit. You guys have invented this concept of RCaaS. What does it mean? What problems does it solve, and how does it solve the problem? Let's get into it. >> Yeah. Hey, thanks, Dave. So I think when you said, you know, who's automating the observer, that that's a great way to think about it because what observability really means is it's a property of a system that means you can see into it. You can observe the internal state and that makes it easier to troubleshoot, right? But the problem is if it's too complicated, you just push the bottleneck up to your eyeball. There's only so much a person can filter through manually, right? And I love the way you put that. So that's a great way to think about it is automating the observer. Now, of course, it means that, you know, you reduce your MTTR, you meet your service level objectives, all that stuff, you improve customer experience. That's all true, but it's important to step back and realize like we have cracked a real nut here. People have been trying to figure out how to automate this part of sort of the troubleshooting experience, this human part of finding the root cause indicators for a long time. And until Zebrium came along, I would argue, no one's really done it right. So, you know, I think it's also important you know, as we step back, we can probably look forward five to 10 years and say, everyone's going to look back and say how did we do all this manually? You're going to see this sort of last mile of observability and troubleshooting is going to be automated everywhere because otherwise, you know, people are just... They're not going to be able to scale their business. So, you know, I think one more thing that's important to point out is, you know, I think Zebrium, you know, it's one thing to have the technology but we've learned we need to deliver it right where people are today. You can't just expect people to dive into a new tool. So, you know, we're looking at, you know, if you look at Zebrium, you'll put us on your dashboard and we don't care what kind of a dashboard it is. It could be, you know Datadog, New Relic, Elastic, Dynatrace, Grafana AppDynamics, ScienceLogic, we don't care. You know, they're all our friends. So we're more interested in getting to that root cause than trying to fight, you know, these incumbents and all that stuff. Yep. >> Yeah. So, interesting. Again, another analogy I think about. You know, you talked about automation. If we're to look back and say this is what... We're never going to do this again, it's like provisioning loans. Nobody provisions loans anymore, it's all automated. >> Larry: (chuckling) That's right. >> So Larry, I'll stay with you, then the skeptic in me says, this sounds amazing, but if I, you know... It might be too good to be true. Tell us how it works. >> Larry: (chuckling) Yeah. So that's interesting. So Cisco came along and they were equally skeptical. So what they did was they took a couple of months and they did a very detailed study. And they got together 192 incidents across four product lines, where they knew that the root cause was in the logs. And they knew what that root cause was because they had had their best engineers, you know work on those cases and take detailed notes of the incidents that had taken place. And so they ran that data through the Zebrium software. And what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time. Like that blew us away. When we saw that kind of evidence, Dave, I have to tell you, everyone was just jumping up and down. It was like, you know, it was like the Apollo command center, you know when they finally, you know, touchdown on the moon kind of thing. So, you know, it's really an exciting point in time to be at the company, like just seeing everything finally being proven out according to this vision. I'm going to tell you one more story which is actually one of my favorites, because we got a chance to work with Seagate Lyve Cloud. So they're, you know, a hyper modern, you know, SaaS business, they're an S3 competitor. Zoom has their files stored on Lyve Cloud, you know, to let you know who they are. So essentially, what happened was they were in alpha, their early access, and they had an outage, and it was pretty bad. I mean, it went on for longer than a day, actually, before they were completely restored. And it was, you know, fortunately for them, it was early access. So no one was expecting, you know, uptime, you know, service level objectives and so on. But they were scared, because they realized, if something like this happens in production, you know, they're screwed. So what they did was they saw Zebrium. They went and did some research, they saw Zebrium. They went in a staging environment, recreated the exact (indistinct) that they had had. And what they saw was immediately, Zebrium pops up a root cause report that tells them exactly the root cause that they took over a day to find. These are the kind of stories that let us know we're onto something transformational. >> Dave: Yeah. That's great. I mean, you guys are jumping up and down, I'm sure. We're going to hear from Cisco later. I bet you, they were jumping up and down too because they didn't have to do all that heavy lifting anymore. So Rod, Larry's just sort of implying that, or actually, you guys both talked about that your tool is agnostic. So how does one actually use the service? How do I deploy it? >> Yeah. So let me step back. So when we talk about logs right? Like, you know, all these bread crumbs being in logs and everything else? So, you know, they are a great wealth of you know, information, but people hate dealing with them. I mean, they hate having to go in and figure out what log to look at. In fact, you know, we had one of our... Or we've heard from several of our customers now prior to using Zebrium, when they, you know, have some issue, and they know there's something wrong, something on their dashboard has told them that something's wrong, maybe a metric has, you know, taken a blip or something's happened that they know there's a problem. We've heard from them that it can take like a number of hours just to get to the right set of logs, like figuring out over these hundreds of services where the logs are, to get to them, maybe searching in a log manager. Just to get into the right context, even, can take hours. So, you know, that's obviously the problem we solve but, you know, we don't want them just looking at logs. I mean, you know, we don't want to put them back in the thing they don't like doing because people don't do that. They don't like doing it. So we put it up on the dashboard. So if something is going wrong with your metrics and that's the indicator, or maybe it's something with tracing that you're sort of digging through that you know something's wrong, we will be right on that same dashboard. So we're deployed as a SaaS service. You send us your logs, you click on one of our integrations and we integrate with all these tools that Larry's talked about. And when we detect anything that is a root cause report, it will show up on your dashboard in the same timeline as those blips in your metrics. So when you see something going wrong and you know there's an issue, take a look at the portion of your dashboard that is us, and we're going to tell you why. We're going to get you to the why that went wrong. No other work could be... You can, you know, also click down and click through to us so that you land up in our portal, if you want to do some more digging around, if you need to or whatever, maybe to get some context what have you, but it's fair that if you ever need to do that, the answer should be right there on your dashboard. And that that's how we expect people to use it. We don't want them digging in logs and going through things, we want it to be right in their workflow. >> Great. Thank you, Larry. So Rod, we talked about Cisco. We're going to hear more from them in a moment in Seagate. I would think this is like a perfect solution for a SaaS provider, anybody doing AI ops. Do you have some examples of those types of firms leaning into this? >> Rod: Yeah, a couple of great ones. Well, I mean, we've got many of them, but a couple that I'll touch on. We have an actual AI ops company that was looking for, you know, sort of some complimentary technology and so on. And so they decided to just put us through our paces by having one of their own SREs sign up for our service in our SaaS environment, and send the logs from their system to us, you know, and just see how we did. So it turned out we ended up talking back to this SRE like a week after he had installed the product, you know signed up and then, you know, started sending us logs. And, you know, he was hewing and hawing, saying that he was busy, like every SRE is, and that he didn't have a chance to really do much with us yet. And, you know, we were just, you know, having this conversation on the phone, and he comes to tell us that, yeah I've been busy because we had this, you know, terrible outage, like, you know, five days ago. And we said like, "Okay did you actually look on the Zebrium dashboard?" (chuckles) And he goes, "You know what? I didn't even think to do it yet. I mean, I'd just been so busy and frazzled." So we have an integration with that company, he hadn't put that integration in, so it wasn't in his dashboard yet, but it was certainly on ours. So he went there, and he looks and he looks on the day, you know, on the time range of when he had had this incident. And right at the very top of the page on our portal was that incident with that root cause. And he was flabbergasted. It literally would've saved him hours and hours and hours. They had this issue going on for over 24 hours. And we had the answer right there in five minutes, and it was crazy. And we get that kind of stories. It's just like the Seagate one. If you use us and you have a problem, we're going to detect it. And you're going to hear from Cisco how successful we are at detecting things. I mean, it'll be there when you have a problem. In SaaS companies, you know, one of our customers is Alchera. They do cost optimizations for cloud properties, you know, for AWS optimization, Google, Google cloud, and so on. But they use our software, and they have a lot of interaction, obviously with these cloud vendors and the APIs of those cloud vendors. So, you know, in order to figure out your costing at AWS, they're using all those APIs. So it turned out, you know, they had some issue where their services were breaking. And we had that root cause report right on the screen, again within five minutes, that was pointing to an API problem with Google. And they had changed one of their APIs and Alchera was not aware of it. So their stuff was breaking because of a change downstream that we had caught. And I'll just tell you one last one because it's somewhat related to one of these cloud vendors. You know, it was a big cloud vendor who had an outage a couple of months ago. And it's interesting because, you know, a lot of our customers will set up shared Slack channels with us, where we're monitoring or seeing their incidents as well as they are. So we get a little Slack representation of the incident that we detected for them or the root cause that we detected for them, and that's in a shared community channel. So we could see this happening when that AWS outage happened. We could see our customers getting impacted by that AWS outage, and the root cause of what was going on there in AWS that was impacting our customers that was showing up in our incidents. Now we didn't obviously, you know, have the very root cause of what was going on in AWS, per se but we were getting to the root cause of why our customer's applications were failing. And that was because of issues going on at AWS. >> Very interesting. I mean, I think one of your biggest challenges is going to be getting people's attention because these SREs are so busy, their hair's on fire. >> Rod: That's it. Right. (chuckling). You know, when you say, hey, (indistinct). >> I tell you, if you get their attention, they love it. I mean, this AI ops company, I didn't even tell you the punchline there, but, you know, they had this incident that occurred that we found. And quite literally, the next week, they ended up signing up as a paid customer. So... >> Dave: that's great. And Larry, to give you the last word. I mean, you know, Rod was talking about, you know, changes in APIs and you know, there's still a lot of scripts out there. You guys, if I understand it correctly, run both as a service in the cloud and you can run on-prem, which is important because there's a lot of sensitive information in logs that people are trying not to leave. >> Larry: That's right. Absolutely. >> Dave: But close it out here. >> Yeah. I mean, that's right, you can run it on-prem. Just like we run it in our cloud, you can run it in your cloud or on your own infrastructure. Now that's all true. You know, I think the one hurdle now that we have left as a company is getting the word out and getting people to believe that this is actually possible and try it for themselves. You don't believe it, do a POC, try it yourself. And you know, people have become so jaded by the lack of, you know, real, sort of, innovation in the software industry for the last 10 years that it's hard to get people to... But guys, you got to give it a shot, I'm telling you. I'm telling you right now, it works. And you'll hear more about that from one of our customers in a minute. >> All right guys, thanks so much. Great story. Really appreciate you sharing. >> Thank you. >> Yeah. Thanks Dave. Appreciate the time. >> Okay. In a moment, we're going to hear from Cisco who is the customer in this case example and a company that has... Look, they have quite an impressive suite of observability tooling, and they've done a pretty compelling proof of concept with Zebrium using real data on some Cisco products that you've heard of, like WebEx. So stay tuned and learn about how you can really take advantage of this new technology called Root Cause As A Service. You're watching theCube, the leader in enterprise and emerging tech coverage. (upbeat music)

Published Date : Jun 16 2022

SUMMARY :

you know, they typically All right Rod, talk to me. Or the business depends on, you know, and how does it solve the problem? And I love the way you put that. You know, you talked about automation. this sounds amazing, but if I, you know... So no one was expecting, you know, uptime, I mean, you guys are jumping up and down, We're going to get you to Do you have some examples and he looks on the day, you know, is going to be getting people's attention you say, hey, (indistinct). but, you know, they had And Larry, to give you the last word. Larry: That's right. by the lack of, you know, appreciate you sharing. you can really take advantage

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Rod	PERSON	0.99+
Dave	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Larry	PERSON	0.99+
two	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Zebrium	ORGANIZATION	0.99+
$20 billion	QUANTITY	0.99+
five	QUANTITY	0.99+
20 billion	QUANTITY	0.99+
Rod Bagg	PERSON	0.99+
Seagate	ORGANIZATION	0.99+
192 incidents	QUANTITY	0.99+
UCS	ORGANIZATION	0.99+
WebEx	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
two technical experts	QUANTITY	0.99+
Dynatrace	ORGANIZATION	0.99+
New Relic	ORGANIZATION	0.99+
ScienceLogic	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
First	QUANTITY	0.99+
Elastic	ORGANIZATION	0.99+
next week	DATE	0.99+
ETR	ORGANIZATION	0.99+
Datadog	ORGANIZATION	0.99+
Grafana AppDynamics	ORGANIZATION	0.99+
five days ago	DATE	0.99+
one	QUANTITY	0.98+
more than 95%	QUANTITY	0.98+
Alchera	ORGANIZATION	0.98+
5X	QUANTITY	0.98+
10 years	QUANTITY	0.98+
both	QUANTITY	0.98+
over a day	QUANTITY	0.98+
Today	DATE	0.97+
today	DATE	0.97+
Zoom	ORGANIZATION	0.97+
Seagate Lyve Cloud	ORGANIZATION	0.97+
one company	QUANTITY	0.96+

Larry Lancaster & Rod Bagg

(bright intro music) >> Full stack observability is all the rage today. As businesses lean in to digital, customer experience becomes ever more important, why? Well, it's obvious. Fickle consumers can switch brands in the blink of an eye or the click of a mouse. Technology companies have sprung into action, and the observability space is getting pretty crowded in an effort to simplify the process of figuring out the root cause of application performance problems without an army of PhDs and lab coats, also known as endlessly digging through logs, for example. We see decades-old software companies that have traditionally done monitoring or log analytics and/or application performance management stepping up their game. These established players, you know, they typically have deep feature sets and sometimes purpose built tools that attack one particular segment of the marketplace, and now, they're pivoting through M&A and some organic development trying to fill gaps in their portfolio, and then you got all these new entrants coming to the market claiming end to end visibility across the so-called modern cloud and now edge-native stacks. Meanwhile, cloud players are gaining traction and participating through a combination of native tooling combined with strong ecosystems to address this problem, but, you know, recent survey research from ETR confirms our thesis that no one company has at all. Here's the thing. Customers just want to figure out the root cause as quickly and efficiently as possible. It's one thing to observe the stack end to end, but the question is who is automating the observers? And that's why we're here today. Hello, my name is Dave Vellante, and welcome to this special "CUBE" presentation where we dig into root cause analysis and, specifically, how one company, Zebrium, is using unsupervised machine learning to detect anomalies and pinpoint root causes and delivering it as an automated service. In this session, we have two deep dives. First, we're going to dig into this exciting new field of RCA, root cause as a service, with two of the founders and technical experts behind Zebrium, and then we bring in two technical experts from Cisco, an early Zebrium customer who ran a POC with Zebrium's service, automating and identifying root cause problems within four very well established and well-known Cisco product lines including Webex client and UCS. I was pretty amazed at the results, and I think you'll be impressed as well. So thanks for being here. Let's get started with me right now is Larry Lancaster who's a founder and CTO of Zebrium, and he's joined by Rod Bagg who's a founder and Vice-President of Engineering at the company. Gents, welcome, thanks for coming on. >> Thanks. >> (indistinct). >> To be here. >> Great to be here. >> All right, Rod, talk to me. Talk to me about software downtime, what root cause means, all the buzzwords in your domain, MTTR and SLO, what do we need to know? >> Yeah, I mean, it's like you said. I mean, it's extremely important to our customers and to most businesses out there to drive up time and avoid as much downtime as possible. So, you know, when you think about it, all of these businesses, most companies nowadays, either their product is software and it's running, you know, running on the web, and that that's how you get a point click or their business depends on it and, you know, internal systems to drive their business and to run it. Now, when that is down, that is hugely impacting to them. So if you take a look, you know, way back, you know, 20, 30 years ago, software was simple. You know, there wasn't much to it. It was pretty monolithic, and maybe it took a couple of people to maintain it and keep it running. It wasn't really anything complicated about it. It was a single tenant piece of software. Today's software is so complicated, often running, you know, maybe hundreds of services to keep that or to actually implement what that software is doing. So as you point out, you know, enter the sort of observability space and the tools that are now in use to help monitor that software and make sure when something goes wrong, they know about it, but there's kind of an interesting stat around the observability space. So when you look at observability in the context or through the lens of the cost of downtime, it's really interesting. So observability tools are about a $20 billion market, okay? But the cost of downtime, even with that in place, is still hundreds of billions of dollars. So you're not taking much of a bite out of what the real problem is. You have to solve root cause and get to that fast. So it's all great to know that something went wrong, but you got to know why, and it it's our contention here that, you know, really, when you take a look at the observability space, you have metrics. That's a great tool. I mean, there's lots of great tools out there, you know, around metrics monitoring that's going to tell you when something went wrong. It's very rarely it's going to tell you why. Similarly for tracing, it's going to point you to where the issue is. It's going to take you through that stack and probably pinpoint where you're being, you know, where it's happening or where something is running slow potentially. So that's great, but again, the root cause of why it's happening is going to be buried in log files, and I can expand on that a little bit more, but, you know, when you're a software developer, and you're writing your software, those log files are a wealth of information. It's just a set of breadcrumbs that are littered with facts about how the software is behaving and why it's doing what it's doing or why it went wrong, and it's that that really gets you to the root cause very fast, and that's, our contention is that these software systems are so complex nowadays, and that the root cause is lying in those logs. So how do you get there fast? You know, we would contend that you better automate that or you're just doomed for failure, and that's where we come in. >> Great. >> Getting to that request. >> Thank you, Rod. You know, it's interesting. You talk about the $20 billion market. There's an analogy with security, right? We spend 80, $100 billion a year on securing our infrastructure, and yet we lose, probably, closer to a trillion dollars a year in breaches, and there's a similar analogy here. 20 billion could be 5x in downtime impacts or more. Okay, let's go to Larry. Tell us a little bit more about Zebrium. I'm interested always to ask a founder why you started the company. Rod touched on that a little bit. You guys have invented this concept of RCAs. What does it mean? What problems does it solve? And how does it solve the problem? Let's get into it. >> Yeah, hey, thanks, Dave. So I think when you said, you know, who's automating the observer? That's a great way to think about it because what observability really means is it's a property of a system that means you can see into it. You can observe the internal state, and that makes it easier to troubleshoot, right? But the problem is if it's too complicated, you just push the bottleneck up to your eyeball. There's only so much a person can filter through manually, right? And I love the way you put that. So that's a great way to think about it is automating the observer. Now, of course, it means that, you know, you reduce your MTTR, you meet your service level objectives, all that stuff, you improve customer experience, that's all true, but it's important to step back and realize like we have cracked a real nut here. People have been trying to figure out how to automate this part of sort of the troubleshooting experience, this human part of finding the root cause indicators for a long time, and until Zebrium came along, I would argue no one's really done it right. So, you know, I think it's also important, you know, as we step back, we can probably look forward five to 10 years and say, "Everyone's going to look back and say, 'How did we do all this manually?'" You're going to see this sort of last mile of observability and troubleshooting is going to be automated everywhere because otherwise, you know, people are just, they're not going to be able to scale their business. So, you know, I think one more thing that's important to point out is, you know, I think Zebrium, you know, it's one thing to have the technology, but we've learned we need to deliver it right where people are today. You can't just expect people to dive into a new tool. So, you know, we're looking at, you know, if you look at Zebrium, you'll put us on your dashboard, and we don't care what kind of a dashboard it is. It could be, you know, Datadog, New Relic, Elastic, Dynatrace, Grafana, AppDynamics, ScienceLogic, we don't care. You know, they're all our friends. So we're more interested in getting to that root cause than trying to fight, you know, these incumbents and all that stuff, yeah. >> Yeah, so interesting. Again, another analogy I think about, you know, you talked about automation, where to look back, and say, "This is what- We're never going to do this again." It's like provisioning LANs. Nobody provisioned LANs anymore. It's all automated. >> That's correct. >> So, Larry, stay with you. The skeptic in me says, "This sounds amazing," but if, you know, it probably too good to be true. Tell us how it works. >> Yeah, so that's interesting. So Cisco came along and they were equally skeptical. So what they did was they took a couple of months, and they did a very detailed study, and they got together 192 incidents across four product lines where they knew that the root cause was in the logs, and they knew what that root cause was because they'd had their best engineers, you know, work on those cases and take detailed notes of the incidents that had taken place, and so they ran that data through the Zebrium software, and what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time. Like that blew us away. When we saw that kind of evidence, Dave, I have to tell you, everyone was just jumping up and down. It was like, you know, it was like the Apollo Command Center, you know, when they finally, (Dave laughs) you know, touchdown on the moon kind of thing. So, you know, it's really exciting at a point in time to be at the company, like just seeing everything finally being proven out according to this vision. I'm going to tell you one more story, which is actually one of my favorites, because we got a chance to work with Seagate Lyve Cloud. So they're, you know, a hyper modern, you know, SaaS business. They're an S3 competitor. Zoom has their files stored on Lyve Cloud to give, you know, to let you know who they are. So, essentially, what happened was they were in alpha, in their early access, and they had an outage, and it was pretty bad. I mean, it went on for longer than a day, actually, before they were completely restored, and it was, you know, fortunately, for them, it was early access. So no one was expecting, you know, uptime, you know, service level objectives and so on, but they were scared because they realized if something like this happens in production, you know, they're screwed. So what they did was they saw Zebrium, they did some research, they saw Zebrium. They went in a staging environment, recreated the exact (indistinct) that they'd had, and what they saw was, immediately, Zebrium pops up a root cause report that tells them exactly the root cause that they took over a day to find. These are the kind of stories that let us know we're onto something transformational. >> Yeah, that's great. I mean, you guys are jumping up and down. I'm sure, we're going to hear from Cisco later. I bet you, they were jumping up and down, too, 'cause they didn't have to do all that heavy lifting anymore. So Rod, Larry's just sort of implying that or, actually, you guys both talked about that your tool's agnostic. So how does one actually use the service? How do I deploy it? >> Yeah, so let me step back. So when we talk about logs, right? Like, you know, all these red crumbs being in logs and everything else. So, you know, they are a great wealth of, you know, information, but people hate dealing with them. I mean, they hate having to go in and figure out what log to look at. In fact, you know, we had one of our, or we've heard from several of our customers now prior to using Zebrium, but when they're, you know, have some issue, and they know there's something wrong, something on their dashboard has told them that something's wrong, maybe a metrics is, you know, taken a blip or something's happened that they know there's a problem, we've heard from them that it can take like a number of hours just to get to the right set of logs, like figuring out over these hundreds of services where the logs are to get to them, maybe searching in a log manager, just to get into the right context even can take hours. So, you know, that's obviously the problem we solve, but, you know, we don't want them just looking at logs. I mean, you know, we don't want to put 'em back in the thing they don't like doing 'cause people don't do what they don't like doing. So we put it up on the dashboard. So if something is going wrong with your metrics, and that's the indicator or maybe it's something with tracing that you're sort of digging through now that you know something's wrong, we will be right on that same dashboard. So we're deployed as a SaaS service. You send us your logs. You click on one of our integrations, and we integrate with all these tools that Larry's talked about, and when we detect anything that is a root cause report, it will show up on your dashboard in the same timeline as those blips in your metrics. So when you see something going wrong, and you know there's an issue, take a look at the portion of your dashboard that is us, and we're going to tell you why. We're going to get you to the why that went wrong. Not no other work could be- You can, you know, also click down and click through to us so that you land up in our portal if you want to do some more digging around if you need to or whatever, maybe to get some context, what have you, but it's fair that you ever need to do that. The answer should be right there on your dashboard, and that's how we expect people to use it. We don't want them digging in logs and going through things. We want it to be right in their workflow. >> Great, thank you, Larry. So Rod, we talked about Cisco. We're going to hear more from them in a moment and Seagate. I would think this is like a perfect solution for a SaaS provider, anybody doing AIOps, do you have some examples of those types of firms leaning into this? >> Yeah, a couple of great, well, I mean, we got many of them, but couple that I'll touch on. We have an actual AIOps company that was looking for, you know, sort of some complimentary technology and so on, and so they decided to just put us through our paces by having one of their own SREs sign up for our service in our SaaS environment and send the logs from their system to us, you know, and just see how we did. So it turned out we ended up talking back to this SRE like a week after he had installed the product, you know, signed up, and then, you know, started sending us logs, and, you know, he was hemming and hawing saying that he was busy like, you know, like every SRE is, and that he didn't have a chance to really do much with us yet, and, you know, we just, you know, having this conversation on the phone, and he comes to tell us that, "Yeah, I've been busy because we had this, you know, terrible outage like, you know, five days ago," and we said like, "Okay, did you actually look on the Zebrium dashboard?" (laughs) And he goes, "You know what? I didn't even think to do it yet. I mean, I'd just been so busy and frazzled." So we have an integration with that company. He hadn't put that integration in so it wasn't in his dashboard yet, but it was certainly on ours. So he went there and he looks on the day like, you know, on the time range of when he had this incident, and right at the very top of the page on our portal was the incident with the root cause, and he was flabbergasted. It literally would've saved him hours and hours and hours. They had this issue going on for over 24 hours, and we had the answer right there in five minutes, and it was crazy, and we get that kind of story. It's just like the Seagate one. If you use us and you have a problem, we're going to detect it, and you're going to hear from Cisco how successful we are at detecting things. I mean, it'll be there when you have a problem. In SaaS companies, you know, one of our customers is Archera. They do cost optimizations for cloud properties, you know, for AWS optimization, Google cloud, and so on, but they use our software, and they have a lot of interaction, obviously, with these cloud vendors and the APIs of those cloud vendors. So, you know, in order to figure out you're costing at AWS, they're using all those APIs. So it turned out, you know, they had some issue where their services were breaking and we had that root cause report right on the screen, again, within five minutes that was pointing to an API problem with Google, and they had changed one of their APIs, and Archera was not aware of it. So their stuff was breaking because of a change downstream that we had caught, and I'll just tell you one last one because it's somewhat related to one of these cloud vendors of, you know, big cloud vendor who had an outage couple of months ago, and it's interesting because, you know, lot of our customers will set up shared Slack channels with us where we're monitoring or seeing their incidents as well as they are. So we get a little Slack representation of the incident that we detected for them or the root cause that we've detected for them, and that's in a shared community channel. So we could see this happening when that AWS outage happened. We could see our customers getting impacted by that AWS outage and the root cause of what was going on there in AWS that was impacting our customers, that was showing up in our incidents. Now, we didn't obviously, you know, have the very root cause of what was going on in AWS per se, but we were getting to the root cause of why our customer's applications were failing, and that was because of issues going on at AWS. >> Very interesting. I mean, I think one of your biggest challenge is going to be getting people's attention because these SREs is so busy, their hair's on fire. (all laughs) You know, he's like, "Hey, chap, I'm going to show you, look at this." >> I tell you. You get their attention, they love it. I mean, this AIOps company, I didn't even tell you the punchline there, but, you know, they had this incident that occurred that we found and, quite literally, the next week, they ended up signing up as a paid customer, so. >> That's great, and Larry, give you the last word. I mean, you know, Rod was talking about, you know, changes in APIs, and, you know, there's still a lot of scripts out there. You guys, if I understand it correctly, run both as a service in the cloud and you can run on-prem, which is important because there's a lot of sensitive information in logs and people don't want to leave. >> That's right, absolutely. >> But, yeah, close it out here. >> Yeah, I mean, you can, that's right, you can run it on-prem, just like we run it in our cloud. You can run it in your cloud or on your own infrastructure. Now, that's all true. You know, I think the one hurdle now that we have left as a company is getting the word out and getting people to believe that this is actually possible and try it for themselves. You don't believe it? Do a POC, try it yourself. And, you know, people have become so jaded by the lack of, you know, real sort of innovation in the software industry for the last 10 years that it's hard to get people to... But guys, you got to give it a shot. I'm telling you. I'm telling you right now, it works, and you'll hear more about that from one of our customers in a minute. >> Alright guys, thanks so much. Great story, really appreciate you sharing. >> Thank you. >> Yeah, thanks, Dave. Appreciate the time. >> Okay, in a moment, we're going to hear from Cisco who is the customer in this case example, and a company that is... Look, they have quite an impressive suite of observability tooling, and they've done a pretty compelling proof of concept with Zebrium using real data on some Cisco products that you've heard of like Webex. So stay tuned and learn about how you can really take advantage of this new technology called root cause as a service. You're watching "theCUBE", the leader in enterprise and emerging tech coverage. (bright outro music)

Published Date : May 25 2022

SUMMARY :

and then you got all these new entrants all the buzzwords in your and that that's how you get a point click why you started the company. Now, of course, it means that, you know, about, you know, you but if, you know, it and it was, you know, I mean, you guys are jumping up and down. I mean, you know, we do you have some examples saying that he was busy like, you know, is going to be getting people's attention but, you know, they had I mean, you know, Rod was talking by the lack of, you know, appreciate you sharing. Appreciate the time. So stay tuned and learn about how you can

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Dave	PERSON	0.99+
Rod	PERSON	0.99+
Seagate	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Larry	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Rod Bagg	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Zebrium	ORGANIZATION	0.99+
Webex	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
UCS	ORGANIZATION	0.99+
$20 billion	QUANTITY	0.99+
20 billion	QUANTITY	0.99+
Grafana	ORGANIZATION	0.99+
192 incidents	QUANTITY	0.99+
five	QUANTITY	0.99+
Dynatrace	ORGANIZATION	0.99+
two technical experts	QUANTITY	0.99+
AppDynamics	ORGANIZATION	0.99+
ScienceLogic	ORGANIZATION	0.99+
New Relic	ORGANIZATION	0.99+
Datadog	ORGANIZATION	0.99+
First	QUANTITY	0.99+
five minutes	QUANTITY	0.99+
Elastic	ORGANIZATION	0.99+
ETR	ORGANIZATION	0.99+
five days ago	DATE	0.99+
10 years	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
one	QUANTITY	0.98+
5x	QUANTITY	0.98+
more than 95%	QUANTITY	0.98+
next week	DATE	0.98+
both	QUANTITY	0.98+
couple of months ago	DATE	0.97+
20	DATE	0.97+
Zoom	ORGANIZATION	0.97+
Archera	ORGANIZATION	0.97+
today	DATE	0.97+
Seagate Lyve Cloud	ORGANIZATION	0.96+
over 24 hours	QUANTITY	0.95+
over a day	QUANTITY	0.95+
Today	DATE	0.95+
four	QUANTITY	0.95+
AIOps	ORGANIZATION	0.95+
hundreds of services	QUANTITY	0.94+
decades	QUANTITY	0.94+
one thing	QUANTITY	0.92+

Larry Lancaster, Zebrium | Virtual Vertica BDC 2020

>> Announcer: It's theCUBE! Covering the Virtual Vertica Big Data Conference 2020 brought to you by Vertica. >> Hi, everybody. Welcome back. You're watching theCUBE's coverage of the Vertica Virtual Big Data Conference. It was, of course, going to be in Boston at the Encore Hotel. Win big with big data with the new casino but obviously Coronavirus has changed all that. Our hearts go out and we are empathy to those people who are struggling. We are going to continue our wall-to-wall coverage of this conference and we're here with Larry Lancaster who's the founder and CTO of Zebrium. Larry, welcome to theCUBE. Thanks for coming on. >> Hi, thanks for having me. >> You're welcome. So first question, why did you start Zebrium? >> You know, I've been dealing with machine data a long time. So for those of you who don't know what that is, if you can imagine servers or whatever goes on in a data center or in a SAS shop. There's data coming out of those servers, out of those applications and basically, you can build a lot of cool stuff on that. So there's a lot of metrics that come out and there's a lot of log files that come. And so, I've built this... Basically spent my career building that sort of thing. So tools on top of that or products on top of that. The problem is that since at least log files are completely unstructured, it's always doing the same thing over and over again, which is going in and understanding the data and extracting the data and all that stuff. It's very time consuming. If you've done it like five times you don't want to do it again. So really, my idea was at this point with machine learning where it's at there's got to be a better way. So Zebrium was founded on the notion that we can just do all that automatically. We can take a pile of machine data, we can turn it into a database, and we can build stuff on top of that. And so the company is really all about bringing that value to the market. >> That's cool. I want to get in to that, just better understand who you're disrupting and understand that opportunity better. But before I do, tell us a little bit about your background. You got kind of an interesting background. Lot of tech jobs. Give us some color there. >> Yeah, so I started in the Valley I guess 20 years ago and when my son was born I left grad school. I was in grad school over at Berkeley, Biophysics. And I realized I needed to go get a job so I ended up starting in software and I've been there ever since. I mean, I spent a lot of time at, I guess I cut my teeth at Nedap, which was a storage company. And then I co-founded a business called Glassbeam, which was kind of an ETL database company. And then after that I ended up at Nimble Storage. Another company, EMC, ended up buying the Glassbeam so I went over there and then after Nimble though, which where I build the InfoSight platform. That's where I kind of, after that I was able to step back and take a year and a half and just go into my basement, actually, this is my kind of workspace here, and come up with the technology and actually build it so that I could go raise money and get a team together to build Zebrium. So that's really my career in a nutshell. >> And you've got Hello Kitty over your right shoulder, which is kind of cool >> That's right. >> And then up to the left you got your monitor, right? >> Well, I had it. It's over here, yeah. >> But it was great! Pull it out, pull it out, let me see it. So, okay, so you got that. So what do you do? You just sit there and code all night or what? >> Yeah, that's right. So Hello Kitty's over here. I have a daughter and she setup my workspace here on this side with Hello Kitty and so on. And over on this side, I've got my recliner where I basically lay it all the way back and then I pivot this thing down over my face and put my keyboard on my lap and I can just sit there for like 20 hours. It's great. Completely comfortable. >> That's cool. All right, better put that monitor back or our guys will yell at me. But so, obviously, we're talking to somebody with serious coding chops and I'll also add that the Nimble InfoSight, I think it was one of the best pick ups that HP, HPE, has had in a while. And the thing that interested me about that, Larry, is the ability that the company was able to take that InfoSight and poured it very quickly across its product lines. So that says to me it was a modern, architecture, I'm sure API, microservices, and all those cool buzz words, but the proof is in their ability to bring that IP to other parts of the portfolio. So, well done. >> Yeah, well thanks. Appreciate that. I mean, they've got a fantastic team there. And the other thing that helps is when you have the notion that you don't just build on top of the data, you extract the data, you structure it, you put that in a database, we used Vertica there for that, and then you build on top of that. Taking the time to build that layer is what lets you build a scalable platform. >> Yeah, so, why Vertica? I mean, Vertica's been around for awhile. You remember you had the you had the old RDBMS, Oracles, Db2s, SQL Server, and then the database was kind of a boring market. And then, all of a sudden, you had all of these MPP companies came out, a spade of them. They all got acquired, including Vertica. And they've all sort of disappeared and morphed into different brands and Micro Focus has preserved the Vertica brand. But it seems like Vertica has been able to survive the transitions. Why Vertica? What was it about that platform that was unique and interested you? >> Well, I mean, so they're the first fund to build, what I would call a real column store that's kind of market capable, right? So there was the C-Store project at Berkeley, which Stonebreaker was involved in. And then that became sort of the seed from which Vertica was spawned. So you had this idea of, let's lay things out in a columnar way. And when I say columnar, I don't just mean that the data for every column is in a different set of files. What I mean by that is it takes full advantage of things like run length and coding, and L file and coding, and block--impression, and so you end up with these massive orders of magnitude savings in terms of the data that's being pulled off of storage as well as as it's moving through the pipeline internally in Vertica's query processing. So why am I saying all this? Because it's fundamentally, it was a fundamentally disruptive technology. I think column stores are ubiquitous now in analytics. And I think you could name maybe a couple of projects which are mostly open source who do something like Vertica does but name me another one that's actually capable of serving an enterprise as a relational database. I still think Vertica is unique in being that one. >> Well, it's interesting because you're a startup. And so a lot of startups would say, okay, we're going with a born-in-the-cloud database. Now Vertica touts that, well look, we've embraced cloud. You know, we have, we run in the cloud, we run on PRAM, all different optionality. And you hear a lot of vendors say that, but a lot of times they're just taking their stack and stuffing it into the cloud. But, so why didn't you go with a cloud-native database and is Vertica able to, I mean, obviously, that's why you chose it, but I'm interested from a technologist standpoint as to why you, again, made that choice given all these other choices around there. >> Right, I mean, again, I'm not, so... As I explained a column store, which I think is the appropriate definition, I'm not aware of another cloud-native-- >> Hm, okay. >> I'm aware of other cloud-native transactional databases, I'm not aware of one that has the analytics form it and I've tried some of them. So it was not like I didn't look. What I was actually impressed with and I think what let me move forward using Vertica in our stack is the fact that Eon really is built from the ground up to be cloud-native. And so we've been using Eon almost ever since we started the work that we're doing. So I've been really happy with the performance and with reliability of Eon. >> It's interesting. I've been saying for years that Vertica's a diamond in the rough and it's previous owner didn't know what to do with it because it got distracted and now Micro Focus seems to really see the value and is obviously putting some investments in there. >> Yeah >> Tell me more about your business. Who are you disrupting? Are you kind of disrupting the do-it-yourself? Or is there sort of a big whale out there that you're going to go after? Add some color to that. >> Yeah, so our broader market is monitoring software, that's kind of the high-level category. So you have a lot of people in that market right now. Some of them are entrenched in large players, like Datadog would be a great example. Some of them are smaller upstarts. It's a pretty, it's a pretty saturated market. But what's happened over the last, I'd say two years, is that there's been sort of a push towards what's called observability in terms of at least how some of the products are architected, like Honeycomb, and how some of them are messaged. Most of them are messaged these days. And what that really means is there's been sort of an understanding that's developed that that MTTR is really what people need to focus on to keep their customers happy. If you're a SAS company, MTTR is going to be your bread and butter. And it's still measured in hours and days. And the biggest reason for that is because of what's called unknown unknowns. Because of complexity. Now a days, things are, applications are ten times as complex as they used to be. And what you end up with is a situation where if something is new, if it's a known issue with a known symptom and a known root cause, then you can setup a automation for it. But the ones that really cost a lot of time in terms of service disruption are unknown unknowns. And now you got to go dig into this massive mass of data. So observability is about making tools to help you do that, but it's still going to take you hours. And so our contention is, you need to automate the eyeball. The bottleneck is now the eyeball. And so you have to get away from this notion of a person's going to be able to do it infinitely more efficient and recognize that you need automated help. When you get an alert agent, it shouldn't be that, "Hey, something weird's happening. Now go dig in." It should be, "Here's a root cause and a symptom." And that should be proposed to you by a system that actually does the observing. That actually does the watching. And that's what Zebrium does. >> Yeah, that's awesome. I mean, you're right. The last thing you want is just another alert and it say, "Go figure something out because there's a problem." So how does it work, Larry? In terms of what you built there. Can you take us inside the covers? >> Yeah, sure. So there's really, right now there's two kinds of data that we're ingesting. There's metrics and there's log files. Metrics, there's actually sort of a framework that's really popular in DevOp circles especially but it's becoming popular everywhere, which is called Prometheus. And it's a way of exporting metrics so that scrapers can collect them. And so if you go look at a typical stack, you'll find that most of the open source components and many of the closed source components are going to have exporters that export all their stacks to Prometheus. So by supporting that stack we can bring in all of those metrics. And then there's also the log files. And so you've got host log files in a containerized environment, you've got container logs, and you've got application-specific logs, perhaps living on a host mount. And you want to pull all those back and you want to be able to associate this log that I've collected here is associated with the same container on the same host that this metric is associated with. But now what? So once you've got that, you've got a pile of unstructured logs. So what we do is we take a look at those logs and we say, let's structure those into tables, right? So where I used to have a log message, if I look in my log file and I see it says something like, X happened five times, right? Well, that event types going to occur again and it'll say, X happened six times or X happened three times. So if I see that as a human being, I can say, "Oh clearly, that's the same thing." And what's interesting here is the times that X, that X happened, and that this number read... I may want to know when the numbers happened as a time series, the values of that column. And so you can imagine it as a table. So now I have table for that event type and every time it happens, I get a row. And then I have a column with that number in it. And so now I can do any kind of analytics I want almost instantly across my... If I have all my event types structured that way, every thing changes. You can do real anomaly detection and incident detection on top of that data. So that's really how we go about doing it. How we go about being able to do autonomous monitoring in a way that's effective. >> How do you handle doing that for, like the Spoke app? Do you have to, does somebody have to build a connector to those apps? How do you handle that? >> Yeah, that's a really good question. So you're right. So if I go and install a typical log manager, there'll be connectors for different apps and usually what that means is pulling in the stuff on the left, if you were to be looking at that log line, and it will be things like a time stamp, or a severity, or a function name, or various other things. And so the connector will know how to pull those apart and then the stuff to the right will be considered the message and that'll get indexed for search. And so our approach is we actually go in with machine learning and we structure that whole thing. So there's a table. And it's going to have a column called severity, and timestamp, and function name. And then it's going to have columns that correspond to the parameters that are in that event. And it'll have a name associated with the constant parts of that event. And so you end up with a situation where you've structured all of it automatically so we don't need collectors. It'll work just as well on your home-grown app that has no collectors or no parsers to find or anything. It'll work immediately just as well as it would work on anything else. And that's important, because you can't be asking people for connectors to their own applications. It just, it becomes now they've go to stop what they're doing and go write code for you, for your platform and they have to maintain it. It's just untenable. So you can be up and running with our service in three minutes. It'll just be monitoring those for you. >> That's awesome! I mean, that is really a breakthrough innovation. So, nice. Love to see that hittin' the market. Who do you sell to? Both types of companies and what role within the company? >> Well, definitely there's two main sort of pushes that we've seen, or I should say pulls. One is from DevOps folks, SRE folks. So these are people who are tasked with monitoring an environment, basically. And then you've got people who are in engineering and they have a staging environment. And what they actually find valuable is... Because when we find an incident in a staging environment, yeah, half the time it's because they're tearing everything up and it's not release ready, whatever's in stage. That's fine, they know that. But the other half the time it's new bugs, it's issues and they're finding issues. So it's kind of diverged. You have engineering users and they don't have titles like QA, they're Dev engineers or Dev managers that are really interested. And then you've got DevOps and SRE people there (mumbles). >> And how do I consume your product? Is the SAS... I sign up and you say within three minutes I'm up and running. I'm paying by the drink. >> Well, (laughs) right. So there's a couple ways. So, right. So the easiest way is if you use Kubernetes. So Kubernetes is what's called a container orchestrator. So these days, you know Docker and containers and all that, so now there's container orchestrators have become, I wouldn't say ubiquitous but they're very popular now. So it's kind of on that inflection curve. I'm not exactly sure the penetration but I'm going to say 30-40% probably of shops that were interested are using container orchestrators. So if you're using Kubernetes, basically you can install our Kubernetes chart, which basically means copying and pasting a URL and so on into your little admin panel there. And then it'll just start collecting all the logs and metrics and then you just login on the website. And the way you do that is just go to our website and it'll show you how to sign up for the service and you'll get your little API key and link to the chart and you're off and running. You don't have to do anything else. You can add rules, you can add stuff, but you don't have to. You shouldn't have to, right? You should never have to do any more work. >> That's great. So it's a SAS capability and I just pay for... How do you price it? >> Oh, right. So it's priced on volume, data volume. I don't want to go too much into it because I'm not the pricing guy. But what I'll say is that it's, as far as I know it's as cheap or cheaper than any other log manager or metrics product. It's in that same neighborhood as the very low priced ones. Because right now, we're not trying to optimize for take. We're trying to make a healthy margin and get the value of autonomous monitoring out there. Right now, that's our priority. >> And it's running in the cloud, is that right? AWB West-- >> Yeah, that right. Oh, I should've also pointed out that you can have a free account if it's less than some number of gigabytes a day we're not going to charge. Yeah, so we run in AWS. We have a multi-tenant instance in AWS. And we have a Vertica Eon cluster behind that. And it's been working out really well. >> And on your freemium, you have used the Vertica Community Edition? Because they don't charge you for that, right? So is that how you do it or... >> No, no. We're, no, no. So, I don't want to go into that because I'm not the bizdev guy. But what I'll say is that if you're doing something that winds up being OEM-ish, you can work out the particulars with Vertica. It's not like you're going to just go pay retail and they won't let you distinguish between tests, and prod, and paid, and all that. They'll work with you. Just call 'em up. >> Yeah, and that's why I brought it up because Vertica, they have a community edition, which is not neutered. It runs Eon, it's just there's limits on clusters and storage >> There's limits. >> But it's still fully functional though. >> So to your point, we want it multi-tenant. So it's big just because it's multi-tenant. We have hundred of users on that (audio cuts out). >> And then, what's your partnership with Vertica like? Can we close on that and just describe that a little bit? >> What's it like. I mean, it's pleasant. >> Yeah, I mean (mumbles). >> You know what, so the important thing... Here's what's important. What's important is that I don't have to worry about that layer of our stack. When it comes to being able to get the performance I need, being able to get the economy of scale that I need, being able to get the absolute scale that I need, I've not been disappointed ever with Vertica. And frankly, being able to have acid guarantees and everything else, like a normal mature database that can join lots of tables and still be fast, that's also necessary at scale. And so I feel like it was definitely the right choice to start with. >> Yeah, it's interesting. I remember in the early days of big data a lot of people said, "Who's going to need these acid properties and all this complexity of databases." And of course, acid properties and SQL became the killer features and functions of these databases. >> Who didn't see that one coming, right? >> Yeah, right. And then, so you guys have done a big seed round. You've raised a little over $6 million dollars and you got the product market fit down. You're ready to rock, right? >> Yeah, that's right. So we're doing a launch probably, well, when this airs it'll probably be the day before this airs. Basically, yeah. We've got people... Like literally in the last, I'd say, six to eight weeks, It's just been this sort of pique of interest. All of a sudden, everyone kind of gets what we're doing, realizes they need it, and we've got a solution that seems to meet expectations. So it's like... It's been an amazing... Let me just say this, it's been an amazing start to the year. I mean, at the same time, it's been really difficult for us but more difficult for some other people that haven't been able to go to work over the last couple of weeks and so on. But it's been a good start to the year, at least for our business. So... >> Well, Larry, congratulations on getting the company off the ground and thank you so much for coming on theCUBE and being part of the Virtual Vertica Big Data Conference. >> Thank you very much. >> All right, and thank you everybody for watching. This is Dave Vellante for theCUBE. Keep it right there. We're covering wall-to-wall Virtual Vertica BDC. You're watching theCUBE. (upbeat music)

Published Date : Mar 31 2020

SUMMARY :

brought to you by Vertica. and we're here with Larry Lancaster why did you start Zebrium? and basically, you can build a lot of cool stuff on that. and understand that opportunity better. and actually build it so that I could go raise money It's over here, yeah. So what do you do? and then I pivot this thing down over my face and I'll also add that the Nimble InfoSight, And the other thing that helps is when you have the notion and Micro Focus has preserved the Vertica brand. and so you end up with these massive orders And you hear a lot of vendors say that, I'm not aware of another cloud-native-- I'm not aware of one that has the analytics form it and now Micro Focus seems to really see the value Are you kind of disrupting the do-it-yourself? And that should be proposed to you In terms of what you built there. And so you can imagine it as a table. And so you end up with a situation I mean, that is really a breakthrough innovation. and it's not release ready, I sign up and you say within three minutes And the way you do that So it's a SAS capability and I just pay for... and get the value of autonomous monitoring out there. that you can have a free account So is that how you do it or... and they won't let you distinguish between Yeah, and that's why I brought it up because Vertica, But it's still So to your point, I mean, it's pleasant. What's important is that I don't have to worry I remember in the early days of big data and you got the product market fit down. that haven't been able to go to work and thank you so much for coming on theCUBE All right, and thank you everybody for watching.

ENTITIES

Entity	Category	Confidence
Larry Lancaster	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Larry	PERSON	0.99+
Boston	LOCATION	0.99+
five times	QUANTITY	0.99+
three times	QUANTITY	0.99+
six times	QUANTITY	0.99+
EMC	ORGANIZATION	0.99+
six	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
20 hours	QUANTITY	0.99+
Glassbeam	ORGANIZATION	0.99+
Nedap	ORGANIZATION	0.99+
Vertica	ORGANIZATION	0.99+
Nimble	ORGANIZATION	0.99+
Nimble Storage	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
a year and a half	QUANTITY	0.99+
Micro Focus	ORGANIZATION	0.99+
ten times	QUANTITY	0.99+
two kinds	QUANTITY	0.99+
two years	QUANTITY	0.99+
three minutes	QUANTITY	0.99+
first question	QUANTITY	0.99+
eight weeks	QUANTITY	0.98+
Stonebreaker	ORGANIZATION	0.98+
Prometheus	TITLE	0.98+
30-40%	QUANTITY	0.98+
Eon	ORGANIZATION	0.98+
hundred of users	QUANTITY	0.98+
One	QUANTITY	0.98+
Vertica Virtual Big Data Conference	EVENT	0.98+
Kubernetes	TITLE	0.97+
first fund	QUANTITY	0.97+
Virtual Vertica Big Data Conference 2020	EVENT	0.97+
AWB West	ORGANIZATION	0.97+
Virtual Vertica Big Data Conference	EVENT	0.97+
Honeycomb	ORGANIZATION	0.96+
SAS	ORGANIZATION	0.96+
20 years ago	DATE	0.96+
Both types	QUANTITY	0.95+
theCUBE	ORGANIZATION	0.95+
Datadog	ORGANIZATION	0.95+
two main	QUANTITY	0.94+
over $6 million dollars	QUANTITY	0.93+
Hello Kitty	ORGANIZATION	0.93+
SQL	TITLE	0.93+
Zebrium	PERSON	0.91+
Spoke	TITLE	0.89+
Encore Hotel	LOCATION	0.88+
InfoSight	ORGANIZATION	0.88+
Coronavirus	OTHER	0.88+
one	QUANTITY	0.86+
less	QUANTITY	0.85+
Oracles	ORGANIZATION	0.85+
2020	DATE	0.85+
CTO	PERSON	0.84+
Vertica	TITLE	0.82+
Nimble InfoSight	ORGANIZATION	0.81+

Keynote Analysis | Virtual Vertica BDC 2020

(upbeat music) >> Narrator: It's theCUBE, covering the Virtual Vertica Big Data Conference 2020. Brought to you by Vertica. >> Dave Vellante: Hello everyone, and welcome to theCUBE's exclusive coverage of the Vertica Virtual Big Data Conference. You're watching theCUBE, the leader in digital event tech coverage. And we're broadcasting remotely from our studios in Palo Alto and Boston. And, we're pleased to be covering wall-to-wall this digital event. Now, as you know, originally BDC was scheduled this week at the new Encore Hotel and Casino in Boston. Their theme was "Win big with big data". Oh sorry, "Win big with data". That's right, got it. And, I know the community was really looking forward to that, you know, meet up. But look, we're making the best of it, given these uncertain times. We wish you and your families good health and safety. And this is the way that we're going to broadcast for the next several months. Now, we want to unpack Colin Mahony's keynote, but, before we do that, I want to give a little context on the market. First, theCUBE has covered every BDC since its inception, since the BDC's inception that is. It's a very intimate event, with a heavy emphasis on user content. Now, historically, the data engineers and DBAs in the Vertica community, they comprised the majority of the content at this event. And, that's going to be the same for this virtual, or digital, production. Now, theCUBE is going to be broadcasting for two days. What we're doing, is we're going to be concurrent with the Virtual BDC. We got practitioners that are coming on the show, DBAs, data engineers, database gurus, we got a security experts coming on, and really a great line up. And, of course, we'll also be hearing from Vertica Execs, Colin Mahony himself right of the keynote, folks from product marketing, partners, and a number of experts, including some from Micro Focus, which is the, of course, owner of Vertica. But I want to take a moment to share a little bit about the history of Vertica. The company, as you know, was founded by Michael Stonebraker. And, Verica started, really they started out as a SQL platform for analytics. It was the first, or at least one of the first, to really nail the MPP column store trend. Not only did Vertica have an early mover advantage in MPP, but the efficiency and scale of its software, relative to traditional DBMS, and also other MPP players, is underscored by the fact that Vertica, and the Vertica brand, really thrives to this day. But, I have to tell you, it wasn't without some pain. And, I'll talk a little bit about that, and really talk about how we got here today. So first, you know, you think about traditional transaction databases, like Oracle or IMBDB tour, or even enterprise data warehouse platforms like Teradata. They were simply not purpose-built for big data. Vertica was. Along with a whole bunch of other players, like Netezza, which was bought by IBM, Aster Data, which is now Teradata, Actian, ParAccel, which was the basis for Redshift, Amazon's Redshift, Greenplum was bought, in the early days, by EMC. And, these companies were really designed to run as massively parallel systems that smoked traditional RDBMS and EDW for particular analytic applications. You know, back in the big data days, I often joked that, like an NFL draft, there was run on MPP players, like when you see a run on polling guards. You know, once one goes, they all start to fall. And that's what you saw with the MPP columnar stores, IBM, EMC, and then HP getting into the game. So, it was like 2011, and Leo Apotheker, he was the new CEO of HP. Frankly, he has no clue, in my opinion, with what to do with Vertica, and totally missed one the biggest trends of the last decade, the data trend, the big data trend. HP picked up Vertica for a song, it wasn't disclosed, but my guess is that it was around 200 million. So, rather than build a bunch of smart tokens around Vertica, which I always call the diamond in the rough, Apotheker basically permanently altered HP for years. He kind of ruined HP, in my view, with a 12 billion dollar purchase of Autonomy, which turned out to be one of the biggest disasters in recent M&A history. HP was forced to spin merge, and ended up selling most of its software to Microsoft, Micro Focus. (laughs) Luckily, during its time at HP, CEO Meg Whitman, largely was distracted with what to do with the mess that she inherited form Apotheker. So, Vertica was left alone. Now, the upshot is Colin Mahony, who was then the GM of Vertica, and still is. By the way, he's really the CEO, and he just doesn't have the title, I actually think they should give that to him. But anyway, he's been at the helm the whole time. And Colin, as you'll see in our interview, is a rockstar, he's got technical and business jobs, people love him in the community. Vertica's culture is really engineering driven and they're all about data. Despite the fact that Vertica is a 15-year-old company, they've really kept pace, and not been polluted by legacy baggage. Vertica, early on, embraced Hadoop and the whole open-source movement. And that helped give it tailwinds. It leaned heavily into cloud, as we're going to talk about further this week. And they got a good story around machine intelligence and AI. So, whereas many traditional database players are really getting hurt, and some are getting killed, by cloud database providers, Vertica's actually doing a pretty good job of servicing its install base, and is in a reasonable position to compete for new workloads. On its last earnings call, the Micro Focus CFO, Stephen Murdoch, he said they're investing 70 to 80 million dollars in two key growth areas, security and Vertica. Now, Micro Focus is running its Suse play on these two parts of its business. What I mean by that, is they're investing and allowing them to be semi-autonomous, spending on R&D and go to market. And, they have no hardware agenda, unlike when Vertica was part of HP, or HPE, I guess HP, before the spin out. Now, let me come back to the big trend in the market today. And there's something going on around analytic databases in the cloud. You've got companies like Snowflake and AWS with Redshift, as we've reported numerous times, and they're doing quite well, they're gaining share, especially of new workloads that are merging, particularly in the cloud native space. They combine scalable compute, storage, and machine learning, and, importantly, they're allowing customers to scale, compute, and storage independent of each other. Why is that important? Because you don't have to buy storage every time you buy compute, or vice versa, in chunks. So, if you can scale them independently, you've got granularity. Vertica is keeping pace. In talking to customers, Vertica is leaning heavily into the cloud, supporting all the major cloud platforms, as we heard from Colin earlier today, adding Google. And, why my research shows that Vertica has some work to do in cloud and cloud native, to simplify the experience, it's more robust in motor stack, which supports many different environments, you know deep SQL, acid properties, and DNA that allows Vertica to compete with these cloud-native database suppliers. Now, Vertica might lose out in some of those native workloads. But, I have to say, my experience in talking with customers, if you're looking for a great MMP column store that scales and runs in the cloud, or on-prem, Vertica is in a very strong position. Vertica claims to be the only MPP columnar store to allow customers to scale, compute, and storage independently, both in the cloud and in hybrid environments on-prem, et cetera, cross clouds, as well. So, while Vertica may be at a disadvantage in a pure cloud native bake-off, it's more robust in motor stack, combined with its multi-cloud strategy, gives Vertica a compelling set of advantages. So, we heard a lot of this from Colin Mahony, who announced Vertica 10.0 in his keynote. He really emphasized Vertica's multi-cloud affinity, it's Eon Mode, which really allows that separation, or scaling of compute, independent of storage, both in the cloud and on-prem. Vertica 10, according to Mahony, is making big bets on in-database machine learning, he talked about that, AI, and along with some advanced regression techniques. He talked about PMML models, Python integration, which was actually something that they talked about doing with Uber and some other customers. Now, Mahony also stressed the trend toward object stores. And, Vertica now supports, let's see S3, with Eon, S3 Eon in Google Cloud, in addition to AWS, and then Pure and HDFS, as well, they all support Eon Mode. Mahony also stressed, as I mentioned earlier, a big commitment to on-prem and the whole cloud optionality thing. So 10.0, according to Colin Mahony, is all about really doubling down on these industry waves. As they say, enabling native PMML models, running them in Vertica, and really doing all the work that's required around ML and AI, they also announced support for TensorFlow. So, object store optionality is important, is what he talked about in Eon Mode, with the news of support for Google Cloud and, as well as HTFS. And finally, a big focus on deployment flexibility. Migration tools, which are a critical focus really on improving ease of use, and you hear this from a lot of customers. So, these are the critical aspects of Vertica 10.0, and an announcement that we're going to be unpacking all week, with some of the experts that I talked about. So, I'm going to close with this. My long-time co-host, John Furrier, and I have talked some time about this new cocktail of innovation. No longer is Moore's law the, really, mainspring of innovation. It's now about taking all these data troves, bringing machine learning and AI into that data to extract insights, and then operationalizing those insights at scale, leveraging cloud. And, one of the things I always look for from cloud is, if you've got a cloud play, you can attract innovation in the form of startups. It's part of the success equation, certainly for AWS, and I think it's one of the challenges for a lot of the legacy on-prem players. Vertica, I think, has done a pretty good job in this regard. And, you know, we're going to look this week for evidence of that innovation. One of the interviews that I'm personally excited about this week, is a new-ish company, I would consider them a startup, called Zebrium. What they're doing, is they're applying AI to do autonomous log monitoring for IT ops. And, I'm interviewing Larry Lancaster, who's their CEO, this week, and I'm going to press him on why he chose to run on Vertica and not a cloud database. This guy is a hardcore tech guru and I want to hear his opinion. Okay, so keep it right there, stay with us. We're all over the Vertica Virtual Big Data Conference, covering in-depth interviews and following all the news. So, theCUBE is going to be interviewing these folks, two days, wall-to-wall coverage, so keep it right there. We're going to be right back with our next guest, right after this short break. This is Dave Vellante and you're watching theCUBE. (upbeat music)

Published Date : Mar 31 2020

SUMMARY :

Brought to you by Vertica. and the Vertica brand, really thrives to this day.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Colin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
70	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Michael Stonebraker	PERSON	0.99+
Colin Mahony	PERSON	0.99+
Stephen Murdoch	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
EMC	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
Zebrium	ORGANIZATION	0.99+
two days	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Verica	ORGANIZATION	0.99+
Micro Focus	ORGANIZATION	0.99+
2011	DATE	0.99+
HPE	ORGANIZATION	0.99+
Uber	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Mahony	PERSON	0.99+
Meg Whitman	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Aster Data	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
First	QUANTITY	0.99+
12 billion dollar	QUANTITY	0.99+
One	QUANTITY	0.99+
this week	DATE	0.99+
John Furrier	PERSON	0.99+
15-year-old	QUANTITY	0.98+
Python	TITLE	0.98+
Oracle	ORGANIZATION	0.98+
olin Mahony	PERSON	0.98+
around 200 million	QUANTITY	0.98+
Virtual Vertica Big Data Conference 2020	EVENT	0.98+
theCUBE	ORGANIZATION	0.98+
80 million dollars	QUANTITY	0.97+
today	DATE	0.97+
two parts	QUANTITY	0.97+
Vertica Virtual Big Data Conference	EVENT	0.97+
Teradata	ORGANIZATION	0.97+
one	QUANTITY	0.97+
Actian	ORGANIZATION	0.97+

UNLIST TILL 4/2 - Autonomous Log Monitoring

>> Sue: Hi everybody, thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "Autonomous Monitoring Using Machine Learning". My name is Sue LeClaire, director of marketing at Vertica, and I'll be your host for this session. Joining me is Larry Lancaster, founder and CTO at Zebrium. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slide and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, you can also go and visit Vertica forums to post your questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, just a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available for you to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Larry, over to you. >> Larry: Hey, thanks so much. So hi, my name's Larry Lancaster and I'm here to talk to you today about something that I think who's time has come and that's autonomous monitoring. So, with that, let's get into it. So, machine data is my life. I know that's a sad life, but it's true. So I've spent most of my career kind of taking telemetry data from products, either in the field, we used to call it in the field or nowadays, that's been deployed, and bringing that data back, like log file stats, and then building stuff on top of it. So, tools to run the business or services to sell back to users and customers. And so, after doing that a few times, it kind of got to the point where I was really sort of sick of building the same kind of thing from scratch every time, so I figured, why not go start a company and do it so that we don't have to do it manually ever again. So, it's interesting to note, I've put a little sentence here saying, "companies where I got to use Vertica" So I've been actually kind of working with Vertica for a long time now, pretty much since they came out of alpha. And I've really been enjoying their technology ever since. So, our vision is basically that I want a system that will characterize incidents before I notice. So an incident is, you know, we used to call it a support case or a ticket in IT, or a support case in support. Nowadays, you may have a DevOps team, or a set of SREs who are monitoring a production sort of deployment. And so they'll call it an incident. So I'm looking for something that will notice and characterize an incident before I notice and have to go digging into log files and stats to figure out what happened. And so that's a pretty heady goal. And so I'm going to talk a little bit today about how we do that. So, if we look at logs in particular. Logs today, if you look at log monitoring. So monitoring is kind of that whole umbrella term that we use to talk about how we monitor systems in the field that we've shipped, or how we monitor production deployments in a more modern stack. And so basically there are log monitoring tools. But they have a number of drawbacks. For one thing, they're kind of slow in the sense that if something breaks and I need to go to a log file, actually chances are really good that if you have a new issue, if it's an unknown unknown problem, you're going to end up in a log file. So the problem then becomes basically you're searching around looking for what's the root cause of the incident, right? And so that's kind of time-consuming. So, they're also fragile and this is largely because log data is completely unstructured, right? So there's no formal grammar for a log file. So you have this situation where, if I write a parser today, and that parser is going to do something, it's going to execute some automation, it's going to open or update a ticket, it's going to maybe restart a service, or whatever it is that I want to happen. What'll happen is later upstream, someone who's writing the code that produces that log message, they might do something really useful for me, or for users. And they might go fix a spelling mistake in that log message. And then the next thing you know, all the automation breaks. So it's a very fragile source for automation. And finally, because of that, people will set alerts on, "Oh, well tell me how many thousands of errors are happening every hour." Or some horrible metric like that. And then that becomes the only visibility you have in the data. So because of all this, it's a very human-driven, slow, fragile process. So basically, we've set out to kind of up-level that a bit. So I touched on this already, right? The truth is if you do have an incident, you're going to end up in log files to do root cause. It's almost always the case. And so you have to wonder, if that's the case, why do most people use metrics only for monitoring? And the reason is related to the problems I just described. They're already structured, right? So for logs, you've got this mess of stuff, so you only want to dig in there when you absolutely have to. But ironically, it's where a lot of the information that you need actually is. So we have a model today, and this model used to work pretty well. And that model is called "index and search". And it basically means you treat log files like they're text documents. And so you index them and when there's some issue you have to drill into, then you go searching, right? So let's look at that model. So 20 years ago, we had sort of a shrink-wrap software delivery model. You had an incident. With that incident, maybe you had one customer and you had a monolithic application and a handful of log files. So it's perfectly natural, in fact, usually you could just v-item the log file, and search that way. Or if there's a lot of them, you could index them and search them that way. And that all worked very well because the developer or the support engineer had to be an expert in those few things, in those few log files, and understand what they meant. But today, everything has changed completely. So we live in a software as a service world. What that means is, for a given incident, first of all you're going to be affecting thousands of users. You're going to have, potentially, 100 services that are deployed in your environment. You're going to have 1,000 log streams to sift through. And yet, you're still kind of stuck in the situation where to go find out what's the matter, you're going to have to search through the log files. So this is kind of the unacceptable sort of position we're in today. So for us, the future will not be index and search. And that's simply because it cannot scale. And the reason I say that it can't scale is because it all kind of is bottlenecked by a person and their eyeball. So, you continue to drive up the amount of data that has to be sifted through, the complexity of the stack that has to be understood, and you still, at the end of the day, for MTTR purposes, you still have the same bottleneck, which is the eyeball. So this model, I believe, is fundamentally broken. And that's why, I believe in five years you're going to be in a situation where most monitoring of unknown unknown problems is going to be done autonomously. And those issues will be characterized autonomously because there's no other way it can happen. So now I'm going to talk a little bit about autonomous monitoring itself. So, autonomous monitoring basically means, if you can imagine in a monitoring platform and you watch the monitoring platform, maybe you watch the alerts coming from it or more importantly, you kind of watch the dashboards and try to see if something looks weird. So autonomous monitoring is the notion that the platform should do the watching for you and only let you know when something is going wrong and should kind of give you a window into what happened. So if you look at this example I have on screen, just to take it really slow and absorb the concept of autonomous monitoring. So here in this example, we've stopped the database. And as a result, down below you can see there were a bunch of fallout. This is an Atlassian Stack, so you can imagine you've got a Postgres database. And then you've got sort of Bitbucket, and Confluence, and Jira, and these various other components that need the database operating in order to function. So what this is doing is it's calling out, "Hey, the root cause is the database stopped and here's the symptoms." Now, you might be wondering, so what. I mean I could go write a script to do this sort of thing. Here's what's interesting about this very particular example, and I'll show a couple more examples that are a little more involved. But here's the interesting thing. So, in the software that came up with this incident and opened this incident and put this root cause and symptoms in there, there's no code that knows anything about timestamp formats, severities, Atlassian, Postgres, databases, Bitbucket, Confluence, there's no regexes that talk about starting, stopped, RDBMS, swallowed exception, and so on and so forth. So you might wonder how it's possible then, that something which is completely ignorant of the stack, could come up with this description, which is exactly what a human would have had to do, to figure out what happened. And I'm going to get into how we do that. But that's what autonomous monitoring is about. It's about getting into a set of telemetry from a stack with no prior information, and understanding when something breaks. And I could give you the punchline right now, which is there are fundamental ways that software behaves when it's breaking. And by looking at hundreds of data sets that people have generously allowed us to use containing incidents, we've been able to characterize that and now generalize it to apply it to any new data set and stack. So here's an interesting one right here. So there's a fella, David Gill, he's just a genius in the monitoring space. He's been working with us for the last couple of months. So he said, "You know what I'm going to do, is I'm going to run some chaos experiments." So for those of you who don't know what chaos engineering is, here's the idea. So basically, let's say I'm running a Kubernetes cluster and what I'll do is I'll use sort of a chaos injection test, something like litmus. And basically it will inject issues, it'll break things in my application randomly to see if my monitoring picks it up. And so this is what chaos engineering is built around. It's built around sort of generating lots of random problems and seeing how the stack responds. So in this particular case, David went in and he deleted, basically one of the tests that was presented through litmus did a delete of a pod delete. And so that's going to basically take out some containers that are part of the service layer. And so then you'll see all kinds of things break. And so what you're seeing here, which is interesting, this is why I like to use this example. Because it's actually kind of eye-opening. So the chaos tool itself generates logs. And of course, through Kubernetes, all the log files locations that are on the host, and the container logs are known. And those are all pulled back to us automatically. So one of the log files we have is actually the chaos tool that's doing the breaking, right? And so what the tool said here, when it went to determine what the root cause was, was it noticed that there was this process that had these messages happen, initializing deletion lists, selection a pod to kill, blah blah blah. It's saying that the root cause is the chaos test. And it's absolutely right, that is the root cause. But usually chaos tests don't get picked up themselves. You're supposed to be just kind of picking up the symptoms. But this is what happens when you're able to kind of tease out root cause from symptoms autonomously, is you end up getting a much more meaningful answer, right? So here's another example. So essentially, we collect the log files, but we also have a Prometheus scraper. So if you export Prometheus metrics, we'll scrape those and we'll collect those as well. And so we'll use those for our autonomous monitoring as well. So what you're seeing here is an issue where, I believe this is where we ran something out of disk space. So it opened an incident, but what's also interesting here is, you see that it pulled that metric to say that the spike in this metric was a symptom of this running out of space. So again, there's nothing that knows anything about file system usage, memory, CPU, any of that stuff. There's no actual hard-coded logic anywhere to explain any of this. And so the concept of autonomous monitoring is looking at a stack the way a human being would. If you can imagine how you would walk in and monitor something, how you would think about it. You'd go looking around for rare things. Things that are not normal. And you would look for indicators of breakage, and you would see, do those seem to be correlated in some dimension? That is how the system works. So as I mentioned a moment ago, metrics really do kind of complete the picture for us. We end up in a situation where we have a one-stop shop for incident root cause. So, how does that work? Well, we ingest and we structure the log files. So if we're getting the logs, we'll ingest them and we'll structure them, and I'm going to show a little bit what that structure looks like and how that goes into the database in a moment. And then of course we ingest and structure the Prometheus metrics. But here, structure really should have an asterisk next to it, because metrics are mostly structured already. They have names. If you have your own scraper, as opposed to going into the time series Prometheus database and pulling metrics from there, you can keep a lot more information about metadata about those metrics from the exporter's perspective. So we keep all of that too. Then we do our anomaly detection on both of those sets of data. And then we cross-correlate metrics and log anomalies. And then we create incidents. So this is at a high level, kind of what's happening without any sort of stack-specific logic built in. So we had some exciting recent validation. So Mayadata's a pretty big player in the Kubernetes space. Essentially, they do Kubernetes as a managed service. They have tens of thousands of customers that they manage their Kubernetes clusters for them. And then they're also involved, both in the OpenEBS project, as well as in the Litmius project I mentioned a moment ago. That's their tool for chaos engineering. So they're a pretty big player in the Kubernetes space. So essentially, they said, "Oh okay, let's see if this is real." So what they did was they set up our collectors, which took three minutes in Kubernetes. And then they went and they, using Litmus, they reproduced eight incidents that their actual, real-world customers had hit. And they were trying to remember the ones that were the hardest to figure out the root cause at the time. And we picked up and put a root cause indicator that was correct in 100% of these incidents with no training configuration or metadata required. So this is kind of what autonomous monitoring is all about. So now I'm going to talk a little bit about how it works. So, like I said, there's no information included or required about, so if you imagine a log file for example. Now, commonly, over to the left-hand side of every line, there will be some sort of a prefix. And what I mean by that is you'll see like a timestamp, or a severity, and maybe there's a PID, and maybe there's function name, and maybe there's some other stuff there. So basically that's kind of, it's common data elements for a large portion of the lines in a given log file. But you know, of course, the contents change. So basically today, like if you look at a typical log manager, they'll talk about connectors. And what connectors means is, for an application it'll generate a certain prefix format in a log. And that means what's the format of the timestamp, and what else is in the prefix. And this lets the tool pick it up. And so if you have an app that doesn't have a connector, you're out of luck. Well, what we do is we learn those prefixes dynamically with machine learning. You do not have to have a connector, right? And what that means is that if you come in with your own application, the system will just work for it from day one. You don't have to have connectors, you don't have to describe the prefix format. That's so yesterday, right? So really what we want to be doing is up-leveling what the system is doing to the point where it's kind of working like a human would. You look at a log line, you know what's a timestamp. You know what's a PID. You know what's a function name. You know where the prefix ends and where the variable parts begin. You know what's a parameter over there in the variable parts. And sometimes you may need to see a couple examples to know what was a variable, but you'll figure it out as quickly as possible, and that's exactly how the system goes about it. As a result, we kind of embrace free-text logs, right? So if you look at a typical stack, most of the logs generated in a typical stack are usually free-text. Even structured logging typically will have a message attribute, which then inside of it has the free-text message. For us, that's not a bad thing. That's okay. The purpose of a log is to inform people. And so there's no need to go rewrite the whole logging stack just because you want a machine to handle it. They'll figure it out for themselves, right? So, you give us the logs and we'll figure out the grammar, not only for the prefix but also for the variable message part. So I already went into this, but there's more that's usually required for configuring a log manager with alerts. You have to give it keywords. You have to give it application behaviors. You have to tell it some prior knowledge. And of course the problem with all of that is that the most important events that you'll ever see in a log file are the rarest. Those are the ones that are one out of a billion. And so you may not know what's going to be the right keyword in advance to pick up the next breakage, right? So we don't want that information from you. We'll figure that out for ourselves. As the data comes in, essentially we parse it and we categorize it, as I've mentioned. And when I say categorize, what I mean is, if you look at a certain given log file, you'll notice that some of the lines are kind of the same thing. So this one will say "X happened five times" and then maybe a few lines below it'll say "X happened six times" but that's basically the same event type. It's just a different instance of that event type. And it has a different value for one of the parameters, right? So when I say categorization, what I mean is figuring out those unique types and I'll show an example of that next. Anomaly detection, we do on top of that. So anomaly detection on metrics in a very sort of time series by time series manner with lots of tunables is a well-understood problem. So we also do this on the event types occurrences. So you can think of each event type occurring in time as sort of a point process. And then you can develop statistics and distributions on that, and you can do anomaly detection on those. Once we have all of that, we have extracted features, essentially, from metrics and from logs. We do pattern recognition on the correlations across different channels of information, so different event types, different log types, different hoses, different containers, and then of course across to the metrics. Based on all of this cross-correlation, we end up with a root cause identification. So that's essentially, at a high level, how it works. What's interesting, from the perspective of this call particularly, is that incident detection needs relationally structured data. It really does. You need to have all the instances of a certain event type that you've ever seen easily accessible. You need to have the values for a given sort of parameter easily, quickly available so you can figure out what's the distribution of this over time, how often does this event type happen. You can run analytical queries against that information so that you can quickly, in real-time, do anomaly detection against new data. So here's an example of that this looks like. And this kind of part of the work that we've done. At the top you see some examples of log lines, right? So that's kind of a snippet, it's three lines out of a log file. And you see one in the middle there that's kind of highlighted with colors, right? I mean, it's a little messy, but it's not atypical of the log file that you'll see pretty much anywhere. So there, you've got a timestamp, and a severity, and a function name. And then you've got some other information. And then finally, you have the variable part. And that's going to have sort of this checkpoint for memory scrubbers, probably something that's written in English, just so that the person who's reading the log file can understand. And then there's some parameters that are put in, right? So now, if you look at how we structure that, the way it looks is there's going to be three tables that correspond to the three event types that we see above. And so we're going to look at the one that corresponds to the one in the middle. So if we look at that table, there you'll see a table with columns, one for severity, for function name, for time zone, and so on. And date, and PID. And then you see over to the right with the colored columns there's the parameters that were pulled out from the variable part of that message. And so they're put in, they're typed and they're in integer columns. So this is the way structuring needs to work with logs to be able to do efficient and effective anomaly detection. And as far as I know, we're the first people to do this inline. All right, so let's talk now about Vertica and why we take those tables and put them in Vertica. So Vertica really is an MPP column store, but it's more than that, because nowadays when you say "column store", people sort of think, like, for example Cassandra's a column store, whatever, but it's not. Cassandra's not a column store in the sense that Vertica is. So Vertica was kind of built from the ground up to be... So it's the original column store. So back in the cStor project at Berkeley that Stonebraker was involved in, he said let's explore what kind of efficiencies we can get out of a real columnar database. And what he found was that, he and his grad students that started Vertica. What they found was that what they can do is they could build a database that gives orders of magnitude better query performance for the kinds of analytics I'm talking about here today. With orders of magnitude less data storage underneath. So building on top of machine data, as I mentioned, is hard, because it doesn't have any defined schemas. But we can use an RDBMS like Vertica once we've structured the data to do the analytics that we need to do. So I talked a little bit about this, but if you think about machine data in general, it's perfectly suited for a columnar store. Because, if you imagine laying out sort of all the attributes of an event type, right? So you can imagine that each occurrence is going to have- So there may be, say, three or four function names that are going to occur for all the instances of a given event type. And so if you were to sort all of those event instances by function name, what you would find is that you have sort of long, million long runs of the same function name over and over. So what you have, in general, in machine data, is lots and lots of slowly varying attributes, lots of low-cardinality data that it's almost completely compressed out when you use a real column store. So you end up with a massive footprint reduction on disk. And it also, that propagates through the analytical pipeline. Because Vertica does late materialization, which means it tries to carry that data through memory with that same efficiency, right? So the scale-out architecture, of course, is really suitable for petascale workloads. Also, I should point out, I was going to mention it in another slide or two, but we use the Vertica Eon architecture, and we have had no problems scaling that in the cloud. It's a beautiful sort of rewrite of the entire data layer of Vertica. The performance and flexibility of Eon is just unbelievable. And so I've really been enjoying using it. I was skeptical, you could get a real column store to run in the cloud effectively, but I was completely wrong. So finally, I should mention that if you look at column stores, to me, Vertica is the one that has the full SQL support, it has the ODBC drivers, it has the ACID compliance. Which means I don't need to worry about these things as an application developer. So I'm laying out the reasons that I like to use Vertica. So I touched on this already, but essentially what's amazing is that Vertica Eon is basically using S3 as an object store. And of course, there are other offerings, like the one that Vertica does with pure storage that doesn't use S3. But what I find amazing is how well the system performs using S3 as an object store, and how they manage to keep an actual consistent database. And they do. We've had issues where we've gone and shut down hosts, or hosts have been shut down on us, and we have to restart the database and we don't have any consistency issues. It's unbelievable, the work that they've done. Essentially, another thing that's great about the way it works is you can use the S3 as a shared object store. You can have query nodes kind of querying from that set of files largely independently of the nodes that are writing to them. So you avoid this sort of bottleneck issue where you've got contention over who's writing what, and who's reading what, and so on. So I've found the performance using separate subclusters for our UI and for the ingest has been amazing. Another couple of things that they have is they have a lot of in-database machine learning libraries. There's actually some cool stuff on their GitHub that we've used. One thing that we make a lot of use of is the sequence and time series analytics. For example, in our product, even though we do all of this stuff autonomously, you can also go create alerts for yourself. And one of the kinds of alerts you can do, you can say, "Okay, if this kind of event happens within so much time, and then this kind of an event happens, but not this one," Then you can be alerted. So you can have these kind of sequences that you define of events that would indicate a problem. And we use their sequence analytics for that. So it kind of gives you really good performance on some of these queries where you're wanting to pull out sequences of events from a fact table. And timeseries analytics is really useful if you want to do analytics on the metrics and you want to do gap filling interpolation on that. It's actually really fast in performance. And it's easy to use through SQL. So those are a couple of Vertica extensions that we use. So finally, I would like to encourage everybody, hey, come try us out. Should be up and running in a few minutes if you're using Kubernetes. If not, it's however long it takes you to run an installer. So you can just come to our website, pick it up and try out autonomous monitoring. And I want to thank everybody for your time. And we can open it up for Q and A.

Published Date : Mar 30 2020

SUMMARY :

Also, just a reminder that you can maximize your screen And one of the kinds of alerts you can do, you can say,

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
David Gill	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
five times	QUANTITY	0.99+
Larry	PERSON	0.99+
S3	TITLE	0.99+
three minutes	QUANTITY	0.99+
six times	QUANTITY	0.99+
Sue	PERSON	0.99+
100 services	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
today	DATE	0.99+
three	QUANTITY	0.99+
five years	QUANTITY	0.99+
Today	DATE	0.99+
yesterday	DATE	0.99+
both	QUANTITY	0.99+
Kubernetes	TITLE	0.99+
one	QUANTITY	0.99+
thousands	QUANTITY	0.99+
two	QUANTITY	0.99+
SQL	TITLE	0.99+
one customer	QUANTITY	0.98+
three lines	QUANTITY	0.98+
three tables	QUANTITY	0.98+
each event	QUANTITY	0.98+
hundreds	QUANTITY	0.98+
first people	QUANTITY	0.98+
1,000 log streams	QUANTITY	0.98+
20 years ago	DATE	0.98+
eight incidents	QUANTITY	0.98+
tens of thousands of customers	QUANTITY	0.97+
later this week	DATE	0.97+
thousands of users	QUANTITY	0.97+
Stonebraker	ORGANIZATION	0.96+
each occurrence	QUANTITY	0.96+
Postgres	ORGANIZATION	0.96+
One thing	QUANTITY	0.95+
three event types	QUANTITY	0.94+
million	QUANTITY	0.94+
Vertica	TITLE	0.94+
one thing	QUANTITY	0.93+
4/2	DATE	0.92+
English	OTHER	0.92+
four function names	QUANTITY	0.86+
day one	QUANTITY	0.84+
Prometheus	TITLE	0.83+
one-stop	QUANTITY	0.82+
Berkeley	LOCATION	0.82+
Confluence	ORGANIZATION	0.79+
double arrow	QUANTITY	0.79+
last couple of months	DATE	0.79+
one of	QUANTITY	0.76+
cStor	ORGANIZATION	0.75+
a billion	QUANTITY	0.73+
Atlassian Stack	ORGANIZATION	0.72+
Eon	ORGANIZATION	0.71+
Bitbucket	ORGANIZATION	0.68+
couple more examples	QUANTITY	0.68+
Litmus	TITLE	0.65+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Larry Lancaster: