Larry Lancaster & Rod Bagg, Zebrium | Zebrium Root Cause as a Service

(upbeat music) >> Full stack observability is all the rage today. As businesses lean into digital, customer experience becomes ever more important. Why? Well, it's obvious, fickle consumers can switch brands in the blink of an eye or the click of a mouse. Technology companies have sprung into action and the observability space is getting pretty crowded in an effort to simplify the process of figuring out the root cause of application performance problems without an army of PhDs and lab coats, also known as endlessly digging through logs, for example. We see decades old software companies that have traditionally done monitoring or log analytics and or application performance management stepping up their game. These established players, you know, they typically have deep feature sets and sometimes purpose-built tools that attack one particular segment of the marketplace. And now they're pivoting through M&A and some organic development trying to fill gaps in their portfolio. And then, you got all these new entrants coming to the market, claiming end to end visibility across the so-called modern cloud and now edge native stacks. Meanwhile, cloud players are gaining traction and participating through a combination of native tooling combined with strong ecosystems to address this problem. But, you know, recent survey research from ETR confirms our thesis that no one company has it all. Here's the thing. Customers just want to figure out the root cause as quickly and as efficiently as possible. It's one thing to observe the stack end to end, but the question is who is automating the observers? And that's why we're here today. Hello, my name is Dave Vellante and welcome to this special Cube presentation where we dig into root cause analysis, and specifically, how one company, Zebrium, is using unsupervised machine learning to detect anomalies and pinpoint root causes and delivering it as an automated service. And in this session, we have two deep dives. First, we're going to dig into this exciting new field of RCaaS, Root Cause As A Service with two of the founders and technical experts behind Zebrium. And then we bring in two technical experts from Cisco, an early Zebrium customer who ran a POC with Zebrium's service, automating and identifying root cause problems within four very well established and well known Cisco product lines, including WebEx Client and UCS. I was pretty amazed at the results and I think you'll be impressed as well. So thanks for being here. Let's get started. With me right now is Larry Lancaster, who's a founder and CTO of Zebrium. And he's joined by Rod Bagg, who's the founder and vice president of engineering at the company. Gents, welcome. Thanks for coming on. >> Thanks. >> Okay. >> It's good to be here. >> It's good to be here >> All right Rod, talk to me. Talk to me about software downtime, what root cause means, all the buzzwords in your domain, MTTR and SLO. What do we need to know? >> Yeah, I mean, it's like you said. I mean, it's extremely important to our customers and to most businesses out there to drive uptime and avoid as much downtime as possible. So, you know, when you think about it, all of these businesses, most companies nowadays, either their product is software and it's running, you know, running on the web and that's how you get a point click. Or the business depends on, you know, internal systems to drive their business and to run it. When that is down, that is hugely impacting to them. So if you take a look, you know, way back, you know, 20, 30 years ago, software was simple. You know, there wasn't much to it. It was pretty monolithic and maybe it took a couple of people to maintain it and keep it running. There wasn't really anything complicated about it. It was a single tenant piece of software. Today's software is so complicated, often running, you know, maybe hundreds of services to keep that or to actually implement what that software is doing. So as you point out, you know, enter the sort of observability space and the tools that are now in use to help monitor that software and make sure when something goes wrong, they know about it But there's kind of an interesting stat around the observability space. So when you look at observability in the context or through the lens of the cost of downtime, it's really interesting. So observability tools are about a $20 billion market, okay? But the cost of downtime, even with that in place, is still hundreds of billions of dollars. So you're not taking much of a bite out of what the real problem is. You have to solve root cause and get to that fast. So it's all great to know that something went wrong but you got to know why. And it's our contention here that, you know, really, when you take a look at the observability space, you have metrics, that's a great tool. I mean, there's lots of great tools out there, you know, around metrics monitoring that's going to tell you when something went wrong. It's very rarely it's going to tell you why. Similarly for tracing, it's going to point you to where the issue is. It's going to take you through that stack and probably pinpoint where you're being, you know where it's happening or where something is running slow, potentially. So that's great. But again, the root cause of why it's happening is going to be buried in log files. And I can expand on that a little bit more but you know, when you're a software developer and you're writing your software, those log files are a wealth of information. It's just a set of breadcrumbs that are littered with facts about how the software is behaving and why it's doing what it's doing, or why it went wrong. And it's that that really gets you to the root cause very fast. And that's our contention, is that these software systems are so complex nowadays and that the root cause is lying in those logs. So how do you get there fast? You know, we would contend that you better automate that or you are just doomed for failure. And that's where we come in. >> Great. >> Getting to that root cause. >> Thank you, Rod. You know, it's interesting you talk about the $20 billion market. There's an analogy with security, right? We spend 80, $100 billion a year on securing our infrastructure, and yet we lose probably closer to a trillion dollars a year in breaches. And there's a similar analogy here. 20 billion could be 5X in downtime impacts or more. Okay, let's go to Larry. Tell us a little bit more about Zebrium. I'm interested always to ask a founder why you started the company. Rod touched on that a little bit. You guys have invented this concept of RCaaS. What does it mean? What problems does it solve, and how does it solve the problem? Let's get into it. >> Yeah. Hey, thanks, Dave. So I think when you said, you know, who's automating the observer, that that's a great way to think about it because what observability really means is it's a property of a system that means you can see into it. You can observe the internal state and that makes it easier to troubleshoot, right? But the problem is if it's too complicated, you just push the bottleneck up to your eyeball. There's only so much a person can filter through manually, right? And I love the way you put that. So that's a great way to think about it is automating the observer. Now, of course, it means that, you know, you reduce your MTTR, you meet your service level objectives, all that stuff, you improve customer experience. That's all true, but it's important to step back and realize like we have cracked a real nut here. People have been trying to figure out how to automate this part of sort of the troubleshooting experience, this human part of finding the root cause indicators for a long time. And until Zebrium came along, I would argue, no one's really done it right. So, you know, I think it's also important you know, as we step back, we can probably look forward five to 10 years and say, everyone's going to look back and say how did we do all this manually? You're going to see this sort of last mile of observability and troubleshooting is going to be automated everywhere because otherwise, you know, people are just... They're not going to be able to scale their business. So, you know, I think one more thing that's important to point out is, you know, I think Zebrium, you know, it's one thing to have the technology but we've learned we need to deliver it right where people are today. You can't just expect people to dive into a new tool. So, you know, we're looking at, you know, if you look at Zebrium, you'll put us on your dashboard and we don't care what kind of a dashboard it is. It could be, you know Datadog, New Relic, Elastic, Dynatrace, Grafana AppDynamics, ScienceLogic, we don't care. You know, they're all our friends. So we're more interested in getting to that root cause than trying to fight, you know, these incumbents and all that stuff. Yep. >> Yeah. So, interesting. Again, another analogy I think about. You know, you talked about automation. If we're to look back and say this is what... We're never going to do this again, it's like provisioning loans. Nobody provisions loans anymore, it's all automated. >> Larry: (chuckling) That's right. >> So Larry, I'll stay with you, then the skeptic in me says, this sounds amazing, but if I, you know... It might be too good to be true. Tell us how it works. >> Larry: (chuckling) Yeah. So that's interesting. So Cisco came along and they were equally skeptical. So what they did was they took a couple of months and they did a very detailed study. And they got together 192 incidents across four product lines, where they knew that the root cause was in the logs. And they knew what that root cause was because they had had their best engineers, you know work on those cases and take detailed notes of the incidents that had taken place. And so they ran that data through the Zebrium software. And what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time. Like that blew us away. When we saw that kind of evidence, Dave, I have to tell you, everyone was just jumping up and down. It was like, you know, it was like the Apollo command center, you know when they finally, you know, touchdown on the moon kind of thing. So, you know, it's really an exciting point in time to be at the company, like just seeing everything finally being proven out according to this vision. I'm going to tell you one more story which is actually one of my favorites, because we got a chance to work with Seagate Lyve Cloud. So they're, you know, a hyper modern, you know, SaaS business, they're an S3 competitor. Zoom has their files stored on Lyve Cloud, you know, to let you know who they are. So essentially, what happened was they were in alpha, their early access, and they had an outage, and it was pretty bad. I mean, it went on for longer than a day, actually, before they were completely restored. And it was, you know, fortunately for them, it was early access. So no one was expecting, you know, uptime, you know, service level objectives and so on. But they were scared, because they realized, if something like this happens in production, you know, they're screwed. So what they did was they saw Zebrium. They went and did some research, they saw Zebrium. They went in a staging environment, recreated the exact (indistinct) that they had had. And what they saw was immediately, Zebrium pops up a root cause report that tells them exactly the root cause that they took over a day to find. These are the kind of stories that let us know we're onto something transformational. >> Dave: Yeah. That's great. I mean, you guys are jumping up and down, I'm sure. We're going to hear from Cisco later. I bet you, they were jumping up and down too because they didn't have to do all that heavy lifting anymore. So Rod, Larry's just sort of implying that, or actually, you guys both talked about that your tool is agnostic. So how does one actually use the service? How do I deploy it? >> Yeah. So let me step back. So when we talk about logs right? Like, you know, all these bread crumbs being in logs and everything else? So, you know, they are a great wealth of you know, information, but people hate dealing with them. I mean, they hate having to go in and figure out what log to look at. In fact, you know, we had one of our... Or we've heard from several of our customers now prior to using Zebrium, when they, you know, have some issue, and they know there's something wrong, something on their dashboard has told them that something's wrong, maybe a metric has, you know, taken a blip or something's happened that they know there's a problem. We've heard from them that it can take like a number of hours just to get to the right set of logs, like figuring out over these hundreds of services where the logs are, to get to them, maybe searching in a log manager. Just to get into the right context, even, can take hours. So, you know, that's obviously the problem we solve but, you know, we don't want them just looking at logs. I mean, you know, we don't want to put them back in the thing they don't like doing because people don't do that. They don't like doing it. So we put it up on the dashboard. So if something is going wrong with your metrics and that's the indicator, or maybe it's something with tracing that you're sort of digging through that you know something's wrong, we will be right on that same dashboard. So we're deployed as a SaaS service. You send us your logs, you click on one of our integrations and we integrate with all these tools that Larry's talked about. And when we detect anything that is a root cause report, it will show up on your dashboard in the same timeline as those blips in your metrics. So when you see something going wrong and you know there's an issue, take a look at the portion of your dashboard that is us, and we're going to tell you why. We're going to get you to the why that went wrong. No other work could be... You can, you know, also click down and click through to us so that you land up in our portal, if you want to do some more digging around, if you need to or whatever, maybe to get some context what have you, but it's fair that if you ever need to do that, the answer should be right there on your dashboard. And that that's how we expect people to use it. We don't want them digging in logs and going through things, we want it to be right in their workflow. >> Great. Thank you, Larry. So Rod, we talked about Cisco. We're going to hear more from them in a moment in Seagate. I would think this is like a perfect solution for a SaaS provider, anybody doing AI ops. Do you have some examples of those types of firms leaning into this? >> Rod: Yeah, a couple of great ones. Well, I mean, we've got many of them, but a couple that I'll touch on. We have an actual AI ops company that was looking for, you know, sort of some complimentary technology and so on. And so they decided to just put us through our paces by having one of their own SREs sign up for our service in our SaaS environment, and send the logs from their system to us, you know, and just see how we did. So it turned out we ended up talking back to this SRE like a week after he had installed the product, you know signed up and then, you know, started sending us logs. And, you know, he was hewing and hawing, saying that he was busy, like every SRE is, and that he didn't have a chance to really do much with us yet. And, you know, we were just, you know, having this conversation on the phone, and he comes to tell us that, yeah I've been busy because we had this, you know, terrible outage, like, you know, five days ago. And we said like, "Okay did you actually look on the Zebrium dashboard?" (chuckles) And he goes, "You know what? I didn't even think to do it yet. I mean, I'd just been so busy and frazzled." So we have an integration with that company, he hadn't put that integration in, so it wasn't in his dashboard yet, but it was certainly on ours. So he went there, and he looks and he looks on the day, you know, on the time range of when he had had this incident. And right at the very top of the page on our portal was that incident with that root cause. And he was flabbergasted. It literally would've saved him hours and hours and hours. They had this issue going on for over 24 hours. And we had the answer right there in five minutes, and it was crazy. And we get that kind of stories. It's just like the Seagate one. If you use us and you have a problem, we're going to detect it. And you're going to hear from Cisco how successful we are at detecting things. I mean, it'll be there when you have a problem. In SaaS companies, you know, one of our customers is Alchera. They do cost optimizations for cloud properties, you know, for AWS optimization, Google, Google cloud, and so on. But they use our software, and they have a lot of interaction, obviously with these cloud vendors and the APIs of those cloud vendors. So, you know, in order to figure out your costing at AWS, they're using all those APIs. So it turned out, you know, they had some issue where their services were breaking. And we had that root cause report right on the screen, again within five minutes, that was pointing to an API problem with Google. And they had changed one of their APIs and Alchera was not aware of it. So their stuff was breaking because of a change downstream that we had caught. And I'll just tell you one last one because it's somewhat related to one of these cloud vendors. You know, it was a big cloud vendor who had an outage a couple of months ago. And it's interesting because, you know, a lot of our customers will set up shared Slack channels with us, where we're monitoring or seeing their incidents as well as they are. So we get a little Slack representation of the incident that we detected for them or the root cause that we detected for them, and that's in a shared community channel. So we could see this happening when that AWS outage happened. We could see our customers getting impacted by that AWS outage, and the root cause of what was going on there in AWS that was impacting our customers that was showing up in our incidents. Now we didn't obviously, you know, have the very root cause of what was going on in AWS, per se but we were getting to the root cause of why our customer's applications were failing. And that was because of issues going on at AWS. >> Very interesting. I mean, I think one of your biggest challenges is going to be getting people's attention because these SREs are so busy, their hair's on fire. >> Rod: That's it. Right. (chuckling). You know, when you say, hey, (indistinct). >> I tell you, if you get their attention, they love it. I mean, this AI ops company, I didn't even tell you the punchline there, but, you know, they had this incident that occurred that we found. And quite literally, the next week, they ended up signing up as a paid customer. So... >> Dave: that's great. And Larry, to give you the last word. I mean, you know, Rod was talking about, you know, changes in APIs and you know, there's still a lot of scripts out there. You guys, if I understand it correctly, run both as a service in the cloud and you can run on-prem, which is important because there's a lot of sensitive information in logs that people are trying not to leave. >> Larry: That's right. Absolutely. >> Dave: But close it out here. >> Yeah. I mean, that's right, you can run it on-prem. Just like we run it in our cloud, you can run it in your cloud or on your own infrastructure. Now that's all true. You know, I think the one hurdle now that we have left as a company is getting the word out and getting people to believe that this is actually possible and try it for themselves. You don't believe it, do a POC, try it yourself. And you know, people have become so jaded by the lack of, you know, real, sort of, innovation in the software industry for the last 10 years that it's hard to get people to... But guys, you got to give it a shot, I'm telling you. I'm telling you right now, it works. And you'll hear more about that from one of our customers in a minute. >> All right guys, thanks so much. Great story. Really appreciate you sharing. >> Thank you. >> Yeah. Thanks Dave. Appreciate the time. >> Okay. In a moment, we're going to hear from Cisco who is the customer in this case example and a company that has... Look, they have quite an impressive suite of observability tooling, and they've done a pretty compelling proof of concept with Zebrium using real data on some Cisco products that you've heard of, like WebEx. So stay tuned and learn about how you can really take advantage of this new technology called Root Cause As A Service. You're watching theCube, the leader in enterprise and emerging tech coverage. (upbeat music)

Published Date : Jun 16 2022

SUMMARY :

you know, they typically All right Rod, talk to me. Or the business depends on, you know, and how does it solve the problem? And I love the way you put that. You know, you talked about automation. this sounds amazing, but if I, you know... So no one was expecting, you know, uptime, I mean, you guys are jumping up and down, We're going to get you to Do you have some examples and he looks on the day, you know, is going to be getting people's attention you say, hey, (indistinct). but, you know, they had And Larry, to give you the last word. Larry: That's right. by the lack of, you know, appreciate you sharing. you can really take advantage

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Rod	PERSON	0.99+
Dave	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Larry	PERSON	0.99+
two	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Zebrium	ORGANIZATION	0.99+
$20 billion	QUANTITY	0.99+
five	QUANTITY	0.99+
20 billion	QUANTITY	0.99+
Rod Bagg	PERSON	0.99+
Seagate	ORGANIZATION	0.99+
192 incidents	QUANTITY	0.99+
UCS	ORGANIZATION	0.99+
WebEx	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
two technical experts	QUANTITY	0.99+
Dynatrace	ORGANIZATION	0.99+
New Relic	ORGANIZATION	0.99+
ScienceLogic	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
First	QUANTITY	0.99+
Elastic	ORGANIZATION	0.99+
next week	DATE	0.99+
ETR	ORGANIZATION	0.99+
Datadog	ORGANIZATION	0.99+
Grafana AppDynamics	ORGANIZATION	0.99+
five days ago	DATE	0.99+
one	QUANTITY	0.98+
more than 95%	QUANTITY	0.98+
Alchera	ORGANIZATION	0.98+
5X	QUANTITY	0.98+
10 years	QUANTITY	0.98+
both	QUANTITY	0.98+
over a day	QUANTITY	0.98+
Today	DATE	0.97+
today	DATE	0.97+
Zoom	ORGANIZATION	0.97+
Seagate Lyve Cloud	ORGANIZATION	0.97+
one company	QUANTITY	0.96+

Larry Lancaster & Rod Bagg

(bright intro music) >> Full stack observability is all the rage today. As businesses lean in to digital, customer experience becomes ever more important, why? Well, it's obvious. Fickle consumers can switch brands in the blink of an eye or the click of a mouse. Technology companies have sprung into action, and the observability space is getting pretty crowded in an effort to simplify the process of figuring out the root cause of application performance problems without an army of PhDs and lab coats, also known as endlessly digging through logs, for example. We see decades-old software companies that have traditionally done monitoring or log analytics and/or application performance management stepping up their game. These established players, you know, they typically have deep feature sets and sometimes purpose built tools that attack one particular segment of the marketplace, and now, they're pivoting through M&A and some organic development trying to fill gaps in their portfolio, and then you got all these new entrants coming to the market claiming end to end visibility across the so-called modern cloud and now edge-native stacks. Meanwhile, cloud players are gaining traction and participating through a combination of native tooling combined with strong ecosystems to address this problem, but, you know, recent survey research from ETR confirms our thesis that no one company has at all. Here's the thing. Customers just want to figure out the root cause as quickly and efficiently as possible. It's one thing to observe the stack end to end, but the question is who is automating the observers? And that's why we're here today. Hello, my name is Dave Vellante, and welcome to this special "CUBE" presentation where we dig into root cause analysis and, specifically, how one company, Zebrium, is using unsupervised machine learning to detect anomalies and pinpoint root causes and delivering it as an automated service. In this session, we have two deep dives. First, we're going to dig into this exciting new field of RCA, root cause as a service, with two of the founders and technical experts behind Zebrium, and then we bring in two technical experts from Cisco, an early Zebrium customer who ran a POC with Zebrium's service, automating and identifying root cause problems within four very well established and well-known Cisco product lines including Webex client and UCS. I was pretty amazed at the results, and I think you'll be impressed as well. So thanks for being here. Let's get started with me right now is Larry Lancaster who's a founder and CTO of Zebrium, and he's joined by Rod Bagg who's a founder and Vice-President of Engineering at the company. Gents, welcome, thanks for coming on. >> Thanks. >> (indistinct). >> To be here. >> Great to be here. >> All right, Rod, talk to me. Talk to me about software downtime, what root cause means, all the buzzwords in your domain, MTTR and SLO, what do we need to know? >> Yeah, I mean, it's like you said. I mean, it's extremely important to our customers and to most businesses out there to drive up time and avoid as much downtime as possible. So, you know, when you think about it, all of these businesses, most companies nowadays, either their product is software and it's running, you know, running on the web, and that that's how you get a point click or their business depends on it and, you know, internal systems to drive their business and to run it. Now, when that is down, that is hugely impacting to them. So if you take a look, you know, way back, you know, 20, 30 years ago, software was simple. You know, there wasn't much to it. It was pretty monolithic, and maybe it took a couple of people to maintain it and keep it running. It wasn't really anything complicated about it. It was a single tenant piece of software. Today's software is so complicated, often running, you know, maybe hundreds of services to keep that or to actually implement what that software is doing. So as you point out, you know, enter the sort of observability space and the tools that are now in use to help monitor that software and make sure when something goes wrong, they know about it, but there's kind of an interesting stat around the observability space. So when you look at observability in the context or through the lens of the cost of downtime, it's really interesting. So observability tools are about a $20 billion market, okay? But the cost of downtime, even with that in place, is still hundreds of billions of dollars. So you're not taking much of a bite out of what the real problem is. You have to solve root cause and get to that fast. So it's all great to know that something went wrong, but you got to know why, and it it's our contention here that, you know, really, when you take a look at the observability space, you have metrics. That's a great tool. I mean, there's lots of great tools out there, you know, around metrics monitoring that's going to tell you when something went wrong. It's very rarely it's going to tell you why. Similarly for tracing, it's going to point you to where the issue is. It's going to take you through that stack and probably pinpoint where you're being, you know, where it's happening or where something is running slow potentially. So that's great, but again, the root cause of why it's happening is going to be buried in log files, and I can expand on that a little bit more, but, you know, when you're a software developer, and you're writing your software, those log files are a wealth of information. It's just a set of breadcrumbs that are littered with facts about how the software is behaving and why it's doing what it's doing or why it went wrong, and it's that that really gets you to the root cause very fast, and that's, our contention is that these software systems are so complex nowadays, and that the root cause is lying in those logs. So how do you get there fast? You know, we would contend that you better automate that or you're just doomed for failure, and that's where we come in. >> Great. >> Getting to that request. >> Thank you, Rod. You know, it's interesting. You talk about the $20 billion market. There's an analogy with security, right? We spend 80, $100 billion a year on securing our infrastructure, and yet we lose, probably, closer to a trillion dollars a year in breaches, and there's a similar analogy here. 20 billion could be 5x in downtime impacts or more. Okay, let's go to Larry. Tell us a little bit more about Zebrium. I'm interested always to ask a founder why you started the company. Rod touched on that a little bit. You guys have invented this concept of RCAs. What does it mean? What problems does it solve? And how does it solve the problem? Let's get into it. >> Yeah, hey, thanks, Dave. So I think when you said, you know, who's automating the observer? That's a great way to think about it because what observability really means is it's a property of a system that means you can see into it. You can observe the internal state, and that makes it easier to troubleshoot, right? But the problem is if it's too complicated, you just push the bottleneck up to your eyeball. There's only so much a person can filter through manually, right? And I love the way you put that. So that's a great way to think about it is automating the observer. Now, of course, it means that, you know, you reduce your MTTR, you meet your service level objectives, all that stuff, you improve customer experience, that's all true, but it's important to step back and realize like we have cracked a real nut here. People have been trying to figure out how to automate this part of sort of the troubleshooting experience, this human part of finding the root cause indicators for a long time, and until Zebrium came along, I would argue no one's really done it right. So, you know, I think it's also important, you know, as we step back, we can probably look forward five to 10 years and say, "Everyone's going to look back and say, 'How did we do all this manually?'" You're going to see this sort of last mile of observability and troubleshooting is going to be automated everywhere because otherwise, you know, people are just, they're not going to be able to scale their business. So, you know, I think one more thing that's important to point out is, you know, I think Zebrium, you know, it's one thing to have the technology, but we've learned we need to deliver it right where people are today. You can't just expect people to dive into a new tool. So, you know, we're looking at, you know, if you look at Zebrium, you'll put us on your dashboard, and we don't care what kind of a dashboard it is. It could be, you know, Datadog, New Relic, Elastic, Dynatrace, Grafana, AppDynamics, ScienceLogic, we don't care. You know, they're all our friends. So we're more interested in getting to that root cause than trying to fight, you know, these incumbents and all that stuff, yeah. >> Yeah, so interesting. Again, another analogy I think about, you know, you talked about automation, where to look back, and say, "This is what- We're never going to do this again." It's like provisioning LANs. Nobody provisioned LANs anymore. It's all automated. >> That's correct. >> So, Larry, stay with you. The skeptic in me says, "This sounds amazing," but if, you know, it probably too good to be true. Tell us how it works. >> Yeah, so that's interesting. So Cisco came along and they were equally skeptical. So what they did was they took a couple of months, and they did a very detailed study, and they got together 192 incidents across four product lines where they knew that the root cause was in the logs, and they knew what that root cause was because they'd had their best engineers, you know, work on those cases and take detailed notes of the incidents that had taken place, and so they ran that data through the Zebrium software, and what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time. Like that blew us away. When we saw that kind of evidence, Dave, I have to tell you, everyone was just jumping up and down. It was like, you know, it was like the Apollo Command Center, you know, when they finally, (Dave laughs) you know, touchdown on the moon kind of thing. So, you know, it's really exciting at a point in time to be at the company, like just seeing everything finally being proven out according to this vision. I'm going to tell you one more story, which is actually one of my favorites, because we got a chance to work with Seagate Lyve Cloud. So they're, you know, a hyper modern, you know, SaaS business. They're an S3 competitor. Zoom has their files stored on Lyve Cloud to give, you know, to let you know who they are. So, essentially, what happened was they were in alpha, in their early access, and they had an outage, and it was pretty bad. I mean, it went on for longer than a day, actually, before they were completely restored, and it was, you know, fortunately, for them, it was early access. So no one was expecting, you know, uptime, you know, service level objectives and so on, but they were scared because they realized if something like this happens in production, you know, they're screwed. So what they did was they saw Zebrium, they did some research, they saw Zebrium. They went in a staging environment, recreated the exact (indistinct) that they'd had, and what they saw was, immediately, Zebrium pops up a root cause report that tells them exactly the root cause that they took over a day to find. These are the kind of stories that let us know we're onto something transformational. >> Yeah, that's great. I mean, you guys are jumping up and down. I'm sure, we're going to hear from Cisco later. I bet you, they were jumping up and down, too, 'cause they didn't have to do all that heavy lifting anymore. So Rod, Larry's just sort of implying that or, actually, you guys both talked about that your tool's agnostic. So how does one actually use the service? How do I deploy it? >> Yeah, so let me step back. So when we talk about logs, right? Like, you know, all these red crumbs being in logs and everything else. So, you know, they are a great wealth of, you know, information, but people hate dealing with them. I mean, they hate having to go in and figure out what log to look at. In fact, you know, we had one of our, or we've heard from several of our customers now prior to using Zebrium, but when they're, you know, have some issue, and they know there's something wrong, something on their dashboard has told them that something's wrong, maybe a metrics is, you know, taken a blip or something's happened that they know there's a problem, we've heard from them that it can take like a number of hours just to get to the right set of logs, like figuring out over these hundreds of services where the logs are to get to them, maybe searching in a log manager, just to get into the right context even can take hours. So, you know, that's obviously the problem we solve, but, you know, we don't want them just looking at logs. I mean, you know, we don't want to put 'em back in the thing they don't like doing 'cause people don't do what they don't like doing. So we put it up on the dashboard. So if something is going wrong with your metrics, and that's the indicator or maybe it's something with tracing that you're sort of digging through now that you know something's wrong, we will be right on that same dashboard. So we're deployed as a SaaS service. You send us your logs. You click on one of our integrations, and we integrate with all these tools that Larry's talked about, and when we detect anything that is a root cause report, it will show up on your dashboard in the same timeline as those blips in your metrics. So when you see something going wrong, and you know there's an issue, take a look at the portion of your dashboard that is us, and we're going to tell you why. We're going to get you to the why that went wrong. Not no other work could be- You can, you know, also click down and click through to us so that you land up in our portal if you want to do some more digging around if you need to or whatever, maybe to get some context, what have you, but it's fair that you ever need to do that. The answer should be right there on your dashboard, and that's how we expect people to use it. We don't want them digging in logs and going through things. We want it to be right in their workflow. >> Great, thank you, Larry. So Rod, we talked about Cisco. We're going to hear more from them in a moment and Seagate. I would think this is like a perfect solution for a SaaS provider, anybody doing AIOps, do you have some examples of those types of firms leaning into this? >> Yeah, a couple of great, well, I mean, we got many of them, but couple that I'll touch on. We have an actual AIOps company that was looking for, you know, sort of some complimentary technology and so on, and so they decided to just put us through our paces by having one of their own SREs sign up for our service in our SaaS environment and send the logs from their system to us, you know, and just see how we did. So it turned out we ended up talking back to this SRE like a week after he had installed the product, you know, signed up, and then, you know, started sending us logs, and, you know, he was hemming and hawing saying that he was busy like, you know, like every SRE is, and that he didn't have a chance to really do much with us yet, and, you know, we just, you know, having this conversation on the phone, and he comes to tell us that, "Yeah, I've been busy because we had this, you know, terrible outage like, you know, five days ago," and we said like, "Okay, did you actually look on the Zebrium dashboard?" (laughs) And he goes, "You know what? I didn't even think to do it yet. I mean, I'd just been so busy and frazzled." So we have an integration with that company. He hadn't put that integration in so it wasn't in his dashboard yet, but it was certainly on ours. So he went there and he looks on the day like, you know, on the time range of when he had this incident, and right at the very top of the page on our portal was the incident with the root cause, and he was flabbergasted. It literally would've saved him hours and hours and hours. They had this issue going on for over 24 hours, and we had the answer right there in five minutes, and it was crazy, and we get that kind of story. It's just like the Seagate one. If you use us and you have a problem, we're going to detect it, and you're going to hear from Cisco how successful we are at detecting things. I mean, it'll be there when you have a problem. In SaaS companies, you know, one of our customers is Archera. They do cost optimizations for cloud properties, you know, for AWS optimization, Google cloud, and so on, but they use our software, and they have a lot of interaction, obviously, with these cloud vendors and the APIs of those cloud vendors. So, you know, in order to figure out you're costing at AWS, they're using all those APIs. So it turned out, you know, they had some issue where their services were breaking and we had that root cause report right on the screen, again, within five minutes that was pointing to an API problem with Google, and they had changed one of their APIs, and Archera was not aware of it. So their stuff was breaking because of a change downstream that we had caught, and I'll just tell you one last one because it's somewhat related to one of these cloud vendors of, you know, big cloud vendor who had an outage couple of months ago, and it's interesting because, you know, lot of our customers will set up shared Slack channels with us where we're monitoring or seeing their incidents as well as they are. So we get a little Slack representation of the incident that we detected for them or the root cause that we've detected for them, and that's in a shared community channel. So we could see this happening when that AWS outage happened. We could see our customers getting impacted by that AWS outage and the root cause of what was going on there in AWS that was impacting our customers, that was showing up in our incidents. Now, we didn't obviously, you know, have the very root cause of what was going on in AWS per se, but we were getting to the root cause of why our customer's applications were failing, and that was because of issues going on at AWS. >> Very interesting. I mean, I think one of your biggest challenge is going to be getting people's attention because these SREs is so busy, their hair's on fire. (all laughs) You know, he's like, "Hey, chap, I'm going to show you, look at this." >> I tell you. You get their attention, they love it. I mean, this AIOps company, I didn't even tell you the punchline there, but, you know, they had this incident that occurred that we found and, quite literally, the next week, they ended up signing up as a paid customer, so. >> That's great, and Larry, give you the last word. I mean, you know, Rod was talking about, you know, changes in APIs, and, you know, there's still a lot of scripts out there. You guys, if I understand it correctly, run both as a service in the cloud and you can run on-prem, which is important because there's a lot of sensitive information in logs and people don't want to leave. >> That's right, absolutely. >> But, yeah, close it out here. >> Yeah, I mean, you can, that's right, you can run it on-prem, just like we run it in our cloud. You can run it in your cloud or on your own infrastructure. Now, that's all true. You know, I think the one hurdle now that we have left as a company is getting the word out and getting people to believe that this is actually possible and try it for themselves. You don't believe it? Do a POC, try it yourself. And, you know, people have become so jaded by the lack of, you know, real sort of innovation in the software industry for the last 10 years that it's hard to get people to... But guys, you got to give it a shot. I'm telling you. I'm telling you right now, it works, and you'll hear more about that from one of our customers in a minute. >> Alright guys, thanks so much. Great story, really appreciate you sharing. >> Thank you. >> Yeah, thanks, Dave. Appreciate the time. >> Okay, in a moment, we're going to hear from Cisco who is the customer in this case example, and a company that is... Look, they have quite an impressive suite of observability tooling, and they've done a pretty compelling proof of concept with Zebrium using real data on some Cisco products that you've heard of like Webex. So stay tuned and learn about how you can really take advantage of this new technology called root cause as a service. You're watching "theCUBE", the leader in enterprise and emerging tech coverage. (bright outro music)

Published Date : May 25 2022

SUMMARY :

and then you got all these new entrants all the buzzwords in your and that that's how you get a point click why you started the company. Now, of course, it means that, you know, about, you know, you but if, you know, it and it was, you know, I mean, you guys are jumping up and down. I mean, you know, we do you have some examples saying that he was busy like, you know, is going to be getting people's attention but, you know, they had I mean, you know, Rod was talking by the lack of, you know, appreciate you sharing. Appreciate the time. So stay tuned and learn about how you can

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Dave	PERSON	0.99+
Rod	PERSON	0.99+
Seagate	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Larry	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Rod Bagg	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Zebrium	ORGANIZATION	0.99+
Webex	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
UCS	ORGANIZATION	0.99+
$20 billion	QUANTITY	0.99+
20 billion	QUANTITY	0.99+
Grafana	ORGANIZATION	0.99+
192 incidents	QUANTITY	0.99+
five	QUANTITY	0.99+
Dynatrace	ORGANIZATION	0.99+
two technical experts	QUANTITY	0.99+
AppDynamics	ORGANIZATION	0.99+
ScienceLogic	ORGANIZATION	0.99+
New Relic	ORGANIZATION	0.99+
Datadog	ORGANIZATION	0.99+
First	QUANTITY	0.99+
five minutes	QUANTITY	0.99+
Elastic	ORGANIZATION	0.99+
ETR	ORGANIZATION	0.99+
five days ago	DATE	0.99+
10 years	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
one	QUANTITY	0.98+
5x	QUANTITY	0.98+
more than 95%	QUANTITY	0.98+
next week	DATE	0.98+
both	QUANTITY	0.98+
couple of months ago	DATE	0.97+
20	DATE	0.97+
Zoom	ORGANIZATION	0.97+
Archera	ORGANIZATION	0.97+
today	DATE	0.97+
Seagate Lyve Cloud	ORGANIZATION	0.96+
over 24 hours	QUANTITY	0.95+
over a day	QUANTITY	0.95+
Today	DATE	0.95+
four	QUANTITY	0.95+
AIOps	ORGANIZATION	0.95+
hundreds of services	QUANTITY	0.94+
decades	QUANTITY	0.94+
one thing	QUANTITY	0.92+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Seagate Lyve Cloud: