Ajay Singh, Zebrium & Michael Nappi, ScienceLogic | AWS re:Invent 2022

(upbeat music) >> Good afternoon, fellow cloud nerds, and welcome back to theCUBE's live coverage of AWS re:Invent, here in a fabulous Sin City, Las Vegas, Nevada. My name is Savannah Peterson, joined by my fabulous co-host, John Furrier. John, how you feeling? >> Great, feeling good Just getting going. Day one of four more, three more days after today. >> Woo! Yeah. >> So much conversation. Talking about business transformation as cloud goes next level- >> Hot topic here for sure. >> Next generation. Data's classic is still around, but the next gen cloud's here, it's changing the game. Lot more AI, machine learning, a lot more business value. I think it's going to be exciting. Next segment's going to be awesome. >> It feels like one of those years where there's just a ton of momentum. I don't think it's just because we're back in person at scale, you can see the literally thousands of people behind us while we're here on set conducting these interviews. Our bold and brave guests, just like the two we have here, combating the noise, the libations, and everything else going on on the show floor. Please help me welcome Mike from Science Logic and Ajay from Zebrium. Gentlemen, welcome to the show floor. >> Thank you. >> Thank you Savannah. It's great to be here. >> How you feeling? Are you feeling the buzz, Mike? Feeling the energy? >> It's tough to not feel and hear the buzz, Savannah >> Savannah: Yeah. (all laughing) >> John: Can you hear me? >> Savannah: Yeah, yeah, yeah. Can you hear me now? What about you, Ajay? How's it feel to be here? >> Yeah, this is high energy. I'm really happy it's bounced back from COVID. I was a little concerned about attendance. This is hopping. >> Yeah, I feel it. It just, you can definitely feel the energy, the sense of community. We're all here for the right reasons. So I know that, I want to set the stage for everyone watching, Zebrium was recently acquired by Science Logic. Mike, can you tell us a little bit about that and what it means for the company? >> Mike: Sure, sure. Well, first of all, science logic, as you may know, has been in the monitoring space for a long time now, and what- >> Savannah: 20 years I believe. >> Yeah. >> Savannah: Just about. >> And what we've seen is a shift from kind of monitoring infrastructure, to monitoring these increasingly complex modern cloud native applications, right? And so this is part of a journey that we've been on at Science Logic to really modernize how enterprises of all sizes manage their IT estate. Okay? So, managing, now workloads that are increasingly in the public cloud, outside the four walls of the enterprise, workloads that are increasingly complex. They're microservices based, they're container based. >> Mhmm. >> Mike: And the rate of change, just because of things like CICD, and agile development has also increased the complexity in the typical IT environment. So all these things have conspired to make the traditional tools and processes of managing IT and IT applications much more difficult. They just don't scale. One of the things that we've seen recently, Savannah is this shift in sort of moving to cloud native applications, right? >> Huge shift. >> Mike: Today it only incorporates about roughly 25% of the typical IT portfolio, but most of the projections we've seen indicate that that's going to invert in about three years. 75% of applications will be what I call cloud native. And so this really requires different technologies to understand what's going on with those applications. And so Zebrium interested us when we were looking at partners at the beginning of this year as they have a super innovative approach to understanding really what's going on with any cloud native application. And they really distill, they separate the complexity out of the equation and they used machine learning to tremendous effect to rapidly understand the root cause of an application failure. And so I was introduced to Ajay, beginning of this year, actually. It feels like it's been a long time now. But we've been on this journey together throughout 2022, and we're thrilled to have Zebrium now, part of the Science Logic family. >> Ajay, Zebrium saves people a lot of time. Obviously, I've worked with developers and seen that struggle when things break, shortening that time to recovery and understanding is so critical. Can you tell us a little bit about what's under the hood and how the ML works to make that happen? >> Ajay: Yeah. So the goal is to figure out not just that something went wrong, but what went wrong. >> Savannah: Right. >> And we took, you know, based on a couple of decades of experience from my co-founders- >> Savannah: Casual couple of decades, came into went into this product just to call that out. Yeah, great. >> Exactly. It took some general learnings about the nature of software and when software breaks, what tends to happen, you tend to see unusual things happen, and they lead to bad things happening. It's very simple. >> Yes. >> It turns out- >> Savannah: Mutations lead to bad things happening, generally speaking. >> So what Zebrium's really good at is identifying those rare things accurately and then figuring out how they connect, or correlate to the bad things, the errors, the warnings, the alerts. So the machine learning has many stages to it, but at its heart it's classifying the event, catalog of any application stack, figuring out what's rare, and when things start to break it's telling you this cluster of events is both unusual, and unlikely to be random, and it's very likely the root cause report for the problem you're trying to solve. We then added some nice enhancements, such as correlation with knowledge spaces in, on the public internet. If someone's ever solved that problem before, we're able to find a match, and pull that back into our platform. But the at the heart, it was a technology that can find rare events and find the connections with other events. >> John: Yeah, and this is the theme of re:Invent this year, data, the role of data, solving end-to-end complexities. One, you mentioned that. Two, I think the Mike, your point about developers and the CICD pipeline is where DevOps is. That is what IT now is. So, if you take digital transformation to its conclusion, or its path and continue it, IT is DevOps. So the developers are actually doing the IT in their coding, hence the shift to autonomous IT. >> Mike: Right, right. Now, those other functions at IT used to be a department, not anymore, or they still are, so, but they'll go away, is security and data teams. You're starting to see the formation of- >> Mike: Yep. >> New replacements to IT as a function to support the developers who are building the applications that will be the company. >> That's right. Yeah. >> John: I mean that's, and do you agree with that statement? >> Yeah, I really do. And you know, collectively independent of whether it's like traditional IT, or it's DevOps, or whatever it is, the enterprise as a whole needs to understand how the infrastructure is deployed, the health of that infrastructure, and more importantly the applications that are hosted in the infrastructure. How are they doing? What's the health? And what we are seeing, and what we're trying to facilitate at Science Logic is really changed the lens of IT, from being low level compute, storage, and networking, to looking at everything through a services lens, looking at the services being delivered by IT, back to the business, and understanding things through a services lens. And Zebrium really compliments that mission that we've been on, by providing, cause a lot of cases, service equal equal application, and they can provide that kind of very real time view of service health in, you know, kind of the IT- >> And automation is beautiful there too, because, as you get into some of the scale- >> Yeah >> Ajay's. understanding how to do this fast is a key component. >> Yeah. So scale, you, you've pinpointed one of the dimensions that makes AI really important when it comes to troubleshooting. The humans just can't scale as fast as data, nor can they keep up with complexity of modern applications. And the third element that we feel is really important is the velocity with which people are now rolling out changes. People develop new features within hours, push them out to production. And in a world like that, the human has just no ability or time to understand what's normal, what's bad, to update their alert rules. And you need a machine, or an AI technology, to go help you with that. And that's basically what we're about. >> So this is where AI Ops comes in, right? Perfectly. Yeah. >> Yeah. You know, and John started to allude to it earlier, but having the insight on what's going on, we believe is only half of the equation, right? Once you understand what's going on, you naturally want to take action to remediate it or optimize it. And we believe automation should not be an exercise that's left to the reader. >> Yeah. >> As a lot of traditional platforms have done. Instead, we have a very robust, no-code, low-code automation built into our platform that allows you to take action in context with what you're seeing right then and there with the service. >> John: Yeah. Essentially monitoring, a term you use observability, some used as a fancy word today, is critical in all operating environments. So if we, if we kind of holistically, hey we're a distributed computing system, aka cloud, you got to track stuff at scale and you got to understand what it, what the impact is from a systems perspective. There's consequences to understanding what goes wrong. So as you look at that, what's the challenge for customers to do that? Because that seems to be the hard part as they lift and shift to the cloud, run their apps on the cloud, now they got to go take it to the next level, which is more developer velocity, faster productivity, and secure. >> Yeah. >> I mean, that seems to be the table stakes now. >> Yeah. >> How are companies forming around that? Are they there yet? Are they halfway there? Are they, where are they in the progression of, one, are they changing? And if so- >> Yeah that's a great question. I mean, I think whether it's an IT use case or a security use case, you can't manage what you don't know about. So visibility, discoverability, understanding what's going on, in a lot of ways that's the really hard problem to solve. And traditionally, we've approached that by like, harvesting data off of all these machines and devices in the infrastructure. But as we've seen with Zebrium and with related machine learning technologies, there's multiple ways of gaining insight as to what's going on. Once you have the insight be it an IT issue, like a service outage, or a security vulnerability, then you can take action. And the idea is you want to make that action as seamless as possible. But I think to answer your question, John, enterprises are still kind of getting their heads around how can we break down all the silos that have built up over the last decade or two, internally, and get visibility across the estate that really matters. And I think that's the real challenge. >> And I mean, and, at the velocity that applications are growing, just looking at our notes here, number of applications scaling from 64 million in 2017 to 147 million in 2021. That goes to what you were talking about, even with those other metrics earlier, 582 million by 2026 is what Morgan Stanley predicts. So, not only do we need to get out of silos we need to be able to see everything all the time, all at once, from the past legacy, as well as as we extend at scale. How are you thinking about that, Ajay? You're now with a big partner as an umbrella. What's next for you all? How, how are you going to help people solve problems faster? >> Yeah, so one of the attractions to the Zebrium team about Science Logic, aside from the team, and the culture, was the product portfolio was so complimentary. As Mike mentioned, you need visibility, you need mapping from low level building blocks to business services. And the end, at the end of the spectrum, once you know something's wrong you need to be able to take action automatically. And again, Science Logic has a very strong product, set of product capabilities and automated actions. What we bring to the table is the middle layer, which is from visibility, understanding what went wrong, figuring out the root cause. So to us, it was really exciting to be a very nice tuck in into this broader platform where we helped complete the story. >> Savannah: Yeah, that's, that's exciting. >> John: Should we do the Insta challenge? >> I was just getting ready to do that. You go for it John. You go ahead and kick it off. >> So we have this little tradition now, Instagram real, short and sweet. If you were going to see yourself on Instagram, what would be the Instagram reel of why this year's re:Invent is so important, and why people should pay attention to what's going on right now in the industry, or your company? >> Well, I think partly what Ajay was saying it's good to be back, right? So seeing just the energy and being back in 3D, you know en mass, is awesome again. It really is. >> Yeah. >> Mike: But, you know, I think this is where it's happening. We are at an inflection point of our industry and we're seeing a sea change in the way that applications and software delivered to businesses, to enterprises. And it's happening right here. This is the nexus of it. And so we're thrilled to be here as a part of all this, and excited about the future. >> All right, Ajay- >> Well done. He passes >> Your Instagram reel. >> Knowing what's happening in the broader economy, in the business context, it's, it feels even more important that companies like us are working on technologies that empower the same number of people to do more. Because it may not be realistic to just add on more headcount given what's going on in the world. But your deliverables and your roadmaps aren't slowing down. So, still the same amount of complexity, the same growth rates, but you're going to have to deal with all of that with fewer resources and be smarter about it. So, the approaches we're taking feel very much off the moment, you know, given what's going on in the real world. >> I love it. I love it. I've got, I've got kind of a finger to the wind, potentially hardball question for you here to close it out. But, given that you both have your finger really on the pulse right here, what percentage of current IT operations do you think will eventually be automated by AI and ML? Or AI ops? >> Well, I think a large percentage of traditional IT operations, and I'm talking about, you know, network operating center type of, you know, checking heartbeat monitors of compute storage and networking health. I think a lot of those things are going to be automated, right? Machine learning, just because of the scale. You can't scale, you can't hire enough NOC engineers to scale that kind of complexity. But I think IT talents, and what they're going to be focusing on is going shift, and they're going to be focusing on different parts. And I believe a lot of IT is going to be a much more of an enabler for the business, versus just managing things when they go wrong. So that's- >> All right. >> That's what I believe is part of the change. >> That's your, all right Ajay what about your hot take? >> Knowing how error-prone predictions are, (all laughing) I'll caveat my with- >> Savannah: We're allowing for human error here. >> I could be wildly wrong, but if I had to guess, you know, in 10 years you know, as much as 50% of the tasks will be automated. >> Mike: Oh, you- >> I love it. >> Mike: You threw a number out there. >> I love it. I love that he put his finger out- >> You got to see, you got to say the matrix. We're all going to be part of the matrix. >> Well, you know- >> And Star Trek- >> Skynet >> We can only turn back to this footage in a few years and quote you exactly when you have the, you know Mackenzie Research or the Morgan Stanley research that we've been mentioning here tonight and say that you've called it accurately. So I appreciate that. Ajay, it was wonderful to have you here. Congratulations on the acquisition. Thank you. Mike, thank you so much for being here on the Science Logic side, and congratulations to the team on 20 years. That's very exciting. John. Thank you. >> I try, I tried. Thank you. >> You try, you succeed. And thank you to all of our fabulous viewers out there at home. Be sure and tweet us at theCUBE. Say hello, Furrier, Sav is savvy. Let us know what you're thinking of AWS re:Invent where we are live from Las Vegas all week. You're watching theCUBE, the leader in high tech coverage. My name's Savannah Peterson, and we'll see you soon. (upbeat music)

Published Date : Nov 29 2022

SUMMARY :

John, how you feeling? Day one of four more, Yeah. So much conversation. I think it's going to be exciting. just like the two we have here, It's great to be here. Savannah: Yeah. How's it feel to be here? I was a little concerned about attendance. We're all here for the right reasons. has been in the monitoring space in the public cloud, One of the things that we've but most of the projections we've seen and how the ML works to make that happen? So the goal is to figure out just to call that out. and they lead to bad things happening. to bad things happening, and find the connections hence the shift to autonomous IT. You're starting to see the formation of- the developers who are Yeah. and more importantly the applications how to do this fast And the third element that So this is where AI of the equation, right? that allows you to take action and you got to understand what it, I mean, that seems to And the idea is you That goes to what you were talking about, And the end, at the end of the spectrum, Savannah: Yeah, I was just getting ready to do that. If you were going to see So seeing just the energy This is the nexus of it. that empower the same of a finger to the wind, and they're going to be is part of the change. Savannah: We're allowing you know, as much as 50% of the tasks I love that You got to see, you and congratulations to I try, I tried. and we'll see you soon.

ENTITIES

Entity	Category	Confidence
Savannah	PERSON	0.99+
John	PERSON	0.99+
Mike	PERSON	0.99+
John Furrier	PERSON	0.99+
Savannah Peterson	PERSON	0.99+
Ajay	PERSON	0.99+
Ajay Singh	PERSON	0.99+
Michael Nappi	PERSON	0.99+
2017	DATE	0.99+
Star Trek	TITLE	0.99+
20 years	QUANTITY	0.99+
Las Vegas	LOCATION	0.99+
Mackenzie Research	ORGANIZATION	0.99+
75%	QUANTITY	0.99+
2021	DATE	0.99+
10 years	QUANTITY	0.99+
2022	DATE	0.99+
two	QUANTITY	0.99+
Science Logic	ORGANIZATION	0.99+
64 million	QUANTITY	0.99+
third element	QUANTITY	0.99+
Today	DATE	0.99+
50%	QUANTITY	0.99+
Two	QUANTITY	0.99+
Zebrium	PERSON	0.98+
2026	DATE	0.98+
582 million	QUANTITY	0.98+
Zebrium	ORGANIZATION	0.97+
both	QUANTITY	0.97+
tonight	DATE	0.97+
Sin City, Las Vegas, Nevada	LOCATION	0.97+
Zebrium	TITLE	0.97+
One	QUANTITY	0.97+
four	QUANTITY	0.96+
Morgan Stanley	ORGANIZATION	0.96+
147 million	QUANTITY	0.95+
Sav	PERSON	0.95+
Instagram	ORGANIZATION	0.94+
thousands of people	QUANTITY	0.94+
AWS	ORGANIZATION	0.93+
about three years	QUANTITY	0.93+
Day one	QUANTITY	0.92+
one	QUANTITY	0.9+
ScienceLogic	ORGANIZATION	0.89+
this year	DATE	0.88+
Skynet	TITLE	0.87+
theCUBE	ORGANIZATION	0.87+
three more days	QUANTITY	0.85+
half	QUANTITY	0.85+

Atri Basu & Necati Cehreli | Zebrium Root Cause as a Service

>>Okay. We're back with Ari Basu, who is Cisco's resident philosopher, who also holds a master's in computer science. We're gonna have to unpack that a little bit and Najati chair he who's technical lead at Cisco. Welcome guys. Thanks for coming on the cube. >>Happy to be here. Thanks a >>Lot. All right, let's get into it. We want you to explain how Cisco validated the SBRI technology and the proof points that, that you have, that it actually works as advertised. So first Outre tell first, tell us about Cisco tech. What does Cisco tech do? >>So T is otherwise it's an acronym for technical assistance center is Cisco's support arm, the support organization, and, you know, the risk of sounding like I'm spotting a corporate line. The, the easiest way to summarize what tag does is provide world class support to Cisco customers. What that means is we have about 8,000 engineers worldwide, and any of our Cisco customers can either go on our web portal or call us to open a support request. And we get about 2.2 million of these support requests a year. And what these support requests are, are essentially the customer will describe something that they need done some networking goal that they have, that they wanna accomplish. And then it's tax job to make sure that that goal does get accomplished. Now, it could be something like they're having trouble with an existing network solution, and it's not working as expected, or it could be that they're integrating with a new solution. >>They're, you know, upgrading devices, maybe there's a hardware failure, anything really to do with networking support and, you know, the customer's network goals. If they open up a case for request for help, then tax job is to, is to respond and make sure the customer's, you know, questions and requirements are met about 44% of these support requests are usually trivial and, you know, can be solved within a call or within a day. But the rest of tax cases really involve getting into the network device, looking at logs. It's a very technical role. It's a very technical job. You're look you're, you need to be conversing with network solutions, their designs protocols, et cetera. >>Wow. So 56% non-trivial. And so I would imagine you spend a lot of time digging through through logs. Is that, is that true? Can you quantify that like, you know, every month, how much time you spend digging through logs and is that a pain point? >>Yeah, it's interesting. You asked that because when we started this on this journey to augment our support engineers workflow with zebra solution, one of the things that we did was we went out and asked our engineers what their experience was like doing log analysis. And the anecdotal evidence was that on average, an engineer will spend three out of their eight hours reviewing logs, either online or offline. So what that means is either with the customer live on a WebEx, they're going to be going over logs, network, state information, et cetera, or they're gonna do it offline, where the customer sends them the logs, it's attached to a, you know, a service request and they review it and try to figure out what's going on and provide the customer with information. So it's a very large chunk of our day. You know, I said 8,000 plus engineers, and so three hours a day, that's 24,000 man hours a day spent on long analysis. >>Now the struggle with logs or analyzing logs is there by out of necessity. Logs are very contr contr. They try to pack a lot of information in a very little space. And this is for performance reasons, storage reasons, et cetera, BEC, but the side effect of that is they're very esoteric. So they're hard to read if you're not conversant, if you're not the developer who wrote these logs or you or you, aren't doing code deep dives. And you're looking at where this logs getting printed and things like that, it may not be immediately obvious or even after a low while it may not be obvious what that log line means or how it correlates to whatever problem you're troubleshooting. So it requires tenure. It requires, you know, like I was saying before, it requires a lot of knowledge about the protocol what's expected because when you're doing log analysis, what you're really looking for is a needle in a haystack. You're looking for that one anomalous event, that single thing that tells you this shouldn't have happened. And this was a problem right now doing that kind of anomaly detection requires you to know what is normal. It requires, you know, what the baseline is. And that requires a very in-depth understanding of, you know, the state changes for that network solution or product. So it requires time, tenure and expertise to do well. And it takes a lot of time even when you have that kind of expertise. >>Wow. So thank you, archery. And Najati, that's, that's about, that's almost two days a week for, for a technical resource. That's that's not inexpensive. So what was Cisco looking for to sort of help with this and, and how'd you stumble upon zebra? >>Yeah, so, I mean, we have our internal automation system, which has been running more than a decade now. And what happens is when a customer attaches a log bundle or diagnostic bundle into the service request, we take that from the Sr we analyze it and we represent some kind of information. You know, it can be alert or some tables, some graph to the engineer, so they can, you know, troubleshoot this particular issue. This is an incredible system, but it comes with its own challenges around maintenance to keep it up to date and relevant with Cisco's new products or new version of the product, new defects, new issues, and all kind of things. And when I, what I mean with those challenges are, let's say Cisco comes up with a product today. We need to come together with those engineers. We need to figure out how this bundle works, how it's structured out. >>We need to select individual logs, which are relevant and then start modeling these logs and get some values out of those logs, using pars or some rag access to come to a level that we can consume the logs. And then people start writing rules on top of that abstraction. So people can say in this log, I'm seeing this value together with this other value in another log, maybe I'm hitting this particular defect. So that's how it works. And if you look at it, the abstraction, it can fail the next time. And the next release when the development or the engineer decides to change that log line, which you write that rag X, or we can come up with a new version, which we completely change the services or processes, then whatever you have wrote needs to be re written for that new service. And we see that a lot with products, like for instance, WebEx, where you have a very short release cycle that things can change maybe the next week with a new release. >>So whatever you are writing, especially for that abstraction and for those rules are maybe not relevant with that new release. With that being sake, we have a incredible rule creation process and governance process around it, which starts with maybe a defect. And then it takes it to a level where we have an automation in place. But if you look at it, this really ties to human bandwidth. And our engineers are really busy working on, you know, customer facing, working on issues daily and sometimes creating these rules or these pars are not their biggest priorities, so they can be delayed a bit. So we have this delay between a new issue being identified to a level where we have the automation to detect it next time that some customer faces it. So with all these questions and with all challenges in mind, we start looking into ways of actually how we can automate these automations. >>So these things that we are doing manually, how we can move it a bit further and automate. And we had actually a couple of things in mind that we were looking for and this being one of them being, this has to be product agnostic. Like if Cisco comes up with a product tomorrow, I should be able to take it logs without writing, you know, complex regs, pars, whatever, and deploy it into this system. So it can embrace our logs and make sense of it. And we wanted this platform to be unsupervised. So none of the engineers need to create rules, you know, label logs. This is bad. This is good. Or train the system like which requires a lot of computational power. And the other most important thing for us was we wanted this to be not noisy at all, because what happens with noises when your level of false PE positives really high your engineers start ignoring the good things between that noise. >>So they start the next time, you know, thinking that this thing will not be relevant. So we want something with a lot or less noise. And ultimately we wanted this new platform or new framework to be easily adaptable to our existing workflows. So this is where we started. We start looking into the, you know, first of all, internally, if we can build this thing and also start researching it, and we came up to Zeum actually Larry, one of the co co-founders of Zeum. We came upon his presentation where he clearly explained why this is different, how this works, and it immediately clicked in. And we said, okay, this is exactly what we were looking for. We dived deeper. We checked the block posts where SBRI guys really explained everything very clearly there, they are really open about it. And most importantly, there is a button in their system. >>So what happens usually with AI ML vendors is they have this button where you fill in your details and sales guys call you back. And, you know, we explain the system here. They were like, this is our trial system. We believe in the system, you can just sign up and try it yourself. And that's what we did. We took our, one of our Cisco live DNA center, wireless platforms. We start streaming logs out of it. And then we synthetically, you know, introduce errors, like we broke things. And then we realized that zebra was really catching the errors perfectly. And on top of that, it was really quiet unless you are really breaking something. And the other thing we realized was during that first trial is zebra was actually bringing a lot of context on top of the logs. During those failures, we work with couple of technical leaders and they said, okay, if this failure happens, I I'm expecting this individual log to be there. And we found out with zebra, apart from that individual log, there were a lot of other things which gives a bit more context around the root columns, which was great. And that's where we wanted to take it to the next level. Yeah. >>Okay. So, you know, a couple things to unpack there. I mean, you have the dart board behind you, which is kind of interesting, cuz a lot of times it's like throwing darts at the board to try to figure this stuff out. But to your other point, Cisco actually has some pretty rich tools with AppD and doing observability and you've made acquisitions like thousand eyes. And like you said, I'm, I'm presuming you gotta eat your own dog food or drink your own champagne. And so you've gotta be tools agnostic. And when I first heard about Z zebra, I was like, wait a minute. Really? I was kind of skeptical. I've heard this before. You're telling me all I need is plain text and, and a timestamp. And you got my problem solved. So, and I, I understand that you guys said, okay, let's run a POC. Let's see if we can cut that from, let's say two days a week down to one day, a week. In other words, 50%, let's see if we can automate 50% of the root cause analysis. And, and so you funded a POC. How, how did you test it? You, you put, you know, synthetic, you know, errors and problems in there, but how did you test that? It actually works Najati >>Yeah. So we, we wanted to take it to the next level, which is meaning that we wanted to back test is with existing SARS. And we decided, you know, we, we chose four different products from four different verticals, data center, security, collaboration, and enterprise networking. And we find out SARS where the engineer put some kind of log in the resolution summary. So they closed the case. And in the summary of the Sr, they put, I identified these log lines and they led me to the roots and we, we ingested those log bundles. And we, we tried to see if Zeum can surface that exact same log line in their analysis. So we initially did it with archery ourself and after 50 tests or so we were really happy with the results. I mean, almost most of them, we saw the log line that we were looking for, but that was not enough. >>And we brought it of course, to our management and they said, okay, let's, let's try this with real users because the log being there is one thing, but the engineer reaching to that log is another take. So we wanted to make sure that when we put it in front of our users, our engineers, they can actually come to that log themselves because, you know, we, we know this platform so we can, you know, make searches and find whatever we are looking for, but we wanted to do that. So we extended our pilots to some selected engineers and they tested with their own SRSS. Also do some back testing for some SARS, which are closed in the past or recently. And with, with a sample set of, I guess, close to 200 SARS, we find out like majority of the time, almost 95% of the time the engineer could find the log they were looking for in zebra analysis. >>Yeah. Okay. So you were looking for 50%, you got to 95%. And my understanding is you actually did it with four pretty well known Cisco products, WebEx client DNA center, identity services, engine ISE, and then, then UCS. Yes. Unified pursuit. So you use actual real data and, and that was kind of your proof proof point, but Ari. So that's sounds pretty impressive. And, and you've have you put this into production now and what have you found? >>Well, yes, we're, we've launched this with the four products that you mentioned. We're providing our tech engineers with the ability, whenever a, whenever a support bundle for that product gets attached to the support request. We are processing it, using sense and then providing that sense analysis to the tech engineer for their review. >>So are you seeing the results in production? I mean, are you actually able to, to, to reclaim that time that people are spending? I mean, it was literally almost two days a week down to, you know, a part of a day, is that what you're seeing in production and what are you able to do with that extra time and people getting their weekends back? Are you putting 'em on more strategic tasks? How are you handling that? >>Yeah. So, so what we're seeing is, and I can tell you from my own personal experience using this tool, that troubleshooting any one of the cases, I don't take more than 15 to 20 minutes to go through the zebra report. And I know within that time either what the root causes or I know that zebra doesn't have the information that I need to solve this particular case. So we've definitely seen, well, it's been very hard to measure exactly how much time we've saved per engineer, right? What we, again, anecdotally, what we've heard from our users is that out of those three hours that they were spending per day, we're definitely able to reclaim at least one of those hours and, and what, even more importantly, you know, what the kind of feedback that we've gotten in terms of, I think one statement that really summarizes how Zebra's impacted our workflow was from one of our users. >>And they said, well, you know, until you provide us with this tool, log analysis was a very black and white affair, but now it's become really colorful. And I mean, if you think about it, log analysis is indeed black and white. You're looking at it on a terminal screen where the background is black and the text is white, or you're looking at it as a text where the background is white and the text is black, but what's what they're really trying to say. Is there hardly any visual cues that help you navigate these logs, which are so esoteric, so dense, et cetera. But what XRM does is it provides a lot of color and context to the whole process. So now you're able to quickly get to, you know, using their word cloud, using their interactive histogram, using the summaries of every incident. You're very quickly able to summarize what might be happening and what you need to look into. >>Like, what are the important aspects of this particular log bundle that might be relevant to you? So we've definitely seen that a really great use case that kind of encapsulates all of this was very early on in our experiment. There was, there was this support request that had been escalated to the business unit or the development team. And the tech engineer had really, they, they had an intuition about what was going wrong because of their experience because of, you know, the symptoms that they'd seen. They kind of had an idea, but they weren't able to convince the development team because they weren't able to find any evidence to back up what they thought was happening. And we, it was entirely happenstance that I happened to pick up that case and did an analysis using Seebri. And then I sat down with the attack engineer and we were very quickly within 15 minutes, we were able to get down to the exact sequence of events that highlighted what the customer thought was happening, evidence of what the, so not the customer, what the attack engineer thought was the, was a root cause. It was a rude pause. And then we were able to share that evidence with our business unit and, you know, redirect their resources so that we could change down what the problem was. And that really has been, that that really shows you how that color and context helps in log analysis. >>Interesting. You know, we do a fair amount of work in the cube in the RPA space, the robotic process automation and the narrative in the press when our RPA first started taking off was, oh, it's, you know, machines replacing humans, or we're gonna lose jobs. And, and what actually happened was people were just eliminating mundane tasks and, and the, the employee's actually very happy about it. But my question to you is, was there ever a reticence amongst your team? Like, oh, wow, I'm gonna, I'm gonna lose my job if the machine's gonna replace me, or have you found that people were excited about this and what what's been the reaction amongst the team? >>Well, I think, you know, every automation and AI project has that immediate gut reaction of you're automating away our jobs and so forth. And there is initially there's a little bit of reticence, but I mean, it's like you said, once you start using the tool, you realize that it's not your job, that's getting automated away. It's just that your job's becoming a little easier to do, and it's faster and more efficient. And you're able to get more done in less time. That's really what we're trying to accomplish here at the end of the day, rim will identify these incidents. They'll do the correlation, et cetera. But if you don't understand what you're reading, then that information's useless to you. So you need the human, you need the network expert to actually look at these incidents, but what we are able to skin away or get rid of is all of the fat that's involved in our, you know, in our process, like without having to download the bundle, which, you know, when it's many gigabytes in size, and now we're working from home with the pandemic and everything, you're, you know, pulling massive amounts of logs from the corporate network onto your local device that takes time and then opening it up, loading it in a text editor that takes time. >>All of these things are we're trying to get rid of. And instead we're trying to make it easier and quicker for you to find what you're looking for. So it's like you said, you take away the mundane, you take away the, the difficulties and the slog, but you don't really take away the work, the work still needs to be done. >>Yeah. Great guys. Thanks so much. Appreciate you sharing your story. It's quite, quite fascinating. Really. Thank you for coming on. >>Thanks for having us. >>You're very welcome. Okay. In a moment, I'll be back to wrap up with some final thoughts. This is Dave Valante and you're watching the, >>So today we talked about the need, not only to gain end to end visibility, but why there's a need to automate the identification of root cause problems and doing so with modern technology and machine intelligence can dramatically speed up the process and identify the vast majority of issues right out of the box. If you will. And this technology, it can work with log bundles in batches, or with real time data, as long as there's plain text and a timestamp, it seems Zebra's technology will get you the outcome of automating root cause analysis with very high degrees of accuracy. Zebra is available on Preem or in the cloud. Now this is important for some companies on Preem because there's really some sensitive data inside logs that for compliance and governance reasons, companies have to keep inside their four walls. Now SBRI has a free trial. Of course they better, right? So check it out@zebra.com. You can book a live demo and sign up for a free trial. Thanks for watching this special presentation on the cube, the leader in enterprise and emerging tech coverage on Dave Valante and.

Published Date : Jun 16 2022

SUMMARY :

Thanks for coming on the cube. Happy to be here. and the proof points that, that you have, that it actually works as advertised. Cisco's support arm, the support organization, and, you know, to do with networking support and, you know, the customer's network goals. And so I would imagine you spend a lot of where the customer sends them the logs, it's attached to a, you know, a service request and And that requires a very in-depth understanding of, you know, to sort of help with this and, and how'd you stumble upon zebra? some graph to the engineer, so they can, you know, troubleshoot this particular issue. And if you look at it, the abstraction, it can fail the next time. And our engineers are really busy working on, you know, customer facing, So none of the engineers need to create rules, you know, label logs. So they start the next time, you know, thinking that this thing will So what happens usually with AI ML vendors is they have this button where you fill in your And like you said, I'm, you know, we, we chose four different products from four different verticals, And we brought it of course, to our management and they said, okay, let's, let's try this with And my understanding is you actually did it with Well, yes, we're, we've launched this with the four products that you mentioned. and what, even more importantly, you know, what the kind of feedback that we've gotten in terms And they said, well, you know, until you provide us with this tool, And that really has been, that that really shows you how that color and context helps But my question to you is, was there ever a reticence amongst or get rid of is all of the fat that's involved in our, you know, So it's like you said, you take away the mundane, Appreciate you sharing your story. This is Dave Valante and you're watching the, it seems Zebra's technology will get you the outcome of automating root cause analysis with

ENTITIES

Entity	Category	Confidence
Ari Basu	PERSON	0.99+
Dave Valante	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
one day	QUANTITY	0.99+
50%	QUANTITY	0.99+
95%	QUANTITY	0.99+
Zeum	ORGANIZATION	0.99+
eight hours	QUANTITY	0.99+
SARS	ORGANIZATION	0.99+
Najati	PERSON	0.99+
56%	QUANTITY	0.99+
Larry	PERSON	0.99+
three hours	QUANTITY	0.99+
UCS	ORGANIZATION	0.99+
50 tests	QUANTITY	0.98+
today	DATE	0.98+
a week	QUANTITY	0.98+
about 8,000 engineers	QUANTITY	0.98+
one	QUANTITY	0.98+
next week	DATE	0.97+
about 2.2 million	QUANTITY	0.97+
three	QUANTITY	0.97+
one statement	QUANTITY	0.97+
first trial	QUANTITY	0.97+
WebEx	ORGANIZATION	0.97+
three hours a day	QUANTITY	0.96+
first	QUANTITY	0.96+
Seebri	ORGANIZATION	0.96+
15 minutes	QUANTITY	0.96+
SBRI	ORGANIZATION	0.95+
tomorrow	DATE	0.95+
more than a decade	QUANTITY	0.95+
about 44%	QUANTITY	0.95+
Outre	ORGANIZATION	0.93+
single thing	QUANTITY	0.93+
more than 15	QUANTITY	0.93+
two days a week	QUANTITY	0.93+
AppD	TITLE	0.92+
a day	QUANTITY	0.91+
Necati Cehreli	PERSON	0.91+
four products	QUANTITY	0.9+
couple	QUANTITY	0.89+
Ari	PERSON	0.89+
pandemic	EVENT	0.87+
one thing	QUANTITY	0.87+
SRSS	TITLE	0.86+
almost 95%	QUANTITY	0.86+
20 minutes	QUANTITY	0.85+
two days a week	QUANTITY	0.85+
Zebra	ORGANIZATION	0.85+
a year	QUANTITY	0.85+
8,000 plus engineers	QUANTITY	0.83+
almost two days a week	QUANTITY	0.82+
WebEx	TITLE	0.82+
ISE	ORGANIZATION	0.81+
Zebrium	ORGANIZATION	0.81+
24,000 man hours a day	QUANTITY	0.8+
thousand eyes	QUANTITY	0.79+
Atri Basu	PERSON	0.79+
DNA	ORGANIZATION	0.76+
zebra	ORGANIZATION	0.74+
out@zebra.com	OTHER	0.74+
BEC	ORGANIZATION	0.72+
four	QUANTITY	0.72+
Zebra	TITLE	0.71+
one anomalous event	QUANTITY	0.71+
one of our users	QUANTITY	0.67+
Najati	ORGANIZATION	0.65+
200	QUANTITY	0.63+

Larry Lancaster & Rod Bagg, Zebrium | Zebrium Root Cause as a Service

(upbeat music) >> Full stack observability is all the rage today. As businesses lean into digital, customer experience becomes ever more important. Why? Well, it's obvious, fickle consumers can switch brands in the blink of an eye or the click of a mouse. Technology companies have sprung into action and the observability space is getting pretty crowded in an effort to simplify the process of figuring out the root cause of application performance problems without an army of PhDs and lab coats, also known as endlessly digging through logs, for example. We see decades old software companies that have traditionally done monitoring or log analytics and or application performance management stepping up their game. These established players, you know, they typically have deep feature sets and sometimes purpose-built tools that attack one particular segment of the marketplace. And now they're pivoting through M&A and some organic development trying to fill gaps in their portfolio. And then, you got all these new entrants coming to the market, claiming end to end visibility across the so-called modern cloud and now edge native stacks. Meanwhile, cloud players are gaining traction and participating through a combination of native tooling combined with strong ecosystems to address this problem. But, you know, recent survey research from ETR confirms our thesis that no one company has it all. Here's the thing. Customers just want to figure out the root cause as quickly and as efficiently as possible. It's one thing to observe the stack end to end, but the question is who is automating the observers? And that's why we're here today. Hello, my name is Dave Vellante and welcome to this special Cube presentation where we dig into root cause analysis, and specifically, how one company, Zebrium, is using unsupervised machine learning to detect anomalies and pinpoint root causes and delivering it as an automated service. And in this session, we have two deep dives. First, we're going to dig into this exciting new field of RCaaS, Root Cause As A Service with two of the founders and technical experts behind Zebrium. And then we bring in two technical experts from Cisco, an early Zebrium customer who ran a POC with Zebrium's service, automating and identifying root cause problems within four very well established and well known Cisco product lines, including WebEx Client and UCS. I was pretty amazed at the results and I think you'll be impressed as well. So thanks for being here. Let's get started. With me right now is Larry Lancaster, who's a founder and CTO of Zebrium. And he's joined by Rod Bagg, who's the founder and vice president of engineering at the company. Gents, welcome. Thanks for coming on. >> Thanks. >> Okay. >> It's good to be here. >> It's good to be here >> All right Rod, talk to me. Talk to me about software downtime, what root cause means, all the buzzwords in your domain, MTTR and SLO. What do we need to know? >> Yeah, I mean, it's like you said. I mean, it's extremely important to our customers and to most businesses out there to drive uptime and avoid as much downtime as possible. So, you know, when you think about it, all of these businesses, most companies nowadays, either their product is software and it's running, you know, running on the web and that's how you get a point click. Or the business depends on, you know, internal systems to drive their business and to run it. When that is down, that is hugely impacting to them. So if you take a look, you know, way back, you know, 20, 30 years ago, software was simple. You know, there wasn't much to it. It was pretty monolithic and maybe it took a couple of people to maintain it and keep it running. There wasn't really anything complicated about it. It was a single tenant piece of software. Today's software is so complicated, often running, you know, maybe hundreds of services to keep that or to actually implement what that software is doing. So as you point out, you know, enter the sort of observability space and the tools that are now in use to help monitor that software and make sure when something goes wrong, they know about it But there's kind of an interesting stat around the observability space. So when you look at observability in the context or through the lens of the cost of downtime, it's really interesting. So observability tools are about a $20 billion market, okay? But the cost of downtime, even with that in place, is still hundreds of billions of dollars. So you're not taking much of a bite out of what the real problem is. You have to solve root cause and get to that fast. So it's all great to know that something went wrong but you got to know why. And it's our contention here that, you know, really, when you take a look at the observability space, you have metrics, that's a great tool. I mean, there's lots of great tools out there, you know, around metrics monitoring that's going to tell you when something went wrong. It's very rarely it's going to tell you why. Similarly for tracing, it's going to point you to where the issue is. It's going to take you through that stack and probably pinpoint where you're being, you know where it's happening or where something is running slow, potentially. So that's great. But again, the root cause of why it's happening is going to be buried in log files. And I can expand on that a little bit more but you know, when you're a software developer and you're writing your software, those log files are a wealth of information. It's just a set of breadcrumbs that are littered with facts about how the software is behaving and why it's doing what it's doing, or why it went wrong. And it's that that really gets you to the root cause very fast. And that's our contention, is that these software systems are so complex nowadays and that the root cause is lying in those logs. So how do you get there fast? You know, we would contend that you better automate that or you are just doomed for failure. And that's where we come in. >> Great. >> Getting to that root cause. >> Thank you, Rod. You know, it's interesting you talk about the $20 billion market. There's an analogy with security, right? We spend 80, $100 billion a year on securing our infrastructure, and yet we lose probably closer to a trillion dollars a year in breaches. And there's a similar analogy here. 20 billion could be 5X in downtime impacts or more. Okay, let's go to Larry. Tell us a little bit more about Zebrium. I'm interested always to ask a founder why you started the company. Rod touched on that a little bit. You guys have invented this concept of RCaaS. What does it mean? What problems does it solve, and how does it solve the problem? Let's get into it. >> Yeah. Hey, thanks, Dave. So I think when you said, you know, who's automating the observer, that that's a great way to think about it because what observability really means is it's a property of a system that means you can see into it. You can observe the internal state and that makes it easier to troubleshoot, right? But the problem is if it's too complicated, you just push the bottleneck up to your eyeball. There's only so much a person can filter through manually, right? And I love the way you put that. So that's a great way to think about it is automating the observer. Now, of course, it means that, you know, you reduce your MTTR, you meet your service level objectives, all that stuff, you improve customer experience. That's all true, but it's important to step back and realize like we have cracked a real nut here. People have been trying to figure out how to automate this part of sort of the troubleshooting experience, this human part of finding the root cause indicators for a long time. And until Zebrium came along, I would argue, no one's really done it right. So, you know, I think it's also important you know, as we step back, we can probably look forward five to 10 years and say, everyone's going to look back and say how did we do all this manually? You're going to see this sort of last mile of observability and troubleshooting is going to be automated everywhere because otherwise, you know, people are just... They're not going to be able to scale their business. So, you know, I think one more thing that's important to point out is, you know, I think Zebrium, you know, it's one thing to have the technology but we've learned we need to deliver it right where people are today. You can't just expect people to dive into a new tool. So, you know, we're looking at, you know, if you look at Zebrium, you'll put us on your dashboard and we don't care what kind of a dashboard it is. It could be, you know Datadog, New Relic, Elastic, Dynatrace, Grafana AppDynamics, ScienceLogic, we don't care. You know, they're all our friends. So we're more interested in getting to that root cause than trying to fight, you know, these incumbents and all that stuff. Yep. >> Yeah. So, interesting. Again, another analogy I think about. You know, you talked about automation. If we're to look back and say this is what... We're never going to do this again, it's like provisioning loans. Nobody provisions loans anymore, it's all automated. >> Larry: (chuckling) That's right. >> So Larry, I'll stay with you, then the skeptic in me says, this sounds amazing, but if I, you know... It might be too good to be true. Tell us how it works. >> Larry: (chuckling) Yeah. So that's interesting. So Cisco came along and they were equally skeptical. So what they did was they took a couple of months and they did a very detailed study. And they got together 192 incidents across four product lines, where they knew that the root cause was in the logs. And they knew what that root cause was because they had had their best engineers, you know work on those cases and take detailed notes of the incidents that had taken place. And so they ran that data through the Zebrium software. And what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time. Like that blew us away. When we saw that kind of evidence, Dave, I have to tell you, everyone was just jumping up and down. It was like, you know, it was like the Apollo command center, you know when they finally, you know, touchdown on the moon kind of thing. So, you know, it's really an exciting point in time to be at the company, like just seeing everything finally being proven out according to this vision. I'm going to tell you one more story which is actually one of my favorites, because we got a chance to work with Seagate Lyve Cloud. So they're, you know, a hyper modern, you know, SaaS business, they're an S3 competitor. Zoom has their files stored on Lyve Cloud, you know, to let you know who they are. So essentially, what happened was they were in alpha, their early access, and they had an outage, and it was pretty bad. I mean, it went on for longer than a day, actually, before they were completely restored. And it was, you know, fortunately for them, it was early access. So no one was expecting, you know, uptime, you know, service level objectives and so on. But they were scared, because they realized, if something like this happens in production, you know, they're screwed. So what they did was they saw Zebrium. They went and did some research, they saw Zebrium. They went in a staging environment, recreated the exact (indistinct) that they had had. And what they saw was immediately, Zebrium pops up a root cause report that tells them exactly the root cause that they took over a day to find. These are the kind of stories that let us know we're onto something transformational. >> Dave: Yeah. That's great. I mean, you guys are jumping up and down, I'm sure. We're going to hear from Cisco later. I bet you, they were jumping up and down too because they didn't have to do all that heavy lifting anymore. So Rod, Larry's just sort of implying that, or actually, you guys both talked about that your tool is agnostic. So how does one actually use the service? How do I deploy it? >> Yeah. So let me step back. So when we talk about logs right? Like, you know, all these bread crumbs being in logs and everything else? So, you know, they are a great wealth of you know, information, but people hate dealing with them. I mean, they hate having to go in and figure out what log to look at. In fact, you know, we had one of our... Or we've heard from several of our customers now prior to using Zebrium, when they, you know, have some issue, and they know there's something wrong, something on their dashboard has told them that something's wrong, maybe a metric has, you know, taken a blip or something's happened that they know there's a problem. We've heard from them that it can take like a number of hours just to get to the right set of logs, like figuring out over these hundreds of services where the logs are, to get to them, maybe searching in a log manager. Just to get into the right context, even, can take hours. So, you know, that's obviously the problem we solve but, you know, we don't want them just looking at logs. I mean, you know, we don't want to put them back in the thing they don't like doing because people don't do that. They don't like doing it. So we put it up on the dashboard. So if something is going wrong with your metrics and that's the indicator, or maybe it's something with tracing that you're sort of digging through that you know something's wrong, we will be right on that same dashboard. So we're deployed as a SaaS service. You send us your logs, you click on one of our integrations and we integrate with all these tools that Larry's talked about. And when we detect anything that is a root cause report, it will show up on your dashboard in the same timeline as those blips in your metrics. So when you see something going wrong and you know there's an issue, take a look at the portion of your dashboard that is us, and we're going to tell you why. We're going to get you to the why that went wrong. No other work could be... You can, you know, also click down and click through to us so that you land up in our portal, if you want to do some more digging around, if you need to or whatever, maybe to get some context what have you, but it's fair that if you ever need to do that, the answer should be right there on your dashboard. And that that's how we expect people to use it. We don't want them digging in logs and going through things, we want it to be right in their workflow. >> Great. Thank you, Larry. So Rod, we talked about Cisco. We're going to hear more from them in a moment in Seagate. I would think this is like a perfect solution for a SaaS provider, anybody doing AI ops. Do you have some examples of those types of firms leaning into this? >> Rod: Yeah, a couple of great ones. Well, I mean, we've got many of them, but a couple that I'll touch on. We have an actual AI ops company that was looking for, you know, sort of some complimentary technology and so on. And so they decided to just put us through our paces by having one of their own SREs sign up for our service in our SaaS environment, and send the logs from their system to us, you know, and just see how we did. So it turned out we ended up talking back to this SRE like a week after he had installed the product, you know signed up and then, you know, started sending us logs. And, you know, he was hewing and hawing, saying that he was busy, like every SRE is, and that he didn't have a chance to really do much with us yet. And, you know, we were just, you know, having this conversation on the phone, and he comes to tell us that, yeah I've been busy because we had this, you know, terrible outage, like, you know, five days ago. And we said like, "Okay did you actually look on the Zebrium dashboard?" (chuckles) And he goes, "You know what? I didn't even think to do it yet. I mean, I'd just been so busy and frazzled." So we have an integration with that company, he hadn't put that integration in, so it wasn't in his dashboard yet, but it was certainly on ours. So he went there, and he looks and he looks on the day, you know, on the time range of when he had had this incident. And right at the very top of the page on our portal was that incident with that root cause. And he was flabbergasted. It literally would've saved him hours and hours and hours. They had this issue going on for over 24 hours. And we had the answer right there in five minutes, and it was crazy. And we get that kind of stories. It's just like the Seagate one. If you use us and you have a problem, we're going to detect it. And you're going to hear from Cisco how successful we are at detecting things. I mean, it'll be there when you have a problem. In SaaS companies, you know, one of our customers is Alchera. They do cost optimizations for cloud properties, you know, for AWS optimization, Google, Google cloud, and so on. But they use our software, and they have a lot of interaction, obviously with these cloud vendors and the APIs of those cloud vendors. So, you know, in order to figure out your costing at AWS, they're using all those APIs. So it turned out, you know, they had some issue where their services were breaking. And we had that root cause report right on the screen, again within five minutes, that was pointing to an API problem with Google. And they had changed one of their APIs and Alchera was not aware of it. So their stuff was breaking because of a change downstream that we had caught. And I'll just tell you one last one because it's somewhat related to one of these cloud vendors. You know, it was a big cloud vendor who had an outage a couple of months ago. And it's interesting because, you know, a lot of our customers will set up shared Slack channels with us, where we're monitoring or seeing their incidents as well as they are. So we get a little Slack representation of the incident that we detected for them or the root cause that we detected for them, and that's in a shared community channel. So we could see this happening when that AWS outage happened. We could see our customers getting impacted by that AWS outage, and the root cause of what was going on there in AWS that was impacting our customers that was showing up in our incidents. Now we didn't obviously, you know, have the very root cause of what was going on in AWS, per se but we were getting to the root cause of why our customer's applications were failing. And that was because of issues going on at AWS. >> Very interesting. I mean, I think one of your biggest challenges is going to be getting people's attention because these SREs are so busy, their hair's on fire. >> Rod: That's it. Right. (chuckling). You know, when you say, hey, (indistinct). >> I tell you, if you get their attention, they love it. I mean, this AI ops company, I didn't even tell you the punchline there, but, you know, they had this incident that occurred that we found. And quite literally, the next week, they ended up signing up as a paid customer. So... >> Dave: that's great. And Larry, to give you the last word. I mean, you know, Rod was talking about, you know, changes in APIs and you know, there's still a lot of scripts out there. You guys, if I understand it correctly, run both as a service in the cloud and you can run on-prem, which is important because there's a lot of sensitive information in logs that people are trying not to leave. >> Larry: That's right. Absolutely. >> Dave: But close it out here. >> Yeah. I mean, that's right, you can run it on-prem. Just like we run it in our cloud, you can run it in your cloud or on your own infrastructure. Now that's all true. You know, I think the one hurdle now that we have left as a company is getting the word out and getting people to believe that this is actually possible and try it for themselves. You don't believe it, do a POC, try it yourself. And you know, people have become so jaded by the lack of, you know, real, sort of, innovation in the software industry for the last 10 years that it's hard to get people to... But guys, you got to give it a shot, I'm telling you. I'm telling you right now, it works. And you'll hear more about that from one of our customers in a minute. >> All right guys, thanks so much. Great story. Really appreciate you sharing. >> Thank you. >> Yeah. Thanks Dave. Appreciate the time. >> Okay. In a moment, we're going to hear from Cisco who is the customer in this case example and a company that has... Look, they have quite an impressive suite of observability tooling, and they've done a pretty compelling proof of concept with Zebrium using real data on some Cisco products that you've heard of, like WebEx. So stay tuned and learn about how you can really take advantage of this new technology called Root Cause As A Service. You're watching theCube, the leader in enterprise and emerging tech coverage. (upbeat music)

Published Date : Jun 16 2022

SUMMARY :

you know, they typically All right Rod, talk to me. Or the business depends on, you know, and how does it solve the problem? And I love the way you put that. You know, you talked about automation. this sounds amazing, but if I, you know... So no one was expecting, you know, uptime, I mean, you guys are jumping up and down, We're going to get you to Do you have some examples and he looks on the day, you know, is going to be getting people's attention you say, hey, (indistinct). but, you know, they had And Larry, to give you the last word. Larry: That's right. by the lack of, you know, appreciate you sharing. you can really take advantage

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Rod	PERSON	0.99+
Dave	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Larry	PERSON	0.99+
two	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Zebrium	ORGANIZATION	0.99+
$20 billion	QUANTITY	0.99+
five	QUANTITY	0.99+
20 billion	QUANTITY	0.99+
Rod Bagg	PERSON	0.99+
Seagate	ORGANIZATION	0.99+
192 incidents	QUANTITY	0.99+
UCS	ORGANIZATION	0.99+
WebEx	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
two technical experts	QUANTITY	0.99+
Dynatrace	ORGANIZATION	0.99+
New Relic	ORGANIZATION	0.99+
ScienceLogic	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
First	QUANTITY	0.99+
Elastic	ORGANIZATION	0.99+
next week	DATE	0.99+
ETR	ORGANIZATION	0.99+
Datadog	ORGANIZATION	0.99+
Grafana AppDynamics	ORGANIZATION	0.99+
five days ago	DATE	0.99+
one	QUANTITY	0.98+
more than 95%	QUANTITY	0.98+
Alchera	ORGANIZATION	0.98+
5X	QUANTITY	0.98+
10 years	QUANTITY	0.98+
both	QUANTITY	0.98+
over a day	QUANTITY	0.98+
Today	DATE	0.97+
today	DATE	0.97+
Zoom	ORGANIZATION	0.97+
Seagate Lyve Cloud	ORGANIZATION	0.97+
one company	QUANTITY	0.96+

Larry Lancaster, Zebrium | Virtual Vertica BDC 2020

>> Announcer: It's theCUBE! Covering the Virtual Vertica Big Data Conference 2020 brought to you by Vertica. >> Hi, everybody. Welcome back. You're watching theCUBE's coverage of the Vertica Virtual Big Data Conference. It was, of course, going to be in Boston at the Encore Hotel. Win big with big data with the new casino but obviously Coronavirus has changed all that. Our hearts go out and we are empathy to those people who are struggling. We are going to continue our wall-to-wall coverage of this conference and we're here with Larry Lancaster who's the founder and CTO of Zebrium. Larry, welcome to theCUBE. Thanks for coming on. >> Hi, thanks for having me. >> You're welcome. So first question, why did you start Zebrium? >> You know, I've been dealing with machine data a long time. So for those of you who don't know what that is, if you can imagine servers or whatever goes on in a data center or in a SAS shop. There's data coming out of those servers, out of those applications and basically, you can build a lot of cool stuff on that. So there's a lot of metrics that come out and there's a lot of log files that come. And so, I've built this... Basically spent my career building that sort of thing. So tools on top of that or products on top of that. The problem is that since at least log files are completely unstructured, it's always doing the same thing over and over again, which is going in and understanding the data and extracting the data and all that stuff. It's very time consuming. If you've done it like five times you don't want to do it again. So really, my idea was at this point with machine learning where it's at there's got to be a better way. So Zebrium was founded on the notion that we can just do all that automatically. We can take a pile of machine data, we can turn it into a database, and we can build stuff on top of that. And so the company is really all about bringing that value to the market. >> That's cool. I want to get in to that, just better understand who you're disrupting and understand that opportunity better. But before I do, tell us a little bit about your background. You got kind of an interesting background. Lot of tech jobs. Give us some color there. >> Yeah, so I started in the Valley I guess 20 years ago and when my son was born I left grad school. I was in grad school over at Berkeley, Biophysics. And I realized I needed to go get a job so I ended up starting in software and I've been there ever since. I mean, I spent a lot of time at, I guess I cut my teeth at Nedap, which was a storage company. And then I co-founded a business called Glassbeam, which was kind of an ETL database company. And then after that I ended up at Nimble Storage. Another company, EMC, ended up buying the Glassbeam so I went over there and then after Nimble though, which where I build the InfoSight platform. That's where I kind of, after that I was able to step back and take a year and a half and just go into my basement, actually, this is my kind of workspace here, and come up with the technology and actually build it so that I could go raise money and get a team together to build Zebrium. So that's really my career in a nutshell. >> And you've got Hello Kitty over your right shoulder, which is kind of cool >> That's right. >> And then up to the left you got your monitor, right? >> Well, I had it. It's over here, yeah. >> But it was great! Pull it out, pull it out, let me see it. So, okay, so you got that. So what do you do? You just sit there and code all night or what? >> Yeah, that's right. So Hello Kitty's over here. I have a daughter and she setup my workspace here on this side with Hello Kitty and so on. And over on this side, I've got my recliner where I basically lay it all the way back and then I pivot this thing down over my face and put my keyboard on my lap and I can just sit there for like 20 hours. It's great. Completely comfortable. >> That's cool. All right, better put that monitor back or our guys will yell at me. But so, obviously, we're talking to somebody with serious coding chops and I'll also add that the Nimble InfoSight, I think it was one of the best pick ups that HP, HPE, has had in a while. And the thing that interested me about that, Larry, is the ability that the company was able to take that InfoSight and poured it very quickly across its product lines. So that says to me it was a modern, architecture, I'm sure API, microservices, and all those cool buzz words, but the proof is in their ability to bring that IP to other parts of the portfolio. So, well done. >> Yeah, well thanks. Appreciate that. I mean, they've got a fantastic team there. And the other thing that helps is when you have the notion that you don't just build on top of the data, you extract the data, you structure it, you put that in a database, we used Vertica there for that, and then you build on top of that. Taking the time to build that layer is what lets you build a scalable platform. >> Yeah, so, why Vertica? I mean, Vertica's been around for awhile. You remember you had the you had the old RDBMS, Oracles, Db2s, SQL Server, and then the database was kind of a boring market. And then, all of a sudden, you had all of these MPP companies came out, a spade of them. They all got acquired, including Vertica. And they've all sort of disappeared and morphed into different brands and Micro Focus has preserved the Vertica brand. But it seems like Vertica has been able to survive the transitions. Why Vertica? What was it about that platform that was unique and interested you? >> Well, I mean, so they're the first fund to build, what I would call a real column store that's kind of market capable, right? So there was the C-Store project at Berkeley, which Stonebreaker was involved in. And then that became sort of the seed from which Vertica was spawned. So you had this idea of, let's lay things out in a columnar way. And when I say columnar, I don't just mean that the data for every column is in a different set of files. What I mean by that is it takes full advantage of things like run length and coding, and L file and coding, and block--impression, and so you end up with these massive orders of magnitude savings in terms of the data that's being pulled off of storage as well as as it's moving through the pipeline internally in Vertica's query processing. So why am I saying all this? Because it's fundamentally, it was a fundamentally disruptive technology. I think column stores are ubiquitous now in analytics. And I think you could name maybe a couple of projects which are mostly open source who do something like Vertica does but name me another one that's actually capable of serving an enterprise as a relational database. I still think Vertica is unique in being that one. >> Well, it's interesting because you're a startup. And so a lot of startups would say, okay, we're going with a born-in-the-cloud database. Now Vertica touts that, well look, we've embraced cloud. You know, we have, we run in the cloud, we run on PRAM, all different optionality. And you hear a lot of vendors say that, but a lot of times they're just taking their stack and stuffing it into the cloud. But, so why didn't you go with a cloud-native database and is Vertica able to, I mean, obviously, that's why you chose it, but I'm interested from a technologist standpoint as to why you, again, made that choice given all these other choices around there. >> Right, I mean, again, I'm not, so... As I explained a column store, which I think is the appropriate definition, I'm not aware of another cloud-native-- >> Hm, okay. >> I'm aware of other cloud-native transactional databases, I'm not aware of one that has the analytics form it and I've tried some of them. So it was not like I didn't look. What I was actually impressed with and I think what let me move forward using Vertica in our stack is the fact that Eon really is built from the ground up to be cloud-native. And so we've been using Eon almost ever since we started the work that we're doing. So I've been really happy with the performance and with reliability of Eon. >> It's interesting. I've been saying for years that Vertica's a diamond in the rough and it's previous owner didn't know what to do with it because it got distracted and now Micro Focus seems to really see the value and is obviously putting some investments in there. >> Yeah >> Tell me more about your business. Who are you disrupting? Are you kind of disrupting the do-it-yourself? Or is there sort of a big whale out there that you're going to go after? Add some color to that. >> Yeah, so our broader market is monitoring software, that's kind of the high-level category. So you have a lot of people in that market right now. Some of them are entrenched in large players, like Datadog would be a great example. Some of them are smaller upstarts. It's a pretty, it's a pretty saturated market. But what's happened over the last, I'd say two years, is that there's been sort of a push towards what's called observability in terms of at least how some of the products are architected, like Honeycomb, and how some of them are messaged. Most of them are messaged these days. And what that really means is there's been sort of an understanding that's developed that that MTTR is really what people need to focus on to keep their customers happy. If you're a SAS company, MTTR is going to be your bread and butter. And it's still measured in hours and days. And the biggest reason for that is because of what's called unknown unknowns. Because of complexity. Now a days, things are, applications are ten times as complex as they used to be. And what you end up with is a situation where if something is new, if it's a known issue with a known symptom and a known root cause, then you can setup a automation for it. But the ones that really cost a lot of time in terms of service disruption are unknown unknowns. And now you got to go dig into this massive mass of data. So observability is about making tools to help you do that, but it's still going to take you hours. And so our contention is, you need to automate the eyeball. The bottleneck is now the eyeball. And so you have to get away from this notion of a person's going to be able to do it infinitely more efficient and recognize that you need automated help. When you get an alert agent, it shouldn't be that, "Hey, something weird's happening. Now go dig in." It should be, "Here's a root cause and a symptom." And that should be proposed to you by a system that actually does the observing. That actually does the watching. And that's what Zebrium does. >> Yeah, that's awesome. I mean, you're right. The last thing you want is just another alert and it say, "Go figure something out because there's a problem." So how does it work, Larry? In terms of what you built there. Can you take us inside the covers? >> Yeah, sure. So there's really, right now there's two kinds of data that we're ingesting. There's metrics and there's log files. Metrics, there's actually sort of a framework that's really popular in DevOp circles especially but it's becoming popular everywhere, which is called Prometheus. And it's a way of exporting metrics so that scrapers can collect them. And so if you go look at a typical stack, you'll find that most of the open source components and many of the closed source components are going to have exporters that export all their stacks to Prometheus. So by supporting that stack we can bring in all of those metrics. And then there's also the log files. And so you've got host log files in a containerized environment, you've got container logs, and you've got application-specific logs, perhaps living on a host mount. And you want to pull all those back and you want to be able to associate this log that I've collected here is associated with the same container on the same host that this metric is associated with. But now what? So once you've got that, you've got a pile of unstructured logs. So what we do is we take a look at those logs and we say, let's structure those into tables, right? So where I used to have a log message, if I look in my log file and I see it says something like, X happened five times, right? Well, that event types going to occur again and it'll say, X happened six times or X happened three times. So if I see that as a human being, I can say, "Oh clearly, that's the same thing." And what's interesting here is the times that X, that X happened, and that this number read... I may want to know when the numbers happened as a time series, the values of that column. And so you can imagine it as a table. So now I have table for that event type and every time it happens, I get a row. And then I have a column with that number in it. And so now I can do any kind of analytics I want almost instantly across my... If I have all my event types structured that way, every thing changes. You can do real anomaly detection and incident detection on top of that data. So that's really how we go about doing it. How we go about being able to do autonomous monitoring in a way that's effective. >> How do you handle doing that for, like the Spoke app? Do you have to, does somebody have to build a connector to those apps? How do you handle that? >> Yeah, that's a really good question. So you're right. So if I go and install a typical log manager, there'll be connectors for different apps and usually what that means is pulling in the stuff on the left, if you were to be looking at that log line, and it will be things like a time stamp, or a severity, or a function name, or various other things. And so the connector will know how to pull those apart and then the stuff to the right will be considered the message and that'll get indexed for search. And so our approach is we actually go in with machine learning and we structure that whole thing. So there's a table. And it's going to have a column called severity, and timestamp, and function name. And then it's going to have columns that correspond to the parameters that are in that event. And it'll have a name associated with the constant parts of that event. And so you end up with a situation where you've structured all of it automatically so we don't need collectors. It'll work just as well on your home-grown app that has no collectors or no parsers to find or anything. It'll work immediately just as well as it would work on anything else. And that's important, because you can't be asking people for connectors to their own applications. It just, it becomes now they've go to stop what they're doing and go write code for you, for your platform and they have to maintain it. It's just untenable. So you can be up and running with our service in three minutes. It'll just be monitoring those for you. >> That's awesome! I mean, that is really a breakthrough innovation. So, nice. Love to see that hittin' the market. Who do you sell to? Both types of companies and what role within the company? >> Well, definitely there's two main sort of pushes that we've seen, or I should say pulls. One is from DevOps folks, SRE folks. So these are people who are tasked with monitoring an environment, basically. And then you've got people who are in engineering and they have a staging environment. And what they actually find valuable is... Because when we find an incident in a staging environment, yeah, half the time it's because they're tearing everything up and it's not release ready, whatever's in stage. That's fine, they know that. But the other half the time it's new bugs, it's issues and they're finding issues. So it's kind of diverged. You have engineering users and they don't have titles like QA, they're Dev engineers or Dev managers that are really interested. And then you've got DevOps and SRE people there (mumbles). >> And how do I consume your product? Is the SAS... I sign up and you say within three minutes I'm up and running. I'm paying by the drink. >> Well, (laughs) right. So there's a couple ways. So, right. So the easiest way is if you use Kubernetes. So Kubernetes is what's called a container orchestrator. So these days, you know Docker and containers and all that, so now there's container orchestrators have become, I wouldn't say ubiquitous but they're very popular now. So it's kind of on that inflection curve. I'm not exactly sure the penetration but I'm going to say 30-40% probably of shops that were interested are using container orchestrators. So if you're using Kubernetes, basically you can install our Kubernetes chart, which basically means copying and pasting a URL and so on into your little admin panel there. And then it'll just start collecting all the logs and metrics and then you just login on the website. And the way you do that is just go to our website and it'll show you how to sign up for the service and you'll get your little API key and link to the chart and you're off and running. You don't have to do anything else. You can add rules, you can add stuff, but you don't have to. You shouldn't have to, right? You should never have to do any more work. >> That's great. So it's a SAS capability and I just pay for... How do you price it? >> Oh, right. So it's priced on volume, data volume. I don't want to go too much into it because I'm not the pricing guy. But what I'll say is that it's, as far as I know it's as cheap or cheaper than any other log manager or metrics product. It's in that same neighborhood as the very low priced ones. Because right now, we're not trying to optimize for take. We're trying to make a healthy margin and get the value of autonomous monitoring out there. Right now, that's our priority. >> And it's running in the cloud, is that right? AWB West-- >> Yeah, that right. Oh, I should've also pointed out that you can have a free account if it's less than some number of gigabytes a day we're not going to charge. Yeah, so we run in AWS. We have a multi-tenant instance in AWS. And we have a Vertica Eon cluster behind that. And it's been working out really well. >> And on your freemium, you have used the Vertica Community Edition? Because they don't charge you for that, right? So is that how you do it or... >> No, no. We're, no, no. So, I don't want to go into that because I'm not the bizdev guy. But what I'll say is that if you're doing something that winds up being OEM-ish, you can work out the particulars with Vertica. It's not like you're going to just go pay retail and they won't let you distinguish between tests, and prod, and paid, and all that. They'll work with you. Just call 'em up. >> Yeah, and that's why I brought it up because Vertica, they have a community edition, which is not neutered. It runs Eon, it's just there's limits on clusters and storage >> There's limits. >> But it's still fully functional though. >> So to your point, we want it multi-tenant. So it's big just because it's multi-tenant. We have hundred of users on that (audio cuts out). >> And then, what's your partnership with Vertica like? Can we close on that and just describe that a little bit? >> What's it like. I mean, it's pleasant. >> Yeah, I mean (mumbles). >> You know what, so the important thing... Here's what's important. What's important is that I don't have to worry about that layer of our stack. When it comes to being able to get the performance I need, being able to get the economy of scale that I need, being able to get the absolute scale that I need, I've not been disappointed ever with Vertica. And frankly, being able to have acid guarantees and everything else, like a normal mature database that can join lots of tables and still be fast, that's also necessary at scale. And so I feel like it was definitely the right choice to start with. >> Yeah, it's interesting. I remember in the early days of big data a lot of people said, "Who's going to need these acid properties and all this complexity of databases." And of course, acid properties and SQL became the killer features and functions of these databases. >> Who didn't see that one coming, right? >> Yeah, right. And then, so you guys have done a big seed round. You've raised a little over $6 million dollars and you got the product market fit down. You're ready to rock, right? >> Yeah, that's right. So we're doing a launch probably, well, when this airs it'll probably be the day before this airs. Basically, yeah. We've got people... Like literally in the last, I'd say, six to eight weeks, It's just been this sort of pique of interest. All of a sudden, everyone kind of gets what we're doing, realizes they need it, and we've got a solution that seems to meet expectations. So it's like... It's been an amazing... Let me just say this, it's been an amazing start to the year. I mean, at the same time, it's been really difficult for us but more difficult for some other people that haven't been able to go to work over the last couple of weeks and so on. But it's been a good start to the year, at least for our business. So... >> Well, Larry, congratulations on getting the company off the ground and thank you so much for coming on theCUBE and being part of the Virtual Vertica Big Data Conference. >> Thank you very much. >> All right, and thank you everybody for watching. This is Dave Vellante for theCUBE. Keep it right there. We're covering wall-to-wall Virtual Vertica BDC. You're watching theCUBE. (upbeat music)

Published Date : Mar 31 2020

SUMMARY :

brought to you by Vertica. and we're here with Larry Lancaster why did you start Zebrium? and basically, you can build a lot of cool stuff on that. and understand that opportunity better. and actually build it so that I could go raise money It's over here, yeah. So what do you do? and then I pivot this thing down over my face and I'll also add that the Nimble InfoSight, And the other thing that helps is when you have the notion and Micro Focus has preserved the Vertica brand. and so you end up with these massive orders And you hear a lot of vendors say that, I'm not aware of another cloud-native-- I'm not aware of one that has the analytics form it and now Micro Focus seems to really see the value Are you kind of disrupting the do-it-yourself? And that should be proposed to you In terms of what you built there. And so you can imagine it as a table. And so you end up with a situation I mean, that is really a breakthrough innovation. and it's not release ready, I sign up and you say within three minutes And the way you do that So it's a SAS capability and I just pay for... and get the value of autonomous monitoring out there. that you can have a free account So is that how you do it or... and they won't let you distinguish between Yeah, and that's why I brought it up because Vertica, But it's still So to your point, I mean, it's pleasant. What's important is that I don't have to worry I remember in the early days of big data and you got the product market fit down. that haven't been able to go to work and thank you so much for coming on theCUBE All right, and thank you everybody for watching.

ENTITIES

Entity	Category	Confidence
Larry Lancaster	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Larry	PERSON	0.99+
Boston	LOCATION	0.99+
five times	QUANTITY	0.99+
three times	QUANTITY	0.99+
six times	QUANTITY	0.99+
EMC	ORGANIZATION	0.99+
six	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
20 hours	QUANTITY	0.99+
Glassbeam	ORGANIZATION	0.99+
Nedap	ORGANIZATION	0.99+
Vertica	ORGANIZATION	0.99+
Nimble	ORGANIZATION	0.99+
Nimble Storage	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
a year and a half	QUANTITY	0.99+
Micro Focus	ORGANIZATION	0.99+
ten times	QUANTITY	0.99+
two kinds	QUANTITY	0.99+
two years	QUANTITY	0.99+
three minutes	QUANTITY	0.99+
first question	QUANTITY	0.99+
eight weeks	QUANTITY	0.98+
Stonebreaker	ORGANIZATION	0.98+
Prometheus	TITLE	0.98+
30-40%	QUANTITY	0.98+
Eon	ORGANIZATION	0.98+
hundred of users	QUANTITY	0.98+
One	QUANTITY	0.98+
Vertica Virtual Big Data Conference	EVENT	0.98+
Kubernetes	TITLE	0.97+
first fund	QUANTITY	0.97+
Virtual Vertica Big Data Conference 2020	EVENT	0.97+
AWB West	ORGANIZATION	0.97+
Virtual Vertica Big Data Conference	EVENT	0.97+
Honeycomb	ORGANIZATION	0.96+
SAS	ORGANIZATION	0.96+
20 years ago	DATE	0.96+
Both types	QUANTITY	0.95+
theCUBE	ORGANIZATION	0.95+
Datadog	ORGANIZATION	0.95+
two main	QUANTITY	0.94+
over $6 million dollars	QUANTITY	0.93+
Hello Kitty	ORGANIZATION	0.93+
SQL	TITLE	0.93+
Zebrium	PERSON	0.91+
Spoke	TITLE	0.89+
Encore Hotel	LOCATION	0.88+
InfoSight	ORGANIZATION	0.88+
Coronavirus	OTHER	0.88+
one	QUANTITY	0.86+
less	QUANTITY	0.85+
Oracles	ORGANIZATION	0.85+
2020	DATE	0.85+
CTO	PERSON	0.84+
Vertica	TITLE	0.82+
Nimble InfoSight	ORGANIZATION	0.81+

Atri Basu & Necati Cehreli | Root Cause as a Service - Never dig through logs again

(upbeat music) >> Okay, we're back with Atri Basu who is Cisco's resident philosopher who also holds a master's in computer science. We're going to have to unpack that a little bit. And Necati Cehreli, who's technical lead at Cisco. Welcome, guys. Thanks for coming on theCUBE. >> Happy to be here. >> Thanks a lot. >> All right, let's get into it. We want you to explain how Cisco validated the Zebrium technology and the proof points that you have that it actually works as advertised. So first Atri, first tell us about Cisco TAC. What does Cisco TAC do? >> So TAC is otherwise it's an acronym for Technical Assistance Center, is Cisco's support arm, the support organization. And the risk of sounding like I'm spouting a corporate line. The easiest way to summarize what TAC does is provide world class support to Cisco customers. What that means is we have about 8,000 engineers worldwide and any of our Cisco customers can either go on our web portal or call us to open a support request. And we get about 2.2 million of these support requests a year. And what these support requests are, are essentially the customer will describe something that they need done some networking goal that they have that they want to accomplish. And then it's TACs job to make sure that that goal does get accomplished. Now, it could be something like they're having trouble with an existing network solution and it's not working as expected or it could be that they're integrating with a new solution. They're, you know, upgrading devices maybe there's a hardware failure anything really to do with networking support and, you know the customer's network goals. If they open up a case for testing for help then TACs job is to respond and make sure the customer's, you know questions and requirements are met. About 44% of these support requests are usually trivial and, you know can be solved within a call or within a day. But the rest of TAC cases really involve getting into the network device, looking at logs. It's a very technical role. It's a very technical job. You need to be conversed with network solutions, their designs, protocols, et cetera. >> Wow. So 56% non-trivial. And so I would imagine you spend a lot of time digging through logs. Is that true? Can you quantify that like, you know, every month how much time you spend digging through logs and is that a pain point? >> Yeah, it's interesting you asked that because when we started on this journey to augment our support engineers workflow with Zebrium solution, one of the things that we did was we went out and asked our engineers what their experience was like doing log analysis. And the anecdotal evidence was that on average an engineer will spend three out of their eight hours reviewing logs either online or offline. So what that means is either with the customer live on a WebEx, they're going to be going over logs, network, state information, et cetera or they're going to do it offline where the customer sends them the logs it's attached to a, you know, a service request and they review it and try to figure out what's going on and provide the customer with information. So it's a very large chunk of our day. You know, I said 8,000 plus engineers and so three hours a day that's 24,000 man hours a day spent on log analysis. Now the struggle with logs or analyzing logs is there by out of necessity, logs are very contrite. They try to pack a lot of information in a very little space. And this is for performance reasons, storage reasons, et cetera, but the side effect of that is they're very esoteric. So they're hard to read if you're not conversant if you're not the developer who wrote these logs or you aren't doing code deep dives. And you're looking at where this logs getting printed and things like that, it may not be immediately obvious or even after a little while it may not be obvious what that log line means or how it correlates to whatever problem you're troubleshooting. So it requires tenure. It requires, you know, like I was saying before it requires a lot of knowledge about the protocol what's expected because when you're doing log analysis what you're really looking for is a needle in a haystack. You're looking for that one anomalous event, that single thing that tells you this shouldn't have happened, and this was a problem right. Now doing that kind of anomaly detection requires you to know what is normal. It requires, you know, what the baseline is. And that requires a very in depth understanding of, you know the state changes for that network solution or product. So it requires time to near and expertise to do well. And it takes a lot of time even when you have that kind of expertise. >> Wow. So thank you, Atri. And Necati, that's almost two days a week for a technical resource. That's not inexpensive. So what was Cisco looking for to sort of help with this and how'd you stumble upon Zebrium? >> Yeah, so, we have our internal automation system which has been running more than a decade now. And what happens is when a customer attach log bundle or diagnostic bundle into the service request we take that from the Sr we analyze it and we represent some kind of information. You know, it can be alerts or some tables, some graph, to the engineer, so they can, you know troubleshoot this particular issue. This is an incredible system, but it comes with its own challenges around maintenance to keep it up to date and relevant with Cisco's new products or a new version of a product, new defects, new issues and all kind of things. And when I mean with those challenges are let's say Cisco comes up with a product today. We need to come together with those engineers. We need to figure out how this bundle works, how it's structured out. We need to select individual logs, which are relevant and then start modeling these logs and get some values out of those logs, using PaaS or some rag access to come to a level that we can consume the logs. And then people start writing rules on top of that abstraction. So people can say in this log I'm seeing this value together with this other value in another log, maybe I'm hitting this particular defect. So that's how it works. And if you look at it, the abstraction it can fail the next time. And the next release when the development or engineer decides to change that log line which you write that rag X or we can come up with a new version which we completely change the services or processes then whatever you have wrote needs to be re-written for the new service. And we see that a lot with products, like for instance, WebEx where you have a very short release cycle that things can change maybe the next week with a new release. So whatever you are writing, especially for that abstraction and for those rules are maybe not relevant with that new release. With that being said we have a incredible rule creation process and governance process around it which starts with maybe a defect. And then it takes it to a level where we have an automation in place. But if you look at it, this really ties to human bandwidth. And our engineers are really busy working on you know, customer facing, working on issues daily and sometimes creating news rules or these PaaS are not their biggest priorities so they can be delayed a bit. So we have this delay between a new issue being identified to a level where we have the automation to detect it next time that some customer faces it. So with all these questions and with all challenges in mind we start looking into ways of actually how we can automate these automation. So these things that we are doing manually how we can move it a bit further and automate. And we had actually a couple of things in mind that we were looking for and this being one of them being this has to be product agnostic. Like if Cisco comes up with a product tomorrow I should be able to take it logs without writing, you know, complex regs, PaaS, whatever and deploy it into this system. So it can embrace our logs and make sense of it. And we wanted this platform to be unsupervised. So none of the engineers need to create rules, you know, label logs, this is bad, this is good. Or train the system like which requires a lot of computational power. And the other most important thing for us was we wanted this to be not noisy at all because what happens with noises when your level of false positives really high your engineers start ignoring the good things between that noise. So they start the next time, you know thinking that this thing will not be relevant. So we want something with a lot more less noise. And ultimately we wanted this new platform or new framework to be easily adaptable to our existing workflow. So this is where we started. We start looking into the, you know first of all, internally, if we can build this thing and also start researching it, and we came up to Zebrium actually Larry, one of the co-founders of Zebrium. We came upon his presentation where he clearly explained why this is different, how this works and it immediately clicked in and we said, okay, this is exactly what we were looking for. We dive deeper. We checked the block posts where Zebrium guys really explain everything very clearly there. They're really open about it. And most importantly, there is a button in their system. And so what happens usually with AI ML vendors is they have this button where you fill in your details and a sales guys call you back and you know, explains the system here. They were like, this is our trial system. We believe in the system you can just sign up and try it yourself. And that's what we did. We took one of our Cisco live DNA Center, wireless platforms. We start streaming logs out of it. And then we synthetically, you know, introduce errors like we broke things. And then we realized that Zebrium was really catching the errors perfectly. And on top of that, it was really quiet unless you are really breaking something. And the other thing we realized was during that first trial is Zebrium was actually bringing a lot of context on top of the logs. During those failures, we worked with couple of technical leaders and they said, "Okay if this failure happens I'm expecting this individual log to be there." And we found out with Zebrium apart from that individual log there were a lot of other things which gives a bit more context around the root cause, which was great. And that's where we wanted to take it to the next level. Yeah. >> Okay. So, you know, a couple things to unpack there. I mean, you have the dart board behind you which is kind of interesting, 'cause a lot of times it's like throwing darts at the board to try to figure this stuff out. But to your other point, Cisco actually has some pretty rich tools with AppD and doing observability and you've made acquisitions like thousand eyes. And like you said, I'm presuming you got to eat your own dog food or drink your own champagne. And so you've got to be tools agnostic. And when I first heard about Zebrium, I was like wait a minute. Really? I was kind of skeptical. I've heard this before. You're telling me all I need is plain text and a timestamp. And you got my problem solved. So, and I understand that you guys said, okay let's run a POC. Let's see if we can cut that from, let's say two days a week down to one day, a week. In other words, 50%, let's see if we can automate 50% of the root cause analysis. And so you funded a POC. How did you test it? You put, you know, synthetic, you know errors and problems in there, but how did you test that, it actually works Necati? >> Yeah. So we wanted to take it to the next level which is meaning that we wanted to back test is with existing SaaS. And we decided, you know, we chose four different products from four different verticals, data center security, collaboration, and enterprise networking. And we find out SaaS where the engineer put some kind of log in the resolution summary. So they closed the case. And in the summary of the SR, they put "I identified these log lines and they led me to the root cause" and we ingested those log bundles. And we tried to see if Zebrium can surface that exact same log line in their analysis. So we initially did it with archery ourself and after 50 tests or so we were really happy with the results. I mean, almost most of them we saw the log line that we were looking for but that was not enough. And we brought it of course to our management and they said, "Okay, let's try this with real users" because the log being there is one thing but the engineer reaching to that log is another take. So we wanted to make sure that when we put it in front of our users, our engineers, they can actually come to that log themselves because, you know, we know this platform so we can, you know make searches and find whatever we are looking for but we wanted to do that. So we extended our pilots to some selected engineers and they tested with their own SaaS. Also due some back testing for some SaaS which are closed in the past or recently. And with a sample set of, I guess, close to 200 SaaS we find out like majority of the time, almost 95% of the time the engineer could find the log they were looking for in Zebrium's analysis. >> Yeah. Okay. So you were looking for 50%, you got the 95%. And my understanding is you actually did it with four pretty well known Cisco products, WebEx client, DNA Center Identity services, engine ISE, and then UCS. Unified pursuit. So you use actual real data and that was kind of your proof point, but Atri, so that sounds pretty impressive. And have you put this into production now and what have you found? >> Well, yes, we've launched this with the four products that you mentioned. We're providing our TAC engineers with the ability, whenever a support bundle for that product gets attached to the support request. We are processing it, using sense and then providing that sense analysis to the TAC engineer for their review. >> So are you seeing the results in production? I mean, are you actually able to reclaim that time that people are spending? I mean, it was literally almost two days a week down to you know, a part of a day, is that what you're seeing in production and what are you able to do with that extra time and people getting their weekends back? Are you putting 'em on more strategic tasks? How are you handling that? >> Yeah. So what we're seeing is, and I can tell you from my own personal experience using this tool that troubleshooting any one of the cases, I don't take more than 15 to 20 minutes to go through the Zebrium report. And I know within that time either what the root causes or I know that Zebrium doesn't have the information that I need to solve this particular case. So we've definitely seen, well it's been very hard to measure exactly how much time we've saved per engineer, right? Again, anecdotally, what we've heard from our users is that out of those three hours that they were spending per day, we're definitely able to reclaim at least one of those hours and what even more importantly, you know, what the kind of feedback that we've gotten in terms of I think one statement that really summarizes how Zebrium's impacted our workflow was from one of our users. And they said, "Well, you know, until you provide us with this tool, log analysis was a very black and white affair, but now it's become really colorful." And I mean, if you think about it log analysis is indeed black and white. You're looking at it on a terminal screen where the background is black and the text is white, or you're looking at it as a text where the background is white and the text is black, but what they're really trying to say is there are hardly any visual cues that help you navigate these logs which are so esoteric, so dense, et cetera. But what Zebrium does is it provides a lot of color and context to the whole process. So now you're able to quickly get to, you know using their Word Cloud, using their interactive histogram, using the summaries of every incident. You're very quickly able to summarize what might be happening and what you need to look into. Like, what are the important aspects of this particular log bundle that might be relevant to you? So we've definitely seen that. A really great use case that kind of encapsulates all of this was very early on in our experiment. There was this support request that had been escalated to the business unit or the development team. And the TAC engineer had really, they had an intuition about what was going wrong because of their experience because of, you know the symptoms that they'd seen. They kind of had an idea but they weren't able to convince the development team because they weren't able to find any evidence to back up what they thought was happening. And it was entirely happenstance that I happened to pick up that case and did an analysis using Zebrium. And then I sat down with a TAC engineer and we were very quickly within 15 minutes we were able to get down to the exact sequence of events that highlighted what the customer thought was happening, evidence of what the sorry not the customer what the TAC engineer thought was a root cause. And then we were able to share that evidence with our business unit and, you know redirect their resources so that we could chase down what the problem was. And that that really shows you how that color and context helps in log analysis. >> Interesting. You know, we do a fair amount of work in theCUBE in the RPA space, the robotic process automation and the narrative in the press when our RPA first started taking off was, oh, it's, you know machines replacing humans, or we're going to lose jobs. And what actually happened was people were just eliminating mundane tasks and the employees actually very happy about it. But what my question to you is was there ever a reticence amongst your team? Like, oh, wow, I'm going to lose my job if the machine's going to replace me or have you found that people were excited about this and what's been the reaction amongst the team? >> Well, I think, you know, every automation and AI project has that immediate gut reaction of you're automating away our jobs and so forth. And there is initially there's a little bit of reticence but I mean, it's like you said once you start using the tool, you realize that it's not your job, that's getting automated away. It's just that your job's becoming a little easier to do and it's faster and more efficient. And you're able to get more done in less time. That's really what we're trying to accomplish here. At the end of the day, Zebrium will identify these incidents. They'll do the correlation, et cetera. But if you don't understand what you're reading then that information's useless to you. So you need the human you need the network expert to actually look at these incidents, but what we are able to skin away or get rid of is all of is all the fat that's involved in our process like without having to download the bundle, which, you know when it's many gigabytes in size and now we're working from home with the pandemic and everything, you're, you know pulling massive amounts of logs from the corporate network onto your local device that takes time and then opening it up, loading it in a text editor that takes time. All of these things are we're trying to get rid of. And instead we're trying to make it easier and quicker for you to find what you're looking for. So it's like you said, you take away the mundane you take away the difficulties and the slog but you don't really take away the work the work still needs to be done. >> Yeah, great. Guys, thanks so much appreciate you sharing your story. It's quite, quite fascinating. Really. Thank you for coming on. >> Thanks for having us. >> You're very welcome. >> Excellent. >> Okay. In a moment, I'll be back to wrap up with some final thoughts. This is Dave Vellante and you're watching theCUBE. (upbeat music)

Published Date : May 25 2022

SUMMARY :

We're going to have to that you have that it the customer's, you know And so I would imagine you spend a lot it's attached to a, you and how'd you stumble upon Zebrium? And the other thing we realized was And like you said, I'm And we decided, you know, and what have you found? with the four products that you mentioned. And they said, "Well, you But what my question to you is the bundle, which, you know you sharing your story. I'll be back to wrap up

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
one day	QUANTITY	0.99+
50%	QUANTITY	0.99+
Larry	PERSON	0.99+
Necati Cehreli	PERSON	0.99+
95%	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
56%	QUANTITY	0.99+
Atri	PERSON	0.99+
eight hours	QUANTITY	0.99+
Atri Basu	PERSON	0.99+
TACs	ORGANIZATION	0.99+
Necati	ORGANIZATION	0.99+
50 tests	QUANTITY	0.99+
first	QUANTITY	0.99+
TAC	ORGANIZATION	0.98+
one	QUANTITY	0.98+
about 8,000 engineers	QUANTITY	0.98+
single	QUANTITY	0.98+
first trial	QUANTITY	0.98+
three hours	QUANTITY	0.98+
four products	QUANTITY	0.98+
a week	QUANTITY	0.98+
next week	DATE	0.98+
pandemic	EVENT	0.97+
about 2.2 million	QUANTITY	0.97+
today	DATE	0.97+
three	QUANTITY	0.97+
Word Cloud	TITLE	0.96+
UCS	ORGANIZATION	0.96+
more than a decade	QUANTITY	0.95+
one statement	QUANTITY	0.95+
20 minutes	QUANTITY	0.95+
two days a week	QUANTITY	0.94+
About 44%	QUANTITY	0.93+
tomorrow	DATE	0.93+
15 minutes	QUANTITY	0.92+
almost two days a week	QUANTITY	0.92+
more than 15	QUANTITY	0.92+
AppD	TITLE	0.92+
one thing	QUANTITY	0.91+
almost 95%	QUANTITY	0.91+
a year	QUANTITY	0.91+
four different products	QUANTITY	0.9+
8,000 plus engineers	QUANTITY	0.88+
three hours a day	QUANTITY	0.88+
four	QUANTITY	0.86+
200 SaaS	QUANTITY	0.86+
Atri	ORGANIZATION	0.86+
24,000 man hours a day	QUANTITY	0.84+
a day	QUANTITY	0.84+
ISE	TITLE	0.8+

Larry Lancaster & Rod Bagg

(bright intro music) >> Full stack observability is all the rage today. As businesses lean in to digital, customer experience becomes ever more important, why? Well, it's obvious. Fickle consumers can switch brands in the blink of an eye or the click of a mouse. Technology companies have sprung into action, and the observability space is getting pretty crowded in an effort to simplify the process of figuring out the root cause of application performance problems without an army of PhDs and lab coats, also known as endlessly digging through logs, for example. We see decades-old software companies that have traditionally done monitoring or log analytics and/or application performance management stepping up their game. These established players, you know, they typically have deep feature sets and sometimes purpose built tools that attack one particular segment of the marketplace, and now, they're pivoting through M&A and some organic development trying to fill gaps in their portfolio, and then you got all these new entrants coming to the market claiming end to end visibility across the so-called modern cloud and now edge-native stacks. Meanwhile, cloud players are gaining traction and participating through a combination of native tooling combined with strong ecosystems to address this problem, but, you know, recent survey research from ETR confirms our thesis that no one company has at all. Here's the thing. Customers just want to figure out the root cause as quickly and efficiently as possible. It's one thing to observe the stack end to end, but the question is who is automating the observers? And that's why we're here today. Hello, my name is Dave Vellante, and welcome to this special "CUBE" presentation where we dig into root cause analysis and, specifically, how one company, Zebrium, is using unsupervised machine learning to detect anomalies and pinpoint root causes and delivering it as an automated service. In this session, we have two deep dives. First, we're going to dig into this exciting new field of RCA, root cause as a service, with two of the founders and technical experts behind Zebrium, and then we bring in two technical experts from Cisco, an early Zebrium customer who ran a POC with Zebrium's service, automating and identifying root cause problems within four very well established and well-known Cisco product lines including Webex client and UCS. I was pretty amazed at the results, and I think you'll be impressed as well. So thanks for being here. Let's get started with me right now is Larry Lancaster who's a founder and CTO of Zebrium, and he's joined by Rod Bagg who's a founder and Vice-President of Engineering at the company. Gents, welcome, thanks for coming on. >> Thanks. >> (indistinct). >> To be here. >> Great to be here. >> All right, Rod, talk to me. Talk to me about software downtime, what root cause means, all the buzzwords in your domain, MTTR and SLO, what do we need to know? >> Yeah, I mean, it's like you said. I mean, it's extremely important to our customers and to most businesses out there to drive up time and avoid as much downtime as possible. So, you know, when you think about it, all of these businesses, most companies nowadays, either their product is software and it's running, you know, running on the web, and that that's how you get a point click or their business depends on it and, you know, internal systems to drive their business and to run it. Now, when that is down, that is hugely impacting to them. So if you take a look, you know, way back, you know, 20, 30 years ago, software was simple. You know, there wasn't much to it. It was pretty monolithic, and maybe it took a couple of people to maintain it and keep it running. It wasn't really anything complicated about it. It was a single tenant piece of software. Today's software is so complicated, often running, you know, maybe hundreds of services to keep that or to actually implement what that software is doing. So as you point out, you know, enter the sort of observability space and the tools that are now in use to help monitor that software and make sure when something goes wrong, they know about it, but there's kind of an interesting stat around the observability space. So when you look at observability in the context or through the lens of the cost of downtime, it's really interesting. So observability tools are about a $20 billion market, okay? But the cost of downtime, even with that in place, is still hundreds of billions of dollars. So you're not taking much of a bite out of what the real problem is. You have to solve root cause and get to that fast. So it's all great to know that something went wrong, but you got to know why, and it it's our contention here that, you know, really, when you take a look at the observability space, you have metrics. That's a great tool. I mean, there's lots of great tools out there, you know, around metrics monitoring that's going to tell you when something went wrong. It's very rarely it's going to tell you why. Similarly for tracing, it's going to point you to where the issue is. It's going to take you through that stack and probably pinpoint where you're being, you know, where it's happening or where something is running slow potentially. So that's great, but again, the root cause of why it's happening is going to be buried in log files, and I can expand on that a little bit more, but, you know, when you're a software developer, and you're writing your software, those log files are a wealth of information. It's just a set of breadcrumbs that are littered with facts about how the software is behaving and why it's doing what it's doing or why it went wrong, and it's that that really gets you to the root cause very fast, and that's, our contention is that these software systems are so complex nowadays, and that the root cause is lying in those logs. So how do you get there fast? You know, we would contend that you better automate that or you're just doomed for failure, and that's where we come in. >> Great. >> Getting to that request. >> Thank you, Rod. You know, it's interesting. You talk about the $20 billion market. There's an analogy with security, right? We spend 80, $100 billion a year on securing our infrastructure, and yet we lose, probably, closer to a trillion dollars a year in breaches, and there's a similar analogy here. 20 billion could be 5x in downtime impacts or more. Okay, let's go to Larry. Tell us a little bit more about Zebrium. I'm interested always to ask a founder why you started the company. Rod touched on that a little bit. You guys have invented this concept of RCAs. What does it mean? What problems does it solve? And how does it solve the problem? Let's get into it. >> Yeah, hey, thanks, Dave. So I think when you said, you know, who's automating the observer? That's a great way to think about it because what observability really means is it's a property of a system that means you can see into it. You can observe the internal state, and that makes it easier to troubleshoot, right? But the problem is if it's too complicated, you just push the bottleneck up to your eyeball. There's only so much a person can filter through manually, right? And I love the way you put that. So that's a great way to think about it is automating the observer. Now, of course, it means that, you know, you reduce your MTTR, you meet your service level objectives, all that stuff, you improve customer experience, that's all true, but it's important to step back and realize like we have cracked a real nut here. People have been trying to figure out how to automate this part of sort of the troubleshooting experience, this human part of finding the root cause indicators for a long time, and until Zebrium came along, I would argue no one's really done it right. So, you know, I think it's also important, you know, as we step back, we can probably look forward five to 10 years and say, "Everyone's going to look back and say, 'How did we do all this manually?'" You're going to see this sort of last mile of observability and troubleshooting is going to be automated everywhere because otherwise, you know, people are just, they're not going to be able to scale their business. So, you know, I think one more thing that's important to point out is, you know, I think Zebrium, you know, it's one thing to have the technology, but we've learned we need to deliver it right where people are today. You can't just expect people to dive into a new tool. So, you know, we're looking at, you know, if you look at Zebrium, you'll put us on your dashboard, and we don't care what kind of a dashboard it is. It could be, you know, Datadog, New Relic, Elastic, Dynatrace, Grafana, AppDynamics, ScienceLogic, we don't care. You know, they're all our friends. So we're more interested in getting to that root cause than trying to fight, you know, these incumbents and all that stuff, yeah. >> Yeah, so interesting. Again, another analogy I think about, you know, you talked about automation, where to look back, and say, "This is what- We're never going to do this again." It's like provisioning LANs. Nobody provisioned LANs anymore. It's all automated. >> That's correct. >> So, Larry, stay with you. The skeptic in me says, "This sounds amazing," but if, you know, it probably too good to be true. Tell us how it works. >> Yeah, so that's interesting. So Cisco came along and they were equally skeptical. So what they did was they took a couple of months, and they did a very detailed study, and they got together 192 incidents across four product lines where they knew that the root cause was in the logs, and they knew what that root cause was because they'd had their best engineers, you know, work on those cases and take detailed notes of the incidents that had taken place, and so they ran that data through the Zebrium software, and what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time. Like that blew us away. When we saw that kind of evidence, Dave, I have to tell you, everyone was just jumping up and down. It was like, you know, it was like the Apollo Command Center, you know, when they finally, (Dave laughs) you know, touchdown on the moon kind of thing. So, you know, it's really exciting at a point in time to be at the company, like just seeing everything finally being proven out according to this vision. I'm going to tell you one more story, which is actually one of my favorites, because we got a chance to work with Seagate Lyve Cloud. So they're, you know, a hyper modern, you know, SaaS business. They're an S3 competitor. Zoom has their files stored on Lyve Cloud to give, you know, to let you know who they are. So, essentially, what happened was they were in alpha, in their early access, and they had an outage, and it was pretty bad. I mean, it went on for longer than a day, actually, before they were completely restored, and it was, you know, fortunately, for them, it was early access. So no one was expecting, you know, uptime, you know, service level objectives and so on, but they were scared because they realized if something like this happens in production, you know, they're screwed. So what they did was they saw Zebrium, they did some research, they saw Zebrium. They went in a staging environment, recreated the exact (indistinct) that they'd had, and what they saw was, immediately, Zebrium pops up a root cause report that tells them exactly the root cause that they took over a day to find. These are the kind of stories that let us know we're onto something transformational. >> Yeah, that's great. I mean, you guys are jumping up and down. I'm sure, we're going to hear from Cisco later. I bet you, they were jumping up and down, too, 'cause they didn't have to do all that heavy lifting anymore. So Rod, Larry's just sort of implying that or, actually, you guys both talked about that your tool's agnostic. So how does one actually use the service? How do I deploy it? >> Yeah, so let me step back. So when we talk about logs, right? Like, you know, all these red crumbs being in logs and everything else. So, you know, they are a great wealth of, you know, information, but people hate dealing with them. I mean, they hate having to go in and figure out what log to look at. In fact, you know, we had one of our, or we've heard from several of our customers now prior to using Zebrium, but when they're, you know, have some issue, and they know there's something wrong, something on their dashboard has told them that something's wrong, maybe a metrics is, you know, taken a blip or something's happened that they know there's a problem, we've heard from them that it can take like a number of hours just to get to the right set of logs, like figuring out over these hundreds of services where the logs are to get to them, maybe searching in a log manager, just to get into the right context even can take hours. So, you know, that's obviously the problem we solve, but, you know, we don't want them just looking at logs. I mean, you know, we don't want to put 'em back in the thing they don't like doing 'cause people don't do what they don't like doing. So we put it up on the dashboard. So if something is going wrong with your metrics, and that's the indicator or maybe it's something with tracing that you're sort of digging through now that you know something's wrong, we will be right on that same dashboard. So we're deployed as a SaaS service. You send us your logs. You click on one of our integrations, and we integrate with all these tools that Larry's talked about, and when we detect anything that is a root cause report, it will show up on your dashboard in the same timeline as those blips in your metrics. So when you see something going wrong, and you know there's an issue, take a look at the portion of your dashboard that is us, and we're going to tell you why. We're going to get you to the why that went wrong. Not no other work could be- You can, you know, also click down and click through to us so that you land up in our portal if you want to do some more digging around if you need to or whatever, maybe to get some context, what have you, but it's fair that you ever need to do that. The answer should be right there on your dashboard, and that's how we expect people to use it. We don't want them digging in logs and going through things. We want it to be right in their workflow. >> Great, thank you, Larry. So Rod, we talked about Cisco. We're going to hear more from them in a moment and Seagate. I would think this is like a perfect solution for a SaaS provider, anybody doing AIOps, do you have some examples of those types of firms leaning into this? >> Yeah, a couple of great, well, I mean, we got many of them, but couple that I'll touch on. We have an actual AIOps company that was looking for, you know, sort of some complimentary technology and so on, and so they decided to just put us through our paces by having one of their own SREs sign up for our service in our SaaS environment and send the logs from their system to us, you know, and just see how we did. So it turned out we ended up talking back to this SRE like a week after he had installed the product, you know, signed up, and then, you know, started sending us logs, and, you know, he was hemming and hawing saying that he was busy like, you know, like every SRE is, and that he didn't have a chance to really do much with us yet, and, you know, we just, you know, having this conversation on the phone, and he comes to tell us that, "Yeah, I've been busy because we had this, you know, terrible outage like, you know, five days ago," and we said like, "Okay, did you actually look on the Zebrium dashboard?" (laughs) And he goes, "You know what? I didn't even think to do it yet. I mean, I'd just been so busy and frazzled." So we have an integration with that company. He hadn't put that integration in so it wasn't in his dashboard yet, but it was certainly on ours. So he went there and he looks on the day like, you know, on the time range of when he had this incident, and right at the very top of the page on our portal was the incident with the root cause, and he was flabbergasted. It literally would've saved him hours and hours and hours. They had this issue going on for over 24 hours, and we had the answer right there in five minutes, and it was crazy, and we get that kind of story. It's just like the Seagate one. If you use us and you have a problem, we're going to detect it, and you're going to hear from Cisco how successful we are at detecting things. I mean, it'll be there when you have a problem. In SaaS companies, you know, one of our customers is Archera. They do cost optimizations for cloud properties, you know, for AWS optimization, Google cloud, and so on, but they use our software, and they have a lot of interaction, obviously, with these cloud vendors and the APIs of those cloud vendors. So, you know, in order to figure out you're costing at AWS, they're using all those APIs. So it turned out, you know, they had some issue where their services were breaking and we had that root cause report right on the screen, again, within five minutes that was pointing to an API problem with Google, and they had changed one of their APIs, and Archera was not aware of it. So their stuff was breaking because of a change downstream that we had caught, and I'll just tell you one last one because it's somewhat related to one of these cloud vendors of, you know, big cloud vendor who had an outage couple of months ago, and it's interesting because, you know, lot of our customers will set up shared Slack channels with us where we're monitoring or seeing their incidents as well as they are. So we get a little Slack representation of the incident that we detected for them or the root cause that we've detected for them, and that's in a shared community channel. So we could see this happening when that AWS outage happened. We could see our customers getting impacted by that AWS outage and the root cause of what was going on there in AWS that was impacting our customers, that was showing up in our incidents. Now, we didn't obviously, you know, have the very root cause of what was going on in AWS per se, but we were getting to the root cause of why our customer's applications were failing, and that was because of issues going on at AWS. >> Very interesting. I mean, I think one of your biggest challenge is going to be getting people's attention because these SREs is so busy, their hair's on fire. (all laughs) You know, he's like, "Hey, chap, I'm going to show you, look at this." >> I tell you. You get their attention, they love it. I mean, this AIOps company, I didn't even tell you the punchline there, but, you know, they had this incident that occurred that we found and, quite literally, the next week, they ended up signing up as a paid customer, so. >> That's great, and Larry, give you the last word. I mean, you know, Rod was talking about, you know, changes in APIs, and, you know, there's still a lot of scripts out there. You guys, if I understand it correctly, run both as a service in the cloud and you can run on-prem, which is important because there's a lot of sensitive information in logs and people don't want to leave. >> That's right, absolutely. >> But, yeah, close it out here. >> Yeah, I mean, you can, that's right, you can run it on-prem, just like we run it in our cloud. You can run it in your cloud or on your own infrastructure. Now, that's all true. You know, I think the one hurdle now that we have left as a company is getting the word out and getting people to believe that this is actually possible and try it for themselves. You don't believe it? Do a POC, try it yourself. And, you know, people have become so jaded by the lack of, you know, real sort of innovation in the software industry for the last 10 years that it's hard to get people to... But guys, you got to give it a shot. I'm telling you. I'm telling you right now, it works, and you'll hear more about that from one of our customers in a minute. >> Alright guys, thanks so much. Great story, really appreciate you sharing. >> Thank you. >> Yeah, thanks, Dave. Appreciate the time. >> Okay, in a moment, we're going to hear from Cisco who is the customer in this case example, and a company that is... Look, they have quite an impressive suite of observability tooling, and they've done a pretty compelling proof of concept with Zebrium using real data on some Cisco products that you've heard of like Webex. So stay tuned and learn about how you can really take advantage of this new technology called root cause as a service. You're watching "theCUBE", the leader in enterprise and emerging tech coverage. (bright outro music)

Published Date : May 25 2022

SUMMARY :

and then you got all these new entrants all the buzzwords in your and that that's how you get a point click why you started the company. Now, of course, it means that, you know, about, you know, you but if, you know, it and it was, you know, I mean, you guys are jumping up and down. I mean, you know, we do you have some examples saying that he was busy like, you know, is going to be getting people's attention but, you know, they had I mean, you know, Rod was talking by the lack of, you know, appreciate you sharing. Appreciate the time. So stay tuned and learn about how you can

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Dave	PERSON	0.99+
Rod	PERSON	0.99+
Seagate	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Larry	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Rod Bagg	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Zebrium	ORGANIZATION	0.99+
Webex	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
UCS	ORGANIZATION	0.99+
$20 billion	QUANTITY	0.99+
20 billion	QUANTITY	0.99+
Grafana	ORGANIZATION	0.99+
192 incidents	QUANTITY	0.99+
five	QUANTITY	0.99+
Dynatrace	ORGANIZATION	0.99+
two technical experts	QUANTITY	0.99+
AppDynamics	ORGANIZATION	0.99+
ScienceLogic	ORGANIZATION	0.99+
New Relic	ORGANIZATION	0.99+
Datadog	ORGANIZATION	0.99+
First	QUANTITY	0.99+
five minutes	QUANTITY	0.99+
Elastic	ORGANIZATION	0.99+
ETR	ORGANIZATION	0.99+
five days ago	DATE	0.99+
10 years	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
one	QUANTITY	0.98+
5x	QUANTITY	0.98+
more than 95%	QUANTITY	0.98+
next week	DATE	0.98+
both	QUANTITY	0.98+
couple of months ago	DATE	0.97+
20	DATE	0.97+
Zoom	ORGANIZATION	0.97+
Archera	ORGANIZATION	0.97+
today	DATE	0.97+
Seagate Lyve Cloud	ORGANIZATION	0.96+
over 24 hours	QUANTITY	0.95+
over a day	QUANTITY	0.95+
Today	DATE	0.95+
four	QUANTITY	0.95+
AIOps	ORGANIZATION	0.95+
hundreds of services	QUANTITY	0.94+
decades	QUANTITY	0.94+
one thing	QUANTITY	0.92+

Keynote Analysis | Virtual Vertica BDC 2020

(upbeat music) >> Narrator: It's theCUBE, covering the Virtual Vertica Big Data Conference 2020. Brought to you by Vertica. >> Dave Vellante: Hello everyone, and welcome to theCUBE's exclusive coverage of the Vertica Virtual Big Data Conference. You're watching theCUBE, the leader in digital event tech coverage. And we're broadcasting remotely from our studios in Palo Alto and Boston. And, we're pleased to be covering wall-to-wall this digital event. Now, as you know, originally BDC was scheduled this week at the new Encore Hotel and Casino in Boston. Their theme was "Win big with big data". Oh sorry, "Win big with data". That's right, got it. And, I know the community was really looking forward to that, you know, meet up. But look, we're making the best of it, given these uncertain times. We wish you and your families good health and safety. And this is the way that we're going to broadcast for the next several months. Now, we want to unpack Colin Mahony's keynote, but, before we do that, I want to give a little context on the market. First, theCUBE has covered every BDC since its inception, since the BDC's inception that is. It's a very intimate event, with a heavy emphasis on user content. Now, historically, the data engineers and DBAs in the Vertica community, they comprised the majority of the content at this event. And, that's going to be the same for this virtual, or digital, production. Now, theCUBE is going to be broadcasting for two days. What we're doing, is we're going to be concurrent with the Virtual BDC. We got practitioners that are coming on the show, DBAs, data engineers, database gurus, we got a security experts coming on, and really a great line up. And, of course, we'll also be hearing from Vertica Execs, Colin Mahony himself right of the keynote, folks from product marketing, partners, and a number of experts, including some from Micro Focus, which is the, of course, owner of Vertica. But I want to take a moment to share a little bit about the history of Vertica. The company, as you know, was founded by Michael Stonebraker. And, Verica started, really they started out as a SQL platform for analytics. It was the first, or at least one of the first, to really nail the MPP column store trend. Not only did Vertica have an early mover advantage in MPP, but the efficiency and scale of its software, relative to traditional DBMS, and also other MPP players, is underscored by the fact that Vertica, and the Vertica brand, really thrives to this day. But, I have to tell you, it wasn't without some pain. And, I'll talk a little bit about that, and really talk about how we got here today. So first, you know, you think about traditional transaction databases, like Oracle or IMBDB tour, or even enterprise data warehouse platforms like Teradata. They were simply not purpose-built for big data. Vertica was. Along with a whole bunch of other players, like Netezza, which was bought by IBM, Aster Data, which is now Teradata, Actian, ParAccel, which was the basis for Redshift, Amazon's Redshift, Greenplum was bought, in the early days, by EMC. And, these companies were really designed to run as massively parallel systems that smoked traditional RDBMS and EDW for particular analytic applications. You know, back in the big data days, I often joked that, like an NFL draft, there was run on MPP players, like when you see a run on polling guards. You know, once one goes, they all start to fall. And that's what you saw with the MPP columnar stores, IBM, EMC, and then HP getting into the game. So, it was like 2011, and Leo Apotheker, he was the new CEO of HP. Frankly, he has no clue, in my opinion, with what to do with Vertica, and totally missed one the biggest trends of the last decade, the data trend, the big data trend. HP picked up Vertica for a song, it wasn't disclosed, but my guess is that it was around 200 million. So, rather than build a bunch of smart tokens around Vertica, which I always call the diamond in the rough, Apotheker basically permanently altered HP for years. He kind of ruined HP, in my view, with a 12 billion dollar purchase of Autonomy, which turned out to be one of the biggest disasters in recent M&A history. HP was forced to spin merge, and ended up selling most of its software to Microsoft, Micro Focus. (laughs) Luckily, during its time at HP, CEO Meg Whitman, largely was distracted with what to do with the mess that she inherited form Apotheker. So, Vertica was left alone. Now, the upshot is Colin Mahony, who was then the GM of Vertica, and still is. By the way, he's really the CEO, and he just doesn't have the title, I actually think they should give that to him. But anyway, he's been at the helm the whole time. And Colin, as you'll see in our interview, is a rockstar, he's got technical and business jobs, people love him in the community. Vertica's culture is really engineering driven and they're all about data. Despite the fact that Vertica is a 15-year-old company, they've really kept pace, and not been polluted by legacy baggage. Vertica, early on, embraced Hadoop and the whole open-source movement. And that helped give it tailwinds. It leaned heavily into cloud, as we're going to talk about further this week. And they got a good story around machine intelligence and AI. So, whereas many traditional database players are really getting hurt, and some are getting killed, by cloud database providers, Vertica's actually doing a pretty good job of servicing its install base, and is in a reasonable position to compete for new workloads. On its last earnings call, the Micro Focus CFO, Stephen Murdoch, he said they're investing 70 to 80 million dollars in two key growth areas, security and Vertica. Now, Micro Focus is running its Suse play on these two parts of its business. What I mean by that, is they're investing and allowing them to be semi-autonomous, spending on R&D and go to market. And, they have no hardware agenda, unlike when Vertica was part of HP, or HPE, I guess HP, before the spin out. Now, let me come back to the big trend in the market today. And there's something going on around analytic databases in the cloud. You've got companies like Snowflake and AWS with Redshift, as we've reported numerous times, and they're doing quite well, they're gaining share, especially of new workloads that are merging, particularly in the cloud native space. They combine scalable compute, storage, and machine learning, and, importantly, they're allowing customers to scale, compute, and storage independent of each other. Why is that important? Because you don't have to buy storage every time you buy compute, or vice versa, in chunks. So, if you can scale them independently, you've got granularity. Vertica is keeping pace. In talking to customers, Vertica is leaning heavily into the cloud, supporting all the major cloud platforms, as we heard from Colin earlier today, adding Google. And, why my research shows that Vertica has some work to do in cloud and cloud native, to simplify the experience, it's more robust in motor stack, which supports many different environments, you know deep SQL, acid properties, and DNA that allows Vertica to compete with these cloud-native database suppliers. Now, Vertica might lose out in some of those native workloads. But, I have to say, my experience in talking with customers, if you're looking for a great MMP column store that scales and runs in the cloud, or on-prem, Vertica is in a very strong position. Vertica claims to be the only MPP columnar store to allow customers to scale, compute, and storage independently, both in the cloud and in hybrid environments on-prem, et cetera, cross clouds, as well. So, while Vertica may be at a disadvantage in a pure cloud native bake-off, it's more robust in motor stack, combined with its multi-cloud strategy, gives Vertica a compelling set of advantages. So, we heard a lot of this from Colin Mahony, who announced Vertica 10.0 in his keynote. He really emphasized Vertica's multi-cloud affinity, it's Eon Mode, which really allows that separation, or scaling of compute, independent of storage, both in the cloud and on-prem. Vertica 10, according to Mahony, is making big bets on in-database machine learning, he talked about that, AI, and along with some advanced regression techniques. He talked about PMML models, Python integration, which was actually something that they talked about doing with Uber and some other customers. Now, Mahony also stressed the trend toward object stores. And, Vertica now supports, let's see S3, with Eon, S3 Eon in Google Cloud, in addition to AWS, and then Pure and HDFS, as well, they all support Eon Mode. Mahony also stressed, as I mentioned earlier, a big commitment to on-prem and the whole cloud optionality thing. So 10.0, according to Colin Mahony, is all about really doubling down on these industry waves. As they say, enabling native PMML models, running them in Vertica, and really doing all the work that's required around ML and AI, they also announced support for TensorFlow. So, object store optionality is important, is what he talked about in Eon Mode, with the news of support for Google Cloud and, as well as HTFS. And finally, a big focus on deployment flexibility. Migration tools, which are a critical focus really on improving ease of use, and you hear this from a lot of customers. So, these are the critical aspects of Vertica 10.0, and an announcement that we're going to be unpacking all week, with some of the experts that I talked about. So, I'm going to close with this. My long-time co-host, John Furrier, and I have talked some time about this new cocktail of innovation. No longer is Moore's law the, really, mainspring of innovation. It's now about taking all these data troves, bringing machine learning and AI into that data to extract insights, and then operationalizing those insights at scale, leveraging cloud. And, one of the things I always look for from cloud is, if you've got a cloud play, you can attract innovation in the form of startups. It's part of the success equation, certainly for AWS, and I think it's one of the challenges for a lot of the legacy on-prem players. Vertica, I think, has done a pretty good job in this regard. And, you know, we're going to look this week for evidence of that innovation. One of the interviews that I'm personally excited about this week, is a new-ish company, I would consider them a startup, called Zebrium. What they're doing, is they're applying AI to do autonomous log monitoring for IT ops. And, I'm interviewing Larry Lancaster, who's their CEO, this week, and I'm going to press him on why he chose to run on Vertica and not a cloud database. This guy is a hardcore tech guru and I want to hear his opinion. Okay, so keep it right there, stay with us. We're all over the Vertica Virtual Big Data Conference, covering in-depth interviews and following all the news. So, theCUBE is going to be interviewing these folks, two days, wall-to-wall coverage, so keep it right there. We're going to be right back with our next guest, right after this short break. This is Dave Vellante and you're watching theCUBE. (upbeat music)

Published Date : Mar 31 2020

SUMMARY :

Brought to you by Vertica. and the Vertica brand, really thrives to this day.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Colin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
70	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Michael Stonebraker	PERSON	0.99+
Colin Mahony	PERSON	0.99+
Stephen Murdoch	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
EMC	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
Zebrium	ORGANIZATION	0.99+
two days	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Verica	ORGANIZATION	0.99+
Micro Focus	ORGANIZATION	0.99+
2011	DATE	0.99+
HPE	ORGANIZATION	0.99+
Uber	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Mahony	PERSON	0.99+
Meg Whitman	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Aster Data	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
First	QUANTITY	0.99+
12 billion dollar	QUANTITY	0.99+
One	QUANTITY	0.99+
this week	DATE	0.99+
John Furrier	PERSON	0.99+
15-year-old	QUANTITY	0.98+
Python	TITLE	0.98+
Oracle	ORGANIZATION	0.98+
olin Mahony	PERSON	0.98+
around 200 million	QUANTITY	0.98+
Virtual Vertica Big Data Conference 2020	EVENT	0.98+
theCUBE	ORGANIZATION	0.98+
80 million dollars	QUANTITY	0.97+
today	DATE	0.97+
two parts	QUANTITY	0.97+
Vertica Virtual Big Data Conference	EVENT	0.97+
Teradata	ORGANIZATION	0.97+
one	QUANTITY	0.97+
Actian	ORGANIZATION	0.97+

UNLIST TILL 4/2 - Autonomous Log Monitoring

>> Sue: Hi everybody, thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "Autonomous Monitoring Using Machine Learning". My name is Sue LeClaire, director of marketing at Vertica, and I'll be your host for this session. Joining me is Larry Lancaster, founder and CTO at Zebrium. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slide and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, you can also go and visit Vertica forums to post your questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, just a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available for you to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Larry, over to you. >> Larry: Hey, thanks so much. So hi, my name's Larry Lancaster and I'm here to talk to you today about something that I think who's time has come and that's autonomous monitoring. So, with that, let's get into it. So, machine data is my life. I know that's a sad life, but it's true. So I've spent most of my career kind of taking telemetry data from products, either in the field, we used to call it in the field or nowadays, that's been deployed, and bringing that data back, like log file stats, and then building stuff on top of it. So, tools to run the business or services to sell back to users and customers. And so, after doing that a few times, it kind of got to the point where I was really sort of sick of building the same kind of thing from scratch every time, so I figured, why not go start a company and do it so that we don't have to do it manually ever again. So, it's interesting to note, I've put a little sentence here saying, "companies where I got to use Vertica" So I've been actually kind of working with Vertica for a long time now, pretty much since they came out of alpha. And I've really been enjoying their technology ever since. So, our vision is basically that I want a system that will characterize incidents before I notice. So an incident is, you know, we used to call it a support case or a ticket in IT, or a support case in support. Nowadays, you may have a DevOps team, or a set of SREs who are monitoring a production sort of deployment. And so they'll call it an incident. So I'm looking for something that will notice and characterize an incident before I notice and have to go digging into log files and stats to figure out what happened. And so that's a pretty heady goal. And so I'm going to talk a little bit today about how we do that. So, if we look at logs in particular. Logs today, if you look at log monitoring. So monitoring is kind of that whole umbrella term that we use to talk about how we monitor systems in the field that we've shipped, or how we monitor production deployments in a more modern stack. And so basically there are log monitoring tools. But they have a number of drawbacks. For one thing, they're kind of slow in the sense that if something breaks and I need to go to a log file, actually chances are really good that if you have a new issue, if it's an unknown unknown problem, you're going to end up in a log file. So the problem then becomes basically you're searching around looking for what's the root cause of the incident, right? And so that's kind of time-consuming. So, they're also fragile and this is largely because log data is completely unstructured, right? So there's no formal grammar for a log file. So you have this situation where, if I write a parser today, and that parser is going to do something, it's going to execute some automation, it's going to open or update a ticket, it's going to maybe restart a service, or whatever it is that I want to happen. What'll happen is later upstream, someone who's writing the code that produces that log message, they might do something really useful for me, or for users. And they might go fix a spelling mistake in that log message. And then the next thing you know, all the automation breaks. So it's a very fragile source for automation. And finally, because of that, people will set alerts on, "Oh, well tell me how many thousands of errors are happening every hour." Or some horrible metric like that. And then that becomes the only visibility you have in the data. So because of all this, it's a very human-driven, slow, fragile process. So basically, we've set out to kind of up-level that a bit. So I touched on this already, right? The truth is if you do have an incident, you're going to end up in log files to do root cause. It's almost always the case. And so you have to wonder, if that's the case, why do most people use metrics only for monitoring? And the reason is related to the problems I just described. They're already structured, right? So for logs, you've got this mess of stuff, so you only want to dig in there when you absolutely have to. But ironically, it's where a lot of the information that you need actually is. So we have a model today, and this model used to work pretty well. And that model is called "index and search". And it basically means you treat log files like they're text documents. And so you index them and when there's some issue you have to drill into, then you go searching, right? So let's look at that model. So 20 years ago, we had sort of a shrink-wrap software delivery model. You had an incident. With that incident, maybe you had one customer and you had a monolithic application and a handful of log files. So it's perfectly natural, in fact, usually you could just v-item the log file, and search that way. Or if there's a lot of them, you could index them and search them that way. And that all worked very well because the developer or the support engineer had to be an expert in those few things, in those few log files, and understand what they meant. But today, everything has changed completely. So we live in a software as a service world. What that means is, for a given incident, first of all you're going to be affecting thousands of users. You're going to have, potentially, 100 services that are deployed in your environment. You're going to have 1,000 log streams to sift through. And yet, you're still kind of stuck in the situation where to go find out what's the matter, you're going to have to search through the log files. So this is kind of the unacceptable sort of position we're in today. So for us, the future will not be index and search. And that's simply because it cannot scale. And the reason I say that it can't scale is because it all kind of is bottlenecked by a person and their eyeball. So, you continue to drive up the amount of data that has to be sifted through, the complexity of the stack that has to be understood, and you still, at the end of the day, for MTTR purposes, you still have the same bottleneck, which is the eyeball. So this model, I believe, is fundamentally broken. And that's why, I believe in five years you're going to be in a situation where most monitoring of unknown unknown problems is going to be done autonomously. And those issues will be characterized autonomously because there's no other way it can happen. So now I'm going to talk a little bit about autonomous monitoring itself. So, autonomous monitoring basically means, if you can imagine in a monitoring platform and you watch the monitoring platform, maybe you watch the alerts coming from it or more importantly, you kind of watch the dashboards and try to see if something looks weird. So autonomous monitoring is the notion that the platform should do the watching for you and only let you know when something is going wrong and should kind of give you a window into what happened. So if you look at this example I have on screen, just to take it really slow and absorb the concept of autonomous monitoring. So here in this example, we've stopped the database. And as a result, down below you can see there were a bunch of fallout. This is an Atlassian Stack, so you can imagine you've got a Postgres database. And then you've got sort of Bitbucket, and Confluence, and Jira, and these various other components that need the database operating in order to function. So what this is doing is it's calling out, "Hey, the root cause is the database stopped and here's the symptoms." Now, you might be wondering, so what. I mean I could go write a script to do this sort of thing. Here's what's interesting about this very particular example, and I'll show a couple more examples that are a little more involved. But here's the interesting thing. So, in the software that came up with this incident and opened this incident and put this root cause and symptoms in there, there's no code that knows anything about timestamp formats, severities, Atlassian, Postgres, databases, Bitbucket, Confluence, there's no regexes that talk about starting, stopped, RDBMS, swallowed exception, and so on and so forth. So you might wonder how it's possible then, that something which is completely ignorant of the stack, could come up with this description, which is exactly what a human would have had to do, to figure out what happened. And I'm going to get into how we do that. But that's what autonomous monitoring is about. It's about getting into a set of telemetry from a stack with no prior information, and understanding when something breaks. And I could give you the punchline right now, which is there are fundamental ways that software behaves when it's breaking. And by looking at hundreds of data sets that people have generously allowed us to use containing incidents, we've been able to characterize that and now generalize it to apply it to any new data set and stack. So here's an interesting one right here. So there's a fella, David Gill, he's just a genius in the monitoring space. He's been working with us for the last couple of months. So he said, "You know what I'm going to do, is I'm going to run some chaos experiments." So for those of you who don't know what chaos engineering is, here's the idea. So basically, let's say I'm running a Kubernetes cluster and what I'll do is I'll use sort of a chaos injection test, something like litmus. And basically it will inject issues, it'll break things in my application randomly to see if my monitoring picks it up. And so this is what chaos engineering is built around. It's built around sort of generating lots of random problems and seeing how the stack responds. So in this particular case, David went in and he deleted, basically one of the tests that was presented through litmus did a delete of a pod delete. And so that's going to basically take out some containers that are part of the service layer. And so then you'll see all kinds of things break. And so what you're seeing here, which is interesting, this is why I like to use this example. Because it's actually kind of eye-opening. So the chaos tool itself generates logs. And of course, through Kubernetes, all the log files locations that are on the host, and the container logs are known. And those are all pulled back to us automatically. So one of the log files we have is actually the chaos tool that's doing the breaking, right? And so what the tool said here, when it went to determine what the root cause was, was it noticed that there was this process that had these messages happen, initializing deletion lists, selection a pod to kill, blah blah blah. It's saying that the root cause is the chaos test. And it's absolutely right, that is the root cause. But usually chaos tests don't get picked up themselves. You're supposed to be just kind of picking up the symptoms. But this is what happens when you're able to kind of tease out root cause from symptoms autonomously, is you end up getting a much more meaningful answer, right? So here's another example. So essentially, we collect the log files, but we also have a Prometheus scraper. So if you export Prometheus metrics, we'll scrape those and we'll collect those as well. And so we'll use those for our autonomous monitoring as well. So what you're seeing here is an issue where, I believe this is where we ran something out of disk space. So it opened an incident, but what's also interesting here is, you see that it pulled that metric to say that the spike in this metric was a symptom of this running out of space. So again, there's nothing that knows anything about file system usage, memory, CPU, any of that stuff. There's no actual hard-coded logic anywhere to explain any of this. And so the concept of autonomous monitoring is looking at a stack the way a human being would. If you can imagine how you would walk in and monitor something, how you would think about it. You'd go looking around for rare things. Things that are not normal. And you would look for indicators of breakage, and you would see, do those seem to be correlated in some dimension? That is how the system works. So as I mentioned a moment ago, metrics really do kind of complete the picture for us. We end up in a situation where we have a one-stop shop for incident root cause. So, how does that work? Well, we ingest and we structure the log files. So if we're getting the logs, we'll ingest them and we'll structure them, and I'm going to show a little bit what that structure looks like and how that goes into the database in a moment. And then of course we ingest and structure the Prometheus metrics. But here, structure really should have an asterisk next to it, because metrics are mostly structured already. They have names. If you have your own scraper, as opposed to going into the time series Prometheus database and pulling metrics from there, you can keep a lot more information about metadata about those metrics from the exporter's perspective. So we keep all of that too. Then we do our anomaly detection on both of those sets of data. And then we cross-correlate metrics and log anomalies. And then we create incidents. So this is at a high level, kind of what's happening without any sort of stack-specific logic built in. So we had some exciting recent validation. So Mayadata's a pretty big player in the Kubernetes space. Essentially, they do Kubernetes as a managed service. They have tens of thousands of customers that they manage their Kubernetes clusters for them. And then they're also involved, both in the OpenEBS project, as well as in the Litmius project I mentioned a moment ago. That's their tool for chaos engineering. So they're a pretty big player in the Kubernetes space. So essentially, they said, "Oh okay, let's see if this is real." So what they did was they set up our collectors, which took three minutes in Kubernetes. And then they went and they, using Litmus, they reproduced eight incidents that their actual, real-world customers had hit. And they were trying to remember the ones that were the hardest to figure out the root cause at the time. And we picked up and put a root cause indicator that was correct in 100% of these incidents with no training configuration or metadata required. So this is kind of what autonomous monitoring is all about. So now I'm going to talk a little bit about how it works. So, like I said, there's no information included or required about, so if you imagine a log file for example. Now, commonly, over to the left-hand side of every line, there will be some sort of a prefix. And what I mean by that is you'll see like a timestamp, or a severity, and maybe there's a PID, and maybe there's function name, and maybe there's some other stuff there. So basically that's kind of, it's common data elements for a large portion of the lines in a given log file. But you know, of course, the contents change. So basically today, like if you look at a typical log manager, they'll talk about connectors. And what connectors means is, for an application it'll generate a certain prefix format in a log. And that means what's the format of the timestamp, and what else is in the prefix. And this lets the tool pick it up. And so if you have an app that doesn't have a connector, you're out of luck. Well, what we do is we learn those prefixes dynamically with machine learning. You do not have to have a connector, right? And what that means is that if you come in with your own application, the system will just work for it from day one. You don't have to have connectors, you don't have to describe the prefix format. That's so yesterday, right? So really what we want to be doing is up-leveling what the system is doing to the point where it's kind of working like a human would. You look at a log line, you know what's a timestamp. You know what's a PID. You know what's a function name. You know where the prefix ends and where the variable parts begin. You know what's a parameter over there in the variable parts. And sometimes you may need to see a couple examples to know what was a variable, but you'll figure it out as quickly as possible, and that's exactly how the system goes about it. As a result, we kind of embrace free-text logs, right? So if you look at a typical stack, most of the logs generated in a typical stack are usually free-text. Even structured logging typically will have a message attribute, which then inside of it has the free-text message. For us, that's not a bad thing. That's okay. The purpose of a log is to inform people. And so there's no need to go rewrite the whole logging stack just because you want a machine to handle it. They'll figure it out for themselves, right? So, you give us the logs and we'll figure out the grammar, not only for the prefix but also for the variable message part. So I already went into this, but there's more that's usually required for configuring a log manager with alerts. You have to give it keywords. You have to give it application behaviors. You have to tell it some prior knowledge. And of course the problem with all of that is that the most important events that you'll ever see in a log file are the rarest. Those are the ones that are one out of a billion. And so you may not know what's going to be the right keyword in advance to pick up the next breakage, right? So we don't want that information from you. We'll figure that out for ourselves. As the data comes in, essentially we parse it and we categorize it, as I've mentioned. And when I say categorize, what I mean is, if you look at a certain given log file, you'll notice that some of the lines are kind of the same thing. So this one will say "X happened five times" and then maybe a few lines below it'll say "X happened six times" but that's basically the same event type. It's just a different instance of that event type. And it has a different value for one of the parameters, right? So when I say categorization, what I mean is figuring out those unique types and I'll show an example of that next. Anomaly detection, we do on top of that. So anomaly detection on metrics in a very sort of time series by time series manner with lots of tunables is a well-understood problem. So we also do this on the event types occurrences. So you can think of each event type occurring in time as sort of a point process. And then you can develop statistics and distributions on that, and you can do anomaly detection on those. Once we have all of that, we have extracted features, essentially, from metrics and from logs. We do pattern recognition on the correlations across different channels of information, so different event types, different log types, different hoses, different containers, and then of course across to the metrics. Based on all of this cross-correlation, we end up with a root cause identification. So that's essentially, at a high level, how it works. What's interesting, from the perspective of this call particularly, is that incident detection needs relationally structured data. It really does. You need to have all the instances of a certain event type that you've ever seen easily accessible. You need to have the values for a given sort of parameter easily, quickly available so you can figure out what's the distribution of this over time, how often does this event type happen. You can run analytical queries against that information so that you can quickly, in real-time, do anomaly detection against new data. So here's an example of that this looks like. And this kind of part of the work that we've done. At the top you see some examples of log lines, right? So that's kind of a snippet, it's three lines out of a log file. And you see one in the middle there that's kind of highlighted with colors, right? I mean, it's a little messy, but it's not atypical of the log file that you'll see pretty much anywhere. So there, you've got a timestamp, and a severity, and a function name. And then you've got some other information. And then finally, you have the variable part. And that's going to have sort of this checkpoint for memory scrubbers, probably something that's written in English, just so that the person who's reading the log file can understand. And then there's some parameters that are put in, right? So now, if you look at how we structure that, the way it looks is there's going to be three tables that correspond to the three event types that we see above. And so we're going to look at the one that corresponds to the one in the middle. So if we look at that table, there you'll see a table with columns, one for severity, for function name, for time zone, and so on. And date, and PID. And then you see over to the right with the colored columns there's the parameters that were pulled out from the variable part of that message. And so they're put in, they're typed and they're in integer columns. So this is the way structuring needs to work with logs to be able to do efficient and effective anomaly detection. And as far as I know, we're the first people to do this inline. All right, so let's talk now about Vertica and why we take those tables and put them in Vertica. So Vertica really is an MPP column store, but it's more than that, because nowadays when you say "column store", people sort of think, like, for example Cassandra's a column store, whatever, but it's not. Cassandra's not a column store in the sense that Vertica is. So Vertica was kind of built from the ground up to be... So it's the original column store. So back in the cStor project at Berkeley that Stonebraker was involved in, he said let's explore what kind of efficiencies we can get out of a real columnar database. And what he found was that, he and his grad students that started Vertica. What they found was that what they can do is they could build a database that gives orders of magnitude better query performance for the kinds of analytics I'm talking about here today. With orders of magnitude less data storage underneath. So building on top of machine data, as I mentioned, is hard, because it doesn't have any defined schemas. But we can use an RDBMS like Vertica once we've structured the data to do the analytics that we need to do. So I talked a little bit about this, but if you think about machine data in general, it's perfectly suited for a columnar store. Because, if you imagine laying out sort of all the attributes of an event type, right? So you can imagine that each occurrence is going to have- So there may be, say, three or four function names that are going to occur for all the instances of a given event type. And so if you were to sort all of those event instances by function name, what you would find is that you have sort of long, million long runs of the same function name over and over. So what you have, in general, in machine data, is lots and lots of slowly varying attributes, lots of low-cardinality data that it's almost completely compressed out when you use a real column store. So you end up with a massive footprint reduction on disk. And it also, that propagates through the analytical pipeline. Because Vertica does late materialization, which means it tries to carry that data through memory with that same efficiency, right? So the scale-out architecture, of course, is really suitable for petascale workloads. Also, I should point out, I was going to mention it in another slide or two, but we use the Vertica Eon architecture, and we have had no problems scaling that in the cloud. It's a beautiful sort of rewrite of the entire data layer of Vertica. The performance and flexibility of Eon is just unbelievable. And so I've really been enjoying using it. I was skeptical, you could get a real column store to run in the cloud effectively, but I was completely wrong. So finally, I should mention that if you look at column stores, to me, Vertica is the one that has the full SQL support, it has the ODBC drivers, it has the ACID compliance. Which means I don't need to worry about these things as an application developer. So I'm laying out the reasons that I like to use Vertica. So I touched on this already, but essentially what's amazing is that Vertica Eon is basically using S3 as an object store. And of course, there are other offerings, like the one that Vertica does with pure storage that doesn't use S3. But what I find amazing is how well the system performs using S3 as an object store, and how they manage to keep an actual consistent database. And they do. We've had issues where we've gone and shut down hosts, or hosts have been shut down on us, and we have to restart the database and we don't have any consistency issues. It's unbelievable, the work that they've done. Essentially, another thing that's great about the way it works is you can use the S3 as a shared object store. You can have query nodes kind of querying from that set of files largely independently of the nodes that are writing to them. So you avoid this sort of bottleneck issue where you've got contention over who's writing what, and who's reading what, and so on. So I've found the performance using separate subclusters for our UI and for the ingest has been amazing. Another couple of things that they have is they have a lot of in-database machine learning libraries. There's actually some cool stuff on their GitHub that we've used. One thing that we make a lot of use of is the sequence and time series analytics. For example, in our product, even though we do all of this stuff autonomously, you can also go create alerts for yourself. And one of the kinds of alerts you can do, you can say, "Okay, if this kind of event happens within so much time, and then this kind of an event happens, but not this one," Then you can be alerted. So you can have these kind of sequences that you define of events that would indicate a problem. And we use their sequence analytics for that. So it kind of gives you really good performance on some of these queries where you're wanting to pull out sequences of events from a fact table. And timeseries analytics is really useful if you want to do analytics on the metrics and you want to do gap filling interpolation on that. It's actually really fast in performance. And it's easy to use through SQL. So those are a couple of Vertica extensions that we use. So finally, I would like to encourage everybody, hey, come try us out. Should be up and running in a few minutes if you're using Kubernetes. If not, it's however long it takes you to run an installer. So you can just come to our website, pick it up and try out autonomous monitoring. And I want to thank everybody for your time. And we can open it up for Q and A.

Published Date : Mar 30 2020

SUMMARY :

Also, just a reminder that you can maximize your screen And one of the kinds of alerts you can do, you can say,

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
David Gill	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
five times	QUANTITY	0.99+
Larry	PERSON	0.99+
S3	TITLE	0.99+
three minutes	QUANTITY	0.99+
six times	QUANTITY	0.99+
Sue	PERSON	0.99+
100 services	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
today	DATE	0.99+
three	QUANTITY	0.99+
five years	QUANTITY	0.99+
Today	DATE	0.99+
yesterday	DATE	0.99+
both	QUANTITY	0.99+
Kubernetes	TITLE	0.99+
one	QUANTITY	0.99+
thousands	QUANTITY	0.99+
two	QUANTITY	0.99+
SQL	TITLE	0.99+
one customer	QUANTITY	0.98+
three lines	QUANTITY	0.98+
three tables	QUANTITY	0.98+
each event	QUANTITY	0.98+
hundreds	QUANTITY	0.98+
first people	QUANTITY	0.98+
1,000 log streams	QUANTITY	0.98+
20 years ago	DATE	0.98+
eight incidents	QUANTITY	0.98+
tens of thousands of customers	QUANTITY	0.97+
later this week	DATE	0.97+
thousands of users	QUANTITY	0.97+
Stonebraker	ORGANIZATION	0.96+
each occurrence	QUANTITY	0.96+
Postgres	ORGANIZATION	0.96+
One thing	QUANTITY	0.95+
three event types	QUANTITY	0.94+
million	QUANTITY	0.94+
Vertica	TITLE	0.94+
one thing	QUANTITY	0.93+
4/2	DATE	0.92+
English	OTHER	0.92+
four function names	QUANTITY	0.86+
day one	QUANTITY	0.84+
Prometheus	TITLE	0.83+
one-stop	QUANTITY	0.82+
Berkeley	LOCATION	0.82+
Confluence	ORGANIZATION	0.79+
double arrow	QUANTITY	0.79+
last couple of months	DATE	0.79+
one of	QUANTITY	0.76+
cStor	ORGANIZATION	0.75+
a billion	QUANTITY	0.73+
Atlassian Stack	ORGANIZATION	0.72+
Eon	ORGANIZATION	0.71+
Bitbucket	ORGANIZATION	0.68+
couple more examples	QUANTITY	0.68+
Litmus	TITLE	0.65+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Zebrium: