Atri Basu & Necati Cehreli | Zebrium Root Cause as a Service

>>Okay. We're back with Ari Basu, who is Cisco's resident philosopher, who also holds a master's in computer science. We're gonna have to unpack that a little bit and Najati chair he who's technical lead at Cisco. Welcome guys. Thanks for coming on the cube. >>Happy to be here. Thanks a >>Lot. All right, let's get into it. We want you to explain how Cisco validated the SBRI technology and the proof points that, that you have, that it actually works as advertised. So first Outre tell first, tell us about Cisco tech. What does Cisco tech do? >>So T is otherwise it's an acronym for technical assistance center is Cisco's support arm, the support organization, and, you know, the risk of sounding like I'm spotting a corporate line. The, the easiest way to summarize what tag does is provide world class support to Cisco customers. What that means is we have about 8,000 engineers worldwide, and any of our Cisco customers can either go on our web portal or call us to open a support request. And we get about 2.2 million of these support requests a year. And what these support requests are, are essentially the customer will describe something that they need done some networking goal that they have, that they wanna accomplish. And then it's tax job to make sure that that goal does get accomplished. Now, it could be something like they're having trouble with an existing network solution, and it's not working as expected, or it could be that they're integrating with a new solution. >>They're, you know, upgrading devices, maybe there's a hardware failure, anything really to do with networking support and, you know, the customer's network goals. If they open up a case for request for help, then tax job is to, is to respond and make sure the customer's, you know, questions and requirements are met about 44% of these support requests are usually trivial and, you know, can be solved within a call or within a day. But the rest of tax cases really involve getting into the network device, looking at logs. It's a very technical role. It's a very technical job. You're look you're, you need to be conversing with network solutions, their designs protocols, et cetera. >>Wow. So 56% non-trivial. And so I would imagine you spend a lot of time digging through through logs. Is that, is that true? Can you quantify that like, you know, every month, how much time you spend digging through logs and is that a pain point? >>Yeah, it's interesting. You asked that because when we started this on this journey to augment our support engineers workflow with zebra solution, one of the things that we did was we went out and asked our engineers what their experience was like doing log analysis. And the anecdotal evidence was that on average, an engineer will spend three out of their eight hours reviewing logs, either online or offline. So what that means is either with the customer live on a WebEx, they're going to be going over logs, network, state information, et cetera, or they're gonna do it offline, where the customer sends them the logs, it's attached to a, you know, a service request and they review it and try to figure out what's going on and provide the customer with information. So it's a very large chunk of our day. You know, I said 8,000 plus engineers, and so three hours a day, that's 24,000 man hours a day spent on long analysis. >>Now the struggle with logs or analyzing logs is there by out of necessity. Logs are very contr contr. They try to pack a lot of information in a very little space. And this is for performance reasons, storage reasons, et cetera, BEC, but the side effect of that is they're very esoteric. So they're hard to read if you're not conversant, if you're not the developer who wrote these logs or you or you, aren't doing code deep dives. And you're looking at where this logs getting printed and things like that, it may not be immediately obvious or even after a low while it may not be obvious what that log line means or how it correlates to whatever problem you're troubleshooting. So it requires tenure. It requires, you know, like I was saying before, it requires a lot of knowledge about the protocol what's expected because when you're doing log analysis, what you're really looking for is a needle in a haystack. You're looking for that one anomalous event, that single thing that tells you this shouldn't have happened. And this was a problem right now doing that kind of anomaly detection requires you to know what is normal. It requires, you know, what the baseline is. And that requires a very in-depth understanding of, you know, the state changes for that network solution or product. So it requires time, tenure and expertise to do well. And it takes a lot of time even when you have that kind of expertise. >>Wow. So thank you, archery. And Najati, that's, that's about, that's almost two days a week for, for a technical resource. That's that's not inexpensive. So what was Cisco looking for to sort of help with this and, and how'd you stumble upon zebra? >>Yeah, so, I mean, we have our internal automation system, which has been running more than a decade now. And what happens is when a customer attaches a log bundle or diagnostic bundle into the service request, we take that from the Sr we analyze it and we represent some kind of information. You know, it can be alert or some tables, some graph to the engineer, so they can, you know, troubleshoot this particular issue. This is an incredible system, but it comes with its own challenges around maintenance to keep it up to date and relevant with Cisco's new products or new version of the product, new defects, new issues, and all kind of things. And when I, what I mean with those challenges are, let's say Cisco comes up with a product today. We need to come together with those engineers. We need to figure out how this bundle works, how it's structured out. >>We need to select individual logs, which are relevant and then start modeling these logs and get some values out of those logs, using pars or some rag access to come to a level that we can consume the logs. And then people start writing rules on top of that abstraction. So people can say in this log, I'm seeing this value together with this other value in another log, maybe I'm hitting this particular defect. So that's how it works. And if you look at it, the abstraction, it can fail the next time. And the next release when the development or the engineer decides to change that log line, which you write that rag X, or we can come up with a new version, which we completely change the services or processes, then whatever you have wrote needs to be re written for that new service. And we see that a lot with products, like for instance, WebEx, where you have a very short release cycle that things can change maybe the next week with a new release. >>So whatever you are writing, especially for that abstraction and for those rules are maybe not relevant with that new release. With that being sake, we have a incredible rule creation process and governance process around it, which starts with maybe a defect. And then it takes it to a level where we have an automation in place. But if you look at it, this really ties to human bandwidth. And our engineers are really busy working on, you know, customer facing, working on issues daily and sometimes creating these rules or these pars are not their biggest priorities, so they can be delayed a bit. So we have this delay between a new issue being identified to a level where we have the automation to detect it next time that some customer faces it. So with all these questions and with all challenges in mind, we start looking into ways of actually how we can automate these automations. >>So these things that we are doing manually, how we can move it a bit further and automate. And we had actually a couple of things in mind that we were looking for and this being one of them being, this has to be product agnostic. Like if Cisco comes up with a product tomorrow, I should be able to take it logs without writing, you know, complex regs, pars, whatever, and deploy it into this system. So it can embrace our logs and make sense of it. And we wanted this platform to be unsupervised. So none of the engineers need to create rules, you know, label logs. This is bad. This is good. Or train the system like which requires a lot of computational power. And the other most important thing for us was we wanted this to be not noisy at all, because what happens with noises when your level of false PE positives really high your engineers start ignoring the good things between that noise. >>So they start the next time, you know, thinking that this thing will not be relevant. So we want something with a lot or less noise. And ultimately we wanted this new platform or new framework to be easily adaptable to our existing workflows. So this is where we started. We start looking into the, you know, first of all, internally, if we can build this thing and also start researching it, and we came up to Zeum actually Larry, one of the co co-founders of Zeum. We came upon his presentation where he clearly explained why this is different, how this works, and it immediately clicked in. And we said, okay, this is exactly what we were looking for. We dived deeper. We checked the block posts where SBRI guys really explained everything very clearly there, they are really open about it. And most importantly, there is a button in their system. >>So what happens usually with AI ML vendors is they have this button where you fill in your details and sales guys call you back. And, you know, we explain the system here. They were like, this is our trial system. We believe in the system, you can just sign up and try it yourself. And that's what we did. We took our, one of our Cisco live DNA center, wireless platforms. We start streaming logs out of it. And then we synthetically, you know, introduce errors, like we broke things. And then we realized that zebra was really catching the errors perfectly. And on top of that, it was really quiet unless you are really breaking something. And the other thing we realized was during that first trial is zebra was actually bringing a lot of context on top of the logs. During those failures, we work with couple of technical leaders and they said, okay, if this failure happens, I I'm expecting this individual log to be there. And we found out with zebra, apart from that individual log, there were a lot of other things which gives a bit more context around the root columns, which was great. And that's where we wanted to take it to the next level. Yeah. >>Okay. So, you know, a couple things to unpack there. I mean, you have the dart board behind you, which is kind of interesting, cuz a lot of times it's like throwing darts at the board to try to figure this stuff out. But to your other point, Cisco actually has some pretty rich tools with AppD and doing observability and you've made acquisitions like thousand eyes. And like you said, I'm, I'm presuming you gotta eat your own dog food or drink your own champagne. And so you've gotta be tools agnostic. And when I first heard about Z zebra, I was like, wait a minute. Really? I was kind of skeptical. I've heard this before. You're telling me all I need is plain text and, and a timestamp. And you got my problem solved. So, and I, I understand that you guys said, okay, let's run a POC. Let's see if we can cut that from, let's say two days a week down to one day, a week. In other words, 50%, let's see if we can automate 50% of the root cause analysis. And, and so you funded a POC. How, how did you test it? You, you put, you know, synthetic, you know, errors and problems in there, but how did you test that? It actually works Najati >>Yeah. So we, we wanted to take it to the next level, which is meaning that we wanted to back test is with existing SARS. And we decided, you know, we, we chose four different products from four different verticals, data center, security, collaboration, and enterprise networking. And we find out SARS where the engineer put some kind of log in the resolution summary. So they closed the case. And in the summary of the Sr, they put, I identified these log lines and they led me to the roots and we, we ingested those log bundles. And we, we tried to see if Zeum can surface that exact same log line in their analysis. So we initially did it with archery ourself and after 50 tests or so we were really happy with the results. I mean, almost most of them, we saw the log line that we were looking for, but that was not enough. >>And we brought it of course, to our management and they said, okay, let's, let's try this with real users because the log being there is one thing, but the engineer reaching to that log is another take. So we wanted to make sure that when we put it in front of our users, our engineers, they can actually come to that log themselves because, you know, we, we know this platform so we can, you know, make searches and find whatever we are looking for, but we wanted to do that. So we extended our pilots to some selected engineers and they tested with their own SRSS. Also do some back testing for some SARS, which are closed in the past or recently. And with, with a sample set of, I guess, close to 200 SARS, we find out like majority of the time, almost 95% of the time the engineer could find the log they were looking for in zebra analysis. >>Yeah. Okay. So you were looking for 50%, you got to 95%. And my understanding is you actually did it with four pretty well known Cisco products, WebEx client DNA center, identity services, engine ISE, and then, then UCS. Yes. Unified pursuit. So you use actual real data and, and that was kind of your proof proof point, but Ari. So that's sounds pretty impressive. And, and you've have you put this into production now and what have you found? >>Well, yes, we're, we've launched this with the four products that you mentioned. We're providing our tech engineers with the ability, whenever a, whenever a support bundle for that product gets attached to the support request. We are processing it, using sense and then providing that sense analysis to the tech engineer for their review. >>So are you seeing the results in production? I mean, are you actually able to, to, to reclaim that time that people are spending? I mean, it was literally almost two days a week down to, you know, a part of a day, is that what you're seeing in production and what are you able to do with that extra time and people getting their weekends back? Are you putting 'em on more strategic tasks? How are you handling that? >>Yeah. So, so what we're seeing is, and I can tell you from my own personal experience using this tool, that troubleshooting any one of the cases, I don't take more than 15 to 20 minutes to go through the zebra report. And I know within that time either what the root causes or I know that zebra doesn't have the information that I need to solve this particular case. So we've definitely seen, well, it's been very hard to measure exactly how much time we've saved per engineer, right? What we, again, anecdotally, what we've heard from our users is that out of those three hours that they were spending per day, we're definitely able to reclaim at least one of those hours and, and what, even more importantly, you know, what the kind of feedback that we've gotten in terms of, I think one statement that really summarizes how Zebra's impacted our workflow was from one of our users. >>And they said, well, you know, until you provide us with this tool, log analysis was a very black and white affair, but now it's become really colorful. And I mean, if you think about it, log analysis is indeed black and white. You're looking at it on a terminal screen where the background is black and the text is white, or you're looking at it as a text where the background is white and the text is black, but what's what they're really trying to say. Is there hardly any visual cues that help you navigate these logs, which are so esoteric, so dense, et cetera. But what XRM does is it provides a lot of color and context to the whole process. So now you're able to quickly get to, you know, using their word cloud, using their interactive histogram, using the summaries of every incident. You're very quickly able to summarize what might be happening and what you need to look into. >>Like, what are the important aspects of this particular log bundle that might be relevant to you? So we've definitely seen that a really great use case that kind of encapsulates all of this was very early on in our experiment. There was, there was this support request that had been escalated to the business unit or the development team. And the tech engineer had really, they, they had an intuition about what was going wrong because of their experience because of, you know, the symptoms that they'd seen. They kind of had an idea, but they weren't able to convince the development team because they weren't able to find any evidence to back up what they thought was happening. And we, it was entirely happenstance that I happened to pick up that case and did an analysis using Seebri. And then I sat down with the attack engineer and we were very quickly within 15 minutes, we were able to get down to the exact sequence of events that highlighted what the customer thought was happening, evidence of what the, so not the customer, what the attack engineer thought was the, was a root cause. It was a rude pause. And then we were able to share that evidence with our business unit and, you know, redirect their resources so that we could change down what the problem was. And that really has been, that that really shows you how that color and context helps in log analysis. >>Interesting. You know, we do a fair amount of work in the cube in the RPA space, the robotic process automation and the narrative in the press when our RPA first started taking off was, oh, it's, you know, machines replacing humans, or we're gonna lose jobs. And, and what actually happened was people were just eliminating mundane tasks and, and the, the employee's actually very happy about it. But my question to you is, was there ever a reticence amongst your team? Like, oh, wow, I'm gonna, I'm gonna lose my job if the machine's gonna replace me, or have you found that people were excited about this and what what's been the reaction amongst the team? >>Well, I think, you know, every automation and AI project has that immediate gut reaction of you're automating away our jobs and so forth. And there is initially there's a little bit of reticence, but I mean, it's like you said, once you start using the tool, you realize that it's not your job, that's getting automated away. It's just that your job's becoming a little easier to do, and it's faster and more efficient. And you're able to get more done in less time. That's really what we're trying to accomplish here at the end of the day, rim will identify these incidents. They'll do the correlation, et cetera. But if you don't understand what you're reading, then that information's useless to you. So you need the human, you need the network expert to actually look at these incidents, but what we are able to skin away or get rid of is all of the fat that's involved in our, you know, in our process, like without having to download the bundle, which, you know, when it's many gigabytes in size, and now we're working from home with the pandemic and everything, you're, you know, pulling massive amounts of logs from the corporate network onto your local device that takes time and then opening it up, loading it in a text editor that takes time. >>All of these things are we're trying to get rid of. And instead we're trying to make it easier and quicker for you to find what you're looking for. So it's like you said, you take away the mundane, you take away the, the difficulties and the slog, but you don't really take away the work, the work still needs to be done. >>Yeah. Great guys. Thanks so much. Appreciate you sharing your story. It's quite, quite fascinating. Really. Thank you for coming on. >>Thanks for having us. >>You're very welcome. Okay. In a moment, I'll be back to wrap up with some final thoughts. This is Dave Valante and you're watching the, >>So today we talked about the need, not only to gain end to end visibility, but why there's a need to automate the identification of root cause problems and doing so with modern technology and machine intelligence can dramatically speed up the process and identify the vast majority of issues right out of the box. If you will. And this technology, it can work with log bundles in batches, or with real time data, as long as there's plain text and a timestamp, it seems Zebra's technology will get you the outcome of automating root cause analysis with very high degrees of accuracy. Zebra is available on Preem or in the cloud. Now this is important for some companies on Preem because there's really some sensitive data inside logs that for compliance and governance reasons, companies have to keep inside their four walls. Now SBRI has a free trial. Of course they better, right? So check it out@zebra.com. You can book a live demo and sign up for a free trial. Thanks for watching this special presentation on the cube, the leader in enterprise and emerging tech coverage on Dave Valante and.

Published Date : Jun 16 2022

SUMMARY :

Thanks for coming on the cube. Happy to be here. and the proof points that, that you have, that it actually works as advertised. Cisco's support arm, the support organization, and, you know, to do with networking support and, you know, the customer's network goals. And so I would imagine you spend a lot of where the customer sends them the logs, it's attached to a, you know, a service request and And that requires a very in-depth understanding of, you know, to sort of help with this and, and how'd you stumble upon zebra? some graph to the engineer, so they can, you know, troubleshoot this particular issue. And if you look at it, the abstraction, it can fail the next time. And our engineers are really busy working on, you know, customer facing, So none of the engineers need to create rules, you know, label logs. So they start the next time, you know, thinking that this thing will So what happens usually with AI ML vendors is they have this button where you fill in your And like you said, I'm, you know, we, we chose four different products from four different verticals, And we brought it of course, to our management and they said, okay, let's, let's try this with And my understanding is you actually did it with Well, yes, we're, we've launched this with the four products that you mentioned. and what, even more importantly, you know, what the kind of feedback that we've gotten in terms And they said, well, you know, until you provide us with this tool, And that really has been, that that really shows you how that color and context helps But my question to you is, was there ever a reticence amongst or get rid of is all of the fat that's involved in our, you know, So it's like you said, you take away the mundane, Appreciate you sharing your story. This is Dave Valante and you're watching the, it seems Zebra's technology will get you the outcome of automating root cause analysis with

ENTITIES

Entity	Category	Confidence
Ari Basu	PERSON	0.99+
Dave Valante	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
one day	QUANTITY	0.99+
50%	QUANTITY	0.99+
95%	QUANTITY	0.99+
Zeum	ORGANIZATION	0.99+
eight hours	QUANTITY	0.99+
SARS	ORGANIZATION	0.99+
Najati	PERSON	0.99+
56%	QUANTITY	0.99+
Larry	PERSON	0.99+
three hours	QUANTITY	0.99+
UCS	ORGANIZATION	0.99+
50 tests	QUANTITY	0.98+
today	DATE	0.98+
a week	QUANTITY	0.98+
about 8,000 engineers	QUANTITY	0.98+
one	QUANTITY	0.98+
next week	DATE	0.97+
about 2.2 million	QUANTITY	0.97+
three	QUANTITY	0.97+
one statement	QUANTITY	0.97+
first trial	QUANTITY	0.97+
WebEx	ORGANIZATION	0.97+
three hours a day	QUANTITY	0.96+
first	QUANTITY	0.96+
Seebri	ORGANIZATION	0.96+
15 minutes	QUANTITY	0.96+
SBRI	ORGANIZATION	0.95+
tomorrow	DATE	0.95+
more than a decade	QUANTITY	0.95+
about 44%	QUANTITY	0.95+
Outre	ORGANIZATION	0.93+
single thing	QUANTITY	0.93+
more than 15	QUANTITY	0.93+
two days a week	QUANTITY	0.93+
AppD	TITLE	0.92+
a day	QUANTITY	0.91+
Necati Cehreli	PERSON	0.91+
four products	QUANTITY	0.9+
couple	QUANTITY	0.89+
Ari	PERSON	0.89+
pandemic	EVENT	0.87+
one thing	QUANTITY	0.87+
SRSS	TITLE	0.86+
almost 95%	QUANTITY	0.86+
20 minutes	QUANTITY	0.85+
two days a week	QUANTITY	0.85+
Zebra	ORGANIZATION	0.85+
a year	QUANTITY	0.85+
8,000 plus engineers	QUANTITY	0.83+
almost two days a week	QUANTITY	0.82+
WebEx	TITLE	0.82+
ISE	ORGANIZATION	0.81+
Zebrium	ORGANIZATION	0.81+
24,000 man hours a day	QUANTITY	0.8+
thousand eyes	QUANTITY	0.79+
Atri Basu	PERSON	0.79+
DNA	ORGANIZATION	0.76+
zebra	ORGANIZATION	0.74+
out@zebra.com	OTHER	0.74+
BEC	ORGANIZATION	0.72+
four	QUANTITY	0.72+
Zebra	TITLE	0.71+
one anomalous event	QUANTITY	0.71+
one of our users	QUANTITY	0.67+
Najati	ORGANIZATION	0.65+
200	QUANTITY	0.63+

Atri Basu & Necati Cehreli | Root Cause as a Service - Never dig through logs again

(upbeat music) >> Okay, we're back with Atri Basu who is Cisco's resident philosopher who also holds a master's in computer science. We're going to have to unpack that a little bit. And Necati Cehreli, who's technical lead at Cisco. Welcome, guys. Thanks for coming on theCUBE. >> Happy to be here. >> Thanks a lot. >> All right, let's get into it. We want you to explain how Cisco validated the Zebrium technology and the proof points that you have that it actually works as advertised. So first Atri, first tell us about Cisco TAC. What does Cisco TAC do? >> So TAC is otherwise it's an acronym for Technical Assistance Center, is Cisco's support arm, the support organization. And the risk of sounding like I'm spouting a corporate line. The easiest way to summarize what TAC does is provide world class support to Cisco customers. What that means is we have about 8,000 engineers worldwide and any of our Cisco customers can either go on our web portal or call us to open a support request. And we get about 2.2 million of these support requests a year. And what these support requests are, are essentially the customer will describe something that they need done some networking goal that they have that they want to accomplish. And then it's TACs job to make sure that that goal does get accomplished. Now, it could be something like they're having trouble with an existing network solution and it's not working as expected or it could be that they're integrating with a new solution. They're, you know, upgrading devices maybe there's a hardware failure anything really to do with networking support and, you know the customer's network goals. If they open up a case for testing for help then TACs job is to respond and make sure the customer's, you know questions and requirements are met. About 44% of these support requests are usually trivial and, you know can be solved within a call or within a day. But the rest of TAC cases really involve getting into the network device, looking at logs. It's a very technical role. It's a very technical job. You need to be conversed with network solutions, their designs, protocols, et cetera. >> Wow. So 56% non-trivial. And so I would imagine you spend a lot of time digging through logs. Is that true? Can you quantify that like, you know, every month how much time you spend digging through logs and is that a pain point? >> Yeah, it's interesting you asked that because when we started on this journey to augment our support engineers workflow with Zebrium solution, one of the things that we did was we went out and asked our engineers what their experience was like doing log analysis. And the anecdotal evidence was that on average an engineer will spend three out of their eight hours reviewing logs either online or offline. So what that means is either with the customer live on a WebEx, they're going to be going over logs, network, state information, et cetera or they're going to do it offline where the customer sends them the logs it's attached to a, you know, a service request and they review it and try to figure out what's going on and provide the customer with information. So it's a very large chunk of our day. You know, I said 8,000 plus engineers and so three hours a day that's 24,000 man hours a day spent on log analysis. Now the struggle with logs or analyzing logs is there by out of necessity, logs are very contrite. They try to pack a lot of information in a very little space. And this is for performance reasons, storage reasons, et cetera, but the side effect of that is they're very esoteric. So they're hard to read if you're not conversant if you're not the developer who wrote these logs or you aren't doing code deep dives. And you're looking at where this logs getting printed and things like that, it may not be immediately obvious or even after a little while it may not be obvious what that log line means or how it correlates to whatever problem you're troubleshooting. So it requires tenure. It requires, you know, like I was saying before it requires a lot of knowledge about the protocol what's expected because when you're doing log analysis what you're really looking for is a needle in a haystack. You're looking for that one anomalous event, that single thing that tells you this shouldn't have happened, and this was a problem right. Now doing that kind of anomaly detection requires you to know what is normal. It requires, you know, what the baseline is. And that requires a very in depth understanding of, you know the state changes for that network solution or product. So it requires time to near and expertise to do well. And it takes a lot of time even when you have that kind of expertise. >> Wow. So thank you, Atri. And Necati, that's almost two days a week for a technical resource. That's not inexpensive. So what was Cisco looking for to sort of help with this and how'd you stumble upon Zebrium? >> Yeah, so, we have our internal automation system which has been running more than a decade now. And what happens is when a customer attach log bundle or diagnostic bundle into the service request we take that from the Sr we analyze it and we represent some kind of information. You know, it can be alerts or some tables, some graph, to the engineer, so they can, you know troubleshoot this particular issue. This is an incredible system, but it comes with its own challenges around maintenance to keep it up to date and relevant with Cisco's new products or a new version of a product, new defects, new issues and all kind of things. And when I mean with those challenges are let's say Cisco comes up with a product today. We need to come together with those engineers. We need to figure out how this bundle works, how it's structured out. We need to select individual logs, which are relevant and then start modeling these logs and get some values out of those logs, using PaaS or some rag access to come to a level that we can consume the logs. And then people start writing rules on top of that abstraction. So people can say in this log I'm seeing this value together with this other value in another log, maybe I'm hitting this particular defect. So that's how it works. And if you look at it, the abstraction it can fail the next time. And the next release when the development or engineer decides to change that log line which you write that rag X or we can come up with a new version which we completely change the services or processes then whatever you have wrote needs to be re-written for the new service. And we see that a lot with products, like for instance, WebEx where you have a very short release cycle that things can change maybe the next week with a new release. So whatever you are writing, especially for that abstraction and for those rules are maybe not relevant with that new release. With that being said we have a incredible rule creation process and governance process around it which starts with maybe a defect. And then it takes it to a level where we have an automation in place. But if you look at it, this really ties to human bandwidth. And our engineers are really busy working on you know, customer facing, working on issues daily and sometimes creating news rules or these PaaS are not their biggest priorities so they can be delayed a bit. So we have this delay between a new issue being identified to a level where we have the automation to detect it next time that some customer faces it. So with all these questions and with all challenges in mind we start looking into ways of actually how we can automate these automation. So these things that we are doing manually how we can move it a bit further and automate. And we had actually a couple of things in mind that we were looking for and this being one of them being this has to be product agnostic. Like if Cisco comes up with a product tomorrow I should be able to take it logs without writing, you know, complex regs, PaaS, whatever and deploy it into this system. So it can embrace our logs and make sense of it. And we wanted this platform to be unsupervised. So none of the engineers need to create rules, you know, label logs, this is bad, this is good. Or train the system like which requires a lot of computational power. And the other most important thing for us was we wanted this to be not noisy at all because what happens with noises when your level of false positives really high your engineers start ignoring the good things between that noise. So they start the next time, you know thinking that this thing will not be relevant. So we want something with a lot more less noise. And ultimately we wanted this new platform or new framework to be easily adaptable to our existing workflow. So this is where we started. We start looking into the, you know first of all, internally, if we can build this thing and also start researching it, and we came up to Zebrium actually Larry, one of the co-founders of Zebrium. We came upon his presentation where he clearly explained why this is different, how this works and it immediately clicked in and we said, okay, this is exactly what we were looking for. We dive deeper. We checked the block posts where Zebrium guys really explain everything very clearly there. They're really open about it. And most importantly, there is a button in their system. And so what happens usually with AI ML vendors is they have this button where you fill in your details and a sales guys call you back and you know, explains the system here. They were like, this is our trial system. We believe in the system you can just sign up and try it yourself. And that's what we did. We took one of our Cisco live DNA Center, wireless platforms. We start streaming logs out of it. And then we synthetically, you know, introduce errors like we broke things. And then we realized that Zebrium was really catching the errors perfectly. And on top of that, it was really quiet unless you are really breaking something. And the other thing we realized was during that first trial is Zebrium was actually bringing a lot of context on top of the logs. During those failures, we worked with couple of technical leaders and they said, "Okay if this failure happens I'm expecting this individual log to be there." And we found out with Zebrium apart from that individual log there were a lot of other things which gives a bit more context around the root cause, which was great. And that's where we wanted to take it to the next level. Yeah. >> Okay. So, you know, a couple things to unpack there. I mean, you have the dart board behind you which is kind of interesting, 'cause a lot of times it's like throwing darts at the board to try to figure this stuff out. But to your other point, Cisco actually has some pretty rich tools with AppD and doing observability and you've made acquisitions like thousand eyes. And like you said, I'm presuming you got to eat your own dog food or drink your own champagne. And so you've got to be tools agnostic. And when I first heard about Zebrium, I was like wait a minute. Really? I was kind of skeptical. I've heard this before. You're telling me all I need is plain text and a timestamp. And you got my problem solved. So, and I understand that you guys said, okay let's run a POC. Let's see if we can cut that from, let's say two days a week down to one day, a week. In other words, 50%, let's see if we can automate 50% of the root cause analysis. And so you funded a POC. How did you test it? You put, you know, synthetic, you know errors and problems in there, but how did you test that, it actually works Necati? >> Yeah. So we wanted to take it to the next level which is meaning that we wanted to back test is with existing SaaS. And we decided, you know, we chose four different products from four different verticals, data center security, collaboration, and enterprise networking. And we find out SaaS where the engineer put some kind of log in the resolution summary. So they closed the case. And in the summary of the SR, they put "I identified these log lines and they led me to the root cause" and we ingested those log bundles. And we tried to see if Zebrium can surface that exact same log line in their analysis. So we initially did it with archery ourself and after 50 tests or so we were really happy with the results. I mean, almost most of them we saw the log line that we were looking for but that was not enough. And we brought it of course to our management and they said, "Okay, let's try this with real users" because the log being there is one thing but the engineer reaching to that log is another take. So we wanted to make sure that when we put it in front of our users, our engineers, they can actually come to that log themselves because, you know, we know this platform so we can, you know make searches and find whatever we are looking for but we wanted to do that. So we extended our pilots to some selected engineers and they tested with their own SaaS. Also due some back testing for some SaaS which are closed in the past or recently. And with a sample set of, I guess, close to 200 SaaS we find out like majority of the time, almost 95% of the time the engineer could find the log they were looking for in Zebrium's analysis. >> Yeah. Okay. So you were looking for 50%, you got the 95%. And my understanding is you actually did it with four pretty well known Cisco products, WebEx client, DNA Center Identity services, engine ISE, and then UCS. Unified pursuit. So you use actual real data and that was kind of your proof point, but Atri, so that sounds pretty impressive. And have you put this into production now and what have you found? >> Well, yes, we've launched this with the four products that you mentioned. We're providing our TAC engineers with the ability, whenever a support bundle for that product gets attached to the support request. We are processing it, using sense and then providing that sense analysis to the TAC engineer for their review. >> So are you seeing the results in production? I mean, are you actually able to reclaim that time that people are spending? I mean, it was literally almost two days a week down to you know, a part of a day, is that what you're seeing in production and what are you able to do with that extra time and people getting their weekends back? Are you putting 'em on more strategic tasks? How are you handling that? >> Yeah. So what we're seeing is, and I can tell you from my own personal experience using this tool that troubleshooting any one of the cases, I don't take more than 15 to 20 minutes to go through the Zebrium report. And I know within that time either what the root causes or I know that Zebrium doesn't have the information that I need to solve this particular case. So we've definitely seen, well it's been very hard to measure exactly how much time we've saved per engineer, right? Again, anecdotally, what we've heard from our users is that out of those three hours that they were spending per day, we're definitely able to reclaim at least one of those hours and what even more importantly, you know, what the kind of feedback that we've gotten in terms of I think one statement that really summarizes how Zebrium's impacted our workflow was from one of our users. And they said, "Well, you know, until you provide us with this tool, log analysis was a very black and white affair, but now it's become really colorful." And I mean, if you think about it log analysis is indeed black and white. You're looking at it on a terminal screen where the background is black and the text is white, or you're looking at it as a text where the background is white and the text is black, but what they're really trying to say is there are hardly any visual cues that help you navigate these logs which are so esoteric, so dense, et cetera. But what Zebrium does is it provides a lot of color and context to the whole process. So now you're able to quickly get to, you know using their Word Cloud, using their interactive histogram, using the summaries of every incident. You're very quickly able to summarize what might be happening and what you need to look into. Like, what are the important aspects of this particular log bundle that might be relevant to you? So we've definitely seen that. A really great use case that kind of encapsulates all of this was very early on in our experiment. There was this support request that had been escalated to the business unit or the development team. And the TAC engineer had really, they had an intuition about what was going wrong because of their experience because of, you know the symptoms that they'd seen. They kind of had an idea but they weren't able to convince the development team because they weren't able to find any evidence to back up what they thought was happening. And it was entirely happenstance that I happened to pick up that case and did an analysis using Zebrium. And then I sat down with a TAC engineer and we were very quickly within 15 minutes we were able to get down to the exact sequence of events that highlighted what the customer thought was happening, evidence of what the sorry not the customer what the TAC engineer thought was a root cause. And then we were able to share that evidence with our business unit and, you know redirect their resources so that we could chase down what the problem was. And that that really shows you how that color and context helps in log analysis. >> Interesting. You know, we do a fair amount of work in theCUBE in the RPA space, the robotic process automation and the narrative in the press when our RPA first started taking off was, oh, it's, you know machines replacing humans, or we're going to lose jobs. And what actually happened was people were just eliminating mundane tasks and the employees actually very happy about it. But what my question to you is was there ever a reticence amongst your team? Like, oh, wow, I'm going to lose my job if the machine's going to replace me or have you found that people were excited about this and what's been the reaction amongst the team? >> Well, I think, you know, every automation and AI project has that immediate gut reaction of you're automating away our jobs and so forth. And there is initially there's a little bit of reticence but I mean, it's like you said once you start using the tool, you realize that it's not your job, that's getting automated away. It's just that your job's becoming a little easier to do and it's faster and more efficient. And you're able to get more done in less time. That's really what we're trying to accomplish here. At the end of the day, Zebrium will identify these incidents. They'll do the correlation, et cetera. But if you don't understand what you're reading then that information's useless to you. So you need the human you need the network expert to actually look at these incidents, but what we are able to skin away or get rid of is all of is all the fat that's involved in our process like without having to download the bundle, which, you know when it's many gigabytes in size and now we're working from home with the pandemic and everything, you're, you know pulling massive amounts of logs from the corporate network onto your local device that takes time and then opening it up, loading it in a text editor that takes time. All of these things are we're trying to get rid of. And instead we're trying to make it easier and quicker for you to find what you're looking for. So it's like you said, you take away the mundane you take away the difficulties and the slog but you don't really take away the work the work still needs to be done. >> Yeah, great. Guys, thanks so much appreciate you sharing your story. It's quite, quite fascinating. Really. Thank you for coming on. >> Thanks for having us. >> You're very welcome. >> Excellent. >> Okay. In a moment, I'll be back to wrap up with some final thoughts. This is Dave Vellante and you're watching theCUBE. (upbeat music)

Published Date : May 25 2022

SUMMARY :

We're going to have to that you have that it the customer's, you know And so I would imagine you spend a lot it's attached to a, you and how'd you stumble upon Zebrium? And the other thing we realized was And like you said, I'm And we decided, you know, and what have you found? with the four products that you mentioned. And they said, "Well, you But what my question to you is the bundle, which, you know you sharing your story. I'll be back to wrap up

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
one day	QUANTITY	0.99+
50%	QUANTITY	0.99+
Larry	PERSON	0.99+
Necati Cehreli	PERSON	0.99+
95%	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
56%	QUANTITY	0.99+
Atri	PERSON	0.99+
eight hours	QUANTITY	0.99+
Atri Basu	PERSON	0.99+
TACs	ORGANIZATION	0.99+
Necati	ORGANIZATION	0.99+
50 tests	QUANTITY	0.99+
first	QUANTITY	0.99+
TAC	ORGANIZATION	0.98+
one	QUANTITY	0.98+
about 8,000 engineers	QUANTITY	0.98+
single	QUANTITY	0.98+
first trial	QUANTITY	0.98+
three hours	QUANTITY	0.98+
four products	QUANTITY	0.98+
a week	QUANTITY	0.98+
next week	DATE	0.98+
pandemic	EVENT	0.97+
about 2.2 million	QUANTITY	0.97+
today	DATE	0.97+
three	QUANTITY	0.97+
Word Cloud	TITLE	0.96+
UCS	ORGANIZATION	0.96+
more than a decade	QUANTITY	0.95+
one statement	QUANTITY	0.95+
20 minutes	QUANTITY	0.95+
two days a week	QUANTITY	0.94+
About 44%	QUANTITY	0.93+
tomorrow	DATE	0.93+
15 minutes	QUANTITY	0.92+
almost two days a week	QUANTITY	0.92+
more than 15	QUANTITY	0.92+
AppD	TITLE	0.92+
one thing	QUANTITY	0.91+
almost 95%	QUANTITY	0.91+
a year	QUANTITY	0.91+
four different products	QUANTITY	0.9+
8,000 plus engineers	QUANTITY	0.88+
three hours a day	QUANTITY	0.88+
four	QUANTITY	0.86+
200 SaaS	QUANTITY	0.86+
Atri	ORGANIZATION	0.86+
24,000 man hours a day	QUANTITY	0.84+
a day	QUANTITY	0.84+
ISE	TITLE	0.8+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Atri: