Anurag Gupta, Shoreline io | AWS re:Invent 2022 - Global Startup Program

(gentle music) >> Now welcome back to theCUBE, everyone. I'm John Walls, and once again, we're glad to have you here for AWS re:Invent 22. Our coverage continues here on Thursday, day three, of what has been a jam-packed week of tech and AWS, of course, has been the great host for this. It's now a pleasure to welcome in Anurag Gupta, who is the founder and CEO of Shoreline, joining us here as part of the AWS Global Showcase Startup Program, and Anurag, good to see you, sir. Thanks for joining us. >> Thank you so much. >> Tell us about Shoreline, about what you're up to. >> So we're a DevOps company. We're really focused on repairing issues. If you think about it, there are a ton DevOps companies and we all went to the cloud in order to gain faster innovation and by and large check. Then all of the things involved in getting things into production, artifact generation, testing, configuration management, deployment, also by and large, automated. Now pity the poor SRE who's getting the deluge of stuff on them, every week, every two days, sometimes multiple times a day, and it's complicated, right? Kubernetes, VMs, lots of services, multiple clouds, sometimes, and you know, they need to know a little bit about everything. And you know what, there are a ton of companies that actually help you with what we call Day-2 Ops. It's just that most of them help you with observability, telling you what's gone wrong, or incident management, routing something to someone. But you know, back when I was at AWS, I never got really that excited about one more dashboard to look at or one more like better ticket routing. What used to really excite me was having some issue extinguished forever. And if you think about it, like the first five minutes of an incident are detecting and routing. The next hour, two hours, is some human being going in and fixing it, so that feels like the big opportunity to reduce, so hopefully we can talk a little bit about different ways that one can do that. >> What about Day-2 Ops? Just tell me about how you define that. >> So I basically define it as once the software goes into a production, just making sure things stay up and are healthy and you're resilient and you don't get errors and all of those sorts of things because everything breaks sooner or later, you know, to a greater or lesser degree. >> Especially that SRE you're talking about, right? >> Yeah. >> So let's go back to that scenario. Yeah, you pity the poor soul because they do have to be a little expert in everything. >> Exactly. >> And that's really challenging and we all know that, that's really hard. So how do you go about trying to lighten that burden, then? >> So when you look at the numbers, about somewhere between 40% to even 95% of the alarms that fire, the alerts that fire, are false positives and that's crazy. Why is someone waking up just to deal with? >> It's a lot of wasted time, isn't it? >> A lot of wasted time. And you know, you're also training someone into what I call ClickOps, just to go in and click the button and resolve it and you don't actually know if it was the false positive or it's the rare real positive, and so that's a challenge, right? And so the first thing to do is to figure out where the false positives are. Like, let's say Datadog tells you that CPU is high and alarms. Is that a good thing or a bad thing? It's hard for them to tell, right? But you have to then introspect it into something precise like, oh, CPU is high, but response times are standard and the request rate is high. Okay, that's a good thing. I'm going to ignore this. Or CPU is high, but it kind of resolves itself, so I'm going to not wake anybody up. Or CPU is high and oh, it's the darn JVM starting to garbage collect again, so let me go and take a heap dump and give that to my dev team and then bounce the JVM and you know, without waking anybody up, or CPU is high, I have no idea what's going on. Now it's time to wake somebody up. You know, what you want to use humans for is the ability to think about novel stuff, not to do repetitive stuff, so that's the first step. The second step is, about 40% of what remains is repetitive and straightforward. So like a disk is full, I'd better clean up the garbage on the disk or maybe grow the disk. People shouldn't wake up to deal to grow a disk. And so for that, what you want to do is just have those sorts of things get automated away. One of the nice things about Shoreline is, is that we take the experience in what we build for one company, and if they're willing, provide it to everybody else. Our belief is, a central tenant is, if someone somewhere fixes something, everyone everywhere should gain the benefit because we all sit on the same three clouds, we all sit on the same set of database infrastructure, et cetera. We should all get the same benefits. Why do we have to scar our own backs rather than benefiting from somebody else's scar tissue, so that's the second thing. The third thing is, okay, let's say it's not straightforward, not something I've seen before, then in that case, what often happens is on average like eight people get involved. You know, it initially goes to L1 support or L1 ops and, but they don't necessarily know because, as you say, the environment's complex. And so, you know, they go into Slack and they say, "At here, can somebody help me with this?" And those things take a much longer time, so wouldn't it be better that if your best SRE is able to say, "Hey, check these 20 things and then run these actions." We could convert that into like a Jupyter Notebook where you could say the incident got fired I pre-populated all the diagnostics, and then I tell people very precisely, "If you see this, run this, et cetera." Like a wiki, but actually something you could run right in this product. And then, you know, last piece of the puzzle, the smaller piece, is sometimes new things happen and when something new happens, what you want is sort of the central tech of Shoreline, which is parallel distributed, real-time debugging. And so the ability to do, you know, execute a command across your fleet rather than individual boxes so that you can say something like, "I'm hearing that my credit card app is slow. For everything tagged as being part of my credit card app, please run for everything that's running over 90% CPU, please run a top command." And so, you know, then you can run in the same time on one host as you can on 30,000 and that helps a lot. So that's the core of what we do. People use us for all sorts of things, also preventative maintenance, you know, just the proactive regular things. You know, like your car, you do an oil change, well, you know, you need to rotate your certs, certificates. You need to make sure that, you know, there isn't drift in your configurations, there isn't drift in your software. There's also security elements to it, right? You want to make sure that you aren't getting weird inbound/outbound traffic across to ports you don't expect to be open. You don't want to have these processes running, you know, maybe something's bad. And so that's all the kind of weird anomaly detection that's easy to do if you run things in a distributed parallel way across everything. That's super hard to do if you have to go and Whac-A-Mole across one box after the next. >> Well, which leads to a question just in terms of setting priorities then, which is what you're talking about helping companies establish priorities, this hierarchy of level one warning, level two, level three, level four. Sounds like that should be a basic, right? But you're saying that's not, that's not really happening in the enterprise. >> Well, you know, I would say that if you hadn't automated deployments, you should do that first. If you haven't automated your testing pipeline, shame on you, you should do that like a year ago. But now it's time to help people in production because you've done that other work and people are suffering. You know, the crazy thing about the cloud is, is that companies spend about three times more on the human beings to operate their cloud infrastructure as on the cloud infrastructure itself. I've yet to hear anybody say that their cloud bill is too low, you know, so, you know, there's a clearer savings also available. And you know, back when I was at AWS, obviously I had to keep the lights on too, but you know, I had to do that, but it's kind of a tax on my engineers and I'd really spend, prefer to spend the head count on innovation, on doing things that delight my customers. You never delight your customers by keeping the lights on, you just avoid irritating them by turning 'em off, right? >> So why are companies so fixed in on spending so much time on manually repairing things and not looking for these kinds of little, much more elegant solution and cost-efficient, time-saving, so on so forth. >> Yeah, I think there just hasn't been very much in this space as yet because it's a hard, hard problem to solve. You know, automation's a little bit scary and that's the reality of it and the way you make it less scary is by proving it out, by doing the simple things first, like reducing the alert fatigue, you know, that's easy. You know, providing notebooks to people so that they can click things and do things in a straightforward way. That's pretty easy. The full automation, that's kind of the North Star, that's what we aspire to do. But you know, people get there over time and one of our customers had 700 instances of this particular incident solved for them last week. You imagine how many human beings would've been doing it otherwise, you know? >> Right. >> That's just one thing, you know? >> How many did it take the build a pyramid? How many decades did that take, right? You had an announcement this week. I don't think we've talked about that. >> No, yeah, so we just announced Incident Insights, which is a free product that lets people plug into initially PagerDuty and pretty soon the Opsgenie ServiceNow, et cetera. And what you can do is, is you give us an API key read-only and we will suck your PagerDuty data out. We apply some lightweight ML unsupervised learning, and in a couple of minutes, we categorize all of your incidents so that you can understand which are the ones that happen most often and are getting resolved really quickly. That's ClickOps, right? Those alarms shouldn't fire. Which are the ones that involve a lot of people? Those are good candidates to build a notebook. Which are the ones that happen again and again and again? Those are good candidates for automation. And so, I think one of the challenges people have is, is that they don't actually know what their teams are doing and so this is intended to provide them that visibility. One of our very first customers was doing the beta test for us on it. He used to tell us he had about 100 tickets, incidents a week. You know, he brought this tool in and he had 2,100 last week and was all, you know, like these false alarms, so while he's giving us- >> That was eye opening for him to see that, sure. >> And why he's, you know, looking at it, you know, he's just like filing Jiras to say, "Oh, change this threshold, cancel this alarm forever." You know, all of that kind of stuff. Before you get to do the fancy work, you got to clean your room before you get to do anything else, right? >> Right, right, dinner before dessert, basically. >> There you go. >> Hey, thanks for the insights on this and again the name of the new product, by the way, is... >> Incident Insights. >> Incident Insights. >> Totally free. >> Free. >> Yeah, it takes a couple of minutes to set up. Go to the website, Shoreline.io/insight and you can be up and running in a couple of minutes. >> Outstanding, again, the company is Shoreline. This is Anurag Gupta, and thank you for being with us. We appreciate it. >> Appreciate it, thank you. >> Glad to have to here on theCUBE. Back with more from AWA re:Invent 22. You're watching theCUBE, the leader in high-tech coverage. (gentle music)

Published Date : Dec 1 2022

SUMMARY :

of the AWS Global Showcase about what you're up to. But you know, back when I was at AWS, Just tell me about how you define that. and you don't get errors Yeah, you pity the poor soul So how do you go about trying So when you look at the numbers, And so the ability to do, you know, in the enterprise. And you know, back when I was at AWS, and not looking for these kinds of little, and the way you make it less the build a pyramid? and was all, you know, for him to see that, sure. And why he's, you know, before dessert, basically. and again the name of the new and you can be up and running thank you for being with us. Glad to have to here on theCUBE.

ENTITIES

Entity	Category	Confidence
John Walls	PERSON	0.99+
Shoreline	ORGANIZATION	0.99+
Anurag Gupta	PERSON	0.99+
Thursday	DATE	0.99+
2,100	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
700 instances	QUANTITY	0.99+
Anurag	PERSON	0.99+
20 things	QUANTITY	0.99+
last week	DATE	0.99+
first step	QUANTITY	0.99+
Jiras	PERSON	0.99+
second thing	QUANTITY	0.99+
30,000	QUANTITY	0.99+
two hours	QUANTITY	0.99+
eight people	QUANTITY	0.99+
second step	QUANTITY	0.99+
95%	QUANTITY	0.99+
40%	QUANTITY	0.99+
third thing	QUANTITY	0.99+
one box	QUANTITY	0.99+
about 100 tickets	QUANTITY	0.98+
first five minutes	QUANTITY	0.98+
One	QUANTITY	0.98+
one	QUANTITY	0.98+
one thing	QUANTITY	0.97+
this week	DATE	0.97+
one company	QUANTITY	0.97+
a year ago	DATE	0.96+
first thing	QUANTITY	0.96+
first	QUANTITY	0.96+
Shoreline.io/insight	OTHER	0.96+
SRE	ORGANIZATION	0.95+
about three times	QUANTITY	0.95+
three clouds	QUANTITY	0.95+
Jupyter	ORGANIZATION	0.94+
Datadog	ORGANIZATION	0.94+
over 90% CPU	QUANTITY	0.93+
one host	QUANTITY	0.93+
Global Showcase Startup Program	EVENT	0.92+
about 40%	QUANTITY	0.91+
level four	QUANTITY	0.91+
a week	QUANTITY	0.9+
first customers	QUANTITY	0.9+
one more	QUANTITY	0.89+
every two days	QUANTITY	0.86+
level three	QUANTITY	0.86+
level one	QUANTITY	0.85+
Day	QUANTITY	0.85+
PagerDuty	ORGANIZATION	0.84+
level two	QUANTITY	0.81+
re:Invent 2022 - Global Startup Program	TITLE	0.8+
Shoreline io	ORGANIZATION	0.78+
Incident	ORGANIZATION	0.73+
ClickOps	ORGANIZATION	0.71+
Day	TITLE	0.7+
times a day	QUANTITY	0.69+
theCUBE	ORGANIZATION	0.67+
next hour	DATE	0.66+
2	TITLE	0.65+
theCUBE	TITLE	0.63+
Kubernetes	TITLE	0.62+
day three	QUANTITY	0.62+
every	QUANTITY	0.6+
ton of companies	QUANTITY	0.6+
Invent 22	TITLE	0.59+
Star	LOCATION	0.59+
Opsgenie	ORGANIZATION	0.57+
AWA	ORGANIZATION	0.57+
Invent	EVENT	0.53+
Slack	TITLE	0.52+
PagerDuty	TITLE	0.48+
22	TITLE	0.46+
2	QUANTITY	0.43+
L1	ORGANIZATION	0.33+
ServiceNow	COMMERCIAL_ITEM	0.32+
re	EVENT	0.27+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for AWA: