Tammy Butow & Alberto Farronato, Gremlin CUBE Conversation, April 2020
>> Narrator: From theCUBE studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is theCUBE Conversation. >> Hello everyone, welcome to theCUBE Conversation here in Palo Alto, in our studios of theCUBE, I'm John Furrier, your host. We're here during the crisis of COVID-19 doing remote interviews. I come into the studio, we've got a quarantine crew are here, getting the interviews, getting the stories out there and of course, the story we're going to continue to talk about is the impact of COVID-19, and how we're all getting back to work, either working at home or working remotely and virtually certainly, but as things start to change, we're going to start to see events, mostly digital events, and we're here to talk about an event that's coming up called the Failover Conference from Gremlin which is now gone digital because it's April 21st. But I think what's important about this conversation that I want to get into is, not only talk about the event that's coming up, but talk about the scale problems that are being highlighted by this change in work environment, working at home. We've been talking about the at-scale problems that we're seeing whether it's a flood of surge of traffic and the chaos that's ensuing across the world and with this pandemic. So I'm excited, I've two two great guests, Alberto Fernando, senior vice president of marketing in Gremlin and Tammy Butow, principal site reliability engineer, or SRE. Guys thanks for coming on. Appreciate it, thank you. >> Thanks. >> Thanks for having me. >> Alberto, I want to get to you first. We've know each other before. You've been in this industry. We've been all talking about the cloud native, cloud scale for some time. It's kind of inside the ropes, it's inside baseball. Tammy, you're a site reliability engineer. Everyone knows Google, knows how cloud works. This is large scale stuff. Now with the COVID-19, we're starting to see the average person, my brother, my sister, our family members and people around the world go, "Oh my God, this is really a high impact." This change of behavior, this surge of web, whether it's traffic on the internet or work at home tools that are inadequate, you start to see (laughs) the statistical things that were planned for, not working well, and this actually maps the things that we've been talking about in our industry. Alberto, you've been on this. How are you guys doing? >> Yeah. >> And what's your take on this situation we're in right now? >> Yeah, we're doing pretty well as a company. We were born as a distributed organization to begin with, so for us working in a distributed environment from all over the world is common practice day-to-day. Personally, I'm originally from Italy, my parents, my family, is Milan and Bergamo of all places, so I have to follow the news with extra care and it becomes so much clear nowadays that the technology is not just a powerful tool to enable our businesses but it also is so critical for our day-to-day life, and thanks to video calls, I can easily talk to my family back there every day. So that's really important. So yes, we've been talking for a long time as you mentioned about complex systems at scale and reliability often in the context of mission critical applications, but more and more of these systems need to be reliable also when it comes to back office systems that enable people to continue to work on a daily basis. >> Yeah, well our hearts go out to your family and your friends in Italy, and I hope everyone stays safe there (speaks faintly) a tough situation continues to be a challenge. Tammy, I want to get your thoughts. How's life going for you? You're a site reliable engineer. What you deal with on the tech side is now (laughs) happening in the real world. It's mind blowing to me that we're seeing these things happen, it's a paradigm that needs attention. How do you look at it as a SRE, dealing with mostly on the tech side now seeing it play out in real life? >> It's been such an interesting situation, obviously really terrible for everybody to have to go through and deal with, so one of the things that I specialize in as a site reliability engineer is incident management and so for example, I previously worked at Dropbox where I was the incident manager on call for 500 million customers, it's like 24/7 shift. These large scale incidents, you really need to be able to act fast. There are two very important metrics that we track and care about as a site reliability engineer. The first one is mean time to detection. How fast can you detect that something is happening? Obviously, if we detect an issue faster then you've got a better chance of making the impact lower so you can contain the blast radius. I like to explain it to people like, if you have a fire in your sauce bin in your kitchen, and you put it out, that's way better than waiting until your entire house is on fire. And the other metric is mean time to resolution. So how long does it take you to recover from the situation? So yeah, this is a large scale, global incident right now that we're in. >> Yeah, I know you guys do a lot, talk about chaos, theory and that applies. A lot of math involved, we all know that, but I think we need to look at the real world. This is now going to be table stakes and there's now a line in the sand here, pre-pandemic, post-pandemic, and I think you guys have an interesting company, Gremlin, in the sense that this is a complex system and that if you think about the world we're going to be living in, whether it's digital events that you guys have one coming up or how to work at home or tools that humans are going to be using, it's going to be working with systems, right? So you have this new paradigm going to be upon us pretty quickly and it's not just buying software mechanisms or software, it's a complex system, it's distributed computing, it's an operating system. I mean this is kind of the world. Can you guys talk about the Gremlin situation of how you guys are attacking these new problems and these new opportunities that are emerging? >> Sure, I can talk about that. So yeah, one of the things I've always specialized in over the last ten years is chaos engineering. And so the idea of chaos engineering is that your injecting failure on purpose to uncover weaknesses. So that's really important in distributed systems, with distributed cloud computing, all these different services that you're kind of putting together. But the idea is if you can inject failure, you can actually figure out what happens when I inject that small failure? And then you can actually go ahead and fix it. One of the things I like to say to people is focus on what you're top five critical systems are. Let's fix those first. Don't go for low hanging fruit. Fix the biggest problems first, get rid of the biggest amount of pain that you have as a company, and then you can go ahead and actually... If you think about Pareto principle, the 80/20 rule, if you fix 20% of your biggest problems, you'll actually solve 80% of your issues. That always works. It's something that I've done while working at the National Australia Bank doing chaos engineering. Also at Gremlin, at Dropbox and I help a lot of our customers do that too. >> Alberto, talk about the mindset involved. It's the most counter intuitive. Whoa! Whoa! Risk! The biggest system. >> Yeah >> I don't want to touch those. They're working fine right now. And then these problems just gestate, they kind of hang around to the bin in the kitchen fire, this is okay, I don't want to touch it. The house is still working. So this is kind of a new mindset. Could you talk about what your take is on that? Is the industry there? I mean, it was a kind of a corner case, you had Netflix, you had the Chaos Monkey those days and then now it's a DevOps practice, for a lot of folks, you guys are involved in that. What's the appetite and what's the progress of chaos engineering in mainstream case? >> Yeah, it's interesting that you mentioned DevOps, and recently Gartner came up with a new, revisited DevOps framework that has chaos engineering in the middle of the lifecycle management of your application. And the reality is that systems have become so complex in infrastructure, so many layers of abstractions. You have hundreds of services if you're doing microservices, but even if you're not doing microservices, you have so many applications connected to each other, build really complex workflows and automation flows. It's impossible for traditional QA to really understand where the vulnerability are in terms of resiliency, in terms of quality. Too often the production environment is also too different from the staging environment, and so you need a fundamentally different approach to go and find where your weaknesses are and find them before they happen, before you end up finding yourself in a situation like the one we're into today and you are not prepared. And so, so much of what we talk about is giving a tool and the methodology for people to go and find these vulnerabilities. Not so much about creating chaos, but it's about managing chaos that is built into our current system and exposing those vulnerabilities before they create problem. And so that's a very scientific methodology and tooling that we bring to market and we help customers well. >> Tammy, I want to get your thoughts on something. We used to riff a lot with our 10th unit CUBE, we've had a lot of conversation we've riffed over the years, but you know when the surge of Amazon web services came out it was pretty obvious that cloud's amazing and look at the startups that were born, you mentioned Dropbox, you worked there. These companies, all these born on the cloud, these hyper scale, companies built from scratch, great way to scale up. And we used to joke about Google, people would say, "I would like a cloud like Google," but no one has Googles use cases. And Google really pioneered the SRE concept, and you got to give 'em a lot of props for that. But now we're kind of getting to a world where it's becoming Google-like. There's more scale now than ever before. It's not a corner case, it's becoming more popular and more of a preferred architecture, this large scale. What's your assessment of the main stream enterprises, how far are they in your mind, are they there with chaos? Are they close? Are they doing it? How does someone develop an SRE practice to get the Google-like scale? 'Cause Google has an amazing network, they got large scale cloud, they have SRE's, they've been doing it for years. How does a company that's transforming their IT (laughs) have SRE's? >> That's a great question. I get asked this a lot as well. One of our goals at Gremlin is to help make the internet more reliable for everybody. Everyone using the internet, all of the engineers who are trying to build reliable services, and so I'm often asked by companies all over the world, how do we create an SRE practice and how do we practice chaos engineering? But you can get started actually rolling out your SRE program. Based on my experiences, I've done it. So when I worked at Dropbox, I worked with a lot of people who had been at Google, they've been at YouTube, they were there when SRE was rolled out across those companies, and then they brought those learnings to Dropbox, and I learned from them. But also the interesting thing is if you look at enterprise companies, so large banks. Say for example, I worked at the National Australia Bank for six years, we actually did a lot of work that I would consider chaos engineering and SRE practices. So for example, we would do large scale disaster recovery, and that's where you'd fail over an entire data center to a secret data center in an unknown location, and the reason is 'cause you're checking to make sure that everything operates okay if there's a nuclear blast. That's actually what you have to do and you have to do that practice every quarter. But if you think about it, it's not very good to only do it once a quarter. You really want to be practicing chaos engineering and injecting failure on purpose. I think actually, I prefer to do it three times a week, so I do it a lot. But I'm also someone who likes to work out a lot and be fit all the time so I know that if you do something regularly, you get great results. So that's what I always tell everyone. >> Yeah, get the reps in, as we say, get stronger, get the muscle memory. >> Yep, exactly. >> Guys, talk about the event that's coming up. You've got an event that was scheduled, physical event and then you were right in the planning mode and then the crisis hits. You're going digital, going virtual, it's really digital, but it's digital. It's on the internet. So how are you guys thinking about this? I know its out there. It's April 21st. Can you share some specifics around the event? Who should be attending and how do they get involved online? >> Yeah, the event really came together about a month ago when we started to see all the cancellations happening across the industry because of COVID-19 and we were extremely engaged in the community and we have a lot of talks and we were seeing a lot of conferences just dropping and so speakers losing their opportunity to really share their knowledge with respect with how you do reliability and topics that we focus on. And so we quickly pivoted as a company and created a new online event to give everyone in the community the opportunity to just failover to a new event as the conference name says and have those speakers who'll have lost their speaking slots have a new opportunity to go share their knowledge. And so that came together really quickly, we shared the idea with a dozen of our partners and everyone liked it and all the sudden this thing took off like crazy and just a month where we are approaching 4,000 registrations, we have over 30 partners signed up and supporting the initiative. A lot of past partners as well covering the event. So it was impressive to see the amount of interest that we were able to generate in such a short amount of time. And really, this is a conference for anybody who is interested in resiliency. If you want to know from the best on how to build business continuity across systems, people and processes, this is a great opportunity at no cost really. It's a free conference. >> And the target persona and the audience you want to have attend is what? SREs or folks doing architectural work? What's the target >> Yeah >> person to attend? >> Architects, SREs, developers, business leaders who care about the quality and the reliability of their applications, who need to help create a framework and a mindset for their organizations that speaks to what Tammy was saying a minute ago. Having that constant practice on a daily basis about go and finding how to improve things. >> You know, Tammy we've been going to physical events with theCUBE and extracting the signal from the noise and distributed it digitally for 10 years and I got to ask you because now that those events have gone away, you talk about chaos and injecting failure. Doing these digital events is not as easy as just live streaming, it's hard to replicate the value of a physical event, years of experience and standards, roles and responsibilities to digital. A different consumption environment, it's asynchronous, you're trying to create a synchronous environment. It's its own complex system, so I think a lot of people who are experimenting and learning (laughs) from these events because it's pretty chaotic. So, I'd love to get your thoughts on how you look at these digital events as a chaos engineer. How should people be looking at these events? How are you guys looking at... I mean, obviously you want to get the program going, get people out there, get the content, but to iterate on this, how do you view this? >> It is really different. So I actually like to compare it to fire drills in SRE. So often what you do there is you actually create a fake incident or a fake issue, so you just, you were saying, "Let's have a fire drill." Similar to when you're in a building and you have a fire drill that goes off and you have wardens and everything and you all have to go outside. So we can do that in this new world that we're all in all of the sudden. A lot people have never run an online event and now all of a sudden they have to. So what I would say is like, do a fire drill. Run a fake one before you do the actual one to make sure that everything does work okay. My other tip is make sure that you have backup plans. Backup plans on backup plans on backup plans. As an SRE, I always have at least three to five backup plans. I'm not just saying plan A and plan B, but there's also a C, D, and E and I think that's very important and even when you're considering technology, one of the things we say with chaos engineering is, if you're using one service, inject failure and make sure that you can fail over to a different alternative servers in case something goes wrong. >> Yeah, hence the Failover Conference, which is the name of the conference. (chuckles) >> Exactly! >> Yeah, well we certainly are going to be sending a digital reporter there, virtually. If you need any backup plans, obviously we have the remote interviews here. If you need any help, let us know, really appreciate it. Great to see you guys. And thanks for sharing. Any final thoughts on the conference? What happens when we get through the other side of this? I'll give you guys a final word. We'll start with Alberto, with you first. >> Yeah, I think when we are on the other side of this, we'll understand even more the importance of effective resilience, architecting and testing. As a provider of tools and methodologies for that, we think we will be able to help customers when we do a significant leap forward on that side. And the conference is just super exciting. I think it's going to be a great event. I encourage everyone to participate. We have tremendous lineup of speakers that have incredible reputation in their field so I'm really happy and excited about the work that the team has been able to do with our partners put together at this type of event. >> Okay, Tammy. >> Yeah, for me, I'm actually going to be doing the opening keynote for the conference and the topic that I'm speaking about is that reliability matters more now than ever. And I'll be sharing some, bizarre, weird incidents that I have worked on myself that I have experienced, really critical strange issues that have come up. But yeah, I'm really looking forward to sharing that with everybody else, so please come along, it's free. You can join from your own home and we can all be there together to support each other. >> You got a great community support and there's a lot of partners, Press Media and ecosystem and customers, so congratulations Gremlin, having a conference on April 21st called the Failover Conference. TheCUBE and SiliconANGLE have a digital reporter there that will be covering the news. Thanks for coming on and sharing. I appreciate the time. I'm John Furrier in the Palo Alto studio with remote interview with Gremlin around their Failover Conference, April 21st. It's really demonstrating, in my opinion, the at scale problems that we've been working on the industry, now more applicable than ever before as we get post-pandemic with COVID-19. Thanks for watching. Be back. (calm music)
SUMMARY :
this is theCUBE Conversation. and of course, the story we're going to and people around the world go, and reliability often in the context and your friends in Italy, making the impact lower so you can contain the blast radius. and that if you think about the world and then you can go ahead and actually... Alberto, talk about the mindset involved. in the kitchen fire, this is okay, and the methodology for people to go and look at the startups that were born, and so I'm often asked by companies all over the world, Yeah, get the reps in, as we say, get stronger, and then you were right in the planning mode and all the sudden this thing took off like crazy and the reliability of their applications, and I got to ask you because now and you all have to go outside. Yeah, hence the Failover Conference, Great to see you guys. that the team has been able to do and the topic that I'm speaking about and customers, so congratulations Gremlin,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tammy | PERSON | 0.99+ |
Alberto Fernando | PERSON | 0.99+ |
Alberto | PERSON | 0.99+ |
80% | QUANTITY | 0.99+ |
John Furrier | PERSON | 0.99+ |
Italy | LOCATION | 0.99+ |
20% | QUANTITY | 0.99+ |
Milan | LOCATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
April 21st | DATE | 0.99+ |
4,000 registrations | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Bergamo | LOCATION | 0.99+ |
six years | QUANTITY | 0.99+ |
Dropbox | ORGANIZATION | 0.99+ |
National Australia Bank | ORGANIZATION | 0.99+ |
Alberto Farronato | PERSON | 0.99+ |
COVID-19 | OTHER | 0.99+ |
10 years | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
April 2020 | DATE | 0.99+ |
Tammy Butow | PERSON | 0.99+ |
Gremlin | PERSON | 0.99+ |
One | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
over 30 partners | QUANTITY | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
10th unit | QUANTITY | 0.99+ |
YouTube | ORGANIZATION | 0.99+ |
theCUBE | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.98+ |
Netflix | ORGANIZATION | 0.98+ |
today | DATE | 0.98+ |
one service | QUANTITY | 0.97+ |
once a quarter | QUANTITY | 0.97+ |
one | QUANTITY | 0.97+ |
Gremlin | ORGANIZATION | 0.97+ |
SiliconANGLE | ORGANIZATION | 0.96+ |
Failover Conference | EVENT | 0.96+ |
500 million customers | QUANTITY | 0.96+ |
TheCUBE | ORGANIZATION | 0.96+ |
hundreds of services | QUANTITY | 0.95+ |
Gremlin | LOCATION | 0.95+ |
first one | QUANTITY | 0.95+ |
three times a week | QUANTITY | 0.95+ |
five backup plans | QUANTITY | 0.94+ |
two very important metrics | QUANTITY | 0.94+ |
a month ago | DATE | 0.94+ |
five critical systems | QUANTITY | 0.93+ |
a month | QUANTITY | 0.92+ |
a dozen | QUANTITY | 0.89+ |
Googles | ORGANIZATION | 0.88+ |
theCUBE Conversation | EVENT | 0.88+ |
SRE | ORGANIZATION | 0.83+ |
DevOps | TITLE | 0.83+ |
two two great guests | QUANTITY | 0.82+ |
CUBE | COMMERCIAL_ITEM | 0.82+ |
pandemic | EVENT | 0.81+ |