Tammy Butow & Alberto Farronato, Gremlin CUBE Conversation, April 2020

>> Narrator: From theCUBE studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is theCUBE Conversation. >> Hello everyone, welcome to theCUBE Conversation here in Palo Alto, in our studios of theCUBE, I'm John Furrier, your host. We're here during the crisis of COVID-19 doing remote interviews. I come into the studio, we've got a quarantine crew are here, getting the interviews, getting the stories out there and of course, the story we're going to continue to talk about is the impact of COVID-19, and how we're all getting back to work, either working at home or working remotely and virtually certainly, but as things start to change, we're going to start to see events, mostly digital events, and we're here to talk about an event that's coming up called the Failover Conference from Gremlin which is now gone digital because it's April 21st. But I think what's important about this conversation that I want to get into is, not only talk about the event that's coming up, but talk about the scale problems that are being highlighted by this change in work environment, working at home. We've been talking about the at-scale problems that we're seeing whether it's a flood of surge of traffic and the chaos that's ensuing across the world and with this pandemic. So I'm excited, I've two two great guests, Alberto Fernando, senior vice president of marketing in Gremlin and Tammy Butow, principal site reliability engineer, or SRE. Guys thanks for coming on. Appreciate it, thank you. >> Thanks. >> Thanks for having me. >> Alberto, I want to get to you first. We've know each other before. You've been in this industry. We've been all talking about the cloud native, cloud scale for some time. It's kind of inside the ropes, it's inside baseball. Tammy, you're a site reliability engineer. Everyone knows Google, knows how cloud works. This is large scale stuff. Now with the COVID-19, we're starting to see the average person, my brother, my sister, our family members and people around the world go, "Oh my God, this is really a high impact." This change of behavior, this surge of web, whether it's traffic on the internet or work at home tools that are inadequate, you start to see (laughs) the statistical things that were planned for, not working well, and this actually maps the things that we've been talking about in our industry. Alberto, you've been on this. How are you guys doing? >> Yeah. >> And what's your take on this situation we're in right now? >> Yeah, we're doing pretty well as a company. We were born as a distributed organization to begin with, so for us working in a distributed environment from all over the world is common practice day-to-day. Personally, I'm originally from Italy, my parents, my family, is Milan and Bergamo of all places, so I have to follow the news with extra care and it becomes so much clear nowadays that the technology is not just a powerful tool to enable our businesses but it also is so critical for our day-to-day life, and thanks to video calls, I can easily talk to my family back there every day. So that's really important. So yes, we've been talking for a long time as you mentioned about complex systems at scale and reliability often in the context of mission critical applications, but more and more of these systems need to be reliable also when it comes to back office systems that enable people to continue to work on a daily basis. >> Yeah, well our hearts go out to your family and your friends in Italy, and I hope everyone stays safe there (speaks faintly) a tough situation continues to be a challenge. Tammy, I want to get your thoughts. How's life going for you? You're a site reliable engineer. What you deal with on the tech side is now (laughs) happening in the real world. It's mind blowing to me that we're seeing these things happen, it's a paradigm that needs attention. How do you look at it as a SRE, dealing with mostly on the tech side now seeing it play out in real life? >> It's been such an interesting situation, obviously really terrible for everybody to have to go through and deal with, so one of the things that I specialize in as a site reliability engineer is incident management and so for example, I previously worked at Dropbox where I was the incident manager on call for 500 million customers, it's like 24/7 shift. These large scale incidents, you really need to be able to act fast. There are two very important metrics that we track and care about as a site reliability engineer. The first one is mean time to detection. How fast can you detect that something is happening? Obviously, if we detect an issue faster then you've got a better chance of making the impact lower so you can contain the blast radius. I like to explain it to people like, if you have a fire in your sauce bin in your kitchen, and you put it out, that's way better than waiting until your entire house is on fire. And the other metric is mean time to resolution. So how long does it take you to recover from the situation? So yeah, this is a large scale, global incident right now that we're in. >> Yeah, I know you guys do a lot, talk about chaos, theory and that applies. A lot of math involved, we all know that, but I think we need to look at the real world. This is now going to be table stakes and there's now a line in the sand here, pre-pandemic, post-pandemic, and I think you guys have an interesting company, Gremlin, in the sense that this is a complex system and that if you think about the world we're going to be living in, whether it's digital events that you guys have one coming up or how to work at home or tools that humans are going to be using, it's going to be working with systems, right? So you have this new paradigm going to be upon us pretty quickly and it's not just buying software mechanisms or software, it's a complex system, it's distributed computing, it's an operating system. I mean this is kind of the world. Can you guys talk about the Gremlin situation of how you guys are attacking these new problems and these new opportunities that are emerging? >> Sure, I can talk about that. So yeah, one of the things I've always specialized in over the last ten years is chaos engineering. And so the idea of chaos engineering is that your injecting failure on purpose to uncover weaknesses. So that's really important in distributed systems, with distributed cloud computing, all these different services that you're kind of putting together. But the idea is if you can inject failure, you can actually figure out what happens when I inject that small failure? And then you can actually go ahead and fix it. One of the things I like to say to people is focus on what you're top five critical systems are. Let's fix those first. Don't go for low hanging fruit. Fix the biggest problems first, get rid of the biggest amount of pain that you have as a company, and then you can go ahead and actually... If you think about Pareto principle, the 80/20 rule, if you fix 20% of your biggest problems, you'll actually solve 80% of your issues. That always works. It's something that I've done while working at the National Australia Bank doing chaos engineering. Also at Gremlin, at Dropbox and I help a lot of our customers do that too. >> Alberto, talk about the mindset involved. It's the most counter intuitive. Whoa! Whoa! Risk! The biggest system. >> Yeah >> I don't want to touch those. They're working fine right now. And then these problems just gestate, they kind of hang around to the bin in the kitchen fire, this is okay, I don't want to touch it. The house is still working. So this is kind of a new mindset. Could you talk about what your take is on that? Is the industry there? I mean, it was a kind of a corner case, you had Netflix, you had the Chaos Monkey those days and then now it's a DevOps practice, for a lot of folks, you guys are involved in that. What's the appetite and what's the progress of chaos engineering in mainstream case? >> Yeah, it's interesting that you mentioned DevOps, and recently Gartner came up with a new, revisited DevOps framework that has chaos engineering in the middle of the lifecycle management of your application. And the reality is that systems have become so complex in infrastructure, so many layers of abstractions. You have hundreds of services if you're doing microservices, but even if you're not doing microservices, you have so many applications connected to each other, build really complex workflows and automation flows. It's impossible for traditional QA to really understand where the vulnerability are in terms of resiliency, in terms of quality. Too often the production environment is also too different from the staging environment, and so you need a fundamentally different approach to go and find where your weaknesses are and find them before they happen, before you end up finding yourself in a situation like the one we're into today and you are not prepared. And so, so much of what we talk about is giving a tool and the methodology for people to go and find these vulnerabilities. Not so much about creating chaos, but it's about managing chaos that is built into our current system and exposing those vulnerabilities before they create problem. And so that's a very scientific methodology and tooling that we bring to market and we help customers well. >> Tammy, I want to get your thoughts on something. We used to riff a lot with our 10th unit CUBE, we've had a lot of conversation we've riffed over the years, but you know when the surge of Amazon web services came out it was pretty obvious that cloud's amazing and look at the startups that were born, you mentioned Dropbox, you worked there. These companies, all these born on the cloud, these hyper scale, companies built from scratch, great way to scale up. And we used to joke about Google, people would say, "I would like a cloud like Google," but no one has Googles use cases. And Google really pioneered the SRE concept, and you got to give 'em a lot of props for that. But now we're kind of getting to a world where it's becoming Google-like. There's more scale now than ever before. It's not a corner case, it's becoming more popular and more of a preferred architecture, this large scale. What's your assessment of the main stream enterprises, how far are they in your mind, are they there with chaos? Are they close? Are they doing it? How does someone develop an SRE practice to get the Google-like scale? 'Cause Google has an amazing network, they got large scale cloud, they have SRE's, they've been doing it for years. How does a company that's transforming their IT (laughs) have SRE's? >> That's a great question. I get asked this a lot as well. One of our goals at Gremlin is to help make the internet more reliable for everybody. Everyone using the internet, all of the engineers who are trying to build reliable services, and so I'm often asked by companies all over the world, how do we create an SRE practice and how do we practice chaos engineering? But you can get started actually rolling out your SRE program. Based on my experiences, I've done it. So when I worked at Dropbox, I worked with a lot of people who had been at Google, they've been at YouTube, they were there when SRE was rolled out across those companies, and then they brought those learnings to Dropbox, and I learned from them. But also the interesting thing is if you look at enterprise companies, so large banks. Say for example, I worked at the National Australia Bank for six years, we actually did a lot of work that I would consider chaos engineering and SRE practices. So for example, we would do large scale disaster recovery, and that's where you'd fail over an entire data center to a secret data center in an unknown location, and the reason is 'cause you're checking to make sure that everything operates okay if there's a nuclear blast. That's actually what you have to do and you have to do that practice every quarter. But if you think about it, it's not very good to only do it once a quarter. You really want to be practicing chaos engineering and injecting failure on purpose. I think actually, I prefer to do it three times a week, so I do it a lot. But I'm also someone who likes to work out a lot and be fit all the time so I know that if you do something regularly, you get great results. So that's what I always tell everyone. >> Yeah, get the reps in, as we say, get stronger, get the muscle memory. >> Yep, exactly. >> Guys, talk about the event that's coming up. You've got an event that was scheduled, physical event and then you were right in the planning mode and then the crisis hits. You're going digital, going virtual, it's really digital, but it's digital. It's on the internet. So how are you guys thinking about this? I know its out there. It's April 21st. Can you share some specifics around the event? Who should be attending and how do they get involved online? >> Yeah, the event really came together about a month ago when we started to see all the cancellations happening across the industry because of COVID-19 and we were extremely engaged in the community and we have a lot of talks and we were seeing a lot of conferences just dropping and so speakers losing their opportunity to really share their knowledge with respect with how you do reliability and topics that we focus on. And so we quickly pivoted as a company and created a new online event to give everyone in the community the opportunity to just failover to a new event as the conference name says and have those speakers who'll have lost their speaking slots have a new opportunity to go share their knowledge. And so that came together really quickly, we shared the idea with a dozen of our partners and everyone liked it and all the sudden this thing took off like crazy and just a month where we are approaching 4,000 registrations, we have over 30 partners signed up and supporting the initiative. A lot of past partners as well covering the event. So it was impressive to see the amount of interest that we were able to generate in such a short amount of time. And really, this is a conference for anybody who is interested in resiliency. If you want to know from the best on how to build business continuity across systems, people and processes, this is a great opportunity at no cost really. It's a free conference. >> And the target persona and the audience you want to have attend is what? SREs or folks doing architectural work? What's the target >> Yeah >> person to attend? >> Architects, SREs, developers, business leaders who care about the quality and the reliability of their applications, who need to help create a framework and a mindset for their organizations that speaks to what Tammy was saying a minute ago. Having that constant practice on a daily basis about go and finding how to improve things. >> You know, Tammy we've been going to physical events with theCUBE and extracting the signal from the noise and distributed it digitally for 10 years and I got to ask you because now that those events have gone away, you talk about chaos and injecting failure. Doing these digital events is not as easy as just live streaming, it's hard to replicate the value of a physical event, years of experience and standards, roles and responsibilities to digital. A different consumption environment, it's asynchronous, you're trying to create a synchronous environment. It's its own complex system, so I think a lot of people who are experimenting and learning (laughs) from these events because it's pretty chaotic. So, I'd love to get your thoughts on how you look at these digital events as a chaos engineer. How should people be looking at these events? How are you guys looking at... I mean, obviously you want to get the program going, get people out there, get the content, but to iterate on this, how do you view this? >> It is really different. So I actually like to compare it to fire drills in SRE. So often what you do there is you actually create a fake incident or a fake issue, so you just, you were saying, "Let's have a fire drill." Similar to when you're in a building and you have a fire drill that goes off and you have wardens and everything and you all have to go outside. So we can do that in this new world that we're all in all of the sudden. A lot people have never run an online event and now all of a sudden they have to. So what I would say is like, do a fire drill. Run a fake one before you do the actual one to make sure that everything does work okay. My other tip is make sure that you have backup plans. Backup plans on backup plans on backup plans. As an SRE, I always have at least three to five backup plans. I'm not just saying plan A and plan B, but there's also a C, D, and E and I think that's very important and even when you're considering technology, one of the things we say with chaos engineering is, if you're using one service, inject failure and make sure that you can fail over to a different alternative servers in case something goes wrong. >> Yeah, hence the Failover Conference, which is the name of the conference. (chuckles) >> Exactly! >> Yeah, well we certainly are going to be sending a digital reporter there, virtually. If you need any backup plans, obviously we have the remote interviews here. If you need any help, let us know, really appreciate it. Great to see you guys. And thanks for sharing. Any final thoughts on the conference? What happens when we get through the other side of this? I'll give you guys a final word. We'll start with Alberto, with you first. >> Yeah, I think when we are on the other side of this, we'll understand even more the importance of effective resilience, architecting and testing. As a provider of tools and methodologies for that, we think we will be able to help customers when we do a significant leap forward on that side. And the conference is just super exciting. I think it's going to be a great event. I encourage everyone to participate. We have tremendous lineup of speakers that have incredible reputation in their field so I'm really happy and excited about the work that the team has been able to do with our partners put together at this type of event. >> Okay, Tammy. >> Yeah, for me, I'm actually going to be doing the opening keynote for the conference and the topic that I'm speaking about is that reliability matters more now than ever. And I'll be sharing some, bizarre, weird incidents that I have worked on myself that I have experienced, really critical strange issues that have come up. But yeah, I'm really looking forward to sharing that with everybody else, so please come along, it's free. You can join from your own home and we can all be there together to support each other. >> You got a great community support and there's a lot of partners, Press Media and ecosystem and customers, so congratulations Gremlin, having a conference on April 21st called the Failover Conference. TheCUBE and SiliconANGLE have a digital reporter there that will be covering the news. Thanks for coming on and sharing. I appreciate the time. I'm John Furrier in the Palo Alto studio with remote interview with Gremlin around their Failover Conference, April 21st. It's really demonstrating, in my opinion, the at scale problems that we've been working on the industry, now more applicable than ever before as we get post-pandemic with COVID-19. Thanks for watching. Be back. (calm music)

Published Date : Apr 8 2020

SUMMARY :

this is theCUBE Conversation. and of course, the story we're going to and people around the world go, and reliability often in the context and your friends in Italy, making the impact lower so you can contain the blast radius. and that if you think about the world and then you can go ahead and actually... Alberto, talk about the mindset involved. in the kitchen fire, this is okay, and the methodology for people to go and look at the startups that were born, and so I'm often asked by companies all over the world, Yeah, get the reps in, as we say, get stronger, and then you were right in the planning mode and all the sudden this thing took off like crazy and the reliability of their applications, and I got to ask you because now and you all have to go outside. Yeah, hence the Failover Conference, Great to see you guys. that the team has been able to do and the topic that I'm speaking about and customers, so congratulations Gremlin,

ENTITIES

Entity	Category	Confidence
Tammy	PERSON	0.99+
Alberto Fernando	PERSON	0.99+
Alberto	PERSON	0.99+
80%	QUANTITY	0.99+
John Furrier	PERSON	0.99+
Italy	LOCATION	0.99+
20%	QUANTITY	0.99+
Milan	LOCATION	0.99+
Palo Alto	LOCATION	0.99+
April 21st	DATE	0.99+
4,000 registrations	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Bergamo	LOCATION	0.99+
six years	QUANTITY	0.99+
Dropbox	ORGANIZATION	0.99+
National Australia Bank	ORGANIZATION	0.99+
Alberto Farronato	PERSON	0.99+
COVID-19	OTHER	0.99+
10 years	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
April 2020	DATE	0.99+
Tammy Butow	PERSON	0.99+
Gremlin	PERSON	0.99+
One	QUANTITY	0.99+
Boston	LOCATION	0.99+
over 30 partners	QUANTITY	0.99+
Gartner	ORGANIZATION	0.99+
10th unit	QUANTITY	0.99+
YouTube	ORGANIZATION	0.99+
theCUBE	ORGANIZATION	0.99+
first	QUANTITY	0.98+
Netflix	ORGANIZATION	0.98+
today	DATE	0.98+
one service	QUANTITY	0.97+
once a quarter	QUANTITY	0.97+
one	QUANTITY	0.97+
Gremlin	ORGANIZATION	0.97+
SiliconANGLE	ORGANIZATION	0.96+
Failover Conference	EVENT	0.96+
500 million customers	QUANTITY	0.96+
TheCUBE	ORGANIZATION	0.96+
hundreds of services	QUANTITY	0.95+
Gremlin	LOCATION	0.95+
first one	QUANTITY	0.95+
three times a week	QUANTITY	0.95+
five backup plans	QUANTITY	0.94+
two very important metrics	QUANTITY	0.94+
a month ago	DATE	0.94+
five critical systems	QUANTITY	0.93+
a month	QUANTITY	0.92+
a dozen	QUANTITY	0.89+
Googles	ORGANIZATION	0.88+
theCUBE Conversation	EVENT	0.88+
SRE	ORGANIZATION	0.83+
DevOps	TITLE	0.83+
two two great guests	QUANTITY	0.82+
CUBE	COMMERCIAL_ITEM	0.82+
pandemic	EVENT	0.81+

UNLISTED FOR REVIEW Tammy Butow & Alberto Farronato, Gremlin | CUBE Conversation, April 2020

from the cube studios in Palo Alto in Boston connecting with thought leaders all around the world this is a cube conversation hello everyone welcome to the cube conversation here in Palo Alto our studios of the cube I'm showing for your host we're here during the crisis of Cove in nineteen doing remote interviews I come into the studio we've got a quarantine crew or here getting the interviews getting the stories out there and of course the story we continue to talk about is the impact of Kovan 19 and how we're all getting back to work either working at home or working remotely and virtually certainly but as things start to change we can start to see events mostly digital events and we're here to talk about an event that's coming up called the failover conference from gremlin which is now gone digital because it's April 21st but I think what's important about this conversation that I want to get into is not only talk about the event that's coming up but talk about these scale problems that are being highlighted by this change in work environment working at home we've been talking about the at scale problems that we're seeing whether it's a flood of surge of traffic and the chaos that's ensuing across the world with this pandemic so I'm excited have two great guests Alberto Ferran auto senior vice president marketing gremlin and Tammy Bhutto principal site reliability engineer or SRE guys thanks for coming on appreciate it thank you Thank You Alberto I want to get to you first you know we've known each other before you've been in this industry we all we've been all been talking about the cloud native cloud scale for some time it's kind of inside the ropes it's inside baseball Tami your site reliability engineer everyone knows Google knows how well cloud works this is large-scale stuff now with The Cove in 19 we're starting to see the average person my brother my sister our family members and people around the world go oh my god this is really a high impact this change of behavior the surge of you know whether whether it's traffic on the internet or work at home tools that are inadequate you start to see these statistical things that were planned for not working well and this actually Maps the things that we've been talking about it in our industry Alberto you've been on this how you guys doing and what's your what's your take on this situation we're in right now yeah yeah we're we're doing pretty well as a company we were born as a distributed organization to begin with so for us working in a distributed environment from all over the world is is common practice day-to-day personally you know I'm originally from Italy my parents my family is Milan and Bergen audible places so I have to follow the news with extra care and so much in me it becomes so much clearer nowadays that technology is not just a powerful tool to enable our businesses but it also is so critical for our day-to-day life and thanks to you know video calls I can easily talk to my family back there every day Wow so that's that's really important so yes we've been talking for a long time as you mentioned about complex systems at scale and reliability often in the context of mission-critical applications but more and more these systems need to be reliable also when it comes to back office systems that enable people to continue to work on a daily basis yeah well our hearts go out to your family and your friends in Italy and hope everyone's stay safe there no that was a tough situation continues to be a challenge Tammy I want to get your thoughts how is life going for you you're a sight reliable engineer what you deal with on the tech side is now happening in the real world it's it's almost it's mind-blowing and to me that we're seeing these these things happen it's it's a paradigm that needs attention and whew look at it as a sre dealing a most from a tech side now seeing it play out in real life it's such an interesting situation really terrible so one of the things that I specialize in as a site reliability engineer is incident management and so for example I previously worked at Dropbox where I was you know the incident manager on call for 500 million customers you know it's like 24/7 and these large-scale incidents you really need to be able to act fast there are two very important metrics that we track and care about as a site reliability engineer the first one is mean time to detection how fast can you detect what something is happening obviously if you detect an issue faster and you've got a better chance of making the impact lower so you can contain the blast radius I like to explain it to people like if you have a fire in your sauce bin in your kitchen and you put it out that's way better than waiting until your entire house is on fire and the other metric is mean time to resolution so how long does it take you to recover from the situation so yeah this is a large-scale global incident right now that we're in yeah I know you guys do a lot of talk about chaos theory and that applies a lot of math involved we all know that but I think when you go look at the real world this is gonna be table stakes and you know there's now a line in the sand here you know pre-pandemic post pandemic and i think you guys have an interesting company gremlin in the sense that this is this is a complex system and if you think about the world we're going to be living in whether it's digital events that you guys are have one coming up or how to work at home or tools that humans are going to be using it's going to be working with systems right so you have this new paradigm gonna be upon us pretty quickly and it's not just buying software mechanisms or software it's a complex system it's distributed computing and operating so I mean this is kind of the world can you guys talk about the gremlin situation of how you guys are attacking these new problems and these new opportunities that are emerging one of the things that I've always specialized in over the last 10 years is chaos engineering and so the idea of chaos engineering is that you're injecting failure on purpose to uncover weaknesses so that's really important in distributed systems with distributed you know cloud computing all these different services that you're kind of putting together but the idea is if you can inject failure you can actually figure out what happens when I inject that small failure and then you can actually go ahead and fix it one of the things I like to say to people is you know focus on what your top 5 critical systems are let's fix those first don't go for low-hanging fruit fix the biggest problems first get rid of the biggest amount of pain that you have as a company and then you can go ahead and like actually if you think about Pareto principle the 80/20 rule if you fix 20% of your biggest problems you actually solve 80% of your issues that always works something that I've done while working at National Australia Bank doing chaos engineering also what gremlin at Dropbox and I help a lot of our customers do that to albariño talk about the mindset involved it's almost counterintuitive whoa-oh-oh risk the biggest system and I don't want to touch those there working fine right now and then these problems just gestate they kind of hang around to the bin in the kitchen fire you know mist okay I don't want to touch it the house is still working so this is kind of a new mindset could you talk about what your take is on that is the industry there I mean oh it was a kind of a corner case you know you had Netflix you had the chaos monkey those days and then now it's the DevOps practice for a lot of folks you guys are involved in that what's the what's the appetite what's the progress of chaos engineering and mainstream yeah it's interesting that you mentioned DevOps and you know recently Gartner came up with a new revisited devil scream work that has chaos engineering in the middle of the lifecycle of your application and the reality is that systems have become so complex in infrastructure so many layers of abstractions you have hundreds of services if you're doing micro services but even if you're not doing micro services you have so many applications connected to each other build really complex workflows and automation flows it's impossible for traditional QA to really understand well the vulnerability are in terms of resiliency in terms of quality too often the production environment is also too different from the staging environment and so you need a fundamentally different approach to go and find where your weaknesses are and find them before they happen before you end up finding yourself in a situation like the one we're in today and you're not prepared and so much of what we talk about is giving it >> and the methodology for people to go and find these vulnerabilities not so much about creating cause chaos but it's about managing sales that is built into our current system and exposing those vulnerabilities before they create problem and so that's a very scientific methodology and and and tooling that we would bring to market and we help customers with Tammy I want to get your thoughts on so you know we used to riff a lot of to our 10th you know cube we've had a lot of conversation we've ripped over the over the years but you know when the surge of Amazon Web Services came out as pretty obvious the clouds amazing and look at the startups that were born you mentioned Dropbox you work there these comings and all these born in the cloud these hyper scale comes built from scratch great way to scale up and we used to joke about Google people say I would like a cloud like Google but no one has Google's use cases and Google really pioneered the sre concept and you gotta give them a lot of props for that but now we're kind of getting to a world where it's becoming Google like there's more scale now than ever before it's not a corner case it's becoming more popular and more of a preferred architecture this large scale what's your assessment of the of the mainstream enterprises how far are they did in your mind our way are they there with Castle they clothed how they doing it how does someone take how does someone develop an SRE practice to get the Google like scale because Google has an amazing network they got large-scale cloud they have sres they've been doing it for years how does a company that's transforming their IT have expertise it's a great question I get asked this a lot as well one of our goals at Bremen is to help make Internet more reliable for everybody everyone using the Internet all of the engineers who are trying to build reliable services and so I'm often asked by you know companies all over the world how do we create an SRE practice and how do we practice chaos engineering and so actually how you can get started actually rolling out your sre program based on my experiences I've done it so when I worked at Dropbox I worked with a lot of people who had been at Google they've been at YouTube they were there when was rolled out across those companies and then they brought those learnings to Dropbox and I learned from them but also the interesting thing is if you look at enterprise companies so large banks say for example I worked at a National Australia Bank for six years we actually did a lot of work that I would consider chaos engineering and sre practices so for example we would do large-scale disaster recovery and that's where you fail over an entire data center to a secret data center in an unknown location and the reason is because you're checking to make sure that everything operates okay if there's a nuclear blast that's actually what you have to do and you have to do that practice every quarter so but but if you think about it it's not very good to only do it once a quarter you really want to be practicing chaos engineering and injecting failure on this I think actually my I prefer to do it three times a week do I do it a lot but I'm also someone who likes to work out a lot and be fit all the time so I know that do something regularly you get great results so that's what I always tell us yeah I get the reps in as we say you know get get stronger at the muscle memory guys talk about the event that's coming up you got an event that was schedules physical event and then you were right in the planning mode and then the crisis hits you going digital going virtual it's really digital but it's digital that's on the internet so how are you guys thinking about this I know I it's out there it's April 21st can you share some specifics around the event well who should be attending and how they get involved online yeah yeah they vent really came about about together about a month ago when we started to see all the cancellations happening across the industry because of code 19 and we are extremely engaged with in the community and we have a lot of talks and we are seeing a lot of conferences just dropping and so speakers losing their opportunity to share their knowledge with respect to how you do reliability and topics that we focus on and so we quickly people it as a company and created a new online event to give everyone in the community the opportunity to you know they'll over to a new event as the president as a as the conference name says and and have those speakers will have lost their speaking slots have a new opportunity to go share their knowledge and so that came together really quickly we share the idea with a dozen of our partners and everyone liked it and all the sudden this thing took off like crazy in just a month where we are approaching you know four thousand registrations we have over 30 partners signed up and supporting the initiative a lot of a lot of past partners as well covering the event so it was impressive to see the amount of interest that that we were able to generate in such a short amount of time and really this is a conference for anybody who is interested in resilience and if you want to know from the best on how to build business continuity of persistence people and processes this is a great opportunity at no cost we need some free conference and the target persona and the audience you want to have a ten is what Sree Zoar folks doing architectural work and what's that that's the target yes and to attend our cadets s Ari's developers business leaders who care about the quality and reliability of their applications who need to help create a framework and a mindset for their organization that speaks to what Tammy was saying a minute ago having that constant crap is on a daily basis about who and finding how to improve things you know Tammy we've been doing going to physical events with the cube and extracting the signal of the noise and distributing it digitally for ten years and I got to ask you because now that those are those events have gone away you talk about chaos and injecting failure these doing these digital events is not as easy it's just live streaming it's it's hard to replicate the value of a physical event years of experience and standards roles and responsibilities to digital different consumption environments a synchronous you're trying to create a synchronous environment it's its own complex system so I think a lot of people are experimenting and learning from these events because it's pretty chaotic so I'd love to get your thoughts on how you look at these digital events as a chaos engineer how should people be looking at these events how are you I was looking at it you know I also want to get the program going get people out there get the content but you have to iterate on this how do you view this it is really different so I actually like to compare it to fire drills in SRA so often what you do there is you actually create a fake incident or a fake issue so you just you know you're saying let's have a fire drill similar to like you know when you're in a building and you have a fire drill that goes off you have wardens and everything and you all have to go outside so we can do that in this new world that we're all in all of a sudden you know a lot of people have never run an online event and now all of a sudden they have to so what I would say is like do a fire drill um run up you know a baked one before you do the actual on one to make sure that everything does work okay my other tip is make sure that you have backup plans backup plans on backup plans on backup plans like as in SRA I always have at least three to five backup plans like I'm not just saying plan a and Plan B but there's also a C D and E and I think that's very important and you know even when you're considering technology one of the things we say with chaos engineering is you know if you're using one service inject failure and make sure that you can fail over to a different alternative service in case something goes wrong yeah hence the failover conference which is the name of the conference yeah yeah well we certainly are gonna be sending a digital reporter there virtually if you need any backup plans obviously we have the remote interviews here if you need any help let us know really appreciate it I'll great to see you guys and thanks for sharing any final thoughts on the conference how what what happens when we get through the other side of this I'll give you guys a final word we'll start with Alberto with you first yeah I think one when we are on the other side of this will will understand even more the importance of effective resilience architecting and and and testing I think you know as a provider of tools and methodologies for that we we think we will be able to help customers do we do a significant leap forward on that side and the conference is just super exciting I think it's going to be a great I encourage everyone to participate we have tremendous lineup of speakers that have incredible reputation in their fields so I'm really happy and and excited about the work that the team has being able to do with our partners put together this type of event okay Tammy yes ma'am I'm actually going to be doing the opening keynote for the conference and the topic that I'm speaking about is that reliability matters more now than ever and I'll be sharing some you know bizarre weird incidents that I've worked on myself that I've experienced you know really critical strange issues that have come up but yeah I just I'm really looking forward to sharing that with everybody else so please come along it's free you can join from your own home and we can all be there together to support each other you got a great community support and there's a lot of partners press media and an ecosystem and customers so congratulations gremlin having a conference on April 21st called the failover conference the qubits look at angle we'll have a digital reporter there we covering the news thanks for coming on and sharing and appreciate the time I'm Jeff we're here in the Palo Alto series with remote interview with gremlin around there failover conference April 21st it's really demonstrating in my opinion the at scale problems that we've been working on the industry now more applicable than ever before as we get post pandemic with kovin 19 thanks for watching be back [Music]

Published Date : Apr 7 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
Tammy	PERSON	0.99+
April 21st	DATE	0.99+
Milan	LOCATION	0.99+
20%	QUANTITY	0.99+
April 2020	DATE	0.99+
Palo Alto	LOCATION	0.99+
Tammy Bhutto	PERSON	0.99+
six years	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Italy	LOCATION	0.99+
Alberto Farronato	PERSON	0.99+
ten years	QUANTITY	0.99+
Jeff	PERSON	0.99+
Alberto	PERSON	0.99+
National Australia Bank	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Tammy Butow	PERSON	0.99+
Amazon Web Services	ORGANIZATION	0.99+
National Australia Bank	ORGANIZATION	0.99+
two very important metrics	QUANTITY	0.99+
nineteen	QUANTITY	0.99+
Bergen	LOCATION	0.99+
over 30 partners	QUANTITY	0.99+
Dropbox	ORGANIZATION	0.99+
Gartner	ORGANIZATION	0.98+
Tami	PERSON	0.98+
10th	QUANTITY	0.98+
a month	QUANTITY	0.98+
hundreds of services	QUANTITY	0.98+
one	QUANTITY	0.97+
four thousand registrations	QUANTITY	0.97+
three times a week	QUANTITY	0.97+
YouTube	ORGANIZATION	0.97+
first one	QUANTITY	0.97+
gremlin	PERSON	0.96+
Alberto Ferran	PERSON	0.96+
first	QUANTITY	0.96+
Netflix	ORGANIZATION	0.95+
today	DATE	0.94+
once a quarter	QUANTITY	0.93+
ten	QUANTITY	0.93+
one service	QUANTITY	0.93+
pandemic	EVENT	0.92+
code 19	OTHER	0.9+
500 million customers	QUANTITY	0.89+
two great guests	QUANTITY	0.88+
five backup	QUANTITY	0.84+
Bremen	ORGANIZATION	0.84+
about a month ago	DATE	0.83+
lot of people	QUANTITY	0.8+
pandemic post pandemic	EVENT	0.79+
The Cove	ORGANIZATION	0.79+
a minute ago	DATE	0.79+
failover	EVENT	0.78+
a lot of people	QUANTITY	0.78+
80% of your issues	QUANTITY	0.77+
Kovan 19	EVENT	0.76+
pre-	EVENT	0.76+
19	QUANTITY	0.75+
every quarter	QUANTITY	0.75+
failover conference	EVENT	0.75+
Sree Zoar	ORGANIZATION	0.75+
top 5 critical systems	QUANTITY	0.73+
DevOps	TITLE	0.72+
19	DATE	0.7+
one of	QUANTITY	0.7+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Failover Conference: