UNLISTED FOR REVIEW Tammy Butow & Alberto Farronato, Gremlin | CUBE Conversation, April 2020

from the cube studios in Palo Alto in Boston connecting with thought leaders all around the world this is a cube conversation hello everyone welcome to the cube conversation here in Palo Alto our studios of the cube I'm showing for your host we're here during the crisis of Cove in nineteen doing remote interviews I come into the studio we've got a quarantine crew or here getting the interviews getting the stories out there and of course the story we continue to talk about is the impact of Kovan 19 and how we're all getting back to work either working at home or working remotely and virtually certainly but as things start to change we can start to see events mostly digital events and we're here to talk about an event that's coming up called the failover conference from gremlin which is now gone digital because it's April 21st but I think what's important about this conversation that I want to get into is not only talk about the event that's coming up but talk about these scale problems that are being highlighted by this change in work environment working at home we've been talking about the at scale problems that we're seeing whether it's a flood of surge of traffic and the chaos that's ensuing across the world with this pandemic so I'm excited have two great guests Alberto Ferran auto senior vice president marketing gremlin and Tammy Bhutto principal site reliability engineer or SRE guys thanks for coming on appreciate it thank you Thank You Alberto I want to get to you first you know we've known each other before you've been in this industry we all we've been all been talking about the cloud native cloud scale for some time it's kind of inside the ropes it's inside baseball Tami your site reliability engineer everyone knows Google knows how well cloud works this is large-scale stuff now with The Cove in 19 we're starting to see the average person my brother my sister our family members and people around the world go oh my god this is really a high impact this change of behavior the surge of you know whether whether it's traffic on the internet or work at home tools that are inadequate you start to see these statistical things that were planned for not working well and this actually Maps the things that we've been talking about it in our industry Alberto you've been on this how you guys doing and what's your what's your take on this situation we're in right now yeah yeah we're we're doing pretty well as a company we were born as a distributed organization to begin with so for us working in a distributed environment from all over the world is is common practice day-to-day personally you know I'm originally from Italy my parents my family is Milan and Bergen audible places so I have to follow the news with extra care and so much in me it becomes so much clearer nowadays that technology is not just a powerful tool to enable our businesses but it also is so critical for our day-to-day life and thanks to you know video calls I can easily talk to my family back there every day Wow so that's that's really important so yes we've been talking for a long time as you mentioned about complex systems at scale and reliability often in the context of mission-critical applications but more and more these systems need to be reliable also when it comes to back office systems that enable people to continue to work on a daily basis yeah well our hearts go out to your family and your friends in Italy and hope everyone's stay safe there no that was a tough situation continues to be a challenge Tammy I want to get your thoughts how is life going for you you're a sight reliable engineer what you deal with on the tech side is now happening in the real world it's it's almost it's mind-blowing and to me that we're seeing these these things happen it's it's a paradigm that needs attention and whew look at it as a sre dealing a most from a tech side now seeing it play out in real life it's such an interesting situation really terrible so one of the things that I specialize in as a site reliability engineer is incident management and so for example I previously worked at Dropbox where I was you know the incident manager on call for 500 million customers you know it's like 24/7 and these large-scale incidents you really need to be able to act fast there are two very important metrics that we track and care about as a site reliability engineer the first one is mean time to detection how fast can you detect what something is happening obviously if you detect an issue faster and you've got a better chance of making the impact lower so you can contain the blast radius I like to explain it to people like if you have a fire in your sauce bin in your kitchen and you put it out that's way better than waiting until your entire house is on fire and the other metric is mean time to resolution so how long does it take you to recover from the situation so yeah this is a large-scale global incident right now that we're in yeah I know you guys do a lot of talk about chaos theory and that applies a lot of math involved we all know that but I think when you go look at the real world this is gonna be table stakes and you know there's now a line in the sand here you know pre-pandemic post pandemic and i think you guys have an interesting company gremlin in the sense that this is this is a complex system and if you think about the world we're going to be living in whether it's digital events that you guys are have one coming up or how to work at home or tools that humans are going to be using it's going to be working with systems right so you have this new paradigm gonna be upon us pretty quickly and it's not just buying software mechanisms or software it's a complex system it's distributed computing and operating so I mean this is kind of the world can you guys talk about the gremlin situation of how you guys are attacking these new problems and these new opportunities that are emerging one of the things that I've always specialized in over the last 10 years is chaos engineering and so the idea of chaos engineering is that you're injecting failure on purpose to uncover weaknesses so that's really important in distributed systems with distributed you know cloud computing all these different services that you're kind of putting together but the idea is if you can inject failure you can actually figure out what happens when I inject that small failure and then you can actually go ahead and fix it one of the things I like to say to people is you know focus on what your top 5 critical systems are let's fix those first don't go for low-hanging fruit fix the biggest problems first get rid of the biggest amount of pain that you have as a company and then you can go ahead and like actually if you think about Pareto principle the 80/20 rule if you fix 20% of your biggest problems you actually solve 80% of your issues that always works something that I've done while working at National Australia Bank doing chaos engineering also what gremlin at Dropbox and I help a lot of our customers do that to albariño talk about the mindset involved it's almost counterintuitive whoa-oh-oh risk the biggest system and I don't want to touch those there working fine right now and then these problems just gestate they kind of hang around to the bin in the kitchen fire you know mist okay I don't want to touch it the house is still working so this is kind of a new mindset could you talk about what your take is on that is the industry there I mean oh it was a kind of a corner case you know you had Netflix you had the chaos monkey those days and then now it's the DevOps practice for a lot of folks you guys are involved in that what's the what's the appetite what's the progress of chaos engineering and mainstream yeah it's interesting that you mentioned DevOps and you know recently Gartner came up with a new revisited devil scream work that has chaos engineering in the middle of the lifecycle of your application and the reality is that systems have become so complex in infrastructure so many layers of abstractions you have hundreds of services if you're doing micro services but even if you're not doing micro services you have so many applications connected to each other build really complex workflows and automation flows it's impossible for traditional QA to really understand well the vulnerability are in terms of resiliency in terms of quality too often the production environment is also too different from the staging environment and so you need a fundamentally different approach to go and find where your weaknesses are and find them before they happen before you end up finding yourself in a situation like the one we're in today and you're not prepared and so much of what we talk about is giving it >> and the methodology for people to go and find these vulnerabilities not so much about creating cause chaos but it's about managing sales that is built into our current system and exposing those vulnerabilities before they create problem and so that's a very scientific methodology and and and tooling that we would bring to market and we help customers with Tammy I want to get your thoughts on so you know we used to riff a lot of to our 10th you know cube we've had a lot of conversation we've ripped over the over the years but you know when the surge of Amazon Web Services came out as pretty obvious the clouds amazing and look at the startups that were born you mentioned Dropbox you work there these comings and all these born in the cloud these hyper scale comes built from scratch great way to scale up and we used to joke about Google people say I would like a cloud like Google but no one has Google's use cases and Google really pioneered the sre concept and you gotta give them a lot of props for that but now we're kind of getting to a world where it's becoming Google like there's more scale now than ever before it's not a corner case it's becoming more popular and more of a preferred architecture this large scale what's your assessment of the of the mainstream enterprises how far are they did in your mind our way are they there with Castle they clothed how they doing it how does someone take how does someone develop an SRE practice to get the Google like scale because Google has an amazing network they got large-scale cloud they have sres they've been doing it for years how does a company that's transforming their IT have expertise it's a great question I get asked this a lot as well one of our goals at Bremen is to help make Internet more reliable for everybody everyone using the Internet all of the engineers who are trying to build reliable services and so I'm often asked by you know companies all over the world how do we create an SRE practice and how do we practice chaos engineering and so actually how you can get started actually rolling out your sre program based on my experiences I've done it so when I worked at Dropbox I worked with a lot of people who had been at Google they've been at YouTube they were there when was rolled out across those companies and then they brought those learnings to Dropbox and I learned from them but also the interesting thing is if you look at enterprise companies so large banks say for example I worked at a National Australia Bank for six years we actually did a lot of work that I would consider chaos engineering and sre practices so for example we would do large-scale disaster recovery and that's where you fail over an entire data center to a secret data center in an unknown location and the reason is because you're checking to make sure that everything operates okay if there's a nuclear blast that's actually what you have to do and you have to do that practice every quarter so but but if you think about it it's not very good to only do it once a quarter you really want to be practicing chaos engineering and injecting failure on this I think actually my I prefer to do it three times a week do I do it a lot but I'm also someone who likes to work out a lot and be fit all the time so I know that do something regularly you get great results so that's what I always tell us yeah I get the reps in as we say you know get get stronger at the muscle memory guys talk about the event that's coming up you got an event that was schedules physical event and then you were right in the planning mode and then the crisis hits you going digital going virtual it's really digital but it's digital that's on the internet so how are you guys thinking about this I know I it's out there it's April 21st can you share some specifics around the event well who should be attending and how they get involved online yeah yeah they vent really came about about together about a month ago when we started to see all the cancellations happening across the industry because of code 19 and we are extremely engaged with in the community and we have a lot of talks and we are seeing a lot of conferences just dropping and so speakers losing their opportunity to share their knowledge with respect to how you do reliability and topics that we focus on and so we quickly people it as a company and created a new online event to give everyone in the community the opportunity to you know they'll over to a new event as the president as a as the conference name says and and have those speakers will have lost their speaking slots have a new opportunity to go share their knowledge and so that came together really quickly we share the idea with a dozen of our partners and everyone liked it and all the sudden this thing took off like crazy in just a month where we are approaching you know four thousand registrations we have over 30 partners signed up and supporting the initiative a lot of a lot of past partners as well covering the event so it was impressive to see the amount of interest that that we were able to generate in such a short amount of time and really this is a conference for anybody who is interested in resilience and if you want to know from the best on how to build business continuity of persistence people and processes this is a great opportunity at no cost we need some free conference and the target persona and the audience you want to have a ten is what Sree Zoar folks doing architectural work and what's that that's the target yes and to attend our cadets s Ari's developers business leaders who care about the quality and reliability of their applications who need to help create a framework and a mindset for their organization that speaks to what Tammy was saying a minute ago having that constant crap is on a daily basis about who and finding how to improve things you know Tammy we've been doing going to physical events with the cube and extracting the signal of the noise and distributing it digitally for ten years and I got to ask you because now that those are those events have gone away you talk about chaos and injecting failure these doing these digital events is not as easy it's just live streaming it's it's hard to replicate the value of a physical event years of experience and standards roles and responsibilities to digital different consumption environments a synchronous you're trying to create a synchronous environment it's its own complex system so I think a lot of people are experimenting and learning from these events because it's pretty chaotic so I'd love to get your thoughts on how you look at these digital events as a chaos engineer how should people be looking at these events how are you I was looking at it you know I also want to get the program going get people out there get the content but you have to iterate on this how do you view this it is really different so I actually like to compare it to fire drills in SRA so often what you do there is you actually create a fake incident or a fake issue so you just you know you're saying let's have a fire drill similar to like you know when you're in a building and you have a fire drill that goes off you have wardens and everything and you all have to go outside so we can do that in this new world that we're all in all of a sudden you know a lot of people have never run an online event and now all of a sudden they have to so what I would say is like do a fire drill um run up you know a baked one before you do the actual on one to make sure that everything does work okay my other tip is make sure that you have backup plans backup plans on backup plans on backup plans like as in SRA I always have at least three to five backup plans like I'm not just saying plan a and Plan B but there's also a C D and E and I think that's very important and you know even when you're considering technology one of the things we say with chaos engineering is you know if you're using one service inject failure and make sure that you can fail over to a different alternative service in case something goes wrong yeah hence the failover conference which is the name of the conference yeah yeah well we certainly are gonna be sending a digital reporter there virtually if you need any backup plans obviously we have the remote interviews here if you need any help let us know really appreciate it I'll great to see you guys and thanks for sharing any final thoughts on the conference how what what happens when we get through the other side of this I'll give you guys a final word we'll start with Alberto with you first yeah I think one when we are on the other side of this will will understand even more the importance of effective resilience architecting and and and testing I think you know as a provider of tools and methodologies for that we we think we will be able to help customers do we do a significant leap forward on that side and the conference is just super exciting I think it's going to be a great I encourage everyone to participate we have tremendous lineup of speakers that have incredible reputation in their fields so I'm really happy and and excited about the work that the team has being able to do with our partners put together this type of event okay Tammy yes ma'am I'm actually going to be doing the opening keynote for the conference and the topic that I'm speaking about is that reliability matters more now than ever and I'll be sharing some you know bizarre weird incidents that I've worked on myself that I've experienced you know really critical strange issues that have come up but yeah I just I'm really looking forward to sharing that with everybody else so please come along it's free you can join from your own home and we can all be there together to support each other you got a great community support and there's a lot of partners press media and an ecosystem and customers so congratulations gremlin having a conference on April 21st called the failover conference the qubits look at angle we'll have a digital reporter there we covering the news thanks for coming on and sharing and appreciate the time I'm Jeff we're here in the Palo Alto series with remote interview with gremlin around there failover conference April 21st it's really demonstrating in my opinion the at scale problems that we've been working on the industry now more applicable than ever before as we get post pandemic with kovin 19 thanks for watching be back [Music]

Published Date : Apr 7 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
Tammy	PERSON	0.99+
April 21st	DATE	0.99+
Milan	LOCATION	0.99+
20%	QUANTITY	0.99+
April 2020	DATE	0.99+
Palo Alto	LOCATION	0.99+
Tammy Bhutto	PERSON	0.99+
six years	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Italy	LOCATION	0.99+
Alberto Farronato	PERSON	0.99+
ten years	QUANTITY	0.99+
Jeff	PERSON	0.99+
Alberto	PERSON	0.99+
National Australia Bank	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Tammy Butow	PERSON	0.99+
Amazon Web Services	ORGANIZATION	0.99+
National Australia Bank	ORGANIZATION	0.99+
two very important metrics	QUANTITY	0.99+
nineteen	QUANTITY	0.99+
Bergen	LOCATION	0.99+
over 30 partners	QUANTITY	0.99+
Dropbox	ORGANIZATION	0.99+
Gartner	ORGANIZATION	0.98+
Tami	PERSON	0.98+
10th	QUANTITY	0.98+
a month	QUANTITY	0.98+
hundreds of services	QUANTITY	0.98+
one	QUANTITY	0.97+
four thousand registrations	QUANTITY	0.97+
three times a week	QUANTITY	0.97+
YouTube	ORGANIZATION	0.97+
first one	QUANTITY	0.97+
gremlin	PERSON	0.96+
Alberto Ferran	PERSON	0.96+
first	QUANTITY	0.96+
Netflix	ORGANIZATION	0.95+
today	DATE	0.94+
once a quarter	QUANTITY	0.93+
ten	QUANTITY	0.93+
one service	QUANTITY	0.93+
pandemic	EVENT	0.92+
code 19	OTHER	0.9+
500 million customers	QUANTITY	0.89+
two great guests	QUANTITY	0.88+
five backup	QUANTITY	0.84+
Bremen	ORGANIZATION	0.84+
about a month ago	DATE	0.83+
lot of people	QUANTITY	0.8+
pandemic post pandemic	EVENT	0.79+
The Cove	ORGANIZATION	0.79+
a minute ago	DATE	0.79+
failover	EVENT	0.78+
a lot of people	QUANTITY	0.78+
80% of your issues	QUANTITY	0.77+
Kovan 19	EVENT	0.76+
pre-	EVENT	0.76+
19	QUANTITY	0.75+
every quarter	QUANTITY	0.75+
failover conference	EVENT	0.75+
Sree Zoar	ORGANIZATION	0.75+
top 5 critical systems	QUANTITY	0.73+
DevOps	TITLE	0.72+
19	DATE	0.7+
one of	QUANTITY	0.7+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Sree Zoar: