Dave Rensin, Google | Google Cloud Next 2018
>> Live from San Francisco, it's The Cube. Covering Google Cloud Next 2018 brought to you by Google Cloud and its ecosystem partners. >> Welcome back everyone, it's The Cube live in San Francisco. At Google Cloud's big event, Next 18, GoogleNext18 is the hashtag. I'm John Furrier with Jeff Frick, our next guest, Dave Rensin, director of CRE and network capacity at Google. CRE stands for Customer Reliability Engineering, not to be confused with SRE which is Google's heralded program Site Reliability Engineering, categoric changer in the industry. Dave, great to have you on. Thanks for coming on. >> Thank you so much for having me. >> So we had a meeting a couple months ago and I was just so impressed by how much thought and engineering and business operations have been built around Google's infrastructure. It's a fascinating case study in history of computing, you guys obviously power yourselves and the Cloud is just massive. You've got the Site Reliability Engineer concept that now is, I won't say is a boiler plate, but it's certainly the guiding architecture for how enterprise is going to start to operate. Take a minute to explain the SRE and the CRE concept within Google. I think it's super important that you guys, again pioneered, something pretty amazing with the SRE program. >> Well, I mean, like everything it was just formed out of necessity for us. We did the calculation 12 or 13 years ago, I think. We sat down a piece of paper and we said, well, the number of people we need to run our systems scales linearly with the number of machines, which scales linearly with the number of users, and the complexity of the stuff you're doing. Alright, carry the two divide by six, plot line. In ten years, now this is 13 or 14 years ago, we're going to need one million humans to run google. And that was at the growth and complexity of 10 years ago or 12 years ago. >> Yeah, Search. (laughs) >> Search, right? We didn't have Android, we didn't have Cloud, we didn't have Assistant, we didn't have any of these things. We were like, well that's not going to work. We're going to have to do something different and so that's kind of where SRE came from. It's like, how do we automate, the basic philosophy is simple, give to the machines all the things machines can do. And keep for the humans all the things that require human judgment. And that's how we get to a place where like 2,500 SREs run all of Google. >> And that's massive and there's billions and billions of users. >> Yeah. >> Again, I think this is super important because at that time it was a tell sign for you guys to wake up and go, well I can't get a million humans. But it's now becoming, in my opinion, what this enterprise is going through in this digital transformation, whatever we call it these days, consumer's agent of IT now it's digital trasfor-- Whatever it is, the role of the human-machine interaction is now changing, people need to do more. They can collect more data than ever before. It doesn't cost them that much to collect data. >> Yeah. >> We just heard from the BigQuery guys, some amazing stuff happening. So now enterprises are almost going through the same changeover that you guys had to go through. And this I now super important because now you have the tooling and the scale that Google has. And so it's almost like it's a level up fast. So, how does an enterprise become SRE like, quickly, to take advantage of the Cloud? >> So, you know, I would like to say this is all sort of a deliberate march of a multi-year plan. But it wasn't, it was a little accidental. Starting two or three years ago, companies were asking us, they were saying, we're getting mired in toil. Like, we're not being able to innovate because we're spending all of our budget and effort just running the things and turning the crank. How do you have billions of users and not have this problem? We said, oh we use this thing called SRE. And they're like please use more words. And so we wrote a book. Right? And we expected maybe 20 people would read the book, and it was fine. And we didn't do it for any other reason other than that seemed like a very scalable way to tell people the words. And then it all just kind of exploded. We didn't expect that it was going to be true and so a couple of years ago we said, well, maybe we should formalize our interactions of, we should go out proactively and teach every enterprise we can how to do this and really work with them, and build up muscle memory. And that's where CRE comes from. That's my little corner of SRE. It's the part of SRE that, instead of being inward focused, we point out to companies. And our goal is that every firm from five to 50 thousand can follow these principles. And they can. wW know they can do it. And it's not as hard as they think. The funny thing about enterprises is they have this inferiority complex, like they've been told for years by Silicon Valley firms in sort of this derogatory way that, you're just an enterprise. We're the innovate-- That's-- >> Buy our stuff. Buy our software. Buy IT. >> We're smarter than you! And it's nonsense. There are hundreds and hundreds of thousands of really awesome engineers in these enterprises, right? And if you just give them a little latitude. And so anyway, we can walk these companies on this journey and it's been, I mean you've seen it, it's just been snowballing the last couple of years. >> Well the developers certainly have changed the game. We've seen with Cloud Native the role of developers doing toil and, or specific longer term projects at an app related IT would support them. So you had this traditional model that's been changed with agile et cetera. And dev ops, so that's great. So you know, golf clap for that. Now it's like scale >> No more than a golf clap it's been real. >> It's been a high five. Now it's like, they got to go to the next level. The next level is how do you scale it, how do I get more apps, how am I going to drive more revenue, not just reduce the cost? But now you got operators, now I have to operate things. So I think the persona of what operating something means, what you guys have hit with SRE, and CRE is part of that program, and that's really I think the aha moment. So that's where I see, and so how does someone read the book, put it in practice? Is it a cultural shift? Is it a reorganization? What are you guy seeing? What are some of the successes that you guys have been involved in? >> The biggest way to fail at doing SRE is try to do all of it at once. Don't do that. There are a few basic principles, that if you adhere to, the rest of it just comes organically at a pace that makes sense for your business. The easiest thing to think of, is simply-- If I had to distill it down to a few simple things, it's just this. Any system involving people is going to have errors. So any goal you have that assumes perfection, 100% uptime, 100% customer satisfaction, zero error, that kind of thing, is a lie. You're lying to yourself, you're lying to your customers. It's not just unrealistic its, in a way kind of immoral. So you got to embrace that. And then that difference between perfection and the amounts, the closeness to perfection that your customers really need, cuz they don't really need perfection, should be just a budget. We call it the error budget. Go spend the budget because above that line your customers are indifferent they don't care. And that unlocks innovation. >> So this is important, I want to just make sure I slow down on this, error budget is a concept that you're talking about. Explain that, because this is, I think, interesting. Because you're saying it's bs that there's no errors, because there's always errors, Right? >> Sure. >> So you just got to factor in and how you deal with them is-- But explain this error budget, because this operating philosophy of saying deal with errors, so explain this error budget concept. >> It comes from this observation, which is really fascinating. If you plot reliability and customer satisfaction on a graph what you will find is, for a while as your reliability goes up, your customer satisfaction goes up. Fantastic. And then there's a point, a magic line, after which you hit this really deep knee. And what you find is if you are much under that line your customers are angry, like pitchforks, torches, flipping cars, angry. And if you operate much above that line they are indifferent. Because, the network they connect with is less reliable than you. Or the phone they're using is less reliable than you. Or they're doing other things in their day than using your system, right? And so, there's a magic line, actually there's a term, it's called an SLO, Service Level Objective. And the difference between perfection, 100%, and the line you need, which is very business specific, we say treat as a budget. If you over spend your budget your customers aren't happy cuz you're less reliable than they need. But if you consistently under spend your budget, because they're indifferent to the change and because it is exponentially more expensive for incrementive improvement, that's literally resources you're wasting. You're wasting the one resource you can never get back, which is time. Spend it on innovation. And just that mental shift that we don't have to be perfect, less people do open and honest, blameless postmortems. It let's them embrace their risk in innovation. We go out of our way at Google to find people who accidentally broke something, took responsibility for it, redesigned the system so that the next unlucky person couldn't break it the same way, and then we promote them and celebrate them. >> So you push the error budget but then it's basically a way to do some experimentation, to do some innovation >> Safely. >> Safely. And what you're saying is, obviously the line of unhappy customers, it's like Gmail. When Gmail breaks people are like, the World freaks out, right? But, I'm happy with Gmail right now. It's working. >> But here's the thing, Gmail breaks very, very little. Very, very often. >> I never noticed it breaking. >> Will you notice the difference between 10 milliseconds of delivery time? No, of course not. Now, would you notice an hour or whatever? There's a line, you would for sure notice. >> That's the SLO line. >> That's exactly right. >> You're also saying that if you try to push above that, it costs more and there's not >> And you don't care >> An incremental benefit >> That's right. >> It doesn't effect my satisfaction. >> Yeah, you don't care. >> I'm at nirvana, now I'm happy. >> Yeah. >> Okay, and so what does that mean now for putting things in practice? What's the ideal error budget, that's an SLO? Is that part of the objective? >> Well that's part of the work to do as a business. And that's part of what my team does, is help you figure out is, what is the SLO, what is the error budget that makes sense for you for this application? And it's different. A medical device manufacturer is going to have a different SLO than a bank or a retailer, right? And the shapes are different. >> And it's interesting, we hear SLA, the Service Level Agreement, it's an old term >> Different things. >> Different things, here objective if I get this right, is not just about speed and feeds. There's also qualitative user experience objectives, right? So, am I getting that right? >> Very much so. SLOs and SLAs get confused a lot because they share two letters. But they don't mean anywhere near the same thing. An SLA is a legal agreement. It's a contract with your user that describes a penalty if you don't meet a certain performance. Lawyers, and sometimes sales or marketing people, drive SLAs. SLOs are different things driven by engineers. They are quantitative measures of your users happiness right now. And exactly to your point, it's always from the user's perspective. Like, your user does not care if the CPU and your fleet spiked. Or the memory usage went up x. They care, did my mail delivery slow down? Or is my load balancer not serving things? So, focus from your user backwards into your systems and then you get much saner things to track. >> Dave, great conversation. I love the innovation, I love the operating philosophy cuz you're really nailing it with terms of you want to make people happy but you're also pushing the envelope. You want to get these error budgets so we can experiment and learn, and not repeat the same mistake. That sounds like automation to me. But I want you to take a minute to explain, what SRE, that's an inward facing thing for Google, you are called a CRE, Customer Reliability Engineer. Explain what that is because I heard Diane Greene saying, we're taking a vertical focus. She mentioned healthcare. Seems like Google is starting to get in, and applying a lot of resources, to the field, customers. What is a CRE? What does that mean? How is that a part of SRE? Explain that. >> So a couple of years ago, when I was first hired at Google I was hired to build and run Cloud support. And one of the things I noticed, which you notice when you talk to customers a lot, is you know the industries done a really fabulous job of telling people how to get to Cloud. I used to work at Amazon. Amazon is a fantastic job! Telling people, how do you get to Cloud? How do you build a thing? But we're awful, as an industry, about telling them how to live there. How do you run it? Cuz it's different running a thing in a Cloud than it is running it in On-Prem. And you find that's the cause of a lot of friction for people. Not that they built it wrong, but they're just operating it in a way that's not quite compatible. It's a few degree off. And so we have this notion of, well we know how to operate these things to scale, that's what SRE is. What if, what if, we did a crazy thing? We took some of our SREs and instead of pointing them in at our production systems, we pointed them out at customers? Like what if we genetically screened our SREs for, can talk to human, instead of can talk to machine? Which is what you optimize for when you hire an engineer. And so we started Siri, it's this part of our SRE org that we point outwards to customer. And our job is to walk that path with you and really do it to get like-- sometimes we go so far as even to share a pager with you. And really get you to that place where your operations look a lot like we're talking that same language. >> It's custom too, you're looking at their environment. >> Oh yeah, it's bespoke. And then we also try to do scale things. We did the first SRE book. At the show just two days ago we launched the companion volume to the book, which is like-- cheap plug segment, where it's the implementation details. The first book's sort of a set of principles, these are the implementation details. Anything we can do to close that gap, I don't know if I ever told you the story, but when I was a little kid when I was like six. Like 1978, my dad who's always loved technology decided he was going to buy a personal computer. So he went to the largest retailer of personal computers in North America, Macy's in 1978, (laughs) and he came home with two things. He came home with a huge box and a human named Fred. And Fred the human unpacked the big box and set up the monitor, and the tape drive, and the keyboard, and told us about hardware and software and booting up, because who knew any of these things in 1978? And it's a funny story that you needed a human named Fred. My view is, I want to close the gap so that Siri are the Freds. Like, in a few years it'll be funny that you would ever need humans, from Google or anyone else, to help you learn how-- >> It's really helping people operate their new environment at a whole. It's a new first generation problem. >> Yeah. >> Essentially. Well, Dave great stuff. Final question, I want to get your thoughts. Great that we can have this conversation. You should come to the studio and go more and more deeper on this, I think it's a super important, and new role with SRES and CREs. But the show here, if you zoom out and look at Google Cloud, look down on the stage of what's going on this week, what's the most important story that should be told that's coming out of Google Cloud? Across all the announcements, what's the most important thing that people should be aware of? >> Wow, I have a definite set of biases, that won't lie. To me, the three most exciting announcements were GKE On-Prem, the idea that manage kubernetes you can actually run in your own environment. People have been saying for years that hybrid wasn't really a thing. Hybrid's a thing and it's going to be a thing for a long time, especially in enterprises. That's one. I think the introduction of machine learning to BigQuery, like anything we can do to bring those machine learning tools into these petabytes-- I mean, you mentioned it earlier. We are now collecting so much data not only can we not, as companies, we can't manage it. We can't even hire enough humans to figure out the right questions. So that's a big thing. And then, selfishly, in my own view of it because of reliability, the idea that Stackdriver will let you set up SLO dashboards and SLO alerting, to me that's a big win too. Those are my top three. >> Dave, great to have you on. Our SLO at The Cube is to bring the best content we possibly can, the most interviews at an event, and get the data and share that with you live. It's The Cube here at Google Cloud Next 18 I'm John Furrier with Jeff Frick. Stay with us, we've got more great content coming. We'll be right back after this short break.
SUMMARY :
brought to you by Google Cloud Dave, great to have you on. and the CRE concept within Google. and the complexity of the stuff you're doing. Yeah, Search. And keep for the humans And that's massive at that time it was a tell sign for you guys the same changeover that you guys and effort just running the things Buy our stuff. And if you just give them a little latitude. So you had this traditional model it's been real. and so how does someone read the book, the closeness to perfection error budget is a concept that you're talking about. and how you deal with them is-- and the line you need, obviously the line of unhappy customers, But here's the thing, Will you notice the difference between And the shapes are different. So, am I getting that right? and then you get much saner things to track. and not repeat the same mistake. And our job is to walk that path with you It's custom too, And it's a funny story that you needed It's a new first generation problem. Great that we can have this conversation. the idea that Stackdriver will let you and get the data and share that with you live.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Rensin | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Diane Greene | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
100% | QUANTITY | 0.99+ |
1978 | DATE | 0.99+ |
Siri | TITLE | 0.99+ |
ORGANIZATION | 0.99+ | |
John Furrier | PERSON | 0.99+ |
Fred | PERSON | 0.99+ |
hundreds | QUANTITY | 0.99+ |
20 people | QUANTITY | 0.99+ |
North America | LOCATION | 0.99+ |
two letters | QUANTITY | 0.99+ |
10 milliseconds | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.99+ |
first | QUANTITY | 0.99+ |
six | QUANTITY | 0.99+ |
first book | QUANTITY | 0.99+ |
five | QUANTITY | 0.99+ |
Android | TITLE | 0.99+ |
two | QUANTITY | 0.99+ |
an hour | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
two | DATE | 0.99+ |
The Cube | ORGANIZATION | 0.98+ |
2,500 SREs | QUANTITY | 0.98+ |
Gmail | TITLE | 0.98+ |
SRE | ORGANIZATION | 0.98+ |
10 years ago | DATE | 0.98+ |
Macy | ORGANIZATION | 0.98+ |
12 years ago | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
two days ago | DATE | 0.98+ |
Google Cloud | TITLE | 0.97+ |
three years ago | DATE | 0.97+ |
ORGANIZATION | 0.96+ | |
first generation | QUANTITY | 0.96+ |
zero error | QUANTITY | 0.96+ |
50 thousand | QUANTITY | 0.94+ |
GoogleNext18 | EVENT | 0.94+ |
13 | DATE | 0.93+ |
SRE | TITLE | 0.93+ |
couple of years ago | DATE | 0.92+ |
Silicon Valley | LOCATION | 0.91+ |
CRE | ORGANIZATION | 0.91+ |
couple months ago | DATE | 0.91+ |
Cloud | TITLE | 0.91+ |
agile | TITLE | 0.9+ |
Google Cloud | ORGANIZATION | 0.9+ |
Assistant | TITLE | 0.89+ |
one million humans | QUANTITY | 0.89+ |
14 years ago | DATE | 0.89+ |
SLA | TITLE | 0.88+ |
ten years | QUANTITY | 0.87+ |
12 | DATE | 0.86+ |
Stackdriver | ORGANIZATION | 0.86+ |
last couple of years | DATE | 0.85+ |