Steve Newman, Scalyr | Scalyr Innovation Day 2019

from San Mateo its the cube covering scaler innovation day brought to you by scaler Livan welcome to the special innovation day with the cube here in San Mateo California heart of Silicon Valley John for the cube our next guest as Steve Newman the co-founder scaler congratulations thanks for having us you guys got a great company here Thanks yeah go ahead glad to have you here so tell the story what's the backstory you guys found it interesting pedigree of founders all tech entrepreneurs tech tech savvy tech athletes as we say tell the backstory how'd it all start and had it all come together so I also traced the story back to I was part of the team that built the original Google Docs and a lot of the early people here at scaler either were part of that Google Docs team or you know they're people we met while we were at Google and really scaler is an outgrowth of the it's a solution to problems we were having trying to run that system at Google you know Google Docs of course became part of a whole ecosystem with Google Drive and Google sheets and there's that you know all these applications working together it's a very complicated system and keeping that humming behind the scenes became a very complicated problem well congratulate ago Google Docs is used by a lot of people so been great success scale is different though you guys are taking a different approach than the competition what's unique about it can you share kind of like the history of where it's going and where it came from and where it's going yeah so you know maybe it'd be helpful like just to kind of set the context a little bit to the blackboard yeah so you know I you know I talked about it's kind of probably put a little flesh on what I was saying about you know there's a very complicated system that we're trying to run in the whole Google Drive ecosystem too there are all these trends in the industry nowadays you know the move to the cloud and micro services and kubernetes and serverless and can use deployment is all everything like these are all great innovations makes you know people are building more complex applications they're evolving faster but it's making things a lot more complicated and to make that concrete imagine that you're running an e-commerce site back in the calm web 1.0 era so you're gonna have a web server maybe a patchy you've got a MySQL database behind that with your inventory and your shopping carts you may be an email gateway and some kind of payment gateway and that's about it that's your that's your system each one of these pieces involved you know going to Fry's buying a computer driving it over the data center slotting it into a rack you know a lot of sweat went into every one of those boxes but there's only about four boxes it's your whole system if you wanted to go faster you threw more hardware at it more ram exactly and like and you know not literally through but literally carried you literally brought in more hardware and so you know took a lot of work just to do the you know that simple system fast forward a couple of decades if you're running uh running an e-commerce site today well you know you're certainly not seeing the inside of a data center you know stripe will run the payments for you you know somebody's on will run the database server and say you know like this is much much you know you know one guy can get this going in an afternoon literally but nobody's running this today this is not a competitive operation today if you're an e-commerce today you also have personalization and advertising based on the surf service history or purchase history and you know there's a separate flow for gifts and you know then printing the you know interfacing to your delivery service and and you know you've got 150 blocks on this diagram and maybe your engineering team doesn't have to be so much larger because each one of those box is so much easier to run but it's still a complicated system and trying to actually understand what's working what's not working why isn't it working and and tracking that down and fixing it this is the challenge day and this and this is where we come in and that's the main focus for today is that you can figure it out but the complexity of the moving parts is the problem exactly so you know and so you see oh you know 10% of the time that somebody comes in to open their shopping cart it fails well you know the problem pops out here but the the root cause turns out to be a problem with your database system back here and and figuring that out you know that's that's the challenge okay so with cloud technology economics has changed how is cloud changing the game so it's interesting you know changes changes the game for our customers and it changes the game for us so for a customer you know kind of we touched on this a little bit like things are a lot easier people run stuff for you you know you're not running your own hardware you're not you know you're often you're not even running your own software you're just consuming a service it's a lot easier to scale up and down so you can do much more ambitious things and you can move a lot faster but you have these complexity problems for us what it presents an an economy of scale opportunity so to you know we step in to help you on the telemetry side what's happening in my system why is it happening when did it start happening what's causing it to happen that all takes a lot of data log data other kinds of data so every one of those components is generating data and by the way for our customers know that they're running a hundred and 50 services instead of four they are generating a lot more data and so traditionally if you're trying to manage that yourself running your own log management cluster or whatever solution you know it's a real challenge to you as you scale up as your system gets more complex you've got so much data to manage we've taken an approach where we're able to service all of our customers out of a single centralized cluster meaning we get an economy of scale each one of our customers gets to work with a basically log management engine that's to scale to our scale rather than the individual customers scale so the older versions of log management had the same kind of complexity challenges you just drew a lot ecommerce as the data types increase so does their complexity is that so the complexity increases and but you also get into just a data scale problem you know suddenly you're generating terabytes of data but you don't you know the you only want to devote a certain budget to the computing resources that are gonna process that data because we can share our processing across all of our customers we we fundamentally changed economics it's a little bit like when you go and run a search and Google thousands literally thousands of servers in that tenth of a second that Google is processing the query 3,000 servers on the Google site may have been involved those aren't your 3,000 servers you know you're sharing those with you know 50 million other people in your data center region but but for a millisecond there those 3,000 servers are all for you and that's that's a big part of how Google is able to give such amazing results so quickly but in still economically yeah economically for them and that's basically on a smaller scale that's what we're doing is you know taking the same hardware and making it all of it available to all of the customers people talk about metrics as the solution to scaling problems is that correct so this is a really interesting question so you know metrics are great you know basically the you know if you look up the definition of a metric it's basically just a measurement on number and you know and it's a great way to boil down you know so I've had 83 million people visit my website today and they did 163 million things in this add mirror and that's you can't make sense of that you can boil it down to you know this is the amount of traffic on the site this was the error rate this was the average response time so these you know these are great it's a great summarization to give you an overall flavor of what's going on the challenge with metrics is that they tend to measure they can be a great way to measure your problems your symptoms sites up it's down it's fast its slow when you want to get to then to the cause of that problem all right exactly why is the site now and I know something's wrong with the database but what's the error message and what you know what's the exact detail here and a metric isn't going to give that to you and in particular when people talk about metrics they tend to have in mind a specific approach to metrics where this flood of events and data very early is distilled down let's count the number of requests measure the average time and then throw away the data and keep the metric that's efficient you know throwing away data means you don't have to pay to manage the data and it gives you this summary but then as soon as you want to drill down you don't have any more data so if you want to look at a different metric one that you didn't set up in advance you can't do it and if you need to go into the the details you can't do an interesting story about that you know when you were at Google you mentioned you the problem statements came from Google but one of things I love about Google is they really kind of nailed the sre model and they clearly decoupled roles you know developers and site reliability engineers who are essentially one-to-many relationship with all the massive hardware and that's a nice operating model it's had a lot of efficiencies was tied together but you guys are kind of saying in a way that does developers use the cloud they become their own sres in a way because this cloud can give them that kind of Google like scale and in smaller ways not like Google size but but that's similar dynamic where there's a lot of compute and a lot of things happening on behalf of the application or the engineers developer as developers become the operator through their role what challenges do they have and what do you see that happening because that's interesting trim because as applications become larger cloud can service them at scale they then become their own sres what yeah well how does that roll out most how do you see that yes I mean and so this is something we see happening at more and more of our customers and one of the implications of that is you have all these people these developers who are now responsible for operations but but they're not special you know they're not that specialist SRE team they're specialists in developing code not in operations they're you know they they minor in operations and and they don't think of it as their real job you know that's the distraction something goes wrong all right they're they're called upon to help fix it they want to get it done as quickly as possible so they can get back to their real job so they're not gonna make the same mental investment in becoming an expert at operations and an expert at the operations tools and the telemetry tools you know they're not gonna be a log management expert on metrics expert um and so they need they need tools that have a gentle learning Kurt have a gentle learning curve and are gonna make it easy for them to get Ian's not really know what they're doing on this side of things but find an answer solve the problem and get back out and that's kind of a concept you guys have of speed to truth exactly so and we mean a couple of things by that sort of most literally we our tool is it's a high performance solution you you hand us your terabytes of log data you ask some question you know what's the trend on this error in this service over the last day and we you know we give you a quick answer Big Data scan through a give you a quick answer but really it's you know that's just part of the overall chain of events which goes from the you know the developer with a problem until they have a solution so they they have to figure out even how to approach the problem what question to ask us you know they have to pose the query and in our interface and so we've done a lot of work to to simplify that learning curve where instead of a complicated query language you can click a button get a graph and then start breaking down that just visually break that down which okay here's the error rate but how does that break down by server or user or whatever dimension and be able to drill down and explore in a you know very kind of straightforward way how would you describe the culture at scaler I mean you guys been around for a while you still growing fast growing startup you haven't done the B round yet got any you guys self-funded it got customers early they pushed you again now 300 plus customers what's the culture like here so you know it's been this has been a fun company to build in part because you know we're into you know the the heart of this company is the engineering team our customers our engineers so you know we're kind of the kind of the same group and that keeps the you know it kind of keeps the inside in the outside very close together and I think that's been a part of the culture we've built is you know we all know why we're building this what it's for you know we use scalar extensively internally but you but even you know even if we weren't we're it's the kind of thing we've used in the past and we're gonna use in the future and so you know I think people are really excited here because you know we understand why and you have an opinion of the future on how it should roll out what's the big problem statement you guys are solving as a company what's it how would you boil that down if asked so by a customer and engineer out there what real problem are you solving that's core problem big problem that's gonna be helping me you know at the end of the day it's giving people the confidence to keep you know building these kind of complicated systems and move quickly because because and this is the business pressure everyone is under you know whatever business you're in it has a digital element and your competitors are in the same you know doing the same thing and they are building these sophisticated systems and they're adding functionality and they're moving quickly you need to be able to do the same thing but it's easy then to get tangled up in this complexity so at the end of the day you know we're giving people the ability to understand those systems and and and the functionality and the software's getting stronger and stronger more complicated with service meshes and micro services as applications start to have these the ability to stand up and tear down services on the fly that's so annoying and they'll even wield more data exact you get more data it gets more complicated actually if you don't mind there's a little story I'd like to tell so hold on just will I clear this out this is going back back to Google and again you know kind of part of the inspiration of you know how he came to build scalar and this doesn't be a story of frustration of you know probably get ourselves into that operation and motivation yep so we were we were working on this project it was building a file system that could tie together Google Docs Google sheets Google Drive Google photos and the black diagram looks kind of like the thing I just erased but there was one particular problem we had that took us months and literally months and months and months to track down you know you'd like to solve a problem in a few minutes or a few hours but this one took months and it had to do with the the indexing system so you have all these files in Google Drive you wanna be able to search and so we had modeled out how we were gonna build this or this search engine you'd think you know Google searches a solve problem but actually so Google web search is four things the whole world can see there's also like Gmail search which is four things that only one person can see so it's lots of separate little indexes those are both solve problems at Google Google Drive is for things a few people can see you share it with your coworker or your whoever and it's actually a very different problem and but we looked at the statistics and we found that the average document our average file was shared with about 1.1 people in other words things were mostly private or maybe you share with one or two people so we said we're just gonna make if something's shared to three people we're just gonna make three copies of it and then now we have just the Gmail problem each copy is for one person and we did the math on how how much work is this going to be to build these indexes and in round numbers we were looking at something like at the time this would be so much larger now but at the time we had maybe one billion documents and files in the system each one was shared to about 1.1 people maybe it was a thousand words long on average and maybe it would change be edited once per day on average so we had about a trillion word updates per day if you multiply all that together and so we allocate it we put in a request and purchase machines to handle that much traffic and we started bringing up the system and immediately collapsed it was completely overloaded and we checked our numbers and we check them again yeah 1.1 about a billion whatever and but then work into the system with just way beyond them and we looked at our metrics so you know measuring the number documents measuring each of these things all the metrics looked right to make a month's long story short these metrics and averages were hiding some funny business there turned out there was this type of use case read of occasional documents that were shared to thousands of people and one of there was a specific example it was the signup sheet for the Google company picnic this is a spreadsheet it was shared to about 5,000 people so it wasn't the whole company but you know a big chunk of Mountain View which meant it was I don't know let's say 20 thousand words long because it had you know the name and a couple other things for each person this is one document but shared to 5,000 people and you know during the period people were signing up maybe it was changing a couple thousand times per day so you multiply out just this document and you get 200 billion word updates for that one document in a day where we're estimating a trillion for the whole earth and so there was something like a hundred documents in this kid Google was hamstringing your own thing we were hamstrung our own thing there were about a hundred examples like this so now we're up to 20 trillion and like that was the whole problem these hundred files and we would have never found that until we got way down into the details of the the logs which in this two months just took month so because we didn't have the tools because we didn't have scaler yeah and I think this is the kind of anomaly you might see with Web Services evolving with micro services which someone has an API interface with some other SAS as apps start to rely on each other this is a new dynamic we're seeing as SLA s are also tied together so the question is whose fault is it exactly you have to whose fault is it and also things get so much more varied now you know again web 1.0 e-commerce you buy a thing you buy a thing that's all the same now you're building a social media site or whatever you've got 8 followers you've got 8 million followers this person has three movies rented on Netflix this person has three thousand movies everything's different and so then you get these funny things hiding yeah you're flying blind if you don't get all the data exposed it's like it's like you know blind person trying to read Braille as we heard earlier see if thanks so much for sharing the insight great story I'm John furry you're here for the q4 innovation day at scalers headquarters thanks for watching

Published Date : May 30 2019

SUMMARY :

people the confidence to keep you know

ENTITIES

Entity	Category	Confidence
Steve Newman	PERSON	0.99+
San Mateo	LOCATION	0.99+
hundred files	QUANTITY	0.99+
50 million	QUANTITY	0.99+
3,000 servers	QUANTITY	0.99+
150 blocks	QUANTITY	0.99+
5,000 people	QUANTITY	0.99+
one	QUANTITY	0.99+
8 followers	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
one billion documents	QUANTITY	0.99+
today	DATE	0.99+
20 thousand words	QUANTITY	0.99+
two people	QUANTITY	0.99+
each copy	QUANTITY	0.99+
one person	QUANTITY	0.99+
three people	QUANTITY	0.99+
thousands of people	QUANTITY	0.99+
Google Docs	TITLE	0.99+
three thousand movies	QUANTITY	0.99+
thousands	QUANTITY	0.99+
Gmail	TITLE	0.99+
three movies	QUANTITY	0.99+
Silicon Valley	LOCATION	0.98+
Steve Newman	PERSON	0.98+
one document	QUANTITY	0.98+
MySQL	TITLE	0.98+
83 million people	QUANTITY	0.98+
four things	QUANTITY	0.98+
Netflix	ORGANIZATION	0.98+
about 5,000 people	QUANTITY	0.98+
about a billion	QUANTITY	0.98+
three copies	QUANTITY	0.98+
two months	QUANTITY	0.97+
both	QUANTITY	0.97+
each person	QUANTITY	0.97+
thousands of servers	QUANTITY	0.97+
each one	QUANTITY	0.97+
earth	LOCATION	0.96+
John	PERSON	0.96+
a trillion	QUANTITY	0.96+
Braille	TITLE	0.95+
four	QUANTITY	0.95+
about a hundred examples	QUANTITY	0.94+
a thousand words	QUANTITY	0.94+
single	QUANTITY	0.94+
a hundred and 50 services	QUANTITY	0.94+
8 million followers	QUANTITY	0.94+
each	QUANTITY	0.93+
q4 innovation day	EVENT	0.93+
300 plus customers	QUANTITY	0.93+
163 million things	QUANTITY	0.92+
one document in a day	QUANTITY	0.92+
about 1.1 people	QUANTITY	0.92+
terabytes	QUANTITY	0.91+
one particular problem	QUANTITY	0.91+
once per day	QUANTITY	0.89+
one guy	QUANTITY	0.88+
Google	TITLE	0.88+
1.1	QUANTITY	0.87+
Ian	PERSON	0.87+
hundred documents	QUANTITY	0.87+
up to 20 trillion	QUANTITY	0.87+
months	QUANTITY	0.86+
John furry	PERSON	0.85+
10% of	QUANTITY	0.84+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for q4 innovation day: