Ed Walsh and Thomas Hazel, ChaosSearch
>> Welcome to theCUBE, I am Dave Vellante. And today we're going to explore the ebb and flow of data as it travels into the cloud and the data lake. The concept of data lakes was alluring when it was first coined last decade by CTO James Dixon. Rather than be limited to highly structured and curated data that lives in a relational database in the form of an expensive and rigid data warehouse or a data mart. A data lake is formed by flowing data from a variety of sources into a scalable repository, like, say an S3 bucket that anyone can access, dive into, they can extract water, A.K.A data, from that lake and analyze data that's much more fine-grained and less expensive to store at scale. The problem became that organizations started to dump everything into their data lakes with no schema on our right, no metadata, no context, just shoving it into the data lake and figure out what's valuable at some point down the road. Kind of reminds you of your attic, right? Except this is an attic in the cloud. So it's too big to clean out over a weekend. Well look, it's 2021 and we should be solving this problem by now. A lot of folks are working on this, but often the solutions add other complexities for technology pros. So to understand this better, we're going to enlist the help of ChaosSearch CEO Ed Walsh, and Thomas Hazel, the CTO and Founder of ChaosSearch. We're also going to speak with Kevin Miller who's the Vice President and General Manager of S3 at Amazon web services. And of course they manage the largest and deepest data lakes on the planet. And we'll hear from a customer to get their perspective on this problem and how to go about solving it, but let's get started. Ed, Thomas, great to see you. Thanks for coming on theCUBE. >> Likewise. >> Face to face, it's really good to be here. >> It is nice face to face. >> It's great. >> So, Ed, let me start with you. We've been talking about data lakes in the cloud forever. Why is it still so difficult to extract value from those data lakes? >> Good question. I mean, data analytics at scale has always been a challenge, right? So, we're making some incremental changes. As you mentioned that we need to see some step function changes. But in fact, it's the reason ChaosSearch was really founded. But if you look at it, the same challenge around data warehouse or a data lake. Really it's not just to flowing the data in, it's how to get insights out. So it kind of falls into a couple of areas, but the business side will always complain and it's kind of uniform across everything in data lakes, everything in data warehousing. They'll say, "Hey, listen, I typically have to deal with a centralized team to do that data prep because it's data scientists and DBAs". Most of the time, they're a centralized group. Sometimes they're are business units, but most of the time, because they're scarce resources together. And then it takes a lot of time. It's arduous, it's complicated, it's a rigid process of the deal of the team, hard to add new data, but also it's hard to, it's very hard to share data and there's no way to governance without locking it down. And of course they would be more self-serve. So there's, you hear from the business side constantly now underneath is like, there's some real technology issues that we haven't really changed the way we're doing data prep since the two thousands, right? So if you look at it, it's, it falls two big areas. It's one, how to do data prep. How do you take, a request comes in from a business unit. I want to do X, Y, Z with this data. I want to use this type of tool sets to do the following. Someone has to be smart, how to put that data in the right schema, you mentioned. You have to put it in the right format, that the tool sets can analyze that data before you do anything. And then second thing, I'll come back to that 'cause that's the biggest challenge. But the second challenge is how these different data lakes and data warehouses are now persisting data and the complexity of managing that data and also the cost of computing it. And I'll go through that. But basically the biggest thing is actually getting it from raw data so the rigidness and complexity that the business sides are using it is literally someone has to do this ETL process, extract, transform, load. They're actually taking data, a request comes in, I need so much data in this type of way to put together. They're literally physically duplicating data and putting it together on a schema. They're stitching together almost a data puddle for all these different requests. And what happens is anytime they have to do that, someone has to do it. And it's, very skilled resources are scanned in the enterprise, right? So it's a DBS and data scientists. And then when they want new data, you give them a set of data set. They're always saying, what can I add to this data? Now that I've seen the reports. I want to add this data more fresh. And the same process has to happen. This takes about 60% to 80% of the data scientists in DPA's to do this work. It's kind of well-documented. And this is what actually stops the process. That's what is rigid. They have to be rigid because there's a process around that. That's the biggest challenge of doing this. And it takes an enterprise, weeks or months. I always say three weeks or three months. And no one challenges beyond that. It also takes the same skill set of people that you want to drive digital transformation, data warehousing initiatives, motorization, being data driven or all these data scientists and DBS they don't have enough of. So this is not only hurting you getting insights out of your day like in the warehouses. It's also, this resource constraint is hurting you actually getting. >> So that smallest atomic unit is that team, that's super specialized team, right? >> Right. >> Yeah. Okay. So you guys talk about activating the data lake. >> Yep. >> For analytics. What's unique about that? What problems are you all solving? You know, when you guys crew created this magic sauce. >> No, and basically, there's a lot of things. I highlighted the biggest one is how to do the data prep, but also you're persisting and using the data. But in the end, it's like, there's a lot of challenges at how to get analytics at scale. And this is really where Thomas and I founded the team to go after this, but I'll try to say it simply. What we're doing, I'll try to compare and contrast what we do compared to what you do with maybe an elastic cluster or a BI cluster. And if you look at it, what we do is we simply put your data in S3, don't move it, don't transform it. In fact, we're against data movement. What we do is we literally point and set that data and we index that data and make it available in a data representation that you can give virtual views to end-users. And those virtual views are available immediately over petabytes of data. And it actually gets presented to the end-user as an open API. So if you're elastic search user, you can use all your elastic search tools on this view. If you're a SQL user, Tableau, Looker, all the different tools, same thing with machine learning next year. So what we do is we take it, make it very simple. Simply put it there. It's already there already. Point us at it. We do the hard of indexing and making available. And then you publish in the open API as your users can use exactly what they do today. So that's, dramatically I'll give you a before and after. So let's say you're doing elastic search. You're doing logging analytics at scale, they're lending their data in S3. And then they're ETL physically duplicating and moving data. And typically deleting a lot of data to get in a format that elastic search can use. They're persisting it up in a data layer called leucine. It's physically sitting in memories, CPU, SSDs, and it's not one of them, it's a bunch of those. They in the cloud, you have to set them up because they're persisting ECC. They stand up same by 24, not a very cost-effective way to the cloud computing. What we do in comparison to that is literally pointing it at the same S3. In fact, you can run a complete parallel, the data necessary it's being ETL out. When just one more use case read only, or allow you to get that data and make this virtual views. So we run a complete parallel, but what happens is we just give a virtual view to the end users. We don't need this persistence layer, this extra cost layer, this extra time, cost and complexity of doing that. So what happens is when you look at what happens in elastic, they have a constraint, a trade-off of how much you can keep and how much you can afford to keep. And also it becomes unstable at time because you have to build out a schema. It's on a server, the more the schema scales out, guess what? you have to add more servers, very expensive. They're up seven by 24. And also they become brutalized. You lose one node, the whole thing has to be put together. We have none of that cost and complexity. We literally go from to keep whatever you want, whatever you want to keep an S3 is single persistence, very cost effective. And what we are able to do is, costs, we save 50 to 80%. Why? We don't go with the old paradigm of sit it up on servers, spin them up for persistence and keep them up 7 by 24. We're literally asking their cluster, what do you want to cut? We bring up the right compute resources. And then we release those sources after the query done. So we can do some queries that they can't imagine at scale, but we're able to do the exact same query at 50 to 80% savings. And they don't have to do any tutorial of moving that data or managing that layer of persistence, which is not only expensive, it becomes brittle. And then it becomes, I'll be quick. Once you go to BI, it's the same challenge, but the BI systems, the requests are constant coming at from a business unit down to the centralized data team. Give me this flavor of data. I want to use this piece of, you know, this analytic tool in that desk set. So they have to do all this pipeline. They're constantly saying, okay, I'll give you this data, this data, I'm duplicating that data, I'm moving it and stitching it together. And then the minute you want more data, they do the same process all over. We completely eliminate that. >> And those requests are queue up. Thomas, it had me, you don't have to move the data. That's kind of the exciting piece here, isn't it? >> Absolutely no. I think, you know, the data lake philosophy has always been solid, right? The problem is we had that Hadoop hang over, right? Where let's say we were using that platform, little too many variety of ways. And so, I always believed in data lake philosophy when James came and coined that I'm like, that's it. However, HTFS, that wasn't really a service. Cloud object storage is a service that the elasticity, the security, the durability, all that benefits are really why we founded on-cloud storage as a first move. >> So it was talking Thomas about, you know, being able to shut off essentially the compute so you don't have to keep paying for it, but there's other vendors out there and stuff like that. Something similar as separating, compute from storage that they're famous for that. And you have Databricks out there doing their lake house thing. Do you compete with those? How do you participate and how do you differentiate? >> Well, you know you've heard this term data lakes, warehouse, now lake house. And so what everybody wants is simple in, easy in, however, the problem with data lakes was complexity of out. Driving value. And I said, what if, what if you have the easy in and the value out? So if you look at, say snowflake as a warehousing solution, you have to all that prep and data movement to get into that system. And that it's rigid static. Now, Databricks, now that lake house has exact same thing. Now, should they have a data lake philosophy, but their data ingestion is not data lake philosophy. So I said, what if we had that simple in with a unique architecture and indexed technology, make it virtually accessible, publishable dynamically at petabyte scale. And so our service connects to the customer's cloud storage. Data stream the data in, set up what we call a live indexing stream, and then go to our data refinery and publish views that can be consumed the elastic API, use cabana Grafana, or say SQL tables look or say Tableau. And so we're getting the benefits of both sides, use scheme on read-write performance with scheme write-read performance. And if you can do that, that's the true promise of a data lake, you know, again, nothing against Hadoop, but scheme on read with all that complexity of software was a little data swamping. >> Well, you've got to start it, okay. So we got to give them a good prompt, but everybody I talked to has got this big bunch of spark clusters, now saying, all right, this doesn't scale, we're stuck. And so, you know, I'm a big fan of Jamag Dagani and our concept of the data lake and it's early days. But if you fast forward to the end of the decade, you know, what do you see as being the sort of critical components of this notion of, people call it data mesh, but to get the analytics stack, you're a visionary Thomas, how do you see this thing playing out over the next decade? >> I love her thought leadership, to be honest, our core principles were her core principles now, 5, 6, 7 years ago. And so this idea of, decentralize that data as a product, self-serve and, and federated computer governance, I mean, all that was our core principle. The trick is how do you enable that mesh philosophy? I can say we're a mesh ready, meaning that, we can participate in a way that very few products can participate. If there's gates data into your system, the CTL, the schema management, my argument with the data meshes like producers and consumers have the same rights. I want the consumer, people that choose how they want to consume that data. As well as the producer, publishing it. I can say our data refinery is that answer. You know, shoot, I'd love to open up a standard, right? Where we can really talk about the producers and consumers and the rights each others have. But I think she's right on the philosophy. I think as products mature in this cloud, in this data lake capabilities, the trick is those gates. If you have to structure up front, if you set those pipelines, the chance of you getting your data into a mesh is the weeks and months that Ed was mentioning. >> Well, I think you're right. I think the problem with data mesh today is the lack of standards you've got. You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are APIs, but they're all unique primitives. So there aren't standards, by which to your point, the consumer can take the data the way he or she wants it and build their own data products without having to tap people on the shoulder to say, how can I use this?, where does the data live? And being able to add their own data. >> You're exactly right. So I'm an organization, I'm generating data, when the courageously stream it into a lake. And then the service, a ChaosSearch service, is the data is discoverable and configurable by the consumer. Let's say you want to go to the corner store. I want to make a certain meal tonight. I want to pick and choose what I want, how I want it. Imagine if the data mesh truly can have that producer of information, you know, all the things you can buy a grocery store and what you want to make for dinner. And if you'd static, if you call up your producer to do the change, was it really a data mesh enabled service? I would argue not. >> Ed, bring us home. >> Well, maybe one more thing with this. >> Please, yeah. 'Cause some of this is we're talking 2031, but largely these principles are what we have in production today, right? So even the self service where you can actually have a business context on top of a data lake, we do that today, we talked about, we get rid of the physical ETL, which is 80% of the work, but the last 20% it's done by this refinery where you can do virtual views, the right or back and do all the transformation need and make it available. But also that's available to, you can actually give that as a role-based access service to your end-users, actually analysts. And you don't want to be a data scientist or DBA. In the hands of a data scientist the DBA is powerful, but the fact of matter, you don't have to affect all of our employees, regardless of seniority, if they're in finance or in sales, they actually go through and learn how to do this. So you don't have to be it. So part of that, and they can come up with their own view, which that's one of the things about data lakes. The business unit wants to do themselves, but more importantly, because they have that context of what they're trying to do instead of queuing up the very specific request that takes weeks, they're able to do it themselves. >> And if I have to put it on different data stores and ETL that I can do things in real time or near real time. And that's game changing and something we haven't been able to do ever. >> And then maybe just to wrap it up, listen, you know 8 years ago, Thomas and his group of founders, came up with the concept. How do you actually get after analytics at scale and solve the real problems? And it's not one thing, it's not just getting S3. It's all these different things. And what we have in market today is the ability to literally just simply stream it to S3, by the way, simply do, what we do is automate the process of getting the data in a representation that you can now share an augment. And then we publish open API. So can actually use a tool as you want, first use case log analytics, hey, it's easy to just stream your logs in. And we give you elastic search type of services. Same thing that with CQL, you'll see mainstream machine learning next year. So listen, I think we have the data lake, you know, 3.0 now, and we're just stretching our legs right now to have fun. >> Well, and you have to say it log analytics. But if I really do believe in this concept of building data products and data services, because I want to sell them, I want to monetize them and being able to do that quickly and easily, so I can consume them as the future. So guys, thanks so much for coming on the program. Really appreciate it.
SUMMARY :
and Thomas Hazel, the CTO really good to be here. lakes in the cloud forever. And the same process has to happen. So you guys talk about You know, when you guys crew founded the team to go after this, That's kind of the exciting service that the elasticity, And you have Databricks out there And if you can do that, end of the decade, you know, the chance of you getting your on the shoulder to say, all the things you can buy a grocery store So even the self service where you can actually have And if I have to put it is the ability to literally Well, and you have
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Kevin Miller | PERSON | 0.99+ |
Thomas | PERSON | 0.99+ |
Ed | PERSON | 0.99+ |
80% | QUANTITY | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
50 | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
James | PERSON | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
ChaosSearch | ORGANIZATION | 0.99+ |
three months | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
next year | DATE | 0.99+ |
2021 | DATE | 0.99+ |
two thousands | QUANTITY | 0.99+ |
three weeks | QUANTITY | 0.99+ |
24 | QUANTITY | 0.99+ |
James Dixon | PERSON | 0.99+ |
last decade | DATE | 0.99+ |
7 | QUANTITY | 0.99+ |
second challenge | QUANTITY | 0.99+ |
2031 | DATE | 0.99+ |
Jamag Dagani | PERSON | 0.98+ |
S3 | ORGANIZATION | 0.98+ |
both sides | QUANTITY | 0.98+ |
S3 | TITLE | 0.98+ |
8 years ago | DATE | 0.98+ |
second thing | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
about 60% | QUANTITY | 0.98+ |
tonight | DATE | 0.97+ |
first | QUANTITY | 0.97+ |
Tableau | TITLE | 0.97+ |
two big areas | QUANTITY | 0.96+ |
one | QUANTITY | 0.95+ |
SQL | TITLE | 0.94+ |
seven | QUANTITY | 0.94+ |
6 | DATE | 0.94+ |
CTO | PERSON | 0.93+ |
CQL | TITLE | 0.93+ |
7 years | DATE | 0.93+ |
first move | QUANTITY | 0.93+ |
next decade | DATE | 0.92+ |
single | QUANTITY | 0.91+ |
DBS | ORGANIZATION | 0.9+ |
20% | QUANTITY | 0.9+ |
one thing | QUANTITY | 0.87+ |
5 | DATE | 0.87+ |
Hadoop | TITLE | 0.87+ |
Looker | TITLE | 0.8+ |
Grafana | TITLE | 0.73+ |
DPA | ORGANIZATION | 0.71+ |
one more thing | QUANTITY | 0.71+ |
end of the | DATE | 0.69+ |
Vice President | PERSON | 0.65+ |
petabytes | QUANTITY | 0.64+ |
cabana | TITLE | 0.62+ |
CEO | PERSON | 0.57+ |
HTFS | ORGANIZATION | 0.54+ |
house | ORGANIZATION | 0.49+ |
theCUBE | ORGANIZATION | 0.48+ |
Ed Walsh and Thomas Hazel V1
>>Welcome to the cube. I'm Dave Volante. Today, we're going to explore the ebb and flow of data as it travels into the cloud. In the data lake, the concept of data lakes was a Loring when it was first coined last decade by CTO James Dickson, rather than be limited to highly structured and curated data that lives in a relational database in the form of an expensive and rigid data warehouse or a data Mart, a data lake is formed by flowing data from a variety of sources into a scalable repository, like say an S3 bucket that anyone can access, dive into. They can extract water. It can a data from that lake and analyze data. That's much more fine-grained and less expensive to store at scale. The problem became that organizations started to dump everything into their data lakes with no schema on it, right? No metadata, no context to shove it into the data lake and figure out what's valuable. >>At some point down the road kind of reminds you of your attic, right? Except this is an attic in the cloud. So it's too big to clean out over a weekend. We'll look it's 2021 and we should be solving this problem by now, a lot of folks are working on this, but often the solutions at other complexities for technology pros. So to understand this better, we're going to enlist the help of chaos search CEO and Walsh and Thomas Hazel, the CTO and founder of chaos search. We're also going to speak with Kevin Miller. Who's the vice president and general manager of S3 at Amazon web services. And of course they manage the largest and deepest data lakes on the planet. And we'll hear from a customer to get their perspective on this problem and how to go about solving it, but let's get started. Ed Thomas. Great to see you. Thanks for coming on the cube. Likewise face. It's really good to be in this nice face. Great. So let me start with you. We've been talking about data lakes in the cloud forever. Why is it still so difficult to extract value from those data? >>Good question. I mean, a data analytics at scale is always been a challenge, right? So, and it's, uh, we're making some incremental changes. As you mentioned that we need to seem some step function changes, but, uh, in fact, it's the reason, uh, search was really founded. But if you look at it the same challenge around data warehouse or a data lake, really, it's not just a flowing the data in is how to get insights out. So it kind of falls into a couple of areas, but the business side will always complain and it's kind of uniform across everything in data lakes, everything that we're offering, they'll say, Hey, listen, I typically have to deal with a centralized team to do that data prep because it's data scientist and DBS. Most of the time they're a centralized group, sometimes are business units, but most of the time, because they're scarce resources together. >>And then it takes a lot of time. It's arduous, it's complicated. It's a rigid process of the deal of the team, hard to add new data. But also it's hard to, you know, it's very hard to share data and there's no way to governance without locking it down. And of course they would be more self-service. So there's you hear from the business side constantly now underneath is like, there's some real technology issues that we haven't really changed the way we're doing data prep since the two thousands. Right? So if you look at it, it's, it falls, uh, two big areas. It's one. How do data prep, how do you take a request comes in from a business unit. I want to do X, Y, Z with this data. I want to use this type of tool sets to do the following. Someone has to be smart, how to put that data in the right schema. >>You mentioned you have to put it in the right format, that the tool sets can analyze that data before you do anything. And then secondly, I'll come back to that because that's a biggest challenge. But the second challenge is how these different data lakes and data we're also going to persisting data and the complexity of managing that data and also the cost of computing. And I'll go through that. But basically the biggest thing is actually getting it from raw data so that the rigidness and complexity that the business sides are using it is literally someone has to do this ETL process extract, transform load. They're actually taking data request comes in. I need so much data in this type of way to put together their Lilly, physically duplicating data and putting it together and schema they're stitching together almost a data puddle for all these different requests. >>And what happens is anytime they have to do that, someone has to do it. And it's very skilled. Resources are scant in the enterprise, right? So it's a DBS and data scientists. And then when they want new data, you give them a set of data set. They're always saying, what can I add this data? Now that I've seen the reports, I want to add this data more fresh. And the same process has to happen. This takes about 60 to 80% of the data scientists in DPA's to do this work. It's kind of well-documented. Uh, and this is what actually stops the process. That's what is rigid. They have to be rigid because there's a process around that. Uh, that's the biggest challenge to doing this. And it takes in the enterprise, uh, weeks or months. I always say three weeks to three months. And no one challenges beyond that. It also takes the same skill set of people that you want to drive. Digital transformation, data, warehousing initiatives, uh, monitorization being, data driven, or all these data scientists and DBS. They don't have enough of, so this is not only hurting you getting insights out of your dead like that, or else it's also this resource constraints hurting you actually getting smaller. >>The Tomic unit is that team that's super specialized team. Right. Right. Yeah. Okay. So you guys talk about activating the data lake. Yep, sure. Analytics, what what's unique about that? What problems are you all solving? You know, when you guys crew created this, this, this magic sauce. >>No, and it basically, there's a lot of things I highlighted the biggest one is how to do the data prep, but also you're persisting and using the data. But in the end, it's like, there's a lot of challenges that how to get analytics at scale. And this is really where Thomas founded the team to go after this. But, um, I'll try to say it simply, what are we doing? I'll try to compare and stress what we do compared to what you do with maybe an elastic cluster or a BI cluster. Um, and if you look at it, what we do is we simply put your data in S3, don't move it, don't transform it. In fact, we're not we're against data movement. What we do is we literally pointed at that data and we index that data and make it available in a data representation that you can give virtual views to end users. >>And those virtual views are available immediately over petabytes of data. And it re it actually gets presented to the end user as an open API. So if you're elastic search user, you can use all your lesser search tools on this view. If you're a SQL user, Tableau, Looker, all the different tools, same thing with machine learning next year. So what we do is we take it, make it very simple. Simply put it there. It's already there already. Point is at it. We do the hard of indexing and making available. And then you publish in the open API as your users can use exactly what they do today. So that's dramatically. I'll give you a before and after. So let's say you're doing elastic search. You're doing logging analytics at scale, they're lending their data in S3. And then they're,, they're physically duplicating a moving data and typically deleting a lot of data to get in a format that elastic search can use. >>They're persisting it up in a data layer called leucine. It's physically sitting in memories, CPU, uh, uh, SSDs. And it's not one of them. It's a bunch of those. They in the cloud, you have to set them up because they're persisting ECC. They stand up semi by 24, not a very cost-effective way to the cloud, uh, cloud computing. What we do in comparison to that is literally pointing it at the same S3. In fact, you can run a complete parallel, the data necessary. It's being ETL. That we're just one more use case read only, or allow you to get that data and make this virtual views. So we run a complete parallel, but what happens is we just give a virtual view to the end users. We don't need this persistence layer, this extra cost layer, this extra, um, uh, time cost and complexity of doing that. >>So what happens is when you look at what happens in elastic, they have a constraint, a trade-off of how much you can keep and how much you can afford to keep. And also it becomes unstable at time because you have to build out a schema. It's on a server, the more the schema scales out, guess what you have to add more servers, very expensive. They're up seven by 24. And also they become brittle. As you lose one node. The whole thing has to be put together. We have none of that cost and complexity. We literally go from to keep whatever you want, whatever you want to keep an S3, a single persistence, very cost effective. And what we do is, um, costs. We save 50 to 80% why we don't go with the old paradigm of sit it up on servers, spin them up for persistence and keep them up. >>Somebody 24, we're literally asking her cluster, what do you want to cut? We bring up the right compute resources. And then we release those sources after the query done. So we can do some queries that they can't imagine at scale, but we're able to do the exact same query at 50 to 80% savings. And they don't have to do any of the toil of moving that data or managing that layer of persistence, which is not only expensive. It becomes brittle. And then it becomes an I'll be quick. Once you go to BI, it's the same challenge, but the BI systems, the requests are constant coming at from a business unit down to the centralized data team. Give me this flavor of debt. I want to use this piece of, you know, this analytic tool in that desk set. So they have to do all this pipeline. They're constantly saying, okay, I'll give you this data, this data I'm duplicating that data. I'm moving in stitching together. And then the minute you want more data, they do the same process all over. We completely eliminate that. >>The questions queue up, Thomas, it had me, you don't have to move the data. That's, that's kind of the >>Writing piece here. Isn't it? I absolutely, no. I think, you know, the daylight philosophy has always been solid, right? The problem is we had that who do hang over, right? Where let's say we were using that platform, little, too many variety of ways. And so I always believed in daily philosophy when James came and coined that I'm like, that's it. However, HTFS that wasn't really a service cloud. Oddish storage is a service that the, the last society, the security and the durability, all that benefits are really why we founded, uh, Oncotype storage as a first move. >>So it was talking Thomas about, you know, being able to shut off essentially the compute and you have to keep paying for it, but there's other vendors out there and stuff like that. Something similar as separating, compute from storage that they're famous for that. And, and, and yet Databricks out there doing their lake house thing. Do you compete with those? How do you participate and how do you differentiate? >>I know you've heard this term data lakes, warehouse now, lake house. And so what everybody wants is simple in easy N however, the problem with data lakes was complexity of out driving value. And I said, what if, what if you have the easy end and the value out? So if you look at, uh, say snowflake as a, as a warehousing solution, you have to all that prep and data movement to get into that system. And that it's rigid static. Now, Databricks, now that lake house has exact same thing. Now, should they have a data lake philosophy, but their data ingestion is not daily philosophy. So I said, what if we had that simple in with a unique architecture, indexed technology, make it virtually accessible publishable dynamically at petabyte scale. And so our service connects to the customer's cloud storage data, stream the data in set up what we call a live indexing stream, and then go to our data refinery and publish views that can be consumed the lasted API, use cabana Grafana, or say SQL tables look or say Tableau. And so we're getting the benefits of both sides, you know, schema on read, write performance with scheme on, right. Reperformance. And if you can do that, that's the true promise of a data lake, you know, again, nothing against Hadoop, but a schema on read with all that complexity of, uh, software was, uh, what was a little data, swamp >>Got to start it. Okay. So we got to give a good prompt, but everybody I talked to has got this big bunch of spark clusters now saying, all right, this, this doesn't scale we're stuck. And so, you know, I'm a big fan of and our concept of the data lake and it's it's early days. But if you fast forward to the end of the decade, you know, what do you see as being the sort of critical components of this notion of, you know, people call it data mesh, but you've got the analytics stack. Uh, you, you, you're a visionary Thomas, how do you see this thing playing out over the next? >>I love for thought leadership, to be honest, our core principles were her core principles now, you know, 5, 6, 7 years ago. And so this idea of, you know, de centralize that data as a product, you know, self-serve and, and federated, computer, uh, governance, I mean, all that, it was our core principle. The trick is how do you enable that mesh philosophy? We, I could say we're a mesh ready, meaning that, you know, we can participate in a way that very few products can participate. If there's gates data into your system, the CTLA, the schema management, my argument with the data meshes like producers and consumers have the same rights. I want the consumer people that choose how they want to consume that data, as well as the producer publishing it. I can say our data refinery is that answer. You know, shoot, I love to open up a standard, right, where we can really talk about the producers and consumers and the rights each others have. But I think she's right on the philosophy. I think as products mature in this cloud, in this data lake capabilities, the trick is those gates. If you have the structure up front, it gets at those pipelines. You know, the chance of you getting your data into a mesh is the weeks and months that it was mentioning. >>Well, I think you're right. I think the problem with, with data mesh today is the lack of standards you've got. You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are API APIs, but they're all, you know, unique primitives. So there aren't standards by which to your point, the consumer can take the data the way he or she wants it and build their own data products without having to tap people on the shoulder to say, how can I use this? Where's the data live and, and, and, and, and being able to add their own >>You're exactly right. So I'm an organization I'm generally data will be courageous, a stream it to a lake. And then the service, uh, Ks search service is the data's con uh, discoverable and configurable by the consumer. Let's say you want to go to the corner store? You know, I want to make a certain meal tonight. I want to pick and choose what I want, how I want it. Imagine if the data mesh truly can have that producer of information, you, all the things you can buy a grocery store and what you want to make for dinner. And if you'd static, if you call up your producer to do the change, was it really a data mesh enabled service? I would argue not that >>Bring us home >>Well. Uh, and, um, maybe one more thing with this, cause some of this is we talking 20, 31, but largely these principles are what we have in production today, right? So even the self service where you can actually have business context on top of a debt, like we do that today, we talked about, we get rid of the physical ETL, which is 80% of the work, but the last 20% it's done by this refinery where you can do virtual views, the right our back and do all the transformation need and make it available. But also that's available to, you can actually give that as a role-based access service to your end users actually analysts, and you don't want to be a data scientist or DBA in the hands of a data science. The DBA is powerful, but the fact of matter, you don't have to affect all of our employees, regardless of seniority. If they're in finance or in sales, they actually go through and learn how to do this. So you don't have to be it. So part of that, and they can come up with their own view, which that's one of the things about debt lakes, the business unit wants to do themselves, but more importantly, because they have that context of what they're trying to do instead of queuing up the very specific request that takes weeks, they're able to do it themselves and to find out that >>Different data stores and ETL that I can do things in real time or near real time. And that's that's game changing and something we haven't been able to do, um, ever. Hmm. >>And then maybe just to wrap it up, listen, um, you know, eight years ago is a group of founders came up with the concept. How do you actually get after analytics at scale and solve the real problems? And it's not one thing it's not just getting S3, it's all these different things. And what we have in market today is the ability to literally just simply stream it to S3 by the way, simply do what we do is automate the process of getting the data in a representation that you can now share an augment. And then we publish open API. So can actually use a tool as you want first use case log analytics, Hey, it's easy to just stream your logs in and we give you elastic search puppet services, same thing that with CQL, you'll see mainstream machine learning next year. So listen, I think we have the data lake, you know, 3.0 now, and we're just stretching our legs run off >>Well, and you have to say it log analytics. But if I really do believe in this concept of building data products and data services, because I want to sell them, I want to monetize them and being able to do that quickly and easily, so that can consume them as the future. So guys, thanks so much for coming on the program. Really appreciate it. All right. In a moment, Kevin Miller of Amazon web services joins me. You're watching the cube, your leader in high tech coverage.
SUMMARY :
that organizations started to dump everything into their data lakes with no schema on it, At some point down the road kind of reminds you of your attic, right? But if you look at it the same challenge around data warehouse So if you look at it, it's, it falls, uh, two big areas. You mentioned you have to put it in the right format, that the tool sets can analyze that data before you do anything. It also takes the same skill set of people that you want So you guys talk about activating the data lake. Um, and if you look at it, what we do is we simply put your data in S3, don't move it, And then you publish in the open API as your users can use exactly what they you have to set them up because they're persisting ECC. It's on a server, the more the schema scales out, guess what you have to add more servers, And then the minute you want more data, they do the same process all over. The questions queue up, Thomas, it had me, you don't have to move the data. I absolutely, no. I think, you know, the daylight philosophy has always been So it was talking Thomas about, you know, being able to shut off essentially the And I said, what if, what if you have the easy end and the value out? the sort of critical components of this notion of, you know, people call it data mesh, And so this idea of, you know, de centralize that You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are API APIs, but they're all, if you call up your producer to do the change, was it really a data mesh enabled service? but the fact of matter, you don't have to affect all of our employees, regardless of seniority. And that's that's game changing And then maybe just to wrap it up, listen, um, you know, eight years ago is a group of founders Well, and you have to say it log analytics.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Kevin Miller | PERSON | 0.99+ |
Thomas | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Ed Thomas | PERSON | 0.99+ |
50 | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
James | PERSON | 0.99+ |
80% | QUANTITY | 0.99+ |
three months | QUANTITY | 0.99+ |
three weeks | QUANTITY | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
2021 | DATE | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
next year | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
S3 | ORGANIZATION | 0.99+ |
second challenge | QUANTITY | 0.99+ |
24 | QUANTITY | 0.99+ |
both sides | QUANTITY | 0.99+ |
eight years ago | DATE | 0.98+ |
Today | DATE | 0.98+ |
two thousands | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
20% | QUANTITY | 0.98+ |
tonight | DATE | 0.97+ |
last decade | DATE | 0.97+ |
S3 | TITLE | 0.97+ |
first | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
Tableau | TITLE | 0.95+ |
single | QUANTITY | 0.95+ |
James Dickson | PERSON | 0.94+ |
Hadoop | TITLE | 0.94+ |
two big areas | QUANTITY | 0.94+ |
20 | QUANTITY | 0.94+ |
SQL | TITLE | 0.93+ |
seven | QUANTITY | 0.93+ |
CTO | PERSON | 0.93+ |
about 60 | QUANTITY | 0.93+ |
Oncotype | ORGANIZATION | 0.92+ |
first move | QUANTITY | 0.92+ |
secondly | QUANTITY | 0.91+ |
one more thing | QUANTITY | 0.89+ |
DBS | ORGANIZATION | 0.89+ |
one node | QUANTITY | 0.85+ |
Walsh | PERSON | 0.83+ |
petabytes | QUANTITY | 0.77+ |
Tomic | ORGANIZATION | 0.77+ |
31 | QUANTITY | 0.77+ |
end of the | DATE | 0.76+ |
cabana | TITLE | 0.73+ |
HTFS | ORGANIZATION | 0.7+ |
Mart | ORGANIZATION | 0.68+ |
Grafana | TITLE | 0.63+ |
data | ORGANIZATION | 0.58+ |
Looker | TITLE | 0.55+ |
CQL | TITLE | 0.55+ |
DPA | ORGANIZATION | 0.54+ |
Breaking Analysis: Cloud 2030 From IT, to Business Transformation
>> From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE in ETR. This is Breaking Analysis with Dave Vellante. >> Cloud computing has been the single most transformative force in IT over the last decade. As we enter the 2020s, we believe that cloud will become the underpinning of a ubiquitous, intelligent and autonomous resource that will disrupt the operational stacks of virtually every company in every industry. Welcome to this week's special edition of Wikibon's CUBE Insights Powered by ETR. In this breaking analysis, and as part of theCUBE365's coverage of AWS re:Invent 2020, we're going to put forth our scenario for the next decade of cloud evolution. We'll also drill into the most recent data on AWS from ETR's October 2020 survey of more than 1,400 CIOs and IT professionals. So let's get right into it and take a look at how we see the cloud of yesterday, today and tomorrow. This graphic shows our view of the critical inflection points that catalyze the cloud adoption. In the middle of the 2000s, the IT industry was recovering from the shock of the dot-com bubble and of course 9/11. CIOs, they were still licking their wounds from the narrative, does IT even matter? AWS launched its Simple Storage Service and later EC2 with a little fanfare in 2006, but developers at startups and small businesses, they noticed that overnight AWS turned the data center into an API. Analysts like myself who saw the writing on the wall and CEO after CEO, they poo-pooed Amazon's entrance into their territory and they promised a cloud strategy that would allow them to easily defend their respective turfs. We'd seen the industry in denial before, and this was no different. The financial crisis was a boon for the cloud. CFOs saw a way to conserve cash, shift CAPEX to OPEX and avoid getting locked in to long-term capital depreciation schedules or constrictive leases. We also saw shadow IT take hold, and then bleed in to the 2010s in a big way. This of course created problems for organizations rightly concerned about security and rogue tech projects. CIOs were asked to come in and clean up the crime scene, and in doing so, realized the inevitable, i.e., that they could transform their IT operational models, shift infrastructure management to more strategic initiatives, and drop money to the bottom lines of their businesses. The 2010s saw an era of rapid innovation and a level of data explosion that we'd not seen before. AWS led the charge with a torrent pace of innovation via frequent rollouts or frequent feature rollouts. Virtually every industry, including the all-important public sector, got into the act. Again, led by AWS with the Seminole, a CIA deal. Google got in the game early, but they never really took the enterprise business seriously until 2015 when it hired Diane Green. But Microsoft saw the opportunity and leaned in heavily and made remarkable strides in the second half of the decade, leveraging its massive software stake. The 2010s also saw the rapid adoption of containers and an exit from the long AI winter, which along with the data explosion, created new workloads that began to go mainstream. Now, during this decade, we saw hybrid investments begin to take shape and show some promise. As the ecosystem realized broadly that it had to play in the AWS sandbox or it would lose customers. And we also saw the emergence of edge and IoT use cases like for example, AWS Ground Station, those emerge. Okay, so that's a quick history of cloud from our vantage point. The question is, what's coming next? What should we expect over the next decade? Whereas the last 10 years was largely about shifting the heavy burden of IT infrastructure management to the cloud, in the coming decade, we see the emergence of a true digital revolution. And most people agree that COVID has accelerated this shift by at least two to three years. We see all industries as ripe for disruption as they create a 360 degree view across their operational stacks. Meaning, for example, sales, marketing, customer service, logistics, etc., they're unified such that the customer experience is also unified. We see data flows coming together as well, where domain-specific knowledge workers are first party citizens in the data pipeline, i.e. not subservient to hyper-specialized technology experts. No industry is safe from this disruption. And the pandemic has given us a glimpse of what this is going to look like. Healthcare is going increasingly remote and becoming personalized. Machines are making more accurate diagnoses than humans, in some cases. Manufacturing, we'll see new levels of automation. Digital cash, blockchain and new payment systems will challenge traditional banking norms. Retail has been completely disrupted in the last nine months, as has education. And we're seeing the rise of Tesla as a possible harbinger to a day where owning and driving your own vehicle could become the exception rather than the norm. Farming, insurance, on and on and on. Virtually every industry will be transformed as this intelligent, responsive, autonomous, hyper-distributed system provides services that are ubiquitous and largely invisible. How's that for some buzzwords? But I'm here to tell you, it's coming. Now, a lot of questions remain. First, you may even ask, is this cloud that you're talking about? And I can understand why some people would ask that question. And I would say this, the definition of cloud is expanding. Cloud has defined the consumption model for technology. You're seeing cloud-like pricing models moving on-prem with initiatives like HPE's GreenLake and now Dell's APEX. SaaS pricing is evolving. You're seeing companies like Snowflake and Datadog challenging traditional SaaS models with a true cloud consumption pricing option. Not option, that's the way they price. And this, we think, is going to become the norm. Now, as hybrid cloud emerges and pushes to the edge, the cloud becomes this what we call, again, hyper-distributed system with a deployment and programming model that becomes much more uniform and ubiquitous. So maybe this s-curve that we've drawn here needs an adjacent s-curve with a steeper vertical. This decade, jumping s-curves, if you will, into this new era. And perhaps the nomenclature evolves, but we believe that cloud will still be the underpinning of whatever we call this future platform. We also point out on this chart, that public policy is going to evolve to address the privacy and concentrated industry power concerns that will vary by region and geography. So we don't expect the big tech lash to abate in the coming years. And finally, we definitely see alternative hardware and software models emerging, as witnessed by Nvidia and Arm and DPA's from companies like Fungible, and AWS and others designing their own silicon for specific workloads to control their costs and reduce their reliance on Intel. So the bottom line is that we see programming models evolving from infrastructure as code to programmable digital businesses, where ecosystems power the next wave of data creation, data sharing and innovation. Okay, let's bring it back to the current state and take a look at how we see the market for cloud today. This chart shows a just-released update of our IaaS and PaaS revenue for the big three cloud players, AWS, Azure, and Google. And you can see we've estimated Q4 revenues for each player and the full year, 2020. Now please remember our normal caveats on this data. AWS reports clean numbers, whereas Azure and GCP are estimates based on the little tidbits and breadcrumbs each company tosses our way. And we add in our own surveys and our own information from theCUBE Network. Now the following points are worth noting. First, while AWS's growth is lower than the other two, note what happens with the laws of large numbers? Yes, growth slows down, but the absolute dollars are substantial. Let me give an example. For AWS, Azure and Google, in Q4 2020 versus Q4 '19, we project annual quarter over quarter growth rate of 25% for AWS, 46% for Azure and 58% for Google Cloud Platform. So meaningfully lower growth rates for AWS compared to the other two. Yet AWS's revenue in absolute terms grows sequentially, 11.6 billion versus 12.4 billion. Whereas the others are flat to down sequentially. Azure and GCP, they'll have to come in with substantially higher annual growth to increase revenue from Q3 to Q4, that sequential increase that AWS can achieve with lower growth rates year to year, because it's so large. Now, having said that, on an annual basis, you can see both Azure and GCP are showing impressive growth in both percentage and absolute terms. AWS is going to add more than $10 billion to its revenue this year, with Azure growing nearly 9 billion or adding nearly 9 billion, and GCP adding just over 3 billion. So there's no denying that Azure is making ground as we've been reporting. GCP still has a long way to go. Thirdly, we also want to point out that these three companies alone now account for nearly $80 billion in infrastructure services annually. And the IaaS and PaaS business for these three companies combined is growing at around 40% per year. So much for repatriation. Now, let's take a deeper look at AWS specifically and bring in some of the ETR survey data. This wheel chart that we're showing here really shows you the granularity of how ETR calculates net score or spending momentum. Now each quarter ETR, they go get responses from thousands of CIOs and IT buyers, and they ask them, are you spending more or less than a particular platform or vendor? Net score is derived by taking adoption plus increase and subtracting out decrease plus replacing. So subtracting the reds from the greens. Now remember, AWS is a $45 billion company, and it has a net score of 51%. So despite its exposure to virtually every industry, including hospitality and airlines and other hard hit sectors, far more customers are spending more with AWS than are spending less. Now let's take a look inside of the AWS portfolio and really try to understand where that spending goes. This chart shows the net score across the AWS portfolio for three survey dates going back to last October, that's the gray. The summer is the blue. And October 2020, the most recent survey, is the yellow. Now remember, net score is an indicator of spending velocity and despite the deceleration, as shown in the yellow bars, these are very elevated net scores for AWS. Only Chime video conferencing is showing notable weakness in the AWS data set from the ETR survey, with an anemic 7% net score. But every other sector has elevated spending scores. Let's start with Lambda on the left-hand side. You can see that Lambda has a 65% net score. Now for context, very few companies have net scores that high. Snowflake and Kubernetes spend are two examples with higher net scores. But this is rarefied air for AWS Lambda, i.e. functions. Similarly, you can see AI, containers, cloud, cloud overall and analytics all with over 50% net scores. Now, while database is still elevated with a 46% net score, it has come down from its highs of late. And perhaps that's because AWS has so many options in database and its own portfolio and its ecosystem, and the survey maybe doesn't have enough granularity there, but in this competition, so I don't really know, but that's something that we're watching. But overall, there's a very strong portfolio from a spending momentum standpoint. Now what we want to do, let's flip the view and look at defections off of the AWS platform. Okay, look at this chart. We find this mind-boggling. The chart shows the same portfolio view, but isolates on the bright red portion of that wheel that I showed you earlier, the replacements. And basically you're seeing very few defections show up for AWS in the ETR survey. Again, only Chime is the sore spot. But everywhere else in the portfolio, we're seeing low single digit replacements. That's very, very impressive. Now, one more data chart. And then I want to go to some direct customer feedback, and then we'll wrap. Now we've shown this chart before. It plots net score or spending velocity on the vertical axis and market share, which measures pervasiveness in the dataset on the horizontal axis. And in the table portion in the upper-right corner, you can see the actual numbers that drive the plotting position. And you can see the data confirms what we know. This is a two-horse race right now between AWS and Microsoft. Google, they're kind of hanging out with the on-prem crowd vying for relevance at the data center. We've talked extensively about how we would like to see Google evolve its business and rely less on appropriating our data to serve ads and focus more on cloud. There's so much opportunity there. But nonetheless, you can see the so-called hybrid zone emerging. Hybrid is becoming real. Customers want hybrid and AWS is going to have to learn how to support hybrid deployments with offerings like outposts and others. But the data doesn't lie. The foundation has been set for the 2020s and AWS is extremely well-positioned to maintain its leadership, in our view. Now, the last chart we'll show takes some verbatim comments from customers that sum up the situation. These quotes were pulled from several ETR event roundtables that occurred in 2020. The first one talks to the cloud compute bill. It spikes and sometimes can be unpredictable. The second comment is from a CIO at IT/Telco. Let me paraphrase what he or she is saying. AWS is leading the pack and is number one. And this individual believes that AWS will continue to be number one by a wide margin. The third quote is from a CTO at an S&P 500 organization who talks to the cloud independence of the architecture that they're setting up and the strategy that they're pursuing. The central concern of this person is the software engineering pipeline, the cICB pipeline. The strategy is to clearly go multicloud, avoid getting locked in and ensuring that developers can be productive and independent of the cloud platform. Essentially separating the underlying infrastructure from the software development process. All right, let's wrap. So we talked about how the cloud will evolve to become an even more hyper-distributed system that can sense, act and serve, and provides sets of intelligence services on which digital businesses will be constructed and transformed. We expect AWS to continue to lead in this build-out with its heritage of delivering innovations and features at a torrid pace. We believe that ecosystems will become the main spring of innovation in the coming decade. And we feel that AWS has to embrace not only hybrid, but cross-cloud services. And it has to be careful not to push its ecosystem partners to competitors. It has to walk a fine line between competing and nurturing its ecosystem. To date, its success has been key to that balance as AWS has been able to, for the most part, call the shots. However, we shall see if competition and public policy attenuate its dominant position in this regard. What will be fascinating to watch is how AWS behaves, given its famed customer obsession and how it decodes the customer's needs. As Steve Jobs famously said, "Some people say, give the customers what they want. "That's not my approach. "Our job is to figure out "what they're going to want before they do." I think Henry Ford once asked, "If I'd ask customers what they wanted, "they would've told me a faster horse." Okay, that's it for now. It was great having you for this special report from theCUBE Insights Powered by ETR. Keep it right there for more great content on theCUBE from re:Invent 2020 virtual. (cheerful music)
SUMMARY :
This is Breaking Analysis and bring in some of the ETR survey data.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Maribel | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Keith | PERSON | 0.99+ |
Equinix | ORGANIZATION | 0.99+ |
Matt Link | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Indianapolis | LOCATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Scott | PERSON | 0.99+ |
Dave Nicholson | PERSON | 0.99+ |
Tim Minahan | PERSON | 0.99+ |
Paul Gillin | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
Lisa | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
Stephanie Cox | PERSON | 0.99+ |
Akanshka | PERSON | 0.99+ |
Budapest | LOCATION | 0.99+ |
Indiana | LOCATION | 0.99+ |
Steve Jobs | PERSON | 0.99+ |
October | DATE | 0.99+ |
India | LOCATION | 0.99+ |
Stephanie | PERSON | 0.99+ |
Nvidia | ORGANIZATION | 0.99+ |
Chris Lavilla | PERSON | 0.99+ |
2006 | DATE | 0.99+ |
Tanuja Randery | PERSON | 0.99+ |
Cuba | LOCATION | 0.99+ |
Israel | LOCATION | 0.99+ |
Keith Townsend | PERSON | 0.99+ |
Akanksha | PERSON | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Akanksha Mehrotra | PERSON | 0.99+ |
London | LOCATION | 0.99+ |
September 2020 | DATE | 0.99+ |
Intel | ORGANIZATION | 0.99+ |
David Schmidt | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
$45 billion | QUANTITY | 0.99+ |
October 2020 | DATE | 0.99+ |
Africa | LOCATION | 0.99+ |