Ed Walsh & Thomas Hazel | A New Database Architecture for Supercloud
(bright music) >> Hi, everybody, this is Dave Vellante, welcome back to Supercloud 2. Last August, at the first Supercloud event, we invited the broader community to help further define Supercloud, we assessed its viability, and identified the critical elements and deployment models of the concept. The objectives here at Supercloud too are, first of all, to continue to tighten and test the concept, the second is, we want to get real world input from practitioners on the problems that they're facing and the viability of Supercloud in terms of applying it to their business. So on the program, we got companies like Walmart, Sachs, Western Union, Ionis Pharmaceuticals, NASDAQ, and others. And the third thing that we want to do is we want to drill into the intersection of cloud and data to project what the future looks like in the context of Supercloud. So in this segment, we want to explore the concept of data architectures and what's going to be required for Supercloud. And I'm pleased to welcome one of our Supercloud sponsors, ChaosSearch, Ed Walsh is the CEO of the company, with Thomas Hazel, who's the Founder, CTO, and Chief Scientist. Guys, good to see you again, thanks for coming into our Marlborough studio. >> Always great. >> Great to be here. >> Okay, so there's a little debate, I'm going to put you right in the spot. (Ed chuckling) A little debate going on in the community started by Bob Muglia, a former CEO of Snowflake, and he was at Microsoft for a long time, and he looked at the Supercloud definition, said, "I think you need to tighten it up a little bit." So, here's what he came up with. He said, "A Supercloud is a platform that provides a programmatically consistent set of services hosted on heterogeneous cloud providers." So he's calling it a platform, not an architecture, which was kind of interesting. And so presumably the platform owner is going to be responsible for the architecture, but Dr. Nelu Mihai, who's a computer scientist behind the Cloud of Clouds Project, he chimed in and responded with the following. He said, "Cloud is a programming paradigm supporting the entire lifecycle of applications with data and logic natively distributed. Supercloud is an open architecture that integrates heterogeneous clouds in an agnostic manner." So, Ed, words matter. Is this an architecture or is it a platform? >> Put us on the spot. So, I'm sure you have concepts, I would say it's an architectural or design principle. Listen, I look at Supercloud as a mega trend, just like cloud, just like data analytics. And some companies are using the principle, design principles, to literally get dramatically ahead of everyone else. I mean, things you couldn't possibly do if you didn't use cloud principles, right? So I think it's a Supercloud effect, you're able to do things you're not able to. So I think it's more a design principle, but if you do it right, you get dramatic effect as far as customer value. >> So the conversation that we were having with Muglia, and Tristan Handy of dbt Labs, was, I'll set it up as the following, and, Thomas, would love to get your thoughts, if you have a CRM, think about applications today, it's all about forms and codifying business processes, you type a bunch of stuff into Salesforce, and all the salespeople do it, and this machine generates a forecast. What if you have this new type of data app that pulls data from the transaction system, the e-commerce, the supply chain, the partner ecosystem, et cetera, and then, without humans, actually comes up with a plan. That's their vision. And Muglia was saying, in order to do that, you need to rethink data architectures and database architectures specifically, you need to get down to the level of how the data is stored on the disc. What are your thoughts on that? Well, first of all, I'm going to cop out, I think it's actually both. I do think it's a design principle, I think it's not open technology, but open APIs, open access, and you can build a platform on that design principle architecture. Now, I'm a database person, I love solving the database problems. >> I'm waited for you to launch into this. >> Yeah, so I mean, you know, Snowflake is a database, right? It's a distributed database. And we wanted to crack those codes, because, multi-region, multi-cloud, customers wanted access to their data, and their data is in a variety of forms, all these services that you're talked about. And so what I saw as a core principle was cloud object storage, everyone streams their data to cloud object storage. From there we said, well, how about we rethink database architecture, rethink file format, so that we can take each one of these services and bring them together, whether distributively or centrally, such that customers can access and get answers, whether it's operational data, whether it's business data, AKA search, or SQL, complex distributed joins. But we had to rethink the architecture. I like to say we're not a first generation, or a second, we're a third generation distributed database on pure, pure cloud storage, no caching, no SSDs. Why? Because all that availability, the cost of time, is a struggle, and cloud object storage, we think, is the answer. >> So when you're saying no caching, so when I think about how companies are solving some, you know, pretty hairy problems, take MySQL Heatwave, everybody thought Oracle was going to just forget about MySQL, well, they come out with Heatwave. And the way they solve problems, and you see their benchmarks against Amazon, "Oh, we crush everybody," is they put it all in memory. So you said no caching? You're not getting performance through caching? How is that true, and how are you getting performance? >> Well, so five, six years ago, right? When you realize that cloud object storage is going to be everywhere, and it's going to be a core foundational, if you will, fabric, what would you do? Well, a lot of times the second generation say, "We'll take it out of cloud storage, put in SSDs or something, and put into cache." And that adds a lot of time, adds a lot of costs. But I said, what if, what if we could actually make the first read hot, the first read distributed joins and searching? And so what we went out to do was said, we can't cache, because that's adds time, that adds cost. We have to make cloud object storage high performance, like it feels like a caching SSD. That's where our patents are, that's where our technology is, and we've spent many years working towards this. So, to me, if you can crack that code, a lot of these issues we're talking about, multi-region, multicloud, different services, everybody wants to send their data to the data lake, but then they move it out, we said, "Keep it right there." >> You nailed it, the data gravity. So, Bob's right, the data's coming in, and you need to get the data from everywhere, but you need an environment that you can deal with all that different schema, all the different type of technology, but also at scale. Bob's right, you cannot use memory or SSDs to cache that, that doesn't scale, it doesn't scale cost effectively. But if you could, and what you did, is you made object storage, S3 first, but object storage, the only persistence by doing that. And then we get performance, we should talk about it, it's literally, you know, hundreds of terabytes of queries, and it's done in seconds, it's done without memory caching. We have concepts of caching, but the only caching, the only persistence, is actually when we're doing caching, we're just keeping another side-eye track of things on the S3 itself. So we're using, actually, the object storage to be a database, which is kind of where Bob was saying, we agree, but that's what you started at, people thought you were crazy. >> And maybe make it live. Don't think of it as archival or temporary space, make it live, real time streaming, operational data. What we do is make it smart, we see the data coming in, we uniquely index it such that you can get your use cases, that are search, observability, security, or backend operational. But we don't have to have this, I dunno, static, fixed, siloed type of architecture technologies that were traditionally built prior to Supercloud thinking. >> And you don't have to move everything, essentially, you can do it wherever the data lands, whatever cloud across the globe, you're able to bring it together, you get the cost effectiveness, because the only persistence is the cheapest storage persistent layer you can buy. But the key thing is you cracked the code. >> We had to crack the code, right? That was the key thing. >> That's where the plans are. >> And then once you do that, then everything else gets easier to scale, your architecture, across regions, across cloud. >> Now, it's a general purpose database, as Bob was saying, but we use that database to solve a particular issue, which is around operational data, right? So, we agree with Bob's. >> Interesting. So this brings me to this concept of data, Jimata Gan is one of our speakers, you know, we talk about data fabric, which is a NetApp, originally NetApp concept, Gartner's kind of co-opted it. But so, the basic concept is, data lives everywhere, whether it's an S3 bucket, or a SQL database, or a data lake, it's just a node on the data mesh. So in your view, how does this fit in with Supercloud? Ed, you've said that you've built, essentially, an enabler for that, for the data mesh, I think you're an enabler for the Supercloud-like principles. This is a big, chewy opportunity, and it requires, you know, a team approach. There's got to be an ecosystem, there's not going to be one Supercloud to rule them all, so where does the ecosystem fit into the discussion, and where do you fit into the ecosystem? >> Right, so we agree completely, there's not one Supercloud in effect, but we use Supercloud principles to build our platform, and then, you know, the ecosystem's going to be built on leveraging what everyone else's secret powers are, right? So our power, our superpower, based upon what we built is, we deal with, if you're having any scale, or cost effective scale issues, with data, machine generated data, like business observability or security data, we are your force multiplier, we will take that in singularly, just let it, simply put it in your object storage wherever it sits, and we give you uniformity access to that using OpenAPI access, SQL, or you know, Elasticsearch API. So, that's what we do, that's our superpower. So I'll play it into data mesh, that's a perfect, we are a node on a data mesh, but I'll play it in the soup about how, the ecosystem, we see it kind of playing, and we talked about it in just in the last couple days, how we see this kind of possibly. Short term, our superpowers, we deal with this data that's coming at these environments, people, customers, building out observability or security environments, or vendors that are selling their own Supercloud, I do observability, the Datadogs of the world, dot dot dot, the Splunks of the world, dot dot dot, and security. So what we do is we fit in naturally. What we do is a cost effective scale, just land it anywhere in the world, we deal with ingest, and it's a cost effective, an order of magnitude, or two or three order magnitudes more cost effective. Allows them, their customers are asking them to do the impossible, "Give me fast monitoring alerting. I want it snappy, but I want it to keep two years of data, (laughs) and I want it cost effective." It doesn't work. They're good at the fast monitoring alerting, we're good at the long-term retention. And yet there's some gray area between those two, but one to one is actually cheaper, so we would partner. So the first ecosystem plays, who wants to have the ability to, really, all the data's in those same environments, the security observability players, they can literally, just through API, drag our data into their point to grab. We can make it seamless for customers. Right now, we make it helpful to customers. Your Datadog, we make a button, easy go from Datadog to us for logs, save you money. Same thing with Grafana. But you can also look at ecosystem, those same vendors, it used to be a year ago it was, you know, its all about how can you grow, like it's growth at all costs, now it's about cogs. So literally we can go an environment, you supply what your customer wants, but we can help with cogs. And one-on one in a partnership is better than you trying to build on your own. >> Thomas, you were saying you make the first read fast, so you think about Snowflake. Everybody wants to talk about Snowflake and Databricks. So, Snowflake, great, but you got to get the data in there. All right, so that's, can you help with that problem? >> I mean we want simple in, right? And if you have to have structure in, you're not simple. So the idea that you have a simple in, data lake, schema read type philosophy, but schema right type performance. And so what I wanted to do, what we have done, is have that simple lake, and stream that data real time, and those access points of Search or SQL, to go after whatever business case you need, security observability, warehouse integration. But the key thing is, how do I make that click, click, click answer, and do it quickly? And so what we want to do is, that first read has to be fast. Why? 'Cause then you're going to do all this siloing, layers, complexity. If your first read's not fast, you're at a disadvantage, particularly in cost. And nobody says I want less data, but everyone has to, whether they say we're going to shorten the window, we're going to use AI to choose, but in a security moment, when you don't have that answer, you're in trouble. And that's why we are this service, this Supercloud service, if you will, providing access, well-known search, well-known SQL type access, that if you just have one access point, you're at a disadvantage. >> We actually talked about Snowflake and BigQuery, and a different platform, Data Bricks. That's kind of where we see the phase two of ecosystem. One is easy, the low-hanging fruit is observability and security firms. But the next one is, what we do, our super power is dealing with this messy data that schema is changing like night and day. Pipelines are tough, and it's changing all the time, but you want these things fast, and it's big data around the world. That's the next point, just use us alongside, or inside, one of their platforms, and now we get the best of both worlds. Our superpower is keeping this messy data as a streaming, okay, not a batch thing, allow you to do that. So, that's the second one. And then to be honest, the third one, which plays you to Supercloud, it also plays perfectly in the data mesh, is if you really go to the ultimate thing, what we have done is made object storage, S3, GCS, and blob storage, we made it a database. Put, get, complex query with big joins. You know, so back to your original thing, and Muglia teed it up perfectly, we've done that. Now imagine if that's an ecosystem, who would want that? If it's, again, it's uniform available across all the regions, across all the clouds, and it's right next to where you are building a service, or a client's trying, that's where the ecosystem, I think people are going to use Superclouds for their superpowers. We're really good at this, allows that short term. I think the Snowflakes and the Data Bricks are the medium term, you know? And then I think eventually gets to, hey, listen if you can make object storage fast, you can just go after it with simple SQL queries, or elastic. Who would want that? I think that's where people are going to leverage it. It's not going to be one Supercloud, and we leverage the super clouds. >> Our viewpoint is smart object storage can be programmable, and so we agree with Bob, but we're not saying do it here, do it here. This core, fundamental layer across regions, across clouds, that everyone has? Simple in. Right now, it's hard to get data in for access for analysis. So we said, simply, we'll automate the entire process, give you API access across regions, across clouds. And again, how do you do a distributed join that's fast? How do you do a distributed join that doesn't cost you an arm or a leg? And how do you do it at scale? And that's where we've been focused. >> So prior, the cloud object store was a niche. >> Yeah. >> S3 obviously changed that. How standard is, essentially, object store across the different cloud platforms? Is that a problem for you? Is that an easy thing to solve? >> Well, let's talk about it. I mean we've fundamentally, yeah we've extracted it, but fundamentally, cloud object storage, put, get, and list. That's why it's so scalable, 'cause it doesn't have all these other components. That complexity is where we have moved up, and provide direct analytical API access. So because of its simplicity, and costs, and security, and reliability, it can scale naturally. I mean, really, distributed object storage is easy, it's put-get anywhere, now what we've done is we put a layer of intelligence, you know, call it smart object storage, where access is simple. So whether it's multi-region, do a query across, or multicloud, do a query across, or hunting, searching. >> We've had clients doing Amazon and Google, we have some Azure, but we see Amazon and Google more, and it's a consistent service across all of them. Just literally put your data in the bucket of choice, or folder of choice, click a couple buttons, literally click that to say "that's hot," and after that, it's hot, you can see it. But we're not moving data, the data gravity issue, that's the other. That it's already natively flowing to these pools of object storage across different regions and clouds. We don't move it, we index it right there, we're spinning up stateless compute, back to the Supercloud concept. But now that allows us to do all these other things, right? >> And it's no longer just cheap and deep object storage. Right? >> Yeah, we make it the same, like you have an analytic platform regardless of where you're at, you don't have to worry about that. Yeah, we deal with that, we deal with a stateless compute coming up -- >> And make it programmable. Be able to say, "I want this bucket to provide these answers." Right, that's really the hope, the vision. And the complexity to build the entire stack, and then connect them together, we said, the fabric is cloud storage, we just provide the intelligence on top. >> Let's bring it back to the customers, and one of the things we're exploring in Supercloud too is, you know, is Supercloud a solution looking for a problem? Is a multicloud really a problem? I mean, you hear, you know, a lot of the vendor marketing says, "Oh, it's a disaster, because it's all different across the clouds." And I talked to a lot of customers even as part of Supercloud too, they're like, "Well, I solved that problem by just going mono cloud." Well, but then you're not able to take advantage of a lot of the capabilities and the primitives that, you know, like Google's data, or you like Microsoft's simplicity, their RPA, whatever it is. So what are customers telling you, what are their near term problems that they're trying to solve today, and how are they thinking about the future? >> Listen, it's a real problem. I think it started, I think this is a a mega trend, just like cloud. Just, cloud data, and I always add, analytics, are the mega trends. If you're looking at those, if you're not considering using the Supercloud principles, in other words, leveraging what I have, abstracting it out, and getting the most out of that, and then build value on top, I think you're not going to be able to keep up, In fact, no way you're going to keep up with this data volume. It's a geometric challenge, and you're trying to do linear things. So clients aren't necessarily asking, hey, for Supercloud, but they're really saying, I need to have a better mechanism to simplify this and get value across it, and how do you abstract that out to do that? And that's where they're obviously, our conversations are more amazed what we're able to do, and what they're able to do with our platform, because if you think of what we've done, the S3, or GCS, or object storage, is they can't imagine the ingest, they can't imagine how easy, time to glass, one minute, no matter where it lands in the world, querying this in seconds for hundreds of terabytes squared. People are amazed, but that's kind of, so they're not asking for that, but they are amazed. And then when you start talking on it, if you're an enterprise person, you're building a big cloud data platform, or doing data or analytics, if you're not trying to leverage the public clouds, and somehow leverage all of them, and then build on top, then I think you're missing it. So they might not be asking for it, but they're doing it. >> And they're looking for a lens, you mentioned all these different services, how do I bring those together quickly? You know, our viewpoint, our service, is I have all these streams of data, create a lens where they want to go after it via search, go after via SQL, bring them together instantly, no e-tailing out, no define this table, put into this database. We said, let's have a service that creates a lens across all these streams, and then make those connections. I want to take my CRM with my Google AdWords, and maybe my Salesforce, how do I do analysis? Maybe I want to hunt first, maybe I want to join, maybe I want to add another stream to it. And so our viewpoint is, it's so natural to get into these lake platforms and then provide lenses to get that access. >> And they don't want it separate, they don't want something different here, and different there. They want it basically -- >> So this is our industry, right? If something new comes out, remember virtualization came out, "Oh my God, this is so great, it's going to solve all these problems." And all of a sudden it just got to be this big, more complex thing. Same thing with cloud, you know? It started out with S3, and then EC2, and now hundreds and hundreds of different services. So, it's a complex matter for a lot of people, and this creates problems for customers, especially when you got divisions that are using different clouds, and you're saying that the solution, or a solution for the part of the problem, is to really allow the data to stay in place on S3, use that standard, super simple, but then give it what, Ed, you've called superpower a couple of times, to make it fast, make it inexpensive, and allow you to do that across clouds. >> Yeah, yeah. >> I'll give you guys the last word on that. >> No, listen, I think, we think Supercloud allows you to do a lot more. And for us, data, everyone says more data, more problems, more budget issue, everyone knows more data is better, and we show you how to do it cost effectively at scale. And we couldn't have done it without the design principles of we're leveraging the Supercloud to get capabilities, and because we use super, just the object storage, we're able to get these capabilities of ingest, scale, cost effectiveness, and then we built on top of this. In the end, a database is a data platform that allows you to go after everything distributed, and to get one platform for analytics, no matter where it lands, that's where we think the Supercloud concepts are perfect, that's where our clients are seeing it, and we're kind of excited about it. >> Yeah a third generation database, Supercloud database, however we want to phrase it, and make it simple, but provide the value, and make it instant. >> Guys, thanks so much for coming into the studio today, I really thank you for your support of theCUBE, and theCUBE community, it allows us to provide events like this and free content. I really appreciate it. >> Oh, thank you. >> Thank you. >> All right, this is Dave Vellante for John Furrier in theCUBE community, thanks for being with us today. You're watching Supercloud 2, keep it right there for more thought provoking discussions around the future of cloud and data. (bright music)
SUMMARY :
And the third thing that we want to do I'm going to put you right but if you do it right, So the conversation that we were having I like to say we're not a and you see their So, to me, if you can crack that code, and you need to get the you can get your use cases, But the key thing is you cracked the code. We had to crack the code, right? And then once you do that, So, we agree with Bob's. and where do you fit into the ecosystem? and we give you uniformity access to that so you think about Snowflake. So the idea that you have are the medium term, you know? and so we agree with Bob, So prior, the cloud that an easy thing to solve? you know, call it smart object storage, and after that, it's hot, you can see it. And it's no longer just you don't have to worry about And the complexity to and one of the things we're and how do you abstract it's so natural to get and different there. and allow you to do that across clouds. I'll give you guys and we show you how to do it but provide the value, I really thank you for around the future of cloud and data.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Walmart | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
NASDAQ | ORGANIZATION | 0.99+ |
Bob Muglia | PERSON | 0.99+ |
Thomas | PERSON | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
Ionis Pharmaceuticals | ORGANIZATION | 0.99+ |
Western Union | ORGANIZATION | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
Bob | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Nelu Mihai | PERSON | 0.99+ |
Sachs | ORGANIZATION | 0.99+ |
Tristan Handy | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
two years | QUANTITY | 0.99+ |
Supercloud 2 | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
Last August | DATE | 0.99+ |
three | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
dbt Labs | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Ed | PERSON | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
Jimata Gan | PERSON | 0.99+ |
third one | QUANTITY | 0.99+ |
one minute | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
first generation | QUANTITY | 0.99+ |
third generation | QUANTITY | 0.99+ |
Grafana | ORGANIZATION | 0.99+ |
second generation | QUANTITY | 0.99+ |
second one | QUANTITY | 0.99+ |
hundreds of terabytes | QUANTITY | 0.98+ |
SQL | TITLE | 0.98+ |
five | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
a year ago | DATE | 0.98+ |
ChaosSearch | ORGANIZATION | 0.98+ |
Muglia | PERSON | 0.98+ |
MySQL | TITLE | 0.98+ |
both worlds | QUANTITY | 0.98+ |
third thing | QUANTITY | 0.97+ |
Marlborough | LOCATION | 0.97+ |
theCUBE | ORGANIZATION | 0.97+ |
today | DATE | 0.97+ |
Supercloud | ORGANIZATION | 0.97+ |
Elasticsearch | TITLE | 0.96+ |
NetApp | TITLE | 0.96+ |
Datadog | ORGANIZATION | 0.96+ |
One | QUANTITY | 0.96+ |
EC2 | TITLE | 0.96+ |
each one | QUANTITY | 0.96+ |
S3 | TITLE | 0.96+ |
one platform | QUANTITY | 0.95+ |
Supercloud 2 | EVENT | 0.95+ |
first read | QUANTITY | 0.95+ |
six years ago | DATE | 0.95+ |
Ed Walsh, Courtney Pallotta & Thomas Hazel, ChaosSearch | AWS 2021 CUBE Testimonial
(upbeat music) >> My name's Courtney Pallota, I'm the Vice President of Marketing at ChaosSearch. We've partnered with theCUBE team to take every one of those assets, tailor them to meet whatever our needs were, and get them out and shared far and wide. And theCUBE team has been tremendously helpful in partnering with us to make that a success. >> theCUBE has been fantastic with us. They are thought leaders in this space. And we have a unique product, a unique vision, and they have an insight into where the market's going. They've had conference with us with data mesh, and how do we fit into that new realm of data access. And with our unique vision, with our unique platform, and with theCUBE, we've uniquely come out into the market. >> What's my overall experience with theCUBE? Would I do it again, would I recommended it to others? I said, I recommend theCUBE to everyone. In fact, I was at IBM, and some of the IBM executives didn't want to go on theCUBE because it's a live interview. Live interviews can be traumatic. But the fact of the matter is, one, yeah, they're tough questions, but they're in line, they're what clients are looking for. So yes, you have to be on ball. I mean, you're always on your toes, but you get your message out so crisply. So I recommend it to everyone. I've gotten a lot of other executives to participate, and they've all had a great example. You have to be ready. I mean, you can't go on theCUBE and not be ready, but now you can get your message out. And it has such a good distribution. I can't think of a better platform. So I recommended it to everyone. If I say ChaosSearch in one word, I'd say digital transformation, with a hyphen.
SUMMARY :
tailor them to meet And with our unique vision, I said, I recommend theCUBE to everyone.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Courtney Pallota | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
ChaosSearch | ORGANIZATION | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
theCUBE | ORGANIZATION | 0.99+ |
one word | QUANTITY | 0.97+ |
Courtney Pallotta | PERSON | 0.9+ |
theCUBE | TITLE | 0.71+ |
one | QUANTITY | 0.56+ |
AWS 2021 | ORGANIZATION | 0.55+ |
Ed Walsh and Thomas Hazel, ChaosSearch
>> Welcome to theCUBE, I am Dave Vellante. And today we're going to explore the ebb and flow of data as it travels into the cloud and the data lake. The concept of data lakes was alluring when it was first coined last decade by CTO James Dixon. Rather than be limited to highly structured and curated data that lives in a relational database in the form of an expensive and rigid data warehouse or a data mart. A data lake is formed by flowing data from a variety of sources into a scalable repository, like, say an S3 bucket that anyone can access, dive into, they can extract water, A.K.A data, from that lake and analyze data that's much more fine-grained and less expensive to store at scale. The problem became that organizations started to dump everything into their data lakes with no schema on our right, no metadata, no context, just shoving it into the data lake and figure out what's valuable at some point down the road. Kind of reminds you of your attic, right? Except this is an attic in the cloud. So it's too big to clean out over a weekend. Well look, it's 2021 and we should be solving this problem by now. A lot of folks are working on this, but often the solutions add other complexities for technology pros. So to understand this better, we're going to enlist the help of ChaosSearch CEO Ed Walsh, and Thomas Hazel, the CTO and Founder of ChaosSearch. We're also going to speak with Kevin Miller who's the Vice President and General Manager of S3 at Amazon web services. And of course they manage the largest and deepest data lakes on the planet. And we'll hear from a customer to get their perspective on this problem and how to go about solving it, but let's get started. Ed, Thomas, great to see you. Thanks for coming on theCUBE. >> Likewise. >> Face to face, it's really good to be here. >> It is nice face to face. >> It's great. >> So, Ed, let me start with you. We've been talking about data lakes in the cloud forever. Why is it still so difficult to extract value from those data lakes? >> Good question. I mean, data analytics at scale has always been a challenge, right? So, we're making some incremental changes. As you mentioned that we need to see some step function changes. But in fact, it's the reason ChaosSearch was really founded. But if you look at it, the same challenge around data warehouse or a data lake. Really it's not just to flowing the data in, it's how to get insights out. So it kind of falls into a couple of areas, but the business side will always complain and it's kind of uniform across everything in data lakes, everything in data warehousing. They'll say, "Hey, listen, I typically have to deal with a centralized team to do that data prep because it's data scientists and DBAs". Most of the time, they're a centralized group. Sometimes they're are business units, but most of the time, because they're scarce resources together. And then it takes a lot of time. It's arduous, it's complicated, it's a rigid process of the deal of the team, hard to add new data, but also it's hard to, it's very hard to share data and there's no way to governance without locking it down. And of course they would be more self-serve. So there's, you hear from the business side constantly now underneath is like, there's some real technology issues that we haven't really changed the way we're doing data prep since the two thousands, right? So if you look at it, it's, it falls two big areas. It's one, how to do data prep. How do you take, a request comes in from a business unit. I want to do X, Y, Z with this data. I want to use this type of tool sets to do the following. Someone has to be smart, how to put that data in the right schema, you mentioned. You have to put it in the right format, that the tool sets can analyze that data before you do anything. And then second thing, I'll come back to that 'cause that's the biggest challenge. But the second challenge is how these different data lakes and data warehouses are now persisting data and the complexity of managing that data and also the cost of computing it. And I'll go through that. But basically the biggest thing is actually getting it from raw data so the rigidness and complexity that the business sides are using it is literally someone has to do this ETL process, extract, transform, load. They're actually taking data, a request comes in, I need so much data in this type of way to put together. They're literally physically duplicating data and putting it together on a schema. They're stitching together almost a data puddle for all these different requests. And what happens is anytime they have to do that, someone has to do it. And it's, very skilled resources are scanned in the enterprise, right? So it's a DBS and data scientists. And then when they want new data, you give them a set of data set. They're always saying, what can I add to this data? Now that I've seen the reports. I want to add this data more fresh. And the same process has to happen. This takes about 60% to 80% of the data scientists in DPA's to do this work. It's kind of well-documented. And this is what actually stops the process. That's what is rigid. They have to be rigid because there's a process around that. That's the biggest challenge of doing this. And it takes an enterprise, weeks or months. I always say three weeks or three months. And no one challenges beyond that. It also takes the same skill set of people that you want to drive digital transformation, data warehousing initiatives, motorization, being data driven or all these data scientists and DBS they don't have enough of. So this is not only hurting you getting insights out of your day like in the warehouses. It's also, this resource constraint is hurting you actually getting. >> So that smallest atomic unit is that team, that's super specialized team, right? >> Right. >> Yeah. Okay. So you guys talk about activating the data lake. >> Yep. >> For analytics. What's unique about that? What problems are you all solving? You know, when you guys crew created this magic sauce. >> No, and basically, there's a lot of things. I highlighted the biggest one is how to do the data prep, but also you're persisting and using the data. But in the end, it's like, there's a lot of challenges at how to get analytics at scale. And this is really where Thomas and I founded the team to go after this, but I'll try to say it simply. What we're doing, I'll try to compare and contrast what we do compared to what you do with maybe an elastic cluster or a BI cluster. And if you look at it, what we do is we simply put your data in S3, don't move it, don't transform it. In fact, we're against data movement. What we do is we literally point and set that data and we index that data and make it available in a data representation that you can give virtual views to end-users. And those virtual views are available immediately over petabytes of data. And it actually gets presented to the end-user as an open API. So if you're elastic search user, you can use all your elastic search tools on this view. If you're a SQL user, Tableau, Looker, all the different tools, same thing with machine learning next year. So what we do is we take it, make it very simple. Simply put it there. It's already there already. Point us at it. We do the hard of indexing and making available. And then you publish in the open API as your users can use exactly what they do today. So that's, dramatically I'll give you a before and after. So let's say you're doing elastic search. You're doing logging analytics at scale, they're lending their data in S3. And then they're ETL physically duplicating and moving data. And typically deleting a lot of data to get in a format that elastic search can use. They're persisting it up in a data layer called leucine. It's physically sitting in memories, CPU, SSDs, and it's not one of them, it's a bunch of those. They in the cloud, you have to set them up because they're persisting ECC. They stand up same by 24, not a very cost-effective way to the cloud computing. What we do in comparison to that is literally pointing it at the same S3. In fact, you can run a complete parallel, the data necessary it's being ETL out. When just one more use case read only, or allow you to get that data and make this virtual views. So we run a complete parallel, but what happens is we just give a virtual view to the end users. We don't need this persistence layer, this extra cost layer, this extra time, cost and complexity of doing that. So what happens is when you look at what happens in elastic, they have a constraint, a trade-off of how much you can keep and how much you can afford to keep. And also it becomes unstable at time because you have to build out a schema. It's on a server, the more the schema scales out, guess what? you have to add more servers, very expensive. They're up seven by 24. And also they become brutalized. You lose one node, the whole thing has to be put together. We have none of that cost and complexity. We literally go from to keep whatever you want, whatever you want to keep an S3 is single persistence, very cost effective. And what we are able to do is, costs, we save 50 to 80%. Why? We don't go with the old paradigm of sit it up on servers, spin them up for persistence and keep them up 7 by 24. We're literally asking their cluster, what do you want to cut? We bring up the right compute resources. And then we release those sources after the query done. So we can do some queries that they can't imagine at scale, but we're able to do the exact same query at 50 to 80% savings. And they don't have to do any tutorial of moving that data or managing that layer of persistence, which is not only expensive, it becomes brittle. And then it becomes, I'll be quick. Once you go to BI, it's the same challenge, but the BI systems, the requests are constant coming at from a business unit down to the centralized data team. Give me this flavor of data. I want to use this piece of, you know, this analytic tool in that desk set. So they have to do all this pipeline. They're constantly saying, okay, I'll give you this data, this data, I'm duplicating that data, I'm moving it and stitching it together. And then the minute you want more data, they do the same process all over. We completely eliminate that. >> And those requests are queue up. Thomas, it had me, you don't have to move the data. That's kind of the exciting piece here, isn't it? >> Absolutely no. I think, you know, the data lake philosophy has always been solid, right? The problem is we had that Hadoop hang over, right? Where let's say we were using that platform, little too many variety of ways. And so, I always believed in data lake philosophy when James came and coined that I'm like, that's it. However, HTFS, that wasn't really a service. Cloud object storage is a service that the elasticity, the security, the durability, all that benefits are really why we founded on-cloud storage as a first move. >> So it was talking Thomas about, you know, being able to shut off essentially the compute so you don't have to keep paying for it, but there's other vendors out there and stuff like that. Something similar as separating, compute from storage that they're famous for that. And you have Databricks out there doing their lake house thing. Do you compete with those? How do you participate and how do you differentiate? >> Well, you know you've heard this term data lakes, warehouse, now lake house. And so what everybody wants is simple in, easy in, however, the problem with data lakes was complexity of out. Driving value. And I said, what if, what if you have the easy in and the value out? So if you look at, say snowflake as a warehousing solution, you have to all that prep and data movement to get into that system. And that it's rigid static. Now, Databricks, now that lake house has exact same thing. Now, should they have a data lake philosophy, but their data ingestion is not data lake philosophy. So I said, what if we had that simple in with a unique architecture and indexed technology, make it virtually accessible, publishable dynamically at petabyte scale. And so our service connects to the customer's cloud storage. Data stream the data in, set up what we call a live indexing stream, and then go to our data refinery and publish views that can be consumed the elastic API, use cabana Grafana, or say SQL tables look or say Tableau. And so we're getting the benefits of both sides, use scheme on read-write performance with scheme write-read performance. And if you can do that, that's the true promise of a data lake, you know, again, nothing against Hadoop, but scheme on read with all that complexity of software was a little data swamping. >> Well, you've got to start it, okay. So we got to give them a good prompt, but everybody I talked to has got this big bunch of spark clusters, now saying, all right, this doesn't scale, we're stuck. And so, you know, I'm a big fan of Jamag Dagani and our concept of the data lake and it's early days. But if you fast forward to the end of the decade, you know, what do you see as being the sort of critical components of this notion of, people call it data mesh, but to get the analytics stack, you're a visionary Thomas, how do you see this thing playing out over the next decade? >> I love her thought leadership, to be honest, our core principles were her core principles now, 5, 6, 7 years ago. And so this idea of, decentralize that data as a product, self-serve and, and federated computer governance, I mean, all that was our core principle. The trick is how do you enable that mesh philosophy? I can say we're a mesh ready, meaning that, we can participate in a way that very few products can participate. If there's gates data into your system, the CTL, the schema management, my argument with the data meshes like producers and consumers have the same rights. I want the consumer, people that choose how they want to consume that data. As well as the producer, publishing it. I can say our data refinery is that answer. You know, shoot, I'd love to open up a standard, right? Where we can really talk about the producers and consumers and the rights each others have. But I think she's right on the philosophy. I think as products mature in this cloud, in this data lake capabilities, the trick is those gates. If you have to structure up front, if you set those pipelines, the chance of you getting your data into a mesh is the weeks and months that Ed was mentioning. >> Well, I think you're right. I think the problem with data mesh today is the lack of standards you've got. You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are APIs, but they're all unique primitives. So there aren't standards, by which to your point, the consumer can take the data the way he or she wants it and build their own data products without having to tap people on the shoulder to say, how can I use this?, where does the data live? And being able to add their own data. >> You're exactly right. So I'm an organization, I'm generating data, when the courageously stream it into a lake. And then the service, a ChaosSearch service, is the data is discoverable and configurable by the consumer. Let's say you want to go to the corner store. I want to make a certain meal tonight. I want to pick and choose what I want, how I want it. Imagine if the data mesh truly can have that producer of information, you know, all the things you can buy a grocery store and what you want to make for dinner. And if you'd static, if you call up your producer to do the change, was it really a data mesh enabled service? I would argue not. >> Ed, bring us home. >> Well, maybe one more thing with this. >> Please, yeah. 'Cause some of this is we're talking 2031, but largely these principles are what we have in production today, right? So even the self service where you can actually have a business context on top of a data lake, we do that today, we talked about, we get rid of the physical ETL, which is 80% of the work, but the last 20% it's done by this refinery where you can do virtual views, the right or back and do all the transformation need and make it available. But also that's available to, you can actually give that as a role-based access service to your end-users, actually analysts. And you don't want to be a data scientist or DBA. In the hands of a data scientist the DBA is powerful, but the fact of matter, you don't have to affect all of our employees, regardless of seniority, if they're in finance or in sales, they actually go through and learn how to do this. So you don't have to be it. So part of that, and they can come up with their own view, which that's one of the things about data lakes. The business unit wants to do themselves, but more importantly, because they have that context of what they're trying to do instead of queuing up the very specific request that takes weeks, they're able to do it themselves. >> And if I have to put it on different data stores and ETL that I can do things in real time or near real time. And that's game changing and something we haven't been able to do ever. >> And then maybe just to wrap it up, listen, you know 8 years ago, Thomas and his group of founders, came up with the concept. How do you actually get after analytics at scale and solve the real problems? And it's not one thing, it's not just getting S3. It's all these different things. And what we have in market today is the ability to literally just simply stream it to S3, by the way, simply do, what we do is automate the process of getting the data in a representation that you can now share an augment. And then we publish open API. So can actually use a tool as you want, first use case log analytics, hey, it's easy to just stream your logs in. And we give you elastic search type of services. Same thing that with CQL, you'll see mainstream machine learning next year. So listen, I think we have the data lake, you know, 3.0 now, and we're just stretching our legs right now to have fun. >> Well, and you have to say it log analytics. But if I really do believe in this concept of building data products and data services, because I want to sell them, I want to monetize them and being able to do that quickly and easily, so I can consume them as the future. So guys, thanks so much for coming on the program. Really appreciate it.
SUMMARY :
and Thomas Hazel, the CTO really good to be here. lakes in the cloud forever. And the same process has to happen. So you guys talk about You know, when you guys crew founded the team to go after this, That's kind of the exciting service that the elasticity, And you have Databricks out there And if you can do that, end of the decade, you know, the chance of you getting your on the shoulder to say, all the things you can buy a grocery store So even the self service where you can actually have And if I have to put it is the ability to literally Well, and you have
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Kevin Miller | PERSON | 0.99+ |
Thomas | PERSON | 0.99+ |
Ed | PERSON | 0.99+ |
80% | QUANTITY | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
50 | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
James | PERSON | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
ChaosSearch | ORGANIZATION | 0.99+ |
three months | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
next year | DATE | 0.99+ |
2021 | DATE | 0.99+ |
two thousands | QUANTITY | 0.99+ |
three weeks | QUANTITY | 0.99+ |
24 | QUANTITY | 0.99+ |
James Dixon | PERSON | 0.99+ |
last decade | DATE | 0.99+ |
7 | QUANTITY | 0.99+ |
second challenge | QUANTITY | 0.99+ |
2031 | DATE | 0.99+ |
Jamag Dagani | PERSON | 0.98+ |
S3 | ORGANIZATION | 0.98+ |
both sides | QUANTITY | 0.98+ |
S3 | TITLE | 0.98+ |
8 years ago | DATE | 0.98+ |
second thing | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
about 60% | QUANTITY | 0.98+ |
tonight | DATE | 0.97+ |
first | QUANTITY | 0.97+ |
Tableau | TITLE | 0.97+ |
two big areas | QUANTITY | 0.96+ |
one | QUANTITY | 0.95+ |
SQL | TITLE | 0.94+ |
seven | QUANTITY | 0.94+ |
6 | DATE | 0.94+ |
CTO | PERSON | 0.93+ |
CQL | TITLE | 0.93+ |
7 years | DATE | 0.93+ |
first move | QUANTITY | 0.93+ |
next decade | DATE | 0.92+ |
single | QUANTITY | 0.91+ |
DBS | ORGANIZATION | 0.9+ |
20% | QUANTITY | 0.9+ |
one thing | QUANTITY | 0.87+ |
5 | DATE | 0.87+ |
Hadoop | TITLE | 0.87+ |
Looker | TITLE | 0.8+ |
Grafana | TITLE | 0.73+ |
DPA | ORGANIZATION | 0.71+ |
one more thing | QUANTITY | 0.71+ |
end of the | DATE | 0.69+ |
Vice President | PERSON | 0.65+ |
petabytes | QUANTITY | 0.64+ |
cabana | TITLE | 0.62+ |
CEO | PERSON | 0.57+ |
HTFS | ORGANIZATION | 0.54+ |
house | ORGANIZATION | 0.49+ |
theCUBE | ORGANIZATION | 0.48+ |
Ed Walsh and Thomas Hazel, ChaosSearch | JSON
>>Hi, Brian, this is Dave Volante. Welcome to this cube conversation with Thomas Hazel was the founder and CTO of chaos surgeon. I'm also joined by ed Walsh. Who's the CEO Thomas. Good to see you. >>Great to be here. >>Explain Jason. First of all, what >>Jason, Jason has a powerful data representation, a data source. Uh, but let's just say that we try to drive value out of it. It gets complicated. Uh, I can search. We activate customers, data lakes. So, you know, customers stream their Jason data to this, uh, cloud stores that we activate. Now, the trick is the complexity of a Jason data structure. You can do all these complexity of representation. Now here's the problem putting that representation into a elastic search database or relational databases, very problematic. So what people choose to do is they pick and choose what they want and or they just stored as a blob. And so I said, what if, what if we create a new index technology that could store it as a full representation, but dynamically in a, we call our data refinery published access to all the permutations that you may want, where if you do a full on flatten, your flattening of its Jason, one row theoretically could be put into a million rows and relational data sort of explode, >>But then it gets really expensive. But so, but everybody says they have Jason support, every database vendor that I talked to, it's a big announcement. We now support Jason. What's the deal. >>Exactly. So you take your relational database with all those relational constructs and you have a proprietary Jason API to pick and choose. So instead of picking, choosing upfront, now you're picking, choosing in the backend where you really want us the power of the relational analysis of that Jaison data. And that's where chaos comes in, where we expand those data streams we do in a relational way. So all that tooling you've been built to know and love. Now you can access to it. So if you're doing proprietary APIs or Jason data, you're not using Looker, you're not using Tableau. You're doing some type of proprietary, probably emailing now on the backend. >>Okay. So you say all the tools that you've trained, everybody on you can't really use them. You got to build some custom stuff and okay, so, so, so maybe bring that home then in terms of what what's the money, why do the suits care about this stuff? >>The reason this is so important is think about anything, cloud native Kubernetes, your different applications. What you're doing in Mongo is all Jason is it's very powerful but painful, but if you're not keeping the data, what people are doing a data scientist is, or they're just doing leveling, they're saying I'm going to only keep the first four things. So think about it's Kubernetes, it's your app logs. They're trying to figure out for black Friday, what happens? It's Lilly saying, Hey, every minute they'll cut a new log. You're able to say, listen, these are the users that were in that system for an hour. And here's a different things. They do. The fact of the matter is if you cut it off, you lose all that fidelity, all that data. So it's really important that to have. So if you're trying to figure out either what happened for security, what happened for on a performance, or if you're trying to figure out, Hey, I'm VP of product or growth, how do I cross sell things? >>You need to know what everyone's doing. If you're not handling Jason natively, like we're doing either your, it keeps on expanding on black Friday. All of a sudden the logs get huge. And the next day it's not, but it's really powerful data that you need to harness for business values. It's, what's going to drive growth. It's what's going to do the digital transformation. So without the technology, you're kind of blind. And to be honest, you don't know. Cause a data scientist is kind of deleted the data on you. So this is big for the business and digital transformation, but also it was such a pain. The data scientists in DBS were forced to just basically make it simple. So it didn't blow up their system. We allow them to keep it simple, but yes, >>Both power. It reminds me if you like, go on vacation, you got your video camera. Somebody breaks into your house. You go back to Lucas and see who and that the data's gone. The video's gone because it didn't, you didn't, you weren't able to save it cause it's too >>Expensive. Well, it's funny. This is the first day source. That's driving the design of the database because of all the value we should be designed the database around the information. It stores not the structure and how it's been organized. And so our viewpoint is you get to choose your structure yet contain all that content. So if a vendor >>It says to kind of, I'm a customer then says, Hey, we got Jason support. What questions should I ask to really peel the onion? >>Well, particularly relational. Is it a relational access to that data? Now you could say, oh, I've ETL does Jason into it. But chances are the explosion of Jason permutations of one row to a million. They're probably not doing the full representation. So from our viewpoint is either you're doing a blob type access to proprietary Jason APIs or you're picking and choosing those, the choices say that is the market thought. However, what if you could take all the vegetation and design your schema based on how you want to consume it versus how you could store it. And that's a big difference with, >>So I should be asking how, how do I consume this data? Are you ETL? Bring it in how much data explosion is going to occur. Once I do this, and you're saying for chaos, search the answer to those questions. >>The answer is, again, our philosophy simply stream your data into your cloud object, storage, your data lake and with our index technology and our data refinery. You get to create views, dynamic the incident, whether it's a terabyte or petabyte, and describe how you want your data because consumed in a relational way or an elastic search way, both are consumable through our data refinery, which is >>For us. The refinery gives you the view. So what happens if someone wants a different view, I want to actually unpack different columns or different matrices. You able to do that in a virtual view, it's available immediately over petabytes of data. You don't have that episode where you come back, look at the video camera. There's no data there left. So that's, >>We do appreciate the time and the explanation on really understanding Jason. Thank you. All right. And thank you for watching this cube conversation. This is Dave Volante. We'll see you next time.
SUMMARY :
Good to see you. First of all, what where if you do a full on flatten, your flattening of its Jason, one row theoretically What's the deal. So you take your relational database with all those relational constructs and you have a proprietary You got to build some custom The fact of the matter is if you cut it off, you lose all that And to be honest, you don't know. It reminds me if you like, go on vacation, you got your video camera. And so our viewpoint is you It says to kind of, I'm a customer then says, Hey, we got Jason support. However, what if you could take all the vegetation and design your schema based on how you want to Bring it in how much data explosion is going to occur. whether it's a terabyte or petabyte, and describe how you want your data because consumed in a relational way You don't have that episode where you come back, look at the video camera. And thank you for watching this cube conversation.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Volante | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Jason | PERSON | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
Lilly | PERSON | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
JSON | PERSON | 0.99+ |
Thomas | PERSON | 0.99+ |
first day | QUANTITY | 0.99+ |
black Friday | EVENT | 0.99+ |
an hour | QUANTITY | 0.98+ |
both | QUANTITY | 0.97+ |
Both | QUANTITY | 0.97+ |
ed Walsh | PERSON | 0.97+ |
Tableau | TITLE | 0.95+ |
first four things | QUANTITY | 0.94+ |
Kubernetes | TITLE | 0.93+ |
one row | QUANTITY | 0.92+ |
Mongo | ORGANIZATION | 0.9+ |
Jason | ORGANIZATION | 0.89+ |
ChaosSearch | ORGANIZATION | 0.89+ |
a million | QUANTITY | 0.88+ |
next day | DATE | 0.86+ |
Jason | TITLE | 0.81+ |
First | QUANTITY | 0.74+ |
million rows | QUANTITY | 0.73+ |
ETL | ORGANIZATION | 0.7+ |
petabytes | QUANTITY | 0.69+ |
Looker | ORGANIZATION | 0.66+ |
DBS | ORGANIZATION | 0.58+ |
Jaison | PERSON | 0.52+ |
Lucas | PERSON | 0.49+ |
Ed Walsh and Thomas Hazel V1
>>Welcome to the cube. I'm Dave Volante. Today, we're going to explore the ebb and flow of data as it travels into the cloud. In the data lake, the concept of data lakes was a Loring when it was first coined last decade by CTO James Dickson, rather than be limited to highly structured and curated data that lives in a relational database in the form of an expensive and rigid data warehouse or a data Mart, a data lake is formed by flowing data from a variety of sources into a scalable repository, like say an S3 bucket that anyone can access, dive into. They can extract water. It can a data from that lake and analyze data. That's much more fine-grained and less expensive to store at scale. The problem became that organizations started to dump everything into their data lakes with no schema on it, right? No metadata, no context to shove it into the data lake and figure out what's valuable. >>At some point down the road kind of reminds you of your attic, right? Except this is an attic in the cloud. So it's too big to clean out over a weekend. We'll look it's 2021 and we should be solving this problem by now, a lot of folks are working on this, but often the solutions at other complexities for technology pros. So to understand this better, we're going to enlist the help of chaos search CEO and Walsh and Thomas Hazel, the CTO and founder of chaos search. We're also going to speak with Kevin Miller. Who's the vice president and general manager of S3 at Amazon web services. And of course they manage the largest and deepest data lakes on the planet. And we'll hear from a customer to get their perspective on this problem and how to go about solving it, but let's get started. Ed Thomas. Great to see you. Thanks for coming on the cube. Likewise face. It's really good to be in this nice face. Great. So let me start with you. We've been talking about data lakes in the cloud forever. Why is it still so difficult to extract value from those data? >>Good question. I mean, a data analytics at scale is always been a challenge, right? So, and it's, uh, we're making some incremental changes. As you mentioned that we need to seem some step function changes, but, uh, in fact, it's the reason, uh, search was really founded. But if you look at it the same challenge around data warehouse or a data lake, really, it's not just a flowing the data in is how to get insights out. So it kind of falls into a couple of areas, but the business side will always complain and it's kind of uniform across everything in data lakes, everything that we're offering, they'll say, Hey, listen, I typically have to deal with a centralized team to do that data prep because it's data scientist and DBS. Most of the time they're a centralized group, sometimes are business units, but most of the time, because they're scarce resources together. >>And then it takes a lot of time. It's arduous, it's complicated. It's a rigid process of the deal of the team, hard to add new data. But also it's hard to, you know, it's very hard to share data and there's no way to governance without locking it down. And of course they would be more self-service. So there's you hear from the business side constantly now underneath is like, there's some real technology issues that we haven't really changed the way we're doing data prep since the two thousands. Right? So if you look at it, it's, it falls, uh, two big areas. It's one. How do data prep, how do you take a request comes in from a business unit. I want to do X, Y, Z with this data. I want to use this type of tool sets to do the following. Someone has to be smart, how to put that data in the right schema. >>You mentioned you have to put it in the right format, that the tool sets can analyze that data before you do anything. And then secondly, I'll come back to that because that's a biggest challenge. But the second challenge is how these different data lakes and data we're also going to persisting data and the complexity of managing that data and also the cost of computing. And I'll go through that. But basically the biggest thing is actually getting it from raw data so that the rigidness and complexity that the business sides are using it is literally someone has to do this ETL process extract, transform load. They're actually taking data request comes in. I need so much data in this type of way to put together their Lilly, physically duplicating data and putting it together and schema they're stitching together almost a data puddle for all these different requests. >>And what happens is anytime they have to do that, someone has to do it. And it's very skilled. Resources are scant in the enterprise, right? So it's a DBS and data scientists. And then when they want new data, you give them a set of data set. They're always saying, what can I add this data? Now that I've seen the reports, I want to add this data more fresh. And the same process has to happen. This takes about 60 to 80% of the data scientists in DPA's to do this work. It's kind of well-documented. Uh, and this is what actually stops the process. That's what is rigid. They have to be rigid because there's a process around that. Uh, that's the biggest challenge to doing this. And it takes in the enterprise, uh, weeks or months. I always say three weeks to three months. And no one challenges beyond that. It also takes the same skill set of people that you want to drive. Digital transformation, data, warehousing initiatives, uh, monitorization being, data driven, or all these data scientists and DBS. They don't have enough of, so this is not only hurting you getting insights out of your dead like that, or else it's also this resource constraints hurting you actually getting smaller. >>The Tomic unit is that team that's super specialized team. Right. Right. Yeah. Okay. So you guys talk about activating the data lake. Yep, sure. Analytics, what what's unique about that? What problems are you all solving? You know, when you guys crew created this, this, this magic sauce. >>No, and it basically, there's a lot of things I highlighted the biggest one is how to do the data prep, but also you're persisting and using the data. But in the end, it's like, there's a lot of challenges that how to get analytics at scale. And this is really where Thomas founded the team to go after this. But, um, I'll try to say it simply, what are we doing? I'll try to compare and stress what we do compared to what you do with maybe an elastic cluster or a BI cluster. Um, and if you look at it, what we do is we simply put your data in S3, don't move it, don't transform it. In fact, we're not we're against data movement. What we do is we literally pointed at that data and we index that data and make it available in a data representation that you can give virtual views to end users. >>And those virtual views are available immediately over petabytes of data. And it re it actually gets presented to the end user as an open API. So if you're elastic search user, you can use all your lesser search tools on this view. If you're a SQL user, Tableau, Looker, all the different tools, same thing with machine learning next year. So what we do is we take it, make it very simple. Simply put it there. It's already there already. Point is at it. We do the hard of indexing and making available. And then you publish in the open API as your users can use exactly what they do today. So that's dramatically. I'll give you a before and after. So let's say you're doing elastic search. You're doing logging analytics at scale, they're lending their data in S3. And then they're,, they're physically duplicating a moving data and typically deleting a lot of data to get in a format that elastic search can use. >>They're persisting it up in a data layer called leucine. It's physically sitting in memories, CPU, uh, uh, SSDs. And it's not one of them. It's a bunch of those. They in the cloud, you have to set them up because they're persisting ECC. They stand up semi by 24, not a very cost-effective way to the cloud, uh, cloud computing. What we do in comparison to that is literally pointing it at the same S3. In fact, you can run a complete parallel, the data necessary. It's being ETL. That we're just one more use case read only, or allow you to get that data and make this virtual views. So we run a complete parallel, but what happens is we just give a virtual view to the end users. We don't need this persistence layer, this extra cost layer, this extra, um, uh, time cost and complexity of doing that. >>So what happens is when you look at what happens in elastic, they have a constraint, a trade-off of how much you can keep and how much you can afford to keep. And also it becomes unstable at time because you have to build out a schema. It's on a server, the more the schema scales out, guess what you have to add more servers, very expensive. They're up seven by 24. And also they become brittle. As you lose one node. The whole thing has to be put together. We have none of that cost and complexity. We literally go from to keep whatever you want, whatever you want to keep an S3, a single persistence, very cost effective. And what we do is, um, costs. We save 50 to 80% why we don't go with the old paradigm of sit it up on servers, spin them up for persistence and keep them up. >>Somebody 24, we're literally asking her cluster, what do you want to cut? We bring up the right compute resources. And then we release those sources after the query done. So we can do some queries that they can't imagine at scale, but we're able to do the exact same query at 50 to 80% savings. And they don't have to do any of the toil of moving that data or managing that layer of persistence, which is not only expensive. It becomes brittle. And then it becomes an I'll be quick. Once you go to BI, it's the same challenge, but the BI systems, the requests are constant coming at from a business unit down to the centralized data team. Give me this flavor of debt. I want to use this piece of, you know, this analytic tool in that desk set. So they have to do all this pipeline. They're constantly saying, okay, I'll give you this data, this data I'm duplicating that data. I'm moving in stitching together. And then the minute you want more data, they do the same process all over. We completely eliminate that. >>The questions queue up, Thomas, it had me, you don't have to move the data. That's, that's kind of the >>Writing piece here. Isn't it? I absolutely, no. I think, you know, the daylight philosophy has always been solid, right? The problem is we had that who do hang over, right? Where let's say we were using that platform, little, too many variety of ways. And so I always believed in daily philosophy when James came and coined that I'm like, that's it. However, HTFS that wasn't really a service cloud. Oddish storage is a service that the, the last society, the security and the durability, all that benefits are really why we founded, uh, Oncotype storage as a first move. >>So it was talking Thomas about, you know, being able to shut off essentially the compute and you have to keep paying for it, but there's other vendors out there and stuff like that. Something similar as separating, compute from storage that they're famous for that. And, and, and yet Databricks out there doing their lake house thing. Do you compete with those? How do you participate and how do you differentiate? >>I know you've heard this term data lakes, warehouse now, lake house. And so what everybody wants is simple in easy N however, the problem with data lakes was complexity of out driving value. And I said, what if, what if you have the easy end and the value out? So if you look at, uh, say snowflake as a, as a warehousing solution, you have to all that prep and data movement to get into that system. And that it's rigid static. Now, Databricks, now that lake house has exact same thing. Now, should they have a data lake philosophy, but their data ingestion is not daily philosophy. So I said, what if we had that simple in with a unique architecture, indexed technology, make it virtually accessible publishable dynamically at petabyte scale. And so our service connects to the customer's cloud storage data, stream the data in set up what we call a live indexing stream, and then go to our data refinery and publish views that can be consumed the lasted API, use cabana Grafana, or say SQL tables look or say Tableau. And so we're getting the benefits of both sides, you know, schema on read, write performance with scheme on, right. Reperformance. And if you can do that, that's the true promise of a data lake, you know, again, nothing against Hadoop, but a schema on read with all that complexity of, uh, software was, uh, what was a little data, swamp >>Got to start it. Okay. So we got to give a good prompt, but everybody I talked to has got this big bunch of spark clusters now saying, all right, this, this doesn't scale we're stuck. And so, you know, I'm a big fan of and our concept of the data lake and it's it's early days. But if you fast forward to the end of the decade, you know, what do you see as being the sort of critical components of this notion of, you know, people call it data mesh, but you've got the analytics stack. Uh, you, you, you're a visionary Thomas, how do you see this thing playing out over the next? >>I love for thought leadership, to be honest, our core principles were her core principles now, you know, 5, 6, 7 years ago. And so this idea of, you know, de centralize that data as a product, you know, self-serve and, and federated, computer, uh, governance, I mean, all that, it was our core principle. The trick is how do you enable that mesh philosophy? We, I could say we're a mesh ready, meaning that, you know, we can participate in a way that very few products can participate. If there's gates data into your system, the CTLA, the schema management, my argument with the data meshes like producers and consumers have the same rights. I want the consumer people that choose how they want to consume that data, as well as the producer publishing it. I can say our data refinery is that answer. You know, shoot, I love to open up a standard, right, where we can really talk about the producers and consumers and the rights each others have. But I think she's right on the philosophy. I think as products mature in this cloud, in this data lake capabilities, the trick is those gates. If you have the structure up front, it gets at those pipelines. You know, the chance of you getting your data into a mesh is the weeks and months that it was mentioning. >>Well, I think you're right. I think the problem with, with data mesh today is the lack of standards you've got. You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are API APIs, but they're all, you know, unique primitives. So there aren't standards by which to your point, the consumer can take the data the way he or she wants it and build their own data products without having to tap people on the shoulder to say, how can I use this? Where's the data live and, and, and, and, and being able to add their own >>You're exactly right. So I'm an organization I'm generally data will be courageous, a stream it to a lake. And then the service, uh, Ks search service is the data's con uh, discoverable and configurable by the consumer. Let's say you want to go to the corner store? You know, I want to make a certain meal tonight. I want to pick and choose what I want, how I want it. Imagine if the data mesh truly can have that producer of information, you, all the things you can buy a grocery store and what you want to make for dinner. And if you'd static, if you call up your producer to do the change, was it really a data mesh enabled service? I would argue not that >>Bring us home >>Well. Uh, and, um, maybe one more thing with this, cause some of this is we talking 20, 31, but largely these principles are what we have in production today, right? So even the self service where you can actually have business context on top of a debt, like we do that today, we talked about, we get rid of the physical ETL, which is 80% of the work, but the last 20% it's done by this refinery where you can do virtual views, the right our back and do all the transformation need and make it available. But also that's available to, you can actually give that as a role-based access service to your end users actually analysts, and you don't want to be a data scientist or DBA in the hands of a data science. The DBA is powerful, but the fact of matter, you don't have to affect all of our employees, regardless of seniority. If they're in finance or in sales, they actually go through and learn how to do this. So you don't have to be it. So part of that, and they can come up with their own view, which that's one of the things about debt lakes, the business unit wants to do themselves, but more importantly, because they have that context of what they're trying to do instead of queuing up the very specific request that takes weeks, they're able to do it themselves and to find out that >>Different data stores and ETL that I can do things in real time or near real time. And that's that's game changing and something we haven't been able to do, um, ever. Hmm. >>And then maybe just to wrap it up, listen, um, you know, eight years ago is a group of founders came up with the concept. How do you actually get after analytics at scale and solve the real problems? And it's not one thing it's not just getting S3, it's all these different things. And what we have in market today is the ability to literally just simply stream it to S3 by the way, simply do what we do is automate the process of getting the data in a representation that you can now share an augment. And then we publish open API. So can actually use a tool as you want first use case log analytics, Hey, it's easy to just stream your logs in and we give you elastic search puppet services, same thing that with CQL, you'll see mainstream machine learning next year. So listen, I think we have the data lake, you know, 3.0 now, and we're just stretching our legs run off >>Well, and you have to say it log analytics. But if I really do believe in this concept of building data products and data services, because I want to sell them, I want to monetize them and being able to do that quickly and easily, so that can consume them as the future. So guys, thanks so much for coming on the program. Really appreciate it. All right. In a moment, Kevin Miller of Amazon web services joins me. You're watching the cube, your leader in high tech coverage.
SUMMARY :
that organizations started to dump everything into their data lakes with no schema on it, At some point down the road kind of reminds you of your attic, right? But if you look at it the same challenge around data warehouse So if you look at it, it's, it falls, uh, two big areas. You mentioned you have to put it in the right format, that the tool sets can analyze that data before you do anything. It also takes the same skill set of people that you want So you guys talk about activating the data lake. Um, and if you look at it, what we do is we simply put your data in S3, don't move it, And then you publish in the open API as your users can use exactly what they you have to set them up because they're persisting ECC. It's on a server, the more the schema scales out, guess what you have to add more servers, And then the minute you want more data, they do the same process all over. The questions queue up, Thomas, it had me, you don't have to move the data. I absolutely, no. I think, you know, the daylight philosophy has always been So it was talking Thomas about, you know, being able to shut off essentially the And I said, what if, what if you have the easy end and the value out? the sort of critical components of this notion of, you know, people call it data mesh, And so this idea of, you know, de centralize that You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are API APIs, but they're all, if you call up your producer to do the change, was it really a data mesh enabled service? but the fact of matter, you don't have to affect all of our employees, regardless of seniority. And that's that's game changing And then maybe just to wrap it up, listen, um, you know, eight years ago is a group of founders Well, and you have to say it log analytics.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Kevin Miller | PERSON | 0.99+ |
Thomas | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Ed Thomas | PERSON | 0.99+ |
50 | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
James | PERSON | 0.99+ |
80% | QUANTITY | 0.99+ |
three months | QUANTITY | 0.99+ |
three weeks | QUANTITY | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
2021 | DATE | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
next year | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
S3 | ORGANIZATION | 0.99+ |
second challenge | QUANTITY | 0.99+ |
24 | QUANTITY | 0.99+ |
both sides | QUANTITY | 0.99+ |
eight years ago | DATE | 0.98+ |
Today | DATE | 0.98+ |
two thousands | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
20% | QUANTITY | 0.98+ |
tonight | DATE | 0.97+ |
last decade | DATE | 0.97+ |
S3 | TITLE | 0.97+ |
first | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
Tableau | TITLE | 0.95+ |
single | QUANTITY | 0.95+ |
James Dickson | PERSON | 0.94+ |
Hadoop | TITLE | 0.94+ |
two big areas | QUANTITY | 0.94+ |
20 | QUANTITY | 0.94+ |
SQL | TITLE | 0.93+ |
seven | QUANTITY | 0.93+ |
CTO | PERSON | 0.93+ |
about 60 | QUANTITY | 0.93+ |
Oncotype | ORGANIZATION | 0.92+ |
first move | QUANTITY | 0.92+ |
secondly | QUANTITY | 0.91+ |
one more thing | QUANTITY | 0.89+ |
DBS | ORGANIZATION | 0.89+ |
one node | QUANTITY | 0.85+ |
Walsh | PERSON | 0.83+ |
petabytes | QUANTITY | 0.77+ |
Tomic | ORGANIZATION | 0.77+ |
31 | QUANTITY | 0.77+ |
end of the | DATE | 0.76+ |
cabana | TITLE | 0.73+ |
HTFS | ORGANIZATION | 0.7+ |
Mart | ORGANIZATION | 0.68+ |
Grafana | TITLE | 0.63+ |
data | ORGANIZATION | 0.58+ |
Looker | TITLE | 0.55+ |
CQL | TITLE | 0.55+ |
DPA | ORGANIZATION | 0.54+ |