Mark Lyons, Dremio | AWS Startup Showcase S2 E2
(upbeat music) >> Hello, everyone and welcome to theCUBE presentation of the AWS startup showcase, data as code. This is season two, episode two of the ongoing series covering the exciting startups from the AWS ecosystem. Here we're talking about operationalizing the data lake. I'm your host, John Furrier, and my guest here is Mark Lyons, VP of product management at Dremio. Great to see you, Mark. Thanks for coming on. >> Hey John, nice to see you again. Thanks for having me. >> Yeah, we were talking before we came on camera here on this showcase we're going to spend the next 20 minutes talking about the new architectures of data lakes and how they expand and scale. But we kind of were reminiscing by the old big data days, and how this really changed. There's a lot of hangovers from (mumbles) kind of fall through, Cloud took over, now we're in a new era and the theme here is data as code. Really highlights that data is now in the developer cycles of operations. So infrastructure is code-led DevOps movement for Cloud programmable infrastructure. Now you got data as code, which is really accelerating DataOps, MLOps, DatabaseOps, and more developer focus. So this is a big part of it. You guys at Dremio have a Cloud platform, query engine and a data tier innovation. Take us through the positioning of Dremio right now. What's the current state of the offering? >> Yeah, sure, so happy to, and thanks for kind of introing into the space that we're headed. I think the world is changing, and databases are changing. So today, Dremio is a full database platform, data lakehouse platform on the Cloud. So we're all about keeping your data in open formats in your Cloud storage, but bringing that full functionality that you would want to access the data, as well as manage the data. All the functionality folks would be used to from NC SQL compatibility, inserts updates, deletes on that data, keeping that data in Parquet files in the iceberg table format, another level of abstraction so that people can access the data in a very efficient way. And going even further than that, what we announced with Dremio Arctic which is in public preview on our Cloud platform, is a full get like experience for the data. So just like you said, data as code, right? We went through waves and source code and infrastructure as code. And now we can treat the data as code, which is amazing. You can have development branches, you can have staging branches, ETL branches, which are separate from production. Developers can do experiments. You can make changes, you can test those changes before you merge back to production and let the consumers see that data. Lots of innovation on the platform, super fast velocity of delivery, and lots of customers adopting it in just in the first month here since we announced Dremio Cloud generally available where the adoption's been amazing. >> Yeah, and I think we're going to dig into the a lot of the architecture, but I want to highlight your point you made about the branching off and taking a branch of Git. This is what developers do, right? The developers use GitHub, Git, they bake branches from code. They build on top of other code. That's open source. This is what's been around for generations. Now for the first time we're seeing data sets being taken out of production to be worked on and coded and tested and even doing look backs or even forward looking analysis. This is data being programmed. This is data as code. This is really, you couldn't get any closer to data as code. >> Yeah. It's all done through metadata by the way. So there's no actual copying of these data sets 'cause in these big data systems, Cloud data lakes and stuff, and these tables are billions of records, trillions of records, super wide, hundreds of columns wide, thousands of columns wide. You have to do this all through metadata operations so you can control what version of the data basically a individual's working with and which version of the data the production systems are seeing because these data sets are too big. You don't want to be moving them. You can't be moving them. You can't be copying them. It's all metadata and manifest files and pointers to basically keep track of what's going on. >> I think this is the most important trend we've seen in a long time, because if you think about what Agile did for developers, okay, speed, DevOps, Cloud scale, now you've got agility in the data side of it where you're basically breaking down the old proprietary, old ways of doing data warehousing, but not killing the functionality of what data warehouses did. Just doing more volume data warehouses where proprietary, not open. They were different use cases. They were single application developers when used data warehouse query, not a lot of volume. But as you get volume, these things are inadequate. And now you've got the new open Agile. Is this Agile data engineering at play here? >> Yeah, I think it totally is. It's bringing it as far forward in as possible. We're talking about making the data engineering process easier and more productive for the data engineer, which ultimately makes the consumers of that data much happier as well as way more experiments can happen. Way more use cases can be tried. If it's not a burden and it doesn't require building a whole new pipeline and defining a schema and adding columns and data types and all this stuff, you can do a lot more with your data much faster. So it's really going to be super impactful to all these businesses out there trying to be data driven, especially when you're looking at data as a code and branching, a branch off, you can de-risk your changes. You're not worried about messing up the production system, messing up that data, having it seen by end user. Some businesses data is their business so that data would be going all the way to a consumer, a third party. And then it gets really scary. There's a lot of risk if you show the wrong credit score to a consumer or you do something like that. So it's really de-risking... >> Even updating machine learning algorithms. So for instance, if the data sets change, you can always be iterating on things like machine learning or learning algorithms. This is kind of new. This is awesome, right? >> I think it's going to change the world because this stuff was so painful to do. The data sets had gotten so much bigger as you know, but we were still doing it in the old way, which was typically moving data around for everyone. It was copying data down, sampling data, moving data, and now we're just basically saying, hey, don't do that anymore. We got to stop moving the data. It doesn't make any sense. >> So I got to ask you Mark, data lakes are growing in popularity. I was originally down on data lakes. I called them data swamps. I didn't think they were going to be as popular because at that time, distributed file systems like Hadoop, and object store in the Cloud were really cool. So what happened between that promise of distributed file systems and object store and data lakes? What made data lakes popular? What made that work in your opinion? >> Yeah, it really comes down to the metadata, which I already mentioned once. But we went through these waves. John you saw we did the EDWs to the data lakes and then the Cloud data warehouses. I think we're at the start of a cycle back to the data lake. And it's because the data lakes this time around with the Apache iceberg table format, with project (mumbles) and what Dremio's working on around metadata, these things aren't going to become data swamps anymore. They're actually going to be functional systems that do inserts updates into leads. You can see all the commits. You can time travel them. And all the files are actually managed and optimized so you have to partition the data. You have to merge small files into larger files. Oh, by the way, this is stuff that all the warehouses have done behind the scenes and all the housekeeping they do, but people weren't really aware of it. And the data lakes the first time around didn't solve all these problems so that those files landing in a distributed file system does become a mess. If you just land JSON, Avro or Parquet files, CSV files into the HDFS, or in S3 compatible, object store doesn't matter, if you're just parking files and you're going to deal with it as schema and read instead of schema and write, you're going to have a mess. If you don't know which tool changed the files, which user deleted a file, updated a file, you will end up with a mess really quickly. So to take care of that, you have to put a table format so everyone's looking at Apache iceberg or the data bricks Delta format, which is an interesting conversation similar to the Parquet and org file format that we saw play out. And then you track the metadata. So you have those manifest files. You know which files change when, which engine, which commit. And you can actually make a functional system that's not going to become a swamp. >> Another trend that's extending on beyond the data lake is other data sources, right? So you have a lot of other data, not just in data lakes so you have to kind of work with that. How do you guys answer the question around some of the mission critical BI dashboards out there on the latency side? A lot of people have been complaining that these mission critical BI dashboards aren't getting the kind of performance as they add more data sources and they try to do more. >> Yeah, that's a great question. Dremio does actually a bunch of interesting things to bring the performance of these systems up because at the end of the day, people want to access their data really quickly. They want the response times of these dashboards to be interactive. Otherwise the data's not interesting if it takes too long to get it. To answer a question, yeah, a couple of things. First of all, from a data source's side, Dremio is very proficient with our Parquet files in an object store, like we just talked about, but it also can access data in other relational systems. So whether that's a Postgres system, whether that's a Teradata system or an Oracle system. That's really useful if you have dimensional data, customer data, not the largest data set in the world, not the fastest moving data set in the world, but you don't want to move it. We can query that where it resides. Bringing in new sources is definitely, we all know that's a key to getting better insights. It's in your data, is joining sources together. And then from a query speed standpoint, there's a lot of things going on here. Everything from kind of Apache, the Apache Avro project, which is in memory format of Parquet and not kind of serialize and de-serialize the data back and forth. As well as what we call reflection, which is basically a re-indexing or pre-computing of the data, but we leave it in Parquet format, in a open format in the customer's account so that you can have aggregates and other things that are really popular in these dashboards pre-computed. So millisecond response, lightning fast, like tricks that a warehouse would do that the warehouses have been doing forever. Right? >> Yeah, more deals coming in. And obviously the architecture we'll get into that now has to handle the growth. And as your customers and practitioners see the volume and the variety and the velocity of the data coming in, how are they adjusting their data strategies to respond to this? Again, Cloud is clearly the answer, not the data warehouse, but what are they doing? What's the strategy adjustment? >> It's interesting when we start talking to folks, I think sometimes it's a really big shift in thinking about data architectures and data strategies when you look at the Dremio approach. It's very different than what most people are doing today around ETL pipelines and then bringing stuff into a warehouse and oh, the warehouse is too overloaded so let's build some cubes and extracts into the next tier of tools to speed up those dashboards for those tools. And Dremio has totally flipped this on a sentence and said, no, let's not do all those things. That's time consuming. It's brittle, it breaks. And actually your agility and the scope of what you can do with your data decreases. You go from all your data and all your data sources to smaller and smaller. We actually call it the perimeter doom and a lot of people look at this and say, yeah, that kind of looks like how we're doing things today. So from a Dremio perspective, it's really about no copy, try to keep as much data in one place, keep it in one open format and less data movement. And that's a very different approach for people. I think they don't realize how much you can accomplish that way. And your latency shrinks down too. Your actual latency from data created to insight is much shorter. And it's not because of the query response time, that latency is mostly because of data movement and copy and all these things. So you really want to shrink your time to insight. It's not about getting a faster query from a few seconds down, it's about changing the architecture. >> The data drift as they say, interesting there. I got to ask you on the personnel side, team side, you got the technical side, you got the non-technical consumers of the data, you got the data science or data engineering is ramping up. We mentioned earlier data engineering being Agile, is a key innovation here. As you got to blend the two personas of technical and non-technical people playing with data, coding with data, we're the bottlenecks in this process today. How can data teams overcome these bottlenecks? >> I think we see a lot of bottlenecks in the process today, a lot of data movement, a lot of change requests, update this dashboard. Oh, well, that dashboard update requires an ETL pipeline update, requires a column to be added to this warehouse. So then you've got these personas, like you said, some more technical, less technical, the data consumers, the data engineers. Well, the data engineers are getting totally overloaded with requests and work. And it's not even super value-add work to the business. It's not really driving big changes in their culture and insights and new new use cases for data. It's turning through kind of small changes, but it's taking too much time. It's taking days, if not weeks for these organizations to manage small changes. And then the data consumers, the less technical folks, they can't get the answers that they want. They're waiting and waiting and waiting and they don't understand why things are so challenging, how things could take so much time. So from a Dremio perspective, it's amazing to watch these organizations unleash their data. Get the data engineers, their productivity up. Stop dealing with some of the last mile ETL and small changes to the data. And Dremio actually says, hey, data consumers, here's a really nice gooey. You don't need to be a SQL expert, well, the tool will write the joints for you. You can click on a column and say, hey, I want to calculate a new field and calculate that field. And it's all done virtually so it's not changing the physical data sets. The actual data engineering team doesn't even really need to care at that point. So you get happier data consumers at the end of the day. They're doing things more self-service. They're learning about the data and the data engineering teams can go do value-add things. They can re-architecture the platform for the future. They can do POCs to test out new technologies that could support new use cases and bring those into the organization. Things that really add value, instead of just churning through backlogs of, hey, can we get a column added or we change... Everyone's doing app development, AB testing, and those developers are king. Those pipelines stream all this data down when the JSON files change. You need agility. And if you don't have that agility, you just get this endless backlog that you never... >> This is data as code in action. You're committing data back into the main brand that's been tested. That's what developers do. So this is really kind of the next step function. I got to put the customer hat on for a second and ask you kind of the pessimist question. Okay, we've had data lakes, I've got data lakes, it's been data lakes around, I got query engines here and there, they're all over the place, what's missing? What's been missing from the architecture to fully realize the potential of a data lakehouse? >> Yeah, I think that's a great question. The customers say exactly that John. They say, "I've got 22 databases, you got to be kidding me. You showed up with another database." Or, hey, let's talk about a Cloud data lake or a data lake. Again, I did the data lake thing. I had a data lake and it wasn't everything I thought it was going to be. >> It was bad. It was data swamp. >> Yeah, so customers really think this way, and you say, well, what's different this time around? Well, the Cloud in the original data lake world, and I'm just going to focus on data lakes, so the original data lake worlds, everything was still direct attached storage, so you had to scale your storage and compute out together. And we built these huge systems. Thousands of thousands of HDFS nodes and stuff. Well, the Cloud brought the separated compute and storage, but data lakes have never seen separated compute and storage until now. We went from the data lake with directed tap storage to the Cloud data warehouse with separated compute and storage. So the Cloud architecture and getting compute and storage separated is a huge shift in the data lake world. And that agility of like, well, I'm only going to apply it, the compute that I need for this question, for this answer right now, and not get 5,000 servers of compute sitting around at some peak moment. Or just 5,000 compute servers because I have five petabytes or 50 petabytes of data that need to be stored in the discs that are attached to them. So I think the Cloud architecture and separating compute and storage is the first thing that's different this time around about data lakes. But then more importantly than that is the metadata tier. Is the data tier and having sufficient metadata to have the functionality that people need on the data lake. Whether that's for governance and compliance standpoints, to actually be able to do a delete on your data lake, or that's for productivity and treating that data as code, like we're talking about today, and being able to time travel it, version it, branch it. And now these data lakes, the data lakes back in the original days were getting to 50 petabytes. Now think about how big these Cloud data lakes could be. Even larger and you can't move that data around so we have to be really intelligent and really smart about the data operations and versioning all that data, knowing which engine touch the data, which person was the last commit and being able to track all that, is ultimately what's going to make this successful. Because if you don't have the governance in place these days with data, the projects are going to fail. >> Yeah, and I think separating the query layer or SQL layer and the data tier is another innovation that you guys have. Also it's a managed Cloud service, Dremio Cloud now. And you got the open source angle too, which is also going to open up more standardization around some of these awesome features like you mentioned the joints, and I think you guys built on top of Parquet and some other cool things. And you got a community developing, so you get the Cloud and community kind of coming together. So it's the real world that is coming to light saying, hey, I need real world applications, not the theory of old school. So what use cases do you see suited for this kind of new way, new architecture, new community, new programability? >> Yeah, I see people doing all sorts of interesting things and I'm sure with what we've introduced with Dremio Arctic and the data is code is going to open up a whole new world of things that we don't even know about today. But generally speaking, we have customers doing very interesting things, very data application things. Like building really high performance data into use cases whether that's a supply chain and manufacturing use case, whether that's a pharma or biotech use case, a banking use case, and really unleashing that data right into an application. We also see a lot of traditional data analytics use cases more in the traditional business intelligence or dashboarding use cases. That stuff is totally achievable, no problems there. But I think the most interesting stuff is companies are really figuring out how to bring that data. When we offer the flexibility that we're talking about, and the agility that we're talking about, you can really start to bring that data back into the apps, into the work streams, into the places where the business gets more value out of it. Not in a dashboard that some person might have access to, or a set of people have access to. So even in the Dremio Cloud announcement, the press release, there was a customer, they're in Europe, it's called Garvis AI and they do AI for supply chains. It's an intelligent application and it's showing customers transparently how they're getting to these predictions. And they stood this all up in a very short period of time, because it's a Cloud product. They don't have to deal with provisioning, management, upgrades. I think they had their stuff going in like 30 minutes or something, like super quick, which is amazing. The data was already there, and a lot of organizations, their data's already in these Cloud storages. And if that's the case... >> If they have data, they're a use case. This is agility. This is agility coming to the data engineering field, making data programmable, enabling the data applications, the data ops for everybody, for coding... >> For everybody. And for so many more use cases at these companies. These data engineering teams, these data platform teams, whether they're in marketing or ad tech or Fiserv or Telco, they have a list. There's a list about a roadmap of use cases that they're waiting to get to. And if they're drowning underwater in the current tooling and barely keeping that alive, and oh, by the way, John, you can't go higher 30 new data engineers tomorrow and bring on the team to get capacity. You have to innovate at the architecture level, to unlock more data use cases because you're not going to go triple your team. That's not possible. >> It's going to unlock a tsunami of value. Because everyone's clogged in the system and it's painful. Right? >> Yeah. >> They've got delays, you've got bottlenecks. you've got people complaining it's hard, scar tissue. So now I think this brings ease of use and speed to the table. >> Yeah. >> I think that's what we're all about, is making the data super easy for everyone. This should be fun and easy, not really painful and really hard and risky. In a lot of these old ways of doing things, there's a lot of risk. You start changing your ETL pipeline. You add a column to the table. All of a sudden, you've got potential risk that things are going to break and you don't even know what's going to break. >> Proprietary, not a lot of volume and usage, and on-premises, open, Cloud, Agile. (John chuckles) Come on, which path? The curtain or the box, what are you going to take? It's a no brainer. >> Which way do you want to go? >> Mark, thanks for coming on theCUBE. Really appreciate it for being part of the AWS startup showcase data as code, great conversation. Data as code is going to enable a next wave of innovation and impact the future of data analytics. Thanks for coming on theCUBE. >> Yeah, thanks John and thanks to the AWS team. A great partnership between AWS and Dremio too. Talk to you soon. >> Keep it right there, more action here on theCUBE. As part of the showcase, stay with us. This is theCUBE, your leader in tech coverage. I'm John Furrier, your host, thanks for watching. (downbeat music)
SUMMARY :
of the AWS startup showcase, data as code. Hey John, nice to see you again. and the theme here is data as code. Lots of innovation on the platform, Now for the first time the production systems are seeing in the data side of it for the data engineer, So for instance, if the data sets change, I think it's going to change the world and object store in the And it's because the data extending on beyond the data lake of the data, but we leave and the variety and the the scope of what you can do I got to ask you on the and the data engineering teams kind of the pessimist question. Again, I did the data lake thing. It was data swamp. and really smart about the data operations and the data tier is another and the data is code is going the data engineering field, and bring on the team to get capacity. Because everyone's clogged in the system to the table. is making the data The curtain or the box, and impact the future of data analytics. Talk to you soon. As part of the showcase, stay with us.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
AWS | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Mark Lyons | PERSON | 0.99+ |
30 minutes | QUANTITY | 0.99+ |
Telco | ORGANIZATION | 0.99+ |
Mark | PERSON | 0.99+ |
50 petabytes | QUANTITY | 0.99+ |
five petabytes | QUANTITY | 0.99+ |
two personas | QUANTITY | 0.99+ |
5,000 servers | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
hundreds of columns | QUANTITY | 0.99+ |
22 databases | QUANTITY | 0.99+ |
Dremio | ORGANIZATION | 0.99+ |
trillions of records | QUANTITY | 0.99+ |
Dremio | PERSON | 0.99+ |
Dremio Arctic | ORGANIZATION | 0.99+ |
Fiserv | ORGANIZATION | 0.99+ |
first time | QUANTITY | 0.98+ |
30 new data engineers | QUANTITY | 0.98+ |
billions of records | QUANTITY | 0.98+ |
thousands of columns | QUANTITY | 0.98+ |
first thing | QUANTITY | 0.98+ |
Thousands of thousands | QUANTITY | 0.98+ |
today | DATE | 0.97+ |
one place | QUANTITY | 0.97+ |
Oracle | ORGANIZATION | 0.97+ |
Apache | ORGANIZATION | 0.96+ |
S3 | TITLE | 0.96+ |
Git | TITLE | 0.96+ |
Cloud | TITLE | 0.95+ |
Hadoop | TITLE | 0.95+ |
first month | QUANTITY | 0.94+ |
Parquet | TITLE | 0.94+ |
Dremio Cloud | TITLE | 0.91+ |
5,000 compute servers | QUANTITY | 0.91+ |
one | QUANTITY | 0.91+ |
JSON | TITLE | 0.89+ |
First | QUANTITY | 0.89+ |
single application | QUANTITY | 0.89+ |
Garvis | ORGANIZATION | 0.88+ |
GitHub | ORGANIZATION | 0.87+ |
Apache | TITLE | 0.82+ |
episode | QUANTITY | 0.79+ |
Agile | TITLE | 0.77+ |
season two | QUANTITY | 0.74+ |
Agile | ORGANIZATION | 0.69+ |
DevOps | TITLE | 0.67+ |
Startup Showcase S2 E2 | EVENT | 0.66+ |
Teradata | ORGANIZATION | 0.65+ |
theCUBE | ORGANIZATION | 0.64+ |