Tomer Shiran, Dremio | AWS re:Invent 2022
>>Hey everyone. Welcome back to Las Vegas. It's the Cube live at AWS Reinvent 2022. This is our fourth day of coverage. Lisa Martin here with Paul Gillen. Paul, we started Monday night, we filmed and streamed for about three hours. We have had shammed pack days, Tuesday, Wednesday, Thursday. What's your takeaway? >>We're routed final turn as we, as we head into the home stretch. Yeah. This is as it has been since the beginning, this show with a lot of energy. I'm amazed for the fourth day of a conference, how many people are still here I am too. And how, and how active they are and how full the sessions are. Huge. Proud for the keynote this morning. You don't see that at most of the day four conferences. Everyone's on their way home. So, so people come here to learn and they're, and they're still >>Learning. They are still learning. And we're gonna help continue that learning path. We have an alumni back with us, Toron joins us, the CPO and co-founder of Dremeo. Tomer, it's great to have you back on the program. >>Yeah, thanks for, for having me here. And thanks for keeping the, the best session for the fourth day. >>Yeah, you're right. I like that. That's a good mojo to come into this interview with Tomer. So last year, last time I saw you was a year ago here in Vegas at Reinvent 21. We talked about the growth of data lakes and the data lake houses. We talked about the need for open data architectures as opposed to data warehouses. And the headline of the Silicon Angle's article on the interview we did with you was, Dremio Predicts 2022 will be the year open data architectures replace the data warehouse. We're almost done with 2022. Has that prediction come true? >>Yeah, I think, I think we're seeing almost every company out there, certainly in the enterprise, adopting data lake, data lakehouse technology, embracing open source kind of file and table formats. And, and so I think that's definitely happening. Of course, nothing goes away. So, you know, data warehouses don't go away in, in a year and actually don't go away ever. We still have mainframes around, but certainly the trends are, are all pointing in that direction. >>Describe the data lakehouse for anybody who may not be really familiar with that and, and what it's, what it really means for organizations. >>Yeah. I think you could think of the data lakehouse as the evolution of the data lake, right? And so, you know, for, for, you know, the last decade we've had kind of these two options, data lakes and data warehouses and, you know, warehouses, you know, having good SQL support, but, and good performance. But you had to spend a lot of time and effort getting data into the warehouse. You got locked into them, very, very expensive. That's a big problem now. And data lakes, you know, more open, more scalable, but had all sorts of kind of limitations. And what we've done now as an industry with the Lake House, and especially with, you know, technologies like Apache Iceberg, is we've unlocked all the capabilities of the warehouse directly on object storage like s3. So you can insert and update and delete individual records. You can do transactions, you can do all the things you could do with a, a database directly in kind of open formats without getting locked in at a much lower cost. >>But you're still dealing with semi-structured data as opposed to structured data. And there's, there's work that has to be done to get that into a usable form. That's where Drio excels. What, what has been happening in that area to, to make, I mean, is it formats like j s o that are, are enabling this to happen? How, how we advancing the cause of making semi-structured data usable? Yeah, >>Well, I think first of all, you know, I think that's all changed. I think that was maybe true for the original data lakes, but now with the Lake house, you know, our bread and butter is actually structured data. It's all, it's all tables with the schema. And, you know, you can, you know, create table insert records. You know, it's, it's, it's really everything you can do with a data warehouse you can now do in the lakehouse. Now, that's not to say that there aren't like very advanced capabilities when it comes to, you know, j s O and nested data and kind of sparse data. You know, we excel in that as well. But we're really seeing kind of the lakehouse take over the, the bread and butter data warehouse use cases. >>You mentioned open a minute ago. Talk about why it's, why open is important and the value that it can deliver for customers. >>Yeah, well, I think if you look back in time and you see all the challenges that companies have had with kind of traditional data architectures, right? The, the, the, a lot of that comes from the, the, the problems with data warehouses. The fact that they are, you know, they're very expensive. The data is, you have to ingest it into the data warehouse in order to query it. And then it's almost impossible to get off of these systems, right? It takes an enormous effort, tremendous cost to get off of them. And so you're kinda locked in and that's a big problem, right? You also, you're dependent on that one data warehouse vendor, right? You can only do things with that data that the warehouse vendor supports. And if you contrast that to data lakehouse and open architectures where the data is stored in entirely open formats. >>So things like par files and Apache iceberg tables, that means you can use any engine on that data. You can use s SQL Query Engine, you can use Spark, you can use flin. You know, there's a dozen different engines that you can use on that, both at the same time. But also in the future, if you ever wanted to try something new that comes out, some new open source innovation, some new startup, you just take it and point out the same data. So that data's now at the core, at the center of the architecture as opposed to some, you know, vendors logo. Yeah. >>Amazon seems to be bought into the Lakehouse concept. It has big announcements on day two about eliminating the ETL stage between RDS and Redshift. Do you see the cloud vendors as pushing this concept forward? >>Yeah, a hundred percent. I mean, I'm, I'm Amazon's a great, great partner of ours. We work with, you know, probably 10 different teams there. Everything from, you know, the S3 team, the, the glue team, the click site team, you know, everything in between. And, you know, their embracement of the, the, the lake house architecture, the fact that they adopted Iceberg as their primary table format. I think that's exciting as an industry. We're all coming together around standard, standard ways to represent data so that at the end of the day, companies have this benefit of being able to, you know, have their own data in their own S3 account in open formats and be able to use all these different engines without losing any of the functionality that they need, right? The ability to do all these interactions with data that maybe in the past you would have to move the data into a database or, or warehouse in order to do, you just don't have to do that anymore. Speaking >>Of functionality, talk about what's new this year with drio since we've seen you last. >>Yeah, there's a lot of, a lot of new things with, with Drio. So yeah, we now have full Apache iceberg support, you know, with DML commands, you can do inserts, updates, deletes, you know, copy into all, all that kind of stuff is now, you know, fully supported native part of the platform. We, we now offer kind of two flavors of dr. We have, you know, Dr. Cloud, which is our SaaS version fully hosted. You sign up with your Google or, you know, Azure account and, and, and you're up in, you're up and running in, in, in a minute. And then dral software, which you can self host usually in the cloud, but even, even even outside of the cloud. And then we're also very excited about this new idea of data as code. And so we've introduced a new product that's now in preview called Dr. >>Arctic. And the idea there is to bring the concepts of GI or GitHub to the world of data. So things like being able to create a branch and work in isolation. If you're a data scientist, you wanna experiment on your own without impacting other people, or you're a data engineer and you're ingesting data, you want to transform it and test it before you expose it to others. You can do that in a branch. So all these ideas that, you know, we take for granted now in the world of source code and software development, we're bringing to the world of data with Jamar. And when you think about data mesh, a lot of people talking about data mesh now and wanting to kind of take advantage of, of those concepts and ideas, you know, thinking of data as a product. Well, when you think about data as a product, we think you have to manage it like code, right? You have to, and that's why we call it data as code, right? The, all those reasons that we use things like GI have to build products, you know, if we wanna think of data as a product, we need all those capabilities also with data. You know, also the ability to go back in time. The ability to undo mistakes, to see who changed my data and when did they change that table. All of those are, are part of this, this new catalog that we've created. >>Are you talk about data as a product that's sort of intrinsic to the data mesh concept. Are you, what's your opinion of data mesh? Is the, is the world ready for that radically different approach to data ownership? >>You know, we are now in dozens of, dozens of our customers that are using drio for to implement enterprise-wide kind of data mesh solutions. And at the end of the day, I think it's just, you know, what most people would consider common sense, right? In a large organization, it is very hard for a centralized single team to understand every piece of data, to manage all the data themselves, to, you know, make sure the quality is correct to make it accessible. And so what data mesh is first and foremost about is being able to kind of federate the, or distribute the, the ownership of data, the governance of the data still has to happen, right? And so that is, I think at the heart of the data mesh, but thinking of data as kind of allowing different teams, different domains to own their own data to really manage it like a product with all the best practices that that we have with that super important. >>So we we're doing a lot with data mesh, you know, the way that cloud has multiple projects and the way that Jamar allows you to have multiple catalogs and different groups can kind of interact and share data among each other. You know, the fact that we can connect to all these different data sources, even outside your data lake, you know, with Redshift, Oracle SQL Server, you know, all the different databases that are out there and join across different databases in addition to your data lake, that that's all stuff that companies want with their data mesh. >>What are some of your favorite customer stories that where you've really helped them accelerate that data mesh and drive business value from it so that more people in the organization kind of access to data so they can really make those data driven decisions that everybody wants to make? >>I mean, there's, there's so many of them, but, you know, one of the largest tech companies in the world creating a, a data mesh where you have all the different departments in the company that, you know, they, they, they were a big data warehouse user and it kinda hit the wall, right? The costs were so high and the ability for people to kind of use it for just experimentation, to try new things out to collaborate, they couldn't do it because it was so prohibitively expensive and difficult to use. And so what they said, well, we need a platform that different people can, they can collaborate, they can ex, they can experiment with the data, they can share data with others. And so at a big organization like that, the, their ability to kind of have a centralized platform but allow different groups to manage their own data, you know, several of the largest banks in the world are, are also doing data meshes with Dr you know, one of them has over over a dozen different business units that are using, using Dremio and that ability to have thousands of people on a platform and to be able to collaborate and share among each other that, that's super important to these >>Guys. Can you contrast your approach to the market, the snowflakes? Cause they have some of those same concepts. >>Snowflake's >>A very closed system at the end of the day, right? Closed and very expensive. Right? I think they, if I remember seeing, you know, a quarter ago in, in, in one of their earnings reports that the average customer spends 70% more every year, right? Well that's not sustainable. If you think about that in a decade, that's your cost is gonna increase 200 x, most companies not gonna be able to swallow that, right? So companies need, first of all, they need more cost efficient solutions that are, you know, just more approachable, right? And the second thing is, you know, you know, we talked about the open data architecture. I think most companies now realize that the, if you want to build a platform for the future, you need to have the data and open formats and not be locked into one vendor, right? And so that's kind of another important aspect beyond that's ability to connect to all your data, even outside the lake to your different databases, no sequel databases, relational databases, and drs semantic layer where we can accelerate queries. And so typically what you have, what happens with data warehouses and other data lake query engines is that because you can't get the performance that you want, you end up creating lots and lots of copies of data. You, for every use case, you're creating a, you know, a pre-joy copy of that data, a pre aggregated version of that data. And you know, then you have to redirect all your data. >>You've got a >>Governance problem, individual things. It's expensive. It's expensive, it's hard to secure that cuz permissions don't travel with the data. So you have all sorts of problems with that, right? And so what we've done because of our semantic layer that makes it easy to kind of expose data in a logical way. And then our query acceleration technology, which we call reflections, which transparently accelerates queries and gives you subsecond response times without data copies and also without extracts into the BI tools. Cause if you start doing bi extracts or imports, again, you have lots of copies of data in the organization, all sorts of refresh problems, security problems, it's, it's a nightmare, right? And that just collapsing all those copies and having a, a simple solution where data's stored in open formats and we can give you fast access to any of that data that's very different from what you get with like a snowflake or, or any of these other >>Companies. Right. That, that's a great explanation. I wanna ask you, early this year you announced that your Dr. Cloud service would be a free forever, the basic DR. Cloud service. How has that offer gone over? What's been the uptake on that offer? >>Yeah, it, I mean it is, and thousands of people have signed up and, and it's, I think it's a great service. It's, you know, it's very, very simple. People can go on the website, try it out. We now have a test drive as well. If, if you want to get started with just some sample public sample data sets and like a tutorial, we've made that increasingly easy as well. But yeah, we continue to, you know, take that approach of, you know, making it, you know, making it easy, democratizing these kind of cloud data platforms and, and kinda lowering the barriers to >>Adoption. How, how effective has it been in driving sales of the enterprise version? >>Yeah, a lot of, a lot of, a lot of business with, you know, that, that we do like when it comes to, to selling is, you know, folks that, you know, have educated themselves, right? They've started off, they've followed some tutorials. I think generally developers, they prefer the first interaction to be with a product, not with a salesperson. And so that's, that's basically the reason we did that. >>Before we ask you the last question, I wanna just, can you give us a speak peek into the product roadmap as we enter 2023? What can you share with us that we should be paying attention to where Drum is concerned? >>Yeah. You know, actually a couple, couple days ago here at the conference, we, we had a press release with all sorts of new capabilities that we, we we just released. And there's a lot more for, for the coming year. You know, we will shortly be releasing a variety of different performance enhancements. So we'll be in the next quarter or two. We'll be, you know, probably twice as fast just in terms of rock qu speed, you know, that's in addition to our reflections and our career acceleration, you know, support for all the major clouds is coming. You know, just a lot of capabilities in Inre that make it easier and easier to use the platform. >>Awesome. Tomer, thank you so much for joining us. My last question to you is, if you had a billboard in your desired location and it was going to really just be like a mic drop about why customers should be looking at Drio, what would that billboard say? >>Well, DRIO is the easy and open data lake house and, you know, open architectures. It's just a lot, a lot better, a lot more f a lot more future proof, a lot easier and a lot just a much safer choice for the future for, for companies. And so hard to argue with those people to take a look. Exactly. That wasn't the best. That wasn't the best, you know, billboards. >>Okay. I think it's a great billboard. Awesome. And thank you so much for joining Poly Me on the program, sharing with us what's new, what some of the exciting things are that are coming down the pipe. Quite soon we're gonna be keeping our eye Ono. >>Awesome. Always happy to be here. >>Thank you. Right. For our guest and for Paul Gillin, I'm Lisa Martin. You're watching The Cube, the leader in live and emerging tech coverage.
SUMMARY :
It's the Cube live at AWS Reinvent This is as it has been since the beginning, this show with a lot of energy. it's great to have you back on the program. And thanks for keeping the, the best session for the fourth day. And the headline of the Silicon Angle's article on the interview we did with you was, So, you know, data warehouses don't go away in, in a year and actually don't go away ever. Describe the data lakehouse for anybody who may not be really familiar with that and, and what it's, And what we've done now as an industry with the Lake House, and especially with, you know, technologies like Apache are enabling this to happen? original data lakes, but now with the Lake house, you know, our bread and butter is actually structured data. You mentioned open a minute ago. The fact that they are, you know, they're very expensive. at the center of the architecture as opposed to some, you know, vendors logo. Do you see the at the end of the day, companies have this benefit of being able to, you know, have their own data in their own S3 account Apache iceberg support, you know, with DML commands, you can do inserts, updates, So all these ideas that, you know, we take for granted now in the world of Are you talk about data as a product that's sort of intrinsic to the data mesh concept. And at the end of the day, I think it's just, you know, what most people would consider common sense, So we we're doing a lot with data mesh, you know, the way that cloud has multiple several of the largest banks in the world are, are also doing data meshes with Dr you know, Cause they have some of those same concepts. And the second thing is, you know, you know, stored in open formats and we can give you fast access to any of that data that's very different from what you get What's been the uptake on that offer? But yeah, we continue to, you know, take that approach of, you know, How, how effective has it been in driving sales of the enterprise version? to selling is, you know, folks that, you know, have educated themselves, right? you know, probably twice as fast just in terms of rock qu speed, you know, that's in addition to our reflections My last question to you is, if you had a Well, DRIO is the easy and open data lake house and, you And thank you so much for joining Poly Me on the program, sharing with us what's new, Always happy to be here. the leader in live and emerging tech coverage.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Lisa Martin | PERSON | 0.99+ |
Paul Gillen | PERSON | 0.99+ |
Paul Gillin | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Tomer | PERSON | 0.99+ |
Tomer Shiran | PERSON | 0.99+ |
Toron | PERSON | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
70% | QUANTITY | 0.99+ |
Monday night | DATE | 0.99+ |
Vegas | LOCATION | 0.99+ |
fourth day | QUANTITY | 0.99+ |
Paul | PERSON | 0.99+ |
last year | DATE | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
dozens | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
10 different teams | QUANTITY | 0.99+ |
Dremio | PERSON | 0.99+ |
early this year | DATE | 0.99+ |
SQL Query Engine | TITLE | 0.99+ |
The Cube | TITLE | 0.99+ |
Tuesday | DATE | 0.99+ |
2023 | DATE | 0.99+ |
one | QUANTITY | 0.98+ |
a year ago | DATE | 0.98+ |
next quarter | DATE | 0.98+ |
S3 | TITLE | 0.98+ |
a quarter ago | DATE | 0.98+ |
twice | QUANTITY | 0.98+ |
Oracle | ORGANIZATION | 0.98+ |
second thing | QUANTITY | 0.98+ |
Drio | ORGANIZATION | 0.98+ |
couple days ago | DATE | 0.98+ |
both | QUANTITY | 0.97+ |
DRIO | ORGANIZATION | 0.97+ |
2022 | DATE | 0.97+ |
Lake House | ORGANIZATION | 0.96+ |
thousands of people | QUANTITY | 0.96+ |
Wednesday | DATE | 0.96+ |
Spark | TITLE | 0.96+ |
200 x | QUANTITY | 0.96+ |
first | QUANTITY | 0.96+ |
Drio | TITLE | 0.95+ |
Dremeo | ORGANIZATION | 0.95+ |
two options | QUANTITY | 0.94+ |
about three hours | QUANTITY | 0.94+ |
day two | QUANTITY | 0.94+ |
s3 | TITLE | 0.94+ |
Apache Iceberg | ORGANIZATION | 0.94+ |
a minute ago | DATE | 0.94+ |
Silicon Angle | ORGANIZATION | 0.94+ |
hundred percent | QUANTITY | 0.93+ |
Apache | ORGANIZATION | 0.93+ |
single team | QUANTITY | 0.93+ |
GitHub | ORGANIZATION | 0.91+ |
this morning | DATE | 0.9+ |
a dozen different engines | QUANTITY | 0.89+ |
Iceberg | TITLE | 0.87+ |
Redshift | TITLE | 0.87+ |
last | DATE | 0.87+ |
this year | DATE | 0.86+ |
first interaction | QUANTITY | 0.85+ |
two flavors | QUANTITY | 0.84+ |
Thursday | DATE | 0.84+ |
Azure | ORGANIZATION | 0.84+ |
DR. Cloud | ORGANIZATION | 0.84+ |
SQL Server | TITLE | 0.83+ |
four conferences | QUANTITY | 0.82+ |
coming year | DATE | 0.82+ |
over over a dozen different business | QUANTITY | 0.81+ |
one vendor | QUANTITY | 0.8+ |
Poly | ORGANIZATION | 0.79+ |
Jamar | PERSON | 0.77+ |
GI | ORGANIZATION | 0.77+ |
Inre | ORGANIZATION | 0.76+ |
Dr. | ORGANIZATION | 0.73+ |
Lake house | ORGANIZATION | 0.71+ |
Arctic | ORGANIZATION | 0.71+ |
a year | QUANTITY | 0.7+ |
a minute | QUANTITY | 0.7+ |
SQL | TITLE | 0.69+ |
AWS Reinvent 2022 | EVENT | 0.69+ |
subsecond | QUANTITY | 0.68+ |
DML | TITLE | 0.68+ |
Breaking Analysis: Data Mesh...A New Paradigm for Data Management
from the cube studios in palo alto in boston bringing you data driven insights from the cube and etr this is breaking analysis with dave vellante data mesh is a new way of thinking about how to use data to create organizational value leading edge practitioners are beginning to implement data mesh in earnest and importantly data mesh is not a single tool or a rigid reference architecture if you will rather it's an architectural and organizational model that's really designed to address the shortcomings of decades of data challenges and failures many of which we've talked about on the cube as important by the way it's a new way to think about how to leverage data at scale across an organization and across ecosystems data mesh in our view will become the defining paradigm for the next generation of data excellence hello and welcome to this week's wikibon cube insights powered by etr in this breaking analysis we welcome the founder and creator of data mesh author thought leader technologist jamaak dagani shamak thank you for joining us today good to see you hi dave it's great to be here all right real quick let's talk about what we're going to cover i'll introduce or reintroduce you to jamaac she joined us earlier this year in our cube on cloud program she's the director of emerging tech at dot works north america and a thought leader practitioner software engineer architect and a passionate advocate for decentralized technology solutions and and data architectures and jamaa since we last had you on as a guest which was less than a year ago i think you've written two books in your spare time one on data mesh and another called software architecture the hard parts both published by o'reilly so how are you you've been busy i've been busy yes um good it's been a great year it's been a busy year i'm looking forward to the end of the year and the end of these two books but it's great to be back and um speaking with you well you got to be pleased with the the momentum that data mesh has and let's just jump back to the agenda for a bit and get that out of the way we're going to set the stage by sharing some etr data our partner our data partner on the spending profile and some of the key data sectors and then we're going to review the four key principles of data mesh just it's always worthwhile to sort of set that framework we'll talk a little bit about some of the dependencies and the data flows and we're really going to dig today into principle number three and a bit around the self-service data platforms and to that end we're going to talk about some of the learnings that shamak has captured since she embarked on the datamess journey with her colleagues and her clients and we specifically want to talk about some of the successful models for building the data mesh experience and then we're going to hit on some practical advice and we'll wrap with some thought exercises maybe a little tongue-in-cheek some of the community questions that we get so the first thing i want to do we'll just get this out of the way is introduce the spending climate we use this xy chart to do this we do this all the time it shows the spending profiles and the etr data set for some of the more data related sectors of the ecr etr taxonomy they they dropped their october data last friday so i'm using the july survey here we'll get into the october survey in future weeks but about 1500 respondents i don't see a dramatic change coming in the october survey but the the y-axis is net score or spending momentum the horizontal axis is market share or presence in the data set and that red line that 40 percent anything over that we consider elevated so for the past eight quarters or so we've seen machine learning slash ai rpa containers and cloud is the four areas where cios and technology buyers have shown the highest net scores and as we've said what's so impressive for cloud is it's both pervasive and it shows high velocity from a spending standpoint and we plotted the three other data related areas database edw analytics bi and big data and storage the first two well under the red line are still elevated the storage market continues to kind of plot along and we've we've plotted the outsourced it just to balance it out for context that's an area that's not so hot right now so i just want to point out that these areas ai automation containers and cloud they're all relevant to data and they're fundamental building blocks of data architectures as are the two that are directly related to data database and analytics and of course storage so it just gives you a picture of the spending sector so i wanted to share this slide jamark uh that that we presented in that you presented in your webinar i love this it's a taxonomy put together by matt turk who's a vc and he called this the the mad landscape machine learning and ai and data and jamock the key point here is there's no lack of tooling you've you've made the the data mesh concept sort of tools agnostic it's not like we need more tools to succeed in data mesh right absolutely great i think we have plenty of tools i think what's missing is a meta architecture that defines the landscape in a way that it's in step with organizational growth and then defines that meta architecture in a way that these tools can actually interoperable and to operate and integrate really well like the the clients right now have a lot of challenges in terms of picking the right tool regardless of the technology they go down the path either they have to go in and big you know bite into a big data solution and then try to fit the other integrated solutions around it or as you see go to that menu of large list of applications and spend a lot of time trying to kind of integrate and stitch this tooling together so i'm hoping that data mesh creates that kind of meta architecture for tools to interoperate and plug in and i think our conversation today around self-subjective platform um hopefully eliminate that yeah we'll definitely circle back because that's one of the questions we get all the time from the community okay let's review the four main principles of data mesh for those who might not be familiar with it and those who are it's worth reviewing jamar allow me to introduce them and then we can discuss a bit so a big frustration i hear constantly from practitioners is that the data teams don't have domain context the data team is separated from the lines of business and as a result they have to constantly context switch and as such there's a lack of alignment so principle number one is focused on putting end-to-end data ownership in the hands of the domain or what i would call the business lines the second principle is data as a product which does cause people's brains to hurt sometimes but it's a key component and if you start sort of thinking about it you'll and talking to people who have done it it actually makes a lot of sense and this leads to principle number three which is a self-serve data infrastructure which we're going to drill into quite a bit today and then the question we always get is when we introduce data meshes how to enforce governance in a federated model so let me bring up a more detailed slide jamar with the dependencies and ask you to comment please sure but as you said the the really the root cause we're trying to address is the siloing of the data external to where the action happens where the data gets produced where the data needs to be shared when the data gets used right in the context of the business so it's about the the really the root cause of the centralization gets addressed by distribution of the accountability end to end back to the domains and these domains this distribution of accountability technical accountability to the domains have already happened in the last you know decade or so we saw the transition from you know one general i.t addressing all of the needs of the organization to technology groups within the itu or even outside of the iit aligning themselves to build applications and services that the different business units need so what data mesh does it just extends that model and say okay we're aligning business with the tech and data now right so both application of the data in ml or inside generation in the domains related to the domain's needs as well as sharing the data that the domains are generating with the rest of the organization but the moment you do that then you have to solve other problems that may arise and that you know gives birth to the second principle which is about um data as a product as a way of preventing data siloing happening within the domain so changing the focus of the domains that are now producing data from i'm just going to create that data i collect for myself and that satisfy my needs to in fact the responsibility of domain is to share the data as a product with all of the you know wonderful characteristics that a product has and i think that leads to really interesting architectural and technical implications of what actually constitutes state has a product and we can have a separate conversation but once you do that then that's the point in the conversation that cio says well how do i even manage the cost of operation if i decentralize you know building and sharing data to my technical teams to my application teams do i need to go and hire another hundred data engineers and i think that's the role of a self-serve data platform in a way that it enables and empowers generalist technologies that we already have in the technical domains the the majority population of our developers these days right so the data platform attempts to mobilize the generalist technologies to become data producers to become data consumers and really rethink what tools these people need um and the last last principle so data platform is really to giving autonomy to domain teams and empowering them and reducing the cost of ownership of the data products and finally as you mentioned the question around how do i still assure that these different data products are interoperable are secure you know respecting privacy now in a decentralized fashion right when we are respecting the sovereignty or the domain ownership of um each domain and that leads to uh this idea of both from operational model um you know applying some sort of a federation where the domain owners are accountable for interoperability of their data product they have incentives that are aligned with global harmony of the data mesh as well as from the technology perspective thinking about this data is a product with a new lens with a lens that all of those policies that need to be respected by these data products such as privacy such as confidentiality can we encode these policies as computational executable units and encode them in every data product so that um we get automation we get governance through automation so that's uh those that's the relationship the complex relationship between the four principles yeah thank you for that i mean it's just a couple of points there's so many important points in there but the idea of the silos and the data as a product sort of breaking down those cells because if you have a product and you want to sell more of it you make it discoverable and you know as a p l manager you put it out there you want to share it as opposed to hide it and then you know this idea of managing the cost you know number three where people say well centralize and and you can be more efficient but that but that essentially was the the failure in your other point related point is generalist versus specialist that's kind of one of the failures of hadoop was you had these hyper specialist roles emerge and and so you couldn't scale and so let's talk about the goals of data mesh for a moment you've said that the objective is to extend exchange you call it a new unit of value between data producers and data consumers and that unit of value is a data product and you've stated that a goal is to lower the cognitive load on our brains i love this and simplify the way in which data are presented to both producers and consumers and doing so in a self-serve manner that eliminates the tapping on the shoulders or emails or raising tickets so how you know i'm trying to understand how data should be used etc so please explain why this is so important and how you've seen organizations reduce the friction across the data flows and the interconnectedness of things like data products across the company yeah i mean this is important um as you mentioned you know initially when this whole idea of a data-driven innovation came to exist and we needed all sorts of you know technology stacks we we centralized um creation of the data and usage of the data and that's okay when you first get started with where the expertise and knowledge is not yet diffused and it's only you know the privilege of a very few people in the organization but as we move to a data driven um you know innovation cycle in the organization as we learn how data can unlock new new programs new models of experience new products then it's really really important as you mentioned to get the consumers and producers talk to each other directly without a broker in the middle because even though that having that centralized broker could be a cost-effective model but if you if we include uh the cost of missed opportunity for something that we could have innovated well we missed that opportunity because of months of looking for the right data then that cost parented the cost benefit parameters and formula changes so um so to to have that innovation um really embedded data-driven innovation embedded into every domain every team we need to enable a model where the producer can directly peer-to-peer discover the data uh use it understand it and use it so the litmus test for that would be going from you know a hypothesis that you know as a data scientist i think there is a pattern and there is an insight in um you know in in the customer behavior that if i have access to all of the different informations about the customer all of the different touch points i might be able to discover that pattern and personalize the experience of my customer the liquid stuff is going from that hypothesis to finding all of the different sources be able to understanding and be able to connect them um and then turn them them into you know training of my machine learning and and the rest is i guess known as an intelligent product got it thank you so i i you know a lot of what we do here in breaking it house is we try to curate and then point people to new resources so we will have some additional resources because this this is not superficial uh what you and your colleagues in the community are creating but but so i do want to you know curate some of the other material that you had so if i bring up this next chart the left-hand side is a curated description both sides of your observations of most of the monolithic data platforms they're optimized for control they serve a centralized team that has hyper-specialized roles as we talked about the operational stacks are running running enterprise software they're on kubernetes and the microservices are isolated from let's say the spark clusters you know which are managing the analytical data etc whereas the data mesh proposes much greater autonomy and the management of code and data pipelines and policy as independent entities versus a single unit and you've made this the point that we have to enable generalists to borrow from so many other examples in the in the industry so it's an architecture based on decentralized thinking that can really be applied to any domain really domain agnostic in a way yes and i think if i pick one key point from that diagram is really um or that comparison is the um the the the data platforms or the the platform capabilities need to present a continuous experience from an application developer building an application that generates some data let's say i have an e-commerce application that generates some data to the data product that now presents and shares that data as as temporal immutable facts that can be used for analytics to the data scientist that uses that data to personalize the experience to the deployment of that ml model now back to that e-commerce application so if we really look at this continuous journey um the walls between these separate platforms that we have built needs to come down the platforms underneath that generate you know that support the operational systems versus supported data platforms versus supporting the ml models they need to kind of play really nicely together because as a user i'll probably fall off the cliff every time i go through these stages of this value stream um so then the interoperability of our data solutions and operational solutions need to increase drastically because so far we've got away with running operational systems an application on one end of the organization running and data analytics in another and build a spaghetti pipeline to you know connect them together neither of the ends are happy i hear from data scientists you know data analyst pointing finger at the application developer saying you're not developing your database the right way and application point dipping you're saying my database is for running my application it wasn't designed for sharing analytical data so so we've got to really what data mesh as a mesh tries to do is bring these two world together closer because and then the platform itself has to come closer and turn into a continuous set of you know services and capabilities as opposed to this disjointed big you know isolated stacks very powerful observations there so we want to dig a little bit deeper into the platform uh jamar can have you explain your thinking here because it's everybody always goes to the platform what do i do with the infrastructure what do i do so you've stressed the importance of interfaces the entries to and the exits from the platform and you've said you use a particular parlance to describe it and and this chart kind of shows what you call the planes not layers the planes of the platform it's complicated with a lot of connection points so please explain these planes and how they fit together sure i mean there was a really good point that you started with that um when we think about capabilities or that enables build of application builds of our data products build their analytical solutions usually we jump too quickly to the deep end of the actual implementation of these technologies right do i need to go buy a data catalog or do i need you know some sort of a warehouse storage and what i'm trying to kind of elevate us up and out is to to to force us to think about interfaces and apis the experiences that the platform needs to provide to run this secure safe trustworthy you know performance mesh of data products and if you focus on then the interfaces the implementation underneath can swap out right you can you can swap one for the other over time so that's the purpose of like having those lollipops and focusing and emphasizing okay what is the interface that provides a certain capability like the storage like the data product life cycle management and so on the purpose of the planes the mesh experience playing data product expense utility plan is really giving us a language to classify different set of interfaces and capabilities that play nicely together to provide that cohesive journey of a data product developer data consumer so then the three planes are really around okay at the bottom layer we have a lot of utilities we have that mad mac turks you know kind of mad data tooling chart so we have a lot of utilities right now they they manage workflow management you know they they do um data processing you've got your spark link you've got your storage you've got your lake storage you've got your um time series of storage you've got a lot of tooling at that level but the layer that we kind of need to imagine and build today we don't buy yet as as long as i know is this linger that allows us to uh exchange that um unit of value right to build and manage these data products so so the language and the apis and interface of this product data product experience plan is not oh i need this storage or i need that you know workflow processing is that i have a data product it needs to deliver certain types of data so i need to be able to model my data it needs to as part of this data product i need to write some processing code that keeps this data constantly alive because it's receiving you know upstream let's say user interactions with a website and generating the profile of my user so i need to be able to to write that i need to serve the data i need to keep the data alive and i need to provide a set of slos and guarantees for my data so that good documentation so that some you know someone who comes to data product knows but what's the cadence of refresh what's the retention of the data and a lot of other slos that i need to provide and finally i need to be able to enforce and guarantee certain policies in terms of access control privacy encryption and so on so as a data product developer i just work with this unit a complete autonomous self-contained unit um and the platform should give me ways of provisioning this unit and testing this unit and so on that's why kind of i emphasize on the experience and of course we're not dealing with one or two data product we're dealing with a mesh of data products so at the kind of mesh level experience we need a set of capabilities and interfaces to be able to search the mesh for the right data to be able to explore the knowledge graph that emerges from this interconnection of data products need to be able to observe the mesh for any anomalies did we create one of these giant master data products that all the data goes into and all the data comes out of how we found ourselves the bottlenecks to be able to kind of do those level machine level capabilities we need to have a certain level of apis and interfaces and once we decide and decide what constitutes that to satisfy this mesh experience then we can step back and say okay now what sort of a tool do i need to build or buy to satisfy them and that's that is not what the data community or data part of our organizations used to i think traditionally we're very comfortable with buying a tool and then changing the way we work to serve to serve the tool and this is slightly inverse to that model that we might be comfortable with right and pragmatists will will to tell you people who've implemented data match they'll tell you they spent a lot of time on figuring out data as a product and the definitions there the organizational the getting getting domain experts to actually own the data and and that's and and they will tell you look the technology will come and go and so to your point if you have those lollipops and those interfaces you'll be able to evolve because we know one thing's for sure in this business technology is going to change um so you you had some practical advice um and i wanted to discuss that for those that are thinking about data mesh i scraped this slide from your presentation that you made and and by the way we'll put links in there your colleague emily who i believe is a data scientist had some really great points there as well that that practitioners should dig into but you made a couple of points that i'd like you to summarize and to me that you know the big takeaway was it's not a one and done this is not a 60-day project it's a it's a journey and i know that's kind of cliche but it's so very true here yes um this was a few starting points for um people who are embarking on building or buying the platform that enables the people enables the mesh creation so it was it was a bit of a focus on kind of the platform angle and i think the first one is what we just discussed you know instead of thinking about mechanisms that you're building think about the experiences that you're enabling uh identify who are the people like what are the what is the persona of data scientists i mean data scientist has a wide range of personas or did a product developer the same what is the persona i need to develop today or enable empower today what skill sets do they have and and so think about experience as mechanisms i think we are at this really magical point i mean how many times in our lifetime we come across a complete blanks you know kind of white space to a degree to innovate so so let's take that opportunity and use a bit of a creativity while being pragmatic of course we need solutions today or yesterday but but still think about the experiences not not mechanisms that you need to buy so that was kind of the first step and and the nice thing about that is that there is an evolutionary there is an iterative path to maturity of your data mesh i mean if you start with thinking about okay which are the initial use cases i need to enable what are the data products that those use cases depend on that we need to unlock and what is the persona of my or general skill set of my data product developer what are the interfaces i need to enable you can start with the simplest possible platform for your first two use cases and then think about okay the next set of data you know data developers they have a different set of needs maybe today i just enable the sql-like querying of the data tomorrow i enable the data scientists file based access of the data the day after i enable the streaming aspect so so have this evolutionary kind of path ahead of you and don't think that you have to start with building out everything i mean one of the things we've done is taking this harvesting approach that we work collaboratively with those technical cross-functional domains that are building the data products and see how they are using those utilities and harvesting what they are building as the solutions for themselves back into the back into the platform but at the end of the day we have to think about mobilization of the large you know largest population of technologies we have we'd have to think about diffusing the technology and making it available and accessible by the generous technologies that you know and we've come a long way like we've we've gone through these sort of paradigm shifts in terms of mobile development in terms of functional programming in terms of cloud operation it's not that we are we're struggling with learning something new but we have to learn something that works nicely with the rest of the tooling that we have in our you know toolbox right now so so again put that generalist as the uh as one of your center personas not the only person of course we will have specialists of course we will always have data scientists specialists but any problem that can be solved as a general kind of engineering problem and i think there's a lot of aspects of data michigan that can be just a simple engineering problem um let's just approach it that way and then create the tooling um to empower those journalists great thank you so listen i've i've been around a long time and so as an analyst i've seen many waves and we we often say language matters um and so i mean i've seen it with the mainframe language it was different than the pc language it's different than internet different than cloud different than big data et cetera et cetera and so we have to evolve our language and so i was going to throw a couple things out here i often say data is not the new oil because because data doesn't live by the laws of scarcity we're not running out of data but i get the analogy it's powerful it powered the industrial economy but it's it's it's bigger than that what do you what do you feel what do you think when you hear the data is the new oil yeah i don't respond to those data as the gold or oil or whatever scarce resource because as you said it evokes a very different emotion it doesn't evoke the emotion of i want to use this i want to utilize it feels like i need to kind of hide it and collect it and keep it to myself and not share it with anyone it doesn't evoke that emotion of sharing i really do think that data and i with it with a little asterisk and i think the definition of data changes and that's why i keep using the language of data product or data quantum data becomes the um the most important essential element of existence of uh computation what do i mean by that i mean that you know a lot of applications that we have written so far are based on logic imperative logic if this happens do that and else do the other and we're moving to a world where those applications generating data that we then look at and and the data that's generated becomes the source the patterns that we can exploit to build our applications as in you know um curate the weekly playlist for dave every monday based on what he has listened to and the you know other people has listened to based on his you know profile so so we're moving to the world that is not so much about applications using the data necessarily to run their businesses that data is really truly is the foundational building block for the applications of the future and then i think in that we need to rethink the definition of the data and maybe that's for a different conversation but that's that's i really think we have to converge the the processing that the data together the substance substance and the processing together to have a unit that is uh composable reusable trustworthy and that's that's the idea behind the kind of data product as an atomic unit of um what we build from future solutions got it now something else that that i heard you say or read that really struck me because it's another sort of often stated phrase which is data is you know our most valuable asset and and you push back a little bit on that um when you hear people call data and asset people people said often have said they think data should be or will eventually be listed as an asset on the balance sheet and i i in hearing what you said i thought about that i said well you know maybe data as a product that's an income statement thing that's generating revenue or it's cutting costs it's not necessarily because i don't share my my assets with people i don't make them discoverable add some color to this discussion i think so i think it's it's actually interesting you mentioned that because i read the new policy in china that cfos actually have a line item around the data that they capture we don't have to go to the political conversation around authoritarian of um collecting data and the power that that creates and the society that leads to but that aside that big conversation little conversation aside i think you're right i mean the data as an asset generates a different behavior it's um it creates different performance metrics that we would measure i mean before conversation around data mesh came to you know kind of exist we were measuring the success of our data teams by the terabytes of data they were collecting by the thousands of tables that they had you know stamped as golden data none of that leads to necessarily there's no direct line i can see between that and actually the value that data generated but if we invert that so that's why i think it's rather harmful because it leads to the wrong measures metrics to measure for success so if you invert that to a bit of a product thinking or something that you share to delight the experience of users your measures are very different your measures are the the happiness of the user they decrease lead time for them to actually use and get value out of it they're um you know the growth of the population of the users so it evokes a very different uh kind of behavior and success metrics i do say if if i may that i probably come back and regret the choice of word around product one day because of the monetization aspect of it but maybe there is a better word to use but but that's the best i think we can use at this point in time why do you say that jamar because it's too directly related to monetization that has a negative connotation or it might might not apply in things like healthcare or you know i think because if we want to take your shortcuts and i remember this conversation years back that people think that the reason to you know kind of collect data or have data so that we can sell it you know it's just the monetization of the data and we have this idea of the data market places and so on and i think that is actually the least valuable um you know outcome that we can get from thinking about data as a product that direct cell an exchange of data as a monetary you know exchange of value so so i think that might redirect our attention to something that really matters which is um enabling using data for generating ultimately value for people for the customers for the organizations for the partners as opposed to thinking about it as a unit of exchange for for money i love data as a product i think you were your instinct was was right on and i think i'm glad you brought that up because because i think people misunderstood you know in the last decade data as selling data directly but you really what you're talking about is using data as a you know ingredient to actually build a product that has value and value either generate revenue cut costs or help with a mission like it could be saving lives but in some way for a commercial company it's about the bottom line and that's just the way it is so i i love data as a product i think it's going to stick so one of the other things that struck me in one of your webinars was one of the q a one of the questions was can i finally get rid of my data warehouse so i want to talk about the data warehouse the data lake jpmc used that term the data lake which some people don't like i know john furrier my business partner doesn't like that term but the data hub and one of the things i've learned from sort of observing your work is that whether it's a data lake a data warehouse data hub data whatever it's it should be a discoverable node on the mesh it really doesn't matter the the technology what are your your thoughts on that yeah i think the the really shift is from a centralized data warehouse to data warehouse where it fits so i think if you just cross that centralized piece uh we are all in agreement that data warehousing provides you know interesting and capable interesting capabilities that are still required perhaps as a edge node of the mesh that is optimizing for certain queries let's say financial reporting and we still want to direct a fair bit of data into a node that is just for those financial reportings and it requires the precision and the um you know the speed of um operation that the warehouse technology provides so i think um definitely that technology has a place where it falls apart is when you want to have a warehouse to rule you know all of your data and model canonically model your data because um it you have to put so much energy into you know kind of try to harness this model and create this very complex the complex and fragile snowflake schemas and so on that that's all you do you spend energy against the entropy of your organization to try to get your arms around this model and the model is constantly out of step with what's happening in reality because reality the model the reality of the business is moving faster than our ability to model everything into into uh into one you know canonical representation i think that's the one we need to you know challenge not necessarily application of data warehousing on a node i want to close by coming back to the issues of standards um you've specifically envisioned data mesh to be technology agnostic as i said before and of course everyone myself included we're going to run a vendor's technology platform through a data mesh filter the reality is per the matt turc chart we showed earlier there are lots of technologies that that can be nodes within the data mesh or facilitate data sharing or governance etc but there's clearly a lack of standardization i'm sometimes skeptical that the vendor community will drive this but maybe like you know kubernetes you know google or some other internet giant is going to contribute something to open source that addresses this problem but talk a little bit more about your thoughts on standardization what kinds of standards are needed and where do you think they'll come from sure i mean the you write that the vendors are not today incentivized to create those open standards because majority of the vet not all of them but some vendors operational model is about bring your data to my platform and then bring your computation to me uh and all will be great and and that will be great for a portion of the clients and portion of environments where that complexity we're talking about doesn't exist so so we need yes other players perhaps maybe um some of the cloud providers or people that are more incentivized to open um open their platform in a way for data sharing so as a starting point i think standardization around data sharing so if you look at the spectrum right now we have um a de facto sound it's not even a standard for something like sql i mean everybody's bastardized to call and extended it with so many things that i don't even know what this standard sql is anymore but we have that for some form of a querying but beyond that i know for example folks at databricks to start to create some standards around delta sharing and sharing the data in different models so i think data sharing as a concept the same way that apis were about capability sharing so we need to have the data apis or analytical data apis and data sharing extended to go beyond simply sql or languages like that i think we need standards around computational prior policies so this is again something that is formulating in the operational world we have a few standards around how do you articulate access control how do you identify the agents who are trying to access with different authentication mechanism we need to bring some of those our ad our own you know our data specific um articulation of policies uh some something as simple as uh identity management across different technologies it's non-existent so if you want to secure your data across three different technologies there is no common way of saying who's the agent that is acting uh to act to to access the data can i authenticate and authorize them so so those are some of the very basic building blocks and then the gravy on top would be new standards around enriched kind of semantic modeling of the data so we have a common language to describe the semantic of the data in different nodes and then relationship between them we have prior work with rdf and folks that were focused on i guess linking data across the web with the um kind of the data web i guess work that we had in the past we need to revisit those and see their practicality in the enterprise con context so so data modeling a rich language for data semantic modeling and data connectivity most importantly i think those are some of the items on my wish list that's good well we'll do our part to try to keep the standards you know push that push that uh uh movement jamaica we're going to leave it there i'm so grateful to have you uh come on to the cube really appreciate your time it's just always a pleasure you're such a clear thinker so thanks again thank you dave that's it's wonderful to be here now we're going to post a number of links to some of the great work that jamark and her team and her books and so you check that out because we remember we publish each week on siliconangle.com and wikibon.com and these episodes are all available as podcasts wherever you listen listen to just search breaking analysis podcast don't forget to check out etr.plus for all the survey data do keep in touch i'm at d vallante follow jamac d z h a m a k d or you can email me at david.velante at siliconangle.com comment on the linkedin post this is dave vellante for the cube insights powered by etrbwell and we'll see you next time you
SUMMARY :
all of the you know wonderful
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
60-day | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
40 percent | QUANTITY | 0.99+ |
matt turk | PERSON | 0.99+ |
two books | QUANTITY | 0.99+ |
china | LOCATION | 0.99+ |
thousands of tables | QUANTITY | 0.99+ |
dave vellante | PERSON | 0.99+ |
jamaac | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
siliconangle.com | OTHER | 0.99+ |
tomorrow | DATE | 0.99+ |
yesterday | DATE | 0.99+ |
october | DATE | 0.99+ |
boston | LOCATION | 0.99+ |
first step | QUANTITY | 0.98+ |
jamar | PERSON | 0.98+ |
today | DATE | 0.98+ |
jamaica | PERSON | 0.98+ |
both sides | QUANTITY | 0.98+ |
shamak | PERSON | 0.98+ |
dave | PERSON | 0.98+ |
jamark | PERSON | 0.98+ |
first one | QUANTITY | 0.98+ |
o'reilly | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.97+ |
each week | QUANTITY | 0.97+ |
john furrier | PERSON | 0.97+ |
second principle | QUANTITY | 0.97+ |
jamaak dagani shamak | PERSON | 0.96+ |
less than a year ago | DATE | 0.96+ |
earlier this year | DATE | 0.96+ |
three different technologies | QUANTITY | 0.96+ |
jamaa | PERSON | 0.95+ |
each domain | QUANTITY | 0.95+ |
terabytes of data | QUANTITY | 0.94+ |
three planes | QUANTITY | 0.94+ |
july | DATE | 0.94+ |
last decade | DATE | 0.93+ |
about 1500 respondents | QUANTITY | 0.93+ |
decades | QUANTITY | 0.93+ |
first | QUANTITY | 0.93+ |
first two | QUANTITY | 0.93+ |
dot works | ORGANIZATION | 0.93+ |
one key point | QUANTITY | 0.93+ |
first two use cases | QUANTITY | 0.92+ |
last friday | DATE | 0.92+ |
this week | DATE | 0.92+ |
two | QUANTITY | 0.92+ |
three other | QUANTITY | 0.92+ |
ndor | ORGANIZATION | 0.92+ |
first thing | QUANTITY | 0.9+ |
two data | QUANTITY | 0.9+ |
lake | ORGANIZATION | 0.89+ |
four areas | QUANTITY | 0.88+ |
single tool | QUANTITY | 0.88+ |
north america | LOCATION | 0.88+ |
single unit | QUANTITY | 0.87+ |
jamac | PERSON | 0.86+ |
one of | QUANTITY | 0.85+ |
things | QUANTITY | 0.85+ |
david.velante | OTHER | 0.83+ |
past eight quarters | DATE | 0.83+ |
four principles | QUANTITY | 0.82+ |
dave | ORGANIZATION | 0.82+ |
a lot of applications | QUANTITY | 0.81+ |
four main principles | QUANTITY | 0.8+ |
sql | TITLE | 0.8+ |
palo alto | ORGANIZATION | 0.8+ |
emily | PERSON | 0.8+ |
d vallante | PERSON | 0.8+ |