Image Title

Search Results for Thoughtworks:

Zhamak Dehghani, ThoughtWorks | theCUBE on Cloud 2021


 

>>from around the globe. It's the Cube presenting Cuban cloud brought to you by silicon angle in 2000 >>nine. Hal Varian, Google's chief economist, said that statisticians would be the sexiest job in the coming decade. The modern big data movement >>really >>took off later in the following year. After the Second Hadoop World, which was hosted by Claudette Cloudera in New York City. Jeff Ham Abakar famously declared to me and John further in the Cube that the best minds of his generation, we're trying to figure out how to get people to click on ads. And he said that sucks. The industry was abuzz with the realization that data was the new competitive weapon. Hadoop was heralded as the new data management paradigm. Now, what actually transpired Over the next 10 years on Lee, a small handful of companies could really master the complexities of big data and attract the data science talent really necessary to realize massive returns as well. Back then, Cloud was in the early stages of its adoption. When you think about it at the beginning of the last decade and as the years passed, Maurin Mawr data got moved to the cloud and the number of data sources absolutely exploded. Experimentation accelerated, as did the pace of change. Complexity just overwhelmed big data infrastructures and data teams, leading to a continuous stream of incremental technical improvements designed to try and keep pace things like data Lakes, data hubs, new open source projects, new tools which piled on even Mawr complexity. And as we reported, we believe what's needed is a comm pleat bit flip and how we approach data architectures. Our next guest is Jean Marc de Connie, who is the director of emerging technologies That thought works. John Mark is a software engineer, architect, thought leader and adviser to some of the world's most prominent enterprises. She's, in my view, one of the foremost advocates for rethinking and changing the way we create and manage data architectures. Favoring a decentralized over monolithic structure and elevating domain knowledge is a primary criterion. And how we organize so called big data teams and platforms. Chamakh. Welcome to the Cube. It's a pleasure to have you on the program. >>Hi, David. This wonderful to be here. >>Well, okay, so >>you're >>pretty outspoken about the need for a paradigm shift in how we manage our data and our platforms that scale. Why do you feel we need such a radical change? What's your thoughts there? >>Well, I think if you just look back over the last decades you gave us, you know, a summary of what happened since 2000 and 10. But if even if we go before then what we have done over the last few decades is basically repeating and, as you mentioned, incrementally improving how we've managed data based on a certain assumptions around. As you mentioned, centralization data has to be in one place so we can get value from it. But if you look at the parallel movement off our industry in general since the birth of Internet, we are actually moving towards decentralization. If we think today, like if this move data side, if he said the only way Web would work the only way we get access to you know various applications on the Web pages is to centralize it. We would laugh at that idea, but for some reason we don't. We don't question that when it comes to data, right? So I think it's time to embrace the complexity that comes with the growth of number of sources, the proliferation of sources and consumptions models, you know, embrace the distribution of sources of data that they're not just within one part of organization. They're not just within even bounds of organization there beyond the bounds of organization. And then look back and say Okay, if that's the trend off our industry in general, Um, given the fabric of computation and data that we put in, you know globally in place, then how the architecture and technology and organizational structure incentives need to move to embrace that complexity. And to me, that requires a paradigm shift, a full stack from how we organize our organizations, how we organize our teams, how we, you know, put a technology in place, um, to to look at it from a decentralized angle. >>Okay, so let's let's unpack that a little bit. I mean, you've spoken about and written that today's big architecture and you basically just mentioned that it's flawed, So I wanna bring up. I love your diagrams of a simple diagram, guys, if you could bring up ah, figure one. So on the left here we're adjusting data from the operational systems and other enterprise data sets and, of course, external data. We cleanse it, you know, you've gotta do the do the quality thing and then serve them up to the business. So So what's wrong with that picture that we just described and give granted? It's a simplified form. >>Yeah, quite a few things. So, yeah, I would flip the question may be back to you or the audience if we said that. You know, there are so many sources off the data on the Actually, the data comes from systems and from teams that are very diverse in terms off domains. Right? Domain. If if you just think about, I don't know retail, Uh, the the E Commerce versus Order Management versus customer This is a very diverse domains. The data comes from many different diverse domains. And then we expect to put them under the control off a centralized team, a centralized system. And I know that centralization. Probably if you zoom out, it's centralized. If you zoom in it z compartmentalized based on functions that we can talk about that and we assume that the centralized model will be served, you know, getting that data, making sense of it, cleansing and transforming it then to satisfy in need of very diverse set of consumers without really understanding the domains, because the teams responsible for it or not close to the source of the data. So there is a bit of it, um, cognitive gap and domain understanding Gap, um, you know, without really understanding of how the data is going to be used, I've talked to numerous. When we came to this, I came up with the idea. I talked to a lot of data teams globally just to see, you know, what are the pain points? How are they doing it? And one thing that was evident in all of those conversations that they actually didn't know after they built these pipelines and put the data in whether the data warehouse tables or like, they didn't know how the data was being used. But yet the responsible for making the data available for these diverse set of these cases, So s centralized system. A monolithic system often is a bottleneck. So what you find is, a lot of the teams are struggling with satisfying the needs of the consumers, the struggling with really understanding the data. The domain knowledge is lost there is a los off understanding and kind of in that in that transformation. Often, you know, we end up training machine learning models on data that is not really representative off the reality off the business. And then we put them to production and they don't work because the semantic and the same tax off the data gets lost within that translation. So we're struggling with finding people thio, you know, to manage a centralized system because there's still the technology is fairly, in my opinion, fairly low level and exposes the users of those technologies. I said, Let's say warehouse a lot off, you know, complexity. So in summary, I think it's a bottleneck is not gonna, you know, satisfy the pace of change, of pace, of innovation and the pace of, you know, availability of sources. Um, it's disconnected and fragmented, even though the centralizes disconnected and fragmented from where the data comes from and where the data gets used on is managed by, you know, a team off hyper specialized people that you know, they're struggling to understand the actual value of the data, the actual format of the data, so it's not gonna get us where our aspirations and ambitions need to be. >>Yes. So the big data platform is essentially I think you call it, uh, context agnostic. And so is data becomes, you know, more important, our lives. You've got all these new data sources, you know, injected into the system. Experimentation as we said it with the cloud becomes much, much easier. So one of the blockers that you've started, you just mentioned it is you've got these hyper specialized roles the data engineer, the quality engineer, data scientists and and the It's illusory. I mean, it's like an illusion. These guys air, they seemingly they're independent and in scale independently. But I think you've made the point that in fact, they can't that a change in the data source has an effect across the entire data lifecycle entire data pipeline. So maybe you could maybe you could add some color to why that's problematic for some of the organizations that you work with and maybe give some examples. >>Yeah, absolutely so in fact, that initially the hypothesis around that image came from a Siris of requests that we received from our both large scale and progressive clients and progressive in terms of their investment in data architectures. So this is where clients that they were there were larger scale. They had divers and reached out of domains. Some of them were big technology tech companies. Some of them were retail companies, big health care companies. So they had that diversity off the data and the number off. You know, the sources of the domains they had invested for quite a few years in, you know, generations. If they had multi generations of proprietary data warehouses on print that they were moving to cloud, they had moved to the barriers, you know, revisions of the Hadoop clusters and they were moving to the cloud. And they the challenges that they were facing were simply there were not like, if I want to just, like, you know, simplifying in one phrase, they were not getting value from the data that they were collecting. There were continuously struggling Thio shift the culture because there was so much friction between all of these three phases of both consumption of the data and transformation and making it available consumption from sources and then providing it and serving it to the consumer. So that whole process was full of friction. Everybody was unhappy. So its bottom line is that you're collecting all this data. There is delay. There is lack of trust in the data itself because the data is not representative of the reality has gone through a transformation. But people that didn't understand really what the data was got delayed on bond. So there is no trust. It's hard to get to the data. It's hard to create. Ultimately, it's hard to create value from the data, and people are working really hard and under a lot of pressure. But it's still, you know, struggling. So we often you know, our solutions like we are. You know, Technologies will often pointed to technology. So we go. Okay, This this version of you know, some some proprietary data warehouse we're using is not the right thing. We should go to the cloud, and that certainly will solve our problems. Right? Or warehouse wasn't a good one. Let's make a deal Lake version. So instead of you know, extracting and then transforming and loading into the little bits. And that transformation is that, you know, heavy process, because you fundamentally made an assumption using warehouses that if I transform this data into this multi dimensional, perfectly designed schema that then everybody can run whatever choir they want that's gonna solve. You know everybody's problem, but in reality it doesn't because you you are delayed and there is no universal model that serves everybody's need. Everybody that needs the divers data scientists necessarily don't don't like the perfectly modeled data. They're looking for both signals and the noise. So then, you know, we've We've just gone from, uh, et elles to let's say now to Lake, which is okay, let's move the transformation to the to the last mile. Let's just get load the data into, uh into the object stores into semi structured files and get the data. Scientists use it, but they're still struggling because the problems that we mentioned eso then with the solution. What is the solution? Well, next generation data platform, let's put it on the cloud, and we sell clients that actually had gone through, you know, a year or multiple years of migration to the cloud. But with it was great. 18 months I've seen, you know, nine months migrations of the warehouse versus two year migrations of the various data sources to the clubhouse. But ultimately, the result is the same on satisfy frustrated data users, data providers, um, you know, with lack of ability to innovate quickly on relevant data and have have have an experience that they deserve toe have have a delightful experience off discovering and exploring data that they trust. And all of that was still a missed so something something else more fundamentally needed to change than just the technology. >>So then the linchpin to your scenario is this notion of context and you you pointed out you made the other observation that look, we've made our operational systems context aware. But our data platforms are not on bond like CRM system sales guys very comfortable with what's in the CRM system. They own the data. So let's talk about the answer that you and your colleagues are proposing. You're essentially flipping the architecture whereby those domain knowledge workers, the builders, if you will, of data products or data services there now, first class citizens in the data flow and they're injecting by design domain knowledge into the system. So So I wanna put up another one of your charts. Guys, bring up the figure to their, um it talks about, you know, convergence. You showed data distributed domain, dream and architecture. Er this self serve platform design and this notion of product thinking. So maybe you could explain why this approach is is so desirable, in your view, >>sure. The motivation and inspiration for the approach came from studying what has happened over the last few decades in operational systems. We had a very similar problem prior to micro services with monolithic systems, monolithic systems where you know the bottleneck. Um, the changes we needed to make was always, you know, our fellow Noto, how the architecture was centralized and we found a nice nation. I'm not saying this is the perfect way of decoupling a monolith, but it's a way that currently where we are in our journey to become data driven, um is a nice place to be, um, which is distribution or decomposition off your system as well as organization. I think when we whenever we talk about systems, we've got to talk about people and teams that's responsible for managing those systems. So the decomposition off the systems and the teams on the data around domains because that's how today we are decoupling our business, right? We're decoupling our businesses around domains, and that's a that's a good thing and that What does that do really for us? What it does? Is it localizes change to the bounded context of fact business. It creates clear boundary and interfaces and contracts between the rest of the universe of the organization on that particular team, so removes the friction that often we have for both managing the change and both serving data or capability. So it's the first principle of data meshes. Let's decouple this world off analytical data the same to mirror the same way we have to couple their systems and teams and business why data is any different. And the moment you do that, So you, the moment you bring the ownership to people who understands the data best, then you get questions that well, how is that any different from silence that's connected databases that we have today and nobody can get to the data? So then the rest of the principles is really to address all of the challenges that comes with this first principle of decomposition around domain Context on the second principle is well, we have to expect a certain level off quality and accountability and responsibility for the teams that provide the data. So let's bring product thinking and treating data as a product to the data that these teams now, um share and let's put accountability around. And we need a new set of incentives and metrics for domain teams to share the data. We need to have a new set off kind of quality metrics that define what it means for the data to be a product. And we can go through that conversation perhaps later eso then the second principle is okay. The teams now that are responsible, the domain teams responsible for the analytical data need to provide that data with a certain level of quality and assurance. Let's call that a product and bring products thinking to that. And then the next question you get asked off by C. E. O s or city or the people who build the infrastructure and, you know, spend the money. They said, Well, it's actually quite complex to manage big data, and now we're We want everybody, every independent team to manage the full stack of, you know, storage and computation and pipelines and, you know, access, control and all of that. And that's well, we have solved that problem in operational world. And that requires really a new level of platform thinking toe provide infrastructure and tooling to the domain teams to now be able to manage and serve their big data. And that I think that requires reimagining the world of our tooling and technology. But for now, let's just assume that we need a new level of abstraction to hide away ton of complexity that unnecessarily people get exposed to and that that's the third principle of creating Selves of infrastructure, um, to allow autonomous teams to build their domains. But then the last pillar, the last you know, fundamental pillar is okay. Once you distributed problem into a smaller problems that you found yourself with another set of problems, which is how I'm gonna connect this data, how I'm gonna you know, that the insights happens and emerges from the interconnection of the data domains right? It does not necessarily locked into one domain. So the concerns around interoperability and standardization and getting value as a result of composition and interconnection of these domains requires a new approach to governance. And we have to think about governance very differently based on a Federated model and based on a computational model. Like once we have this powerful self serve platform, we can computational e automate a lot of governance decisions. Um, that security decisions and policy decisions that applies to you know, this fabric of mesh not just a single domain or not in a centralized. Also, really. As you mentioned that the most important component of the emissions distribution of ownership and distribution of architecture and data the rest of them is to solve all the problems that come with that. >>So very powerful guys. We actually have a picture of what Jamaat just described. Bring up, bring up figure three, if you would tell me it. Essentially, you're advocating for the pushing of the pipeline and all its various functions into the lines of business and abstracting that complexity of the underlying infrastructure, which you kind of show here in this figure, data infrastructure is a platform down below. And you know what I love about this Jama is it to me, it underscores the data is not the new oil because I could put oil in my car I can put in my house, but I can't put the same court in both places. But I think you call it polyglot data, which is really different forms, batch or whatever. But the same data data doesn't follow the laws of scarcity. I can use the same data for many, many uses, and that's what this sort of graphic shows. And then you brought in the really important, you know, sticking problem, which is that you know the governance which is now not a command and control. It's it's Federated governance. So maybe you could add some thoughts on that. >>Sure, absolutely. It's one of those I think I keep referring to data much as a paradigm shift. And it's not just to make it sound ground and, you know, like, kind of ground and exciting or in court. And it's really because I want to point out, we need to question every moment when we make a decision around how we're going to design security or governance or modeling off the data, we need to reflect and go back and say, um, I applying some of my cognitive biases around how I have worked for the last 40 years, I have seen it work. Or do I do I really need to question. And we do need to question the way we have applied governance. I think at the end of the day, the rule of the data governance and objective remains the same. I mean, we all want quality data accessible to a diverse set of users. And these users now have different personas, like David, Personal data, analyst data, scientists, data application, Um, you know, user, very diverse personal. So at the end of the day, we want quality data accessible to them, um, trustworthy in in an easy consumable way. Um, however, how we get there looks very different in as you mentioned that the governance model in the old world has been very commander control, very centralized. Um, you know, they were responsible for quality. They were responsible for certification off the data, you know, applying making sure the data complies. But also such regulations Make sure you know, data gets discovered and made available in the world of the data mesh. Really. The job of the data governance as a function becomes finding that equilibrium between what decisions need to be um, you know, made and enforced globally. And what decisions need to be made locally so that we can have an interoperable measure. If data sets that can move fast and can change fast like it's really about instead of hardest, you know, kind of putting the putting those systems in a straitjacket of being constant and don't change, embrace, change and continuous change of landscape because that's that's just the reality we can't escape. So the role of governance really the governance model called Federated and Computational. And by that I mean, um, every domain needs to have a representative in the governance team. So the role of the data or domain data product owner who really were understand the data that domain really well but also wears that hacks of a product owner. It is an important role that had has to have a representation in the governance. So it's a federation off domains coming together, plus the SMEs and people have, you know, subject matter. Experts who understands the regulations in that environmental understands the data security concerns, but instead off trying to enforce and do this as a central team. They make decisions as what need to be standardized, what need to be enforced. And let's push that into that computational E and in an automated fashion into the into the camp platform itself. For example, instead of trying to do that, you know, be part of the data quality pipeline and inject ourselves as people in that process, let's actually, as a group, define what constitutes quality, like, how do we measure quality? And then let's automate that and let Z codify that into the platform so that every native products will have a C I City pipeline on as part of that pipeline. Those quality metrics gets validated and every day to product needs to publish those SLOC or service level objectives. So you know, whatever we choose as a measure of quality, maybe it's the, you know, the integrity of the data, the delay in the data, the liveliness of it, whatever the are the decisions that you're making, let's codify that. So it's, um, it's really, um, the role of the governance. The objectives of the governance team tried to satisfies the same, but how they do it. It is very, very different. I wrote a new article recently trying to explain the logical architecture that would emerge from applying these principles. And I put a kind of light table to compare and contrast the roll off the You know how we do governance today versus how we will do it differently to just give people a flavor of what does it mean to embrace the centralization? And what does it mean to embrace change and continuous change? Eso hopefully that that that could be helpful. >>Yes, very so many questions I haven't but the point you make it to data quality. Sometimes I feel like quality is the end game. Where is the end game? Should be how fast you could go from idea to monetization with the data service. What happens again? You sort of address this, but what happens to the underlying infrastructure? I mean, spinning a PC to S and S three buckets and my pie torches and tensor flows. And where does that that lives in the business? And who's responsible for that? >>Yeah, that's I'm glad you're asking this question. Maybe because, um, I truly believe we need to re imagine that world. I think there are many pieces that we can use Aziz utilities on foundational pieces, but I but I can see for myself a 5 to 7 year roadmap of building this new tooling. I think, in terms of the ownership, the question around ownership, if that would remains with the platform team, but and perhaps the domain agnostic, technology focused team right that there are providing instead of products themselves. And but the products are the users off those products are data product developers, right? Data domain teams that now have really high expectations in terms of low friction in terms of lead time to create a new data product. Eso We need a new set off tooling, and I think with the language needs to shift from, You know, I need a storage buckets. So I need a storage account. So I need a cluster to run my, you know, spark jobs, too. Here's the declaration of my data products. This is where the data for it will come from. This is the data that I want to serve. These are the policies that I need toe apply in terms of perhaps encryption or access control. Um, go make it happen. Platform, go provision, Everything that I mean so that as a data product developer. All I can focus on is the data itself, representation of semantic and representation of the syntax. And make sure that data meets the quality that I have that I have to assure and it's available. The rest of provisioning of everything that sits underneath will have to get taken care of by the platform. And that's what I mean by requires a re imagination and in fact, Andi, there will be a data platform team, the data platform teams that we set up for our clients. In fact, themselves have a favorite of complexity. Internally, they divide into multiple teams multiple planes, eso there would be a plane, as in a group of capabilities that satisfied that data product developer experience, there would be a set of capabilities that deal with those need a greatly underlying utilities. I call it at this point, utilities, because to me that the level of abstraction of the platform is to go higher than where it is. So what we call platform today are a set of utilities will be continuing to using will be continuing to using object storage, will continue using relation of databases and so on so there will be a plane and a group of people responsible for that. There will be a group of people responsible for capabilities that you know enable the mesh level functionality, for example, be able to correlate and connects. And query data from multiple knows. That's a measure level capability to be able to discover and explore the measure data products as a measure of capability. So it would be set of teams as part of platforms with a strong again platform product thinking embedded and product ownership embedded into that. To satisfy the experience of this now business oriented domain data team teams s way have a lot of work to do. >>I could go on. Unfortunately, we're out of time. But I guess my first I want to tell people there's two pieces that you put out so far. One is, uh, how to move beyond a monolithic data lake to a distributed data mesh. You guys should read that in a data mesh principles and logical architectures kind of part two. I guess my last question in the very limited time we have is our organization is ready for this. >>E think the desire is there I've bean overwhelmed with number off large and medium and small and private and public governments and federal, you know, organizations that reached out to us globally. I mean, it's not This is this is a global movement and I'm humbled by the response of the industry. I think they're the desire is there. The pains are really people acknowledge that something needs to change. Here s so that's the first step. I think that awareness isa spreading organizations. They're more and more becoming aware. In fact, many technology providers are reach out to us asking what you know, what shall we do? Because our clients are asking us, You know, people are already asking We need the data vision. We need the tooling to support. It s oh, that awareness is there In terms of the first step of being ready, However, the ingredients of a successful transformation requires top down and bottom up support. So it requires, you know, support from Chief Data Analytics officers or above the most successful clients that we have with data. Make sure the ones that you know the CEOs have made a statement that, you know, we want to change the experience of every single customer using data and we're going to do, we're going to commit to this. So the investment and support, you know, exists from top to all layers. The engineers are excited that maybe perhaps the traditional data teams are open to change. So there are a lot of ingredients. Substance to transformation is to come together. Um, are we really ready for it? I think I think the pioneers, perhaps the innovators. If you think about that innovation, careful. My doctors, probably pioneers and innovators and leaders. Doctors are making making move towards it. And hopefully, as the technology becomes more available, organizations that are less or in, you know, engineering oriented, they don't have the capability in house today, but they can buy it. They would come next. Maybe those are not the ones who aren't quite ready for it because the technology is not readily available. Requires, you know, internal investment today. >>I think you're right on. I think the leaders are gonna lead in hard, and they're gonna show us the path over the next several years. And I think the the end of this decade is gonna be defined a lot differently than the beginning. Jammeh. Thanks so much for coming in. The Cuban. Participate in the >>program. Pleasure head. >>Alright, Keep it right. Everybody went back right after this short break.

Published Date : Jan 22 2021

SUMMARY :

cloud brought to you by silicon angle in 2000 The modern big data movement It's a pleasure to have you on the program. This wonderful to be here. pretty outspoken about the need for a paradigm shift in how we manage our data and our platforms the only way we get access to you know various applications on the Web pages is to So on the left here we're adjusting data from the operational lot of data teams globally just to see, you know, what are the pain points? that's problematic for some of the organizations that you work with and maybe give some examples. And that transformation is that, you know, heavy process, because you fundamentally So let's talk about the answer that you and your colleagues are proposing. the changes we needed to make was always, you know, our fellow Noto, how the architecture was centralized And then you brought in the really important, you know, sticking problem, which is that you know the governance which So at the end of the day, we want quality data accessible to them, um, Where is the end game? And make sure that data meets the quality that I I guess my last question in the very limited time we have is our organization is ready So the investment and support, you know, Participate in the Alright, Keep it right.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavidPERSON

0.99+

Jean Marc de ConniePERSON

0.99+

Hal VarianPERSON

0.99+

Zhamak DehghaniPERSON

0.99+

New York CityLOCATION

0.99+

John MarkPERSON

0.99+

5QUANTITY

0.99+

Jeff Ham AbakarPERSON

0.99+

two yearQUANTITY

0.99+

two piecesQUANTITY

0.99+

GoogleORGANIZATION

0.99+

JohnPERSON

0.99+

nine monthsQUANTITY

0.99+

2000DATE

0.99+

18 monthsQUANTITY

0.99+

first stepQUANTITY

0.99+

second principleQUANTITY

0.99+

both placesQUANTITY

0.99+

bothQUANTITY

0.99+

OneQUANTITY

0.99+

a yearQUANTITY

0.99+

one partQUANTITY

0.99+

firstQUANTITY

0.99+

Claudette ClouderaPERSON

0.99+

third principleQUANTITY

0.98+

10DATE

0.98+

first principleQUANTITY

0.98+

one domainQUANTITY

0.98+

todayDATE

0.98+

LeePERSON

0.98+

one phraseQUANTITY

0.98+

three phasesQUANTITY

0.98+

CubanOTHER

0.98+

JammehPERSON

0.97+

7 yearQUANTITY

0.97+

MawrPERSON

0.97+

JamaatPERSON

0.97+

last decadeDATE

0.97+

Maurin MawrPERSON

0.94+

single domainQUANTITY

0.92+

one thingQUANTITY

0.91+

ThoughtWorksORGANIZATION

0.9+

oneQUANTITY

0.9+

nineQUANTITY

0.9+

theCUBEORGANIZATION

0.89+

endDATE

0.88+

last few decadesDATE

0.87+

one placeQUANTITY

0.87+

Second Hadoop WorldEVENT

0.86+

threeOTHER

0.85+

C. E. OORGANIZATION

0.84+

this decadeDATE

0.84+

SirisTITLE

0.83+

coming decadeDATE

0.83+

AndiPERSON

0.81+

ChamakhPERSON

0.8+

three bucketsQUANTITY

0.77+

JamaPERSON

0.77+

CubanPERSON

0.76+

AzizORGANIZATION

0.72+

yearsDATE

0.72+

first classQUANTITY

0.72+

last 40DATE

0.67+

single customerQUANTITY

0.66+

part twoOTHER

0.66+

lastDATE

0.66+

CloudTITLE

0.56+

2021DATE

0.55+

next 10 yearsDATE

0.54+

HadoopEVENT

0.53+

following yearDATE

0.53+

yearsQUANTITY

0.51+

CubeORGANIZATION

0.5+

NotoORGANIZATION

0.45+

CubePERSON

0.39+

CubeCOMMERCIAL_ITEM

0.26+

Zhamak Dehghani, Director of Emerging Technologies at ThoughtWorks


 

(bright music) >> In 2009, Hal Varian, Google's Chief Economist said that statisticians would be the sexiest job in the coming decade. The modern big data movement really took off later in the following year, after the second Hadoop World, which was hosted by Cloudera, in New York city. Jeff Hama Bachar, famously declared to me and John Furrie, in "theCUBE," that the best minds of his generation were trying to figure out how to get people to click on ads. And he said that sucks. The industry was abuzz with the realization that data was the new competitive weapon. Hadoop was heralded as the new data management paradigm. Now what actually transpired over the next 10 years was only a small handful of companies could really master the complexities of big data and attract the data science talent, really necessary to realize massive returns. As well, back then, cloud was in the early stages of its adoption. When you think about it at the beginning of the last decade, and as the years passed, more and more data got moved to the cloud, and the number of data sources absolutely exploded, experimentation accelerated, as did the pace of change. Complexity just overwhelmed big data infrastructures and data teams, leading to a continuous stream of incremental technical improvements designed to try and keep pace, things like data lakes, data hubs, new open source projects, new tools, which piled on even more complexity. And as we reported, we believe what's needed is a complete bit flip and how we approach data architectures. Our next guest is Zhamak Dehgani, who is the Director of Emerging Technologies at ThoughtWorks. Zhamak is a software engineer, architect, thought leader and advisor, to some of the world's most prominent enterprises. She's in my view, one of the foremost advocates for rethinking and changing the way we create and manage data architectures, favoring a decentralized over monolithic structure, and elevating domain knowledge as a primary criterion, and how we organize so-called big data teams and platforms. Zhamak, welcome to the cube, it's a pleasure to have you on the program. >> Hi David, it's wonderful to be here. >> Okay. So you're pretty outspoken about the need for a paradigm shift, in how we manage our data, and our platforms at scale. Why do you feel we need such a radical change? What's your thoughts there? >> Well, I think if you just look back over the last decades, you gave us a summary of what happened since 2010. But even if we got it before then, what we have done over the last few decades is basically repeating, and as you mentioned, incrementally improving how we manage data, based on certain assumptions around, as you mentioned, centralization. Data has to be in one place so we can get value from it. But if you look at the parallel movement of our industry in general, since the birth of internet, we are actually moving towards decentralization. If we think today, like if in this move data side, if we said, the only way web would work, the only way we get access to various applications on the web or pages is to centralize it, we would laugh at that idea, but for some reason, we don't question that when it comes to data, right? So I think it's time to embrace the complexity that comes with the growth of number of sources, the proliferation of sources and consumptions models, embrace the distribution of sources of data, that they're not just within one part of organization. They're not just within even bounds of organizations. They're beyond the bounds of organization, and then look back and say, okay, if that's the trend of our industry in general, given the fabric of compensation and data that we put in globally in place, then how the architecture and technology and organizational structure incentives need to move, to embrace that complexity. And to me, that requires a paradigm shift. A full stack from how we organize our organizations, how we organize our teams, how we put a technology in place to look at it from a decentralized angle. >> Okay, so let's unpack that a little bit. I mean, you've spoken about and written today's big architecture, and you've basically just mentioned that it's flawed. So I want to bring up, I love your diagrams, you have a simple diagram, guys if you could bring up figure one. So on the left here, we're adjusting data from the operational systems, and other enterprise data sets. And of course, external data, we cleanse it, you've got to do the quality thing, and then serve them up to the business. So what's wrong with that picture that we just described, and give granted it's a simplified form. >> Yeah. Quite a few things. So, and I would flip the question maybe back to you or the audience. If we said that there are so many sources of the data and actually data comes from systems and from teams that are very diverse in terms of domains, right? Domain. If you just think about, I don't know, retail, the E-Commerce versus auto management, versus customer. These are very diverse domains. The data comes from many different diverse domains, and then we expect to put them under the control of a centralized team, a centralized system. And I know that centralization probably, if you zoom out is centralized, if you zoom in it's compartmentalized based on functions, and we can talk about that. And we assume that the centralized model, will be getting that data, making sense of it, cleansing and transforming it, then to satisfy a need of very diverse set of consumers without really understanding the domains because the teams responsible for it are not close to the source of the data. So there is a bit of a cognitive gap and domain understanding gap, without really understanding how the data is going to be used. I've talked to numerous, when we came to this, I came up with the idea. I talked to a lot of data teams globally, just to see, what are the pain points? How are they doing it? And one thing that was evident in all of those conversations, that they actually didn't know, after they built these pipelines and put the data in, whether the data warehouse tables or linked, they didn't know how the data was being used. But yet they're responsible for making the data available for this diverse set of use cases. So essentially system and monolithic system, often is a bottleneck. So what you find is that a lot of the teams are struggling with satisfying the needs of the consumers, are struggling with really understanding the data, the domain knowledge is lost, there is a loss of understanding and kind of it in that transformation, often we end up training machine learning models on data, that is not really representative of the reality of the business, and then we put them to production and they don't work because the semantic and the syntax of the data gets lost within that translation. So, and we are struggling with finding people to manage a centralized system because still the technology's fairly, in my opinion, fairly low level and exposes the users of those technology sets and let's say they warehouse a lot of complexity. So in summary, I think it's a bottleneck, it's not going to satisfy the pace of change or pace of innovation, and the availability of sources. It's disconnected and fragmented, even though there's centralized, it's disconnected and fragmented from where the data comes from and where the data gets used, and is managed by a team of hyper specialized people, they're struggling to understand the actual value of the data, the actual format of the data. So it's not going to get us where our aspirations, our ambitions need to be. >> Yeah, so the big data platform is essentially, I think you call it context agnostic. And so as data becomes more important in our lives, you've got all these new data sources injected into the system, experimentation as we said, the cloud becomes much, much easier. So one of the blockers that you've cited and you just mentioned it, is you've got these hyper specialized roles, the data engineer, the quality engineer, data scientist. And it's a losery. I mean, it's like an illusion. These guys, they seemingly they're independent, and can scale independently, but I think you've made the point that in fact, they can't. That a change in a data source has an effect across the entire data life cycle, entire data pipeline. So maybe you could add some some color to why that's problematic for some of the organizations that you work with, and maybe give some examples. >> Yeah, absolutely. So in fact initially, the hypothesis around data mesh came from a series of requests that we received from our both large scale and progressive clients, and progressive in terms of their investment in data architecture. So these were clients that were larger scale, they had diverse and rich set of domain, some of them were big technology, tech companies, some of them were big retail companies, big healthcare companies. So they had that diversity of the data and a number of the sources of the domains. They had invested for quite a few years in generations, of they had multi-generations of PROPRICER data warehouses on prem that were moving to cloud. They had moved through the various revisions of the Hadoop clusters, and they were moving to that to cloud, and then the challenges that they were facing were simply... If I want to just simplify it in one phrase, they we're not getting value from the data that they were collecting. They were continuously struggling to shift the culture because there was so much friction between all of these three phases of both consumption of the data, then transformation and making it available. Consumption from sources and then providing it and serving it to the consumer. So that whole process was full of friction. Everybody was unhappy. So it's bottom line is that you're collecting all this data, there is delay, there is lack of trust in the data itself, because the data is not representative of the reality, it's gone through the transformation, but people that didn't understand really what the data was got delayed. And so there's no trust, it's hard to get to the data. Ultimately, it's hard to create value from the data, and people are working really hard and under a lot of pressure, but it's still struggling. So we often, our solutions, like we are... Technologies, we will often point out to technology. So we go. Okay, this version of some proprietary data warehouse we're using is not the right thing. We should go to the cloud and that certainly will solve our problem, right? Or warehouse wasn't a good one, let's make a data Lake version. So instead of extracting and then transforming and loading into the database, and that transformation is that heavy process because you fundamentally made an assumption using warehouses that if I transform this data into this multidimensional perfectly designed schema, that then everybody can draw on whatever query they want, that's going to solve everybody's problem. But in reality, it doesn't because you are delayed and there is no universal model that serves everybody's need, everybody needs are diverse. Data scientists necessarily don't like the perfectly modeled data, they're for both signals and the noise. So then we've just gone from ATLs to let's say now to Lake, which is... Okay, let's move the transformation to the last mile. Let's just get load the data into the object stores and sort of semi-structured files and get the data scientists use it, but they still struggling because of the problems that we mentioned. So then what is the solution? What is the solution? Well, next generation data platform. Let's put it on the cloud. And we saw clients that actually had gone through a year or multiple years of migration to the cloud but it was great, 18 months, I've seen nine months migrations of the warehouse versus two year migrations of various data sources to the cloud. But ultimately the result is the same, unsatisfied, frustrated data users, data providers with lack of ability to innovate quickly on relevant data and have an experience that they deserve to have, have a delightful experience of discovering and exploring data that they trust. And all of that was still amiss. So something else more fundamentally needed to change than just the technology. >> So the linchpin to your scenario is this notion of context. And you pointed out, you made the other observation that "Look we've made our operational systems context aware but our data platforms are not." And like CRM system sales guys are very comfortable with what's in the CRMs system. They own the data. So let's talk about the answer that you and your colleagues are proposing. You're essentially flipping the architecture whereby those domain knowledge workers, the builders if you will, of data products or data services, they are now first-class citizens in the data flow, and they're injecting by design domain knowledge into the system. So I want to put up another one of your charts guys, bring up the figure two there. It talks about convergence. She showed data distributed, domain driven architecture, the self-serve platform design, and this notion of product thinking. So maybe you could explain why this approach is so desirable in your view. >> Sure. The motivation and inspirations for that approach came from studying what has happened over the last few decades in operational systems. We had a very similar problem prior to microservices with monolithic systems. One of the things systems where the bottleneck, the changes we needed to make was always on vertical now to how the architecture was centralized. And we found a nice niche. And I'm not saying this is a perfect way of decoupling your monolith, but it's a way that currently where we are in our journey to become data driven, it is a nice place to be, which is distribution or a decomposition of your system as well as organization. I think whenever we talk about systems, we've got to talk about people and teams that are responsible for managing those systems. So the decomposition of the systems and the teams, and the data around domains. Because that's how today we are decoupling our business, right? We are decoupling our businesses around domains, and that's a good thing. And what does that do really for us? What it does is it localizes change to the bounded context of that business. It creates clear boundary and interfaces and contracts between the rest of the universe of the organization, and that particular team, so removes the friction that often we have for both managing the change, and both serving data or capability. So if the first principle of data meshes, let's decouple this world of analytical data the same to mirror. The same way we have decoupled our systems and teams, and business. Why data is any different. And the moment you do that, so the moment you bring the ownership to people who understands the data best, then you get questions that well, how is that any different from silos of disconnected databases that we have today and nobody can get to the data? So then the rest of the principles is really to address all of the challenges that comes with this first principle of decomposition around domain context. And the second principle is, well, we have to expect a certain level of quality and accountability, and responsibility for the teams that provide the data. So let's bring products thinking and treating data as a product, to the data that these teams now share, and let's put accountability around it. We need a new set of incentives and metrics for domain teams to share the data, we need to have a new set of kind of quality metrics that define what it means for the data to be a product, and we can go through that conversation perhaps later. So then the second principle is, okay, the teams now that are responsible, the domain teams responsible for their analytical data need to provide that data with a certain level of quality and assurance. Let's call that a product, and bring product thinking to that. And then the next question you get asked off at work by CIO or CTO is the people who build the infrastructure and spend the money. They say, well, "It's actually quite complex to manage big data, now where we want everybody, every independent team to manage the full stack of storage and computation and pipelines and access control and all of that." Well, we've solved that problem in operational world. And that requires really a new level of platform thinking to provide infrastructure and tooling to the domain teams to now be able to manage and serve their big data, and I think that requires re-imagining the world of our tooling and technology. But for now, let's just assume that we need a new level of abstraction to hide away a ton of complexity that unnecessarily people get exposed to. And that's the third principle of creating self-serve infrastructure to allow autonomous teams to build their domains. But then the last pillar, the last fundamental pillar is okay, once he distributed a problem into smaller problems that you found yourself with another set of problems, which is how I'm going to connect this data. The insights happens and emerges from the interconnection of the data domains, right? It's just not necessarily locked into one domain. So the concerns around interoperability and standardization and getting value as a result of composition and interconnection of these domains requires a new approach to governance. And we have to think about governance very differently based on a federated model. And based on a computational model. Like once we have this powerful self-serve platform, we can computationally automate a lot of covenants decisions and security decisions, and policy decisions, that applies to this fabric of mesh, not just a single domain or not in a centralized. So really, as you mentioned, the most important component of the data mesh is distribution of ownership and distribution of architecture in data, the rest of them is to solve all the problems that come with that. >> So, very powerful. And guys, we actually have a picture of what Zhamak just described. Bring up figure three, if you would. So I mean, essentially, you're advocating for the pushing of the pipeline and all its various functions into the lines of business and abstracting that complexity of the underlying infrastructure which you kind of show here in this figure, data infrastructure as a platform down below. And you know why I love about this, Zhamak, is, to me it underscores the data is not the new oil. Because I can put oil in my car, I can put it in my house but I can't put the same code in both places. But I think you call it polyglot data, which is really different forms, batch or whatever. But the same data doesn't follow the laws of scarcity. I can use the same data for many, many uses, and that's what this sort of graphic shows. And then you brought in the really important, sticking problem, which is that the governance which is now not a command and control, it's federated governance. So maybe you could add some thoughts on that. >> Sure, absolutely. It's one of those, I think I keep referring to data mesh as a paradigm shift, and it's not just to make it sound grand and like kind of grand and exciting or important, it's really because I want to point out, we need to question every moment when we make a decision around, how we're going to design security, or governance or modeling of the data. We need to reflect and go back and say, "Am I applying some of my cognitive biases around how I have worked for the last 40 years?" I've seen it work? Or "Do I do I really need to question?" And do need to question the way we have applied governance. I think at the end of the day, the role of the data governance and the objective remains the same. I mean, we all want quality data accessible to a diverse set of users and its users now know have different personas, like data persona, data analysts, data scientists, data application user. These are very diverse personas. So at the end of the day, we want quality data accessible to them, trustworthy in an easy consumable way. However, how we get there looks very different in as you mentioned that the governance model in the old world has been very command and control, very centralized. They were responsible for quality, they were responsible for certification of the data, applying and making sure the data complies with all sorts of regulations, make sure data gets discovered and made available. In the world of data mesh, really the job of the data governance as a function becomes finding the equilibrium between what decisions need to be made and enforced globally, and what decisions need to be made locally so that we can have an interoperable mesh of data sets that can move fast and can change fast. It's really about, instead of kind of putting those systems in a straight jacket of being constantly and don't change, embrace change, and continuous change of landscape because that's just the reality we can't escape. So the role of governance really, the modern governance model I called federated and computational. And by that I mean, every domain needs to have a representative in the governance team. So the role of the data or domain data product owner who really were understands that domain really well, but also wears that hats of the product owner. It's an important role that has to have a representation in the governance. So it's a federation of domains coming together. Plus the SMEs, and people have Subject Matter Experts who understand the regulations in that environment, who understands the data security concerns. But instead of trying to enforce and do this as a central team, they make decisions as what needs to be standardized. What needs to be enforced. And let's push that into that computationally and in an automated fashion into the platform itself, For example. Instead of trying to be part of the data quality pipeline and inject ourselves as people in that process, let's actually as a group, define what constitutes quality. How do we measure quality? And then let's automate that, and let's codify that into the platform, so that every day the products will have a CICD pipeline, and as part of that pipeline, law's quality metrics gets validated, and every day to product needs to publish those SLOs or Service Level Objectives, or whatever we choose as a measure of quality, maybe it's the integrity of the data, or the delay in the data, the liveliness of the data, whatever are the decisions that you're making. Let's codify that. So it's really the objectives of the governance team trying to satisfies the same, but how they do it, it's very, very different. And I wrote a new article recently, trying to explain the logical architecture that would emerge from applying these principles, and I put a kind of a light table to compare and contrast how we do governance today, versus how we'll do it differently, to just give people a flavor of what does it mean to embrace decentralization, and what does it mean to embrace change, and continuous change. So hopefully that could be helpful. >> Yes. There's so many questions I have. But the point you make it too on data quality, sometimes I feel like quality is the end game, Where the end game should be how fast you can go from idea to monetization with a data service. What happens again? And you've sort of addressed this, but what happens to the underlying infrastructure? I mean, spinning up EC2s and S3 buckets, and MyPytorches and TensorFlows. That lives in the business, and who's responding for that? >> Yeah, that's why I'm glad you're asking this question, David, because I truly believe we need to reimagine that world. I think there are many pieces that we can use as utilities are foundational pieces, but I can see for myself at five to seven year road map building this new tooling. I think in terms of the ownership, the question around ownership, that would remain with the platform team, but I don't perhaps a domain agnostic technology focused team, right? That there are providing a set of products themselves, but the users of those products are data product developers, right? Data domain teams that now have really high expectations, in terms of low friction, in terms of a lead time to create a new data products. So we need a new set of tooling and I think the language needs to shift from I need a storage bucket, or I need a storage account, to I need a cluster to run my spark jobs. Too, here's the declaration of my data products. This is where the data file will come from, this is a data that I want to serve, these are the policies that I need to apply in terms of perhaps encryption or access control, go make it happen platform, go provision everything that I need, so that as a data product developer, all I can focus on is the data itself. Representation of semantic and representation of the syntax, and make sure that data meets the quality that I have to assure and it's available. The rest of provisioning of everything that sits underneath will have to get taken care of by the platform. And that's what I mean by requires a reimagination. And there will be a data platform team. The data platform teams that we set up for our clients, in fact themselves have a fair bit of complexity internally, they divide into multiple teams, multiple planes. So there would be a plane, as in a group of capabilities that satisfied that data product developer experience. There would be a set of capabilities that deal with those nitty gritty underlying utilities, I call them (indistinct) utilities because to me, the level of abstraction of the platform needs to go higher than where it is. So what we call platform today are a set of utilities we'll be continuing to using. We'll be continuing to using object storage, we will continue to using relational databases and so on. So there will be a plane and a group of people responsible for that. There will be a group of people responsible for capabilities that enable the mesh level functionality, for example, be able to correlate and connect and query data from multiple nodes, that's a mesh level capability, to be able to discover and explore the mesh of data products, that's the mesh of capability. So it would be a set of teams as part of platform. So we use a strong, again, products thinking embedded in a product and ownership embedded into that to satisfy the experience of this now business oriented domain data teams. So we have a lot of work to do. >> I could go on, unfortunately, we're out of time, but I guess, first of all, I want to tell people there's two pieces that you've put out so far. One is how to move beyond a Monolithic Data Lake to a distributed data mesh. You guys should read that in the "Data Mesh Principles and Logical Architecture," is kind of part two. I guess my last question in the very limited time we have is are organizations ready for this? >> I think how the desire is there. I've been overwhelmed with the number of large and medium and small and private and public, and governments and federal organizations that reached out to us globally. I mean, this is a global movement and I'm humbled by the response of the industry. I think, the desire is there, the pains are real, people acknowledge that something needs to change here. So that's the first step. I think awareness is spreading, organizations are more and more becoming aware, in fact, many technology providers are reaching to us asking what shall we do because our clients are asking us, people are already asking, we need the data mesh and we need the tooling to support it. So that awareness is there in terms of the first step of being ready. However, the ingredients of a successful transformation requires top-down and bottom-up support. So it requires support from chief data analytics officers, all above, the most successful clients that we have with data mesh are the ones that, the CEOs have made a statement that, "We'd want to change the experience of every single customer using data, and we're going to commit to this." So the investment and support exists from top to all layers, the engineers are excited, the maybe perhaps the traditional data teams are open to change. So there are a lot of ingredients of transformations that come together. Are we really ready for it? I think the pioneers, perhaps, the innovators if you think about that innovation curve of adopters, probably pioneers and innovators and lead adopters are making moves towards it, and hopefully as the technology becomes more available, organizations that are less engineering oriented, they don't have the capability in-house today, but they can buy it, they would come next. Maybe those are not the ones who are quite ready for it because the technology is not readily available and requires internal investments to make. >> I think you're right on. I think the leaders are going to lean in hard and they're going to show us the path over the next several years. And I think that the end of this decade is going to be defined a lot differently than the beginning. Zhamak, thanks so much for coming to "theCUBE" and participating in the program. >> Thank you for hosting me, David. >> Pleasure having you. >> It's been wonderful. >> All right, keep it right there everybody, we'll be back right after this short break. (slow music)

Published Date : Dec 23 2020

SUMMARY :

and attract the data science and our platforms at scale. and data that we put in globally in place, So on the left here, we're adjusting data how the data is going to be used. So one of the blockers that you've cited and a number of the So the linchpin to your scenario for the data to be a product, is that the governance So at the end of the day, we But the point you make and make sure that data meets the quality in the "Data Mesh Principles and hopefully as the technology and participating in the program. after this short break.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavePERSON

0.99+

DavidPERSON

0.99+

MichaelPERSON

0.99+

Marc LemirePERSON

0.99+

Chris O'BrienPERSON

0.99+

VerizonORGANIZATION

0.99+

HilaryPERSON

0.99+

MarkPERSON

0.99+

Dave VellantePERSON

0.99+

Ildiko VancsaPERSON

0.99+

JohnPERSON

0.99+

Alan CohenPERSON

0.99+

Lisa MartinPERSON

0.99+

John TroyerPERSON

0.99+

RajivPERSON

0.99+

EuropeLOCATION

0.99+

Stefan RennerPERSON

0.99+

IldikoPERSON

0.99+

Mark LohmeyerPERSON

0.99+

JJ DavisPERSON

0.99+

IBMORGANIZATION

0.99+

BethPERSON

0.99+

Jon BakkePERSON

0.99+

John FarrierPERSON

0.99+

BoeingORGANIZATION

0.99+

AWSORGANIZATION

0.99+

Dave NicholsonPERSON

0.99+

Cassandra GarberPERSON

0.99+

Peter McKayPERSON

0.99+

CiscoORGANIZATION

0.99+

Dave BrownPERSON

0.99+

Beth CohenPERSON

0.99+

Stu MinimanPERSON

0.99+

John WallsPERSON

0.99+

Seth DobrinPERSON

0.99+

SeattleLOCATION

0.99+

5QUANTITY

0.99+

Hal VarianPERSON

0.99+

JJPERSON

0.99+

Jen SaavedraPERSON

0.99+

Michael LoomisPERSON

0.99+

LisaPERSON

0.99+

JonPERSON

0.99+

Rajiv RamaswamiPERSON

0.99+

StefanPERSON

0.99+

Joanna Parke, ThoughtWorks, Grace Hopper Celebration of Women in Computing 2017


 

>> Announcer: Live from Orlando, Florida, it's theCUBE, covering Grace Hopper Celebration of Women in Computing, brought to you by SiliconANGLE Media. (light, electronic music) >> Welcome back to theCUBE's coverage of the Grace Hopper Conference here in Orlando, Florida. I'm your host, Rebecca Knight. We're joined by Joanna Parke. She is the Group Managing Director, North America, at ThoughtWorks based in Chicago. Thanks so much for joining us, Joanna. >> Thank you, it's a pleasure to be here. >> Your company is being honored for the second year in a row as a top company for women technologists by the Anita Borg Institute. Tell our viewers what that means. >> Yeah, we're incredibly proud and super humble to be recognized again for the second year in a row. Our journey towards diversity and inclusivity really began about eight or nine years ago. It started with the top leadership of the company saying that this is a crisis in our industry, and we need to take a stand and we need to do something about it. So, it's been a long journey. It's not something that we started a couple of years ago, so there's been a lot of work by many people over the years to get us to where we are today, and we still feel that we have a long way to go. There's still a lot to do. >> So, being recognized as a top company for women technologists, it obviously means there are many women who work there. But, what else can a woman technologist looking for a job expect at ThoughtWorks? >> So, we think about, not just the aspects of diversity, which is what is the make up of your work for us look like, but also put equal if not more importance on inclusivity. So, you can go out and you can make all sorts of efforts to hire women or minorities into your company, but if you don't have a culture and an environment in which they feel welcome and they feel like they can succeed and they can bring themselves to work, then that success won't be very lasting. So, we've focused not only on the recruiting process but also our culture, our benefits, the environment in which we work. We are a software development company and we come from a history of agile software practices, which means that we work together in a very people-oriented and collaborative way. So, in some ways we had a little bit of a head start in that, by working in that way, our culture was already built to be more team-focused and collaborative and inclusive, so that was a good advantage for us when we got started. >> So, how else do you implement these best practices of the collaboration and the inclusivity? Because, I mean, it is one thing to say that we want everyone to have a voice at the table, but it's harder to pull off. >> It is, absolutely. So, a couple things that we've done over our history, one is just starting with open conversation. We talk a lot about unconscious bias, we do education and training through the workforce, we try to encourage those uncomfortable conversations that really create breakthroughs in understanding. We look for people that are open and curious in the interview process, and we feel like if you are open to having your views about the world challenged, that's a really good sign. So, that's kind of one step. Then, I think, when bad behavior arrives, which it always does, it's how you react and how you deal with it. So, making it clear to everyone that behavior that excludes or belittles others on the team is not tolerated. That's not the kind of culture that we want to build. It's on ongoing process. >> So, how do you call out the bad behavior, because that's hard to do, particularly if you're a junior employee. >> Yes, so we try and create a safe environment where people feel like, if I have an issue with someone on my team, particularly if it's someone more senior than me, we have a complete open-door and flat organization. So, anyone can pick up the phone and call me or our CEO or whoever they feel comfortable talking to. I think, what happens is, when that happens and people see action being taken, whether it's feedback being given or a more serious action, then it reinforces the fact that it's okay to speak up and that you are going to be heard and listened to. >> One of the underlying themes of this conference is that women technologists have a real responsibility to have a voice in this industry, and to shape how the future of software progresses. Can you talk a little bit more about that, about what you've seen and observed and also the perspective of ThoughtWorks on this issue? >> Absolutely, we all have seen the power that technology has in transforming our society, and that is only going to grow over time. It's not going away. So, it really impacts every aspect of our life, whether it's healthcare or how we interact with our family or how we go to work every day. Having a diverse set of perspectives that reflects the makeup of our society is so important. I was really impressed by Dr. Faith Ilee's keynote on Wednesday morning-- >> She's at Stanford. >> Yeah, Stanford and at Google right now as well. She spoke about the importance of having diverse voices in the field of artificial intelligence. She said, no other technology reflects its designers more than AI, and it is so critical that we have that diverse set of voices that are involved in shaping that technology. >> Is it almost too much though? As a woman technologist, not only do you have to be a trailblazer and put up with a lot of bias and sexism in the industry, and then you have this added responsibility. What's your advice to women in the field? Particularly the young women here who are at their first Grace Hopper. >> Absolutely, our CEO-- Sorry, our CTO, Rebecca Parsons, often says that the reason that she put up with it for so many years is because she's a geek, and because she's passionate about technology. So, when you're in those trying times, being able to connect with your passion and know that you're making a difference is so important. Because, if it's just something that you view as a job, or a way to make a living, you don't have that level of passion to get you through some of the hardships. So, I think, for me, that sense of responsibility is kind of a motivating and driving force. The good news is it will get easier over time. As we make progress in our industry, you don't feel so alone. You start to have other women and other marginalized groups around you that you can connect with and share experiences. >> What are some of the most exciting projects you're working on at ThoughtWorks? >> We really try to cover a broad landscape of technology. We think of ourselves as early adopters that can spot the trends in the industry and help bring them into the enterprise. So, we're doing some really exciting things in the machine-learning space, around predictive maintenance, understanding when machine parts are going to fail and being able to repair them ahead of time. Things like understanding customer insights through data. I think those areas are emerging and super exciting. >> Excellent. What are you looking for? Are you here recruiting? >> Absolutely. >> And, with a top company sticker on your booth, I'm sure that you are highly sought after. What are you looking for in a candidate? >> We for a long time have articulated our strategy in three words: attitude, aptitude, and integrity. Because we feel like if we can find a person that has a passion for learning, the ability to learn, and the right attitude about that, we can work with that, right? The world of technology is changing so fast, so even if you know the tech of today, if you don't have that passion and ability to learn, you're not going to be able to keep up. So, we really look for people in terms of those character traits and those people are the kind of people that are successful and thrive at ThoughtWorks. >> If you look at the data, it looks as though there is a looming talent shortage. Are you worried about that at ThoughtWorks? What's your-- >> Absolutely. There is a huge talent gap. It's growing by the day. We see it at our clients as well as ourselves. For me, it really comes down to the responsibility of society as well as companies to invest in upscaling our workforce. We have seen some clients take that investment and realize that the skills they needed in their workforce a few years ago look very different from what they're going to need into the future. So, we believe strongly in investing in and training and upscaling our employees. We help work with our clients to do so as well. But, I think we can't rely on the existing educational system to create all of the talent that we're going to need. It's really going to take investment, I believe, from society and from companies. >> And on the job training. >> Absolutely. There's no replacement for that, right? You can do the kind of academic and educational studies but there's no replacement for once you get into the real world and you're with people and the day to day challenges arise. >> Excellent. Well, Joanna, thanks so much for coming on. It was a real pleasure talking to you. >> Thank you, it was my pleasure. >> We will have more from the Orange County Convention Center, the Grace Hopper Celebration of Women in Computing just after this. (light, electronic music)

Published Date : Oct 6 2017

SUMMARY :

brought to you by SiliconANGLE Media. She is the Group Managing Director, Your company is being honored for the second year in a row It's not something that we started a couple of years ago, So, being recognized as a top company So, in some ways we had a little bit of a head start Because, I mean, it is one thing to say that we want That's not the kind of culture that we want to build. the bad behavior, because that's hard to do, and that you are going to be heard and listened to. and to shape how the future of software progresses. and that is only going to grow over time. and it is so critical that we have that diverse set and then you have this added responsibility. Because, if it's just something that you view as a job, and being able to repair them ahead of time. What are you looking for? I'm sure that you are highly sought after. a passion for learning, the ability to learn, If you look at the data, that the skills they needed in their workforce and the day to day challenges arise. It was a real pleasure talking to you. the Grace Hopper Celebration of Women in Computing

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Rebecca KnightPERSON

0.99+

JoannaPERSON

0.99+

Joanna ParkePERSON

0.99+

Rebecca ParsonsPERSON

0.99+

ChicagoLOCATION

0.99+

GoogleORGANIZATION

0.99+

Wednesday morningDATE

0.99+

ThoughtWorksORGANIZATION

0.99+

second yearQUANTITY

0.99+

Faith IleePERSON

0.99+

Orlando, FloridaLOCATION

0.99+

Anita Borg InstituteORGANIZATION

0.99+

firstQUANTITY

0.99+

SiliconANGLE MediaORGANIZATION

0.99+

StanfordORGANIZATION

0.99+

theCUBEORGANIZATION

0.98+

Orange County Convention CenterLOCATION

0.98+

three wordsQUANTITY

0.98+

todayDATE

0.97+

one stepQUANTITY

0.97+

North AmericaLOCATION

0.96+

OneQUANTITY

0.96+

Grace Hopper ConferenceEVENT

0.95+

one thingQUANTITY

0.94+

Grace HopperEVENT

0.94+

couple of years agoDATE

0.92+

Grace Hopper Celebration of Women in ComputingEVENT

0.9+

Dr.PERSON

0.87+

Celebration of Women in ComputingEVENT

0.86+

nine years agoDATE

0.8+

Grace HopperPERSON

0.78+

about eightDATE

0.77+

few years agoDATE

0.76+

oneQUANTITY

0.64+

coupleQUANTITY

0.61+

2017DATE

0.51+

Is Data Mesh the Killer App for Supercloud | Supercloud2


 

(gentle bright music) >> Okay, welcome back to our "Supercloud 2" event live coverage here at stage performance in Palo Alto syndicating around the world. I'm John Furrier with Dave Vellante. We've got exclusive news and a scoop here for SiliconANGLE and theCUBE. Zhamak Dehghani, creator of data mesh has formed a new company called NextData.com NextData, she's a cube alumni and contributor to our Supercloud initiative, as well as our coverage and breaking analysis with Dave Vellante on data, the killer app for Supercloud. Zhamak, great to see you. Thank you for coming into the studio and congratulations on your newly formed venture and continued success on the data mesh. >> Thank you so much. It's great to be here. Great to see you in person. >> Dave: Yeah, finally. >> John: Wonderful. Your contributions to the data conversation has been well-documented certainly by us and others in the industry. Data mesh taking the world by storm. Some people are debating it, throwing, you know, cold water on it. Some are, I think, it's the next big thing. Tell us about the data mesh super data apps that are emerging out of cloud. >> I mean, data mesh, as you said, it's, you know, the pain point that it surfaced were universal. Everybody said, "Oh, why didn't I think of that?" You know, it was just an obvious next step and people are approaching it, implementing it. I guess the last few years, I've been involved in many of those implementations, and I guess Supercloud is somewhat a prerequisite for it because it's data mesh and building applications using data mesh is about sharing data responsibly across boundaries. And those boundaries include boundaries, organizational boundaries cloud technology boundaries and trust boundaries. >> I want to bring that up because your venture, NextData which is new, just formed. Tell us about that. What wave is that riding? What specifically are you targeting? What's the pain point? >> Zhamak: Absolutely, yes. So next data is the result of, I suppose, the pains that I suffered from implementing a database for many of the organizations. Basically, a lot of organizations that I've worked with, they want decentralized data. So they really embrace this idea of decentralized ownership of the data, but yet they want interconnectivity through standard APIs, yet they want discoverability and governance. So they want to have policies implemented, they want to govern that data, they want to be able to discover that data and yet they want to decentralize it. And we do that with a developer experience that is easy and native to a generalist developer. So we try to find, I guess, the common denominator that solves those problems and enables that developer experience for data sharing. >> John: Since you just announced the news, what's been the reaction? >> Zhamak: I just announced the news right now, so what's the reaction? >> John: But people in the industry that know you, you did a lot of work in the area. What have been some of the feedback on the new venture in terms of the approach, the customers, problem? >> Yeah, so we've been in stealth modes, so we haven't publicly talked about it, but folks that have been close to us in fact have reached out. We already have implementations of our pilot platform with early customers, which is super exciting. And we're going to have multiple of those. Of course, we're a tiny, tiny company. We can have many of those where we are going to have multiple pilots, implementations of our platform in real world. We're real global large scale organizations that have real world problems. So we're not going to build our platform in vacuum. And that's what's happening right now. >> Zhamak: When I think about your role at ThoughtWorks, you had a very wide observation space with a number of clients helping them implement data mesh and other things as well prior to your data mesh initiative. But when I look at data mesh, at least the ones that I've seen, they're very narrow. I think of JPMC, I think of HelloFresh. They're generally obviously not surprising. They don't include the big vision of inclusivity across clouds across different data stores. But it seems like people are having to go through some gymnastics to get to, you know, the organizational reality of decentralizing data, and at least pushing data ownership to the line of business. How are you approaching or are you approaching, solving that problem? Are you taking a narrow slice? What can you tell us about Next Data? >> Zhamak: Sure, yeah, absolutely. Gymnastics, the cute word to describe what the organizations have to go through. And one of those problems is that, you know, the data, as you know, resides on different platforms. It's owned by different people, it's processed by pipelines that who owns them. So there's this very disparate and disconnected set of technologies that were very useful for when we thought about data and processing as a centralized problem. But when you think about data as a decentralized problem, the cost of integration of these technologies in a cohesive developer experience is what's missing. And we want to focus on that cohesive end-to-end developer experience to share data responsibly in this autonomous units, we call them data products, I guess in data mesh, right? That constitutes computation, that governs that data policies, discoverability. So I guess, I heard this expression in the last talks that you can have your cake and eat it too. So we want people have their cakes, which is, you know, data in different places, decentralization and eat it too, which is interconnected access to it. So we start with standardizing and codifying this idea of a data product container that encapsulates data computation, APIs to get to it in a technology agnostic way, in an open way. And then, sit on top and use existing existing tech, you know, Snowflake, Databricks, whatever exists, you know, the millions of dollars of investments that companies have made, sit on top of those but create this cohesive, integrated experience where data product is a first class primitive. And that's really key here, that the language, and the modeling that we use is really native to data mesh is that I will make a data product, I'm sharing a data product, and that encapsulates on providing metadata about this. I'm providing computation that's constantly changing the data. I'm providing the API for that. So we're trying to kind of codify and create a new developer experience based on that. And developer, both from provider side and user side connected to peer-to-peer data sharing with data product as a primitive first class concept. >> Okay, so the idea would be developers would build applications leveraging those data products which are discoverable and governed. Now, today you see some companies, you know, take a snowflake for example. >> Zhamak: Yeah. >> Attempting to do that within their own little walled garden. They even, at one point, used the term, "Mesh." I dunno if they pull back on that. And then they sort of became aware of some of your work. But a lot of the things that they're doing within their little insulated environment, you know, support that, that, you know, governance, they're building out an ecosystem. What's different in your vision? >> Exactly. So we realize that, you know, and this is a reality, like you go to organizations, they have a snowflake and half of the organization happily operates on Snowflake. And on the other half, oh, we are on, you know, bare infrastructure on AWS, or we are on Databricks. This is the realities, you know, this Supercloud that's written up here. It's about working across boundaries of technology. So we try to embrace that. And even for our own technology with the way we're building it, we say, "Okay, nobody's going to use next data mesh operating system. People will have different platforms." So you have to build with openness in mind, and in case of Snowflake, I think, you know, they have I'm sure very happy customers as long as customers can be on Snowflake. But once you cross that boundary of platforms then that becomes a problem. And we try to keep that in mind in our solution. >> So, it's worth reviewing that basically, the concept of data mesh is that, whether you're a data lake or a data warehouse, an S3 bucket, an Oracle database as well, they should be inclusive inside of the data. >> We did a session with AWS on the startup showcase, data as code. And remember, I wrote a blog post in 2007 called, "Data's the new developer kit." Back then, they used to call 'em developer kits, if you remember. And that we said at that time, whoever can code data >> Zhamak: Yes. >> Will have a competitive advantage. >> Aren't there machines going to be doing that? Didn't we just hear that? >> Well we have, and you know, Hey Siri, hey Cube. Find me that best video for data mesh. There it is. I mean, this is the point, like what's happening is that, now, data has to be addressable >> Zhamak: Yes. >> For machines and for coding. >> Zhamak: Yes. >> Because as you need to call the data. So the question is, how do you manage the complexity of big things as promiscuous as possible, making it available as well as then governing it because it's a trade off. The more you make open >> Zhamak: Definitely. >> The better the machine learning. >> Zhamak: Yes. >> But yet, the governance issue, so this is the, you need an OS to handle this maybe. >> Yes, well, we call our mental model for our platform is an OS operating system. Operating systems, you know, have shown us how you can kind of abstract what's complex and take care of, you know, a lot of complexities, but yet provide an open and, you know, dynamic enough interface. So we think about it that way. We try to solve the problem of policies live with the data. An enforcement of the policies happens at the most granular level which is, in this concept, the data product. And that would happen whether you read, write, or access a data product. But we can never imagine what are these policies could be. So our thinking is, okay, we should have a open policy framework that can allow organizations write their own policy drivers, and policy definitions, and encode it and encapsulated in this data product container. But I'm not going to fool myself to say that, you know, that's going to solve the problem that you just described. I think we are in this, I don't know, if I look into my crystal ball, what I think might happen is that right now, the primitives that we work with to train machine-learning model are still bits and bites in data. They're fields, rows, columns, right? And that creates quite a large surface area, an attack area for, you know, for privacy of the data. So perhaps, one of the trends that we might see is this evolution of data APIs to become more and more computational aware to bring the compute to the data to reduce that surface area so you can really leave the control of the data to the sovereign owners of that data, right? So that data product. So I think the evolution of our data APIs perhaps will become more and more computational. So you describe what you want, and the data owner decides, you know, how to manage the- >> John: That's interesting, Dave, 'cause it's almost like we just talked about ChatGPT in the last segment with you, who's a machine learning, could really been around the industry. It's almost as if you're starting to see reason come into the data, reasoning. It's like you starting to see not just metadata, using the data to reason so that you don't have to expose the raw data. It's almost like a, I won't say curation layer, but an intelligence layer. >> Zhamak: Exactly. >> Can you share your vision on that 'cause that seems to be where the dots are connecting. >> Zhamak: Yes, this is perhaps further into the future because just from where we stand, we have to create still that bridge of familiarity between that future and present. So we are still in that bridge-making mode, however, by just the basic notion of saying, "I'm going to put an API in front of my data, and that API today might be as primitive as a level of indirection as in you tell me what you want, tell me who you are, let me go process that, all the policies and lineage, and insert all of this intelligence that need to happen. And then I will, today, I will still give you a file. But by just defining that API and standardizing it, now we have this amazing extension point that we can say, "Well, the next revision of this API, you not just tell me who you are, but you actually tell me what intelligence you're after. What's a logic that I need to go and now compute on your API?" And you can kind of evolve that, right? Now you have a point of evolution to this very futuristic, I guess, future where you just describe the question that you're asking from the chat. >> Well, this is the Supercloud, Dave. >> I have a question from a fan, I got to get it in. It's George Gilbert. And so, his question is, you're blowing away the way we synchronize data from operational systems to the data stack to applications. So the concern that he has, and he wants your feedback on this, "Is the data product app devs get exposed to more complexity with respect to moving data between data products or maybe it's attributes between data products, how do you respond to that? How do you see, is that a problem or is that something that is overstated, or do you have an answer for that?" >> Zhamak: Absolutely. So I think there's a sweet spot in getting data developers, data product developers closer to the app, but yet not burdening them with the complexity of the application and application logic, and yet reducing their cognitive load by localizing what they need to know about which is that domain where they're operating within. Because what's happening right now? what's happening right now is that data engineers, a ton of empathy for them for their high threshold of pain that they can, you know, deal with, they have been centralized, they've put into the data team, and they have been given this unbelievable task of make meaning out of data, put semantic over it, curates it, cleans it, and so on. So what we are saying is that get those folks embedded into the domain closer to the application developers, these are still separately moving units. Your app and your data products are independent but yet tightly closed with each other, tightly coupled with each other based on the context of the domain, so reduce cognitive load by localizing what they need to know about to the domain, get them closer to the application but yet have them them separate from app because app provides a very different service. Transactional data for my e-commerce transaction, data product provides a very different service, longitudinal data for the, you know, variety of this intelligent analysis that I can do on the data. But yet, it's all within the domain of e-commerce or sales or whatnot. >> So a lot of decoupling and coupling create that cohesiveness. >> Zhamak: Absolutely. >> Architecture. So I have to ask you, this is an interesting question 'cause it came up on theCUBE all last year. Back on the old server, data center days and cloud, SRE, Google coined the term, "Site Reliability Engineer" for someone to look over the hundreds of thousands of servers. We asked a question to data engineering community who have been suffering, by the way, agree. Is there an SRE-like role for data? Because in a way, data engineering, that platform engineer, they are like the SRE for data. In other words, managing the large scale to enable automation and cell service. What's your thoughts and reaction to that? >> Zhamak: Yes, exactly. So, maybe we go through that history of how SRE came to be. So we had the first DevOps movement which was, remove the wall between dev and ops and bring them together. So you have one cross-functional units of the organization that's responsible for, you build it you run it, right? So then there is no, I'm going to just shoot my application over the wall for somebody else to manage it. So we did that, and then we said, "Okay, as we decentralized and had this many microservices running around, we had to create a layer that abstracted a lot of the complexity around running now a lot or monitoring, observing and running a lot while giving autonomy to this cross-functional team." And that's where the SRE, a new generation of engineers came to exist. So I think if I just look- >> Hence Borg, hence Kubernetes. >> Hence, hence, exactly. Hence chaos engineering, hence embracing the complexity and messiness, right? And putting engineering discipline to embrace that and yet give a cohesive and high integrity experience of those systems. So I think, if we look at that evolution, perhaps something like that is happening by bringing data and apps closer and make them these domain-oriented data product teams or domain oriented cross-functional teams, full stop, and still have a very advanced maybe at the platform infrastructure level kind of operational team that they're not busy doing two jobs which is taking care of domains and the infrastructure, but they're building infrastructure that is embracing that complexity, interconnectivity of this data process. >> John: So you see similarities. >> Absolutely, but I feel like we're probably in a more early days of that movement. >> So it's a data DevOps kind of thing happening where scales happening. It's good things are happening yet. Eh, a little bit fast and loose with some complexities to clean up. >> Yes, yes. This is a different restructure. As you said we, you know, the job of this industry as a whole on architects is decompose, recompose, decompose, recomposing a new way, and now we're like decomposing centralized team, recomposing them as domains and- >> John: So is data mesh the killer app for Supercloud? >> You had to do this for me. >> Dave: Sorry, I couldn't- (John and Dave laughing) >> Zhamak: What do you want me to say, Dave? >> John: Yes. >> Zhamak: Yes of course. >> I mean Supercloud, I think it's, really the terminology's Supercloud, Opencloud. But I think, in spirits of it, this embracing of diversity and giving autonomy for people to make decisions for what's right for them and not yet lock them in. I think just embracing that is baked into how data mesh assume the world would work. >> John: Well thank you so much for coming on Supercloud too, really appreciate it. Data has driven this conversation. Your success of data mesh has really opened up the conversation and exposed the slow moving data industry. >> Dave: Been a great catalyst. (John laughs) >> John: That's now going well. We can move faster, so thanks for coming on. >> Thank you for hosting me. It was wonderful. >> Okay, Supercloud 2 live here in Palo Alto. Our stage performance, I'm John Furrier with Dave Vellante. We're back with more after this short break, Stay with us all day for Supercloud 2. (gentle bright music)

Published Date : Feb 17 2023

SUMMARY :

and continued success on the data mesh. Great to see you in person. and others in the industry. I guess the last few years, What's the pain point? a database for many of the organizations. in terms of the approach, but folks that have been close to us to get to, you know, the data, as you know, resides Okay, so the idea would be developers But a lot of the things that they're doing This is the realities, you know, inside of the data. And that we said at that Well we have, and you know, So the question is, how do so this is the, you need and the data owner decides, you know, so that you don't have 'cause that seems to be where of this API, you not So the concern that he has, into the domain closer to So a lot of decoupling So I have to ask you, this a lot of the complexity of domains and the infrastructure, in a more early days of that movement. to clean up. the job of this industry the world would work. John: Well thank you so much for coming Dave: Been a great catalyst. We can move faster, so Thank you for hosting me. after this short break,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave VellantePERSON

0.99+

JohnPERSON

0.99+

ZhamakPERSON

0.99+

DavePERSON

0.99+

George GilbertPERSON

0.99+

AWSORGANIZATION

0.99+

2007DATE

0.99+

Palo AltoLOCATION

0.99+

John FurrierPERSON

0.99+

John FurrierPERSON

0.99+

Zhamak DehghaniPERSON

0.99+

JPMCORGANIZATION

0.99+

GoogleORGANIZATION

0.99+

DavPERSON

0.99+

two jobsQUANTITY

0.99+

SupercloudORGANIZATION

0.99+

NextDataORGANIZATION

0.99+

todayDATE

0.99+

OpencloudORGANIZATION

0.99+

last yearDATE

0.99+

SiriTITLE

0.99+

ThoughtWorksORGANIZATION

0.98+

NextData.comORGANIZATION

0.98+

Supercloud 2EVENT

0.98+

bothQUANTITY

0.98+

oneQUANTITY

0.98+

HelloFreshORGANIZATION

0.98+

firstQUANTITY

0.98+

millions of dollarsQUANTITY

0.96+

SnowflakeEVENT

0.96+

OracleORGANIZATION

0.96+

SRETITLE

0.94+

SnowflakeORGANIZATION

0.94+

CubePERSON

0.93+

ZhamaPERSON

0.92+

Data Mesh the Killer AppTITLE

0.92+

SiliconANGLEORGANIZATION

0.91+

DatabricksORGANIZATION

0.9+

first classQUANTITY

0.89+

Supercloud 2ORGANIZATION

0.88+

theCUBEORGANIZATION

0.88+

hundreds of thousandsQUANTITY

0.85+

one pointQUANTITY

0.84+

ZhamPERSON

0.83+

SupercloudEVENT

0.83+

ChatGPTORGANIZATION

0.72+

SREORGANIZATION

0.72+

BorgPERSON

0.7+

SnowflakeTITLE

0.66+

SupercloudTITLE

0.65+

halfQUANTITY

0.64+

Is Data Mesh the Next Killer App for Supercloud?


 

(upbeat music) >> Welcome back to our Supercloud 2 event live coverage here of stage performance in Palo Alto syndicating around the world. I'm John Furrier with Dave Vellante. We got exclusive news and a scoop here for SiliconANGLE in theCUBE. Zhamak Dehghani, creator of data mesh has formed a new company called Nextdata.com, Nextdata. She's a cube alumni and contributor to our supercloud initiative, as well as our coverage and Breaking Analysis with Dave Vellante on data, the killer app for supercloud. Zhamak, great to see you. Thank you for coming into the studio and congratulations on your newly formed venture and continued success on the data mesh. >> Thank you so much. It's great to be here. Great to see you in person. >> Dave: Yeah, finally. >> Wonderful. Your contributions to the data conversation has been well documented certainly by us and others in the industry. Data mesh taking the world by storm. Some people are debating it, throwing cold water on it. Some are thinking it's the next big thing. Tell us about the data mesh, super data apps that are emerging out of cloud. >> I mean, data mesh, as you said, the pain point that it surface were universal. Everybody said, "Oh, why didn't I think of that?" It was just an obvious next step and people are approaching it, implementing it. I guess the last few years I've been involved in many of those implementations and I guess supercloud is somewhat a prerequisite for it because it's data mesh and building applications using data mesh is about sharing data responsibly across boundaries. And those boundaries include organizational boundaries, cloud technology boundaries, and trust boundaries. >> I want to bring that up because your venture, Nextdata, which is new just formed. Tell us about that. What wave is that riding? What specifically are you targeting? What's the pain point? >> Absolutely. Yes, so Nextdata is the result of, I suppose the pains that I suffered from implementing data mesh for many of the organizations. Basically a lot of organizations that I've worked with they want decentralized data. So they really embrace this idea of decentralized ownership of the data, but yet they want interconnectivity through standard APIs, yet they want discoverability and governance. So they want to have policies implemented, they want to govern that data, they want to be able to discover that data, and yet they want to decentralize it. And we do that with a developer experience that is easy and native to a generalist developer. So we try to find the, I guess the common denominator that solves those problems and enables that developer experience for data sharing. >> Since you just announced the news, what's been the reaction? >> I just announced the news right now, so what's the reaction? >> But people in the industry know you did a lot of work in the area. What have been some of the feedback on the new venture in terms of the approach, the customers, problem? >> Yeah, so we've been in stealth mode so we haven't publicly talked about it, but folks that have been close to us, in fact have reached that we already have implementations of our pilot platform with early customers, which is super exciting. And we going to have multiple of those. Of course, we're a tiny, tiny company. We can have many of those, but we are going to have multiple pilot implementations of our platform in real world where real global large scale organizations that have real world problems. So we're not going to build our platform in vacuum. And that's what's happening right now. >> Zhamak, when I think about your role at ThoughtWorks, you had a very wide observation space with a number of clients, helping them implement data mesh and other things as well prior to your data mesh initiative. But when I look at data mesh, at least the ones that I've seen, they're very narrow. I think of JPMC, I think of HelloFresh. They're generally, obviously not surprising, they don't include the big vision of inclusivity across clouds, across different data storage. But it seems like people are having to go through some gymnastics to get to the organizational reality of decentralizing data and at least pushing data ownership to the line of business. How are you approaching, or are you approaching solving that problem? Are you taking a narrow slice? What can you tell us about Nextdata? >> Yeah, absolutely. Gymnastics, the cute word to describe what the organizations have to go through. And one of those problems is that the data as you know resides on different platforms, it's owned by different people, is processed by pipelines that who knows who owns them. So there's this very disparate and disconnected set of technologies that were very useful for when we thought about data and processing as a centralized problem. But when you think about data as a decentralized problem the cost of integration of these technologies in a cohesive developer experience is what's missing. And we want to focus on that cohesive end-to-end developer experience to share data responsibly in these autonomous units. We call them data products, I guess in data mesh. That constitutes computation. That governs that data policies, discoverability. So I guess, I heard this expression in the last talks that you can have your cake and eat it too. So we want people have their cakes, which is data in different places, decentralization, and eat it too, which is interconnected access to it. So we start with standardizing and codifying this idea of a data product container that encapsulates data computation APIs to get to it in a technology agnostic way, in an open way. And then sit on top and use existing tech, Snowflake, Databricks, whatever exists, the millions of dollars of investments that companies have made, sit on top of those but create this cohesive, integrated experience where data product is a first class primitive. And that's really key here. The language and the modeling that we use is really native to data mesh, which is that I'm building a data product I'm sharing a data product, and that encapsulates I'm providing metadata about this. I'm providing computation that's constantly changing the data. I'm providing the API for that. So we we're trying to kind of codify and create a new developer experience based on that. And developer, both from provider side and user side, connected to peer-to-peer data sharing with data product as a primitive first class concept. >> So the idea would be developers would build applications leveraging those data products, which are discoverable and governed. Now today you see some companies, take a Snowflake for example, attempting to do that within their own little walled garden. They even at one point used the term mesh. I don't know if they pull back on that. And then they became aware of some of your work. But a lot of the things that they're doing within their little insulated environment support that governance, they're building out an ecosystem. What's different in your vision? >> Exactly. So we realized that, and this is a reality, like you go to organizations, they have a Snowflake and half of the organization happily operates on Snowflake. And on the other half, "oh, we are on Bare infrastructure on AWS or we are on Databricks." This is the reality. This supercloud that's written up here, it's about working across boundaries of technology. So we try to embrace that. And even for our own technology with the way we're building it, we say, "Okay, nobody's going to use Nextdata, data mesh operating system. People will have different platforms." So you have to build with openness in mind and in case of Snowflake, I think, they have very, I'm sure very happy customers as long as customers can be on Snowflake. But once you cross that boundary of platforms then that becomes a problem. And we try to keep that in mind in our solution. >> So it's worth reviewing that basically the concept of data mesh is that whether you're a data lake or a data warehouse, an S3 bucket, an Oracle database as well, they should be inclusive inside of the data. >> We did a session with AWS on the startup showcase, data as code. And remember I wrote a blog post in 2007 called "Data as the New Developer Kit" back then we used to call them developer kits if you remember. And that we said at that time, whoever can code data will have a competitive advantage. >> Aren't the machines going to be doing that? Didn't we just hear that? >> Well, we have. Hey, Siri. Hey, Cube, find me that best video for data mesh. There it is. But this is the point, like what's happening is that now data has to be addressable. for machines and for coding because as you need to call the data. So the question is how do you manage the complexity of big things as promiscuous as possible, making it available, as well as then governing it? Because it's a trade off. The more you make open, the better the machine learning. But yet the governance issue, so this is the, you need an OS to handle this maybe. >> Yes. So yes, well we call, our mental model for our platform is an OS operating system. Operating systems have shown us how you can abstract what's complex and take care of a lot of complexities, but yet provide an open and dynamic enough interface. So we think about it that way. Just, we try to solve the problem of policies live with the data, an enforcement of the policies happens at the most granular level, which is in this concept of the data product. And that would happen whether you read, write or access a data product. But we can never imagine what are these policies could be. So our thinking is we should have a policy, open policy framework that can allow organizations write their own policy drivers and policy definitions and encode it and encapsulated in this data product container. But I'm not going to fool myself to say that, that's going to solve the problem that you just described. I think we are in this, I don't know, if I look into my crystal ball, what I think might happen is that right now the primitives that we work with to train machine learning model are still bits and bytes and data. They're fields, rows, columns and that creates quite a large surface area and attack area for privacy of the data. So perhaps one of the trends that we might see is this evolution of data APIs to become more and more computational aware to bring the compute to the data to reduce that surface area. So you can really leave the control of the data to the sovereign owners of that data. So that data product. So I think that evolution of our data APIs perhaps will become more and more computational. So you describe what you want and the data owner decides how to manage. >> That's interesting, Dave, 'cause it's almost like we just talked about ChatGPT in the last segment we had with you. It was a machine learning have been around the industry. It's almost as if you're starting to see reason come into, the data reasoning is like starting to see not just metadata. Using the data to reason so that you don't have to expose the raw data. So almost like a, I won't say curation layer, but an intelligence layer. >> Zhamak: Exactly. >> Can you share your vision on that? 'Cause that seems to be where the dots are connecting. >> Yes, perhaps further into the future because just from where we stand, we have to create still that bridge of familiarity between that future and present. So we are still in that bridge making mode. However, by just the basic notion of saying, "I'm going to put an API in front of my data." And that API today might be as primitive as a level of indirection, as in you tell me what you want, tell me who you are, let me go process that, all the policies and lineage and insert all of this intelligence that need to happen. And then today, I will still give you a file. But by just defining that API and standardizing it now we have this amazing extension point that we can say, "Well, the next revision of this API, you not just tell me who you are, but you actually tell me what intelligence you're after. What's a logic that I need to go and now compute on your API?" And you can evolve that. Now you have a point of evolution to this very futuristic, I guess, future where you just described the question that you're asking from the ChatGPT. >> Well, this is the supercloud, go ahead, Dave. >> I have a question from a fan, I got to get it in. It's George Gilbert. And so his question is, you're blowing away the way we synchronize data from operational systems to the data stack to applications. So the concern that he has and he wants your feedback on this, is the data product app devs get exposed to more complexity with respect to moving data between data products or maybe it's attributes between data products? How do you respond to that? How do you see? Is that a problem? Is that something that is overstated or do you have an answer for that? >> Absolutely. So I think there's a sweet spot in getting data developers, data product developers closer to the app, but yet not overburdening them with the complexity of the application and application logic and yet reducing their cognitive load by localizing what they need to know about, which is that domain where they're operating within. Because what's happening right now? What's happening right now is that data engineers with, a ton of empathy for them for their high threshold of pain that they can deal with, they have been centralized, they've put into the data team, and they have been given this unbelievable task of make meaning out of data, put semantic over it, curate it, cleans it, and so on. So what we are saying is that get those folks embedded into the domain closer to the application developers. These are still separately moving units. Your app and your data products are independent, but yet tightly closed with each other, tightly coupled with each other based on the context of the domain. So reduce cognitive load by localizing what they need to know about to the domain, get them closer to the application, but yet have them separate from app because app provides a very different service. Transactional data for my e-commerce transaction. Data product provides a very different service. Longitudinal data for the variety of this intelligent analysis that I can do on the data. But yet it's all within the domain of e-commerce or sales or whatnot. >> It's a lot of decoupling and coupling create that cohesiveness architecture. So I have to ask you, this is an interesting question 'cause it came up on theCUBE all last year. Back on the old server data center days and cloud, SRE, Google coined the term, site reliability engineer, for someone to look over the hundreds of thousands of servers. We asked the question to data engineering community who have been suffering, by the way, I agree. Is there an SRE like role for data? Because in a way data engineering, that platform engineer, they are like the SRE for data. In other words managing the large scale to enable automation and cell service. What's your thoughts and reaction to that? >> Yes, exactly. So maybe we go through that history of how SRE came to be. So we had the first DevOps movement, which was remove the wall between dev and ops and bring them together. So you have one unit of one cross-functional units of the organization that's responsible for you build it, you run it. So then there is no, I'm going to just shoot my application over the wall for somebody else to manage it. So we did that and then we said, okay, there is a ton, as we decentralized and had these many microservices running around, we had to create a layer that abstracted a lot of the complexity around running now a lot or monitoring, observing, and running a lot while giving autonomy to this cross-functional team. And that's where the SRE, a new generation of engineers came to exist. So I think if I just look at. >> Hence, Kubernetes. >> Hence, hence, exactly. Hence, chaos engineering. Hence, embracing the complexity and messiness. And putting engineering discipline to embrace that and yet give a cohesive and high integrity experience of those systems. So I think if we look at that evolution, perhaps something like that is happening by bringing data and apps closer and make them these domain-oriented data product teams or domain-oriented cross-functional teams full stop and still have a very advanced maybe at the platform level, infrastructure level operational team that they're not busy doing two jobs, which is taking care of domains and the infrastructure, but they're building infrastructure that is embracing that complexity, interconnectivity of this data process. >> So you see similarities? >> I see, absolutely. But I feel like we're probably in a more early days of that movement. >> So it's a data DevOps kind of thing happening where scales happening. It's good things are happening, yet a little bit fast and loose with some complexities to clean up. >> Yes. This is a different restructure. As you said, the job of this industry as a whole, an architect, is decompose recompose, decompose recompose in new way and now we're like decomposing centralized team, recomposing them as domains. >> So is data mesh the killer app for supercloud? >> You had to do this to me. >> Sorry, I couldn't resist. >> I know. Of course you want me to say this. >> Yes. >> Yes, of course. I mean, supercloud, I think it's really, the terminology supercloud, open cloud, but I think in spirits of it this embracing of diversity and giving autonomy for people to make decisions for what's right for them and not yet lock them in. I think just embracing that is baked into how data mesh assume the world would work. >> Well, thank you so much for coming on Supercloud 2. We really appreciate it. Data has driven this conversation. Your success of data mesh has really opened up the conversation and exposed the slow moving data industry. >> Dave: Been a great catalyst. >> That's now going well. We can move faster. So thanks for coming on. >> Thank you for hosting me. It was wonderful. >> Supercloud 2 live here in Palo Alto, our stage performance. I'm John Furrier with Dave Vellante. We'll back with more after this short break. Stay with us all day for Supercloud 2. (upbeat music)

Published Date : Jan 25 2023

SUMMARY :

and continued success on the data mesh. Great to see you in person. and others in the industry. I guess the last few What's the pain point? for many of the organizations. But people in the industry know you did but folks that have been close to us, at least the ones that I've is that the data as you know But a lot of the things that they're doing and half of the organization that basically the concept of data mesh And that we said at that time, is that now data has to be addressable. and the data owner decides how to manage. the data reasoning is like starting to see 'Cause that seems to be where What's a logic that I need to go Well, this is the So the concern that he has into the domain closer to We asked the question to of the organization that's responsible So I think if we look at that evolution, in a more early days of that movement. So it's a data DevOps As you said, the job of Of course you want me to say this. assume the world would work. the conversation and exposed So thanks for coming on. Thank you for hosting me. I'm John Furrier with Dave Vellante.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave VellantePERSON

0.99+

DavePERSON

0.99+

AWSORGANIZATION

0.99+

2007DATE

0.99+

George GilbertPERSON

0.99+

Zhamak DehghaniPERSON

0.99+

NextdataORGANIZATION

0.99+

ZhamakPERSON

0.99+

Palo AltoLOCATION

0.99+

GoogleORGANIZATION

0.99+

John FurrierPERSON

0.99+

oneQUANTITY

0.99+

Nextdata.comORGANIZATION

0.99+

two jobsQUANTITY

0.99+

JPMCORGANIZATION

0.99+

todayDATE

0.99+

HelloFreshORGANIZATION

0.99+

ThoughtWorksORGANIZATION

0.99+

last yearDATE

0.99+

Supercloud 2EVENT

0.99+

OracleORGANIZATION

0.98+

firstQUANTITY

0.98+

SiriTITLE

0.98+

CubePERSON

0.98+

DatabricksORGANIZATION

0.98+

SnowflakeORGANIZATION

0.97+

SupercloudORGANIZATION

0.97+

bothQUANTITY

0.97+

one unitQUANTITY

0.97+

SnowflakeTITLE

0.96+

SRETITLE

0.95+

millions of dollarsQUANTITY

0.94+

first classQUANTITY

0.94+

hundreds of thousands of serversQUANTITY

0.92+

supercloudORGANIZATION

0.92+

one pointQUANTITY

0.92+

Supercloud 2TITLE

0.89+

ChatGPTORGANIZATION

0.81+

halfQUANTITY

0.81+

Data Mesh the Next Killer AppTITLE

0.78+

supercloudTITLE

0.75+

a tonQUANTITY

0.73+

Supercloud 2ORGANIZATION

0.72+

SiliconANGLEORGANIZATION

0.7+

DevOpsTITLE

0.66+

SnowflakeEVENT

0.59+

S3TITLE

0.54+

lastDATE

0.54+

supercloudEVENT

0.48+

KubernetesTITLE

0.47+

Breaking Analysis: Technology & Architectural Considerations for Data Mesh


 

>> From theCUBE Studios in Palo Alto and Boston, bringing you data driven insights from theCUBE in ETR, this is Breaking Analysis with Dave Vellante. >> The introduction in socialization of data mesh has caused practitioners, business technology executives, and technologists to pause, and ask some probing questions about the organization of their data teams, their data strategies, future investments, and their current architectural approaches. Some in the technology community have embraced the concept, others have twisted the definition, while still others remain oblivious to the momentum building around data mesh. Here we are in the early days of data mesh adoption. Organizations that have taken the plunge will tell you that aligning stakeholders is a non-trivial effort, but necessary to break through the limitations that monolithic data architectures and highly specialized teams have imposed over frustrated business and domain leaders. However, practical data mesh examples often lie in the eyes of the implementer, and may not strictly adhere to the principles of data mesh. Now, part of the problem is lack of open technologies and standards that can accelerate adoption and reduce friction, and that's what we're going to talk about today. Some of the key technology and architecture questions around data mesh. Hello, and welcome to this week's Wikibon CUBE Insights powered by ETR, and in this Breaking Analysis, we welcome back the founder of data mesh and director of Emerging Technologies at Thoughtworks, Zhamak Dehghani. Hello, Zhamak. Thanks for being here today. >> Hi Dave, thank you for having me back. It's always a delight to connect and have a conversation. Thank you. >> Great, looking forward to it. Okay, so before we get into it in the technology details, I just want to quickly share some data from our friends at ETR. You know, despite the importance of data initiative since the pandemic, CIOs and IT organizations have had to juggle of course, a few other priorities, this is why in the survey data, cyber and cloud computing are rated as two most important priorities. Analytics and machine learning, and AI, which are kind of data topics, still make the top of the list, well ahead of many other categories. And look, a sound data architecture and strategy is fundamental to digital transformations, and much of the past two years, as we've often said, has been like a forced march into digital. So while organizations are moving forward, they really have to think hard about the data architecture decisions that they make, because it's going to impact them, Zhamak, for years to come, isn't it? >> Yes, absolutely. I mean, we are moving really from, slowly moving from reason based logical algorithmic to model based computation and decision making, where we exploit the patterns and signals within the data. So data becomes a very important ingredient, of not only decision making, and analytics and discovering trends, but also the features and applications that we build for the future. So we can't really ignore it, and as we see, some of the existing challenges around getting value from data is not necessarily that no longer is access to computation, is actually access to trustworthy, reliable data at scale. >> Yeah, and you see these domains coming together with the cloud and obviously it has to be secure and trusted, and that's why we're here today talking about data mesh. So let's get into it. Zhamak, first, your new book is out, 'Data Mesh: Delivering Data-Driven Value at Scale' just recently published, so congratulations on getting that done, awesome. Now in a recent presentation, you pulled excerpts from the book and we're going to talk through some of the technology and architectural considerations. Just quickly for the audience, four principles of data mesh. Domain driven ownership, data as product, self-served data platform and federated computational governance. So I want to start with self-serve platform and some of the data that you shared recently. You say that, "Data mesh serves autonomous domain oriented teams versus existing platforms, which serve a centralized team." Can you elaborate? >> Sure. I mean the role of the platform is to lower the cognitive load for domain teams, for people who are focusing on the business outcomes, the technologies that are building the applications, to really lower the cognitive load for them, to be able to work with data. Whether they are building analytics, automated decision making, intelligent modeling. They need to be able to get access to data and use it. So the role of the platform, I guess, just stepping back for a moment is to empower and enable these teams. Data mesh by definition is a scale out model. It's a decentralized model that wants to give autonomy to cross-functional teams. So it is core requires a set of tools that work really well in that decentralized model. When we look at the existing platforms, they try to achieve this similar outcome, right? Lower the cognitive load, give the tools to data practitioners, to manage data at scale because today centralized teams, really their job, the centralized data teams, their job isn't really directly aligned with a one or two or different, you know, business units and business outcomes in terms of getting value from data. Their job is manage the data and make the data available for then those cross-functional teams or business units to use the data. So the platforms they've been given are really centralized around or tuned to work with this structure as a team, structure of centralized team. Although on the surface, it seems that why not? Why can't I use my, you know, cloud storage or computation or data warehouse in a decentralized way? You should be able to, but some changes need to happen to those online platforms. As an example, some cloud providers simply have hard limits on the number of like account storage, storage accounts that you can have. Because they never envisaged you have hundreds of lakes. They envisage one or two, maybe 10 lakes, right. They envisage really centralizing data, not decentralizing data. So I think we see a shift in thinking about enabling autonomous independent teams versus a centralized team. >> So just a follow up if I may, we could be here for a while. But so this assumes that you've sorted out the organizational considerations? That you've defined all the, what a data product is and a sub product. And people will say, of course we use the term monolithic as a pejorative, let's face it. But the data warehouse crowd will say, "Well, that's what data march did. So we got that covered." But Europe... The primest of data mesh, if I understand it is whether it's a data march or a data mart or a data warehouse, or a data lake or whatever, a snowflake warehouse, it's a node on the mesh. Okay. So don't build your organization around the technology, let the technology serve the organization is that-- >> That's a perfect way of putting it, exactly. I mean, for a very long time, when we look at decomposition of complexity, we've looked at decomposition of complexity around technology, right? So we have technology and that's maybe a good segue to actually the next item on that list that we looked at. Oh, I need to decompose based on whether I want to have access to raw data and put it on the lake. Whether I want to have access to model data and put it on the warehouse. You know I need to have a team in the middle to move the data around. And then try to figure organization into that model. So data mesh really inverses that, and as you said, is look at the organizational structure first. Then scale boundaries around which your organization and operation can scale. And then the second layer look at the technology and how you decompose it. >> Okay. So let's go to that next point and talk about how you serve and manage autonomous interoperable data products. Where code, data policy you say is treated as one unit. Whereas your contention is existing platforms of course have independent management and dashboards for catalogs or storage, et cetera. Maybe we double click on that a bit. >> Yeah. So if you think about that functional, or technical decomposition, right? Of concerns, that's one way, that's a very valid way of decomposing, complexity and concerns. And then build solutions, independent solutions to address them. That's what we see in the technology landscape today. We will see technologies that are taking care of your management of data, bring your data under some sort of a control and modeling. You'll see technology that moves that data around, will perform various transformations and computations on it. And then you see technology that tries to overlay some level of meaning. Metadata, understandability, discovery was the end policy, right? So that's where your data processing kind of pipeline technologies versus data warehouse, storage, lake technologies, and then the governance come to play. And over time, we decomposed and we compose, right? Deconstruct and reconstruct back this together. But, right now that's where we stand. I think for data mesh really to become a reality, as in independent sources of data and teams can responsibly share data in a way that can be understood right then and there can impose policies, right then when the data gets accessed in that source and in a resilient manner, like in a way that data changes structure of the data or changes to the scheme of the data, doesn't have those downstream down times. We've got to think about this new nucleus or new units of data sharing. And we need to really bring back transformation and governing data and the data itself together around these decentralized nodes on the mesh. So that's another, I guess, deconstruction and reconstruction that needs to happen around the technology to formulate ourselves around the domains. And again the data and the logic of the data itself, the meaning of the data itself. >> Great. Got it. And we're going to talk more about the importance of data sharing and the implications. But the third point deals with how operational, analytical technologies are constructed. You've got an app DevStack, you've got a data stack. You've made the point many times actually that we've contextualized our operational systems, but not our data systems, they remain separate. Maybe you could elaborate on this point. >> Yes. I think this is, again, has a historical background and beginning. For a really long time, applications have dealt with features and the logic of running the business and encapsulating the data and the state that they need to run that feature or run that business function. And then we had for anything analytical driven, which required access data across these applications and across the longer dimension of time around different subjects within the organization. This analytical data, we had made a decision that, "Okay, let's leave those applications aside. Let's leave those databases aside. We'll extract the data out and we'll load it, or we'll transform it and put it under the analytical kind of a data stack and then downstream from it, we will have analytical data users, the data analysts, the data sciences and the, you know, the portfolio of users that are growing use that data stack. And that led to this really separation of dual stack with point to point integration. So applications went down the path of transactional databases or urban document store, but using APIs for communicating and then we've gone to, you know, lake storage or data warehouse on the other side. If we are moving and that again, enforces the silo of data versus app, right? So if we are moving to the world that our missions that are ambitions around making applications, more intelligent. Making them data driven. These two worlds need to come closer. As in ML Analytics gets embedded into those app applications themselves. And the data sharing, as a very essential ingredient of that, gets embedded and gets closer, becomes closer to those applications. So, if you are looking at this now cross-functional, app data, based team, right? Business team, then the technology stacks can't be so segregated, right? There has to be a continuum of experience from app delivery, to sharing of the data, to using that data, to embed models back into those applications. And that continuum of experience requires well integrated technologies. I'll give you an example, which actually in some sense, we are somewhat moving to that direction. But if we are talking about data sharing or data modeling and applications use one set of APIs, you know, HTTP compliant, GraQL or RAC APIs. And on the other hand, you have proprietary SQL, like connect to my database and run SQL. Like those are very two different models of representing and accessing data. So we kind of have to harmonize or integrate those two worlds a bit more closely to achieve that domain oriented cross-functional teams. >> Yeah. We are going to talk about some of the gaps later and actually you look at them as opportunities, more than barriers. But they are barriers, but they're opportunities for more innovation. Let's go on to the fourth one. The next point, it deals with the roles that the platform serves. Data mesh proposes that domain experts own the data and take responsibility for it end to end and are served by the technology. Kind of, we referenced that before. Whereas your contention is that today, data systems are really designed for specialists. I think you use the term hyper specialists a lot. I love that term. And the generalist are kind of passive bystanders waiting in line for the technical teams to serve them. >> Yes. I mean, if you think about the, again, the intention behind data mesh was creating a responsible data sharing model that scales out. And I challenge any organization that has a scaled ambitions around data or usage of data that relies on small pockets of very expensive specialists resources, right? So we have no choice, but upscaling cross-scaling. The majority population of our technologists, we often call them generalists, right? That's a short hand for people that can really move from one technology to another technology. Sometimes we call them pandric people sometimes we call them T-shaped people. But regardless, like we need to have ability to really mobilize our generalists. And we had to do that at Thoughtworks. We serve a lot of our clients and like many other organizations, we are also challenged with hiring specialists. So we have tested the model of having a few specialists, really conveying and translating the knowledge to generalists and bring them forward. And of course, platform is a big enabler of that. Like what is the language of using the technology? What are the APIs that delight that generalist experience? This doesn't mean no code, low code. We have to throw away in to good engineering practices. And I think good software engineering practices remain to exist. Of course, they get adopted to the world of data to build resilient you know, sustainable solutions, but specialty, especially around kind of proprietary technology is going to be a hard one to scale. >> Okay. I'm definitely going to come back and pick your brain on that one. And, you know, your point about scale out in the examples, the practical examples of companies that have implemented data mesh that I've talked to. I think in all cases, you know, there's only a handful that I've really gone deep with, but it was their hadoop instances, their clusters wouldn't scale, they couldn't scale the business and around it. So that's really a key point of a common pattern that we've seen now. I think in all cases, they went to like the data lake model and AWS. And so that maybe has some violation of the principles, but we'll come back to that. But so let me go on to the next one. Of course, data mesh leans heavily, toward this concept of decentralization, to support domain ownership over the centralized approaches. And we certainly see this, the public cloud players, database companies as key actors here with very large install bases, pushing a centralized approach. So I guess my question is, how realistic is this next point where you have decentralized technologies ruling the roost? >> I think if you look at the history of places, in our industry where decentralization has succeeded, they heavily relied on standardization of connectivity with, you know, across different components of technology. And I think right now you are right. The way we get value from data relies on collection. At the end of the day, collection of data. Whether you have a deep learning machinery model that you're training, or you have, you know, reports to generate. Regardless, the model is bring your data to a place that you can collect it, so that we can use it. And that leads to a naturally set of technologies that try to operate as a full stack integrated proprietary with no intention of, you know, opening, data for sharing. Now, conversely, if you think about internet itself, web itself, microservices, even at the enterprise level, not at the planetary level, they succeeded as decentralized technologies to a large degree because of their emphasis on open net and openness and sharing, right. API sharing. We don't talk about, in the API worlds, like we don't say, you know, "I will build a platform to manage your logical applications." Maybe to a degree but we actually moved away from that. We say, "I'll build a platform that opens around applications to manage your APIs, manage your interfaces." Right? Give you access to API. So I think the shift needs to... That definition of decentralized there means really composable, open pieces of the technology that can play nicely with each other, rather than a full stack, all have control of your data yet being somewhat decentralized within the boundary of my platform. That's just simply not going to scale if data needs to come from different platforms, different locations, different geographical locations, it needs to rethink. >> Okay, thank you. And then the final point is, is data mesh favors technologies that are domain agnostic versus those that are domain aware. And I wonder if you could help me square the circle cause it's nuanced and I'm kind of a 100 level student of your work. But you have said for example, that the data teams lack context of the domain and so help us understand what you mean here in this case. >> Sure. Absolutely. So as you said, we want to take... Data mesh tries to give autonomy and decision making power and responsibility to people that have the context of those domains, right? The people that are really familiar with different business domains and naturally the data that that domain needs, or that naturally the data that domains shares. So if the intention of the platform is really to give the power to people with most relevant and timely context, the platform itself naturally becomes as a shared component, becomes domain agnostic to a large degree. Of course those domains can still... The platform is a (chuckles) fairly overloaded world. As in, if you think about it as a set of technology that abstracts complexity and allows building the next level solutions on top, those domains may have their own set of platforms that are very much doing agnostic. But as a generalized shareable set of technologies or tools that allows us share data. So that piece of technology needs to relinquish the knowledge of the context to the domain teams and actually becomes domain agnostic. >> Got it. Okay. Makes sense. All right. Let's shift gears here. Talk about some of the gaps and some of the standards that are needed. You and I have talked about this a little bit before, but this digs deeper. What types of standards are needed? Maybe you could walk us through this graphic, please. >> Sure. So what I'm trying to depict here is that if we imagine a world that data can be shared from many different locations, for a variety of analytical use cases, naturally the boundary of what we call a node on the mesh will encapsulates internally a fair few pieces. It's not just the boundary of that, not on the mesh, is the data itself that it's controlling and updating and maintaining. It's of course a computation and the code that's responsible for that data. And then the policies that continue to govern that data as long as that data exists. So if that's the boundary, then if we shift that focus from implementation details, that we can leave that for later, what becomes really important is the scene or the APIs and interfaces that this node exposes. And I think that's where the work that needs to be done and the standards that are missing. And we want the scene and those interfaces be open because that allows, you know, different organizations with different boundaries of trust to share data. Not only to share data to kind of move that data to yes, another location, to share the data in a way that distributed workloads, distributed analytics, distributed machine learning model can happen on the data where it is. So if you follow that line of thinking around the centralization and connection of data versus collection of data, I think the very, very important piece of it that needs really deep thinking, and I don't claim that I have done that, is how do we share data responsibly and sustainably, right? That is not brittle. If you think about it today, the ways we share data, one of the very common ways is around, I'll give you a JDC endpoint, or I give you an endpoint to your, you know, database of choice. And now as technology, whereas a user actually, you can now have access to the schema of the underlying data and then run various queries or SQL queries on it. That's very simple and easy to get started with. That's why SQL is an evergreen, you know, standard or semi standard, pseudo standard that we all use. But it's also very brittle, because we are dependent on a underlying schema and formatting of the data that's been designed to tell the computer how to store and manage the data. So I think that the data sharing APIs of the future really need to think about removing this brittle dependencies, think about sharing, not only the data, but what we call metadata, I suppose. Additional set of characteristics that is always shared along with data to make the data usage, I suppose ethical and also friendly for the users and also, I think we have to... That data sharing API, the other element of it, is to allow kind of computation to run where the data exists. So if you think about SQL again, as a simple primitive example of computation, when we select and when we filter and when we join, the computation is happening on that data. So maybe there is a next level of articulating, distributed computational data that simply trains models, right? Your language primitives change in a way to allow sophisticated analytical workloads run on the data more responsibly with policies and access control and force. So I think that output port that I mentioned simply is about next generation data sharing, responsible data sharing APIs. Suitable for decentralized analytical workloads. >> So I'm not trying to bait you here, but I have a follow up as well. So you schema, for all its good creates constraints. No schema on right, that didn't work, cause it was just a free for all and it created the data swamps. But now you have technology companies trying to solve that problem. Take Snowflake for example, you know, enabling, data sharing. But it is within its proprietary environment. Certainly Databricks doing something, you know, trying to come at it from its angle, bringing some of the best to data warehouse, with the data science. Is your contention that those remain sort of proprietary and defacto standards? And then what we need is more open standards? Maybe you could comment. >> Sure. I think the two points one is, as you mentioned. Open standards that allow... Actually make the underlying platform invisible. I mean my litmus test for a technology provider to say, "I'm a data mesh," (laughs) kind of compliant is, "Is your platform invisible?" As in, can I replace it with another and yet get the similar data sharing experience that I need? So part of it is that. Part of it is open standards, they're not really proprietary. The other angle for kind of sharing data across different platforms so that you know, we don't get stuck with one technology or another is around APIs. It is around code that is protecting that internal schema. So where we are on the curve of evolution of technology, right now we are exposing the internal structure of the data. That is designed to optimize certain modes of access. We're exposing that to the end client and application APIs, right? So the APIs that use the data today are very much aware that this database was optimized for machine learning workloads. Hence you will deal with a columnar storage of the file versus this other API is optimized for a very different, report type access, relational access and is optimized around roles. I think that should become irrelevant in the API sharing of the future. Because as a user, I shouldn't care how this data is internally optimized, right? The language primitive that I'm using should be really agnostic to the machine optimization underneath that. And if we did that, perhaps this war between warehouse or lake or the other will become actually irrelevant. So we're optimizing for that human best human experience, as opposed to the best machine experience. We still have to do that but we have to make that invisible. Make that an implementation concern. So that's another angle of what should... If we daydream together, the best experience and resilient experience in terms of data usage than these APIs with diagnostics to the internal storage structure. >> Great, thank you for that. We've wrapped our ankles now on the controversy, so we might as well wade all the way in, I can't let you go without addressing some of this. Which you've catalyzed, which I, by the way, I see as a sign of progress. So this gentleman, Paul Andrew is an architect and he gave a presentation I think last night. And he teased it as quote, "The theory from Zhamak Dehghani versus the practical experience of a technical architect, AKA me," meaning him. And Zhamak, you were quick to shoot back that data mesh is not theory, it's based on practice. And some practices are experimental. Some are more baked and data mesh really avoids by design, the specificity of vendor or technology. Perhaps you intend to frame your post as a technology or vendor specific, specific implementation. So touche, that was excellent. (Zhamak laughs) Now you don't need me to defend you, but I will anyway. You spent 14 plus years as a software engineer and the better part of a decade consulting with some of the most technically advanced companies in the world. But I'm going to push you a little bit here and say, some of this tension is of your own making because you purposefully don't talk about technologies and vendors. Sometimes doing so it's instructive for us neophytes. So, why don't you ever like use specific examples of technology for frames of reference? >> Yes. My role is pushes to the next level. So, you know everybody picks their fights, pick their battles. My role in this battle is to push us to think beyond what's available today. Of course, that's my public persona. On a day to day basis, actually I work with clients and existing technology and I think at Thoughtworks we have given the talk we gave a case study talk with a colleague of mine and I intentionally got him to talk about (indistinct) I want to talk about the technology that we use to implement data mesh. And the reason I haven't really embraced, in my conversations, the specific technology. One is, I feel the technology solutions we're using today are still not ready for the vision. I mean, we have to be in this transitional step, no matter what we have to be pragmatic, of course, and practical, I suppose. And use the existing vendors that exist and I wholeheartedly embrace that, but that's just not my role, to show that. I've gone through this transformation once before in my life. When microservices happened, we were building microservices like architectures with technology that wasn't ready for it. Big application, web application servers that were designed to run these giant monolithic applications. And now we're trying to run little microservices onto them. And the tail was riding the dock, the environmental complexity of running these services was consuming so much of our effort that we couldn't really pay attention to that business logic, the business value. And that's where we are today. The complexity of integrating existing technologies is really overwhelmingly, capturing a lot of our attention and cost and effort, money and effort as opposed to really focusing on the data product themselves. So it's just that's the role I have, but it doesn't mean that, you know, we have to rebuild the world. We've got to do with what we have in this transitional phase until the new generation, I guess, technologies come around and reshape our landscape of tools. >> Well, impressive public discipline. Your point about microservice is interesting because a lot of those early microservices, weren't so micro and for the naysayers look past this, not prologue, but Thoughtworks was really early on in the whole concept of microservices. So be very excited to see how this plays out. But now there was some other good comments. There was one from a gentleman who said the most interesting aspects of data mesh are organizational. And that's how my colleague Sanji Mohan frames data mesh versus data fabric. You know, I'm not sure, I think we've sort of scratched the surface today that data today, data mesh is more. And I still think data fabric is what NetApp defined as software defined storage infrastructure that can serve on-prem and public cloud workloads back whatever, 2016. But the point you make in the thread that we're showing you here is that you're warning, and you referenced this earlier, that the segregating different modes of access will lead to fragmentation. And we don't want to repeat the mistakes of the past. >> Yes, there are comments around. Again going back to that original conversation that we have got this at a macro level. We've got this tendency to decompose complexity based on technical solutions. And, you know, the conversation could be, "Oh, I do batch or you do a stream and we are different."' They create these bifurcations in our decisions based on the technology where I do events and you do tables, right? So that sort of segregation of modes of access causes accidental complexity that we keep dealing with. Because every time in this tree, you create a new branch, you create new kind of new set of tools and then somehow need to be point to point integrated. You create new specialization around that. So the least number of branches that we have, and think about really about the continuum of experiences that we need to create and technologies that simplify, that continuum experience. So one of the things, for example, give you a past experience. I was really excited around the papers and the work that came around on Apache Beam, and generally flow based programming and stream processing. Because basically they were saying whether you are doing batch or whether you're doing streaming, it's all one stream. And sometimes the window of time, narrows and sometimes the window of time over which you're computing, widens and at the end of today, is you are just getting... Doing the stream processing. So it is those sort of notions that simplify and create continuum of experience. I think resonate with me personally, more than creating these tribal fights of this type versus that mode of access. So that's why data mesh naturally selects kind of this multimodal access to support end users, right? The persona of end users. >> Okay. So the last topic I want to hit, this whole discussion, the topic of data mesh it's highly nuanced, it's new, and people are going to shoehorn data mesh into their respective views of the world. And we talked about lake houses and there's three buckets. And of course, the gentleman from LinkedIn with Azure, Microsoft has a data mesh community. See you're going to have to enlist some serious army of enforcers to adjudicate. And I wrote some of the stuff down. I mean, it's interesting. Monte Carlo has a data mesh calculator. Starburst is leaning in, chaos. Search sees themselves as an enabler. Oracle and Snowflake both use the term data mesh. And then of course you've got big practitioners J-P-M-C, we've talked to Intuit, Orlando, HelloFresh has been on, Netflix has this event based sort of streaming implementation. So my question is, how realistic is it that the clarity of your vision can be implemented and not polluted by really rich technology companies and others? (Zhamak laughs) >> Is it even possible, right? Is it even possible? That's a yes. That's why I practice then. This is why I should practice things. Cause I think, it's going to be hard. What I'm hopeful, is that the socio-technical, Leveling Data mentioned that this is a socio-technical concern or solution, not just a technology solution. Hopefully always brings us back to, you know, the reality that vendors try to sell you safe oil that solves all of your problems. (chuckles) All of your data mesh problems. It's just going to cause more problem down the track. So we'll see, time will tell Dave and I count on you as one of those members of, (laughs) you know, folks that will continue to share their platform. To go back to the roots, as why in the first place? I mean, I dedicated a whole part of the book to 'Why?' Because we get, as you said, we get carried away with vendors and technology solution try to ride a wave. And in that story, we forget the reason for which we even making this change and we are going to spend all of this resources. So hopefully we can always come back to that. >> Yeah. And I think we can. I think you have really given this some deep thought and as we pointed out, this was based on practical knowledge and experience. And look, we've been trying to solve this data problem for a long, long time. You've not only articulated it well, but you've come up with solutions. So Zhamak, thank you so much. We're going to leave it there and I'd love to have you back. >> Thank you for the conversation. I really enjoyed it. And thank you for sharing your platform to talk about data mesh. >> Yeah, you bet. All right. And I want to thank my colleague, Stephanie Chan, who helps research topics for us. Alex Myerson is on production and Kristen Martin, Cheryl Knight and Rob Hoff on editorial. Remember all these episodes are available as podcasts, wherever you listen. And all you got to do is search Breaking Analysis Podcast. Check out ETR's website at etr.ai for all the data. And we publish a full report every week on wikibon.com, siliconangle.com. You can reach me by email david.vellante@siliconangle.com or DM me @dvellante. Hit us up on our LinkedIn post. This is Dave Vellante for theCUBE Insights powered by ETR. Have a great week, stay safe, be well. And we'll see you next time. (bright music)

Published Date : Apr 20 2022

SUMMARY :

bringing you data driven insights Organizations that have taken the plunge and have a conversation. and much of the past two years, and as we see, and some of the data and make the data available But the data warehouse crowd will say, in the middle to move the data around. and talk about how you serve and the data itself together and the implications. and the logic of running the business and are served by the technology. to build resilient you I think in all cases, you know, And that leads to a that the data teams lack and naturally the data and some of the standards that are needed. and formatting of the data and it created the data swamps. We're exposing that to the end client and the better part of a decade So it's just that's the role I have, and for the naysayers look and at the end of today, And of course, the gentleman part of the book to 'Why?' and I'd love to have you back. And thank you for sharing your platform etr.ai for all the data.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Kristen MartinPERSON

0.99+

Rob HoffPERSON

0.99+

Cheryl KnightPERSON

0.99+

Stephanie ChanPERSON

0.99+

Alex MyersonPERSON

0.99+

DavePERSON

0.99+

ZhamakPERSON

0.99+

oneQUANTITY

0.99+

Dave VellantePERSON

0.99+

AWSORGANIZATION

0.99+

10 lakesQUANTITY

0.99+

Sanji MohanPERSON

0.99+

MicrosoftORGANIZATION

0.99+

Paul AndrewPERSON

0.99+

twoQUANTITY

0.99+

NetflixORGANIZATION

0.99+

Zhamak DehghaniPERSON

0.99+

Data Mesh: Delivering Data-Driven Value at ScaleTITLE

0.99+

BostonLOCATION

0.99+

OracleORGANIZATION

0.99+

14 plus yearsQUANTITY

0.99+

Palo AltoLOCATION

0.99+

two pointsQUANTITY

0.99+

siliconangle.comOTHER

0.99+

second layerQUANTITY

0.99+

2016DATE

0.99+

LinkedInORGANIZATION

0.99+

todayDATE

0.99+

SnowflakeORGANIZATION

0.99+

hundreds of lakesQUANTITY

0.99+

theCUBEORGANIZATION

0.99+

david.vellante@siliconangle.comOTHER

0.99+

theCUBE StudiosORGANIZATION

0.98+

SQLTITLE

0.98+

one unitQUANTITY

0.98+

firstQUANTITY

0.98+

100 levelQUANTITY

0.98+

third pointQUANTITY

0.98+

DatabricksORGANIZATION

0.98+

EuropeLOCATION

0.98+

three bucketsQUANTITY

0.98+

ETRORGANIZATION

0.98+

DevStackTITLE

0.97+

OneQUANTITY

0.97+

wikibon.comOTHER

0.97+

bothQUANTITY

0.97+

ThoughtworksORGANIZATION

0.96+

one setQUANTITY

0.96+

one streamQUANTITY

0.96+

IntuitORGANIZATION

0.95+

one wayQUANTITY

0.93+

two worldsQUANTITY

0.93+

HelloFreshORGANIZATION

0.93+

this weekDATE

0.93+

last nightDATE

0.91+

fourth oneQUANTITY

0.91+

SnowflakeTITLE

0.91+

two different modelsQUANTITY

0.91+

ML AnalyticsTITLE

0.91+

Breaking AnalysisTITLE

0.87+

two worldsQUANTITY

0.84+

Saket Saurabh, Next | AWS Startup Showcase S2 E2


 

[Music] welcome everyone to thecube's presentation of the aws startup showcase data as code this is season two episode two of our ongoing series covering exciting startups in the aws ecosystem to talk about data and analytics i'm your host lisa martin i have a cube alumni here with me socket sarah the ceo and founder of nexla he's here to talk about a future of automated data engineering socket welcome back great to see you lisa thank you for having me pleasure to be here again let's dig into nexla's mission ready to use data in the hands of every user what does that mean that means that you know every organization what what are they trying to do with data they want to make use of data they want to make decisions from data they want to make data a part of their business right the challenge is that every function in an organization today needs to leverage data whether it is finance whether it is hr whether it is marketing sales or product the problem for companies is that for each of these users into each of these teams the data is not ready for them to use as it is there is a lot that goes on before the data can be in their hands and it's in the tools that they like to work with and that's where a lot of data engineering happens today i would say that is by far one of the biggest bottlenecks today for companies in accelerating their business and being you know truly data-driven so talk to me about what makes nexla unique when you're in customer conversations as every company these days in every industry has to be a data company what do you tell them about what differentiates you yeah one of the biggest challenges out there is that the variety of data that companies work with is growing tremendously you know every sas application you use becomes a data source every type of database every type of user event anything can be a source of data now it is a tremendous engineering challenge for companies to make the data usable and the biggest challenge there is people companies just cannot have enough people to write that code to make the data engineering happen and where we come in with a very unique value is how to start thinking about making this whole process much faster much more automated at the end of the day lisa time to value and time to results is by far the number one thing on top of mind for customers time to value is critical we're all thin on patients these days whether we're in our consumerizer our business lives but being able to get access to data to make intelligent decisions whether it's on something that you're going to buy or a product or service you're going to deliver is really critical give me a snapshot of some of the users of nexla yeah the users of nexla are actually across different industries one of the main one of the interesting things is that the data challenges whether you are in financial services whether you are in retail e-commerce whether you are in healthcare they are very similar is basically getting connected to all these data systems and having the data now what people do with the data is very specific to their industry so for example within the e-commerce world or retail world itself you know companies from the likes of bed bath beyond and forever 21 and poshmark which are retailers or e-commerce companies they use nexla today to bring a lot of data in uh so do delivery companies like dodash and instacart and you know so do for example logistics providers like you know narwhal or customer loyalty and customer data companies like yacht pro so across the board for example just in retail we cover a whole bunch of companies got it now let's dig into you're here to talk about the future of automated data engineering talk to me about data engineering what is it let's define it and crack it open yeah um data engineering is i would say by far one of the hottest areas of work today the one of the hardest people to hire if you're looking for one data engineering is basically um all the code you know the process and the people that is basically connecting to their system so just to give a very practical example right for um for somebody in e-commerce let's say a take-off case of door dash right it's extremely important for them to have data as to which stores have what products what is available is this something they can list for people to go and buy is this something that they can therefore deliver right this is data that changes all the time now imagine them getting data from hundreds of different merchants across the board so it is the task of data engineering to then consume that data from all these different places different formats different apis different systems and then somehow unify all the data so that it can be used by the applications that they are building so data engineering in this case becomes taking data from different places and making it useful again back to what i was talking about ready to use data it is a lot of code it's a lot of people not just that it is something that runs every single day so it means it has monitoring it has reliability um it has performance it has every aspect of engineering as we know going into it you mentioned it's a hot topic which it is but it's also really challenging to accomplish how does nexla help enable that yeah data engineering is quite interesting in that one it is difficult to implement you know the the necessary sort of pieces but it is also very repetitive at some level right i mean when you connect to say 10 systems and get data from them you know that's not the end of it you have 10 more and 10 more and 10 more and then at some point you have thousands of such you know data connectivity and data flows happening it's hard to maintain them right as well so the way nexla gets into the whole picture is looking at what can we understand about data what can we observe about the data systems what can be done from that and then start to automate certain pieces of data engineering so that we are helping those teams just accelerate a lot faster and it i would say comes down to more people being able to do these tasks rather than only very very specialized people more people being able to do the tasks more users kind of democratization of data really there can you talk to us in more detail about how naxa is automating data engineering yeah i think um you know i think this is best shared through a visual so let me walk you through that a little bit as to how we automated engineering right so if we think about data engineering three of the most core components are many parts to it but three of the most core components of that are integrating with data systems preparing and transforming data and then monitoring that right so automating data engineering happens in you know three different ways first of all connecting connecting to data is is basically about the gateway to data the ability to read and write data from different systems this is where the data journey starts but it is extremely complex because people have to write code to connect to different systems one part that we have automated is generating these connectors so that you don't have to write code for that also making them bi-directional is extremely valuable because now you can read and write from any system the second part is that the gateway the connector has read the data but how do you represent it to the user so anybody can understand it and that's where the concept of data product comes in so we also look at auto generating data products these become the common language and entity that people can understand and work with and then the third part is taking all this automation and bringing the human in the loop no automation is perfect and therefore bringing the human in the loop means that somebody who is an expert in data who can look at it and understand it can now do things which only data systems experts were able to do before so bringing in that user of data directly into the into the picture is one important part but let's not forget data challenges are very diverse and very complex so the same system also becomes accessible to the engineers who are experts in that and now both of these can work together while an engineer will come through apis and sdk and command interfaces a data user comes in through a nice no code user interface and all of these things coming together are what is accelerating back to that time to value that really everybody cares about so if i'm in marketing and i'm a data user i'm able to have a collaborative workflow with the data engineer yeah yeah for the first time that is actually possible and everybody's focuses on their expertise and their know-how so you know um somebody who for example in financial services really understands portfolio and transactions and different type of asset classes they have the data in front of them the engineers who understand the underlying real-time data feeds and those they are still involved in the loop but now they are not doing that back and forth you know as the user of data i'm not going to the engineer saying hey can you do this for me can you get the data here and that back and forth is not only time taking it's frustrating and the number one hold back right yeah that and that's time that nobody has to waste as we know for many reasons talk to me about when you look into your crystal ball which i'm sure you have one what is the future of of data engineering from nexus perspective you talked about the automation what's the future hold i think the future of data engineering becomes that we up level this at a point where um companies don't have to be slowed down for it um i think a lot of tooling is already happening the way to think about this is that here in 2022 if we think that our data challenges are you know like x they will be a thousand x in five years right i mean this complexity is just increasing very rapidly so we think that this becomes one of those fundamental layers you know and you know as i was saying maybe the last time this is like the road you know you don't feel it you just move on it you do your job you build your products deliver your services as a company this just works for you um and that's where i think the future is and that's where i think the future should be we all need to work towards that we're not there yet not there yet a lot of a lot of potential a lot of opportunity and a lot of momentum speaking of momentum i want to talk about data mesh that is a topic of a lot of excitement a lot of discussion let's unpack that yeah i think uh you know the idea that data should be democratized that people should get access to the data and it's all coming back to that sort of basic concept of scale companies can scale only when more people can do the relevant jobs without depending on each other right so the idea of data democratization has been there for a long time but you know recently in the last couple of years the concept of data mesh was introduced by zamak digani and thoughtworks and that has really caught the attention of people and the imagination of leadership as well the idea that data should be available as a product you know that democratization can happen what is the entity of the democratization that's data presented as a product that people can use and collaborate is extremely powerful um i think a lot of companies are gravitating towards that and that's why it's exciting this is promising a future that is you know possible so second speaking of data products we talked a little bit about this last time but can you really help us understand see smell touch feel what a data product is and give us that context yeah absolutely i think uh best to orient ourselves with the general thinking of how we consider something as a product right a product is something that we find ready to use for example this table that i'm using right now made out of raw materials wood metal screws somebody designed it somebody produced it and i'm using it right now when we think about data products we think about data as the raw material so for example a spreadsheet an api a database query those are the raw raw materials what is a data product is something that further enriches and enhances that entity to be much more usable ready to use right um let me illustrate that with a little bit of a visual actually and that might help okay um the idea of the data product and this is how a data product looks like in next lab for a user to write as you see the concept of a data product is something that first of all it's a logical entity this simply means that it's not a new copy of data just like containers or logical compute units you know these data products are logical entities but they represent data in the same consistent fashion regardless of where the data comes from what format it is in they provide the user the idea of what the structure of data is what the sample data looks like what the characteristics of data are it allows people to have some documentation around it what does the data mean what do these attributes you know mean and how to interpret them how to validate that data something that users often know in an industry how is my data looking like well this value can never be negative because it's a price for example right um then the ability to take these data products that you know we automate by generating as i was mentioning earlier automatically creating these data products taking these data products to create new data products now that's something that's very unique about data you could take data off about an order for a from a company and say well the order data has an order id and a user id but i need to look up shipping address so i can combine user and order data to get that information in one place so you know creating new data products giving people access hey i've designed a data product i think you'll find it useful you can go use that as it is you don't have to go from scratch so all of those things together make a data product something that people can find ready to use again and this is this is also usable by the again that example where i'm in marketing uh or i'm in sales this is available to me as a general user as a general user in the tool of your choice so you can say oh no i am most familiar with using data in a spreadsheet i would like it there or i prefer my data in a tableau or a looker to visualize it and you can have it there so these data products give multiple interfaces for the end user to make use of it got it i like it you're meeting the user where they are with relevant data that helps them understand so much more contextually i'm curious when you're in customer conversations customers that come to you saying saka we need to build the data mesh how is nexl relevant they're how what is your conversation like yeah when people want to build a data mesh they're really looking for how their organization will scale into the future uh there are multiple components to building a data mesh there's a tooling part of it the technology portion there are people and processes right i mean unless you train people in certain processes and say hey when you build a data product you know make sure you have taken care of privacy or compliance to certain rules or who do you give access to is something you have to follow some rules about so we provide the technology component of it and then the people and process is something that companies you know then as they adopt and do that right so the concept of data product becomes core to building the data mesh having governance on it uh having all this be self-serve it's an essential part of that so that's where we come into the picture as a as a technology component to the whole story and working to deliver on that mission to getting data in the hands of every user you mentioned i want to dig into in the last few minutes here that we have uh the target audience you mentioned a few by name big names customers that nexla has you i heard retail i heard e-commerce i think i heard logistics but talk to me about the target customer for nexla any verticals in particular or any company's sizes in particular as well yeah you know the one of the top three banks in the country is a big user of nexla as part of their data stack uh we actually sit as part of their enterprise-wide ai platform providing data to their data scientists um we're not allowed to share their name unfortunately but um you know there are multiple other companies in asset management area for example they work with a lot of data in markets portfolio and so on um the leading medical devices companies using nexla data scientists there are using data coming in real time or streaming for medical devices to train and um and combine that with other data to do sort of clinical trial related research that they do um we have you know the companies for example linkedin is an excellent customer linkedin is by far the largest social network um their marketing team leverages nexla to bring data from different type of systems together as well um you know so are companies in education space like nerdy is a public company that uses nexla for you know student enrollment education data as they collaborate with school districts for example um you know there are companies across the board in marketing live brand you know for example uses nexla so we are um we are you know from who uses nexla is today mostly mid to large to very large enterprises today leverage nexla as a very critical component and often mission critical data for which they leverage us do you see that changing anytime soon as every company these days has to be a data company we expect that as consumers whether it's my grocery store um or my local coffee shop that they've got to use data to deliver me that personalized experience do you see the target audience kind of shifting down to more into mid-market smb space for next level oh yeah absolutely look we started the journey of the company with the thinking that the most complex data challenges exist in the large enterprise and if we can make it no code self-serve easy to use for them we can bring the same high-end technology to everybody and this is exactly why we recently launched in the amazon marketplace so anybody can go there get access to nexla and start to use it and you will see more and more of that happen where we will be bringing even some free versions of our product available so you're absolutely right every company needs to leverage data and i think people are getting much better at it you know especially in the last couple of years i've seen that teams have become much more sophisticated yes even if you are a coffee shop and you're running campaigns you know getting people yelp reviews and so on this data that you can use and understand better your demographic your customer and run your business better so one day yes we will absolutely be in the hands of every single person here a lot more opportunity to delight a lot more consumers and customers socket thank you so much for joining me on the program during the startup showcase you did a great job of helping us understand the future of automated data engineering we appreciate your insights thank you so much lisa it's a pleasure talking to you likewise for soccer sarah i'm lisa martin you're watching thecube's coverage of the aws startup showcase season two episode two stick around more great content coming up from the cube the leader in hybrid tech event coverage [Music]

Published Date : Mar 30 2022

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

EntityCategoryConfidence
10 systemsQUANTITY

0.99+

10QUANTITY

0.99+

Saket SaurabhPERSON

0.99+

lisa martinPERSON

0.99+

2022DATE

0.99+

lisaPERSON

0.99+

sarahPERSON

0.99+

second partQUANTITY

0.99+

thousandsQUANTITY

0.99+

third partQUANTITY

0.99+

one partQUANTITY

0.99+

nexlaORGANIZATION

0.99+

naxaORGANIZATION

0.99+

threeQUANTITY

0.98+

eachQUANTITY

0.98+

dodashORGANIZATION

0.98+

hundreds of difQUANTITY

0.98+

todayDATE

0.98+

first timeQUANTITY

0.98+

five yearsQUANTITY

0.98+

narwhalORGANIZATION

0.97+

bothQUANTITY

0.97+

AWSORGANIZATION

0.96+

instacartORGANIZATION

0.96+

yacht proORGANIZATION

0.95+

linkedinORGANIZATION

0.94+

oneQUANTITY

0.94+

firstQUANTITY

0.94+

awsORGANIZATION

0.94+

one important partQUANTITY

0.92+

nexlaTITLE

0.91+

one placeQUANTITY

0.91+

every single dayQUANTITY

0.91+

zamak diganiPERSON

0.9+

three different waysQUANTITY

0.89+

amazonORGANIZATION

0.89+

last couple of yearsDATE

0.88+

last couple of yearsDATE

0.88+

secondQUANTITY

0.87+

poshmarkORGANIZATION

0.87+

sarah the ceoPERSON

0.85+

nexusORGANIZATION

0.83+

season twoQUANTITY

0.8+

a lot of peopleQUANTITY

0.8+

ShowcaseEVENT

0.79+

every functionQUANTITY

0.79+

one dayQUANTITY

0.78+

three banksQUANTITY

0.77+

10 moreQUANTITY

0.77+

number oneQUANTITY

0.76+

ferentORGANIZATION

0.73+

lot of dataQUANTITY

0.73+

thousandQUANTITY

0.73+

core componentsQUANTITY

0.7+

single personQUANTITY

0.69+

S2 E2EVENT

0.67+

one of the biggest bottlenecksQUANTITY

0.67+

lot of companiesQUANTITY

0.6+

episode twoQUANTITY

0.59+

thecubeORGANIZATION

0.56+

challengesQUANTITY

0.53+

Exploring The Rise of Kubernete's With Two Insiders


 

>>Hi everybody. This is Dave Volante. Welcome to this cube conversation where we're going to go back in time a little bit and explore the early days of Kubernetes. Talk about how it formed the improbable events, perhaps that led to it. And maybe how customers are taking advantage of containers and container orchestration today, and maybe where the industry is going. Matt Provo is here. He's the founder and CEO of storm forge and Chandler Huntington hoes. Hoisington is the general manager of EKS edge and hybrid AWS guys. Thanks for coming on. Good to see you. Thanks for having me. Thanks. So, Jenny, you were the vice president of engineering at miso sphere. Is that, is that correct? >>Well, uh, vice-president engineering basis, fear and then I ran product and engineering for DTQ masons. >>Yeah. Okay. Okay. So you were there in the early days of, of container orchestration and Matt, you, you were working at a S a S a Docker swarm shop, right? Yep. Okay. So I mean, a lot of people were, you know, using your platform was pretty novel at the time. Uh, it was, it was more sophisticated than what was happening with, with Kubernetes. Take us back. What was it like then? Did you guys, I mean, everybody was coming out. I remember there was, I think there was one Docker con and everybody was coming, the Kubernetes was announced, and then you guys were there, doc Docker swarm was, was announced and there were probably three or four other startups doing kind of container orchestration. And what, what were those days like? Yeah. >>Yeah. I wasn't actually atmosphere for those days, but I know them well, I know the story as well. Um, uh, I came right as we started to pivot towards Kubernetes there, but, um, it's a really interesting story. I mean, obviously they did a documentary on it and, uh, you know, people can watch that. It's pretty good. But, um, I think that, from my perspective, it was, it was really interesting how this happened. You had basically, uh, con you had this advent of containers coming out, right? So, so there's new novel technology and Solomon, and these folks started saying, Hey, you know, wait a second, wait if I put a UX around these couple of Linux features that got launched a couple of years ago, what does that look like? Oh, this is pretty cool. Um, so you have containers starting to crop up. And at the same time you had folks like ThoughtWorks and other kind of thought leaders in the space, uh, starting to talk about microservices and saying, Hey, monoliths are bad and you should break up these monoliths into smaller pieces. >>And any Greenfield application should be broken up into individuals, scalable units that a team can can own by themselves, and they can scale independent of each other. And you can write tests against them independently of other components. And you should break up these big, big mandalas. And now we are kind of going back to model this, but that's for another day. Um, so, so you had microservices coming out and then you also had containers coming out, same time. So there was like, oh, we need to put these microservices in something perfect. We'll put them in containers. And so at that point, you don't really, before that moment, you didn't really need container orchestration. You could just run a workload in a container and be done with it, right? You didn't need, you don't need Kubernetes to run Docker. Um, but all of a sudden you had tons and tons of containers and you had to manage these in some way. >>And so that's where container orchestration came, came from. And, and Ben Heineman, the founder of Mesa was actually helping schedule spark at the time at Berkeley. Um, and that was one of the first workloads with spark for Macy's. And then his friends at Twitter said, Hey, come over, can you help us do this with containers at Twitter? He said, okay. So when it helped them do it with containers at Twitter, and that's kinda how that branch of the container wars was started. And, um, you know, it was really, really great technology and it actually is still in production in a lot of shops today. Um, uh, more and more people are moving towards Kubernetes and Mesa sphere saw that trend. And at the end of the day, Mesa sphere was less concerned about, even though they named the company Mesa sphere, they were less concerned about helping customers with Mesa specifically. They really want to help customers with these distributed problems. And so it didn't make sense to, to just do Mesa. So they would took on Kubernetes as well. And I hope >>I don't do that. I remember, uh, my, my co-founder John furrier introduced me to Jerry Chen way back when Jerry is his first, uh, uh, VC investment with Greylock was Docker. And we were talking in these very, obviously very excited about it. And, and his Chandler was just saying, it said Solomon and the team simplified, you know, containers, you know, simple and brilliant. All right. So you guys saw the opportunity where you were Docker swarm shop. Why? Because you needed, you know, more sophisticated capabilities. Yeah. But then you, you switched why the switch, what was happening? What was the mindset back then? We ran >>And into some scale challenges in kind of operationalize or, or productizing our kind of our core machine learning. And, you know, we, we, we saw kind of the, the challenges, luckily a bit ahead of our time. And, um, we happen to have someone on the team that was also kind of moonlighting, uh, as one of the, the original core contributors to Kubernetes. And so as this sort of shift was taking place, um, we, we S we saw the flexibility, uh, of what was becoming Kubernetes. Um, and, uh, I'll never forget. I left on a Friday and came back on a Monday and we had lifted and shifted, uh, to Kubernetes. Uh, the challenge was, um, you know, you, at that time, you, you didn't have what you have today through EKS. And, uh, those kinds of services were, um, just getting that first cluster up and running was, was super, super difficult, even in a small environment. >>And so I remember we, you know, we, we finally got it up and running and it was like, nobody touch it, don't do anything. Uh, but obviously that doesn't, that doesn't scale either. And so that's really, you know, being kind of a data science focused shop at storm forge from the very beginning. And that's where our core IP is. Uh, our, our team looked at that problem. And then we looked at, okay, there are a bunch of parameters and ways that I can tune this application. And, uh, why are the configurations set the way that they are? And, you know, uh, is there room to explore? And that's really where, unfortunately, >>Because Mesa said much greater enterprise capabilities as the Docker swarm, at least they were heading in that direction, but you still saw that Kubernetes was, was attractive because even though it didn't have all the security features and enterprise features, because it was just so simple. I remember Jen Goldberg who was at Google at the time saying, no, we were focused on keeping it simple and we're going from mass adoption, but does that kind of what you said? >>Yeah. And we made a bet, honestly. Uh, we saw that the, uh, you know, the growing community was really starting to, you know, we had a little bit of an inside view because we had, we had someone that was very much in the, in the original part, but you also saw the, the tool chain itself start to, uh, start to come into place right. A little bit. And it's still hardening now, but, um, yeah, we, as any, uh, as any startup does, we, we made a pivot and we made a bet and, uh, this, this one paid off >>Well, it's interesting because, you know, we said at the time, I mean, you had, obviously Amazon invented the modern cloud. You know, Microsoft has the advantage of has got this huge software stays, Hey, just now run it into the cloud. Okay, great. So they had their entry point. Google didn't have an entry point. This is kind of a hail Mary against Amazon. And, and I, I wrote a piece, you know, the improbable, Verizon, who Kubernetes to become the O S you know, the cloud, but, but I asked, did it make sense for Google to do that? And it never made any money off of it, but I would argue they, they were kind of, they'd be irrelevant if they didn't have, they hadn't done that yet, but it didn't really hurt. It certainly didn't hurt Amazon EKS. And you do containers and your customers you've embraced it. Right. I mean, I, I don't know what it was like early days. I remember I've have talked to Amazon people about this. It's like, okay, we saw it and then talk to customers, what are they doing? Right. That's kind of what the mindset is, right? Yeah. >>That's, I, I, you know, I've, I've been at Amazon a couple of years now, and you hear the stories of all we're customer obsessed. We listened to our customers like, okay, okay. We have our company values, too. You get told them. And when you're, uh, when you get first hired in the first day, and you never really think about them again, but Amazon, that really is preached every day. It really is. Um, uh, and that we really do listen to our customers. So when customers start asking for communities, we said, okay, when we built it for them. So, I mean, it's, it's really that simple. Um, and, and we also, it's not as simple as just building them a Kubernetes service. Amazon has a big commitment now to start, you know, getting involved more in the community and working with folks like storm forage and, and really listening to customers and what they want. And they want us working with folks like storm florigen and that, and that's why we're doing things like this. So, well, >>It's interesting, because of course, everybody looks at the ecosystem, says, oh, Amazon's going to kill the ecosystem. And then we saw an article the other day in, um, I think it was CRN, did an article, great job by Amazon PR, but talk about snowflake and Amazon's relationship. And I've said many times snowflake probably drives more than any other ISV out there. And so, yeah, maybe the Redshift guys might not love snowflake, but Amazon in general, you know, they're doing great three things. And I remember Andy Jassy said to me, one time, look, we love the ecosystem. We need the ecosystem. They have to innovate too. If they don't, you know, keep pace, you know, they're going to be in trouble. So that's actually a healthy kind of a dynamic, I mean, as an ecosystem partner, how do you, >>Well, I'll go back to one thing without the work that Google did to open source Kubernetes, a storm forge wouldn't exist, but without the effort that AWS and, and EKS in particular, um, provides and opens up for, for developers to, to innovate and to continue, continue kind of operationalizing the shift to Kubernetes, um, you know, we wouldn't have nearly the opportunity that we do to actually listen to them as well, listen to the users and be able to say, w w w what do you want, right. Our entire reason for existence comes from asking users, like, how painful is this process? Uh, like how much confidence do you have in the, you know, out of the box, defaults that ship with your, you know, with your database or whatever it is. And, uh, and, and how much do you love, uh, manually tuning your application? >>And, and, uh, obviously nobody's said, I love that. And so I think as that ecosystem comes together and continues expanding, um, it's just, it opens up a huge opportunity, uh, not only for existing, you know, EKS and, uh, AWS users to continue innovating, but for companies like storm forge, to be able to provide that opportunity for them as well. And, and that's pretty powerful. So I think without a lot of the moves they've made, um, you know, th the door wouldn't be nearly as open for companies like, who are, you know, growing quickly, but are smaller to be able to, you know, to exist. >>Well, and I was saying earlier that, that you've, you're in, I wrote about this, you're going to get better capabilities. You're clearly seeing that cluster management we've talked about better, better automation, security, the whole shift left movement. Um, so obviously there's a lot of momentum right now for Kubernetes. When you think about bare metal servers and storage, and then you had VM virtualization, VMware really, and then containers, and then Kubernetes as another abstraction, I would expect we're not at the end of the road here. Uh, what's next? Is there another abstraction layer that you would think is coming? Yeah, >>I mean, w for awhile, it looked like, and I remember even with our like board members and some of our investors said, well, you know, well, what about serverless? And, you know, what's the next Kubernetes and nothing, we, as much as I love Kubernetes, um, which I do, and we do, um, nothing about what we particularly do. We are purpose built for Kubernetes, but from a core kind of machine learning and problem solving standpoint, um, we could apply this elsewhere, uh, if we went that direction and so time will tell what will be next, then there will be something, uh, you know, that will end up, you know, expanding beyond Kubernetes at some point. Um, but, you know, I think, um, without knowing what that is, you know, our job is to, to, to serve our, you know, to serve our customers and serve our users in the way that they are asking for that. >>Well, serverless obviously is exploding when you look again, and we tucked the ETR survey data, when you look at, at the services within Amazon and other cloud providers, you know, the functions off, off the charts. Uh, so that's kind of an interesting and notable now, of course, you've got Chandler, you've got edge in your title. You've got hybrid in, in your title. So, you know, this notion of the cloud expanding, it's not just a set of remote services, just only in the public cloud. Now it's, it's coming to on premises. You actually got Andy, Jesse, my head space. He said, one time we just look at it. The data centers is another edge location. Right. Okay. That's a way to look at it and then you've got edge. Um, so that cloud is expanding, isn't it? The definition of cloud is, is, is evolving. >>Yeah, that's right. I mean, customers one-on-one run workloads in lots of places. Um, and that's why we have things like, you know, local zones and wavelengths and outposts and EKS anywhere, um, EKS, distro, and obviously probably lots more things to come. And there's, I always think of like, Amazon's Kubernetes strategy on a manageability scale. We're on one far end of the spectrum, you have EKS distro, which is just a collection of the core Kubernetes packages. And you could, you could take those and stand them up yourself in a broom closet, in a, in a retail shop. And then on the other far in the spectrum, you have EKS far gate where you can just give us your container and we'll handle everything for you. Um, and then we kind of tried to solve everything in between for your data center and for the cloud. And so you can, you can really ask Amazon, I want you to manage my control plane. I want you to manage this much of my worker nodes, et cetera. And oh, I actually want help on prem. And so we're just trying to listen to customers and solve their problems where they're asking us to solve them. Cut, >>Go ahead. No, I would just add that in a more vertically focused, uh, kind of orientation for us. Like we, we believe that op you know, optimization capabilities should transcend the location itself. And, and, and so whether that's part public part, private cloud, you know, that's what I love part of what I love about EKS anywhere. Uh, it, you know, you shouldn't, you should still be able to achieve optimal results that connect to your business objectives, uh, wherever those workloads, uh, are, are living >>Well, don't wince. So John and I coined this term called Supercloud and people laugh about it, but it's different. It's, it's, you know, people talk about multi-cloud, but that was just really kind of vendor diversity. Right? I got to running here, I'm running their money anywhere. Uh, but, but individually, and so Supercloud is this concept of this abstraction layer that floats wherever you are, whether it's on prem, across clouds, and you're taking advantage of those native primitives, um, and then hiding that underlying complexity. And that's what, w re-invent the ecosystem was so excited and they didn't call it super cloud. We, we, we called it that, but they're clearly thinking differently about the value that they can add on top of Goldman Sachs. Right. That to me is an example of a Supercloud they're taking their on-prem data and their, their, their software tooling connecting it to AWS. They're running it on AWS, but they're, they're abstracting that complexity. And I think you're going to see a lot, a lot more of that. >>Yeah. So Kubernetes itself, in many cases is being abstracted away. Yeah. There's a disability of a disappearing act for Kubernetes. And I don't mean that in a, you know, in an, a, from an adoption standpoint, but, uh, you know, Kubernetes itself is increasingly being abstracted away, which I think is, is actually super interesting. Yeah. >>Um, communities doesn't really do anything for a company. Like we run Kubernetes, like, how does that help your bottom line? That at the end of the day, like companies don't care that they're running Kubernetes, they're trying to solve a problem, which is the, I need to be able to deploy my applications. I need to be able to scale them easily. I need to be able to update them easily. And those are the things they're trying to solve. So if you can give them some other way to do that, I'm sure you know, that that's what they want. It's not like, uh, you know, uh, a big bank is making more money because they're running Kubernetes. That's not, that's not the current, >>It gets subsumed. It's just become invisible. Right. Exactly. You guys back to the office yet. What's, uh, what's the situation, >>You know, I, I work for my house and I, you know, we go into the office a couple of times a week, so it's, it's, uh, yeah, it's, it's, it's a crazy time. It's a crazy time to be managing and hiring. And, um, you know, it's, it's, it's, it's definitely a challenge, but there's a lot of benefits of working home. I got two young kids, so I get to see them, uh, grow up a little bit more working, working out of my house. So it's >>Nice also. >>So we're in, even as a smaller startup, we're in 26, 27 states, uh, Canada, Germany, we've got a little bit of presence in Japan, so we're very much distributed. Um, we, uh, have not gone back and I'm not sure we will >>Permanently remote potentially. >>Yeah. I mean, w we made a, uh, pretty like for us, the timing of our series B funding, which was where we started hiring a lot, uh, was just before COVID started really picking up. So we, you know, thankfully made a, a pretty good strategic decision to say, we're going to go where the talent is. And yeah, it was harder to find for sure, especially in w we're competing, it's incredibly competitive. Uh, but yeah, we've, it was a good decision for us. Um, we are very about, you know, getting the teams together in person, you know, as often as possible and in the safest way possible, obviously. Um, but you know, it's been a, it's been a pretty interesting, uh, journey for us and something that I'm, I'm not sure I would, I would change to be honest with you. Yeah. >>Well, Frank Slootman, snowflakes HQ to Montana, and then can folks like Michael Dell saying, Hey, same thing as you, wherever they want to work, bring yourself and wherever you are as cool. And do you think that the hybrid mode for your team is kind of the, the, the operating mode for the, for the foreseeable future is a couple of, >>No, I think, I think there's a lot of benefits in both working from the office. I don't think you can deny like the face-to-face interactions. It feels good just doing this interview face to face. Right. And I can see your mouth move. So it's like, there's a lot of benefits to that, um, over a chime call or a zoom call or whatever, you know, that, that also has advantages, right. I mean, you can be more focused at home. And I think some version of hybrid is probably in the industry's future. I don't know what Amazon's exact plans are. That's above my pay grade, but, um, I know that like in general, the industry is definitely moving to some kind of hybrid model. And like Matt said, getting people I'm a big fan at Mesa sphere, we ran a very diverse, like remote workforce. We had a big office in Germany, but we'd get everybody together a couple of times a year for engineering week or, or something like this. And you'd get a hundred people, you know, just dedicated to spending time together at a hotel and, you know, Vegas or Hamburg or wherever. And it's a really good time. And I think that's a good model. >>Yeah. And I think just more ETR data, the current thinking now is that, uh, the hybrid is the number one sort of model, uh, 36% that the CIO is believe 36% of the workforce are going to be hybrid permanently is kind of their, their call a couple of days in a couple of days out. Um, and the, the percentage that is remote is significantly higher. It probably, you know, high twenties, whereas historically it's probably 15%. Yeah. So permanent changes. And that, that changes the infrastructure. You need to support it, the security models and everything, you know, how you communicate. So >>When COVID, you know, really started hitting and in 2020, um, the big banks for example, had to, I mean, you would want to talk about innovation and ability to, to shift quickly. Two of the bigger banks that have in, uh, in fact, adopted Kubernetes, uh, were able to shift pretty quickly, you know, systems and things that were, you know, historically, you know, it was in the office all the time. And some of that's obviously shifted back to a certain degree, but that ability, it was pretty remarkable actually to see that, uh, take place for some of the larger banks and others that are operating in super regulated environments. I mean, we saw that in government agencies and stuff as well. >>Well, without the cloud, no, this never would've happened. Yeah. >>And I think it's funny. I remember some of the more old school manager thing people are, aren't gonna work less when they're working from home, they're gonna be distracted. I think you're seeing the opposite where people are too much, they get burned out because you're just running your computer all day. And so I think that we're learning, I think everyone, the whole industry is learning. Like, what does it mean to work from home really? And, uh, it's, it's a fascinating thing is as a case study, we're all a part of right now. >>I was talking to my wife last night about this, and she's very thoughtful. And she w when she was in the workforce, she was at a PR firm and a guy came in a guest speaker and it might even be in the CEO of the company asking, you know, what, on average, what time who stays at the office until, you know, who leaves by five o'clock, you know, a few hands up, or who stays until like eight o'clock, you know, and enhancement. And then, so he, and he asked those people, like, why, why can't you get your work done in a, in an eight hour Workday? I go home. Why don't you go in? And I sit there. Well, that's interesting, you know, cause he's always looking at me like, why can't you do, you know, get it done? And I'm saying the world has changed. Yeah. It really has where people are just on all the time. I'm not sure it's sustainable, quite frankly. I mean, I think that we have to, you know, as organizations think about, and I see companies doing it, you guys probably do as well, you know, take a four day, you know, a week weekend, um, just for your head. Um, but it's, there's no playbook. >>Yeah. Like I said, we're a part of a case study. It's also hard because people are distributed now. So you have your meetings on the east coast, you can wake up at seven four, and then you have meetings on the west coast. You stay until seven o'clock therefore, so your day just stretches out. So you've got to manage this. And I think we're, I think we'll figure it out. I mean, we're good at figuring this stuff. >>There's a rise in asynchronous communication. So with things like slack and other tools, as, as helpful as they are in many cases, it's a, it, isn't always on mentality. And like, people look for that little green dot and you know, if you're on the you're online. So my kids, uh, you know, we have a term now for me, cause my office at home is upstairs and I'll come down. And if it's, if it's during the day, they'll say, oh dad, you're going for a walk and talk, you know, which is like, it was my way of getting away from the desk, getting away from zoom. And like, you know, even in Boston, uh, you know, getting outside, trying to at least, you know, get a little exercise or walk and get, you know, get my head away from the computer screen. Um, but even then it's often like, oh, I'll get a slack notification on my phone or someone will call me even if it's not a scheduled walk and talk. Um, uh, and so it is an interesting, >>A lot of ways to get in touch or productivity is presumably going to go through the roof. But now, all right, guys, I'll let you go. Thanks so much for coming to the cube. Really appreciate it. And thank you for watching this cube conversation. This is Dave Alante and we'll see you next time.

Published Date : Mar 10 2022

SUMMARY :

So, Jenny, you were the vice president Well, uh, vice-president engineering basis, fear and then I ran product and engineering for DTQ So I mean, a lot of people were, you know, using your platform I mean, obviously they did a documentary on it and, uh, you know, people can watch that. Um, but all of a sudden you had tons and tons of containers and you had to manage these in some way. And, um, you know, it was really, really great technology and it actually is still you know, containers, you know, simple and brilliant. Uh, the challenge was, um, you know, you, at that time, And so that's really, you know, being kind of a data science focused but does that kind of what you said? you know, the growing community was really starting to, you know, we had a little bit of an inside view because we Well, it's interesting because, you know, we said at the time, I mean, you had, obviously Amazon invented the modern cloud. Amazon has a big commitment now to start, you know, getting involved more in the community and working with folks like storm And so, yeah, maybe the Redshift guys might not love snowflake, but Amazon in general, you know, you know, we wouldn't have nearly the opportunity that we do to actually listen to them as well, um, you know, th the door wouldn't be nearly as open for companies like, and storage, and then you had VM virtualization, VMware really, you know, that will end up, you know, expanding beyond Kubernetes at some point. at the services within Amazon and other cloud providers, you know, the functions And so you can, you can really ask Amazon, it, you know, you shouldn't, you should still be able to achieve optimal results that connect It's, it's, you know, people talk about multi-cloud, but that was just really kind of vendor you know, in an, a, from an adoption standpoint, but, uh, you know, Kubernetes itself is increasingly It's not like, uh, you know, You guys back to the office And, um, you know, it's, it's, it's, it's definitely a challenge, but there's a lot of benefits of working home. So we're in, even as a smaller startup, we're in 26, 27 Um, we are very about, you know, getting the teams together And do you think that the hybrid mode for your team is kind of the, and, you know, Vegas or Hamburg or wherever. and everything, you know, how you communicate. you know, systems and things that were, you know, historically, you know, Yeah. And I think it's funny. and it might even be in the CEO of the company asking, you know, what, on average, So you have your meetings on the east coast, you can wake up at seven four, and then you have meetings on the west coast. And like, you know, even in Boston, uh, you know, getting outside, And thank you for watching this cube conversation.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave AlantePERSON

0.99+

Michael DellPERSON

0.99+

Jen GoldbergPERSON

0.99+

AmazonORGANIZATION

0.99+

JohnPERSON

0.99+

JennyPERSON

0.99+

Frank SlootmanPERSON

0.99+

Ben HeinemanPERSON

0.99+

Andy JassyPERSON

0.99+

JapanLOCATION

0.99+

JerryPERSON

0.99+

Dave VolantePERSON

0.99+

AndyPERSON

0.99+

GermanyLOCATION

0.99+

JessePERSON

0.99+

Goldman SachsORGANIZATION

0.99+

15%QUANTITY

0.99+

Matt ProvoPERSON

0.99+

CanadaLOCATION

0.99+

Mesa sphereORGANIZATION

0.99+

AWSORGANIZATION

0.99+

BostonLOCATION

0.99+

MontanaLOCATION

0.99+

2020DATE

0.99+

GoogleORGANIZATION

0.99+

MattPERSON

0.99+

TwoQUANTITY

0.99+

VerizonORGANIZATION

0.99+

MicrosoftORGANIZATION

0.99+

John furrierPERSON

0.99+

Jerry ChenPERSON

0.99+

threeQUANTITY

0.99+

36%QUANTITY

0.99+

five o'clockDATE

0.99+

SolomonPERSON

0.99+

HamburgLOCATION

0.99+

VegasLOCATION

0.99+

MondayDATE

0.99+

firstQUANTITY

0.99+

two young kidsQUANTITY

0.99+

BerkeleyLOCATION

0.99+

26QUANTITY

0.99+

Mesa sphereORGANIZATION

0.99+

FridayDATE

0.99+

EKSORGANIZATION

0.99+

HoisingtonPERSON

0.98+

firsQUANTITY

0.98+

storm forgeORGANIZATION

0.98+

a weekQUANTITY

0.98+

oneQUANTITY

0.98+

todayDATE

0.97+

bothQUANTITY

0.97+

KubernetesORGANIZATION

0.97+

four dayQUANTITY

0.97+

LinuxTITLE

0.97+

MaryPERSON

0.97+

KubernetesTITLE

0.97+

SupercloudORGANIZATION

0.96+

TwitterORGANIZATION

0.96+

eight o'clockDATE

0.96+

last nightDATE

0.96+

S a S a Docker swarmORGANIZATION

0.96+

COVIDORGANIZATION

0.96+

eight hourQUANTITY

0.96+

MesaORGANIZATION

0.95+

seven o'clockDATE

0.95+

27 statesQUANTITY

0.94+

GreylockORGANIZATION

0.94+

Breaking Analysis: Enterprise Technology Predictions 2022


 

>> From theCUBE Studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR, this is Breaking Analysis with Dave Vellante. >> The pandemic has changed the way we think about and predict the future. As we enter the third year of a global pandemic, we see the significant impact that it's had on technology strategy, spending patterns, and company fortunes Much has changed. And while many of these changes were forced reactions to a new abnormal, the trends that we've seen over the past 24 months have become more entrenched, and point to the way that's coming ahead in the technology business. Hello and welcome to this week's Wikibon CUBE Insights powered by ETR. In this Breaking Analysis, we welcome our partner and colleague and business friend, Erik Porter Bradley, as we deliver what's becoming an annual tradition for Erik and me, our predictions for Enterprise Technology in 2022 and beyond Erik, welcome. Thanks for taking some time out. >> Thank you, Dave. Luckily we did pretty well last year, so we were able to do this again. So hopefully we can keep that momentum going. >> Yeah, you know, I want to mention that, you know, we get a lot of inbound predictions from companies and PR firms that help shape our thinking. But one of the main objectives that we have is we try to make predictions that can be measured. That's why we use a lot of data. Now not all will necessarily fit that parameter, but if you've seen the grading of our 2021 predictions that Erik and I did, you'll see we do a pretty good job of trying to put forth prognostications that can be declared correct or not, you know, as black and white as possible. Now let's get right into it. Our first prediction, we're going to go run into spending, something that ETR surveys for quarterly. And we've reported extensively on this. We're calling for tech spending to increase somewhere around 8% in 2022, we can see there on the slide, Erik, we predicted spending last year would increase by 4% IDC. Last check was came in at five and a half percent. Gardner was somewhat higher, but in general, you know, not too bad, but looking ahead, we're seeing an acceleration from the ETR September surveys, as you can see in the yellow versus the blue bar in this chart, many of the SMBs that were hard hit by the pandemic are picking up spending again. And the ETR data is showing acceleration above the mean for industries like energy, utilities, retail, and services, and also, notably, in the Forbes largest 225 private companies. These are companies like Mars or Koch industries. They're predicting well above average spending for 2022. So Erik, please weigh in here. >> Yeah, a lot to bring up on this one, I'm going to be quick. So 1200 respondents on this, over a third of which were at the C-suite level. So really good data that we brought in, the usual bucket of, you know, fortune 500, global 2000 make up the meat of that median, but it's 8.3% and rising with momentum as we see. What's really interesting right now is that energy and utilities. This is usually like, you know, an orphan stock dividend type of play. You don't see them at the highest point of tech spending. And the reason why right now is really because this state of tech infrastructure in our energy infrastructure needs help. And it's obvious, remember the Florida municipality break reach last year? When they took over the water systems or they had the ability to? And this is a real issue, you know, there's bad nation state actors out there, and I'm no alarmist, but the energy and utility has to spend this money to keep up. It's really important. And then you also hit on the retail consumer. Obviously what's happened, the work from home shift created a shop from home shift, and the trends that are happening right now in retail. If you don't spend and keep up, you're not going to be around much longer. So I think the really two interesting things here to call out are energy utilities, usually a laggard in IT spend and it's leading, and also retail consumer, a lot of changes happening. >> Yeah. Great stuff. I mean, I recall when we entered the pandemic, really ETR was the first to emphasize the impact that work from home was going to have, so I really put a lot of weight on this data. Okay. Our next prediction is we're going to get into security, it's one of our favorite topics. And that is that the number one priority that needs to be addressed by organizations in 2022 is security and you can see, in this slide, the degree to which security is top of mind, relative to some other pretty important areas like cloud, productivity, data, and automation, and some others. Now people may say, "Oh, this is obvious." But I'm going to add some context here, Erik, and then bring you in. First, organizations, they don't have unlimited budgets. And there are a lot of competing priorities for dollars, especially with the digital transformation mandate. And depending on the size of the company, this data will vary. For example, while security is still number one at the largest public companies, and those are of course of the biggest spenders, it's not nearly as pronounced as it is on average, or in, for example, mid-sized companies and government agencies. And this is because midsized companies or smaller companies, they don't have the resources that larger companies do. Larger companies have done a better job of securing their infrastructure. So these mid-size firms are playing catch up and the data suggests cyber is even a bigger priority there, gaps that they have to fill, you know, going forward. And that's why we think there's going to be more demand for MSSPs, managed security service providers. And we may even see some IPO action there. And then of course, Erik, you and I have talked about events like the SolarWinds Hack, there's more ransomware attacks, other vulnerabilities. Just recently, like Log4j in December. All of this has heightened concerns. Now I want to talk a little bit more about how we measure this, you know, relatively, okay, it's an obvious prediction, but let's stick our necks out a little bit. And so in addition to the rise of managed security services, we're calling for M&A and/or IPOs, we've specified some names here on this chart, and we're also pointing to the digital supply chain as an area of emphasis. Again, Log4j really shone that under a light. And this is going to help the likes of Auth0, which is now Okta, SailPoint, which is called out on this chart, and some others. We're calling some winners in end point security. Erik, you're going to talk about sort of that lifecycle, that transformation that we're seeing, that migration to new endpoint technologies that are going to benefit from this reset refresh cycle. So Erik, weigh in here, let's talk about some of the elements of this prediction and some of the names on that chart. >> Yeah, certainly. I'm going to start right with Log4j top of mind. And the reason why is because we're seeing a real paradigm shift here where things are no longer being attacked at the network layer, they're being attacked at the application layer, and in the application stack itself. And that is a huge shift left. And that's taking in DevSecOps now as a real priority in 2022. That's a real paradigm shift over the last 20 years. That's not where attacks used to come from. And this is going to have a lot of changes. You called out a bunch of names in there that are, they're either going to work. I would add to that list Wiz. I would add Orca Security. Two names in our emerging technology study, in addition to the ones you added that are involved in cloud security and container security. These names are either going to get gobbled up. So the traditional legacy names are going to have to start writing checks and, you know, legacy is not fair, but they're in the data center, right? They're, on-prem, they're not cloud native. So these are the names that money is going to be flowing to. So they're either going to get gobbled up, or we're going to see some IPO's. And on the other thing I want to talk about too, is what you mentioned. We have CrowdStrike on that list, We have SentinalOne on the list. Everyone knows them. Our data was so strong on Tanium that we actually went positive for the first time just today, just this morning, where that was released. The trifecta of these are so important because of what you mentioned, under resourcing. We can't have security just tell us when something happens, it has to automate, and it has to respond. So in this next generation of EDR and XDR, an automated response has to happen because people are under-resourced, salaries are really high, there's a skill shortage out there. Security has to become responsive. It can't just monitor anymore. >> Yeah. Great. And we should call out too. So we named some names, Snyk, Aqua, Arctic Wolf, Lacework, Netskope, Illumio. These are all sort of IPO, or possibly even M&A candidates. All right. Our next prediction goes right to the way we work. Again, something that ETR has been on for awhile. We're calling for a major rethink in remote work for 2022. We had predicted last year that by the end of 2021, there'd be a larger return to the office with the norm being around a third of workers permanently remote. And of course the variants changed that equation and, you know, gave more time for people to think about this idea of hybrid work and that's really come in to focus. So we're predicting that is going to overtake fully remote as the dominant work model with only about a third of the workers back in the office full-time. And Erik, we expect a somewhat lower percentage to be fully remote. It's now sort of dipped under 30%, at around 29%, but it's still significantly higher than the historical average of around 15 to 16%. So still a major change, but this idea of hybrid and getting hybrid right, has really come into focus. Hasn't it? >> Yeah. It's here to stay. There's no doubt about it. We started this in March of 2020, as soon as the virus hit. This is the 10th iteration of the survey. No one, no one ever thought we'd see a number where only 34% of people were going to be in office permanently. That's a permanent number. They're expecting only a third of the workers to ever come back fully in office. And against that, there's 63% that are saying their permanent workforce is going to be either fully remote or hybrid. And this, I can't really explain how big of a paradigm shift this is. Since the start of the industrial revolution, people leave their house and go to work. Now they're saying that's not going to happen. The economic impact here is so broad, on so many different areas And, you know, the reason is like, why not? Right? The productivity increase is real. We're seeing the productivity increase. Enterprises are spending on collaboration tools, productivity tools, We're seeing an increased perception in productivity of their workforce. And the CFOs can cut down an expense item. I just don't see a reason why this would end, you know, I think it's going to continue. And I also want to point out these results, as high as they are, were before the Omicron wave hit us. I can only imagine what these results would have been if we had sent the survey out just two or three weeks later. >> Yeah. That's a great point. Okay. Next prediction, we're going to look at the supply chain, specifically in how it's affecting some of the hardware spending and cloud strategies in the future. So in this chart, ETRS buyers, have you experienced problems procuring hardware as a result of supply chain issues? And, you know, despite the fact that some companies are, you know, I would call out Dell, for example, doing really well in terms of delivering, you can see that in the numbers, it's pretty clear, there's been an impact. And that's not not an across the board, you know, thing where vendors are able to deliver, especially acute in PCs, but also pronounced in networking, also in firewall servers and storage. And what's interesting is how companies are responding and reacting. So first, you know, I'm going to call the laptop and PC demand staying well above pre-COVID norms. It had peaked in 2012. Pre-pandemic it kept dropping and dropping and dropping, in terms of, you know, unit volume, where the market was contracting. And we think can continue to grow this year in double digits in 2022. But what's interesting, Erik, is when you survey customers, is despite the difficulty they're having in procuring network hardware, there's as much of a migration away from existing networks to the cloud. You could probably comment on that. Their networks are more fossilized, but when it comes to firewalls and servers and storage, there's a much higher propensity to move to the cloud. 30% of customers that ETR surveyed will replace security appliances with cloud services and 41% and 34% respectively will move to cloud compute and storage in 2022. So cloud's relentless march on traditional on-prem models continues. Erik, what do you make of this data? Please weigh in on this prediction. >> As if we needed another reason to go to the cloud. Right here, here it is yet again. So this was added to the survey by client demand. They were asking about the procurement difficulties, the supply chain issues, and how it was impacting our community. So this is the first time we ran it. And it really was interesting to see, you know, the move there. And storage particularly I found interesting because it correlated with a huge jump that we saw on one of our vendor names, which was Rubrik, had the highest net score that it's ever had. So clearly we're seeing some correlation with some of these names that are there, you know, really well positioned to take storage, to take data into the cloud. So again, you didn't need another reason to, you know, hasten this digital transformation, but here we are, we have it yet again, and I don't see it slowing down anytime soon. >> You know, that's a really good point. I mean, it's not necessarily bad news for the... I mean, obviously you wish that it had no change, would be great, but things, you know, always going to change. So we'll talk about this a little bit later when we get into the Supercloud conversation, but this is an opportunity for people who embrace the cloud. So we'll come back to that. And I want to hang on cloud a bit and share some recent projections that we've made. The next prediction is the big four cloud players are going to surpass 167 billion, an IaaS and PaaS revenue in 2022. We track this. Observers of this program know that we try to create an apples to apples comparison between AWS, Azure, GCP and Alibaba in IaaS and PaaS. So we're calling for 38% revenue growth in 2022, which is astounding for such a massive market. You know, AWS is probably not going to hit a hundred billion dollar run rate, but they're going to be close this year. And we're going to get there by 2023, you know they're going to surpass that. Azure continues to close the gap. Now they're about two thirds of the size of AWS and Google, we think is going to surpass Alibaba and take the number three spot. Erik, anything you'd like to add here? >> Yeah, first of all, just on a sector level, we saw our sector, new survey net score on cloud jumped another 10%. It was already really high at 48. Went up to 53. This train is not slowing down anytime soon. And we even added an edge compute type of player, like CloudFlare into our cloud bucket this year. And it debuted with a net score of almost 60. So this is really an area that's expanding, not just the big three, but everywhere. We even saw Oracle and IBM jump up. So even they're having success, taking some of their on-prem customers and then selling them to their cloud services. This is a massive opportunity and it's not changing anytime soon, it's going to continue. >> And I think the operative word there is opportunity. So, you know, the next prediction is something that we've been having fun with and that's this Supercloud becomes a thing. Now, the reason I say we've been having fun is we put this concept of Supercloud out and it's become a bit of a controversy. First, you know, what the heck's the Supercloud right? It's sort of a buzz-wordy term, but there really is, we believe, a thing here. We think there needs to be a rethinking or at least an evolution of the term multi-cloud. And what we mean is that in our view, you know, multicloud from a vendor perspective was really cloud compatibility. It wasn't marketed that way, but that's what it was. Either a vendor would containerize its legacy stack, shove it into the cloud, or a company, you know, they'd do the work, they'd build a cloud native service on one of the big clouds and they did do it for AWS, and then Azure, and then Google. But there really wasn't much, if any, leverage across clouds. Now from a buyer perspective, we've always said multicloud was a symptom of multi-vendor, meaning I got different workloads, running in different clouds, or I bought a company and they run on Azure, and I do a lot of work on AWS, but generally it wasn't necessarily a prescribed strategy to build value on top of hyperscale infrastructure. There certainly was somewhat of a, you know, reducing lock-in and hedging the risk. But we're talking about something more here. We're talking about building value on top of the hyperscale gift of hundreds of billions of dollars in CapEx. So in addition, we're not just talking about transforming IT, which is what the last 10 years of cloud have been like. And, you know, doing work in the cloud because it's cheaper or simpler or more agile, all of those things. So that's beginning to change. And this chart shows some of the technology vendors that are leaning toward this Supercloud vision, in our view, building on top of the hyperscalers that are highlighted in red. Now, Jerry Chan at Greylock, they wrote a piece called Castles in the Cloud. It got our thinking going, and he and the team at Greylock, they're building out a database of all the cloud services and all the sub-markets in cloud. And that got us thinking that there's a higher level of abstraction coalescing in the market, where there's tight integration of services across clouds, but the underlying complexity is hidden, and there's an identical experience across clouds, and even, in my dreams, on-prem for some platforms, so what's new or new-ish and evolving are things like location independence, you've got to include the edge on that, metadata services to optimize locality of reference and data source awareness, governance, privacy, you know, application independent and dependent, actually, recovery across clouds. So we're seeing this evolve. And in our view, the two biggest things that are new are the technology is evolving, where you're seeing services truly integrate cross-cloud. And the other big change is digital transformation, where there's this new innovation curve developing, and it's not just about making your IT better. It's about SaaS-ifying and automating your entire company workflows. So Supercloud, it's not just a vendor thing to us. It's the evolution of, you know, the, the Marc Andreessen quote, "Every company will be a SaaS company." Every company will deliver capabilities that can be consumed as cloud services. So Erik, the chart shows spending momentum on the y-axis and net score, or presence in the ETR data center, or market share on the x-axis. We've talked about snowflake as the poster child for this concept where the vision is you're in their cloud and sharing data in that safe place. Maybe you could make some comments, you know, what do you think of this Supercloud concept and this change that we're sensing in the market? >> Well, I think you did a great job describing the concept. So maybe I'll support it a little bit on the vendor level and then kind of give examples of the ones that are doing it. You stole the lead there with Snowflake, right? There is no better example than what we've seen with what Snowflake can do. Cross-portability in the cloud, the ability to be able to be, you know, completely agnostic, but then build those services on top. They're better than anything they could offer. And it's not just there. I mean, you mentioned edge compute, that's a whole nother layer where this is coming in. And CloudFlare, the momentum there is out of control. I mean, this is a company that started off just doing CDN and trying to compete with Okta Mite. And now they're giving you a full soup to nuts with security and actual edge compute layer, but it's a fantastic company. What they're doing, it's another great example of what you're seeing here. I'm going to call out HashiCorp as well. They're more of an infrastructure services, a little bit more of an open-source freemium model, but what they're doing as well is completely cloud agnostic. It's dynamic. It doesn't care if you're in a container, it doesn't matter where you are. They recently IPO'd and they're down 25%, but their data looks so good across both of our emerging technology and TISA survey. It's certainly another name that's playing on this. And another one that we mentioned as well is Rubrik. If you need storage, compute, and in the cloud layer and you need to be agnostic to it, they're another one that's really playing in this space. So I think it's a great concept you're bringing up. I think it's one that's here to stay and there's certainly a lot of vendors that fit into what you're describing. >> Excellent. Thank you. All right, let's shift to data. The next prediction, it might be a little tough to measure. Before I said we're trying to be a little black and white here, but it relates to Data Mesh, which is, the ideas behind that term were created by Zhamak Dehghani of ThoughtWorks. And we see Data Mesh is really gaining momentum in 2022, but it's largely going to be, we think, confined to a more narrow scope. Now, the impetus for change in data architecture in many companies really stems from the fact that their Hadoop infrastructure really didn't solve their data problems and they struggle to get more value out of their data investments. Data Mesh prescribes a shift to a decentralized architecture in domain ownership of data and a shift to data product thinking, beyond data for analytics, but data products and services that can be monetized. Now this a very powerful in our view, but they're difficult for organizations to get their heads around and further decentralization creates the need for a self-service platform and federated data governance that can be automated. And not a lot of standards around this. So it's going to take some time. At our power panel a couple of weeks ago on data management, Tony Baer predicted a backlash on Data Mesh. And I don't think it's going to be so much of a backlash, but rather the adoption will be more limited. Most implementations we think are going to use a starting point of AWS and they'll enable domains to access and control their own data lakes. And while that is a very small slice of the Data Mesh vision, I think it's going to be a starting point. And the last thing I'll say is, this is going to take a decade to evolve, but I think it's the right direction. And whether it's a data lake or a data warehouse or a data hub or an S3 bucket, these are really, the concept is, they'll eventually just become nodes on the data mesh that are discoverable and access is governed. And so the idea is that the stranglehold that the data pipeline and process and hyper-specialized roles that they have on data agility is going to evolve. And decentralized architectures and the democratization of data will eventually become a norm for a lot of different use cases. And Erik, I wonder if you'd add anything to this. >> Yeah. There's a lot to add there. The first thing that jumped out to me was that that mention of the word backlash you said, and you said it's not really a backlash, but what it could be is these are new words trying to solve an old problem. And I do think sometimes the industry will notice that right away and maybe that'll be a little pushback. And the problems are what you already mentioned, right? We're trying to get to an area where we can have more assets in our data site, more deliverable, and more usable and relevant to the business. And you mentioned that as self-service with governance laid on top. And that's really what we're trying to get to. Now, there's a lot of ways you can get there. Data fabric is really the technical aspect and data mesh is really more about the people, the process, and the governance, but the two of those need to meet, in order to make that happen. And as far as tools, you know, there's even cataloging names like Informatica that play in this, right? Istio plays in this, Snowflake plays in this. So there's a lot of different tools that will support it. But I think you're right in calling out AWS, right? They have AWS Lake, they have AWS Glue. They have so much that's trying to drive this. But I think the really important thing to keep here is what you said. It's going to be a decade long journey. And by the way, we're on the shoulders of giants a decade ago that have even gotten us to this point to talk about these new words because this has been an ongoing type of issue, but ultimately, no matter which vendors you use, this is going to come down to your data governance plan and the data literacy in your business. This is really about workflows and people as much as it is tools. So, you know, the new term of data mesh is wonderful, but you still have to have the people and the governance and the processes in place to get there. >> Great, thank you for that, Erik. Some great points. All right, for the next prediction, we're going to shine the spotlight on two of our favorite topics, Snowflake and Databricks, and the prediction here is that, of course, Databricks is going to IPO this year, as expected. Everybody sort of expects that. And while, but the prediction really is, well, while these two companies are facing off already in the market, they're also going to compete with each other for M&A, especially as Databricks, you know, after the IPO, you're going to have, you know, more prominence and a war chest. So first, these companies, they're both looking pretty good, the same XY graph with spending velocity and presence and market share on the horizontal axis. And both Snowflake and Databricks are well above that magic 40% red dotted line, the elevated line, to us. And for context, we've included a few other firms. So you can see kind of what a good position these two companies are really in, especially, I mean, Snowflake, wow, it just keeps moving to the right on this horizontal picture, but maintaining the next net score in the Y axis. Amazing. So, but here's the thing, Databricks is using the term Lakehouse implying that it has the best of data lakes and data warehouses. And Snowflake has the vision of the data cloud and data sharing. And Snowflake, they've nailed analytics, and now they're moving into data science in the domain of Databricks. Databricks, on the other hand, has nailed data science and is moving into the domain of Snowflake, in the data warehouse and analytics space. But to really make this seamless, there has to be a semantic layer between these two worlds and they're either going to build it or buy it or both. And there are other areas like data clean rooms and privacy and data prep and governance and machine learning tooling and AI, all that stuff. So the prediction is they'll not only compete in the market, but they'll step up and in their competition for M&A, especially after the Databricks IPO. We've listed some target names here, like Atscale, you know, Iguazio, Infosum, Habu, Immuta, and I'm sure there are many, many others. Erik, you care to comment? >> Yeah. I remember a year ago when we were talking Snowflake when they first came out and you, and I said, "I'm shocked if they don't use this war chest of money" "and start going after more" "because we know Slootman, we have so much respect for him." "We've seen his playbook." And I'm actually a little bit surprised that here we are, at 12 months later, and he hasn't spent that money yet. So I think this prediction's just spot on. To talk a little bit about the data side, Snowflake is in rarefied air. It's all by itself. It is the number one net score in our entire TISA universe. It is absolutely incredible. There's almost no negative intentions. Global 2000 organizations are increasing their spend on it. We maintain our positive outlook. It's really just, you know, stands alone. Databricks, however, also has one of the highest overall net sentiments in the entire universe, not just its area. And this is the first time we're coming up positive on this name as well. It looks like it's not slowing down. Really interesting comment you made though that we normally hear from our end-user commentary in our panels and our interviews. Databricks is really more used for the data science side. The MLAI is where it's best positioned in our survey. So it might still have some catching up to do to really have that caliber of usability that you know Snowflake is seeing right now. That's snowflake having its own marketplace. There's just a lot more to Snowflake right now than there is Databricks. But I do think you're right. These two massive vendors are sort of heading towards a collision course, and it'll be very interesting to see how they deploy their cash. I think Snowflake, with their incredible management and leadership, probably will make the first move. >> Well, I think you're right on that. And by the way, I'll just add, you know, Databricks has basically said, hey, it's going to be easier for us to come from data lakes into data warehouse. I'm not sure I buy that. I think, again, that semantic layer is a missing ingredient. So it's going to be really interesting to see how this plays out. And to your point, you know, Snowflake's got the war chest, they got the momentum, they've got the public presence now since November, 2020. And so, you know, they're probably going to start making some aggressive moves. Anyway, next prediction is something, Erik, that you and I have talked about many, many times, and that is observability. I know it's one of your favorite topics. And we see this world screaming for more consolidation it's going all in on cloud native. These legacy stacks, they're fighting to stay relevant, but the direction is pretty clear. And the same XY graph lays out the players in the field, with some of the new entrants that we've also highlighted, like Observe and Honeycomb and ChaosSearch that we've talked about. Erik, we put a big red target around Splunk because everyone wants their gold. So please give us your thoughts. >> Oh man, I feel like I've been saying negative things about Splunk for too long. I've got a bad rap on this name. The Splunk shareholders come after me all the time. Listen, it really comes down to this. They're a fantastic company that was designed to do logging and monitoring and had some great tool sets around what you could do with it. But they were designed for the data center. They were designed for prem. The world we're in now is so dynamic. Everything I hear from our end user community is that all net new workloads will be going to cloud native players. It's that simple. So Splunk has entrenched. It's going to continue doing what it's doing and it does it really, really well. But if you're doing something new, the new workloads are going to be in a dynamic environment and that's going to go to the cloud native players. And in our data, it is extremely clear that that means Datadog and Elastic. They are by far number one and two in net score, increase rates, adoption rates. It's not even close. Even New Relic actually is starting to, you know, entrench itself really well. We saw New Relic's adoption's going up, which is super important because they went to that freemium model, you know, to try to get their little bit of an entrenched customer base and that's working as well. And then you made a great list here, of all the new entrants, but it goes beyond this. There's so many more. In our emerging technology survey, we're seeing Century, Catchpoint, Securonix, Lucid Works. There are so many options in this space. And let's not forget, the biggest data that we're seeing is with Grafana. And Grafana labs as yet to turn on their enterprise. Elastic did it, why can't Grafana labs do it? They have an enterprise stack. So when you look at how crowded this space is, there has to be consolidation. I recently hosted a panel and every single guy on that panel said, "Please give me a consolidation." Because they're the end users trying to actually deploy these and it's getting a little bit confusing. >> Great. Thank you for that. Okay. Last prediction. Erik, might be a little out of your wheelhouse, but you know, you might have some thoughts on it. And that's a hybrid events become the new digital model and a new category in 2022. You got these pure play digital or virtual events. They're going to take a back seat to in-person hybrids. The virtual experience will eventually give way to metaverse experiences and that's going to take some time, but the physical hybrid is going to drive it. And metaverse is ultimately going to define the virtual experience because the virtual experience today is not great. Nobody likes virtual. And hybrid is going to become the business model. Today's pure virtual experience has to evolve, you know, theCUBE first delivered hybrid mid last decade, but nobody really wanted it. We did Mobile World Congress last summer in Barcelona in an amazing hybrid model, which we're showing in some of the pictures here. Alex, if you don't mind bringing that back up. And every physical event that we're we're doing now has a hybrid and virtual component, including the pre-records. You can see in our studios, you see that the green screen. I don't know. Erik, what do you think about, you know, the Zoom fatigue and all this. I know you host regular events with your round tables, but what are your thoughts? >> Well, first of all, I think you and your company here have just done an amazing job on this. So that's really your expertise. I spent 20 years of my career hosting intimate wall street idea dinners. So I'm better at navigating a wine list than I am navigating a conference floor. But I will say that, you know, the trend just goes along with what we saw. If 35% are going to be fully remote. If 70% are going to be hybrid, then our events are going to be as well. I used to host round table dinners on, you know, one or two nights a week. Now those have gone virtual. They're now panels. They're now one-on-one interviews. You know, we do chats. We do submitted questions. We do what we can, but there's no reason that this is going to change anytime soon. I think you're spot on here. >> Yeah. Great. All right. So there you have it, Erik and I, Listen, we always love the feedback. Love to know what you think. Thank you, Erik, for your partnership, your collaboration, and love doing these predictions with you. >> Yeah. I always enjoy them too. And I'm actually happy. Last year you made us do a baker's dozen, so thanks for keeping it to 10 this year. >> (laughs) We've got a lot to say. I know, you know, we cut out. We didn't do much on crypto. We didn't really talk about SaaS. I mean, I got some thoughts there. We didn't really do much on containers and AI. >> You want to keep going? I've got another 10 for you. >> RPA...All right, we'll have you back and then let's do that. All right. All right. Don't forget, these episodes are all available as podcasts, wherever you listen, all you can do is search Breaking Analysis podcast. Check out ETR's website at etr.plus, they've got a new website out. It's the best data in the industry, and we publish a full report every week on wikibon.com and siliconangle.com. You can always reach out on email, David.Vellante@siliconangle.com I'm @DVellante on Twitter. Comment on our LinkedIn posts. This is Dave Vellante for the Cube Insights powered by ETR. Have a great week, stay safe, be well. And we'll see you next time. (mellow music)

Published Date : Jan 22 2022

SUMMARY :

bringing you data-driven and predict the future. So hopefully we can keep to mention that, you know, And this is a real issue, you know, And that is that the number one priority and in the application stack itself. And of course the variants And the CFOs can cut down an expense item. the board, you know, thing interesting to see, you know, and take the number three spot. not just the big three, but everywhere. It's the evolution of, you know, the, the ability to be able to be, and the democratization of data and the processes in place to get there. and is moving into the It is the number one net score And by the way, I'll just add, you know, and that's going to go to has to evolve, you know, that this is going to change anytime soon. Love to know what you think. so thanks for keeping it to 10 this year. I know, you know, we cut out. You want to keep going? This is Dave Vellante for the

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
ErikPERSON

0.99+

IBMORGANIZATION

0.99+

Jerry ChanPERSON

0.99+

OracleORGANIZATION

0.99+

AWSORGANIZATION

0.99+

March of 2020DATE

0.99+

Dave VellantePERSON

0.99+

Zhamak DehghaniPERSON

0.99+

DavePERSON

0.99+

Marc AndreessenPERSON

0.99+

GoogleORGANIZATION

0.99+

2022DATE

0.99+

Tony BaerPERSON

0.99+

AlexPERSON

0.99+

DatabricksORGANIZATION

0.99+

8.3%QUANTITY

0.99+

2021DATE

0.99+

DecemberDATE

0.99+

38%QUANTITY

0.99+

last yearDATE

0.99+

November, 2020DATE

0.99+

twoQUANTITY

0.99+

20 yearsQUANTITY

0.99+

Last yearDATE

0.99+

Erik Porter BradleyPERSON

0.99+

AlibabaORGANIZATION

0.99+

41%QUANTITY

0.99+

SnowflakeORGANIZATION

0.99+

MarsORGANIZATION

0.99+

DellORGANIZATION

0.99+

40%QUANTITY

0.99+

30%QUANTITY

0.99+

NetskopeORGANIZATION

0.99+

oneQUANTITY

0.99+

BostonLOCATION

0.99+

GrafanaORGANIZATION

0.99+

63%QUANTITY

0.99+

Arctic WolfORGANIZATION

0.99+

167 billionQUANTITY

0.99+

SlootmanPERSON

0.99+

two companiesQUANTITY

0.99+

35%QUANTITY

0.99+

34%QUANTITY

0.99+

SnykORGANIZATION

0.99+

70%QUANTITY

0.99+

FloridaLOCATION

0.99+

Palo AltoLOCATION

0.99+

4%QUANTITY

0.99+

GreylockORGANIZATION

0.99+

Analyst Predictions 2022: The Future of Data Management


 

[Music] in the 2010s organizations became keenly aware that data would become the key ingredient in driving competitive advantage differentiation and growth but to this day putting data to work remains a difficult challenge for many if not most organizations now as the cloud matures it has become a game changer for data practitioners by making cheap storage and massive processing power readily accessible we've also seen better tooling in the form of data workflows streaming machine intelligence ai developer tools security observability automation new databases and the like these innovations they accelerate data proficiency but at the same time they had complexity for practitioners data lakes data hubs data warehouses data marts data fabrics data meshes data catalogs data oceans are forming they're evolving and exploding onto the scene so in an effort to bring perspective to the sea of optionality we've brought together the brightest minds in the data analyst community to discuss how data management is morphing and what practitioners should expect in 2022 and beyond hello everyone my name is dave vellante with the cube and i'd like to welcome you to a special cube presentation analyst predictions 2022 the future of data management we've gathered six of the best analysts in data and data management who are going to present and discuss their top predictions and trends for 2022 in the first half of this decade let me introduce our six power panelists sanjeev mohan is former gartner analyst and principal at sanjamo tony bear is principal at db insight carl olufsen is well-known research vice president with idc dave meninger is senior vice president and research director at ventana research brad shimon chief analyst at ai platforms analytics and data management at omnia and doug henschen vice president and principal analyst at constellation research gentlemen welcome to the program and thanks for coming on thecube today great to be here thank you all right here's the format we're going to use i as moderator are going to call on each analyst separately who then will deliver their prediction or mega trend and then in the interest of time management and pace two analysts will have the opportunity to comment if we have more time we'll elongate it but let's get started right away sanjeev mohan please kick it off you want to talk about governance go ahead sir thank you dave i i believe that data governance which we've been talking about for many years is now not only going to be mainstream it's going to be table stakes and all the things that you mentioned you know with data oceans data lakes lake houses data fabric meshes the common glue is metadata if we don't understand what data we have and we are governing it there is no way we can manage it so we saw informatica when public last year after a hiatus of six years i've i'm predicting that this year we see some more companies go public uh my bet is on colibra most likely and maybe alation we'll see go public this year we we i'm also predicting that the scope of data governance is going to expand beyond just data it's not just data and reports we are going to see more transformations like spark jaws python even airflow we're going to see more of streaming data so from kafka schema registry for example we will see ai models become part of this whole governance suite so the governance suite is going to be very comprehensive very detailed lineage impact analysis and then even expand into data quality we already seen that happen with some of the tools where they are buying these smaller companies and bringing in data quality monitoring and integrating it with metadata management data catalogs also data access governance so these so what we are going to see is that once the data governance platforms become the key entry point into these modern architectures i'm predicting that the usage the number of users of a data catalog is going to exceed that of a bi tool that will take time and we already seen that that trajectory right now if you look at bi tools i would say there are 100 users to a bi tool to one data catalog and i i see that evening out over a period of time and at some point data catalogs will really become you know the main way for us to access data data catalog will help us visualize data but if we want to do more in-depth analysis it'll be the jumping-off point into the bi tool the data science tool and and that is that is the journey i see for the data governance products excellent thank you some comments maybe maybe doug a lot a lot of things to weigh in on there maybe you could comment yeah sanjeev i think you're spot on a lot of the trends uh the one disagreement i think it's it's really still far from mainstream as you say we've been talking about this for years it's like god motherhood apple pie everyone agrees it's important but too few organizations are really practicing good governance because it's hard and because the incentives have been lacking i think one thing that deserves uh mention in this context is uh esg mandates and guidelines these are environmental social and governance regs and guidelines we've seen the environmental rags and guidelines imposed in industries particularly the carbon intensive industries we've seen the social mandates particularly diversity imposed on suppliers by companies that are leading on this topic we've seen governance guidelines now being imposed by banks and investors so these esgs are presenting new carrots and sticks and it's going to demand more solid data it's going to demand more detailed reporting and solid reporting tighter governance but we're still far from mainstream adoption we have a lot of uh you know best of breed niche players in the space i think the signs that it's going to be more mainstream are starting with things like azure purview google dataplex the big cloud platform uh players seem to be uh upping the ante and and addressing starting to address governance excellent thank you doug brad i wonder if you could chime in as well yeah i would love to be a believer in data catalogs um but uh to doug's point i think that it's going to take some more pressure for for that to happen i recall metadata being something every enterprise thought they were going to get under control when we were working on service oriented architecture back in the 90s and that didn't happen quite the way we we anticipated and and uh to sanjeev's point it's because it is really complex and really difficult to do my hope is that you know we won't sort of uh how do we put this fade out into this nebulous nebula of uh domain catalogs that are specific to individual use cases like purview for getting data quality right or like data governance and cyber security and instead we have some tooling that can actually be adaptive to gather metadata to create something i know is important to you sanjeev and that is this idea of observability if you can get enough metadata without moving your data around but understanding the entirety of a system that's running on this data you can do a lot to help with with the governance that doug is talking about so so i just want to add that you know data governance like many other initiatives did not succeed even ai went into an ai window but that's a different topic but a lot of these things did not succeed because to your point the incentives were not there i i remember when starbucks oxley had come into the scene if if a bank did not do service obviously they were very happy to a million dollar fine that was like you know pocket change for them instead of doing the right thing but i think the stakes are much higher now with gdpr uh the floodgates open now you know california you know has ccpa but even ccpa is being outdated with cpra which is much more gdpr like so we are very rapidly entering a space where every pretty much every major country in the world is coming up with its own uh compliance regulatory requirements data residence is becoming really important and and i i think we are going to reach a stage where uh it won't be optional anymore so whether we like it or not and i think the reason data catalogs were not successful in the past is because we did not have the right focus on adoption we were focused on features and these features were disconnected very hard for business to stop these are built by it people for it departments to to take a look at technical metadata not business metadata today the tables have turned cdo's are driving this uh initiative uh regulatory compliances are beating down hard so i think the time might be right yeah so guys we have to move on here and uh but there's some some real meat on the bone here sanjeev i like the fact that you late you called out calibra and alation so we can look back a year from now and say okay he made the call he stuck it and then the ratio of bi tools the data catalogs that's another sort of measurement that we can we can take even though some skepticism there that's something that we can watch and i wonder if someday if we'll have more metadata than data but i want to move to tony baer you want to talk about data mesh and speaking you know coming off of governance i mean wow you know the whole concept of data mesh is decentralized data and then governance becomes you know a nightmare there but take it away tony we'll put it this way um data mesh you know the the idea at least is proposed by thoughtworks um you know basically was unleashed a couple years ago and the press has been almost uniformly almost uncritical um a good reason for that is for all the problems that basically that sanjeev and doug and brad were just you know we're just speaking about which is that we have all this data out there and we don't know what to do about it um now that's not a new problem that was a problem we had enterprise data warehouses it was a problem when we had our hadoop data clusters it's even more of a problem now the data's out in the cloud where the data is not only your data like is not only s3 it's all over the place and it's also including streaming which i know we'll be talking about later so the data mesh was a response to that the idea of that we need to debate you know who are the folks that really know best about governance is the domain experts so it was basically data mesh was an architectural pattern and a process my prediction for this year is that data mesh is going to hit cold hard reality because if you if you do a google search um basically the the published work the articles and databases have been largely you know pretty uncritical um so far you know that you know basically learning is basically being a very revolutionary new idea i don't think it's that revolutionary because we've talked about ideas like this brad and i you and i met years ago when we were talking about so and decentralizing all of us was at the application level now we're talking about at the data level and now we have microservices so there's this thought of oh if we manage if we're apps in cloud native through microservices why don't we think of data in the same way um my sense this year is that you know this and this has been a very active search if you look at google search trends is that now companies are going to you know enterprises are going to look at this seriously and as they look at seriously it's going to attract its first real hard scrutiny it's going to attract its first backlash that's not necessarily a bad thing it means that it's being taken seriously um the reason why i think that that uh that it will you'll start to see basically the cold hard light of day shine on data mesh is that it's still a work in progress you know this idea is basically a couple years old and there's still some pretty major gaps um the biggest gap is in is in the area of federated governance now federated governance itself is not a new issue uh federated governance position we're trying to figure out like how can we basically strike the balance between getting let's say you know between basically consistent enterprise policy consistent enterprise governance but yet the groups that understand the data know how to basically you know that you know how do we basically sort of balance the two there's a huge there's a huge gap there in practice and knowledge um also to a lesser extent there's a technology gap which is basically in the self-service technologies that will help teams essentially govern data you know basically through the full life cycle from developed from selecting the data from you know building the other pipelines from determining your access control determining looking at quality looking at basically whether data is fresh or whether or not it's trending of course so my predictions is that it will really receive the first harsh scrutiny this year you are going to see some organization enterprises declare premature victory when they've uh when they build some federated query implementations you're going to see vendors start to data mesh wash their products anybody in the data management space they're going to say that whether it's basically a pipelining tool whether it's basically elt whether it's a catalog um or confederated query tool they're all going to be like you know basically promoting the fact of how they support this hopefully nobody is going to call themselves a data mesh tool because data mesh is not a technology we're going to see one other thing come out of this and this harks back to the metadata that sanji was talking about and the catalogs that he was talking about which is that there's going to be a new focus on every renewed focus on metadata and i think that's going to spur interest in data fabrics now data fabrics are pretty vaguely defined but if we just take the most elemental definition which is a common metadata back plane i think that if anybody is going to get serious about data mesh they need to look at a data fabric because we all at the end of the day need to speak you know need to read from the same sheet of music so thank you tony dave dave meninger i mean one of the things that people like about data mesh is it pretty crisply articulates some of the flaws in today's organizational approaches to data what are your thoughts on this well i think we have to start by defining data mesh right the the term is already getting corrupted right tony said it's going to see the cold hard uh light of day and there's a problem right now that there are a number of overlapping terms that are similar but not identical so we've got data virtualization data fabric excuse me for a second sorry about that data virtualization data fabric uh uh data federation right uh so i i think that it's not really clear what each vendor means by these terms i see data mesh and data fabric becoming quite popular i've i've interpreted data mesh as referring primarily to the governance aspects as originally you know intended and specified but that's not the way i see vendors using i see vendors using it much more to mean data fabric and data virtualization so i'm going to comment on the group of those things i think the group of those things is going to happen they're going to happen they're going to become more robust our research suggests that a quarter of organizations are already using virtualized access to their data lakes and another half so a total of three quarters will eventually be accessing their data lakes using some sort of virtualized access again whether you define it as mesh or fabric or virtualization isn't really the point here but this notion that there are different elements of data metadata and governance within an organization that all need to be managed collectively the interesting thing is when you look at the satisfaction rates of those organizations using virtualization versus those that are not it's almost double 68 of organizations i'm i'm sorry um 79 of organizations that were using virtualized access express satisfaction with their access to the data lake only 39 expressed satisfaction if they weren't using virtualized access so thank you uh dave uh sanjeev we just got about a couple minutes on this topic but i know you're speaking or maybe you've spoken already on a panel with jamal dagani who sort of invented the concept governance obviously is a big sticking point but what are your thoughts on this you are mute so my message to your mark and uh and to the community is uh as opposed to what dave said let's not define it we spent the whole year defining it there are four principles domain product data infrastructure and governance let's take it to the next level i get a lot of questions on what is the difference between data fabric and data mesh and i'm like i can compare the two because data mesh is a business concept data fabric is a data integration pattern how do you define how do you compare the two you have to bring data mesh level down so to tony's point i'm on a warp path in 2022 to take it down to what does a data product look like how do we handle shared data across domains and govern it and i think we are going to see more of that in 2022 is operationalization of data mesh i think we could have a whole hour on this topic couldn't we uh maybe we should do that uh but let's go to let's move to carl said carl your database guy you've been around that that block for a while now you want to talk about graph databases bring it on oh yeah okay thanks so i regard graph database as basically the next truly revolutionary database management technology i'm looking forward to for the graph database market which of course we haven't defined yet so obviously i have a little wiggle room in what i'm about to say but that this market will grow by about 600 percent over the next 10 years now 10 years is a long time but over the next five years we expect to see gradual growth as people start to learn how to use it problem isn't that it's used the problem is not that it's not useful is that people don't know how to use it so let me explain before i go any further what a graph database is because some of the folks on the call may not may not know what it is a graph database organizes data according to a mathematical structure called a graph a graph has elements called nodes and edges so a data element drops into a node the nodes are connected by edges the edges connect one node to another node combinations of edges create structures that you can analyze to determine how things are related in some cases the nodes and edges can have properties attached to them which add additional informative material that makes it richer that's called a property graph okay there are two principal use cases for graph databases there's there's semantic proper graphs which are used to break down human language text uh into the semantic structures then you can search it organize it and and and answer complicated questions a lot of ai is aimed at semantic graphs another kind is the property graph that i just mentioned which has a dazzling number of use cases i want to just point out is as i talk about this people are probably wondering well we have relational databases isn't that good enough okay so a relational database defines it uses um it supports what i call definitional relationships that means you define the relationships in a fixed structure the database drops into that structure there's a value foreign key value that relates one table to another and that value is fixed you don't change it if you change it the database becomes unstable it's not clear what you're looking at in a graph database the system is designed to handle change so that it can reflect the true state of the things that it's being used to track so um let me just give you some examples of use cases for this um they include uh entity resolution data lineage uh um social media analysis customer 360 fraud prevention there's cyber security there's strong supply chain is a big one actually there's explainable ai and this is going to become important too because a lot of people are adopting ai but they want a system after the fact to say how did the ai system come to that conclusion how did it make that recommendation right now we don't have really good ways of tracking that okay machine machine learning in general um social network i already mentioned that and then we've got oh gosh we've got data governance data compliance risk management we've got recommendation we've got personalization anti-money money laundering that's another big one identity and access management network and i.t operations is already becoming a key one where you actually have mapped out your operation your your you know whatever it is your data center and you you can track what's going on as things happen there root cause analysis fraud detection is a huge one a number of major credit card companies use graph databases for fraud detection risk analysis tracking and tracing churn analysis next best action what-if analysis impact analysis entity resolution and i would add one other thing or just a few other things to this list metadata management so sanjay here you go this is your engine okay because i was in metadata management for quite a while in my past life and one of the things i found was that none of the data management technologies that were available to us could efficiently handle metadata because of the kinds of structures that result from it but grass can okay grafts can do things like say this term in this context means this but in that context it means that okay things like that and in fact uh logistics management supply chain it also because it handles recursive relationships by recursive relationships i mean objects that own other objects that are of the same type you can do things like bill materials you know so like parts explosion you can do an hr analysis who reports to whom how many levels up the chain and that kind of thing you can do that with relational databases but yes it takes a lot of programming in fact you can do almost any of these things with relational databases but the problem is you have to program it it's not it's not supported in the database and whenever you have to program something that means you can't trace it you can't define it you can't publish it in terms of its functionality and it's really really hard to maintain over time so carl thank you i wonder if we could bring brad in i mean brad i'm sitting there wondering okay is this incremental to the market is it disruptive and replaceable what are your thoughts on this space it's already disrupted the market i mean like carl said go to any bank and ask them are you using graph databases to do to get fraud detection under control and they'll say absolutely that's the only way to solve this problem and it is frankly um and it's the only way to solve a lot of the problems that carl mentioned and that is i think it's it's achilles heel in some ways because you know it's like finding the best way to cross the seven bridges of konigsberg you know it's always going to kind of be tied to those use cases because it's really special and it's really unique and because it's special and it's unique uh it it still unfortunately kind of stands apart from the rest of the community that's building let's say ai outcomes as the great great example here the graph databases and ai as carl mentioned are like chocolate and peanut butter but technologically they don't know how to talk to one another they're completely different um and you know it's you can't just stand up sql and query them you've got to to learn um yeah what is that carlos specter or uh special uh uh yeah thank you uh to actually get to the data in there and if you're gonna scale that data that graph database especially a property graph if you're gonna do something really complex like try to understand uh you know all of the metadata in your organization you might just end up with you know a graph database winter like we had the ai winter simply because you run out of performance to make the thing happen so i i think it's already disrupted but we we need to like treat it like a first-class citizen in in the data analytics and ai community we need to bring it into the fold we need to equip it with the tools it needs to do that the magic it does and to do it not just for specialized use cases but for everything because i i'm with carl i i think it's absolutely revolutionary so i had also identified the principal achilles heel of the technology which is scaling now when these when these things get large and complex enough that they spill over what a single server can handle you start to have difficulties because the relationships span things that have to be resolved over a network and then you get network latency and that slows the system down so that's still a problem to be solved sanjeev any quick thoughts on this i mean i think metadata on the on the on the word cloud is going to be the the largest font uh but what are your thoughts here i want to like step away so people don't you know associate me with only meta data so i want to talk about something a little bit slightly different uh dbengines.com has done an amazing job i think almost everyone knows that they chronicle all the major databases that are in use today in january of 2022 there are 381 databases on its list of ranked list of databases the largest category is rdbms the second largest category is actually divided into two property graphs and rdf graphs these two together make up the second largest number of data databases so talking about accolades here this is a problem the problem is that there's so many graph databases to choose from they come in different shapes and forms uh to bright's point there's so many query languages in rdbms is sql end of the story here we've got sci-fi we've got gremlin we've got gql and then your proprietary languages so i think there's a lot of disparity in this space but excellent all excellent points sanji i must say and that is a problem the languages need to be sorted and standardized and it needs people need to have a road map as to what they can do with it because as you say you can do so many things and so many of those things are unrelated that you sort of say well what do we use this for i'm reminded of the saying i learned a bunch of years ago when somebody said that the digital computer is the only tool man has ever devised that has no particular purpose all right guys we gotta we gotta move on to dave uh meninger uh we've heard about streaming uh your prediction is in that realm so please take it away sure so i like to say that historical databases are to become a thing of the past but i don't mean that they're going to go away that's not my point i mean we need historical databases but streaming data is going to become the default way in which we operate with data so in the next say three to five years i would expect the data platforms and and we're using the term data platforms to represent the evolution of databases and data lakes that the data platforms will incorporate these streaming capabilities we're going to process data as it streams into an organization and then it's going to roll off into historical databases so historical databases don't go away but they become a thing of the past they store the data that occurred previously and as data is occurring we're going to be processing it we're going to be analyzing we're going to be acting on it i mean we we only ever ended up with historical databases because we were limited by the technology that was available to us data doesn't occur in batches but we processed it in batches because that was the best we could do and it wasn't bad and we've continued to improve and we've improved and we've improved but streaming data today is still the exception it's not the rule right there's there are projects within organizations that deal with streaming data but it's not the default way in which we deal with data yet and so that that's my prediction is that this is going to change we're going to have um streaming data be the default way in which we deal with data and and how you label it what you call it you know maybe these databases and data platforms just evolve to be able to handle it but we're going to deal with data in a different way and our research shows that already about half of the participants in our analytics and data benchmark research are using streaming data you know another third are planning to use streaming technologies so that gets us to about eight out of ten organizations need to use this technology that doesn't mean they have to use it throughout the whole organization but but it's pretty widespread in its use today and has continued to grow if you think about the consumerization of i.t we've all been conditioned to expect immediate access to information immediate responsiveness you know we want to know if an uh item is on the shelf at our local retail store and we can go in and pick it up right now you know that's the world we live in and that's spilling over into the enterprise i.t world where we have to provide those same types of capabilities um so that's my prediction historical database has become a thing of the past streaming data becomes the default way in which we we operate with data all right thank you david well so what what say you uh carl a guy who's followed historical databases for a long time well one thing actually every database is historical because as soon as you put data in it it's now history it's no longer it no longer reflects the present state of things but even if that history is only a millisecond old it's still history but um i would say i mean i know you're trying to be a little bit provocative in saying this dave because you know as well as i do that people still need to do their taxes they still need to do accounting they still need to run general ledger programs and things like that that all involves historical data that's not going to go away unless you want to go to jail so you're going to have to deal with that but as far as the leading edge functionality i'm totally with you on that and i'm just you know i'm just kind of wondering um if this chain if this requires a change in the way that we perceive applications in order to truly be manifested and rethinking the way m applications work um saying that uh an application should respond instantly as soon as the state of things changes what do you say about that i i think that's true i think we do have to think about things differently that's you know it's not the way we design systems in the past uh we're seeing more and more systems designed that way but again it's not the default and and agree 100 with you that we do need historical databases you know that that's clear and even some of those historical databases will be used in conjunction with the streaming data right so absolutely i mean you know let's take the data warehouse example where you're using the data warehouse as context and the streaming data as the present you're saying here's a sequence of things that's happening right now have we seen that sequence before and where what what does that pattern look like in past situations and can we learn from that so tony bear i wonder if you could comment i mean if you when you think about you know real-time inferencing at the edge for instance which is something that a lot of people talk about um a lot of what we're discussing here in this segment looks like it's got great potential what are your thoughts yeah well i mean i think you nailed it right you know you hit it right on the head there which is that i think a key what i'm seeing is that essentially and basically i'm going to split this one down the middle is i don't see that basically streaming is the default what i see is streaming and basically and transaction databases um and analytics data you know data warehouses data lakes whatever are converging and what allows us technically to converge is cloud native architecture where you can basically distribute things so you could have you can have a note here that's doing the real-time processing that's also doing it and this is what your leads in we're maybe doing some of that real-time predictive analytics to take a look at well look we're looking at this customer journey what's happening with you know you know with with what the customer is doing right now and this is correlated with what other customers are doing so what i so the thing is that in the cloud you can basically partition this and because of basically you know the speed of the infrastructure um that you can basically bring these together and or and so and kind of orchestrate them sort of loosely coupled manner the other part is that the use cases are demanding and this is part that goes back to what dave is saying is that you know when you look at customer 360 when you look at let's say smart you know smart utility grids when you look at any type of operational problem it has a real-time component and it has a historical component and having predictives and so like you know you know my sense here is that there that technically we can bring this together through the cloud and i think the use case is that is that we we can apply some some real-time sort of you know predictive analytics on these streams and feed this into the transactions so that when we make a decision in terms of what to do as a result of a transaction we have this real time you know input sanjeev did you have a comment yeah i was just going to say that to this point you know we have to think of streaming very different because in the historical databases we used to bring the data and store the data and then we used to run rules on top uh aggregations and all but in case of streaming the mindset changes because the rules normally the inference all of that is fixed but the data is constantly changing so it's a completely reverse way of thinking of uh and building applications on top of that so dave menninger there seemed to be some disagreement about the default or now what kind of time frame are you are you thinking about is this end of decade it becomes the default what would you pin i i think around you know between between five to ten years i think this becomes the reality um i think you know it'll be more and more common between now and then but it becomes the default and i also want sanjeev at some point maybe in one of our subsequent conversations we need to talk about governing streaming data because that's a whole other set of challenges we've also talked about it rather in a two dimensions historical and streaming and there's lots of low latency micro batch sub second that's not quite streaming but in many cases it's fast enough and we're seeing a lot of adoption of near real time not quite real time as uh good enough for most for many applications because nobody's really taking the hardware dimension of this information like how do we that'll just happen carl so near real time maybe before you lose the customer however you define that right okay um let's move on to brad brad you want to talk about automation ai uh the the the pipeline people feel like hey we can just automate everything what's your prediction yeah uh i'm i'm an ai fiction auto so apologies in advance for that but uh you know um i i think that um we've been seeing automation at play within ai for some time now and it's helped us do do a lot of things for especially for practitioners that are building ai outcomes in the enterprise uh it's it's helped them to fill skills gaps it's helped them to speed development and it's helped them to to actually make ai better uh because it you know in some ways provides some swim lanes and and for example with technologies like ottawa milk and can auto document and create that sort of transparency that that we talked about a little bit earlier um but i i think it's there's an interesting kind of conversion happening with this idea of automation um and and that is that uh we've had the automation that started happening for practitioners it's it's trying to move outside of the traditional bounds of things like i'm just trying to get my features i'm just trying to pick the right algorithm i'm just trying to build the right model uh and it's expanding across that full life cycle of building an ai outcome to start at the very beginning of data and to then continue on to the end which is this continuous delivery and continuous uh automation of of that outcome to make sure it's right and it hasn't drifted and stuff like that and because of that because it's become kind of powerful we're starting to to actually see this weird thing happen where the practitioners are starting to converge with the users and that is to say that okay if i'm in tableau right now i can stand up salesforce einstein discovery and it will automatically create a nice predictive algorithm for me um given the data that i that i pull in um but what's starting to happen and we're seeing this from the the the companies that create business software so salesforce oracle sap and others is that they're starting to actually use these same ideals and a lot of deep learning to to basically stand up these out of the box flip a switch and you've got an ai outcome at the ready for business users and um i i'm very much you know i think that that's that's the way that it's going to go and what it means is that ai is is slowly disappearing uh and i don't think that's a bad thing i think if anything what we're going to see in 2022 and maybe into 2023 is this sort of rush to to put this idea of disappearing ai into practice and have as many of these solutions in the enterprise as possible you can see like for example sap is going to roll out this quarter this thing called adaptive recommendation services which which basically is a cold start ai outcome that can work across a whole bunch of different vertical markets and use cases it's just a recommendation engine for whatever you need it to do in the line of business so basically you're you're an sap user you look up to turn on your software one day and you're a sales professional let's say and suddenly you have a recommendation for customer churn it's going that's great well i i don't know i i think that's terrifying in some ways i think it is the future that ai is going to disappear like that but i am absolutely terrified of it because um i i think that what it what it really does is it calls attention to a lot of the issues that we already see around ai um specific to this idea of what what we like to call it omdia responsible ai which is you know how do you build an ai outcome that is free of bias that is inclusive that is fair that is safe that is secure that it's audible etc etc etc etc that takes some a lot of work to do and so if you imagine a customer that that's just a sales force customer let's say and they're turning on einstein discovery within their sales software you need some guidance to make sure that when you flip that switch that the outcome you're going to get is correct and that's that's going to take some work and so i think we're going to see this let's roll this out and suddenly there's going to be a lot of a lot of problems a lot of pushback uh that we're going to see and some of that's going to come from gdpr and others that sam jeeve was mentioning earlier a lot of it's going to come from internal csr requirements within companies that are saying hey hey whoa hold up we can't do this all at once let's take the slow route let's make ai automated in a smart way and that's going to take time yeah so a couple predictions there that i heard i mean ai essentially you disappear it becomes invisible maybe if i can restate that and then if if i understand it correctly brad you're saying there's a backlash in the near term people can say oh slow down let's automate what we can those attributes that you talked about are non trivial to achieve is that why you're a bit of a skeptic yeah i think that we don't have any sort of standards that companies can look to and understand and we certainly within these companies especially those that haven't already stood up in internal data science team they don't have the knowledge to understand what that when they flip that switch for an automated ai outcome that it's it's gonna do what they think it's gonna do and so we need some sort of standard standard methodology and practice best practices that every company that's going to consume this invisible ai can make use of and one of the things that you know is sort of started that google kicked off a few years back that's picking up some momentum and the companies i just mentioned are starting to use it is this idea of model cards where at least you have some transparency about what these things are doing you know so like for the sap example we know for example that it's convolutional neural network with a long short-term memory model that it's using we know that it only works on roman english uh and therefore me as a consumer can say oh well i know that i need to do this internationally so i should not just turn this on today great thank you carl can you add anything any context here yeah we've talked about some of the things brad mentioned here at idc in the our future of intelligence group regarding in particular the moral and legal implications of having a fully automated you know ai uh driven system uh because we already know and we've seen that ai systems are biased by the data that they get right so if if they get data that pushes them in a certain direction i think there was a story last week about an hr system that was uh that was recommending promotions for white people over black people because in the past um you know white people were promoted and and more productive than black people but not it had no context as to why which is you know because they were being historically discriminated black people being historically discriminated against but the system doesn't know that so you know you have to be aware of that and i think that at the very least there should be controls when a decision has either a moral or a legal implication when when you want when you really need a human judgment it could lay out the options for you but a person actually needs to authorize that that action and i also think that we always will have to be vigilant regarding the kind of data we use to train our systems to make sure that it doesn't introduce unintended biases and to some extent they always will so we'll always be chasing after them that's that's absolutely carl yeah i think that what you have to bear in mind as a as a consumer of ai is that it is a reflection of us and we are a very flawed species uh and so if you look at all the really fantastic magical looking supermodels we see like gpt three and four that's coming out z they're xenophobic and hateful uh because the people the data that's built upon them and the algorithms and the people that build them are us so ai is a reflection of us we need to keep that in mind yeah we're the ai's by us because humans are biased all right great okay let's move on doug henson you know a lot of people that said that data lake that term's not not going to not going to live on but it appears to be have some legs here uh you want to talk about lake house bring it on yes i do my prediction is that lake house and this idea of a combined data warehouse and data lake platform is going to emerge as the dominant data management offering i say offering that doesn't mean it's going to be the dominant thing that organizations have out there but it's going to be the predominant vendor offering in 2022. now heading into 2021 we already had cloudera data bricks microsoft snowflake as proponents in 2021 sap oracle and several of these fabric virtualization mesh vendors join the bandwagon the promise is that you have one platform that manages your structured unstructured and semi-structured information and it addresses both the beyond analytics needs and the data science needs the real promise there is simplicity and lower cost but i think end users have to answer a few questions the first is does your organization really have a center of data gravity or is it is the data highly distributed multiple data warehouses multiple data lakes on-premises cloud if it if it's very distributed and you you know you have difficulty consolidating and that's not really a goal for you then maybe that single platform is unrealistic and not likely to add value to you um you know also the fabric and virtualization vendors the the mesh idea that's where if you have this highly distributed situation that might be a better path forward the second question if you are looking at one of these lake house offerings you are looking at consolidating simplifying bringing together to a single platform you have to make sure that it meets both the warehouse need and the data lake need so you have vendors like data bricks microsoft with azure synapse new really to the data warehouse space and they're having to prove that these data warehouse capabilities on their platforms can meet the scaling requirements can meet the user and query concurrency requirements meet those tight slas and then on the other hand you have the or the oracle sap snowflake the data warehouse uh folks coming into the data science world and they have to prove that they can manage the unstructured information and meet the needs of the data scientists i'm seeing a lot of the lake house offerings from the warehouse crowd managing that unstructured information in columns and rows and some of these vendors snowflake in particular is really relying on partners for the data science needs so you really got to look at a lake house offering and make sure that it meets both the warehouse and the data lake requirement well thank you doug well tony if those two worlds are going to come together as doug was saying the analytics and the data science world does it need to be some kind of semantic layer in between i don't know weigh in on this topic if you would oh didn't we talk about data fabrics before common metadata layer um actually i'm almost tempted to say let's declare victory and go home in that this is actually been going on for a while i actually agree with uh you know much what doug is saying there which is that i mean we i remembered as far back as i think it was like 2014 i was doing a a study you know it was still at ovum predecessor omnia um looking at all these specialized databases that were coming up and seeing that you know there's overlap with the edges but yet there was still going to be a reason at the time that you would have let's say a document database for json you'd have a relational database for tran you know for transactions and for data warehouse and you had you know and you had basically something at that time that that resembles to do for what we're considering a day of life fast fo and the thing is what i was saying at the time is that you're seeing basically blur you know sort of blending at the edges that i was saying like about five or six years ago um that's all and the the lake house is essentially you know the amount of the the current manifestation of that idea there is a dichotomy in terms of you know it's the old argument do we centralize this all you know you know in in in in in a single place or do we or do we virtualize and i think it's always going to be a yin and yang there's never going to be a single single silver silver bullet i do see um that they're also going to be questions and these are things that points that doug raised they're you know what your what do you need of of of your of you know for your performance there or for your you know pre-performance characteristics do you need for instance hiking currency you need the ability to do some very sophisticated joins or is your requirement more to be able to distribute and you know distribute our processing is you know as far as possible to get you know to essentially do a kind of brute force approach all these approaches are valid based on you know based on the used case um i just see that essentially that the lake house is the culmination of it's nothing it's just it's a relatively new term introduced by databricks a couple years ago this is the culmination of basically what's been a long time trend and what we see in the cloud is that as we start seeing data warehouses as a checkbox item say hey we can basically source data in cloud and cloud storage and s3 azure blob store you know whatever um as long as it's in certain formats like you know like you know parquet or csv or something like that you know i see that as becoming kind of you know a check box item so to that extent i think that the lake house depending on how you define it is already reality um and in some in some cases maybe new terminology but not a whole heck of a lot new under the sun yeah and dave menger i mean a lot of this thank you tony but a lot of this is going to come down to you know vendor marketing right some people try to co-opt the term we talked about data mesh washing what are your thoughts on this yeah so um i used the term data platform earlier and and part of the reason i use that term is that it's more vendor neutral uh we've we've tried to uh sort of stay out of the the vendor uh terminology patenting world right whether whether the term lake house is what sticks or not the concept is certainly going to stick and we have some data to back it up about a quarter of organizations that are using data lakes today already incorporate data warehouse functionality into it so they consider their data lake house and data warehouse one in the same about a quarter of organizations a little less but about a quarter of organizations feed the data lake from the data warehouse and about a quarter of organizations feed the data warehouse from the data lake so it's pretty obvious that three quarters of organizations need to bring this stuff together right the need is there the need is apparent the technology is going to continue to verge converge i i like to talk about you know you've got data lakes over here at one end and i'm not going to talk about why people thought data lakes were a bad idea because they thought you just throw stuff in a in a server and you ignore it right that's not what a data lake is so you've got data lake people over here and you've got database people over here data warehouse people over here database vendors are adding data lake capabilities and data lake vendors are adding data warehouse capabilities so it's obvious that they're going to meet in the middle i mean i think it's like tony says i think we should there declare victory and go home and so so i it's just a follow-up on that so are you saying these the specialized lake and the specialized warehouse do they go away i mean johnny tony data mesh practitioners would say or or advocates would say well they could all live as just a node on the on the mesh but based on what dave just said are we going to see those all morph together well number one as i was saying before there's always going to be this sort of you know kind of you know centrifugal force or this tug of war between do we centralize the data do we do it virtualize and the fact is i don't think that work there's ever going to be any single answer i think in terms of data mesh data mesh has nothing to do with how you physically implement the data you could have a data mesh on a basically uh on a data warehouse it's just that you know the difference being is that if we use the same you know physical data store but everybody's logically manual basically governing it differently you know um a data mission is basically it's not a technology it's a process it's a governance process um so essentially um you know you know i basically see that you know as as i was saying before that this is basically the culmination of a long time trend we're essentially seeing a lot of blurring but there are going to be cases where for instance if i need let's say like observe i need like high concurrency or something like that there are certain things that i'm not going to be able to get efficiently get out of a data lake um and you know we're basically i'm doing a system where i'm just doing really brute forcing very fast file scanning and that type of thing so i think there always will be some delineations but i would agree with dave and with doug that we are seeing basically a a confluence of requirements that we need to essentially have basically the element you know the ability of a data lake and a data laid out their warehouse we these need to come together so i think what we're likely to see is organizations look for a converged platform that can handle both sides for their center of data gravity the mesh and the fabric vendors the the fabric virtualization vendors they're all on board with the idea of this converged platform and they're saying hey we'll handle all the edge cases of the stuff that isn't in that center of data gradient that is off distributed in a cloud or at a remote location so you can have that single platform for the center of of your your data and then bring in virtualization mesh what have you for reaching out to the distributed data bingo as they basically said people are happy when they virtualize data i i think yes at this point but to this uh dave meningas point you know they have convert they are converging snowflake has introduced support for unstructured data so now we are literally splitting here now what uh databricks is saying is that aha but it's easy to go from data lake to data warehouse than it is from data warehouse to data lake so i think we're getting into semantics but we've already seen these two converge so is that so it takes something like aws who's got what 15 data stores are they're going to have 15 converged data stores that's going to be interesting to watch all right guys i'm going to go down the list and do like a one i'm going to one word each and you guys each of the analysts if you wouldn't just add a very brief sort of course correction for me so sanjeev i mean governance is going to be the maybe it's the dog that wags the tail now i mean it's coming to the fore all this ransomware stuff which really didn't talk much about security but but but what's the one word in your prediction that you would leave us with on governance it's uh it's going to be mainstream mainstream okay tony bear mesh washing is what i wrote down that's that's what we're going to see in uh in in 2022 a little reality check you you want to add to that reality check is i hope that no vendor you know jumps the shark and calls their offering a data mesh project yeah yeah let's hope that doesn't happen if they do we're going to call them out uh carl i mean graph databases thank you for sharing some some you know high growth metrics i know it's early days but magic is what i took away from that it's the magic database yeah i would actually i've said this to people too i i kind of look at it as a swiss army knife of data because you can pretty much do anything you want with it it doesn't mean you should i mean that's definitely the case that if you're you know managing things that are in a fixed schematic relationship probably a relational database is a better choice there are you know times when the document database is a better choice it can handle those things but maybe not it may not be the best choice for that use case but for a great many especially the new emerging use cases i listed it's the best choice thank you and dave meninger thank you by the way for bringing the data in i like how you supported all your comments with with some some data points but streaming data becomes the sort of default uh paradigm if you will what would you add yeah um i would say think fast right that's the world we live in you got to think fast fast love it uh and brad shimon uh i love it i mean on the one hand i was saying okay great i'm afraid i might get disrupted by one of these internet giants who are ai experts so i'm gonna be able to buy instead of build ai but then again you know i've got some real issues there's a potential backlash there so give us the there's your bumper sticker yeah i i would say um going with dave think fast and also think slow uh to to talk about the book that everyone talks about i would say really that this is all about trust trust in the idea of automation and of a transparent invisible ai across the enterprise but verify verify before you do anything and then doug henson i mean i i look i think the the trend is your friend here on this prediction with lake house is uh really becoming dominant i liked the way you set up that notion of you know the the the data warehouse folks coming at it from the analytics perspective but then you got the data science worlds coming together i still feel as though there's this piece in the middle that we're missing but your your final thoughts we'll give you the last well i think the idea of consolidation and simplification uh always prevails that's why the appeal of a single platform is going to be there um we've already seen that with uh you know hadoop platforms moving toward cloud moving toward object storage and object storage becoming really the common storage point for whether it's a lake or a warehouse uh and that second point uh i think esg mandates are uh are gonna come in alongside uh gdpr and things like that to uh up the ante for uh good governance yeah thank you for calling that out okay folks hey that's all the time that that we have here your your experience and depth of understanding on these key issues and in data and data management really on point and they were on display today i want to thank you for your your contributions really appreciate your time enjoyed it thank you now in addition to this video we're going to be making available transcripts of the discussion we're going to do clips of this as well we're going to put them out on social media i'll write this up and publish the discussion on wikibon.com and siliconangle.com no doubt several of the analysts on the panel will take the opportunity to publish written content social commentary or both i want to thank the power panelist and thanks for watching this special cube presentation this is dave vellante be well and we'll see you next time [Music] you

Published Date : Jan 8 2022

SUMMARY :

the end of the day need to speak you

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
381 databasesQUANTITY

0.99+

2014DATE

0.99+

2022DATE

0.99+

2021DATE

0.99+

january of 2022DATE

0.99+

100 usersQUANTITY

0.99+

jamal daganiPERSON

0.99+

last weekDATE

0.99+

dave meningerPERSON

0.99+

sanjiPERSON

0.99+

second questionQUANTITY

0.99+

15 converged data storesQUANTITY

0.99+

dave vellantePERSON

0.99+

microsoftORGANIZATION

0.99+

threeQUANTITY

0.99+

sanjeevPERSON

0.99+

2023DATE

0.99+

15 data storesQUANTITY

0.99+

siliconangle.comOTHER

0.99+

last yearDATE

0.99+

sanjeev mohanPERSON

0.99+

sixQUANTITY

0.99+

twoQUANTITY

0.99+

carlPERSON

0.99+

tonyPERSON

0.99+

carl olufsenPERSON

0.99+

six yearsQUANTITY

0.99+

davidPERSON

0.99+

carlos specterPERSON

0.98+

both sidesQUANTITY

0.98+

2010sDATE

0.98+

first backlashQUANTITY

0.98+

five yearsQUANTITY

0.98+

todayDATE

0.98+

davePERSON

0.98+

eachQUANTITY

0.98+

three quartersQUANTITY

0.98+

firstQUANTITY

0.98+

single platformQUANTITY

0.98+

lake houseORGANIZATION

0.98+

bothQUANTITY

0.98+

this yearDATE

0.98+

dougPERSON

0.97+

one wordQUANTITY

0.97+

this yearDATE

0.97+

wikibon.comOTHER

0.97+

one platformQUANTITY

0.97+

39QUANTITY

0.97+

about 600 percentQUANTITY

0.97+

two analystsQUANTITY

0.97+

ten yearsQUANTITY

0.97+

single platformQUANTITY

0.96+

fiveQUANTITY

0.96+

oneQUANTITY

0.96+

three quartersQUANTITY

0.96+

californiaLOCATION

0.96+

googleORGANIZATION

0.96+

singleQUANTITY

0.95+

Predictions 2022: Top Analysts See the Future of Data


 

(bright music) >> In the 2010s, organizations became keenly aware that data would become the key ingredient to driving competitive advantage, differentiation, and growth. But to this day, putting data to work remains a difficult challenge for many, if not most organizations. Now, as the cloud matures, it has become a game changer for data practitioners by making cheap storage and massive processing power readily accessible. We've also seen better tooling in the form of data workflows, streaming, machine intelligence, AI, developer tools, security, observability, automation, new databases and the like. These innovations they accelerate data proficiency, but at the same time, they add complexity for practitioners. Data lakes, data hubs, data warehouses, data marts, data fabrics, data meshes, data catalogs, data oceans are forming, they're evolving and exploding onto the scene. So in an effort to bring perspective to the sea of optionality, we've brought together the brightest minds in the data analyst community to discuss how data management is morphing and what practitioners should expect in 2022 and beyond. Hello everyone, my name is Dave Velannte with theCUBE, and I'd like to welcome you to a special Cube presentation, analysts predictions 2022: the future of data management. We've gathered six of the best analysts in data and data management who are going to present and discuss their top predictions and trends for 2022 in the first half of this decade. Let me introduce our six power panelists. Sanjeev Mohan is former Gartner Analyst and Principal at SanjMo. Tony Baer, principal at dbInsight, Carl Olofson is well-known Research Vice President with IDC, Dave Menninger is Senior Vice President and Research Director at Ventana Research, Brad Shimmin, Chief Analyst, AI Platforms, Analytics and Data Management at Omdia and Doug Henschen, Vice President and Principal Analyst at Constellation Research. Gentlemen, welcome to the program and thanks for coming on theCUBE today. >> Great to be here. >> Thank you. >> All right, here's the format we're going to use. I as moderator, I'm going to call on each analyst separately who then will deliver their prediction or mega trend, and then in the interest of time management and pace, two analysts will have the opportunity to comment. If we have more time, we'll elongate it, but let's get started right away. Sanjeev Mohan, please kick it off. You want to talk about governance, go ahead sir. >> Thank you Dave. I believe that data governance which we've been talking about for many years is now not only going to be mainstream, it's going to be table stakes. And all the things that you mentioned, you know, the data, ocean data lake, lake houses, data fabric, meshes, the common glue is metadata. If we don't understand what data we have and we are governing it, there is no way we can manage it. So we saw Informatica went public last year after a hiatus of six. I'm predicting that this year we see some more companies go public. My bet is on Culebra, most likely and maybe Alation we'll see go public this year. I'm also predicting that the scope of data governance is going to expand beyond just data. It's not just data and reports. We are going to see more transformations like spark jawsxxxxx, Python even Air Flow. We're going to see more of a streaming data. So from Kafka Schema Registry, for example. We will see AI models become part of this whole governance suite. So the governance suite is going to be very comprehensive, very detailed lineage, impact analysis, and then even expand into data quality. We already seen that happen with some of the tools where they are buying these smaller companies and bringing in data quality monitoring and integrating it with metadata management, data catalogs, also data access governance. So what we are going to see is that once the data governance platforms become the key entry point into these modern architectures, I'm predicting that the usage, the number of users of a data catalog is going to exceed that of a BI tool. That will take time and we already seen that trajectory. Right now if you look at BI tools, I would say there a hundred users to BI tool to one data catalog. And I see that evening out over a period of time and at some point data catalogs will really become the main way for us to access data. Data catalog will help us visualize data, but if we want to do more in-depth analysis, it'll be the jumping off point into the BI tool, the data science tool and that is the journey I see for the data governance products. >> Excellent, thank you. Some comments. Maybe Doug, a lot of things to weigh in on there, maybe you can comment. >> Yeah, Sanjeev I think you're spot on, a lot of the trends the one disagreement, I think it's really still far from mainstream. As you say, we've been talking about this for years, it's like God, motherhood, apple pie, everyone agrees it's important, but too few organizations are really practicing good governance because it's hard and because the incentives have been lacking. I think one thing that deserves mention in this context is ESG mandates and guidelines, these are environmental, social and governance, regs and guidelines. We've seen the environmental regs and guidelines and posts in industries, particularly the carbon-intensive industries. We've seen the social mandates, particularly diversity imposed on suppliers by companies that are leading on this topic. We've seen governance guidelines now being imposed by banks on investors. So these ESGs are presenting new carrots and sticks, and it's going to demand more solid data. It's going to demand more detailed reporting and solid reporting, tighter governance. But we're still far from mainstream adoption. We have a lot of, you know, best of breed niche players in the space. I think the signs that it's going to be more mainstream are starting with things like Azure Purview, Google Dataplex, the big cloud platform players seem to be upping the ante and starting to address governance. >> Excellent, thank you Doug. Brad, I wonder if you could chime in as well. >> Yeah, I would love to be a believer in data catalogs. But to Doug's point, I think that it's going to take some more pressure for that to happen. I recall metadata being something every enterprise thought they were going to get under control when we were working on service oriented architecture back in the nineties and that didn't happen quite the way we anticipated. And so to Sanjeev's point it's because it is really complex and really difficult to do. My hope is that, you know, we won't sort of, how do I put this? Fade out into this nebula of domain catalogs that are specific to individual use cases like Purview for getting data quality right or like data governance and cybersecurity. And instead we have some tooling that can actually be adaptive to gather metadata to create something. And I know its important to you, Sanjeev and that is this idea of observability. If you can get enough metadata without moving your data around, but understanding the entirety of a system that's running on this data, you can do a lot. So to help with the governance that Doug is talking about. >> So I just want to add that, data governance, like any other initiatives did not succeed even AI went into an AI window, but that's a different topic. But a lot of these things did not succeed because to your point, the incentives were not there. I remember when Sarbanes Oxley had come into the scene, if a bank did not do Sarbanes Oxley, they were very happy to a million dollar fine. That was like, you know, pocket change for them instead of doing the right thing. But I think the stakes are much higher now. With GDPR, the flood gates opened. Now, you know, California, you know, has CCPA but even CCPA is being outdated with CPRA, which is much more GDPR like. So we are very rapidly entering a space where pretty much every major country in the world is coming up with its own compliance regulatory requirements, data residents is becoming really important. And I think we are going to reach a stage where it won't be optional anymore. So whether we like it or not, and I think the reason data catalogs were not successful in the past is because we did not have the right focus on adoption. We were focused on features and these features were disconnected, very hard for business to adopt. These are built by IT people for IT departments to take a look at technical metadata, not business metadata. Today the tables have turned. CDOs are driving this initiative, regulatory compliances are beating down hard, so I think the time might be right. >> Yeah so guys, we have to move on here. But there's some real meat on the bone here, Sanjeev. I like the fact that you called out Culebra and Alation, so we can look back a year from now and say, okay, he made the call, he stuck it. And then the ratio of BI tools to data catalogs that's another sort of measurement that we can take even though with some skepticism there, that's something that we can watch. And I wonder if someday, if we'll have more metadata than data. But I want to move to Tony Baer, you want to talk about data mesh and speaking, you know, coming off of governance. I mean, wow, you know the whole concept of data mesh is, decentralized data, and then governance becomes, you know, a nightmare there, but take it away, Tony. >> We'll put this way, data mesh, you know, the idea at least as proposed by ThoughtWorks. You know, basically it was at least a couple of years ago and the press has been almost uniformly almost uncritical. A good reason for that is for all the problems that basically Sanjeev and Doug and Brad we're just speaking about, which is that we have all this data out there and we don't know what to do about it. Now, that's not a new problem. That was a problem we had in enterprise data warehouses, it was a problem when we had over DoOP data clusters, it's even more of a problem now that data is out in the cloud where the data is not only your data lake, is not only us three, it's all over the place. And it's also including streaming, which I know we'll be talking about later. So the data mesh was a response to that, the idea of that we need to bait, you know, who are the folks that really know best about governance? It's the domain experts. So it was basically data mesh was an architectural pattern and a process. My prediction for this year is that data mesh is going to hit cold heart reality. Because if you do a Google search, basically the published work, the articles on data mesh have been largely, you know, pretty uncritical so far. Basically loading and is basically being a very revolutionary new idea. I don't think it's that revolutionary because we've talked about ideas like this. Brad now you and I met years ago when we were talking about so and decentralizing all of us, but it was at the application level. Now we're talking about it at the data level. And now we have microservices. So there's this thought of have we managed if we're deconstructing apps in cloud native to microservices, why don't we think of data in the same way? My sense this year is that, you know, this has been a very active search if you look at Google search trends, is that now companies, like enterprise are going to look at this seriously. And as they look at it seriously, it's going to attract its first real hard scrutiny, it's going to attract its first backlash. That's not necessarily a bad thing. It means that it's being taken seriously. The reason why I think that you'll start to see basically the cold hearted light of day shine on data mesh is that it's still a work in progress. You know, this idea is basically a couple of years old and there's still some pretty major gaps. The biggest gap is in the area of federated governance. Now federated governance itself is not a new issue. Federated governance decision, we started figuring out like, how can we basically strike the balance between getting let's say between basically consistent enterprise policy, consistent enterprise governance, but yet the groups that understand the data and know how to basically, you know, that, you know, how do we basically sort of balance the two? There's a huge gap there in practice and knowledge. Also to a lesser extent, there's a technology gap which is basically in the self-service technologies that will help teams essentially govern data. You know, basically through the full life cycle, from develop, from selecting the data from, you know, building the pipelines from, you know, determining your access control, looking at quality, looking at basically whether the data is fresh or whether it's trending off course. So my prediction is that it will receive the first harsh scrutiny this year. You are going to see some organization and enterprises declare premature victory when they build some federated query implementations. You going to see vendors start with data mesh wash their products anybody in the data management space that they are going to say that where this basically a pipelining tool, whether it's basically ELT, whether it's a catalog or federated query tool, they will all going to get like, you know, basically promoting the fact of how they support this. Hopefully nobody's going to call themselves a data mesh tool because data mesh is not a technology. We're going to see one other thing come out of this. And this harks back to the metadata that Sanjeev was talking about and of the catalog just as he was talking about. Which is that there's going to be a new focus, every renewed focus on metadata. And I think that's going to spur interest in data fabrics. Now data fabrics are pretty vaguely defined, but if we just take the most elemental definition, which is a common metadata back plane, I think that if anybody is going to get serious about data mesh, they need to look at the data fabric because we all at the end of the day, need to speak, you know, need to read from the same sheet of music. >> So thank you Tony. Dave Menninger, I mean, one of the things that people like about data mesh is it pretty crisply articulate some of the flaws in today's organizational approaches to data. What are your thoughts on this? >> Well, I think we have to start by defining data mesh, right? The term is already getting corrupted, right? Tony said it's going to see the cold hard light of day. And there's a problem right now that there are a number of overlapping terms that are similar but not identical. So we've got data virtualization, data fabric, excuse me for a second. (clears throat) Sorry about that. Data virtualization, data fabric, data federation, right? So I think that it's not really clear what each vendor means by these terms. I see data mesh and data fabric becoming quite popular. I've interpreted data mesh as referring primarily to the governance aspects as originally intended and specified. But that's not the way I see vendors using it. I see vendors using it much more to mean data fabric and data virtualization. So I'm going to comment on the group of those things. I think the group of those things is going to happen. They're going to happen, they're going to become more robust. Our research suggests that a quarter of organizations are already using virtualized access to their data lakes and another half, so a total of three quarters will eventually be accessing their data lakes using some sort of virtualized access. Again, whether you define it as mesh or fabric or virtualization isn't really the point here. But this notion that there are different elements of data, metadata and governance within an organization that all need to be managed collectively. The interesting thing is when you look at the satisfaction rates of those organizations using virtualization versus those that are not, it's almost double, 68% of organizations, I'm sorry, 79% of organizations that were using virtualized access express satisfaction with their access to the data lake. Only 39% express satisfaction if they weren't using virtualized access. >> Oh thank you Dave. Sanjeev we just got about a couple of minutes on this topic, but I know you're speaking or maybe you've always spoken already on a panel with (indistinct) who sort of invented the concept. Governance obviously is a big sticking point, but what are your thoughts on this? You're on mute. (panelist chuckling) >> So my message to (indistinct) and to the community is as opposed to what they said, let's not define it. We spent a whole year defining it, there are four principles, domain, product, data infrastructure, and governance. Let's take it to the next level. I get a lot of questions on what is the difference between data fabric and data mesh? And I'm like I can't compare the two because data mesh is a business concept, data fabric is a data integration pattern. How do you compare the two? You have to bring data mesh a level down. So to Tony's point, I'm on a warpath in 2022 to take it down to what does a data product look like? How do we handle shared data across domains and governance? And I think we are going to see more of that in 2022, or is "operationalization" of data mesh. >> I think we could have a whole hour on this topic, couldn't we? Maybe we should do that. But let's corner. Let's move to Carl. So Carl, you're a database guy, you've been around that block for a while now, you want to talk about graph databases, bring it on. >> Oh yeah. Okay thanks. So I regard graph database as basically the next truly revolutionary database management technology. I'm looking forward for the graph database market, which of course we haven't defined yet. So obviously I have a little wiggle room in what I'm about to say. But this market will grow by about 600% over the next 10 years. Now, 10 years is a long time. But over the next five years, we expect to see gradual growth as people start to learn how to use it. The problem is not that it's not useful, its that people don't know how to use it. So let me explain before I go any further what a graph database is because some of the folks on the call may not know what it is. A graph database organizes data according to a mathematical structure called a graph. The graph has elements called nodes and edges. So a data element drops into a node, the nodes are connected by edges, the edges connect one node to another node. Combinations of edges create structures that you can analyze to determine how things are related. In some cases, the nodes and edges can have properties attached to them which add additional informative material that makes it richer, that's called a property graph. There are two principle use cases for graph databases. There's semantic property graphs, which are use to break down human language texts into the semantic structures. Then you can search it, organize it and answer complicated questions. A lot of AI is aimed at semantic graphs. Another kind is the property graph that I just mentioned, which has a dazzling number of use cases. I want to just point out as I talk about this, people are probably wondering, well, we have relation databases, isn't that good enough? So a relational database defines... It supports what I call definitional relationships. That means you define the relationships in a fixed structure. The database drops into that structure, there's a value, foreign key value, that relates one table to another and that value is fixed. You don't change it. If you change it, the database becomes unstable, it's not clear what you're looking at. In a graph database, the system is designed to handle change so that it can reflect the true state of the things that it's being used to track. So let me just give you some examples of use cases for this. They include entity resolution, data lineage, social media analysis, Customer 360, fraud prevention. There's cybersecurity, there's strong supply chain is a big one actually. There is explainable AI and this is going to become important too because a lot of people are adopting AI. But they want a system after the fact to say, how do the AI system come to that conclusion? How did it make that recommendation? Right now we don't have really good ways of tracking that. Machine learning in general, social network, I already mentioned that. And then we've got, oh gosh, we've got data governance, data compliance, risk management. We've got recommendation, we've got personalization, anti money laundering, that's another big one, identity and access management, network and IT operations is already becoming a key one where you actually have mapped out your operation, you know, whatever it is, your data center and you can track what's going on as things happen there, root cause analysis, fraud detection is a huge one. A number of major credit card companies use graph databases for fraud detection, risk analysis, tracking and tracing turn analysis, next best action, what if analysis, impact analysis, entity resolution and I would add one other thing or just a few other things to this list, metadata management. So Sanjeev, here you go, this is your engine. Because I was in metadata management for quite a while in my past life. And one of the things I found was that none of the data management technologies that were available to us could efficiently handle metadata because of the kinds of structures that result from it, but graphs can, okay? Graphs can do things like say, this term in this context means this, but in that context, it means that, okay? Things like that. And in fact, logistics management, supply chain. And also because it handles recursive relationships, by recursive relationships I mean objects that own other objects that are of the same type. You can do things like build materials, you know, so like parts explosion. Or you can do an HR analysis, who reports to whom, how many levels up the chain and that kind of thing. You can do that with relational databases, but yet it takes a lot of programming. In fact, you can do almost any of these things with relational databases, but the problem is, you have to program it. It's not supported in the database. And whenever you have to program something, that means you can't trace it, you can't define it. You can't publish it in terms of its functionality and it's really, really hard to maintain over time. >> Carl, thank you. I wonder if we could bring Brad in, I mean. Brad, I'm sitting here wondering, okay, is this incremental to the market? Is it disruptive and replacement? What are your thoughts on this phase? >> It's already disrupted the market. I mean, like Carl said, go to any bank and ask them are you using graph databases to get fraud detection under control? And they'll say, absolutely, that's the only way to solve this problem. And it is frankly. And it's the only way to solve a lot of the problems that Carl mentioned. And that is, I think it's Achilles heel in some ways. Because, you know, it's like finding the best way to cross the seven bridges of Koenigsberg. You know, it's always going to kind of be tied to those use cases because it's really special and it's really unique and because it's special and it's unique, it's still unfortunately kind of stands apart from the rest of the community that's building, let's say AI outcomes, as a great example here. Graph databases and AI, as Carl mentioned, are like chocolate and peanut butter. But technologically, you think don't know how to talk to one another, they're completely different. And you know, you can't just stand up SQL and query them. You've got to learn, know what is the Carl? Specter special. Yeah, thank you to, to actually get to the data in there. And if you're going to scale that data, that graph database, especially a property graph, if you're going to do something really complex, like try to understand you know, all of the metadata in your organization, you might just end up with, you know, a graph database winter like we had the AI winter simply because you run out of performance to make the thing happen. So, I think it's already disrupted, but we need to like treat it like a first-class citizen in the data analytics and AI community. We need to bring it into the fold. We need to equip it with the tools it needs to do the magic it does and to do it not just for specialized use cases, but for everything. 'Cause I'm with Carl. I think it's absolutely revolutionary. >> Brad identified the principal, Achilles' heel of the technology which is scaling. When these things get large and complex enough that they spill over what a single server can handle, you start to have difficulties because the relationships span things that have to be resolved over a network and then you get network latency and that slows the system down. So that's still a problem to be solved. >> Sanjeev, any quick thoughts on this? I mean, I think metadata on the word cloud is going to be the largest font, but what are your thoughts here? >> I want to (indistinct) So people don't associate me with only metadata, so I want to talk about something slightly different. dbengines.com has done an amazing job. I think almost everyone knows that they chronicle all the major databases that are in use today. In January of 2022, there are 381 databases on a ranked list of databases. The largest category is RDBMS. The second largest category is actually divided into two property graphs and IDF graphs. These two together make up the second largest number databases. So talking about Achilles heel, this is a problem. The problem is that there's so many graph databases to choose from. They come in different shapes and forms. To Brad's point, there's so many query languages in RDBMS, in SQL. I know the story, but here We've got cipher, we've got gremlin, we've got GQL and then we're proprietary languages. So I think there's a lot of disparity in this space. >> Well, excellent. All excellent points, Sanjeev, if I must say. And that is a problem that the languages need to be sorted and standardized. People need to have a roadmap as to what they can do with it. Because as you say, you can do so many things. And so many of those things are unrelated that you sort of say, well, what do we use this for? And I'm reminded of the saying I learned a bunch of years ago. And somebody said that the digital computer is the only tool man has ever device that has no particular purpose. (panelists chuckle) >> All right guys, we got to move on to Dave Menninger. We've heard about streaming. Your prediction is in that realm, so please take it away. >> Sure. So I like to say that historical databases are going to become a thing of the past. By that I don't mean that they're going to go away, that's not my point. I mean, we need historical databases, but streaming data is going to become the default way in which we operate with data. So in the next say three to five years, I would expect that data platforms and we're using the term data platforms to represent the evolution of databases and data lakes, that the data platforms will incorporate these streaming capabilities. We're going to process data as it streams into an organization and then it's going to roll off into historical database. So historical databases don't go away, but they become a thing of the past. They store the data that occurred previously. And as data is occurring, we're going to be processing it, we're going to be analyzing it, we're going to be acting on it. I mean we only ever ended up with historical databases because we were limited by the technology that was available to us. Data doesn't occur in patches. But we processed it in patches because that was the best we could do. And it wasn't bad and we've continued to improve and we've improved and we've improved. But streaming data today is still the exception. It's not the rule, right? There are projects within organizations that deal with streaming data. But it's not the default way in which we deal with data yet. And so that's my prediction is that this is going to change, we're going to have streaming data be the default way in which we deal with data and how you label it and what you call it. You know, maybe these databases and data platforms just evolved to be able to handle it. But we're going to deal with data in a different way. And our research shows that already, about half of the participants in our analytics and data benchmark research, are using streaming data. You know, another third are planning to use streaming technologies. So that gets us to about eight out of 10 organizations need to use this technology. And that doesn't mean they have to use it throughout the whole organization, but it's pretty widespread in its use today and has continued to grow. If you think about the consumerization of IT, we've all been conditioned to expect immediate access to information, immediate responsiveness. You know, we want to know if an item is on the shelf at our local retail store and we can go in and pick it up right now. You know, that's the world we live in and that's spilling over into the enterprise IT world We have to provide those same types of capabilities. So that's my prediction, historical databases become a thing of the past, streaming data becomes the default way in which we operate with data. >> All right thank you David. Well, so what say you, Carl, the guy who has followed historical databases for a long time? >> Well, one thing actually, every database is historical because as soon as you put data in it, it's now history. They'll no longer reflect the present state of things. But even if that history is only a millisecond old, it's still history. But I would say, I mean, I know you're trying to be a little bit provocative in saying this Dave 'cause you know, as well as I do that people still need to do their taxes, they still need to do accounting, they still need to run general ledger programs and things like that. That all involves historical data. That's not going to go away unless you want to go to jail. So you're going to have to deal with that. But as far as the leading edge functionality, I'm totally with you on that. And I'm just, you know, I'm just kind of wondering if this requires a change in the way that we perceive applications in order to truly be manifested and rethinking the way applications work. Saying that an application should respond instantly, as soon as the state of things changes. What do you say about that? >> I think that's true. I think we do have to think about things differently. It's not the way we designed systems in the past. We're seeing more and more systems designed that way. But again, it's not the default. And I agree 100% with you that we do need historical databases you know, that's clear. And even some of those historical databases will be used in conjunction with the streaming data, right? >> Absolutely. I mean, you know, let's take the data warehouse example where you're using the data warehouse as its context and the streaming data as the present and you're saying, here's the sequence of things that's happening right now. Have we seen that sequence before? And where? What does that pattern look like in past situations? And can we learn from that? >> So Tony Baer, I wonder if you could comment? I mean, when you think about, you know, real time inferencing at the edge, for instance, which is something that a lot of people talk about, a lot of what we're discussing here in this segment, it looks like it's got a great potential. What are your thoughts? >> Yeah, I mean, I think you nailed it right. You know, you hit it right on the head there. Which is that, what I'm seeing is that essentially. Then based on I'm going to split this one down the middle is that I don't see that basically streaming is the default. What I see is streaming and basically and transaction databases and analytics data, you know, data warehouses, data lakes whatever are converging. And what allows us technically to converge is cloud native architecture, where you can basically distribute things. So you can have a node here that's doing the real-time processing, that's also doing... And this is where it leads in or maybe doing some of that real time predictive analytics to take a look at, well look, we're looking at this customer journey what's happening with what the customer is doing right now and this is correlated with what other customers are doing. So the thing is that in the cloud, you can basically partition this and because of basically the speed of the infrastructure then you can basically bring these together and kind of orchestrate them sort of a loosely coupled manner. The other parts that the use cases are demanding, and this is part of it goes back to what Dave is saying. Is that, you know, when you look at Customer 360, when you look at let's say Smart Utility products, when you look at any type of operational problem, it has a real time component and it has an historical component. And having predictive and so like, you know, my sense here is that technically we can bring this together through the cloud. And I think the use case is that we can apply some real time sort of predictive analytics on these streams and feed this into the transactions so that when we make a decision in terms of what to do as a result of a transaction, we have this real-time input. >> Sanjeev, did you have a comment? >> Yeah, I was just going to say that to Dave's point, you know, we have to think of streaming very different because in the historical databases, we used to bring the data and store the data and then we used to run rules on top, aggregations and all. But in case of streaming, the mindset changes because the rules are normally the inference, all of that is fixed, but the data is constantly changing. So it's a completely reversed way of thinking and building applications on top of that. >> So Dave Menninger, there seem to be some disagreement about the default. What kind of timeframe are you thinking about? Is this end of decade it becomes the default? What would you pin? >> I think around, you know, between five to 10 years, I think this becomes the reality. >> I think its... >> It'll be more and more common between now and then, but it becomes the default. And I also want Sanjeev at some point, maybe in one of our subsequent conversations, we need to talk about governing streaming data. 'Cause that's a whole nother set of challenges. >> We've also talked about it rather in two dimensions, historical and streaming, and there's lots of low latency, micro batch, sub-second, that's not quite streaming, but in many cases its fast enough and we're seeing a lot of adoption of near real time, not quite real-time as good enough for many applications. (indistinct cross talk from panelists) >> Because nobody's really taking the hardware dimension (mumbles). >> That'll just happened, Carl. (panelists laughing) >> So near real time. But maybe before you lose the customer, however we define that, right? Okay, let's move on to Brad. Brad, you want to talk about automation, AI, the pipeline people feel like, hey, we can just automate everything. What's your prediction? >> Yeah I'm an AI aficionados so apologies in advance for that. But, you know, I think that we've been seeing automation play within AI for some time now. And it's helped us do a lot of things especially for practitioners that are building AI outcomes in the enterprise. It's helped them to fill skills gaps, it's helped them to speed development and it's helped them to actually make AI better. 'Cause it, you know, in some ways provide some swim lanes and for example, with technologies like AutoML can auto document and create that sort of transparency that we talked about a little bit earlier. But I think there's an interesting kind of conversion happening with this idea of automation. And that is that we've had the automation that started happening for practitioners, it's trying to move out side of the traditional bounds of things like I'm just trying to get my features, I'm just trying to pick the right algorithm, I'm just trying to build the right model and it's expanding across that full life cycle, building an AI outcome, to start at the very beginning of data and to then continue on to the end, which is this continuous delivery and continuous automation of that outcome to make sure it's right and it hasn't drifted and stuff like that. And because of that, because it's become kind of powerful, we're starting to actually see this weird thing happen where the practitioners are starting to converge with the users. And that is to say that, okay, if I'm in Tableau right now, I can stand up Salesforce Einstein Discovery, and it will automatically create a nice predictive algorithm for me given the data that I pull in. But what's starting to happen and we're seeing this from the companies that create business software, so Salesforce, Oracle, SAP, and others is that they're starting to actually use these same ideals and a lot of deep learning (chuckles) to basically stand up these out of the box flip-a-switch, and you've got an AI outcome at the ready for business users. And I am very much, you know, I think that's the way that it's going to go and what it means is that AI is slowly disappearing. And I don't think that's a bad thing. I think if anything, what we're going to see in 2022 and maybe into 2023 is this sort of rush to put this idea of disappearing AI into practice and have as many of these solutions in the enterprise as possible. You can see, like for example, SAP is going to roll out this quarter, this thing called adaptive recommendation services, which basically is a cold start AI outcome that can work across a whole bunch of different vertical markets and use cases. It's just a recommendation engine for whatever you needed to do in the line of business. So basically, you're an SAP user, you look up to turn on your software one day, you're a sales professional let's say, and suddenly you have a recommendation for customer churn. Boom! It's going, that's great. Well, I don't know, I think that's terrifying. In some ways I think it is the future that AI is going to disappear like that, but I'm absolutely terrified of it because I think that what it really does is it calls attention to a lot of the issues that we already see around AI, specific to this idea of what we like to call at Omdia, responsible AI. Which is, you know, how do you build an AI outcome that is free of bias, that is inclusive, that is fair, that is safe, that is secure, that its audible, et cetera, et cetera, et cetera, et cetera. I'd take a lot of work to do. And so if you imagine a customer that's just a Salesforce customer let's say, and they're turning on Einstein Discovery within their sales software, you need some guidance to make sure that when you flip that switch, that the outcome you're going to get is correct. And that's going to take some work. And so, I think we're going to see this move, let's roll this out and suddenly there's going to be a lot of problems, a lot of pushback that we're going to see. And some of that's going to come from GDPR and others that Sanjeev was mentioning earlier. A lot of it is going to come from internal CSR requirements within companies that are saying, "Hey, hey, whoa, hold up, we can't do this all at once. "Let's take the slow route, "let's make AI automated in a smart way." And that's going to take time. >> Yeah, so a couple of predictions there that I heard. AI simply disappear, it becomes invisible. Maybe if I can restate that. And then if I understand it correctly, Brad you're saying there's a backlash in the near term. You'd be able to say, oh, slow down. Let's automate what we can. Those attributes that you talked about are non trivial to achieve, is that why you're a bit of a skeptic? >> Yeah. I think that we don't have any sort of standards that companies can look to and understand. And we certainly, within these companies, especially those that haven't already stood up an internal data science team, they don't have the knowledge to understand when they flip that switch for an automated AI outcome that it's going to do what they think it's going to do. And so we need some sort of standard methodology and practice, best practices that every company that's going to consume this invisible AI can make use of them. And one of the things that you know, is sort of started that Google kicked off a few years back that's picking up some momentum and the companies I just mentioned are starting to use it is this idea of model cards where at least you have some transparency about what these things are doing. You know, so like for the SAP example, we know, for example, if it's convolutional neural network with a long, short term memory model that it's using, we know that it only works on Roman English and therefore me as a consumer can say, "Oh, well I know that I need to do this internationally. "So I should not just turn this on today." >> Thank you. Carl could you add anything, any context here? >> Yeah, we've talked about some of the things Brad mentioned here at IDC and our future of intelligence group regarding in particular, the moral and legal implications of having a fully automated, you know, AI driven system. Because we already know, and we've seen that AI systems are biased by the data that they get, right? So if they get data that pushes them in a certain direction, I think there was a story last week about an HR system that was recommending promotions for White people over Black people, because in the past, you know, White people were promoted and more productive than Black people, but it had no context as to why which is, you know, because they were being historically discriminated, Black people were being historically discriminated against, but the system doesn't know that. So, you know, you have to be aware of that. And I think that at the very least, there should be controls when a decision has either a moral or legal implication. When you really need a human judgment, it could lay out the options for you. But a person actually needs to authorize that action. And I also think that we always will have to be vigilant regarding the kind of data we use to train our systems to make sure that it doesn't introduce unintended biases. In some extent, they always will. So we'll always be chasing after them. But that's (indistinct). >> Absolutely Carl, yeah. I think that what you have to bear in mind as a consumer of AI is that it is a reflection of us and we are a very flawed species. And so if you look at all of the really fantastic, magical looking supermodels we see like GPT-3 and four, that's coming out, they're xenophobic and hateful because the people that the data that's built upon them and the algorithms and the people that build them are us. So AI is a reflection of us. We need to keep that in mind. >> Yeah, where the AI is biased 'cause humans are biased. All right, great. All right let's move on. Doug you mentioned mentioned, you know, lot of people that said that data lake, that term is not going to live on but here's to be, have some lakes here. You want to talk about lake house, bring it on. >> Yes, I do. My prediction is that lake house and this idea of a combined data warehouse and data lake platform is going to emerge as the dominant data management offering. I say offering that doesn't mean it's going to be the dominant thing that organizations have out there, but it's going to be the pro dominant vendor offering in 2022. Now heading into 2021, we already had Cloudera, Databricks, Microsoft, Snowflake as proponents, in 2021, SAP, Oracle, and several of all of these fabric virtualization/mesh vendors joined the bandwagon. The promise is that you have one platform that manages your structured, unstructured and semi-structured information. And it addresses both the BI analytics needs and the data science needs. The real promise there is simplicity and lower cost. But I think end users have to answer a few questions. The first is, does your organization really have a center of data gravity or is the data highly distributed? Multiple data warehouses, multiple data lakes, on premises, cloud. If it's very distributed and you'd have difficulty consolidating and that's not really a goal for you, then maybe that single platform is unrealistic and not likely to add value to you. You know, also the fabric and virtualization vendors, the mesh idea, that's where if you have this highly distributed situation, that might be a better path forward. The second question, if you are looking at one of these lake house offerings, you are looking at consolidating, simplifying, bringing together to a single platform. You have to make sure that it meets both the warehouse need and the data lake need. So you have vendors like Databricks, Microsoft with Azure Synapse. New really to the data warehouse space and they're having to prove that these data warehouse capabilities on their platforms can meet the scaling requirements, can meet the user and query concurrency requirements. Meet those tight SLS. And then on the other hand, you have the Oracle, SAP, Snowflake, the data warehouse folks coming into the data science world, and they have to prove that they can manage the unstructured information and meet the needs of the data scientists. I'm seeing a lot of the lake house offerings from the warehouse crowd, managing that unstructured information in columns and rows. And some of these vendors, Snowflake a particular is really relying on partners for the data science needs. So you really got to look at a lake house offering and make sure that it meets both the warehouse and the data lake requirement. >> Thank you Doug. Well Tony, if those two worlds are going to come together, as Doug was saying, the analytics and the data science world, does it need to be some kind of semantic layer in between? I don't know. Where are you in on this topic? >> (chuckles) Oh, didn't we talk about data fabrics before? Common metadata layer (chuckles). Actually, I'm almost tempted to say let's declare victory and go home. And that this has actually been going on for a while. I actually agree with, you know, much of what Doug is saying there. Which is that, I mean I remember as far back as I think it was like 2014, I was doing a study. I was still at Ovum, (indistinct) Omdia, looking at all these specialized databases that were coming up and seeing that, you know, there's overlap at the edges. But yet, there was still going to be a reason at the time that you would have, let's say a document database for JSON, you'd have a relational database for transactions and for data warehouse and you had basically something at that time that resembles a dupe for what we consider your data life. Fast forward and the thing is what I was seeing at the time is that you were saying they sort of blending at the edges. That was saying like about five to six years ago. And the lake house is essentially on the current manifestation of that idea. There is a dichotomy in terms of, you know, it's the old argument, do we centralize this all you know in a single place or do we virtualize? And I think it's always going to be a union yeah and there's never going to be a single silver bullet. I do see that there are also going to be questions and these are points that Doug raised. That you know, what do you need for your performance there, or for your free performance characteristics? Do you need for instance high concurrency? You need the ability to do some very sophisticated joins, or is your requirement more to be able to distribute and distribute our processing is, you know, as far as possible to get, you know, to essentially do a kind of a brute force approach. All these approaches are valid based on the use case. I just see that essentially that the lake house is the culmination of it's nothing. It's a relatively new term introduced by Databricks a couple of years ago. This is the culmination of basically what's been a long time trend. And what we see in the cloud is that as we start seeing data warehouses as a check box items say, "Hey, we can basically source data in cloud storage, in S3, "Azure Blob Store, you know, whatever, "as long as it's in certain formats, "like, you know parquet or CSP or something like that." I see that as becoming kind of a checkbox item. So to that extent, I think that the lake house, depending on how you define is already reality. And in some cases, maybe new terminology, but not a whole heck of a lot new under the sun. >> Yeah. And Dave Menninger, I mean a lot of these, thank you Tony, but a lot of this is going to come down to, you know, vendor marketing, right? Some people just kind of co-op the term, we talked about you know, data mesh washing, what are your thoughts on this? (laughing) >> Yeah, so I used the term data platform earlier. And part of the reason I use that term is that it's more vendor neutral. We've tried to sort of stay out of the vendor terminology patenting world, right? Whether the term lake houses, what sticks or not, the concept is certainly going to stick. And we have some data to back it up. About a quarter of organizations that are using data lakes today, already incorporate data warehouse functionality into it. So they consider their data lake house and data warehouse one in the same, about a quarter of organizations, a little less, but about a quarter of organizations feed the data lake from the data warehouse and about a quarter of organizations feed the data warehouse from the data lake. So it's pretty obvious that three quarters of organizations need to bring this stuff together, right? The need is there, the need is apparent. The technology is going to continue to converge. I like to talk about it, you know, you've got data lakes over here at one end, and I'm not going to talk about why people thought data lakes were a bad idea because they thought you just throw stuff in a server and you ignore it, right? That's not what a data lake is. So you've got data lake people over here and you've got database people over here, data warehouse people over here, database vendors are adding data lake capabilities and data lake vendors are adding data warehouse capabilities. So it's obvious that they're going to meet in the middle. I mean, I think it's like Tony says, I think we should declare victory and go home. >> As hell. So just a follow-up on that, so are you saying the specialized lake and the specialized warehouse, do they go away? I mean, Tony data mesh practitioners would say or advocates would say, well, they could all live. It's just a node on the mesh. But based on what Dave just said, are we gona see those all morphed together? >> Well, number one, as I was saying before, there's always going to be this sort of, you know, centrifugal force or this tug of war between do we centralize the data, do we virtualize? And the fact is I don't think that there's ever going to be any single answer. I think in terms of data mesh, data mesh has nothing to do with how you're physically implement the data. You could have a data mesh basically on a data warehouse. It's just that, you know, the difference being is that if we use the same physical data store, but everybody's logically you know, basically governing it differently, you know? Data mesh in space, it's not a technology, it's processes, it's governance process. So essentially, you know, I basically see that, you know, as I was saying before that this is basically the culmination of a long time trend we're essentially seeing a lot of blurring, but there are going to be cases where, for instance, if I need, let's say like, Upserve, I need like high concurrency or something like that. There are certain things that I'm not going to be able to get efficiently get out of a data lake. And, you know, I'm doing a system where I'm just doing really brute forcing very fast file scanning and that type of thing. So I think there always will be some delineations, but I would agree with Dave and with Doug, that we are seeing basically a confluence of requirements that we need to essentially have basically either the element, you know, the ability of a data lake and the data warehouse, these need to come together, so I think. >> I think what we're likely to see is organizations look for a converge platform that can handle both sides for their center of data gravity, the mesh and the fabric virtualization vendors, they're all on board with the idea of this converged platform and they're saying, "Hey, we'll handle all the edge cases "of the stuff that isn't in that center of data gravity "but that is off distributed in a cloud "or at a remote location." So you can have that single platform for the center of your data and then bring in virtualization, mesh, what have you, for reaching out to the distributed data. >> As Dave basically said, people are happy when they virtualized data. >> I think we have at this point, but to Dave Menninger's point, they are converging, Snowflake has introduced support for unstructured data. So obviously literally splitting here. Now what Databricks is saying is that "aha, but it's easy to go from data lake to data warehouse "than it is from databases to data lake." So I think we're getting into semantics, but we're already seeing these two converge. >> So take somebody like AWS has got what? 15 data stores. Are they're going to 15 converge data stores? This is going to be interesting to watch. All right, guys, I'm going to go down and list do like a one, I'm going to one word each and you guys, each of the analyst, if you would just add a very brief sort of course correction for me. So Sanjeev, I mean, governance is going to to be... Maybe it's the dog that wags the tail now. I mean, it's coming to the fore, all this ransomware stuff, which you really didn't talk much about security, but what's the one word in your prediction that you would leave us with on governance? >> It's going to be mainstream. >> Mainstream. Okay. Tony Baer, mesh washing is what I wrote down. That's what we're going to see in 2022, a little reality check, you want to add to that? >> Reality check, 'cause I hope that no vendor jumps the shark and close they're offering a data niche product. >> Yeah, let's hope that doesn't happen. If they do, we're going to call them out. Carl, I mean, graph databases, thank you for sharing some high growth metrics. I know it's early days, but magic is what I took away from that, so magic database. >> Yeah, I would actually, I've said this to people too. I kind of look at it as a Swiss Army knife of data because you can pretty much do anything you want with it. That doesn't mean you should. I mean, there's definitely the case that if you're managing things that are in fixed schematic relationship, probably a relation database is a better choice. There are times when the document database is a better choice. It can handle those things, but maybe not. It may not be the best choice for that use case. But for a great many, especially with the new emerging use cases I listed, it's the best choice. >> Thank you. And Dave Menninger, thank you by the way, for bringing the data in, I like how you supported all your comments with some data points. But streaming data becomes the sort of default paradigm, if you will, what would you add? >> Yeah, I would say think fast, right? That's the world we live in, you got to think fast. >> Think fast, love it. And Brad Shimmin, love it. I mean, on the one hand I was saying, okay, great. I'm afraid I might get disrupted by one of these internet giants who are AI experts. I'm going to be able to buy instead of build AI. But then again, you know, I've got some real issues. There's a potential backlash there. So give us your bumper sticker. >> I'm would say, going with Dave, think fast and also think slow to talk about the book that everyone talks about. I would say really that this is all about trust, trust in the idea of automation and a transparent and visible AI across the enterprise. And verify, verify before you do anything. >> And then Doug Henschen, I mean, I think the trend is your friend here on this prediction with lake house is really becoming dominant. I liked the way you set up that notion of, you know, the data warehouse folks coming at it from the analytics perspective and then you get the data science worlds coming together. I still feel as though there's this piece in the middle that we're missing, but your, your final thoughts will give you the (indistinct). >> I think the idea of consolidation and simplification always prevails. That's why the appeal of a single platform is going to be there. We've already seen that with, you know, DoOP platforms and moving toward cloud, moving toward object storage and object storage, becoming really the common storage point for whether it's a lake or a warehouse. And that second point, I think ESG mandates are going to come in alongside GDPR and things like that to up the ante for good governance. >> Yeah, thank you for calling that out. Okay folks, hey that's all the time that we have here, your experience and depth of understanding on these key issues on data and data management really on point and they were on display today. I want to thank you for your contributions. Really appreciate your time. >> Enjoyed it. >> Thank you. >> Thanks for having me. >> In addition to this video, we're going to be making available transcripts of the discussion. We're going to do clips of this as well we're going to put them out on social media. I'll write this up and publish the discussion on wikibon.com and siliconangle.com. No doubt, several of the analysts on the panel will take the opportunity to publish written content, social commentary or both. I want to thank the power panelists and thanks for watching this special CUBE presentation. This is Dave Vellante, be well and we'll see you next time. (bright music)

Published Date : Jan 7 2022

SUMMARY :

and I'd like to welcome you to I as moderator, I'm going to and that is the journey to weigh in on there, and it's going to demand more solid data. Brad, I wonder if you that are specific to individual use cases in the past is because we I like the fact that you the data from, you know, Dave Menninger, I mean, one of the things that all need to be managed collectively. Oh thank you Dave. and to the community I think we could have a after the fact to say, okay, is this incremental to the market? the magic it does and to do it and that slows the system down. I know the story, but And that is a problem that the languages move on to Dave Menninger. So in the next say three to five years, the guy who has followed that people still need to do their taxes, And I agree 100% with you and the streaming data as the I mean, when you think about, you know, and because of basically the all of that is fixed, but the it becomes the default? I think around, you know, but it becomes the default. and we're seeing a lot of taking the hardware dimension That'll just happened, Carl. Okay, let's move on to Brad. And that is to say that, Those attributes that you And one of the things that you know, Carl could you add in the past, you know, I think that what you have to bear in mind that term is not going to and the data science needs. and the data science world, You need the ability to do lot of these, thank you Tony, I like to talk about it, you know, It's just a node on the mesh. basically either the element, you know, So you can have that single they virtualized data. "aha, but it's easy to go from I mean, it's coming to the you want to add to that? I hope that no vendor Yeah, let's hope that doesn't happen. I've said this to people too. I like how you supported That's the world we live I mean, on the one hand I And verify, verify before you do anything. I liked the way you set up We've already seen that with, you know, the time that we have here, We're going to do clips of this as well

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave MenningerPERSON

0.99+

DavePERSON

0.99+

Dave VellantePERSON

0.99+

Doug HenschenPERSON

0.99+

DavidPERSON

0.99+

Brad ShimminPERSON

0.99+

DougPERSON

0.99+

Tony BaerPERSON

0.99+

Dave VelanntePERSON

0.99+

TonyPERSON

0.99+

CarlPERSON

0.99+

BradPERSON

0.99+

Carl OlofsonPERSON

0.99+

MicrosoftORGANIZATION

0.99+

2014DATE

0.99+

Sanjeev MohanPERSON

0.99+

Ventana ResearchORGANIZATION

0.99+

2022DATE

0.99+

OracleORGANIZATION

0.99+

last yearDATE

0.99+

January of 2022DATE

0.99+

threeQUANTITY

0.99+

381 databasesQUANTITY

0.99+

IDCORGANIZATION

0.99+

InformaticaORGANIZATION

0.99+

SnowflakeORGANIZATION

0.99+

DatabricksORGANIZATION

0.99+

twoQUANTITY

0.99+

SanjeevPERSON

0.99+

2021DATE

0.99+

GoogleORGANIZATION

0.99+

OmdiaORGANIZATION

0.99+

AWSORGANIZATION

0.99+

SanjMoORGANIZATION

0.99+

79%QUANTITY

0.99+

second questionQUANTITY

0.99+

last weekDATE

0.99+

15 data storesQUANTITY

0.99+

100%QUANTITY

0.99+

SAPORGANIZATION

0.99+

Clemence W. Chee & Christoph Sawade, HelloFresh


 

(upbeat music) >> Hello everyone. We're here at theCUBE startup showcase made possible by AWS. Thanks so much for joining us today. You know, when Zhamak Dehghani was formulating her ideas around data mesh, she wasn't the only one thinking about decentralized data architectures. HelloFresh was going into hyper-growth mode and realized that in order to support its scale, it needed to rethink how it thought about data. Like many companies that started in the early part of the last decade, HelloFresh relied on a monolithic data architecture and the internal team it had concerns about its ability to support continued innovation at high velocity. The company's data team began to think about the future and work backwards from a target architecture, which possessed many principles of so-called data mesh, even though they didn't use that term specifically. The company is a strong example of an early but practical pioneer of data mesh. Now, there are many practitioners and stakeholders involved in evolving the company's data architecture many of whom are listed here on this slide. Two are highlighted in red and joining us today. We're really excited to welcome you to theCUBE, Clemence Chee, who is the global senior director for data at HelloFresh, and Christoph Sawade, who's the global senior director of data also of course at HelloFresh. Folks, welcome. Thanks so much for making some time today and sharing your story. >> Thank you very much. >> Thanks, Dave. >> All right, let's start with HelloFresh. You guys are number one in the world in your field. You deliver hundreds of millions of meals each year to many, many millions of people around the globe. You're scaling. Christoph, tell us a little bit more about your company and its vision. >> Yeah. Should I start or Clemence? Maybe take over the first piece because Clemence has actually been longer a director at HelloFresh. >> Yeah go ahead Clemence. >> I mean, yes, about approximately six years ago I joined and HelloFresh, and I didn't think about the startup I was joining would eventually IPO. And just two years later, HelloFresh went public. And approximately three years and 10 months after HelloFresh was listed on the German stock exchange which was just last week, HelloFresh was included in the DAX Germany's leading stock market index and that, to mind a great, great milestone, and I'm really looking forward and I'm very excited for the future for HelloFresh and also our data. The vision that we have is to become the world's leading food solution group. And there are a lot of attractive opportunities. So recently we did launch and expand in Norway. This was in July. And earlier this year, we launched the US brand, Green Chef, in the UK as well. We're committed to launch continuously different geographies in the next coming years and have a strong path ahead of us. With the acquisition of ready to eat companies like factor in the US and the plant acquisition of Youfoodz in Australia, we are diversifying our offer, now reaching even more and more untapped customer segments and increase our total address for the market. So by offering customers and growing range of different alternatives to shop food and to consume meals, we are charging towards this vision and this goal to become the world's leading integrated food solutions group. >> Love it. You guys are on a rocket ship. You're really transforming the industry. And as you expand your TAM, it brings us to sort of the data as a core part of that strategy. So maybe you guys could talk a little bit about your journey as a company, specifically as it relates to your data journey. I mean, you began as a startup, you had a basic architecture and like everyone, you've made extensive use of spreadsheets, you built a Hadoop based system that started to grow. And when the company IPO'd, you really started to explode. So maybe describe that journey from a data perspective. >> Yes, Dave. So HelloFresh by 2015, approximately had evolved what amount, a classical centralized data management set up. So we grew very organically over the years, and there were a lot of very smart people around the globe, really building the company and building our infrastructure. This also means that there were a small number of internal and external sources, data sources, and a centralized BI team with a number of people producing different reports, different dashboards and, and products for our executives, for example, or for different operations teams to see a company's performance and knowledge was transferred just by our talking to each other face-to-face conversations. And the people in the data warehouse team were considered as the data wizard or as the ETL wizard. Very classical challenges. And it was ETL, who reserved, indicated the kind of like a style of knowledge of data management, right? So our central data warehouse team then was responsible for different type of verticals in different domains, different geographies. And all this setup gave us in the beginning, the flexibility to grow fast as a company in 2015. >> Christoph, anything to add to that? >> Yes, not explicitly to that one, but as, as Clemence said, right, this was kind of the setup that actually worked for us quite a while. And then in 2017, when HelloFresh went public, the company also grew rapidly. And just to give you an idea how that looked like as well, the tech departments have actually increased from about 40 people to almost 300 engineers. And in the same way as the business units, as there Clemence has described, also grew sustainably. So we continue to launch HelloFresh in new countries, launched new brands like Every Plate, and also acquired other brands like we have Factor. And that grows also from a data perspective, the number of data requests that the central (mumbles), we're getting become more and more and more, and also more and more complex. So that for the team meant that they had a fairly high mental load. So they had to achieve a very, or basically get a very deep understanding about the business and also suffered a lot from this context, switching back and forth. Essentially, they had to prioritize across our product requests from our physical product, digital product, from a physical, from, sorry, from the marketing perspective, and also from the central reporting teams. And in a nutshell, this was very hard for these people, and that altered situations that let's say the solution that we have built. We can not really optimal. So in a, in a, in a, in a nutshell, the central function became a bottleneck and slow down of all the innovation of the company. >> It's a classic case. Isn't it? I mean, Clemence, you see, you see the central team becomes a bottleneck, and so the lines of business, the marketing team, sales teams say "Okay, we're going to take things into our own hands." And then of course IT and the technical team is called in later to clean up the mess. Maybe, maybe I'm overstating it, but, but that's a common situation. Isn't it? >> Yeah this is what exactly happened. Right. So we had a bottleneck, we had those central teams, there was always a bit of tension. Analytics teams then started in those business domains like marketing, supply chain, finance, HR, and so on started really to build their own data solutions. At some point you have to get the ball rolling, right? And then continue the trajectory, which means then that the data pipelines didn't meet the engineering standards. And there was an increased need for maintenance and support from central teams. Hence over time, the knowledge about those pipelines and how to maintain a particular infrastructure, for example, left the company, such that most of those data assets and data sets that turned into a huge debt with decreasing data quality, also decreasing lack of trust, decreasing transparency. And this was an increasing challenge where a majority of time was spent in meeting rooms to align on, on data quality for example. >> Yeah. And the point you were making Christoph about context switching, and this is, this is a point that Zhamak makes quite often as we've, we've, we've contextualized our operational systems like our sales systems, our marketing systems, but not our, our data systems. So you're asking the data team, okay, be an expert in sales, be an expert in marketing, be an expert in logistics, be an expert in supply chain and it's start, stop, start, stop. It's a paper cut environment, and it's just not as productive. But, but, and the flip side of that is when you think about a centralized organization, you think, hey, this is going to be a very efficient way across functional team to support the organization, but it's not necessarily the highest velocity, most effective organizational structure. >> Yeah. So, so I agree with that piece, that's up to a certain scale. A centralized function has a lot of advantages, right? So it's a tool for everyone, which would go to a destined kind of expert team. However, if you see that you actually would like to accelerate that in specific as the type of growth. But you want to actually have autonomy on certain teams and move the teams, or let's say the data to the experts in these teams. And this, as you have mentioned, right, that increases mental load. And you can either internally start splitting your team into different kinds of sub teams focusing on different areas, however, that is then again, just adding another piece where actually collaboration needs to happen because the external seized, so why not bridging that gap immediately and actually move these teams end to end into the, into the function themselves. So maybe just to continue what Clemence was saying, and this is actually where our, so, Clemence and my journey started to become one joint journey. So Clemence was coming actually from one of these teams who builds their own solutions. I was basically heading the platform team called data warehouse team these days. And in 2019, where (mumbles) become more and more serious, I would say, so more and more people have recognized that this model does not really scale, in 2019, basically the leadership of the company came together and identified data as a key strategic asset. And what we mean by that, that if he leveraged it in a, in a, an appropriate way, it gives us a unique, competitive advantage, which could help us to, to support and actually fully automate our decision making process across the entire value chain. So once we, what we're trying to do now, or what we would be aiming for is that HelloFresh is able to build data products that have a purpose. We're moving away from the idea that it's just a bi-product. We have a purpose why we would like to collect this data. There's a clear business need behind that. And because it's so important to, for the company as a business, we also want to provide them as a trustworthy asset to the rest of the organization. We'd say, this is the best customer experience, but at least in a way that users can easily discover, understand and securely access, high quality data. >> Yeah. So, and, and, and Clemence, when you see Zhamak's writing, you see, you know, she has the four pillars and the principles. As practitioners, you look at that say, okay, hey, that's pretty good thinking. And then now we have to apply it. And that's where the devil meets the details. So it's the for, the decentralized data ownership, data as a product, which we'll talk about a little bit, self-serve, which you guys have spent a lot of time on, and Clemence your wheelhouse, which is, which is governance and a federated governance model. And it's almost like if you, if you achieve the first two, then you have to solve for the second two, it almost creates a new challenges, but maybe you could talk about that a little bit as to how it relates to HelloFresh. >> Yes. So Chris has mentioned that we identified kind of a challenge beforehand and said, how can we actually decentralized and actually empower the different colleagues of ours? And this was more a, we realized that it was more an organizational or a cultural change. And this is something that someone also mentioned. I think ThoughtWorks mentioned one of the white papers, it's more of an organizational or a cultural impact. And we kicked off a phased reorganization, or different phases we're currently on, in the middle of still, but we kicked off different phases of organizational restructuring or reorganization trying to lock this data at scale. And the idea was really moving away from ever growing complex matrix organizations or matrix setups and split between two different things. One is the value creation. So basically when people ask the question, what can we actually do? What should we do? This is value creation and the how, which is capability building, and both are equal in authority. This actually then creates a high urge in collaboration and this collaboration breaks up the different silos that were built. And of course, this also includes different needs of staffing for teams staffing with more, let's say data scientists or data engineers, data professionals into those business domains, enhance, or some more capability building. >> Okay, go ahead. Sorry. >> So back to Zhamak Dehghani. So we, the idea also then crossed over when she published her papers in May, 2019. And we thought, well, the four pillars that she described were around decentralized data ownership, product, data as a product mindset, we have a self-service infrastructure. And as you mentioned, federated computational governance. And this suited very much with our thinking at that point of time to reorganize the different teams and this then that to not only organizational restructure, but also in completely new approach of how we need to manage data, through data. >> Got it. Okay. So your businesses is exploding. The data team was having to become domain experts to many areas, constantly context switching as we said, people started to take things into their own hands. So again, we said classic story, but, but you didn't let it get out of control and that's important. And so we, we actually have a picture of kind of where you're going today and it's evolved into this, Pat, if you could bring up the picture with the, the elephant, here we go. So I will talk a little bit about the architecture. It doesn't show it here, the spreadsheet era, but Christoph, maybe you could talk about that. It does show the Hadoop monolith, which exists today. I think that's in a managed hosting service, but, but you, you preserve that piece of it. But if I understand it correctly, everything is evolving to the cloud. I think you're running a lot of this or all of it in AWS. You've got, everybody's got their own data sources. You've got a data hub, which I think is enabled by a master catalog for discovery and all this underlying technical infrastructure that is, is really not the focus of this conversation today. But the key here, if I understand correctly is these domains are autonomous and that not only this required technical thinking, but really supportive organizational mindset, which we're going to talk about today. But, but Christoph, maybe you could address, you know, at a high level, some of the architectural evolution that you guys went through. >> Yeah, sure. Yeah. Maybe it's also a good summary about the entire history. So as you have mentioned, right, we started in the very beginning, it's a monolith on the operational plan, right? Actually it wasn't just one model it was two, one for the backend and one for the front end. And our analytical plan was essentially a couple of spreadsheets. And I think there's nothing wrong with spreadsheets, but it allows you to store information, it allows you to transform data, it allows you to share this information, it allows you to visualize this data, but all kind of, it's not actually separating concern, right? Every single one tool. And this means that it's obviously not scalable, right? You reach the point where this kind of management's set up in, or data management is in one tool, reached elements. So what we have started is we created our data lake, as we have seen here on our dupe. And just in the very beginning actually reflected very much our operation upon this. On top of that, we used Impala as a data warehouse, but there was not really a distinction between what is our data warehouse and what is our data lakes as the Impala was used as kind of both as a kind of engine to create a warehouse and data lake constructed itself. And this organic growth actually led to a situation. As I think it's clear now that we had the centralized model as, for all the domains that were really lose Kimball, the modeling standards and there's new uniformity we used to actually build, in-house, a base of building materialized use, of use that we have used for the presentation there. There was a lot of duplication of effort. And in the end, essentially the amendments and feedback tool, which helped us to, to improve of what we, have built during the end in a natural, as you said, the lack of trust. And this basically was a starting point for us to understand, okay, how can we move away? And there are a lot of different things that we can discuss of apart from this organizational structure that we have set up here, we have three or four pillars from Zhamak. However, there's also the next, extra question around, how do we implement product, right? What are the implications on that level and I think that is, that's something that we are, that we are currently still in progress. >> Got it. Okay. So I wonder if we could talk about, switch gears a little bit, and talk about the organizational and cultural challenges that you faced. What were those conversations like? And let's, let's dig into that a little bit. I want to get into governance as well. >> The conversations on the cultural change. I mean, yes, we went through a hyper growth through the last year, and obviously there were a lot of new joiners, a lot of different, very, very smart people joining the company, which then results that collaborations got a bit more difficult. Of course, the time zone changes. You have different, different artifacts that you had recreated in documentation that were flying around. So we were, we had to build the company from scratch, right? Of course, this then resulted always this tension, which I described before. But the most important part here is that data has always been a very important factor at HelloFresh, and we collected more of this data and continued to improve, use data to improve the different key areas of our business. Even when organizational struggles like the central (mumbles) struggles, data somehow always helped us to grow through this kind of change, right? In the end, those decentralized teams in our local geographies started with solutions that serve the business, which was very, very important. Otherwise, we wouldn't be at the place where we are today, but they did violate best practices and standards. And I always use the sports analogy, Dave. So like any sport, there are different rules and regulations that need to be followed. These routes are defined by, I'll call it, the sports association. And this is what you can think about other data governance and then our compliance team. Now we add the players to it who need to follow those rules and abide by them. This is what we then call data management. Now we have the different players, the professionals they also need to be trained and understand the strategy and the rules before they can play. And this is what I then called data literacy. So we realized that we need to focus on helping our teams to develop those capabilities and teach the standards for how work is being done to truly drive functional excellence in the different domains. And one of our ambition of our data literacy program for example, is to really empower every employee at HelloFresh, everyone, to make the right data-informed decisions by providing data education that scales (mumbles), and that can be different things. Different things like including data capabilities with, in the learning path for example, right? So help them to create and deploy data products, connecting data, producers, and data consumers, and create a common sense and more understanding of each other's dependencies, which is important. For example, SIS, SLO, state of contracts, et cetera, people get more of a sense of ownership and responsibility. Of course, we have to define what it means. What does ownership means? What does responsibility mean? But we are teaching this to our colleagues via individual learning patterns and help them upscale to use also their shared infrastructure, and those self-service data applications. And of all to summarize, we are still in this progress of learning. We're still learning as well. So learning never stops at Hello Fresh, but we are really trying this to make it as much fun as possible. And in the end, we all know user behavior is changed through positive experience. So instead of having massive training programs over endless courses of workshops, leaving our new joiners and colleagues confused and overwhelmed, we're applying gamification, right? So split different levels of certification where our colleagues, can access, have had access points. They can earn badges along the way, which then simplifies the process of learning and engagement of the users. And this is what we see in surveys, for example, where our employees value this gamification approach a lot and are even competing to collect those learning pet badges, to become the number one on the leaderboard. >> I love the gamification. I mean, we've seen it work so well in so many different industries, not the least of which is crypto. So you've identified some of the process gaps that you, you saw, you just gloss over them. Sometimes I say, pave the cow path. You didn't try to force. In other words, a new architecture into the legacy processes, you really had to rethink your approach to data management. So what did that entail? >> To rethink the way of data management, 100%. So if I take the example of revolution, industrial revolution or classical supply chain revolution, but just imagine that you have been riding a horse, for example, your whole life, and suddenly you can operate a car or you suddenly receive just a complete new way of transporting assets from A to B. So we needed to establish a new set of cross-functional business processes to run faster, drive faster, more robustly, and deliver data products which can be trusted and used by downstream processes and systems. Hence we had a subset of new standards and new procedures that would fall into the internal data governance and compliance sector. With internal, I'm always referring to the data operations around new things like data catalog, how to identify ownership, how to change ownership, how to certify data assets, everything around classical is software development, which we now apply to data. This, this is some old and new thinking, right? Deployment, versioning, QA, all the different things, ingestion policies, the deletion procedures, all the things that software development has been doing, we do it now with data as well. And it's simple terms, it's a whole redesign of the supply chain of our data with new procedures and new processes in asset creation, asset management and asset consumption. >> So data's become kind of the new development kit, if you will. I want to shift gears and talk about the notion of data product, and we have a slide that, that we pulled from your deck. And I'd like to unpack it a little bit. I'll just, if you can bring that up, I'll, I'll read it. A data product is a product whose primary objective is to leverage on data to solve customer problems, where customers are both internal and external. so pretty straightforward. I know you've, you've gone much deeper in your thinking and into your organization, but how do you think about that and how do you determine for instance, who owns what, how did you get everybody to agree? >> I can take that one. Maybe let me start as a data product. So I think that's an ongoing debate, right? And I think the debate itself is the important piece here, right? You mentioned the debate, you've clarified what we actually mean by that, a product, and what is actually the mindset. So I think just from a definition perspective, right? I think we find the common denominator that we say, okay, that our product is something which is important for the company that comes with value. What do you mean by that? Okay. It's a solution to a customer problem that delivers ideally maximum value to the business. And yes, leverage is the power of data. And we have a couple of examples, and I'll hit refresh here, the historical and classical ones around dashboards, for example, to monitor our error rates, but also more sophisticated based for example, to incorporate machine learning algorithms in our recipe recommendation. However, I think the important aspects of a data product is A: there is an owner, right? There's someone accountable for making sure that the product that you're providing is actually served and has maintained. And there are, there's someone who's making sure that this actually keeps the value of what we are promising. Combined with the idea of the proper documentation, like a product description, right? The people understand how to use it. What is this about? And related to that piece is the idea of, there's a purpose, right? We need to understand or ask ourselves, okay, why does a thing exist? Does it provide the value that we think it does? Then it leads in to a good understanding of what the life cycle of the data product and product life cycle. What do we mean? Okay. From the beginning, from the creation, you need to have a good understanding. You need to collect feedback. We need to learn about that, you need to rework, and actually finally, also to think about, okay, when is it time to decommission that piece So overall I think the core of this data product is product thinking 101, right? That we start, the point is, the starting point needs to be the problem and not the solution. And this is essentially what we have seen, what was missing, what brought us to this kind of data spaghetti that we have built there in Rush, essentially, we built it. Certain data assets develop in isolation and continuously patch the solution just to fulfill these ad hoc requests that we got and actually really understanding what the stakeholder needs. And the interesting piece as a results in duplication of (mumbled) And this is not just frustrating and probably not the most efficient way, how the company should work. But also if I build the same data assets, but slightly different assumption across the company and multiple teams that leads to data inconsistency. And imagine the following scenario. You, as a management, for management perspective, you're asking basically a specific question and you get essentially from a couple of different teams, different kinds of graphs, different kinds of data and numbers. And in the end, you do not know which ones to trust. So there's actually much (mumbles) but good. You do not know what actually is it noise for times of observing or is it just actually, is there actually a signal that I'm looking for? And the same as if I'm running an AB test, right? I have a new feature, I would like to understand what is the business impact of this feature? I run that with a specific source and an unfortunate scenario. Your production system is actually running on a different source. You see different numbers. What you have seen in the AB test is actually not what you see then in production, typical thing. Then as you asking some analytics team to actually do a deep dive, to understand where the discrepancies are coming from, worst case scenario again, there's a different kind of source. So in the end, it's a pretty frustrating scenario. And it's actually a waste of time of people that have to identify the root cause of this type of divergence. So in a nutshell, the highest degree of consistency is actually achieved if people are just reusing data assets. And also in the end, the meetup talk they've given, right? We start trying to establish this approach by AB testing. So we have a team, but just providing, or is kind of owning their target metric associated business teams, and they're providing that as a product also to other services, including the AB testing team. The AB testing team can use this information to find an interface say, okay, I'm drawing information for the metadata of an experiment. And in the end, after the assignment, after this data collection phase, they can easily add a graph to a dashboard just grouped by the AB testing barrier. And we have seen that also in other companies. So it's not just a nice dream that we have, right? I have actually looked at other companies maybe looked on search and we established a complete KPI pipeline that was computing all these information and this information both hosted by the team and those that (mumbles) AB testing, deep dives and, and regular reporting again. So just one last second, the, the important piece, Now, why I'm coming back to that is that it requires that we are treating this data as a product, right? If we want to have multiple people using the thing that I am owning and building, we have to provide this as a trust (mumbles) asset and in a way that it's easy for people to discover and to actually work with. >> Yeah. And coming back to that. So this is, to me this is why I get so excited about data mesh, because I really do think it's the right direction for organizations. When people hear data product, they think, "Well, what does that mean?" But then when you start to sort of define it as you did, it's using data to add value that could be cutting costs, that could be generating revenue, it could be actually directly creating a product that you monetize. So it's sort of in the eyes of the beholder, but I think the other point that we've made, is you made it earlier on too, and again, context. So when you have a centralized data team and you have all these P&L managers, a lot of times they'll question the data 'cause they don't own it. They're like, "Well, wait a minute." If it doesn't agree with their agenda, they'll attack the data. But if they own the data, then they're responsible for defending that. And that is a mindset change that's really important. And I'm curious is how you got to that ownership. Was it a top-down or was somebody providing leadership? Was it more organic bottom up? Was it a sort of a combination? How do you decide who owned what? In other words, you know, did you get, how did you get the business to take ownership of the data and what does owning the data actually mean? >> That's a very good question, Dave. I think that one of the pieces where I think we have a lot of learning and basically if you ask me how we could stop the filling, I think that would be the first piece that we need to start. Really think about how that should be approached. If it's staff has ownership, right? That means somehow that the team has the responsibility to host themselves the data assets to minimum acceptable standards. That's minimum dependencies up and down stream. The interesting piece has to be looking backwards. What was happening is that under that definition, this extra process that we have to go through is not actually transferring ownership from a central team to the other teams, but actually in most cases to establish ownership. I make this difference because saying we have to transfer ownership actually would erroneously suggest that the dataset was owned before, but this platform team, yes, they had the capability to make the change, but actually the analytics team, but always once we had the business understand the use cases and what no one actually bought, it's actually expensive, expected. So we had to go through this very lengthy process and establishing ownership, how we have done that as in the beginning, very naively started, here's a document, here are all the data assets, what is probably the nearest neighbor who can actually take care of that. And then we, we moved it over. But the problem here is that all these things is kind of technical debt, right? It's not really properly documented, pretty unstable. It was built in a very inconsistent way over years. And these people that built this thing have already left the company. So this is actually not a nice thing that you want to see and people build up a certain resistance, even if they have actually bought into this idea of domain ownership. So if you ask me these learnings, what needs to happen is first, the company needs to really understand what our core business concept that we have the need to have this mapping from this other core business concept that we have. These are the domain teams who are owning this concept, and then actually linked that to the, the assets and integrate that better, but suppose understanding how we can evolve, actually the data assets and new data builds things new and the, in this piece and the domain, but also how can we address reduction of technical depth and stabilizing what we have already. >> Thank you for that Christoph. So I want to turn a direction here and talk Clemence about governance. And I know that's an area that's passionate, you're passionate about. I pulled this slide from your deck, which I kind of messed up a little bit, sorry for that. But, but, but by the way, we're going to publish a link to the full video that you guys did. So we'll share that with folks, but it's one of the most challenging aspects of data mesh. If you're going to decentralize, you, you quickly realize this could be the wild west, as we talked about all over again. So how are you approaching governance? There's a lot of items on this slide that are, you know, underscore the complexity, whether it's privacy compliance, et cetera. So, so how did you approach this? >> It's yeah, it's about connecting those dots, right? So the aim of the data governance program is to promote the autonomy of every team while still ensuring that everybody has the right interoperability. So when we want to move from the wild west, riding horses to a civilized way of transport, I can take the example of modern street traffic. Like when all participants can maneuver independently, and as long as they follow the same rules and standards, everybody can remain compatible with each other and understand and learn from each other so we can avoid car crashes. So when I go from country to country, I do understand what the street infrastructure means. How do I drive my car? I can also read the traffic lights and the different signals. So likewise, as a business in HelloFresh we do operate autonomously and consequently need to follow those external and internal rules and standards set forth by the tradition in which we operate. So in order to prevent a, a car crash, we need to at least ensure compliance with regulations, to account for societies and our customers' increasing concern with data protection and privacy. So teaching and advocating this imaging, evangelizing this to everyone in the company was a key community or communication strategy. And of course, I mean, I mentioned data privacy, external factors, the same goes for internal regulations and processes to help our colleagues to adapt for this very new environment. So when I mentioned before, the new way of thinking, the new way of dealing and managing data, this of course implies that we need new processes and regulations for our colleagues as well. In a nutshell, then this means that data governance provides a framework for managing our people, the processes and technology and culture around our data traffic. And that governance must come together in order to have this effective program providing at least a common denominator is especially critical for shared data sets, which we have across our different geographies managed, and shared applications on shared infrastructure and applications. And as then consumed by centralized processes, for example, master data, everything, and all the metrics and KPIs, which are also used for a central steering. It's a big change, right? And our ultimate goal is to have this non-invasive federated, automated and computational governance. And for that, we can't just talk about it. We actually have to go deep and use case by use case and QC by PUC and generate learnings and learnings with the different teams. And this would be a classical approach of identifying the target structure, the target status, match it with the current status, by identifying together with the business teams, with the different domains and have a risk assessment, for example, to increase transparency because a lot of teams, they might not even know what kind of situation they might be. And this is where this training and this piece of data literacy comes into place, where we go in and trade based on the findings, based on the most valuable use case. And based on that, help our teams to do this change, to increase their capability. I just told a little bit more, I wouldn't say hand-holding, but a lot of guidance. >> Can I kind of kind of chime in quickly and (mumbled) below me, I mean, there's a lot of governance piece, but I think that is important. And if you're talking about documentation, for example, yes, we can go from team to team and tell these people, hey, you have to document your data assets and data catalog, or you have to establish a data contract and so on and forth. But if we would like to build data products at scale, following actual governance, we need to think about automation, right? We need to think about a lot of things that we can learn from engineering before, and just starts as simple things. Like if we would like to build up trust in our data products, right? And actually want to apply the same rigor and the best practices that we know from engineering. There are things that we can do. And we should probably think about what we can copy. And one example might be so the level of service level agreements, so that level objectives. So the level of indicators, right, that represent on a, on an engineering level, right? Are we providing services? They're representing the promises we make to our customer and to our consumers. These are the internal objectives that help us to keep those promises. And actually these audits of, of how we are tracking ourselves, how we are doing. And this is just one example of where I think the federated governance, governance comes into play, right? In an ideal world, you should not just talk about data as a product, but also data product that's code. That'd be say, okay, as most, as much as possible, right? Give the engineers the tool that they are familiar with, and actually not ask the product managers, for example, to document the data assets in the data catalog, but make it part of the configuration has as, as a, as a CDCI continuous delivery pipeline, as we typically see in other engineering, tasks through it and services maybe say, okay, there is configuration, we can think about PII, we can think about data quality monitoring, we can think about the ingestion data catalog and so on and forth. But I think ideally in a data product goals become a sort of templates that can be deployed and are actually rejected or verified at build time before we actually make them and deploy them to production. >> Yeah so it's like DevOps for data product. So, so I'm envisioning almost a three-phase approach to governance. And you're kind of, it sounds like you're in the early phase of it, call it phase zero, where there's learning, there's literacy, there's training education, there's kind of self-governance. And then there's some kind of oversight, some, a lot of manual stuff going on, and then you, you're trying to process builders at this phase and then you codify it and then you can automate it. Is that fair? >> Yeah. I would rather think, think about automation as early as possible in a way, and yes, it needs to be separate rules, but then actually start actually use case by use case. Is there anything that small piece that we can already automate? If just possible roll that out at the next extended step-by-step. >> Is there a role though, that adjudicates that? Is there a central, you know, chief state officer who's responsible for making sure people are complying or is it, how do you handle it? >> I mean, from a, from a, from a platform perspective, yes. This applies in to, to implement certain pieces, that we are saying are important and actually would like to implement, however, that is actually working very closely with the governance department, So it's Clemence's piece to understand that defy the policies that needs to be implemented. >> So good. So Clemence essentially, it's, it's, it's your responsibility to make sure that the policy is being followed. And then as you were saying, Christoph, you want to compress the time to automation as fast as possible. Is that, is that-- >> Yeah, so it's a really, it's a, what needs to be really clear is that it's always a split effort, right? So you can't just do one or the other thing, but there is some that really goes hand in hand because for the right information, for the right engineering tooling, we need to have the transparency first. I mean, code needs to be coded. So we kind of need to operate on the same level with the right understanding. So there's actually two things that are important, which is one it's policies and guidelines, but not only that, because more importantly or equally important is to align with the end-user and tech teams and engineering and really bridge between business value business teams and the engineering teams. >> Got it. So just a couple more questions, because we got to wrap up, I want to talk a little bit about the business outcome. I know it's hard to quantify and I'll talk about that in a moment, but, but major learnings, we've got some of the challenges that, that you cited. I'll just put them up here. We don't have to go detailed into this, but I just wanted to share with some folks, but my question, I mean, this is the advice for your peers question. If you had to do it differently, if you had a do over or a Mulligan, as we like to say for you, golfers, what, what would you do differently? >> I mean, I, can we start with, from, from the transformational challenge that understanding that it's also high load of cultural exchange. I think this is, this is important that a particular communication strategy needs to be put into place and people really need to be supported, right? So it's not that we go in and say, well, we have to change into, towards data mash, but naturally it's the human nature, nature, nature, we are kind of resistant to change, right? And (mumbles) uncomfortable. So we need to take that away by training and by communicating. Chris, you might want to add something to that. >> Definitely. I think the point that I've also made before, right? We need to acknowledge that data mesh it's an architectural scale, right? If you're looking for something which is necessary by huge companies who are vulnerable, that are product at scale. I mean, Dave, you mentioned that right, there are a lot of advantages to have a centralized team, but at some point it may make sense to actually decentralize here. And at this point, right, if you think about data mesh, you have to recognize that you're not building something on a green field. And I think there's a big learning, which is also reflected on the slide is, don't underestimate your baggage. It's typically is you come to a point where the old model doesn't work anymore. And as had a fresh write, we lost the trust in our data. And actually we have seen certain risks of slowing down our innovation. So we triggered that, this was triggering the need to actually change something. So at this transition applies that you took, we have a lot of technical depth accumulated over years. And I think what we have learned is that potentially we have, de-centralized some assets too early. This is not actually taking into account the maturity of the team. We are actually investigating too. And now we'll be actually in the face of correcting pieces of that one, right? But I think if you, if you, if you start from scratch, you have to understand, okay, is all my teams actually ready for taking on this new, this new capability? And you have to make sure that this is decentralization. You build up these capabilities and the teams, and as Clemence has mentioned, right? Make sure that you take the, the people on your journey. I think these are the pieces that also here it comes with this knowledge gap, right? That we need to think about hiring literacy, the technical depth I just talked about. And I think the, the last piece that I would add now, which is not here on the slide deck is also from our perspective, we started on the analytical layer because it was kind of where things are exploding, right? This is the bit where people feel the pain. But I think a lot of the efforts that we have started to actually modernize the current stage and data products, towards data mesh, we've understood that it always comes down basically to a proper shape of our operational plan. And I think what needs to happen is I think we got through a lot of pains, but the learning here is this needs to really be an, a commitment from the company. It needs to have an end to end. >> I think that point, that last point you made is so critical because I, I, I hear a lot from the vendor community about how they're going to make analytics better. And that's not, that's not unimportant, but, but true data product thinking and decentralized data organizations really have to operationalize in order to scale it. So these decisions around data architecture and organization, they're fundamental and lasting, it's not necessarily about an individual project ROI. They're going to be projects, sub projects, you know, within this architecture. But the architectural decision itself is organizational it's cultural and, and what's the best approach to support your business at scale. It really speaks to, to, to what you are, who you are as a company, how you operate and getting that right, as we've seen in the success of data-driven companies is, yields tremendous results. So I'll, I'll, I'll ask each of you to give, give us your final thoughts and then we'll wrap. Maybe. >> Just can I quickly, maybe just jumping on this piece, what you have mentioned, right, the target architecture. If you talk about these pieces, right, people often have this picture of (mumbled). Okay. There are different kinds of stages. We have (incomprehensible speech), we have actually a gesture layer, we have a storage layer, transformation layer, presentation data, and then we are basically putting a lot of technology on top of that. That's kind of our target architecture. However, I think what we really need to make sure is that we have these different kinds of views, right? We need to understand what are actually the capabilities that we need to know, what new goals, how does it look and feel from the different kinds of personas and experience view. And then finally that should actually go to the, to the target architecture from a technical perspective. Maybe just to give an outlook what we are planning to do, how we want to move that forward. Yes. Actually based on our strategy in the, in the sense of we would like to increase the maturity as a whole across the entire company. And this is kind of a framework around the business strategy and it's breaking down into four pillars as well. People meaning the data culture, data literacy, data organizational structure and so on. If you're talking about governance, as Clemence had actually mentioned that right, compliance, governance, data management, and so on, you're talking about technology. And I think we could talk for hours for that one it's around data platform, data science platform. And then finally also about enablements through data. Meaning we need to understand data quality, data accessibility and applied science and data monetization. >> Great. Thank you, Christoph. Clemence why don't you bring us home. Give us your final thoughts. >> Okay. I can just agree with Christoph that important is to understand what kind of maturity people have, but I understand we're at the maturity level, where a company, where people, our organization is, and really understand what does kind of, it's just kind of a change applies to that, those four pillars, for example, what needs to be tackled first. And this is not very clear from the very first beginning (mumbles). It's kind of like green field, you come up with must wins to come up with things that you really want to do out of theory and out of different white papers. Only if you really start conducting the first initiatives, you do understand that you are going to have to put those thoughts together. And where do I miss out on one of those four different pillars, people process technology and governance, but, and then that can often the integration like doing step by step, small steps, by small steps, not pulling the ocean where you're capable, really to identify the gaps and see where either you can fill the gaps or where you have to increase maturity first and train people or increase your tech stack. >> You know, HelloFresh is an excellent example of a company that is innovating. It was not born in Silicon Valley, which I love. It's a global company. And, and I got to ask you guys, it seems like it's just an amazing place to work. Are you guys hiring? >> Yes, definitely. We do. As, as mentioned right as well as one of these aspects distributing and actually hiring as an entire company, specifically for data. I think there are a lot of open roles, so yes, please visit or our page from data engineering, data, product management, and Clemence has a lot of roles that you can speak to about. But yes. >> Guys, thanks so much for sharing with theCUBE audience, you're, you're pioneers, and we look forward to collaborations in the future to track progress, and really want to thank you for your time. >> Thank you very much. >> Thank you very much Dave. >> And thank you for watching theCUBE's startup showcase made possible by AWS. This is Dave Volante. We'll see you next time. (cheerful music)

Published Date : Sep 15 2021

SUMMARY :

and the internal team it had the world in your field. Maybe take over the first and the plant acquisition And as you expand your TAM, the flexibility to grow So that for the team meant and so the lines of business, and so on started really to and the flip side of that say the data to the experts So it's the for, And the idea was really moving away Okay, go ahead. And as you mentioned, federated computational governance. is really not the focus of And in the end, and talk about the organizational And in the end, we all know user behavior not the least of which is crypto. So if I take the example of revolution, of the new development kit, And also in the end, So it's sort of in the the company needs to really but it's one of the most So the aim of the data governance and actually not ask the the early phase of it, that we can already automate? that defy the policies that the time to automation on the same level with the about the business outcome. So it's not that we go in and say, well, efforts that we have started to I hear a lot from the vendor in the sense of we would like Clemence why don't you bring us home. fill the gaps or where you And, and I got to ask you guys, that you can speak to about. collaborations in the future to track And thank you for watching

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavePERSON

0.99+

ChristophPERSON

0.99+

ChrisPERSON

0.99+

Christoph SawadePERSON

0.99+

2015DATE

0.99+

Zhamak DehghaniPERSON

0.99+

YoufoodzORGANIZATION

0.99+

Dave VolantePERSON

0.99+

Clemence CheePERSON

0.99+

2019DATE

0.99+

NorwayLOCATION

0.99+

2017DATE

0.99+

AWSORGANIZATION

0.99+

May, 2019DATE

0.99+

UKLOCATION

0.99+

HelloFreshORGANIZATION

0.99+

ClemencePERSON

0.99+

Silicon ValleyLOCATION

0.99+

AustraliaLOCATION

0.99+

100%QUANTITY

0.99+

USLOCATION

0.99+

JulyDATE

0.99+

twoQUANTITY

0.99+

Clemence W. CheePERSON

0.99+

TwoQUANTITY

0.99+

TAMORGANIZATION

0.99+

oneQUANTITY

0.99+

threeQUANTITY

0.99+

Hello FreshORGANIZATION

0.99+

first pieceQUANTITY

0.99+

one toolQUANTITY

0.99+

last yearDATE

0.99+

last weekDATE

0.99+

two thingsQUANTITY

0.99+

ZhamakPERSON

0.99+

firstQUANTITY

0.99+

two years laterDATE

0.99+

PatPERSON

0.99+

second twoQUANTITY

0.99+

one last secondQUANTITY

0.99+

Green ChefORGANIZATION

0.99+

OneQUANTITY

0.98+

first twoQUANTITY

0.98+

one exampleQUANTITY

0.98+

bothQUANTITY

0.98+

one modelQUANTITY

0.98+

theCUBEORGANIZATION

0.97+

four pillarsQUANTITY

0.97+

Every PlateORGANIZATION

0.97+

todayDATE

0.97+

eachQUANTITY

0.97+

earlier this yearDATE

0.97+

Breaking Analysis: How JPMC is Implementing a Data Mesh Architecture on the AWS Cloud


 

>> From theCUBE studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR. This is braking analysis with Dave Vellante. >> A new era of data is upon us, and we're in a state of transition. You know, even our language reflects that. We rarely use the phrase big data anymore, rather we talk about digital transformation or digital business, or data-driven companies. Many have come to the realization that data is a not the new oil, because unlike oil, the same data can be used over and over for different purposes. We still use terms like data as an asset. However, that same narrative, when it's put forth by the vendor and practitioner communities, includes further discussions about democratizing and sharing data. Let me ask you this, when was the last time you wanted to share your financial assets with your coworkers or your partners or your customers? Hello everyone, and welcome to this week's Wikibon Cube Insights powered by ETR. In this breaking analysis, we want to share our assessment of the state of the data business. We'll do so by looking at the data mesh concept and how a leading financial institution, JP Morgan Chase is practically applying these relatively new ideas to transform its data architecture. Let's start by looking at what is the data mesh. As we've previously reported many times, data mesh is a concept and set of principles that was introduced in 2018 by Zhamak Deghani who's director of technology at ThoughtWorks, it's a global consultancy and software development company. And she created this movement because her clients, who were some of the leading firms in the world had invested heavily in predominantly monolithic data architectures that had failed to deliver desired outcomes in ROI. So her work went deep into trying to understand that problem. And her main conclusion that came out of this effort was the world of data is distributed and shoving all the data into a single monolithic architecture is an approach that fundamentally limits agility and scale. Now a profound concept of data mesh is the idea that data architectures should be organized around business lines with domain context. That the highly technical and hyper specialized roles of a centralized cross functional team are a key blocker to achieving our data aspirations. This is the first of four high level principles of data mesh. So first again, that the business domain should own the data end-to-end, rather than have it go through a centralized big data technical team. Second, a self-service platform is fundamental to a successful architectural approach where data is discoverable and shareable across an organization and an ecosystem. Third, product thinking is central to the idea of data mesh. In other words, data products will power the next era of data success. And fourth data products must be built with governance and compliance that is automated and federated. Now there's lot more to this concept and there are tons of resources on the web to learn more, including an entire community that is formed around data mesh. But this should give you a basic idea. Now, the other point is that, in observing Zhamak Deghani's work, she is deliberately avoided discussions around specific tooling, which I think has frustrated some folks because we all like to have references that tie to products and tools and companies. So this has been a two-edged sword in that, on the one hand it's good, because data mesh is designed to be tool agnostic and technology agnostic. On the other hand, it's led some folks to take liberties with the term data mesh and claim mission accomplished when their solution, you know, maybe more marketing than reality. So let's look at JP Morgan Chase in their data mesh journey. Is why I got really excited when I saw this past week, a team from JPMC held a meet up to discuss what they called, data lake strategy via data mesh architecture. I saw that title, I thought, well, that's a weird title. And I wondered, are they just taking their legacy data lakes and claiming they're now transformed into a data mesh? But in listening to the presentation, which was over an hour long, the answer is a definitive no, not at all in my opinion. A gentleman named Scott Hollerman organized the session that comprised these three speakers here, James Reid, who's a divisional CIO at JPMC, Arup Nanda who is a technologist and architect and Serita Bakst who is an information architect, again, all from JPMC. This was the most detailed and practical discussion that I've seen to date about implementing a data mesh. And this is JP Morgan's their approach, and we know they're extremely savvy and technically sound. And they've invested, it has to be billions in the past decade on data architecture across their massive company. And rather than dwell on the downsides of their big data past, I was really pleased to see how they're evolving their approach and embracing new thinking around data mesh. So today, we're going to share some of the slides that they use and comment on how it dovetails into the concept of data mesh that Zhamak Deghani has been promoting, and at least as we understand it. And dig a bit into some of the tooling that is being used by JP Morgan, particularly around it's AWS cloud. So the first point is it's all about business value, JPMC, they're in the money business, and in that world, business value is everything. So Jr Reid, the CIO showed this slide and talked about their overall goals, which centered on a cloud first strategy to modernize the JPMC platform. I think it's simple and sensible, but there's three factors on which he focused, cut costs always short, you got to do that. Number two was about unlocking new opportunities, or accelerating time to value. But I was really happy to see number three, data reuse. That's a fundamental value ingredient in the slide that he's presenting here. And his commentary was all about aligning with the domains and maximizing data reuse, i.e. data is not like oil and making sure there's appropriate governance around that. Now don't get caught up in the term data lake, I think it's just how JP Morgan communicates internally. It's invested in the data lake concept, so they use water analogies. They use things like data puddles, for example, which are single project data marts or data ponds, which comprise multiple data puddles. And these can feed in to data lakes. And as we'll see, JPMC doesn't strive to have a single version of the truth from a data standpoint that resides in a monolithic data lake, rather it enables the business lines to create and own their own data lakes that comprise fit for purpose data products. And they do have a single truth of metadata. Okay, we'll get to that. But generally speaking, each of the domains will own end-to-end their own data and be responsible for those data products, we'll talk about that more. Now the genesis of this was sort of a cloud first platform, JPMC is leaning into public cloud, which is ironic since the early days, in the early days of cloud, all the financial institutions were like never. Anyway, JPMC is going hard after it, they're adopting agile methods and microservices architectures, and it sees cloud as a fundamental enabler, but it recognizes that on-prem data must be part of the data mesh equation. Here's a slide that starts to get into some of that generic tooling, and then we'll go deeper. And I want to make a couple of points here that tie back to Zhamak Deghani's original concept. The first is that unlike many data architectures, this puts data as products right in the fat middle of the chart. The data products live in the business domains and are at the heart of the architecture. The databases, the Hadoop clusters, the files and APIs on the left-hand side, they serve the data product builders. The specialized roles on the right hand side, the DBA's, the data engineers, the data scientists, the data analysts, we could have put in quality engineers, et cetera, they serve the data products. Because the data products are owned by the business, they inherently have the context that is the middle of this diagram. And you can see at the bottom of the slide, the key principles include domain thinking, an end-to-end ownership of the data products. They build it, they own it, they run it, they manage it. At the same time, the goal is to democratize data with a self-service as a platform. One of the biggest points of contention of data mesh is governance. And as Serita Bakst said on the Meetup, metadata is your friend, and she kind of made a joke, she said, "This sounds kind of geeky, but it's important to have a metadata catalog to understand where data resides and the data lineage in overall change management. So to me, this really past the data mesh stink test pretty well. Let's look at data as products. CIO Reid said the most difficult thing for JPMC was getting their heads around data product, and they spent a lot of time getting this concept to work. Here's the slide they use to describe their data products as it related to their specific industry. They set a common language and taxonomy is very important, and you can imagine how difficult that was. He said, for example, it took a lot of discussion and debate to define what a transaction was. But you can see at a high level, these three product groups around wholesale, credit risk, party, and trade and position data as products, and each of these can have sub products, like, party, we'll have to know your customer, KYC for example. So a key for JPMC was to start at a high level and iterate to get more granular over time. So lots of decisions had to be made around who owns the products and the sub-products. The product owners interestingly had to defend why that product should even exist, what boundaries should be in place and what data sets do and don't belong in the various products. And this was a collaborative discussion, I'm sure there was contention around that between the lines of business. And which sub products should be part of these circles? They didn't say this, but tying it back to data mesh, each of these products, whether in a data lake or a data hub or a data pond or data warehouse, data puddle, each of these is a node in the global data mesh that is discoverable and governed. And supporting this notion, Serita said that, "This should not be infrastructure-bound, logically, any of these data products, whether on-prem or in the cloud can connect via the data mesh." So again, I felt like this really stayed true to the data mesh concept. Well, let's look at some of the key technical considerations that JPM discussed in quite some detail. This chart here shows a diagram of how JP Morgan thinks about the problem, and some of the challenges they had to consider were how to write to various data stores, can you and how can you move data from one data store to another? How can data be transformed? Where's the data located? Can the data be trusted? How can it be easily accessed? Who has the right to access that data? These are all problems that technology can help solve. And to address these issues, Arup Nanda explained that the heart of this slide is the data in ingestor instead of ETL. All data producers and contributors, they send their data to the ingestor and the ingestor then registers the data so it's in the data catalog. It does a data quality check and it tracks the lineage. Then, data is sent to the router, which persists the data in the data store based on the best destination as informed by the registration. This is designed to be a flexible system. In other words, the data store for a data product is not fixed, it's determined at the point of inventory, and that allows changes to be easily made in one place. The router simply reads that optimal location and sends it to the appropriate data store. Nowadays you see the schema infer there is used when there is no clear schema on right. In this case, the data product is not allowed to be consumed until the schema is inferred, and then the data goes into a raw area, and the inferer determines the schema and then updates the inventory system so that the data can be routed to the proper location and properly tracked. So that's some of the detail of how the sausage factory works in this particular use case, it was very interesting and informative. Now let's take a look at the specific implementation on AWS and dig into some of the tooling. As described in some detail by Arup Nanda, this diagram shows the reference architecture used by this group within JP Morgan, and it shows all the various AWS services and components that support their data mesh approach. So start with the authorization block right there underneath Kinesis. The lake formation is the single point of entitlement and has a number of buckets including, you can see there the raw area that we just talked about, a trusted bucket, a refined bucket, et cetera. Depending on the data characteristics at the data catalog registration block where you see the glue catalog, that determines in which bucket the router puts the data. And you can see the many AWS services in use here, identity, the EMR, the elastic MapReduce cluster from the legacy Hadoop work done over the years, the Redshift Spectrum and Athena, JPMC uses Athena for single threaded workloads and Redshift Spectrum for nested types so they can be queried independent of each other. Now remember very importantly, in this use case, there is not a single lake formation, rather than multiple lines of business will be authorized to create their own lakes, and that creates a challenge. So how can that be done in a flexible and automated manner? And that's where the data mesh comes into play. So JPMC came up with this federated lake formation accounts idea, and each line of business can create as many data producer or consumer accounts as they desire and roll them up into their master line of business lake formation account. And they cross-connect these data products in a federated model. And these all roll up into a master glue catalog so that any authorized user can find out where a specific data element is located. So this is like a super set catalog that comprises multiple sources and syncs up across the data mesh. So again to me, this was a very well thought out and practical application of database. Yes, it includes some notion of centralized management, but much of that responsibility has been passed down to the lines of business. It does roll up to a master catalog, but that's a metadata management effort that seems compulsory to ensure federated and automated governance. As well at JPMC, the office of the chief data officer is responsible for ensuring governance and compliance throughout the federation. All right, so let's take a look at some of the suspects in this world of data mesh and bring in the ETR data. Now, of course, ETR doesn't have a data mesh category, there's no such thing as that data mesh vendor, you build a data mesh, you don't buy it. So, what we did is we use the ETR dataset to select and filter on some of the culprits that we thought might contribute to the data mesh to see how they're performing. This chart depicts a popular view that we often like to share. It's a two dimensional graphic with net score or spending momentum on the vertical axis and market share or pervasiveness in the data set on the horizontal axis. And we filtered the data on sectors such as analytics, data warehouse, and the adjacencies to things that might fit into data mesh. And we think that these pretty well reflect participation that data mesh is certainly not all compassing. And it's a subset obviously, of all the vendors who could play in the space. Let's make a few observations. Now as is often the case, Azure and AWS, they're almost literally off the charts with very high spending velocity and large presence in the market. Oracle you can see also stands out because much of the world's data lives inside of Oracle databases. It doesn't have the spending momentum or growth, but the company remains prominent. And you can see Google Cloud doesn't have nearly the presence in the dataset, but it's momentum is highly elevated. Remember that red dotted line there, that 40% line, anything over that indicates elevated spending momentum. Let's go to Snowflake. Snowflake is consistently shown to be the gold standard in net score in the ETR dataset. It continues to maintain highly elevated spending velocity in the data. And in many ways, Snowflake with its data marketplace and its data cloud vision and data sharing approach, fit nicely into the data mesh concept. Now, a caution, Snowflake has used the term data mesh in it's marketing, but in our view, it lacks clarity, and we feel like they're still trying to figure out how to communicate what that really is. But is really, we think a lot of potential there to that vision. Databricks is also interesting because the firm has momentum and we expect further elevated levels in the vertical axis in upcoming surveys, especially as it readies for its IPO. The firm has a strong product and managed service, and is really one to watch. Now we included a number of other database companies for obvious reasons like Redis and Mongo, MariaDB, Couchbase and Terradata. SAP as well is in there, but that's not all database, but SAP is prominent so we included them. As is IBM more of a database, traditional database player also with the big presence. Cloudera includes Hortonworks and HPE Ezmeral comprises the MapR business that HPE acquired. So these guys got the big data movement started, between Cloudera, Hortonworks which is born out of Yahoo, which was the early big data, sorry early Hadoop innovator, kind of MapR when it's kind of owned course, and now that's all kind of come together in various forms. And of course, we've got Talend and Informatica are there, they are two data integration companies that are worth noting. We also included some of the AI and ML specialists and data science players in the mix like DataRobot who just did a monster $250 million round. Dataiku, H2O.ai and ThoughtSpot, which is all about democratizing data and injecting AI, and I think fits well into the data mesh concept. And you know we put VMware Cloud in there for reference because it really is the predominant on-prem infrastructure platform. All right, let's wrap with some final thoughts here, first, thanks a lot to the JP Morgan team for sharing this data. I really want to encourage practitioners and technologists, go to watch the YouTube of that meetup, we'll include it in the link of this session. And thank you to Zhamak Deghani and the entire data mesh community for the outstanding work that you're doing, challenging the established conventions of monolithic data architectures. The JPM presentation, it gives you real credibility, it takes Data Mesh well beyond concept, it demonstrates how it can be and is being done. And you know, this is not a perfect world, you're going to start somewhere and there's going to be some failures, the key is to recognize that shoving everything into a monolithic data architecture won't support massive scale and agility that you're after. It's maybe fine for smaller use cases in smaller firms, but if you're building a global platform in a data business, it's time to rethink data architecture. Now much of this is enabled by the cloud, but cloud first doesn't mean cloud only, doesn't mean you'll leave your on-prem data behind, on the contrary, you have to include non-public cloud data in your Data Mesh vision just as JPMC has done. You've got to get some quick wins, that's crucial so you can gain credibility within the organization and grow. And one of the key takeaways from the JP Morgan team is, there is a place for dogma, like organizing around data products and domains and getting that right. On the other hand, you have to remain flexible because technologies is going to come, technology is going to go, so you got to be flexible in that regard. And look, if you're going to embrace the metaphor of water like puddles and ponds and lakes, we suggest maybe a little tongue in cheek, but still we believe in this, that you expand your scope to include data ocean, something John Furry and I have talked about and laughed about extensively in theCUBE. Data oceans, it's huge. It's the new data lake, go transcend data lake, think oceans. And think about this, just as we're evolving our language, we should be evolving our metrics. Much the last the decade of big data was around just getting the stuff to work, getting it up and running, standing up infrastructure and managing massive, how much data you got? Massive amounts of data. And there were many KPIs built around, again, standing up that infrastructure, ingesting data, a lot of technical KPIs. This decade is not just about enabling better insights, it's a more than that. Data mesh points us to a new era of data value, and that requires the new metrics around monetizing data products, like how long does it take to go from data product conception to monetization? And how does that compare to what it is today? And what is the time to quality if the business owns the data, and the business has the context? the quality that comes out of them, out of the shoot should be at a basic level, pretty good, and at a higher mark than out of a big data team with no business context. Automation, AI, and very importantly, organizational restructuring of our data teams will heavily contribute to success in the coming years. So we encourage you, learn, lean in and create your data future. Okay, that's it for now, remember these episodes, they're all available as podcasts wherever you listen, all you got to do is search, breaking analysis podcast, and please subscribe. Check out ETR's website at etr.plus for all the data and all the survey information. We publish a full report every week on wikibon.com and siliconangle.com. And you can get in touch with us, email me david.vellante@siliconangle.com, you can DM me @dvellante, or you can comment on my LinkedIn posts. This is Dave Vellante for theCUBE insights powered by ETR. Have a great week everybody, stay safe, be well, and we'll see you next time. (upbeat music)

Published Date : Jul 12 2021

SUMMARY :

This is braking analysis and the adjacencies to things

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
JPMCORGANIZATION

0.99+

Dave VellantePERSON

0.99+

2018DATE

0.99+

Zhamak DeghaniPERSON

0.99+

James ReidPERSON

0.99+

JP MorganORGANIZATION

0.99+

JP MorganORGANIZATION

0.99+

ClouderaORGANIZATION

0.99+

Serita BakstPERSON

0.99+

IBMORGANIZATION

0.99+

HPEORGANIZATION

0.99+

AWSORGANIZATION

0.99+

Scott HollermanPERSON

0.99+

HortonworksORGANIZATION

0.99+

BostonLOCATION

0.99+

40%QUANTITY

0.99+

JP Morgan ChaseORGANIZATION

0.99+

SeritaPERSON

0.99+

YahooORGANIZATION

0.99+

Arup NandaPERSON

0.99+

eachQUANTITY

0.99+

ThoughtWorksORGANIZATION

0.99+

firstQUANTITY

0.99+

OracleORGANIZATION

0.99+

Palo AltoLOCATION

0.99+

david.vellante@siliconangle.comOTHER

0.99+

each lineQUANTITY

0.99+

TerradataORGANIZATION

0.99+

RedisORGANIZATION

0.99+

$250 millionQUANTITY

0.99+

first pointQUANTITY

0.99+

three factorsQUANTITY

0.99+

SecondQUANTITY

0.99+

MapRORGANIZATION

0.99+

todayDATE

0.99+

InformaticaORGANIZATION

0.99+

TalendORGANIZATION

0.99+

John FurryPERSON

0.99+

Zhamak DeghaniPERSON

0.99+

first platformQUANTITY

0.98+

YouTubeORGANIZATION

0.98+

fourthQUANTITY

0.98+

singleQUANTITY

0.98+

OneQUANTITY

0.98+

ThirdQUANTITY

0.97+

CouchbaseORGANIZATION

0.97+

three speakersQUANTITY

0.97+

two dataQUANTITY

0.97+

first strategyQUANTITY

0.96+

oneQUANTITY

0.96+

one placeQUANTITY

0.96+

Jr ReidPERSON

0.96+

single lakeQUANTITY

0.95+

SAPORGANIZATION

0.95+

wikibon.comOTHER

0.95+

siliconangle.comOTHER

0.94+

AzureORGANIZATION

0.93+

Collibra Day 1 Felix Zhamak


 

>>Hi, Felix. Great to be here. >>Likewise. Um, so when I started reading about data mesh, I think about a year ago, I found myself the more I read about it, the more I find myself agreeing with other principles behind data mesh, it actually took me back to almost the starting of Colibra 13 years ago, based on the research we were doing on semantic technologies, even personally my own master thesis, which was about domain driven ontologies. And we'll talk about domain-driven as it's a key principle behind data mesh, but before we get into that, let's not assume that everybody knows what data measures about. Although we've seen a lot of traction and momentum, which is fantastic to see, but maybe if you could start by talking about some of the key principles and, and a brief overview of what data mesh, uh, Isabella of >>Course, well, they're happy to, uh, so Dana mesh is an approach is a new approach. It's a decentralized, decentralized approach to managing and accessing data and particularly analytical data at scale. So we can break that down a little bit. What is analytical data? Well, analytical data is the data that fuels our reporting as a business intelligence. Most importantly, the machine learning training, right? So it's the data, that's, it's an aggregate view of historical events that happens across organizations, many domains within organizations, or even beyond one organization, right? Um, and today we manage, uh, this analytical data through very centralized solutions. So whether it's a data lake or data warehouse or combinations of the two, and, uh, to be honest, we have kind of outsource the accountability for it, to the data team, right? It doesn't happen within the domains. Uh, what we have found ourselves with is, uh, central button next. >>So as we see the growth in the scale of organizations, in terms of the origins of the data and in terms of the great expectations for the data, all of these wonderful use cases that are, that requires access to that, unless we're data, uh, we find ourselves kind of constraints and limited in agility to respond, you know, because we have a centralized bottleneck from team to technology, to architecture. So there's a mesh kind of is that looks at the past what we've done, accidental complexity that we've kind of created and tries to reimagine a different way of, uh, managing and accessing data that can truly scale as this origins of the data grows. As they become available within one organization, we didn't want a cloud or another, and it links down really the approach based on four principles. Uh, so I so far, I haven't tried to be prescriptive as exactly how you implement it. >>I leave that to Elizabeth, to the imaginations of the users. Um, of course I have my opinions, but, but without being prescriptive, I think there are full shifts that needs to happen. One is, uh, we need to start breaking down the, kind of this complex problem of accessing to data around boundaries that can allow this to scale out a solution. So boundaries that are, that naturally fits into that model or domains, right. Our business domain. So, so there's a first principle is the domain ownership of the data. So analytical data will be shared and served and accountable, uh, by the domains where they come from. And then the second dimension of that is, okay. So once we break down this, the ownership of the database on domains, how can we prevent this data siloing? So the second principle is really treating data as a product. >>So considering the success of that data based on the access and usability and the lifelong experience of data analysts, data scientists. So we talk about data as a product and that the third principle is to really make it possible feasible. We need to really rethink our data platforms, our infrastructure capabilities, and create a new set ourselves of capabilities that allows domain in fact, to own their data in fact, to manage the life cycle of their analytical data. So then self-serve daytime frustration and platform is the fourth principle. And the last principle is really around governance because we have to think about governance. In fact, when I first wrote it down, this was like a little kind of concern in, in embedded in what some of my texts and I thought about, okay, now to make this real, we need to think about securing and quality of the data accessibility of the data at scale, in a fashion that embraces this autonomous domain ownership. So we have to think about how can we make this real with competition of governance? How can we make those domains be part of the governance, federated governance, federally, the competition of governance is the fourth principle. So at insurance it's a organizational shift, it's an architectural change. And of course technology needs to change to get us to decentralize access and management of Emily's school data. >>Yeah, I think that makes a ton of sense. If you want to scale, typically you have to think much more distributed versus centralized at we've seen it in other practices as well, that domain-driven thinking as well. I think, especially around engineering, right? We've seen a lot of the same principles and best practices in order to scale engineering teams and not make the same mistakes again, but maybe we can start there with kind of the core principles around that domain driven thinking. Can you elaborate a little bit on that? Why that is so important than the kind of data organizations, data functions as well? >>Absolutely. I mean, if you look at your organizations, organizations are complex systems, right? There are eight made of parts, which are basically domains functions of the business, your automation and your customer management, yourselves marketing. And then the behavior of the organization is the result of an intuitive, you know, network of dependencies and interactions with these domains. So if we just overlay data on this complex system, it does make sense to really, to scale, to bring the ownership and, um, really access to data right at the domain where it originates, right. But to the people who know that data best and most capable of providing that data. So to optimize response, to change, to optimize creating new features, new services, new machine learning models, we've got to kind of think about your call optimization, but not that the cost of global good. Right. Uh, so the domain ownership really talks about giving autonomy to the domains and accountability to provide their data and model the data, um, in a responsible way, be accountable for its quality. >>So no collect some of the empower them and localize some of those responsibilities, but at the same time, you know, thinking about the global goods, so what are they, how that domain needs to be accountable against the other domains on the mission? That's the governance piece covers that. And that leads to some interesting kind of architectural shifts, because when you think about not submission of the data, then you think about, okay, if I have a machine learning model that needs, you know, three pieces of the data from the different domains, I ended up actually distributing the computer also back to those domains. So it actually starts shifting kind of architectural as well. We start with ownership. Yeah, >>No, I think that makes a ton of sense, but I can imagine people thinking, well, if you're organizing, according to these domains, aren't gonna be going to grades different silos, even more silos. And I think that's where it second principle that's, um, think of data as a product and it comes in, I think that's incredibly powerful in my mind. It's powerful because it helps us think about usability. It helps us think about the consumer of that data and really packaging it in the right way. And as one sentence that I've heard you use that I think is incredibly powerful, it's less collecting, more connecting. Um, and can you elaborate on that a little bit? >>Absolutely. I mean the power and the value of the data is not enhanced, which we have got and stored on this, right. It's really about connecting that data to other data sets to aluminate new insights. The higher order information is connecting that data to the users, right. Then they want to use it. So that's why I think, uh, if we shift that thinking from just collecting more in one place, like whatever, and ability to connect datasets, then, then arrive at a different solution. So, uh, I think data as a product, as you said, exactly, was a kind of a response to the challenges that domain-driven siloing could create. And the idea is that the data that now these domains own needs to be shared with some accountability and incentive structure as a product. So if you bring product thinking to data, what does that mean? >>That means delighting the experience that there are users who are they, they're the data analysts, data scientists. So, you know, how can we delight their experience of their journey starts with a hypothesis. I have a question. Do I have right data to answer this question with a particular model? Let me discover it, let me find it if it's useful. Do I trust it? So really fascinated in that journey? I think we have two choices in that we have the choice of source of that data. The people who are really shouldn't be accountable for it, shrug off the responsibility and say, you know, I dumped this data on some event streaming and somebody downstream, the governance or data team will take care of a terror again. So it usable piece of information. And that's what we have done for, you know, half century almost. And, or let's say let's bring intention of providing quality data back to the source and make the folks both empower them and make them accountable for providing that data right at the source as a product. And I think by being intentional about that, um, w we're going to remove a lot of accidental complexity that we have created with, you know, labyrinth pipelines of moving data from one place to another, and try to build quality back into it. Um, and that requires, you know, architectural shifts, organizational shifts, incentive models, and the whole package, >>The hope is absolutely. And we'll talk about that. Federated computational governance is going to be a really an important aspect, but the other part of kind of data as a product next to usability is whole trust. Right? If you, if you want to use it, why is also trusts so important if you think about data as a product? >>Well, uh, I mean, maybe we turn this question back to you. Would you buy the shiniest product if you don't trust it, if you, if you don't trust where it comes from, can I use it? Is it, does it have integrity? I wouldn't. I think, I think it's almost irresponsible to use the data that you can trust, right. And the, really the meaning of the trust is that, do I know enough about this data to, to, for it, to be useful for the purpose that I'm using it for? So, um, I think trust is absolutely fundamental to, as a fundamental characteristics of a data as a product. And again, it comes back to breaching the gap between what the data user knows needs to know to really trust them, use that data, to find it, whether it's suitable and what they know today. So we can bridge that gap with, uh, you know, adding documentation, adding SLRs, adding lineage, like all of these additional information, but not only that, but also having people that are accountable for providing that integrity and those silos and guaranteeing. So it's really those product owners. So I think, um, it's just, for me, it's a non trust is a non-negotiable characteristic of the data as a product, like any other consumer product. >>Exactly. Like you said, if you think about consumer product, consumer marketplace is almost Uber of Amazon, of Airbnb. You have the simple rating as a very simple way of showing trust and those two and those different stakeholders and that almost. And we also say, okay, how do we actually get there? And I think data measure also talks a little bit about the roles responsibilities. And I think the importance overall of a, of a data product owner probably is aligned with that, that importance and trust. Yeah, >>Absolutely. I think we can't just wish for these good things happens without putting the accountability and the right roles in place. And the data product owner is just the starting point for us to stop playing hot potato. When it comes to, you know, who owns the data will be accountable for not so much. Who's the actual owner of that data because the owner of the data is you and me where the data comes really from, but it's the data product owner who's going to be responsible for the life cycle of this. They know when the data gets changed with consumers, meaning you feel as a new information, make sure that that gets carried out and maybe one day retire that data. So that long term ownership with intimate understanding of the needs of the user for that data, as well as the data itself and the domain itself and managing the life cycle of that, uh, I think that's a, that's a necessary role. >>Um, and then we have to think about why would anybody want to be a data product owner, right? What are the incentives we have to set up in the infrastructure, you know, in the organization. Um, and it really comes down to, I think, adopting prior art that exists in the product ownership landscape and bring it really to the data and assume the data users as the, as the customers, right. To make them happy. So our incentives on KPIs for these people before they get product on it needs to be aligned with the happiness of their data users. >>Yep. I love that. The alignment again, to the consumer using things like we know from product management, product owner of these roles and reusing that for data, I think that makes it makes a ton of sense. And it's a good leeway to talk a little about governance, right? We mentioned already federated governance, computational governance at we seeing that challenge often with our customers centralizing versus decentralizing. How do we find the right balance? Can you talk a little bit about that in the context of data mesh? How do we, how do we do this? >>Yeah, absolutely. I think the, I was hoping to pack three concepts in the title of the governance, but I thought that would be quite mouthful. So, uh, as you mentioned, uh, the kind of that federated aspects, the competition aspects, and I think embedded governance, I would, if I could add another kind of phrasing there and really it's about, um, as we talked about to how to make it happen. So I think the Federation matters because the people who are really in a position listed this, their product owners in a position to provide data in a trustworthy, with integrity and secure way, they have to have a stake in doing that, right. They have to be accountable, not just for their little domain or a big domain, but also they have to have an accountability for the mesh. So some of the concerns that are applied to all of the data front, I've seen fluid, how we secure them are consistently really secure them. >>How do we model the data or the schema language or the SLO metrics, or that allows this, uh, data to be interoperable so we can join multiple data products. So we have to have, I think, a set of policies that are really minimum set of policies that we have to apply globally to all the data products and then in a federated fashion, incentivize the data product owners. So have a stake in that and make that happen because there's always going to be a challenge in prioritizing. Would I add another few attributes? So my data sets to make my customers happy, or would I adopt that this standardized modeling language, right? They have to make that kind of continuous, um, kind of prioritization. Um, and they have to be incentivized to do both. Right. Uh, and then the other piece of it is okay, if we want to apply these consistent policies, across many data products and the mesh, how would it be physically possible? >>And the only way I can see, and I have seen it done in service mesh would be possible is by embedding those policies as competition, as code into every single data product. And how do we do that again, platform has a big part of it. So be able to have this embedded policy engines and whatever those things are into the data products, uh, and to, to be able to competition. So by default, when you become a data product, as part of the scaffolding of that data product, you get all of these, um, kind of computational capabilities to configure your, your policies according to the global policies. >>No, that makes sense. That makes, that makes it on a sense. That makes sense. >>I'm just curious. Really. So you've been at this for a while. You've built this system for the 13 years came from kind of academic background. So, uh, to be honest, we run into your products, lots of our clients, and there's always like a chat conversation within ThoughtWorks that, uh, do you guys know about this product then? So and so, oh, I should have curious, well, how do you think data governance tehcnology then skip and you need to shift with data mesh, right. And, and if, if I would ask, how would your roadmap changes with database? >>Yeah, I think it's a really good question. Um, what I don't want to do is to make, make the mistake that Venice often make and think of data mesh as a product. I think it's a much more holistic mindset change, right? That that's organization. Yes. It needs to be a kind of a platform enablement component there. And we've actually, I think authentically what, how we think about governance, that's very aligned with some of the principles and data measures that federate their thinking or customers know about going to communities domains or operating model. We really support that flexibility. I think from a roadmap perspective, I think making that even easier, uh, as always kind of a, a focus focus area for us, um, specifically around data measures are a few things that come to mind. Uh, one, I think is connectivity, right? If you, if you give different teams more ownership and accountability, we're not going to live in a world where all of the data is going to be stored on one location, right? >>You want to give people themes the opportunity and the accountability to make their own technology decisions so that they are fit for purpose. So I think whatever platform being able to really provide out of the box connectivity to a very wide, um, area or a range of technologies, I think is absolutely critical, um, on the, on the product as a or data as a product, thinking that usability, I think that's top of mind, uh, that's part of our roadmap. You're going to hear us, uh, stock about that tomorrow as well. Um, that data consumer, how do we make it as easy as possible for people to discover data that they can trust that they can access? Um, and in that thinking is a big part of our roadmap. So again, making that as easy as possible, uh, is a, is a big part of it. >>And, and also on the, I think the computation aspect that you mentioned, I think we believe in as well, if, if it's just documentation is going to be really hard to keep that alive, right? And so you have to make an active, we have to get close to the actual data. So if you think about a policy enforcement, for example, some things we're talking about, it's not just definition is the enforcement data quality. That's why we are so excited about our or data quality, um, acquisition as well. Um, so these are a couple of the things that we're thinking of, again, your, your, um, your, your, uh, message around from collecting to connecting. We talk about unity. I think that that works really, really well with our mission and vision as well. So mark, thank you so much. I wish we had more time to continue the conversation, uh, but it's been great to have a conversation here. Thank you so much for being here today and, uh, let's continue to work on that on data. Hello. I'm excited >>To see it. Just come to like.

Published Date : Jun 17 2021

SUMMARY :

Great to be here. I found myself the more I read about it, the more I find myself agreeing with other principles So it's the data, that's, it's an aggregate view of historical events that happens in agility to respond, you know, because we have a centralized bottleneck from team to technology, I leave that to Elizabeth, to the imaginations of the users. some of my texts and I thought about, okay, now to make this real, we need to think about securing in order to scale engineering teams and not make the same mistakes again, but maybe we can start there with kind Uh, so the domain ownership really talks about giving autonomy to the domains and And that leads to some interesting kind of architectural shifts, because when you think about not And as one sentence that I've heard you use that I think is incredibly powerful, it's less collecting, data that now these domains own needs to be shared with some accountability shouldn't be accountable for it, shrug off the responsibility and say, you know, I dumped this data on some event streaming aspect, but the other part of kind of data as a product next to usability is whole So we can bridge that gap with, uh, you know, adding documentation, And I think data measure also talks a little bit about the roles responsibilities. of the data is you and me where the data comes really from, but it's the data product owner who's What are the incentives we have to set up in the infrastructure, you know, in the organization. The alignment again, to the consumer using things like we know from product management, So some of the concerns that are applied to all of the data front, Um, and they have to be incentivized to do both. So be able to have this embedded policy engines That makes, that makes it on a sense. So and so, oh, I should have curious, the principles and data measures that federate their thinking or customers know about going to communities domains or operating of the box connectivity to a very wide, um, area or a range of technologies, And, and also on the, I think the computation aspect that you mentioned, I think we believe in as well, Just come to like.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
AmazonORGANIZATION

0.99+

FelixPERSON

0.99+

IsabellaPERSON

0.99+

UberORGANIZATION

0.99+

AirbnbORGANIZATION

0.99+

ElizabethPERSON

0.99+

Felix ZhamakPERSON

0.99+

13 yearsQUANTITY

0.99+

second principleQUANTITY

0.99+

twoQUANTITY

0.99+

todayDATE

0.99+

one sentenceQUANTITY

0.99+

third principleQUANTITY

0.99+

second dimensionQUANTITY

0.99+

fourth principleQUANTITY

0.99+

bothQUANTITY

0.99+

first principleQUANTITY

0.99+

two choicesQUANTITY

0.98+

DanaPERSON

0.98+

EmilyPERSON

0.98+

tomorrowDATE

0.98+

firstQUANTITY

0.98+

one organizationQUANTITY

0.98+

13 years agoDATE

0.98+

three piecesQUANTITY

0.97+

a year agoDATE

0.97+

OneQUANTITY

0.94+

markPERSON

0.93+

one locationQUANTITY

0.93+

three conceptsQUANTITY

0.92+

one placeQUANTITY

0.9+

oneQUANTITY

0.86+

eight madeQUANTITY

0.85+

four principlesQUANTITY

0.84+

single data productQUANTITY

0.79+

ColibraPERSON

0.76+

VeniceORGANIZATION

0.73+

half centuryDATE

0.63+

Day 1QUANTITY

0.6+

ThoughtWorksORGANIZATION

0.59+