Zhamak Dehghani, ThoughtWorks | theCUBE on Cloud 2021

>>from around the globe. It's the Cube presenting Cuban cloud brought to you by silicon angle in 2000 >>nine. Hal Varian, Google's chief economist, said that statisticians would be the sexiest job in the coming decade. The modern big data movement >>really >>took off later in the following year. After the Second Hadoop World, which was hosted by Claudette Cloudera in New York City. Jeff Ham Abakar famously declared to me and John further in the Cube that the best minds of his generation, we're trying to figure out how to get people to click on ads. And he said that sucks. The industry was abuzz with the realization that data was the new competitive weapon. Hadoop was heralded as the new data management paradigm. Now, what actually transpired Over the next 10 years on Lee, a small handful of companies could really master the complexities of big data and attract the data science talent really necessary to realize massive returns as well. Back then, Cloud was in the early stages of its adoption. When you think about it at the beginning of the last decade and as the years passed, Maurin Mawr data got moved to the cloud and the number of data sources absolutely exploded. Experimentation accelerated, as did the pace of change. Complexity just overwhelmed big data infrastructures and data teams, leading to a continuous stream of incremental technical improvements designed to try and keep pace things like data Lakes, data hubs, new open source projects, new tools which piled on even Mawr complexity. And as we reported, we believe what's needed is a comm pleat bit flip and how we approach data architectures. Our next guest is Jean Marc de Connie, who is the director of emerging technologies That thought works. John Mark is a software engineer, architect, thought leader and adviser to some of the world's most prominent enterprises. She's, in my view, one of the foremost advocates for rethinking and changing the way we create and manage data architectures. Favoring a decentralized over monolithic structure and elevating domain knowledge is a primary criterion. And how we organize so called big data teams and platforms. Chamakh. Welcome to the Cube. It's a pleasure to have you on the program. >>Hi, David. This wonderful to be here. >>Well, okay, so >>you're >>pretty outspoken about the need for a paradigm shift in how we manage our data and our platforms that scale. Why do you feel we need such a radical change? What's your thoughts there? >>Well, I think if you just look back over the last decades you gave us, you know, a summary of what happened since 2000 and 10. But if even if we go before then what we have done over the last few decades is basically repeating and, as you mentioned, incrementally improving how we've managed data based on a certain assumptions around. As you mentioned, centralization data has to be in one place so we can get value from it. But if you look at the parallel movement off our industry in general since the birth of Internet, we are actually moving towards decentralization. If we think today, like if this move data side, if he said the only way Web would work the only way we get access to you know various applications on the Web pages is to centralize it. We would laugh at that idea, but for some reason we don't. We don't question that when it comes to data, right? So I think it's time to embrace the complexity that comes with the growth of number of sources, the proliferation of sources and consumptions models, you know, embrace the distribution of sources of data that they're not just within one part of organization. They're not just within even bounds of organization there beyond the bounds of organization. And then look back and say Okay, if that's the trend off our industry in general, Um, given the fabric of computation and data that we put in, you know globally in place, then how the architecture and technology and organizational structure incentives need to move to embrace that complexity. And to me, that requires a paradigm shift, a full stack from how we organize our organizations, how we organize our teams, how we, you know, put a technology in place, um, to to look at it from a decentralized angle. >>Okay, so let's let's unpack that a little bit. I mean, you've spoken about and written that today's big architecture and you basically just mentioned that it's flawed, So I wanna bring up. I love your diagrams of a simple diagram, guys, if you could bring up ah, figure one. So on the left here we're adjusting data from the operational systems and other enterprise data sets and, of course, external data. We cleanse it, you know, you've gotta do the do the quality thing and then serve them up to the business. So So what's wrong with that picture that we just described and give granted? It's a simplified form. >>Yeah, quite a few things. So, yeah, I would flip the question may be back to you or the audience if we said that. You know, there are so many sources off the data on the Actually, the data comes from systems and from teams that are very diverse in terms off domains. Right? Domain. If if you just think about, I don't know retail, Uh, the the E Commerce versus Order Management versus customer This is a very diverse domains. The data comes from many different diverse domains. And then we expect to put them under the control off a centralized team, a centralized system. And I know that centralization. Probably if you zoom out, it's centralized. If you zoom in it z compartmentalized based on functions that we can talk about that and we assume that the centralized model will be served, you know, getting that data, making sense of it, cleansing and transforming it then to satisfy in need of very diverse set of consumers without really understanding the domains, because the teams responsible for it or not close to the source of the data. So there is a bit of it, um, cognitive gap and domain understanding Gap, um, you know, without really understanding of how the data is going to be used, I've talked to numerous. When we came to this, I came up with the idea. I talked to a lot of data teams globally just to see, you know, what are the pain points? How are they doing it? And one thing that was evident in all of those conversations that they actually didn't know after they built these pipelines and put the data in whether the data warehouse tables or like, they didn't know how the data was being used. But yet the responsible for making the data available for these diverse set of these cases, So s centralized system. A monolithic system often is a bottleneck. So what you find is, a lot of the teams are struggling with satisfying the needs of the consumers, the struggling with really understanding the data. The domain knowledge is lost there is a los off understanding and kind of in that in that transformation. Often, you know, we end up training machine learning models on data that is not really representative off the reality off the business. And then we put them to production and they don't work because the semantic and the same tax off the data gets lost within that translation. So we're struggling with finding people thio, you know, to manage a centralized system because there's still the technology is fairly, in my opinion, fairly low level and exposes the users of those technologies. I said, Let's say warehouse a lot off, you know, complexity. So in summary, I think it's a bottleneck is not gonna, you know, satisfy the pace of change, of pace, of innovation and the pace of, you know, availability of sources. Um, it's disconnected and fragmented, even though the centralizes disconnected and fragmented from where the data comes from and where the data gets used on is managed by, you know, a team off hyper specialized people that you know, they're struggling to understand the actual value of the data, the actual format of the data, so it's not gonna get us where our aspirations and ambitions need to be. >>Yes. So the big data platform is essentially I think you call it, uh, context agnostic. And so is data becomes, you know, more important, our lives. You've got all these new data sources, you know, injected into the system. Experimentation as we said it with the cloud becomes much, much easier. So one of the blockers that you've started, you just mentioned it is you've got these hyper specialized roles the data engineer, the quality engineer, data scientists and and the It's illusory. I mean, it's like an illusion. These guys air, they seemingly they're independent and in scale independently. But I think you've made the point that in fact, they can't that a change in the data source has an effect across the entire data lifecycle entire data pipeline. So maybe you could maybe you could add some color to why that's problematic for some of the organizations that you work with and maybe give some examples. >>Yeah, absolutely so in fact, that initially the hypothesis around that image came from a Siris of requests that we received from our both large scale and progressive clients and progressive in terms of their investment in data architectures. So this is where clients that they were there were larger scale. They had divers and reached out of domains. Some of them were big technology tech companies. Some of them were retail companies, big health care companies. So they had that diversity off the data and the number off. You know, the sources of the domains they had invested for quite a few years in, you know, generations. If they had multi generations of proprietary data warehouses on print that they were moving to cloud, they had moved to the barriers, you know, revisions of the Hadoop clusters and they were moving to the cloud. And they the challenges that they were facing were simply there were not like, if I want to just, like, you know, simplifying in one phrase, they were not getting value from the data that they were collecting. There were continuously struggling Thio shift the culture because there was so much friction between all of these three phases of both consumption of the data and transformation and making it available consumption from sources and then providing it and serving it to the consumer. So that whole process was full of friction. Everybody was unhappy. So its bottom line is that you're collecting all this data. There is delay. There is lack of trust in the data itself because the data is not representative of the reality has gone through a transformation. But people that didn't understand really what the data was got delayed on bond. So there is no trust. It's hard to get to the data. It's hard to create. Ultimately, it's hard to create value from the data, and people are working really hard and under a lot of pressure. But it's still, you know, struggling. So we often you know, our solutions like we are. You know, Technologies will often pointed to technology. So we go. Okay, This this version of you know, some some proprietary data warehouse we're using is not the right thing. We should go to the cloud, and that certainly will solve our problems. Right? Or warehouse wasn't a good one. Let's make a deal Lake version. So instead of you know, extracting and then transforming and loading into the little bits. And that transformation is that, you know, heavy process, because you fundamentally made an assumption using warehouses that if I transform this data into this multi dimensional, perfectly designed schema that then everybody can run whatever choir they want that's gonna solve. You know everybody's problem, but in reality it doesn't because you you are delayed and there is no universal model that serves everybody's need. Everybody that needs the divers data scientists necessarily don't don't like the perfectly modeled data. They're looking for both signals and the noise. So then, you know, we've We've just gone from, uh, et elles to let's say now to Lake, which is okay, let's move the transformation to the to the last mile. Let's just get load the data into, uh into the object stores into semi structured files and get the data. Scientists use it, but they're still struggling because the problems that we mentioned eso then with the solution. What is the solution? Well, next generation data platform, let's put it on the cloud, and we sell clients that actually had gone through, you know, a year or multiple years of migration to the cloud. But with it was great. 18 months I've seen, you know, nine months migrations of the warehouse versus two year migrations of the various data sources to the clubhouse. But ultimately, the result is the same on satisfy frustrated data users, data providers, um, you know, with lack of ability to innovate quickly on relevant data and have have have an experience that they deserve toe have have a delightful experience off discovering and exploring data that they trust. And all of that was still a missed so something something else more fundamentally needed to change than just the technology. >>So then the linchpin to your scenario is this notion of context and you you pointed out you made the other observation that look, we've made our operational systems context aware. But our data platforms are not on bond like CRM system sales guys very comfortable with what's in the CRM system. They own the data. So let's talk about the answer that you and your colleagues are proposing. You're essentially flipping the architecture whereby those domain knowledge workers, the builders, if you will, of data products or data services there now, first class citizens in the data flow and they're injecting by design domain knowledge into the system. So So I wanna put up another one of your charts. Guys, bring up the figure to their, um it talks about, you know, convergence. You showed data distributed domain, dream and architecture. Er this self serve platform design and this notion of product thinking. So maybe you could explain why this approach is is so desirable, in your view, >>sure. The motivation and inspiration for the approach came from studying what has happened over the last few decades in operational systems. We had a very similar problem prior to micro services with monolithic systems, monolithic systems where you know the bottleneck. Um, the changes we needed to make was always, you know, our fellow Noto, how the architecture was centralized and we found a nice nation. I'm not saying this is the perfect way of decoupling a monolith, but it's a way that currently where we are in our journey to become data driven, um is a nice place to be, um, which is distribution or decomposition off your system as well as organization. I think when we whenever we talk about systems, we've got to talk about people and teams that's responsible for managing those systems. So the decomposition off the systems and the teams on the data around domains because that's how today we are decoupling our business, right? We're decoupling our businesses around domains, and that's a that's a good thing and that What does that do really for us? What it does? Is it localizes change to the bounded context of fact business. It creates clear boundary and interfaces and contracts between the rest of the universe of the organization on that particular team, so removes the friction that often we have for both managing the change and both serving data or capability. So it's the first principle of data meshes. Let's decouple this world off analytical data the same to mirror the same way we have to couple their systems and teams and business why data is any different. And the moment you do that, So you, the moment you bring the ownership to people who understands the data best, then you get questions that well, how is that any different from silence that's connected databases that we have today and nobody can get to the data? So then the rest of the principles is really to address all of the challenges that comes with this first principle of decomposition around domain Context on the second principle is well, we have to expect a certain level off quality and accountability and responsibility for the teams that provide the data. So let's bring product thinking and treating data as a product to the data that these teams now, um share and let's put accountability around. And we need a new set of incentives and metrics for domain teams to share the data. We need to have a new set off kind of quality metrics that define what it means for the data to be a product. And we can go through that conversation perhaps later eso then the second principle is okay. The teams now that are responsible, the domain teams responsible for the analytical data need to provide that data with a certain level of quality and assurance. Let's call that a product and bring products thinking to that. And then the next question you get asked off by C. E. O s or city or the people who build the infrastructure and, you know, spend the money. They said, Well, it's actually quite complex to manage big data, and now we're We want everybody, every independent team to manage the full stack of, you know, storage and computation and pipelines and, you know, access, control and all of that. And that's well, we have solved that problem in operational world. And that requires really a new level of platform thinking toe provide infrastructure and tooling to the domain teams to now be able to manage and serve their big data. And that I think that requires reimagining the world of our tooling and technology. But for now, let's just assume that we need a new level of abstraction to hide away ton of complexity that unnecessarily people get exposed to and that that's the third principle of creating Selves of infrastructure, um, to allow autonomous teams to build their domains. But then the last pillar, the last you know, fundamental pillar is okay. Once you distributed problem into a smaller problems that you found yourself with another set of problems, which is how I'm gonna connect this data, how I'm gonna you know, that the insights happens and emerges from the interconnection of the data domains right? It does not necessarily locked into one domain. So the concerns around interoperability and standardization and getting value as a result of composition and interconnection of these domains requires a new approach to governance. And we have to think about governance very differently based on a Federated model and based on a computational model. Like once we have this powerful self serve platform, we can computational e automate a lot of governance decisions. Um, that security decisions and policy decisions that applies to you know, this fabric of mesh not just a single domain or not in a centralized. Also, really. As you mentioned that the most important component of the emissions distribution of ownership and distribution of architecture and data the rest of them is to solve all the problems that come with that. >>So very powerful guys. We actually have a picture of what Jamaat just described. Bring up, bring up figure three, if you would tell me it. Essentially, you're advocating for the pushing of the pipeline and all its various functions into the lines of business and abstracting that complexity of the underlying infrastructure, which you kind of show here in this figure, data infrastructure is a platform down below. And you know what I love about this Jama is it to me, it underscores the data is not the new oil because I could put oil in my car I can put in my house, but I can't put the same court in both places. But I think you call it polyglot data, which is really different forms, batch or whatever. But the same data data doesn't follow the laws of scarcity. I can use the same data for many, many uses, and that's what this sort of graphic shows. And then you brought in the really important, you know, sticking problem, which is that you know the governance which is now not a command and control. It's it's Federated governance. So maybe you could add some thoughts on that. >>Sure, absolutely. It's one of those I think I keep referring to data much as a paradigm shift. And it's not just to make it sound ground and, you know, like, kind of ground and exciting or in court. And it's really because I want to point out, we need to question every moment when we make a decision around how we're going to design security or governance or modeling off the data, we need to reflect and go back and say, um, I applying some of my cognitive biases around how I have worked for the last 40 years, I have seen it work. Or do I do I really need to question. And we do need to question the way we have applied governance. I think at the end of the day, the rule of the data governance and objective remains the same. I mean, we all want quality data accessible to a diverse set of users. And these users now have different personas, like David, Personal data, analyst data, scientists, data application, Um, you know, user, very diverse personal. So at the end of the day, we want quality data accessible to them, um, trustworthy in in an easy consumable way. Um, however, how we get there looks very different in as you mentioned that the governance model in the old world has been very commander control, very centralized. Um, you know, they were responsible for quality. They were responsible for certification off the data, you know, applying making sure the data complies. But also such regulations Make sure you know, data gets discovered and made available in the world of the data mesh. Really. The job of the data governance as a function becomes finding that equilibrium between what decisions need to be um, you know, made and enforced globally. And what decisions need to be made locally so that we can have an interoperable measure. If data sets that can move fast and can change fast like it's really about instead of hardest, you know, kind of putting the putting those systems in a straitjacket of being constant and don't change, embrace, change and continuous change of landscape because that's that's just the reality we can't escape. So the role of governance really the governance model called Federated and Computational. And by that I mean, um, every domain needs to have a representative in the governance team. So the role of the data or domain data product owner who really were understand the data that domain really well but also wears that hacks of a product owner. It is an important role that had has to have a representation in the governance. So it's a federation off domains coming together, plus the SMEs and people have, you know, subject matter. Experts who understands the regulations in that environmental understands the data security concerns, but instead off trying to enforce and do this as a central team. They make decisions as what need to be standardized, what need to be enforced. And let's push that into that computational E and in an automated fashion into the into the camp platform itself. For example, instead of trying to do that, you know, be part of the data quality pipeline and inject ourselves as people in that process, let's actually, as a group, define what constitutes quality, like, how do we measure quality? And then let's automate that and let Z codify that into the platform so that every native products will have a C I City pipeline on as part of that pipeline. Those quality metrics gets validated and every day to product needs to publish those SLOC or service level objectives. So you know, whatever we choose as a measure of quality, maybe it's the, you know, the integrity of the data, the delay in the data, the liveliness of it, whatever the are the decisions that you're making, let's codify that. So it's, um, it's really, um, the role of the governance. The objectives of the governance team tried to satisfies the same, but how they do it. It is very, very different. I wrote a new article recently trying to explain the logical architecture that would emerge from applying these principles. And I put a kind of light table to compare and contrast the roll off the You know how we do governance today versus how we will do it differently to just give people a flavor of what does it mean to embrace the centralization? And what does it mean to embrace change and continuous change? Eso hopefully that that that could be helpful. >>Yes, very so many questions I haven't but the point you make it to data quality. Sometimes I feel like quality is the end game. Where is the end game? Should be how fast you could go from idea to monetization with the data service. What happens again? You sort of address this, but what happens to the underlying infrastructure? I mean, spinning a PC to S and S three buckets and my pie torches and tensor flows. And where does that that lives in the business? And who's responsible for that? >>Yeah, that's I'm glad you're asking this question. Maybe because, um, I truly believe we need to re imagine that world. I think there are many pieces that we can use Aziz utilities on foundational pieces, but I but I can see for myself a 5 to 7 year roadmap of building this new tooling. I think, in terms of the ownership, the question around ownership, if that would remains with the platform team, but and perhaps the domain agnostic, technology focused team right that there are providing instead of products themselves. And but the products are the users off those products are data product developers, right? Data domain teams that now have really high expectations in terms of low friction in terms of lead time to create a new data product. Eso We need a new set off tooling, and I think with the language needs to shift from, You know, I need a storage buckets. So I need a storage account. So I need a cluster to run my, you know, spark jobs, too. Here's the declaration of my data products. This is where the data for it will come from. This is the data that I want to serve. These are the policies that I need toe apply in terms of perhaps encryption or access control. Um, go make it happen. Platform, go provision, Everything that I mean so that as a data product developer. All I can focus on is the data itself, representation of semantic and representation of the syntax. And make sure that data meets the quality that I have that I have to assure and it's available. The rest of provisioning of everything that sits underneath will have to get taken care of by the platform. And that's what I mean by requires a re imagination and in fact, Andi, there will be a data platform team, the data platform teams that we set up for our clients. In fact, themselves have a favorite of complexity. Internally, they divide into multiple teams multiple planes, eso there would be a plane, as in a group of capabilities that satisfied that data product developer experience, there would be a set of capabilities that deal with those need a greatly underlying utilities. I call it at this point, utilities, because to me that the level of abstraction of the platform is to go higher than where it is. So what we call platform today are a set of utilities will be continuing to using will be continuing to using object storage, will continue using relation of databases and so on so there will be a plane and a group of people responsible for that. There will be a group of people responsible for capabilities that you know enable the mesh level functionality, for example, be able to correlate and connects. And query data from multiple knows. That's a measure level capability to be able to discover and explore the measure data products as a measure of capability. So it would be set of teams as part of platforms with a strong again platform product thinking embedded and product ownership embedded into that. To satisfy the experience of this now business oriented domain data team teams s way have a lot of work to do. >>I could go on. Unfortunately, we're out of time. But I guess my first I want to tell people there's two pieces that you put out so far. One is, uh, how to move beyond a monolithic data lake to a distributed data mesh. You guys should read that in a data mesh principles and logical architectures kind of part two. I guess my last question in the very limited time we have is our organization is ready for this. >>E think the desire is there I've bean overwhelmed with number off large and medium and small and private and public governments and federal, you know, organizations that reached out to us globally. I mean, it's not This is this is a global movement and I'm humbled by the response of the industry. I think they're the desire is there. The pains are really people acknowledge that something needs to change. Here s so that's the first step. I think that awareness isa spreading organizations. They're more and more becoming aware. In fact, many technology providers are reach out to us asking what you know, what shall we do? Because our clients are asking us, You know, people are already asking We need the data vision. We need the tooling to support. It s oh, that awareness is there In terms of the first step of being ready, However, the ingredients of a successful transformation requires top down and bottom up support. So it requires, you know, support from Chief Data Analytics officers or above the most successful clients that we have with data. Make sure the ones that you know the CEOs have made a statement that, you know, we want to change the experience of every single customer using data and we're going to do, we're going to commit to this. So the investment and support, you know, exists from top to all layers. The engineers are excited that maybe perhaps the traditional data teams are open to change. So there are a lot of ingredients. Substance to transformation is to come together. Um, are we really ready for it? I think I think the pioneers, perhaps the innovators. If you think about that innovation, careful. My doctors, probably pioneers and innovators and leaders. Doctors are making making move towards it. And hopefully, as the technology becomes more available, organizations that are less or in, you know, engineering oriented, they don't have the capability in house today, but they can buy it. They would come next. Maybe those are not the ones who aren't quite ready for it because the technology is not readily available. Requires, you know, internal investment today. >>I think you're right on. I think the leaders are gonna lead in hard, and they're gonna show us the path over the next several years. And I think the the end of this decade is gonna be defined a lot differently than the beginning. Jammeh. Thanks so much for coming in. The Cuban. Participate in the >>program. Pleasure head. >>Alright, Keep it right. Everybody went back right after this short break.

Published Date : Jan 22 2021

SUMMARY :

cloud brought to you by silicon angle in 2000 The modern big data movement It's a pleasure to have you on the program. This wonderful to be here. pretty outspoken about the need for a paradigm shift in how we manage our data and our platforms the only way we get access to you know various applications on the Web pages is to So on the left here we're adjusting data from the operational lot of data teams globally just to see, you know, what are the pain points? that's problematic for some of the organizations that you work with and maybe give some examples. And that transformation is that, you know, heavy process, because you fundamentally So let's talk about the answer that you and your colleagues are proposing. the changes we needed to make was always, you know, our fellow Noto, how the architecture was centralized And then you brought in the really important, you know, sticking problem, which is that you know the governance which So at the end of the day, we want quality data accessible to them, um, Where is the end game? And make sure that data meets the quality that I I guess my last question in the very limited time we have is our organization is ready So the investment and support, you know, Participate in the Alright, Keep it right.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Jean Marc de Connie	PERSON	0.99+
Hal Varian	PERSON	0.99+
Zhamak Dehghani	PERSON	0.99+
New York City	LOCATION	0.99+
John Mark	PERSON	0.99+
5	QUANTITY	0.99+
Jeff Ham Abakar	PERSON	0.99+
two year	QUANTITY	0.99+
two pieces	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
John	PERSON	0.99+
nine months	QUANTITY	0.99+
2000	DATE	0.99+
18 months	QUANTITY	0.99+
first step	QUANTITY	0.99+
second principle	QUANTITY	0.99+
both places	QUANTITY	0.99+
both	QUANTITY	0.99+
One	QUANTITY	0.99+
a year	QUANTITY	0.99+
one part	QUANTITY	0.99+
first	QUANTITY	0.99+
Claudette Cloudera	PERSON	0.99+
third principle	QUANTITY	0.98+
10	DATE	0.98+
first principle	QUANTITY	0.98+
one domain	QUANTITY	0.98+
today	DATE	0.98+
Lee	PERSON	0.98+
one phrase	QUANTITY	0.98+
three phases	QUANTITY	0.98+
Cuban	OTHER	0.98+
Jammeh	PERSON	0.97+
7 year	QUANTITY	0.97+
Mawr	PERSON	0.97+
Jamaat	PERSON	0.97+
last decade	DATE	0.97+
Maurin Mawr	PERSON	0.94+
single domain	QUANTITY	0.92+
one thing	QUANTITY	0.91+
ThoughtWorks	ORGANIZATION	0.9+
one	QUANTITY	0.9+
nine	QUANTITY	0.9+
theCUBE	ORGANIZATION	0.89+
end	DATE	0.88+
last few decades	DATE	0.87+
one place	QUANTITY	0.87+
Second Hadoop World	EVENT	0.86+
three	OTHER	0.85+
C. E. O	ORGANIZATION	0.84+
this decade	DATE	0.84+
Siris	TITLE	0.83+
coming decade	DATE	0.83+
Andi	PERSON	0.81+
Chamakh	PERSON	0.8+
three buckets	QUANTITY	0.77+
Jama	PERSON	0.77+
Cuban	PERSON	0.76+
Aziz	ORGANIZATION	0.72+
years	DATE	0.72+
first class	QUANTITY	0.72+
last 40	DATE	0.67+
single customer	QUANTITY	0.66+
part two	OTHER	0.66+
last	DATE	0.66+
Cloud	TITLE	0.56+
2021	DATE	0.55+
next 10 years	DATE	0.54+
Hadoop	EVENT	0.53+
following year	DATE	0.53+
years	QUANTITY	0.51+
Cube	ORGANIZATION	0.5+
Noto	ORGANIZATION	0.45+
Cube	PERSON	0.39+
Cube	COMMERCIAL_ITEM	0.26+

Mai Lan Tomsen Bukovec, Vice President, Block and Object Storage, AWS

>> We continue with cube on cloud. We here with Mai-Lan Tomsen Bukovec who's the vice president of block and object storage at AWS which comprises elastic block storage, AWS S3 and Amazon glacier. Mai-Lan Great to see you again. Thanks so much for coming on the program. >> Nice to be here. Thanks for having me, Dave. >> You're very welcome. So here we're unpacking the future of cloud and we'd love to get your perspectives on how customers should think about the future of infrastructure things like applying machine intelligence to their data but just to set the stage, when we look back at the history of storage and the cloud has obviously started with S3 and then a couple of years later AWS introduced EBS for block storage and those are the most well-known services in the portfolio but there's more of this cold storage and new capabilities that you announced recently at reinvent around, you know, super-duper block storage and in tiering is another example. But it looks like AWS is really starting to accelerate and pick up the pace of customer options in storage. So my first question is how should we think about this expanding portfolio? >> Well, I think you have to go all the way back to what customers are trying to do with their data Dave. The path to innovation is paved by data. If you don't have data, you don't have machine learning. You don't have the next generation of analytics applications that helps you chart a path forward into a world that seems to be changing every week. And so in order to have that insight in order to have that predictive forecasting that every company needs, regardless of what industry that you're in today, it all starts from data. And I think the key shift that I've seen is how customers are thinking about that data, about being instantly usable. Whereas in the past, it might've been a backup. Now it's part of a data lake. And if you can bring that data into a data lake you can have not just analytics or machine learning or auditing applications, it's really what does your application do for your business and how can it take advantage of that vast amount of shared data set in your business? >> Awesome, so thank you. So I want to make sure we're hitting on the big trends that you're seeing in the market that kind of are informing your strategy around the portfolio, and what you're seeing with customers. Instant usability, you know, you bring in machine learning into the equation. I think people have really started to understand the benefits of cloud storage as a service and the pay by the drink. and that whole model. Obviously COVID has accelerated that, you know, cloud migration is accelerated. Anything else we're missing there? What are the other big trends that you see? If any. >> Well, Dave, you did a good job of capturing a lot of the drivers. The one thing I would say that just sits underneath all of it is the massive growth of digital data year over year. IDC says digital data is growing at a rate of 40% year over year. And that has been true for a while and it's not going to stop. It's going to keep on growing because the sources of that data acquisition keeps on expanding and whether it's IOT devices whether it is a content created by users, that data is going to grow and everything you're talking about depends on the ability to not just capture it and store it. But as you say, use it. >> Well, you know, and we talk about data growth a lot and sometimes it can, it becomes bromide. But I think the interesting thing that I've observed over the last couple of decades really is that the growth is non-linear and it's really the curve is starting to shape exponentially. You guys always talk about that flywheel effect it's really hard to believe, you know people say trees don't grow to the moon. It seems like data does. >> It does and what's interesting about working in a world of AWS storage Dave is that it's counter-intuitive but our goal with a data growth is to make it cost effective. And so year over year how can we make it cheaper and cheaper? It is have customers store more and more data so they can use it. But it's also to think about the definition of usage and what kind of data is being tapped by businesses for their insights and make that easier than it's ever been before. >> Let me ask you a follow up question on that Mai-Lan. Cause I get asked this a lot, or I hear comments a lot that yes AWS continuously and rigorously reduces pricing but it's just kind of following the natural curve of Moore's law or whatever. How do you respond to that? Are there other factors involved? Obviously labor is another, you know, cost reducing factor, but what's the trend line say? >> Well, cost efficiency is in our DNA, Dave we come to work every day in AWS across all of our services and we ask ourselves, how can we lower our costs and be able to pass that along to customers. As you say, there are many different aspects to costs. There's a cost to the storage itself There's a cost to the data center. And that's really what we've seen impact a lot of customers that were slower or just getting started with a move to the cloud, is they entered 2020 and then they found out exactly how expensive that data center was to maintain because they had to put in safety equipment and they had to do all the things that you have to do in a pandemic, in a data center. And so sometimes that cost is a little bit hidden or it won't show up until you really don't need to have it land. But the costs of managing that explosive growth of data is very real. And when we're thinking about costs, we're thinking about costs in terms of how can I lower it on a per gigabyte per month basis, but we're also building into the product itself, adaptive discounts. Like we have a storage class in S3 that's called intelligent tiering. And in intelligent tiering we have built-in monitoring where if particular objects aren't frequently accessed in a given month, a customer will automatically get a discounted price for that storage or a customer can, you know, as of late last year say that they want to automatically move storage in the storage class that has been stored for example longer than 180 days and saves 95% by moving it into deep archive storage. And so it's not just, you know relentlessly going after and lowering the cost of storage. It's also building into the products these new ways where we can adaptively discount storage based on what a customer's storage is actually doing. >> Right, and I would add to already is the other thing Gatos has done is it's really forced transparency almost the same way that Amazon has done on retail. And now Mai-Lan when we talked last I mentioned that S3 was an object store. And of course that's technically correct but your comment to me was Dave, it's more than that. And you started to talk about SageMaker and AI and bringing in machine learning. And I wonder if you could talk a little bit about the future of how storage is going to be leveraged in the cloud. That's maybe different than what we've been used to in the early days of S3. And how your customers should be thinking about infrastructure, not as bespoke services, but as a suite of capabilities and maybe some of those adjacent services that you see as most leverageable for customers and why? >> Well, to tell this story, Dave, we're going to have to go a little bit back in time, all the way back to the 1990s or before then. When all you had was a set of hardware appliance vendors that sold you appliances that you put in your data center and inherently created a data silo because those hardware appliances were hardwired to your application. And so an individual application that was dealing with auditing as an example wouldn't really be able to access the storage for another application, because you know, the architecture of that legacy world is tied to a data silo and S3 came out launched in 2006 and introduced very low cost storage. That is an object. And I'll tell you, Dave, you know, over the last 10 plus years we have seen all kinds of data coming to S3. Whereas before it might've been backups or it might've been images and videos. Now a pretty substantial data set is our parquet files and work files. These files are there for business analytics for more real-time type of processing. And that has really been the trend of the future, is taking these different files putting them in a shared file layer, so any application today or in the future can tap into that data. And so this idea of the shared file layer is a major trend that has been taking off for the last I would say five or six years. And I expect that to not only keep on going but to really open up the type of services that you can then do on that shared file layer. And whether that's Sage maker or some of the machine learning introduced by our connect service, it's bringing together the data as a starting point and then the applications can evolve very rapidly on top of that. >> I want to ask your opinion about big data architectures. One of our guests Chamakh Tigani, she's amazing data architect. And she's put forth this notion of a distributed global mesh. And picking up on some of the comments, Andy Jassy made it at re-invent how essentially, "Hey we're bringing AWS to the edge. "We see the data center is just another edge node." So you're seeing this massive distributed system evolving. You guys have talked about that for a while and data by its very nature is distributed but we've had this tendency to put it into a monolithic data Lake or a data warehouse and it's sort of antithetical to that distributed nature. So how do you see that playing out? What do you see customers in the future doing in terms of their big data architectures and what does that mean for storage? >> It comes down to the nature of the data and again the usage and Dave that's where I see the biggest difference in these modern data architectures from the legacy of 20 years ago, is the idea that the data need drives the data storage. So let's take an example of the type of data that you always want to have on the edge. We have customers today that need to have storage in the field and whether the field of scientific research or oftentimes it's content creation in the film industry, or if it's for military operations there's a lot of data that needs to be captured and analyzed in the field. And for us, what that means is that, you know we have a suite of products called snow ball and whether it's snow ball or snow cone, take your pick. That whole portfolio of AWS services is targeted at customers that need to do work with storage at the edge. And so, you know, if you think about the need for multiple applications acting on the same data set that's when you keep it in an AWS region. And what we've done in AWS storage is we've recognized that depending on the need of usage where you put your data and how you interact with it may vary. But we've built a whole set of services like data transfer to help make sure that we can connect data from, for example that new snow cone into a region automatically. And so our goal Dave is to make sure that when customers are operating at the edge or they're operating in the region they have the same quality of storage service and they have easy ways to go between them. You shouldn't have to pick, you should be able to do it all. >> So in the spirit of do it all there's this sort of age old dynamic in the tech business where you've got the friction between the best of breed and the integrated suite. And my question is around what you're optimizing for customers. And can you have your cake and eat it too? In other words, why AWS storage? What makes it compelling? Is it because it's kind of a best of breed storage service or is it because it's integrated with AWS? Would you ever sub optimize one in order to get an advantage to the other? Or can you actually, you know have your cake and eat it too? >> The way that we build storage is to focus on being both the breadth of capabilities and the depth of capabilities. And so where we identify a particular need where we think that it takes a whole new service to deliver we'll go build that service. And an example for that as FTP our AWS SFTP service, which, you know, there's a lot of SFTP usage out there and there will be for a while because of the, you know, the legacy B2B type of architectures that still live in the business world today. And so we looked at that problem. We said, how are we going to build that in the best depth way, in the best focus? And we launched a separate service for that. And so our goal is to take the individual building blocks of EBS and glacier and S3 and make the best of class and the most comprehensive in the capabilities of what we can do and where we identify a very specific need. We'll go build a service for it. But Dave, you know as an example for that idea of both depth and breadth, S3 Storage Lens is a great example of that. S3 Storage Lens is a new capability that we launched late last year. And what it does is it lets you look across all your regions and all your accounts and get a summary view of all your S3 storage and whether that's buckets or the most active prefixes that you have and be able to drill down from that. And that is built in to the S3 service and available for any customer that wants to turn it on in the AWS management console. >> Right, and we saw just recently made, I called it super-duper block storage but you can make some improvements in really addressing the highest performance. I want to ask you, so we've all learned about an experience that benefits of cloud over the last several years and especially in the last 10 months during the pandemic but one of the challenges and it's particularly acute with IO is of course latency and moving data around and accessing data remotely. It's a challenge for customers, you know, due to speed of light, et cetera. So my question is how was AWS thinking about all that data that's still resides on premises? I think we heard at reinvent, that's still on 90% of the opportunity is, or the the workloads are still on prem that live inside a customer's data centers. So how do you tap into those and help customers innovate with on-prem data, particularly from a storage angle? >> Well, we always want to provide the best of class solution for those little latency workloads. And that's why we launched Block Express just late last year at reinvent. And Block Express has a new capability in preview on top of our IO to provisioned IOPS volume type. And what's really interesting about block express Dave is that the way that we're able to deliver the performance of Block Express, which is sound performance with cloud elasticity is that we went all the way down to the network layer and we customize the hardware software. And at the network layer we built Block Express on something called SRD which stands for a scalable reliable diagrams. And basically what it's letting us do is offload all of our EBS operations for Block Express on the nitrile card on hardware. And so that type of innovation where we're able to, you know, take advantage of modern cop commodity, multi-tenant data center networks, where we're sending in this new network protocol across a large number of network paths. And that type of innovation all the way down to that protocol level helps us innovate in a way that's hard. In fact, I would say impossible for other sound providers to kind of really catch up and keep up. And so we feel that the amount of innovation that we have for delivering those low latency workloads in our AWS cloud storage is unlimited really because of that ability to customize software hardware and network protocols as we go along without requiring upgrades from a customer it just gets better. And the customer benefits. Now, if you want to stay in your data center that's why we build outposts. And for outposts, we have UVS and we have S3 for outposts and our goal there is that some customers will have workloads where they want to keep them resident in the data center. And for those customers we want to give them that AWS storage opportunities as well. >> So thank you for coming back to Block Express. So you call it, you know, sand in the cloud. So is that essentially it comprises a custom built essentially storage network. Is that right? What you just described SRD? I think you called it. >> Yeah, it's a SRD is used by other AWS services as well but it is a custom network protocol that we designed to deliver the lowest latency experience and we're taking advantage of it with Block Express. >> So sticking with traditional data centers for a moment I'm interested in your thoughts on the importance of the cloud pricing approach, I.e the consumption model to pay by the drink. Obviously it's one of the most attractive features, and I asked that because we're seeing what Andy Jassy refers to as the old guard Institute, flexible pricing models two of the biggest storage companies, HP with GreenLake and Dell has this thing called apex. They've announced such models for on-prem and presumably cross cloud. How do you think this is going to impact your customers leverage of AWS cloud storage? Is it something that you have an opinion on? >> Yeah, I think it all comes down to, again that usage of the storage, and this is where I think there's an inherent advantage for our cloud storage. So there might be an attempt by the old guard to lower prices or add flexibility but at the end of the day it comes down to what the customer actually needs to tune. And if you think about gp3 which is the new EBS volume. The idea with gp3 is we're going to pass a long savings to the customer by making the storage 20% cheaper than gp2. And we're going to make the product better by giving a great, reliable baseline performance. But we're also going to let customers who want to run workloads like Cassandra on EBS tune their throughput separately, for example from their capacity. So if you're running Cassandra sometimes you don't need to change your capacity. Your storage capacity works just fine. But what happens with, for example Cassandra workload is that you may need more throughput. And if you're buying hardware appliance you just have to buy for your peak. You have to buy for the max of what you think your throughput and the max of what your storage is. And this inherent flexibility that we have for AWS storage and being able to tune throughput separate from up separate from capacity like you do for gp3 that is really where the future is for customers having control over costs and control over customer experience without compromising or trading off either one. >> Awesome, thank you for that. So in the time we have remaining Mai-Lan, I want to talk about the topic of diversity social impact, and as a woman leader, women executive, and I really want to get your perspectives on this. And I've shared with the audience previously, one of my breaking analysis segments, your boxing video which is awesome. And so, you've got a lot of unique non-traditional aspects to your life and I love it, but I want to ask you this. So it's obviously, you know, certainly politically and socially correct to talk about diversity, the importance of diversity, there's data that suggests that diversity is good both economically, not just socially, and of course it's the right thing to do. But there are those, you know, Peter teal is probably the most prominent but there are others that say, "You know what? "Forget that, just hire people, just like you'll be able "to go faster, ramp up more quickly, hit escape "velocity it's natural." And that's what you should do. Why is that not the right approach? Why is diversity both, of course, socially, you know responsible, but also, you know, good for business >> For Amazon we think about diversity as something that is essential to how we think about innovation. And so, Dave, as you know, from listening to some of the announcements at reinvent, we launch a lot of new ideas, like new concepts and new services in AWS. And just bringing that lens down to storage. Astri has been reinventing itself every year since we launched in 2006. EBS introduced the first sun on the cloud late last year, and continues to reinvent how customers think about block storage. We would not be able to look at a product in a different way and think to ourselves, not just what is the legacy system do in a data center today but how do we want to build this new distributed system in a way that helps customers achieve not just what they're doing today, but what they want to do in five and 10 years. You can't get that innovative mindset without bringing different perspectives to the table. And so we strongly believe in hiring people who are from under represented groups and whether that's gender or it's related to racial equality or if it's geographic diversity and bringing them in to have the conversation because those diverse viewpoints inform how we can innovate at all levels in AWS. >> Right, and so I really appreciate their perspectives on that. And we've had, as you probably know the cube has been, you know a very big advocate of diversity, you know, generally but women in tech specifically, we participated a lot. And I often ask this question is, you know, as a smaller company, I, and some of my other colleagues in small business, sometimes we struggle. And so my question is how do you go beyond what's your advice for going beyond, you know the good old boys network? I think it's large companies like AWS and, you know, the big players, you've got responsibility too that you can put somebody in charge and make it their full-time job. How should smaller companies that are largely white male dominated, how should they become more diverse? What should they do to increase that diversity? >> I think the place to start is voice. A lot of what we try to do is make sure that the under represented voice is heard. And so Dave, any small business owner of any industry can encourage voice for your under represented or your unheard populations. And honestly, it is as simple as being in a meeting and looking around that table or on your screen, as it were and asking yourself, who hasn't talked? Who hasn't weighed in? Particularly if the debate is contentious or even animated. And you will see, particularly if you note this over time you will see that there may be somebody and whether it's an under represented group or it's a woman who's early career, or it's not it's just a member of your team who happens to be a white male too, who's not being heard. And you can ask that person for their perspective. And that is a step that every one of us can and should do which is ask to have everyone's voice at the table to listen and to weigh in on it. So I think that is something everyone should do. I think if you are a member of an under represented group as for example, I'm Vietnamese American and I'm a female in tech, I think, it's something to think about how you can make sure that you're always taking that bold step forward. And it's one of the topics that we covered at re-invent. We had a great discussion with a group of women CEOs and a lot of it we talked about is being bold taking the challenge of being bold in tough situations. And that is an important thing, I think for anybody to keep in mind, but especially for members of under represented groups, because sometimes Dave that bold step that you kind of think of as like, "Oh I don't know if I should ask for that promotion." or "I don't know if I should volunteer for that project." It's not a big ask, but it's big in your head. And so if you can internalize as a member of some, you know, a group that maybe isn't heard as or seen as much how you can take those bold challenges and step forward and learn, maybe fail also cause that's how you learn. Then that is a way to also have people learn and develop and become leaders in whatever industry it is. >> That's great advice. It reminds me of, I think most of us can relate to that Mai-Lan, because when we started in the industry, we may be timid. You didn't want to necessarily speak up. And I think it's incumbent upon those in a position of power. And by the way power might just be running a meeting agenda to maybe call on those folks that are, maybe it's not diversity of gender or, you know, or race. Maybe it's just the under represented. Maybe that's a good way to start building muscle memory. So that's unique advice that I hadn't heard before. So thank you very much for that. I appreciate it. And Hey, listen. Thanks so much for coming on the Cube On Cloud. We're out of time and really always appreciate your perspectives and you're doing a great job. And thank you. >> Great, thank you Dave. Thanks for having me and have a great day. >> All right, and Keep it right there buddy. You're watching the Cube On Cloud. Right back. (gentle upbeat music)

Published Date : Jan 11 2021

SUMMARY :

Mai-Lan Great to see you again. Nice to be here. and the cloud has And so in order to have that insight in the market that kind of on the ability to not just it's really hard to believe, you know and make that easier than Obviously labor is another, you know, And so it's not just, you know And I wonder if you could talk And I expect that to in the future doing of data that you always And can you have your cake and eat it too? And that is built in to the S3 service and especially in the last is that the way that we're I think you called it. network protocol that we of the most attractive features, by the old guard to lower and of course it's the right thing to do. And so, Dave, as you know, from listening the cube has been, you know And it's one of the topics And by the way Great, thank you Dave. it right there buddy.

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Dell	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
2006	DATE	0.99+
Andy Jassy	PERSON	0.99+
HP	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
40%	QUANTITY	0.99+
90%	QUANTITY	0.99+
two	QUANTITY	0.99+
EBS	ORGANIZATION	0.99+
GreenLake	ORGANIZATION	0.99+
20%	QUANTITY	0.99+
Chamakh Tigani	PERSON	0.99+
Mai Lan Tomsen Bukovec	PERSON	0.99+
five	QUANTITY	0.99+
first question	QUANTITY	0.99+
95%	QUANTITY	0.99+
IDC	ORGANIZATION	0.99+
one	QUANTITY	0.99+
six years	QUANTITY	0.99+
Moore	PERSON	0.99+
10 years	QUANTITY	0.99+
2020	DATE	0.98+
1990s	DATE	0.98+
S3	TITLE	0.98+
both	QUANTITY	0.98+
gp2	TITLE	0.98+
gp3	TITLE	0.98+
late last year	DATE	0.98+
20 years ago	DATE	0.98+
longer than 180 days	QUANTITY	0.97+
Mai-Lan Tomsen Bukovec	PERSON	0.97+
pandemic	EVENT	0.96+
today	DATE	0.95+
Gatos	ORGANIZATION	0.94+
block express	TITLE	0.94+
EBS	TITLE	0.94+
Mai-Lan	PERSON	0.93+
Astri	ORGANIZATION	0.92+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Chamakh: