Collibra Day 1 Felix Zhamak

>>Hi, Felix. Great to be here. >>Likewise. Um, so when I started reading about data mesh, I think about a year ago, I found myself the more I read about it, the more I find myself agreeing with other principles behind data mesh, it actually took me back to almost the starting of Colibra 13 years ago, based on the research we were doing on semantic technologies, even personally my own master thesis, which was about domain driven ontologies. And we'll talk about domain-driven as it's a key principle behind data mesh, but before we get into that, let's not assume that everybody knows what data measures about. Although we've seen a lot of traction and momentum, which is fantastic to see, but maybe if you could start by talking about some of the key principles and, and a brief overview of what data mesh, uh, Isabella of >>Course, well, they're happy to, uh, so Dana mesh is an approach is a new approach. It's a decentralized, decentralized approach to managing and accessing data and particularly analytical data at scale. So we can break that down a little bit. What is analytical data? Well, analytical data is the data that fuels our reporting as a business intelligence. Most importantly, the machine learning training, right? So it's the data, that's, it's an aggregate view of historical events that happens across organizations, many domains within organizations, or even beyond one organization, right? Um, and today we manage, uh, this analytical data through very centralized solutions. So whether it's a data lake or data warehouse or combinations of the two, and, uh, to be honest, we have kind of outsource the accountability for it, to the data team, right? It doesn't happen within the domains. Uh, what we have found ourselves with is, uh, central button next. >>So as we see the growth in the scale of organizations, in terms of the origins of the data and in terms of the great expectations for the data, all of these wonderful use cases that are, that requires access to that, unless we're data, uh, we find ourselves kind of constraints and limited in agility to respond, you know, because we have a centralized bottleneck from team to technology, to architecture. So there's a mesh kind of is that looks at the past what we've done, accidental complexity that we've kind of created and tries to reimagine a different way of, uh, managing and accessing data that can truly scale as this origins of the data grows. As they become available within one organization, we didn't want a cloud or another, and it links down really the approach based on four principles. Uh, so I so far, I haven't tried to be prescriptive as exactly how you implement it. >>I leave that to Elizabeth, to the imaginations of the users. Um, of course I have my opinions, but, but without being prescriptive, I think there are full shifts that needs to happen. One is, uh, we need to start breaking down the, kind of this complex problem of accessing to data around boundaries that can allow this to scale out a solution. So boundaries that are, that naturally fits into that model or domains, right. Our business domain. So, so there's a first principle is the domain ownership of the data. So analytical data will be shared and served and accountable, uh, by the domains where they come from. And then the second dimension of that is, okay. So once we break down this, the ownership of the database on domains, how can we prevent this data siloing? So the second principle is really treating data as a product. >>So considering the success of that data based on the access and usability and the lifelong experience of data analysts, data scientists. So we talk about data as a product and that the third principle is to really make it possible feasible. We need to really rethink our data platforms, our infrastructure capabilities, and create a new set ourselves of capabilities that allows domain in fact, to own their data in fact, to manage the life cycle of their analytical data. So then self-serve daytime frustration and platform is the fourth principle. And the last principle is really around governance because we have to think about governance. In fact, when I first wrote it down, this was like a little kind of concern in, in embedded in what some of my texts and I thought about, okay, now to make this real, we need to think about securing and quality of the data accessibility of the data at scale, in a fashion that embraces this autonomous domain ownership. So we have to think about how can we make this real with competition of governance? How can we make those domains be part of the governance, federated governance, federally, the competition of governance is the fourth principle. So at insurance it's a organizational shift, it's an architectural change. And of course technology needs to change to get us to decentralize access and management of Emily's school data. >>Yeah, I think that makes a ton of sense. If you want to scale, typically you have to think much more distributed versus centralized at we've seen it in other practices as well, that domain-driven thinking as well. I think, especially around engineering, right? We've seen a lot of the same principles and best practices in order to scale engineering teams and not make the same mistakes again, but maybe we can start there with kind of the core principles around that domain driven thinking. Can you elaborate a little bit on that? Why that is so important than the kind of data organizations, data functions as well? >>Absolutely. I mean, if you look at your organizations, organizations are complex systems, right? There are eight made of parts, which are basically domains functions of the business, your automation and your customer management, yourselves marketing. And then the behavior of the organization is the result of an intuitive, you know, network of dependencies and interactions with these domains. So if we just overlay data on this complex system, it does make sense to really, to scale, to bring the ownership and, um, really access to data right at the domain where it originates, right. But to the people who know that data best and most capable of providing that data. So to optimize response, to change, to optimize creating new features, new services, new machine learning models, we've got to kind of think about your call optimization, but not that the cost of global good. Right. Uh, so the domain ownership really talks about giving autonomy to the domains and accountability to provide their data and model the data, um, in a responsible way, be accountable for its quality. >>So no collect some of the empower them and localize some of those responsibilities, but at the same time, you know, thinking about the global goods, so what are they, how that domain needs to be accountable against the other domains on the mission? That's the governance piece covers that. And that leads to some interesting kind of architectural shifts, because when you think about not submission of the data, then you think about, okay, if I have a machine learning model that needs, you know, three pieces of the data from the different domains, I ended up actually distributing the computer also back to those domains. So it actually starts shifting kind of architectural as well. We start with ownership. Yeah, >>No, I think that makes a ton of sense, but I can imagine people thinking, well, if you're organizing, according to these domains, aren't gonna be going to grades different silos, even more silos. And I think that's where it second principle that's, um, think of data as a product and it comes in, I think that's incredibly powerful in my mind. It's powerful because it helps us think about usability. It helps us think about the consumer of that data and really packaging it in the right way. And as one sentence that I've heard you use that I think is incredibly powerful, it's less collecting, more connecting. Um, and can you elaborate on that a little bit? >>Absolutely. I mean the power and the value of the data is not enhanced, which we have got and stored on this, right. It's really about connecting that data to other data sets to aluminate new insights. The higher order information is connecting that data to the users, right. Then they want to use it. So that's why I think, uh, if we shift that thinking from just collecting more in one place, like whatever, and ability to connect datasets, then, then arrive at a different solution. So, uh, I think data as a product, as you said, exactly, was a kind of a response to the challenges that domain-driven siloing could create. And the idea is that the data that now these domains own needs to be shared with some accountability and incentive structure as a product. So if you bring product thinking to data, what does that mean? >>That means delighting the experience that there are users who are they, they're the data analysts, data scientists. So, you know, how can we delight their experience of their journey starts with a hypothesis. I have a question. Do I have right data to answer this question with a particular model? Let me discover it, let me find it if it's useful. Do I trust it? So really fascinated in that journey? I think we have two choices in that we have the choice of source of that data. The people who are really shouldn't be accountable for it, shrug off the responsibility and say, you know, I dumped this data on some event streaming and somebody downstream, the governance or data team will take care of a terror again. So it usable piece of information. And that's what we have done for, you know, half century almost. And, or let's say let's bring intention of providing quality data back to the source and make the folks both empower them and make them accountable for providing that data right at the source as a product. And I think by being intentional about that, um, w we're going to remove a lot of accidental complexity that we have created with, you know, labyrinth pipelines of moving data from one place to another, and try to build quality back into it. Um, and that requires, you know, architectural shifts, organizational shifts, incentive models, and the whole package, >>The hope is absolutely. And we'll talk about that. Federated computational governance is going to be a really an important aspect, but the other part of kind of data as a product next to usability is whole trust. Right? If you, if you want to use it, why is also trusts so important if you think about data as a product? >>Well, uh, I mean, maybe we turn this question back to you. Would you buy the shiniest product if you don't trust it, if you, if you don't trust where it comes from, can I use it? Is it, does it have integrity? I wouldn't. I think, I think it's almost irresponsible to use the data that you can trust, right. And the, really the meaning of the trust is that, do I know enough about this data to, to, for it, to be useful for the purpose that I'm using it for? So, um, I think trust is absolutely fundamental to, as a fundamental characteristics of a data as a product. And again, it comes back to breaching the gap between what the data user knows needs to know to really trust them, use that data, to find it, whether it's suitable and what they know today. So we can bridge that gap with, uh, you know, adding documentation, adding SLRs, adding lineage, like all of these additional information, but not only that, but also having people that are accountable for providing that integrity and those silos and guaranteeing. So it's really those product owners. So I think, um, it's just, for me, it's a non trust is a non-negotiable characteristic of the data as a product, like any other consumer product. >>Exactly. Like you said, if you think about consumer product, consumer marketplace is almost Uber of Amazon, of Airbnb. You have the simple rating as a very simple way of showing trust and those two and those different stakeholders and that almost. And we also say, okay, how do we actually get there? And I think data measure also talks a little bit about the roles responsibilities. And I think the importance overall of a, of a data product owner probably is aligned with that, that importance and trust. Yeah, >>Absolutely. I think we can't just wish for these good things happens without putting the accountability and the right roles in place. And the data product owner is just the starting point for us to stop playing hot potato. When it comes to, you know, who owns the data will be accountable for not so much. Who's the actual owner of that data because the owner of the data is you and me where the data comes really from, but it's the data product owner who's going to be responsible for the life cycle of this. They know when the data gets changed with consumers, meaning you feel as a new information, make sure that that gets carried out and maybe one day retire that data. So that long term ownership with intimate understanding of the needs of the user for that data, as well as the data itself and the domain itself and managing the life cycle of that, uh, I think that's a, that's a necessary role. >>Um, and then we have to think about why would anybody want to be a data product owner, right? What are the incentives we have to set up in the infrastructure, you know, in the organization. Um, and it really comes down to, I think, adopting prior art that exists in the product ownership landscape and bring it really to the data and assume the data users as the, as the customers, right. To make them happy. So our incentives on KPIs for these people before they get product on it needs to be aligned with the happiness of their data users. >>Yep. I love that. The alignment again, to the consumer using things like we know from product management, product owner of these roles and reusing that for data, I think that makes it makes a ton of sense. And it's a good leeway to talk a little about governance, right? We mentioned already federated governance, computational governance at we seeing that challenge often with our customers centralizing versus decentralizing. How do we find the right balance? Can you talk a little bit about that in the context of data mesh? How do we, how do we do this? >>Yeah, absolutely. I think the, I was hoping to pack three concepts in the title of the governance, but I thought that would be quite mouthful. So, uh, as you mentioned, uh, the kind of that federated aspects, the competition aspects, and I think embedded governance, I would, if I could add another kind of phrasing there and really it's about, um, as we talked about to how to make it happen. So I think the Federation matters because the people who are really in a position listed this, their product owners in a position to provide data in a trustworthy, with integrity and secure way, they have to have a stake in doing that, right. They have to be accountable, not just for their little domain or a big domain, but also they have to have an accountability for the mesh. So some of the concerns that are applied to all of the data front, I've seen fluid, how we secure them are consistently really secure them. >>How do we model the data or the schema language or the SLO metrics, or that allows this, uh, data to be interoperable so we can join multiple data products. So we have to have, I think, a set of policies that are really minimum set of policies that we have to apply globally to all the data products and then in a federated fashion, incentivize the data product owners. So have a stake in that and make that happen because there's always going to be a challenge in prioritizing. Would I add another few attributes? So my data sets to make my customers happy, or would I adopt that this standardized modeling language, right? They have to make that kind of continuous, um, kind of prioritization. Um, and they have to be incentivized to do both. Right. Uh, and then the other piece of it is okay, if we want to apply these consistent policies, across many data products and the mesh, how would it be physically possible? >>And the only way I can see, and I have seen it done in service mesh would be possible is by embedding those policies as competition, as code into every single data product. And how do we do that again, platform has a big part of it. So be able to have this embedded policy engines and whatever those things are into the data products, uh, and to, to be able to competition. So by default, when you become a data product, as part of the scaffolding of that data product, you get all of these, um, kind of computational capabilities to configure your, your policies according to the global policies. >>No, that makes sense. That makes, that makes it on a sense. That makes sense. >>I'm just curious. Really. So you've been at this for a while. You've built this system for the 13 years came from kind of academic background. So, uh, to be honest, we run into your products, lots of our clients, and there's always like a chat conversation within ThoughtWorks that, uh, do you guys know about this product then? So and so, oh, I should have curious, well, how do you think data governance tehcnology then skip and you need to shift with data mesh, right. And, and if, if I would ask, how would your roadmap changes with database? >>Yeah, I think it's a really good question. Um, what I don't want to do is to make, make the mistake that Venice often make and think of data mesh as a product. I think it's a much more holistic mindset change, right? That that's organization. Yes. It needs to be a kind of a platform enablement component there. And we've actually, I think authentically what, how we think about governance, that's very aligned with some of the principles and data measures that federate their thinking or customers know about going to communities domains or operating model. We really support that flexibility. I think from a roadmap perspective, I think making that even easier, uh, as always kind of a, a focus focus area for us, um, specifically around data measures are a few things that come to mind. Uh, one, I think is connectivity, right? If you, if you give different teams more ownership and accountability, we're not going to live in a world where all of the data is going to be stored on one location, right? >>You want to give people themes the opportunity and the accountability to make their own technology decisions so that they are fit for purpose. So I think whatever platform being able to really provide out of the box connectivity to a very wide, um, area or a range of technologies, I think is absolutely critical, um, on the, on the product as a or data as a product, thinking that usability, I think that's top of mind, uh, that's part of our roadmap. You're going to hear us, uh, stock about that tomorrow as well. Um, that data consumer, how do we make it as easy as possible for people to discover data that they can trust that they can access? Um, and in that thinking is a big part of our roadmap. So again, making that as easy as possible, uh, is a, is a big part of it. >>And, and also on the, I think the computation aspect that you mentioned, I think we believe in as well, if, if it's just documentation is going to be really hard to keep that alive, right? And so you have to make an active, we have to get close to the actual data. So if you think about a policy enforcement, for example, some things we're talking about, it's not just definition is the enforcement data quality. That's why we are so excited about our or data quality, um, acquisition as well. Um, so these are a couple of the things that we're thinking of, again, your, your, um, your, your, uh, message around from collecting to connecting. We talk about unity. I think that that works really, really well with our mission and vision as well. So mark, thank you so much. I wish we had more time to continue the conversation, uh, but it's been great to have a conversation here. Thank you so much for being here today and, uh, let's continue to work on that on data. Hello. I'm excited >>To see it. Just come to like.

Published Date : Jun 17 2021

SUMMARY :

Great to be here. I found myself the more I read about it, the more I find myself agreeing with other principles So it's the data, that's, it's an aggregate view of historical events that happens in agility to respond, you know, because we have a centralized bottleneck from team to technology, I leave that to Elizabeth, to the imaginations of the users. some of my texts and I thought about, okay, now to make this real, we need to think about securing in order to scale engineering teams and not make the same mistakes again, but maybe we can start there with kind Uh, so the domain ownership really talks about giving autonomy to the domains and And that leads to some interesting kind of architectural shifts, because when you think about not And as one sentence that I've heard you use that I think is incredibly powerful, it's less collecting, data that now these domains own needs to be shared with some accountability shouldn't be accountable for it, shrug off the responsibility and say, you know, I dumped this data on some event streaming aspect, but the other part of kind of data as a product next to usability is whole So we can bridge that gap with, uh, you know, adding documentation, And I think data measure also talks a little bit about the roles responsibilities. of the data is you and me where the data comes really from, but it's the data product owner who's What are the incentives we have to set up in the infrastructure, you know, in the organization. The alignment again, to the consumer using things like we know from product management, So some of the concerns that are applied to all of the data front, Um, and they have to be incentivized to do both. So be able to have this embedded policy engines That makes, that makes it on a sense. So and so, oh, I should have curious, the principles and data measures that federate their thinking or customers know about going to communities domains or operating of the box connectivity to a very wide, um, area or a range of technologies, And, and also on the, I think the computation aspect that you mentioned, I think we believe in as well, Just come to like.

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Felix	PERSON	0.99+
Isabella	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Airbnb	ORGANIZATION	0.99+
Elizabeth	PERSON	0.99+
Felix Zhamak	PERSON	0.99+
13 years	QUANTITY	0.99+
second principle	QUANTITY	0.99+
two	QUANTITY	0.99+
today	DATE	0.99+
one sentence	QUANTITY	0.99+
third principle	QUANTITY	0.99+
second dimension	QUANTITY	0.99+
fourth principle	QUANTITY	0.99+
both	QUANTITY	0.99+
first principle	QUANTITY	0.99+
two choices	QUANTITY	0.98+
Dana	PERSON	0.98+
Emily	PERSON	0.98+
tomorrow	DATE	0.98+
first	QUANTITY	0.98+
one organization	QUANTITY	0.98+
13 years ago	DATE	0.98+
three pieces	QUANTITY	0.97+
a year ago	DATE	0.97+
One	QUANTITY	0.94+
mark	PERSON	0.93+
one location	QUANTITY	0.93+
three concepts	QUANTITY	0.92+
one place	QUANTITY	0.9+
one	QUANTITY	0.86+
eight made	QUANTITY	0.85+
four principles	QUANTITY	0.84+
single data product	QUANTITY	0.79+
Colibra	PERSON	0.76+
Venice	ORGANIZATION	0.73+
half century	DATE	0.63+
Day 1	QUANTITY	0.6+
ThoughtWorks	ORGANIZATION	0.59+

James Kobielus, IBM - IBM Machine Learning Launch - #IBMML - #theCUBE

>> [Announcer] Live from New York, it's the Cube. Covering the IBM Machine Learning Launch Event. Brought to you by IBM. Now here are your hosts Dave Vellante and Stu Miniman. >> Welcome back to New York City everybody, this is the CUBE. We're here live at the IBM Machine Learning Launch Event. Bringing analytics and transactions together on Z, extending an announcement that IBM made a couple years ago, sort of laid out that vision, and now bringing machine learning to the mainframe platform. We're here with Jim Kobielus. Jim is the Director of IBM's Community Engagement for Data Science and a long time CUBE alum and friend. Great to see you again James. >> Great to always be back here with you. Wonderful folks from the CUBE. You ask really great questions and >> Well thank you. >> I'm prepared to answer. >> So we saw you last week at Spark Summit so back to back, you know, continuous streaming, machine learning, give us the lay of the land from your perspective of machine learning. >> Yeah well machine learning very much is at the heart of what modern application developers build and that's really the core secret sauce in many of the most disruptive applications. So machine learning has become the core of, of course, what data scientists do day in and day out or what they're asked to do which is to build, essentially artificial neural networks that can process big data and find patterns that couldn't normally be found using other approaches. And then as Dinesh and Rob indicated a lot of it's for regression analysis and classification and the other core things that data scientists have been doing for a long time, but machine learning has come into its own because of the potential for great automation of this function of finding patterns and correlations within data sets. So today at the IBM Machine Learning Launch Event, and we've already announced it, IBM Machine Learning for ZOS takes that automation promised to the next step. And so we're real excited and there'll be more details today in the main event. >> One of the most funs I had, most fun I had last year, most fun interviews I had last year was with you, when we interviewed, I think it was 10 data scientists, rock star data scientists, and Dinesh had a quote, he said, "Machine learning is 20% fun, 80% elbow grease." And data scientists sort of echoed that last year. We spent 80% of our time wrangling data. >> [Jim] Yeah. >> It gets kind of tedious. You guys have made announcements to address that, is the needle moving? >> To some degree the needle's moving. Greater automation of data sourcing and preparation and cleansing is ongoing. Machine learning is being used for that function as well. But nonetheless there is still a lot of need in the data science, sort of, pipeline for a lot of manual effort. So if you look at the core of what machine learning is all about, it's supervised learning involves humans, meaning data scientists, to train their algorithms with data and so that involves finding the right data and then of course doing the feature engineering which is a very human and creative process. And then to be training the data and iterating through models to improve the fit of the machine learning algorithms to the data. In many ways there's still a lot of manual functions that need expertise of data scientists to do it right. There's a lot of ways to do machine learning wrong you know there's a lot of, as it were, tricks of the trade you have to learn just through trial and error. A lot of things like the new generation of things like generative adversarial models ride on machine learning or deep learning in this case, a multilayered, and they're not easy to get going and get working effectively the first time around. I mean with the first run of your training data set, so that's just an example of how, the fact is there's a lot of functions that can't be fully automated yet in the whole machine learning process, but a great many can in fact, especially data preparation and transformation. It's being automated to a great degree, so that data scientists can focus on the more creative work that involves subject matter expertise and really also application development and working with larger teams of coders and subject matter experts and others, to be able to take the machine learning algorithms that have been proved out, have been trained, and to dry them to all manner of applications to deliver some disruptive business value. >> James, can you expand for us a little bit this democratization of before it was not just data but now the machine learning, the analytics, you know, when we put these massive capabilities in the broader hands of the business analysts the business people themselves, what are you seeing your customers, what can they do now that they couldn't do before? Why is this such an exciting period of time for the leveraging of data analytics? >> I don't know that it's really an issue of now versus before. Machine learning has been around for a number of years. It's artificial neural networks at the very heart, and that got going actually in many ways in the late 50s and it steadily improved in terms of sophistication and so forth. But what's going on now is that machine learning tools have become commercialized and refined to a greater degree and now they're in a form in the cloud, like with IBM machine learning for the private cloud on ZOS, or Watson machine learning for the blue mixed public cloud. They're at a level of consumability that they've never been at before. With software as a service offering you just, you pay for it, it's available to you. If you're a data scientist you being doing work right away to build applications, derive quick value. So in other words, the time to value on a machine learning project continues to shorten and shorten, due to the consumability, the packaging of these capabilities and to cloud offerings and into other tools that are prebuilt to deliver success. That's what's fundamentally different now and it's just an ongoing process. You sort of see the recent parallels with the business intelligence market. 10 years ago BI was reporting and OLEP and so forth, was only for the, what we now call data scientists or the technical experts and all that area. But in the last 10 years we've seen the business intelligence community and the industry including IBM's tools, move toward more self service, interactive visualization, visual design, BI and predictive analytics, you know, through our cognos and SPSS portfolios. A similar dynamic is coming in to the progress of machine learning, the democratization, to use your term, the more self service model wherein everybody potentially will be able to be, to do machine learning, to build machine learning and deep learning models without a whole of university training. That day is coming and it's coming fairly rapidly. It's just a matter of the maturation of this technology in the marketplace. >> So I want to ask you, you're right, 1950s it was artificial neural networks or AI, sort of was invented I guess, the concept, and then in the late 70s and early 80s it was heavily hyped. It kind of died in the late 80s or in the 90s, you never heard about it even the early 2000s. Why now, why is it here now? Is it because IBM's putting so much muscle behind it? Is it because we have Siri? What is it that has enabled that? >> Well I wish that IBM putting muscle behind a technology can launch anything to success. And we've done a lot of things in that regard. But the thing is, if you look back at the historical progress of AI, I mean, it's older than me and you in terms of when it got going in the middle 50s as a passion or a focus of computer scientists. What we had for the last, most of the last half century is AI or expert systems that were built on having to do essentially programming is right, declared a rule defining how AI systems could process data whatever under various scenarios. That didn't prove scalable. It didn't prove agile enough to learn on the fly from the statistical patterns within the data that you're trying to process. For face recognition and voice recognition, pattern recognition, you need statistical analysis, you need something along the lines of an artificial neural network that doesn't have to be pre-programmed. That's what's new now about in the last this is the turn of this century, is that AI has become predominantly now focused not so much on declarative rules, expert systems of old, but statistical analysis, artificial neural networks that learn from the data. See the, in the long historical sweep of computing, we have three eras of computing. The first era before the second world war was all electromechanical computing devices like IBM's start of course, like everybody's, was in that era. The business logic was burned into the hardware as it were. The second era from the second world war really to the present day, is all about software, programming, it's COBAL, 4trans, C, Java, where the business logic has to be developed, coded by a cadre of programmers. Since the turn of this millennium and really since the turn of this decade, it's all moved towards the third era, which is the cognitive era, where you're learning the business rules automatically from the data itself, and that involves machine learning at its very heart. So most of what has been commercialized and most of what is being deployed in the real world working, successful AI, is all built on artificial neural networks and cognitive computing in the way that I laid out. Where, you still need human beings in the equation, it can't be completely automated. There's things like unsupervised learning that take the automation of machine learning to a greater extent, but you still have the bulk of machine learning is supervised learning where you have training data sets and you need experts, data scientists, to manage that whole process, that over time supervised learning is evolving towards who's going to label the training data sets, especially when you have so much data flooding in from the internet of things and social media and so forth. A lot of that is being outsourced to crowd sourcing environments in terms of the ongoing labeling of data for machine learning projects of all sorts. That trend will continue a pace. So less and less of the actual labeling of the data for machine learning will need to be manually coded by data scientists or data engineers. >> So the more data the better. See I would argue in the enablement pie. You're going to disagree with that which is good. Let's have a discussion [Jim Laughs]. In the enablement pie, I would say the profundity of Hadup was two things. One is I can leave data where it is and bring code to data. >> [Jim] Yeah. >> 5 megabytes of code to petabyte of data, but the second was the dramatic reduction in the cost to store more data, hence my statement of the more data the better, but you're saying, meh maybe not. Certainly for compliance and other things you might not want to have data lying around. >> Well it's an open issue. How much data do you actually need to find the patterns of interest to you, the correlations of interest to you? Sampling of your data set, 10% sample or whatever, in most cases that might be sufficient to find the correlations you're looking for. But if you're looking for some highly deepened rare nuances in terms of anomalies or outliers or whatever within your data set, you may only find those if you have a petabyte of data of the population of interest. So but if you're just looking for broad historical trends and to do predictions against broad trends, you may not need anywhere near that amount. I mean, if it's a large data set, you may only need five to 10% sample. >> So I love this conversation because people have been on the CUBE, Abi Metter for example said, "Dave, sampling is dead." Now a statistician said that's BS, no way. Of course it's not dead. >> Storage isn't free first of all so you can't necessarily save and process all the data. Compute power isn't free yet, memory isn't free yet, so forth so there's lots... >> You're working on that though. >> Yeah sure, it's asymptotically all moving towards zero. But the bottom line is if the underlying resources, including the expertise of your data scientists that's not for free, these are human beings who need to make a living. So you've got to do a lot of things. A, automate functions on the data science side so that your, these experts can radically improve their productivity. Which is why the announcement today of IBM machine learning is so important, it enables greater automation in the creation and the training and deployment of machine learning models. It is a, as Rob Thomas indicated, it's very much a multiplier of productivity of your data science teams, the capability we offer. So that's the core value. Because our customers live and die increasingly by machine learning models. And the data science teams themselves are highly inelastic in the sense that you can't find highly skilled people that easily at an affordable price if you're a business. And you got to make the most of the team that you have and help them to develop their machine learning muscle. >> Okay, I want to ask you to weigh in on one of Stu's favorite topics which is man versus machine. >> Humans versus mechanisms. Actually humans versus bots, let's, okay go ahead. >> Okay so, you know a lot of discussions, about, machines have always replaced humans for jobs, but for the first time it's really beginning to replace cognitive functions. >> [Jim] Yeah. >> What does that mean for jobs, for skill sets? The greatest, I love the comment, the greatest chess player in the world is not a machine. It's humans and machines, but what do you see in terms of the skill set shift when you talk to your data science colleagues in these communities that you're building? Is that the right way to think about it, that it's the creativity of humans and machines that will drive innovation going forward. >> I think it's symbiotic. If you take Watson, of course, that's a star case of a cognitive AI driven machine in the cloud. We use a Watson all the time of course in IBM. I use it all the time in my job for example. Just to give an example of one knowledge worker and how he happens to use AI and machine learning. Watson is an awesome search engine. Through multi-structure data types and in real time enabling you to ask a sequence of very detailed questions and Watson is a relevance ranking engine, all that stuff. What I've found is it's helped me as a knowledge worker to be far more efficient in doing my upfront research for anything that I might be working on. You see I write blogs and I speak and I put together slide decks that I present and so forth. So if you look at knowledge workers in general, AI as driving far more powerful search capabilities in the cloud helps us to eliminate a lot of the grunt work that normally was attended upon doing deep research into like a knowledge corpus that may be preexisting. And that way we can then ask more questions and more intelligent questions and really work through our quest for answers far more rapidly and entertain and rule out more options when we're trying to develop a strategy. Because we have all the data at our fingertips and we've got this expert resource increasingly in a conversational back and forth that's working on our behalf predictively to find what we need. So if you look at that, everybody who's a knowledge worker which is really the bulk now of the economy, can be far more productive cause you have this high performance virtual assistant in the cloud. I don't know that it's really going, AI or deep learning or machine learning, is really going to eliminate a lot of those jobs. It'll just make us far smarter and more efficient doing what we do. That's, I don't want to belittle, I don't want to minimize the potential for some structural dislocation in some fields. >> Well it's interesting because as an example, you're like the, you're already productive, now you become this hyper-productive individual, but you're also very creative and can pick and choose different toolings and so I think people like you it's huge opportunities. If you're a person who used to put up billboards maybe it's time for retraining. >> Yeah well maybe you know a lot of the people like the research assistants and so forth who would support someone like me and most knowledge worker organizations, maybe those people might be displaced cause we would have less need for them. In the same way that one of my very first jobs out of college before I got into my career, I was a file clerk in a court in Detroit, it's like you know, a totally manual job, and there was no automation or anything. You know that most of those functions, I haven't revisited that court in recent years, I'm sure are automated because you have this thing called computers, especially PCs and LANs and so forth that came along since then. So a fair amount of those kinds of feather bedding jobs have gone away and in any number of bureaucracies due to automation and machine learning is all about automation. So who knows where we'll all end up. >> Alright well we got to go but I wanted to ask you about... >> [Jim] I love unions by the way. >> And you got to meet a lot of lawyers I'm sure. >> Okay cool. >> So I got to ask you about your community of data scientists that you're building. You've been early on in that. It's been a persona that you've really tried to cultivate and collaborate with. So give us an update there. What's your, what's the latest, what's your effort like these days? >> Yeah, well, what we're doing is, I'm on a team now that's managing and bringing together all of our program for community engagement programs for really for across portfolio not just data scientists. That involves meet ups and hack-a-thons and developer days and user groups and so forth. These are really important professional forums for our customers, our developers, our partners, to get together and share their expertise and provide guidance to each other. And these are very very important for these people to become very good at, to help them, get better at what they do, help them stay up to speed on the latest technologies. Like deep learning, machine learning and so forth. So we take it very seriously at IBM that communities are really where customers can realize value and grow their human capital ongoing so we're making significant investments in growing those efforts and bringing them together in a unified way and making it easier for like developers and IT administrators to find the right forums, the right events, the right content, within IBM channels and so forth, to help them do their jobs effectively and machine learning is at the heart, not just of data science, but other professions within the IT and business analytics universe, relying more heavily now on machine learning and understanding the tools of the trade to be effective in their jobs. So we're bringing, we're educating our communities on machine learning, why it's so critically important to the future of IT. >> Well your content machine is great content so congratulations on not only kicking that off but continuing it. Thanks Jim for coming on the CUBE. It's good to see you. >> Thanks for having me. >> You're welcome. Alright keep it right there everybody, we'll be back with our next guest. The CUBE, we're live from the Waldorf-Astoria in New York City at the IBM Machine Learning Launch Event right back. (techno music)

Published Date : Feb 15 2017

SUMMARY :

Brought to you by IBM. Great to see you again James. Wonderful folks from the CUBE. so back to back, you know, continuous streaming, and that's really the core secret sauce in many One of the most funs I had, most fun I had last year, is the needle moving? of the machine learning algorithms to the data. of machine learning, the democratization, to use your term, It kind of died in the late 80s or in the 90s, So less and less of the actual labeling of the data So the more data the better. but the second was the dramatic reduction in the cost the correlations of interest to you? because people have been on the CUBE, so you can't necessarily save and process all the data. and the training and deployment of machine learning models. Okay, I want to ask you to weigh in Actually humans versus bots, let's, okay go ahead. but for the first time it's really beginning that it's the creativity of humans and machines and in real time enabling you to ask now you become this hyper-productive individual, In the same way that one of my very first jobs So I got to ask you about your community and machine learning is at the heart, Thanks Jim for coming on the CUBE. in New York City at the IBM Machine Learning

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Jim	PERSON	0.99+
Dinesh	PERSON	0.99+
Stu Miniman	PERSON	0.99+
IBM	ORGANIZATION	0.99+
James	PERSON	0.99+
80%	QUANTITY	0.99+
James Kobielus	PERSON	0.99+
20%	QUANTITY	0.99+
Jim Laughs	PERSON	0.99+
five	QUANTITY	0.99+
Rob Thomas	PERSON	0.99+
Detroit	LOCATION	0.99+
1950s	DATE	0.99+
last year	DATE	0.99+
New York	LOCATION	0.99+
New York City	LOCATION	0.99+
10 data scientists	QUANTITY	0.99+
One	QUANTITY	0.99+
Siri	TITLE	0.99+
Dave	PERSON	0.99+
10%	QUANTITY	0.99+
5 megabytes	QUANTITY	0.99+
Abi Metter	PERSON	0.99+
two things	QUANTITY	0.99+
first time	QUANTITY	0.99+
last week	DATE	0.99+
second	QUANTITY	0.99+
90s	DATE	0.99+
ZOS	TITLE	0.99+
Rob	PERSON	0.99+
last half century	DATE	0.99+
today	DATE	0.99+
early 2000s	DATE	0.98+
Java	TITLE	0.98+
one	QUANTITY	0.98+
C	TITLE	0.98+
10 years ago	DATE	0.98+
first run	QUANTITY	0.98+
late 80s	DATE	0.98+
Watson	TITLE	0.97+
late 70s	DATE	0.97+
late 50s	DATE	0.97+
zero	QUANTITY	0.97+
IBM Machine Learning Launch Event	EVENT	0.96+
early 80s	DATE	0.96+
4trans	TITLE	0.96+
second world war	EVENT	0.95+
IBM Machine Learning Launch Event	EVENT	0.94+
second era	QUANTITY	0.94+
IBM Machine Learning Launch	EVENT	0.93+
Stu	PERSON	0.92+
first jobs	QUANTITY	0.92+
middle 50s	DATE	0.91+
couple years ago	DATE	0.89+
agile	TITLE	0.87+
petabyte	QUANTITY	0.85+
BAL	TITLE	0.84+
this decade	DATE	0.81+
three eras	QUANTITY	0.78+
last 10 years	DATE	0.78+
this millennium	DATE	0.75+
third era	QUANTITY	0.72+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for half century: