Kirk Viktor Fireside Chat Trusted Data | Data Citizens'21

>>Kirk focuses on the approach to modern data quality and how it can enable the continuous delivery of trusted data. Take it away. Kirk >>Trusted data has been a focus of mine for the last several years. Most particularly in the area of machine learning. Uh, I spent much of my career on wall street, writing models and trying to create a healthy data program, sort of the run the bank and protect the franchise and how to do that at scale for larger organizations. Uh, I'm excited to have the opportunity today sitting with me as Victor to have a fireside chat. He is an award-winning and best-selling author of delete big data and most currently framers. He's also a professor of governance at Oxford. So Victor, my question for you today is in an era of data that is always on and always flowing. How does CDOs get comfortable? You know, the, I can sleep at night factor when data is coming in from more angles, it's being stored in different formats and varieties and probably just in larger quantities than ever before. In my opinion, just laws of large numbers with that much data. Is there really just that much more risk of having bad data or inaccuracy in your business? >>Well, thank you Kirk, for having me on. Yes, you're absolutely right. That the real problem, if I were to simplify it down to one statement is that incorrect data and it can lead to wrong decisions that can be incredibly costly and incredibly costly for trust for the brand, for the franchise incredibly costly, because they can lead to decisions that are fundamentally flawed, uh, and therefore lead the business in the wrong direction. And so the, the, the real question is, you know, how can you avoid, uh, incorrect data to produce incorrect insights? And that depends on how you view trust and how you view, uh, data and correctness in the first place. >>Yeah, that's interesting, you know, in my background, we were constantly writing models, you know, we're trying to make the models smarter all the time, and we always wanted to get that accuracy level from 89% to 90%, you know, whatever we could be, but there's this popular theme where over time the models can diminish an accuracy. And the only button we really had at our disposal was to retrain the model, uh, oftentime I'm focused on, should we be stress testing the data, it almost like a patient health exam. Uh, and how do we do that? Where we could get more comfortable thinking about the quality of the data before we're running our models and our analytics. >>Yeah, absolutely. When we look at the machine learning landscape, even the big data landscape, what we see is that a lot of focus is now put on getting the models, right, getting it worked out, getting the kinks worked out, but getting sort of the ethics, right. The value, right. That is in the model. Um, uh, and what is really not looked at what is not focused enough that, um, is the data. Now, if you're looking at it from a compliance viewpoint, maybe it's okay if you just look at the model, maybe not. But if you understand that actually using the right data with the right model gives you a competitive advantage that your competitors don't have, then it is far more than compliance. And if it is far more compliance, then actually the aperture for strategy opens up and you should not just look at models. You should actually look at the data and the quality and correctness of the data as a huge way by which you can push forward your competitive advantage. >>Well, I haven't even trickier one for you. I think, you know, there's so much coming in and there's so much that we know we can measure and there's so much we could replay and do what if analysis on and kind of back tests, but, you know, do you see organizations doing things to look around the corner? And maybe an interesting analogy would be something like with Tesla is doing whether it's sensors or LIDAR, and they're trying to bounce off every object they know, and they can make a lot of measurements, but the advancements in computer vision are saying, I might be able to predict what's around the corner. I might be able to be out ahead of the data error. I'm about to see tomorrow. Um, you know, do you see any organizations trying to take that futuristic step to sort of know the unknown and be more predictive versus reactive? >>Absolutely. Tesla is doing a bit Lincoln, uh, but so are others in that space and not autonomous driving space, um, uh, Waymo, the, uh, the, the, uh, Google company that is, uh, doing autonomous driving for a long period of time where they have been doing is collecting training data, uh, through their cars and then running a machine learning on the training data. Now they hit a wall a couple of years ago because the training data wasn't diverse enough. It didn't have that sort of Moore's law of insight anymore, even though it was more and more training data. Um, and so the, the Delta, the additional learning was just limited. So what they then decided to do was to build a virtual reality called car crafting, which were actually cars would drive around and create, uh, uh, predictive training data. Now, what is really interesting about that is that that is isn't a model. It is a model that creates predictive data. And this predictive is the actual value that is added to the equation here. And with this extra predictive data, they were able to improve their autonomous driving quite significantly. Uh, five years ago, their disengagement was, uh, raped was every, uh, 2000 miles on average. And, uh, last year, uh, five years later, it was every 30,000 miles on average, that's a 15 K improvement. And that wasn't driven by a mysterious model. It was driven by predictive data. >>Right, right. You know, that's interesting. I, I'm also a fan of trying to use data points that don't exist in the data sets. So it sounds like they were using more data data that was derived from other sources. And maybe the most simple format that I usually get started with was, you know, what, if I was looking at data from Glassdoor and I wanted to know if it was valid, if it was accurate, but of course there's going to be numbers in the age, field and salary and years of experience in different things. But what if the years of experience and age and academic level of someone no longer correlates to the salary yet that correlation component is not a piece of data that even lives in the column, the row, the cell. So I do think that there's a huge area for improvement and just advancement in the role data that we see in collect, but also the data science metrics, something like lift and correlation between the data points that really helped me certify and feel comfortable that this data makes sense. Otherwise it could just be numbers in the field >>Indeed. And, and this challenge of, of finding the data and focusing on the right subset of the data and manipulating it, uh, in the right, in a qualitatively right way is really something that has been with us for quite a number of years. There's a fabulous, uh, case, um, a few years back, uh, when, um, in Japan, when there was the suspicion that in Sumo wrestling, there was match fixing going on massive max fiction. Um, and, and so investigators came in and they took the data from the championship bouts and analyzed them and, uh, didn't find anything. And, uh, what was, what was really interesting is then later researchers came in and read the rules and regulations of Sumo wrestling and understood that it's not just the championship bouts that matter, but it's also sometimes the relegation matches that matter. And so then they started looking at those secondary matches that nobody looked at before and that subset of data, and they discovered there's massive match fixing going on. It's just, nobody looked at it because nobody just, as you said, that connection, uh, between th those various data sources or the sort of causal connectivity there. And so it's, it's, it's really crucial to understand, uh, that, uh, driving insight out of data, isn't a black box thing where you feed the data in and get it out. It really requires deep thinking about how to wire it up from the very beginning. >>No, that's an interesting story. I kind of wonder if the model in that case is almost the, the wrestlers themselves or the output, but definitely the, the data that goes into it. Um, yeah. So, I mean, do you see a path where organizations will achieve a hundred percent confidence? Because we all know there's a, I can't sleep at night factor, but there's also a case of what do I do today. It's, I'm probably not living in a perfect world. I might be sailing a boat across an ocean that already has a hole in it. So, you know, we can't turn everything off. We have to sort of patch the boat and sail it at the same time. Um, what do you think the, a good approaches for a large organization to improve their posture? >>You know, if you focus on perfection, you never, you never achieved that perfection a hundred percent perfection or so is never achievable. And if you want some radical change, then that that's admirable. But a lot of times it's very risky. It's a very risky proposition. So rather than doing that, there is a lot of low hanging fruit than that incremental, pragmatic step-by-step approach. If I can use an analogy from history, uh, we, we, we talk a lot about, um, the data revolution and before that, the industrial revolution, and when we think about the industrial revolution, we think about the steam engine, but the reality is that the steam engine, wasn't just one radical invention. In fact, there were a myriad of small incremental invade innovations over the course of a century that today we call the industrial revolution. And I think it's the various same thing when the data revolution where we don't have this one silver bullet that radically puts us into data Nirvana, but it is this incremental, pragmatic step-by-step change. It will get us closer. Um, pragmatic, can you speak in closer to where we want to be, even though there was always more work for us left? >>Yeah, that's interesting. Um, you know, that one hits home for me because we ultimately at Collibra take an incremental approach. We don't think there's a stop the world event. There's, you know, a way to learn from the past trends of our data to become incrementally smarter each day. And this kind of stops us from being in a binary project mode, right. Where we have to wait right. Something for six months and then reassess it and hope, you know, we kind of wonder if you're at 70% accuracy today is being at 71% better tomorrow, right? At least there's a measurable amount of improvement there. Uh, and it's a sort of a philosophical difference. And it reminds me of my banking days. When you say, uh, you know, past performance is no guarantee of future results. And, um, it's a nice disclaimer, you can put in everything, but I actually find it to be more true in data. >>We have all of these large data assets, whether it's terabytes or petabytes, or even if it's just gigabytes sitting there on all the datasets to learn from. And what I find in data is that the past historical values actually do tell us a lot about the future and we can learn from that to become incrementally smarter tomorrow. And there's really a lot of value sitting there in the historical data. And it tells me at least a lot about how to forecast the future. You know, one that's been sitting on the top of my mind recently, especially with COVID and the housing market a long time back, I competed with automation, valuation modeling, which basically means how well can you predict the price of a house? And, you know, that's always a fun one to do. And there's some big name brands out there that do that pretty well. >>Back then when I built those models, I would look at things like the size of the yard, the undulation of the land, uh, you know, whether a pool would award you more or less money for your house. And a lot of those factors were different than they are now. So those models ultimately have already changed. And now that we've seen post COVID people look for different things in housing and the prices have gone up. So we've seen a decline and then a dramatic increase. And then we've also seen things like land and pools become more valuable than they were in the housing model before, you know, what are you seeing here with models and data and how that's going to come together? And it's just, is it always going to change where you're going to have to constantly recalibrate both, you know, our understanding of the data and the models themselves? >>Well, indeed the, the problem of course is almost eternal. Um, oftentimes we have developed beautiful models that work really well. And then we're so wedded to this model or this particular kind of model. And we can fathom to give them up. I mean, if I think of my students, sometimes, you know, they, they, they, they have a model, they collect the data, then they run the analysis and, uh, it basically, uh, tells them that their model was wrong. They go out and they collect more data and more data and more data just to make sure that it isn't there, that, that, that their model is right. But the data tells them what the truth is that the model isn't right anymore that has context and goals and circumstances change the model needs to adapt. And we have seen it over and over again, not just in the housing market, but post COVID and in the COVID crisis, you know, a lot of the epidemiologists looked at life expectancy of people, but when you, when you look at people, uh, in the intensive care unit, uh, with long COVID, uh, suffering, uh, and in ICU and so on, you also need to realize, and many have that rather than life expectancy. >>You also need to look at life quality as a mother, uh, kind of dimension. And that means your model needs to change because you can't just have a model that optimizes on life expectancy anymore. And so what we need to do is to understand that the data and the changes in the data that they NAMIC of the data really is a thorn in our thigh of revisiting the model and thinking very critically about what we can do in order to adjust the model to the present situation. >>But with that, Victor, uh, I've really enjoyed our chat today. And, uh, do you have any final thoughts, comments, questions for me? >>Uh, you know, Kirk, I enjoyed it tremendously as well. Uh, I do think that, uh, that what is important, uh, to understand with data is that as there is no, uh, uh, no silver bullet, uh, and there is only incremental steps forward, this is not actually something to despair, but to give and be the source of great hope, because it means that not just tomorrow, but even the day after tomorrow and the day after the day after tomorrow, we still can make headway can make improvement and get better. >>Absolutely. I like the hopeful message I live every day to, uh, to make data a better place. And it is exciting as we see the advancements in what's possible on what's kind of on the forefront. Um, well with that, I really appreciate the chat and I would encourage anyone. Who's interested in this topic to attend a session later today on modern data quality, where I go through maybe five key flaws of the past and some of the pitfalls, and explain a little bit more about how we're using unsupervised learning to solve for future problems. Thanks Victor. Thank you, Kurt. >>Thanks, Kirk. And Victor, how incredible was that?

Published Date : Jun 17 2021

SUMMARY :

Kirk focuses on the approach to modern data quality and how it can enable the continuous delivery the franchise and how to do that at scale for larger organizations. And that depends on how you view trust and how you And the only button we really even the big data landscape, what we see is that a lot of focus is now Um, you know, the Delta, the additional learning was just limited. and just advancement in the role data that we see in collect, but also the that matter, but it's also sometimes the relegation matches that matter. Um, what do you think the, a good approaches And if you want some radical Um, you know, that one hits home for me because we ultimately And, you know, that's always a fun one to do. the undulation of the land, uh, you know, whether a pool would not just in the housing market, but post COVID and in the COVID crisis, you know, adjust the model to the present situation. And, uh, do you have any final thoughts, comments, questions for me? Uh, you know, Kirk, I enjoyed it tremendously as well. I like the hopeful message I live every day to, uh, to make data a better place.

ENTITIES

Entity	Category	Confidence
Kirk	PERSON	0.99+
Kurt	PERSON	0.99+
Victor	PERSON	0.99+
Google	ORGANIZATION	0.99+
Japan	LOCATION	0.99+
six months	QUANTITY	0.99+
71%	QUANTITY	0.99+
Glassdoor	ORGANIZATION	0.99+
89%	QUANTITY	0.99+
tomorrow	DATE	0.99+
15 K	QUANTITY	0.99+
Tesla	ORGANIZATION	0.99+
last year	DATE	0.99+
70%	QUANTITY	0.99+
2000 miles	QUANTITY	0.99+
Waymo	ORGANIZATION	0.99+
five years later	DATE	0.99+
one statement	QUANTITY	0.99+
90%	QUANTITY	0.98+
today	DATE	0.98+
five years ago	DATE	0.98+
both	QUANTITY	0.98+
each day	QUANTITY	0.98+
COVID	OTHER	0.97+
Moore	PERSON	0.97+
five key flaws	QUANTITY	0.95+
Collibra	ORGANIZATION	0.94+
hundred percent	QUANTITY	0.94+
one silver bullet	QUANTITY	0.92+
Kirk Viktor	PERSON	0.92+
first	QUANTITY	0.91+
COVID crisis	EVENT	0.88+
Oxford	ORGANIZATION	0.88+
every 30,000 miles	QUANTITY	0.86+
a couple of years ago	DATE	0.85+
Sumo wrestling	EVENT	0.84+
one radical invention	QUANTITY	0.8+
few years back	DATE	0.75+
secondary matches	QUANTITY	0.74+
last several years	DATE	0.73+
COVID	EVENT	0.68+
Delta	ORGANIZATION	0.66+
NAMIC	ORGANIZATION	0.53+
Kirk	ORGANIZATION	0.53+
Lincoln	ORGANIZATION	0.45+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Kirk Viktor: