Jim Cushman, CPO, Collibra

>> From around the globe, it's theCUBE, covering Data Citizens'21. Brought to you by Collibra. >> We're back talking all things data at Data Citizens '21. My name is Dave Vellante and you're watching theCUBE's continuous coverage, virtual coverage #DataCitizens21. I'm here with Jim Cushman who is Collibra's Chief Product Officer who shared the company's product vision at the event. Jim, welcome, good to see you. >> Thanks Dave, glad to be here. >> Now one of the themes of your session was all around self-service and access to data. This is a big big point of discussion amongst organizations that we talk to. I wonder if you could speak a little more toward what that means for Collibra and your customers and maybe some of the challenges of getting there. >> So Dave our ultimate goal at Collibra has always been to enable service access for all customers. Now, one of the challenges is they're limited to how they can access information, these knowledge workers. So our goal is to totally liberate them and so, why is this important? Well, in and of itself, self-service liberates, tens of millions of data lyric knowledge workers. This will drive more rapid, insightful decision-making, it'll drive productivity and competitiveness. And to make this level of adoption possible, the user experience has to be as intuitive as say, retail shopping, like I mentioned in my previous bit, like you're buying shoes online. But this is a little bit of foreshadowing and there's even a more profound future than just enabling a self-service, that we believe that a new class of shopper is coming online and she may not be as data-literate as our knowledge worker of today. Think of her as an algorithm developer, she builds machine learning or AI. The engagement model for this user will be, to kind of build automation, personalized experiences for people to engage with data. But in order to build that automation, she too needs data. Because she's not data literate, she needs the equivalent of a personal shopper. Someone that can guide her through the experience without actually having her know all the answers to the questions that would be asked. So this level of self-service goes one step further and becomes an automated service. One to really help find the best unbiased in a labeled training data to help train an algorithm in the future. >> That's, okay please continue. >> No please, and so all of this self and automated service, needs to be complemented with kind of a peace of mind that you're letting the right people gain access to it. So when you automate it, it's like, well, geez are the right people getting access to this. So it has to be governed and secured. This can't become like the Wild Wild West or like a data, what we call a data flea market or you know, data's everywhere. So, you know, history does quickly forget the companies that do not adjust to remain relevant. And I think we're in the midst of an exponential differentiation in Collibra data intelligence cloud is really kind of established to be the key catalyst for companies that will be on the winning side. >> Well, that's big because I mean, I'm a big believer in putting data in the hands of those folks in the line of business. And of course the big question that always comes up is, well, what about governance? What about security? So to the extent that you can federate that, that's huge. Because data is distributed by its very nature, it's going to stay that way. It's complex. You have to make the technology work in that complex environment, which brings me to this idea of low code or no code. It's gaining a lot of momentum in the industry. Everybody's talking about it, but there are a lot of questions, you know, what can you actually expect from no code and low code who were the right, you know potential users of that? Is there a difference between low and no? And so from your standpoint, why is this getting so much attention and why now, Jim? >> You don't want me to go back even 25 years ago we were talking about four and five generational languages that people were building. And it really didn't re reach the total value that folks were looking for because it always fell short. And you'd say, listen, if you didn't do all the work it took to get to a certain point how are you possibly going to finish it? And that's where the four GLs and five GLs fell short as capability. With our stuff where if you really get a great self-service how are you going to be self-service if it still requires somebody right though? Well, I guess you could do it if the only self-service people are people who write code, well, that's not bad factor. So if you truly want the ability to have something show up at your front door, without you having to call somebody or make any efforts to get it, then it needs to generate itself. The beauty of doing a catalog, new governance, understanding all the data that is available for choice, giving someone the selection that is using objective criteria, like this is the best objective cause if it's quality for what you want or it's labeled or it's unbiased and it has that level of deterministic value to it versus guessing or civic activity or what my neighbor used or what I used on my last job. Now that we've given people the power with confidence to say, this is the one that I want, the next step is okay, can you deliver it to them without them having to write any code? So imagine being able to generate those instructions from everything that we have in our metadata repository to say this is exactly the data I need you to go get and perform what we call a distributed query against those data sets and bringing it back to them. No code written. And here's the real beauty Dave, pipeline development, data pipeline development is a relatively expensive thing today and that's why people spend a lot of money maintaining these pipelines but imagine if there was zero cost to building your pipeline would you spend any money to maintain it? Probably not. So if we can build it for no cost, then why maintain it? Just build it every time you need it. And it then again, done on a self-service basis. >> I really liked the way you're thinking about this cause you're right. A lot of times when you hear self self-service it's about making the hardcore developers, you know be able to do self service. But the reality is, and you talk about that data pipeline it's complex a business person sitting there waiting for data or wants to put in new data and it turns out that the smallest unit is actually that entire team. And so you sit back and wait. And so to the extent that you can actually enable self-serve for the business by simplification that is it's been the holy grail for a while, isn't it? >> I agree. >> Let's look a little bit dig into where you're placing your bets. I mean, your head of products, you got to make bets, you know, certainly many many months if not years in advance. What are your big focus areas of investment right now? >> Yeah, certainly. So one of the things we've done very successfully since our origin over a decade ago, was building a business user-friendly software and it was predominantly kind of a plumbing or infrastructure area. So, business users love working with our software. They can find what they're looking for and they don't need to have some cryptic key of how to work with it. They can think about things in their terms and use our business glossary and they can navigate through what we call our data intelligence graph and find just what they're looking for. And we don't require a business to change everything just to make it happen. We give them kind of a universal translator to talk to the data. But with all that wonderful usability the common compromise that you make as well, its only good up to a certain amount of information, kind of like Excel. You know, you can do almost anything with Excel, right? But when you get to into large volumes, it becomes problematic and now you need that, you know go with a hardcore database and application on top. So what the industry is pulling us towards is far greater amounts of data not that just millions or even tens of millions but into the hundreds of millions and billions of things that we need to manage. So we have a huge focus on scale and performance on a global basis and that's a mouthful, right? Not only are you dealing with large amounts at performance but you have to do it in a global fashion and make it possible for somebody who might be operating in a Southeast Asia to have the same experience with the environment as they would be in Los Angeles. And the data needs to therefore go to the user as opposed to having the user come to the data as much as possible. So it really does put a lot of emphasis on some of what you call the non-functional requirements also known as the ilities and so our ability to bring the data and handle those large enterprise grade capabilities at scale and performance globally is what's really driving a good number of our investments today. >> I want to talk about data quality. This is a hard topic, but it's one that's so important. And I think it's been really challenging and somewhat misunderstood when you think about the chief data officer role itself, it kind of emerged from these highly regulated industries. And it came out of the data quality, kind of a back office role that's kind of gone front and center and now is, you know pretty strategic. Having said that, the you know, the prevailing philosophy is okay, we got to have this centralized data quality approach and that it's going to be imposed throughout. And it really is a hard problem and I think about, you know these hyper specialized roles, like, you know the quality engineer and so forth. And again, the prevailing wisdom is, if I could centralize that it can be lower cost and I can service these lines of business when in reality, the real value is, you know speed. And so how are you thinking about data quality? You hear so much about it. Why is it such a big deal and why is it so hard in a priority in the marketplace? You're thoughts. >> Thanks for that. So we of course acquired a data quality company, not burying delete, earlier this year LGQ and the big question is, okay, so why, why them and why now, not before? Well, at least a decade ago you started hearing people talk about big data. It was probably around 2009, it was becoming the big talk and what we don't really talk about when we talk about this ever expanding data, the byproduct is, this velocity of data, is increasing dramatically. So the speed of which new data is being presented the way in which data is changing is dramatic. And why is that important to data quality? Cause data quality historically for the last 30 years or so has been a rules-based business where you analyze the data at a certain point in time and you write a rule for it. Now there's already a room for error there cause humans are involved in writing those rules, but now with the increased velocity, the likelihood that it's going to atrophy and become no longer a valid or useful rule to you increases exponentially. So we were looking for a technology that was doing it in a new way similar to the way that we do auto classification when we're cataloging attributes is how do we look at millions of pieces of information around metadata and decide what it is to put it into context? The ability to automatically generate these rules and then continuously adapt as data changes to adjust these rules, is really a game changer for the industry itself. So we chose OwlDQ for that very reason. It's not only where they had this really kind of modern architecture to automatically generate rules but then to continuously monitor the data and adjust those rules, cutting out the huge amounts of costs, clearly having rules that aren't helping you save and frankly, you know how this works is, you know no one really complains about it until there's the squeaky wheel, you know, you get a fine or exposes and that's what is causing a lot of issues with data quality. And then why now? Well, I think and this is my speculation, but there's so much movement of data moving to the cloud right now. And so anyone who's made big investments in data quality historically for their on-premise data warehouses, Netezzas, Teradatas, Oracles, et cetera or even their data lakes are now moving to the cloud. And they're saying, hmm, what investments are we going to carry forward that we had on premise? And which ones are we going to start a new from and data quality seems to be ripe for something new and so these new investments in data in the cloud are now looking up. Let's look at new next generation method of doing data quality. And that's where we're really fitting in nicely. And of course, finally, you can't really do data governance and cataloging without data quality and data quality without data governance and cataloging is kind of a hollow a long-term story. So the three working together is very a powerful story. >> I got to ask you some Colombo questions about this cause you know, you're right. It's rules-based and so my, you know, immediate like, okay what are the rules around COVID or hybrid work, right? If there's static rules, there's so much unknown and so what you're saying is you've got a dynamic process to do that. So and one of the my gripes about the whole big data thing and you know, you referenced that 2009, 2010, I loved it, because there was a lot of profound things about Hadoop and a lot of failings. And one of the challenges is really that there's no context in the big data system. You know, the data, the folks in the data pipeline, they don't have the business context. So my question is, as you it's and it sounds like you've got this awesome magic to automate, who would adjudicates the dynamic rules? How does, do humans play a role? What role do they play there? >> Absolutely. There's the notion of sampling. So you can only trust a machine for certain point before you want to have some type of a steward or a assisted or supervised learning that goes on. So, you know, suspect maybe one out of 10, one out of 20 rules that are generated, you might want to have somebody look at it. Like there's ways to do the equivalent of supervised learning without actually paying the cost of the supervisor. Let's suppose that you've written a thousand rules for your system that are five years old. And we come in with our ability and we analyze the same data and we generate rules ourselves. We compare the two themselves and there's absolutely going to be some exact matching some overlap that validates one another. And that gives you confidence that the machine learning did exactly what you did and what's likelihood that you guessed wrong and machine learning guessed wrong exactly the right way that seems pretty, pretty small concern. So now you're really saying, well, why are they different? And now you start to study the samples. And what we learned, is that our ability to generate between 60 and 70% of these rules anytime we were different, we were right. Almost every single time, like almost every, like only one out of a hundred where was it proven that the handwritten rule was a more profound outcome. And of course, it's machine learning. So it learned, and it caught up the next time. So that's the true power of this innovation is it learns from the data as well as the stewards and it gives you confidence that you're not missing things and you start to trust it, but you should never completely walk away. You should constantly do your periodic sampling. >> And the secret sauce is math. I mean, I remember back in the mid two thousands it was like 2006 timeframe. You mentioned, you know, auto classification. That was a big problem with the federal rules of civil procedure trying to figure out, okay, you know, had humans classifying humans don't scale, until you had, you know, all kinds of support, vector machines and probabilistic, latent semantic indexing, but you didn't have the compute power or the data corpus to really do it well. So it sounds like a combination of you know, cheaper compute, a lot more data and machine intelligence have really changed the game there. Is that a fair assumption? >> That's absolutely fair. I think the other aspect that to keep in mind is that it's an innovative technology that actually brings all that compute as close into the data as possible. One of the greatest expenses of doing data quality was of course, the profiling concept bringing up the statistics of what the data represents. And in most traditional senses that data is completely pulled out of the database itself, into a separate area and now you start talking about terabytes or petabytes of data that takes a long time to extract that much information from a database and then to process through it all. Imagine bringing that profiling closer into the database, what's happening in the NAPE the same space as the data, that cuts out like 90% of the unnecessary processing speed. It also gives you the ability to do it incrementally. So you're not doing a full analysis each time, you have kind of an expensive play when you're first looking at a full database and then maybe over the course of a day, an hour, 15 minutes you've only seen a small segment of change. So now it feels more like a transactional analysis process. >> Yeah and that's, you know, again, we talked about the old days of big data, you know the Hadoop days and the boat was profound was it was all about bringing five megabytes of code to a petabyte of data, but that didn't happen. We shoved it all into a central data lake. I'm really excited for Collibra. It sounds like you guys are really on the cutting edge and doing some really interesting things. I'll give you the last word, Jim, please bring us on. >> Yeah thanks Dave. So one of the really exciting things about our solution is, it trying to be a combination of best of breed capabilities but also integrated. So to actually create a full and complete story that customers are looking for, you don't want to have them worry about a complex integration in trying to manage multiple vendors and the times of their releases, et cetera. If you can find one customer that you don't have to say well, that's good enough, but every single component is in fact best of breed that you can find in it's integrated and they'll manage it as a service. You truly unlock the power of your data, literate individuals in your organization. And again, that goes back to our overall goal. How do we empower the hundreds of millions of people around the world who are just looking for insightful decision? Did they feel completely locked it's as if they're looking for information before the internet and they're kind of limited to whatever their local library has and if we can truly become somewhat like the internet of data, we make it possible for anyone to access it without controls but we still govern it and secure it for privacy laws, I think we do have a chance to to change the world for better. >> Great. Thank you so much, Jim. Great conversation really appreciate your time and your insights. >> Yeah, thank you, Dave. Appreciate it. >> All right and thank you for watching theCUBE's continuous coverage of Data Citizens'21. My name is Dave Vellante. Keep it right there for more great content. (upbeat music)

Published Date : Jun 17 2021

SUMMARY :

Brought to you by Collibra. and you're watching theCUBE's and maybe some of the And to make this level So it has to be governed and secured. And of course the big question and it has that level of And so to the extent that you you got to make bets, you know, And the data needs to and that it's going to and frankly, you know how this works is, So and one of the my gripes and it gives you confidence or the data corpus to really do it well. of data that takes a long time to extract Yeah and that's, you know, again, is in fact best of breed that you can find Thank you so much, Jim. you for watching theCUBE's

ENTITIES

Entity	Category	Confidence
Jim Cushman	PERSON	0.99+
Dave	PERSON	0.99+
Jim	PERSON	0.99+
Dave Vellante	PERSON	0.99+
90%	QUANTITY	0.99+
Collibra	ORGANIZATION	0.99+
2009	DATE	0.99+
Oracles	ORGANIZATION	0.99+
Netezzas	ORGANIZATION	0.99+
LGQ	ORGANIZATION	0.99+
Los Angeles	LOCATION	0.99+
Excel	TITLE	0.99+
Teradatas	ORGANIZATION	0.99+
two	QUANTITY	0.99+
2010	DATE	0.99+
15 minutes	QUANTITY	0.99+
2006	DATE	0.99+
millions of pieces	QUANTITY	0.99+
millions	QUANTITY	0.99+
tens of millions	QUANTITY	0.99+
an hour	QUANTITY	0.99+
five GLs	QUANTITY	0.99+
Southeast Asia	LOCATION	0.99+
one	QUANTITY	0.99+
four GLs	QUANTITY	0.99+
billions	QUANTITY	0.99+
Hadoop	TITLE	0.99+
hundreds of millions	QUANTITY	0.98+
20 rules	QUANTITY	0.98+
three	QUANTITY	0.98+
70%	QUANTITY	0.98+
each time	QUANTITY	0.98+
one customer	QUANTITY	0.98+
earlier this year	DATE	0.97+
10	QUANTITY	0.97+
today	DATE	0.95+
a decade ago	DATE	0.95+
first	QUANTITY	0.95+
a day	QUANTITY	0.95+
25 years ago	DATE	0.94+
Collibra	PERSON	0.94+
hundreds of millions of people	QUANTITY	0.94+
four	QUANTITY	0.94+
petabytes	QUANTITY	0.91+
over a decade ago	DATE	0.9+
terabytes	QUANTITY	0.9+
theCUBE	ORGANIZATION	0.9+
five years old	QUANTITY	0.88+
CPO	PERSON	0.87+
Wild Wild West	LOCATION	0.86+
tens of millions of data	QUANTITY	0.86+
One	QUANTITY	0.84+
five generational languages	QUANTITY	0.83+
a thousand rules	QUANTITY	0.81+
single component	QUANTITY	0.8+
60	QUANTITY	0.8+
last 30 years	DATE	0.79+
Data Citizens'21	TITLE	0.78+
zero cost	QUANTITY	0.77+
five megabytes of code	QUANTITY	0.76+
OwlDQ	ORGANIZATION	0.7+
single time	QUANTITY	0.69+
Data Citizens '21	EVENT	0.67+
Chief Product Officer	PERSON	0.64+
hundred	QUANTITY	0.63+
two thousands	QUANTITY	0.63+
Data	EVENT	0.58+
#DataCitizens21	EVENT	0.58+
petabyte	QUANTITY	0.49+
COVID	OTHER	0.48+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for LGQ: