Jim Cushman Product strategy vision | Data Citizens'21

>>Hi everyone. And welcome to data citizens. Thank you for making the time to join me and the over 5,000 data citizens like you that are looking to become United by data. My name is Jim Cushman. I serve as the chief product officer at Collibra. I have the benefit of sharing with you, the product, vision, and strategy of Culebra. There's several sections to this presentation, and I can't wait to share them with you. The first is a story of how we're taking a business user and making it possible for him or her data, use data and gain. And if it and insight from that data, without relying on anyone in the organization to write code or do the work for them next I'll share with you how Collibra will make it possible to manage metadata at scales, into the billions of assets. And again, load this into our software without writing any code third, I will demonstrate to you the integration we have already achieved with our newest product release it's data quality that's powered by machine learning. >>Right? Finally, you're going to hear about how Colibra has become the most universally available solution in the market. Now, we all know that data is a critical asset that can make or break an organization. Yet organizations struggle to capture the power of their data and many remain afraid of how their data could be misused and or abused. We also observe that the understanding of and access to data remains in the hands of just a small few, three out of every four companies continue to struggle to use data, to drive meaningful insights, all forward looking companies, looking for an advantage, a differentiator that will set them apart from their peers and competitors. What if you could improve your organization's productivity by just 5%, even a modest 5% productivity improvement compounded over a five-year period will make your organization 28% more productive. This will leave you with an overwhelming advantage over your competition and uniting your data. >>Litter employees with data is the key to your success. And dare I say, sorry to unlock this potential for increased productivity, huge competitive advantage organizations need to enable self-service access to data for everyday to literate knowledge worker. Our ultimate goal at Cleaver has always been to enable this self-service for our customers to empower every knowledge worker to access the data they need when they need it. But with the peace of mind that your data is governed insecure. Just to imagine if you had a single integrated solution that could deliver a seamless governed, no code user experience of delivering the right data to the right person at the right time, just as simply as ordering a pair of shoes online would be quite a magic trick and one that would place you and your organization on the fast track for success. Let me introduce you to our character here. >>Cliff cliff is that business analyst. He doesn't write code. He doesn't know Julian or R or sequel, but is data literate. When cliff has presented with data of high quality and can actually help find that data of high-quality cliff knows what to do with it. Well, we're going to expose cliff to our software and see how he can find the best data to solve his problem of the day, which is customer churn. Cliff is going to go out and find this information is going to bring it back to him. And he's going to analyze it in his favorite BI reporting tool. Tableau, of course, that could be Looker, could be power BI or any other of your favorites, but let's go ahead and get started and see how cliff can do this without any help from anyone in the organization. So cliff is going to log into Cleaver and being a business user. >>The first thing he's going to do is look for a business term. He looks for customer churn rate. Now, when he brings back a churn rate, it shows him the definition of churn rate and various other things that have been attributed to it such as data domains like product and customer in order. Now, cliff says, okay, customer is really important. So let me click on that and see what makes up customer definition. Cliff will scroll through a customer and find out the various data concepts attributes that make up the definition of customer and cliff knows that customer identifier is a really important aspect to this. It helps link all the data together. And so cliff is going to want to make sure that whatever source he brings actually has customer identifier in it. And that it's of high quality cliff is also interested in things such as email address and credit activity and credit card. >>But he's now going to say, okay, what data sets actually have customer as a data domain in, and by the way, why I'm doing it, what else has product and order information? That's again, relevant to the concept of customer churn. Now, as he goes on, he can actually filter down because there's a lot of different results that could potentially come back. And again, customer identifier was very important to cliff. So cliff, further filters on customer identifier any further does it on customer churn rate as well. This results in two different datasets that are available to cliff for selection, which one to use? Well, he's first presented with some data quality information you can see for customer analytics. It has a data quality score of 76. You can see for sales data enrichment dataset. It has a data quality score of 68. Something that he can see right at the front of the box of things that he's looking for, but let's dig in deeper because the contents really matter. >>So we see again the score of 76, but we actually have the chance to find out that this is something that's actually certified. And this is something that has a check mark. And so he knows someone he trusts is actually certified. This is a dataset. You'll see that there's 91 columns that make up this data set. And rather than sifting through all of that information, cliff is going to go ahead and say, well, okay, customer identifier is very important to me. Let me search through and see if I can find what it's data quality scores very quickly. He finds that using a fuzzy search and brings back and sees, wow, that's a really high data quality score of 98. Well, what's the alternative? Well, the data set is only has 68, but how about, uh, the customer identifier and quickly, he discovers that the data quality for that is only 70. >>So all things being equal, customer analytics is the better data set for what cliff needs to achieve. But now he wants to look and say, other people have used this, what have they had to say about it? And you can see there are various reviews for different reviews from peers of his, in the organization that have given it five stars. So this is encourages cliffs, a confidence that this is great data set to use. Now cliff wants to look a little bit more detailed before he finally commits to using this dataset. Cliff has the opportunity to look at it in the broader set. What are the things can I learn about customer analytics, such as what else is it related to? Who else uses it? Where did it come from? Where does it go and what actually happens to it? And so within our graph of information, we're able to show you a diagram. >>You can see the customer analytics actually comes from the CRM cloud system. And from there you can inherit some wonderful information. We know exactly what CRM cloud is about as an overall system. It's related to other logical models. And here you're actually seeing that it's related to a policy policy about PII or personally identifiable information. This gets cliff almost the immediate knowledge that there's going to be some customer information in this PII information that he's not going to be able to see given his user role in the organization. But cliff says, Hey, that's okay. I actually don't need to see somebody's name and social security number to do my work. I can actually work with other information in the data file. That'll actually help me understand why our customers churning in, what can I actually do about it. If we dig in deeper, we can see what is personally identifiable information that actually could cause issues. >>And as we scroll down and take a little bit of a focus on what we call or what you'll see here is customer phone, because we'll show that to you a little bit later, but these show the various information that once cliff actually has it fulfilled and delivered to him, he will see that it's actually massed and or redacted from his use. Now cliff might drive in deeper and see more information. And he says, you know what? Another piece that's important to me in my analysis is something called is churned. This is basically suggesting that has a customer actually churned. It's an important flag, of course, because that's the analysis that he's performing cliff sees that the score is a mere 65. That's not exactly a great data quality score, but cliff has, is kind of in a hurry. His bosses is, has come back and said, we need to have this information so we can take action. >>So he's not going to wait around to see if they can go through some long day to quality project before he pursues, but he is going to come up and use it. The speed of thinking. He's going to create a suggestion, an issue. He's going to submit this as a work queue item that actually informs others that are responsible for the quality of data. That there's an opportunity for improvement to this dataset that is highly reviewed, but it may be, it has room for improvement as cliff is actually typing in his explanation that he'll pass along. We can also see that the data quality is made up of multiple components, such as integrity, duplication, accuracy, consistency, and conformity. Um, we see that we can submit this, uh, issue and pass it through. And this will go to somebody else who can actually work on this. >>And we'll show that to you a little bit later, but back to cliff, cliff says, okay, I'd like to, I'd like to work with this dataset. So he adds it to his data basket. And just like if he's shopping online, cliff wants that kind of ability to just say, I want to just click once and be done with it. Now it is data and there's some sensitivity about it. And again, there's an owner of this data who you need to get permission from. So cliff is going to provide information to the owner to say, here's why I need this data. And how long do I need this data for starting on a certain date and ending on a certain date and ultimately, what purpose am I going to have with this data? Now, there are other things that cliff can choose to run. This one is how do you want this day to deliver to you? >>Now, you'll see down below, there are three options. One is borrow the other's lease and others by what does that mean? Well, borrow is this idea of, I don't want to have the data that's currently in this CRM, uh, cloud database moved somewhere. I don't want it to be persistent anywhere else. I just want to borrow it very short term to use in my Tablo report and then poof be gone. Cause I don't want to create any problems in my organization. Now you also see lease. Lease is a situation where you actually do need to take possession of the data, but only for a time box period of time, you don't need it for an indefinite amount of time. And ultimately buy is your ability to take possession of the data and have it in perpetuity. So we're going to go forward with our bar use case and cliff is going to submit this and all the fun starts there. >>So cliff has actually submitted the order and the owner, Joanna is actually going to receive the request for the order. Joanna, uh, opens up her task, UCS there's work to perform. It says, oh, okay, here's this there's work for me to perform. Now, Joanna has the ability to automate this using incorporated workflow that we have in Colibra. But for this situation, she's going to manually review that. Cliff wants to borrow a specific data set for a certain period of time. And he actually wants to be using in a Tablo context. So she reviews. It makes an approval and submits it this in turn, flips it back to cliff who says, okay, what obligations did I just take on in order to work for this data? And he reviews each of these data sharing agreements that you, as an organization would set up and say, what am I, uh, what are my restrictions for using this data site? >>As cliff accepts his notices, he now has triggered the process of what we would call fulfillment or a service broker. And in this situation we're doing a virtualization, uh, access, uh, for the borrow use case. Cliff suggests Tablo is his preferred BI and reporting tool. And you can see the various options that are available from power BI Looker size on ThoughtSpot. There are others that can be added over time. And from there, cliff now will be alerted the minute this data is available to them. So now we're running out and doing a distributed query to get the information and you see it returns back for raw view. Now what's really interesting is you'll see, the customer phone has a bunch of X's in it. If you remember that's PII. So it's actually being massed. So cliff can't actually see the raw data. Now cliff also wants to look at it in a Tablo report and can see the visualization layer, but you also see an incorporation of something we call Collibra on the go. >>Not only do we bring the data to the report, but then we tell you the reader, how to interpret the report. It could be that there's someone else who wants to use the very same report that cliff helped create, but they don't understand exactly all the things that cliff went through. So now they have the ability to get a full interpretation of what was this data that was used, where did it come from? And how do I actually interpret some of the fields that I see on this report? Really a clever combination of bringing the data to you and showing you how to use it. Cliff can also see this as a registered asset within a Colibra. So the next shopper comes through might actually, instead of shopping for the dataset might actually shop for the report itself. And the report is connected with the data set he used. >>So now they have a full bill of materials to run a customer Shern report and schedule it anytime they want. So now we've turned cliff actually into a creator of data assets, and this is where intelligent, it gets more intelligence and that's really what we call data intelligence. So let's go back through that magic trick that we just did with cliff. So cliff went into the software, not knowing if the source of data that he was looking for for customer product sales was even available to him. He went in very quickly and searched and found his dataset, use facts and facets to filter down to exactly what was available. Compare to contrast the options that were there actually made an observation that there actually wasn't enough data quality around a certain thing was important to him, created an idea, or basically a suggestion for somebody to follow up on was able to put that into his shopping basket checkout and have it delivered to his front door. >>I mean, that's a bit of a magic trick, right? So, uh, cliff was successful in finding data that he wanted and having it, deliver it to him. And then in his preferred model, he was able to look at it into Tableau. All right. So let's talk about how we're going to make this vision a reality. So our first section here is about performance and scale, but it's also about codeless database registration. How did we get all that stuff into the data catalog and available for, uh, cliff to find? So allow us to introduce you to what we call the asset life cycle and some of the largest organizations in the world. They might have upwards of a billion data assets. These are columns and tables, reports, API, APIs, algorithms, et cetera. These are very high volume and quite technical and far more information than a business user like cliff might want to be engaged with those very same really large organizations may have upwards of say, 20 to 25 million that are critical data sources and data assets, things that they do need to highly curate and make available. >>But through that as a bit of a distillation, a lifecycle of different things you might want to do along that. And so we're going to share with you how you can actually automatically register these sources, deal with these very large volumes at speed and at scale, and actually make it available with just a level of information you need to govern and protect, but also make it available for opportunistic use cases, such as the one we presented with cliff. So as you recall, when cliff was actually trying to look for his dataset, he identified that the is churned, uh, data at your was of low quality. So he passed this over to Eliza, who's a data steward and she actually receives this work queue in a collaborative fashion. And she has to review, what is the request? If you recall, this was the request to improve the data quality for his churn. >>Now she needs to familiarize herself with what cliff was observing when he was doing his shopping experience. So she digs in and wants to look at the quality that he was observing and sure enough, as she goes down and it looks at his churn, she sees that it was a low 65% and now understands exactly what cliff was referring to. She says, aha, okay. I need to get help. I need to decide whether I have a data quality project to fix the data, or should I see if there's another data set in the organization that has better, uh, data for this. And so she creates a queue that can go over to one of her colleagues who really focuses on data quality. She submits this request and it goes over to, uh, her colleague, John who's really familiar with data quality. So John actually receives the request from Eliza and you'll see a task showing up in his queue. >>He opens up the request and finds out that Eliza's asking if there's another source out there that actually has good is churned, uh, data available. Now he actually knows quite a bit about the quality of information sturdiness. So he goes into the data quality console and does a quick look for a dataset that he's familiar with called customer product sales. He quickly scrolls down and finds out the one that's actually been published. That's the one he was looking for and he opens it up to find out more information. What data sets are, what columns are actually in there. And he goes down to find his churned is in fact, one of the attributes in there. It actually does have active rules that are associated with it to manage the quality. And so he says, well, let's look in more detail and find out what is the quality of this dataset? >>Oh, it's 86. This is a dramatic improvement over what we've seen before. So we can see again, it's trended quite nicely over time each day, it hasn't actually degraded in performance. So we actually responds back to realize and say, this data set, uh, is actually the data set that you want to bring in. It really will improve. And you'll see that he refers to the refined database within the CRM cloud solution. Once he actually submits this, it goes back to Eliza and she's able to continue her work. Now when Eliza actually brings this back open, she's able to very quickly go into the database registration process for her. She very quickly goes into the CRM cloud, selects the community, to which she wants to register this, uh, data set into the schemas community. And the CRM cloud is the system that she wants to load it in. >>And the refined is the database that John told her that she should bring in. After a quick description, she's able to click register. And this triggers that automatic codeless process of going out to the dataset and bringing back its metadata. Now metadata is great, but it's not the end all be all. There's a lot of other values that she really cares about as she's actually registering this dataset and synchronizing the metadata she's also then asked, would you like to bring in quality information? And so she'll go out and say, yes, of course, I want to enable the quality information from CRM refined. I also want to bring back lineage information to associate with this metadata. And I also want to select profiling and classification information. Now when she actually selects it, she can also say, how often do you want to synchronize this? This is a daily, weekly, monthly kind of update. >>That's part of the change data capture process. Again, all automated without the require of actually writing code. So she's actually run this process. Now, after this loads in, she can then open up this new registered, uh, dataset and actually look and see if it actually has achieved the problem that cliff set her out on, which was improved data quality. So looking into the data quality for the is churn capability shows her that she has fantastic quality. It's at a hundred, it's exactly what she was looking for. So she can with confidence actually, uh, suggest that it's done, but she did notice something and something that she wants to tell John, which is there's a couple of data quality checks that seem to be missing from this dataset. So again, in a collaborative fashion, she can pass that information, uh, for validity and completeness to say, you know what, check for NOLs and MPS and send that back. >>So she submits this onto John to work on. And John now has a work queue in his task force, but remember she's been working in this task forklift and because she actually has actually added a much better source for his churn information, she's going to update that test that was sent to her to notify cliff that the work has actually been done and that she actually has a really good data set in there. In fact, if you recall, it was 100% in terms of its data quality. So this will really make life a lot easier for cliff. Once he receives that data and processes, the churn report analysis next time. So let's talk about these audacious performance goals that we have in mind. Now today, we actually have really strong performance and amazing usability. Our customers continue to tell us how great our usability is, but they keep asking for more well, we've decided to present to you. >>Something you can start to bank on. This is the performance you can expect from us on the highly curated assets that are available for the business users, as well as the technical and lineage assets that are more available for the developer uses and for things that are more warehoused based, you'll see in Q1, uh, our Q2 of this year, we're making available 5 million curated assets. Now you might be out there saying, Hey, I'm already using the software and I've got over 20 million already. That's fair. We do. We have customers that are actually well over 20 million in terms of assets they're managing, but we wanted to present this to you with zero conditions, no limitations we wouldn't talk about, well, it depends, et cetera. This is without any conditions. That's what we can offer you without fail. And yes, it can go higher and higher. We're also talking about the speed with which you can ingest the data right now, we're ingesting somewhere around 50,000 to a hundred thousand records per and of course, yes, you've probably seen it go quite a bit faster, but we are assuring you that that's the case, but what's really impressive is right now, we can also, uh, help you manage 250 million technical assets and we can load it at a speed of 25 million for our, and you can see how over the next 18 months about every two quarters, we show you dramatic improvements, more than doubling of these. >>For most of them leading up to the end of 2022, we're actually handling over a billion technical lineage assets and we're loading at a hundred million per hour. That sets the mark for the industry. Earlier this year, we announced a recent acquisition Al DQ. LDQ brought to us machine learning based data quality. We're now able to introduce to you Collibra data quality, the first integrated approach to Al DQ and Culebra. We've got a demo to follow. I'm really excited to share it with you. Let's get started. So Eliza submitted a task for John to work on, remember to add checks for no and for empty. So John picks up this task very quickly and looks and sees what's what's the request. And from there says, ah, yes, we do have a quality check issue when we look at these churns. So he jumps over to the data quality console and says, I need to create a new data quality test. >>So cliff is able to go in, uh, to the solution and, uh, set up quick rules, automated rules. Uh, he could inherit rules from other things, but it starts with first identifying what is the data source that he needs to connect to, to perform this. And so he chooses the CRM refined data set that was most recently, uh, registered by Lysa. You'll see the same score of 86 was the quality score for the dataset. And you'll also see, there are four rules that are associated underneath this. Now there are various checks that, uh, that John can establish on this, but remember, this is a fairly easy request that he receives from Eliza. So he's going to go in and choose the actual field, uh, is churned. Uh, and from there identify quick rules of, uh, an empty check and that quickly sets up the rules for him. >>And also the null check equally fast. This one's established and analyzes all the data in there. And this sets up the baseline of data quality, uh, for this. Now this data, once it's captured then is periodically brought back to the catalog. So it's available to not only Eliza, but also to cliff next time he, uh, where to shop in the environment. As we look through the rules that were created through that very simple user experience, you can see the one for is empty and is no that we're set up. Now, these are various, uh, styles that can be set up either manually, or you can set them up through machine learning again, or you can inherit them. But the key is to track these, uh, rule creation in the metrics that are generated from these rules so that it can be brought back to the catalog and then used in meaningful context, by someone who's shopping and the confidence that this has neither empty nor no fields, at least most of them don't well now give a confidence as you go forward. >>And as you can see, those checks have now been entered in and you can see that it's a hundred percent quality score for the Knoll check. So with confidence now, John can actually respond back to Eliza and say, I've actually inserted them they're up and running. And, uh, you're in good status. So that was pretty amazing integration, right? And four months after our acquisition, we've already brought that level of integration between, uh, Colibra, uh, data intelligence, cloud, and data quality. Now it doesn't stop there. We have really impressive and high site set early next year. We're getting introduced a fully immersive experience where customers can work within Culebra and actually bring the data quality information all the way in as well as start to manipulate the rules and generate the machine learning rules. On top of it, all of that will be a deeply immersive experience. >>We also have something really clever coming, which we call continuous data profiling, where we bring the power of data quality all the way into the database. So it's continuously running and always making that data available for you. Now, I'd also like to share with you one of the reasons why we are the most universally available software solutions in data intelligence. We've already announced that we're available on AWS and Google cloud prior, but today we can announce to you in Q3, we're going to be, um, available on Microsoft Azure as well. Now it's not just these three cloud providers that were available on we've also become available on each of their marketplaces. So if you are buying our software, you can actually go out and achieve that same purchase from their marketplace and achieve your financial objectives as well. We're very excited about this. These are very important partners for, uh, for our, for us. >>Now, I'd also like to introduce you our system integrators, without them. There's no way we could actually achieve our objectives of growing so rapidly and dealing with the demand that you customers have had Accenture, Deloitte emphasis, and even others have been instrumental in making sure that we can serve your needs when you need them. Uh, and so it's been a big part of our growth and will be a continued part of our growth as well. And finally, I'd like to actually introduce you to our product showcases where we can go into absolute detail on many of the topics I talked about today, such as data governance with Arco or data privacy with Sergio or data quality with Brian and finally catalog with Peter. Again, I'd like to thank you all for joining us. Uh, and we really look forward to hearing your feedback. Thank you..

Published Date : Jun 17 2021

SUMMARY :

I have the benefit of sharing with you, We also observe that the understanding of and access to data remains in the hands of to imagine if you had a single integrated solution that could deliver a seamless governed, And he's going to analyze it in his favorite BI reporting tool. And so cliff is going to want to make sure that are available to cliff for selection, which one to use? And rather than sifting through all of that information, cliff is going to go ahead and say, well, okay, Cliff has the opportunity to look at it in the broader set. knowledge that there's going to be some customer information in this PII information that he's not going to be And as we scroll down and take a little bit of a focus on what we call or what you'll see here is customer phone, We can also see that the data quality is made up of multiple components, So cliff is going to provide information to the owner to say, case and cliff is going to submit this and all the fun starts there. So cliff has actually submitted the order and the owner, Joanna is actually going to receive the request for the order. in a Tablo report and can see the visualization layer, but you also see an incorporation of something we call Collibra Really a clever combination of bringing the data to you and showing you how to So now they have a full bill of materials to run a customer Shern report and schedule it anytime they want. So allow us to introduce you to what we call the asset life cycle and And so we're going to share with you how you can actually automatically register these sources, And so she creates a queue that can go over to one of her colleagues who really focuses on data quality. And he goes down to find So we actually responds back to realize and say, this data set, uh, is actually the data set that you want And the refined is the database that John told her that she should bring in. So again, in a collaborative fashion, she can pass that information, uh, So she submits this onto John to work on. We're also talking about the speed with which you can ingest the data right We're now able to introduce to you Collibra data quality, the first integrated approach to Al So cliff is able to go in, uh, to the solution and, uh, set up quick rules, So it's available to not only Eliza, but also to cliff next time he, uh, And as you can see, those checks have now been entered in and you can see that it's a hundred percent quality Now, I'd also like to share with you one of the reasons why we are the most And finally, I'd like to actually introduce you to our product showcases where we can go into

ENTITIES

Entity	Category	Confidence
Joanna	PERSON	0.99+
John	PERSON	0.99+
Brian	PERSON	0.99+
Jim Cushman	PERSON	0.99+
Deloitte	ORGANIZATION	0.99+
Peter	PERSON	0.99+
Eliza	PERSON	0.99+
Accenture	ORGANIZATION	0.99+
cliff	PERSON	0.99+
Arco	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
5 million	QUANTITY	0.99+
250 million	QUANTITY	0.99+
20	QUANTITY	0.99+
65	QUANTITY	0.99+
28%	QUANTITY	0.99+
25 million	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
98	QUANTITY	0.99+
Cliff	PERSON	0.99+
Collibra	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
5%	QUANTITY	0.99+
first section	QUANTITY	0.99+
68	QUANTITY	0.99+
first	QUANTITY	0.99+
76	QUANTITY	0.99+
One	QUANTITY	0.99+
five stars	QUANTITY	0.99+
Culebra	ORGANIZATION	0.99+
LDQ	ORGANIZATION	0.99+
91 columns	QUANTITY	0.99+
today	DATE	0.99+
Al DQ	ORGANIZATION	0.99+
Cleaver	ORGANIZATION	0.99+
86	QUANTITY	0.99+
one	QUANTITY	0.98+
three	QUANTITY	0.98+
end of 2022	DATE	0.98+
each day	QUANTITY	0.98+
each	QUANTITY	0.98+
over 20 million	QUANTITY	0.98+
Cliff cliff	PERSON	0.98+
next year	DATE	0.98+
Q1	DATE	0.98+
70	QUANTITY	0.98+
Google	ORGANIZATION	0.98+
Tableau	TITLE	0.98+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Arco: