Lisa Ehrlinger, Johannes Kepler University | MIT CDOIQ 2019

>> From Cambridge, Massachusetts, it's theCUBE, covering MIT Chief Data Officer and Information Quality Symposium 2019. Brought to you by SiliconANGLE Media. >> Hi, everybody, welcome back to Cambridge, Massachusetts. This is theCUBE, the leader in tech coverage. I'm Dave Vellante with my cohost, Paul Gillin, and we're here covering the MIT Chief Data Officer Information Quality Conference, #MITCDOIQ. Lisa Ehrlinger is here, she's the Senior Researcher at the Johannes Kepler University in Linz, Austria, and the Software Competence Center in Hagenberg. Lisa, thanks for coming in theCUBE, great to see you. >> Thanks for having me, it's great to be here. >> You're welcome. So Friday you're going to lay out the results of the study, and it's a study of Data Quality Tools. Kind of the long tail of tools, some of those ones that may not have made the Gartner Magic Quadrant and maybe other studies, but talk about the study and why it was initiated. >> Okay, so the main motivation for this study was actually a very practical one, because we have many company projects with companies from different domains, like steel industry, financial sector, and also focus on automotive industry at our department at Johannes Kepler University in Linz. We have experience with these companies for more than 20 years, actually, in this department, and what reoccurred was the fact that we spent the majority of time in such big data projects on data quality measurement and improvement tasks. So at some point we thought, okay, what possibilities are there to automate these tasks and what tools are out there on the market to automate these data quality tasks. So this was actually the motivation why we thought, okay, we'll look at those tools. Also, companies ask us, "Do you have any suggestions? "Which tool performs best in this-and-this domain?" And I think this study answers some questions that have not been answered so far in this particular detail, in these details. For example, Gartner Magic Quadrant of Data Quality Tools, it's pretty interesting but it's very high-level and focusing on some global windows, but it does not look on the specific measurement functionalities. >> Yeah, you have to have some certain number of whatever, customers or revenue to get into the Magic Quadrant. So there's a long tail that they don't cover. But talk a little bit more about the methodology, was it sort of you got hands-on or was it more just kind of investigating what the capabilities of the tools were, talking to customers? How did you come to the conclusions? >> We actually approached this from a very scientific side. We conducted a systematic search, which tools are out there on the market, not only industrial tools, but also open-sourced tools were included. And I think this gives a really nice digest of the market from different perspectives, because we also include some tools that have not been investigated by Gartner, for example, like more BTQ, Data Quality, or Apache Griffin, which has really nice monitoring capabilities, but lacks some other features from these comprehensive tools, of course. >> So was the goal of the methodology largely to capture a feature function analysis of being able to compare that in terms of binary, did it have it or not, how robust is it? And try to develop a common taxonomy across all these tools, is that what you did? >> So we came up with a very detailed requirements catalog, which is divided into three fields, like the focuses on data profiling to get a first insight into data quality. The second is data quality management in terms of dimensions, metrics, and rules. And the third part is dedicated to data quality monitoring over time, and for all those three categories, we came up with different case studies on a database, on a test database. And so we conducted, we looked, okay, does this tool, yes, support this feature, no, or partially? And when partially, to which extent? So I think, especially on the partial assessment, we got a lot into detail in our survey, which is available on Archive online already. So the preliminary results are already online. >> How do you find it? Where is it available? >> On Archive. >> Archive? >> Yes. >> What's the URL, sorry. Archive.com, or .org, or-- >> Archive.org, yeah. >> Archive.org. >> But actually there is a ID I have not with me currently, but I can send you afterwards, yeah. >> Yeah, maybe you can post that with the show notes. >> We can post it afterwards. >> I was amazed, you tested 667 tools. Now, I would've expected that there would be 30 or 40. Where are all of these, what do all of these long tail tools do? Are they specialized by industry or by function? >> Oh, sorry, I think we got some confusion here, because we identified 667 tools out there on the market, but we narrowed this down. Because, as you said, it's quite impossible to observe all those tools. >> But the question still stands, what is the difference, what are these very small, niche tools? What do they do? >> So most of them are domain-specific, and I think this really highlights also these very basic early definition about data quality, of like data qualities defined as fitness for use, and we can pretty much see it here that we excluded the majority of these tools just because they assess some specific kind of data, and we just really wanted to find tools that are generally applicable for different kinds of data, for structured data, unstructured data, and so on. And most of these tools, okay, someone came up with, we want to assess the quality of our, I don't know, like geological data or something like that, yeah. >> To what extent did you consider other sort of non-technical factors? Did you do that at all? I mean, was there pricing or complexity of downloading or, you know, is there a free version available? Did you ignore those and just focus on the feature function, or did those play a role? >> So basically the focus was on the feature function, but of course we had to contact the customer support. Especially with the commercial tools, we had to ask them to provide us with some trial licenses, and there we perceived different feedback from those companies, and I think the best comprehensive study here is definitely Gartner Magic Quadrant for Data Quality Tools, because they give a broad assessment here, but what we also highlight in our study are companies that have a very open support and they are very willing to support you. For example, Informatica Data Quality, we perceived a really close interaction with them in terms of support, trial licenses, and also like specific functionality. Also Experian, our contact from Experian from France was really helpful here. And other companies, like IBM, they focus on big vendors, and here, it was not able to assess these tools, for example, yeah. >> Okay, but the other differences of the Magic Quadrant is you guys actually used the tools, played with them, experienced firsthand the customer experience. >> Exactly, yeah. >> Did you talk to customers as well, or, because you were the customer, you had that experience. >> Yes, I were the customer, but I was also happy to attend some data quality event in Vienna, and there I met some other customers who had experience with single tools. Not of course this wide range we observed, but it was interesting to get feedback on single tools and verify our results, and it matched pretty good. >> How large was the team that ran the study? >> Five people. >> Five people, and how long did it take you from start to finish? >> Actually, we performed it for one year, roughly. The assessment. And I think it's a pretty long time, especially when you see how quick the market responds, especially in the open source field. But nevertheless, you need to make some cut, and I think it's a very recent study now, and there is also the idea to publish it now, the preliminary results, and we are happy with that. >> Were there any surprises in the results? >> I think the main results, or one of the surprises was that we think that there is definitely more potential for automation, but not only for automation. I really enjoyed the keynote this morning that we need more automation, but at the same time, we think that there is also the demand for more declaration. We observed some tools that say, yeah, we apply machine learning, and then you look into their documentation and find no information, which algorithm, which parameters, which thresholds. So I think this is definitely, especially if you want to assess the data quality, you really need to know what algorithm and how it's attuned and give the user, which in most case will be a technical person with technical background, like some chief data officer. And he or she really needs to have the possibility to tune these algorithms to get reliable results and to know what's going on and why, which records are selected, for example. >> So now what? You're presenting the results, right? You're obviously here at this conference and other conferences, and so it's been what, a year, right? >> Yes. >> And so what's the next wave? What's next for you? >> The next wave, we're currently working on a project which is called some Knowledge Graph for Data Quality Assessment, which should tackle two problems in ones. The first is to come up with a semantic representation of your data landscape in your company, but not only the data landscape itself in terms of gathering meta data, but also to automatically improve or annotate this data schema with data profiles. And I think what we've seen in the tools, we have a lot of capabilities for data profiling, but this is usually left to the user ad hoc, and here, we store it centrally and allow the user to continuously verify newly incoming data if this adheres to this standard data profile. And I think this is definitely one step into the way into more automation, and also I think it's the most... The best thing here with this approach would be to overcome this very arduous way of coming up with all the single rules within a team, but present the data profile to a group of data, within your data quality project to those peoples involved in the projects, and then they can verify the project and only update it and refine it, but they have some automated basis that is presented to them. >> Oh, great, same team or new team? >> Same team, yeah. >> Oh, great. >> We're continuing with it. >> Well, Lisa, thanks so much for coming to theCUBE and sharing the results of your study. Good luck with your talk on Friday. >> Thank you very much, thank you. >> All right, and thank you for watching. Keep it right there, everybody. We'll be back with our next guest right after this short break. From MIT CDOIQ, you're watching theCUBE. (upbeat music)

Published Date : Jul 31 2019

SUMMARY :

Brought to you by SiliconANGLE Media. and the Software Competence Center in Hagenberg. it's great to be here. Kind of the long tail of tools, Okay, so the main motivation for this study of the tools were, talking to customers? And I think this gives a really nice digest of the market And the third part is dedicated to data quality monitoring What's the URL, sorry. but I can send you afterwards, yeah. Yeah, maybe you can post that I was amazed, you tested 667 tools. Oh, sorry, I think we got some confusion here, and I think this really highlights also these very basic So basically the focus was on the feature function, Okay, but the other differences of the Magic Quadrant Did you talk to customers as well, or, and there I met some other customers and we are happy with that. or one of the surprises was that we think but present the data profile to a group of data, and sharing the results of your study. All right, and thank you for watching.

ENTITIES

Entity	Category	Confidence
Lisa Ehrlinger	PERSON	0.99+
Paul Gillin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Hagenberg	LOCATION	0.99+
Lisa	PERSON	0.99+
Vienna	LOCATION	0.99+
Linz	LOCATION	0.99+
Five people	QUANTITY	0.99+
30	QUANTITY	0.99+
Johannes Kepler University	ORGANIZATION	0.99+
40	QUANTITY	0.99+
Friday	DATE	0.99+
one year	QUANTITY	0.99+
667 tools	QUANTITY	0.99+
France	LOCATION	0.99+
three categories	QUANTITY	0.99+
third part	QUANTITY	0.99+
Cambridge, Massachusetts	LOCATION	0.99+
Experian	ORGANIZATION	0.99+
second	QUANTITY	0.99+
two problems	QUANTITY	0.99+
more than 20 years	QUANTITY	0.99+
Gartner	ORGANIZATION	0.99+
single tools	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.98+
first	QUANTITY	0.98+
MIT CDOIQ	ORGANIZATION	0.98+
a year	QUANTITY	0.97+
three fields	QUANTITY	0.97+
Apache Griffin	ORGANIZATION	0.97+
Archive.org	OTHER	0.96+
.org	OTHER	0.96+
one step	QUANTITY	0.96+
Linz, Austria	LOCATION	0.95+
one	QUANTITY	0.94+
single	QUANTITY	0.94+
first insight	QUANTITY	0.93+
theCUBE	ORGANIZATION	0.92+
2019	DATE	0.92+
this morning	DATE	0.91+
BTQ	ORGANIZATION	0.91+
MIT Chief Data Officer and	EVENT	0.9+
Archive.com	OTHER	0.88+
Informatica	ORGANIZATION	0.85+
Software Competence Center	ORGANIZATION	0.84+
Information Quality Symposium 2019	EVENT	0.81+
MIT Chief Data Officer Information Quality Conference	EVENT	0.72+
Data Quality	ORGANIZATION	0.67+
#MITCDOIQ	EVENT	0.65+
Magic Quadrant	COMMERCIAL_ITEM	0.63+
Magic	COMMERCIAL_ITEM	0.45+
next	EVENT	0.44+
wave	EVENT	0.43+
Magic Quadrant	ORGANIZATION	0.43+
wave	DATE	0.41+
Magic	TITLE	0.39+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Software Competence Center: