Piotr Mierzejewski, IBM | Dataworks Summit EU 2018

>> Announcer: From Berlin, Germany, it's theCUBE covering Dataworks Summit Europe 2018 brought to you by Hortonworks. (upbeat music) >> Well hello, I'm James Kobielus and welcome to theCUBE. We are here at Dataworks Summit 2018, in Berlin, Germany. It's a great event, Hortonworks is the host, they made some great announcements. They've had partners doing the keynotes and the sessions, breakouts, and IBM is one of their big partners. Speaking of IBM, from IBM we have a program manager, Piotr, I'll get this right, Piotr Mierzejewski, your focus is on data science machine learning and data science experience which is one of the IBM Products for working data scientists to build and to train models in team data science enterprise operational environments, so Piotr, welcome to theCUBE. I don't think we've had you before. >> Thank you. >> You're a program manager. I'd like you to discuss what you do for IBM, I'd like you to discuss Data Science Experience. I know that Hortonworks is a reseller of Data Science Experience, so I'd like you to discuss the partnership going forward and how you and Hortonworks are serving your customers, data scientists and others in those teams who are building and training and deploying machine learning and deep learning, AI, into operational applications. So Piotr, I give it to you now. >> Thank you. Thank you for inviting me here, very excited. This is a very loaded question, and I would like to begin, before I get actually to why the partnership makes sense, I would like to begin with two things. First, there is no machine learning about data. And second, machine learning is not easy. Especially, especially-- >> James: I never said it was! (Piotr laughs) >> Well there is this kind of perception, like you can have a data scientist working on their Mac, working on some machine learning algorithms and they can create a recommendation engine, let's say in a two, three days' time. This is because of the explosion of open-source in that space. You have thousands of libraries, from Python, from R, from Scala, you have access to Spark. All these various open-source offerings that are enabling data scientists to actually do this wonderful work. However, when you start talking about bringing machine learning to the enterprise, this is not an easy thing to do. You have to think about governance, resiliency, the data access, actual model deployments, which are not trivial. When you have to expose this in a uniform fashion to actually various business units. Now all this has to actually work in a private cloud, public clouds environment, on a variety of hardware, a variety of different operating systems. Now that is not trivial. (laughs) Now when you deploy a model, as the data scientist is going to deploy the model, he needs to be able to actually explain how the model was created. He has to be able to explain what the data was used. He needs to ensure-- >> Explicable AI, or explicable machine learning, yeah, that's a hot focus of our concern, of enterprises everywhere, especially in a world where governance and tracking and lineage GDPR and so forth, so hot. >> Yes, you've mentioned all the right things. Now, so given those two things, there's no ML web data, and ML is not easy, why the partnership between Hortonworks and IBM makes sense, well, you're looking at the number one industry leading big data plot from Hortonworks. Then, you look at a DSX local, which, I'm proud to say, I've been there since the first line of code, and I'm feeling very passionate about the product, is the merger between the two, ability to integrate them tightly together gives your data scientists secure access to data, ability to leverage the spark that runs inside a Hortonworks cluster, ability to actually work in a platform like DSX that doesn't limit you to just one kind of technology but allows you to work with multiple technologies, ability to actually work on not only-- >> When you say technologies here, you're referring to frameworks like TensorFlow, and-- >> Precisely. Very good, now that part I'm going to get into very shortly, (laughs) so please don't steal my thunder. >> James: Okay. >> Now, what I was saying is that not only DSX and Hortonworks integrated to the point that you can actually manage your Hadoop clusters, Hadoop environments within a DSX, you can actually work on your Python models and your analytics within DSX and then push it remotely to be executed where your data is. Now, why is this important? If you work with the data that's megabytes, gigabytes, maybe you know you can pull it in, but in truly what you want to do when you move to the terabytes and the petabytes of data, what happens is that you actually have to push the analytics to where your data resides, and leverage for example YARN, a resource manager, to distribute your workloads and actually train your models on your actually HDP cluster. That's one of the huge volume propositions. Now, mind you to say this is all done in a secure fashion, with ability to actually install DSX on the edge notes of the HDP clusters. >> James: Hmm... >> As of HDP 264, DSX has been certified to actually work with HDP. Now, this partnership embarked, we embarked on this partnership about 10 months ago. Now, often happens that there is announcements, but there is not much materializing after such announcement. This is not true in case of DSX and HDP. We have had, just recently we have had a release of the DSX 1.2 which I'm super excited about. Now, let's talk about those open-source toolings in the various platforms. Now, you don't want to force your data scientists to actually work with just one environment. Some of them might prefer to work on Spark, some of them like their RStudio, they're statisticians, they like R, others like Python, with Zeppelin, say Jupyter Notebook. Now, how about Tensorflow? What are you going to do when actually, you know, you have to do the deep learning workloads, when you want to use neural nets? Well, DSX does support ability to actually bring in GPU notes and do the Tensorflow training. As a sidecar approach, you can append the note, you can scale the platform horizontally and vertically, and train your deep learning workloads, and actually remove the sidecar out. So you should put it towards the cluster and remove it at will. Now, DSX also actually not only satisfies the needs of your programmer data scientists, that actually code in Python and Scala or R, but actually allows your business analysts to work and create models in a visual fashion. As of DSX 1.2, you can actually, we have embedded, integrated, an SPSS modeler, redesigned, rebranded, this is an amazing technology from IBM that's been on for a while, very well established, but now with the new interface, embedded inside a DSX platform, allows your business analysts to actually train and create the model in a visual fashion and, what is beautiful-- >> Business analysts, not traditional data scientists. >> Not traditional data scientists. >> That sounds equivalent to how IBM, a few years back, was able to bring more of a visual experience to SPSS proper to enable the business analysts of the world to build and do data-mining and so forth with structured data. Go ahead, I don't want to steal your thunder here. >> No, no, precisely. (laughs) >> But I see it's the same phenomenon, you bring the same capability to greatly expand the range of data professionals who can do, in this case, do machine learning hopefully as well as professional, dedicated data scientists. >> Certainly, now what we have to also understand is that data science is actually a team sport. It involves various stakeholders from the organization. From executive, that actually gives you the business use case to your data engineers that actually understand where your data is and can grant the access-- >> James: They manage the Hadoop clusters, many of them, yeah. >> Precisely. So they manage the Hadoop clusters, they actually manage your relational databases, because we have to realize that not all the data is in the datalinks yet, you have legacy systems, which DSX allows you to actually connect to and integrate to get data from. It also allows you to actually consume data from streaming sources, so if you actually have a Kafka message cob and actually were streaming data from your applications or IoT devices, you can actually integrate all those various data sources and federate them within the DSX to use for machine training models. Now, this is all around predictive analytics. But what if I tell you that right now with the DSX you can actually do prescriptive analytics as well? With the 1.2, again I'm going to be coming back to this 1.2 DSX with the most recent release we have actually added decision optimization, an industry-leading solution from IBM-- >> Prescriptive analytics, gotcha-- >> Yes, for prescriptive analysis. So now if you have warehouses, or you have a fleet of trucks, or you want to optimize the flow in let's say, a utility company, whether it be for power or could it be for, let's say for water, you can actually create and train prescriptive models within DSX and deploy them the same fashion as you will deploy and manage your SPSS streams as well as the machine learning models from Spark, from Python, so with XGBoost, Tensorflow, Keras, all those various aspects. >> James: Mmmhmm. >> Now what's going to get really exciting in the next two months, DSX will actually bring in natural learning language processing and text analysis and sentiment analysis by Vio X. So Watson Explorer, it's another offering from IBM... >> James: It's called, what is the name of it? >> Watson Explorer. >> Oh Watson Explorer, yes. >> Watson Explorer, yes. >> So now you're going to have this collaborative message platform, extendable! Extendable collaborative platform that can actually install and run in your data centers without the need to access internet. That's actually critical. Yes, we can deploy an IWS. Yes we can deploy an Azure. On Google Cloud, definitely we can deploy in Softlayer and we're very good at that, however in the majority of cases we find that the customers have challenges for bringing the data out to the cloud environments. Hence, with DSX, we designed it to actually deploy and run and scale everywhere. Now, how we have done it, we've embraced open source. This was a huge shift within IBM to realize that yes we do have 350,000 employees, yes we could develop container technologies, but why? Why not embrace what is actually industry standards with the Docker and equivalent as they became industry standards? Bring in RStudio, the Jupyter, the Zeppelin Notebooks, bring in the ability for a data scientist to choose the environments they want to work with and actually extend them and make the deployments of web services, applications, the models, and those are actually full releases, I'm not only talking about the model, I'm talking about the scripts that can go with that ability to actually pull the data in and allow the models to be re-trained, evaluated and actually re-deployed without taking them down. Now that's what actually becomes, that's what is the true differentiator when it comes to DSX, and all done in either your public or private cloud environments. >> So that's coming in the next version of DSX? >> Outside of DSX-- >> James: We're almost out of time, so-- >> Oh, I'm so sorry! >> No, no, no. It's my job as the host to let you know that. >> Of course. (laughs) >> So if you could summarize where DSX is going in 30 seconds or less as a product, the next version is, what is it? >> It's going to be the 1.2.1. >> James: Okay. >> 1.2.1 and we're expecting to release at the end of June. What's going to be unique in the 1.2.1 is infusing the text and sentiment analysis, so natural language processing with predictive and prescriptive analysis for both developers and your business analysts. >> James: Yes. >> So essentially a platform not only for your data scientist but pretty much every single persona inside the organization >> Including your marketing professionals who are baking sentiment analysis into what they do. Thank you very much. This has been Piotr Mierzejewski of IBM. He's a Program Manager for DSX and for ML, AI, and data science solutions and of course a strong partnership is with Hortonworks. We're here at Dataworks Summit in Berlin. We've had two excellent days of conversations with industry experts including Piotr. We want to thank everyone, we want to thank the host of this event, Hortonworks for having us here. We want to thank all of our guests, all these experts, for sharing their time out of their busy schedules. We want to thank everybody at this event for all the fascinating conversations, the breakouts have been great, the whole buzz here is exciting. GDPR's coming down and everybody's gearing up and getting ready for that, but everybody's also focused on innovative and disruptive uses of AI and machine learning and business, and using tools like DSX. I'm James Kobielus for the entire CUBE team, SiliconANGLE Media, wishing you all, wherever you are, whenever you watch this, have a good day and thank you for watching theCUBE. (upbeat music)

Published Date : Apr 19 2018

SUMMARY :

brought to you by Hortonworks. and to train models in team data science and how you and Hortonworks are serving your customers, Thank you for inviting me here, very excited. from Python, from R, from Scala, you have access to Spark. GDPR and so forth, so hot. that doesn't limit you to just one kind of technology Very good, now that part I'm going to get into very shortly, and then push it remotely to be executed where your data is. Now, you don't want to force your data scientists of the world to build and do data-mining (laughs) you bring the same capability the business use case to your data engineers James: They manage the Hadoop clusters, With the 1.2, again I'm going to be coming back to this as you will deploy and manage your SPSS streams in the next two months, DSX will actually bring in and allow the models to be re-trained, evaluated It's my job as the host to let you know that. (laughs) is infusing the text and sentiment analysis, and of course a strong partnership is with Hortonworks.

ENTITIES

Entity	Category	Confidence
Piotr Mierzejewski	PERSON	0.99+
James Kobielus	PERSON	0.99+
James	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Piotr	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
30 seconds	QUANTITY	0.99+
Berlin	LOCATION	0.99+
IWS	ORGANIZATION	0.99+
Python	TITLE	0.99+
Spark	TITLE	0.99+
two	QUANTITY	0.99+
First	QUANTITY	0.99+
Scala	TITLE	0.99+
Berlin, Germany	LOCATION	0.99+
350,000 employees	QUANTITY	0.99+
DSX	ORGANIZATION	0.99+
Mac	COMMERCIAL_ITEM	0.99+
two things	QUANTITY	0.99+
RStudio	TITLE	0.99+
DSX	TITLE	0.99+
DSX 1.2	TITLE	0.98+
both developers	QUANTITY	0.98+
second	QUANTITY	0.98+
GDPR	TITLE	0.98+
Watson Explorer	TITLE	0.98+
Dataworks Summit 2018	EVENT	0.98+
first line	QUANTITY	0.98+
Dataworks Summit Europe 2018	EVENT	0.98+
SiliconANGLE Media	ORGANIZATION	0.97+
end of June	DATE	0.97+
TensorFlow	TITLE	0.97+
thousands of libraries	QUANTITY	0.96+
R	TITLE	0.96+
Jupyter	ORGANIZATION	0.96+
1.2.1	OTHER	0.96+
two excellent days	QUANTITY	0.95+
Dataworks Summit	EVENT	0.94+
Dataworks Summit EU 2018	EVENT	0.94+
SPSS	TITLE	0.94+
one	QUANTITY	0.94+
Azure	TITLE	0.92+
one kind	QUANTITY	0.92+
theCUBE	ORGANIZATION	0.92+
HDP	ORGANIZATION	0.91+

Mandy Chessell, IBM | Dataworks Summit EU 2018

>> Announcer: From Berlin, Germany, it's the Cube covering Dataworks Summit Europe 2018. Brought to you by Hortonworks. (electronic music) >> Well hello welcome to the Cube I'm James Kobielus. I'm the lead analyst for big data analytics within the Wikibon team of SiliconANGLE Media. I'm hosting the Cube this week at Dataworks Summit 2018 in Berlin, Germany. It's been an excellent event. Hortonworks, the host, had... We've completed two days of keynotes. They made an announcement of the Data Steward Studio as the latest of their offerings and demonstrated it this morning, to address GDPR compliance, which of course is hot and heavy is coming down on enterprises both in the EU and around the world including in the U.S. and the May 25th deadline is fast approaching. One of Hortonworks' prime partners is IBM. And today on this Cube segment we have Mandy Chessell. Mandy is a distinguished engineer at IBM who did an excellent keynote yesterday all about metadata and metadata management. Mandy, great to have you. >> Hi and thank you. >> So I wonder if you can just reprise or summarize the main take aways from your keynote yesterday on metadata and it's role in GDPR compliance, so forth and the broader strategies that enterprise customers have regarding managing their data in this new multi-cloud world where Hadoop and open source platforms are critically important for storing and processing data. So Mandy go ahead. >> So, metadata's not new. I mean it's basically information about data. And a lot of companies are trying to build a data catalog which is not a catalog of, you know, actually containing their data, it's a catalog that describes their data. >> James: Is it different with index or a glossary. How's the catalog different from-- >> Yeah, so catalog actually includes both. So it is a list of all the data sets plus a links to glossary definitions of what those data items mean within the data sets, plus information about the lineage of the data. It includes information about who's using it, what they're using it for, how it should be governed. >> James: It's like a governance repository. >> So governance is part of it. So the governance part is really saying, "This is how you're allowed to use it, "this is how the data's classified," "these are the automated actions that are going to happen "on the data as it's used "within the operational environment." >> James: Yeah. >> So there's that aspect to it, but there is the collaboration side. Hey I've been using this data set it's great. Or, actually this data set is full of errors, we can't use it. So you've got feedback to data set owners as well as, exchange and collaboration between data scientists working with the data. So it's really, it is a central resource for an organization that has a strong data strategy, is interested in becoming a data-driven organization as such, so, you know, this becomes their major catalog over their data assets, and how they're using it. So when a regulator comes in and says, "can you show up, show me that you're "managing personal data?" The data catalog will have the information about where personal data's located, what type of infrastructure it's sitting on, how it's being used by different services. So they can really show that they know what they're doing and then from that they can show how to processes are used in the metadata in order to use the data appropriately day to day. >> So Apache Atlas, so it's basically a catalog, if I understand correctly at least for IBM and Hortonworks, it's Hadoop, it's Apache Atlas and Apache Atlas is essentially a metadata open source code base. >> Mandy: Yes, yes. >> So explain what Atlas is in this context. >> So yes, Atlas is a collection of code, but it supports a server, a graph-based metadata server. It also supports-- >> James: A graph-based >> Both: Metadata server >> Yes >> James: I'm sorry, so explain what you mean by graph-based in this context. >> Okay, so it runs using the JanusGraph, graph repository. And this is very good for metadata 'cause if you think about what it is it's connecting dots. It's basically saying this data set means this value and needs to be classified in this way and this-- >> James: Like a semantic knowledge graph >> It is, yes actually. And on top of it we impose a type system that describes the different types of things you need to control and manage in a data catalog, but the graph, the Atlas component gives you that graph-based, sorry, graph-based repository underneath, but on top we've built what we call the open metadata and governance libraries. They run inside Atlas so when you run Atlas you will have all the open metadata interfaces, but you can also take those libraries and connect them and load them actually into another vendor's product. And what they're doing is allowing metadata to be exchanged between repositories of different types. And this becomes incredibly important as an organization increases their maturity and their use of data because you can't just have knowledge about data in a single server, it just doesn't scale. You need to get that knowledge into every runtime environment, into the data tools that people are using across the organization. And so it needs to be distributed. >> Mandy I'm wondering, the whole notion of what you catalog in that repository, does it include, or does Apache Atlas support adding metadata relevant to data derivative assets like machine learning models-- >> Mandy: Absolutely. >> So forth. >> Mandy: Absolutely, so we have base types in the upper metadata layer, but also it's a very flexible and sensible type system. So, if you've got a specialist machine learning model that needs additional information stored about it, that can easily be added to the runtime environment. And then it will be managed through the open metadata protocols as if it was part of the native type system. >> Because of the courses in analysts, one of my core areas is artificial intelligence and one of the hot themes in artificial, well there's a broad umbrella called AI safety. >> Mandy: Yeah. >> And one of the core subsets of that is something called explicable AI, being able to identify the lineage of a given algorithmic decision back to what machine learning models fed from what data. >> Mandy: Yeah. >> Throw what action like when let's say a self-driving vehicle hits a human being for legal, you know, discovery whatever. So what I'm getting at, what I'm working through to is the extent to which the Hortonworks, IBM big data catalog running Atlas can be a foundation for explicable AI either now or in the future. We see a lot of enterprise, me as an analyst at least, sees lots of enterprises that are exploring this topic, but it's not to the point where it's in production, explicable AI, but where clearly companies like IBM are exploring building a stack or a architecture for doing this kind of thing in a standardized way. What are your thoughts there? Is IBM working on bringing, say Atlas and the overall big data catalog into that kind of a use case. >> Yes, yeah, so if you think about what's required, you need to understand the data that was used to train the AI how, what data's been fed to it since it was deployed because that's going to change its behavior, and then also a view of how that data's going to change in the future so you can start to anticipate issues that might arising from the model's changing behavior. And this is where the data catalog can actually associate and maintain information about the data that's being used with the algorithm. You can also associate the checking mechanism that's constantly monitoring the profile of the data so you can see where the data is changing over time, that will obviously affect the behavior of the machine learning model. So it's really about providing, not just information about the model itself, but also the data that's feeding it, how those characteristics are changing over time so that you know the model is continuing to work into the future. >> So tell us about the IBM, Hortonworks partnership on metadata and so forth. >> Mandy: Okay. >> How is that evolving? So, you know, your partnership is fairly tight. You clearly, you've got ODPI, you've got the work that you're doing related to the big data catalog. What can we expect to see in the near future in terms of, initiatives building on all of that for governance of big data in the multi-cloud environment? >> Yeah so Hortonworks started the Apache Atlas project a couple of years ago with a number of their customers. And they built a base repository and a set of APIs that allow it to work in the Hadoop environment. We came along last year, formed our partnership. That partnership includes this open metadata and governance layer. So since then we worked with ING as well and ING bring the, sort of, user perspective, this is the organization's use of the data. And, so between the three of us we are basically transforming Apache Atlas from an Hadoop focused metadata repository to an enterprise focused metadata repository. Plus enabling other vendors to connect into the open metadata ecosystem. So we're standardizing types, standardizing format, the format of metadata, there's a protocol for exchanging metadata between repositories. And this is all coming from that three-way partnership where you've got a consuming organization, you've got a company who's used to building enterprise middleware, and you've got Hortonworks with their knowledge of open source development in their Hadoop environment. >> Quick out of left field, as you develop this architecture, clearly you're leveraging Hadoop HTFS for storage. Are you looking to at least evaluating maybe using block chain for more distributed management of the metadata in these heterogeneous environments in the multi-cloud, or not? >> So Atlas itself does run on HTFS, but doesn't need to run on HTFS, it's got other storage environments so that we can run it outside of Hadoop. When it comes to block chain, so block chain is, for, sharing data between partners, small amounts of data that basically express agreements, so it's like a ledger. There are some aspects that we could use for metadata management. It's more that we actually need to put metadata management into block chain. So the agreements and contracts that are stored in block chain are only meaningful if we understand the data that's there, what it's quality, where it came from what it means. And so actually there's a very interesting distributor metadata question that comes with the block chain technology. And I think that's an important area of research. >> Well Mandy we're at the end of our time. Thank you very much. We could go on and on. You're a true expert and it's great to have you on the Cube. >> Thank you for inviting me. >> So this is James Kobielus with Mandy Chessell of IBM. We are here this week in Berlin at Dataworks Summit 2018. It's a great event and we have some more interviews coming up so thank you very much for tuning in. (electronic music)

Published Date : Apr 19 2018

SUMMARY :

Announcer: From Berlin, Germany, it's the Cube I'm hosting the Cube this week at Dataworks Summit 2018 and the broader strategies that enterprise customers which is not a catalog of, you know, actually containing How's the catalog different from-- So it is a list of all the data sets plus a links "these are the automated actions that are going to happen in the metadata in order to use So Apache Atlas, so it's basically a catalog, So yes, Atlas is a collection of code, James: I'm sorry, so explain what you mean and needs to be classified in this way that describes the different types of things you need in the upper metadata layer, but also it's a very flexible and one of the hot themes in artificial, And one of the core subsets of that the extent to which the Hortonworks, IBM big data catalog in the future so you can start to anticipate issues So tell us about the IBM, Hortonworks partnership for governance of big data in the multi-cloud environment? And, so between the three of us we are basically of the metadata in these heterogeneous environments So the agreements and contracts that are stored You're a true expert and it's great to have you on the Cube. So this is James Kobielus with Mandy Chessell of IBM.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Mandy Chessell	PERSON	0.99+
IBM	ORGANIZATION	0.99+
ING	ORGANIZATION	0.99+
James	PERSON	0.99+
three	QUANTITY	0.99+
Berlin	LOCATION	0.99+
Mandy	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
May 25th	DATE	0.99+
last year	DATE	0.99+
U.S.	LOCATION	0.99+
two days	QUANTITY	0.99+
Atlas	TITLE	0.99+
yesterday	DATE	0.99+
Berlin, Germany	LOCATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Data Steward Studio	ORGANIZATION	0.99+
both	QUANTITY	0.99+
Both	QUANTITY	0.98+
EU	LOCATION	0.98+
GDPR	TITLE	0.98+
One	QUANTITY	0.98+
one	QUANTITY	0.98+
Dataworks Summit 2018	EVENT	0.97+
Dataworks Summit EU 2018	EVENT	0.96+
this week	DATE	0.94+
single server	QUANTITY	0.94+
Hadoop	TITLE	0.94+
today	DATE	0.93+
this morning	DATE	0.93+
three-way partnership	QUANTITY	0.93+
Wikibon	ORGANIZATION	0.91+
Hortonworks'	ORGANIZATION	0.9+
Atlas	ORGANIZATION	0.89+
Dataworks Summit Europe 2018	EVENT	0.89+
couple of years ago	DATE	0.87+
Apache Atlas	TITLE	0.86+
Cube	COMMERCIAL_ITEM	0.83+
Apache	ORGANIZATION	0.82+
JanusGraph	TITLE	0.79+
hot themes	QUANTITY	0.68+
Hado	ORGANIZATION	0.67+
Hadoop HTFS	TITLE	0.63+

Day Two Keynote Analysis | Dataworks Summit 2018

>> Announcer: From Berlin, Germany, it's the Cube covering Datawork Summit Europe 2018. Brought to you by Hortonworks. (electronic music) >> Hello and welcome to the Cube on day two of Dataworks Summit 2018 from Berlin. It's been a great show so far. We have just completed the day two keynote and in just a moment I'll bring ya up to speed on the major points and the presentations from that. It's been a great conference. Fairly well attended here. The hallway chatter, discussion's been great. The breakouts have been stimulating. For me the takeaway is the fact that Hortonworks, the show host, has announced yesterday at the keynote, Scott Gnau, the CTO of Hortonworks announced Data Steward Studio, DSS they call it, part of the data plane, Hotronworks data plane services portfolio and it could not be more timely Data Steward Studio because we are now five weeks away from GDPR, that's the General Data Protection Regulation becoming the law of the land. When I say the land, the EU, but really any company that operates in the EU, and that includes many U.S. based and Apac based and other companies will need to comply with the GDPR as of May 25th and ongoing. In terms of protecting the personal data of EU citizens. And that means a lot of different things. Data Steward Studio announced yesterday, was demo'd today, by Hortonworks and it was a really excellent demo, and showed that it's a powerful solution for a number of things that are at the core of GDPR compliance. The demo covered the capability of the solution to discover and inventory personal data within a distributed data lake or enterprise data environment, number one. Number two, the ability of the solution to centralize consent, provide a consent portal essentially that data subjects can use then to review the data that's kept on them to make fine grain consents or withdraw consents for use in profiling of their data that they own. And then number three, the show, they demonstrated the capability of the solution then to execute the data subject to people's requests in terms of the handling of their personal data. The three main points in terms of enabling, adding the teeth to enforce GDPR in an operational setting in any company that needs to comply with GDPR. So, what we're going to see, I believe going forward in the, really in the whole global economy and in the big data space is that Hortonworks and others in the data lake industry, and there's many others, are going to need to roll out similar capabilities in their portfolios 'cause their customers are absolutely going to demand it. In fact the deadline is fast approaching, it's only five weeks away. One of the interesting take aways from the, the keynote this morning was the fact that John Kreisa, the VP for marketing at Hortonworks today, a quick survey of those in the audience a poll, asking how ready they are to comply with GDPR as of May 25th and it was a bit eye opening. I wasn't surprised, but I think it was 19 or 20%, I don't have the numbers in front of me, said that they won't be ready to comply. I believe it was something where between 20 and 30% said they will be able to comply. About 40% I'm, don't quote me on that, but a fair plurality said that they're preparing. So that, indicates that they're not entirely 100% sure that they will be able to comply 100% to the letter of the law as of May 25th. I think that's probably accurate in terms of ballpark figures. I think there's a lot of, I know there's a lot of companies, users racing for compliance by that date. And so really GDPR is definitely the headline banner, umbrella story around this event and really around the big data community world-wide right now in terms of enterprise, investments in the needed compliance software and services and capabilities are needed to comply with GDPR. That was important. That wasn't the only thing that was covered in, not only the keynotes, but in the sessions here so far. AI, clearly AI and machine learning are hot themes in terms of the innovation side of big data. There's compliance, there's GDPR, but really innovation in terms of what enterprises are doing with their data, with their analytics, they're building more and more AI and embedding that in conversational UIs and chatbots and their embedding AI, you know manner of e-commerce applications, internal applications in terms of search, as well as things like face recognition, voice recognition, and so forth and so on. So, what we've seen here at the show is what I've been seeing for quite some time is that more of the actual developers who are working with big data are the data scientists of the world. And more of the traditional coders are getting up to speed very rapidly on the new state of the art for building machine learning and deep learning AI natural language processing into their applications. That said, so Hortonworks has become a fairly substantial player in the machine learning space. In fact, you know, really across their portfolio many of the discussions here I've seen shows that everybody's buzzing about getting up to speed on frameworks for building and deploying and iterating and refining machine learning models in operational environments. So that's definitely a hot theme. And so there was an AI presentation this morning from the first gentleman that came on that laid out the broad parameters of what, what developers are doing and looking to do with data that they maintain in their lakes, training data to both build the models and train them and deploy them. So, that was also something I expected and it's good to see at Dataworks Summit that there is a substantial focus on that in addition of course to GDPR and compliance. It's been about seven years now since Hortonworks was essentially spun off of Yahoo. It's been I think about three years or so since they went IPO. And what I can see is that they are making great progress in terms of their growth, in terms of not just the finances, but their customer acquisition and their deal size and also customer satisfaction. I get a sense from talking to many of the attendees at this event that Hortonworks has become a fairly blue chip vendor, that they're really in many ways, continuing to grow their footprint of Hortonworks products and services in most of their partners, such as IBM. And from what I can see everybody was wrapped with intention around Data Steward Studio and I sensed, sort of a sigh of relief that it looks like a fairly good solution and so I have no doubt that a fair number of those in this hall right now are probably, as we say in the U.S., probably kicking the tires of DSS and probably going to expedite their adoption of it. So, with that said, we have day two here, so what we're going to have is Alan Gates, one of the founders of Hortonworks coming on in just a few minutes and I'll be interviewing him, asking about the vibrancy in the health of the community, the Hortonworks ecosystem, developers, partners, and so forth as well as of course the open source communities for Hadoop and Ranger and Atlas and so forth, the growing stack of open source code upon which Hortonworks has built their substantial portfolio of solutions. Following him we'll have John Kreisa, the VP for marketing. I'm going to ask John to give us an update on, really the, sort of the health of Hortonworks as a business in terms of the reach out to the community in terms of their messaging obviously and have him really position Hortonworks in the community in terms of who's he see them competing with. What segments is Hortonworks in now? The whole Hadoop segment increasingly... Hadoop is there. It's the foundation. The word is not invoked in the context of discussions of Hortonworks as much now as it was in the past. And the same thing for say Cloudera one of their closest to traditional rivals, closest in the sense that people associate them. I was at the Cloudera analyst event the other week in Santa Monica, California. It was the same thing. I think both of these vendors are on a similar path to become fairly substantial data warehousing and data governance suppliers to the enterprises of the world that have traditionally gone with the likes of IBM and Oracle and SAP and so forth. So I think they're, Hortonworks, has definitely evolved into a far more diversified solution provider than people realize. And that's really one of the take aways from Dataworks Summit. With that said, this is Jim Kobielus. I'm the lead analyst, I should've said that at the outset. I'm the lead analyst at SiliconANGLE's Media's Wikibon team focused on big data analytics. I'm your host this week on the Cube at Dataworks Summit Berlin. I'll close out this segment and we'll get ready to talk to the Hortonworks and IBM personnel. I understand there's a gentleman from Accenture on as well today on the Cube here at Dataworks Summit Berlin. (electronic music)

Published Date : Apr 19 2018

SUMMARY :

Announcer: From Berlin, Germany, it's the Cube as a business in terms of the reach out to the community

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
John Kreisa	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Scott Gnau	PERSON	0.99+
IBM	ORGANIZATION	0.99+
John	PERSON	0.99+
Cloudera	ORGANIZATION	0.99+
May 25th	DATE	0.99+
Berlin	LOCATION	0.99+
Yahoo	ORGANIZATION	0.99+
five weeks	QUANTITY	0.99+
Alan Gates	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
Hotronworks	ORGANIZATION	0.99+
Data Steward Studio	ORGANIZATION	0.99+
General Data Protection Regulation	TITLE	0.99+
Santa Monica, California	LOCATION	0.99+
GDPR	TITLE	0.99+
19	QUANTITY	0.99+
both	QUANTITY	0.99+
100%	QUANTITY	0.99+
today	DATE	0.99+
20%	QUANTITY	0.99+
one	QUANTITY	0.99+
yesterday	DATE	0.99+
U.S.	LOCATION	0.99+
DSS	ORGANIZATION	0.99+
30%	QUANTITY	0.99+
Berlin, Germany	LOCATION	0.98+
Dataworks Summit 2018	EVENT	0.98+
three main points	QUANTITY	0.98+
Atlas	ORGANIZATION	0.98+
20	QUANTITY	0.98+
about seven years	QUANTITY	0.98+
Accenture	ORGANIZATION	0.97+
SiliconANGLE	ORGANIZATION	0.97+
One	QUANTITY	0.97+
about three years	QUANTITY	0.97+
Day Two	QUANTITY	0.97+
first gentleman	QUANTITY	0.96+
day two	QUANTITY	0.96+
SAP	ORGANIZATION	0.96+
EU	LOCATION	0.95+
Datawork Summit Europe 2018	EVENT	0.95+
Dataworks Summit	EVENT	0.94+
this morning	DATE	0.91+
About 40%	QUANTITY	0.91+
Wikibon	ORGANIZATION	0.9+
EU	ORGANIZATION	0.9+

Muggie van Staden, Obsidian | Dataworks Summit 2018

>> Voiceover: From Berlin, Germany, it's theCUBE, covering DataWorks Summit Europe 2018, brought to you by Hortonworks. >> Hi, hello, welcome to theCUBE, I'm James Kobielus. I'm the lead analyst for Big Data Analytics at the Wikibon, which is the team inside of SiliconANGLE Media that focuses on emerging trends and technologies. We are here, on theCUBE at DataWorks Summit 2018 in Berlin, Germany. And I have a guest here. This is, Muggie, and if I get it wrong, Muggie Van Staden >> That's good enough, yep. >> Who is with Obsidian, which is a South Africa-based partner of Hortonworks. And I'm not familiar with Obsidian, so I'm going to ask Muggie to tell us a little bit about your company, what you do, your focus on open source, and really the opportunities you see for big data, for Hadoop, in South Africa, really the African continent as a whole. So, Muggie? >> Yeah, James great to be here. Yes, Obsidian, we started it 23 years ago, focusing mostly on open source technologies, and as you can imagine that has changed a lot over the last 23 years when we started the concept of selling Linux was basically a box with a hat and maybe a T-shirt in it. Today that's changed. >> James: Hopefully there's a stuffed penguin in there, too. (laughing) I could use that right now. >> Maybe a manual. So our business has evolved a lot over the last 23 years. And one of the technologies that has come around is Hadoop. And we actually started with some of the other Hadoop vendors out there as our first partnerships, and probably three or four years ago we decided to take on Hortonworks as one of our vendors. We found them an amazing company to work with. And together with them we've now worked in four of the big banks in South Africa. One of them is actually here at DataWorks Summit. They won an award last night. So it's fantastic to be part of all of that. And yes, South Africa being so far removed from the rest of the world. They have different challenges. Everybody's nervous of Cloud. We have the joys that we don't really have any Cloud players locally yet. The two big players are in Microsoft and Amazon are planning some data centers soon. So the guys have different challenges to Europe and to the States. But big data, the big banks are looking at it, starting to deploy nice Hadoop clusters, starting to ingest data, starting to get real business value out of it, and we're there to help, and hopefully the four is the start for us and we can help lots of customers on this journey. >> Are South African-based companies, because you are so distant in terms of miles on the planet from Europe, from the EU, is any company in South Africa, or many companies, concerned at all about the global, or say the general data protection regulation, GDPR? US-based companies certainly are 'cause they operate in Europe. So is that a growing focus for them? And we have five weeks until GDPR kicks in. So tell me about it. >> Yeah, so from a South African point of view, some of the banks and some of the companies would have subsidiaries in Europe. So for them it's a very real thing. But we have our own Act called PoPI, which is the protection of private information, so very similar. So everybody's keeping an eye on it. Everybody's worried. I think everybody's worried for the first company to be fined. And then they will all make sure that they get their things right. But, I think not just because of a legislation, I think it's something that everybody should worry about. How do we protect data? How do we make sure the right people have access to the correct data when they should and nobody violates that because I mean, in this day and age, you know, Google and Amazon and those guys probably know more about me than my family does. So it's a challenge for everybody. And I think it's just the right thing for companies to do is to make sure that the data that they do have that they really do take good care of it. We trust them with our money and now we're trusting them with our data. So it's a real challenge for everybody. >> So how long has Obsidian been a partner of Hortonworks and how has your role, or partnership I should say, evolved over that time, and how do you see it evolving going forward. >> We've been a partner about three or four years now. And started off as a value added reseller. We also a training partner in South Africa for them. And as they as company have evolved, we've had to evolve with them. You know, so they started with HTTP as the Hadoop platform. Now they're doing NiFi and HDF, so we have to learn all of those technologies as well. But very, very excited where they're going with DataPlane service just managing a customer's data across multiple clusters, multiple clouds, because that's realistically where we see all the customers going, is you know clusters, on-premise clusters in typically multiple Clouds and how do you manage that? And we are very excited to walk this road together with Hortonworks and all the South African customers that we have. >> So you say your customers are deploying multiple Clouds. Public Clouds or hybrid private-public Clouds? Give us a sense, for South Africa, whether public Cloud is a major, or is a major deployment option or choice for financial services firms that you work with. >> Not necessarily financial services, so most of them are kicking tires at this stage, nobody's really put major work loads in there. As I mentioned, both Amazon and Microsoft are planning to put data centers down in South Africa very soon, and I think that will spur a big movement towards Cloud, but we do have some customers, unfortunately not Hortonworks customers, that are actually mostly in the Cloud. And they are now starting to look at a multi-Cloud strategy. So to ideally be in the three or four major Cloud providers and spinning up the right workloads in the right Cloud, and we're there to help. >> One of the most predominant workloads that your customers are running in the Cloud, is it backend in terms of data ingest and transformation? Is it a bit of maybe data warehousing with unstructured data? Is it a bit of things like queriable archiving. I want to get a sense for, what is predominant right now in workloads? >> Yeah I think most of them start with (mumble) environments. (mumbles) one customer that's heavily into Cloud from a data point of view. Literally it's their data warehouse. They put everything in there. I think from the banking customers, most of them are considering DR of their existing Hadoop clusters, maybe a subset of their data and not necessarily everything. And I think some of them are also considering putting their unstructured data outside on the Cloud because that's where most of it's coming from. I mean, if you have Twitter, Facebook, LinkedIn data, it's a bit silly to pull all of that into your environment, why not just put it in the Cloud, that's where it's coming from, and analyze that and connect it back to your data where relevant. So I think a lot of the customers would love to get there, and now Hortonworks makes it so much easier to do that. I think a lot of them will start moving in that direction. Now, excuse me, so are any or many of your customers doing development and training of machine learning algorithms and models in their Clouds? And to the extent that they are, are they using tools like the IBM Data Science Experience that Hortonworks resells for that? >> I think it's definitely on the radar for a lot of them. I'm not aware of anybody using it yet, but lots of people are looking at it and excited about the partnership between IBM and Hortonworks. And IBM has been a longstanding player in the South African market, and it's exciting for us as well to bring them into the whole Hortonworks ecosystem, and together solve real world problems. >> Give us a sense for how built out the big data infrastructure is in neighboring countries like Botswana or Angola or Mozambique and so forth. Is that an area that your company, are those regions that your company operates in? Sells into? >> We don't have offices, but we don't have a problem going in and helping customers there, so we've had projects in the past, not data related, that we've flown in and helped people. Most of the banks from a South African point of view, have branches into Africa. So it's on the roadmap, some are a little bit ahead of others, but definitely on the roadmap to actually put down Hadoop clusters in some of the major countries all throughout Africa. There's a big debate, do you put it down there, do you leave the data in South Africa? So they're all going through their own legislation, but it's definitely on the roadmap for all of them to actually take their data, knowledge in data science, up into Africa. >> Now you say that in South Africa Proper, there are privacy regulations, you know, maybe not the same as GDPR, but equivalent. Throughout Africa, at least throughout Southern Africa, how is privacy regulation lacking or is it emerging? >> I think it's emerging. A lot of the countries do have the basic rule that their data shouldn't leave the country. So everybody wants that data sovereignty and that's why a lot of them will not go to Cloud, and that's part of the challenges for the banks, that if they have banks up in Botswana, etc. And Botswana rules are our data has to stay in country. They have to figure out a way how do they connect that data to get the value for all of their customers. So real world challenges for everybody. >> When you're going into and selling into an emerging, or developing nation, of you need to provide upfront consulting to help the customer bootstrap their own understanding of the technology and making the business case and so forth. And how consultative is the selling process... >> Absolutely, and what we see with the banks, most of them even have a consultative approach within their own environment, so you would have the South African team maybe flying into the team at (mumbles) Botswana, and share some of the learnings that they've had. And then help those guys get up to speed. The reality is the skills are not necessarily in country. So there's a lot of training, a lot of help to go and say, we've done this, let us upscale you. And be a part of that process. So we sometimes send in teams to come and do two, three day training, basics, etc., so that ultimately the guys can operationalize in each country by themselves. >> So, that's very interesting, so what do you want to take away from this event? What do you find most interesting in terms of the sessions you've been in around the community showcase that you can take back to Obsidian, back in your country and apply? Like the announcement this morning of the Data Steward Studio. Do you see a possible, that your customers might be eager to use that for curation of their data in their clusters? >> Definitely, and one of the key messages for me was Scott, the CTO's message about your data strategy, your Cloud strategy, and your business strategy. It is effectively the same thing. And I think that's the biggest message that I would like to take back to the South African customers is to go and say, you need to start thinking about this. You know, as Cloud becomes a bigger reality for us, we have to align, we have to go and say, how do we get your data where it belongs? So you know, we like to say to our customers, we help the teams get the right code to the right computer and the right data, and I think it's absolutely critical for all of the customers to go and say, well, where is that data going to sit? Where is the right compute for that piece of data? And can we get it then, can we manage it, etc.? And align to business strategy. Everybody's trying to do digital transformation, and those three things go very much hand-in-hand. >> Well, Muggie, thank you very much. We're at the end of our slot. This has been great. It's been excellent to learn more about Obsidian and the work you're doing in South Africa, providing big data solutions or working with customers to build the big data infrastructure in the financial industry down there. So this has been theCUBE. We've been speaking with Muggie Van Staden of Obsidian Systems, and here at DataWorks Summit 2018 in Berlin. Thank you very much.

Published Date : Apr 18 2018

SUMMARY :

brought to you by Hortonworks. I'm the lead analyst for Big Data Analytics at the Wikibon, and really the opportunities you see for big data, and as you can imagine that has changed a lot I could use that right now. So the guys have different challenges to Europe or say the general data protection regulation, GDPR? And I think it's just the right thing for companies to do and how do you see it evolving going forward. And we are very excited to walk this road together So you say your customers are deploying multiple Clouds. And they are now starting to look at a multi-Cloud strategy. One of the most predominant workloads and now Hortonworks makes it so much easier to do that. and excited about the partnership the big data infrastructure is in neighboring countries but definitely on the roadmap to actually put down you know, maybe not the same as GDPR, and that's part of the challenges for the banks, And how consultative is the selling process... and share some of the learnings that they've had. around the community showcase that you can take back for all of the customers to go and say, and the work you're doing in South Africa,

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
James Kobielus	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Muggie Van Staden	PERSON	0.99+
Africa	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Muggie van Staden	PERSON	0.99+
Botswana	LOCATION	0.99+
Mozambique	LOCATION	0.99+
Angola	LOCATION	0.99+
Muggie	PERSON	0.99+
Scott	PERSON	0.99+
South Africa	LOCATION	0.99+
James	PERSON	0.99+
Southern Africa	LOCATION	0.99+
two	QUANTITY	0.99+
LinkedIn	ORGANIZATION	0.99+
Berlin	LOCATION	0.99+
three day	QUANTITY	0.99+
three	QUANTITY	0.99+
GDPR	TITLE	0.99+
Facebook	ORGANIZATION	0.99+
Berlin, Germany	LOCATION	0.99+
Twitter	ORGANIZATION	0.99+
Obsidian Systems	ORGANIZATION	0.99+
first company	QUANTITY	0.99+
five weeks	QUANTITY	0.99+
four	QUANTITY	0.99+
first partnerships	QUANTITY	0.99+
three	DATE	0.99+
Today	DATE	0.98+
Linux	TITLE	0.98+
23 years ago	DATE	0.98+
DataWorks Summit 2018	EVENT	0.98+
both	QUANTITY	0.97+
EU	LOCATION	0.97+
Wikibon	ORGANIZATION	0.97+
one	QUANTITY	0.97+
PoPI	TITLE	0.97+
Data Steward Studio	ORGANIZATION	0.97+
each country	QUANTITY	0.97+
Cloud	TITLE	0.97+
US	LOCATION	0.96+
last night	DATE	0.96+
SiliconANGLE Media	ORGANIZATION	0.96+
four years	QUANTITY	0.96+
DataWorks Summit	EVENT	0.96+
Hadoo	ORGANIZATION	0.96+
One	QUANTITY	0.96+
Dataworks Summit 2018	EVENT	0.95+
Hadoop	ORGANIZATION	0.93+
about three	QUANTITY	0.93+
two big players	QUANTITY	0.93+
theCUBE	ORGANIZATION	0.93+

Abhas Ricky, Hortonwork | Dataworks Summit 2018

>> Announcer: From Berlin, Germany, it's the CUBE covering Dataworks Summit Europe 2018. Brought to you by Hortonworks. >> Welcome to the CUBE, we're here at Dataworks Summit 2018 in Berlin. I'm James Kobielus. I am the lead analyst for big data analytics on the Wikibon team of SiliconANGLE Media On the CUBE, we extract the signal from the noise and here at Dataworks Summit, the signal is big data analytics and increasingly the imperative for many enterprises is compliance with GDPR, the General Data Protection Regulation comes in five weeks, May 25th. There's more things going on so what I'm going to be doing today for the next 20 minutes or so is from Hortonworks I have Abhas Ricky who is the director of strategy and innovation. He helps customers, and he'll explain what he does, but at a high level, he helps customers to identify the value of investments in big data, analytics, big data platforms in their business. And Abhas, how do you justify the value of compliance with GDPR. I guess, the value would be avoid penalties for noncompliance, right? Can you do it as an upside as well? Is there an upside in terms of if you make an investment, and you probably will need to make an investment to comply, Can you turn this around as a strategic asset, possibly? Yeah, so I'll take a step back first. >> James: Like a big data catalog and so forth. >> Yeah, so if you look at the value part which you said, it's interesting that you mentioned it. So there's a study which was done by McKinsey which said that only 15% of executives can understand what is the value of a digital initiative, let alone big data initiative. >> James: Yeah. >> Similarly, Gardner says that if you look at the various portraits and if you look at various issues, the fundamental thing which executives struggle with identifying the value which they will get. So that is where I pitch in. That is where I come in and do a data perspective. Now if you look at GDPR specifically, one of the things that we believe, and I've done multiple blogs around that and webinars, GDPR should be treated at a business opportunity because of the fact that -- >> James: Any opportunity? Business opportunity. It shouldn't necessarily be seen as a compliance burden on costs or your balance sheets because of the fact, it is the one single opportunity which allows you to clean up your data supply chain. It allows you to look at your data assets with a holistic view, and if you create a transparent data supply chain, and your IT systems talk to each other. So some of the provisions, as you know, in addition to right to content, right to portability, etc. It is also privacy by design which says that you have to be proactive in defining your IT systems and architecture. It's not necessarily reactive. But guess what? If you're able to do that, you will see the benefits in other use cases like single view of customer or fraud or anti-money laundering because at the end of the day, all GDPR is allowing you to say is that where do you store your data, what's the lineage, what's the provenance? Can you identify what the personally identifiable information is for any particular customer? And can you use that to your effect as you go forward? So it's a great opportunity because to be able to comply with the provisions, you've got to take steps before that which is essentially streamlining your data operations which obviously will have a domino effect on the efficiency of other use cases. So I believe it's a business opportunity. >> Right, now part of that opportunity in terms of getting your arms around what data you have, when the GDPR is concerned, the customer has a right to withhold consent for you and the enterprise that holds that data to use that personal data of theirs which they own for various and sundry reasons. Many enterprises and many of Hortonworks customers are using their big data for things like AI and machine learning. Won't this compliance with GDPR limit their ability to seize the opportunity to build deep learning and so forth? What are customers saying about that? Is that going to be kind of a downer or a chilling effect on their investments in AI and so forth? >> So there's two elements around it. The first thing which you said, there are customers, there's machine learning in AI, yes, there are. But broadly speaking, before you're able to do machine learning and AI, you need to get your data sets onto a particular platform in a particular fashion, clean data, otherwise, you can't do AI or machine learning on top of it. >> James: Right. So the reason why I say it's an opportunity is that because you're being forced by compliance to get that data from every other place onto this platform. So obviously those capabilities will get enhanced. Having said, I do agree if I'm an organization which does targeting, retargeting of customers based on multiple segmentations and then one of the things is online advertisements. In that case, yes, your ability might get affected, but I don't think you'll get prohibited. And that affected time span will be only small because you just adapt. So the good thing about machine learning and AI is that you don't create rules, you don't create manual rules. They pick up the rules based on the patterns and how the data sets have been performing. So obviously once you have created those structures in place, initially, yes, you'll have to make an investment to alter your programs of work. However, going forward, it will be even better. Because guess what? You just cleaned your entire data supply chain. So that's how I would see that, yes, a lot of companies, ecommerce you do targeting and retargeting based on the customer DNA, based on their shopping profiles, based on their shopping ad libs and then based off that, you give them the next best offer or whatever. So, yes, that might get affected initially, but that's not because GDPR is there or not. That's just because you're changing your program software. You're changing the fundamental way by which you're sourcing the data, the way they are coming from and which data can you use. But once you have tags against each of those attributes, once you have access controls, once you know exactly which customer attributes you can touch and you cannot for the purposes, do you have consent or not, your life's even better. The AI tools or the machine learning algorithms will learn from themselves. >> Right, so essentially, once you have a tight ship in terms of managing your data in line with the GDPR strictures and so forth, it sounds like what you're saying is that it gives you as an enterprise the confidence and assurance that if you want to use that data and need to use that data, you know exactly how you've the processes in place to gain the necessary consents from customers. So there won't be any nasty surprises later on of customers complaining because you've got legal procedures for getting the consent and that's great. You know, one of the things, Abhas, we're hearing right now in terms of compliance requirements that are coming along, maybe not apart of GDPR directly yet, but related to it is the whole notion of algorithmic transparency. As you build machine learning models and these machine learning models are driven into working applications, being able to transparently identify if those models make, in particular, let's say autonomous action based on particular data and particular variables, and then there is some nasty consequences like crashing an autonomous vehicle, the ability, they call it explicably AI to roll that back and determine who's liable for that event. Does Hortonworks have any capability within your portfolio to enable more transparency into the algorithmic underpinnings of a given decision? Is that something that you enable in your solutions or that your partner IBM enables through DSX and so forth? Give us a sense whether that's a capability currently that you guys offer and whether that's something in terms of your understand, are customers asking for that yet or is that too futuristic? >> So I would say that it's a two-part question. >> James: Yeah. >> The first one, yes, there are multiple regulations coming in, like Vilica Financial Markets, there's Mid Fair, the BCBS, etc. and organizations have to comply. You've got the IFRS which span to brokers, the insurance, etc., etc. So, yes, a lot of organizations across industries are getting affected by compliance use cases. Where does Hortonworks come into the picture is to be able to be compliant from a data standpoint, A you need to be able to identify which of those data sources you need to implement a particular use case. B you need to get them to a certain point whereby you can do analytics on that And then there's the whole storage and processing and all of that. But also which you might have heard at the keynote today, from a cloud perspective, it's starting to get more and more complex because everyone's moving to the cloud which means, if you look at any large multi-national organization, most of them have a hybrid cloud structure because they work with two or three cloud vendors which makes the process even more complex because now you have multiple clusters, you have have on premise and you have multiple different IT systems who need to talk to each other. Which is where the Hortonworks data plan services come into the picture because it gives you a unified view of your global data assets. >> James: Yes. >> Think of it like a single pane of glass which whereby you can do security and governance across all data assets. So from those angles, yes, we definitely enable those use cases which will help with compliance. >> Making the case to the customer for a big data catalog along the lines of what you guys offer, in making the case, there's a lot of upfront data architectural work that needs to be done to get all you data assets into shape within the context of the catalog. How do they justify making that expense in terms of hiring the people, the data architects and so forth needed to put it all in shape. I mean, how long does it take before you can really stand up in your working data catalog in most companies? >> So again, you've asked two questions. First of all is how do they justify it? Which is where we say that the platform is a means to an end. It's enabling you to deliver use cases. So I look at it in terms of five key value drivers. Either it's a risk reduction or it's a cost reduction or it's a cost avoidance. >> James: Okay. >> Or it's a revenue optimization, or it's time to market. Against each one of these value drivers, or multiple of them or a combination of them, each of the use cases that you're delivering on the platform will lead you to benefits around that. My job, obviously, is to work with the customers and executes to understand what will that be to quantify the potential impact which will then form the basis and give my customer champions enough ammunition so that they can go back and justify those investments. >> James: Abhas, we're going to have to cut it short, but I'm going to let you finish your point here, but we have to end this segment so go ahead. >> That's fine. >> Okay, well, anyway, have had Abhas Ricky who is the director of strategy and innovation at Hortonworks. We're here at Dataworks Summit Berlin. And thank you very much Sorry to cut it short, but we have to move to the next guest. >> No worries, pleasure, thank you very much. >> Take care, have a good one. >> Thanks a lot, yes. (upbeat music)

Published Date : Apr 18 2018

SUMMARY :

Brought to you by Hortonworks. and you probably will need to make an investment to comply, Yeah, so if you look at the value part which you said, the various portraits and if you look at various issues, So some of the provisions, as you know, the customer has a right to withhold consent for you you need to get your data sets onto a particular platform the way they are coming from and which data can you use. and need to use that data, you know exactly come into the picture because it gives you which whereby you can do security and governance a big data catalog along the lines of what you guys offer, the platform is a means to an end. will lead you to benefits around that. but I'm going to let you finish your point here, And thank you very much Thanks a lot, yes.

ENTITIES

Entity	Category	Confidence
James	PERSON	0.99+
James Kobielus	PERSON	0.99+
two	QUANTITY	0.99+
Berlin	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
two questions	QUANTITY	0.99+
BCBS	ORGANIZATION	0.99+
two-part	QUANTITY	0.99+
General Data Protection Regulation	TITLE	0.99+
Abhas	PERSON	0.99+
Gardner	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
15%	QUANTITY	0.99+
two elements	QUANTITY	0.99+
Vilica Financial Markets	ORGANIZATION	0.99+
each	QUANTITY	0.99+
Abhas Ricky	PERSON	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
GDPR	TITLE	0.99+
May 25th	DATE	0.99+
today	DATE	0.98+
First	QUANTITY	0.98+
Berlin, Germany	LOCATION	0.98+
Dataworks Summit 2018	EVENT	0.98+
one	QUANTITY	0.98+
first	QUANTITY	0.97+
first one	QUANTITY	0.97+
single	QUANTITY	0.97+
Dataworks Summit	EVENT	0.96+
five weeks	QUANTITY	0.95+
five key value drivers	QUANTITY	0.95+
first thing	QUANTITY	0.95+
Wikibon	ORGANIZATION	0.95+
one single opportunity	QUANTITY	0.93+
single pane	QUANTITY	0.91+
McKinsey	ORGANIZATION	0.9+
CUBE	ORGANIZATION	0.9+
Mid Fair	ORGANIZATION	0.89+
three cloud vendors	QUANTITY	0.89+
IFRS	TITLE	0.87+
each one	QUANTITY	0.87+
Dataworks Summit Europe 2018	EVENT	0.86+
DSX	TITLE	0.8+
Hortonwork	ORGANIZATION	0.78+
next 20 minutes	DATE	0.72+

Scott Gnau, Hortonworks | Dataworks Summit EU 2018

(upbeat music) >> Announcer: From Berlin, Germany, it's The Cube, covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. >> Hi, welcome to The Cube, we're separating the signal from the noise and tuning into the trends in data and analytics. Here at DataWorks Summit 2018 in Berlin, Germany. This is the sixth year, I believe, that DataWorks has been held in Europe. Last year I believe it was at Munich, now it's in Berlin. It's a great show. The host is Hortonworks and our first interviewee today is Scott Gnau, who is the chief technology officer of Hortonworks. Of course Hortonworks got established themselves about seven years ago as one of the up and coming start ups commercializing a then brand new technology called Hadoop and MapReduce. They've moved well beyond that in terms of their go to market strategy, their product portfolio, their partnerships. So Scott, this morning, it's great to have ya'. How are you doing? >> Glad to be back and good to see you. It's been awhile. >> You know, yes, I mean, you're an industry veteran. We've both been around the block a few times but I remember you years ago. You were at Teradata and I was at another analyst firm. And now you're with Hortonworks. And Hortonworks is really on a roll. I know you're not Rob Bearden, so I'm not going to go into the financials, but your financials look pretty good, your latest. You're growing, your deal sizes are growing. Your customer base is continuing to deepen. So you guys are on a roll. So we're here in Europe, we're here in Berlin in particular. It's five weeks--you did the keynote this morning, It's five weeks until GDPR. The sword of Damacles, the GDPR sword of Damacles. It's not just affecting European based companies, but it's affecting North American companies and others who do business in Europe. So your keynote this morning, your core theme was that, if you're in enterprise, your business strategy is equated with your cloud strategy now, is really equated with your data strategy. And you got to a lot of that. It was a really good discussion. And where GDPR comes into the picture is the fact that protecting data, personal data of your customers is absolutely important, in fact it's imperative and mandatory, and will be in five weeks or you'll face a significant penalty if you're not managing that data and providing customers with the right to have it erased, or the right to withdraw consent to have it profiled, and so forth. So enterprises all over the world, especially in Europe, are racing as fast as they can to get compliant with GDPR by the May 25th deadline time. So, one of the things you discussed this morning, you had an announcement overnight that Hortonworks has released a new solution in technical preview called The Data Steward Studio. And I'm wondering if you can tie that announcement to GDPR? It seems like data stewardship would have a strong value for your customers. >> Yeah, there's definitely a big tie-in. GDPR is certainly creating a milestone, kind of a trigger, for people to really think about their data assets. But it's certainly even larger than that, because when you even think about driving digitization of a business, driving new business models and connecting data and finding new use cases, it's all about finding the data you have, understanding what it is, where it came from, what's the lineage of it, who had access to it, what did they do to it? These are all governance kinds of things, which are also now mandated by laws like GDPR. And so it's all really coming together in the context of the new modern data architecture era that we live in, where a lot of data that we have access to, we didn't create. And so it was created outside the firewall by a device, by some application running with some customer, and so capturing and interpreting and governing that data is very different than taking derivative transactions from an ERP system, which are already adjudicated and understood, and governing that kind of a data structure. And so this is a need that's driven from many different perspectives, it's driven from the new architecture, the way IoT devices are connecting and just creating a data bomb, that's one thing. It's driven by business use cases, just saying what are the assets that I have access to, and how can I try to determine patterns between those assets where I didn't even create some of them, so how do I adjudicate that? >> Discovering and cataloging your data-- >> Discovering it, cataloging it, actually even... When I even think about data, just think the files on my laptop, that I created, and I don't remember what half of them are. So creating the metadata, creating that trail of bread crumbs that lets you piece together what's there, what's the relevance of it, and how, then, you might use it for some correlation. And then you get in, obviously, to the regulatory piece that says sure, if I'm a new customer and I ask to be forgotten, the only way that you can guarantee to forget me is to know where all of my data is. >> If you remember that they are your customer in the first place and you know where all that data is, if you're even aware that it exists, that's the first and foremost thing for an enterprise to be able to assess their degree of exposure to GDPR. >> So, right. It's like a whole new use case. It's a microcosm of all of these really big things that are going on. And so what we've been trying to do is really leverage our expertise in metadata management using the Apache Atlas project. >> Interviewer: You and IBM have done some major work-- >> We work with IBM and the community on Apache Atlas. You know, metadata tagging is not the most interesting topic for some people, but in the context that I just described, it's kind of important. And so I think one of the areas where we can really add value for the industry is leveraging our lowest common denominator, open source, open community kind of development to really create a standard infrastructure, a standard open infrastructure for metadata tagging, into which all of these use cases can now plug. Whether it's I want to discover data and create metadata about the data based on patterns that I see in the data, or I've inherited data and I want to ensure that the metadata stay with that data through its life cycle, so that I can guarantee the lineage of the data, and be compliant with GDPR-- >> And in fact, tomorrow we will have Mandy Chessell from IBM, a key Hortonworks partner, discussing the open metadata framework you're describing and what you're doing. >> And that was part of this morning's keynote close also. It all really flowed nicely together. Anyway, it is really a perfect storm. So what we've done is we've said, let's leverage this lowest common denominator, standard metadata tagging, Apache Atlas, and uplevel it, and not have it be part of a cluster, but actually have it be a cloud service that can be in force across multiple data stores, whether they're in the cloud or whether they're on prem. >> Interviewer: That's the Data Steward Studio? >> Well, Data Plane and Data Steward Studio really enable those things to come together. >> So the Data Steward Studio is the second service >> Like an app. >> under the Hortonworks DataPlane service. >> Yeah, so the whole idea is to be able to tie those things together, and when you think about it in today's hybrid world, and this is where I really started, where your data strategy is your cloud strategy, they can't be separate, because if they're separate, just think about what would happen. So I've copied a bunch of data out to the cloud. All memory of any lineage is gone. Or I've got to go set up manually another set of lineage that may not be the same as the lineage it came with. And so being able to provide that common service across footprint, whether it's multiple data centers, whether it's multiple clouds, or both, is a really huge value, because now you can sit back and through that single pane, see all of your data assets and understand how they interact. That obviously has the ability then to provide value like with Data Steward Studio, to discover assets, maybe to discover assets and discover duplicate assets, where, hey, I can save some money if I get rid of this cloud instance, 'cause it's over here already. Or to be compliant and say yeah, I've got these assets here, here, and here, I am now compelled to do whatever: delete, protect, encrypt. I can now go do that and keep a record through the metadata that I did it. >> Yes, in fact that is very much at the heart of compliance, you got to know what assets there are out there. And so it seems to me that Hortonworks is increasingly... the H-word rarely comes up these days. >> Scott: Not Hortonworks, you're talking about Hadoop. >> Hadoop rarely comes up these days. When the industry talks about you guys, it's known that's your core, that's your base, that's where HDP and so forth, great product, great distro. In fact, in your partnership with IBM, a year or more ago, I think it was IBM standardized on HDP in lieu of their distro, 'cause it's so well-established, so mature. But going forward, you guys in many ways, Hortonworks, you have positioned yourselves now. Wikibon sees you as being the premier solution provider of big data governance solutions specifically focused on multi-cloud, on structured data, and so forth. So the announcement today of the Data Steward Studio very much builds on that capability you already have there. So going forward, can you give us a sense to your roadmap in terms of building out DataPlane's service? 'Cause this is the second of these services under the DataPlane umbrella. Give us a sense for how you'll continue to deepen your governance portfolio in DataPlane. >> Really the way to think about it, there are a couple of things that you touched on that I think are really critical, certainly for me, and for us at Hortonworks to continue to repeat, just to make sure the message got there. Number one, Hadoop is definitely at the core of what we've done, and was kind of the secret sauce. Some very different stuff in the technology, also the fact that it's open source and community, all those kinds of things. But that really created a foundation that allowed us to build the whole beginning of big data data management. And we added and expanded to the traditional Hadoop stack by adding Data in Motion. And so what we've done is-- >> Interviewer: NiFi, I believe, you made a major investment. >> Yeah, so we made a large investment in Apache NiFi, as well as Storm and Kafka as kind of a group of technologies. And the whole idea behind doing that was to expand our footprint so that we would enable our customers to manage their data through its entire lifecycle, from being created at the edge, all the way through streaming technologies, to landing, to analytics, and then even analytics being pushed back out to the edge. So it's really about having that common management infrastructure for the lifecycle of all the data, including Hadoop and many other things. And then in that, obviously as we discuss whether it be regulation, whether it be, frankly, future functionality, there's an opportunity to uplevel those services from an overall security and governance perspective. And just like Hadoop kind of upended traditional thinking... and what I mean by that was not the economics of it, specifically, but just the fact that you could land data without describing it. That seemed so unimportant at one time, and now it's like the key thing that drives the difference. Think about sensors that are sending in data that reconfigure firmware, and those streams change. Being able to acquire data and then assess the data is a big deal. So the same thing applies, then, to how we apply governance. I said this morning, traditional governance was hey, I started this employee, I have access to this file, this file, this file, and nothing else. I don't know what else is out there. I only have access to what my job title describes. And that's traditional data governance. In the new world, that doesn't work. Data scientists need access to all of the data. Now, that doesn't mean we need to give away PII. We can encrypt it, we can tokenize it, but we keep referential integrity. We keep the integrity of the original structures, and those who have a need to actually see the PII can get the token and see the PII. But it's governance thought inversely as it's been thought about for 30 years. >> It's so great you've worked governance into an increasingly streaming, real-time in motion data environment. Scott, this has been great. It's been great to have you on The Cube. You're an alum of The Cube. I think we've had you at least two or three times over the last few years. >> It feels like 35. Nah, it's pretty fun.. >> Yeah, you've been great. So we are here at Dataworks Summit in Berlin. (upbeat music)

Published Date : Apr 18 2018

SUMMARY :

Brought to you by Hortonworks. So Scott, this morning, it's great to have ya'. Glad to be back and good to see you. So, one of the things you discussed this morning, of the new modern data architecture era that we live in, forgotten, the only way that you can guarantee and foremost thing for an enterprise to be able And so what we've been trying to do is really leverage so that I can guarantee the lineage of the data, discussing the open metadata framework you're describing And that was part of this morning's keynote close also. those things to come together. of lineage that may not be the same as the lineage And so it seems to me that Hortonworks is increasingly... When the industry talks about you guys, it's known And so what we've done is-- Interviewer: NiFi, I believe, you made So the same thing applies, then, to how we apply governance. It's been great to have you on The Cube. Nah, it's pretty fun.. So we are here at Dataworks Summit in Berlin.

ENTITIES

Entity	Category	Confidence
Europe	LOCATION	0.99+
Scott	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Berlin	LOCATION	0.99+
Scott Gnau	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Teradata	ORGANIZATION	0.99+
Last year	DATE	0.99+
May 25th	DATE	0.99+
five weeks	QUANTITY	0.99+
Mandy Chessell	PERSON	0.99+
GDPR	TITLE	0.99+
Munich	LOCATION	0.99+
Rob Bearden	PERSON	0.99+
second service	QUANTITY	0.99+
30 years	QUANTITY	0.99+
both	QUANTITY	0.99+
tomorrow	DATE	0.99+
first	QUANTITY	0.99+
Berlin, Germany	LOCATION	0.99+
second	QUANTITY	0.99+
DataPlane	ORGANIZATION	0.99+
sixth year	QUANTITY	0.98+
three times	QUANTITY	0.98+
first interviewee	QUANTITY	0.98+
Dataworks Summit	EVENT	0.98+
one	QUANTITY	0.97+
this morning	DATE	0.97+
DataWorks Summit 2018	EVENT	0.97+
MapReduce	ORGANIZATION	0.96+
Hadoop	TITLE	0.96+
Hadoop	ORGANIZATION	0.96+
one time	QUANTITY	0.96+
35	QUANTITY	0.96+
single pane	QUANTITY	0.96+
NiFi	ORGANIZATION	0.96+
today	DATE	0.94+
DataWorks Summit Europe 2018	EVENT	0.93+
Data Steward Studio	ORGANIZATION	0.93+
Dataworks Summit EU 2018	EVENT	0.92+
about seven years ago	DATE	0.91+
a year or	DATE	0.88+
years	DATE	0.87+
Storm	ORGANIZATION	0.87+
Wikibon	ORGANIZATION	0.86+
Apache NiFi	ORGANIZATION	0.85+
The Cube	PERSON	0.84+
North American	OTHER	0.84+
DataWorks	ORGANIZATION	0.84+
Data Plane	ORGANIZATION	0.76+
Data Steward Studio	TITLE	0.75+
Kafka	ORGANIZATION	0.75+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for DataWorks Summit Europe 2018: