David Piester, Io-Tahoe & Eddie Edwards, Direct Energy | AWS re:Invent 2019

>>long from Las Vegas. It's the Q covering a ws re invent 2019. Brought to you by Amazon Web service is and in along with its ecosystem partners. >>Hey, welcome back to the cubes. Coverage of AWS 19 from Las Vegas. This is Day two of our coverage of three days. Two sets, lots of cute content. Lisa Martin here with Justin Warren, founder and chief analyst. A pivot nine. Justin and I are joined by a couple of guests New to the Cube. We've got David Meister next to meet Global head of sales for Io Tahoe. Welcome. Eddie Edwards with a cool name. Global Data Service is director from Direct Energy. Welcome, Eddie. Thank you. Okay, So, David, I know we had somebody from Io Tahoe on yesterday, but I'd love for you to give her audience an overview of Io Tahoe, and then you gotta tell us what the name means. >>Okay. Well, day pie stir. Io Tahoe thinks it's wonderful event here in AWS and excited to be here. Uh, I, oh, Tahoe were located in downtown on Wall Street, New York on and I Oh, Tahoe. Well, there's a lot of different meanings, but mainly Tahoe for Data Lake Input output into the lake is how it was originally meant So But ah, little background on Io Tahoe way are 2014. We came out way started in stealth came out of stealth in 2017 with two signature clients. When you're going to hear from in a moment direct energy, the other one g e and we'll speak to those in just a moment I owe Tahoe takes a unique approach way have nine machine learning machine learning algorithms 14 future sets that interrogates the data. At the data level, we go past metadata, so solving that really difficult data challenge and I'm gonna let Eddie describe some of the use cases that were around data migration, P II discovery, and so over to you >>a little bit about direct energy. What, you where you're located, What you guys do and how data is absolutely critical to your business. Yeah, >>sure. So direct energy. Well, it's the largest residential energy supplier in the er us around 5000 employees. Loss of this is coming from acquisitions. So as you can imagine, we have a vast amount of data that we need some money. Currently, I've got just under 1700 applications in my portfolio. Onda a lot. The challenges We guys are around the cost, driving down costs to serve so we can pass that back onto our consumers on the challenge that with hard is how best to gain that understanding. Where I alter whole came into play, it was vainly around off ability to use the products quickly for being able to connect to our existing sources to discover the data. What, then, that Thio catalog that information to start applying the rules around whether it be legislation like GDP, are or that way gets a lot of cases where these difference between the states on the standings and definitions so the product gives us the ability to bring a common approach So that information a good success story, would be about three months ago, we took the 30 and applications for our North America home business. We were able to running through the product within a week on that gave us the information to them, consolidate the estate downwards, working with bar business colleagues Thio, identify all the data we don't see the archival retention reels on, bring you no more meaning to the data on actually improve ourselves opportunities by highlights in that rich information that was not known >>previously. Yes, you mentioned that you growing through acquisition. One thing that people tend to underestimate around I t. Is that it's not a heterogeneous. It's not a homogeneous environments hatred genius. Like as soon as you buy another company, you've got another. You got another silent. You got another day to say. You got something else. So walk us through how iota who actually deals with that very disparity set of data that you've night out inherited from just acquiring all of these different companies? >>Yeah, so exactly right. You know, every time we a private organization, they would have various different applications that were running in the estate. Where would be an old article? I say, Hey, sequel tap environment. What we're able to do is use the products to plug in a name profile to understand what's inside knowledge they have around their customer base and how we can number in. That's in to build up a single view and offer additional products value adding products or rewards for customers, whether that be, uh on our hay truck side our heat in a ventilation and air con unit, which again we have 4600 engineers in that space. So it's opening up new opportunities and territories to us. >>Go ahead, >>say additionally to that, we're across multiple sectors, but the problem death by Excel was in the financial service is we're located on Wall Street. As I mentioned on this problem of legacy to spirit, data, sources and understanding, and knowing your data was a common problem, banks were just throwing people at the problem. So his use case with 1700 applications, a lot of them legacy is fits right into what we d'oh and cataloging is he mentioned. We catalogue with that discover in search engine that we have. We enable search cross enterprise. But Discovery we auto tag and auto classify the sensitive data into the catalog automatically, and that's a key part of what we do. And it >>was that Dave is something in thinking of differentiation, wanting to know what is unique about Iota. What was the opportunity that you guys saw? But is the cataloging and the sensitive information one of the key things that makes it a difference >>Way enabled data governance. So it's not just sensitive information way catalog, entire data set multiple data sets. And what makes us what differentiates us is that the machine learning way Interrogate in brute force The data So every single so metadata beyond so 1,000,000,000 rose. 100,000 columns. Large, complex data sets way. Interrogate every field value. And we tell you what this looks like A phone number. This looks like an address. This looks like a first name. This looks like the last name and we tagged at to the catalog. And then anything that sensitive in nature will color coded red green, highly sensitive, sensitive. So that's our big differentiator. >>So is that like 100% visibility into the granularity of what is in this data? >>Yes, that's that's one of the issues is who were here ahead of us. We're finding a lot of folks are wanting to go to the cloud, but they can't get access to the data. They don't know their data. They don't understand it. On DSO where that bridge were a key strategic partner for aws Andi we're excited about the opportunity that's come about in the last six months with AWS because we're gonna be that key geese for migration to the cloud >>so that the data like I love the name iota, How But in your opinion, you know, you could hear so many different things about Data Lake Data's turning into data Swamp is there's still a lot of value and data lakes that customers just like you're saying before, you just don't know what they have. >>Well, what's interesting in this transition to one of other clients? But on I just want to make a note that way actually started in the relational world. So we're already a mess. We're across header genius environment so but Tahoe does have more to do with Lake. But at a time a few years back, everybody was just dumping data into the lake. They didn't understand what what was in there, and it's created in this era of privacy, a big issue, and Comcast had this problem. The large Terry Tate instance just dumping into the lake, not understanding data flows, how they're data's flowing, not understanding what's in the lake, sensitivity wise, and they want to start, you know they want enable b I. They want they want to start doing analytics, but you gotta understand and know the data, right? So for Comcast, we enable data ops for them automatically with our machine learning. So that was one of the use cases. And then they put the information and we integrated with Apache Atlas, and they have a large JW aws instance, and they're able to then better govern their data on S O N G. Digital. One other customer very complex use case around their data. 36 e. R. P s being migrated toe one virtually r p in the lake. And think about finance data How difficult that is to manage and understand. So we were a key piece in helping that migration happen in weeks rather than months. >>David, you mentioned cloud. Clearly weird. We're at a cloud show, but you mentioned knowing your data. One of the aspect of that cloud is that it moves fast, and it's a much bigger scale than what we've been used to. So I'm interested. Maybe, Eddie, you can. You can fill us in here as well about the use of a tool to help you know your data when we're not creating any less stated. There's just more and more data. So at this speed and this scale, how important is it that you actually have tooling to provide to the to the humans who have to go on that operate on all of this data >>building on what David was saying around the speed in the agility side, you know, now all our information I would know for North America home business is in AWS Hold on ns free bucket. We are already starting work with AWS connect on the call center side. Being able to stream that information through so we're getting to the point now is an organization where we're able to profile the data riel. Time on. Take that information Bolts predict what the customers going going to do is part that machine learning side. So we're starting to trial where we will interject into a call to say, Well, you know, a customer might be on your digital site trying to do a journey. You can see the challenges around data, and you could Then they go in with a chop using, say, the new AWS trap that's just coming through at the moment. So >>one of the things that opportunities I'm here. Sorry, Eddie is the opportunity to leverage the insights into the data to deliver more. You mentioned like customer words, are more personalized experience or a call center agent. Knowing this is the problem of this customer is experiencing this way. Have tried X, y and Z to resolve, or this customer is loyal to pay their bills on time. They should be eligible for some sort of reward program. I think consumers that I think amazon dot com has created us this demanding consumer that way expect you to know us. I expect you to serve us up things that you think we want. Talk to me about the opportunity that I owe Ty was is giving your business to be able to delight customers in ways that you probably couldn't even have predicted? >>Well, they touched on the tagging earlier, you know, survive on the stunned in the data that's coming through. Being able to use the data flow technology on dhe categorizing were able than telling kidding with wider estate. So David mentioned Comcast around 36 e. R. P. You know, we've just gone through the same in other parts of our organization. We're driving the additional level of value, turning away from being a manually labor intensive task. So I used to have 20 architects that daily goal through trying to build an understanding the relationship. I do not need that now. I just have a couple of people that are able to take the outputs and then be able to validate the information using the products. >>And I like that. There's just so much you mentioned customer 360. Example at a call centre. There's so much data ops that has to happen to make that happen on. That's the most difficult challenge to solve. And that's where we come in. And after you catalogue the data, I just want to touch on this. We enable search for the enterprise so you're now connected to 50 115 100 sources with our software. Now you've catalogued it. You profiled it. Now you can search Karen Kim Kim Smith, So your your your engineers, your architect, your data stewards influences your business analysts. This is folks can now search anything they want and find anything sensitive. Find that person find an invoice, and that helps enable. But you mentioned the customer >>360. But I can Also. What I'm hearing is, as it has the potential to enable a better relationship between I t in the business. >>Absolutely. It brings those both together because they're so siloed. In this day and age, your data siloed and your business is siloed in a different business unit. So this helps exactly collaborate crowdsource, bring it all together. One platform >>and how many you so 1700 applications. How many you mentioned the 36 or so air peace. What percentage? If you can guess who have you been able to reduce duplicate triplicate at center applications? And what are some of the overarching business benefits that direct energy is achieving? >>So incentive the direct senator, decide that we're just at the beginning about journey. We're about four months in what? We've already decommissioned 12. The applications I was starting to move out to the wider side in terms of benefits are oh, I probably around 300% of the moment >>in a 300% r A y in just a few months. >>Just now, you know you've got some of the basic savings around the story side. We're also getting large savings from some of the existing that support agreements that we have in place. David touched on data Rob's. I've been able to reduce the amount of people that are required to support the team. There is now a more common on the standing within the organization and have money to turn it more into a self care opportunity with the business operations by pushing the line from being a technical problem to a business challenge. And at the end of the day, they're the experts. They understand the data better than any IittIe fault that sat in a corner, right? So I'm >>gonna ask you one more question. What gave you the confidence that I Oh, Tahoe was the right solution for you >>purely down Thio three Open Soul site. So we come from a you know I've been using. I'll tell whole probably for about two years in parts of the organization. We were very early. Adopters are over technologies in the open source market, and it was just the ability thio on the proof of concept to be able to turn it around iTunes, where you'll go to a traditional vendor, which would take a few months large business cases. They need any of that. We were able to show results within 24 48 hours on now buys the confidence. And I'm sure David would take the challenge of being able to plug in some day. It says on to show you the day. >>Cool stuff, guys. Well, thank you for sharing with us what you guys are doing. And I have a Iot Tahoe keeping up data Lake Blue and the successes that you're cheating in such a short time, but direct energy. I appreciate your time, guys. Thank you. Excellent. Our pleasure. >>No, you'll day. >>Exactly know your data. My guests and my co host, Justin Warren. I'm Lisa Martin. I'm gonna go often. Learn my data. Now you've been watching the Cube and AWS reinvent 19. Thanks for watching

Published Date : Dec 4 2019

SUMMARY :

Brought to you by Amazon Web service Justin and I are joined by a couple of guests New to the Cube. P II discovery, and so over to you critical to your business. the products quickly for being able to connect to our existing sources to discover You got another day to say. That's in to build up a single view and offer but the problem death by Excel was in the financial service is we're But is the cataloging and the sensitive information one of the key things that makes it And we tell you what this looks like A phone number. in the last six months with AWS because we're gonna be that key geese for so that the data like I love the name iota, How But in does have more to do with Lake. So at this speed and this scale, how important is it that you actually have tooling into a call to say, Well, you know, a customer might be on your digital site Sorry, Eddie is the opportunity to leverage I just have a couple of people that are able to take the outputs and then be on. That's the most difficult challenge to solve. What I'm hearing is, as it has the potential to enable So this helps exactly How many you mentioned the 36 or so So incentive the direct senator, decide that we're just at the beginning about journey. reduce the amount of people that are required to support the team. Tahoe was the right solution for you It says on to show you the day. Well, thank you for sharing with us what you guys are doing. Exactly know your data.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Justin Warren	PERSON	0.99+
Comcast	ORGANIZATION	0.99+
Lisa Martin	PERSON	0.99+
Eddie	PERSON	0.99+
David Meister	PERSON	0.99+
Justin	PERSON	0.99+
Eddie Edwards	PERSON	0.99+
AWS	ORGANIZATION	0.99+
2017	DATE	0.99+
Las Vegas	LOCATION	0.99+
David Piester	PERSON	0.99+
100%	QUANTITY	0.99+
2014	DATE	0.99+
Karen Kim Kim Smith	PERSON	0.99+
Dave	PERSON	0.99+
North America	LOCATION	0.99+
three days	QUANTITY	0.99+
20 architects	QUANTITY	0.99+
Two sets	QUANTITY	0.99+
300%	QUANTITY	0.99+
4600 engineers	QUANTITY	0.99+
1,000,000,000	QUANTITY	0.99+
Rob	PERSON	0.99+
1700 applications	QUANTITY	0.99+
One platform	QUANTITY	0.99+
North America	LOCATION	0.99+
Io-Tahoe	PERSON	0.99+
30	QUANTITY	0.99+
Direct Energy	ORGANIZATION	0.99+
Excel	TITLE	0.99+
100,000 columns	QUANTITY	0.98+
Wall Street	LOCATION	0.98+
36	QUANTITY	0.98+
yesterday	DATE	0.98+
Global Data Service	ORGANIZATION	0.98+
iTunes	TITLE	0.98+
amazon dot com	ORGANIZATION	0.98+
one	QUANTITY	0.98+
12	QUANTITY	0.98+
first name	QUANTITY	0.98+
both	QUANTITY	0.97+
Io Tahoe	ORGANIZATION	0.97+
aws	ORGANIZATION	0.97+
Day two	QUANTITY	0.97+
14 future sets	QUANTITY	0.97+
around 5000 employees	QUANTITY	0.96+
two signature clients	QUANTITY	0.96+
around 300%	QUANTITY	0.96+
under 1700 applications	QUANTITY	0.96+
One	QUANTITY	0.96+
one more question	QUANTITY	0.95+
about two years	QUANTITY	0.95+
24 48 hours	QUANTITY	0.95+
2019	DATE	0.95+
Amazon Web	ORGANIZATION	0.94+
Thio	ORGANIZATION	0.94+
Terry Tate	PERSON	0.94+
a week	QUANTITY	0.94+
One thing	QUANTITY	0.94+
about four months	QUANTITY	0.94+
Discovery	ORGANIZATION	0.91+
nine	QUANTITY	0.91+
last six months	DATE	0.9+
Andi	PERSON	0.89+
Iota	TITLE	0.89+
Tahoe	ORGANIZATION	0.89+
Data Lake Data	ORGANIZATION	0.88+
DSO	ORGANIZATION	0.88+
Wall Street, New York	LOCATION	0.86+

Mandy Chessell, IBM | Dataworks Summit EU 2018

>> Announcer: From Berlin, Germany, it's the Cube covering Dataworks Summit Europe 2018. Brought to you by Hortonworks. (electronic music) >> Well hello welcome to the Cube I'm James Kobielus. I'm the lead analyst for big data analytics within the Wikibon team of SiliconANGLE Media. I'm hosting the Cube this week at Dataworks Summit 2018 in Berlin, Germany. It's been an excellent event. Hortonworks, the host, had... We've completed two days of keynotes. They made an announcement of the Data Steward Studio as the latest of their offerings and demonstrated it this morning, to address GDPR compliance, which of course is hot and heavy is coming down on enterprises both in the EU and around the world including in the U.S. and the May 25th deadline is fast approaching. One of Hortonworks' prime partners is IBM. And today on this Cube segment we have Mandy Chessell. Mandy is a distinguished engineer at IBM who did an excellent keynote yesterday all about metadata and metadata management. Mandy, great to have you. >> Hi and thank you. >> So I wonder if you can just reprise or summarize the main take aways from your keynote yesterday on metadata and it's role in GDPR compliance, so forth and the broader strategies that enterprise customers have regarding managing their data in this new multi-cloud world where Hadoop and open source platforms are critically important for storing and processing data. So Mandy go ahead. >> So, metadata's not new. I mean it's basically information about data. And a lot of companies are trying to build a data catalog which is not a catalog of, you know, actually containing their data, it's a catalog that describes their data. >> James: Is it different with index or a glossary. How's the catalog different from-- >> Yeah, so catalog actually includes both. So it is a list of all the data sets plus a links to glossary definitions of what those data items mean within the data sets, plus information about the lineage of the data. It includes information about who's using it, what they're using it for, how it should be governed. >> James: It's like a governance repository. >> So governance is part of it. So the governance part is really saying, "This is how you're allowed to use it, "this is how the data's classified," "these are the automated actions that are going to happen "on the data as it's used "within the operational environment." >> James: Yeah. >> So there's that aspect to it, but there is the collaboration side. Hey I've been using this data set it's great. Or, actually this data set is full of errors, we can't use it. So you've got feedback to data set owners as well as, exchange and collaboration between data scientists working with the data. So it's really, it is a central resource for an organization that has a strong data strategy, is interested in becoming a data-driven organization as such, so, you know, this becomes their major catalog over their data assets, and how they're using it. So when a regulator comes in and says, "can you show up, show me that you're "managing personal data?" The data catalog will have the information about where personal data's located, what type of infrastructure it's sitting on, how it's being used by different services. So they can really show that they know what they're doing and then from that they can show how to processes are used in the metadata in order to use the data appropriately day to day. >> So Apache Atlas, so it's basically a catalog, if I understand correctly at least for IBM and Hortonworks, it's Hadoop, it's Apache Atlas and Apache Atlas is essentially a metadata open source code base. >> Mandy: Yes, yes. >> So explain what Atlas is in this context. >> So yes, Atlas is a collection of code, but it supports a server, a graph-based metadata server. It also supports-- >> James: A graph-based >> Both: Metadata server >> Yes >> James: I'm sorry, so explain what you mean by graph-based in this context. >> Okay, so it runs using the JanusGraph, graph repository. And this is very good for metadata 'cause if you think about what it is it's connecting dots. It's basically saying this data set means this value and needs to be classified in this way and this-- >> James: Like a semantic knowledge graph >> It is, yes actually. And on top of it we impose a type system that describes the different types of things you need to control and manage in a data catalog, but the graph, the Atlas component gives you that graph-based, sorry, graph-based repository underneath, but on top we've built what we call the open metadata and governance libraries. They run inside Atlas so when you run Atlas you will have all the open metadata interfaces, but you can also take those libraries and connect them and load them actually into another vendor's product. And what they're doing is allowing metadata to be exchanged between repositories of different types. And this becomes incredibly important as an organization increases their maturity and their use of data because you can't just have knowledge about data in a single server, it just doesn't scale. You need to get that knowledge into every runtime environment, into the data tools that people are using across the organization. And so it needs to be distributed. >> Mandy I'm wondering, the whole notion of what you catalog in that repository, does it include, or does Apache Atlas support adding metadata relevant to data derivative assets like machine learning models-- >> Mandy: Absolutely. >> So forth. >> Mandy: Absolutely, so we have base types in the upper metadata layer, but also it's a very flexible and sensible type system. So, if you've got a specialist machine learning model that needs additional information stored about it, that can easily be added to the runtime environment. And then it will be managed through the open metadata protocols as if it was part of the native type system. >> Because of the courses in analysts, one of my core areas is artificial intelligence and one of the hot themes in artificial, well there's a broad umbrella called AI safety. >> Mandy: Yeah. >> And one of the core subsets of that is something called explicable AI, being able to identify the lineage of a given algorithmic decision back to what machine learning models fed from what data. >> Mandy: Yeah. >> Throw what action like when let's say a self-driving vehicle hits a human being for legal, you know, discovery whatever. So what I'm getting at, what I'm working through to is the extent to which the Hortonworks, IBM big data catalog running Atlas can be a foundation for explicable AI either now or in the future. We see a lot of enterprise, me as an analyst at least, sees lots of enterprises that are exploring this topic, but it's not to the point where it's in production, explicable AI, but where clearly companies like IBM are exploring building a stack or a architecture for doing this kind of thing in a standardized way. What are your thoughts there? Is IBM working on bringing, say Atlas and the overall big data catalog into that kind of a use case. >> Yes, yeah, so if you think about what's required, you need to understand the data that was used to train the AI how, what data's been fed to it since it was deployed because that's going to change its behavior, and then also a view of how that data's going to change in the future so you can start to anticipate issues that might arising from the model's changing behavior. And this is where the data catalog can actually associate and maintain information about the data that's being used with the algorithm. You can also associate the checking mechanism that's constantly monitoring the profile of the data so you can see where the data is changing over time, that will obviously affect the behavior of the machine learning model. So it's really about providing, not just information about the model itself, but also the data that's feeding it, how those characteristics are changing over time so that you know the model is continuing to work into the future. >> So tell us about the IBM, Hortonworks partnership on metadata and so forth. >> Mandy: Okay. >> How is that evolving? So, you know, your partnership is fairly tight. You clearly, you've got ODPI, you've got the work that you're doing related to the big data catalog. What can we expect to see in the near future in terms of, initiatives building on all of that for governance of big data in the multi-cloud environment? >> Yeah so Hortonworks started the Apache Atlas project a couple of years ago with a number of their customers. And they built a base repository and a set of APIs that allow it to work in the Hadoop environment. We came along last year, formed our partnership. That partnership includes this open metadata and governance layer. So since then we worked with ING as well and ING bring the, sort of, user perspective, this is the organization's use of the data. And, so between the three of us we are basically transforming Apache Atlas from an Hadoop focused metadata repository to an enterprise focused metadata repository. Plus enabling other vendors to connect into the open metadata ecosystem. So we're standardizing types, standardizing format, the format of metadata, there's a protocol for exchanging metadata between repositories. And this is all coming from that three-way partnership where you've got a consuming organization, you've got a company who's used to building enterprise middleware, and you've got Hortonworks with their knowledge of open source development in their Hadoop environment. >> Quick out of left field, as you develop this architecture, clearly you're leveraging Hadoop HTFS for storage. Are you looking to at least evaluating maybe using block chain for more distributed management of the metadata in these heterogeneous environments in the multi-cloud, or not? >> So Atlas itself does run on HTFS, but doesn't need to run on HTFS, it's got other storage environments so that we can run it outside of Hadoop. When it comes to block chain, so block chain is, for, sharing data between partners, small amounts of data that basically express agreements, so it's like a ledger. There are some aspects that we could use for metadata management. It's more that we actually need to put metadata management into block chain. So the agreements and contracts that are stored in block chain are only meaningful if we understand the data that's there, what it's quality, where it came from what it means. And so actually there's a very interesting distributor metadata question that comes with the block chain technology. And I think that's an important area of research. >> Well Mandy we're at the end of our time. Thank you very much. We could go on and on. You're a true expert and it's great to have you on the Cube. >> Thank you for inviting me. >> So this is James Kobielus with Mandy Chessell of IBM. We are here this week in Berlin at Dataworks Summit 2018. It's a great event and we have some more interviews coming up so thank you very much for tuning in. (electronic music)

Published Date : Apr 19 2018

SUMMARY :

Announcer: From Berlin, Germany, it's the Cube I'm hosting the Cube this week at Dataworks Summit 2018 and the broader strategies that enterprise customers which is not a catalog of, you know, actually containing How's the catalog different from-- So it is a list of all the data sets plus a links "these are the automated actions that are going to happen in the metadata in order to use So Apache Atlas, so it's basically a catalog, So yes, Atlas is a collection of code, James: I'm sorry, so explain what you mean and needs to be classified in this way that describes the different types of things you need in the upper metadata layer, but also it's a very flexible and one of the hot themes in artificial, And one of the core subsets of that the extent to which the Hortonworks, IBM big data catalog in the future so you can start to anticipate issues So tell us about the IBM, Hortonworks partnership for governance of big data in the multi-cloud environment? And, so between the three of us we are basically of the metadata in these heterogeneous environments So the agreements and contracts that are stored You're a true expert and it's great to have you on the Cube. So this is James Kobielus with Mandy Chessell of IBM.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Mandy Chessell	PERSON	0.99+
IBM	ORGANIZATION	0.99+
ING	ORGANIZATION	0.99+
James	PERSON	0.99+
three	QUANTITY	0.99+
Berlin	LOCATION	0.99+
Mandy	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
May 25th	DATE	0.99+
last year	DATE	0.99+
U.S.	LOCATION	0.99+
two days	QUANTITY	0.99+
Atlas	TITLE	0.99+
yesterday	DATE	0.99+
Berlin, Germany	LOCATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Data Steward Studio	ORGANIZATION	0.99+
both	QUANTITY	0.99+
Both	QUANTITY	0.98+
EU	LOCATION	0.98+
GDPR	TITLE	0.98+
One	QUANTITY	0.98+
one	QUANTITY	0.98+
Dataworks Summit 2018	EVENT	0.97+
Dataworks Summit EU 2018	EVENT	0.96+
this week	DATE	0.94+
single server	QUANTITY	0.94+
Hadoop	TITLE	0.94+
today	DATE	0.93+
this morning	DATE	0.93+
three-way partnership	QUANTITY	0.93+
Wikibon	ORGANIZATION	0.91+
Hortonworks'	ORGANIZATION	0.9+
Atlas	ORGANIZATION	0.89+
Dataworks Summit Europe 2018	EVENT	0.89+
couple of years ago	DATE	0.87+
Apache Atlas	TITLE	0.86+
Cube	COMMERCIAL_ITEM	0.83+
Apache	ORGANIZATION	0.82+
JanusGraph	TITLE	0.79+
hot themes	QUANTITY	0.68+
Hado	ORGANIZATION	0.67+
Hadoop HTFS	TITLE	0.63+

Alan Gates, Hortonworks | Dataworks Summit 2018

(techno music) >> (announcer) From Berlin, Germany it's theCUBE covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. >> Well hello, welcome to theCUBE. We're here on day two of DataWorks Summit 2018 in Berlin, Germany. I'm James Kobielus. I'm lead analyst for Big Data Analytics in the Wikibon team of SiliconANGLE Media. And who we have here today, we have Alan Gates whose one of the founders of Hortonworks and Hortonworks of course is the host of DataWorks Summit and he's going to be, well, hello Alan. Welcome to theCUBE. >> Hello, thank you. >> Yeah, so Alan, so you and I go way back. Essentially, what we'd like you to do first of all is just explain a little bit of the genesis of Hortonworks. Where it came from, your role as a founder from the beginning, how that's evolved over time but really how the company has evolved specifically with the folks on the community, the Hadoop community, the Open Source community. You have a deepening open source stack with you build upon with Atlas and Ranger and so forth. Gives us a sense for all of that Alan. >> Sure. So as I think it's well-known, we started as the team at Yahoo that really was driving a lot of the development of Hadoop. We were one of the major players in the Hadoop community. Worked on that for, I was in that team for four years. I think the team itself was going for about five. And it became clear that there was an opportunity to build a business around this. Some others had already started to do so. We wanted to participate in that. We worked with Yahoo to spin out Hortonworks and actually they were a great partner in that. Helped us get than spun out. And the leadership team of the Hadoop team at Yahoo became the founders of Hortonworks and brought along a number of the other engineering, a bunch of the other engineers to help get started. And really at the beginning, we were. It was Hadoop, Pig, Hive, you know, a few of the very, Hbase, the kind of, the beginning projects. So pretty small toolkit. And we were, our early customers were very engineering heavy people, or companies who knew how to take those tools and build something directly on those tools right? >> Well, you started off with the Hadoop community as a whole started off with a focus on the data engineers of the world >> Yes. >> And I think it's shifted, and confirm for me, over time that you focus increasing with your solutions on the data scientists who are doing the development of the applications, and the data stewards from what I can see at this show. >> I think it's really just a part of the adoption curve right? When you're early on that curve, you have people who are very into the technology, understand how it works, and want to dive in there. So those tend to be, as you said, the data engineering types in this space. As that curve grows out, you get, it comes wider and wider. There's still plenty of data engineers that are our customers, that are working with us but as you said, the data analysts, the BI people, data scientists, data stewards, all those people are now starting to adopt it as well. And they need different tools than the data engineers do. They don't want to sit down and write Java code or you know, some of the data scientists might want to work in Python in a notebook like Zeppelin or Jupyter but some, may want to use SQL or even Tablo or something on top of SQL to do the presentation. Of course, data stewards want tools more like Atlas to help manage all their stuff. So that does drive us to one, put more things into the toolkit so you see the addition of projects like Apache Atlas and Ranger for security and all that. Another area of growth, I would say is also the kind of data that we're focused on. So early on, we were focused on data at rest. You know, we're going to store all this stuff in HDFS and as the kind of data scene has evolved, there's a lot more focus now on a couple things. One is data, what we call data-in-motion for our HDF product where you've got in a stream manager like Kafka or something like that >> (James) Right >> So there's processing that kind of data. But now we also see a lot of data in various places. It's not just oh, okay I have a Hadoop cluster on premise at my company. I might have some here, some on premise somewhere else and I might have it in several clouds as well. >> K, your focus has shifted like the industry in general towards streaming data in multi-clouds where your, it's more stateful interactions and so forth? I think you've made investments in Apache NiFi so >> (Alan) yes. >> Give us a sense for your NiFi versus Kafka and so forth inside of your product strategy or your >> Sure. So NiFi is really focused on that data at the edge, right? So you're bringing data in from sensors, connected cars, airplane engines, all those sorts of things that are out there generating data and you need, you need to figure out what parts of the data to move upstream, what parts not to. What processing can I do here so that I don't have to move upstream? When I have a error event or a warning event, can I turn up the amount of data I'm sending in, right? Say this airplane engine is suddenly heating up maybe a little more than it's supposed to. Maybe I should ship more of the logs upstream when the plane lands and connects that I would if, otherwise. That's the kind o' thing that Apache NiFi focuses on. I'm not saying it runs in all those places by my point is, it's that kind o' edge processing. Kafka is still going to be running in a data center somewhere. It's still a pretty heavy weight technology in terms of memory and disk space and all that so it's not going to be run on some sensor somewhere. But it is that data-in-motion right? I've got millions of events streaming through a set of Kafka topics watching all that sensor data that's coming in from NiFi and reacting to it, maybe putting some of it in the data warehouse for later analysis, all those sorts of things. So that's kind o' the differentiation there between Kafka and NiFi. >> Right, right, right. So, going forward, do you see more of your customers working internet of things projects, is that, we don't often, at least in the industry of popular mind, associate Hortonworks with edge computing and so forth. Is that? >> I think that we will have more and more customers in that space. I mean, our goal is to help our customers with their data wherever it is. >> (James) Yeah. >> When it's on the edge, when it's in the data center, when it's moving in between, when it's in the cloud. All those places, that's where we want to help our customers store and process their data. Right? So, I wouldn't want to say that we're going to focus on just the edge or the internet of things but that certainly has to be part of our strategy 'cause it's has to be part of what our customers are doing. >> When I think about the Hortonworks community, now we have to broaden our understanding because you have a tight partnership with IBM which obviously is well-established, huge and global. Give us a sense for as you guys have teamed more closely with IBM, how your community has changed or broadened or shifted in its focus or has it? >> I don't know that it's shifted the focus. I mean IBM was already part of the Hadoop community. They were already contributing. Obviously, they've contributed very heavily on projects like Spark and some of those. They continue some of that contribution. So I wouldn't say that it's shifted it, it's just we are working more closely together as we both contribute to those communities, working more closely together to present solutions to our mutual customer base. But I wouldn't say it's really shifted the focus for us. >> Right, right. Now at this show, we're in Europe right now, but it doesn't matter that we're in Europe. GDPR is coming down fast and furious now. Data Steward Studio, we had the demonstration today, it was announced yesterday. And it looks like a really good tool for the main, the requirements for compliance which is discover and inventory your data which is really set up a consent portal, what I like to refer to. So the data subject can then go and make a request to have my data forgotten and so forth. Give us a sense going forward, for how or if Hortonworks, IBM, and others in your community are going to work towards greater standardization in the functional capabilities of the tools and platforms for enabling GDPR compliance. 'Cause it seems to me that you're going to need, the industry's going to need to have some reference architecture for these kind o' capabilities so that going forward, either your ecosystem of partners can build add on tools in some common, like the framework that was laid out today looks like a good basis. Is there anything that you're doing in terms of pushing towards more Open Source standardization in that area? >> Yes, there is. So actually one of my responsibilities is the technical management of our relationship with ODPI which >> (James) yes. >> Mandy Chessell referenced yesterday in her keynote and that is where we're working with IBM, with ING, with other companies to build exactly those standards. Right? Because we do want to build it around Apache Atlas. We feel like that's a good tool for the basis of that but we know one, that some people are going to want to bring their own tools to it. They're not necessarily going to want to use that one platform so we want to do it in an open way that they can still plug in their metadata repositories and communicate with others and we want to build the standards on top of that of how do you properly implement these features that GDPR requires like right to be forgotten, like you know, what are the protocols around PIII data? How do you prevent a breach? How do you respond to a breach? >> Will that all be under the umbrella of ODPI, that initiative of the partnership or will it be a separate group or? >> Well, so certainly Apache Atlas is part of Apache and remains so. What ODPI is really focused up is that next layer up of how do we engage, not the programmers 'cause programmers can gage really well at the Apache level but the next level up. We want to engage the data professionals, the people whose job it is, the compliance officers. The people who don't sit and write code and frankly if you connect them to the engineers, there's just going to be an impedance mismatch in that conversation. >> You got policy wonks and you got tech wonks so. They understand each other at the wonk level. >> That's a good way to put it. And so that's where ODPI is really coming is that group of compliance people that speak a completely different language. But we still need to get them all talking to each other as you said, so that there's specifications around. How do we do this? And what is compliance? >> Well Alan, thank you very much. We're at the end of our time for this segment. This has been great. It's been great to catch up with you and Hortonworks has been evolving very rapidly and it seems to me that, going forward, I think you're well-positioned now for the new GDPR age to take your overall solution portfolio, your partnerships, and your capabilities to the next level and really in terms of in an Open Source framework. In many ways though, you're not entirely 100% like nobody is, purely Open Source. You're still very much focused on open frameworks for building fairly scalable, very scalable solutions for enterprise deployment. Well, this has been Jim Kobielus with Alan Gates of Hortonworks here at theCUBE on theCUBE at DataWorks Summit 2018 in Berlin. We'll be back fairly quickly with another guest and thank you very much for watching our segment. (techno music)

Published Date : Apr 19 2018

SUMMARY :

Brought to you by Hortonworks. of Hortonworks and Hortonworks of course is the host a little bit of the genesis of Hortonworks. a bunch of the other engineers to help get started. of the applications, and the data stewards So those tend to be, as you said, the data engineering types But now we also see a lot of data in various places. So NiFi is really focused on that data at the edge, right? So, going forward, do you see more of your customers working I mean, our goal is to help our customers with their data When it's on the edge, when it's in the data center, as you guys have teamed more closely with IBM, I don't know that it's shifted the focus. the industry's going to need to have some So actually one of my responsibilities is the that GDPR requires like right to be forgotten, like and frankly if you connect them to the engineers, You got policy wonks and you got tech wonks so. as you said, so that there's specifications around. It's been great to catch up with you and

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
James Kobielus	PERSON	0.99+
Mandy Chessell	PERSON	0.99+
Alan	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Jim Kobielus	PERSON	0.99+
Europe	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Alan Gates	PERSON	0.99+
four years	QUANTITY	0.99+
James	PERSON	0.99+
ING	ORGANIZATION	0.99+
Berlin	LOCATION	0.99+
yesterday	DATE	0.99+
Apache	ORGANIZATION	0.99+
SQL	TITLE	0.99+
Java	TITLE	0.99+
GDPR	TITLE	0.99+
Python	TITLE	0.99+
100%	QUANTITY	0.99+
Berlin, Germany	LOCATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
DataWorks Summit	EVENT	0.99+
Atlas	ORGANIZATION	0.99+
DataWorks Summit 2018	EVENT	0.98+
Data Steward Studio	ORGANIZATION	0.98+
today	DATE	0.98+
one	QUANTITY	0.98+
NiFi	ORGANIZATION	0.98+
Dataworks Summit 2018	EVENT	0.98+
Hadoop	ORGANIZATION	0.98+
one platform	QUANTITY	0.97+
2018	EVENT	0.97+
both	QUANTITY	0.97+
millions of events	QUANTITY	0.96+
Hbase	ORGANIZATION	0.95+
Tablo	TITLE	0.95+
ODPI	ORGANIZATION	0.94+
Big Data Analytics	ORGANIZATION	0.94+
One	QUANTITY	0.93+
theCUBE	ORGANIZATION	0.93+
NiFi	COMMERCIAL_ITEM	0.92+
day two	QUANTITY	0.92+
about five	QUANTITY	0.91+
Kafka	TITLE	0.9+
Zeppelin	ORGANIZATION	0.89+
Atlas	TITLE	0.85+
Ranger	ORGANIZATION	0.84+
Jupyter	ORGANIZATION	0.83+
first	QUANTITY	0.82+
Apache Atlas	ORGANIZATION	0.82+
Hadoop	TITLE	0.79+

Scott Gnau, Hortonworks | Dataworks Summit EU 2018

(upbeat music) >> Announcer: From Berlin, Germany, it's The Cube, covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. >> Hi, welcome to The Cube, we're separating the signal from the noise and tuning into the trends in data and analytics. Here at DataWorks Summit 2018 in Berlin, Germany. This is the sixth year, I believe, that DataWorks has been held in Europe. Last year I believe it was at Munich, now it's in Berlin. It's a great show. The host is Hortonworks and our first interviewee today is Scott Gnau, who is the chief technology officer of Hortonworks. Of course Hortonworks got established themselves about seven years ago as one of the up and coming start ups commercializing a then brand new technology called Hadoop and MapReduce. They've moved well beyond that in terms of their go to market strategy, their product portfolio, their partnerships. So Scott, this morning, it's great to have ya'. How are you doing? >> Glad to be back and good to see you. It's been awhile. >> You know, yes, I mean, you're an industry veteran. We've both been around the block a few times but I remember you years ago. You were at Teradata and I was at another analyst firm. And now you're with Hortonworks. And Hortonworks is really on a roll. I know you're not Rob Bearden, so I'm not going to go into the financials, but your financials look pretty good, your latest. You're growing, your deal sizes are growing. Your customer base is continuing to deepen. So you guys are on a roll. So we're here in Europe, we're here in Berlin in particular. It's five weeks--you did the keynote this morning, It's five weeks until GDPR. The sword of Damacles, the GDPR sword of Damacles. It's not just affecting European based companies, but it's affecting North American companies and others who do business in Europe. So your keynote this morning, your core theme was that, if you're in enterprise, your business strategy is equated with your cloud strategy now, is really equated with your data strategy. And you got to a lot of that. It was a really good discussion. And where GDPR comes into the picture is the fact that protecting data, personal data of your customers is absolutely important, in fact it's imperative and mandatory, and will be in five weeks or you'll face a significant penalty if you're not managing that data and providing customers with the right to have it erased, or the right to withdraw consent to have it profiled, and so forth. So enterprises all over the world, especially in Europe, are racing as fast as they can to get compliant with GDPR by the May 25th deadline time. So, one of the things you discussed this morning, you had an announcement overnight that Hortonworks has released a new solution in technical preview called The Data Steward Studio. And I'm wondering if you can tie that announcement to GDPR? It seems like data stewardship would have a strong value for your customers. >> Yeah, there's definitely a big tie-in. GDPR is certainly creating a milestone, kind of a trigger, for people to really think about their data assets. But it's certainly even larger than that, because when you even think about driving digitization of a business, driving new business models and connecting data and finding new use cases, it's all about finding the data you have, understanding what it is, where it came from, what's the lineage of it, who had access to it, what did they do to it? These are all governance kinds of things, which are also now mandated by laws like GDPR. And so it's all really coming together in the context of the new modern data architecture era that we live in, where a lot of data that we have access to, we didn't create. And so it was created outside the firewall by a device, by some application running with some customer, and so capturing and interpreting and governing that data is very different than taking derivative transactions from an ERP system, which are already adjudicated and understood, and governing that kind of a data structure. And so this is a need that's driven from many different perspectives, it's driven from the new architecture, the way IoT devices are connecting and just creating a data bomb, that's one thing. It's driven by business use cases, just saying what are the assets that I have access to, and how can I try to determine patterns between those assets where I didn't even create some of them, so how do I adjudicate that? >> Discovering and cataloging your data-- >> Discovering it, cataloging it, actually even... When I even think about data, just think the files on my laptop, that I created, and I don't remember what half of them are. So creating the metadata, creating that trail of bread crumbs that lets you piece together what's there, what's the relevance of it, and how, then, you might use it for some correlation. And then you get in, obviously, to the regulatory piece that says sure, if I'm a new customer and I ask to be forgotten, the only way that you can guarantee to forget me is to know where all of my data is. >> If you remember that they are your customer in the first place and you know where all that data is, if you're even aware that it exists, that's the first and foremost thing for an enterprise to be able to assess their degree of exposure to GDPR. >> So, right. It's like a whole new use case. It's a microcosm of all of these really big things that are going on. And so what we've been trying to do is really leverage our expertise in metadata management using the Apache Atlas project. >> Interviewer: You and IBM have done some major work-- >> We work with IBM and the community on Apache Atlas. You know, metadata tagging is not the most interesting topic for some people, but in the context that I just described, it's kind of important. And so I think one of the areas where we can really add value for the industry is leveraging our lowest common denominator, open source, open community kind of development to really create a standard infrastructure, a standard open infrastructure for metadata tagging, into which all of these use cases can now plug. Whether it's I want to discover data and create metadata about the data based on patterns that I see in the data, or I've inherited data and I want to ensure that the metadata stay with that data through its life cycle, so that I can guarantee the lineage of the data, and be compliant with GDPR-- >> And in fact, tomorrow we will have Mandy Chessell from IBM, a key Hortonworks partner, discussing the open metadata framework you're describing and what you're doing. >> And that was part of this morning's keynote close also. It all really flowed nicely together. Anyway, it is really a perfect storm. So what we've done is we've said, let's leverage this lowest common denominator, standard metadata tagging, Apache Atlas, and uplevel it, and not have it be part of a cluster, but actually have it be a cloud service that can be in force across multiple data stores, whether they're in the cloud or whether they're on prem. >> Interviewer: That's the Data Steward Studio? >> Well, Data Plane and Data Steward Studio really enable those things to come together. >> So the Data Steward Studio is the second service >> Like an app. >> under the Hortonworks DataPlane service. >> Yeah, so the whole idea is to be able to tie those things together, and when you think about it in today's hybrid world, and this is where I really started, where your data strategy is your cloud strategy, they can't be separate, because if they're separate, just think about what would happen. So I've copied a bunch of data out to the cloud. All memory of any lineage is gone. Or I've got to go set up manually another set of lineage that may not be the same as the lineage it came with. And so being able to provide that common service across footprint, whether it's multiple data centers, whether it's multiple clouds, or both, is a really huge value, because now you can sit back and through that single pane, see all of your data assets and understand how they interact. That obviously has the ability then to provide value like with Data Steward Studio, to discover assets, maybe to discover assets and discover duplicate assets, where, hey, I can save some money if I get rid of this cloud instance, 'cause it's over here already. Or to be compliant and say yeah, I've got these assets here, here, and here, I am now compelled to do whatever: delete, protect, encrypt. I can now go do that and keep a record through the metadata that I did it. >> Yes, in fact that is very much at the heart of compliance, you got to know what assets there are out there. And so it seems to me that Hortonworks is increasingly... the H-word rarely comes up these days. >> Scott: Not Hortonworks, you're talking about Hadoop. >> Hadoop rarely comes up these days. When the industry talks about you guys, it's known that's your core, that's your base, that's where HDP and so forth, great product, great distro. In fact, in your partnership with IBM, a year or more ago, I think it was IBM standardized on HDP in lieu of their distro, 'cause it's so well-established, so mature. But going forward, you guys in many ways, Hortonworks, you have positioned yourselves now. Wikibon sees you as being the premier solution provider of big data governance solutions specifically focused on multi-cloud, on structured data, and so forth. So the announcement today of the Data Steward Studio very much builds on that capability you already have there. So going forward, can you give us a sense to your roadmap in terms of building out DataPlane's service? 'Cause this is the second of these services under the DataPlane umbrella. Give us a sense for how you'll continue to deepen your governance portfolio in DataPlane. >> Really the way to think about it, there are a couple of things that you touched on that I think are really critical, certainly for me, and for us at Hortonworks to continue to repeat, just to make sure the message got there. Number one, Hadoop is definitely at the core of what we've done, and was kind of the secret sauce. Some very different stuff in the technology, also the fact that it's open source and community, all those kinds of things. But that really created a foundation that allowed us to build the whole beginning of big data data management. And we added and expanded to the traditional Hadoop stack by adding Data in Motion. And so what we've done is-- >> Interviewer: NiFi, I believe, you made a major investment. >> Yeah, so we made a large investment in Apache NiFi, as well as Storm and Kafka as kind of a group of technologies. And the whole idea behind doing that was to expand our footprint so that we would enable our customers to manage their data through its entire lifecycle, from being created at the edge, all the way through streaming technologies, to landing, to analytics, and then even analytics being pushed back out to the edge. So it's really about having that common management infrastructure for the lifecycle of all the data, including Hadoop and many other things. And then in that, obviously as we discuss whether it be regulation, whether it be, frankly, future functionality, there's an opportunity to uplevel those services from an overall security and governance perspective. And just like Hadoop kind of upended traditional thinking... and what I mean by that was not the economics of it, specifically, but just the fact that you could land data without describing it. That seemed so unimportant at one time, and now it's like the key thing that drives the difference. Think about sensors that are sending in data that reconfigure firmware, and those streams change. Being able to acquire data and then assess the data is a big deal. So the same thing applies, then, to how we apply governance. I said this morning, traditional governance was hey, I started this employee, I have access to this file, this file, this file, and nothing else. I don't know what else is out there. I only have access to what my job title describes. And that's traditional data governance. In the new world, that doesn't work. Data scientists need access to all of the data. Now, that doesn't mean we need to give away PII. We can encrypt it, we can tokenize it, but we keep referential integrity. We keep the integrity of the original structures, and those who have a need to actually see the PII can get the token and see the PII. But it's governance thought inversely as it's been thought about for 30 years. >> It's so great you've worked governance into an increasingly streaming, real-time in motion data environment. Scott, this has been great. It's been great to have you on The Cube. You're an alum of The Cube. I think we've had you at least two or three times over the last few years. >> It feels like 35. Nah, it's pretty fun.. >> Yeah, you've been great. So we are here at Dataworks Summit in Berlin. (upbeat music)

Published Date : Apr 18 2018

SUMMARY :

Brought to you by Hortonworks. So Scott, this morning, it's great to have ya'. Glad to be back and good to see you. So, one of the things you discussed this morning, of the new modern data architecture era that we live in, forgotten, the only way that you can guarantee and foremost thing for an enterprise to be able And so what we've been trying to do is really leverage so that I can guarantee the lineage of the data, discussing the open metadata framework you're describing And that was part of this morning's keynote close also. those things to come together. of lineage that may not be the same as the lineage And so it seems to me that Hortonworks is increasingly... When the industry talks about you guys, it's known And so what we've done is-- Interviewer: NiFi, I believe, you made So the same thing applies, then, to how we apply governance. It's been great to have you on The Cube. Nah, it's pretty fun.. So we are here at Dataworks Summit in Berlin.

ENTITIES

Entity	Category	Confidence
Europe	LOCATION	0.99+
Scott	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Berlin	LOCATION	0.99+
Scott Gnau	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Teradata	ORGANIZATION	0.99+
Last year	DATE	0.99+
May 25th	DATE	0.99+
five weeks	QUANTITY	0.99+
Mandy Chessell	PERSON	0.99+
GDPR	TITLE	0.99+
Munich	LOCATION	0.99+
Rob Bearden	PERSON	0.99+
second service	QUANTITY	0.99+
30 years	QUANTITY	0.99+
both	QUANTITY	0.99+
tomorrow	DATE	0.99+
first	QUANTITY	0.99+
Berlin, Germany	LOCATION	0.99+
second	QUANTITY	0.99+
DataPlane	ORGANIZATION	0.99+
sixth year	QUANTITY	0.98+
three times	QUANTITY	0.98+
first interviewee	QUANTITY	0.98+
Dataworks Summit	EVENT	0.98+
one	QUANTITY	0.97+
this morning	DATE	0.97+
DataWorks Summit 2018	EVENT	0.97+
MapReduce	ORGANIZATION	0.96+
Hadoop	TITLE	0.96+
Hadoop	ORGANIZATION	0.96+
one time	QUANTITY	0.96+
35	QUANTITY	0.96+
single pane	QUANTITY	0.96+
NiFi	ORGANIZATION	0.96+
today	DATE	0.94+
DataWorks Summit Europe 2018	EVENT	0.93+
Data Steward Studio	ORGANIZATION	0.93+
Dataworks Summit EU 2018	EVENT	0.92+
about seven years ago	DATE	0.91+
a year or	DATE	0.88+
years	DATE	0.87+
Storm	ORGANIZATION	0.87+
Wikibon	ORGANIZATION	0.86+
Apache NiFi	ORGANIZATION	0.85+
The Cube	PERSON	0.84+
North American	OTHER	0.84+
DataWorks	ORGANIZATION	0.84+
Data Plane	ORGANIZATION	0.76+
Data Steward Studio	TITLE	0.75+
Kafka	ORGANIZATION	0.75+

Wikibon Action Item | De-risking Digital Business | March 2018

>> Hi I'm Peter Burris. Welcome to another Wikibon Action Item. (upbeat music) We're once again broadcasting from theCube's beautiful Palo Alto, California studio. I'm joined here in the studio by George Gilbert and David Floyer. And then remotely, we have Jim Kobielus, David Vellante, Neil Raden and Ralph Finos. Hi guys. >> Hey. >> Hi >> How you all doing? >> This is a great, great group of people to talk about the topic we're going to talk about, guys. We're going to talk about the notion of de-risking digital business. Now, the reason why this becomes interesting is, the Wikibon perspective for quite some time has been that the difference between business and digital business is the role that data assets play in a digital business. Now, if you think about what that means. Every business institutionalizes its work around what it regards as its most important assets. A bottling company, for example, organizes around the bottling plant. A financial services company organizes around the regulatory impacts or limitations on how they share information and what is regarded as fair use of data and other resources, and assets. The same thing exists in a digital business. There's a difference between, say, Sears and Walmart. Walmart mades use of data differently than Sears. And that specific assets that are employed and had a significant impact on how the retail business was structured. Along comes Amazon, which is even deeper in the use of data as a basis for how it conducts its business and Amazon is institutionalizing work in quite different ways and has been incredibly successful. We could go on and on and on with a number of different examples of this, and we'll get into that. But what it means ultimately is that the tie between data and what is regarded as valuable in the business is becoming increasingly clear, even if it's not perfect. And so traditional approaches to de-risking data, through backup and restore, now needs to be re-thought so that it's not just de-risking the data, it's de-risking the data assets. And, since those data assets are so central to the business operations of many of these digital businesses, what it means to de-risk the whole business. So, David Vellante, give us a starting point. How should folks think about this different approach to envisioning business? And digital business, and the notion of risk? >> Okay thanks Peter, I mean I agree with a lot of what you just said and I want to pick up on that. I see the future of digital business as really built around data sort of agreeing with you, building on what you just said. Really where organizations are putting data at the core and increasingly I believe that organizations that have traditionally relied on human expertise as the primary differentiator, will be disrupted by companies where data is the fundamental value driver and I think there are some examples of that and I'm sure we'll talk about it. And in this new world humans have expertise that leverage the organization's data model and create value from that data with augmented machine intelligence. I'm not crazy about the term artificial intelligence. And you hear a lot about data-driven companies and I think such companies are going to have a technology foundation that is increasingly described as autonomous, aware, anticipatory, and importantly in the context of today's discussion, self-healing. So able to withstand failures and recover very quickly. So de-risking a digital business is going to require new ways of thinking about data protection and security and privacy. Specifically as it relates to data protection, I think it's going to be a fundamental component of the so-called data-driven company's technology fabric. This can be designed into applications, into data stores, into file systems, into middleware, and into infrastructure, as code. And many technology companies are going to try to attack this problem from a lot of different angles. Trying to infuse machine intelligence into the hardware, software and automated processes. And the premise is that meaty companies will architect their technology foundations, not as a set of remote cloud services that they're calling, but rather as a ubiquitous set of functional capabilities that largely mimic a range of human activities. Including storing, backing up, and virtually instantaneous recovery from failure. >> So let me build on that. So what you're kind of saying if I can summarize, and we'll get into whether or not it's human expertise or some other approach or notion of business. But you're saying that increasingly patterns in the data are going to have absolute consequential impacts on how a business ultimately behaves. We got that right? >> Yeah absolutely. And how you construct that data model, and provide access to the data model, is going to be a fundamental determinant of success. >> Neil Raden, does that mean that people are no longer important? >> Well no, no I wouldn't say that at all. I'm talking with the head of a medical school a couple of weeks ago, and he said something that really resonated. He said that there're as many doctors who graduated at the bottom of their class as the top of their class. And I think that's true of organizations too. You know what, 20 years ago I had the privilege of interviewing Peter Drucker for an hour and he foresaw this, 20 years ago, he said that people who run companies have traditionally had IT departments that provided operational data but they needed to start to figure out how to get value from that data and not only get value from that data but get value from data outside the company, not just internal data. So he kind of saw this big data thing happening 20 years ago. Unfortunately, he had a prejudice for senior executives. You know, he never really thought about any other people in an organization except the highest people. And I think what we're talking about here is really the whole organization. I think that, I have some concerns about the ability of organizations to really implement this without a lot of fumbles. I mean it's fine to talk about the five digital giants but there's a lot of companies out there that, you know the bar isn't really that high for them to stay in business. And they just seem to get along. And I think if we're going to de-risk we really need to help companies understand the whole process of transformation, not just the technology. >> Well, take us through it. What is this process of transformation? That includes the role of technology but is bigger than the role of technology. >> Well, it's like anything else, right. There has to be communication, there has to be some element of control, there has to be a lot of flexibility and most importantly I think there has to be acceptability by the people who are going to be affected by it, that is the right thing to do. And I would say you start with assumptions, I call it assumption analysis, in other words let's all get together and figure out what our assumptions are, and see if we can't line em up. Typically IT is not good at this. So I think it's going to require the help of a lot of practitioners who can guide them. >> So Dave Vellante, reconcile one point that you made I want to come back to this notion of how we're moving from businesses built on expertise and people to businesses built on expertise resident as patterns in the data, or data models. Why is it that the most valuable companies in the world seem to be the ones that have the most real hardcore data scientists. Isn't that expertise and people? >> Yeah it is, and I think it's worth pointing out. Look, the stock market is volatile, but right now the top-five companies: Apple, Amazon, Google, Facebook and Microsoft, in terms of market cap, account for about $3.5 trillion and there's a big distance between them, and they've clearly surpassed the big banks and the oil companies. Now again, that could change, but I believe that it's because they are data-driven. So called data-driven. Does that mean they don't need humans? No, but human expertise surrounds the data as opposed to most companies, human expertise is at the center and the data lives in silos and I think it's very hard to protect data, and leverage data, that lives in silos. >> Yes, so here's where I'll take exception to that, Dave. And I want to get everybody to build on top of this just very quickly. I think that human expertise has surrounded, in other businesses, the buildings. Or, the bottling plant. Or, the wealth management. Or, the platoon. So I think that the organization of assets has always been the determining factor of how a business behaves and we institutionalized work, in other words where we put people, based on the business' understanding of assets. Do you disagree with that? Is that, are we wrong in that regard? I think data scientists are an example of reinstitutionalizing work around a very core asset in this case, data. >> Yeah, you're saying that the most valuable asset is shifting from some of those physical assets, the bottling plant et cetera, to data. >> Yeah we are, we are. Absolutely. Alright, David Foyer. >> Neil: I'd like to come in. >> Panelist: I agree with that too. >> Okay, go ahead Neil. >> I'd like to give an example from the news. Cigna's acquisition of Express Scripts for $67 billion. Who the hell is Cigna, right? Connecticut General is just a sleepy life insurance company and INA was a second-tier property and casualty company. They merged a long time ago, they got into health insurance and suddenly, who's Express Scripts? I mean that's a company that nobody ever even heard of. They're a pharmacy benefit manager, what is that? They're an information management company, period. That's all they do. >> David Foyer, what does this mean from a technology standpoint? >> So I wanted to to emphasize one thing that evolution has always taught us. That you have to be able to come from where you are. You have to be able to evolve from where you are and take the assets that you have. And the assets that people have are their current systems of records, other things like that. They must be able to evolve into the future to better utilize what those systems are. And the other thing I would like to say-- >> Let me give you an example just to interrupt you, because this is a very important point. One of the primary reasons why the telecommunications companies, whom so many people believed, analysts believed, had this fundamental advantage, because so much information's flowing through them is when you're writing assets off for 30 years, that kind of locks you into an operational mode, doesn't it? >> Exactly. And the other thing I want to emphasize is that the most important thing is sources of data not the data itself. So for example, real-time data is very very important. So what is your source of your real-time data? If you've given that away to Google or your IOT vendor you have made a fundamental strategic mistake. So understanding the sources of data, making sure that you have access to that data, is going to enable you to be able to build the sort of processes and data digitalization. >> So let's turn that concept into kind of a Geoffrey Moore kind of strategy bromide. At the end of the day you look at your value proposition and then what activities are central to that value proposition and what data is thrown off by those activities and what data's required by those activities. >> Right, both internal-- >> We got that right? >> Yeah. Both internal and external data. What are those sources that you require? Yes, that's exactly right. And then you need to put together a plan which takes you from where you are, as the sources of data and then focuses on how you can use that data to either improve revenue or to reduce costs, or a combination of those two things, as a series of specific exercises. And in particular, using that data to automate in real-time as much as possible. That to me is the fundamental requirement to actually be able to do this and make money from it. If you look at every example, it's all real-time. It's real-time bidding at Google, it's real-time allocation of resources by Uber. That is where people need to focus on. So it's those steps, practical steps, that organizations need to take that I think we should be giving a lot of focus on. >> You mention Uber. David Vellante, we're just not talking about the, once again, talking about the Uberization of things, are we? Or is that what we mean here? So, what we'll do is we'll turn the conversation very quickly over to you George. And there are existing today a number of different domains where we're starting to see a new emphasis on how we start pricing some of this risk. Because when we think about de-risking as it relates to data give us an example of one. >> Well we were talking earlier, in financial services risk itself is priced just the way time is priced in terms of what premium you'll pay in terms of interest rates. But there's also something that's softer that's come into much more widely-held consciousness recently which is reputational risk. Which is different from operational risk. Reputational risk is about, are you a trusted steward for data? Some of that could be personal information and a use case that's very prominent now with the European GDPR regulation is, you know, if I ask you as a consumer or an individual to erase my data, can you say with extreme confidence that you have? That's just one example. >> Well I'll give you a specific number on that. We've mentioned it here on Action Item before. I had a conversation with a Chief Privacy Officer a few months ago who told me that they had priced out what the fines to Equifax would have been had the problem occurred after GDPR fines were enacted. It was $160 billion, was the estimate. There's not a lot of companies on the planet that could deal with $160 billion liability. Like that. >> Okay, so we have a price now that might have been kind of, sort of mushy before. And the notion of trust hasn't really changed over time what's changed is the technical implementations that support it. And in the old world with systems of record we basically collected from our operational applications as much data as we could put it in the data warehouse and it's data marked satellites. And we try to govern it within that perimeter. But now we know that data basically originates and goes just about anywhere. There's no well-defined perimeter. It's much more porous, far more distributed. You might think of it as a distributed data fabric and the only way you can be a trusted steward of that is if you now, across the silos, without trying to centralize all the data that's in silos or across them, you can enforce, who's allowed to access it, what they're allowed to do, audit who's done what to what type of data, when and where? And then there's a variety of approaches. Just to pick two, one is where it's discovery-oriented to figure out what's going on with the data estate. Using machine learning this is, Alation is an example. And then there's another example, which is where you try and get everyone to plug into what's essentially a new system catalog. That's not in a in a deviant mesh but that acts like the fabric for your data fabric, deviant mesh. >> That's an example of another, one of the properties of looking at coming at this. But when we think, Dave Vellante coming back to you for a second. When we think about the conversation there's been a lot of presumption or a lot of bromide. Analysts like to talk about, don't get Uberized. We're not just talking about getting Uberized. We're talking about something a little bit different aren't we? >> Well yeah, absolutely. I think Uber's going to get Uberized, personally. But I think there's a lot of evidence, I mentioned the big five, but if you look at Spotify, Waze, AirbnB, yes Uber, yes Twitter, Netflix, Bitcoin is an example, 23andme. These are all examples of companies that, I'll go back to what I said before, are putting data at the core and building humans expertise around that core to leverage that expertise. And I think it's easy to sit back, for some companies to sit back and say, "Well I'm going to wait and see what happens." But to me anyway, there's a big gap between kind of the haves and the have-nots. And I think that, that gap is around applying machine intelligence to data and applying cloud economics. Zero marginal economics and API economy. An always-on sort of mentality, et cetera et cetera. And that's what the economy, in my view anyway, is going to look like in the future. >> So let me put out a challenge, Jim I'm going to come to you in a second, very quickly on some of the things that start looking like data assets. But today, when we talk about data protection we're talking about simply a whole bunch of applications and a whole bunch of devices. Just spinning that data off, so we have it at a third site. And then we can, and it takes to someone in real-time, and then if there's a catastrophe or we have, you know, large or small, being able to restore it often in hours or days. So we're talking about an improvement on RPO and RTO but when we talk about data assets, and I'm going to come to you in a second with that David Floyer, but when we talk about data assets, we're talking about, not only the data, the bits. We're talking about the relationships and the organization, and the metadata, as being a key element of that. So David, I'm sorry Jim Kobielus, just really quickly, thirty seconds. Models, what do they look like? What are the new nature of some of these assets look like? >> Well the new nature of these assets are the machine learning models that are driving so many business processes right now. And so really the core assets there are the data obviously from which they are developed, and also from which they are trained. But also very much the knowledge of the data scientists and engineers who build and tune this stuff. And so really, what you need to do is, you need to protect that knowledge and grow that knowledge base of data science professionals in your organization, in a way that builds on it. And hopefully you keep the smartest people in house. And they can encode more of their knowledge in automated programs to manage the entire pipeline of development. >> We're not talking about files. We're not even talking about databases, are we David Floyer? We're talking about something different. Algorithms and models are today's technology's really really set up to do a good job of protecting the full organization of those data assets. >> I would say that they're not even being thought about yet. And going back on what Jim was saying, Those data scientists are the only people who understand that in the same way as in the year 2000, the COBOL programmers were the only people who understood what was going on inside those applications. And we as an industry have to allow organizations to be able to protect the assets inside their applications and use AI if you like to actually understand what is in those applications and how are they working? And I think that's an incredibly important de-risking is ensuring that you're not dependent on a few experts who could leave at any moment, in the same way as COBOL programmers could have left. >> But it's not just the data, and it's not just the metadata, it really is the data structure. >> It is the model. Just the whole way that this has been put together and the reason why. And the ability to continue to upgrade that and change that over time. So those assets are incredibly important but at the moment there is no way that you can, there isn't technology available for you to actually protect those assets. >> So if I combine what you just said with what Neil Raden was talking about, David Vallante's put forward a good vision of what's required. Neil Raden's made the observation that this is going to be much more than technology. There's a lot of change, not change management at a low level inside the IT, but business change and the technology companies also have to step up and be able to support this. We're seeing this, we're seeing a number of different vendor types start to enter into this space. Certainly storage guys, Dylon Sears talking about doing a better job of data protection we're seeing middleware companies, TIBCO and DISCO, talk about doing this differently. We're seeing file systems, Scality, WekaIO talk about doing this differently. Backup and restore companies, Veeam, Veritas. I mean, everybody's looking at this and they're all coming at it. Just really quickly David, where's the inside track at this point? >> For me it is so much whitespace as to be unbelievable. >> So nobody has an inside track yet. >> Nobody has an inside track. Just to start with a few things. It's clear that you should keep data where it is. The cost of moving data around an organization from inside to out, is crazy. >> So companies that keep data in place, or technologies to keep data in place, are going to have an advantage. >> Much, much, much greater advantage. Sure, there must be backups somewhere. But you need to keep the working copies of data where they are because it's the real-time access, usually that's important. So if it originates in the cloud, keep it in the cloud. If it originates in a data-provider, on another cloud, that's where you should keep it. If it originates on your premise, keep it where it originated. >> Unless you need to combine it. But that's a new origination point. >> Then you're taking subsets of that data and then combining that up for itself. So that would be my first point. So organizations are going to need to put together what George was talking about, this metadata of all the data, how it interconnects, how it's being used. The flow of data through the organization, it's amazing to me that when you go to an IT shop they cannot define for you how the data flows through that data center or that organization. That's the requirement that you have to have and AI is going to be part of that solution, of looking at all of the applications and the data and telling you where it's going and how it's working together. >> So the second thing would be companies that are able to build or conceive of networks as data. Will also have an advantage. And I think I'd add a third one. Companies that demonstrate perennial observations, a real understanding of the unbelievable change that's required you can't just say, oh Facebook wants this therefore everybody's going to want it. There's going to be a lot of push marketing that goes on at the technology side. Alright so let's get to some Action Items. David Vellante, I'll start with you. Action Item. >> Well the future's going to be one where systems see, they talk, they sense, they recognize, they control, they optimize. It may be tempting to say, you know what I'm going to wait, I'm going to sit back and wait to figure out how I'm going to close that machine intelligence gap. I think that's a mistake. I think you have to start now, and you have to start with your data model. >> George Gilbert, Action Item. >> I think you have to keep in mind the guardrails related to governance, and trust, when you're building applications on the new data fabric. And you can take the approach of a platform-oriented one where you're plugging into an API, like Apache Atlas, that Hortonworks is driving, or a discovery-oriented one as David was talking about which would be something like Alation, using machine learning. But if, let's say the use case starts out as an IOT, edge analytics and cloud inferencing, that data science pipeline itself has to now be part of this fabric. Including the output of the design time. Meaning the models themselves, so they can be managed. >> Excellent. Jim Kobielus, you've been pretty quiet but I know you've got a lot to offer. Action Item, Jim. >> I'll be very brief. What you need to do is protect your data science knowledge base. That's the way to de-risk this entire process. And that involves more than just a data catalog. You need a data science expertise registry within your distributed value chain. And you need to manage that as a very human asset that needs to grow. That is your number one asset going forward. >> Ralph Finos, you've also been pretty quiet. Action Item, Ralph. >> Yeah, I think you've got to be careful about what you're trying to get done. Whether it's, it depends on your industry, whether it's finance or whether it's the entertainment business, there are different requirements about data in those different environments. And you need to be cautious about that and you need leadership on the executive business side of things. The last thing in the world you want to do is depend on data scientists to figure this stuff out. >> And I'll give you the second to last answer or Action Item. Neil Raden, Action Item. >> I think there's been a lot of progress lately in creating tools for data scientists to be more efficient and they need to be, because the big digital giants are draining them from other companies. So that's very encouraging. But in general I think becoming a data-driven, a digital transformation company for most companies, is a big job and I think they need to it in piece parts because if they try to do it all at once they're going to be in trouble. >> Alright, so that's great conversation guys. Oh, David Floyer, Action Item. David's looking at me saying, ah what about me? David Floyer, Action Item. >> (laughing) So my Action Item comes from an Irish proverb. Which if you ask for directions they will always answer you, "I wouldn't start from here." So the Action Item that I have is, if somebody is coming in saying you have to re-do all of your applications and re-write them from scratch, and start in a completely different direction, that is going to be a 20-year job and you're not going to ever get it done. So you have to start from what you have. The digital assets that you have, and you have to focus on improving those with additional applications, additional data using that as the foundation for how you build that business with a clear long-term view. And if you look at some of the examples that were given early, particularly in the insurance industries, that's what they did. >> Thank you very much guys. So, let's do an overall Action Item. We've been talking today about the challenges of de-risking digital business which ties directly to the overall understanding of the role of data assets play in businesses and the technology's ability to move from just protecting data, restoring data, to actually restoring the relationships in the data, the structures of the data and very importantly the models that are resident in the data. This is going to be a significant journey. There's clear evidence that this is driving a new valuation within the business. Folks talk about data as the new oil. We don't necessarily see things that way because data, quite frankly, is a very very different kind of asset. The cost could be shared because it doesn't suffer the same limits on scarcity. So as a consequence, what has to happen is, you have to start with where you are. What is your current value proposition? And what data do you have in support of that value proposition? And then whiteboard it, clean slate it and say, what data would we like to have in support of the activities that we perform? Figure out what those gaps are. Find ways to get access to that data through piecemeal, piece-part investments. That provide a roadmap of priorities looking forward. Out of that will come a better understanding of the fundamental data assets that are being created. New models of how you engage customers. New models of how operations works in the shop floor. New models of how financial services are being employed and utilized. And use that as a basis for then starting to put forward plans for bringing technologies in, that are capable of not just supporting the data and protecting the data but protecting the overall organization of data in the form of these models, in the form of these relationships, so that the business can, as it creates these, as it throws off these new assets, treat them as the special resource that the business requires. Once that is in place, we'll start seeing businesses more successfully reorganize, reinstitutionalize the work around data, and it won't just be the big technology companies who have, who people call digital native, that are well down this path. I want to thank George Gilbert, David Floyer here in the studio with me. David Vellante, Ralph Finos, Neil Raden and Jim Kobelius on the phone. Thanks very much guys. Great conversation. And that's been another Wikibon Action Item. (upbeat music)

Published Date : Mar 16 2018

SUMMARY :

I'm joined here in the studio has been that the difference and importantly in the context are going to have absolute consequential impacts and provide access to the data model, the ability of organizations to really implement this but is bigger than the role of technology. that is the right thing to do. Why is it that the most valuable companies in the world human expertise is at the center and the data lives in silos in other businesses, the buildings. the bottling plant et cetera, to data. Yeah we are, we are. an example from the news. and take the assets that you have. One of the primary reasons why is going to enable you to be able to build At the end of the day you look at your value proposition And then you need to put together a plan once again, talking about the Uberization of things, to erase my data, can you say with extreme confidence There's not a lot of companies on the planet and the only way you can be a trusted steward of that That's an example of another, one of the properties I mentioned the big five, but if you look at Spotify, and I'm going to come to you in a second And so really, what you need to do is, of protecting the full organization of those data assets. and use AI if you like to actually understand and it's not just the metadata, And the ability to continue to upgrade that and the technology companies also have to step up It's clear that you should keep data where it is. are going to have an advantage. So if it originates in the cloud, keep it in the cloud. Unless you need to combine it. That's the requirement that you have to have that goes on at the technology side. Well the future's going to be one where systems see, I think you have to keep in mind the guardrails but I know you've got a lot to offer. that needs to grow. Ralph Finos, you've also been pretty quiet. And you need to be cautious about that And I'll give you the second to last answer and they need to be, because the big digital giants David's looking at me saying, ah what about me? that is going to be a 20-year job and the technology's ability to move from just

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
David Vellante	PERSON	0.99+
David	PERSON	0.99+
Apple	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Neil	PERSON	0.99+
Google	ORGANIZATION	0.99+
Walmart	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
David Floyer	PERSON	0.99+
George Gilbert	PERSON	0.99+
Jim Kobelius	PERSON	0.99+
Peter Burris	PERSON	0.99+
Jim	PERSON	0.99+
Geoffrey Moore	PERSON	0.99+
George	PERSON	0.99+
Ralph Finos	PERSON	0.99+
Neil Raden	PERSON	0.99+
INA	ORGANIZATION	0.99+
Equifax	ORGANIZATION	0.99+
Sears	ORGANIZATION	0.99+
Peter	PERSON	0.99+
March 2018	DATE	0.99+
Uber	ORGANIZATION	0.99+
TIBCO	ORGANIZATION	0.99+
DISCO	ORGANIZATION	0.99+
David Vallante	PERSON	0.99+
$160 billion	QUANTITY	0.99+
20-year	QUANTITY	0.99+
30 years	QUANTITY	0.99+
Ralph	PERSON	0.99+
Dave	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
Peter Drucker	PERSON	0.99+
Express Scripts	ORGANIZATION	0.99+
Veritas	ORGANIZATION	0.99+
David Foyer	PERSON	0.99+
Veeam	ORGANIZATION	0.99+
$67 billion	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
first point	QUANTITY	0.99+
thirty seconds	QUANTITY	0.99+
second	QUANTITY	0.99+
Spotify	ORGANIZATION	0.99+
Twitter	ORGANIZATION	0.99+
Connecticut General	ORGANIZATION	0.99+
two things	QUANTITY	0.99+
both	QUANTITY	0.99+
about $3.5 trillion	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
Cigna	ORGANIZATION	0.99+
Both	QUANTITY	0.99+
2000	DATE	0.99+
today	DATE	0.99+
one	QUANTITY	0.99+
Dylon Sears	ORGANIZATION	0.98+

Rob Bearden, Hortonworks & Rob Thomas, IBM | BigData NYC 2017

>> Announcer: Live from Midtown Manhattan, it's theCUBE. Covering Big Data New York City 2017. Brought to you by SiliconANGLE media, and its ecosystem sponsor. >> Okay, welcome back, everyone. We're here live in New York City for BigData NYC, our annual event with SiliconANGLE Media, theCUBE, and Wikibon, in conjunction with Strata Hadoop, which is now called Strata Data as that show evolves. I'm John Furrier, cohost of theCUBE, with Peter Burris, head of research for SiliconANGLE Media, and General Manager of Wikibon. Our next two guests are two legends in the big data industry, Rob Bearden, the CEO of Hortonworks, really one of the founders of the big data movement, you know, got Cloudaire and Hortonworks, really kind of built that out, and Rob Thomas, General Manager of IBM Analytics. Big-time investments have made both of them. Congratulations for your success, guys. Welcome back to theCUBE, great to see you guys! >> Great to see you. >> Great, yeah. >> And got an exciting partnership to talk about, as well. >> So, but let's do a little history, you guys, obviously, I want to get to that, and get clarified on the news in a second, but you guys have been there from the beginning, kind of looking at the market, developing it, almost from the embryonic state to now. I mean, what a changeover. Give a quick comparison of where we've come from and what's the current landscape now, because you have, it evolved into so much more. You got IOT, you got AI, you have a lot of things in the enterprise. You've got cloud computing. A lot of tailwinds for this industry. It's gotten bigger. It's become big and now it's huge. What's your thoughts, guys? >> You know I, so you look at arcs and really all this started with Hadoop, and Rob and I met early in the days of that. You kind of gone from the early few years is about optimizing operations. Hadoop is a great way for a company to become more efficient, take out costs in their data infrastructure, and so that put huge momentum into this area, and now we've kind of fast-forwarded to the point where now it's about, "So how "am I actually going to extract insight?" So instead of just getting operational advantages, how am I going to get competitive advantage, and that's about bringing the world of data science and machine learning, run it natively on Hadoop, that's the next chapter, and that's what Rob and I are working closely together on. >> Rob, your thoughts, too? You know, we've been talking about data in motion. You guys were early on in that, seeing that trend. Real time is still hot. Data is still the core asset people are trying to figure out and move from wrangling to actually enabling that data. >> Right. Well, you know, in the early days of Big Data, it was, to Rob's point, it was very much about bringing operational leverage and efficiency and being able to aggregate very siloed data sets, and unlocking that data and bringing it into a central platform. In the early days in resources, and Hadoop went to making Hadoop an enterprise-viable data platform, with security, governance, operations, management capability, that mirrored any of the proprietary transactional or EDW platforms, and what the lessons learned in that were, is that by bringing all that data together in a central data set, we now can understand what's happening with our customers, and with our other assets pre-transaction, and so they can become very prescriptive in engaging in new business models, and so what we've learned now is the further upstream we can get in the world of IOT and bring that data under management from the point of origination and be able to manage that all the way through its life cycle, we can create new business models with higher velocity of engagement and a lot more rapid value that gets created. It, though, creates a number of new challenges in all the areas of how you secure that data, how you bring governance across that entire life cycle from a common stream set. >> Well, let's talk about the news you guys have. Obviously, the partnership. Partnerships become the new normal in an open source era that we're living in. We're seeing open source software grow really exponentially in the forecast coming in the next five years and ten years and exponential growth in new code. Just new people coming on board, new developers, dev ops is mainstream. Partnerships are key for communities. 90% of the code is going to be open source, 10%, as they say, the Code Sandwich as Jim Zemlin, the executive director of Linux Foundation, wants to, and you're seeing that work. You guys have worked together with Apache Atlas. What's the news, what's the relationship with Hortonworks and IBM? Share the news. >> So, a lot of great work's been happening there, and generally in the open source community, around Apache Atlas, and making sure that we're bringing missing critical governance capabilities across the big data sets and environments. As we then get into the complexity of now multiple data lakes, multiple tiers of data coming from multiple sources, that brings a higher level of requirement in both the security and governance aspects, and that's where the partnership with IBM is continuing to drive Apache Atlas into mission critical enterprise viability, but then when we get into the distributed models and enterprise requirements, the IBM platforms leveraging Atlas and what we're doing together then take that into the mission critical enterprise capability. >> You got the open source, and now you got the enterprise. Rob, we've talked many times about the enterprise as a hard, hard environment to crack for say, a start up, but even now, they're becoming reliant on open source, but yet, they have a lot of operational challenges. How does this relate to the challenge of, you know, CIO and his staff, now new personas coming in, you seeing the data science role, you see it expanding from analytics to dev ops. A day of challenges. >> Look, enterprises are getting better at this. Clearly we've seen progress the last five years on that, but to kind of go back and link the points, there's a phrase I heard I like. It says, "There's no AI without IA," meaning information architecture. Fundamentally, what our partnership is about is delivering the right information architecture. So it's Hadoop federated with whatever you have in terms of warehouses and databases. We partner around IBM common sequel for that. It's meta data for your core governance because without governance you don't have compliance, you can't offer self-service analytics, so we are forming what I would call the fluid data layer for an enterprise that enables them to get to this future of AI, and my view is there's a stop in between, which is data science, machine learning, applications that are ready today that clients can put into production and improve the outcomes they're getting. That's what we're focused on right now is how do we take the information architecture we've been able to establish, and then help clients on this journey? That's what enterprises want, because that's how they're going to build differentiation in their businesses. >> But the definition of an information architecture is closest to applications, and maybe this informs your perspective, it's close to the applications that the business is running on. Goes back to your observation about, "We used to be focusing, optimizing operations." As you move away from those applications, your information architecture becomes increasingly diffuse. It's not as crystal clear. How do you drive that clarity, as the data moves to derived new applications? >> Rob and I have talked about this. I think we're at the dawn of probably a new era in application development. Much more agile, flexible applications that are taking advantage of data wherever it resides. We are really early in that. Right now we are in the let's actually put into practice, machine learning and data science, let's extract value the data we got, that will then inform a new set of applications, which is related to the announcements that Hortonworks made this week around data plane, which is looking at multi-cloud environments and how would you manage applications and data across those? Rob, you can speak to that better than I can, I think. >> Well, the data plan thing, this information architecture, I think you're 100% right on. The data that we're hearing from customers in the enterprise is, they see the IOT buzz, oh, of course they're going to connect with IOT devices down the road, but when they see the security challenges, when they see the operational challenges around hiring people to actually run the dev ops, they have to then re-architect. So there's certainly a conversation we see on what is the architecture for the data, but also a little bit bigger than that, the holistic architecture of, say, cloud. So a lot of people are like, trying to clean up their house, if you will, to be ready for this new era, and I think Wikibon, your private cloud report you guys put out really amplified that by saying, "Yeah, they see these trends, "but they got to kind of get their act together." They got to look at who the staff is, what the data architecture's going to be, what apps are being developed, so doing a lot more retrenching. Given that, if we agree, what does that mean for the data plane, and then your vision of having that data architecture so that this will be a solid foundational transition? >> I think we all hit on the same point, which is it is about enabling a next generation IT architecture, of which, sort of the X and the Y axis or network, and generally what Big Data's been able to do, and Hadoop specifically, was over the last five years, enabling the existing applications architected, and I like the term that's been coined by you, is they were known processes with known technology, and that's how applications in the last 20 years have been enabled. Big Data and Hadoop generally have unlocked that ability to now be able to move all the way out to the edge and incorporate IOT, data at rest, data in motion, on-prem and cloud hybrid architecture. What that's done is said, "Now we know how to build an "application that takes advantage of an event or an "occurrence and then can drive outcome in a variety of ways. "We don't have to wait for a static programming model "to automate a function." >> And in fact, if we are wait, we're going to fail. That's one of the biggest challenges. I mean, IBM, I will tell you guys, or I'll tell you, Rob, that one of the craziest days I've ever spent is I flew from Japan to New York City for the IBM Information Architecture Announcement back in like 1994, and it was the most painful two days I've ever experienced in my entire life. That's a long time ago. It's ancient history. We can't use information architecture as a way of slowing things down. What we need to be able to do is we need to be able to introduce technology that again, allows the clarity of information architecture close to these core applications to move, and that may involve things like machine learning itself being embedded directly into how we envision data being moved, how we envision optimization, how we envision the data plane working. So, as you guys think about this data plane, everybody ends up asking themselves, "Is there a natural place for data to be?" What's going to be centralized, what's going to be decentralized, and I'm asking you, is increasingly the data going to be decentralized but the governance and securities and policies that we put in place going to be centralized and that's what's going to inform the operation of the data plane? What do you guys think? >> It's our view, very specifically from Hortonworks' perspective, that we want to give the ability for the data to exist and reside wherever the physics dictate, whether that be on-prem, whether that be in the cloud, and we want to give the ability to process and take action on an event or an occurrence or drive and outcome as early in the cycle as possible. >> Describe what you mean by "early in the cycle." >> So, as we see conditions emerge. A machine part breaking down. A customer taking an action. A supply chain inventory outage. >> So as close as possible to the event that's generating the data. >> As it's being generated, or as the processes are leading up to the natural outcome and we can maybe disintermediate for a better outcome, and so, that means that we have to be able to engage with the data irrespective of where it is in its cycle, and that's where we've enabled, with data plane, the ability to extract out the requirement of where that data is, and to be able to have a common plane, pun intended, for the operations and managing and provisioning of the environment, for being able to govern that and secure it, which are increasingly becoming intertwined, because you have to deal with it from point of origin through point at rest. >> The new phrase, "The single plane of glass." All joking aside, I want to just get your thoughts on this, Rob, too. "What's in it for me? "I'm the customer. "Right now I have a couple challenges." This is what we hear from the market. "I need data consistency because things are happening in "real time; whatever events are going on with data, we know "more data's going to be coming out from the edge and "everywhere else, faster and more volume, so I need "consistency of my data, and I don't want "to have multiple data silos," and then they got to integrate the data, so on the application developer side, a dev ops-like ethos is emerging where, "Hey, if there's data being done, I need to integrate that "into my app in real time," so those are two challenges. Does the data plane address that concern for customers? That's the question. >> Today it enables the ops world. >> So I can integrate my apps into the data plane. >> My apps and my other data assets, irrespective of where they reside, on-prem, cloud, or out to the edge, and all points in between. >> Rob, for enterprise, is this going to be the single pane of glass for data governance? Is that how the vision that you guys see this, because that's a benefit. If that could happen, that's essentially one step towards the promised land, if you will, for more data flowing through apps and app developers. >> So let me reshape a little bit. There's two main problems that collectively we have to address for enterprises: one is they want to apply machine learning and data science at scale, and they're struggling with that, and two is they want to get the cloud, and it's not talked about nearly enough, but most clients are really struggling with that. Then you fast forward on that one, we are moving to a multi-cloud world, absolutely. I don't think any enterprise is going to standardize on a single cloud, that's pretty clear. So you need things like data plane that acknowledge it's a multi-cloud world, and even as you move to multi clouds, you want a single focus for your data governance, a single strategy for your data governance, and then what we're doing together with IBM Data Science Experience with Hortonworks, let's say, whatever data you have in there, you can now do your machine learning right where that data is. You don't need to move it around. You can if you want, but you don't have to move it around, 'cause it's built in, and it's integrated right into the Hadoop ecosystem. That solves the two main enterprise pain points, which is help me get the cloud, help me apply data science and machine learning. >> Well we'll have to follow up and we'll have to do just a segment just on that. I think multi-cloud is clearly the direction, but what the hell does that mean? If I run 365 on Azure, that's one app. If I run something else on Amazon, that's multiple clouds, not necessarily moving workloads across. So the question I want to ask here is, it's clear from customers they want single code bases that run on all clouds seamlessly so I don't have to scale up on things on Amazon, Azure, and Google. Not all clouds are created equal in how they do things. Storage, through ever, inside the data factories of how they process. That's a challenge. How do you guys see that playing out of, you have on-premise activities that have been bootstrapped. Now you have multiple clouds with different ways of doing things, from pipelining, ingestion and processing, and learning. How do you see that playing out? Clouds just kind of standardizing around data plane? >> There's also the complexity of even within the multi-clouds, you're going to have multiple tiers within the clouds, if you're running in one data center in Asia, versus one in Latin America, maybe a couple across the Americas. >> But as a customer, do I need to know the cloud internals of Amazon, Azure, and Google? >> You do. In a stand-alone world, yes you do. That's where we have to bring and abstract the complexity of that out, and that's the goal with data plane, is to be able to extract, whether it's, which tier it's in, on-prem, or whether it's on, irrespective of which cloud platform. >> But Rob Thomas, I really like the way you put it. There may be some other issues that users have to worry about, certainly there are some that we think, but the two questions of, "Where am I going to run the machine learning," and "How am I going to get that to the cloud appropriately," I really like the way you put that. At the end of the day, what users need to focus on is less where the application code is, and more where the data is, so that they can move the application code or they can move the work to the data. That's fundamentally the perspective. We think that businesses don't take their business to the cloud, they bring the cloud to their business. So, when you think about this notion of increasingly looking at a set of work that needs to be performed, where the data exists, and what acts you're going to take in that data, it does suggest that data is going to become more of a centerpiece asset within the business. How does some of the things that you guys are doing lead customers to start to acknowledge data as an asset so they're making the appropriate investments in their data as their business evolves, and partly in response to data as an asset? What do you think? >> We have to do our job to build to common denominators, and that's what we're doing to make this easy for clients. So today we announced the IBM integrated analytics system. Same code base on private cloud as on a hardware system as on public cloud, all of it federates to Hortonworks through common sequel. That's what clients need, 'cause it solves their problem. Click of a button, they can get the cloud, and by the way, on private cloud it's based on Kubernetes, which is aligned with what we have on public cloud. We're working with Hortonworks to optimize Yarn and Kubernetes working together. These are the meaty issues that if we don't solve it, then clients have to deal with the bag of bolts, and so that's the kind of stuff we're solving together. So think about it: one single code base for managing your data, federates to Hadoop, machine learning is built into the system, and it's based on Kubernetes, that's what clients want. >> And the containers is just great, too. Great cloud-native trend. You guys been great, active in there. Congratulations to both of you guys. Final question, get you guys the last word: How does the relationship between Hortonworks and IBM evolve? How do you guys see this playing out? More of the same? Keep integrating in code? Is there any new thing you see on the horizon that you're going to be knocking down in the future? >> I'll take the first shot. The goal is to continue to make it simple and easy for the customer to get to the cloud, bring those machine learning and data science models to the data, and make it easy for the consumption of the new next generation of applications, and continue to make our customer successful and drive value, but to do it through transparently enabling the technology platforms together, and I think we've acknowledged the things that IBM is extraordinarily good at, the things that Hortworks is good at, and bring those two together with virtually no overlap. >> Rob, you've been very partner-centric. Your thoughts on this partnership? >> Look, it's what clients want. Since we announced this, the results and the response has been fantastic, and I think it's for one simple reason. So, Hortonworks' mission, we all know, is open source, and delivering in the community. They do a fantastic job of that. We also know that sometimes, clients need a little bit more, and so, when you bring those two things together, that's what clients want. That's very different than what other people in the industry do that say, "We're going to create a proprietary wrapper "around your Hadoop environment and lock your data in." That's the opposite of what we're doing. We're saying we're giving you full freedom of open source, but we're enabling you to augment that with machine learning, data science capabilities. This is what clients want. That's why the partnership's working. I think that's why we've gotten the response that we have. >> And you guys have been multiple years into the new operating model of being much more aggressive within the Big Data community, which has now morphed into much larger landscape. You pleased with some of the results you're seeing on the IBM side and more coding, more involvement in these projects on your end? >> Yeah, I mean, look, we were certainly early on Spark, created a lot of momentum there. I think it actually ended up helping both of our interests in the market. We built a huge community of developers at IBM, which is not something IBM had even a few years ago, but it's great to have a relationship like this where we can continue to augment our skills. We make each other better, and I think what you'll see in the future is more on the governance side; I think that's the piece that's still not quite been figured out by most enterprises yet. The need is understood. The implementation is slow, so you'll see more from us collectively there. >> Well, congratulations in the community work you guys have done. I think the community's model's evolving mainstream as well. Open source will continue to grow. Congratulations. Rob Bearden and Rob Thomas here inside theCUBE, more coverage here in Big Data NYC with theCUBE, after this short break.

Published Date : Sep 27 2017

SUMMARY :

Brought to you by SiliconANGLE media, of the big data movement, you know, almost from the embryonic state to now. You kind of gone from the early few years Data is still the core asset people are trying to figure out and be able to manage that all the way through its 90% of the code is going to be open source, and generally in the open source community, How does this relate to the challenge of, you know, CIO the fluid data layer for an enterprise that enables them to But the definition of an information architecture is the data we got, that will then inform a new set Well, the data plan thing, this information architecture, and that's how applications in the last 20 years of the data plane? to give the ability to process and take action on an event So, as we see conditions emerge. So as close as possible to the event and provisioning of the environment, and then they got to integrate the data, they reside, on-prem, cloud, or out to the edge, Is that how the vision that you guys see this, I don't think any enterprise is going to standardize So the question I want to ask here is, There's also the complexity of even within the of that out, and that's the goal with data plane, How does some of the things that you guys are doing and so that's the kind of stuff we're solving together. Congratulations to both of you guys. for the customer to get to the cloud, bring those machine Rob, you've been very partner-centric. and delivering in the community. on the IBM side and more coding, more involvement in these in the market. Well, congratulations in the community work

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Rob Bearden	PERSON	0.99+
Japan	LOCATION	0.99+
Rob	PERSON	0.99+
Rob Thomas	PERSON	0.99+
Peter Burris	PERSON	0.99+
John Furrier	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Asia	LOCATION	0.99+
Jim Zemlin	PERSON	0.99+
1994	DATE	0.99+
100%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Americas	LOCATION	0.99+
Wikibon	ORGANIZATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Latin America	LOCATION	0.99+
two	QUANTITY	0.99+
Hortworks	ORGANIZATION	0.99+
Linux Foundation	ORGANIZATION	0.99+
two questions	QUANTITY	0.99+
New York City	LOCATION	0.99+
10%	QUANTITY	0.99+
both	QUANTITY	0.99+
Cloudaire	ORGANIZATION	0.99+
90%	QUANTITY	0.99+
IBM Analytics	ORGANIZATION	0.99+
theCUBE	ORGANIZATION	0.99+
two things	QUANTITY	0.99+
NYC	LOCATION	0.99+
two challenges	QUANTITY	0.99+
one	QUANTITY	0.99+
Midtown Manhattan	LOCATION	0.98+
two days	QUANTITY	0.98+
two main problems	QUANTITY	0.98+
Apache Atlas	ORGANIZATION	0.98+
first shot	QUANTITY	0.98+
one step	QUANTITY	0.98+
ibon	PERSON	0.98+
one app	QUANTITY	0.98+
Today	DATE	0.97+
this week	DATE	0.97+
two guests	QUANTITY	0.97+
today	DATE	0.97+
Yarn	ORGANIZATION	0.96+
BigData	ORGANIZATION	0.96+
SiliconANGLE media	ORGANIZATION	0.95+
Hortonworks'	PERSON	0.94+
single cloud	QUANTITY	0.94+

Wrap Up | IBM Fast Track Your Data 2017

>> Narrator: Live from Munich Germany, it's theCUBE, covering IBM, Fast Track Your Data. Brought to you by IBM. >> We're back. This is Dave Vellante with Jim Kobielus, and this is theCUBE, the leader in live tech coverage. We go out to the events. We extract the signal from the noise. We are here covering special presentation of IBM's Fast Track your Data, and we're in Munich Germany. It's been a day-long session. We started this morning with a panel discussion with five senior level data scientists that Jim and I hosted. Then we did CUBE interviews in the morning. We cut away to the main tent. Kate Silverton did a very choreographed scripted, but very well done, main keynote set of presentations. IBM made a couple of announcements today, and then we finished up theCUBE interviews. Jim and I are here to wrap. We're actually running on IBMgo.com. We're running live. Hilary Mason talking about what she's doing in data science, and also we got a session on GDPR. You got to log in to see those sessions. So go ahead to IBMgo.com, and you'll find those. Hit the schedule and go to the Hilary Mason and GDP our channels, and check that out, but we're going to wrap now. Jim two main announcements today. I hesitate to call them big announcements. I mean they were you know just kind of ... I think the word you used last night was perfunctory. You know I mean they're okay, but they're not game changing. So what did you mean? >> Well first of all, when you look at ... Though IBM is not calling this a signature event, it's essentially a signature event. They do these every June or so. You know in the past several years, the signature events have had like a one track theme, whether it be IBM announcing their investing deeply in Spark, or IBM announcing that they're focusing on investing in R as the core language for data science development. This year at this event in Munich, it's really a three track event, in terms of the broad themes, and I mean they're all important tracks, but none of them is like game-changing. Perhaps IBM doesn't intend them to be it seems like. One of which is obviously Europe. We're holding this in Munich. And a couple of things of importance to European customers, first and foremost GDPR. The deadline next year, in terms of compliance, is approaching. So sound the alarm as it were. And IBM has rolled out compliance or governance tools. Download and the go from the information catalog, governance catalog and so forth. Now announcing the consortium with Hortonworks to build governance on top of Apache Atlas, but also IBM announcing that they've opened up a DSX center in England and a machine-learning hub here in Germany, to help their European clients, in those countries especially, to get deeper down into data science and machine learning, in terms of developing those applicants. That's important for the audience, the regional audience here. The second track, which is also important, and I alluded to it. It's governance. In all of its manifestations you need a master catalog of all the assets for building and maintaining and controlling your data applications and your data science applications. The catalog, the consortium, the various offerings at IBM is announced and discussed in great detail. They've brought in customers and partners like Northern Trust, talk about the importance of governance, not just as a compliance mandate, but also the potential strategy for monetizing your data. That's important. Number three is what I call cloud native data applications and how the state of the art in developing data applications is moving towards containerized and orchestrated environments that involve things like Docker and Kubernetes. The IBM DB2 developer community edition. Been in the market for a few years. The latest version they announced today includes kubernetes support. Includes support for JSON. So it's geared towards new generation of cloud and data apps. What I'm getting at ... Those three core themes are Europe governance and cloud native data application development. Each of them is individually important, but none of them is game changer. And one last thing. Data science and machine learning, is one of the overarching envelope themes of this event. They've had Hilary Mason. A lot of discussion there. My sense I was a little bit disappointed because there wasn't any significant new announcements related to IBM evolving their machine learning portfolio into deep learning or artificial intelligence in an environment where their direct competitors like Microsoft and Google and Amazon are making a huge push in AI, in terms of their investments. There's a bit of a discussion, and Rob Thomas got to it this morning, about DSX. Working with power AI, the IBM platform, I would like to hear more going forward about IBM investments in these areas. So I thought it was an interesting bunch of announcements. I'll backtrack on perfunctory. I'll just say it was good that they had this for a lot of reasons, but like I said, none of these individual announcements is really changing the game. In fact like I said, I think I'm waiting for the fall, to see where IBM goes in terms of doing something that's actually differentiating and innovative. >> Well I think that the event itself is great. You've got a bunch of partners here, a bunch of customers. I mean it's active. IBM knows how to throw a party. They've always have. >> And the sessions are really individually awesome. I mean terms of what you learn. >> The content is very good. I would agree. The two announcements that were sort of you know DB2, sort of what I call community edition. Simpler, easier to download. Even Dave can download DB2. I really don't want to download DB2, but I could, and play with it I guess. You know I'm not database guy, but those of you out there that are, go check it out. And the other one was the sort of unified data governance. They tried to tie it in. I think they actually did a really good job of tying it into GDPR. We're going to hear over the next, you know 11 months, just a ton of GDPR readiness fear, uncertainty and doubt, from the vendor community, kind of like we heard with Y2K. We'll see what kind of impact GDPR has. I mean it looks like it's the real deal Jim. I mean it looks like you know this 4% of turnover penalty. The penalties are much more onerous than any other sort of you know, regulation that we've seen in the past, where you could just sort of fluff it off. Say yeah just pay the fine. I think you're going to see a lot of, well pay the lawyers to delay this thing and battle it. >> And one of our people in theCUBE that we interviewed, said it exactly right. It's like the GDPR is like the inverse of Y2K. In Y2K everybody was freaking out. It was actually nothing when it came down to it. Where nobody on the street is really buzzing. I mean the average person is not buzzing about GDPR, but it's hugely important. And like you said, I mean some serious penalties may be in the works for companies that are not complying, companies not just in Europe, but all around the world who do business with European customers. >> Right okay so now bring it back to sort of machine learning, deep learning. You basically said to Rob Thomas, I see machine learning here. I don't see a lot of the deep learning stuff quite yet. He said stay tuned. You know you were talking about TensorFlow and things like that. >> Yeah they supported that ... >> Explain. >> So Rob indicated that IBM very much, like with power AI and DSX, provides an open framework or toolkit for plugging in your, you the developers, preferred machine learning or deep learning toolkit of an open source nature. And there's a growing range of open source deep learning toolkits beyond you know TensorFlow, including Theano and MXNet and so forth, that IBM is supporting within the overall ESX framework, but also within the power AI framework. In other words they've got those capabilities. They're sort of burying that message under a bushel basket, at least in terms of this event. Also one of the things that ... I said this too Mena Scoyal. Watson data platform, which they launched last fall, very important product. Very important platform for collaboration among data science professionals, in terms of the machine learning development pipeline. I wish there was more about the Watson data platform here, about where they're taking it, what the customers are doing with it. Like I said a couple of times, I see Watson data platform as very much a DevOps tool for the new generation of developers that are building machine learning models directly into their applications. I'd like to see IBM, going forward turn Watson data platform into a true DevOps platform, in terms of continuous integration of machine learning and deep learning another statistical models. Continuous training, continuous deployment, iteration. I believe that's where they're going, or probably she will be going. I'd like to see more. I'm expecting more along those lines going forward. What I just described about DevOps for data science is a big theme that we're focusing on at Wikibon, in terms where the industry is going. >> Yeah, yeah. And I want to come back to that again, and get an update on what you're doing within your team, and talk about the research. Before we do that, I mean one of the things we talked about on theCUBE, in the early days of Hadoop is that the guys are going to make the money in this big data business of the practitioners. They're not going to see, you know these multi-hundred billion dollar valuations come out of the Hadoop world. And so far that prediction has held up well. It's the Airbnbs and the Ubers and the Spotifys and the Facebooks and the Googles, the practitioners who are applying big data, that are crushing it and making all the money. You see Amazon now buying Whole Foods. That in our view is a data play, but who's winning here, in either the vendor or the practitioner community? >> Who's winning are the startups with a hot new idea that's changing, that's disrupting some industry, or set of industries with machine learning, deep learning, big data, etc. For example everybody's, with bated breath, waiting for you know self-driving vehicles. And the ecosystem as it develops somebody's going to clean up. And one or more companies, companies we probably never heard of, leveraging everything we're describing here today, data science and containerized distributed applications that involve you know deep learning for you know image analysis and sensor analyst and so forth. Putting it all together in some new fabric that changes the way we live on this planet, but as you said the platforms themselves, whether they be Hadoop or Spark or TensorFlow, whatever, they're open source. You know and the fact is, by it's very nature, open source based solutions, in terms of profit margins on selling those, inexorably migrate to zero. So you're not going to make any money as a tool vendor, or a platform vendor. You got to make money ... If you're going to make money, you make money, for example from providing an ecosystem, within which innovation can happen. >> Okay we have a few minutes left. Let's talk about the research that you're working on. What's exciting you these days? >> Right, right. So I think a lot of people know I've been around the analyst space for a long long time. I've joined the SiliconANGLE Wikibon team just recently. I used to work for a very large solution provider, and what I do here for Wikibon is I focus on data science as the core of next generation application development. When I say next-generation application development, it's the development of AI, deep learning machine learning, and the deployment of those data-driven statistical assets into all manner of application. And you look at the hot stuff, like chatbots for example. Transforming the experience in e-commerce on mobile devices. Siri and Alexa and so forth. Hugely important. So what we're doing is we're focusing on AI and everything. We're focusing on containerization and building of AI micro-services and the ecosystem of the pipelines and the tools that allow you to do that. DevOps for data science, distributed training, federated training of statistical models, so forth. We are also very much focusing on the whole distributed containerized ecosystem, Docker, Kubernetes and so forth. Where that's going, in terms of changing the state of the art, in terms of application development. Focusing on the API economy. All of those things that you need to wrap around the payload of AI to deliver it into every ... >> So you're focused on that intersection between AI and the related topics and the developer. Who is winning in that developer community? Obviously Amazon's winning. You got Microsoft doing a good job there. Google, Apple, who else? I mean how's IBM doing for example? Maybe name some names. Who do you who impresses you in the developer community? But specifically let's start with IBM. How is IBM doing in that space? >> IBM's doing really well. IBM has been for quite a while, been very good about engaging with new generation of developers, using spark and R and Hadoop and so forth to build applications rapidly and deploy them rapidly into all manner of applications. So IBM has very much reached out to, in the last several years, the Millennials for whom all of this, these new tools, have been their core repertoire from the very start. And I think in many ways, like today like developer edition of the DB2 developer community edition is very much geared to that market. Saying you know to the cloud native application developer, take a second look at DB2. There's a lot in DB2 that you might bring into your next application development initiative, alongside your spark toolkit and so forth. So IBM has startup envy. They're a big old company. Been around more than a hundred years. And they're trying to, very much bootstrap and restart their brand in this new context, in the 21st century. I think they're making a good effort at doing it. In terms of community engagement, they have a really good community engagement program, all around the world, in terms of hackathons and developer days, you know meetups here and there. And they get lots of turnout and very loyal customers and IBM's got to broadest portfolio. >> So you still bleed a little bit of blue. So I got to squeeze it out of you now here. So let me push a little bit on what you're saying. So DB2 is the emphasis here, trying to position DB2 as appealing for developers, but why not some of the other you know acquisitions that they've made? I mean you don't hear that much about Cloudant, Dash TV, and things of that nature. You would think that that would be more appealing to some of the developer communities than DB2. Or am I mistaken? Is it IBM sort of going after the core, trying to evolve that core you know constituency? >> No they've done a lot of strategic acquisitions like Cloudant, and like they've acquired Agrath Databases and brought them into their platform. IBM has every type of database or file system that you might need for web or social or Internet of Things. And so with all of the development challenges, IBM has got a really high-quality, fit-the-purpose, best-of-breed platform, underlying data platform for it. They've got huge amounts of developers energized all around the world working on this platform. DB2, in the last several years they've taken all of their platforms, their legacy ... That's the wrong word. All their existing mature platforms, like DB2 and brought them into the IBM cloud. >> I think legacy is the right word. >> Yeah, yeah. >> These things have been around for 30 years. >> And they're not going away because they're field-proven and ... >> They are evolving. >> And customers have implemented them everywhere. And they're evolving. If you look at how IBM has evolved DB2 in the last several years into ... For example they responded to the challenge from SAP HANA. We brought BLU Acceleration technology in memory technology into DB2 to make it screamingly fast and so forth. IBM has done a really good job of turning around these product groups and the product architecture is making them cloud first. And then reaching out to a new generation of cloud application developers. Like I said today, things like DB2 developer community edition, it's just the next chapter in this ongoing saga of IBM turning itself around. Like I said, each of the individual announcements today is like okay that's interesting. I'm glad to see IBM showing progress. None of them is individually disruptive. I think the last week though, I think Hortonworks was disruptive in the sense that IBM recognized that BigInsights didn't really have a lot of traction in the Hadoop spaces, not as much as they would have wished. Hortonworks very much does, and IBM has cast its lot to work with HDP, but HDP and Hortonworks recognizes they haven't achieved any traction with data scientists, therefore DSX makes sense, as part of the Hortonworks portfolio. Likewise a big sequel makes perfect sense as the sequel front end to the HDP. I think the teaming of IBM and Hortonworks is propitious of further things that they'll be doing in the future, not just governance, but really putting together a broader cloud portfolio for the next generation of data scientists doing work in the cloud. >> Do you think Hortonworks is a legitimate acquisition target for IBM. >> Of course they are. >> Why would IBM ... You know educate us. Why would IBM want to acquire Hortonworks? What does that give IBM? Open source mojo, obviously. >> Yeah mojo. >> What else? >> Strong loyalty with the Hadoop market with developers. >> The developer angle would supercharge the developer angle, and maybe make it more relevant outside of some of those legacy systems. Is that it? >> Yeah, but also remember that Hortonworks came from Yahoo, the team that developed much of what became Hadoop. They've got an excellent team. Strategic team. So in many ways, you can look at Hortonworks as one part aqui-hire if they ever do that and one part really substantial and growing solution portfolio that in many ways is complementary to IBM. Hortonworks is really deep on the governance of Hadoop. IBM has gone there, but I think Hortonworks is even deeper, in terms of their their laser focus. >> Ecosystem expansion, and it actually really wouldn't be that expensive of an acquisition. I mean it's you know north of ... Maybe a billion dollars might get it done. >> Yeah. >> You know so would you pay a billion dollars for Hortonworks? >> Not out of my own pocket. >> No, I mean if you're IBM. You think that would deliver that kind of value? I mean you know how IBM thinks about about acquisitions. They're good at acquisitions. They look at the IRR. They have their formula. They blue-wash the companies and they generally do very well with acquisitions. Do you think Hortonworks would fit profile, that monetization profile? >> I wouldn't say that Hortonworks, in terms of monetization potential, would match say what IBM has achieved by acquiring the Netezza. >> Cognos. >> Or SPSS. I mean SPSS has been an extraordinarily successful ... >> Well the day IBM acquired SPSS they tripled the license fees. As a customer I know, ouch, it worked. It was incredibly successful. >> Well, yeah. Cognos was. Netezza was. And SPSS. Those three acquisitions in the last ten years have been extraordinarily pivotal and successful for IBM to build what they now have, which is really the most comprehensive portfolio of fit-to-purpose data platform. So in other words all those acquisitions prepared IBM to duke it out now with their primary competitors in this new field, which are Microsoft, who's newly resurgent, and Amazon Web Services. In other words, the two Seattle vendors, Seattle has come on strong, in a way that almost Seattle now in big data in the cloud is eclipsing Silicon Valley, in terms of where you know ... It's like the locus of innovation and really of customer adoption in the cloud space. >> Quite amazing. Well Google still hanging in there. >> Oh yeah. >> Alright, Jim. Really a pleasure working with you today. Thanks so much. Really appreciate it. >> Thanks for bringing me on your team. >> And Munich crew, you guys did a great job. Really well done. Chuck, Alex, Patrick wherever he is, and our great makeup lady. Thanks a lot. Everybody back home. We're out. This is Fast Track Your Data. Go to IBMgo.com for all the replays. Youtube.com/SiliconANGLE for all the shows. TheCUBE.net is where we tell you where theCUBE's going to be. Go to wikibon.com for all the research. Thanks for watching everybody. This is Dave Vellante with Jim Kobielus. We're out.

Published Date : Jun 25 2017

SUMMARY :

Brought to you by IBM. I mean they were you know just kind of ... I think the word you used last night was perfunctory. And a couple of things of importance to European customers, first and foremost GDPR. IBM knows how to throw a party. I mean terms of what you learn. seen in the past, where you could just sort of fluff it off. I mean the average person is not buzzing about GDPR, but it's hugely important. I don't see a lot of the deep learning stuff quite yet. And there's a growing range of open source deep learning toolkits beyond you know TensorFlow, of Hadoop is that the guys are going to make the money in this big data business of the And the ecosystem as it develops somebody's going to clean up. Let's talk about the research that you're working on. the pipelines and the tools that allow you to do that. Who do you who impresses you in the developer community? all around the world, in terms of hackathons and developer days, you know meetups here Is it IBM sort of going after the core, trying to evolve that core you know constituency? They've got huge amounts of developers energized all around the world working on this platform. Likewise a big sequel makes perfect sense as the sequel front end to the HDP. You know educate us. The developer angle would supercharge the developer angle, and maybe make it more relevant Hortonworks is really deep on the governance of Hadoop. I mean it's you know north of ... They blue-wash the companies and they generally do very well with acquisitions. I wouldn't say that Hortonworks, in terms of monetization potential, would match say I mean SPSS has been an extraordinarily successful ... Well the day IBM acquired SPSS they tripled the license fees. now in big data in the cloud is eclipsing Silicon Valley, in terms of where you know Well Google still hanging in there. Really a pleasure working with you today. And Munich crew, you guys did a great job.

ENTITIES

Entity	Category	Confidence
Kate Silverton	PERSON	0.99+
Jim Kobielus	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Jim	PERSON	0.99+
Hilary Mason	PERSON	0.99+
Google	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Patrick	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Germany	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Y2K	ORGANIZATION	0.99+
Dave	PERSON	0.99+
Chuck	PERSON	0.99+
Amazon Web Services	ORGANIZATION	0.99+
Munich	LOCATION	0.99+
England	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
second track	QUANTITY	0.99+
Siri	TITLE	0.99+
two	QUANTITY	0.99+
21st century	DATE	0.99+
three track	QUANTITY	0.99+
Rob	PERSON	0.99+
next year	DATE	0.99+
4%	QUANTITY	0.99+
Mena Scoyal	PERSON	0.99+
Alex	PERSON	0.99+
Whole Foods	ORGANIZATION	0.99+
Each	QUANTITY	0.99+
Cloudant	ORGANIZATION	0.99+

Seth Dobrin, IBM Analytics - IBM Fast Track Your Data 2017

>> Announcer: Live from Munich, Germany; it's The Cube. Covering IBM; fast-track your data. Brought to you by IBM. (upbeat techno music) >> For you here at the show, generally; and specifically, what are you doing here today? >> There's really three things going on at the show, three high level things. One is we're talking about our new... How we're repositioning our hybrid data management portfolio, specifically some announcements around DB2 in a hybrid environment, and some highly transactional offerings around DB2. We're talking about our unified governance portfolio; so actually delivering a platform for unified governance that allows our clients to interact with governance and data management kind of products in a more streamlined way, and help them actually solve a problem instead of just offering products. The third is really around data science and machine learning. Specifically we're talking about our machine learning hub that we're launching here in Germany. Prior to this we had a machine learning hub in San Francisco, Toronto, one in Asia, and now we're launching one here in Europe. >> Seth, can you describe what this hub is all about? This is a data center where you're hosting machine learning services, or is it something else? >> Yeah, so this is where clients can come and learn how to do data science. They can bring their problems, bring their data to our facilities, learn how to solve a data science problem in a more team oriented way; interacting with data scientists, machine learning engineers, basically, data engineers, developers, to solve a problem for their business around data science. These previous hubs have been completely booked, so we wanted to launch them in other areas to try and expand the capacity of them. >> You're hosting a round table today, right, on the main tent? >> Yep. >> And you got a customer on, you guys going to be talking about sort of applying practices and financial and other areas. Maybe describe that a little bit. >> We have a customer on from ING, Heinrich, who's the chief architect for ING. ING, IBM, and Horton Works have a consortium, if you would, or a framework that we're doing around Apache Atlas and Ranger, as the kind of open-source operating system for our unified governance platform. So much as IBM has positioned Spark as a unified, kind of open-source operating system for analytics, for a unified governance platform... For a governance platform to be truly unified, you need to be able to integrate metadata. The biggest challenge about connecting your data environments, if you're an enterprise that was not internet born, or cloud born, is that you have proprietary metadata platforms that all want to be the master. When everyone wants to be the master, you can't really get anything done. So what we're doing around Apache Atlas is we are setting up Apache Atlas as kind of a virtual translator, if you would, or a dictionary between all the different proprietary metadata platforms so that you can get a single unified view of your data environment across hybrid clouds, on premise, in the cloud, and across different proprietary vendor platforms. Because it's open-sourced, there are these connectors that can go in and out of the proprietary platforms. >> So Seth, you seem like you're pretty tuned in to the portfolio within the analytics group. How are you spending your time as the Chief Data Officer? How do you balance it between customer visits, maybe talking about some of the products, and then you're sort of day job? >> I actually have three days jobs. My job's actually split into kind of three pieces. The first, my primary mission, is really around transforming IBM's internal business unit, internal business workings, to use data and analytics to run our business. So kind of internal business unit transformation. Part of that business unit transformation is also making sure that we're compliant with regulations like GDBR and other regulations. Another third is really around kind of rethinking our offerings from a CDO perspective. As a CDO, and as you, Dave, I've only been with IBM for seven months. As a former client recently, and as a CDO, what is it that I want to see from IBM's offerings? We kind of hit on it a little bit with the unified governance platform, where I think IBM makes fantastic products. But as a client, if a salesperson shows up to me, I don't want them selling me a product, 'cause if I want an MDM solution, I'll call you up and say, "Hey, I need an MDM solution. "Give me a quote." What I want them showing up is saying, "I have a solution that's going to solve "your governance problem across your portfolio." Or, "I'm going to solve your data science problem." Or, "I'm going to help you master your data, "and manage your data across "all these different environments." So really working with the offering management and the Dev teams to define what are these three or four, kind of business platforms that we want to settle on? We know three of them at least, right? We know that we have a hybrid data management. We have unified governance. We have data science and machine learning, and you could think of the Z franchise as a fourth platform. >> Seth, can you net out how governance relates to data science? 'Cause there is governance of the statistical models, machine learning, and so forth, version control. I mean, in an end to end machine learning pipeline, there's various versions of various artifacts they have to be managed in a structured way. Is your unified governance bundle, or portfolio, does it address those requirements? Or just the data governance? >> Yeah, so the unified governance platform really kind of focuses today on data governance and how good data governance can be an enabler of rapid data science. So if you have your data all pre-governed, it makes it much quicker to get access to data and understand what you can and can't do with data; especially being here in Europe, in the context of the EU GDPR. You need to make sure that your data scientists are doing things that are approved by the user, because basically your data, you have to give explicit consent to allow things to be done with it. But long term vision is that... essentially the output of models is data, right? And how you use and deploy those models also need to be governed. So the long term vision is that we will have a governance platform for all those things, as well. I think it makes more sense for those things to be governed in the data science platform, if you would. And we... >> We often hear separate from GDPR and all that, is something called algorithmic accountability; that more is being discussed in policy circles, in government circles around the world, as strongly related to everything you're describing. Being able to trace the lineage of any algorithmic decision back to the data, the metadata, and so forth, and the machine learning models that might have driven it. Is that where IBM's going with this portfolio? >> I think that's the natural extension of it. We're thinking really in the context of them as two different pieces, but if you solve them both and you connect them together, then you have that problem. But I think you're absolutely right. As we're leveraging machine learning and artificial intelligence, in general, we need to be able to understand how we got to a decision, and that includes the model, the data, how the data was gathered, how the data was used and processed. So it is that entire pipeline, 'cause it is a pipeline. You're not doing machine learning or AI in a vacuum. You're doing it in the context of the data, and you're doing it in the context about the individuals or the organizations that you're trying to influence with the output of those models. >> I call it Dev ops for data science. >> Seth, in the early Hadoop days, the real headwind was complexity. It still is, by the way. We know that. Companies like IBM are trying to reduce that complexity. Spark helps a little bit So the technology will evolve, we get that. It seems like one of the other big headwinds right now is that most companies don't have a great understanding of how they can take data and monetize it, turn it into value. Most companies, many anyway, make the mistake of, "Well, I don't really want to sell my data," or, "I'm not really a data supplier." And they're kind of thinking about it, maybe not in the right way. But we seem to be entering a next wave here, where people are beginning to understand I can cut costs, I can do predictive maintenance, I can maybe not sell the data, but I can enhance what I'm doing and increase my revenue, maybe my customer retention. They seem to be tuning, more so; largely, I think 'cause of the chief data officer roles, helping them think that through. I wonder if you would give us your point of view on that narrative. >> I think what you're describing is kind of the digital transformation journey. I think the end game, as enterprises go through a digital transformation, the end game is how do I sell services, outcomes, those types of things. How do I sell an outcome to my end user? That's really the end game of a digital transformation in my mind. But before you can get to that, before you transform your business's objectives, there's a couple of intermediary steps that are required for that. The first is what you're describing, is those kind of data transformations. Enterprises need to really get a handle on their data and become data driven, and start then transforming their current business model; so how do I accelerate my current business leveraging data and analytics? I kind of frame that, that's like the data science kind of transformation aspect of the digital journey. Then the next aspect of it is how do I transform my business and change my business objectives? Part of that first step is in fact, how do I optimize my supply chain? How do I optimize my workforce? How do I optimize my goals? How do I get to my current, you know, the things that Wall Street cares about for business; how do I accelerate those, make those faster, make those better, and really put my company out in front? 'Cause really in the grand scheme of things, there's two types of companies today; there's the company that's going to be the disruptor, and there's companies that's going to get disrupted. Most companies want to be the disruptors, and it's a process to do that. >> So the accounting industry doesn't have standards around valuing data as an asset, and many of us feel as though waiting for that is a mistake. You can't wait for that. You've got to figure out on your own. But again, it seems to be somewhat of a headwind because it puts data and data value in this fuzzy category. But there are clearly the data haves and the data have-nots. What are you seeing in that regard? >> I think the first... When I was in my former role, my former company went through an exercise of valuing our data and our decisions. I'm actually doing that same exercise at IBM right now. We're going through IBM, at least in the analytics business unit, the part I'm responsible for, and going to all the leaders and saying, "What decisions are you making?" "Help me understand the decisions that you're making." "Help me understand the data you need "to make those decisions." And that does two things. Number one, it does get to the point of, how can we value the decisions? 'Cause each one of those decisions has a specific value to the company. You can assign a dollar amount to it. But it also helps you change how people in the enterprise think. Because the first time you go through and ask these questions, they talk about the dashboards they want to help them make their preconceived decisions, validated by data. They have a preconceived notion of the decision they want to make. They want the data to back it up. So they want a dashboard to help them do that. So when you come in and start having this conversation, you kind of stop them and say, "Okay, what you're describing is a dashboard. "That's not a decision. "Let's talk about the decision that you want to make, "and let's understand the real value of that decision." So you're doing two things, you're building a portfolio of decisions that then becomes to your point, Jim, about Dev ops for data science. It's your backlog for your data scientists, in the long run. You then connect those decisions to data that's required to make those, and you can extrapolate the data for each decision to the component that each piece of data makes up to it. So you can group your data logically within an enterprise; customer, product, talent, location, things like that, and you can assign a value to those based on decisions they support. >> Jim: So... >> Dave: Go ahead, please. >> As a CDO, following on that, are you also, as part of that exercise, trying to assess the value of not just the data, but of data science as a capability? Or particular data science assets, like machine learning models? In the overall scheme of things, that kind of valuation can then drive IBM's decision to ramp up their internal data science initiatives, or redeploy it, or, give me a... >> That's exactly what happened. As you build this portfolio of decisions, each decision has a value. So I am now assigning a value to the data science models that my team will build. As CDOs, CDOs are a relatively new role in many organizations. When money gets tight, they say, "What's this guy doing?" (Dave laughing) Having a portfolio of decisions that's saying, "Here's real value I'm adding..." So, number one, "Here's the value I can add in the future," and as you check off those boxes, you can kind of go and say, "Here's value I've added. "Here's where I've changed how the company's operating. "Here's where I've generated X billions of dollars "of new revenue, or cost savings, or cost avoidance, "for the enterprise." >> When you went through these exercises at your previous company, and now at IBM, are you using standardized valuation methodologies? Did you kind of develop your own, or come up with a scoring system? How'd you do that? >> I think there's some things around, like net promoter score, where there's pretty good standards on how to assign value to increases in net promoter score, or decreases in net promoter score for certain aspects of your business. In other ways, you need to kind of decide as an enterprise, how do we value our assets? Do we use a three year, five year, ten year MPV? Do we use some other metric? You need to kind of frame it in the reference that your CFO is used to talking about so that it's in the context that the company is used to talking about. Most companies, it's net present value. >> Okay, and you're measuring that on an ongoing basis. >> Seth: Yep. >> And fine tuning as you go along. Seth, we're out of time. Thanks so much for coming back in The Cube. It was great to see you. >> Seth: Yeah, thanks for having me. >> You're welcome, good luck this afternoon. >> Seth: Alright. >> Keep it right there, buddy. We'll be back. Actually, let me run down the day here for you, just take a second to do that. We're going to end our Cube interviews for the morning, and then we're going to cut over to the main tent. So in about an hour, Rob Thomas is going to kick off the main tent here with a keynote, talking about where data goes next. Hilary Mason's going to be on. There's a session with Dez Blanchfield on data science as a team sport. Then the big session on changing regulations, GDPRs. Seth, you've got some customers that you're going to bring on and talk about these issues. And then, sort of balancing act, the balancing act of hybrid data. Then we're going to come back to The Cube and finish up our Cube interviews for the afternoon. There's also going to be two breakout sessions; one with Hilary Mason, and one on GDPR. You got to go to IBMgo.com and log in and register. It's all free to see those breakout sessions. Everything else is open. You don't even have to register or log in to see that. So keep it right here, everybody. Check out the main tent. Check out siliconangle.com, and of course IBMgo.com for all the action here. Fast track your data. We're live from Munich, Germany; and we'll see you a little later. (upbeat techno music)

Published Date : Jun 24 2017

SUMMARY :

Brought to you by IBM. that allows our clients to interact with governance and expand the capacity of them. And you got a customer on, you guys going to be talking about and Ranger, as the kind of open-source operating system How are you spending your time as the Chief Data Officer? and the Dev teams to define what are these three or four, I mean, in an end to end machine learning pipeline, in the data science platform, if you would. and the machine learning models that might have driven it. and you connect them together, then you have that problem. I can maybe not sell the data, How do I get to my current, you know, But again, it seems to be somewhat of a headwind of decisions that then becomes to your point, Jim, of not just the data, but of data science as a capability? and as you check off those boxes, you can kind of go and say, You need to kind of frame it in the reference that your CFO And fine tuning as you go along. and we'll see you a little later.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Dave	PERSON	0.99+
ING	ORGANIZATION	0.99+
Seth	PERSON	0.99+
Europe	LOCATION	0.99+
Seth Dobrin	PERSON	0.99+
Germany	LOCATION	0.99+
Jim	PERSON	0.99+
Hilary Mason	PERSON	0.99+
Rob Thomas	PERSON	0.99+
ten year	QUANTITY	0.99+
five year	QUANTITY	0.99+
seven months	QUANTITY	0.99+
Asia	LOCATION	0.99+
three year	QUANTITY	0.99+
three	QUANTITY	0.99+
four	QUANTITY	0.99+
Heinrich	PERSON	0.99+
Horton Works	ORGANIZATION	0.99+
Dez Blanchfield	PERSON	0.99+
two types	QUANTITY	0.99+
siliconangle.com	OTHER	0.99+
three days	QUANTITY	0.99+
two things	QUANTITY	0.99+
each piece	QUANTITY	0.99+
today	DATE	0.99+
Dav	PERSON	0.99+
each	QUANTITY	0.99+
first	QUANTITY	0.99+
Munich, Germany	LOCATION	0.99+
third	QUANTITY	0.99+
both	QUANTITY	0.99+
billions of dollars	QUANTITY	0.99+
one	QUANTITY	0.99+
One	QUANTITY	0.98+
two different pieces	QUANTITY	0.98+
three things	QUANTITY	0.98+
DB2	TITLE	0.98+
first step	QUANTITY	0.98+
GDPR	TITLE	0.97+
Apache Atlas	ORGANIZATION	0.97+
fourth platform	QUANTITY	0.97+
2017	DATE	0.97+
three pieces	QUANTITY	0.97+
IBM Analytics	ORGANIZATION	0.96+
first time	QUANTITY	0.96+
single	QUANTITY	0.96+
Spark	TITLE	0.95+
Ranger	ORGANIZATION	0.91+
two breakout sessions	QUANTITY	0.88+
about an hour	QUANTITY	0.86+
each decision	QUANTITY	0.85+
Cube	COMMERCIAL_ITEM	0.84+
each one	QUANTITY	0.83+
this afternoon	DATE	0.82+
Cube	ORGANIZATION	0.8+
San Francisco, Toronto	LOCATION	0.79+
GDPRs	TITLE	0.76+
GDBR	TITLE	0.75+

Rob Thomas, IBM Analytics | IBM Fast Track Your Data 2017

>> Announcer: Live from Munich, Germany, it's theCUBE. Covering IBM: Fast Track Your Data. Brought to you by IBM. >> Welcome, everybody, to Munich, Germany. This is Fast Track Your Data brought to you by IBM, and this is theCUBE, the leader in live tech coverage. We go out to the events, we extract the signal from the noise. My name is Dave Vellante, and I'm here with my co-host Jim Kobielus. Rob Thomas is here, he's the General Manager of IBM Analytics, and longtime CUBE guest, good to see you again, Rob. >> Hey, great to see you. Thanks for being here. >> Dave: You're welcome, thanks for having us. So we're talking about, we missed each other last week at the Hortonworks DataWorks Summit, but you came on theCUBE, you guys had the big announcement there. You're sort of getting out, doing a Hadoop distribution, right? TheCUBE gave up our Hadoop distributions several years ago so. It's good that you joined us. But, um, that's tongue-in-cheek. Talk about what's going on with Hortonworks. You guys are now going to be partnering with them essentially to replace BigInsights, you're going to continue to service those customers. But there's more than that. What's that announcement all about? >> We're really excited about that announcement, that relationship, just to kind of recap for those that didn't see it last week. We are making a huge partnership with Hortonworks, where we're bringing data science and machine learning to the Hadoop community. So IBM will be adopting HDP as our distribution, and that's what we will drive into the market from a Hadoop perspective. Hortonworks is adopting IBM Data Science Experience and IBM machine learning to be a core part of their Hadoop platform. And I'd say this is a recognition. One is, companies should do what they do best. We think we're great at data science and machine learning. Hortonworks is the best at Hadoop. Combine those two things, it'll be great for clients. And, we also talked about extending that to things like Big SQL, where they're partnering with us on Big SQL, around modernizing data environments. And then third, which relates a little bit to what we're here in Munich talking about, is governance, where we're partnering closely with them around unified governance, Apache Atlas, advancing Atlas in the enterprise. And so, it's a lot of dimensions to the relationship, but I can tell you since I was on theCUBE a week ago with Rob Bearden, client response has been amazing. Rob and I have done a number of client visits together, and clients see the value of unlocking insights in their Hadoop data, and they love this, which is great. >> Now, I mean, the Hadoop distro, I mean early on you got into that business, just, you had to do it. You had to be relevant, you want to be part of the community, and a number of folks did that. But it's really sort of best left to a few guys who want to do that, and Apache open source is really, I think, the way to go there. Let's talk about Munich. You guys chose this venue. There's a lot of talk about GDPR, you've got some announcements around unified government, but why Munich? >> So, there's something interesting that I see happening in the market. So first of all, you look at the last five years. There's only 10 companies in the world that have outperformed the S&P 500, in each of those five years. And we started digging into who those companies are and what they do. They are all applying data science and machine learning at scale to drive their business. And so, something's happening in the market. That's what leaders are doing. And I look at what's happening in Europe, and I say, I don't see the European market being that aggressive yet around data science, machine learning, how you apply data for competitive advantage, so we wanted to come do this in Munich. And it's a bit of a wake-up call, almost, to say hey, this is what's happening. We want to encourage clients across Europe to think about how do they start to do something now. >> Yeah, of course, GDPR is also a hook. The European Union and you guys have made some talk about that, you've got some keynotes today, and some breakout sessions that are discussing that, but talk about the two announcements that you guys made. There's one on DB2, there's another one around unified governance, what do those mean for clients? >> Yeah, sure, so first of all on GDPR, it's interesting to me, it's kind of the inverse of Y2K, which is there's very little hype, but there's huge ramifications. And Y2K was kind of the opposite. So look, it's coming, May 2018, clients have to be GDPR-compliant. And there's a misconception in the market that that only impacts companies in Europe. It actually impacts any company that does any type of business in Europe. So, it impacts everybody. So we are announcing a platform for unified governance that makes sure clients are GDPR-compliant. We've integrated software technology across analytics, IBM security, some of the assets from the Promontory acquisition that IBM did last year, and we are delivering the only platform for unified governance. And that's what clients need to be GDPR-compliant. The second piece is data has to become a lot simpler. As you think about my comment, who's leading the market today? Data's hard, and so we're trying to make data dramatically simpler. And so for example, with DB2, what we're announcing is you can download and get started using DB2 in 15 minutes or less, and anybody can do it. Even you can do it, Dave, which is amazing. >> Dave: (laughs) >> For the first time ever, you can-- >> We'll test that, Rob. >> Let's go test that. I would love to see you do it, because I guarantee you can. Even my son can do it. I had my son do it this weekend before I came here, because I wanted to see how simple it was. So that announcement is really about bringing, or introducing a new era of simplicity to data and analytics. We call it Download And Go. We started with SPSS, we did that back in March. Now we're bringing Download And Go to DB2, and to our governance catalog. So the idea is make data really simple for enterprises. >> You had a community edition previous to this, correct? There was-- >> Rob: We did, but it wasn't this easy. >> Wasn't this simple, okay. >> Not anybody could do it, and I want to make it so anybody can do it. >> Is simplicity, the rate of simplicity, the only differentiator of the latest edition, or I believe you have Kubernetes support now with this new addition, can you describe what that involves? >> Yeah, sure, so there's two main things that are new functionally-wise, Jim, to your point. So one is, look, we're big supporters of Kubernetes. And as we are helping clients build out private clouds, the best answer for that in our mind is Kubernetes, and so when we released Data Science Experience for Private Cloud earlier this quarter, that was on Kubernetes, extending that now to other parts of the portfolio. The other thing we're doing with DB2 is we're extending JSON support for DB2. So think of it as, you're working in a relational environment, now just through SQL you can integrate with non-relational environments, JSON, documents, any type of no-SQL environment. So we're finally bringing to fruition this idea of a data fabric, which is I can access all my data from a single interface, and that's pretty powerful for clients. >> Yeah, more cloud data development. Rob, I wonder if you can, we can go back to the machine learning, one of the core focuses of this particular event and the announcements you're making. Back in the fall, IBM made an announcement of Watson machine learning, for IBM Cloud, and World of Watson. In February, you made an announcement of IBM machine learning for the z platform. What are the machine learning announcements at this particular event, and can you sort of connect the dots in terms of where you're going, in terms of what sort of innovations are you driving into your machine learning portfolio going forward? >> I have a fundamental belief that machine learning is best when it's brought to the data. So, we started with, like you said, Watson machine learning on IBM Cloud, and then we said well, what's the next big corpus of data in the world? That's an easy answer, it's the mainframe, that's where all the world's transactional data sits, so we did that. Last week with the Hortonworks announcement, we said we're bringing machine learning to Hadoop, so we've kind of covered all the landscape of where data is. Now, the next step is about how do we bring a community into this? And the way that you do that is we don't dictate a language, we don't dictate a framework. So if you want to work with IBM on machine learning, or in Data Science Experience, you choose your language. Python, great. Scala or Java, you pick whatever language you want. You pick whatever machine learning framework you want, we're not trying to dictate that because there's different preferences in the market, so what we're really talking about here this week in Munich is this idea of an open platform for data science and machine learning. And we think that is going to bring a lot of people to the table. >> And with open, one thing, with open platform in mind, one thing to me that is conspicuously missing from the announcement today, correct me if I'm wrong, is any indication that you're bringing support for the deep learning frameworks like TensorFlow into this overall machine learning environment. Am I wrong? I know you have Power AI. Is there a piece of Power AI in these announcements today? >> So, stay tuned on that. We are, it takes some time to do that right, and we are doing that. But we want to optimize so that you can do machine learning with GPU acceleration on Power AI, so stay tuned on that one. But we are supporting multiple frameworks, so if you want to use TensorFlow, that's great. If you want to use Caffe, that's great. If you want to use Theano, that's great. That is our approach here. We're going to allow you to decide what's the best framework for you. >> So as you look forward, maybe it's a question for you, Jim, but Rob I'd love you to chime in. What does that mean for businesses? I mean, is it just more automation, more capabilities as you evolve that timeline, without divulging any sort of secrets? What do you think, Jim? Or do you want me to ask-- >> What do I think, what do I think you're doing? >> No, you ask about deep learning, like, okay, that's, I don't see that, Rob says okay, stay tuned. What does it mean for a business, that, if like-- >> Yeah. >> If I'm planning my roadmap, what does that mean for me in terms of how I should think about the capabilities going forward? >> Yeah, well what it means for a business, first of all, is what they're going, they're using deep learning for, is doing things like video analytics, and speech analytics and more of the challenges involving convolution of neural networks to do pattern recognition on complex data objects for things like connected cars, and so forth. Those are the kind of things that can be done with deep learning. >> Okay. And so, Rob, you're talking about here in Europe how the uptick in some of the data orientation has been a little bit slower, so I presume from your standpoint you don't want to over-rotate, to some of these things. But what do you think, I mean, it sounds like there is difference between certainly Europe and those top 10 companies in the S&P, outperforming the S&P 500. What's the barrier, is it just an understanding of how to take advantage of data, is it cultural, what's your sense of this? >> So, to some extent, data science is easy, data culture is really hard. And so I do think that culture's a big piece of it. And the reason we're kind of starting with a focus on machine learning, simplistic view, machine learning is a general-purpose framework. And so it invites a lot of experimentation, a lot of engagement, we're trying to make it easier for people to on-board. As you get to things like deep learning as Jim's describing, that's where the market's going, there's no question. Those tend to be very domain-specific, vertical-type use cases and to some extent, what I see clients struggle with, they say well, I don't know what my use case is. So we're saying, look, okay, start with the basics. A general purpose framework, do some tests, do some iteration, do some experiments, and once you find out what's hunting and what's working, then you can go to a deep learning type of approach. And so I think you'll see an evolution towards that over time, it's not either-or. It's more of a question of sequencing. >> One of the things we've talked to you about on theCUBE in the past, you and others, is that IBM obviously is a big services business. This big data is complicated, but great for services, but one of the challenges that IBM and other companies have had is how do you take that service expertise, codify it to software and scale it at large volumes and make it adoptable? I thought the Watson data platform announcement last fall, I think at the time you called it Data Works, and then so the name evolved, was really a strong attempt to do that, to package a lot of expertise that you guys had developed over the years, maybe even some different software modules, but bring them together in a scalable software package. So is that the right interpretation, how's that going, what's the uptake been like? >> So, it's going incredibly well. What's interesting to me is what everybody remembers from that announcement is the Watson Data Platform, which is a decomposable framework for doing these types of use cases on the IBM cloud. But there was another piece of that announcement that is just as critical, which is we introduced something called the Data First method. And that is the recipe book to say to a client, so given where you are, how do you get to this future on the cloud? And that's the part that people, clients, struggle with, is how do I get from step to step? So with Data First, we said, well look. There's different approaches to this. You can start with governance, you can start with data science, you can start with data management, you can start with visualization, there's different entry points. You figure out the right one for you, and then we help clients through that. And we've made Data First method available to all of our business partners so they can go do that. We work closely with our own consulting business on that, GBS. But that to me is actually the thing from that event that has had, I'd say, the biggest impact on the market, is just helping clients map out an approach, a methodology, to getting on this journey. >> So that was a catalyst, so this is not a sequential process, you can start, you can enter, like you said, wherever you want, and then pick up the other pieces from majority model standpoint? Exactly, because everybody is at a different place in their own life cycle, and so we want to make that flexible. >> I have a question about the clients, the customers' use of Watson Data Platform in a DevOps context. So, are more of your customers looking to use Watson Data Platform to automate more of the stages of the machine learning development and the training and deployment pipeline, and do you see, IBM, do you see yourself taking the platform and evolving it into a more full-fledged automated data science release pipelining tool? Or am I misunderstanding that? >> Rob: No, I think that-- >> Your strategy. >> Rob: You got it right, I would just, I would expand a little bit. So, one is it's a very flexible way to manage data. When you look at the Watson Data Platform, we've got relational stores, we've got column stores, we've got in-memory stores, we've got the whole suite of open-source databases under the composed-IO umbrella, we've got cloud in. So we've delivered a very flexible data layer. Now, in terms of how you apply data science, we say, again, choose your model, choose your language, choose your framework, that's up to you, and we allow clients, many clients start by building models on their private cloud, then we say you can deploy those into the Watson Data Platform, so therefore then they're running on the data that you have as part of that data fabric. So, we're continuing to deliver a very fluid data layer which then you can apply data science, apply machine learning there, and there's a lot of data moving into the Watson Data Platform because clients see that flexibility. >> All right, Rob, we're out of time, but I want to kind of set up the day. We're doing CUBE interviews all morning here, and then we cut over to the main tent. You can get all of this on IBMgo.com, you'll see the schedule. Rob, you've got, you're kicking off a session. We've got Hilary Mason, we've got a breakout session on GDPR, maybe set up the main tent for us. >> Yeah, main tent's going to be exciting. We're going to debunk a lot of misconceptions about data and about what's happening. Marc Altshuller has got a great segment on what he calls the death of correlations, so we've got some pretty engaging stuff. Hilary's got a great piece that she was talking to me about this morning. It's going to be interesting. We think it's going to provoke some thought and ultimately provoke action, and that's the intent of this week. >> Excellent, well Rob, thanks again for coming to theCUBE. It's always a pleasure to see you. >> Rob: Thanks, guys, great to see you. >> You're welcome; all right, keep it right there, buddy, We'll be back with our next guest. This is theCUBE, we're live from Munich, Fast Track Your Data, right back. (upbeat electronic music)

Published Date : Jun 22 2017

SUMMARY :

Brought to you by IBM. This is Fast Track Your Data brought to you by IBM, Hey, great to see you. It's good that you joined us. and machine learning to the Hadoop community. You had to be relevant, you want to be part of the community, So first of all, you look at the last five years. but talk about the two announcements that you guys made. Even you can do it, Dave, which is amazing. I would love to see you do it, because I guarantee you can. but it wasn't this easy. and I want to make it so anybody can do it. extending that now to other parts of the portfolio. What are the machine learning announcements at this And the way that you do that is we don't dictate I know you have Power AI. We're going to allow you to decide So as you look forward, maybe it's a question No, you ask about deep learning, like, okay, that's, and speech analytics and more of the challenges But what do you think, I mean, it sounds like And the reason we're kind of starting with a focus One of the things we've talked to you about on theCUBE And that is the recipe book to say to a client, process, you can start, you can enter, and deployment pipeline, and do you see, IBM, models on their private cloud, then we say you can deploy and then we cut over to the main tent. and that's the intent of this week. It's always a pleasure to see you. This is theCUBE, we're live from Munich,

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Dave Vellante	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Jim	PERSON	0.99+
Europe	LOCATION	0.99+
Rob	PERSON	0.99+
Marc Altshuller	PERSON	0.99+
Hilary	PERSON	0.99+
Hilary Mason	PERSON	0.99+
Rob Bearden	PERSON	0.99+
February	DATE	0.99+
Dave	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Rob Thomas	PERSON	0.99+
May 2018	DATE	0.99+
March	DATE	0.99+
Munich	LOCATION	0.99+
Scala	TITLE	0.99+
Apache	ORGANIZATION	0.99+
second piece	QUANTITY	0.99+
Last week	DATE	0.99+
Java	TITLE	0.99+
last year	DATE	0.99+
two announcements	QUANTITY	0.99+
10 companies	QUANTITY	0.99+
GDPR	TITLE	0.99+
Python	TITLE	0.99+
DB2	TITLE	0.99+
15 minutes	QUANTITY	0.99+
last week	DATE	0.99+
IBM Analytics	ORGANIZATION	0.99+
European Union	ORGANIZATION	0.99+
five years	QUANTITY	0.99+
JSON	TITLE	0.99+
Watson Data Platform	TITLE	0.99+
third	QUANTITY	0.99+
One	QUANTITY	0.99+
this week	DATE	0.98+
today	DATE	0.98+
a week ago	DATE	0.98+
two things	QUANTITY	0.98+
SQL	TITLE	0.98+
last fall	DATE	0.98+
2017	DATE	0.98+
Munich, Germany	LOCATION	0.98+
each	QUANTITY	0.98+
Y2K	ORGANIZATION	0.98+

Scott Gnau, Hortonworks - DataWorks Summit 2017

>> Announcer: Live, from San Jose, in the heart of Silicon Valley, it's The Cube, covering DataWorks Summit 2017. Brought to you by Hortonworks. >> Welcome back to The Cube. We are live at DataWorks Summit 2017. I'm Lisa Martin with my cohost, George Gilbert. We've just come from this energetic, laser light show infused keynote, and we're very excited to be joined by one of the keynotes today, the CTO of Hortonworks, Scott Gnau. Scott, welcome back to The Cube. >> Great to be here, thanks for having me. >> Great to have you back here. One of the things that you talked about in your keynote today was collaboration. You talked about the modern data architecture and one of the things that I thought was really interesting is that now where Horton Works is, you are empowering cross-functional teams, operations managers, business analysts, data scientists, really helping enterprises drive the next generation of value creation. Tell us a little bit about that. >> Right, great. Thanks for noticing, by the way. I think the next, the important thing, kind of as a natural evolution for us as a company and as a community is, and I've seen this time and again in the tech industry, we've kind of moved from really cool breakthrough tech, more into a solutions base. So I think this whole notion is really about how we're making that natural transition. And when you think about all the cool technology and all the breakthrough algorithms and all that, that's really great, but how do we then take that and turn it to value really quickly and in a repeatable fashion. So, the notion that I launched today is really making these three personas really successful. If you can focus, combining all of the technology, usability and even some services around it, to make each of those folks more successful in their job. So I've broken it down really into three categories. We know the traditional business analyst, right? They've Sequel and they've been doing predictive modeling of structured data for a very long time, and there's a lot of value generated from that. Making the business analyst successful Hadoop inspired world is extremely valuable. And why is that? Well, it's because Hadoop actually now brings a lot more breadth of data and frankly a lot more depth of data than they've ever had access to before. But being able to communicate with that business analyst in a language they understand, Sequel, being able to make all those tools work seamlessly, is the next extension of success for the business analyst. We spent a lot of time this morning talking about data scientists, the next great frontier where you bring together lots and lots and lots and lots of data, for instance, Skin and Math and Heavy Compute, with the data scientists and really enable them to go build out that next generation of high definition kind of analytics, all right, and we're all, certainly I am, captured by the notion of self-driving cars, and you think about a self-driving car, and the success of that is purely based on the successful data science. In those cameras and those machines being able to infer images more accurately than a human being, and then make decisions about what those images mean. That's all data science, and it's all about raw processing power and lots and lots and lots of data to make those models train and more accurate than what would otherwise happen. So enabling the data scientist to be successful, obviously, that's a use case. You know, certainly voice activated, voice response kinds of systems, for better customer service; better fraud detection, you know, the cost of a false positive is a hundred times the cost of missing a fraudulent behavior, right? That's because you've irritated a really good customer. So being able to really train those models in high definition is extremely valuable. So bringing together the data, but the tool set so that data scientists can actually act as a team and collaborate and spend less of their time finding the data, and more of their time providing the models. And I said this morning, last but not least, the operations manager. This is really, really, really important. And a lot of times, especially geeks like myself, are just, ah, operations guys are just a pain in the neck. Really, really, really important. We've got data that we've never thought of. Making sure that it's secured properly, making sure that we're managing within the regulations of privacy requirements, making sure that we're governing it and making sure how that data is used, alongside our corporate mission is really important. So creating that tool set so that the operations manager can be confident in turning these massive files of data to the business analyst and to the data scientist and be confident that the company's mission, the regulation that they're working within in those jurisdictions are all in compliance. And so that's what we're building on, and that stack, of course, is built on open source Apache Atlas and open source Apache Ranger and it really makes for an enterprise grade experience. >> And a couple things to follow on to that, we've heard of this notion for years, that there is a shortage of data scientists, and now, it's such a core strategic enabler of business transformation. Is this collaboration, this team support that was talked about earlier, is this helping to spread data science across these personas to enable more of the to be data scientists? >> Yeah, I think there are two aspects to it, right? One is certainly really great data scientists are hard to find; they're scarce. They're unique creatures. And so, to the extent that we're able to combine the tool set to make the data scientists that we have more productive, and I think the numbers are astronomical, right? You could argue that, with the wrong tool set, a data scientist might spend 80% or 90% of his or her time just finding the data and only 10% working on the problem. If we can flip that around and make it 10% finding the data and 90%, that's like, in order of magnitude, more breadth of data science coverage that we get from the same pool of data scientists, so I think that from an efficiency perspective, that's really huge. The second thing, though, is that by looking at these personas and the tools that we're rolling out, can we start to package up things that the data scientists are learning and move those models into the business analysts desktop. So, now, not only is there more breadth and depth of data, but frankly, there's more depth and breadth of models that can be run, but inferred with traditional business process, which means, turning that into better decision making, turning that into better value for the business, just kind of happens automatically. So, you're leveraging the value of data scientists. >> Let me follow that up, Scott. So, if the, right now the biggest time sync for the data scientist or the data engineer is data cleansing and transformation. Where do the cloud vendors fit in in terms of having trained some very broad horizontal models in terms of vision, natural language understanding, text to speech, so where they have accumulated a lot of data assets, and then they created models that were trained and could be customized. Do you see a role for, not just mixed gen UI related models coming from the cloud vendors, but for other vendors who have data assets to provide more fully baked models so that you don't have to start from scratch? >> Absolutely. So, one of the things that I talked about also this morning is this notion, and I said it this morning, kind of opens where open community, open source, and open ecosystem, I think it's now open to the third power, right, and it's talking about open models and algorithms. And I think all of those same things are really creating a tremendous opportunity, the likes of which we've not seen before, and I think it's really driving the velocity in the market, right, so there's no, because we're collaborating in the open, things just get done faster and more efficiently, whether it be in the core open source stuff or whether it be in the open ecosystem, being able to pull tools in. Of course, the announcement earlier today, with IBMs Data Science Experience software as a framework for the data scientists to work as a team, but that thing in and of itself is also very open. You can plug in Python, you can plug in open source models and libraries, some of which were developed in the cloud and published externally. So, it's all about continued availability of open collaboration that is the hallmark of this wave of technology. >> Okay, so we have this issue of how much can we improve the productivity with better tools or with some amount of data. But then, the part that everyone's also point out, besides the cloud experience, is also the ability to operationalize the models and get them into production either in Bespoke apps or packaged apps. How's that going to sort of play out over time? >> Well, I think two things you'll see. One, certainly in the near term, again, with our collaboration with IBM and the Data Science Experience. One of the key things there is not only, not just making the data scientists be able to be more collaborative, but also the ease of which they can publish their models out into the wild. And so, kind of closing that loop to action is really important. I think, longer term, what you're going to see, and I gave a hint of this a little bit in my keynote this morning, is, I believe in five years, we'll be talking about scalability, but scalability won't be the way we think of it today, right? Oh, I have this many petabytes under management, or, petabytes. That's upkeep. But truly, scalability is going to be how many connected devices do you have interacting, and how many analytics can you actually push from model perspective, actually out to the center or out to the device to run locally. Why is that important? Think about it as a consumer with a mobile device. The time of interaction, your attention span, do you get an offer in the right time, and is that offer relevant. It can't be rules based, it has to be models based. There's no time for the electrons to move from your device across a power grid, run an analytic and have it come back. It's going to happen locally. So scalability, I believe, is going to be determined in terms of the CPU cycles and the total interconnected IOT network that you're working in. What does that mean from your original question? That means applications have to be portable, models have to be portable so that they can execute out to the edge where it's required. And so that's, obviously, part of the key technology that we're working with in Portworks Data Flow and the combination of Apache Nifi and Apache Caca and Storm to really combine that, "How do I manage, not only data in motion, but ultimately, how do I move applications and analytics to the data and not be required to move the data to the analytics?" >> So, question for you. You talked about real time offers, for example. We talk a lot about predicted analytics, advanced analytics, data wrangling. What are your thoughts on preemptive analytics? >> Well, I think that, while that sounds a little bit spooky, because we're kind of mind reading, I think those things can start to exist. Certainly because we now have access to all of the data and we have very sophisticated data science models that allow us to understand and predict behavior, yeah, the timing of real time analytics or real time offer delivery, could actually, from our human being perception, arrive before I thought about it. And isn't that really cool in a way. I'm thinking about, I need to go do X,Y,Z. Here's a relevant offer, boom. So it's no longer, I clicked here, I clicker here, I clicked here, and in five seconds I get a relevant offer, but before I even though to click, I got a relevant offer. And again, to the extent that it's relevant, it's not spooky. >> Right. >> If it's irrelevant, then you deal with all of the other downstream impact. So that, again, points to more and more and more data and more and more and more accurate and sophisticated models to make sure that that relevance exists. >> Exactly. Well, Scott Gnau, CTO of Hortonworks, thank you so much for stopping by The Cube once again. We appreciate your conversation and insights. And for George Gilbert, I am Lisa Martin. You're watching The Cube live, from day one of the DataWorks Summit in the heart of Silicon Valley. Stick around, though, we'll be right back.

Published Date : Jun 13 2017

SUMMARY :

in the heart of Silicon Valley, it's The Cube, the CTO of Hortonworks, Scott Gnau. One of the things that you talked about So enabling the data scientist to be successful, And a couple things to follow on to that, and the tools that we're rolling out, for the data scientist or the data engineer as a framework for the data scientists to work as a team, is also the ability to operationalize the models not just making the data scientists be able to be You talked about real time offers, for example. And again, to the extent that it's relevant, So that, again, points to more and more and more data of the DataWorks Summit in the heart of Silicon Valley.

ENTITIES

Entity	Category	Confidence
Lisa Martin	PERSON	0.99+
George Gilbert	PERSON	0.99+
Scott	PERSON	0.99+
IBM	ORGANIZATION	0.99+
80%	QUANTITY	0.99+
San Jose	LOCATION	0.99+
10%	QUANTITY	0.99+
90%	QUANTITY	0.99+
Scott Gnau	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
IBMs	ORGANIZATION	0.99+
Python	TITLE	0.99+
two aspects	QUANTITY	0.99+
five seconds	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
One	QUANTITY	0.99+
DataWorks Summit 2017	EVENT	0.98+
Horton Works	ORGANIZATION	0.98+
Hadoop	TITLE	0.98+
one	QUANTITY	0.98+
DataWorks Summit	EVENT	0.98+
today	DATE	0.98+
each	QUANTITY	0.98+
five years	QUANTITY	0.97+
third	QUANTITY	0.96+
second thing	QUANTITY	0.96+
Apache Caca	ORGANIZATION	0.95+
three personas	QUANTITY	0.95+
this morning	DATE	0.95+
Apache Nifi	ORGANIZATION	0.95+
this morning	DATE	0.94+
three categories	QUANTITY	0.94+
CTO	PERSON	0.93+
The Cube	TITLE	0.9+
Sequel	PERSON	0.89+
Apache Ranger	ORGANIZATION	0.88+
two things	QUANTITY	0.86+
hundred times	QUANTITY	0.85+
Portworks	ORGANIZATION	0.82+
earlier today	DATE	0.8+
Data Science Experience	TITLE	0.79+
The Cube	ORGANIZATION	0.78+
Apache Atlas	ORGANIZATION	0.75+
Storm	ORGANIZATION	0.74+
day one	QUANTITY	0.74+
wave	EVENT	0.69+
one of the keynotes	QUANTITY	0.66+
lots	QUANTITY	0.63+
years	QUANTITY	0.53+
Hortonworks	EVENT	0.5+
lots of data	QUANTITY	0.49+
Sequel	ORGANIZATION	0.46+
Flow	ORGANIZATION	0.39+

Seth Dobrin, IBM - IBM Interconnect 2017 - #ibminterconnect - #theCUBE

>> Announcer: Live from Las Vegas, it's theCUBE, covering InterConnect 2017. Brought to you by IBM. >> Okay welcome back everyone. We are here live in Las Vegas from Mandalay Bay for IBM InterConnect 2017. This is theCUBE's three day coverage of IBM InterConnect. I'm John Furrier with my co-host Dave Vellante. Or next guest is Seth Dobrin, Vice President and Chief Data Officer for IBM Analytics. Welcome to theCUBE, welcome back. >> Yeah, thanks for having me again. I love sittin' down and chattin' with you guys. >> You're a CDO, Chief Data Officer and that's a really kind of a really pivotal role because you got to look at, as a chief, over all of the data with IBM Analytics. Also you have customers you're delivering a lot solutions to and it's cutting edge. I like the keynote on day one here. You had Chris Moody at Twitter. He's a data guy. >> Seth: Yep. >> I mean you guys have a deal with Twitter so he got more data. You've got the weather company, you got that data set. You have IBM customer data. You guys are full with data right now. >> We're first seat at the scenes with data and that's a good thing. >> So what's the strategy and what are you guys working on and what's the key points that you guys are honing in on? Obviously, Cognitive to the Core is Robetti's theme. How are you guys making data work for IBM and your customers? >> If you think about IBM Analytics, we're really focusing on five key areas, five things that we think if we get right, we'll help our clients learn how to drive their business and data strategies right. One is around how do I manage data across hybrid environments? So what's my hybrid data management strategy? It used to be how do I get to public cloud, but really what it is, it's a conversation about every enterprise has their business critical assets, what people call legacy. If we call them business critical and we think about-- These are how companies got here today. This is what they make their money on today. The real challenge is how do we help them tie those business critical assets to their future state cloud, whether it's public cloud, private cloud, or something in between our hybrid cloud. One of the key strategies for us is hybrid data management. Another one is around unified governance. If you look at governance in the past, governance in the past was an inhibitor. It was something that people went (groan) "Governance, so I have to do it." >> John: Barb wire. >> Right, you know. When I've been at companies before, and thought about building a data strategy, we spent the first six months building data strategy trying to figure out how to avoid data governance, or the word data governance, and really, we need to embrace data governance as an enabler. If you do it right, if you do it upfront, if you wrap things that include model management, how do I make sure that my data scientists can get to the data they need upfront by classifying data ahead of time; understanding entitlements, understanding what intent when people gave consent was. You also take out of the developer hands the need to worry about governance because now in a unified governance platform, right, it's all API-driven. Just like our applications are all API-driven, how do we make our governance platform API-driven? If I'm an application developer, by the way, I'm not, I can now call on API to manage governance for me, so I don't need to worry about am I giving away the shop. Am I going to get the company sued? Am I going to get fired? Now I'm calling on API. That's only two of them, right? The third one is really around data science and machine learning. So how do we make machine learning pervasive across enterprises and things like data science experience. Watson, IBM, machine learning. We're now bringing that machine-learning capability to the private cloud, right, because 90% of data that exists can't be Googled so it's behind firewalls. How do we bring machine learning to that? >> One more! >> One more! That's around, God, I gave you quite a list-- >> Hybrid data management, you defined governance, data science and machine learning-- >> Oh, the other one is Open Source, our commitment to Open Source. Our commitment to Open Source, like Hadoop, Spark, as we think about unified governance, a truly unified governed platform needs to be built on top of Open Source, so IBM is doubling down on our commitment to Apache Spark as a framework backbone, a metadata framework for our unified governed platform. >> What's the biggest para >> Wait, did we miss one? Hybrid data management, unified governance, data science machine learning (talking over another), pervasive, and open source. >> That's four. >> I thought it was five. >> No. >> Machine learning and data science are two, so typically five. >> There's only four. If I said five, there's only four. >> Cover the data governance thing because this unification is interesting to me because one of the things we see in the marketplace, people hungry for data ops. Like what data ops was for cloud was a whole application developer model developing where as a new developer persona emerging where I want to code and I want to just tap data handled by brilliant people who are cognitive engines that just serve me up what I need like a routine or a procedure, or a subroutine, whatever you want to call it, that's a data DevOps model kind of thing. How will you guys do it? Do you agree with that and how does that play out? >> That's a combination, in my mind, that's a combination of an enterprise creating data assets, so treating data as the asset it is and not a digital dropping of applications, and it's that combined with metadata. It gets back to the Apache Atlas conversation. If you want to understand your data and know where it is, it's a metadata problem. What's the data; what's the lineage; where is it; where does it live; how do I get to it; what can I, can't I do with it, and so that just reinforces the need for an Open Source ubiquitous metadata catalog, a single catalog, and then a single catalog of policies associated with that all driven in a composable way through API. >> That's a fundamental, cultural thinking shift because you're saying, "I don't want to just take exhaust "from apps, which is just how people have been dealing with data." You're saying, "Get holistic and say you need to create an asset class or layer or something that is designed." >> If an enterprises are going to be successful with data, now we're getting to five things, right, so there's five things. They need to treat data as an asset. It's got to be a first-class citizen, not a digital dropping, and they need a strategy around it. So what are, conceptually, what are the pieces of data that I care about? My customers, my products, my talent, my finances, what are the limited number of things. What is my data science strategy? How do I build deployable data science assets? I can't be developing machine-learning models and deploying them in Excel spreadsheets. They have to be integrated into My Processes. I have to have a cloud strategy so am I going to be on premise? Am I going to be off premise? Am I going to be something in between? I have to get back to unified governance. I have to govern it, right? Governing in a single place is hard enough, let alone multiple places, and then my talent disappears. >> Could you peg a progress bar of the industry where these would be, what you just said, because, I think-- >> Dave: Again, we only got through four. >> No talent was the last one. >> Talent, sorry, missed it. >> In the progress bar of work, how are the enterprises right now 'cause actually the big conversation on the cloud side is enterprise-readiness, enterprise-grade, that's kind of an ongoing conversation, but now, if you take your premise, which I think is accurate, is that I got to have a centralized data strategy and platform, not a data (mumbles), more than that, software, et cetera, where's the progress bar? Where are people, Pegeninning? >> I think they are all over the map. I've only been with IBM for four months and I've been spending much of that time literally traveling around the world talking to clients, and clients are all over the map. Last week I spent a week in South America with a media company, a cable company down there. Before setting up the meeting, the guy was like, "Well, you know, we're not that far along "down this journey," and I was like, "Oh, my God, "you guys are like so far ahead of everyone else! "That's not even funny!" And then I'm sitting down with big banks that think they're like way out there and they haven't even started on the journey. So it's really literally all over the place and it's even within industry. There's financial companies that are also way out there. There's another bank in Brazil that uses biometrics to access ATMs, you don't need a pin anymore. They have analytics that drive all that. That's crazy. We don't have anything like that here. >> Are you meeting with CDOs? >> Yeah, mostly CDOs, or kind of defacto like we talked about before this show. Mostly CDOs. >> So you may be unique in the sense that you are working for a technology company, so a lot of your time is outward focused, but when you travel around and meet with the CDOs, how much of their time is inward-focused versus outward-focused? >> My time is actually split between inward and outward focus because part of my time is transforming our own business using data and analytics because IBM is a company and we got to figure out how to do that. >> Is it correct that yours is probably a higher percentage outward? >> Mine's probably a higher percentage outward than most CDOs, yeah. So I think most CDOs are 7%, 80% inward-focused and 20% outward-focused, and a lot of that outward focus is just trying to understand what other people are doing. >> I guess it's okay for now, but will that change over time? >> I think that's about right. It gets back to the other conversation we had before the show about your monetization strategy. I think if a company progresses where it's not longer about how do I change my processes and use data to monetize my internal process. If I'm going to start figuring how I sell data, then CDOs need to get a more external-- >> But you're supporting the business in that role and that's largely going to be an internal function of data-quality, governance, and the like, like you say, the data science strategy. >> Yeah, and I think it's important when I talk about data governance, I think things that we used to talk about is data management is all part of data governance. Data governance is not just controlling. It's all of that. It's how do I understand my data, how do I provide access to my data. It's all those things you need to enable your business to thrive on data. >> My question for you is a personal one. How did you get to be a CDO? Do you go to a class? I'm going to be a CDO someday. Not that you do that, I'm just-- >> CDO school. >> CDO school. >> Seth: I was staying in a Holiday Express last night. (laughing) >> Tongue in cheek aside, people are getting into CDO roles from interesting vectors, right? Anthropology, science, art, I mean, it's a really interesting, math geeks certainly love, they thrive there, but there's not one, I haven't yet seen one sweet spot. Take us through how you got into it and what-- >> I'm not going to fit any preconceived notion of what a CDO is, especially in a technology company. My background is in molecular and statistical genetics. >> Dave: Well, that explains it. >> I'm a geneticist. >> Data has properties that could be kind of biological. >> And actually, if you think about the routes of big data and data science, or big data, at least, the two of the predative, they're probably fundamental drivers of the concept of big data were genetics and astrophysics. So 20 years ago when I was getting my PhD, we were dealing with tens and hundreds of gigabyte-sized files. We were trying to figure out how do we get stuff out of 15 Excel files because they weren't big enough into a single CSV file. Millions of rows and millions of crude, by today's standard, but it was still, how do we do this, and so 20 years ago I was learning to be a data scientist. I didn't know it. I stopped doing that field and I started managing labs for a while and then in my last role, we kind of transformed how the research group within that company, in the agricultural space, handled and managed data, and I was simultaneously the biggest critic and biggest advocate for IT, and they said, "Hey, come over and help us figure out how to transform "the company the way we've transformed this group." >> It's looks like when you talk about your PhD experience, it's almost like you were so stuck in the mud with not having to compute power or sort of tooling. It's like a hungry man saying "Oh, it's an unlimited "abundance of compute, oh, I love what's going on." So you almost get gravitated, pulled into that, right? >> It was funny, I was doing a demo upstairs today with, one of the sales guys was doing a demo with some clients, and in one line of code, they had expressed what was part of my dissertation. It was a single line of code in a script and it was like, that was someone's entire four-year career 20 years ago. >> Great story, and I think that's consistent with just people who just attracted to it, and they end up being captains of industry. This is a hot field. You guys have a CDO of that happening in San Francisco. We'll be doing some live streaming there. What's the agenda because this is a very accelerating field? You mentioned now dealing practically with compliance and governance, which is you'd run in the other direction in the old days, now this embracing that. It's got to get (mumbles) and discipline in management. What's going to go on at CDO Summit or do you know? >> At the CDO Summit next week, I think we're going to focus on three key areas, right? What does a cloud journey look like? Maybe four key areas, right. So a cloud journey, how do you monetize data and what does that even mean, and talent, so at all these CDO Summits, the IBM CDO Summits have been going on for three or four years now, every one of them has a talent conversation, and then governance. I think those are four key concepts, and not surprising, they were four of my five on my list. I think that's what really we're going to talk about. >> The unified governance, tell us how that happens in your vision because that's something that you hear unified identity, we hear block chain looking at a whole new disruptive way of dealing with value digitally. How do you see the data governance thing unifying? >> Well, I think again, it's around... IBM did a great job of figuring out how to take an Open Source product that was Spark, and make it the heart of our products. It's going to be the same thing with governance where you're going to see Apache Atlas is at its infancy right now, having that open backbone so that people can get in and out of it easy. If you're going to have a unified governance platform, it's going to be open by definition because I need to get other people's products on there. I can't go to an enterprise and say we're going to sell your unified governance platform, but you got to buy all IBM, or you got to spend two years doing development work to get it on there. So open is the framework and composable, API-driven, and pro-active are really, I think, that's kind of the key pieces for it. >> So we all remember the client-server days where it took a decade and a half to realize, "Oh, my Gosh, this is out of control "and we need to bring it back in." And the Wild West days of big data, it feels like enterprises have nipped that governance issue in the butt at least, maybe they don't have it under control yet, but they understand the need to get it under control. Is that a fair statement? >> I think they understand the need. The data is so big and grows so fast that another component that I didn't mention, maybe it was implied a little bit, but, is automation. You need to be able to capture metadata in an automated fashion. We were talking to a client earlier who, 400 terabytes a day of data changes, not even talking about what new data they are ingesting, how do they keep track of that? It's got to be automated. This unified governance, you need to capture this metadata and as an automated fashion as possible. Master data needs to be automated when you think about-- >> And make it available in real time, low-latency because otherwise it becomes a data swamp. >> Right, it's got to be pro-active, real-time, on-demand. >> Another thing I wanted to ask you, Seth, and get your opinion on is sort of the mid-2000s when the federal rules of civil procedure changed in electronic documents and records became admissible, it was always about how do I get rid of data, and that's changed. Everybody wants to keep data and how to analyze it, and so forth, so what about that balance? And one of the challenges back then was data classification. I can't scale, by governance, I can't eliminate and defensively delete data unless I can classify it. Is the analog true where with data as an opportunity, I can't do a good job or a good enough job analyzing my data and keeping my data under control without some kind of automated classification, and has the industry solved that? >> I don't think the industry has completely solved it yet, but I think with cognitive tools, there's tools out there that we have that other people have that can automatically, if you give them parameters and train it, can classify the data for you, and I think classification is one of the keys. You need to understand how the data's classified so you understand who can access it, how long you should keep it, and so it's key, and that's got to be automated also. I think we've done a fair job as an industry of doing that. There's still a whole lot of work, especially as you get into the kind of specialized sectors, and so I think that's a key and we've got to do a better job of helping companies train those things so that they work. I'm a big proponent of don't give your data away to IT companies. It's your asset. Don't let them train their models with your data and sell it to other people, but there are some caveats out. There are some core areas where industries need to get together and let IT companies, whether it's IBM or someone else, train models for things just like that, for classification because if someone gets it wrong, it can bring the whole industry down. >> It's almost as if (talking over each other) source paradigm almost. It's like Open Source software. Share some data, but I-- >> Right, and there's some key things that aren't differentiating that, as an industry, you should get together and share. >> You guys are making, IBM is making a big deal out of this, and I think it's super important. I think it's probably the top thing that CDOs and CIOs need to think about right now is if I really own my data and that data is needed to train my big data models, who owns the models and how do I protect my IP. >> And are you selling it to my competitors. Are you going down the street and taking away my IP, my differentiating IP and giving it to my competitor? >> So do I own the model 'cause the data and models are coming together, and that's what IBM's telling me. >> Seth: Absolutely. >> I own the data and the models that it informs, is that correct? >> Yeah, that's absolutely correct. You guys made the point earlier about IBM bursting at the seams on data. That's really the driver for it. We need to do a key set of training. We need to train our models with content for industries, bring those trained models to companies and let them train specific versions for their company with their data that unless there's a reason they tell us to do it, is never going to leave their company. >> I think that's a great point about you being full of data because a lot of people who are building solutions and scaffolding for data, aka software never have more data full. The typical, "Oh, I'm going to be a software company," and they build something that they don't (mumbles) for. Your data full, so you know the problem. You're living it every day. It's opportunity. >> Yeah, and that's why when a startup comes to you and says, "Hey, we have this great AI algorithm. "Give us your data," they want to resell that model, and because they don't have access to the content. If you look at what IBM's done with Watson, right? That's why there's specialized verticals that we're focusing Watson, Watson Health, Watson Financial, because where we are investing in data in those areas you can look at the acquisitions we've done, right. We're investing in data to train those models. >> We should follow up on this because this brings up the whole scale point. If you look at all the innovators of the past decade, even two decades, Yahoo, Google, Facebook, these are companies that were webscalers before there was anything that they could buy. They built their own because they had their own problem at scale. >> At scale. >> And data at scale is a whole other mind-blowing issue. Do you agree? >> Absolutely. >> We're going to put that on the agenda for the CDO Summit in San Francisco next week. Seth, thanks so much for joining us on theCube. Appreciate it; Chief Data Officer, this is going to be a hot field. The CDO is going to be a very important opportunity for anyone watching in the data field. This is going to be new opportunities. Get that data, get it controlled, taming the data, making it valuable. This is theCUBE, taming all of the content here at InterConnect. I'm John Furrier with Dave Vellante. More content coming. Stay with us. Day Two coverage continues. (innovative music tones)

Published Date : Mar 22 2017

SUMMARY :

Brought to you by IBM. Welcome to theCUBE, welcome back. chattin' with you guys. I like the keynote on day one here. I mean you guys have the scenes with data what are you guys working on I get to public cloud, the need to worry about governance platform needs to be built data science machine learning data science are two, If I said five, there's only four. one of the things we see and so that just reinforces the need for and say you need to create Am I going to be off premise? to access ATMs, you like we talked about before this show. and we got to figure out how to do that. a lot of that outward focus If I'm going to start and that's largely going to how do I provide access to my data. I'm going to be a CDO someday. Seth: I was staying in a Take us through how you I'm not going to fit Data has properties that fundamental drivers of the concept it's almost like you and it was like, that was someone's It's got to get (mumbles) and not surprising, they were How do you see the data and make it the heart of our products. and a half to realize, Master data needs to be in real time, low-latency Right, it's got to be and has the industry solved that? and sell it to other people, It's almost as if Right, and there's some key things need to think about right giving it to my competitor? So do I own the model is never going to leave their company. Your data full, so you know the problem. have access to the content. innovators of the past decade, Do you agree? The CDO is going to be a

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Chris Moody	PERSON	0.99+
Seth Dobrin	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
John	PERSON	0.99+
Google	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
Brazil	LOCATION	0.99+
Seth	PERSON	0.99+
90%	QUANTITY	0.99+
three	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
tens	QUANTITY	0.99+
Mandalay Bay	LOCATION	0.99+
John Furrier	PERSON	0.99+
South America	LOCATION	0.99+
Seth Dobrin	PERSON	0.99+
20%	QUANTITY	0.99+
Last week	DATE	0.99+
two	QUANTITY	0.99+
five	QUANTITY	0.99+
two years	QUANTITY	0.99+
80%	QUANTITY	0.99+
four months	QUANTITY	0.99+
7%	QUANTITY	0.99+
five things	QUANTITY	0.99+
400 terabytes	QUANTITY	0.99+
Watson Health	ORGANIZATION	0.99+
IBM Analytics	ORGANIZATION	0.99+
Las Vegas	LOCATION	0.99+
four years	QUANTITY	0.99+
One	QUANTITY	0.99+
Watson Financial	ORGANIZATION	0.99+
next week	DATE	0.99+
Twitter	ORGANIZATION	0.99+
Excel	TITLE	0.99+
Las Vegas	LOCATION	0.99+
one	QUANTITY	0.99+
four	QUANTITY	0.99+
today	DATE	0.99+
Robetti	PERSON	0.99+
third one	QUANTITY	0.99+
Watson	ORGANIZATION	0.99+
CDO Summit	EVENT	0.99+
a week	QUANTITY	0.99+
mid-2000s	DATE	0.98+
single line	QUANTITY	0.98+
next week	DATE	0.98+
Millions of rows	QUANTITY	0.98+
15	QUANTITY	0.98+
three day	QUANTITY	0.98+
first six months	QUANTITY	0.97+
20 years ago	DATE	0.97+
day one	QUANTITY	0.96+
single catalog	QUANTITY	0.96+
five key areas	QUANTITY	0.96+

Nenshad Bardoliwalla, Paxata - #BigDataNYC 2016 - #theCUBE

>> Voiceover: Live from New York, it's The Cube, covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, Nvidia, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to New York City, everybody. Nenshad Bardoliwalla is here, he's the co-founder and chief product officer at Paxata, a company that, three years ago, I want to say three years ago, came out of stealth on The Cube. >> October 27, 2013. >> Right, and we were at the Warwick Hotel across the street from the Hilton. Yeah, Prakash came on The Cube and came out of stealth. Welcome back. >> Thank you very much. >> Great to see you guys. Taking the world by storm. >> Great to be here, and of course, Prakash sends his apologies. He couldn't be here so he sent his stunt double. (Dave and George laugh) >> Great, so give us the update. What's the latest? >> So there are a lot of great things going on in our space. The thing that we announced here at the show is what we're calling Paxata Connect, OK? We are moving just in the same way that we created the self-service data preparation category, and now there are 50 companies that claim they do self-service data prep. We are moving the industry to the next phase of what we are calling our business information platform. Paxata Connect is one of the first major milestones in getting to that vision of the business information platform. What Paxata Connect allows our customers to do is, number one, to have visual, completely declarative, point-and-click browsing access to a variety of different data sources in the enterprise. For example, we support, we are the only company that we know of that supports connecting to multiple, simultaneous, different Hadoop distributions in one system. So a Paxata customer can connect to MapR, they can connect to Hortonworks, they can connect to Cloudera, and they can federate across all of them, which is a very powerful aspect of the system. >> And part of this involves, when you say declarative, it means you don't have to write a program to retrieve the data. >> Exactly right. Exactly right. >> Is this going into HTFS, into Hive, or? >> Yes it is. In fact, so Hadoop is one part of, this multi-source Hadoop capability is one part of Paxata Connect. The second is, as we've moved into this information platform world, our customers are telling us they want read-write access to more than just Hadoop. Hadoop is obviously a very important part, but we're actually supporting no-sequel data sources like Cloudant, Mongo DB, we're supporting read and write, we're supporting, for the first time, relational databases, we already supported read, but now we actually support write to relational databases. So Paxata is really becoming kind of this fabric, a business-centric information fabric, that allows people to move data from anywhere to any destination, and transform it, profile it, explore it along the way. >> Excellent. Let's get into some of the use cases. >> Yeah, tell us where the banks are. The sense at the conference is that everyone sort of got their data lakes to some extent up and running. Now where are they pushing to go next? >> Sure, that's an excellent question. So we have really focused on the enterprise segment, as you know. So the customers that are working with Paxata from an industry perspective, banking is, of course, a very important one, we were really proud to share the stage yesterday with both Citi and Standard Chartered Bank, two of our flagship banking customers. But Paxata is also heavily used in the United States government, in the intelligence community, I won't say any more about that. It's used heavily in retail and consumer products, it's used heavily in the high-tech space, it's used heavily by data service providers, that is, companies whose entire business is based on data. But to answer your question specifically, what's happening in the data lake world is that a lot of folks, the early adopters, have jumped onto the data lake bandwagon. So they're pouring terabytes and petabytes of data into the data lake. And then the next question the business asks is, OK, now what? Where's the data, right? One of the simplest use cases, but actually one that's very pervasive for our customers, is they say, "Look, we don't even know, "our business people, they don't even know "what's in Hadoop right now." And by the way, I will also say that the data lake is not just Hadoop, but Amazon S3 is also serving as a data lake. The capabilities inside Microsoft's cloud are also serving as a data lake. Even the notion of a data lake is becoming this sort of polymorphic distributed thing. So what they do is, they want to be able to get what we like to say is first eyes on data. We let people with Paxata, especially with the release of Connect, to just point and click their way and to actually explore the data in all of the native systems before they even bring it in to something like Paxata. So they can actually sneak preview thousands of database tables or thousands of compressed data sets inside of Amazon S3, or thousands of data sets inside of Hadoop, and now the business people for the first time can point and click and actually see what is in the data lake in the first place. So step number one is, we have taken the approach so far in the industry of, there have been a lot of IT-driven use cases that have motivated people to go to the data lake approach. But now, we obviously want to show, all of our companies want to show business value, so tools and platforms like Paxata that sit on top of the data lake, that can federate across multiple data lakes and provide business-centric access to that information is the first significant use case pattern we're seeing. >> Just a clarification, could there be two roles where one is for slightly more technical business user exposes views summarizing, so that the ultimate end user doesn't have to see the thousands of tables? >> Absolutely, that's a great question. So when you look at self-service, if somebody wants to roll out a self-service strategy, there are multiple roles in an organization that actually need to intersect with self-service. There is a pattern in organizations where people say, "We want our people to get access to all the data." Of course it's governed, they have to have the right passwords and SSO and all that, but they're the companies who say, yes, the users really need to be able to see all of the data across these different tables. But there's a different role, who also uses Paxata extensively, who are the curators, right? These are the people who say, look, I'm going to provision the raw data, provide the views, provide even some normalization or transformation, and then land that data back into another layer, as people call the data relay, they go from layer zero to layer one to layer two, they're different directory structures, but the point is, there's a natural processing frame that they're going through with their data, and then from the curated data that's created by the data stewards, then the analysts can go pick it up. >> One of the other big challenges that our research is showing, that chief data officers express, is that they get this data in the data lake. So they've got the data sources, you're providing access to it, the other piece is they want to trust that data. There's obviously a governance piece, but then there's a data quality piece, maybe you could talk about that? >> Absolutely. So use case number one is about access. The second reason that people are not so -- So, why are people doing data prep in the first place? They are trying to make information-driven decisions that actually help move their business forward. So if you look at researchers from firms like Forrester, they'll say there are two reasons that slow down the latency of going from raw data to decision. Number one is access to data. That's the use case we just talked about. Number two is the trustworthiness of data. Our approach is very different on that. Once people actually can find the data that they're looking for, the big paradigm shift in the self-service world is that, instead of trying to process data based on transforming the metadata attributes, like I'm going to draw on a work flow diagram, bring in this table, aggregate with this operator, then split it this way, filter it, which is the classic ETL paradigm. The, I don't want to say profound, but maybe the very obvious thing we did was to say, "What if people could actually look at the data in the first place --" >> And sort of program it by example? >> We can tell, that's right. Because our eyes can tell us, our brains help us to say, we can immediately look at a data set, right? You look at an age column, let's say. There are values in the age column of 150 years. Maybe 20 years from now there may be someone who, on Earth, lives to 150 years. But pretty much -- >> Highly unlikely. >> The customers at the banks you work with are not 150 years old, right? So just being able to look at the data, to get to the point that you're asking, quality is about data being fit for a specific purpose. In order for data to be fit for a specific purpose, the person who needs the data needs to make the decision about what is quality data. Both of you may have access to the same transactional data, raw data, that the IT team has landed in the Hadoop cluster. But now you pull it up for one use case, you pull it up for another use case, and because your needs are different, what constitutes quality to you and where you want to make the investment is going to be very different. So by putting the power of that capability into the hands of the person who actually knows what they want, that is how we are actually able to change the paradigm and really compress the latency from "Here's my raw data" to "Here's the decision I want to make on that data." >> Let me ask, it sounds like, having put all of the self-service capabilities together, you've democratized access to this data. Now, what happens in terms of governance, or more importantly, just trust, when the pipeline, you know, has to go beyond where you're working on it, to some of the analytics or some of the basic ingest? To say, "I know this data came from here "and it's going there." >> That's right, how do we verify the fidelity of these data sources? It's a fantastic question. So, in my career, having worked in BI for a couple of decades, I know I look much younger but it actually has been a couple of decades. Remember, the camera adds about 15 pounds, for those of you watching at home. (Dave and George laugh) >> George: But you've lost already. >> Thank you very much. >> So you've lost net 30. (Nenshad laughs) >> Or maybe I'm back to where I'm supposed to be. What I've seen as the two models of governance in the enterprise when it comes to analytics and information management, right? There's model one, which is, we're going to build an enterprise data warehouse, we're going to know all the possible questions people are going to ask in advance, we're going to preprogram the ETL routines, we're going to put something like a MicroStrategy or BusinessObjects, an enterprise-reporting factory tool. Then you spend 10 million dollars on that project, the users come in and for the first time they use the system, and they say, "Oh, I kind of want to change this, this way. "I want to add this calculation." It takes them about five minutes to determine that they can't do it for whatever reason, and what is the first feature they look for in the product in order to move forward? Download to Excel, right? So you invested 15 million dollars to build a download to Excel capability which they already had before. So if you lock things down too much, the point is, the end users will go around you. They've been doing it for 30 years and they'll keep doing it. Then we have model two. Model two is, Excel spreadsheet. Excel Hell, or spreadmarts. There are lots of words for these things. You have a version of the data, you have a version of the data, I have a version of the data. We all started from the same transactional data, yet you're the head of sales, so suddenly your forecast looks really rosy. You're the head of finance, you really don't like what the forecast looks like. And I'm the product guy, so why am I even looking at the forecast in the first place, but somehow I got access to the data, right? These are the two polarities of the enterprise that we've worked with for the last 30 years. We wanted to find sort of a middle path, which is to say, let's give people the freedom and flexibility to be able to do the transformations they need to. If they want to add a column, let them add a column. If they want to change a calculation, let them add a a calculation. But, every single step in the process must be recorded. It must be versioned, it must be auditable. It must be governed in that way. So why the large banks and the intelligence community and the large enterprise customers are attracted to Paxata is because they have the ability to have perfect retraceability for every decision that they make. I can actually sit next to you and say, "This is why the data looks like this. "This is how this value, which started at one million, "became 1.5 million." That covers the Paxata part. But then the answer to the question you asked is, how do you even extend that to a broader ecosystem? I think that's really about some of the metadata interchange initiatives that a lot of the vendors in the Hadoop space, but also in the traditional enterprise space, have had for the last many years. If you look at something like Apache Atlas or Cloudera Navigator, they are systems designed to collect, aggregate, and connect these different metadata steps so you can see in an end-to-end flow, this is the raw data that got ingested into Hadoop. These are the transformations that the end user did in Paxata in order to make it ready for analytics. This is how it's getting consumed in something like Zoom Data, and you actually have the entire life cycle of data now actually manifested as a software asset. >> So those not, in other words, those are not just managing within the perimeter of Hadoop. They are managers of managers. >> That's right, that's right. Because the data is coming from anywhere, and it's going to anywhere. And then you can add another dimension of complexity which is, it's not just one Hadoop cluster. It's 10 Hadoop clusters. And those 10 Hadoop clusters, three of them are in Amazon. Four of them are in Microsoft. Three of them are in Google Cloud platform. How do you know what people are doing with data then? >> How is this all presented to the user? What does the user see? >> Great question. The trick to all of this, of self service, first you have to know very clearly, who is the person you are trying to serve? What are their technical skills and capabilities, and how can you get them productive as fast as possible? When we created this category, our key notion was that we were going to go after analysts. Now, that is a very generic term, right? Because we are all, in some sense, analysts in our day-to-day lives. But in Paxata, a business analyst, in an enterprise organizational context, is somebody that has the ability to use Microsoft Excel, they have to have that skill or they won't be successful with today's Paxata. They have to know what a VLOOKUP is, because a VLOOKUP is a way to actually pull data from a second data source into one. We would all know that as a join or a lookup. And the third thing is, they have to know what a pivot table is and know how a pivot table works. Because the key insight we had is that, of the hundreds of millions of analysts, people who use Excel on a day-to-day basis, a lot of their work is data prep. But Excel, being an amazing generic tool, is actually quite bad for doing data prep. So the person we target, when I go to a customer and they say, "Are we a good candidate to use Paxata?" and we're talking to the actual person who's going to use the software, I say, "Do you know what a VLOOKUP is, yes or no? "Do you know what a pivot table is, yes or no?" If they have that skill, when they come into Paxata, we designed Paxata to be very attractive to those people. So it's completely point-and-click. It's completely visual. It's completely interactive. There's no scripting inside that whole process, because do you think the average Microsoft Excel analyst wants to script, or they want to use a proprietary wrangling language? I'm sorry, but analysts don't want to wrangle. Data scientists, the 1% of the 1%, maybe they like to wrangle, but you don't have that with the broader analyst community, and that is a much larger market opportunity that we have targeted. >> Well, very large, I mean, a lot of people are familiar with those concepts in Excel, and if they're not, they're relatively easy to learn. >> Nenshad: That's right. Excellent. All right, Nenshad, we have to leave it there. Thanks very much for coming on The Cube, appreciate it. >> Thank you very much for having me. >> Congratulations for all the success. >> Thank you. >> All right, keep it right there, everybody. We'll be back with our next guest. This is The Cube, we're live from New York City at Big Data NYC. We'll be right back. (electronic music)

Published Date : Sep 30 2016

SUMMARY :

Brought to you by headline sponsors, here, he's the co-founder across the street from the Hilton. Great to see you guys. Great to be here, and of course, What's the latest? of the business information platform. to retrieve the data. Exactly right. explore it along the way. Let's get into some of the use cases. The sense at the conference One of the simplest use These are the people who One of the other big That's the use case we just talked about. to say, we can immediately the banks you work with of the self-service capabilities together, Remember, the camera adds about 15 pounds, So you've lost net 30. of the data, I have a version of the data. They are managers of managers. and it's going to anywhere. And the third thing is, they have to know relatively easy to learn. have to leave it there. This is The Cube, we're

ENTITIES

Entity	Category	Confidence
Citi	ORGANIZATION	0.99+
October 27, 2013	DATE	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Nenshad	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Prakash	PERSON	0.99+
Dave	PERSON	0.99+
New York City	LOCATION	0.99+
Nvidia	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Earth	LOCATION	0.99+
15 million dollars	QUANTITY	0.99+
two	QUANTITY	0.99+
30 years	QUANTITY	0.99+
Forrester	ORGANIZATION	0.99+
Excel	TITLE	0.99+
thousands	QUANTITY	0.99+
50 companies	QUANTITY	0.99+
10 million dollars	QUANTITY	0.99+
Standard Chartered Bank	ORGANIZATION	0.99+
New York City	LOCATION	0.99+
Nenshad Bardoliwalla	PERSON	0.99+
two reasons	QUANTITY	0.99+
one million	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
first	QUANTITY	0.99+
two roles	QUANTITY	0.99+
two polarities	QUANTITY	0.99+
1.5 million	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
150 years	QUANTITY	0.99+
Hadoop	TITLE	0.99+
Paxata	ORGANIZATION	0.99+
second reason	QUANTITY	0.99+
One	QUANTITY	0.99+
two models	QUANTITY	0.99+
second	QUANTITY	0.99+
one	QUANTITY	0.99+
yesterday	DATE	0.99+
Both	QUANTITY	0.99+
three years ago	DATE	0.99+
first time	QUANTITY	0.98+
first time	QUANTITY	0.98+
New York	LOCATION	0.98+
both	QUANTITY	0.98+
1%	QUANTITY	0.97+
third thing	QUANTITY	0.97+
one system	QUANTITY	0.97+
about five minutes	QUANTITY	0.97+
Paxata	PERSON	0.97+
first feature	QUANTITY	0.97+
Data	LOCATION	0.96+
one part	QUANTITY	0.96+
United States government	ORGANIZATION	0.95+
thousands of tables	QUANTITY	0.94+
20 years	QUANTITY	0.94+
Model two	QUANTITY	0.94+
10 Hadoop clusters	QUANTITY	0.94+
terabytes	QUANTITY	0.93+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Apache Atlas: