Cloud First – Data Driven Reinvention Drew Allan | Cloudera 2021

>>Okay. Now we're going to dig into the data landscape and cloud of course. And talk a little bit more about that with drew Allen. He's a managing director at Accenture drew. Welcome. Great to see you. Thank you. So let's talk a little bit about, you know, you've been in this game for a number of years. Uh, you've got a particular expertise in, in, in data and finance and insurance. I mean, you think about it within the data and analytics world, even our language is changing. You know, we don't say talk about big data so much anymore. We, we talk more about digital, you know, or, or, or data-driven when you think about sort of where we've come from and where we're going, what are the puts and takes that you have with regard to what's going on in the business today? >>Well, thanks for having me. Um, you know, I think some of the trends we're seeing in terms of challenges and puts some takes are that a lot of companies are already on this digital transformation journey. Um, they focused on customer experience is kind of table stakes. Everyone wants to focus on that and kind of digitizing their channels. But a lot of them are seeing that, you know, a lot of them don't even own their, their channels necessarily. So like we're working with a big cruise line, right. And yes, they've invested in digitizing what they own, but a lot of the channels that they sell through, they don't even own, right. It's the travel agencies or third-party real sellers. So having the data to know where, you know, where those agencies are, that that's something that they've discovered. And so there's a lot of big focus on not just digitizing, but also really understanding your customers and going across products because a lot of the data has built, been built up in individual channels and in digital products. >>And so bringing that data together is something that customers that have really figured out in the last few years is a big differentiator. And what we're seeing too, is that a big trend that the data rich are getting richer. So companies that have really invested in data, um, are having, uh, an outside market share and outside earnings per share and outside revenue growth. And it's really being a big differentiator. And I think for companies just getting started in this, the thing to think about is one of the missteps is to not try to capture all the data at once. The average company has, you know, 10,000, 20,000 data elements individually, when you want to start out, you know, 500, 300 critical data elements, about 5% of the data of a company drives 90% of the business value. So focusing on, on those key critical data elements is really what you need to govern first and really invest in first. And so that's something we tell companies at the beginning of their data strategy is first focus on those critical data elements, really get a handle on governing that data, organizing that data and building data products around >>That data. You can't boil the ocean. Right. And so, and I, I feel like pre pandemic, there was a lot of complacency. Oh yeah, we'll get to that. You know, not on my watch, I'll be retired before that, you know, it becomes a minute. And then of course the pandemic was, I call it sometimes a forced March to digital. So in many respects, it wasn't planned. It just ha you know, you had to do it. And so now I feel like people are stepping back and saying, okay, let's now really rethink this and do it right. But is there, is there a sense of urgency, do you think? >>Absolutely. I think with COVID, you know, we were working with, um, a retailer where they had 12,000 stores across the U S and they had didn't have the insights where they could drill down and understand, you know, with the riots and with COVID was the store operational, you know, with the supply chain of they having multiple, uh, distributors, what did they have in stock? So there are millions of data points that you need to drill down, down at the cell level, at the store level to really understand how's my business performing. And we like to think about it for like a CEO and his leadership team of like, think of it as a digital cockpit, right? You think about a pilot, they have a cockpit with all these dials and, um, dashboards, essentially understanding the performance of their business. And they should be able to drill down and understand for each individual, you know, unit of their work, how are they performing? That's really what we want to see for businesses. Can they get down to that individual performance to really understand how their businesses and >>The ability to connect those dots and traverse those data points and not have to go in and come back out and go into a new system and come back out. And that's really been a lot of the frustration where does machine intelligence and AI fit in? Is that sort of a dot connector, if you will, and an enabler, I mean, we saw, you know, decades of the, the AI winter, and then, you know, there's been a lot of talk about it, but it feels like with the amount of data that we've collected over the last decade and the, the, the low costs of processing that data now, it feels like it's, it's real. Where do you see AI fitting in? Yeah, >>I mean, I think there's been a lot of innovation in the last 10 years with, um, the low cost of storage and computing and these algorithms in non-linear, um, you know, knowledge graphs, and, um, um, a whole bunch of opportunities in cloud where what I think the, the big opportunity is, you know, you can apply AI in areas where a human just couldn't have the scale to do that alone. So back to the example of a cruise lines, you know, you may have a ship being built that has 4,000 cabins on the single cruise line, and it's going to multiple deaths that destinations over its 30 year life cycle. Each one of those cabins is being priced individually for each individual destination. It's physically impossible for a human to calculate the dynamic pricing across all those destinations. You need a machine to actually do that pricing. And so really what a machine is leveraging is all that data to really calculate and assist the human, essentially with all these opportunities where you wouldn't have a human being able to scale up to that amount of data >>Alone. You know, it's interesting. One of the things we talked to Mick Halston about earlier was just the everybody's algorithms are out of whack. You know, you look at the airline pricing, you look at hotels it's as a consumer, you would be able to kind of game the system and predict a, they can't even predict these days. And I feel as though that the data and AI are actually going to bring us back into some kind of normalcy and predictability, uh, w what do you see in that regard? >>Yeah, I think it's, I mean, we're definitely not at a point where when I talk to, you know, the top AI engineers and data scientists, we're not at a point where we have what they call broad AI, right? Where you can get machines to solve general knowledge problems, where they can solve one problem, and then a distinctly different problem, right? That's still many years away, but narrow AI, there's still tons of use cases out there that can really drive tons of business performance challenges, tons of accuracy challenges. So, for example, in the insurance industry, commercial lines, where I work a lot of the time, the biggest leakage of loss experience and pricing for commercial insurers is, um, people will go in as an agent and they'll select an industry to say, you know what, I'm a restaurant business. Um, I'll select this industry code to quote out a policy, but there's, let's say, you know, 12 dozen permutations, you could be an outdoor restaurant. >>You could be a bar, you could be a caterer, and all of that leads to different loss experience. So what this does is they built a machine learning algorithm. We've helped them do this, that actually at the time that they're putting in their name and address, it's crawling across the web and predicting in real time, you know, is this address actually, you know, a business that's a restaurant with indoor dining, does it have a bar is an outdoor dining, and it's that that's able to accurately more price the policy and reduce the loss experience. So there's a lot of that you can do, even with narrow AI that can really drive top line of business results. >>Yeah. I like that term narrow AI because getting things done is important. Let's talk about cloud a little bit because people talk about cloud first public cloud first doesn't necessarily mean public cloud only, of course. So where do you see things like what's the right operating model, the right regime hybrid cloud. We talked earlier about hybrid data help us squint through the cloud landscape. Yeah. >>I mean, I think for most right, most fortune 500 companies, they can't just their fingers and say, let's move all of our data centers to the cloud. They've got to move, you know, gradually. And it's usually a journey that's taking more than two to three plus years, even more than that in some cases. So they're half they have to move their data, uh, incrementally to the cloud. And what that means is that, that they have to move to a hybrid perspective where some of their data is on premise and some of it is publicly on the cloud. And so that's the term hybrid cloud essentially. And so what they've had to think about is from an intelligence perspective, the privacy of that data, where is it being moved? Can they reduce the replication of that data? Because ultimately you like, uh, replicating the data from on-premise to, to the cloud that introduces, you know, errors and data quality issues. So thinking about how do you manage, uh, you know, uh, on-premise and public cloud as a transition is something that Accenture thinks, thinks, and helps our clients do quite a bit. And how do you move them in a manner that's well-organized and well thought about? >>Yeah. So I've been a big proponent of sort of line of business lines of business becoming much more involved in, in the data pipeline, if you will, the data process, if you think about our major operational systems, they all have sort of line of business context in them. Then the salespeople, they know the CRM data and, you know, logistics folks. There they're very much in tune with ERP. I almost feel like for the past decade, the lines of business have been somewhat removed from the, the data team, if you will. And that, that seems to be changing. What are you seeing in terms of the line of line of business being much more involved in sort of end to end ownership if you will, if I can use that term of, uh, of the data and sort of determining things like helping determine anyway, the data quality and things of that nature. Yeah. >>I mean, I think this is where thinking about your data operating model and thinking about ideas of a chief data officer and having data on the CEO agenda, that's really important to get the lines of business, to really think about data sharing and reuse, and really getting them to, you know, kind of unlock the data because they do think about their data as a fiefdom data has value, but you've got to really get organizations in their silos to open it up and bring that data together because that's where the value is. You know, data doesn't operate. When you think about a customer, they don't operate in their journey across the business in silo channels. They don't think about, you know, I use only the web and then I use the call center, right? They think about that as just one experience. And that data is a single journey. >>So we like to think about data as a product. You know, you should think about a data in the same way. You think about your products as, as products, you know, data as a product, you should have the idea of like every two weeks you have releases to it. You have an operational resiliency to it. So thinking about that, where you can have a very product mindset to delivering your data, I think is very important for the success. And that's where kind of, there's not just the things about critical data elements and having the right platform architecture, but there's a soft stuff as well, like a product mindset to data, having the right data, culture, and business adoption and having the right value set mindset for, for data, I think is really, >>I think data as a product is a very powerful concept. And I think it maybe is uncomfortable to some people sometimes. And I think in the early days of big data, if you will, people thought, okay, data is a product going to sell my data, and that's not necessarily what you mean. You mean thinking about products or data that can fuel products that you can then monetize maybe as a product or as a, as, as a service. And I like to think about a new metric in the industry, which is how long does it take me to get from idea of I'm a business person. I have an idea for a data product. How long does it take me to get from idea to monetization? And that's going to be something that ultimately as a business person, I'm going to use to determine the success of my data team and my, my data architecture is, is that kind of thinking starting to really hit the marketplace. >>I mean, I insurers now are working, partnering with, you know, auto manufacturers to monetize, um, driver usage data, you know, on telematics to see, you know, driver behavior on how, you know, how auto manufacturers are using that data. That's very important to insurers, you know, so how an auto manufacturer can monetize that data is very important and also an insurance, you know, cyber insurance, um, are there news new ways we can look at how companies are being attacked with viruses and malware, and is there a way we can somehow monetize that information? So companies that are able to agily, you know, think about how can we, you know, collect this data, bring it together, think about it as a product, and then potentially, you know, sell it as a service is something that, um, company, successful companies are doing >>Great examples of data products, and it might be revenue generating, or it might be in the case of, you know, cyber, maybe it reduces my expected loss. Exactly. And it drops right to my bottom line. What's the relationship between Accenture and cloud era? Do you, I presume you guys meet at the customer, but maybe you could give us some insight as to yeah. So, >>Um, I I'm in the executive sponsor for, um, the Accenture cloud era partnership on the Accenture side. Uh, we do quite a lot of business together and, um, you know, Cloudera has been a great partner for us. Um, and they've got a great product in terms of the Cloudera data platform where, you know, what we do is as a big systems integrator for them, we help, um, you know, configure and we have a number of engineers across the world that come in and help in terms of, um, engineer architects and install, uh, cloud errors, data platform, and think about what are some of those, you know, value cases where you can really think about organizing data and bringing it together for all these different types of use cases. And really just as the examples we thought about. So the telematics, you know, um, in order to realize something like that, you're bringing in petabytes and huge scales of data that, you know, you just couldn't bring on a normal, uh, platform. You need to think about cloud. You need to think about speed of, of data and real-time insights and cloud errors, the right data platform for that. So, um, >>That'd be Cloudera ushered in the modern big data era. We, we kind of all know that, and it was, which of course early on, it was very services intensive. You guys were right there helping people think through there weren't enough data scientists. We've sort of all, all been through that. And of course in your wheelhouse industries, you know, financial services and insurance, they were some of the early adopters, weren't they? Yeah, >>Absolutely. Um, so, you know, an insurance, you've got huge amounts of data with loss history and, um, a lot with IOT. So in insurance, there's a whole thing of like sensorized thing in, uh, you know, taking the physical world and digitizing it. So, um, there's a big thing in insurance where, um, it's not just about, um, pricing out the risk of a loss experience, but actual reducing the loss before it even happens. So it's called risk control or loss control, you know, can we actually put sensors on oil pipelines or on elevators and, you know, reduce, um, you know, accidents before they happen. So we're, you know, working with an insurer to actually, um, listen to elevators as they move up and down and are there signals in just listening to the audio of an elevator over time that says, you know what, this elevator is going to need maintenance, you know, before a critical accident could happen. So there's huge applications, not just in structured data, but in unstructured data like voice and audio and video where a partner like Cloudera has a huge role apply. >>Great example of it. So again, narrow sort of use case for machine intelligence, but, but real value. True. We'll leave it like that. Thanks so much for taking some time. Thank you.

Published Date : Aug 2 2021

SUMMARY :

So let's talk a little bit about, you know, you've been in this game But a lot of them are seeing that, you know, a lot of them don't even own their, you know, 10,000, 20,000 data elements individually, when you want to start out, It just ha you know, I think with COVID, you know, we were working with, um, a retailer where and an enabler, I mean, we saw, you know, decades of the, the AI winter, the big opportunity is, you know, you can apply AI in areas where You know, you look at the airline pricing, you look at hotels it's as a Yeah, I think it's, I mean, we're definitely not at a point where when I talk to, you know, you know, is this address actually, you know, a business that's a restaurant So where do you see things like They've got to move, you know, gradually. more involved in, in the data pipeline, if you will, the data process, and really getting them to, you know, kind of unlock the data because they do You know, you should think about a data in And I think in the early days of big data, if you will, people thought, okay, data is a product going to sell my data, that are able to agily, you know, think about how can we, you know, collect this data, Great examples of data products, and it might be revenue generating, or it might be in the case of, you know, So the telematics, you know, um, in order to realize something you know, financial services and insurance, they were some of the early adopters, weren't they? this elevator is going to need maintenance, you know, before a critical accident could happen. So again, narrow sort of use case for machine intelligence,

ENTITIES

Entity	Category	Confidence
Accenture	ORGANIZATION	0.99+
Mick Halston	PERSON	0.99+
90%	QUANTITY	0.99+
10,000	QUANTITY	0.99+
4,000 cabins	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
12 dozen	QUANTITY	0.99+
12,000 stores	QUANTITY	0.99+
Drew Allan	PERSON	0.99+
U S	LOCATION	0.99+
more than two	QUANTITY	0.98+
one experience	QUANTITY	0.98+
each individual	QUANTITY	0.98+
One	QUANTITY	0.98+
first	QUANTITY	0.97+
pandemic	EVENT	0.97+
Allen	PERSON	0.97+
one	QUANTITY	0.96+
one problem	QUANTITY	0.96+
about 5%	QUANTITY	0.95+
three plus years	QUANTITY	0.94+
Each one	QUANTITY	0.94+
30 year	QUANTITY	0.93+
single cruise line	QUANTITY	0.92+
COVID	ORGANIZATION	0.91+
500, 300 critical data elements	QUANTITY	0.9+
today	DATE	0.89+
20,000 data elements	QUANTITY	0.89+
companies	QUANTITY	0.89+
decades	QUANTITY	0.85+
Accenture drew	ORGANIZATION	0.84+
single journey	QUANTITY	0.83+
2021	DATE	0.83+
each individual destination	QUANTITY	0.8+
millions of data points	QUANTITY	0.77+
last decade	DATE	0.74+
two weeks	QUANTITY	0.73+
last 10 years	DATE	0.72+
fortune 500	ORGANIZATION	0.71+
tons	QUANTITY	0.69+
half	QUANTITY	0.68+
last few years	DATE	0.65+
fiefdom	QUANTITY	0.63+
Cloud First	ORGANIZATION	0.6+
past decade	DATE	0.58+
March	DATE	0.55+

Jitesh Ghai, Informatica | CUBE Conversation, July 2020

(ambient music) >> Narrator: From the cube studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is a CUBE conversation. >> Hello welcome to this cube conversation. I'm John Furrier, host of theCUBE here in our Palo Alto studios. During this quarantine, crew doing all the interviews, getting all the top story especially during this COVID pandemic. Great conversation here Jitesh Ghai, Senior Vice President and General Manager of Data Management with Informatica, CUBE alumni multi time. We can't be in person this year, because of the pandemic but a lot of great content. We've been doing a lot of interviews with you guys. Jitesh great to see you. Thanks for coming on. >> Hey, great to see you again. We weren't able to make it happen in person this year, >> but if not in person, >> virtually will have to work. >>In our past conversations on theCUBE and through all the Informatica employees it's always been kind of an inside baseball, kind of inside the ropes conversation in the industry >> about data. >> Now more than ever, with the pandemic, you starting to see people seeing it. Oh, I get it now. I get why data is important. I can see why Cloud First, Mobile First, Data First strategies and now Virtual First, is now this transformational scene. Everyone's feeling it, you can't help not ignore it. It's happening. It's also highlighting what's working, what's not. I have to ask you in the current environment Jitesh what are you seeing as some of those opportunities that your customers are dealing with approach to data? 'Cause clearly, you're working with that data layer, there's a lot of innovation opportunities, you've got CLAIRE on the AI side, all great. But now with the pandemic, it's really forcing that conversation. I got to rethink about what's going to happen after and have a really good strategy. >> Yeah, you're exactly right. There's a broad based realization that, I'll take a step back. First, we all know that as global 2000 organizations or in general, we all need to be data driven, we need to make fact based decisions. And there is a lot of that good work that's happened over the last few years as organizations have realized just how important data is to innovate and to deliver new products and services, new business models. What's really happened is that, during this COVID pandemic, there is a greater appreciation for trust in data. Historically, organizations became data driven, we're on the journey of being increasingly data driven. However, there was some element of Oh, gut or experience and that combined with data will get us to the outcomes we're looking for, will enable us to make the decisions. In this pandemic world of great uncertainty, supply chains falling apart on occasion, groceries not getting delivered on time et cetra, et cetra. The appreciation and critical importance on the quality on the trust of data is greater than ever to drive the insights for organizations. Leaders are less hesitant or sorry, leaders are more hesitant to just go with your gut type of approaches. There is a tremendous reliance on data. And we're seeing it in particular, more than ever, as you can imagine in the healthcare provider sector, in the public sector with federal state and local, as all of these organizations are having to make very difficult decisions, and are increasingly relying on high quality, trustworthy governed data to help them make what can be life or death decision. So a big shift and appreciation for the importance and trustworthiness in their data, their data state and their insights. >> So as the GM of data management and Senior Vice President at Informatica, you get a good view of things. I got to ask you love this data 4.0 concept. Talk about what that means to you because you got customers have been doing data management with you guys for a while, but now it's data 4.0 that has a feeling of agility to it. It's got kind of a DevOps vibe. It feels like a lot of automation being discussed and you mentioned trust. What is data 4.0 mean? >> So data 4.0 for us is where AI and ML is powering data management. And so what do I mean by that? There is a greater insight and appreciation for high quality trustworthy data to enable organizations to make fact based decisions to be more data driven. But how do you do that when data is exponentially growing in volume, where data types are increasing, where data is moving increasingly between Clouds, between On-premises and Clouds between various ecosystems, new data sources are emerging, the internet of things is yet another exploding source of data. This is a lot of different types of data, a lot of volume of data, a lot of different locations, and gravity of data where data resides. So the question becomes how do you practically manage this data without intelligence and automation. And that's what the era of data 4.0 is. Where AI and ML is powering data management, making it more intelligent, automating more and more of what was historically manual to enable organizations to scale, to enable them to scale to the breadth of data that they need to get a greater understanding of their data landscape within the enterprise, to get a greater understanding of the quality of the data within their landscape, how it's moving, and the associated privacy implications of how that data is being used, how effectively it's protected, so on and so forth. All underpinned by our CLAIRE engine, which is AI and ML applied to metadata, to deliver the intelligence and enable the automation of the data management operations. >> Awesome. Thanks for taking the time to define that, love that. The question I want to ask you, I'll put you on the spot here because I think this is an important conversation we've been having and also writing a lot about it on siliconangle.com and that is customers say to us, "Hey, John, I'm investing in Cloud Native technologies, using Cloud data warehouse as a data lakes. I need to make this work because this is a scale opportunity. I need to come out of this pandemic with really agile, scalable solutions that I can move fast on my applications." How do you comment on that? What's your thoughts on this because, you guys are in the middle of all this with the data management. >> I couldn't agree more. Increasingly, data workloads are moving to the Cloud. It's projected that by 2022, 75% of all databases will be in the Cloud, and COVID-19 is really accelerating it. It's opening the eyes of leadership of decision makers to be truly Cloud First and Cloud Native, now more than ever. And so organizations, traditional banking organizations, highly regulated industries that have been hesitant to move to the cloud, are now aggressively embarking on that journey. And industries that were early adopters of the Cloud are now accelerating that journey. I mentioned earlier that, we had a very seamless transition as we moved to a work from home environment, and that's because our IT is Cloud First Cloud Native. And why is that? It's because it's through being Cloud First and Cloud Native that you get the resiliency, the agility, the flexibility benefits in these uncertain times. And we're seeing that with the data and analytics stack as well. Customers are accelerating the move to Cloud data warehouses to Cloud data lakes, and become Cloud Native for their data management stack in addition to the data analytics platforms. >> Great stuff which I agree with hundred percent. Cloud Native is where it goes but you aren't they're (laughs) yet. Still on Hybrid and Multi-cloud is a big discussion. I want to get your thoughts >> Completely. >> On how that's going to play up because if you put Hybrid cloud and Multi-cloud I see Public cloud it's amazing, we know that. But Hybrid and Multi-cloud as the next generation of kind of interoperability framework of Cloud services, you're going to have to overlay and manage data governance and privacy. It's going to get more complicated, right? So how are you seeing your customers approach that piece, on the Public side, and then with Hybrid, because that's become a big discussion point. >> So Hybrid is an absolutely critical enabling capability as organizations modernize their on premise estate into the Cloud. You need to be able to move and connect to your On-premise applications, databases, and migrate the data that's important into the Cloud. So Hybrid is an essential capability. When I say Informatica is Cloud First Cloud Native, being Cloud First Cloud Native as a data management as a service provider if you will, requires essentially capabilities of being able to connect to On-premise data sources and therefore, be Hybrid. So Hybrid architecture is an essential part of that. Equally, it's important to enable organizations to understand what needs to go to the Cloud. As you're modernizing your infrastructure, your applications, your data and analytics stack. You don't need to bring everything to the Cloud with you. So there's an opportunity for organizations to introduce efficiencies. And that's done by enabling organizations to really scan the data landscape On-premise, scan the data that already exists in the various Public clouds that they partner with, and understand what's important, what's not, what can be decommissioned and left behind to realize savings and what is important for the business and needs to be moved into a Cloud Native analytic stack. And that's really where our CLAIRE metadata intelligence capabilities come to bear. And that's really what serves as the foundation of data governance, data cataloging and data privacy, to enable organizations to get the right data into the Cloud. To do so, while ensuring privacy. And to ensure that they govern that data in their new now Cloud Native analytics stack, whether it's AWS, Azure, GCP, snowflake data, bricks, all partners, all deep partnerships that we have. >> Jitesh, I want to get your thoughts on something. I was having a Zoom call a couple weeks ago, with a bunch of CXO friends, people, practitioners, probably some of them are probably your customers. It was kind of a social get together. But we were talking about, how the world we're living in pandemic, from COVID data, fake news, and one of the comments was, finally the whole world now realized what my life like. And in referring to how we're seeing fake news and misinformation kind of screw up an election and you got COVID's got 10 zillion different data points and people are making it to tell stories. And what does it really mean? There's a lot of trust involved. People are confused, and all that's going on. Again, in that backdrop, he said that that's my world. >> Right. This is back down to some of the things you're talking about, trust. We've talked about metadata services in the past. This authenticity, the duck democratization has been around for a while in the enterprise, so that dealing with bad data or fake data or too much data, you can make data (laughs) into whatever you want. You got to make sense of it. What's your thoughts on the reaction to his comment? I mean, what does it make you feel? >> Completely agree, completely agree. And that goes back to the earlier comment I made about making fact based decisions that you can have confidence in because the insight is based on trusted data. And so you mentioned data democratization. Our point of view is to democratize data, you have to do it on a foundational governance, right? There's a reason why traffic lights exist, it's to facilitate or at least attempt to facilitate the optimal free flow of traffic without getting into accidents, without causing congestion, so on and so forth. Equally, you need to have a foundation of governance. And I realized that there's an optical tension of democratized data, which is, free data for everybody consume it whenever and however you want, and then governance, which seems to imply, locking things down controlling them. And really, when I say you need a foundation of data governance, you need to enable for organizations to implement guardrails so that data can be effectively democratized. So that data consumers can easily find data. They can understand how trustworthy it is, what the quality of it is, and they can access it in easy way and consume it, while adhering to the appropriate privacy policies that are fit for the use of that particular set of data that a data and data consumer wants to access. And so, how do you practically do that? That's where data 4.0 AI power data management comes into play. In that, you need to build a foundation of what we call intelligent data governance. A foundation of scanning metadata, combining it with business metadata, linking it into an enterprise knowledge graph that gives you an understanding of an organization and enterprises data language. It auto tags auto curates, it gives you insight into the quality of the data, and now enables organizations to publish these curated data sets into a capability, what we call a data marketplace, so that much like Amazon.com, you can shop for the data, you can browse home and garden, electronics various categories. You can identify the data sets that are interesting to you, when you select them, you can look at the quality dimensions that have already been analyzed and associated with the data set. And you can also review the privacy policies that govern the use of that data set. And if you're interested in it, find the data sets, add them to your shopping cart, like you would do with Amazon.com, and check out. And when you do that triggers off an approval workflow to enable organizations to that last mile of governing access. And once approved, we can automatically provision the datasets to wherever you want to analyze them, whether it's in Tableau Power BI, an S3 market, what have you. And that is what I mean by a foundation of intelligent data governance. That is enabling data democratization. >> A common metadata layer gives you capabilities to use AI, I get that, There's a concept that you guys are talking a lot about, this augmentation to the data. This augmented data management activities that go on. What does that mean? Can you describe and explain that further and unpack that? This augmented data management activity? >> Yeah, and what do we mean by augmented data management, it's a really a first step into full blown automation of data management. In the old world, a developer would connect to a source, parse the source schema, connect to another source, parse its source schema, connect to the target, understand the target schema, and then pick the appropriate fields from the various sources, structure it through a mapping and then run a job that transforms the data and delivers it to a target database, in its structure, in its schema, in its format. Now that we have enterprise scale metadata intelligence, we know what source of data looks like, we know what targets exist as you simply pick sources and targets, we're able to automatically generate the mappings and automate this development part of the process so that organizations can more rapidly build out data pipelines to support their AI to operationalize AIML, to enable data science, and to enable analytics. >> Jitesh great insight. I really appreciate you explaining all this concept and unpacking that with me. Final point, I'd love you to have you just take a minute to put the plug in there for Informatica, what you're working on? What are your customers doing? What are some of the best practices coming out of the current situation? Take a minute to talk about that. >> Yeah, thank you, I'm happy to. It really comes down to focusing on enabling organizations to have a complete understanding of their data landscape. And that is, where we're enabling organizations to build an enterprise knowledge graph of technical metadata, business metadata, operational usage metadata, social metadata to understand and link and develop the necessary context to understand what data exists, where how it's used, what its purpose is and whether or not you should be using. And that's where we're building the Google for the enterprise to help organizations develop that. Equally, leveraging that insight, we're building out the necessary that insight and intelligence through CLAIRE, we're building out the automation in the data quality capabilities, in the data integration capabilities, in the metadata management capabilities, in the master data management capabilities, as well as the data privacy capability. So things that our tooling historically used to do manually, we're just automating it so that organizations can more productively access data, understand it and scale their understanding and insight and analytics initiatives with greater trust greater insight. It's all built on a foundation of our intelligent data platform. >> Love it, scaling data. It's that's really the future fast, available, highly available, integrated to the applications for AI. That's the future. >> Exactly right. Data 4.0, (laughs) AI power data management. >> I love talking about data in the future, because I think that's really valuable. And I think developers, and I've always been saying for over a decade now data is a critical piece for the applications, and AI really unlocks that of having it available, and surface is critical. You guys doing a great job. Thanks for the insight, appreciate you Jitesh. Thank you for coming on. >> Thanks for having me. Pleasure to be here. >> You couldn't do it in person with Informatica world but we're getting the conversations here on the remote CUBE, CUBE virtual. I'm John Furrier, you're watching CUBE conversation with Jitesh Ghai Senior Vice President General Manager, Data Manager at Informatica. Thanks for watching. (upbeat music)

Published Date : Jul 13 2020

SUMMARY :

leaders all around the world, because of the pandemic Hey, great to see you again. I have to ask you in the and that combined with data I got to ask you love that they need to get and that is customers say to us, in addition to the data but you aren't they're (laughs) yet. On how that's going to play up and connect to your On-premise and people are making it to tell stories. This is back down to some of the things And that goes back to the There's a concept that you and to enable analytics. of the current situation? and whether or not you should be using. integrated to the applications for AI. AI power data management. data in the future, Pleasure to be here. on the remote CUBE, CUBE virtual.

ENTITIES

Entity	Category	Confidence
Jitesh Ghai	PERSON	0.99+
John Furrier	PERSON	0.99+
Jitesh	PERSON	0.99+
Palo Alto	LOCATION	0.99+
John	PERSON	0.99+
July 2020	DATE	0.99+
Informatica	ORGANIZATION	0.99+
Amazon.com	ORGANIZATION	0.99+
First	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
2022	DATE	0.99+
Boston	LOCATION	0.99+
CUBE	ORGANIZATION	0.98+
this year	DATE	0.98+
siliconangle.com	OTHER	0.98+
pandemic	EVENT	0.97+
COVID pandemic	EVENT	0.96+
Native	TITLE	0.96+
hundred percent	QUANTITY	0.96+
Cloud Native	TITLE	0.96+
Cloud First Cloud Native	TITLE	0.95+
first step	QUANTITY	0.94+
Cloud Native	TITLE	0.93+
Data First	ORGANIZATION	0.93+
Mobile First	ORGANIZATION	0.93+
Cloud	TITLE	0.92+
Hybrid	TITLE	0.92+
CXO	ORGANIZATION	0.91+
couple weeks ago	DATE	0.9+
10 zillion different data points	QUANTITY	0.9+
over a decade	QUANTITY	0.89+
Cloud First Cloud Native	TITLE	0.89+
Virtual First	ORGANIZATION	0.88+
Cloud First	COMMERCIAL_ITEM	0.86+
Azure	ORGANIZATION	0.85+
COVID-19	OTHER	0.84+
Cloud	COMMERCIAL_ITEM	0.83+
GCP	ORGANIZATION	0.81+
2000 organizations	QUANTITY	0.81+
75%	QUANTITY	0.75+
Tableau Power BI	TITLE	0.75+
one of the comments	QUANTITY	0.75+
Cloud Native	COMMERCIAL_ITEM	0.73+
Cloud First	ORGANIZATION	0.73+
last	DATE	0.72+
Senior	PERSON	0.71+
CLAIRE	PERSON	0.68+
COVID	OTHER	0.67+
years	DATE	0.64+
CUBE	TITLE	0.59+
theCUBE	ORGANIZATION	0.58+
Management	ORGANIZATION	0.52+
President	PERSON	0.5+
First	TITLE	0.48+

Jitesh Ghai, Informatica | CUBE Conversation, July 2020

(ambient music) >> Narrator: From the cube studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is a CUBE conversation. >> Hello welcome to this cube conversation. I'm John Furrier, host of theCUBE here in our Palo Alto studios. During this quarantine, crew doing all the interviews, getting all the top story especially during this COVID pandemic. Great conversation here Jitesh Ghai, Senior Vice President and General Manager of Data Management with Informatica, CUBE alumni multi time. We can't be in person this year, because of the pandemic but a lot of great content. We've been doing a lot of interviews with you guys. Jitesh great to see you. Thanks for coming on. >> Hey, great to see you again. We weren't able to make it happen in person this year, but if not in person, virtually will have to work. >> One of the things, I'm a half glass half full kind of guy but you can't look at this without saying man, it's bad. But it really highlights how things are going on. So first, how are you doing? How's everyone Informatica doing over there? You guys are doing okay? >> We are well, we are well, families well, the Informatica family is well. So overall, can't complain can't complain, I think it was remarkable how quickly we were able to transition to a work from home environment for our global 5000 plus organization. And really, the fact that we're Cloud First Cloud Native, both in our product offerings, as well as an IT organization really helped make that transition seamless. >> In our past conversations on theCUBE and through all the Informatica employees it's always been kind of an inside baseball, kind of inside the ropes conversation in the industry about data. Now more than ever, with the pandemic, you starting to see people seeing it. Oh, I get it now. I get why data is important. I can see why Cloud First, Mobile First, Data First strategies and now Virtual First, is now this transformational scene. Everyone's feeling it, you can't help not ignore it. It's happening. It's also highlighting what's working, what's not. I have to ask you in the current environment Jitesh what are you seeing as some of those opportunities that your customers are dealing with approach to data? 'Cause clearly, you're working with that data layer, there's a lot of innovation opportunities, you've got CLAIRE on the AI side, all great. But now with the pandemic, it's really forcing that conversation. I got to rethink about what's going to happen after and have a really good strategy. >> Yeah, you're exactly right. There's a broad based realization that, I'll take a step back. First, we all know that as global 2000 organizations or in general, we all need to be data driven, we need to make fact based decisions. And there is a lot of that good work that's happened over the last few years as organizations have realized just how important data is to innovate and to deliver new products and services, new business models. What's really happened is that, during this COVID pandemic, there is a greater appreciation for trust in data. Historically, organizations became data driven, we're on the journey of being increasingly data driven. However, there was some element of Oh, gut or experience and that combined with data will get us to the outcomes we're looking for, will enable us to make the decisions. In this pandemic world of great uncertainty, supply chains falling apart on occasion, groceries not getting delivered on time et cetra, et cetra. The appreciation and critical importance on the quality on the trust of data is greater than ever to drive the insights for organizations. Leaders are less hesitant or sorry, leaders are more hesitant to just go with your gut type of approaches. There is a tremendous reliance on data. And we're seeing it in particular, more than ever, as you can imagine in the healthcare provider sector, in the public sector with federal state and local, as all of these organizations are having to make very difficult decisions, and are increasingly relying on high quality, trustworthy governed data to help them make what can be life or death decision. So a big shift and appreciation for the importance and trustworthiness in their data, their data state and their insights. >> So as the GM of data management and Senior Vice President at Informatica, you get a good view of things. I got to ask you love this data 4.0 concept. Talk about what that means to you because you got customers have been doing data management with you guys for a while, but now it's data 4.0 that has a feeling of agility to it. It's got kind of a DevOps vibe. It feels like a lot of automation being discussed and you mentioned trust. What is data 4.0 mean? >> So data 4.0 for us is where AI and ML is powering data management. And so what do I mean by that? There is a greater insight and appreciation for high quality trustworthy data to enable organizations to make fact based decisions to be more data driven. But how do you do that when data is exponentially growing in volume, where data types are increasing, where data is moving increasingly between Clouds, between On-premises and Clouds between various ecosystems, new data sources are emerging, the internet of things is yet another exploding source of data. This is a lot of different types of data, a lot of volume of data, a lot of different locations, and gravity of data where data resides. So the question becomes how do you practically manage this data without intelligence and automation. And that's what the era of data 4.0 is. Where AI and ML is powering data management, making it more intelligent, automating more and more of what was historically manual to enable organizations to scale, to enable them to scale to the breadth of data that they need to get a greater understanding of their data landscape within the enterprise, to get a greater understanding of the quality of the data within their landscape, how it's moving, and the associated privacy implications of how that data is being used, how effectively it's protected, so on and so forth. All underpinned by our CLAIRE engine, which is AI and ML applied to metadata, to deliver the intelligence and enable the automation of the data management operations. >> Awesome. Thanks for taking the time to define that, love that. The question I want to ask you, I'll put you on the spot here because I think this is an important conversation we've been having and also writing a lot about it on siliconangle.com and that is customers say to us, "Hey, John, I'm investing in Cloud Native technologies, using Cloud data warehouse as a data lakes. I need to make this work because this is a scale opportunity. I need to come out of this pandemic with really agile, scalable solutions that I can move fast on my applications." How do you comment on that? What's your thoughts on this because, you guys are in the middle of all this with the data management. >> I couldn't agree more. Increasingly, data workloads are moving to the Cloud. It's projected that by 2022, 75% of all databases will be in the Cloud, and COVID-19 is really accelerating it. It's opening the eyes of leadership of decision makers to be truly Cloud First and Cloud Native, now more than ever. And so organizations, traditional banking organizations, highly regulated industries that have been hesitant to move to the cloud, are now aggressively embarking on that journey. And industries that were early adopters of the Cloud are now accelerating that journey. I mentioned earlier that, we had a very seamless transition as we moved to a work from home environment, and that's because our IT is Cloud First Cloud Native. And why is that? It's because it's through being Cloud First and Cloud Native that you get the resiliency, the agility, the flexibility benefits in these uncertain times. And we're seeing that with the data and analytics stack as well. Customers are accelerating the move to Cloud data warehouses to Cloud data lakes, and become Cloud Native for their data management stack in addition to the data analytics platforms. >> Great stuff which I agree with hundred percent. Cloud Native is where it goes but you aren't they're (laughs) yet. Still on Hybrid and Multi-cloud is a big discussion. I want to get your thoughts >> Completely. >> On how that's going to play up because if you put Hybrid cloud and Multi-cloud I see Public cloud it's amazing, we know that. But Hybrid and Multi-cloud as the next generation of kind of interoperability framework of Cloud services, you're going to have to overlay and manage data governance and privacy. It's going to get more complicated, right? So how are you seeing your customers approach that piece, on the Public side, and then with Hybrid, because that's become a big discussion point. >> So Hybrid is an absolutely critical enabling capability as organizations modernize their on premise estate into the Cloud. You need to be able to move and connect to your On-premise applications, databases, and migrate the data that's important into the Cloud. So Hybrid is an essential capability. When I say Informatica is Cloud First Cloud Native, being Cloud First Cloud Native as a data management as a service provider if you will, requires essentially capabilities of being able to connect to On-premise data sources and therefore, be Hybrid. So Hybrid architecture is an essential part of that. Equally, it's important to enable organizations to understand what needs to go to the Cloud. As you're modernizing your infrastructure, your applications, your data and analytics stack. You don't need to bring everything to the Cloud with you. So there's an opportunity for organizations to introduce efficiencies. And that's done by enabling organizations to really scan the data landscape On-premise, scan the data that already exists in the various Public clouds that they partner with, and understand what's important, what's not, what can be decommissioned and left behind to realize savings and what is important for the business and needs to be moved into a Cloud Native analytic stack. And that's really where our CLAIRE metadata intelligence capabilities come to bear. And that's really what serves as the foundation of data governance, data cataloging and data privacy, to enable organizations to get the right data into the Cloud. To do so, while ensuring privacy. And to ensure that they govern that data in their new now Cloud Native analytics stack, whether it's AWS, Azure, GCP, snowflake data, bricks, all partners, all deep partnerships that we have. >> Jitesh, I want to get your thoughts on something. I was having a Zoom call a couple weeks ago, with a bunch of CXO friends, people, practitioners, probably some of them are probably your customers. It was kind of a social get together. But we were talking about, how the world we're living in pandemic, from COVID data, fake news, and one of the comments was, finally the whole world now realized what my life like. And in referring to how we're seeing fake news and misinformation kind of screw up an election and you got COVID's got 10 zillion different data points and people are making it to tell stories. And what does it really mean? There's a lot of trust involved. People are confused, and all that's going on. Again, in that backdrop, he said that that's my world. >> Right. This is back down to some of the things you're talking about, trust. We've talked about metadata services in the past. This authenticity, the duck democratization has been around for a while in the enterprise, so that dealing with bad data or fake data or too much data, you can make data (laughs) into whatever you want. You got to make sense of it. What's your thoughts on the reaction to his comment? I mean, what does it make you feel? >> Completely agree, completely agree. And that goes back to the earlier comment I made about making fact based decisions that you can have confidence in because the insight is based on trusted data. And so you mentioned data democratization. Our point of view is to democratize data, you have to do it on a foundational governance, right? There's a reason why traffic lights exist, it's to facilitate or at least attempt to facilitate the optimal free flow of traffic without getting into accidents, without causing congestion, so on and so forth. Equally, you need to have a foundation of governance. And I realized that there's an optical tension of democratized data, which is, free data for everybody consume it whenever and however you want, and then governance, which seems to imply, locking things down controlling them. And really, when I say you need a foundation of data governance, you need to enable for organizations to implement guardrails so that data can be effectively democratized. So that data consumers can easily find data. They can understand how trustworthy it is, what the quality of it is, and they can access it in easy way and consume it, while adhering to the appropriate privacy policies that are fit for the use of that particular set of data that a data and data consumer wants to access. And so, how do you practically do that? That's where data 4.0 AI power data management comes into play. In that, you need to build a foundation of what we call intelligent data governance. A foundation of scanning metadata, combining it with business metadata, linking it into an enterprise knowledge graph that gives you an understanding of an organization and enterprises data language. It auto tags auto curates, it gives you insight into the quality of the data, and now enables organizations to publish these curated data sets into a capability, what we call a data marketplace, so that much like Amazon.com, you can shop for the data, you can browse home and garden, electronics various categories. You can identify the data sets that are interesting to you, when you select them, you can look at the quality dimensions that have already been analyzed and associated with the data set. And you can also review the privacy policies that govern the use of that data set. And if you're interested in it, find the data sets, add them to your shopping cart, like you would do with Amazon.com, and check out. And when you do that triggers off an approval workflow to enable organizations to that last mile of governing access. And once approved, we can automatically provision the datasets to wherever you want to analyze them, whether it's in Tableau Power BI, an S3 market, what have you. And that is what I mean by a foundation of intelligent data governance. That is enabling data democratization. >> A common metadata layer gives you capabilities to use AI, I get that, There's a concept that you guys are talking a lot about, this augmentation to the data. This augmented data management activities that go on. What does that mean? Can you describe and explain that further and unpack that? This augmented data management activity? >> Yeah, and what do we mean by augmented data management, it's a really a first step into full blown automation of data management. In the old world, a developer would connect to a source, parse the source schema, connect to another source, parse its source schema, connect to the target, understand the target schema, and then pick the appropriate fields from the various sources, structure it through a mapping and then run a job that transforms the data and delivers it to a target database, in its structure, in its schema, in its format. Now that we have enterprise scale metadata intelligence, we know what source of data looks like, we know what targets exist as you simply pick sources and targets, we're able to automatically generate the mappings and automate this development part of the process so that organizations can more rapidly build out data pipelines to support their AI to operationalize AIML, to enable data science, and to enable analytics. >> Jitesh great insight. I really appreciate you explaining all this concept and unpacking that with me. Final point, I'd love you to have you just take a minute to put the plug in there for Informatica, what you're working on? What are your customers doing? What are some of the best practices coming out of the current situation? Take a minute to talk about that. >> Yeah, thank you, I'm happy to. It really comes down to focusing on enabling organizations to have a complete understanding of their data landscape. And that is, where we're enabling organizations to build an enterprise knowledge graph of technical metadata, business metadata, operational usage metadata, social metadata to understand and link and develop the necessary context to understand what data exists, where how it's used, what its purpose is and whether or not you should be using. And that's where we're building the Google for the enterprise to help organizations develop that. Equally, leveraging that insight, we're building out the necessary that insight and intelligence through CLAIRE, we're building out the automation in the data quality capabilities, in the data integration capabilities, in the metadata management capabilities, in the master data management capabilities, as well as the data privacy capability. So things that our tooling historically used to do manually, we're just automating it so that organizations can more productively access data, understand it and scale their understanding and insight and analytics initiatives with greater trust greater insight. It's all built on a foundation of our intelligent data platform. >> Love it, scaling data. It's that's really the future fast, available, highly available, integrated to the applications for AI. That's the future. >> Exactly right. Data 4.0, (laughs) AI power data management. >> I love talking about data in the future, because I think that's really valuable. And I think developers, and I've always been saying for over a decade now data is a critical piece for the applications, and AI really unlocks that of having it available, and surface is critical. You guys doing a great job. Thanks for the insight, appreciate you Jitesh. Thank you for coming on. >> Thanks for having me. Pleasure to be here. >> You couldn't do it in person with Informatica world but we're getting the conversations here on the remote CUBE, CUBE virtual. I'm John Furrier, you're watching CUBE conversation with Jitesh Ghai Senior Vice President General Manager, Data Manager at Informatica. Thanks for watching. (upbeat music)

Published Date : Jul 9 2020

SUMMARY :

leaders all around the world, because of the pandemic Hey, great to see you again. One of the things, I'm a And really, the fact that I have to ask you in the and that combined with data I got to ask you love that they need to get and that is customers say to us, early adopters of the Cloud but you aren't they're (laughs) yet. On how that's going to play up and connect to your On-premise and people are making it to tell stories. This is back down to some of the things And that goes back to the There's a concept that you and delivers it to a target database, of the current situation? and whether or not you should be using. It's that's really the future fast, AI power data management. data in the future, Pleasure to be here. on the remote CUBE, CUBE virtual.

ENTITIES

Entity	Category	Confidence
Jitesh Ghai	PERSON	0.99+
John Furrier	PERSON	0.99+
Jitesh	PERSON	0.99+
John	PERSON	0.99+
Palo Alto	LOCATION	0.99+
July 2020	DATE	0.99+
Informatica	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
2022	DATE	0.99+
Google	ORGANIZATION	0.99+
First	QUANTITY	0.99+
75%	QUANTITY	0.99+
Amazon.com	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
first	QUANTITY	0.99+
one	QUANTITY	0.98+
pandemic	EVENT	0.98+
this year	DATE	0.98+
siliconangle.com	OTHER	0.98+
hundred percent	QUANTITY	0.98+
CUBE	ORGANIZATION	0.98+
first step	QUANTITY	0.98+
both	QUANTITY	0.97+
Cloud Native	TITLE	0.97+
COVID pandemic	EVENT	0.96+
One	QUANTITY	0.95+
10 zillion different data points	QUANTITY	0.94+
GCP	ORGANIZATION	0.94+
COVID-19	OTHER	0.92+
5000 plus	QUANTITY	0.92+
Cloud	TITLE	0.91+
Cloud Native	TITLE	0.9+
COVID	OTHER	0.87+
couple weeks ago	DATE	0.86+
Azure	ORGANIZATION	0.82+
CLAIRE	PERSON	0.82+
over a decade	QUANTITY	0.81+
2000 organizations	QUANTITY	0.81+
Mobile First	ORGANIZATION	0.8+
CXO	ORGANIZATION	0.79+
Data First	ORGANIZATION	0.78+
Cloud First	COMMERCIAL_ITEM	0.78+
Cloud	COMMERCIAL_ITEM	0.77+
Virtual First	ORGANIZATION	0.77+
Cloud Native	COMMERCIAL_ITEM	0.74+
Senior	PERSON	0.74+
half	QUANTITY	0.73+
Data Management	ORGANIZATION	0.72+
Hybrid	TITLE	0.7+
First	TITLE	0.66+
theCUBE	ORGANIZATION	0.66+
last	DATE	0.66+
Cloud First Cloud Native	TITLE	0.66+
Cloud First Cloud Native	TITLE	0.65+
Tableau Power BI	TITLE	0.64+
years	DATE	0.63+
Native	TITLE	0.62+
First	ORGANIZATION	0.55+
Vice President	PERSON	0.51+

Moe Abdulla Tim Davis, IBM | IBM Think 2018

(upbeat music) >> Announcer: Live from Las Vegas it's The Cube, covering IBM Think 2018. Brought to you by IBM. >> We're back at IBM Think 2018. This is The Cube, the leader in live tech coverage. My name is Dave Vellante. I'm here with my co-host Peter Burris, Moe Abdulla is here. He's the vice president of Cloud Garage and Solution Architecture Hybrid Cloud for IBM and Tim Davis is here, Data Analytics and Cloud Architecture Group and Services Center of Excellence IBM. Gentlemen, welcome to The Cube. >> Glad to be here. >> Thanks for having us. >> Moe, Garage, Cloud Garage, I'm picturing drills and wrenches, what's the story with Garage? Bring that home for us. >> (laughs) I wish it was that type of a garage. My bill would go down for sure. No, the garage is playing on the theme of the start-up, the idea of how do you bring new ideas and innovate on them, but for the enterprises. So what two people can do with pizza and innovate, how do you bring that to a larger concept. That's what The Garage is really about. >> Alright and Tim, talk about your role. >> Yeah, I lead the data and analytics field team and so we're really focused on helping companies do digital transformation and really drive digital and analytics, data, into their businesses to get better business value, accelerate time to value. >> Awesome, so we're going to get into it. You guys both have written books. We're going to get into the Field Guide and we're going to get into the Cloud Adoption Playbook, but Peter I want you to jump in here because I know you got to run, so get your questions in and then I'll take over. >> Sure I think so obvious question number one is, one of the biggest challenges we've had in analytics over the past couple of years is we had to get really good at the infrastructure and really good at the software and really good at this and really good at that and there were a lot of pilot failures because if you succeeded at one you might not have succeeded at the other. The Garage sounds like it's time to value based. Is that the right way to think about this? And what are you guys together doing to drive time to value, facilitate adoption, and get to the changes, the outcomes that the business really wants? >> So Tim you want to start? >> Yeah I can start because Moe leads the overall Garage and within the Garage we have something called the Data First Methodology where we're really driving a direct engagement with the clients where we help them develop a data strategy because most clients when they do digital transformation or really go after data, they're taking kind of a legacy approach. They're building these big monolithic data warehouses, they're doing big master data management programs and what we're really trying to do is change the paradigm and so we connect with the Data First Methodology through the Garage to get to a data strategy that's connected to the business outcome because it's what data and analytics do you need to successfully achieve what you're trying to do as a business. A lot of this is digital transformation which means you're not only changing what you're doing from a data warehouse to a data lake, but you're also accelerating the data because now we have to get into the time domain of a customer, or your customer where they may be consuming things digitally and so they're at a website, they're moving into a bank branch, they go into a social media site, maybe they're being contacted by a fintech. You've got to retain an maintain a digital relationship and that's the key. >> And The Garage itself is really playing on the same core value of it's not the big beating the small anymore, it's the fast beating the slow and so when you think of the fast beating the slow, how do you achieve fast? You really do that by three ways. So The Garage says the first way to achieve fast is to break down the problem into smaller chunks, also known as MVPs or minimum viable product. So you take a very complex problem that people are talking and over-talking and over engineering, and you really bring it down to something that has a client value, user-centered. So bring the discipline from the business side, the operation side, the developers, and we mush them together to center that. That's one way to do fast. The second way-- >> By the way, I did, worked with a client. They started calling it minimum viable outcomes. >> Yes, minimum viable outcomes means what product and there's a lot of types of these minimum viable to achieve, we're talking about four weeks, six weeks, and so on and so forth. The story of American Airlines was taking all of their kiosk systems for example and really changing them both in terms of the types of services they can deliver, so now you can recheck your flights, et cetera, within six week periods and you really, that's fast, and doing it in one terminal and then moving to others. The second way you do fast is by understanding that the change is not just technology. The change is culture, process, and so on. So when you come to The Garage, it's not like the mechanic style garage where you are sitting in the waiting room and the mechanic is fixing your car. Not at all. You really have some sort of mechanical skills and you're in there with me. That's called pair programming. That's called test-driven, these types of techniques and methodologies are proven in the industry. So Tim will sit right next to me and we'll code together. By the time Tim goes back to his company, he's now an expert on how to do it. So fast is achieving the cultural transformation as well as this minimum viable aspect. >> Hands on, and you guys are actually learning from each in that experience, aren't you? >> Absolutely. >> Oh yeah. >> And then sharing, yeah. >> I would also say I would think that there's one more thing for both of you guys and that is increasingly as business acknowledges that data is an asset unlike traditional systems approaches where we built a siloed application, this server, that database manager, this data model, that application and then we do some integration at some point in time, when you start with this garage approach, data-centric approach, figure out how that works, now you have an asset that can be reused in a lot of new and interesting ways. Does that also factor into this from a speed aspect? >> Yeah it does. And this is a key part. We have something called data science experience now and we're really driving pilots through The Garage, through the data first method to get that rapid engagement and the goal is to do sprints, to do 12 to 20 week kind of sprints where we actually produce a business outcome that you show to the business and then you put it into production and we're actually developing algorithms and other things as we go that are part of the analytic result and that's kind of the key and behind that, you know the analytic result is really the, kind of the icing on the cake and the business value where you connect, but there's a whole foundation underneath that of data and that's why we do a data topology and the data topology has kind of replaced the data lake, replaces all that modeling because now we can have a data topology that spans on premise, private cloud, and public cloud and we can drive an integrated strategy with the governance program over that to actually support the data analytics that you're trying to drive and that's how we get at that. >> But that topology's got to tie back to the attributes of the data, right? Not the infrastructure that's associated with it. >> It does and the idea of the topology is you may have an existing warehouse. That becomes a zone in the topology, so we aren't really ripping and replacing, we're augmenting, you know, so we may augment an on premise warehouse that may sit in a relational database technology with a Hadoop environment that we can spin up in the cloud very rapidly and then the data science applications and so we can have a discovery zone as well as the traditional structured reporting and the level of data quality can be mixed. You may do analytic discovery against raw data versus where you have highly processed data where we have extreme data quality for regulatory reporting. >> Compared to a god box where everything goes through some pipe into that box. >> And you put in on later. >> Yes. >> Well and this is the, when Hadoop came out, right, people thought they were going to dump all their data into Hadoop and something beautiful was going to happen right? And what happened is everybody created a lot of data swamps out there. >> Something really ugly happened. >> Right, right, it's just a pile of data. >> Well they ended up with a cheaper data warehouse. >> But it's not because that data warehouse was structured, it has-- >> Dave: Yeah and data quality. >> All the data modeling, but all that stuff took massive amounts of time. When you just dump it into a Hadoop environment you have no structure, you have to discover the structures so we're really doing all the things we used to do with data warehousing only we're doing it in incremental, agile, faster method where you can also get access to the data all the way through it. >> Yeah that makes sense. >> You know it's not like we will serve new wine before its time, you know you can. >> Yeah, yeah, yeah, yeah. >> You know, now you can eat the grapes, you can drink the wine as it's fermenting, and you can-- >> No wrong or right, just throw it in and figure it out. >> There's an image that Tim chose that the idea of a data lake is this organized library with books, but the reality is a library with all the books dumped in the middle and go find the book that you want. >> Peter: And no Dewey Decimal. >> And, exactly. And if you want to pick on the idea that you had earlier, when you look at that type of a solution, the squad structure is changing. To solve that particular problem you no longer just have your data people on one side. You have a data person, you have the business person that's trying to distill it, you have the developer, you have the operator, so the concept of DevOps to try and synchronize between these two players is now really evolved and this is the first time you're hearing it, right at The Cube. It's the Biz Data DevOps. That's the new way we actually start to tell this. >> Dave: Explain that, explain that to us. >> Very simple. It starts with business requirements. So the business reflects the user and the consumer and they come with not just generics, they come with very specific requirements that then automatically and immediately says what are the most valuable data sources I need either from my enterprise or externally? Because the minute I understand those requirements and the persistence of those requirements, I'm now shaping the way the solution has to be implemented. Data first, not data as an afterthought. That's why we call it the data first method. The developers then, when they're building the cloud infrastructure, they really understand the type of resilience, the type of compliance, the type of meshing that you need to do and they're doing it from the outside. And because of the fact that they're dealing with data, the operation people automatically understand that they have to deal with the right to recovery and so on and so forth. So now we're having this. >> Makes sense. You're not throwing it over the wall. >> Exactly. >> That's where the DevOps piece comes in. >> And you're also understanding the velocity of data, through the enterprise as well as the gaps that you have as an enterprise because you're, when you go into a digital world you have to accumulate a lot more data and then you have to be able to match that and you have to be able to do identity resolution to get to a customer to understand all the dimensions of it. >> Well in the digital world, data is the core, so and it's interesting what you were saying Moe about essentially the line of business identifying the data sources because they're the ones who know how data affects monetization. >> Yes. >> Inder Paul Mendari, when he took over as IBM Chief Data Officer, said you must from partnerships with the line of business in order to understand how to monetize, how data contributes to the monetization and your DevOps metaphor is very important because everybody is sort of on the same page is the idea right? >> That's right. >> And there's a transformation here because we're working very close with Inder Paul's team and the emergence of a Chief Data Officer in many enterprises and we actually kind of had a program that we still have going from last year which is kind of the Chief Data Officer success program where you can help get at this because the classic IT structure has kind of started to fail because it's not data oriented, it's technology oriented, so by getting to a data oriented organization and having a elevated Chief Data Officer, you can get aligned with the line of business, really get your hands on the data and we prescribe the data topology, which is actually the back cover of that book, shows an example of one, because that's the new center of the universe. The technologies can change, this data can live on premise or in the cloud, but the topology should only change when your business changes-- (drowned out) >> This is hugely important so I want to pick up on something Ginny Rometti was talking about yesterday was incumbent disruptors. And when I heard that I'm like, come on no way. You know, instant skeptic. >> Tim: And that's what, that's what it is. >> Right and so then I started-- >> Moe: Wait, wait, discover. >> To think about it and you guys, what you're describing is how you take somebody, a company, who's been organized around human expertise and other physical assets for years, decades, maybe hundreds of years and transform them into a data oriented company-- >> Tim: Exactly. >> Where data is the core asset and human expertise is surrounding that data and learn to say look, it's not an, most data's in silos. You're busting down those silos. >> Exactly. >> And giving the prescription to do that. >> Exactly, yeah exactly. >> I think that's what Tim actually said this very, you heard us use the word re-prescriptive. You heard us use the word methodology, data first method or The Garage method and what we're really starting to see is these patterns from enterprises. You know, what works for a startup does not necessarily translate easily for an enterprise. You have to make it work in the context of the existing baggage, the existing processes, the existing culture. >> Customer expectations. >> Expectations, the scale, all of those type dimensions. So this particular notion of a prescription is we're taking the experiences from Hertz, Marriott, American Airlines, RVs, all of these clients that really have made that leap and got the value and essentially started to put it in the simple framework, seven elements to those frameworks, and that's in the adoption, yeah. >> You're talking this, right? >> Yeah. >> So we got two documents here, the Cloud Adoption Playbook, which Moe you authored, co-authored. >> Moe: With Tim's help. >> Tim as well and then this Field Guide, the IBM Data and Analytic Strategy Field Guide that Tim you also contributed to this right? >> Yeah, I wrote some of it yeah. >> Which augments the book, so I'll give you the description of it too. >> Well I love the hybrid cloud data topology in the back. >> That's an example of a topology on the back. >> So that's kind of cool. But go ahead, let's talk about these. >> So if you look at the cover of that book and piece of art, very well drawn. That's right. You will see that there are seven elements. You start to see architecture, you start to see culture and organization, you start to see methodology, you start to see all of these different components. >> Dave: Governance, management, security, emerging tech. >> That's right, that really are important in any type of transformation. And then when you look at the data piece, that's a way of taking that data and applying all of these dimensions, so when a client comes forward and says, "Look, I'm having a data challenge "in the sense of how do I transform access, "how do I share data, how to I monetize?," we start to take them through all of these dimensions and what we've been able to do is to go back to our starting comment, accelerate the transformation, sorry. >> And the real engagement that we're getting pulled into now in many cases and getting pulled right up the executive chains at these companies is data strategy because this is kind of the core, you've got to, so many companies have a business strategy, very good business strategies, but then you ask for their data strategy, they show you some kind of block diagram architecture or they show you a bunch of servers and the data center. You know, that's not a strategy. The data strategy really gets at the sources and consumption, velocity of data, and gaps in the data that you need to achieve your business outcome. And so by developing a data strategy, this opens up the patterns and the things that we talk to. So now we look at data security, we look at data management, we look at governance, we look at all the aspects of it to actually lay this out. And another thought here, the other transformation is in data warehousing, we've been doing this for the past, some of us longer than others, 20 or 30 years, right? And our whole thing then was we're going to align the silos by dumping all the data into this big data warehouse. That is really not the path to go because these things became like giant dinosaurs, big monolithic difficult to change. The data lake concept is you leave the data where it is and you establish a governance and management process over top of it and then you augment it with things like cloud, like Hadoop, like other things where we can rapidly spin up and we're taking advantage of things like object stores and advanced infrastructures and this is really where Moe and I connect with our IBM Club private platforms, with our data capabilities, because we can now put together managed solutions for some of these major enterprises and even show them the road map and that's really that road map. >> It's critical in that transformation. Last word, Moe. >> Yeah, so to me I think the exciting thing about this year, versus when we spoke last year, is the maturity curve. You asked me this last year, you said, "Moe where are we on the maturity curve of adoption?" And I think the fact that we're talking today about data strategies and so on is a reflection of how people have matured. >> Making progress. >> Earlier on, they really start to think about experimenting with ideas. We're now starting to see them access detailed deep information about approaches and methodologies to do it and the key word for us this year was not about experimentation or trial, it's about acceleration. >> Exactly. >> Because they've proven it in that garage fashion in small places, now I want to do it in the American Airlines scale, I want to do it at the global scale. >> Exactly. >> And I want, so acceleration is the key theme of what we're trying to do here. >> What a change from 15, 20 years ago when the deep data warehouse was the single version of the truth. It was like snake swallowing a basketball. >> Tim: Yeah exactly, that's a good analogy. >> And you had a handful of people who actually knew how to get in there and you had this huge asynchronous process to get insights out. Now you guys have a very important, in a year you've made a ton of progress, yea >> It's democratization of data. Everyone should, yeah. >> So guys, really exciting, I love the enthusiasm. Congratulations. A lot more work to do, a lot more companies to affect, so we'll be watching. Thank you. >> Thank you so much. >> Thank you very much. >> And make sure you read our book. (Tim laughs) >> Yeah definitely, read these books. >> They'll be a quiz after. >> Cloud Adoption Playbook and IBM Data and Analytic Strategy Field Guide. Where can you get these? I presume on your website? >> On Amazon, you can get these on Amazon. >> Oh you get them on Amazon, great. Okay, good. >> Thank you very much. >> Thanks guys, appreciate it. >> Alright, thank you. >> Keep it right there everybody, this is The Cube. We're live from IBM Think 2018 and we'll be right back. (upbeat electronic music)

Published Date : Mar 21 2018

SUMMARY :

Brought to you by IBM. This is The Cube, the leader in live tech coverage. and wrenches, what's the story with Garage? the idea of how do you bring new ideas and innovate on them, Yeah, I lead the data and analytics field team because I know you got to run, so get your questions in Is that the right way to think about this? and that's the key. and so when you think of the fast beating the slow, By the way, I did, worked with a client. the mechanic style garage where you are sitting for both of you guys and that is increasingly and the business value where you connect, Not the infrastructure that's associated with it. and the level of data quality can be mixed. Compared to a god box where everything Well and this is the, when Hadoop came out, right, where you can also get access to the data new wine before its time, you know you can. the book that you want. That's the new way we actually start to tell this. the type of meshing that you need to do You're not throwing it over the wall. and then you have to be able to match that so and it's interesting what you were saying Moe and the emergence of a Chief Data Officer This is hugely important so I want to pick up Where data is the core asset and human expertise of the existing baggage, the existing processes, and that's in the adoption, yeah. the Cloud Adoption Playbook, which Moe you authored, Which augments the book, so I'll give you the description So that's kind of cool. You start to see architecture, you start to see culture And then when you look at the data piece, That is really not the path to go It's critical in that transformation. You asked me this last year, you said, to do it and the key word for us this year in the American Airlines scale, I want to do it of what we're trying to do here. of the truth. knew how to get in there and you had this huge It's democratization of data. So guys, really exciting, I love the enthusiasm. And make sure you read our book. Where can you get these? Oh you get them on Amazon, great. Keep it right there everybody, this is The Cube.

ENTITIES

Entity	Category	Confidence
Peter Burris	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Michael Dell	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Michael	PERSON	0.99+
Comcast	ORGANIZATION	0.99+
Elizabeth	PERSON	0.99+
Paul Gillan	PERSON	0.99+
Jeff Clark	PERSON	0.99+
Paul Gillin	PERSON	0.99+
Nokia	ORGANIZATION	0.99+
Savannah	PERSON	0.99+
Dave	PERSON	0.99+
Richard	PERSON	0.99+
Micheal	PERSON	0.99+
Carolyn Rodz	PERSON	0.99+
Dave Vallante	PERSON	0.99+
Verizon	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Eric Seidman	PERSON	0.99+
Paul	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Google	ORGANIZATION	0.99+
Keith	PERSON	0.99+
Chris McNabb	PERSON	0.99+
Joe	PERSON	0.99+
Carolyn	PERSON	0.99+
Qualcomm	ORGANIZATION	0.99+
Alice	PERSON	0.99+
2006	DATE	0.99+
John	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
congress	ORGANIZATION	0.99+
Ericsson	ORGANIZATION	0.99+
AT&T	ORGANIZATION	0.99+
Elizabeth Gore	PERSON	0.99+
Paul Gillen	PERSON	0.99+
Madhu Kutty	PERSON	0.99+
1999	DATE	0.99+
Michael Conlan	PERSON	0.99+
2013	DATE	0.99+
Michael Candolim	PERSON	0.99+
Pat	PERSON	0.99+
Yvonne Wassenaar	PERSON	0.99+
Mark Krzysko	PERSON	0.99+
Boston	LOCATION	0.99+
Pat Gelsinger	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Willie Lu	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Yvonne	PERSON	0.99+
Hertz	ORGANIZATION	0.99+
Andy	PERSON	0.99+
2012	DATE	0.99+
Microsoft	ORGANIZATION	0.99+

Rob Thomas, IBM Analytics | IBM Fast Track Your Data 2017

>> Announcer: Live from Munich, Germany, it's theCUBE. Covering IBM: Fast Track Your Data. Brought to you by IBM. >> Welcome, everybody, to Munich, Germany. This is Fast Track Your Data brought to you by IBM, and this is theCUBE, the leader in live tech coverage. We go out to the events, we extract the signal from the noise. My name is Dave Vellante, and I'm here with my co-host Jim Kobielus. Rob Thomas is here, he's the General Manager of IBM Analytics, and longtime CUBE guest, good to see you again, Rob. >> Hey, great to see you. Thanks for being here. >> Dave: You're welcome, thanks for having us. So we're talking about, we missed each other last week at the Hortonworks DataWorks Summit, but you came on theCUBE, you guys had the big announcement there. You're sort of getting out, doing a Hadoop distribution, right? TheCUBE gave up our Hadoop distributions several years ago so. It's good that you joined us. But, um, that's tongue-in-cheek. Talk about what's going on with Hortonworks. You guys are now going to be partnering with them essentially to replace BigInsights, you're going to continue to service those customers. But there's more than that. What's that announcement all about? >> We're really excited about that announcement, that relationship, just to kind of recap for those that didn't see it last week. We are making a huge partnership with Hortonworks, where we're bringing data science and machine learning to the Hadoop community. So IBM will be adopting HDP as our distribution, and that's what we will drive into the market from a Hadoop perspective. Hortonworks is adopting IBM Data Science Experience and IBM machine learning to be a core part of their Hadoop platform. And I'd say this is a recognition. One is, companies should do what they do best. We think we're great at data science and machine learning. Hortonworks is the best at Hadoop. Combine those two things, it'll be great for clients. And, we also talked about extending that to things like Big SQL, where they're partnering with us on Big SQL, around modernizing data environments. And then third, which relates a little bit to what we're here in Munich talking about, is governance, where we're partnering closely with them around unified governance, Apache Atlas, advancing Atlas in the enterprise. And so, it's a lot of dimensions to the relationship, but I can tell you since I was on theCUBE a week ago with Rob Bearden, client response has been amazing. Rob and I have done a number of client visits together, and clients see the value of unlocking insights in their Hadoop data, and they love this, which is great. >> Now, I mean, the Hadoop distro, I mean early on you got into that business, just, you had to do it. You had to be relevant, you want to be part of the community, and a number of folks did that. But it's really sort of best left to a few guys who want to do that, and Apache open source is really, I think, the way to go there. Let's talk about Munich. You guys chose this venue. There's a lot of talk about GDPR, you've got some announcements around unified government, but why Munich? >> So, there's something interesting that I see happening in the market. So first of all, you look at the last five years. There's only 10 companies in the world that have outperformed the S&P 500, in each of those five years. And we started digging into who those companies are and what they do. They are all applying data science and machine learning at scale to drive their business. And so, something's happening in the market. That's what leaders are doing. And I look at what's happening in Europe, and I say, I don't see the European market being that aggressive yet around data science, machine learning, how you apply data for competitive advantage, so we wanted to come do this in Munich. And it's a bit of a wake-up call, almost, to say hey, this is what's happening. We want to encourage clients across Europe to think about how do they start to do something now. >> Yeah, of course, GDPR is also a hook. The European Union and you guys have made some talk about that, you've got some keynotes today, and some breakout sessions that are discussing that, but talk about the two announcements that you guys made. There's one on DB2, there's another one around unified governance, what do those mean for clients? >> Yeah, sure, so first of all on GDPR, it's interesting to me, it's kind of the inverse of Y2K, which is there's very little hype, but there's huge ramifications. And Y2K was kind of the opposite. So look, it's coming, May 2018, clients have to be GDPR-compliant. And there's a misconception in the market that that only impacts companies in Europe. It actually impacts any company that does any type of business in Europe. So, it impacts everybody. So we are announcing a platform for unified governance that makes sure clients are GDPR-compliant. We've integrated software technology across analytics, IBM security, some of the assets from the Promontory acquisition that IBM did last year, and we are delivering the only platform for unified governance. And that's what clients need to be GDPR-compliant. The second piece is data has to become a lot simpler. As you think about my comment, who's leading the market today? Data's hard, and so we're trying to make data dramatically simpler. And so for example, with DB2, what we're announcing is you can download and get started using DB2 in 15 minutes or less, and anybody can do it. Even you can do it, Dave, which is amazing. >> Dave: (laughs) >> For the first time ever, you can-- >> We'll test that, Rob. >> Let's go test that. I would love to see you do it, because I guarantee you can. Even my son can do it. I had my son do it this weekend before I came here, because I wanted to see how simple it was. So that announcement is really about bringing, or introducing a new era of simplicity to data and analytics. We call it Download And Go. We started with SPSS, we did that back in March. Now we're bringing Download And Go to DB2, and to our governance catalog. So the idea is make data really simple for enterprises. >> You had a community edition previous to this, correct? There was-- >> Rob: We did, but it wasn't this easy. >> Wasn't this simple, okay. >> Not anybody could do it, and I want to make it so anybody can do it. >> Is simplicity, the rate of simplicity, the only differentiator of the latest edition, or I believe you have Kubernetes support now with this new addition, can you describe what that involves? >> Yeah, sure, so there's two main things that are new functionally-wise, Jim, to your point. So one is, look, we're big supporters of Kubernetes. And as we are helping clients build out private clouds, the best answer for that in our mind is Kubernetes, and so when we released Data Science Experience for Private Cloud earlier this quarter, that was on Kubernetes, extending that now to other parts of the portfolio. The other thing we're doing with DB2 is we're extending JSON support for DB2. So think of it as, you're working in a relational environment, now just through SQL you can integrate with non-relational environments, JSON, documents, any type of no-SQL environment. So we're finally bringing to fruition this idea of a data fabric, which is I can access all my data from a single interface, and that's pretty powerful for clients. >> Yeah, more cloud data development. Rob, I wonder if you can, we can go back to the machine learning, one of the core focuses of this particular event and the announcements you're making. Back in the fall, IBM made an announcement of Watson machine learning, for IBM Cloud, and World of Watson. In February, you made an announcement of IBM machine learning for the z platform. What are the machine learning announcements at this particular event, and can you sort of connect the dots in terms of where you're going, in terms of what sort of innovations are you driving into your machine learning portfolio going forward? >> I have a fundamental belief that machine learning is best when it's brought to the data. So, we started with, like you said, Watson machine learning on IBM Cloud, and then we said well, what's the next big corpus of data in the world? That's an easy answer, it's the mainframe, that's where all the world's transactional data sits, so we did that. Last week with the Hortonworks announcement, we said we're bringing machine learning to Hadoop, so we've kind of covered all the landscape of where data is. Now, the next step is about how do we bring a community into this? And the way that you do that is we don't dictate a language, we don't dictate a framework. So if you want to work with IBM on machine learning, or in Data Science Experience, you choose your language. Python, great. Scala or Java, you pick whatever language you want. You pick whatever machine learning framework you want, we're not trying to dictate that because there's different preferences in the market, so what we're really talking about here this week in Munich is this idea of an open platform for data science and machine learning. And we think that is going to bring a lot of people to the table. >> And with open, one thing, with open platform in mind, one thing to me that is conspicuously missing from the announcement today, correct me if I'm wrong, is any indication that you're bringing support for the deep learning frameworks like TensorFlow into this overall machine learning environment. Am I wrong? I know you have Power AI. Is there a piece of Power AI in these announcements today? >> So, stay tuned on that. We are, it takes some time to do that right, and we are doing that. But we want to optimize so that you can do machine learning with GPU acceleration on Power AI, so stay tuned on that one. But we are supporting multiple frameworks, so if you want to use TensorFlow, that's great. If you want to use Caffe, that's great. If you want to use Theano, that's great. That is our approach here. We're going to allow you to decide what's the best framework for you. >> So as you look forward, maybe it's a question for you, Jim, but Rob I'd love you to chime in. What does that mean for businesses? I mean, is it just more automation, more capabilities as you evolve that timeline, without divulging any sort of secrets? What do you think, Jim? Or do you want me to ask-- >> What do I think, what do I think you're doing? >> No, you ask about deep learning, like, okay, that's, I don't see that, Rob says okay, stay tuned. What does it mean for a business, that, if like-- >> Yeah. >> If I'm planning my roadmap, what does that mean for me in terms of how I should think about the capabilities going forward? >> Yeah, well what it means for a business, first of all, is what they're going, they're using deep learning for, is doing things like video analytics, and speech analytics and more of the challenges involving convolution of neural networks to do pattern recognition on complex data objects for things like connected cars, and so forth. Those are the kind of things that can be done with deep learning. >> Okay. And so, Rob, you're talking about here in Europe how the uptick in some of the data orientation has been a little bit slower, so I presume from your standpoint you don't want to over-rotate, to some of these things. But what do you think, I mean, it sounds like there is difference between certainly Europe and those top 10 companies in the S&P, outperforming the S&P 500. What's the barrier, is it just an understanding of how to take advantage of data, is it cultural, what's your sense of this? >> So, to some extent, data science is easy, data culture is really hard. And so I do think that culture's a big piece of it. And the reason we're kind of starting with a focus on machine learning, simplistic view, machine learning is a general-purpose framework. And so it invites a lot of experimentation, a lot of engagement, we're trying to make it easier for people to on-board. As you get to things like deep learning as Jim's describing, that's where the market's going, there's no question. Those tend to be very domain-specific, vertical-type use cases and to some extent, what I see clients struggle with, they say well, I don't know what my use case is. So we're saying, look, okay, start with the basics. A general purpose framework, do some tests, do some iteration, do some experiments, and once you find out what's hunting and what's working, then you can go to a deep learning type of approach. And so I think you'll see an evolution towards that over time, it's not either-or. It's more of a question of sequencing. >> One of the things we've talked to you about on theCUBE in the past, you and others, is that IBM obviously is a big services business. This big data is complicated, but great for services, but one of the challenges that IBM and other companies have had is how do you take that service expertise, codify it to software and scale it at large volumes and make it adoptable? I thought the Watson data platform announcement last fall, I think at the time you called it Data Works, and then so the name evolved, was really a strong attempt to do that, to package a lot of expertise that you guys had developed over the years, maybe even some different software modules, but bring them together in a scalable software package. So is that the right interpretation, how's that going, what's the uptake been like? >> So, it's going incredibly well. What's interesting to me is what everybody remembers from that announcement is the Watson Data Platform, which is a decomposable framework for doing these types of use cases on the IBM cloud. But there was another piece of that announcement that is just as critical, which is we introduced something called the Data First method. And that is the recipe book to say to a client, so given where you are, how do you get to this future on the cloud? And that's the part that people, clients, struggle with, is how do I get from step to step? So with Data First, we said, well look. There's different approaches to this. You can start with governance, you can start with data science, you can start with data management, you can start with visualization, there's different entry points. You figure out the right one for you, and then we help clients through that. And we've made Data First method available to all of our business partners so they can go do that. We work closely with our own consulting business on that, GBS. But that to me is actually the thing from that event that has had, I'd say, the biggest impact on the market, is just helping clients map out an approach, a methodology, to getting on this journey. >> So that was a catalyst, so this is not a sequential process, you can start, you can enter, like you said, wherever you want, and then pick up the other pieces from majority model standpoint? Exactly, because everybody is at a different place in their own life cycle, and so we want to make that flexible. >> I have a question about the clients, the customers' use of Watson Data Platform in a DevOps context. So, are more of your customers looking to use Watson Data Platform to automate more of the stages of the machine learning development and the training and deployment pipeline, and do you see, IBM, do you see yourself taking the platform and evolving it into a more full-fledged automated data science release pipelining tool? Or am I misunderstanding that? >> Rob: No, I think that-- >> Your strategy. >> Rob: You got it right, I would just, I would expand a little bit. So, one is it's a very flexible way to manage data. When you look at the Watson Data Platform, we've got relational stores, we've got column stores, we've got in-memory stores, we've got the whole suite of open-source databases under the composed-IO umbrella, we've got cloud in. So we've delivered a very flexible data layer. Now, in terms of how you apply data science, we say, again, choose your model, choose your language, choose your framework, that's up to you, and we allow clients, many clients start by building models on their private cloud, then we say you can deploy those into the Watson Data Platform, so therefore then they're running on the data that you have as part of that data fabric. So, we're continuing to deliver a very fluid data layer which then you can apply data science, apply machine learning there, and there's a lot of data moving into the Watson Data Platform because clients see that flexibility. >> All right, Rob, we're out of time, but I want to kind of set up the day. We're doing CUBE interviews all morning here, and then we cut over to the main tent. You can get all of this on IBMgo.com, you'll see the schedule. Rob, you've got, you're kicking off a session. We've got Hilary Mason, we've got a breakout session on GDPR, maybe set up the main tent for us. >> Yeah, main tent's going to be exciting. We're going to debunk a lot of misconceptions about data and about what's happening. Marc Altshuller has got a great segment on what he calls the death of correlations, so we've got some pretty engaging stuff. Hilary's got a great piece that she was talking to me about this morning. It's going to be interesting. We think it's going to provoke some thought and ultimately provoke action, and that's the intent of this week. >> Excellent, well Rob, thanks again for coming to theCUBE. It's always a pleasure to see you. >> Rob: Thanks, guys, great to see you. >> You're welcome; all right, keep it right there, buddy, We'll be back with our next guest. This is theCUBE, we're live from Munich, Fast Track Your Data, right back. (upbeat electronic music)

Published Date : Jun 22 2017

SUMMARY :

Brought to you by IBM. This is Fast Track Your Data brought to you by IBM, Hey, great to see you. It's good that you joined us. and machine learning to the Hadoop community. You had to be relevant, you want to be part of the community, So first of all, you look at the last five years. but talk about the two announcements that you guys made. Even you can do it, Dave, which is amazing. I would love to see you do it, because I guarantee you can. but it wasn't this easy. and I want to make it so anybody can do it. extending that now to other parts of the portfolio. What are the machine learning announcements at this And the way that you do that is we don't dictate I know you have Power AI. We're going to allow you to decide So as you look forward, maybe it's a question No, you ask about deep learning, like, okay, that's, and speech analytics and more of the challenges But what do you think, I mean, it sounds like And the reason we're kind of starting with a focus One of the things we've talked to you about on theCUBE And that is the recipe book to say to a client, process, you can start, you can enter, and deployment pipeline, and do you see, IBM, models on their private cloud, then we say you can deploy and then we cut over to the main tent. and that's the intent of this week. It's always a pleasure to see you. This is theCUBE, we're live from Munich,

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Dave Vellante	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Jim	PERSON	0.99+
Europe	LOCATION	0.99+
Rob	PERSON	0.99+
Marc Altshuller	PERSON	0.99+
Hilary	PERSON	0.99+
Hilary Mason	PERSON	0.99+
Rob Bearden	PERSON	0.99+
February	DATE	0.99+
Dave	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Rob Thomas	PERSON	0.99+
May 2018	DATE	0.99+
March	DATE	0.99+
Munich	LOCATION	0.99+
Scala	TITLE	0.99+
Apache	ORGANIZATION	0.99+
second piece	QUANTITY	0.99+
Last week	DATE	0.99+
Java	TITLE	0.99+
last year	DATE	0.99+
two announcements	QUANTITY	0.99+
10 companies	QUANTITY	0.99+
GDPR	TITLE	0.99+
Python	TITLE	0.99+
DB2	TITLE	0.99+
15 minutes	QUANTITY	0.99+
last week	DATE	0.99+
IBM Analytics	ORGANIZATION	0.99+
European Union	ORGANIZATION	0.99+
five years	QUANTITY	0.99+
JSON	TITLE	0.99+
Watson Data Platform	TITLE	0.99+
third	QUANTITY	0.99+
One	QUANTITY	0.99+
this week	DATE	0.98+
today	DATE	0.98+
a week ago	DATE	0.98+
two things	QUANTITY	0.98+
SQL	TITLE	0.98+
last fall	DATE	0.98+
2017	DATE	0.98+
Munich, Germany	LOCATION	0.98+
each	QUANTITY	0.98+
Y2K	ORGANIZATION	0.98+

Derek Shoettle & Adam Kocoloski, IBM- IBM Interconnect 2017 - #ibminterconnect - #theCUBE

>> Narrator: Live from Las Vegas! It's the Cube covering Interconnect 2017, brought to you by IBM. >> Okay, welcome back everyone. We are live in Las Vegas at IBM Interconnect 2017, IBM's cloud and now data show. I'm John Furrier with my co-host Dave Vellante. This is the Cube. Our next guest is Derek Schoettle, the general manager of Watson Data Platform, and Adam Kocoloski who's the CTO of the Watson Data Platform. Guys, welcome to the Cube. Good to see you again Derek. Great to see you, welcome Adam! >> Thanks, John. >> So, obviously the data was a big part of the theme. You saw Chris Moody from Twitter up there, obviously, they have a ton of data. I like to joke about they have a really active user right now in the President of the United States. >> Daily State of the Union, I think, was the one take away. >> Daily State of the Union. But this is the conversation that's happening in all over IT, and enterprise, and cloud, both public and enterprise, is the data conversation in context to cloud. Super relevant right now, and there's architecturals at play, it's app, it impacts app developers, it impacts architectures. And that's the Holy Grail, the so-called app data layer or cloud data layer. What's your vision, guys, on this? Derek, I'll start with you, your vision on this data opportunity. How does IBM approach it? And what's different from, or could be different from the competitors? >> Yeah, I know, one, it's an exciting time. We were just chatting about before we went live is, there's so much change taking place in and around data, right? It used to be it's the natural currency, it's everything everyone is talking about. The reality is, it's changing business models, right? It introduces a whole new set of discussions when you introduce cloud, self-service and open source. So, when we step back and think about how we can differentiate, how we can make IBM's offer to clients and the broader market interesting, is shift to a platform strategy where it says, we have instead of discreet compossible services that act independent of one another that are not, I'll say, self-aware, shift into a platform where you have common governance, you have common management, and you have really a collaborative by design approach where data is at the epicenter. Data is what starts every conversation whether you're on the app dev side, whether you are a data scientist, someone who's, you know, at the edge of discovery. And cloud's what's enabling that, self-service is what's enabling that and operationalize is what we do. I mean, we spend our days thinking about and then operationalizing feature, function, and then performance for a lot of different workloads. 'Cause it used to be, I think the, I was at Vertica, right? So that was the introduction of volume, variety, and velocity, right? Now, with the introduction of AI and cognitive, it's really about taking any and all and rationalizing it. And any and all meaning sitting within your corporate structure, as well as what's more broadly in the internet, out available within social media, right? That to me is the shift that's taking place. It's all companies are realizing they made a lot of investments, they have a lot of data, and they're not taking advantage of it. And we see that the big shift is... People are saying data scientist, what we think about is the merging of data and science. You think of science as cognitive and AI, right? That's a small population that really understands and can take advantage of. You have a whole big market that's out there in traditional data and analytics. Our platform is about merging those two. It's really about merging those experiences so everyone takes advantage of the benefits of data and science. >> What's the conversations that you are having, Derek, with customers? Because I think that's, there's a lot of bells going off into the CXO or even practitioners when you hear about machine learning, you hear AI, cognitive, autonomous vehicles, sensor networks. Obviously that's, the alarms are going off, like, I'd better get my act together. So, how do they pull that off? How do your customers pull off making that happen? Because now you got to bring in to be cloud ready, you have all these decoupled component parts. >> Yeah. >> John: You got to operate them in the cloud and you got to kind of have an on-prem component that's hybrid. What are the conversations that you are having with customers in how they're pulling this off? >> Yeah so, I'll cover the first piece, and I know Adam is spending certainly this week and a lot of time as well with clients on this topic. You know, the first part of the discussion is do you believe that the cloud can help you? Most folks are saying, "Yes, we believe it can help". Second piece is, how do I take advantage of emerging technologies that are moving at a rate and pace that perhaps my skills, my existing IT architecture, and my business model can't fully kind of, grasp, if not take advantage of? So, what we've introduced is a methodology, a data first method, which literally is a, it sounds simple, but at the end of the day, it is a common, uniform, agile way for us as IBM to engage with partners and clients that literally starts with the discovery workshop that says how does data inform your business? It's not static reporting anymore, it's what is the data that's sitting within your organization? You heard it from James at PlayFab. Data is changing the way people build in games today, thinking about how to enrich games, so on and so forth. Data First Method is what we've introduced, so you'll see going forward, IBM will sell Data First, we will engage Data First. So, any conversation with someone who says, "How do I take advantage of AI, "or machine learning, "or data science experience?". Well, let's step back for a second and talk about data. 'Cause 30 years ago, 20, that's how every conversation started. You get on a whiteboard, you design a schema, you talk about the relationships. That's how it started, and we're kind of cycling back to that, right? We got to put data first. >> So, Adam, the geeks are always arguing speeds, "I got a Hadoop cluster here, "I got this over here.". I mean, there's a lot of variety and diversity in terms of how people can manage either databases, and middleware or what not, right? So, how do you see the data first? How does it play out architecturally? And how does that play out for the solution? >> I think one of the big advantages we have in the world of the cloud platform is this opportunity to, on the one hand, use more a broader variety of compossible services, but also be able to take different parts of the business that were historically a little bit more separated from one another and bring them together. So you look at a Hadoop-flavored data leg on premises. It's a good area to do discovery, a good area to do exploration. But what clients really care about time and time again, a common refrain is the operationalization of the analytics, of the machine learning models. How do I take this insight that my data science team has discovered, and have it really influence a business process or incorporate it into an application? And in the on-premises architecture, that's often times quite a challenge. In the world of the cloud platform and the Watson data platform, we have an opportunity to be a little bit closer to things like the world of kubernetes which are really ideally suited for deploying and scaling microservices and APIs in a cloud-native, fault-tolerant, reliable fashion, right? So, you're seeing us take that menu of composable services in the cloud platform, and treat the data platform as one such composition. An opinionated way to put together this menu of services specifically to help data professionals collaborate, and drive the business forward. >> So, when you guys announced the Watson Data Platform, I think you called it Data Works, then changed the name, about five, maybe six months ago you messaged that 80% of, you know, data professionals' time is spent wrangling data, not enough time doing the fun stuff. And the premise was you coming up with a platform for collaboration that sort of integrates those different roles as well as, as you pointed out just now, allows you to operationalize analytics. Okay, so we're five months in, six months in, what kind of proof points do you have? Have you seen it? I mean, some people were skeptical saying, "Okay, well, it's IBM, "they've put a nice wrapper on this thing, "pulling in some different legacy components, "and you know, nice name." Okay, so, what do you say to that? And what evidence do you have that what you said is going to come true is actually coming true? >> You're going to do tech and I can do customer? >> Yeah, go for customer first. >> Yeah, so what we've seen is if you think about why we ended up at a platform. So, if you roll the tape back to when Cloudant got acquired in 2014, the journey that we were on was everyone was building rich applications, they wanted to be smarter, they wanted to understand what that exhaust was coming off. >> Right. >> Derek: And they wanted to add different ingredients to it. So, instead of a do-it-yourself kit that is a bunch of proprietary interoperability issues that's a ton of expense and inefficiency, and can't take advantage of the cloud, we decided, in very much of then our path towards, let's build a platform that allows you to easily ingest, govern, curate, and then, I'll say present and deploy. So, starting in actually June, and thhis started first with Spark. We made a huge bet on Spark 'cause we believed that to be kind of the operational operating system, if you will, for an analytic fabric. So, it started in Spark. Then, when we announced the Watson Data Platform in October it was, here's how we're going to take our heritage run governance, our heritage run traditional structured, non-structured data repositories, and here's how we're going to take visualization and distribution of data. So, that then next went into how we bring it to market? That's Data First. So, we've been working with large insurance companies, large financial services companies, retailers, gaming companies, and the net that we see is three things. First is, yes everyone agrees the platform is the right place to go. It's where do we get started? How do I take my existing investment and take advantage of this platform? And that, invariably, is I'm going to build a net new application whether it be Watson Conversations, so that runs into Watson Data Platform. We want to ingest data, but we want that data to be resident on-prem, we want it to be native to the cloud, and so we're going to work through the architectural change to adopt that. Another great example is we want to start with just an analytic application because we are already hosting with you a mobile app. Well, we're going to run it into your analytic fabric using dashDB, and dashDB works with Watson Analytics and we're going to build an application that's resident. The really creative and compelling piece here, back to your comment on IBM is, it's really hard to buy things from this company historically. Buying things from IBM is not easy, so we built a platform, we built the methodology to help you understand how to take advantage of it, and now we have a subscription, the Bluemix subscription is which you can come in and draw down those services, be it an object store, be it a sequel data store, be the visualization layer. >> John: Opposability basically. >> Yeah, but in a common governed framework. The big takeaway is, and I'll pass to Adam, governance and security and operationalizing the platform is what we can bring to bear. 'Cause we're bringing Open Source, we're bringing proprietary technologies, but if it's done independent, it doesn't really deliver on the promise of a platform. >> I will say that architecturally, that's incredibly liberating to know that there is this one common mind model. >> It's also highly requested by customers. That's what they want. >> Derek: That's what they want. It's the path to get there that I think is, we're at that intersection right now, it's crossing the chasm. >> John: So, what's liberating? Give us good-- >> Oh, just the fact that you know that if there's a common access control layer under the hood, if there's a common governance layer under the hood, that you don't have to compromise and come up with an alternative proposition for taking some capability, maybe deploying a model to a scoring engine. You can have the one purpose filled scoring engine and know that I can call that in on demand from discovery phase to go to production and I don't have to sort of engage in another separate mind conversation or separate entitlement conversation or a separate enabling conversation. This catalog is allowing it to work together. >> That to me from a team sport perspective is that the steps you have to take. So, think of ETL. ETL really in a modern real time, like getting away from batch and go into real time, that's just flow. So, the skill set and the ownership of the infrastructure associated with that is evolved, especially in cloud where that's just a dynamic where it's going to be a team deciding here's the data I want, here's how I want to enrich it, here's how I want to govern and curate it. >> It's a team sport. I love that. We were just at the Strata Hadoop. We had our big data SV event and the collision between batch and real time, they are not mutually exclusive and some people just made bets on batch and forgot real time. And they have real time people who don't do batch. So, you kind of see that coming together. >> Adam: Conversion. >> So, the question, Adam, for you is that, with the world kind of moving in that direction, how do you rationalize so the customer who's saying, "Hey, I'm cloud native but I also have a hybrid here "and I want to be cloud native purely "on this net new applications". So, there's a conversation happening. I call it the dev ops of data which is like data ops. Hey, I'm a programmer. I just want data as code. I just don't want to get in the weeds of setting up a data warehouse, and prepping an ETL, all that batch stuff that someone else does. I'm writing some software. I want data native to my app, but I don't want to go in and do the wrangling. I don't want to go out. I just want stuff to magically work. How do you tackle that premise? >> I mean, I think the dev ops of data piece is certainly a topic we're going to be hearing a lot more about over the next coming six months, in a year. I think the reason for that is precisely because this earlier topic of operationalization. You've got lots of people building up, budding data science teams and so on. And the first thing they're going to do is be working in the discovery area. They won't be in the world of pushing things to production. When they do, it's going to become more important that the folks who truly understand the details of the algorithm are close enough to the deployed assets, so that they can understand how this model is behaving over time. So that they can understand new data quality issues that might have cropped up and get close to that without obviously sort of breaking the separation duties that are important for a production system. So, I think, that is one part of the data ops conversation that hasn't yet been worked out. It's going to be a real opportunity for folks who-- >> That's an emerging area. You agree, right? >> It's a cultural shift too. I mean that is a re-thinking of, because most companies keep data in steel pipes. They're highly regulated. Their rules, the personalities that own them so to speak. The proposition that we've been on and every client asks for is how do I create a common fabric that gives access to people, that is governed and curated so you can always give a shopping experience. People that work with data do not want to talk about and say this : "How long does it take to stand up a server? "When can I get the data stood up in the staging area "so I can actually access it?" That's over. >> It's interesting, we're doing some Wikibon research on this, and this is the point where people look at value extraction of the data so they tend to, it's kind of like if you're a hammer, everything looks like a nail. So if you're in IT, it's infrastructure. If you are on the business line, it's the apps. So, you're seeing the shift where apps is value creating the value, but the infrastructure is more elastic, more compossible so it's enablement by itself so that's interesting. So, your thoughts on that, guys? Where is that value of the data coming from most, right now? Is it the apps? Is the infrastructure still evolving? The hybrid not-- >> We think there's a value model here. There is certainly elements of the data pipeline that are purely operational, reporting base and things like that, which drive value on their own. But we also recognize that it's new uses of data and new business processes that are primarily driven by applications, driven by conversational interfaces, driven by these sort of emerging paradigms. And one of our goals in the data platform is to ensure that clients can move along that curve more aggressively. >> How are people getting started with the Watson Data Platform? Do they go jumping all in? Is there a community edition, you can try it before you buy it kind of thing? >> Yeah, so you're signing up in Bluemix. You have access to a set of services around the platform. You have a 30-day window where you can try everything included within it, and then at some point you got to commit to a credit card or you got to commit a 12-month term agreement. I think in parallel, we see a lot of other companies that end up blasting in size challenge for IBM. We have a lot of clients. We have got a lot of clients that we are working with today in traditional architects and infrastructure, helping them through a methodology, helping them with the right skills. That is a more traditional, hey, come in and try an analytic workload on the platform. We'll give the skills. We'll help do the enablement and then we're off and running. I think the big difference is whether or not clients are paying for and they are willing to pay for it. 'Cause we are helping them get to this new model. We're helping them get to the platform, and I think the big thing we're working through is how do we get to velocity? I think when you look at these workloads that are happening. The reason they're happening is now data is not just in some dark corner. With AI, the machine learning is always on. So, there's a lot of different ways in which you can unleash that, that then, how do you take advantage of it? And that is a cultural shift. It's re-thinking business models, it's re-thinking how you got skills deployed which is incredibly exciting for us, and I think the market in general. I think back to how AI is cast in many cases as the robots are going to rule the world. There's a lot of good that can come from exposing vast amounts of data to AI and to frameworks where you can get a lot of value out of it. From how to better position products to how to, better design of medicines to fulfillment chains in countries that need help. >> So, guys, in the last minute that we have I want you to take a minute to either together or one of you guys talk about how IBM is helping solve what seems to be the number one question we get on the Cube where I get asked, hey, how do you help me build a hybrid architecture. I have more data-rich workloads coming on board now. Either I have some heavy data rich workloads that are run on-prem, I got more cloud action coming, I got IOT and I'm investing in data science. So, how do you guys specifically help me build a hybrid cloud architecture that's going to fuel and support data-rich workloads and propel my data science operation. >> Yeah, so, I'll take the basics for me. It is the Data First method. It is dashDB, which is an extensible on-prem hybrid in the cloud so that the common analytic fabric. There's Data Connect, which is our ability to move data batch continuous into different end states in the cloud, and then there's data science experience. So data science experience is our offering that brings together community, it brings together content, it brings together various tooling for the data scientist or data engineers. And I think the other piece of this is, we have something called solutions assurance. So we're literally designing patterns that we stand up in our own environments that reflect what we see on Premise and what we see workloads going into the cloud with, and stamping that as hybrid architectures that are repeatable, and we remove risk, the operational risk. But the reality is (mumbles) is, clients have to make sacrifices in getting to the cloud. You have to deprecate, you have to rethink. And that's where some of the smoothing of those rough edges come into the discipline of us saying, here's a supported architecture, here's the destination that you're going to, and we're going to have to work together to get there. Which is the fun part, I mean, that's what we're all in this for, is getting the outcomes. >> I think the key is not to pretend that these environments are completely identical to one another. There are things that the public cloud is uniquely well suited for. So let's make sure that those kinds of use cases are really nailed there, right? And then there are other cases where you're dealing with mainframe systems running critical business processes, and you want to be able to infuse that process with some analytics. So you have to look at the use case. Maybe it's training a machine learning model in the cloud, being able to export that model and run it-- >> So use proven solutions and be prepared to be handling new ones coming onboard. Alright, Derek Schoettle, general manager, and Adam Kocoloski, the CTO, the leaders at IBM Watson Data Group, IMB Watson Platform. This is The Cube, back with more live coverage after this short break.

Published Date : Mar 21 2017

SUMMARY :

brought to you by IBM. Good to see you again Derek. So, obviously the data was a big part of the theme. Daily State of the Union, is the data conversation in context to cloud. and the broader market interesting, What's the conversations that you are having, What are the conversations that you are having Data is changing the way people build in games today, And how does that play out for the solution? and the Watson data platform, And the premise was you in 2014, the journey that we were on was kind of the operational operating system, if you will, it doesn't really deliver on the promise of a platform. to know that there is this one common mind model. That's what they want. It's the path to get there that I think is, Oh, just the fact that you know that is that the steps you have to take. and the collision between batch and real time, So, the question, Adam, for you is that, of the algorithm are close enough to the deployed assets, You agree, right? Their rules, the personalities that own them so to speak. Is it the apps? And one of our goals in the data platform is to ensure and to frameworks where you can get So, guys, in the last minute that we have You have to deprecate, you have to rethink. in the cloud, being able to export that model and Adam Kocoloski, the CTO,

ENTITIES

Entity	Category	Confidence
Derek	PERSON	0.99+
Derek Schoettle	PERSON	0.99+
Adam	PERSON	0.99+
Adam Kocoloski	PERSON	0.99+
Dave Vellante	PERSON	0.99+
John	PERSON	0.99+
IBM	ORGANIZATION	0.99+
James	PERSON	0.99+
2014	DATE	0.99+
30-day	QUANTITY	0.99+
John Furrier	PERSON	0.99+
12-month	QUANTITY	0.99+
October	DATE	0.99+
Chris Moody	PERSON	0.99+
Derek Shoettle	PERSON	0.99+
80%	QUANTITY	0.99+
first piece	QUANTITY	0.99+
PlayFab	ORGANIZATION	0.99+
Las Vegas	LOCATION	0.99+
Second piece	QUANTITY	0.99+
five months	QUANTITY	0.99+
first part	QUANTITY	0.99+
June	DATE	0.99+
IBM Watson Data Group	ORGANIZATION	0.99+
one	QUANTITY	0.99+
Watson Data Platform	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Cloudant	ORGANIZATION	0.99+
three things	QUANTITY	0.98+
Vertica	ORGANIZATION	0.98+
six months ago	DATE	0.98+
First	QUANTITY	0.98+
first method	QUANTITY	0.98+
both	QUANTITY	0.98+
six months	QUANTITY	0.97+
first	QUANTITY	0.97+
Watson Data Platform	TITLE	0.97+
Watson Conversations	TITLE	0.96+
this week	DATE	0.96+
30 years ago	DATE	0.96+
one part	QUANTITY	0.96+
today	DATE	0.95+
Wikibon	ORGANIZATION	0.95+
Daily State of the Union	TITLE	0.94+
Watson Analytics	TITLE	0.94+
Cube	COMMERCIAL_ITEM	0.94+
Spark	TITLE	0.92+
Interconnect 2017	EVENT	0.92+
Bluemix	ORGANIZATION	0.89+
dashDB	TITLE	0.89+
IBM Interconnect 2017	EVENT	0.87+
one purpose	QUANTITY	0.86+
#ibminterconnect	EVENT	0.85+
20	QUANTITY	0.83+
Strata Hadoop	LOCATION	0.82+
first thing	QUANTITY	0.81+
one common mind model	QUANTITY	0.79+
second	QUANTITY	0.76+
Twitter	ORGANIZATION	0.76+
President of the United States	PERSON	0.72+
Watson	TITLE	0.71+
IMB Watson	ORGANIZATION	0.71+
about five	DATE	0.7+

Nick Pentreath, IBM STC - Spark Summit East 2017 - #sparksummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is The Cube, covering Spark Summit East 2017. Brought to you by Data Bricks. Now, here are your hosts, Dave Valente and George Gilbert. >> Boston, everybody. Nick Pentry this year, he's a principal engineer a the IBM Spark Technology Center in South Africa. Welcome to The Cube. >> Thank you. >> Great to see you. >> Great to see you. >> So let's see, it's a different time of year, here that you're used to. >> I've flown from, I don't know the Fahrenheit's equivalent, but 30 degrees Celsius heat and sunshine to snow and sleet, so. >> Yeah, yeah. So it's a lot chillier there. Wait until tomorrow. But, so we were joking. You probably get the T-shirt for the longest flight here, so welcome. >> Yeah, I actually need the parka, or like a beanie. (all laugh) >> Little better. Long sleeve. So Nick, tell us about the Spark Technology Center, STC is its acronym and your role, there. >> Sure, yeah, thank you. So Spark Technology Center was formed by IBM a little over a year ago, and its mission is to focus on the Open Source world, particularly Apache Spark and the ecosystem around that, and to really drive forward the community and to make contributions to both the core project and the ecosystem. The overarching goal is to help drive adoption, yeah, and particularly enterprise customers, the kind of customers that IBM typically serves. And to harden Spark and to make it really enterprise ready. >> So why Spark? I mean, we've watched IBM do this now for several years. The famous example that I like to use is Linux. When IBM put $1 billion into Linux, it really went all in on Open Source, and it drove a lot of IBM value, both internally and externally for customers. So what was it about Spark? I mean, you could have made a similar bet on Hadoop. You decided not to, you sort of waited to see that market evolve. What was the catalyst for having you guys all go in on Spark? >> Yeah, good question. I don't know all the details, certainly, of what was the internal drivers because I joined HTC a little under a year ago, so I'm fairly new. >> Translate the hallway talk, maybe. (Nick laughs) >> Essentially, I think you raise very good parallels to Linux and also Java. >> Absolutely. >> So Spark, sorry, IBM, made these investments and Open Source technologies that had ceased to be transformational and kind of game-changing. And I think, you know, most people will probably admit within IBM that they maybe missed the boat, actually, on Hadoop and saw Spark as the successor and actually saw a chance to really dive into that and kind of almost leap frog and say, "We're going to "back this as the next generation analytics platform "and operating system for analytics "and big debt in the enterprise." >> Well, I don't know if you happened to watch the Super Bowl, but there's a saying that it's sometimes better to be lucky than good. (Nick laughs) And that sort of applies, and so, in some respects, maybe missing the window on Hadoop was not a bad thing for IBM >> Yeah, exactly because not a lot of people made a ton of dough on Hadoop and they're still sort of struggling to figure it out. And now along comes Spark, and you've got this more real time nature. IBM talks a lot about bringing analytics and transactions together. They've made some announcements about that and affecting business outcomes in near real time. I mean, that's really what it's all about and one of your areas of expertise is machine learning. And so, talk about that relationship and what it means for organizations, your mission. >> Yeah, machine learning is a key part of the mission. And you've seen the kind of big debt in enterprise story, starting with the kind of Hadoop and data lakes. And that's evolved into, now we've, before we just dumped all of this data into these data lakes and these silos and maybe we had some Hadoop jobs and so on. But now we've got all this data we can store, what are we actually going to do with it? So part of that is the traditional data warehousing and business intelligence and analytics, but more and more, we're seeing there's a rich value in this data, and to unlock it, you really need intelligent systems. You need machine learning, you need AI, you need real time decision making that starts transcending the boundaries of all the rule-based systems and human-based systems. So we see machine learning as one of the key tools and one of the key unlockers of value in these enterprise data stores. >> So Nick, perhaps paint us a picture of someone who's advanced enough to be working with machine learning with BMI and we know that the tool chain's kind of immature. Although, IBM with Data Works or Data First has a fairly broad end-to-end sort of suit of tools, but what are the early-use cases? And what needs to mature to go into higher volume production apps or higher-value production apps? >> I think the early-use cases for machine learning in general and certainly at scale are numerous and they're growing, but classic examples are, let's say, recommendation engines. That's an area that's close to my heart. In my previous life before IBM, I bought the startup that had a recommendation engine service targeting online stores and new commerce players and social networks and so on. So this is a great kind of example use case. We've got all this data about, let's say, customer behavior in your retail store or your video-sharing site, and in order to serve those customers better and make more money, if you can make good recommendations about what they should buy, what they should watch, or what they should listen to, that's a classic use case for machine learning and unlocking the data that is there, so that is one of the drivers of some of these systems, players like Amazon, they're sort of good examples of the recommendation use case. Another is fraud detection, and that is a classic example in financial services, enterprise, which is a kind of staple of IBM's customer base. So these are a couple of examples of the use cases, but the tool sets, traditionally, have been kind of cumbersome. So Amazon bought everything from scratch themselves using customized systems, and they've got teams and teams of people. Nowadays, you've got this bold into Apache Spark, you've got it in Spark, a machine learning library, you've got good models to do that kind of thing. So I think from an algorithmic perspective, there's been a lot of advancement and there's a lot of standardization and almost commoditization of the model side. So what is missing? >> George: Yeah, what else? >> And what are the shortfalls currently? So there's a big difference between the current view, I guess the hype of the machine learning as you've got data, you apply some machine learning, and then you get profit, right? But really, there's a hugely complex workflow that involves this end-to-end story. You've got data coming from various data sources, you have to feed it into one centralized system, transform and process it, extract your features and do your sort of hardcore data signs, which is the core piece that everyone sort of thinks about as the only piece, but that's kind of in the middle and it makes up a relatively small proportion of the overall chain. And once you've got that, you do model training and selection testing, and you now have to take that model, that machine-learning algorithm and you need to deploy it into a real system to make real decisions. And that's not even the end of it because once you've got that, you need to close the loop, what we call the feedback loop, and you need to monitor the performance of that model in the real world. You need to make sure that it's not deteriorating, that it's adding business value. All of these ind of things. So I think that is the real, the piece of the puzzle that's missing at the moment is this end-to-end, delivering this end-to-end story and doing it at scale, securely, enterprise-grade. >> And the business impact of that presumably will be a better-quality experience. I mean, recommendation engines and fraud detection have been around for a while, they're just not that good. Retargeting systems are too little too late, and kind of cumbersome fraud detection. Still a lot of false positives. Getting much better, certainly compressing the time. It used to be six months, >> Yes, yes. Now it's minutes or second, but a lot of false positives still, so, but are you suggesting that by closing that gap, that we'll start to see from a consumer standpoint much better experiences? >> Well, I think that's imperative because if you don't see that from a consumer standpoint, then the mission is failing because ultimately, it's not magic that you just simply throw machine learning at something and you unlock business value and everyone's happy. You have to, you know, there's a human in the loop, there. You have to fulfill the customer's need, you have to fulfill consumer needs, and the better you do that, the more successful your business is. You mentioned the time scale, and I think that's a key piece, here. >> Yeah. >> What makes better decisions? What makes a machine-learning system better? Well, it's better data and more data, and faster decisions. So I think all of those three are coming into play with Apache Spark, end-to-end's story streaming systems, and the models are getting better and better because they're getting more data and better data. >> So I think we've, the industry, has pretty much attacked the time problem. Certainly for fraud detection and recommendation systems the quality issue. Are we close? I mean, are we're talking about 6-12 months before we really sort of start to see a major impact to the consumer and ultimately, to the company who's providing those services? >> Nick: Well, >> Or is it further away than that, you think? >> You know, it's always difficult to make predictions about timeframes, but I think there's a long way to go to go from, yeah, as you mentioned where we are, the algorithms and the models are quite commoditized. The time gap to make predictions is kind of down to this real-time nature. >> Yeah. >> So what is missing? I think it's actually less about the traditional machine-learning algorithms and more about making the systems better and getting better feedback, better monitoring, so improving the end user's experience of these systems. >> Yeah. >> And that's actually, I don't think it's, I think there's a lot of work to be done. I don't think it's a 6-12 month thing, necessarily. I don't think that in 12 months, certainly, you know, everything's going to be perfectly recommended. I think there's areas of active research in the kind of academic fields of how to improve these things, but I think there's a big engineering challenge to bring in more disparate data sources, to better, to improve data quality, to improve these feedback loops, to try and get systems that are serving customer needs better. So improving recommendations, improving the quality of fraud detection systems. Everything from that to medical imaging and counter detection. I think we've got a long way to go. >> Would it be fair to say that we've done a pretty good job with traditional application lifecycle in terms of DevOps, but we now need the DevOps for the data scientists and their collaborators? >> Nick: Yeah, I think that's >> And where is BMI along that? >> Yeah, that's a good question, and I think you kind of hit the nail on the head, that the enterprise applied machine learning problem has moved from the kind of academic to the software engineering and actually, DevOps. Internally, someone mentioned the word train ops, so it's almost like, you know, the machine learning workflow and actually professionalizing and operationalizing that. So recently, IBM, for one, has announced what's in data platform and now, what's in machine learning. And that really tries to address that problem. So really, the aim is to simplify and productionize these end-to-end machine-learning workflows. So that is the product push that IBM has at the moment. >> George: Okay, that's helpful. >> Yeah, and right. I was at the Watson data platform announcement you call the Data Works. I think they changed the branding. >> Nick: Yeah. >> It looked like there were numerous components that IBM had in its portfolio that's now strung together. And to create that end-to-end system that you're describing. Is that a fair characterization, or is it underplaying? I'm sure it is. The work that went into it, but help us maybe understand that better. >> Yeah, I should caveat it by saying we're fairly focused, very focused at HTC on the Open Source side of things, So my work is predominately within the Apache Spark project and I'm less involved in the data bank. >> Dave: So you didn't contribute specifically to Watson data platform? >> Not to the product line, so, you know, >> Yeah, so its really not an appropriate question for you? >> I wouldn't want to kind of, >> Yeah. >> To talk too deeply about it >> Yeah, yeah, so that, >> Simply because I haven't been involved. >> Yeah, that's, I don't want to push you on that because it's not your wheelhouse, but then, help me understand how you will commercialize the activities that you do, or is that not necessarily the intent? >> So the intent with HTC particularly is that we focus on Open Source and a core part of that is that we, being within IBM, we have the opportunity to interface with other product groups and customer groups. >> George: Right. >> So while we're not directly focused on, let's say, the commercial aspect, we want to effectively leverage the ability to talk to real-world customers and find the use cases, talk to other product groups that are building this Watson data platform and all the product lines and the features, data sans experience, it's all built on top of Apache Apache Spark and platform. >> Dave: So your role is really to innovate? >> Exactly, yeah. >> Leverage and Open Source and innovate. >> Both innovate and kind of improve, so improve performance improve efficiency. When you are operating at the scale of a company such as IBM and other large players, your customers and you as product teams and builders of products will come into contact with all the kind of little issues and bugs >> Right. >> And performance >> Make it better. Problems, yeah. And that is the feedback that we take on board and we try and make it better, not just for IBM and their customers. Because it's an Apache product and everyone benefits. So that's really the idea. Take all the feedback and learnings from enterprise customers and product groups and centralize that in the Open Source contributions that we make. >> Great. Would it be, so would it be fair to say you're focusing on making the core Spark, Spark ML and Spark ML Lib capabilities sort of machine learning libraries and in the pipeline, more robust? >> Yes. >> And if that's the case, we know there needs to be improvements in its ability to serve predictions in real time, like high speed. We know there's a need to take the pipeline and sort of share it with other tools, perhaps. Or collaborate with other tool chains. >> Nick: Yeah. >> What are some of the things that the Enterprise customers are looking for along the lines? >> Yeah, that's a great question and very topical at the moment. So both from an Open Source community perspective and Enterprise customer perspective, this is one of the, if not the key, I think, kind of missing pieces within the Spark machine-learning kind of community at the moment, and it's one of the things that comes up most often. So it is a missing piece, and we as a community need to work together and decide, is this something that we built within Spark and provide that functionality? Is is something where we try and adopt open standards that will benefit everybody and that provides a kind of one standardized format, or way or serving models? Or is it something where there's a few Open Source projects out there that might serve for this purpose, and do we get behind those? So I don't have the answer because this is ongoing work, but it's definitely one of the most critical kind of blockers, or, let's say, areas that needs work at the moment. >> One quick question, then, along those lines. IBM, the first thing IBM contributed to the Spark community was Spark ML, which is, as I understand it, it was an ability to, I think, create an ensemble sort of set of models to do a better job or create a more, >> So are you referring to system ML, I think it is? >> System ML. >> System ML, yeah, yeah. >> What are they, I forgot. >> Yeah, so, so. >> Yeah, where does that fit? >> System ML started out as a IBM research project and perhaps the simplest way to describe it is, as a kind of sequel optimizer is to take sequel queries and decide how to execute them in the most efficient way, system ML takes a kind of high-level mathematical language and compiles it down to a execution plan that runs in a distributed system. So in much the same way as your sequel operators allow this very flexible and high-level language, you don't have to worry about how things are done, you just tell the system what you want done. System ML aims to do that for mathematical and machine learning problems, so it's now an Apache project. It's been donated to Open Source and it's an incubating project under very active development. And that is really, there's a couple of different aspects to it, but that's the high-level goal. The underlying execution engine is Spark. It can run on Hadoop and it can run locally, but really, the main focus is to execute on Spark and then expose these kind of higher level APRs that are familiar to users of languages like R and Python, for example, to be able to write their algorithms and not necessarily worry about how do I do large scale matrix operations on a cluster? System ML will compile that down and execute that for them. >> So really quickly, follow up, what that means is if it's a higher level way for people who sort of cluster aware to write machine-learning algorithms that are cluster aware? >> Nick: Precisely, yeah. >> That's very, very valuable. When it works. >> When it works, yeah. So it does, again, with the caveat that I'm mostly focused on Spark and not so much the System ML side of things, so I'm definitely not an expert. I don't claim to be an expert in it. But it does, you know, it works at the moment. It works for a large class of machine-learning problems. It's very powerful, but again, it's a young project and there's always work to be done, so exactly the areas that I know that they're focusing on are these areas of usability, hardening up the APRs and making them easier to use and easier to access for users coming from the R and Python communities who, again are, as you said, they're not necessarily experts on distributed systems and cluster awareness, but they know how to write a very complex machine-learning model in R, for example. And it's really trying to enable them with a set of APR tools. So in terms of the underlying engine, they are, I don't know how many hundreds of thousands, millions of lines of code and years and years of research that's gone into that, so it's an extremely powerful set of tools. But yes, a lot of work still to be done there and ongoing to make it, in a way to make it user ready and Enterprise ready in a sense of making it easier for people to use it and adopt it and to put it into their systems and production. >> So I wonder if we can close, Nick, just a few questions on STC, so the Spark Technology Centers in Cape Town, is that a global expertise center? Is is STC a virtual sort of IBM community, or? >> I'm the only member visiting Cape Town, >> David: Okay. >> So I'm kind of fairly lucky from that perspective, to be able to kind of live at home. The rest of the team is mostly in San Francisco, so there's an office there that's co-located with the Watson west office >> Yeah. >> And Watson teams >> Sure. >> That are based there in Howard Street, I think it is. >> Dave: How often do you get there? >> I'll be there next week. >> Okay. >> So I typically, sort of two or three times a year, I try and get across there >> Right. And interface with the team, >> So, >> But we are a fairly, I mean, IBM is obviously a global company, and I've been surprised actually, pleasantly surprised there are team members pretty much everywhere. Our team has a few scattered around including me, but in general, when we interface with various teams, they pop up in all kinds of geographical locations, and I think it's great, you know, a huge diversity of people and locations, so. >> Anything, I mean, these early days here, early day one, but anything you saw in the morning keynotes or things you hope to learn here? Anything that's excited you so far? >> A couple of the morning keynotes, but had to dash out to kind of prepare for, I'm doing a talk later, actually on feature hashing for scalable machine learning, so that's at 12:20, please come and see it. >> Dave: A breakout session, it's at what, 12:20? >> 20 past 12:00, yeah. >> Okay. >> So in room 302, I think, >> Okay. >> I'll be talking about that, so I needed to prepare, but I think some of the key exciting things that I have seen that I would like to go and take a look at are kind of related to the deep learning on Spark. I think that's been a hot topic recently in one of the areas, again, Spark is, perhaps, hasn't been the strongest contender, let's say, but there's some really interesting work coming out of Intel, it looks like. >> They're talking here on The Cube in a couple hours. >> Yeah. >> Yeah. >> I'd really like to see their work. >> Yeah. >> And that sounds very exciting, so yeah. I think every time I come to a Spark summit, they always need projects from the community, various companies, some of them big, some of them startups that are pushing the envelope, whether it's research projects in machine learning, whether it's adding deep learning libraries, whether it's improving performance for kind of commodity clusters or for single, very powerful single modes, there's always people pushing the envelope, and that's what's great about being involved in an Open Source community project and being part of those communities, so yeah. That's one of the talks that I would like to go and see. And I think I, unfortunately, had to miss some of the Netflix talks on their recommendation pipeline. That's always interesting to see. >> Dave: Right. >> But I'll have to check them on the video (laughs). >> Well, there's always another project in Open Source land. Nick, thanks very much for coming on The Cube and good luck. Cool, thanks very much. Thanks for having me. >> Have a good trip, stay warm, hang in there. (Nick laughs) Alright, keep it right there. My buddy George and I will be back with our next guest. We're live. This is The Cube from Sparks Summit East, #sparksummit. We'll be right back. (upbeat music) (gentle music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Data Bricks. a the IBM Spark Technology Center in South Africa. So let's see, it's a different time of year, here I've flown from, I don't know the Fahrenheit's equivalent, You probably get the T-shirt for the longest flight here, need the parka, or like a beanie. So Nick, tell us about the Spark Technology Center, and the ecosystem. The famous example that I like to use is Linux. I don't know all the details, certainly, Translate the hallway talk, maybe. Essentially, I think you raise very good parallels and kind of almost leap frog and say, "We're going to and so, in some respects, maybe missing the window on Hadoop and they're still sort of struggling to figure it out. So part of that is the traditional data warehousing So Nick, perhaps paint us a picture of someone and almost commoditization of the model side. And that's not even the end of it And the business impact of that presumably will be still, so, but are you suggesting that by closing it's not magic that you just simply throw and the models are getting better and better attacked the time problem. to go from, yeah, as you mentioned where we are, and more about making the systems better So improving recommendations, improving the quality So really, the aim is to simplify and productionize Yeah, and right. And to create that end-to-end system that you're describing. and I'm less involved in the data bank. So the intent with HTC particularly is that we focus leverage the ability to talk to real-world customers and you as product teams and builders of products and centralize that in the Open Source contributions sort of machine learning libraries and in the pipeline, And if that's the case, So I don't have the answer because this is ongoing work, IBM, the first thing IBM contributed to the Spark community but really, the main focus is to execute on Spark When it works. and ongoing to make it, in a way to make it user ready So I'm kind of fairly lucky from that perspective, And interface with the team, and I think it's great, you know, A couple of the morning keynotes, but had to dash out are kind of related to the deep learning on Spark. that are pushing the envelope, whether it's research and good luck. My buddy George and I will be back with our next guest.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George Gilbert	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Valente	PERSON	0.99+
George	PERSON	0.99+
Dave	PERSON	0.99+
Nick Pentreath	PERSON	0.99+
Howard Street	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Nick Pentry	PERSON	0.99+
$1 billion	QUANTITY	0.99+
Nick	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
HTC	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Cape Town	LOCATION	0.99+
South Africa	LOCATION	0.99+
Java	TITLE	0.99+
Linux	TITLE	0.99+
12 months	QUANTITY	0.99+
six months	QUANTITY	0.99+
next week	DATE	0.99+
Boston	LOCATION	0.99+
Boston, Massachusetts	LOCATION	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
BMI	ORGANIZATION	0.99+
Python	TITLE	0.99+
Spark	TITLE	0.99+
12:20	DATE	0.99+
three	QUANTITY	0.99+
6-12 month	QUANTITY	0.99+
Watson	ORGANIZATION	0.98+
tomorrow	DATE	0.98+
Spark Technology Center	ORGANIZATION	0.98+
one	QUANTITY	0.98+
Spark Technology Centers	ORGANIZATION	0.98+
this year	DATE	0.97+
Hadoop	TITLE	0.97+
hundreds of thousands	QUANTITY	0.97+
both	QUANTITY	0.97+
30 degrees Celsius	QUANTITY	0.97+
Data First	ORGANIZATION	0.97+
Super Bowl	EVENT	0.97+
single	QUANTITY	0.96+

Next-Generation Analytics Social Influencer Roundtable - #BigDataNYC 2016 #theCUBE

>> Narrator: Live from New York, it's the Cube, covering big data New York City 2016. Brought to you by headline sponsors, CISCO, IBM, NVIDIA, and our ecosystem sponsors, now here's your host, Dave Valante. >> Welcome back to New York City, everybody, this is the Cube, the worldwide leader in live tech coverage, and this is a cube first, we've got a nine person, actually eight person panel of experts, data scientists, all alike. I'm here with my co-host, James Cubelis, who has helped organize this panel of experts. James, welcome. >> Thank you very much, Dave, it's great to be here, and we have some really excellent brain power up there, so I'm going to let them talk. >> Okay, well thank you again-- >> And I'll interject my thoughts now and then, but I want to hear them. >> Okay, great, we know you well, Jim, we know you'll do that, so thank you for that, and appreciate you organizing this. Okay, so what I'm going to do to our panelists is ask you to introduce yourself. I'll introduce you, but tell us a little bit about yourself, and talk a little bit about what data science means to you. A number of you started in the field a long time ago, perhaps data warehouse experts before the term data science was coined. Some of you started probably after Hal Varian said it was the sexiest job in the world. (laughs) So think about how data science has changed and or what it means to you. We're going to start with Greg Piateski, who's from Boston. A Ph.D., KDnuggets, Greg, tell us about yourself and what data science means to you. >> Okay, well thank you Dave and thank you Jim for the invitation. Data science in a sense is the second oldest profession. I think people have this built-in need to find patterns and whatever we find we want to organize the data, but we do it well on a small scale, but we don't do it well on a large scale, so really, data science takes our need and helps us organize what we find, the patterns that we find that are really valid and useful and not just random, I think this is a big challenge of data science. I've actually started in this field before the term Data Science existed. I started as a researcher and organized the first few workshops on data mining and knowledge discovery, and the term data mining became less fashionable, became predictive analytics, now it's data science and it will be something else in a few years. >> Okay, thank you, Eves Mulkearns, Eves, I of course know you from Twitter. A lot of people know you as well. Tell us about your experiences and what data scientist means to you. >> Well, data science to me is if you take the two words, the data and the science, the science it holds a lot of expertise and skills there, it's statistics, it's mathematics, it's understanding the business and putting that together with the digitization of what we have. It's not only the structured data or the unstructured data what you store in the database try to get out and try to understand what is in there, but even video what is coming on and then trying to find, like George already said, the patterns in there and bringing value to the business but looking from a technical perspective, but still linking that to the business insights and you can do that on a technical level, but then you don't know yet what you need to find, or what you're looking for. >> Okay great, thank you. Craig Brown, Cube alum. How many people have been on the Cube actually before? >> I have. >> Okay, good. I always like to ask that question. So Craig, tell us a little bit about your background and, you know, data science, how has it changed, what's it all mean to you? >> Sure, so I'm Craig Brown, I've been in IT for almost 28 years, and that was obviously before the term data science, but I've evolved from, I started out as a developer. And evolved through the data ranks, as I called it, working with data structures, working with data systems, data technologies, and now we're working with data pure and simple. Data science to me is an individual or team of individuals that dissect the data, understand the data, help folks look at the data differently than just the information that, you know, we usually use in reports, and get more insights on, how to utilize it and better leverage it as an asset within an organization. >> Great, thank you Craig, okay, Jennifer Shin? Math is obviously part of being a data scientist. You're good at math I understand. Tell us about yourself. >> Yeah, so I'm a senior principle data scientist at the Nielsen Company. I'm also the founder of 8 Path Solutions, which is a data science, analytics, and technology company, and I'm also on the faculty in the Master of Information and Data Science program at UC Berkeley. So math is part of the IT statistics for data science actually this semester, and I think for me, I consider myself a scientist primarily, and data science is a nice day job to have, right? Something where there's industry need for people with my skill set in the sciences, and data gives us a great way of being able to communicate sort of what we know in science in a way that can be used out there in the real world. I think the best benefit for me is that now that I'm a data scientist, people know what my job is, whereas before, maybe five ten years ago, no one understood what I did. Now, people don't necessarily understand what I do now, but at least they understand kind of what I do, so it's still an improvement. >> Excellent. Thank you Jennifer. Joe Caserta, you're somebody who started in the data warehouse business, and saw that snake swallow a basketball and grow into what we now know as big data, so tell us about yourself. >> So I've been doing data for 30 years now, and I wrote the Data Warehouse ETL Toolkit with Ralph Timbal, which is the best selling book in the industry on preparing data for analytics, and with the big paradigm shift that's happened, you know for me the past seven years has been, instead of preparing data for people to analyze data to make decisions, now we're preparing data for machines to make the decisions, and I think that's the big shift from data analysis to data analytics and data science. >> Great, thank you. Miriam, Miriam Fridell, welcome. >> Thank you. I'm Miriam Fridell, I work for Elder Research, we are a data science consultancy, and I came to data science, sort of through a very circuitous route. I started off as a physicist, went to work as a consultant and software engineer, then became a research analyst, and finally came to data science. And I think one of the most interesting things to me about data science is that it's not simply about building an interesting model and doing some interesting mathematics, or maybe wrangling the data, all of which I love to do, but it's really the entire analytics lifecycle, and a value that you can actually extract from data at the end, and that's one of the things that I enjoy most is seeing a client's eyes light up or a wow, I didn't really know we could look at data that way, that's really interesting. I can actually do something with that, so I think that, to me, is one of the most interesting things about it. >> Great, thank you. Justin Sadeen, welcome. >> Absolutely, than you, thank you. So my name is Justin Sadeen, I work for Morph EDU, an artificial intelligence company in Atlanta, Georgia, and we develop learning platforms for non-profit and private educational institutions. So I'm a Marine Corp veteran turned data enthusiast, and so what I think about data science is the intersection of information, intelligence, and analysis, and I'm really excited about the transition from big data into smart data, and that's what I see data science as. >> Great, and last but not least, Dez Blanchfield, welcome mate. >> Good day. Yeah, I'm the one with the funny accent. So data science for me is probably the funniest job I've ever to describe to my mom. I've had quite a few different jobs, and she's never understood any of them, and this one she understands the least. I think a fun way to describe what we're trying to do in the world of data science and analytics now is it's the equivalent of high altitude mountain climbing. It's like the extreme sport version of the computer science world, because we have to be this magical unicorn of a human that can understand plain english problems from C-suite down and then translate it into code, either as soles or as teams of developers. And so there's this black art that we're expected to be able to transmogrify from something that we just in plain english say I would like to know X, and we have to go and figure it out, so there's this neat extreme sport view I have of rushing down the side of a mountain on a mountain bike and just dodging rocks and trees and things occasionally, because invariably, we do have things that go wrong, and they don't quite give us the answers we want. But I think we're at an interesting point in time now with the explosion in the types of technology that are at our fingertips, and the scale at which we can do things now, once upon a time we would sit at a terminal and write code and just look at data and watch it in columns, and then we ended up with spreadsheet technologies at our fingertips. Nowadays it's quite normal to instantiate a small high performance distributed cluster of computers, effectively a super computer in a public cloud, and throw some data at it and see what comes back. And we can do that on a credit card. So I think we're at a really interesting tipping point now where this coinage of data science needs to be slightly better defined, so that we can help organizations who have weird and strange questions that they want to ask, tell them solutions to those questions, and deliver on them in, I guess, a commodity deliverable. I want to know xyz and I want to know it in this time frame and I want to spend this much amount of money to do it, and I don't really care how you're going to do it. And there's so many tools we can choose from and there's so many platforms we can choose from, it's this little black art of computing, if you'd like, we're effectively making it up as we go in many ways, so I think it's one of the most exciting challenges that I've had, and I think I'm pretty sure I speak for most of us in that we're lucky that we get paid to do this amazing job. That we get make up on a daily basis in some cases. >> Excellent, well okay. So we'll just get right into it. I'm going to go off script-- >> Do they have unicorns down under? I think they have some strange species right? >> Well we put the pointy bit on the back. You guys have in on the front. >> So I was at an IBM event on Friday. It was a chief data officer summit, and I attended what was called the Data Divas' breakfast. It was a women in tech thing, and one of the CDOs, she said that 25% of chief data officers are women, which is much higher than you would normally see in the profile of IT. We happen to have 25% of our panelists are women. Is that common? Miriam and Jennifer, is that common for the data science field? Or is this a higher percentage than you would normally see-- >> James: Or a lower percentage? >> I think certainly for us, we have hired a number of additional women in the last year, and they are phenomenal data scientists. I don't know that I would say, I mean I think it's certainly typical that this is still a male-dominated field, but I think like many male-dominated fields, physics, mathematics, computer science, I think that that is slowly changing and evolving, and I think certainly, that's something that we've noticed in our firm over the years at our consultancy, as we're hiring new people. So I don't know if I would say 25% is the right number, but hopefully we can get it closer to 50. Jennifer, I don't know if you have... >> Yeah, so I know at Nielsen we have actually more than 25% of our team is women, at least the team I work with, so there seems to be a lot of women who are going into the field. Which isn't too surprising, because with a lot of the issues that come up in STEM, one of the reasons why a lot of women drop out is because they want real world jobs and they feel like they want to be in the workforce, and so I think this is a great opportunity with data science being so popular for these women to actually have a job where they can still maintain that engineering and science view background that they learned in school. >> Great, well Hillary Mason, I think, was the first data scientist that I ever interviewed, and I asked her what are the sort of skills required and the first question that we wanted to ask, I just threw other women in tech in there, 'cause we love women in tech, is about this notion of the unicorn data scientist, right? It's been put forth that there's the skill sets required to be a date scientist are so numerous that it's virtually impossible to have a data scientist with all those skills. >> And I love Dez's extreme sports analogy, because that plays into the whole notion of data science, we like to talk about the theme now of data science as a team sport. Must it be an extreme sport is what I'm wondering, you know. The unicorns of the world seem to be... Is that realistic now in this new era? >> I mean when automobiles first came out, they were concerned that there wouldn't be enough chauffeurs to drive all the people around. Is there an analogy with data, to be a data-driven company. Do I need a data scientist, and does that data scientist, you know, need to have these unbelievable mixture of skills? Or are we doomed to always have a skill shortage? Open it up. >> I'd like to have a crack at that, so it's interesting, when automobiles were a thing, when they first bought cars out, and before they, sort of, were modernized by the likes of Ford's Model T, when we got away from the horse and carriage, they actually had human beings walking down the street with a flag warning the public that the horseless carriage was coming, and I think data scientists are very much like that. That we're kind of expected to go ahead of the organization and try and take the challenges we're faced with today and see what's going to come around the corner. And so we're like the little flag-bearers, if you'd like, in many ways of this is where we're at today, tell me where I'm going to be tomorrow, and try and predict the day after as well. It is very much becoming a team sport though. But I think the concept of data science being a unicorn has come about because the coinage hasn't been very well defined, you know, if you were to ask 10 people what a data scientist were, you'd get 11 answers, and I think this is a really challenging issue for hiring managers and C-suites when the generants say I was data science, I want big data, I want an analyst. They don't actually really know what they're asking for. Generally, if you ask for a database administrator, it's a well-described job spec, and you can just advertise it and some 20 people will turn up and you interview to decide whether you like the look and feel and smell of 'em. When you ask for a data scientist, there's 20 different definitions of what that one data science role could be. So we don't initially know what the job is, we don't know what the deliverable is, and we're still trying to figure that out, so yeah. >> Craig what about you? >> So from my experience, when we talk about data science, we're really talking about a collection of experiences with multiple people I've yet to find, at least from my experience, a data science effort with a lone wolf. So you're talking about a combination of skills, and so you don't have, no one individual needs to have all that makes a data scientist a data scientist, but you definitely have to have the right combination of skills amongst a team in order to accomplish the goals of data science team. So from my experiences and from the clients that I've worked with, we refer to the data science effort as a data science team. And I believe that's very appropriate to the team sport analogy. >> For us, we look at a data scientist as a full stack web developer, a jack of all trades, I mean they need to have a multitude of background coming from a programmer from an analyst. You can't find one subject matter expert, it's very difficult. And if you're able to find a subject matter expert, you know, through the lifecycle of product development, you're going to require that individual to interact with a number of other members from your team who are analysts and then you just end up well training this person to be, again, a jack of all trades, so it comes full circle. >> I own a business that does nothing but data solutions, and we've been in business 15 years, and it's been, the transition over time has been going from being a conventional wisdom run company with a bunch of experts at the top to becoming more of a data-driven company using data warehousing and BI, but now the trend is absolutely analytics driven. So if you're not becoming an analytics-driven company, you are going to be behind the curve very very soon, and it's interesting that IBM is now coining the phrase of a cognitive business. I think that is absolutely the future. If you're not a cognitive business from a technology perspective, and an analytics-driven perspective, you're going to be left behind, that's for sure. So in order to stay competitive, you know, you need to really think about data science think about how you're using your data, and I also see that what's considered the data expert has evolved over time too where it used to be just someone really good at writing SQL, or someone really good at writing queries in any language, but now it's becoming more of a interdisciplinary action where you need soft skills and you also need the hard skills, and that's why I think there's more females in the industry now than ever. Because you really need to have a really broad width of experiences that really wasn't required in the past. >> Greg Piateski, you have a comment? >> So there are not too many unicorns in nature or as data scientists, so I think organizations that want to hire data scientists have to look for teams, and there are a few unicorns like Hillary Mason or maybe Osama Faiat, but they generally tend to start companies and very hard to retain them as data scientists. What I see is in other evolution, automation, and you know, steps like IBM, Watson, the first platform is eventually a great advance for data scientists in the short term, but probably what's likely to happen in the longer term kind of more and more of those skills becoming subsumed by machine unique layer within the software. How long will it take, I don't know, but I have a feeling that the paradise for data scientists may not be very long lived. >> Greg, I have a follow up question to what I just heard you say. When a data scientist, let's say a unicorn data scientist starts a company, as you've phrased it, and the company's product is built on data science, do they give up becoming a data scientist in the process? It would seem that they become a data scientist of a higher order if they've built a product based on that knowledge. What is your thoughts on that? >> Well, I know a few people like that, so I think maybe they remain data scientists at heart, but they don't really have the time to do the analysis and they really have to focus more on strategic things. For example, today actually is the birthday of Google, 18 years ago, so Larry Page and Sergey Brin wrote a very influential paper back in the '90s About page rank. Have they remained data scientist, perhaps a very very small part, but that's not really what they do, so I think those unicorn data scientists could quickly evolve to have to look for really teams to capture those skills. >> Clearly they come to a point in their career where they build a company based on teams of data scientists and data engineers and so forth, which relates to the topic of team data science. What is the right division of roles and responsibilities for team data science? >> Before we go, Jennifer, did you have a comment on that? >> Yeah, so I guess I would say for me, when data science came out and there was, you know, the Venn Diagram that came out about all the skills you were supposed to have? I took a very different approach than all of the people who I knew who were going into data science. Most people started interviewing immediately, they were like this is great, I'm going to get a job. I went and learned how to develop applications, and learned computer science, 'cause I had never taken a computer science course in college, and made sure I trued up that one part where I didn't know these things or had the skills from school, so I went headfirst and just learned it, and then now I have actually a lot of technology patents as a result of that. So to answer Jim's question, actually. I started my company about five years ago. And originally started out as a consulting firm slash data science company, then it evolved, and one of the reasons I went back in the industry and now I'm at Nielsen is because you really can't do the same sort of data science work when you're actually doing product development. It's a very very different sort of world. You know, when you're developing a product you're developing a core feature or functionality that you're going to offer clients and customers, so I think definitely you really don't get to have that wide range of sort of looking at 8 million models and testing things out. That flexibility really isn't there as your product starts getting developed. >> Before we go into the team sport, the hard skills that you have, are you all good at math? Are you all computer science types? How about math? Are you all math? >> What were your GPAs? (laughs) >> David: Anybody not math oriented? Anybody not love math? You don't love math? >> I love math, I think it's required. >> David: So math yes, check. >> You dream in equations, right? You dream. >> Computer science? Do I have to have computer science skills? At least the basic knowledge? >> I don't know that you need to have formal classes in any of these things, but I think certainly as Jennifer was saying, if you have no skills in programming whatsoever and you have no interest in learning how to write SQL queries or RR Python, you're probably going to struggle a little bit. >> James: It would be a challenge. >> So I think yes, I have a Ph.D. in physics, I did a lot of math, it's my love language, but I think you don't necessarily need to have formal training in all of these things, but I think you need to have a curiosity and a love of learning, and so if you don't have that, you still want to learn and however you gain that knowledge I think, but yeah, if you have no technical interests whatsoever, and don't want to write a line of code, maybe data science is not the field for you. Even if you don't do it everyday. >> And statistics as well? You would put that in that same general category? How about data hacking? You got to love data hacking, is that fair? Eaves, you have a comment? >> Yeah, I think so, while we've been discussing that for me, the most important part is that you have a logical mind and you have the capability to absorb new things and the curiosity you need to dive into that. While I don't have an education in IT or whatever, I have a background in chemistry and those things that I learned there, I apply to information technology as well, and from a part that you say, okay, I'm a tech-savvy guy, I'm interested in the tech part of it, you need to speak that business language and if you can do that crossover and understand what other skill sets or parts of the roles are telling you I think the communication in that aspect is very important. >> I'd like throw just something really quickly, and I think there's an interesting thing that happens in IT, particularly around technology. We tend to forget that we've actually solved a lot of these problems in the past. If we look in history, if we look around the second World War, and Bletchley Park in the UK, where you had a very similar experience as humans that we're having currently around the whole issue of data science, so there was an interesting challenge with the enigma in the shark code, right? And there was a bunch of men put in a room and told, you're mathematicians and you come from universities, and you can crack codes, but they couldn't. And so what they ended up doing was running these ads, and putting challenges, they actually put, I think it was crossword puzzles in the newspaper, and this deluge of women came out of all kinds of different roles without math degrees, without science degrees, but could solve problems, and they were thrown at the challenge of cracking codes, and invariably, they did the heavy lifting. On a daily basis for converting messages from one format to another, so that this very small team at the end could actually get in play with the sexy piece of it. And I think we're going through a similar shift now with what we're refer to as data science in the technology and business world. Where the people who are doing the heavy lifting aren't necessarily what we'd think of as the traditional data scientists, and so, there have been some unicorns and we've championed them, and they're great. But I think the shift's going to be to accountants, actuaries, and statisticians who understand the business, and come from an MBA star background that can learn the relevant pieces of math and models that we need to to apply to get the data science outcome. I think we've already been here, we've solved this problem, we've just got to learn not to try and reinvent the wheel, 'cause the media hypes this whole thing of data science is exciting and new, but we've been here a couple times before, and there's a lot to be learned from that, my view. >> I think we had Joe next. >> Yeah, so I was going to say that, data science is a funny thing. To use the word science is kind of a misnomer, because there is definitely a level of art to it, and I like to use the analogy, when Michelangelo would look at a block of marble, everyone else looked at the block of marble to see a block of marble. He looks at a block of marble and he sees a finished sculpture, and then he figures out what tools do I need to actually make my vision? And I think data science is a lot like that. We hear a problem, we see the solution, and then we just need the right tools to do it, and I think part of consulting and data science in particular. It's not so much what we know out of the gate, but it's how quickly we learn. And I think everyone here, what makes them brilliant, is how quickly they could learn any tool that they need to see their vision get accomplished. >> David: Justin? >> Yeah, I think you make a really great point, for me, I'm a Marine Corp veteran, and the reason I mentioned that is 'cause I work with two veterans who are problem solvers. And I think that's what data scientists really are, in the long run are problem solvers, and you mentioned a great point that, yeah, I think just problem solving is the key. You don't have to be a subject matter expert, just be able to take the tools and intelligently use them. >> Now when you look at the whole notion of team data science, what is the right mix of roles, like role definitions within a high-quality or a high-preforming data science teams now IBM, with, of course, our announcement of project, data works and so forth. We're splitting the role division, in terms of data scientist versus data engineers versus application developer versus business analyst, is that the right breakdown of roles? Or what would the panelists recommend in terms of understanding what kind of roles make sense within, like I said, a high performing team that's looking for trying to develop applications that depend on data, machine learning, and so forth? Anybody want to? >> I'll tackle that. So the teams that I have created over the years made up these data science teams that I brought into customer sites have a combination of developer capabilities and some of them are IT developers, but some of them were developers of things other than applications. They designed buildings, they did other things with their technical expertise besides building technology. The other piece besides the developer is the analytics, and analytics can be taught as long as they understand how algorithms work and the code behind the analytics, in other words, how are we analyzing things, and from a data science perspective, we are leveraging technology to do the analyzing through the tool sets, so ultimately as long as they understand how tool sets work, then we can train them on the tools. Having that analytic background is an important piece. >> Craig, is it easier to, I'll go to you in a moment Joe, is it easier to cross train a data scientist to be an app developer, than to cross train an app developer to be a data scientist or does it not matter? >> Yes. (laughs) And not the other way around. It depends on the-- >> It's easier to cross train a data scientist to be an app developer than-- >> Yes. >> The other way around. Why is that? >> Developing code can be as difficult as the tool set one uses to develop code. Today's tool sets are very user friendly. where developing code is very difficult to teach a person to think along the lines of developing code when they don't have any idea of the aspects of code, of building something. >> I think it was Joe, or you next, or Jennifer, who was it? >> I would say that one of the reasons for that is data scientists will probably know if the answer's right after you process data, whereas data engineer might be able to manipulate the data but may not know if the answer's correct. So I think that is one of the reasons why having a data scientist learn the application development skills might be a easier time than the other way around. >> I think Miriam, had a comment? Sorry. >> I think that what we're advising our clients to do is to not think, before data science and before analytics became so required by companies to stay competitive, it was more of a waterfall, you have a data engineer build a solution, you know, then you throw it over the fence and the business analyst would have at it, where now, it must be agile, and you must have a scrum team where you have the data scientist and the data engineer and the project manager and the product owner and someone from the chief data office all at the table at the same time and all accomplishing the same goal. Because all of these skills are required, collectively in order to solve this problem, and it can't be done daisy chained anymore it has to be a collaboration. And that's why I think spark is so awesome, because you know, spark is a single interface that a data engineer can use, a data analyst can use, and a data scientist can use. And now with what we've learned today, having a data catalog on top so that the chief data office can actually manage it, I think is really going to take spark to the next level. >> James: Miriam? >> I wanted to comment on your question to Craig about is it harder to teach a data scientist to build an application or vice versa, and one of the things that we have worked on a lot in our data science team is incorporating a lot of best practices from software development, agile, scrum, that sort of thing, and I think particularly with a focus on deploying models that we don't just want to build an interesting data science model, we want to deploy it, and get some value. You need to really incorporate these processes from someone who might know how to build applications and that, I think for some data scientists can be a challenge, because one of the fun things about data science is you get to get into the data, and you get your hands dirty, and you build a model, and you get to try all these cool things, but then when the time comes for you to actually deploy something, you need deployment-grade code in order to make sure it can go into production at your client side and be useful for instance, so I think that there's an interesting challenge on both ends, but one of the things I've definitely noticed with some of our data scientists is it's very hard to get them to think in that mindset, which is why you have a team of people, because everyone has different skills and you can mitigate that. >> Dev-ops for data science? >> Yeah, exactly. We call it insight ops, but yeah, I hear what you're saying. Data science is becoming increasingly an operational function as opposed to strictly exploratory or developmental. Did some one else have a, Dez? >> One of the things I was going to mention, one of the things I like to do when someone gives me a new problem is take all the laptops and phones away. And we just end up in a room with a whiteboard. And developers find that challenging sometimes, so I had this one line where I said to them don't write the first line of code until you actually understand the problem you're trying to solve right? And I think where the data science focus has changed the game for organizations who are trying to get some systematic repeatable process that they can throw data at and just keep getting answers and things, no matter what the industry might be is that developers will come with a particular mindset on how they're going to codify something without necessarily getting the full spectrum and understanding the problem first place. What I'm finding is the people that come at data science tend to have more of a hacker ethic. They want to hack the problem, they want to understand the challenge, and they want to be able to get it down to plain English simple phrases, and then apply some algorithms and then build models, and then codify it, and so most of the time we sit in a room with whiteboard markers just trying to build a model in a graphical sense and make sure it's going to work and that it's going to flow, and once we can do that, we can codify it. I think when you come at it from the other angle from the developer ethic, and you're like I'm just going to codify this from day one, I'm going to write code. I'm going to hack this thing out and it's just going to run and compile. Often, you don't truly understand what he's trying to get to at the end point, and you can just spend days writing code and I think someone made the comment that sometimes you don't actually know whether the output is actually accurate in the first place. So I think there's a lot of value being provided from the data science practice. Over understanding the problem in plain english at a team level, so what am I trying to do from the business consulting point of view? What are the requirements? How do I build this model? How do I test the model? How do I run a sample set through it? Train the thing and then make sure what I'm going to codify actually makes sense in the first place, because otherwise, what are you trying to solve in the first place? >> Wasn't that Einstein who said if I had an hour to solve a problem, I'd spend 55 minutes understanding the problem and five minutes on the solution, right? It's exactly what you're talking about. >> Well I think, I will say, getting back to the question, the thing with building these teams, I think a lot of times people don't talk about is that engineers are actually very very important for data science projects and data science problems. For instance, if you were just trying to prototype something or just come up with a model, then data science teams are great, however, if you need to actually put that into production, that code that the data scientist has written may not be optimal, so as we scale out, it may be actually very inefficient. At that point, you kind of want an engineer to step in and actually optimize that code, so I think it depends on what you're building and that kind of dictates what kind of division you want among your teammates, but I do think that a lot of times, the engineering component is really undervalued out there. >> Jennifer, it seems that the data engineering function, data discovery and preparation and so forth is becoming automated to a greater degree, but if I'm listening to you, I don't hear that data engineering as a discipline is becoming extinct in terms of a role that people can be hired into. You're saying that there's a strong ongoing need for data engineers to optimize the entire pipeline to deliver the fruits of data science in production applications, is that correct? So they play that very much operational role as the backbone for... >> So I think a lot of times businesses will go to data scientist to build a better model to build a predictive model, but that model may not be something that you really want to implement out there when there's like a million users coming to your website, 'cause it may not be efficient, it may take a very long time, so I think in that sense, it is important to have good engineers, and your whole product may fail, you may build the best model it may have the best output, but if you can't actually implement it, then really what good is it? >> What about calibrating these models? How do you go about doing that and sort of testing that in the real world? Has that changed overtime? Or is it... >> So one of the things that I think can happen, and we found with one of our clients is when you build a model, you do it with the data that you have, and you try to use a very robust cross-validation process to make sure that it's robust and it's sturdy, but one thing that can sometimes happen is after you put your model into production, there can be external factors that, societal or whatever, things that have nothing to do with the data that you have or the quality of the data or the quality of the model, which can actually erode the model's performance over time. So as an example, we think about cell phone contracts right? Those have changed a lot over the years, so maybe five years ago, the type of data plan you had might not be the same that it is today, because a totally different type of plan is offered, so if you're building a model on that to say predict who's going to leave and go to a different cell phone carrier, the validity of your model overtime is going to completely degrade based on nothing that you have, that you put into the model or the data that was available, so I think you need to have this sort of model management and monitoring process to take this factors into account and then know when it's time to do a refresh. >> Cross-validation, even at one point in time, for example, there was an article in the New York Times recently that they gave the same data set to five different data scientists, this is survey data for the presidential election that's upcoming, and five different data scientists came to five different predictions. They were all high quality data scientists, the cross-validation showed a wide variation about who was on top, whether it was Hillary or whether it was Trump so that shows you that even at any point in time, cross-validation is essential to understand how robust the predictions might be. Does somebody else have a comment? Joe? >> I just want to say that this even drives home the fact that having the scrum team for each project and having the engineer and the data scientist, data engineer and data scientist working side by side because it is important that whatever we're building we assume will eventually go into production, and we used to have in the data warehousing world, you'd get the data out of the systems, out of your applications, you do analysis on your data, and the nirvana was maybe that data would go back to the system, but typically it didn't. Nowadays, the applications are dependent on the insight coming from the data science team. With the behavior of the application and the personalization and individual experience for a customer is highly dependent, so it has to be, you said is data science part of the dev-ops team, absolutely now, it has to be. >> Whose job is it to figure out the way in which the data is presented to the business? Where's the sort of presentation, the visualization plan, is that the data scientist role? Does that depend on whether or not you have that gene? Do you need a UI person on your team? Where does that fit? >> Wow, good question. >> Well usually that's the output, I mean, once you get to the point where you're visualizing the data, you've created an algorithm or some sort of code that produces that to be visualized, so at the end of the day that the customers can see what all the fuss is about from a data science perspective. But it's usually post the data science component. >> So do you run into situations where you can see it and it's blatantly obvious, but it doesn't necessarily translate to the business? >> Well there's an interesting challenge with data, and we throw the word data around a lot, and I've got this fun line I like throwing out there. If you torture data long enough, it will talk. So the challenge then is to figure out when to stop torturing it, right? And it's the same with models, and so I think in many other parts of organizations, we'll take something, if someone's doing a financial report on performance of the organization and they're doing it in a spreadsheet, they'll get two or three peers to review it, and validate that they've come up with a working model and the answer actually makes sense. And I think we're rushing so quickly at doing analysis on data that comes to us in various formats and high velocity that I think it's very important for us to actually stop and do peer reviews, of the models and the data and the output as well, because otherwise we start making decisions very quickly about things that may or may not be true. It's very easy to get the data to paint any picture you want, and you gave the example of the five different attempts at that thing, and I had this shoot out thing as well where I'll take in a team, I'll get two different people to do exactly the same thing in completely different rooms, and come back and challenge each other, and it's quite amazing to see the looks on their faces when they're like, oh, I didn't see that, and then go back and do it again until, and then just keep iterating until we get to the point where they both get the same outcome, in fact there's a really interesting anecdote about when the UNIX operation system was being written, and a couple of the authors went away and wrote the same program without realizing that each other were doing it, and when they came back, they actually had line for line, the same piece of C code, 'cause they'd actually gotten to a truth. A perfect version of that program, and I think we need to often look at, when we're building models and playing with data, if we can't come at it from different angles, and get the same answer, then maybe the answer isn't quite true yet, so there's a lot of risk in that. And it's the same with presentation, you know, you can paint any picture you want with the dashboard, but who's actually validating when the dashboard's painting the correct picture? >> James: Go ahead, please. >> There is a science actually, behind data visualization, you know if you're doing trending, it's a line graph, if you're doing comparative analysis, it's bar graph, if you're doing percentages, it's a pie chart, like there is a certain science to it, it's not that much of a mystery as the novice thinks there is, but what makes it challenging is that you also, just like any presentation, you have to consider your audience. And your audience, whenever we're delivering a solution, either insight, or just data in a grid, we really have to consider who is the consumer of this data, and actually cater the visual to that person or to that particular audience. And that is part of the art, and that is what makes a great data scientist. >> The consumer may in fact be the source of the data itself, like in a mobile app, so you're tuning their visualization and then their behavior is changing as a result, and then the data on their changed behavior comes back, so it can be a circular process. >> So Jim, at a recent conference, you were tweeting about the citizen data scientist, and you got emasculated by-- >> I spoke there too. >> Okay. >> TWI on that same topic, I got-- >> Kirk Borne I hear came after you. >> Kirk meant-- >> Called foul, flag on the play. >> Kirk meant well. I love Claudia Emahoff too, but yeah, it's a controversial topic. >> So I wonder what our panel thinks of that notion, citizen data scientist. >> Can I respond about citizen data scientists? >> David: Yeah, please. >> I think this term was introduced by Gartner analyst in 2015, and I think it's a very dangerous and misleading term. I think definitely we want to democratize the data and have access to more people, not just data scientists, but managers, BI analysts, but when there is already a term for such people, we can call the business analysts, because it implies some training, some understanding of the data. If you use the term citizen data scientist, it implies that without any training you take some data and then you find something there, and they think as Dev's mentioned, we've seen many examples, very easy to find completely spurious random correlations in data. So we don't want citizen dentists to treat our teeth or citizen pilots to fly planes, and if data's important, having citizen data scientists is equally dangerous, so I'm hoping that, I think actually Gartner did not use the term citizen data scientist in their 2016 hype course, so hopefully they will put this term to rest. >> So Gregory, you apparently are defining citizen to mean incompetent as opposed to simply self-starting. >> Well self-starting is very different, but that's not what I think what was the intention. I think what we see in terms of data democratization, there is a big trend over automation. There are many tools, for example there are many companies like Data Robot, probably IBM, has interesting machine learning capability towards automation, so I think I recently started a page on KDnuggets for automated data science solutions, and there are already 20 different forums that provide different levels of automation. So one can deliver in full automation maybe some expertise, but it's very dangerous to have part of an automated tool and at some point then ask citizen data scientists to try to take the wheels. >> I want to chime in on that. >> David: Yeah, pile on. >> I totally agree with all of that. I think the comment I just want to quickly put out there is that the space we're in is a very young, and rapidly changing world, and so what we haven't had yet is this time to stop and take a deep breath and actually define ourselves, so if you look at computer science in general, a lot of the traditional roles have sort of had 10 or 20 years of history, and so thorough the hiring process, and the development of those spaces, we've actually had time to breath and define what those jobs are, so we know what a systems programmer is, and we know what a database administrator is, but we haven't yet had a chance as a community to stop and breath and say, well what do we think these roles are, and so to fill that void, the media creates coinages, and I think this is the risk we've got now that the concept of a data scientist was just a term that was coined to fill a void, because no one quite knew what to call somebody who didn't come from a data science background if they were tinkering around data science, and I think that's something that we need to sort of sit up and pay attention to, because if we don't own that and drive it ourselves, then somebody else is going to fill the void and they'll create these very frustrating concepts like data scientist, which drives us all crazy. >> James: Miriam's next. >> So I wanted to comment, I agree with both of the previous comments, but in terms of a citizen data scientist, and I think whether or not you're citizen data scientist or an actual data scientist whatever that means, I think one of the most important things you can have is a sense of skepticism, right? Because you can get spurious correlations and it's like wow, my predictive model is so excellent, you know? And being aware of things like leaks from the future, right? This actually isn't predictive at all, it's a result of the thing I'm trying to predict, and so I think one thing I know that we try and do is if something really looks too good, we need to go back in and make sure, did we not look at the data correctly? Is something missing? Did we have a problem with the ETL? And so I think that a healthy sense of skepticism is important to make sure that you're not taking a spurious correlation and trying to derive some significant meaning from it. >> I think there's a Dilbert cartoon that I saw that described that very well. Joe, did you have a comment? >> I think that in order for citizen data scientists to really exist, I think we do need to have more maturity in the tools that they would use. My vision is that the BI tools of today are all going to be replaced with natural language processing and searching, you know, just be able to open up a search bar and say give me sales by region, and to take that one step into the future even further, it should actually say what are my sales going to be next year? And it should trigger a simple linear regression or be able to say which features of the televisions are actually affecting sales and do a clustering algorithm, you know I think hopefully that will be the future, but I don't see anything of that today, and I think in order to have a true citizen data scientist, you would need to have that, and that is pretty sophisticated stuff. >> I think for me, the idea of citizen data scientist I can relate to that, for instance, when I was in graduate school, I started doing some research on FDA data. It was an open source data set about 4.2 million data points. Technically when I graduated, the paper was still not published, and so in some sense, you could think of me as a citizen data scientist, right? I wasn't getting funding, I wasn't doing it for school, but I was still continuing my research, so I'd like to hope that with all the new data sources out there that there might be scientists or people who are maybe kept out of a field people who wanted to be in STEM and for whatever life circumstance couldn't be in it. That they might be encouraged to actually go and look into the data and maybe build better models or validate information that's out there. >> So Justin, I'm sorry you had one comment? >> It seems data science was termed before academia adopted formalized training for data science. But yeah, you can make, like Dez said, you can make data work for whatever problem you're trying to solve, whatever answer you see, you want data to work around it, you can make it happen. And I kind of consider that like in project management, like data creep, so you're so hyper focused on a solution you're trying to find the answer that you create an answer that works for that solution, but it may not be the correct answer, and I think the crossover discussion works well for that case. >> So but the term comes up 'cause there's a frustration I guess, right? That data science skills are not plentiful, and it's potentially a bottleneck in an organization. Supposedly 80% of your time is spent on cleaning data, is that right? Is that fair? So there's a problem. How much of that can be automated and when? >> I'll have a shot at that. So I think there's a shift that's going to come about where we're going to move from centralized data sets to data at the edge of the network, and this is something that's happening very quickly now where we can't just hold everything back to a central spot. When the internet of things actually wakes up. Things like the Boeing Dreamliner 787, that things got 6,000 sensors in it, produces half a terabyte of data per flight. There are 87,400 flights per day in domestic airspace in the U.S. That's 43.5 petabytes of raw data, now that's about three years worth of disk manufacturing in total, right? We're never going to copy that across one place, we can't process, so I think the challenge we've got ahead of us is looking at how we're going to move the intelligence and the analytics to the edge of the network and pre-cook the data in different tiers, so have a look at the raw material we get, and boil it down to a slightly smaller data set, bring a meta data version of that back, and eventually get to the point where we've only got the very minimum data set and data points we need to make key decisions. Without that, we're already at the point where we have too much data, and we can't munch it fast enough, and we can't spin off enough tin even if we witch the cloud on, and that's just this never ending deluge of noise, right? And you've got that signal versus noise problem so then we're now seeing a shift where people looking at how do we move the intelligence back to the edge of network which we actually solved some time ago in the securities space. You know, spam filtering, if an emails hits Google on the west coast of the U.S. and they create a check some for that spam email, it immediately goes into a database, and nothing gets on the opposite side of the coast, because they already know it's spam. They recognize that email coming in, that's evil, stop it. So we've already fixed its insecurity with intrusion detection, we've fixed it in spam, so we now need to take that learning, and bring it into business analytics, if you like, and see where we're finding patterns and behavior, and brew that out to the edge of the network, so if I'm seeing a demand over here for tickets on a new sale of a show, I need to be able to see where else I'm going to see that demand and start responding to that before the demand comes about. I think that's a shift that we're going to see quickly, because we'll never keep up with the data munching challenge and the volume's just going to explode. >> David: We just have a couple minutes. >> That does sound like a great topic for a future Cube panel which is data science on the edge of the fog. >> I got a hundred questions around that. So we're wrapping up here. Just got a couple minutes. Final thoughts on this conversation or any other pieces that you want to punctuate. >> I think one thing that's been really interesting for me being on this panel is hearing all of my co-panelists talking about common themes and things that we are also experiencing which isn't a surprise, but it's interesting to hear about how ubiquitous some of the challenges are, and also at the announcement earlier today, some of the things that they're talking about and thinking about, we're also talking about and thinking about. So I think it's great to hear we're all in different countries and different places, but we're experiencing a lot of the same challenges, and I think that's been really interesting for me to hear about. >> David: Great, anybody else, final thoughts? >> To echo Dez's thoughts, it's about we're never going to catch up with the amount of data that's produced, so it's about transforming big data into smart data. >> I could just say that with the shift from normal data, small data, to big data, the answer is automate, automate, automate, and we've been talking about advanced algorithms and machine learning for the science for changing the business, but there also needs to be machine learning and advanced algorithms for the backroom where we're actually getting smarter about how we ingestate and how we fix data as it comes in. Because we can actually train the machines to understand data anomalies and what we want to do with them over time. And I think the further upstream we get of data correction, the less work there will be downstream. And I also think that the concept of being able to fix data at the source is gone, that's behind us. Right now the data that we're using to analyze to change the business, typically we have no control over. Like Dez said, they're coming from censors and machines and internet of things and if it's wrong, it's always going to be wrong, so we have to figure out how to do that in our laboratory. >> Eaves, final thoughts? >> I think it's a mind shift being a data scientist if you look back at the time why did you start developing or writing code? Because you like to code, whatever, just for the sake of building a nice algorithm or a piece of software, or whatever, and now I think with the spirit of a data scientist, you're looking at a problem and say this is where I want to go, so you have more the top down approach than the bottom up approach. And have the big picture and that is what you really need as a data scientist, just look across technologies, look across departments, look across everything, and then on top of that, try to apply as much skills as you have available, and that's kind of unicorn that they're trying to look for, because it's pretty hard to find people with that wide vision on everything that is happening within the company, so you need to be aware of technology, you need to be aware of how a business is run, and how it fits within a cultural environment, you have to work with people and all those things together to my belief to make it very difficult to find those good data scientists. >> Jim? Your final thought? >> My final thoughts is this is an awesome panel, and I'm so glad that you've come to New York, and I'm hoping that you all stay, of course, for the the IBM Data First launch event that will take place this evening about a block over at Hudson Mercantile, so that's pretty much it. Thank you, I really learned a lot. >> I want to second Jim's thanks, really, great panel. Awesome expertise, really appreciate you taking the time, and thanks to the folks at IBM for putting this together. >> And I'm big fans of most of you, all of you, on this session here, so it's great just to meet you in person, thank you. >> Okay, and I want to thank Jeff Frick for being a human curtain there with the sun setting here in New York City. Well thanks very much for watching, we are going to be across the street at the IBM announcement, we're going to be on the ground. We open up again tomorrow at 9:30 at Big Data NYC, Big Data Week, Strata plus the Hadoop World, thanks for watching everybody, that's a wrap from here. This is the Cube, we're out. (techno music)

Published Date : Sep 28 2016

SUMMARY :

Brought to you by headline sponsors, and this is a cube first, and we have some really but I want to hear them. and appreciate you organizing this. and the term data mining Eves, I of course know you from Twitter. and you can do that on a technical level, How many people have been on the Cube I always like to ask that question. and that was obviously Great, thank you Craig, and I'm also on the faculty and saw that snake swallow a basketball and with the big paradigm Great, thank you. and I came to data science, Great, thank you. and so what I think about data science Great, and last but not least, and the scale at which I'm going to go off script-- You guys have in on the front. and one of the CDOs, she said that 25% and I think certainly, that's and so I think this is a great opportunity and the first question talk about the theme now and does that data scientist, you know, and you can just advertise and from the clients I mean they need to have and it's been, the transition over time but I have a feeling that the paradise and the company's product and they really have to focus What is the right division and one of the reasons I You dream in equations, right? and you have no interest in learning but I think you need to and the curiosity you and there's a lot to be and I like to use the analogy, and the reason I mentioned that is that the right breakdown of roles? and the code behind the analytics, And not the other way around. Why is that? idea of the aspects of code, of the reasons for that I think Miriam, had a comment? and someone from the chief data office and one of the things that an operational function as opposed to and so most of the time and five minutes on the solution, right? that code that the data but if I'm listening to you, that in the real world? the data that you have or so that shows you that and the nirvana was maybe that the customers can see and a couple of the authors went away and actually cater the of the data itself, like in a mobile app, I love Claudia Emahoff too, of that notion, citizen data scientist. and have access to more people, to mean incompetent as opposed to and at some point then ask and the development of those spaces, and so I think one thing I think there's a and I think in order to have a true so I'd like to hope that with all the new and I think So but the term comes up and the analytics to of the fog. or any other pieces that you want to and also at the so it's about transforming big data and machine learning for the science and now I think with the and I'm hoping that you and thanks to the folks at IBM so it's great just to meet you in person, This is the Cube, we're out.

ENTITIES

Entity	Category	Confidence
Jennifer	PERSON	0.99+
Jennifer Shin	PERSON	0.99+
Miriam Fridell	PERSON	0.99+
Greg Piateski	PERSON	0.99+
Justin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David	PERSON	0.99+
Jeff Frick	PERSON	0.99+
2015	DATE	0.99+
Joe Caserta	PERSON	0.99+
James Cubelis	PERSON	0.99+
James	PERSON	0.99+
Miriam	PERSON	0.99+
Jim	PERSON	0.99+
Joe	PERSON	0.99+
Claudia Emahoff	PERSON	0.99+
NVIDIA	ORGANIZATION	0.99+
Hillary	PERSON	0.99+
New York	LOCATION	0.99+
Hillary Mason	PERSON	0.99+
Justin Sadeen	PERSON	0.99+
Greg	PERSON	0.99+
Dave	PERSON	0.99+
55 minutes	QUANTITY	0.99+
Trump	PERSON	0.99+
2016	DATE	0.99+
Craig	PERSON	0.99+
Dave Valante	PERSON	0.99+
George	PERSON	0.99+
Dez Blanchfield	PERSON	0.99+
UK	LOCATION	0.99+
Ford	ORGANIZATION	0.99+
Craig Brown	PERSON	0.99+
10	QUANTITY	0.99+
8 Path Solutions	ORGANIZATION	0.99+
CISCO	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
two	QUANTITY	0.99+
30 years	QUANTITY	0.99+
Kirk	PERSON	0.99+
25%	QUANTITY	0.99+
Marine Corp	ORGANIZATION	0.99+
80%	QUANTITY	0.99+
43.5 petabytes	QUANTITY	0.99+
Boston	LOCATION	0.99+
Data Robot	ORGANIZATION	0.99+
10 people	QUANTITY	0.99+
Hal Varian	PERSON	0.99+
Einstein	PERSON	0.99+
New York City	LOCATION	0.99+
Nielsen	ORGANIZATION	0.99+
first question	QUANTITY	0.99+
Friday	DATE	0.99+
Ralph Timbal	PERSON	0.99+
U.S.	LOCATION	0.99+
6,000 sensors	QUANTITY	0.99+
UC Berkeley	ORGANIZATION	0.99+
Sergey Brin	PERSON	0.99+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Data First: