Eric Seidman, Veritas | CUBEConversation, November 2018

(upbeat music) >> Hello everyone, I'm John Furrier here in the Palo Alto theCUBE studios. I'm the co-host of theCUBE. Also co-founder of SiliconANGLE Media. We're here for some big news from Veritas. We're with Eric Seidman, who's the director of Solution's Marketing for Veritas. Veritas is introducing today and the press release is on the wire, Veritas Predictive Insights. Eric, thanks for coming in today and sharing thew news. >> Thanks John, absolutely, thanks for having me. >> So you guys have a unique new thing for Veritas. Not new to the industry, but new in capabilities, called Predictive Insights. I know Dave Vellante is actually linked on your press release and covered it in Chicago as an embargo. This is exciting news for Veritas because you guys have so much customer installed base, tons of data. Talk about what this new product is. What's the news? >> Well, thanks John, actually the news it's pretty exciting. Our customers are very excited and receptive about it. What it's actually doing is helping our customers reduce both planned and unplanned down time. And the way we're doing that is with an analytics engine that we've developed that's taking all the data from over 15,000 of our appliances around the world. We've been collecting that data for three years. We have hundreds of millions of data points from that. And we're utilizing our own AI ML engines that we've created to be able to predict things in customer's environments that may cause them down time or outages, and fix those before they happen. That's why our customers are really exciting about it. >> So how much does this cost? >> Well, it doesn't really cost anything. It's a value add. You know if our customers are utilizing our Veritas auto support services today, then as of yesterday, the service is turned on and we're already looking at their systems and creating this intelligence on them. >> So this is immediately valuable. >> And immediately evaluate those. >> So this is a new product from Veritas that takes existing operational data from your customer's environment. >> Correct. >> You guys are matching in your corpus of meta data. >> Exactly. >> A telemetry data, what, hundreds of millions of signals, call center, real log data, real outages and real things. >> Right, right. >> And creating machine learning and AI on top of it to extract value for you guys or for the customer? >> Well it's really for the customer. The benefit for the customer is that we have insights into you know our world wide universe of customers. But we can look at individual systems and say, why is this one operating differently than the others? And then the machine learning will actually determine that the ones that are operating really well have this patch and this patch installed. You know those types of things. And then we can apply that learning and that model to a particular customer's system. >> And they get a dashboard. >> And they get a dashboard that'll highlight what we call the system reliability score. So there's this, you know in big enterprises there's a lot of fatigue associated with events that are occurring all the time. You think of an enterprise, we have customers with many, many just net backup appliances alone. But you think of their entire infrastructure and all the alerts that they're getting. It creates a lot of fatigue. A lot of things go unfixed because they're minor events, like maybe a patch needs to be installed or a firmware update. While they're fixing the more hair on fire problems. But then ultimately those what looked like smaller events build up and build up and then they create outages. So what we're able to do is to identify which systems have potential anomalies. Highlight those very visually. Then they can drill down and we'll have prescriptive maintenance that can be taken to improve that. >> So site reliability score, we'll get to that in a second, I think that's a big deal. I want to read the press release headline. >> Okay. >> Veritas's Predictive Insights uses original intelligence AI, machine learning, ML, to predict and prevent unplanned service. Now the key word there is unplanned service. This is kind of the doomsday scenario for customers. They got a large data center or large infrastructure devices. Unplanned basically means an outage, if something happens, something bad happens. >> Yeah, something bad happens. >> And no one likes that, so what you guys are doing is giving them a valuated dashboard that taps into a product. So if, correct me if I'm wrong, but if a customer that has Veritas, if they have the products, they get the service. If they become a customer, they now have the capability built in out of the gate. >> Absolutely, right. >> And so they see all this, so you're taking all the data from years of experience. >> Yeah, exactly. >> Giving them a dashboard to help them look at unplanned down time type scenarios, and give them specific actions to take, particularly analytics and prescriptive analytics for them. >> Exactly, so what we're trying to really achieve for our customers is to use that intelligence and machine learning to identify things that may cause an outage in the future and prevent that outage from occurring, causing that down time, by taking remedial action in advance of that happening. And that's the beauty of Predictive Insights. That's really what it's providing for our customers. >> So you guys have this always on feature called auto support feature. >> Correct. >> That kicks in and it brings the system reliability score, SRS. I think this is important. I want you to explain this, I think this is a trend we're seeing certainly on the Cloud side of the market. Google has pioneered this concept called site reliability engineer over years of practice and they make their infrastructure work great. So we know that that kind of concept of having reliability, you guys are now giving a score to each appliance. >> Correct. >> It's almost like a health detector or like credit score. >> Definitely, credit score is a good analogy of that. >> So explain SRS, what's it mean for the customer and what's the impact to them? >> Yeah, so I don't know if you ever like, maybe you use one of those credit scoring apps os something like that, where it's monitoring your credit from three different agencies or whatever. That's kind of what we're doing, only the data sets are coming from a much broader set of appliances, right. But we're showing you your system reliability score, credit score if you will. And then we're showing you very prescriptively the processes you can take to improve your credit score, if you will, or your system's health and reliability. So that might be installing a firmware patch. Installing software update, things of that nature. Replacing some drives that may fail in the future. And all of those steps will then increase that liability score. >> And also you see in the hacking world, you know that one of the biggest parts of security breaches is not loading a patch. >> Yeah, exactly. >> The unplanned, unforeseen things are you know some sort of thing goes on, a hurricane, wild fire. You never know what's going to happen, so you got to be prepared for those kinds of infrastructure change or whatever. So I get that, so the operators can have a nice dashboard. I totally buy that. I want to get into the impact to the more on business side. How does this help the business owner that's your customer, does this help them with planning, refresh rates, total cost of ownership? Can you just talk about the impact of how this data relates to their job? Because I'd be like, what's in it for me? >> Yeah, no, exactly. And there's really three key areas that we're addressing for our customers. I mean the first one is around improving their operational efficiency, right. Again, reducing that alert fatigue and making it easier on the infrastructure management to do their job with less headaches, with less dashboards lighting up. So it's very, very prescriptive on highlighting what needs to be done and helping them through that process. The other area's around the prescriptive potential fault detection. And fixing those anomalies before they can actually cause a down time event, right, doing that in advance. So that's reducing the planned and unplanned down time, which can be significant in terms of cost to your business. One of the analysts states that this 20 million a year in cost associated with down time events like that, and that varies by industry. >> And that's a dart at the board, it's a big number. >> It's a big number, yeah. >> You pick your number, right, and see which one. >> And then the third area is really around helping our customers have better predictability into what their utilization requirements are. So the benefit there is really helping them improve their ROI on our appliances. Because now they don't have to over buy and over provisions at capacity because we can show them the trend data, the amount of efficiency they're getting from data. And they can right size their appliances in terms of performance capacity, and then we can warn them in advance. >> That's a real big thing is what's happening there. That's proactive. >> It's very proactive. >> It's not reactive. >> Exactly. >> Well you can solve on the reactive side because you just fix it. But then the proactive side is really where things break as you blow over capacity, you might want to add more. >> Yeah, believe it or not, those types of things have caused down time events in our customers, where they're assuming their backups are going to complete, as an example with net backup appliances, and yet they're out of capacity. And at the last moment that's a fire drill for them. So we can show them out 30, 60, 90 days, what they're utilization is, and then a threshold is at this point in time you're going to have potential outage, some kind of problem. And so we recommend that you add this capacity before that ever occurs. >> Alright, talk about the customer reaction to that. You guys, actually it was announced today, you talk to customers all the time. When you showed customers this in pre-launch, what was some of the feedback you guys heard? What were the key areas, what did they hone in on, what was the key things about the Predictive Insights that made them get jazzed up about this? >> Yeah, so it's really I would say it covers the two key areas that I already mentioned. One of 'em is helping prevent unplanned down time. That's a big concern for our customers in any industry. And this is going to be able to help them overcome that you know kind of rear view mirror look as to what's happening in the data center. And fixing a problem after it's occurred. Now they'll be able to be in advance of that and eliminate or at least significantly reduce those types of issues. And then the other one is helping, again, in that event fatigue at the operational model. That's where we've gotten the best feedback. >> So I'm going to ask you a hard question, which is, hey you know, predictive analytics has been around for awhile, pre-descriptive. why now, what's different about this opportunity? Obviously free is good because your customers get turned on pretty quickly. They get the benefits immediately, and new customers get it. I get that piece. >> Yeah. >> But what's different about you guys with this versus what might be out in the market? >> Yeah, I would say the key differentiation is that we have this very, very large universe of installed base systems that we've been gathering data on for over three years now. So the more data you have, the more data points you have. The better results you'll get from a machine learning type of environment. And we're still collecting data, both from the machines that are coming in from the telemetry data, as well as from our service personnel. So that right off the bat makes our solution unique than others that may have been like out sooner, in that we've developed a rich data set that is being applied to the machine learning. And hence, our results out the gate are very, very good. >> And you're using that, you're not actually charging for it. So that's another big one. >> Yeah, that's true too. >> So let's get into the specifics on the rollout. So this is a digital transformation table stake. You guys are checking a big box here. >> Sure. >> This is good. It gives your product some capability that levels that meta data, and that is what this data driven world is about. And certainly IoT is even going to make this even more of a table stake. >> Absolutely. >> On the rollout side, it's all appliances, Veritas. >> Uh huh. >> And then software only and then you're going to go beyond Veritas, is that right? >> Yeah. >> Explain that, what does that mean? So I get the appliances. What does software only mean and what does beyond Veritas mean? >> Yeah, so just to reiterate, today it's our appliances only. But many of our customers consume our solutions of software. And they're putting it on their bring our own server model. Probably about 40% of our customers, right. So we believe we can add this type of capability to be able to provide insights into our software that's installed on independent third party hardware as well. Maybe some of the capabilities won't be as rich, but we're going to start building those capabilities over time and try to bring in that data and help those customers that are software only customers. >> And that's on the road map? >> That's on the road map. >> Okay, so it's not available today? Okay, beyond Veritas? >> Yeah, so obviously many of our customers today are protecting data on prime, protecting data in the Cloud or some kind of hybrid model. And we support, we don't really care where the customers want to store their data. We're capable of protecting it and helping them achieve whatever those Cloud type of initiatives are in that environment. So an obvious next step would be to, hey how can we bring this to help you know where your data is located and how it's working in those environments? Is that back up going to be able to be restored, as an example? So we're looking at future capabilities to add on to this. There's going to be huge value to our customers. >> This is great news. Thanks for coming in and sharing. I really appreciate it. I want to get your thoughts on some observations that we've been making. Certainly theCUBE coverage of Veritas has been increased. Dave Vellante's been out on the road with the team, looked at some of the new back up recovery versions, looking at new UI, kind of new Veritas going on here. >> It kind of is. >> What's the vibe going on in Veritas? What's new about Veritas for the folks watching now and saying this is really cool. Veritas is cool and relevant right now. You guys are a product market fit. You guys got kind of a new Veritas vibe going on. What's it all about? Share your thoughts. >> Yeah, so I think there's, you know, some people call us legacy, right? But I don't think that's necessarily a bad term, right. I meant like when I'm gone, I hope I've left a legacy, right, that's worthwhile. And so we have that legacy, which is great. Because we've been adding, there's value for our customers for many, many years. But what's new and exciting I think for us is that we're able to provide solutions that are very, very simple to utilize, very easy to accommodate whatever their requirements are, whether it's on print or hybrid or in the Cloud, we don't really care. So we've kind of progressed I would say into a very, very modern architecture for what we're doing. And meeting the requirements today of what our customer's are doing as well as looking forward. And this Predictive Insights piece I think is just another manifestation on how we're progressing as a company, what we can bring to today on the current problems in the data center, and also looking out in terms of where the future requirements are as well. And we're ready for those. >> Well legacy is a great word. I love you brought that up because it's a double edge sword. If you're a legacy and you don't do anything and you rest on your legacy, then you kind of, you're just milking that until the legacy is dry. >> Fair. >> But if you look at what Microsoft's done, they're classified as a legacy vendor. Office was shrink wrapped software. >> Yeah. >> Satya Nadella comes over and now they're the darling of Cloud. They've shifted their products and execution to be what customers want, which is Cloud. Now they've got Office365, Azures, you know have been repurposed. There's some stuff they could still work on, but clearly cleared the runway. >> Yeah. >> And Oracle, not so much, Microsoft has. So this is a Veritas kind of vibe that's going on similarly to Microsoft. You guys are looking, hey we've got to install a base. We're going to use that and leverage the assets of that installed base, that legacy. Harness it and make it part of the digital transformation. Is that kind of the vibe? >> No, exactly, and I think Microsoft is a great example. I mean we're in tight partnership with Azure as a matter of fact. I just came from one of our vision solution's stages where a gentleman from Azure shared the stage with us and talking about our partnerships and all that. So I mean great example, but we're bringing those capabilities into the Cloud era, if you will. We have solutions that run natively in Cloud, help that environment, so. >> Making the transition to digital transformation. Veritas, the new Veritas, they got the solutions that are Cloud enabled. Using data for the benefit of the customers, not just trying to bolt it on and make more money. They're actually bringing value to the install base and changing the game up. Eric Seidman here inside theCUBE. Director of Solutions Marketing at Veritas. Part of theCUBE conversation, part of their news coverage of their Predictive Insights. I'm John Furrier, here in the Palo Alto studios, thanks for watching. (upbeat music)

Published Date : Nov 19 2018

SUMMARY :

is on the wire, Veritas thanks for having me. What's the news? And the way we're doing that the service is turned on and we're already So this is a new product from Veritas You guys are matching in of signals, call center, real log data, determine that the ones that that are occurring all the time. So site reliability score, This is kind of the doomsday built in out of the gate. And so they see all this, and give them specific actions to take, that may cause an outage in the future So you guys have this always on feature the Cloud side of the market. detector or like credit score. is a good analogy of that. the processes you can take that one of the biggest So I get that, so the operators So that's reducing the planned And that's a dart at the You pick your number, So the benefit there is what's happening there. because you just fix it. And at the last moment the Predictive Insights that made them And this is going to be able to help them They get the benefits immediately, So the more data you have, And you're using So let's get into the And certainly IoT is even going to make this On the rollout side, it's So I get the appliances. Maybe some of the in the Cloud or some kind of hybrid model. on the road with the team, for the folks watching now And meeting the requirements today of what and you rest on your But if you look at but clearly cleared the runway. Is that kind of the vibe? the Cloud era, if you will. benefit of the customers,

ENTITIES

Entity	Category	Confidence
Peter Burris	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Michael Dell	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Michael	PERSON	0.99+
Comcast	ORGANIZATION	0.99+
Elizabeth	PERSON	0.99+
Paul Gillan	PERSON	0.99+
Jeff Clark	PERSON	0.99+
Paul Gillin	PERSON	0.99+
Nokia	ORGANIZATION	0.99+
Savannah	PERSON	0.99+
Dave	PERSON	0.99+
Richard	PERSON	0.99+
Micheal	PERSON	0.99+
Carolyn Rodz	PERSON	0.99+
Dave Vallante	PERSON	0.99+
Verizon	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Eric Seidman	PERSON	0.99+
Paul	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Google	ORGANIZATION	0.99+
Keith	PERSON	0.99+
Chris McNabb	PERSON	0.99+
Joe	PERSON	0.99+
Carolyn	PERSON	0.99+
Qualcomm	ORGANIZATION	0.99+
Alice	PERSON	0.99+
2006	DATE	0.99+
John	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
congress	ORGANIZATION	0.99+
Ericsson	ORGANIZATION	0.99+
AT&T	ORGANIZATION	0.99+
Elizabeth Gore	PERSON	0.99+
Paul Gillen	PERSON	0.99+
Madhu Kutty	PERSON	0.99+
1999	DATE	0.99+
Michael Conlan	PERSON	0.99+
2013	DATE	0.99+
Michael Candolim	PERSON	0.99+
Pat	PERSON	0.99+
Yvonne Wassenaar	PERSON	0.99+
Mark Krzysko	PERSON	0.99+
Boston	LOCATION	0.99+
Pat Gelsinger	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Willie Lu	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Yvonne	PERSON	0.99+
Hertz	ORGANIZATION	0.99+
Andy	PERSON	0.99+
2012	DATE	0.99+
Microsoft	ORGANIZATION	0.99+

Action Item | How to get more value out of your data, April 06, 2018

>> Hi I'm Peter Burris and welcome to another Wikibon Action Item. (electronic music) One of the most pressing strategic issues that businesses face is how to get more value out of their data, In our opinion that's the essence of a digital business transformation, is the using of data as an asset to improve your operations and take better advantage of market opportunities. The problem of data though, it's shareable, it's copyable, it's reusable. It's easy to create derivative value out of it. One of the biggest misnomers in the digital business world is the notion that data is the new fuel or the new oil. It's not, You can only use oil once. You can apply it to a purpose and not multiple purposes. Data you can apply to a lot of purposes, which is why you are able to get such interesting and increasing returns to that asset if you use it appropriately. Now, this becomes especially important for technology companies that are attempting to provide digital business technologies or services or other capabilities to their customers. In the consumer world, it started to reach a head. Questions about Facebook's reuse of a person's data through an ad based business model is now starting to lead people to question the degree to which the information asymmetry about what I'm giving and how they're using it is really worth the value that I get out of Facebook, is something that consumers and certainly governments are starting to talk about. it's also one of the bases for GDPR, which is going to start enforcing significant fines in the next month or so. In the B2B world that question is going to become especially acute. Why? Because as we try to add intelligence to the services and the products that we are utilizing within digital business, some of that requires a degree of, or some sort of relationship where some amount of data is passed to improve the models and machine learning and AI that are associated with that intelligence. Now, some companies have come out and said flat out they're not going to reuse a customer's data. IBM being a good example of that. When Ginni Rometty at IBM Think said, we're not going to reuse our customer's data. The question for the panel here is, is that going to be a part of a differentiating value proposition in the marketplace? Are we going to see circumstances in which companies keep products and services low by reusing a client's data and others sustaining their experience and sustaining a trust model say they won't. How is that going to play out in front of customers? So joining me today here in the studio, David Floyer. >> Hi there. >> And on the remote lines we have Neil Raden, Jim Kobielus, George Gilbert, and Ralph Finos. Hey, guys. >> All: Hey. >> All right so... Neil, let me start with you. You've been in the BI world as a user, as a consultant, for many, many number of years. Help us understand the relationship between data, assets, ownership, and strategy. >> Oh, God. Well, I don't know that I've been in the BI world. Anyway, as a consultant when we would do a project for a company, there were very clear lines of what belong to us and what belong to the client. They were paying us generously. They would allow us to come in to their company and do things that they needed and in return we treated them with respect. We wouldn't take their data. We wouldn't take their data models that we built, for example, and sell them to another company. That's just, as far as I'm concerned, that's just theft. So if I'm housing another company's data because I'm a cloud provider or some sort of application provider and I say well, you know, I can use this data too. To me the analogy is, I'm a warehousing company and independently I go into the warehouse and I say, you know, these guys aren't moving their inventory fast enough, I think I'll sell some of it. It just isn't right. >> I think it's a great point. Jim Kobielus. As we think about the role that data, machine learning play, training models, delivering new classes of services, we don't have a clean answer right now. So what's your thought on how this is likely to play out? >> I agree totally with Neil, first of all. If it's somebody else's data, you don't own it, therefore you can't sell and you can't monetize it, clearly. But where you have derivative assets, like machine learning models that are derivative from data, it's the same phenomena, it's the same issue at a higher level. You can build and train, or should, your machine learning models only from data that you have legal access to. You own or you have license and so forth. So as you're building these derivative assets, first and foremost, make sure as you're populating your data lake, to build and to do the training, that you have clear ownership over the data. So with GDPR and so forth, we have to be doubly triply vigilant to make sure that we're not using data that we don't have authorized ownership or access to. That is critically important. And so, I get kind of queasy when I hear some people say we use blockchain to make... the sharing of training data more distributed and federated or whatever. It's like wait a second. That doesn't solve the issues of ownership. That makes it even more problematic. If you get this massive blockchain of data coming from hither and yon, who owns what? How do you know? Do you dare build any models whatsoever from any of that data? That's a huge gray area that nobody's really addressed yet. >> Yeah well, it might mean that the blockchain has been poorly designed. I think that we talked in one of the previous Action Items about the role that blockchain design's going to play. But moving aside from the blockchain, so it seems as though we generally agree that data is owned by somebody typically and that the ownership of it, as Neil said, means that you can't intercept it at some point in time just because it is easily copied and then generate rents on it yourself. David Floyer, what does that mean from a ongoing systems design and development standpoint? How are we going to assure, as Jim said, not only that we know what data is ours but make sure that we have the right protection strategies, in a sense, in place to make sure that the data as it moves, we have some influence and control over it. >> Well, my starting point is that AI and AI infused products are fueled by data. You need that data, and Jim and Neil have already talked about that. In my opinion, the most effective way of improving a company's products, whatever the products are, from manufacturing, agriculture, financial services, is to use AI infused capabilities. That is likely to give you the best return on your money and businesses need to focus on their own products. That's the first place you are trying to protect from anybody coming in. Businesses own that data. They own the data about your products, in use by your customers, use that data to improve your products with AI infused function and use it before your competition eats your lunch. >> But let's build on that. So we're not saying that, for example, if you're a storage system supplier, since that's a relatively easy one. You've got very, very fast SSDs. Very, very fast NVMe over Fabric. Great technology. You can collect data about how that system is working but that doesn't give you rights to then also collect data about how the customer's using the system. >> There is a line which you need to make sure that you are covering. For example, Call Home on a product, any product, whose data is that? You need to make sure that you can use that data. You have some sort of agreement with the customer and that's a win-win because you're using that data to improve the product, prove things about it. But that's very, very clear that you should have a contractual relationship, as Jim and Neil were pointing out. You need the right to use that data. It can't come beyond the hand. But you must get it because if you don't get it, you won't be able to improve your products. >> Now, we're talking here about technology products which have often very concrete and obvious ownership and people who are specifically responsible for administering them. But when we start getting into the IoT domain or in other places where the device is infused with intelligence and it might be collecting data that's not directly associated with its purpose, just by virtue of the nature of sensors that are out there and the whole concept of digital twin introduces some tension in all this. George Gilbert. Take us through what's been happening with the overall suppliers of technology that are related to digital twin building, designing, etc. How are they securing or making promises committing to their customers that they will not cross this data boundary as they improve the quality of their twins? >> Well, as you quoted Ginni Rometty starting out, she's saying IBM, unlike its competitors, will not take advantage and leverage and monetize your data. But it's a little more subtle than that and digital twins are just sort of another manifestation of industry-specific sort of solution development that we've done for decades. The differences, as Jim and David have pointed out, that with machine learning, it's not so much code that's at the heart of these digital twins, it's the machine learning models and the data is what informs those models. Now... So you don't want all your secret sauce to go from Mercedes Benz to BMW but at the same time the economics of industry solutions means that you do want some of the repeatability that we've always gotten from industry solutions. You might have parts that are just company specific. And so in IBM's case, if you really parse what they're saying, they take what they learn in terms of the models from the data when they're working with BMW, and some of that is going to go into the industry specific models that they're going to use when they're working with Mercedes-Benz. If you really, really sort of peel the onion back and ask them, it's not the models, it's not the features of the models, but it's the coefficients that weight the features or variables in the models that they will keep segregated by customer. So in other words, you get some of the benefits, the economic benefits of reuse across customers with similar expertise but you don't actually get all of the secret sauce. >> Now, Ralph Finos-- >> And I agree with George here. I think that's an interesting topic. That's one of the important points. It's not kosher to monetize data that you don't own but conceivably if you can abstract from that data at some higher level, like George's describing, in terms of weights and coefficients and so forth, in a neural network that's derivative from the model. At some point in the abstraction, you should be able to monetize. I mean, it's like a paraphrase of some copyrighted material. A paraphrase, I'm not a lawyer, but you can, you can sell a paraphrase because it's your own original work that's based obviously on your reading of Moby Dick or whatever it is you're paraphrasing. >> Yeah, I think-- >> Jim I-- >> Peter: Go ahead, Neil. >> I agree with that but there's a line. There was a guy who worked at Capital One, this was about ten years ago, and he was their chief statistician or whatever. This was before we had words like machine learning and data science, it was called statistics and predictive analytics. He left the company and formed his own company and rewrote and recoded all of the algorithms he had for about 20 different predictive models. Formed a company and then licensed that stuff to Sybase and Teradata and whatnot. Now, the question I have is, did that cross the line or didn't it? These were algorithms actually developed inside Capital One. Did he have the right to use those, even if he wrote new computer code to make them run in databases? So it's more than just data, I think. It's a, well, it's a marketplace and I think that if you own something someone should not be able to take it and make money on it. But that doesn't mean you can't make an agreement with them to do that, and I think we're going to see a lot of that. IMSN gets data on prescription drugs and IRI and Nielsen gets scanner data and they pay for it and then they add value to it and they resell it. So I think that's really the issue is the use has to be understood by all the parties and the compensation has to be appropriate to the use. >> All right, so Ralph Finos. As a guy who looks at market models and handles a lot of the fundamentals for how we do our forecasting, look at this from the standpoint of how people are going to make money because clearly what we're talking about sounds like is the idea that any derivative use is embedded in algorithms. Seeing how those contracts get set up and I got a comment on that in a second, but the promise, a number of years ago, is that people are going to start selling data willy-nilly as a basis for their economic, a way of capturing value out of their economic activities or their business activities, hasn't matured yet generally. Do we see like this brand new data economy, where everybody's selling data to each other, being the way that this all plays out? >> Yeah, I'm having a hard time imagining this as a marketplace. I think we pointed at the manufacturing industries, technology industries, where some of this makes some sense. But I think from a practitioner perspective, you're looking for variables that are meaningful that are in a form you can actually use to make prediction. That you understand what the the history and the validity of that of that data is. And in a lot of cases there's a lot of garbage out there that you can't use. And the notion of paying for something that ultimately you look at and say, oh crap, it's not, this isn't really helping me, is going to be... maybe not an insurmountable barrier but it's going to create some obstacles in the market for adoption of this kind of thought process. We have to think about the utility of the data that feeds your models. >> Yeah, I think there's going to be a lot, like there's going to be a lot of legal questions raised and I recommend that people go look at a recent SiliconANGLE article written by Mike Wheatley and edited by our Editor In Chief Robert Hof about Microsoft letting technology partners own right to joint innovations. This is a quite a difference. This is quite a change for Microsoft who used to send you, if you sent an email with an idea to them, you'd often get an email back saying oh, just to let you know any correspondence we have here is the property of Microsoft. So there clearly is tension in the model about how we're going to utilize data and enable derivative use and how we're going to share, how we're going to appropriate value and share in the returns of that. I think this is going to be an absolutely central feature of business models, certainly in the digital business world for quite some time. The last thing I'll note and then I'll get to the Action Items, the last thing I'll mention here is that one of the biggest challenges in whenever we start talking about how we set up businesses and institutionalize the work that's done, is to look at the nature of the assets and the scope of the assets and in circumstances where the asset is used by two parties and it's generating a high degree of value, as measured by the transactions against those assets, there's always going to be a tendency for one party to try to take ownership of it. One party that's able to generate greater returns than the other, almost always makes move to try to take more control out of that asset and that's the basis of governance. And so everybody talks about data governance as though it's like something that you worry about with your backup and restore. Well, that's important but this notion of data governance increasingly is going to become a feature of strategy and boardroom conversations about what it really means to create data assets, sustain those data assets, get value out of them, and how we determine whether or not the right balance is being struck between the value that we're getting out of our data and third parties are getting out of our data, including customers. So with that, let's do a quick Action Item. David Floyer, I'm looking at you. Why don't we start here. David Floyer, Action Item. >> So my Action Item is for businesses, you should focus. Focus on data about your products in use by your customers, to improve, help improve the quality of your products and fuse AI into those products as one of the most efficient ways of adding value to it. And do that before your competition has a chance to come in and get data that will stop you from doing that. >> George Gilbert, Action Item. >> I guess mine would be that... in most cases you you want to embrace some amount of reuse because of the economics involved from your joint development with a solution provider. But if others are going to get some benefit from sort of reusing some of the intellectual property that informs models that you build, make sure you negotiate with your vendor that any upgrades to those models, whether they're digital twins or in other forms, that there's a canonical version that can come back and be an upgraded path for you as well. >> Jim Kobielus, Action Item. >> My Action Item is for businesses to regard your data as a product that you monetize yourself. Or if you are unable to monetize it yourself, if there is a partner, like a supplier or a customer who can monetize that data, then negotiate the terms of that monetization in your your relationship and be vigilant on that so you get a piece of that stream. Even if the bulk of the work is done by your partner. >> Neil Raden, Action Item. >> It's all based on transparency. Your data is your data. No one else can take it without your consent. That doesn't mean that you can't get involved in relationships where there's an agreement to do that. But the problem is most agreements, especially when you look at a business consumer, are so onerous that nobody reads them and nobody understands them. So the person providing the data has to have an unequivocal right to sell it to you and the person buying it has to really understand what the limits are that they can do with it. >> Ralph Finos, Action Item. You're muted Ralph. But it was brilliant, whatever it was. >> Well it was and I really can't say much more than that. (Peter laughs) But I think from a practitioner perspective and I understand that from a manufacturing perspective how the value could be there. But as a practitioner if you're fishing for data out there that someone has that might look like something you can use, chances are it's not. And you need to be real careful about spending money to get data that you're not really clear is going to help you. >> Great. All right, thanks very much team. So here's our Action Item conclusion for today. The whole concept of digital business is predicated in the idea of using data assets in a differential way to better serve your markets and improve your operations. It's your data. Increasingly, that is going to be the base for differentiation. And any weak undertaking to allow that data to get out has the potential that someone else can, through their data science and their capabilities, re-engineer much of what you regard as your differentiation. We've had conversations with leading data scientists who say that if someone were to sell customer data into a open marketplace, that it would take about four days for a great data scientist to re-engineer almost everything about your customer base. So as a consequence, we have to tread lightly here as we think about what it means to release data into the wild. Ultimately, the challenge there for any business will be: how do I establish the appropriate governance and protections, not just looking at the technology but rather looking at the overall notion of the data assets. If you don't understand how to monetize your data and nonetheless enter into a partnership with somebody else, by definition that partner is going to generate greater value out of your data than you are. There's significant information asymmetries here. So it's something that, every company must undertake an understanding of how to generate value out of their data. We don't think that there's going to be a general-purpose marketplace for sharing data in a lot of ways. This is going to be a heavily contracted arrangement but it doesn't mean that we should not take great steps or important steps right now to start doing a better job of instrumenting our products and services so that we can start collecting data about our products and services because the path forward is going to demonstrate that we're going to be able to improve, dramatically improve the quality of the goods and services we sell by reducing the assets specificities for our customers by making them more intelligent and more programmable. Finally, is this going to be a feature of a differentiated business relationship through trust? We're open to that. Personally, I'll speak for myself, I think it will. I think that there is going to be an important element, ultimately, of being able to demonstrate to a customer base, to a marketplace, that you take privacy, data ownership, and intellectual property control of data assets seriously and that you are very, very specific, very transparent, in how you're going to use those in derivative business transactions. All right. So once again, David Floyer, thank you very much here in the studio. On the phone: Neil Raden, Ralph Finos, Jim Kobielus, and George Gilbert. This has been another Wikibon Action Item. (electronic music)

Published Date : Apr 6 2018

SUMMARY :

and the products that we are utilizing And on the remote lines we have Neil Raden, You've been in the BI world as a user, as a consultant, and independently I go into the warehouse and I say, So what's your thought on how this is likely to play out? that you have clear ownership over the data. and that the ownership of it, as Neil said, That is likely to give you the best return on your money but that doesn't give you rights to then also You need the right to use that data. and the whole concept of digital twin and some of that is going to go into It's not kosher to monetize data that you don't own and the compensation has to be appropriate to the use. and handles a lot of the fundamentals and the validity of that of that data is. and that's the basis of governance. and get data that will stop you from doing that. because of the economics involved from your Even if the bulk of the work is done by your partner. and the person buying it has to really understand But it was brilliant, whatever it was. how the value could be there. and that you are very, very specific,

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
David Floyer	PERSON	0.99+
Jim Kobielus	PERSON	0.99+
Neil	PERSON	0.99+
George Gilbert	PERSON	0.99+
Peter Burris	PERSON	0.99+
George	PERSON	0.99+
Neil Raden	PERSON	0.99+
BMW	ORGANIZATION	0.99+
Mike Wheatley	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Ginni Rometty	PERSON	0.99+
IBM	ORGANIZATION	0.99+
IRI	ORGANIZATION	0.99+
Nielsen	ORGANIZATION	0.99+
April 06, 2018	DATE	0.99+
Peter	PERSON	0.99+
David	PERSON	0.99+
Ralph Finos	PERSON	0.99+
one party	QUANTITY	0.99+
two parties	QUANTITY	0.99+
Mercedes-Benz	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
Mercedes Benz	ORGANIZATION	0.99+
One party	QUANTITY	0.99+
Robert Hof	PERSON	0.99+
Capital One	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Ralph	PERSON	0.99+
one	QUANTITY	0.99+
today	DATE	0.98+
One	QUANTITY	0.98+
IMSN	ORGANIZATION	0.98+
GDPR	TITLE	0.98+
Teradata	ORGANIZATION	0.98+
next month	DATE	0.96+
Moby Dick	TITLE	0.95+
about 20 different predictive models	QUANTITY	0.95+
Sybase	ORGANIZATION	0.95+
decades	QUANTITY	0.93+
about ten years ago	DATE	0.88+
about four days	QUANTITY	0.86+
second	QUANTITY	0.83+
once	QUANTITY	0.82+
Wikibon	ORGANIZATION	0.8+
of years ago	DATE	0.77+
Action	ORGANIZATION	0.68+
SiliconANGLE	TITLE	0.66+
twins	QUANTITY	0.64+
Editor In Chief	PERSON	0.61+
Items	QUANTITY	0.58+
twin	QUANTITY	0.48+
Think	ORGANIZATION	0.46+

Eric Siegel, Predictive Analytics World - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco it's theCUBE Covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE. You are watching coverage of Spark Summit 2017. It's day two, we've got so many new guests to talk to today. We already learned a lot, right George? >> Yeah, I mean we had some, I guess, pretty high bandwidth conversations. >> Yes, well I expect we're going to have another one here too, because the person we have is the founder of Predictive Analytics World, it's Eric Siegel, Eric welcome to the show. >> Hey thanks Dave, thanks George. You go by Dave or David? >> Dave: Oh you can call me sir, and that would be. >> I was calling you, should I, can I bow? >> Oh no we are bowing to you, you're the author of the book, Predictive Analytics, I love the subtitle, the Power to Predict Who Will Click, Buy, Lie or Die. >> And that sums up the industry right? >> Right, so if people are new to the industry, that's sort of an informal definition of predictive analytics, basically also known as machine learning. Where you're trying to make predictions for each individual, whether it's a customer for marketing, a suspect for fraud or law enforcement, a voter for political campaigning, a patient for healthcare. So, in general it's on that level, it's a prediction for each individual. So how does data help make those predictions? And then you can only imagine just how many ways in which predicting on that level helps organizations improve all their activities. >> Well we know you were on the keynote stage this morning. Could you maybe summarize for the CUBE audience, what a couple of the top themes that you were talking about? >> Yeah, I covered two advanced topics. I wanted to make sure this pretty technical audience was aware of because a lot of people aren't and one is called uplift modeling, so that's optimizing for persuasion for things like marketing and also for healthcare, actually. And for political campaigning. So when you do predictive analytics for targeting marketing normally sort of the traditional approach is, let's predict will this person buy if I contact them because when well its okay maybe its a good idea to spend the two dollars to send them a brochure its marketing treatment, right. But there is actually a little bit different question that would make even driving them better decisions Which is not will this person buy but would contacting them, sending them the brochure, influence them to buy, will it increase the chance that we get that positive outcome. That's a different question, and it doesn't correspond with standard predictive modeling or machine learning methods So uplift modeling, also known as net lift modeling, persuasion modeling its a way to actually create a predictive model like any other except that it's target is, is it a good idea to contact this person because it will increase the chances that they are going to have a positive outcome. So that's the first of the two. And I cram this all in 20 minutes. The other one was a little more commonly known But I think people would like to visit it and it's called P-Hacking or vast search. Where you can be fooled by randomness and data relatively easily in the era of Big Data there is this all to common pitfall where you find a predictive insight in the data and it turns out it was actually just a random perturbation. How do you know the difference? >> Dave: Fake news right? >> Okay fake news, except that in this case, it was generated by a computer, right? And then there is a statistical test that makes it look like its actually statistically significant and we should have credibility to it, on it or about it. So you can avert it, you have compensate for the fact that you are trying lots, that you are evaluating many different predictive insights or hypotheses whatever you want to call it and make sure that the one that you are believing you sort of checked for the ability that it wasn't just random luck, that's known as p-hacking. >> Alright, so uplift modeling and p-hacking. George do you want to drill on those a little bit. >> Yeah, I want to start from maybe the vocabulary of our audience where they say sort of like uplift modeling goes beyond prediction. Actually even for the second one with p-hacking is that where you're essentially playing with the parameters of the model to find the difference between correlation and causation and going from prediction to prescription? >> It's not about causation, its actually so correlation is what you get when you get a predictive insight or some component of a predictive model where you see these things connected therefore one is predictive of the other. Now the fact that does not entail causation is a really good point to remind people of as such. But even before you address that question, the first question is this correlation actually legit? Is there really a correlation between this things? Is this an actual finding? Or is it just happened to be the case in this particular sample of limited sample data that I have access to at the moment, right? So is it a real link or correlation in the first place before you even start asking any question about causality and it does have, it does related to what you alluded to with regard to tuning parameters because its closely related to this issue of overfitting. People who do predictive modeling are very familiar with overfitting. The standard practice all tools implementations of machine learning and predictive modeling do this, which is they hold the side evaluation set called test set. So you don't get to cheat, creates a predict model. It learns from the data, does the number crunching, its mostly automated, right. And it comes out with this beautiful model that does well predicting and then you evaluate, you assess it over this held aside. Oh my thing's falling off here. >> Dave: Just second on your. >> See then you evaluate it on this held aside set it was quarantine so you didn't get to cheat. You didn't get to look at it when you are creating the model. So it serves as an objective performance measure. The problem is and here is the huge irony, the things that we get from data, the predictive insights, there was one famous one that was broadcasted too loudly because its not nearly as credible as they first thought. Is that an orange used car is a better one to buy because its less likely to be a lemon. That's what it looked like in this one data set. The problem is, that when you have a single insight where its relatively simple, just talking about the car, the color to make the prediction. A predictive model is much more complex and deals with lots of other attributes not just the color, for example, make, year, model everything on that individual car, individual person, you can imagine all the attributes that's the point of the modeling process, the learning process, how do you consider multiple things. If its just a really simple thing with just based on the car color, then many of even the most advanced data science practitioners kind of forget that there is still potential to effectively overfit, that you might have found something that doesn't apply in general, only applies over this particular set of data. So that's where the trap falls and they don't necessarily hold themselves a high standard of having this held aside test set. So its kind of ironic thing, the things that most likely to make the headlines like orange cars are simpler, easier to understand, but are less well understood that they could be wrong. >> You know keying off that, that's really interesting, because we've been hearing for years that what's made, especially deep learning relevant over the last few years is huge compute up in the cloud and huge data sets. >> Yeah. >> But we're also starting to hear about methods of generating a sort of synthetic data so that if you don't have, I don't know what the term is, organic training data, and then test data, we're getting to the point where we can do high quality models with less. >> Yes, less of that training data. And did you. >> Tell us. >> Did you interview with the keynote speaker from Stanford about that? >> No, I only saw part of his. >> Yeah his speech yesterday. That's an area that I'm relatively new to but it sounds extremely important because that is the bottleneck. He called it, if data's the new oil, he's calling it the new-new oil. Which is more specific than data, it's training data. So all of the machine learning or predictive modeling methods of which we speak, are, in most cases, what's called supervised learning. So the thing that makes it supervised is you have a bunch of examples where you already know the answer. So you're trying to figure out is this picture of a cat or of a dog, that means you need to have a whole bunch of data from which to learn, the training data, where you've already got it labeled. You already know the correct answer. In many business applications just because of history you know who did or didn't respond to your marketing, you know who did or did not turn out to be fraudulent. History is experience in which to learn, it's in the data, so you do have that labeled, yes, no, like you already know the answer, you don't need to predict on them, it's in the past but you use that as training data. So we have that in many cases. But for something like classifying an image, and we're trying to figure out does this have a picture of a cat somewhere in the image, or whatever all these big image classification problems, you do need, often, a manual effort to label the data. Have the positive and negative examples, that's what's called training data, the learning data. It's actually called training data. There's definitely a bottleneck so anything that can be done to avert that bottleneck decrease the amount that we need, or find ways to make, sort of, rough training data that may serve as a building block for the modeling process this kind of thing. That's not my area of expertise, sounds really intriguing though. >> What about, and this may be further out on the horizon but one thing we are hearing about is the extreme shortage of data scientists who need to be teamed up with domain experts to figure out the knobs, the variables to create these elaborate models. We're told that even if you're doing the traditional, statistical, machine learning models, that eventually deep learning can help us identify the features or the variables just the way they sort of identify you know ears and whiskers and a nose and then figure out from that the cat. That's something that's in the near term, the medium term in terms of helping to augment what the data scientist does? >> It's in the near term and that's why everyone's excited about deep learning right now is that, basically the reason we built these machines called computers is because they automate stuff. Pretty much anything that you can think of and define well, you can program. Then you've got a machine that does it. Of course one of the things we wanted to learn, to do actually, is to learn from data. Now, it's literally really very analogous to what it means for a human to learn. You've got a limited number of examples that you're trying to draw generalizations from those. When you go to bigger scale problems where the thing you're classifying isn't just like a customer, and all the things you know about the customer, are they likely to commit fraud, yes or no. But it become a level more complex when it's an image right, image is worth a thousand words. And maybe literally more than a thousand words where it says of data if it's a high resolution. So how do you process that? Well there's all sorts of research like well we can define the thing that tries to find arcs, and circles and edges and this kind of thing, or, we can try to, once again, let that be automatic. Let the computer do that. So deep learning is a way to allow, spark is a way to make it operate quickly but there's another level of scale other than speed. The level of scale is just like how complex of a task can you leave up to the automaton, to go by itself. That's what deep learning does is it scales in that respect it has the ability to automate more layers of that complexity as far as finding those kinds of what might me domain specific features and images. >> Okay, but I'm thinking not just the, help me figure out speech to text and natural language understanding or classify. >> Anything with a signal where it's a high bandwidth amount of data coming in that you want to classify. >> OK, so could that, does that extend to I'm building a very elaborate predictive model not on, is there a cat in the video or in the picture so much as I guess you called it, is there an uplift potential and how big is that potential, in a context of making a sale on an eCommerce site. >> So what you just tapped into was when you go to marketing and many other business applications, you don't actually need to have high accuracy what you have to do is have a prediction that's better than guessing. So for example, if I get a 1% response rate to my marketing campaign, but I can find a pocket that's got 3% response rate, it may be very much rocket science to define and learn from the data how to define that specifically defined sub-segment that has a higher response rate, or whatever it is. But the 3% isn't like, I have high confidence this person's definitely going to buy, it's still just 3%, but that difference can make a huge difference and can improve the bottom line marketing by a factor of five and that kind of thing. It's not necessarily about accuracy. If you've got an image and you need to know is there a picture of a car, or is this traffic light green or red, somewhere in this image, then there's certain application areas, self driving cars what have you, it does need to be accurate right. But maybe there's more potential for it to be accurate because there's more predictability inherent to that problem. Like I can predict that there's a traffic light that has a green light somewhere in an image because there is enough label data and the nature of the problem is more tractable because it's not as challenging to find where the traffic light is, and then which color it is. You need it to scale, to reach that level of classification performance in terms of accuracy or whatever measure you use for certain applications. >> Are you seeing like new methodologies like reinforcement learning or deep learning where the models are adversarial where they make big advances in terms of what they can learn without a lot of supervision? Like the ones where. >> It's more self learning and unsupervised. >> Sort of glue yourself onto this video game screen we'll give you control of the steering wheel and you figure out how to win. >> Having less required supervision, more self-learning, anomaly detection or clustering, these are some of the unsupervised ones. When it comes to vision there are part of the process that can be unsupervised in the sense that you don't need labels on your target like is there a car in the picture. But it can still learn the feature detection in a way that doesn't have that supervised data. Although that image classification in general, on that level deep learning, is not my area of expertise. That's a very up and coming part of machine learning but it's only needed when you have these high bandwidth inputs like an entire image, high resolution, or a video, or a high bandwidth audio. So it's signal processing type problems where you start to need that kind of deep learning. >> Great discussion Eric, just a couple of minutes to go in this segment here. I want to make sure I give a chance to talk about Predictive Analytics World and what's your affiliation with that ad what do you want theCUBE audience to know? >> Oh sure, Predictive Analytics World I'm the founder it's the leading cross-vendor event focused on commercial deployment of predictive analytics and machine learning. Our main event a few times a year is a broad scope business focused event but we also have industry vertical focused specialized events just for financial services, healthcare, workforce, manufacturing and government applications of predictive analytics and machine learning. So there's a number a year, and two weeks from now in Chicago, October in New York and you can see the full agendas at PredictiveAnalyticsWorld.com. >> Alright great short commercial there. 30 seconds. >> It's the elevator pitch. >> Answered the toughest question in 30 seconds what the toughest question you got after your keynote this morning? Maybe a hallway conversation or. >> What's the toughest question I got after my keynote? >> Dave: From one of the attendees. >> Oh, the question that always comes up is how do you get this level of complexity across to non-technical people or your boss or your colleagues or your friends and family. By the way that's something I worked really hard on with the book which is meant for all readers although the last few chapters have. >> How do you get executive sponsors to get what you're doing? >> Well, as I say, give them the book. Because the point of the book is it's pop science it's accessible, it's analytically driven, it's entertaining it keeps it relevant but it does address advanced topics at the end of the book. So it sort of ends, industry overview kind of thing. The bottom line there, in general, is that you want to focus on the business impact. What I mentioned briefly a second ago if we can improve target marketing this much it will increase profit by a factor five something like that. So you start with that and then answer any questions they have about, well how does it work, what makes it credible that it really has that much potential in the bottom line. When you're a techie, you're inclined to go forward you start with the technology that you're excited about. That's my background, so that's sort of the definition of being a geek, that you're ore enamored with the technology than the value it produces. Because it's amazing that it works, and it's exciting, it's interesting, it's scientifically challenging. But, when you're talking to the decision makers you have to start with the eventual carrot at the end of the stick, which is the value. >> The business outcome. >> Yeah. >> Great, well that's going to be the last word. That might even make it onto our CUBE Gems segment, great sound bites. George thanks again, great questions and Eric the author of Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die. Thank you for being on the show we appreciate your time. >> Eric: Sure, yeah thank you, great to meet you. >> Thank you for watching theCUBE we'll be back in just a few minutes with our next guest here at Spark Summit 2017.

Published Date : Jun 7 2017

SUMMARY :

brought to you by Databricks. to talk to today. Yeah, I mean we had some, I guess, because the person we have is the founder You go by Dave or David? I love the subtitle, the Power to Predict Who Will Click, And then you can only imagine just how many ways what a couple of the top themes that you were talking about? there is this all to common pitfall where you find and make sure that the one that you are believing George do you want to drill on those a little bit. is that where you're essentially of a predictive model where you see these things connected The problem is, that when you have a single insight over the last few years is huge compute up in the cloud so that if you don't have, I don't know what the term is, Yes, less of that training data. it's in the data, so you do have that labeled, That's something that's in the near term, the medium term and all the things you know about the customer, help me figure out speech to text that you want to classify. so much as I guess you called it, So what you just tapped into was Are you seeing like new methodologies like and unsupervised. and you figure out how to win. that you don't need labels on your target ad what do you want theCUBE audience to know? in Chicago, October in New York and you can see what the toughest question you got is how do you get this level of complexity is that you want to focus on the business impact. and Eric the author of Predictive Analytics, the Power Thank you for watching theCUBE we'll be back

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Dave	PERSON	0.99+
Eric Siegel	PERSON	0.99+
David	PERSON	0.99+
Eric	PERSON	0.99+
Chicago	LOCATION	0.99+
1%	QUANTITY	0.99+
two dollars	QUANTITY	0.99+
3%	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
New York	LOCATION	0.99+
30 seconds	QUANTITY	0.99+
yesterday	DATE	0.99+
first question	QUANTITY	0.99+
20 minutes	QUANTITY	0.99+
Predictive Analytics	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
more than a thousand words	QUANTITY	0.98+
Predictive Analytics World	ORGANIZATION	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
second one	QUANTITY	0.98+
each individual	QUANTITY	0.98+
two	QUANTITY	0.98+
today	DATE	0.97+
second	QUANTITY	0.97+
October	DATE	0.97+
two weeks	QUANTITY	0.97+
two advanced topics	QUANTITY	0.97+
first place	QUANTITY	0.96+
the Power to Predict Who Will Click, Buy, Lie or Die	TITLE	0.94+
Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die	TITLE	0.94+
Databricks	ORGANIZATION	0.94+
single insight	QUANTITY	0.93+
Stanford	ORGANIZATION	0.91+
five	QUANTITY	0.9+
this morning	DATE	0.87+
CUBE	ORGANIZATION	0.86+
a thousand words	QUANTITY	0.84+
first thought	QUANTITY	0.82+
Predictive Analytics	ORGANIZATION	0.77+
a year	QUANTITY	0.72+
theCUBE	ORGANIZATION	0.72+
day two	QUANTITY	0.7+
one famous	QUANTITY	0.69+
PredictiveAnalyticsWorld.com	ORGANIZATION	0.66+
times a year	QUANTITY	0.66+
second ago	DATE	0.66+
World	EVENT	0.63+
#theCUBE	ORGANIZATION	0.57+
years	QUANTITY	0.56+
last	DATE	0.56+
factor	QUANTITY	0.52+
years	DATE	0.49+
minutes	QUANTITY	0.48+
five	OTHER	0.33+

Rob Lantz, Novetta - Spark Summit 2017 - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco it's the CUBE covering Spark Summit 2017 brought to you by Data Bricks. >> Welcome back to the CUBE, we're continuing to take about two people who are not just talking about things but doing things. We're happy to have, from Novetta, the Director of Predictive Analytics, Mr. Rob Lantz. Rob, welcome to the show. >> Thank you. >> And off to my right, George, how are you? >> Good. >> We've introduced you before. >> Yes. >> Well let's talk to the guest. Let's get right to it. I want to talk to you a little bit about what does Novetta do and then maybe what apps you're building using Spark. >> Sure, so Novetta is an advanced analytics company, we're medium sized and we develop custom hardware and software solutions for our customers who are looking to get insights out of their big data. Our primary offering is a hard entity resolution engine. We scale up to billions of records and we've done that for about 15 years. >> So you're in the business end of analytics, right? >> Yeah, I think so. >> Alright, so talk to us a little bit more about entity resolution, and that's all Spark right? This is your main priority? >> Yes, yes, indeed. Entity resolution is the science of taking multiple disparate data sets, traditional big data, and taking records from those and determining which of those are actually the same individual or company or address or location and which of those should be kept separate. We can aggregate those things together and build profiles and that enables a more robust picture of what's going on for an organization. >> Okay, and George? >> So what did you do... What was the solution looking like before Spark and how did it change once you adopted Spark? >> Sure, so with Spark, it enabled us to get a lot faster. Obviously those computations scaled a lot better. Before, we were having to write a lot of custom code to get those computations out across a grid. When we moved to Hadoop and then Spark, that made us, let's say able to scale those things and get it done overnight or in hours and not weeks. >> So when you say you had to do a lot of custom code to distribute across the cluster, does that include when you were working with MapReduce, or was this even before the Hadoop era? >> Oh it was before the Hadoop era and that predates my time so I won't be able to speak expertly about it, but to my understanding, it was a challenge for sure. >> Okay so this sounds like a service that your customers would then themselves build on. Maybe an ETL customer would figure out master data from a repository that is not as carefully curated as the data warehouse or similar applications. So who is your end customer and how do they build on your solution? >> Sure, so the end customer typically is an enterprise that has large volumes of data that deal in particular things. They collect, it could be customers, it could be passengers, it could be lots of different things. They want to be able to build profiles about those people or companies, like I said, or locations, any number of things can be considered an entity. The way they build upon it then is how they go about quantifying those profiles. We can help them do that, in fact, some of the work that I manage does that, but often times they do it themselves. They take the resolve data and that gets resolved nightly or even hourly. They build those profiles themselves for their own purpose. >> Then, to help us think about the application or the use case holistically, once they've built those profiles and essentially harmonized the data, what does that typically feed into? >> Oh gosh, any number of things really. Oh, shoot. We've got deployments in AWS in the cloud, we've got deployments, lots of deployments on premises obviously. That can go anywhere from relational databases to graph query language databases. Lots of different places from there for sure. >> Okay so, this actually sounds like everyone talks now about machine learning and forming every category of software. This sounds like you take the old style ETL, where master data was a value add layer on top, and that was, it took a fair amount of human judgment to do. Now, you're putting that service on top of ETL and you're largely automating it, probably with, I assume, some supervised guidance, supervised training. >> Yes, so we're getting into the machine learning space as far as entity extraction and resolution and recognition because more and more data is unstructured. But machine learning isn't necessarily a baked in part of that. Actually entity resolution is a prerequisite, I think, for quality machine learning. So if Rob Lantz is a customer, I want to be able to know what has Rob Lantz bought in the past from me. And maybe what is Rob Lantz talking about in social media? Well I need to know how to figure out who those people are and who's Rob Lantz and who's Robert Lantz is a completely different person, I don't want to collapse those two things together. Then I would build machine learning on top of that to say, right, now what's his behavior going to be in the future. But once I have that robust profile built up, I can derive a lot more interesting features with which to apply the machine learning. >> Okay, so you are a Data Bricks customer and there's also a burgeoning partnership. >> Rob: Yeah, I think that's true. >> So talk to us a little bit about what are some of the frustrations you had before adopting Data Bricks and maybe why you choose it. >> Yeah, sure. So the frustrations primarily with a traditional Hadoop environment involved having to go from one customer site to another customer site with an incredibly complex technology stack and then do a lot of the cluster management for those customers even after they'd already set it up because of all the inner workings of Hadoop and that ecosystem. Getting our Spark application installed there, we had to penetrate layers and layers of configuration in order to tune it appropriately to get the performance we needed. >> David: Okay, and were you at the keynote this morning? >> I was not, actually. >> Okay, I'm not going to ask you about that then. >> Ah. >> But I am going to ask you a little bit about your wishlist. You've been talking to people maybe in the hallway here, you just got here today but, what do you wish the community would do or develop, what would you like to learn while you're here? >> Learning while I'm here, I've already picked up a lot. So much going on and it's such a fast paced environment, it's really exciting. I think if I had a wishlist, I would want a more robust ML Lib, machine learning library. All the things that you can get on traditional, in scientific computing stacks moved onto a Spark ML Lib for easier access. On a cluster would be great. >> I thought several years ago ML Lib took over from Mahoot as the most active open source community for adding, really, I thought, scale out machine learning algorithms. If it doesn't have it all now, or maybe all is something you never reach, kind of like Red Queen effect, you know? >> Rob: For sure, for sure. >> What else is attracting these scale out implementations of the machine learning algorithms? >> Um? >> In other words, what are the platforms? If it's not Spark then... >> I don't think it exists frankly, unless you write your own. I think that would be the way to go. That's the way to go about it now. I think what organizations are having to do with machine learning in a distributed environment is just go with good enough, right. Whereas maybe some of the ensemble methods that are, actually aren't even really cutting edge necessarily, but you can really do a lot of tuning on those things, doing that tuning distributed at scale would be really powerful. I read somewhere, and I'm not going to be able to quote exactly where it was but, actually throwing more data at a problem is more valuable than tuning a perfect algorithm frankly. If we could combine the two, I think that would be really powerful. That is, finding the right algorithm and throwing all the data at it would get you a really solid model that would pick up on that signal that underlies any of these phenomena. >> David: Okay well, go ahead George. >> I was going to ask, I think that goes back to, I don't know if it was Google Paper, or one of the Google search quality guys who's a luminary in the machine learning space says, "data always trumps algorithms." >> I believe that's true and that's true in my experience certainly. >> Once you had this machine learning and once you've perhaps simplified the multi-vendor stack, then what is your solution start looking like in terms of broadening its appeal, because of the lower TCO. And then, perhaps embracing more use cases. >> I don't know that it necessarily embraces more use cases because entity resolution applies so broadly already, but what I would say is will give us more time to focus on improving the ER itself. That's I think going to be a really, really powerful improvement we can make to Novetta entity analytics as it stands right now. That's going to go into, we alluded to before, the machine learning as part of the entity resolution. Entity extraction, automated entity extraction from unstructured information and not just unstructured text but unstructured images and video. Could be a really powerful thing. Taking in stuff that isn't tagged and pulling the entities out of that automatically without actually having to have a human in the loop. Pulling every name out, every phone number out, every address out. Go ahead, sorry. >> This goes back to a couple conversations we've had today where people say data trumps algorithms, even if they don't say it explicitly, so the cloud vendors who are sitting on billions of photos, many of which might have house street addresses and things like that, or faces, how do you make better... How do you extract better tuning for your algorithms from data sets that I assume are smaller than the cloud vendors? >> They're pretty big. We employ data engineers that are very experienced at tagging that stuff manually. What I would envision would happen is we would apply somebody for a week or two weeks, to go in and tag the data as appropriate. In fact, we have products that go in and do concept tagging already across multiple languages. That's going to be the subject of my talk tomorrow as a matter of fact. But we can tag things manually or with machine assistance and then use that as a training set to go apply to the much larger data set. I'm not so worried about the scale of the data, we already have a lot, a lot of data. I think it's going to be getting that proof set that's already tagged. >> So what you're saying is, it actually sounds kind of important. That actually almost ties into what we hear about Facebook training their messenger bot where we can't do it purely just on training data so we're going to take some data that needs semi-supervision, and that becomes our new labeled set, our new training data. Then we can run it against this broad, unwashed mass of training data. Is that the strategy? >> Certainly we would get there. We would want to get there and that's the beauty of what Data Bricks promises, is that ability to save a lot of the time that we would spend doing the nug work on cluster management to innovate in that way and we're really excited about that. >> Alright, we've got just a minute to go here before the break, so I wanted to ask you maybe, the wish list question, I've been asking everybody today, what do you wish you had? Whether it's in entity resolution or some other area in the next couple of years for Novetta, what's on your list? >> Well I think that would be the more robust machine learning library, all in Spark, kind of native, so we wouldn't have to deploy that ourselves. Then, I think everything else is there, frankly. We are very excited about the platform and the stack that comes with it. >> Well that's a great ending right there, George do you have any other questions you want to ask? Alright, we're just wrapping up here. Thank you so much, we appreciate you being on the show Rob, and we'll see you out there in the Expo. >> I appreciate it, thank you. >> Alright, thanks so much. >> George: It's good to meet you. >> Thanks. >> Alright, you are watching the CUBE here at Spark Summit 2017, stay tuned, we'll be back with our next guest.

Published Date : Jun 6 2017

SUMMARY :

brought to you by Data Bricks. Welcome back to the CUBE, I want to talk to you a little bit about and we've done that for about 15 years. and build profiles and that enables a more robust picture and how did it change once you adopted Spark? and get it done overnight or in hours and not weeks. and that predates my time and how do they build on your solution? and that gets resolved nightly or even hourly. We've got deployments in AWS in the cloud, and that was, it took a fair amount going to be in the future. Okay, so you are a Data Bricks customer and maybe why you choose it. to get the performance we needed. what would you like to learn while you're here? All the things that you can get on traditional, kind of like Red Queen effect, you know? If it's not Spark then... I read somewhere, and I'm not going to be able or one of the Google search quality guys and that's true in my experience certainly. because of the lower TCO. and pulling the entities out of that automatically that I assume are smaller than the cloud vendors? I think it's going to be getting that proof set Is that the strategy? is that ability to save a lot of the time and the stack that comes with it. and we'll see you out there in the Expo. Alright, you are watching the CUBE

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George	PERSON	0.99+
Rob Lantz	PERSON	0.99+
Robert Lantz	PERSON	0.99+
San Francisco	LOCATION	0.99+
Data Bricks	ORGANIZATION	0.99+
a week	QUANTITY	0.99+
Rob	PERSON	0.99+
two	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Novetta	ORGANIZATION	0.99+
two weeks	QUANTITY	0.99+
tomorrow	DATE	0.99+
two things	QUANTITY	0.98+
today	DATE	0.98+
Spark Summit 2017	EVENT	0.98+
several years ago	DATE	0.97+
Hadoop	TITLE	0.97+
Google	ORGANIZATION	0.97+
about 15 years	QUANTITY	0.96+
#SparkSummit	EVENT	0.95+
billions of photos	QUANTITY	0.95+
this morning	DATE	0.91+
ML Lib	TITLE	0.91+
billions	QUANTITY	0.9+
one	QUANTITY	0.87+
Mahoot	ORGANIZATION	0.85+
one customer site	QUANTITY	0.85+
Hadoop	DATE	0.84+
two people	QUANTITY	0.74+
CUBE	ORGANIZATION	0.72+
Predictive Analytics	ORGANIZATION	0.68+
next couple	DATE	0.66+
Director	PERSON	0.66+
years	DATE	0.62+
Spark ML Lib	TITLE	0.61+
Queen	TITLE	0.59+
ML	TITLE	0.57+
couple	QUANTITY	0.54+
Red	OTHER	0.53+
MapReduce	ORGANIZATION	0.52+
Google Paper	ORGANIZATION	0.47+

Raymie Stata, SAP - Big Data SV 17 - #BigDataSV - #theCUBE

>> Announcer: From San Jose, California, it's The Cube, covering Big Data Silicon Valley 2017. >> Welcome back everyone. We are at Big Data Silicon Valley, running in conjunction with Strata + Hadoop World in San Jose. I'm George Gilbert and I'm joined by Raymie Stata, and Raymie was most recently CEO and Founder of Altiscale. Hadoop is a service vendor. One of the few out there, not part of one of the public clouds. And in keeping with all of the great work they've done, they got snapped up by SAP. So, Rami, since we haven't seen you, I think on The Cube since then, why don't you catch us up with all that, the good work that's gone on between you and SAP since then. >> Sure, so the acquisition closed back in September, so it's been about six months. And it's been a very busy six months. You know, there's just a lot of blocking and tackling that needs to happen. So, you know, getting people on board. Getting new laptops, all that good stuff. But certainly a huge effort for us was to open up a data center in Europe. We've long had demand to have that European presence, both because I think there's a lot of interest over in Europe itself, but also large, multi-national companies based in the US, you know, it's important for them to have that European presence as well. So, it was a natural thing to do as part of SAP, so kind of first order of business was to expand over into Europe. So that was a big exercise. We've actually had some good traction on the sales side, right, so we're getting new customers, larger customers, more demanding customers, which has been a good challenge too. >> So let's pause for a minute on, sort of unpack for folks, what Altiscale offered, the core services. >> Sure. >> That were, you know, here in the US, and now you've extended to Europe. >> Right. So our core platform is kind of Hadoop, Hive, and Spark, you know, as a service in the cloud. And so we would offer HDFS and YARN for Hadoop. Spark and Hive kind of well-integrated. And we would offer that as a cloud service. So you would just, you know, get an account, login, you know, store stuff in HDFS, run your Spark programs, and the way we encourage people to think about it is, I think very often vendors have trained folks in the big data space to think about nodes. You know, how many nodes am I going to get? What kind of nodes am I going to get? And the way we really force people to think twice about Hadoop and what Hadoop as a service means is, you know, they don't, why are you asking that? You don't need to know about nodes. Just store stuff, run your jobs. We worry about nodes. And that, you know, once people kind of understood, you know, just how much complexity that takes out of their lives and how that just enables them to truly focus on using these technologies to get business value, rather that operating them. You know, there's that aha moment in the sales cycle, where people say yeah, that's what I want. I want Hadoop as a service. So that's been our value proposition from the beginning. And it's remained quite constant, and even coming into SAP that's not changing, you know, one bit. >> So, just to be clear then, it's like a lot of the operational responsibilities sort of, you took control over, so that when you say, like don't worry about nodes, it's customer pours x amount of data into storage, which in your case would be HDFS, and then compute is independent of that. They need, you spin up however many, or however much capacity they need, with Spark for instance, to process it, or Hive. Okay, so. >> And all on demand. >> Yeah so it sounds like it's, how close to like the Big Query or Athena services, Athena on AWS or Big Query on Google? Where you're not aware of any servers, either for storage or for compute? >> Yeah I think that's a very good comparable. It's very much like Athena and Big Query where you just store stuff in tables and you issue queries and you don't worry about how much compute, you know, and managing it. I think, by throwing, you know, Spark in the equation, and YARN more generally, right, we can handle a broader range of these cases. So, for example, you don't have to store data in tables, you can store them into HDFS files which is good for processing log data, for example. And with Spark, for example, you have access to a lot of machine learning algorithms that are a little bit harder to run in the context of, say, Athena. So I think it's the same model, in terms of, it's fully operated for you. But a broader platform in terms of its capabilities. >> Okay, so now let's talk about what SAP brought to the table and how that changed the use cases that were appropriate for Altiscale. You know, starting at the data layer. >> Yeah, so, I think the, certainly the, from the business perspective, SAP brings a large, very engaged customer base that, you know, is eager to embrace, kind of a data-driven mindset and culture and is looking for a partner to help them do that, right. And so that's been great to be in that environment. SAP has a number of additional technologies that we've been integrating into the Altiscale offering. So one of them is Vora, which is kind of an interactive sequel engine, it also has time series capabilities and graph capabilities and search capabilities. So it has a lot of additive capabilities, if you will, to what we have at Altiscale. And it also integrates very deeply into HANA itself. And so we now have that for a technology available as a service at Altiscale. >> Let me make sure, so that everyone understands, and so I understand too, is that so you can issue queries from HANA and they can, you know, beyond just simple sequel queries, they can handle the time series, and predictive analytics, and access data sort of seamlessly that's in Hadoop, or can it go the other way as well? >> It's both ways. So you can, you know, from HANA you can essentially federate out into Vora. And through that access data that's in a Hadoop cluster. But it's also the other way around. A lot of times there's an analyst who really lives in the big data world, right, they're in the Hadoop world, but they want to join in data that's sitting in a HANA database, you know. Might be dimensions in a warehouse or, you know, customer details even in a transactional system. And so, you know, that Hadoop-based analyst now has access to data that's out in those HANA databases. >> Do you have some Lighthouse accounts that are working with this already? >> Yes, we do. (laughter) >> Yes we do, okay. I guess that was the diplomatic way of saying yes. But no comment. Alright, so tell us more about SAPs big data stack today and how that might evolve. >> Yeah, of course now, especially that now we've got the Spark, Hadoop, Hive offering that we have. And then four sitting on top of that. There's an offering called Predictive Analytics, which is Spark-based predictive analytics. >> Is that something that came from you, or is that, >> That's an SAP thing, so this is what's been great about the acquisition is that SAP does have a lot of technologies that we can now integrate. And it brings new capabilities to our customer base. So those three are kind of pretty key. And then there's something called Data Services as well, which allows us to move data easily in and out of, you know, HANA and other data stores. >> Is it, is this ability to federate queries between Hadoop and HANA and then migration of the data between the stores, does that, has that changed the economics of how much data people, SAP customers, maintain and sort of what types of apps they can build on it now that they might, it's economically feasible to store a lot more data. >> Well, yes and no. I think the context of Altiscale, both before and after the acquisition is very often there's, what you might call a big data source, right. It could be your web logs, it could be some IOT generated log data, it could be social media streams. You know, this is data that's, you know, doesn't have a lot of structure coming in. It's fairly voluminous. It doesn't, very naturally, go into a sequel database, and that's kind of the sweet spot for the big data technologies like Hadoop and Spark. So, those datas come into your big data environment. You can transform it, you can do some data quality on it. And then you can eventually stage it out into something like HANA data mart, where it, you know, to make it available for reporting. But obviously there's stuff that you can do on the larger dataset in Hadoop as well. So, in a way, yes, you can now tame, if you will, those huge data sources that, you know, weren't practical to put into HANA databasing. >> If you were to prioritize, in the context of, sort of, the applications SAP focuses on, would you be, sort of, with the highest priority use case be IOT related stuff, where, you know, it was just prohibitive to put it in HANA since it's mostly in memory. But, you know, SAP is exposed to tons of that type of data, which would seem to most naturally have an afinity to Altiscale. >> Yeah, so, I mean, IOT is a big initiative. And is a great use case for big data. But, you know, financial-to-financial services industry, as another example, is fairly down the path using Hadoop technologies for many different use cases. And so, that's also an opportunity for us. >> So, let me pop back up, you know, before we have to wrap. With Altiscale as part of the SAP portfolio, have the two companies sort of gone to customers with a more, with more transformational options, that, you know, you'll sell together? >> Yeah, we have. In fact, Altiscale actually is no longer called Altiscale, right? We're part of a portfolio of products, you know, known as the SAP Cloud Platform. So, you know, under the cloud platform we're the big data services. The SAP Cloud Platform is all about business transformation. And business innovation. And so, we bring to that portfolio the ability to now bring the types of data sources that I've just discussed, you know, to bear on these transformative efforts. And so, you know, we fit into some momentum SAP already has, right, to help companies drive change. >> Okay. So, along those lines, which might be, I mean, we know the financial services has done a lot of work with, and I guess telcos as well, what are some of the other verticals that look like they're primed to fall, you know, with this type of transformational network? >> So you mentioned one, which I kind of call manufacturing, right, and there tends to be two kind of different use cases there. One of them I call kind of the shop floor thing. Where you're collecting a lot of sensor data, you know, out of a manufacturing facility with the goal of increasing yield. So you've got the shop floor. And then you've got the, I think, more commonly discussed measuring stuff out in the field. You've got a product, you know, out in the field. Bringing the telemetry back. Doing things like predictive meetings. So, I think manufacturing is a big sector ready to go for big data. And healthcare is another one. You know, people pulling together electronic medical records, you know trying to combine that with clinical outcomes, and I think the big focus there is to drive towards, kind of, outcome-based models, even on the payment side. And big data is really valuable to drive and assess, you know, kind of outcomes in an aggregate way. >> Okay. We're going to have to leave it on that note. But we will tune back in at I guess Sapphire or TechEd, whichever of the SAP shows is coming up next to get an update. >> Sapphire's next. Then TechEd. >> Okay. With that, this is George Gilbert, and Raymie Stata. We will be back in few moments with another segment. We're here at Big Data Silicon Valley. Running in conjunction with Strata + Hadoop World. Stay tuned, we'll be right back.

Published Date : Mar 15 2017

SUMMARY :

it's The Cube, covering Big One of the few out there, companies based in the US, you So let's pause for a minute That were, you know, here in the US, And that, you know, once so that when you say, you know, and managing it. You know, starting at the data layer. very engaged customer base that, you know, And so, you know, that Yes, we do. and how that might evolve. the Spark, Hadoop, Hive in and out of, you know, migration of the data You know, this is data that's, you know, be IOT related stuff, where, you know, But, you know, financial-to-financial So, let me pop back up, you know, And so, you know, we fit into you know, with this type you know, out of a manufacturing facility We're going to have to Gilbert, and Raymie Stata.

ENTITIES

Entity	Category	Confidence
Europe	LOCATION	0.99+
George Gilbert	PERSON	0.99+
George Gilbert	PERSON	0.99+
September	DATE	0.99+
US	LOCATION	0.99+
Raymie Stata	PERSON	0.99+
Altiscale	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
San Jose, California	LOCATION	0.99+
Raymie	PERSON	0.99+
One	QUANTITY	0.99+
six months	QUANTITY	0.99+
TechEd	ORGANIZATION	0.99+
two companies	QUANTITY	0.99+
HANA	TITLE	0.99+
SAP	ORGANIZATION	0.99+
Rami	PERSON	0.99+
Hadoop	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
Big Data	ORGANIZATION	0.99+
three	QUANTITY	0.99+
Sapphire	ORGANIZATION	0.99+
both	QUANTITY	0.98+
twice	QUANTITY	0.98+
SAP Cloud Platform	TITLE	0.98+
one	QUANTITY	0.98+
about six months	QUANTITY	0.98+
Spark	TITLE	0.98+
AWS	ORGANIZATION	0.98+
Google	ORGANIZATION	0.97+
both ways	QUANTITY	0.97+
Athena	TITLE	0.97+
Strata + Hadoop World	ORGANIZATION	0.96+
Strata	ORGANIZATION	0.92+
Predictive Analytics	TITLE	0.91+
Athena	ORGANIZATION	0.91+
one bit	QUANTITY	0.9+
first order	QUANTITY	0.89+
The Cube	ORGANIZATION	0.89+
Vora	TITLE	0.88+
Big Query	TITLE	0.87+
today	DATE	0.86+

Claudia Perlich, Dstillery - Women in Data Science 2017 - #WiDS2017 - #theCUBE

>> Narrator: Live from Stanford University, it's theCUBE covering the Women in Data Science Conference 2017. >> Hi welcome back to theCUBE, I'm Lisa Martin and we are live at Stanford University at the second annual Women in Data Science one day tech conference. We are joined by one of the speakers for the event today, Claudia Perlich, the Chief Scientist at Dstillery, Claudia, welcome to theCUBE. >> Claudia: Thank you so much for having me. It's exciting. >> It is exciting! It's great to have you here. You are quite the prolific author, you've won data mining competitions and awards, you speak at conferences all around the world. Talk to us what you're currently doing as the Chief Scientist for Dstillery. Who's Dstillery? What's the Chief Scientist's role and how are you really leveraging data and science to be a change agent for your clients. I joined Dstillery when it was still called Media6Degrees as a very small startup in the New York ad tech space. It was very exciting. I came out of the IBM Watson Research Lab and really found this a new challenging application area for my skills. What does a Chief Scientist do? It's a good question, I think it actually took the CEO about two years to finally give me a job description, (laughter) and the conclusion at that point was something like, okay there is technical contribution, so I sit down and actually code things and I build prototypes and I play around with data. I also am referred to as Intellectual Leadership, so I work a lot with the teams just kind of scoping problems, brainstorming was may work or dozen, and finally, that's what I'm here for today, is what they consider an Ambassador for the company, so being the face to talk about the more scientific aspects of what's happening now in ad tech, which brings me to what we actually do, right. One of the things that happened over the recent past in advertising is it became an incredible playground for data signs because the available data is incomparable to many other fields that I have seen. And so Dstillery was a pioneer in that space starting to look at initially social data things that people shared, but over the years it has really grown into getting a sense of the digital footprint of what people do. And our primary business model was to bring this to marketers to help them on a much more individualized basis identify who their customers current as well as futures are. Really get a very different understanding than these broad middle-aged soccer mom kind of categories to honor the individual tastes and preferences and actions that really truly reflect the variety of what people do. I'm many things as you mentioned, I publish mom, what's a mom, and I have a horse, so there are many different parts to me. I don't think any single one description fully captures that and we felt that advertising is a great space to explore how you can translate that and help both sides, the people that are being interacted with, as well as the brands that want to make sure that they reach the right individuals. >> Lisa: Very interesting. Well, as buyers journey as changed to mostly online, >> Exactly. >> You're right, it's an incredibly rich opportunity for companies to harness more of that behavioral information and probably see things that they wouldn't have predicted. We were talking to Walmart Labs earlier and one of the interesting insights that they shared was that, especially in Silicon Valley where people spend too much time in the car commuting-- (laughter) You have a long commute as well by train. >> Yes. >> And you'd think that people would want, I want my groceries to show up on my doorstep, I don't want to have to go into the store, and they actually found the opposite that people in such a cosmopolitan area as Silicon Valley actually want to go into the store and pick up-- >> Claudia: Yep. >> Their groceries, so it's very interesting how the data actually can sometimes really change. It's really the scientific method on a very different scale >> Claudia: Much smaller. >> But really using the behavior insights to change the shopping experience, but also to change the experience of companies that are looking to sell their products. >> I think that the last part of the puzzle is, the question is no longer what is the right video for the Super Bowl, I mean we have the Super Bowl coming up, right? >> Lisa: Right. Right. >> They did a study like when do people pay attention to the Super Bowl. You can actually tell, cuz you know what people don't do when they pay attention to the Super Bowl? >> Lisa: Mm,hmm. >> They're not playing around with their phones. They're actually not playing-- >> Lisa: Of course. >> Candy Crush and all these things, so what we see in the ad tech environment, we actually see that the demand for the digital ads go down when people really focus on what's going on on the big screen. But that was a diversion ... >> Lisa: It's very interesting (laughter) though cuz it's something that's very tangible and very ... It's a real world applications. Question for you about data science and your background. You mentioned that you worked with IBM Watson. Forbes has just said that Data Scientist is the best job to apply for in 2017. What is your vision? Talk to us about your team, how you've grown that up, how you're using big data and science to really optimize the products that you deliver to your customers. >> Data Science is really many, many different flavors and in some sense I became a Data Scientist long before the term really existed. Back then I was just a particular weird kind of geek. (laughter) You know all of a sudden it's-- >> Now it has a name. (laughter) >> Right and the reputation to be fun and so you see really many different application areas depending very different skillsets. What is originally the focus of our company has always been around, can we predict what people are going to do? That was always the primary focus and now you see that it's very nicely reflected at the event too. All of sudden communicating this becomes much bigger a part of the puzzle where people say, "Okay, I realize that you're really "good at predicting, but can you tell me why, "what is it these nuggets of inside-- >> Interpretation, right. >> "That you mentioned. Can you visualize what's going on?" And so we grew a team initially from a small group of really focused machine learning and predictive skills over to the broader can you communicate it. Can you explain to the customer archieve brands what happened here. Can you visualize data. That's kind of the broader shift and I think the most challenging part that I can tell in the broader picture of where there is a bit of a short coming in skillset, we have a lot of people who are really good today at analyzing data and coding, so that part has caught up. There are so many Data Science programs. What I still am looking for is how do you bring management and corporate culture to the place where they can truly take advantage of it. >> Lisa: Right. >> This kind of disconnect that we still have-- >> Lisa: Absolutely. >> How do we educate the management level to be comfortable evaluating what their data science group actually did. Whether they working on the right problems that really ultimately will have impact. I think that layer of education needs to receive a lot more emphasis compared to what we already see in terms of this increased skillset on just the sheer technical side of it. >> You mentioned that you teach-- >> Claudia: Mm,hmm. >> Before we went live here, that you teach at NYU, but you're also teaching Data Science to the business folks. I would love for you to expand a little bit more upon that and how are you helping to educate these people to understand the impact. Cuz that's really, really a change agent within the company. That's a cultural change, which is really challenging-- >> Claudia: Very much so. >> Lisa: What's their perception? What's their interest in understanding how this can really drive value? >> What you see, I've been teaching this course for almost six years now, and originally it was really kind of the hardcore coders who also happened to get a PhD on the side, who came to the course. Now you increasingly have a very broad collection of business minded people. I typically teach in the part-time, meaning they all have day jobs and they've realized in their day jobs, I need this. I need that. That skill. That knowledge. We're trying to get on the ground where without having to teach them python and ARM whatever the new toys are there. How can you identify opportunities? How do you know which of the many different flavors of Data Science, from prediction towards visualization to just analyzing historical data to maybe even causality. Which of these tools is appropriate for the task at hand and then being able to evaluate whether the level of support that a machine can only bring, is it even sufficient? Because often just because you can analyze data doesn't mean that the reliability of the model is truly sufficient to support then a downstream business project. Being able to really understand those trade offs without necessarily being able to sit down and code it yourself. That knowledge has become a lot more valuable and I really enjoy the brainstorming when we're just trying to scope a project when they come with problems from their day job and say, "Hey, we're trying to do that." And saying, "Are you really trying to do that?" "What are you actually able to execute? "What kind of decisions can you make?" This is almost like the brainstorming in my own company now brought out to much broader people working in hospitals, people working in banking, so I get exposed to all of these kinds of problems said and that makes it really exciting for me. >> Lisa: Interesting. When Dstillery is talking to customer or prospective customers, is this now something that you're finding is a board level conversation within businesses? >> Claudia: No, I never get bored of that, so there is a part of the business that is pretty well understood and executed. You come to us, you give us money, and we will execute a digital campaign, either on mobile phones, on video, and you tell me what it is that you want me to optimize for. Do you want people to click on your ad? Please don't say yes, that's the worst possible things you may ask me to do-- (laughter) But let's talk about what you're going to measure, whether you want people to show up in your store, whether you really care about signing up for a test drive, and then the system automatically will build all the models that then do all the real-time bidding. Advertising, I'm not sure how many people are aware, as your New York Times page loads, every single ad slot on that side is sold in a real-time auction. About 50 billion times a day, we receive a request whether we want to bid on the opportunity to show somebody an ad. >> Lisa: Wow. >> So that piece, I can't make 50 billion decisions a day. >> Lisa: Right. >> It is entirely automated. There's this fully automated machine learning that just serves that purpose. What makes it interesting for me now that ... Now this is kind of standard fare if you want to move over and is more interesting parts. Well, can you for instance predict which of the 15 different creatives I have for Jobani, should I show you? >> Lisa: Mm,hmm. >> The one with the woman running, or the one with the kid opening, so there is no nuances to it and exploring these new challenges or going into totally new areas talking about, for instance churn prediction, I know an awful lot about people, I can predict very many things and a lot of them go far beyond just how you interact with ads, it's almost the most boring part. We can see people researching diabetes. We can provide snapshots to farmer telling them here's really where we see a rise of activity on a certain topic and maybe this is something of interest to understand which population is driving those changes. These kinds of conversations really making it exciting for me to bring the knowledge of what I see back to many different constituents and see what kind of problems we can possibly support with that. >> Lisa: It's interesting too. It sounds like more, not just providing ad technology to customers-- >> Claudia: Yeah. >> You're really helping them understand where they should be looking to drive value for their businesses. >> Claudia: That's really been the focus increasingly and I enjoy that a lot. >> Lisa: I can imagine that, that's quite interesting. Want to ask you a little bit before we wrap up here about your talk today. I was looking at your, the title of your abstract is, "Beware what you ask for: The secret life of predictive models". (laughter) Talk to us about some of the lessons you learn when things have gone a little bit, huh, I didn't expect that. >> I'm a huge fan of predictive modeling. I love the capabilities and what this technology can do. This being said, it's a collection of aha moments where you're looking at this and this, this doesn't really smell right. To give you an example from ad tech, and I alluded to this, when people say, "Okay we want a high click through rate." Yes, that means I have to predict who will click on an ad. And then you realize that no matter what the campaign, no matter what the product, the model always chooses to show the ad on the flashlight app. Yeah, because that's when people fumble in the dark. The model's really, really good at predicting when people are likely to click on an ad, except that's really not what you intended-- >> Right. >> When you asked me to do that. >> Right. >> So it's almost the best and powerful that they move off into a sidetracked direction you didn't even know existed. Something similar happened with one of these competitions that I won. For Siemens Medical where you had to identify an FMI images of breast, which of these regions are most likely benign or which one have cancer. In both models we did really, really well, all was good. Until we realized that the patient ID was by far the most predictive feature. Now this really shouldn't happen. Your social security number shouldn't be able to predict-- >> Lisa: Right. >> Anything really. It wasn't the social security number, but when we started looking a little bit deeper, we realized what had happened is the data set was a sample from different sources, and one was a treatment center, and one was a screening center and they had certain ranges of patient IDs, so the model had learned where the machine stood, not what the image actually contained about the probability of having cancer. Whoever assembled the data set possibly didn't think about the downstream effect this can have on modeling-- >> Right. >> Which brings us back to the data science skill as really comprehensive starting all the way from the beginning of where the data is collected, all the way down to be extremely skeptical about your own work and really make sure that it truly reflects what you want it to do. You asked earlier like what makes really good Data Scientists. The intuition to feel when something is wrong and to be able to pinpoint and trace it back with the curiosity of really needing to understand everything about the whole process. >> Lisa: And also being not only being able to communicate it, but probably being willing to fail. >> Claudia: That is the number one really requirement. If you want to have a data-driven culture, you have to embrace failure, because otherwise you will fail. >> Lisa: How do you find the reception (laughter) to that fact by your business students. Is that something that they're used to hearing or does it sound like a foreign language to them? >> I think the majority of them are in junior enough positions that they-- >> Lisa: Okay. >> Truly embrace that and if at all, they have come across the fact that they weren't allowed to fail as often as they had wanted to. I think once you go into the higher levels of conversation and we see that a lot in the ad tech industry where you have incentive problems. We see a lot of fraud being targeted. At the end of the day, the ad agency doesn't want to confess to the client that yeah they just wasted five million dollars-- >> Lisa: Right. >> Of ad spend on bots, and even the CMO might not be feeling very comfortable confessing that to the CO-- >> Right. >> Claudia: Being willing to truly face up the truth that sometimes data forces you into your face, that can be quite difficult for a company or even an industry. >> Lisa: Yes, it can. It's quite revolutionary. As is this event, so Claudia Perlich we thank you so much for joining us-- >> My pleasure. >> Lisa: On theCUBE today and we know that you're going to be mentoring a lot of people that are here. We thank you for watching theCUBE. We are live at Stanford University from the Women in Data Science Conference. I am Lisa Martin and we'll be right back (upbeat music)

Published Date : Feb 3 2017

SUMMARY :

covering the Women in Data We are joined by one of the Claudia: Thank you so being the face to talk about changed to mostly online, and one of the interesting It's really the scientific that are looking to sell their products. Lisa: Right. to the Super Bowl. around with their phones. demand for the digital ads is the best job to apply for in 2017. before the term really existed. Now it has a name. Right and the reputation to be fun and corporate culture to the the management level to and how are you helping and I really enjoy the brainstorming to customer or prospective customers, on the opportunity to show somebody an ad. So that piece, I can't make Well, can you for instance predict of interest to understand which population ad technology to customers-- be looking to drive value and I enjoy that a lot. of the lessons you learn the model always chooses to show the ad So it's almost the best and powerful happened is the data set was and to be able to able to communicate it, Claudia: That is the Lisa: How do you find the reception I think once you go into the to truly face up the truth we thank you so much for joining us-- from the Women in Data Science Conference.

ENTITIES

Entity	Category	Confidence
Claudia Perlich	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Lisa	PERSON	0.99+
Claudia	PERSON	0.99+
2017	DATE	0.99+
Candy Crush	TITLE	0.99+
Silicon Valley	LOCATION	0.99+
Siemens Medical	ORGANIZATION	0.99+
Dstillery	ORGANIZATION	0.99+
New York	LOCATION	0.99+
Super Bowl	EVENT	0.99+
Super Bowl	EVENT	0.99+
Walmart Labs	ORGANIZATION	0.99+
IBM Watson Research Lab	ORGANIZATION	0.99+
Jobani	PERSON	0.99+
five million dollars	QUANTITY	0.99+
both models	QUANTITY	0.99+
both sides	QUANTITY	0.99+
single	QUANTITY	0.99+
today	DATE	0.99+
15 different creatives	QUANTITY	0.98+
One	QUANTITY	0.97+
#WiDS2017	EVENT	0.97+
about two years	QUANTITY	0.97+
ARM	ORGANIZATION	0.97+
Women in Data Science Conference 2017	EVENT	0.97+
Women in Data Science Conference	EVENT	0.97+
Women in Data Science	EVENT	0.96+
one	QUANTITY	0.96+
Media6Degrees	ORGANIZATION	0.96+
About 50 billion times a day	QUANTITY	0.95+
Forbes	ORGANIZATION	0.95+
Stanford University	ORGANIZATION	0.93+
50 billion decisions a day	QUANTITY	0.92+
Women in Data Science 2017	EVENT	0.92+
Beware what you ask for: The secret life of predictive models	TITLE	0.9+
IBM Watson	ORGANIZATION	0.89+
theCUBE	ORGANIZATION	0.89+
almost six years	QUANTITY	0.88+
one day	QUANTITY	0.86+
Stanford University	ORGANIZATION	0.84+
NYU	ORGANIZATION	0.82+
single ad	QUANTITY	0.72+
python	ORGANIZATION	0.66+
second annual	QUANTITY	0.62+
one of the speakers	QUANTITY	0.61+
New York Times	TITLE	0.6+
dozen	QUANTITY	0.56+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for five point O: