Eric Siegel, Predictive Analytics World - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco it's theCUBE Covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE. You are watching coverage of Spark Summit 2017. It's day two, we've got so many new guests to talk to today. We already learned a lot, right George? >> Yeah, I mean we had some, I guess, pretty high bandwidth conversations. >> Yes, well I expect we're going to have another one here too, because the person we have is the founder of Predictive Analytics World, it's Eric Siegel, Eric welcome to the show. >> Hey thanks Dave, thanks George. You go by Dave or David? >> Dave: Oh you can call me sir, and that would be. >> I was calling you, should I, can I bow? >> Oh no we are bowing to you, you're the author of the book, Predictive Analytics, I love the subtitle, the Power to Predict Who Will Click, Buy, Lie or Die. >> And that sums up the industry right? >> Right, so if people are new to the industry, that's sort of an informal definition of predictive analytics, basically also known as machine learning. Where you're trying to make predictions for each individual, whether it's a customer for marketing, a suspect for fraud or law enforcement, a voter for political campaigning, a patient for healthcare. So, in general it's on that level, it's a prediction for each individual. So how does data help make those predictions? And then you can only imagine just how many ways in which predicting on that level helps organizations improve all their activities. >> Well we know you were on the keynote stage this morning. Could you maybe summarize for the CUBE audience, what a couple of the top themes that you were talking about? >> Yeah, I covered two advanced topics. I wanted to make sure this pretty technical audience was aware of because a lot of people aren't and one is called uplift modeling, so that's optimizing for persuasion for things like marketing and also for healthcare, actually. And for political campaigning. So when you do predictive analytics for targeting marketing normally sort of the traditional approach is, let's predict will this person buy if I contact them because when well its okay maybe its a good idea to spend the two dollars to send them a brochure its marketing treatment, right. But there is actually a little bit different question that would make even driving them better decisions Which is not will this person buy but would contacting them, sending them the brochure, influence them to buy, will it increase the chance that we get that positive outcome. That's a different question, and it doesn't correspond with standard predictive modeling or machine learning methods So uplift modeling, also known as net lift modeling, persuasion modeling its a way to actually create a predictive model like any other except that it's target is, is it a good idea to contact this person because it will increase the chances that they are going to have a positive outcome. So that's the first of the two. And I cram this all in 20 minutes. The other one was a little more commonly known But I think people would like to visit it and it's called P-Hacking or vast search. Where you can be fooled by randomness and data relatively easily in the era of Big Data there is this all to common pitfall where you find a predictive insight in the data and it turns out it was actually just a random perturbation. How do you know the difference? >> Dave: Fake news right? >> Okay fake news, except that in this case, it was generated by a computer, right? And then there is a statistical test that makes it look like its actually statistically significant and we should have credibility to it, on it or about it. So you can avert it, you have compensate for the fact that you are trying lots, that you are evaluating many different predictive insights or hypotheses whatever you want to call it and make sure that the one that you are believing you sort of checked for the ability that it wasn't just random luck, that's known as p-hacking. >> Alright, so uplift modeling and p-hacking. George do you want to drill on those a little bit. >> Yeah, I want to start from maybe the vocabulary of our audience where they say sort of like uplift modeling goes beyond prediction. Actually even for the second one with p-hacking is that where you're essentially playing with the parameters of the model to find the difference between correlation and causation and going from prediction to prescription? >> It's not about causation, its actually so correlation is what you get when you get a predictive insight or some component of a predictive model where you see these things connected therefore one is predictive of the other. Now the fact that does not entail causation is a really good point to remind people of as such. But even before you address that question, the first question is this correlation actually legit? Is there really a correlation between this things? Is this an actual finding? Or is it just happened to be the case in this particular sample of limited sample data that I have access to at the moment, right? So is it a real link or correlation in the first place before you even start asking any question about causality and it does have, it does related to what you alluded to with regard to tuning parameters because its closely related to this issue of overfitting. People who do predictive modeling are very familiar with overfitting. The standard practice all tools implementations of machine learning and predictive modeling do this, which is they hold the side evaluation set called test set. So you don't get to cheat, creates a predict model. It learns from the data, does the number crunching, its mostly automated, right. And it comes out with this beautiful model that does well predicting and then you evaluate, you assess it over this held aside. Oh my thing's falling off here. >> Dave: Just second on your. >> See then you evaluate it on this held aside set it was quarantine so you didn't get to cheat. You didn't get to look at it when you are creating the model. So it serves as an objective performance measure. The problem is and here is the huge irony, the things that we get from data, the predictive insights, there was one famous one that was broadcasted too loudly because its not nearly as credible as they first thought. Is that an orange used car is a better one to buy because its less likely to be a lemon. That's what it looked like in this one data set. The problem is, that when you have a single insight where its relatively simple, just talking about the car, the color to make the prediction. A predictive model is much more complex and deals with lots of other attributes not just the color, for example, make, year, model everything on that individual car, individual person, you can imagine all the attributes that's the point of the modeling process, the learning process, how do you consider multiple things. If its just a really simple thing with just based on the car color, then many of even the most advanced data science practitioners kind of forget that there is still potential to effectively overfit, that you might have found something that doesn't apply in general, only applies over this particular set of data. So that's where the trap falls and they don't necessarily hold themselves a high standard of having this held aside test set. So its kind of ironic thing, the things that most likely to make the headlines like orange cars are simpler, easier to understand, but are less well understood that they could be wrong. >> You know keying off that, that's really interesting, because we've been hearing for years that what's made, especially deep learning relevant over the last few years is huge compute up in the cloud and huge data sets. >> Yeah. >> But we're also starting to hear about methods of generating a sort of synthetic data so that if you don't have, I don't know what the term is, organic training data, and then test data, we're getting to the point where we can do high quality models with less. >> Yes, less of that training data. And did you. >> Tell us. >> Did you interview with the keynote speaker from Stanford about that? >> No, I only saw part of his. >> Yeah his speech yesterday. That's an area that I'm relatively new to but it sounds extremely important because that is the bottleneck. He called it, if data's the new oil, he's calling it the new-new oil. Which is more specific than data, it's training data. So all of the machine learning or predictive modeling methods of which we speak, are, in most cases, what's called supervised learning. So the thing that makes it supervised is you have a bunch of examples where you already know the answer. So you're trying to figure out is this picture of a cat or of a dog, that means you need to have a whole bunch of data from which to learn, the training data, where you've already got it labeled. You already know the correct answer. In many business applications just because of history you know who did or didn't respond to your marketing, you know who did or did not turn out to be fraudulent. History is experience in which to learn, it's in the data, so you do have that labeled, yes, no, like you already know the answer, you don't need to predict on them, it's in the past but you use that as training data. So we have that in many cases. But for something like classifying an image, and we're trying to figure out does this have a picture of a cat somewhere in the image, or whatever all these big image classification problems, you do need, often, a manual effort to label the data. Have the positive and negative examples, that's what's called training data, the learning data. It's actually called training data. There's definitely a bottleneck so anything that can be done to avert that bottleneck decrease the amount that we need, or find ways to make, sort of, rough training data that may serve as a building block for the modeling process this kind of thing. That's not my area of expertise, sounds really intriguing though. >> What about, and this may be further out on the horizon but one thing we are hearing about is the extreme shortage of data scientists who need to be teamed up with domain experts to figure out the knobs, the variables to create these elaborate models. We're told that even if you're doing the traditional, statistical, machine learning models, that eventually deep learning can help us identify the features or the variables just the way they sort of identify you know ears and whiskers and a nose and then figure out from that the cat. That's something that's in the near term, the medium term in terms of helping to augment what the data scientist does? >> It's in the near term and that's why everyone's excited about deep learning right now is that, basically the reason we built these machines called computers is because they automate stuff. Pretty much anything that you can think of and define well, you can program. Then you've got a machine that does it. Of course one of the things we wanted to learn, to do actually, is to learn from data. Now, it's literally really very analogous to what it means for a human to learn. You've got a limited number of examples that you're trying to draw generalizations from those. When you go to bigger scale problems where the thing you're classifying isn't just like a customer, and all the things you know about the customer, are they likely to commit fraud, yes or no. But it become a level more complex when it's an image right, image is worth a thousand words. And maybe literally more than a thousand words where it says of data if it's a high resolution. So how do you process that? Well there's all sorts of research like well we can define the thing that tries to find arcs, and circles and edges and this kind of thing, or, we can try to, once again, let that be automatic. Let the computer do that. So deep learning is a way to allow, spark is a way to make it operate quickly but there's another level of scale other than speed. The level of scale is just like how complex of a task can you leave up to the automaton, to go by itself. That's what deep learning does is it scales in that respect it has the ability to automate more layers of that complexity as far as finding those kinds of what might me domain specific features and images. >> Okay, but I'm thinking not just the, help me figure out speech to text and natural language understanding or classify. >> Anything with a signal where it's a high bandwidth amount of data coming in that you want to classify. >> OK, so could that, does that extend to I'm building a very elaborate predictive model not on, is there a cat in the video or in the picture so much as I guess you called it, is there an uplift potential and how big is that potential, in a context of making a sale on an eCommerce site. >> So what you just tapped into was when you go to marketing and many other business applications, you don't actually need to have high accuracy what you have to do is have a prediction that's better than guessing. So for example, if I get a 1% response rate to my marketing campaign, but I can find a pocket that's got 3% response rate, it may be very much rocket science to define and learn from the data how to define that specifically defined sub-segment that has a higher response rate, or whatever it is. But the 3% isn't like, I have high confidence this person's definitely going to buy, it's still just 3%, but that difference can make a huge difference and can improve the bottom line marketing by a factor of five and that kind of thing. It's not necessarily about accuracy. If you've got an image and you need to know is there a picture of a car, or is this traffic light green or red, somewhere in this image, then there's certain application areas, self driving cars what have you, it does need to be accurate right. But maybe there's more potential for it to be accurate because there's more predictability inherent to that problem. Like I can predict that there's a traffic light that has a green light somewhere in an image because there is enough label data and the nature of the problem is more tractable because it's not as challenging to find where the traffic light is, and then which color it is. You need it to scale, to reach that level of classification performance in terms of accuracy or whatever measure you use for certain applications. >> Are you seeing like new methodologies like reinforcement learning or deep learning where the models are adversarial where they make big advances in terms of what they can learn without a lot of supervision? Like the ones where. >> It's more self learning and unsupervised. >> Sort of glue yourself onto this video game screen we'll give you control of the steering wheel and you figure out how to win. >> Having less required supervision, more self-learning, anomaly detection or clustering, these are some of the unsupervised ones. When it comes to vision there are part of the process that can be unsupervised in the sense that you don't need labels on your target like is there a car in the picture. But it can still learn the feature detection in a way that doesn't have that supervised data. Although that image classification in general, on that level deep learning, is not my area of expertise. That's a very up and coming part of machine learning but it's only needed when you have these high bandwidth inputs like an entire image, high resolution, or a video, or a high bandwidth audio. So it's signal processing type problems where you start to need that kind of deep learning. >> Great discussion Eric, just a couple of minutes to go in this segment here. I want to make sure I give a chance to talk about Predictive Analytics World and what's your affiliation with that ad what do you want theCUBE audience to know? >> Oh sure, Predictive Analytics World I'm the founder it's the leading cross-vendor event focused on commercial deployment of predictive analytics and machine learning. Our main event a few times a year is a broad scope business focused event but we also have industry vertical focused specialized events just for financial services, healthcare, workforce, manufacturing and government applications of predictive analytics and machine learning. So there's a number a year, and two weeks from now in Chicago, October in New York and you can see the full agendas at PredictiveAnalyticsWorld.com. >> Alright great short commercial there. 30 seconds. >> It's the elevator pitch. >> Answered the toughest question in 30 seconds what the toughest question you got after your keynote this morning? Maybe a hallway conversation or. >> What's the toughest question I got after my keynote? >> Dave: From one of the attendees. >> Oh, the question that always comes up is how do you get this level of complexity across to non-technical people or your boss or your colleagues or your friends and family. By the way that's something I worked really hard on with the book which is meant for all readers although the last few chapters have. >> How do you get executive sponsors to get what you're doing? >> Well, as I say, give them the book. Because the point of the book is it's pop science it's accessible, it's analytically driven, it's entertaining it keeps it relevant but it does address advanced topics at the end of the book. So it sort of ends, industry overview kind of thing. The bottom line there, in general, is that you want to focus on the business impact. What I mentioned briefly a second ago if we can improve target marketing this much it will increase profit by a factor five something like that. So you start with that and then answer any questions they have about, well how does it work, what makes it credible that it really has that much potential in the bottom line. When you're a techie, you're inclined to go forward you start with the technology that you're excited about. That's my background, so that's sort of the definition of being a geek, that you're ore enamored with the technology than the value it produces. Because it's amazing that it works, and it's exciting, it's interesting, it's scientifically challenging. But, when you're talking to the decision makers you have to start with the eventual carrot at the end of the stick, which is the value. >> The business outcome. >> Yeah. >> Great, well that's going to be the last word. That might even make it onto our CUBE Gems segment, great sound bites. George thanks again, great questions and Eric the author of Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die. Thank you for being on the show we appreciate your time. >> Eric: Sure, yeah thank you, great to meet you. >> Thank you for watching theCUBE we'll be back in just a few minutes with our next guest here at Spark Summit 2017.

Published Date : Jun 7 2017

SUMMARY :

brought to you by Databricks. to talk to today. Yeah, I mean we had some, I guess, because the person we have is the founder You go by Dave or David? I love the subtitle, the Power to Predict Who Will Click, And then you can only imagine just how many ways what a couple of the top themes that you were talking about? there is this all to common pitfall where you find and make sure that the one that you are believing George do you want to drill on those a little bit. is that where you're essentially of a predictive model where you see these things connected The problem is, that when you have a single insight over the last few years is huge compute up in the cloud so that if you don't have, I don't know what the term is, Yes, less of that training data. it's in the data, so you do have that labeled, That's something that's in the near term, the medium term and all the things you know about the customer, help me figure out speech to text that you want to classify. so much as I guess you called it, So what you just tapped into was Are you seeing like new methodologies like and unsupervised. and you figure out how to win. that you don't need labels on your target ad what do you want theCUBE audience to know? in Chicago, October in New York and you can see what the toughest question you got is how do you get this level of complexity is that you want to focus on the business impact. and Eric the author of Predictive Analytics, the Power Thank you for watching theCUBE we'll be back

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Dave	PERSON	0.99+
Eric Siegel	PERSON	0.99+
David	PERSON	0.99+
Eric	PERSON	0.99+
Chicago	LOCATION	0.99+
1%	QUANTITY	0.99+
two dollars	QUANTITY	0.99+
3%	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
New York	LOCATION	0.99+
30 seconds	QUANTITY	0.99+
yesterday	DATE	0.99+
first question	QUANTITY	0.99+
20 minutes	QUANTITY	0.99+
Predictive Analytics	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
more than a thousand words	QUANTITY	0.98+
Predictive Analytics World	ORGANIZATION	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
second one	QUANTITY	0.98+
each individual	QUANTITY	0.98+
two	QUANTITY	0.98+
today	DATE	0.97+
second	QUANTITY	0.97+
October	DATE	0.97+
two weeks	QUANTITY	0.97+
two advanced topics	QUANTITY	0.97+
first place	QUANTITY	0.96+
the Power to Predict Who Will Click, Buy, Lie or Die	TITLE	0.94+
Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die	TITLE	0.94+
Databricks	ORGANIZATION	0.94+
single insight	QUANTITY	0.93+
Stanford	ORGANIZATION	0.91+
five	QUANTITY	0.9+
this morning	DATE	0.87+
CUBE	ORGANIZATION	0.86+
a thousand words	QUANTITY	0.84+
first thought	QUANTITY	0.82+
Predictive Analytics	ORGANIZATION	0.77+
a year	QUANTITY	0.72+
theCUBE	ORGANIZATION	0.72+
day two	QUANTITY	0.7+
one famous	QUANTITY	0.69+
PredictiveAnalyticsWorld.com	ORGANIZATION	0.66+
times a year	QUANTITY	0.66+
second ago	DATE	0.66+
World	EVENT	0.63+
#theCUBE	ORGANIZATION	0.57+
years	QUANTITY	0.56+
last	DATE	0.56+
factor	QUANTITY	0.52+
years	DATE	0.49+
minutes	QUANTITY	0.48+
five	OTHER	0.33+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for the Power to Predict Who Will Click,Buy, Lie or Die: