Eric Siegel, Predictive Analytics World - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco it's theCUBE Covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE. You are watching coverage of Spark Summit 2017. It's day two, we've got so many new guests to talk to today. We already learned a lot, right George? >> Yeah, I mean we had some, I guess, pretty high bandwidth conversations. >> Yes, well I expect we're going to have another one here too, because the person we have is the founder of Predictive Analytics World, it's Eric Siegel, Eric welcome to the show. >> Hey thanks Dave, thanks George. You go by Dave or David? >> Dave: Oh you can call me sir, and that would be. >> I was calling you, should I, can I bow? >> Oh no we are bowing to you, you're the author of the book, Predictive Analytics, I love the subtitle, the Power to Predict Who Will Click, Buy, Lie or Die. >> And that sums up the industry right? >> Right, so if people are new to the industry, that's sort of an informal definition of predictive analytics, basically also known as machine learning. Where you're trying to make predictions for each individual, whether it's a customer for marketing, a suspect for fraud or law enforcement, a voter for political campaigning, a patient for healthcare. So, in general it's on that level, it's a prediction for each individual. So how does data help make those predictions? And then you can only imagine just how many ways in which predicting on that level helps organizations improve all their activities. >> Well we know you were on the keynote stage this morning. Could you maybe summarize for the CUBE audience, what a couple of the top themes that you were talking about? >> Yeah, I covered two advanced topics. I wanted to make sure this pretty technical audience was aware of because a lot of people aren't and one is called uplift modeling, so that's optimizing for persuasion for things like marketing and also for healthcare, actually. And for political campaigning. So when you do predictive analytics for targeting marketing normally sort of the traditional approach is, let's predict will this person buy if I contact them because when well its okay maybe its a good idea to spend the two dollars to send them a brochure its marketing treatment, right. But there is actually a little bit different question that would make even driving them better decisions Which is not will this person buy but would contacting them, sending them the brochure, influence them to buy, will it increase the chance that we get that positive outcome. That's a different question, and it doesn't correspond with standard predictive modeling or machine learning methods So uplift modeling, also known as net lift modeling, persuasion modeling its a way to actually create a predictive model like any other except that it's target is, is it a good idea to contact this person because it will increase the chances that they are going to have a positive outcome. So that's the first of the two. And I cram this all in 20 minutes. The other one was a little more commonly known But I think people would like to visit it and it's called P-Hacking or vast search. Where you can be fooled by randomness and data relatively easily in the era of Big Data there is this all to common pitfall where you find a predictive insight in the data and it turns out it was actually just a random perturbation. How do you know the difference? >> Dave: Fake news right? >> Okay fake news, except that in this case, it was generated by a computer, right? And then there is a statistical test that makes it look like its actually statistically significant and we should have credibility to it, on it or about it. So you can avert it, you have compensate for the fact that you are trying lots, that you are evaluating many different predictive insights or hypotheses whatever you want to call it and make sure that the one that you are believing you sort of checked for the ability that it wasn't just random luck, that's known as p-hacking. >> Alright, so uplift modeling and p-hacking. George do you want to drill on those a little bit. >> Yeah, I want to start from maybe the vocabulary of our audience where they say sort of like uplift modeling goes beyond prediction. Actually even for the second one with p-hacking is that where you're essentially playing with the parameters of the model to find the difference between correlation and causation and going from prediction to prescription? >> It's not about causation, its actually so correlation is what you get when you get a predictive insight or some component of a predictive model where you see these things connected therefore one is predictive of the other. Now the fact that does not entail causation is a really good point to remind people of as such. But even before you address that question, the first question is this correlation actually legit? Is there really a correlation between this things? Is this an actual finding? Or is it just happened to be the case in this particular sample of limited sample data that I have access to at the moment, right? So is it a real link or correlation in the first place before you even start asking any question about causality and it does have, it does related to what you alluded to with regard to tuning parameters because its closely related to this issue of overfitting. People who do predictive modeling are very familiar with overfitting. The standard practice all tools implementations of machine learning and predictive modeling do this, which is they hold the side evaluation set called test set. So you don't get to cheat, creates a predict model. It learns from the data, does the number crunching, its mostly automated, right. And it comes out with this beautiful model that does well predicting and then you evaluate, you assess it over this held aside. Oh my thing's falling off here. >> Dave: Just second on your. >> See then you evaluate it on this held aside set it was quarantine so you didn't get to cheat. You didn't get to look at it when you are creating the model. So it serves as an objective performance measure. The problem is and here is the huge irony, the things that we get from data, the predictive insights, there was one famous one that was broadcasted too loudly because its not nearly as credible as they first thought. Is that an orange used car is a better one to buy because its less likely to be a lemon. That's what it looked like in this one data set. The problem is, that when you have a single insight where its relatively simple, just talking about the car, the color to make the prediction. A predictive model is much more complex and deals with lots of other attributes not just the color, for example, make, year, model everything on that individual car, individual person, you can imagine all the attributes that's the point of the modeling process, the learning process, how do you consider multiple things. If its just a really simple thing with just based on the car color, then many of even the most advanced data science practitioners kind of forget that there is still potential to effectively overfit, that you might have found something that doesn't apply in general, only applies over this particular set of data. So that's where the trap falls and they don't necessarily hold themselves a high standard of having this held aside test set. So its kind of ironic thing, the things that most likely to make the headlines like orange cars are simpler, easier to understand, but are less well understood that they could be wrong. >> You know keying off that, that's really interesting, because we've been hearing for years that what's made, especially deep learning relevant over the last few years is huge compute up in the cloud and huge data sets. >> Yeah. >> But we're also starting to hear about methods of generating a sort of synthetic data so that if you don't have, I don't know what the term is, organic training data, and then test data, we're getting to the point where we can do high quality models with less. >> Yes, less of that training data. And did you. >> Tell us. >> Did you interview with the keynote speaker from Stanford about that? >> No, I only saw part of his. >> Yeah his speech yesterday. That's an area that I'm relatively new to but it sounds extremely important because that is the bottleneck. He called it, if data's the new oil, he's calling it the new-new oil. Which is more specific than data, it's training data. So all of the machine learning or predictive modeling methods of which we speak, are, in most cases, what's called supervised learning. So the thing that makes it supervised is you have a bunch of examples where you already know the answer. So you're trying to figure out is this picture of a cat or of a dog, that means you need to have a whole bunch of data from which to learn, the training data, where you've already got it labeled. You already know the correct answer. In many business applications just because of history you know who did or didn't respond to your marketing, you know who did or did not turn out to be fraudulent. History is experience in which to learn, it's in the data, so you do have that labeled, yes, no, like you already know the answer, you don't need to predict on them, it's in the past but you use that as training data. So we have that in many cases. But for something like classifying an image, and we're trying to figure out does this have a picture of a cat somewhere in the image, or whatever all these big image classification problems, you do need, often, a manual effort to label the data. Have the positive and negative examples, that's what's called training data, the learning data. It's actually called training data. There's definitely a bottleneck so anything that can be done to avert that bottleneck decrease the amount that we need, or find ways to make, sort of, rough training data that may serve as a building block for the modeling process this kind of thing. That's not my area of expertise, sounds really intriguing though. >> What about, and this may be further out on the horizon but one thing we are hearing about is the extreme shortage of data scientists who need to be teamed up with domain experts to figure out the knobs, the variables to create these elaborate models. We're told that even if you're doing the traditional, statistical, machine learning models, that eventually deep learning can help us identify the features or the variables just the way they sort of identify you know ears and whiskers and a nose and then figure out from that the cat. That's something that's in the near term, the medium term in terms of helping to augment what the data scientist does? >> It's in the near term and that's why everyone's excited about deep learning right now is that, basically the reason we built these machines called computers is because they automate stuff. Pretty much anything that you can think of and define well, you can program. Then you've got a machine that does it. Of course one of the things we wanted to learn, to do actually, is to learn from data. Now, it's literally really very analogous to what it means for a human to learn. You've got a limited number of examples that you're trying to draw generalizations from those. When you go to bigger scale problems where the thing you're classifying isn't just like a customer, and all the things you know about the customer, are they likely to commit fraud, yes or no. But it become a level more complex when it's an image right, image is worth a thousand words. And maybe literally more than a thousand words where it says of data if it's a high resolution. So how do you process that? Well there's all sorts of research like well we can define the thing that tries to find arcs, and circles and edges and this kind of thing, or, we can try to, once again, let that be automatic. Let the computer do that. So deep learning is a way to allow, spark is a way to make it operate quickly but there's another level of scale other than speed. The level of scale is just like how complex of a task can you leave up to the automaton, to go by itself. That's what deep learning does is it scales in that respect it has the ability to automate more layers of that complexity as far as finding those kinds of what might me domain specific features and images. >> Okay, but I'm thinking not just the, help me figure out speech to text and natural language understanding or classify. >> Anything with a signal where it's a high bandwidth amount of data coming in that you want to classify. >> OK, so could that, does that extend to I'm building a very elaborate predictive model not on, is there a cat in the video or in the picture so much as I guess you called it, is there an uplift potential and how big is that potential, in a context of making a sale on an eCommerce site. >> So what you just tapped into was when you go to marketing and many other business applications, you don't actually need to have high accuracy what you have to do is have a prediction that's better than guessing. So for example, if I get a 1% response rate to my marketing campaign, but I can find a pocket that's got 3% response rate, it may be very much rocket science to define and learn from the data how to define that specifically defined sub-segment that has a higher response rate, or whatever it is. But the 3% isn't like, I have high confidence this person's definitely going to buy, it's still just 3%, but that difference can make a huge difference and can improve the bottom line marketing by a factor of five and that kind of thing. It's not necessarily about accuracy. If you've got an image and you need to know is there a picture of a car, or is this traffic light green or red, somewhere in this image, then there's certain application areas, self driving cars what have you, it does need to be accurate right. But maybe there's more potential for it to be accurate because there's more predictability inherent to that problem. Like I can predict that there's a traffic light that has a green light somewhere in an image because there is enough label data and the nature of the problem is more tractable because it's not as challenging to find where the traffic light is, and then which color it is. You need it to scale, to reach that level of classification performance in terms of accuracy or whatever measure you use for certain applications. >> Are you seeing like new methodologies like reinforcement learning or deep learning where the models are adversarial where they make big advances in terms of what they can learn without a lot of supervision? Like the ones where. >> It's more self learning and unsupervised. >> Sort of glue yourself onto this video game screen we'll give you control of the steering wheel and you figure out how to win. >> Having less required supervision, more self-learning, anomaly detection or clustering, these are some of the unsupervised ones. When it comes to vision there are part of the process that can be unsupervised in the sense that you don't need labels on your target like is there a car in the picture. But it can still learn the feature detection in a way that doesn't have that supervised data. Although that image classification in general, on that level deep learning, is not my area of expertise. That's a very up and coming part of machine learning but it's only needed when you have these high bandwidth inputs like an entire image, high resolution, or a video, or a high bandwidth audio. So it's signal processing type problems where you start to need that kind of deep learning. >> Great discussion Eric, just a couple of minutes to go in this segment here. I want to make sure I give a chance to talk about Predictive Analytics World and what's your affiliation with that ad what do you want theCUBE audience to know? >> Oh sure, Predictive Analytics World I'm the founder it's the leading cross-vendor event focused on commercial deployment of predictive analytics and machine learning. Our main event a few times a year is a broad scope business focused event but we also have industry vertical focused specialized events just for financial services, healthcare, workforce, manufacturing and government applications of predictive analytics and machine learning. So there's a number a year, and two weeks from now in Chicago, October in New York and you can see the full agendas at PredictiveAnalyticsWorld.com. >> Alright great short commercial there. 30 seconds. >> It's the elevator pitch. >> Answered the toughest question in 30 seconds what the toughest question you got after your keynote this morning? Maybe a hallway conversation or. >> What's the toughest question I got after my keynote? >> Dave: From one of the attendees. >> Oh, the question that always comes up is how do you get this level of complexity across to non-technical people or your boss or your colleagues or your friends and family. By the way that's something I worked really hard on with the book which is meant for all readers although the last few chapters have. >> How do you get executive sponsors to get what you're doing? >> Well, as I say, give them the book. Because the point of the book is it's pop science it's accessible, it's analytically driven, it's entertaining it keeps it relevant but it does address advanced topics at the end of the book. So it sort of ends, industry overview kind of thing. The bottom line there, in general, is that you want to focus on the business impact. What I mentioned briefly a second ago if we can improve target marketing this much it will increase profit by a factor five something like that. So you start with that and then answer any questions they have about, well how does it work, what makes it credible that it really has that much potential in the bottom line. When you're a techie, you're inclined to go forward you start with the technology that you're excited about. That's my background, so that's sort of the definition of being a geek, that you're ore enamored with the technology than the value it produces. Because it's amazing that it works, and it's exciting, it's interesting, it's scientifically challenging. But, when you're talking to the decision makers you have to start with the eventual carrot at the end of the stick, which is the value. >> The business outcome. >> Yeah. >> Great, well that's going to be the last word. That might even make it onto our CUBE Gems segment, great sound bites. George thanks again, great questions and Eric the author of Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die. Thank you for being on the show we appreciate your time. >> Eric: Sure, yeah thank you, great to meet you. >> Thank you for watching theCUBE we'll be back in just a few minutes with our next guest here at Spark Summit 2017.
SUMMARY :
brought to you by Databricks. to talk to today. Yeah, I mean we had some, I guess, because the person we have is the founder You go by Dave or David? I love the subtitle, the Power to Predict Who Will Click, And then you can only imagine just how many ways what a couple of the top themes that you were talking about? there is this all to common pitfall where you find and make sure that the one that you are believing George do you want to drill on those a little bit. is that where you're essentially of a predictive model where you see these things connected The problem is, that when you have a single insight over the last few years is huge compute up in the cloud so that if you don't have, I don't know what the term is, Yes, less of that training data. it's in the data, so you do have that labeled, That's something that's in the near term, the medium term and all the things you know about the customer, help me figure out speech to text that you want to classify. so much as I guess you called it, So what you just tapped into was Are you seeing like new methodologies like and unsupervised. and you figure out how to win. that you don't need labels on your target ad what do you want theCUBE audience to know? in Chicago, October in New York and you can see what the toughest question you got is how do you get this level of complexity is that you want to focus on the business impact. and Eric the author of Predictive Analytics, the Power Thank you for watching theCUBE we'll be back
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Eric Siegel | PERSON | 0.99+ |
David | PERSON | 0.99+ |
Eric | PERSON | 0.99+ |
Chicago | LOCATION | 0.99+ |
1% | QUANTITY | 0.99+ |
two dollars | QUANTITY | 0.99+ |
3% | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.99+ |
New York | LOCATION | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
first question | QUANTITY | 0.99+ |
20 minutes | QUANTITY | 0.99+ |
Predictive Analytics | TITLE | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
more than a thousand words | QUANTITY | 0.98+ |
Predictive Analytics World | ORGANIZATION | 0.98+ |
first | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
second one | QUANTITY | 0.98+ |
each individual | QUANTITY | 0.98+ |
two | QUANTITY | 0.98+ |
today | DATE | 0.97+ |
second | QUANTITY | 0.97+ |
October | DATE | 0.97+ |
two weeks | QUANTITY | 0.97+ |
two advanced topics | QUANTITY | 0.97+ |
first place | QUANTITY | 0.96+ |
the Power to Predict Who Will Click, Buy, Lie or Die | TITLE | 0.94+ |
Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die | TITLE | 0.94+ |
Databricks | ORGANIZATION | 0.94+ |
single insight | QUANTITY | 0.93+ |
Stanford | ORGANIZATION | 0.91+ |
five | QUANTITY | 0.9+ |
this morning | DATE | 0.87+ |
CUBE | ORGANIZATION | 0.86+ |
a thousand words | QUANTITY | 0.84+ |
first thought | QUANTITY | 0.82+ |
Predictive Analytics | ORGANIZATION | 0.77+ |
a year | QUANTITY | 0.72+ |
theCUBE | ORGANIZATION | 0.72+ |
day two | QUANTITY | 0.7+ |
one famous | QUANTITY | 0.69+ |
PredictiveAnalyticsWorld.com | ORGANIZATION | 0.66+ |
times a year | QUANTITY | 0.66+ |
second ago | DATE | 0.66+ |
World | EVENT | 0.63+ |
#theCUBE | ORGANIZATION | 0.57+ |
years | QUANTITY | 0.56+ |
last | DATE | 0.56+ |
factor | QUANTITY | 0.52+ |
years | DATE | 0.49+ |
minutes | QUANTITY | 0.48+ |
five | OTHER | 0.33+ |
Rob Lantz, Novetta - Spark Summit 2017 - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco it's the CUBE covering Spark Summit 2017 brought to you by Data Bricks. >> Welcome back to the CUBE, we're continuing to take about two people who are not just talking about things but doing things. We're happy to have, from Novetta, the Director of Predictive Analytics, Mr. Rob Lantz. Rob, welcome to the show. >> Thank you. >> And off to my right, George, how are you? >> Good. >> We've introduced you before. >> Yes. >> Well let's talk to the guest. Let's get right to it. I want to talk to you a little bit about what does Novetta do and then maybe what apps you're building using Spark. >> Sure, so Novetta is an advanced analytics company, we're medium sized and we develop custom hardware and software solutions for our customers who are looking to get insights out of their big data. Our primary offering is a hard entity resolution engine. We scale up to billions of records and we've done that for about 15 years. >> So you're in the business end of analytics, right? >> Yeah, I think so. >> Alright, so talk to us a little bit more about entity resolution, and that's all Spark right? This is your main priority? >> Yes, yes, indeed. Entity resolution is the science of taking multiple disparate data sets, traditional big data, and taking records from those and determining which of those are actually the same individual or company or address or location and which of those should be kept separate. We can aggregate those things together and build profiles and that enables a more robust picture of what's going on for an organization. >> Okay, and George? >> So what did you do... What was the solution looking like before Spark and how did it change once you adopted Spark? >> Sure, so with Spark, it enabled us to get a lot faster. Obviously those computations scaled a lot better. Before, we were having to write a lot of custom code to get those computations out across a grid. When we moved to Hadoop and then Spark, that made us, let's say able to scale those things and get it done overnight or in hours and not weeks. >> So when you say you had to do a lot of custom code to distribute across the cluster, does that include when you were working with MapReduce, or was this even before the Hadoop era? >> Oh it was before the Hadoop era and that predates my time so I won't be able to speak expertly about it, but to my understanding, it was a challenge for sure. >> Okay so this sounds like a service that your customers would then themselves build on. Maybe an ETL customer would figure out master data from a repository that is not as carefully curated as the data warehouse or similar applications. So who is your end customer and how do they build on your solution? >> Sure, so the end customer typically is an enterprise that has large volumes of data that deal in particular things. They collect, it could be customers, it could be passengers, it could be lots of different things. They want to be able to build profiles about those people or companies, like I said, or locations, any number of things can be considered an entity. The way they build upon it then is how they go about quantifying those profiles. We can help them do that, in fact, some of the work that I manage does that, but often times they do it themselves. They take the resolve data and that gets resolved nightly or even hourly. They build those profiles themselves for their own purpose. >> Then, to help us think about the application or the use case holistically, once they've built those profiles and essentially harmonized the data, what does that typically feed into? >> Oh gosh, any number of things really. Oh, shoot. We've got deployments in AWS in the cloud, we've got deployments, lots of deployments on premises obviously. That can go anywhere from relational databases to graph query language databases. Lots of different places from there for sure. >> Okay so, this actually sounds like everyone talks now about machine learning and forming every category of software. This sounds like you take the old style ETL, where master data was a value add layer on top, and that was, it took a fair amount of human judgment to do. Now, you're putting that service on top of ETL and you're largely automating it, probably with, I assume, some supervised guidance, supervised training. >> Yes, so we're getting into the machine learning space as far as entity extraction and resolution and recognition because more and more data is unstructured. But machine learning isn't necessarily a baked in part of that. Actually entity resolution is a prerequisite, I think, for quality machine learning. So if Rob Lantz is a customer, I want to be able to know what has Rob Lantz bought in the past from me. And maybe what is Rob Lantz talking about in social media? Well I need to know how to figure out who those people are and who's Rob Lantz and who's Robert Lantz is a completely different person, I don't want to collapse those two things together. Then I would build machine learning on top of that to say, right, now what's his behavior going to be in the future. But once I have that robust profile built up, I can derive a lot more interesting features with which to apply the machine learning. >> Okay, so you are a Data Bricks customer and there's also a burgeoning partnership. >> Rob: Yeah, I think that's true. >> So talk to us a little bit about what are some of the frustrations you had before adopting Data Bricks and maybe why you choose it. >> Yeah, sure. So the frustrations primarily with a traditional Hadoop environment involved having to go from one customer site to another customer site with an incredibly complex technology stack and then do a lot of the cluster management for those customers even after they'd already set it up because of all the inner workings of Hadoop and that ecosystem. Getting our Spark application installed there, we had to penetrate layers and layers of configuration in order to tune it appropriately to get the performance we needed. >> David: Okay, and were you at the keynote this morning? >> I was not, actually. >> Okay, I'm not going to ask you about that then. >> Ah. >> But I am going to ask you a little bit about your wishlist. You've been talking to people maybe in the hallway here, you just got here today but, what do you wish the community would do or develop, what would you like to learn while you're here? >> Learning while I'm here, I've already picked up a lot. So much going on and it's such a fast paced environment, it's really exciting. I think if I had a wishlist, I would want a more robust ML Lib, machine learning library. All the things that you can get on traditional, in scientific computing stacks moved onto a Spark ML Lib for easier access. On a cluster would be great. >> I thought several years ago ML Lib took over from Mahoot as the most active open source community for adding, really, I thought, scale out machine learning algorithms. If it doesn't have it all now, or maybe all is something you never reach, kind of like Red Queen effect, you know? >> Rob: For sure, for sure. >> What else is attracting these scale out implementations of the machine learning algorithms? >> Um? >> In other words, what are the platforms? If it's not Spark then... >> I don't think it exists frankly, unless you write your own. I think that would be the way to go. That's the way to go about it now. I think what organizations are having to do with machine learning in a distributed environment is just go with good enough, right. Whereas maybe some of the ensemble methods that are, actually aren't even really cutting edge necessarily, but you can really do a lot of tuning on those things, doing that tuning distributed at scale would be really powerful. I read somewhere, and I'm not going to be able to quote exactly where it was but, actually throwing more data at a problem is more valuable than tuning a perfect algorithm frankly. If we could combine the two, I think that would be really powerful. That is, finding the right algorithm and throwing all the data at it would get you a really solid model that would pick up on that signal that underlies any of these phenomena. >> David: Okay well, go ahead George. >> I was going to ask, I think that goes back to, I don't know if it was Google Paper, or one of the Google search quality guys who's a luminary in the machine learning space says, "data always trumps algorithms." >> I believe that's true and that's true in my experience certainly. >> Once you had this machine learning and once you've perhaps simplified the multi-vendor stack, then what is your solution start looking like in terms of broadening its appeal, because of the lower TCO. And then, perhaps embracing more use cases. >> I don't know that it necessarily embraces more use cases because entity resolution applies so broadly already, but what I would say is will give us more time to focus on improving the ER itself. That's I think going to be a really, really powerful improvement we can make to Novetta entity analytics as it stands right now. That's going to go into, we alluded to before, the machine learning as part of the entity resolution. Entity extraction, automated entity extraction from unstructured information and not just unstructured text but unstructured images and video. Could be a really powerful thing. Taking in stuff that isn't tagged and pulling the entities out of that automatically without actually having to have a human in the loop. Pulling every name out, every phone number out, every address out. Go ahead, sorry. >> This goes back to a couple conversations we've had today where people say data trumps algorithms, even if they don't say it explicitly, so the cloud vendors who are sitting on billions of photos, many of which might have house street addresses and things like that, or faces, how do you make better... How do you extract better tuning for your algorithms from data sets that I assume are smaller than the cloud vendors? >> They're pretty big. We employ data engineers that are very experienced at tagging that stuff manually. What I would envision would happen is we would apply somebody for a week or two weeks, to go in and tag the data as appropriate. In fact, we have products that go in and do concept tagging already across multiple languages. That's going to be the subject of my talk tomorrow as a matter of fact. But we can tag things manually or with machine assistance and then use that as a training set to go apply to the much larger data set. I'm not so worried about the scale of the data, we already have a lot, a lot of data. I think it's going to be getting that proof set that's already tagged. >> So what you're saying is, it actually sounds kind of important. That actually almost ties into what we hear about Facebook training their messenger bot where we can't do it purely just on training data so we're going to take some data that needs semi-supervision, and that becomes our new labeled set, our new training data. Then we can run it against this broad, unwashed mass of training data. Is that the strategy? >> Certainly we would get there. We would want to get there and that's the beauty of what Data Bricks promises, is that ability to save a lot of the time that we would spend doing the nug work on cluster management to innovate in that way and we're really excited about that. >> Alright, we've got just a minute to go here before the break, so I wanted to ask you maybe, the wish list question, I've been asking everybody today, what do you wish you had? Whether it's in entity resolution or some other area in the next couple of years for Novetta, what's on your list? >> Well I think that would be the more robust machine learning library, all in Spark, kind of native, so we wouldn't have to deploy that ourselves. Then, I think everything else is there, frankly. We are very excited about the platform and the stack that comes with it. >> Well that's a great ending right there, George do you have any other questions you want to ask? Alright, we're just wrapping up here. Thank you so much, we appreciate you being on the show Rob, and we'll see you out there in the Expo. >> I appreciate it, thank you. >> Alright, thanks so much. >> George: It's good to meet you. >> Thanks. >> Alright, you are watching the CUBE here at Spark Summit 2017, stay tuned, we'll be back with our next guest.
SUMMARY :
brought to you by Data Bricks. Welcome back to the CUBE, I want to talk to you a little bit about and we've done that for about 15 years. and build profiles and that enables a more robust picture and how did it change once you adopted Spark? and get it done overnight or in hours and not weeks. and that predates my time and how do they build on your solution? and that gets resolved nightly or even hourly. We've got deployments in AWS in the cloud, and that was, it took a fair amount going to be in the future. Okay, so you are a Data Bricks customer and maybe why you choose it. to get the performance we needed. what would you like to learn while you're here? All the things that you can get on traditional, kind of like Red Queen effect, you know? If it's not Spark then... I read somewhere, and I'm not going to be able or one of the Google search quality guys and that's true in my experience certainly. because of the lower TCO. and pulling the entities out of that automatically that I assume are smaller than the cloud vendors? I think it's going to be getting that proof set Is that the strategy? is that ability to save a lot of the time and the stack that comes with it. and we'll see you out there in the Expo. Alright, you are watching the CUBE
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Rob Lantz | PERSON | 0.99+ |
Robert Lantz | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Data Bricks | ORGANIZATION | 0.99+ |
a week | QUANTITY | 0.99+ |
Rob | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
AWS | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.99+ |
Novetta | ORGANIZATION | 0.99+ |
two weeks | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
two things | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
several years ago | DATE | 0.97+ |
Hadoop | TITLE | 0.97+ |
ORGANIZATION | 0.97+ | |
about 15 years | QUANTITY | 0.96+ |
#SparkSummit | EVENT | 0.95+ |
billions of photos | QUANTITY | 0.95+ |
this morning | DATE | 0.91+ |
ML Lib | TITLE | 0.91+ |
billions | QUANTITY | 0.9+ |
one | QUANTITY | 0.87+ |
Mahoot | ORGANIZATION | 0.85+ |
one customer site | QUANTITY | 0.85+ |
Hadoop | DATE | 0.84+ |
two people | QUANTITY | 0.74+ |
CUBE | ORGANIZATION | 0.72+ |
Predictive Analytics | ORGANIZATION | 0.68+ |
next couple | DATE | 0.66+ |
Director | PERSON | 0.66+ |
years | DATE | 0.62+ |
Spark ML Lib | TITLE | 0.61+ |
Queen | TITLE | 0.59+ |
ML | TITLE | 0.57+ |
couple | QUANTITY | 0.54+ |
Red | OTHER | 0.53+ |
MapReduce | ORGANIZATION | 0.52+ |
Google Paper | ORGANIZATION | 0.47+ |
Raymie Stata, SAP - Big Data SV 17 - #BigDataSV - #theCUBE
>> Announcer: From San Jose, California, it's The Cube, covering Big Data Silicon Valley 2017. >> Welcome back everyone. We are at Big Data Silicon Valley, running in conjunction with Strata + Hadoop World in San Jose. I'm George Gilbert and I'm joined by Raymie Stata, and Raymie was most recently CEO and Founder of Altiscale. Hadoop is a service vendor. One of the few out there, not part of one of the public clouds. And in keeping with all of the great work they've done, they got snapped up by SAP. So, Rami, since we haven't seen you, I think on The Cube since then, why don't you catch us up with all that, the good work that's gone on between you and SAP since then. >> Sure, so the acquisition closed back in September, so it's been about six months. And it's been a very busy six months. You know, there's just a lot of blocking and tackling that needs to happen. So, you know, getting people on board. Getting new laptops, all that good stuff. But certainly a huge effort for us was to open up a data center in Europe. We've long had demand to have that European presence, both because I think there's a lot of interest over in Europe itself, but also large, multi-national companies based in the US, you know, it's important for them to have that European presence as well. So, it was a natural thing to do as part of SAP, so kind of first order of business was to expand over into Europe. So that was a big exercise. We've actually had some good traction on the sales side, right, so we're getting new customers, larger customers, more demanding customers, which has been a good challenge too. >> So let's pause for a minute on, sort of unpack for folks, what Altiscale offered, the core services. >> Sure. >> That were, you know, here in the US, and now you've extended to Europe. >> Right. So our core platform is kind of Hadoop, Hive, and Spark, you know, as a service in the cloud. And so we would offer HDFS and YARN for Hadoop. Spark and Hive kind of well-integrated. And we would offer that as a cloud service. So you would just, you know, get an account, login, you know, store stuff in HDFS, run your Spark programs, and the way we encourage people to think about it is, I think very often vendors have trained folks in the big data space to think about nodes. You know, how many nodes am I going to get? What kind of nodes am I going to get? And the way we really force people to think twice about Hadoop and what Hadoop as a service means is, you know, they don't, why are you asking that? You don't need to know about nodes. Just store stuff, run your jobs. We worry about nodes. And that, you know, once people kind of understood, you know, just how much complexity that takes out of their lives and how that just enables them to truly focus on using these technologies to get business value, rather that operating them. You know, there's that aha moment in the sales cycle, where people say yeah, that's what I want. I want Hadoop as a service. So that's been our value proposition from the beginning. And it's remained quite constant, and even coming into SAP that's not changing, you know, one bit. >> So, just to be clear then, it's like a lot of the operational responsibilities sort of, you took control over, so that when you say, like don't worry about nodes, it's customer pours x amount of data into storage, which in your case would be HDFS, and then compute is independent of that. They need, you spin up however many, or however much capacity they need, with Spark for instance, to process it, or Hive. Okay, so. >> And all on demand. >> Yeah so it sounds like it's, how close to like the Big Query or Athena services, Athena on AWS or Big Query on Google? Where you're not aware of any servers, either for storage or for compute? >> Yeah I think that's a very good comparable. It's very much like Athena and Big Query where you just store stuff in tables and you issue queries and you don't worry about how much compute, you know, and managing it. I think, by throwing, you know, Spark in the equation, and YARN more generally, right, we can handle a broader range of these cases. So, for example, you don't have to store data in tables, you can store them into HDFS files which is good for processing log data, for example. And with Spark, for example, you have access to a lot of machine learning algorithms that are a little bit harder to run in the context of, say, Athena. So I think it's the same model, in terms of, it's fully operated for you. But a broader platform in terms of its capabilities. >> Okay, so now let's talk about what SAP brought to the table and how that changed the use cases that were appropriate for Altiscale. You know, starting at the data layer. >> Yeah, so, I think the, certainly the, from the business perspective, SAP brings a large, very engaged customer base that, you know, is eager to embrace, kind of a data-driven mindset and culture and is looking for a partner to help them do that, right. And so that's been great to be in that environment. SAP has a number of additional technologies that we've been integrating into the Altiscale offering. So one of them is Vora, which is kind of an interactive sequel engine, it also has time series capabilities and graph capabilities and search capabilities. So it has a lot of additive capabilities, if you will, to what we have at Altiscale. And it also integrates very deeply into HANA itself. And so we now have that for a technology available as a service at Altiscale. >> Let me make sure, so that everyone understands, and so I understand too, is that so you can issue queries from HANA and they can, you know, beyond just simple sequel queries, they can handle the time series, and predictive analytics, and access data sort of seamlessly that's in Hadoop, or can it go the other way as well? >> It's both ways. So you can, you know, from HANA you can essentially federate out into Vora. And through that access data that's in a Hadoop cluster. But it's also the other way around. A lot of times there's an analyst who really lives in the big data world, right, they're in the Hadoop world, but they want to join in data that's sitting in a HANA database, you know. Might be dimensions in a warehouse or, you know, customer details even in a transactional system. And so, you know, that Hadoop-based analyst now has access to data that's out in those HANA databases. >> Do you have some Lighthouse accounts that are working with this already? >> Yes, we do. (laughter) >> Yes we do, okay. I guess that was the diplomatic way of saying yes. But no comment. Alright, so tell us more about SAPs big data stack today and how that might evolve. >> Yeah, of course now, especially that now we've got the Spark, Hadoop, Hive offering that we have. And then four sitting on top of that. There's an offering called Predictive Analytics, which is Spark-based predictive analytics. >> Is that something that came from you, or is that, >> That's an SAP thing, so this is what's been great about the acquisition is that SAP does have a lot of technologies that we can now integrate. And it brings new capabilities to our customer base. So those three are kind of pretty key. And then there's something called Data Services as well, which allows us to move data easily in and out of, you know, HANA and other data stores. >> Is it, is this ability to federate queries between Hadoop and HANA and then migration of the data between the stores, does that, has that changed the economics of how much data people, SAP customers, maintain and sort of what types of apps they can build on it now that they might, it's economically feasible to store a lot more data. >> Well, yes and no. I think the context of Altiscale, both before and after the acquisition is very often there's, what you might call a big data source, right. It could be your web logs, it could be some IOT generated log data, it could be social media streams. You know, this is data that's, you know, doesn't have a lot of structure coming in. It's fairly voluminous. It doesn't, very naturally, go into a sequel database, and that's kind of the sweet spot for the big data technologies like Hadoop and Spark. So, those datas come into your big data environment. You can transform it, you can do some data quality on it. And then you can eventually stage it out into something like HANA data mart, where it, you know, to make it available for reporting. But obviously there's stuff that you can do on the larger dataset in Hadoop as well. So, in a way, yes, you can now tame, if you will, those huge data sources that, you know, weren't practical to put into HANA databasing. >> If you were to prioritize, in the context of, sort of, the applications SAP focuses on, would you be, sort of, with the highest priority use case be IOT related stuff, where, you know, it was just prohibitive to put it in HANA since it's mostly in memory. But, you know, SAP is exposed to tons of that type of data, which would seem to most naturally have an afinity to Altiscale. >> Yeah, so, I mean, IOT is a big initiative. And is a great use case for big data. But, you know, financial-to-financial services industry, as another example, is fairly down the path using Hadoop technologies for many different use cases. And so, that's also an opportunity for us. >> So, let me pop back up, you know, before we have to wrap. With Altiscale as part of the SAP portfolio, have the two companies sort of gone to customers with a more, with more transformational options, that, you know, you'll sell together? >> Yeah, we have. In fact, Altiscale actually is no longer called Altiscale, right? We're part of a portfolio of products, you know, known as the SAP Cloud Platform. So, you know, under the cloud platform we're the big data services. The SAP Cloud Platform is all about business transformation. And business innovation. And so, we bring to that portfolio the ability to now bring the types of data sources that I've just discussed, you know, to bear on these transformative efforts. And so, you know, we fit into some momentum SAP already has, right, to help companies drive change. >> Okay. So, along those lines, which might be, I mean, we know the financial services has done a lot of work with, and I guess telcos as well, what are some of the other verticals that look like they're primed to fall, you know, with this type of transformational network? >> So you mentioned one, which I kind of call manufacturing, right, and there tends to be two kind of different use cases there. One of them I call kind of the shop floor thing. Where you're collecting a lot of sensor data, you know, out of a manufacturing facility with the goal of increasing yield. So you've got the shop floor. And then you've got the, I think, more commonly discussed measuring stuff out in the field. You've got a product, you know, out in the field. Bringing the telemetry back. Doing things like predictive meetings. So, I think manufacturing is a big sector ready to go for big data. And healthcare is another one. You know, people pulling together electronic medical records, you know trying to combine that with clinical outcomes, and I think the big focus there is to drive towards, kind of, outcome-based models, even on the payment side. And big data is really valuable to drive and assess, you know, kind of outcomes in an aggregate way. >> Okay. We're going to have to leave it on that note. But we will tune back in at I guess Sapphire or TechEd, whichever of the SAP shows is coming up next to get an update. >> Sapphire's next. Then TechEd. >> Okay. With that, this is George Gilbert, and Raymie Stata. We will be back in few moments with another segment. We're here at Big Data Silicon Valley. Running in conjunction with Strata + Hadoop World. Stay tuned, we'll be right back.
SUMMARY :
it's The Cube, covering Big One of the few out there, companies based in the US, you So let's pause for a minute That were, you know, here in the US, And that, you know, once so that when you say, you know, and managing it. You know, starting at the data layer. very engaged customer base that, you know, And so, you know, that Yes, we do. and how that might evolve. the Spark, Hadoop, Hive in and out of, you know, migration of the data You know, this is data that's, you know, be IOT related stuff, where, you know, But, you know, financial-to-financial So, let me pop back up, you know, And so, you know, we fit into you know, with this type you know, out of a manufacturing facility We're going to have to Gilbert, and Raymie Stata.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Europe | LOCATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
September | DATE | 0.99+ |
US | LOCATION | 0.99+ |
Raymie Stata | PERSON | 0.99+ |
Altiscale | ORGANIZATION | 0.99+ |
San Jose | LOCATION | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
Raymie | PERSON | 0.99+ |
One | QUANTITY | 0.99+ |
six months | QUANTITY | 0.99+ |
TechEd | ORGANIZATION | 0.99+ |
two companies | QUANTITY | 0.99+ |
HANA | TITLE | 0.99+ |
SAP | ORGANIZATION | 0.99+ |
Rami | PERSON | 0.99+ |
Hadoop | ORGANIZATION | 0.99+ |
Hadoop | TITLE | 0.99+ |
Big Data | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
Sapphire | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.98+ |
twice | QUANTITY | 0.98+ |
SAP Cloud Platform | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
about six months | QUANTITY | 0.98+ |
Spark | TITLE | 0.98+ |
AWS | ORGANIZATION | 0.98+ |
ORGANIZATION | 0.97+ | |
both ways | QUANTITY | 0.97+ |
Athena | TITLE | 0.97+ |
Strata + Hadoop World | ORGANIZATION | 0.96+ |
Strata | ORGANIZATION | 0.92+ |
Predictive Analytics | TITLE | 0.91+ |
Athena | ORGANIZATION | 0.91+ |
one bit | QUANTITY | 0.9+ |
first order | QUANTITY | 0.89+ |
The Cube | ORGANIZATION | 0.89+ |
Vora | TITLE | 0.88+ |
Big Query | TITLE | 0.87+ |
today | DATE | 0.86+ |