Dinesh Nirmal, IBM | CUBEConversation

(upbeat music) >> Hi everyone. We have a special program today. We are joined by Dinesh Nirmal, who is VP of Development and Analytics, for Analytics at the IBM company and Dinesh has an extremely broad perspective on what's going on in this part of the industry, and IBM has a very broad portfolio. So, between the two of us, I think we can cover a lot of ground today. So, Dinesh, welcome. >> Oh thank you George. Great to be here. >> So just to frame the discussion, I wanted to hit on sort of four key highlights. One is balancing the compatibility across cloud on-prem, and edge versus leveraging specialized services that might be on any one of those platforms. And then harmonizing and simplifying both the management and the development of services across these platforms. You have that trade-off between: do I do everything compatibly; or can I take advantage of platforms, specific stuff? And then, we've heard a huge amount of noise on Machine Learning. And everyone says they're democratizing it. We want to hear your perspective on how you think that's most effectively done. And then, if we have time, the how to manage Machine Learning feedback, data feedback loops to improve the models. So, having started with that. >> So you talked about the private cloud and the public cloud and then, how do you manage the data and the models, or the other analytical assets across the hybrid nature of today. So, if you look at our enterprises, it's a hybrid format that most customers adopt. I mean you have some data in the public side; but you have your mission critical data, that's very core to your transactions, exist in the private cloud. Now, how do you make sure that the data that you've pushed on the cloud, that you can go use to build models? And then you can take that model deployed on-prem or on public cloud. >> Is that the emerging sort of mainstream design pattern, where mission critical systems are less likely to move, for latency, for the fact that they're fused to their own hardware, but you take the data, and the researching for the models happens up in the cloud, and then that gets pushed down close to where the transaction decisions are. >> Right, so there's also the economics of data that comes into play, so if you are doing a, you know, large scale neural net, where you have GPUs, and you want to do deep learning, obviously, you know, it might make more sense for you to push it into the cloud and be able to do that or one of the other deep learning frameworks out there. But then you have your core transactional data that includes your customer data, you know, or your customer medical data, which I think some customers might be reluctant to push on a public cloud, and then, but you still want to build models and predict and all those things. So I think it's a hybrid nature, depending on the sensitivities of the data. Customers might decide to put it on cloud versus private cloud which is in their premises, right? So then how do you serve those customer needs, making sure that you can build a model on the cloud and that you can deploy that model on private cloud or vice versa. I mean, you can build that model on private cloud or only on private, and then deployed on your public cloud. Now the challenge, one last statement, is that people think, well, once I build a model, and I deploy it on public cloud, then it's easy, because it's just an API call at that time, just to call that model to execute the transactions. But that's not the case. You take support vector machine, for example, right, that still has vectors in there, that means your data is there, right, so even though you're saying you're deploying the model, you still have sensitive data there, so those are the kind of things customers need to think about before they go deploy those models. >> So I might, this is a topic for our Friday interview with a member of the Watson IT family, but it's not so black and white when you say we'll leave all your customer data with you, and we'll work on the models, because it, sort of like, teabags, you know, you can take the customer's teabag and squeeze some of the tea out, in your IBM or public cloud, and give them back the teabag, but you're getting some of the benefit of this data. >> Right, so like, it depends, depends on the algorithms you build. You could take a linear regression, and you don't have the challenges I mentioned, in support of retro machine, because none of the data is moving, it's just modeled. So it depends, I think that's where, you know, like Watson has done, will help tremendously because the data is secure in that sense. But if you're building on your own, it's a different challenge, you've got to make sure you pick the right algorithms to do that. >> Okay, so let's move on to the modern sort of what we call operational analytic pipeline, where the key steps are ingest, process, analyze, predict, serve, and you can drill down on those more. Today there's, those pipelines are pretty much built out of multi-vendor components. How do you see that evolving under pressures of, or tension between simplicity, coming from one vendor, and the pieces all designed together, and the specialization, where you want to have a, you know, unique tool in one component. >> Right, so you're exactly right. So you can take a two prong approach. One is, you can go to a cloud provider, and get each of the services, and you stitch it together. That's one approach. A challenging approach, but that has its benefits, right, I mean, you bring some core strengths from each vendor into it. The other one is the integrate approach, where you ingest the data, you shape or cleanse the data, you get it prepared for analytics, you build the model, you predict, you visualize. I mean, that all comes in one. The benefit there is you get the whole stack in one, you have one you have a whole pipeline that you can execute, you have one service provider that's giving them services, it's managed. So all those benefits come with it, and that's probably the preferred way for it integrated all together in one stack, I think that's the path most people go towards, because then you have the whole pipeline available to you, and also the services that comes with it. So any updates that comes with it, and how do you make sure, if you take the first round, one challenge you have is how do you make sure all these services are compatible with each other? How do you make sure they're compliant? So if you're an insurance company, you want it to be HIPAA compliant. Are you going to individually make sure that each of these services are HIPAA compliant? Or would you get from one integrated provider, you can make sure they are HIPAA compliant, tests are done, so all those benefits, to me, outweigh you going, putting unmanaged service all together, and then creating a data link to underlay all of it. >> Would it be fair to say, to use an analogy, where Hadoop, being sort of, originating in many different Apache products, is a quasi-multi vendor kind of pipeline, and the state of, the state of the machine learning analytic pipeline, is still kind of multi-vendor today. You see that moving toward single vendor pipeline, who do you see as the sort of last man standing? >> So, I mean, I can speak from an IBM perspective, I can say that the benefit that a company, a vendor like IBM brings forward, is like, so the different, public or private cloud or hybrid, you obviously have the choice of going to public cloud, you can get the same service on public cloud, so you get a hybrid experience, so that's one aspect of it. Then, if you get the integrated solution, all the way from ingest to visualization, you have one provider, it's tested, it's integrated, you know, it's combined, it works well together, so I would say, going forward, if you look at it purely from an enterprise perspective, I would say integrated solutions is the way to go, because that what will be the last man standing. I'll give you an example. I was with a major bank in Europe, about a month ago, and I took them through our data science experience, our machine learning project and all that, and you know, the CTO's take was that, Dinesh, I got it. Building the model itself, it only took us two days, but incorporating our model into our existing infrastructure, it has been 11 months, we haven't been able to do it. So that's the challenge our enterprises face, and they want an integrated solution to bring that model into their existing infrastructure. So that's, you know, that's my thought. >> Today though, let's talk about the IBM pipeline. Spark is core, Ingest is, off the-- >> Dinesh: Right, so you can do spark streaming, you can use Kafka, or you can use infostream which is our proprietary tool. >> Right, although, you wouldn't really use structured streaming for ingest, 'cause of the back pressure? >> Right, so they are-- >> The point that I'm trying to make is, it's still multi-vendor, and then the serving side, I don't know, where, once the analysis is done and predictions are made, some sort of sequel database has to take over, so it's, today, it's still pretty multi vendor. So how do you see any of those products broadening their footprints so that the number of pieces decreases. >> So good question, they are all going to get into end pipeline, because that's where the value is, unless you provide an integrated end to end solution for a customer, especially parts customer it's all about putting it all together, and putting these pieces together is not easy, even if you ingest the data, IOP kind of data, a lot of times, 99% of the time, data is not clean, unless you're in a competition where you get cleansed data, in real world, that never happens. So then, I would say 80% of a data scientists time is spent on cleaning the data, shaping the data, preparing the data to build that pipeline. So for most customers, it's critical that they get that end to end, well oiled, well connected solution integrated solution, than take it from each vendor, every isolated solution. To answer your question, yes, every vendor is going to move into the ingest, data cleansing phase, transformation, and the building the pipeline and then visualization, if you look at those five steps, has to be developed. >> But just building the data cleansing and transformation, having it in your, native to your own pipeline, that doesn't sound like it's going to solve the problem of messy data that needs, you know, human supervision to correct. >> I mean, so there is some level of human supervision to be sure, so I'll give you an example, right, so when data from an insurance company goes, a lot of times, the gender could be missing, how do you know if it's a male or female? Then you got to build another model to say, you know, this patient has gone for a prostate exam, you know, it's a male, gynecology is a female, so you have to do some intuitary work in there, to make sure that the data is clean, and then there's some human supervision to make sure that this is good to build models, because when you're executing that pipeline in real time, >> Yeah. >> It's all based on the past data, so you want to make sure that the data is as clean as possible to train and model, that you're going to execute on. >> So, let me ask you, turning to a slide we've got about complexity, and first, for developers, and then second, for admins, if we take the steps in the pipeline, as ingest, process, analyze, predict, serve, and sort of products or product categories as Kafka, Spark streaming and sequel, web service for predict, and MPP sequel, or no sequel for serve, even if they all came from IBM, would it be possible to unify the data model, the addressing and name space, and I'm just kicking off a few that I can think of, programming model, persistence, transaction model, workflow, testing integration, there's one thing to say it's all IBM, and then there's another thing, so that the developer working with it, sees as it as one suite. >> So it has to be validated, and that's the benefit that IBM brings already, because we obviously test each segment to make sure it works, but when you talk about complexity, building the model is one, you know, development of the model, but now the complexity also comes in the deployment of the model, now we talk about the management of the model, where, how you monitor it? When was the model deployed, was it deployed in tests, was it deployed in production, and who changed that model last, what was changed, and how is it scoring? Is it scoring high or low? You want to get notification when the model starts going low. So complexity is all the way across, all the way from getting the data, in cleaning the data, developing the model, it never ends. And the other benefit that IBM has added is the feedback loop, where when you talk about complexity, it reduces the complexity, so today, if the model scores low, you have to take it offline, retrain the model based on the new data, and then redeploy it. Usually for enterprises, there is slots where you can take it offline, put it back online, all these things, so it's a process. What we have done is created a feedback loop where we are training the model in real time, using real time data, so the model is continuously-- >> Online learning. >> Online learning. >> And challenger, champion, or AB testing to see which one is more robust. >> Right, so you can do that, I mean, you could have multiple models where you can do AB testing, but in this case, you can condition, train the model to say, okay, this model scores the best. And then, another benefit is that, if you look at the whole machine learning process, there's the data, there's development, there's deployment. On development side, more and more it's getting commoditized, meaning picking the right algorithm, there's a lot of tools, including IBM, where he can say, question what's the right one to use for this, so that piece is getting a little, less complex, I don't want to say easier, but less complex, but the data cleansing and the deployment, these are two enterprises, when you have thousands of models how do you make sure that you deploy the right model. >> So you might say that the pipeline for managing the model is sort of separate from the original data pipeline, maybe it includes the same technology, or as much of the same technology, but once your pipeline, your data pipeline is in production, the model pipeline has to keep cycling through. >> Exactly, so the data pipeline could be changing, so if you take a lone example, right, a lot of the data that goes in the model pipeline, is static, I mean, my age, it's not going to change every day, I mean, it is, but you know, the age that goes into my salary, my race, my gender, those are static data that you can take from data and put it in there, but then there's also real time data that's coming, my loan amount, my credit score, all those things, so how do you bring that data pipeline between real time and static data, into the model pipeline, so the model can predict accurately and based on the score dipping, you should be able to re-try the model using real time data. >> I want to take, Dinesh, to the issue of a multi-vendor stack again, and the administrative challenges, so here, we look at a slide that shows me just rattling off some of the admin challenges, governance, performance modeling, scheduling orchestration, availability, recovering authentication, authorization, resource isolation, elasticity, testing integration, so that's the Y-axis, and then for every different product in the pipeline, as the access, say Kafka, Spark, structured streaming MPP, sequel, no sequel, so you got a mess. >> Right. >> Most open source companies are trying to make life easier for companies by managing their software as a service for the customer, and that's typically how they monetize. But tell us what you see the problem is, or will be with that approach. >> So, great question. Let me take a very simple example. Probably most of our audience know about GDPR, which is the European law to write to forget. So if you're an enterprise, and say, George, I want my data deleted, you have to delete all of my data within a period of time. Now, that's where one of the aspects you talked about with governance comes in. How do you make sure you have governance across not just data but your individual assets? So if you're using a multi-vendor solution, in all of that, that state governance, how do I make sure that data get deleted by all these services that's all tied together. >> Let me maybe make an analogy. On CSI, when they pick up something at the crime scene, they got to make sure that it's bagged, and the chain of custody doesn't lose its integrity all the way back to the evidence room. I assume you're talking about something like that. >> Yeah, something similar. Where the data, as it moves between private cloud, public cloud, analytical assets, is using that data, all those things need to work seamlessly for you to execute that particular transaction to delete data from everywhere. >> So that's, it's not just administrative costs, but regulations that are pushing towards more homogenous platforms. >> Right, right, and even if you take some of the other things on the stack, monitoring, logging, metering, provides some of those capabilities, but you have to make sure when you put all these services together, how are they going to integrate all together? You have one monitoring stack, so if you're pulling you know, your IOT kind of data into a data center, or your whole stack evaluation, how do you make sure you're getting the right monitoring data across the board? Those are the kind of challenges that you will have. >> It's funny you mention that, because we were talking to an old Lotus colleague of mine, who was CTO of Microsoft's IT organization, and we were talking about how the cloud vendors can put machine learning application, machine learning management application across their properties, or their services, but he said one of the first problems he'll encounter is the telemetry, like it's really easy on hardware, CPUs, utilization, memory utilization, a noise enabler for iO, but as you get higher up in the application services, it becomes much more difficult to harmonize, so that a program can figure out what's going wrong. >> Right, and I mean, like anomaly detection, right? >> Yes. >> I mean, how do you make sure you're seeing patterns where you can predict something before it happens, right? >> Is that on the road map for...? >> Yeah, so we're already working with some big customers to say, if you have a data center, how do you look at outage to predict what can go wrong in the future, root cause analysis, I mean, that is a huge problem solved. So let's say customer hit a problem, you took an outage, what caused it? Because today, you have specialists who will come and try to figure out what the problem is, but can we use machine learning or deep learning to figure out, is it a fix that was missing, or an application got changed that caused a CPU spike, that caused the outage? So that whole cost analysis is the one that's the hardest to solve, because you are talking about people's decades worth of knowledge, now you are influencing a machine to do that prediction. >> And from my understanding, root cause analysis is most effective when you have a rich model of how your, in this case, data structure and apps are working, and there might be many little models, but they're held together by some sort of knowledge graph that says here is where all the pieces fit, these are the pieces below these, sort of as peers to these other things, how does that knowledge graph get built in, and is this the next generation of a configuration management database. >> Right, so I call it the self-healing, self-managing, self-fixing data center. It's easy for you to turn up the heat or A/C, the temperature goes down, I mean, those are good, but the real value for a customer is exactly what you mentioned, building up that knowledge graft from different models that all comes together, but the hardest part, is, how do you, predicting an anomaly is one thing, but getting to the root cause is a different thing, because at that point, now you're saying, I know exactly what's caused this problem, and I can prevent it from happening again. That's not easy. We are working with our customers to figure out how do we get to the root cause analysis, but it's all about building the knowledge graph with multiple models coming from different systems, today, I mean enterprises have different systems from multi-vendors. We have to bring all that monitoring data into one source, and that's where that knowledge comes in, and then different models will feed that data, and then you need to prime that data, using deep learning algorithms to say, what caused this? >> Okay, so this actually sounds extremely relevant, although we're probably, in the interest of time, going to have to dig down on that one another time, but, just at a high level, it sounds like the knowledge graph is sort of your web or directory, into how local components or local models work, and then, knowing that, if it sees problems coming up here, it can understand how it affects something else tangentially. >> So think of knowledge graph as a neural net, because it's just building new neural net based on the past data, and it has that built-in knowledge where it says, okay, these symptoms seem to be a problem that I have encountered in the past. Now I can predict the root cause because I know this happened in the past. So it's kind of like you putting that net to build new problem determinations as it goes along. So it's a complex task. It's not easy to get to root cause analysis. But that's something we are aggressively working on developing. >> Okay, so let me ask, let's talk about sort of democratizing machine learning and the different ways of doing that. You've actually talked about the big pain points, maybe not so sexy, but that are critical, which is operationalizing the models, and preparing the data. Let me bounce off you some of the other approaches. One that we have heard from Amazon is that they're saying, well, data expunging might be an issue, and operationalizing the models might be an issue, but the biggest issue in terms of making this developer ready, is we're going to take the machine learning we use to run our business, whether it's merchandising fashion, running recommendation engines, managing fulfillment or logistics, and just like I did with AWS, they're dog-fooding it internally, and then they're going to put it out on AWS as a new layer of a platform. Where do you see that being effective, and where less effective? >> Right, so let me answer the first part of your question, the democratization of learning. So that happens when for example, a real estate agent who has no idea about machine learning, be able to come and predict the house prices in this area. That's to me, is democratizing. Because at that time, you have made it available to everyone, everyone can use it. But that comes back to our first point, which is having that clean set of data. You can build all the pre-canned pipelines out there, but if you're not feeding the set of data into, none of this, you know. Garbage in, garbage out, that's what you're going to get. So when we talk about democratization, it's not that easy and simple because you can build all this pre-canned pipelines that you have used in-house for your own purposes, but every customer has many unique cases. So if I take you as a bank, your fraud detection methods is completely different than me as a bank, my limit for fraud detection could be completely different. So there is always customization that's involved, the data that's coming in is different, so while it's a buzzword, I think there's knowledge that people need to feed it, there's models that needs to be tuned and trained, and there's deployment that is completely different, so you know, there is work that has to be done. >> So then what I'm taking away from what you're saying is, you don't have to start from ground zero with your data, but you might want to add some of your data, which is specialized, or slightly different from what the pre-trained model is, you still have to worry about operationalizing it, so it's not a pure developer ready API, but it uplevels the skills requirement so that it's not quite as demanding as working with TensorFlow or something like that. >> Right, I mean, so you can always build pre-canned pipelines and make it available, so we have already done that. For example, fraud detection, we have pre-canned pipelines for IT analytics, we have pre-canned pipelines. So it's nothing new, you can always do what you have done in house, and make it available to the public or the customers, but then they have to take it and have to do customization to meet their demands, bring their data to re-train the model, all those things has to be done, it's not just about providing the model, but every customer use case is completely different, whether you are looking at fraud detection from that one bank's perspective, not all banks are going to do the same thing. Same thing for predicting, for example, the loan, I mean, your loan approval process is going to be completely different than me as a bank loan approval process. >> So let me ask you then, and we're getting low on time here, but what would you, if you had to characterize Microsoft, Azure, Google, Amazon, as each bringing to bear certain advantages and disadvantages, and you're now the ambassador, so you're not a representative of IBM, help us understand the sweet spot for each of those. Like you're trying to fix the two sides of the pipeline, I guess, thinking of it like a barbell, you know, where are the others based on their data assets and their tools, where do they need to work. >> So, there's two aspects of it, there's enterprise aspect, so as an enterprise, I would like to say, it's not just about the technology, but there's also the services aspect. If my model goes down in the middle of the night, and my banking app is down, who do I call? If I'm using a service that is available on the cloud provider which is open source, do I have the right amount of coverage to call somebody and fix it. So there's the enterprise capabilities, availabilities, reliability, that is different, than a developer comes in, has a CSV file that he or she wants to build a model to predict something, that's different, this is different, two different aspects. So if you talk about, you know, all these vendors, if I'm bearing an enterprise card, some of the things I would look is, can I get an integrated solution, end to end on the machine learning platform. >> And that means end to end in one location, >> Right. >> So you don't have network issues or latency and stuff like that. >> Right, it's an integrated solution, where I can bring in the data, there's no challenges to latency, those kinds of things, and then can I get the enterprise level service, SLA all those things, right? So, in there, the named vendors obviously have an upper hand, because they are preferred to enterprises than a brand new open source that will come along, but then there is, within enterprises, there are a line of businesses building models, using some of the open source vendors, which is okay, but eventually they'd have to get deployed and then how do you make sure you have that enterprise capabilities up there. So if you ask me, I think each vendor brings some capabilities. I think the benefit IBM brings in is, one, you have the choice or the freedom to bring in cloud or on-prem or hybrid, you have all the choices of languages, like we support R, Python Spar, Spark, I mean, SPS, so I mean, the choice, the freedom, the reliability, the availability, the enterprise nature, that's where IBM comes in and differentiates, and that's for our customers, a huge plus. >> One last question, and we're really out of time, in terms of thinking about a unified pipeline, when we were at Spark Summit, sitting down with Matei Zaharia and Reynold Shin, the question came up, the data breaks has an incomplete pipeline, no persistence, no ingest, not really much in the way of serving, but boy are they good at, you know, data transmigration, and munging and machine learning, but they said they consider it part of their ultimate responsibility to take control. And on the ingest side it's Kafka, the serving side, might be Redis or something else, or the Spark databases like Snappy Data and Splice Machine. Spark is so central to IBM's efforts. What might a unified Spark pipeline look like? Have you guys thought about that? >> It's not there, obviously they probably could be working on it, but for our purpose, Spark is critical for us, and the reason we invested in Spark so much is because of the executions, where you can take a tremendous amount of data, and, you know, crunch through it in a very short amount of time, that's the reason, we also invented Spark Sequel, because we have a good chunk of customers still use Sequel heavily, We put a lot of work into the Spark ML, so we are continuing to invest, and probably they will get to and integrated into a solution, but it's not there yet, but as it comes along, we'll adapt. If it meets our needs and demands, and enterprise can do it, then definitely, I mean, you know, we saw that Spark's core engine has the ability to crunch a tremendous amount of data, so we are using it, I mean, 45 of our internal products use Spark as our core engine. Our DSX, Data Science Experience, has Spark as our core engine. So, yeah, I mean, today it's not there, but I know they're probably working on it, and if there are elements of this whole pipeline that comes together, that is convenient for us to use, and at enterprise level, we will definitely consider using it. >> Okay, on that note, Dinesh, thanks for joining us, and taking time out of your busy schedule. My name is George Gilbert, I'm with Dinesh Nirmal from IBM, VP of Analytics Development, and we are at the Cube studio in Palo Alto, and we will be back in the not too distant future, with more interesting interviews with some of the gurus at IBM. (peppy music)

Published Date : Aug 22 2017

SUMMARY :

So, between the two of us, I think Oh thank you George. the how to manage Machine Learning feedback, that you can go use to build models? but you take the data, and the researching for and that you can deploy that model on private cloud but it's not so black and white when you say and you don't have the challenges I mentioned, and the specialization, where you want to have and get each of the services, and you stitch it together. who do you see as the sort of last man standing? So that's, you know, that's my thought. Spark is core, Ingest is, off the-- Dinesh: Right, so you can do spark streaming, so that the number of pieces decreases. and then visualization, if you look at those five steps, of messy data that needs, you know, human supervision so you want to make sure that the data is as clean as in the pipeline, as ingest, process, analyze, if the model scores low, you have to take it offline, to see which one is more robust. Right, so you can do that, I mean, you could have So you might say that the pipeline for managing I mean, it is, but you know, the age that goes MPP, sequel, no sequel, so you got a mess. But tell us what you see the problem is, Now, that's where one of the aspects you talked about and the chain of custody doesn't lose its integrity for you to execute that particular transaction to delete but regulations that are pushing towards more Those are the kind of challenges that you will have. It's funny you mention that, because we were to say, if you have a data center, how do you look at most effective when you have a rich model and then you need to prime that data, using deep learning but, just at a high level, it sounds like the knowledge So it's kind of like you putting that net Let me bounce off you some of the other approaches. pipelines that you have used in-house for your own purposes, the pre-trained model is, you still have to worry So it's nothing new, you can always do what you have So let me ask you then, and we're getting low on time So if you talk about, you know, all these vendors, So you don't have network issues or latency and then how do you make sure you have that but boy are they good at, you know, where you can take a tremendous amount of data, of the gurus at IBM.

ENTITIES

Entity	Category	Confidence
Microsoft	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
IBM	ORGANIZATION	0.99+
George	PERSON	0.99+
Europe	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Dinesh Nirmal	PERSON	0.99+
99%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
80%	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
two	QUANTITY	0.99+
HIPAA	TITLE	0.99+
Dinesh	PERSON	0.99+
Reynold Shin	PERSON	0.99+
Friday	DATE	0.99+
AWS	ORGANIZATION	0.99+
Today	DATE	0.99+
today	DATE	0.99+
five steps	QUANTITY	0.99+
45	QUANTITY	0.99+
two days	QUANTITY	0.99+
11 months	QUANTITY	0.99+
each segment	QUANTITY	0.99+
first part	QUANTITY	0.99+
two enterprises	QUANTITY	0.99+
One	QUANTITY	0.99+
first point	QUANTITY	0.99+
first round	QUANTITY	0.99+
each vendor	QUANTITY	0.99+
Lotus	TITLE	0.99+
each	QUANTITY	0.99+
Azure	ORGANIZATION	0.99+
two aspects	QUANTITY	0.99+
one challenge	QUANTITY	0.99+
one approach	QUANTITY	0.99+
Spark	TITLE	0.99+
two sides	QUANTITY	0.99+
Cube	ORGANIZATION	0.99+
one stack	QUANTITY	0.98+
one component	QUANTITY	0.98+
one source	QUANTITY	0.98+
GDPR	TITLE	0.98+
One last question	QUANTITY	0.98+
one	QUANTITY	0.98+
thousands of models	QUANTITY	0.98+
one vendor	QUANTITY	0.98+
both	QUANTITY	0.98+
Kafka	TITLE	0.97+
one thing	QUANTITY	0.97+
Sequel	TITLE	0.97+
one location	QUANTITY	0.97+
second	QUANTITY	0.96+

Rob Lantz, Novetta - Spark Summit 2017 - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco it's the CUBE covering Spark Summit 2017 brought to you by Data Bricks. >> Welcome back to the CUBE, we're continuing to take about two people who are not just talking about things but doing things. We're happy to have, from Novetta, the Director of Predictive Analytics, Mr. Rob Lantz. Rob, welcome to the show. >> Thank you. >> And off to my right, George, how are you? >> Good. >> We've introduced you before. >> Yes. >> Well let's talk to the guest. Let's get right to it. I want to talk to you a little bit about what does Novetta do and then maybe what apps you're building using Spark. >> Sure, so Novetta is an advanced analytics company, we're medium sized and we develop custom hardware and software solutions for our customers who are looking to get insights out of their big data. Our primary offering is a hard entity resolution engine. We scale up to billions of records and we've done that for about 15 years. >> So you're in the business end of analytics, right? >> Yeah, I think so. >> Alright, so talk to us a little bit more about entity resolution, and that's all Spark right? This is your main priority? >> Yes, yes, indeed. Entity resolution is the science of taking multiple disparate data sets, traditional big data, and taking records from those and determining which of those are actually the same individual or company or address or location and which of those should be kept separate. We can aggregate those things together and build profiles and that enables a more robust picture of what's going on for an organization. >> Okay, and George? >> So what did you do... What was the solution looking like before Spark and how did it change once you adopted Spark? >> Sure, so with Spark, it enabled us to get a lot faster. Obviously those computations scaled a lot better. Before, we were having to write a lot of custom code to get those computations out across a grid. When we moved to Hadoop and then Spark, that made us, let's say able to scale those things and get it done overnight or in hours and not weeks. >> So when you say you had to do a lot of custom code to distribute across the cluster, does that include when you were working with MapReduce, or was this even before the Hadoop era? >> Oh it was before the Hadoop era and that predates my time so I won't be able to speak expertly about it, but to my understanding, it was a challenge for sure. >> Okay so this sounds like a service that your customers would then themselves build on. Maybe an ETL customer would figure out master data from a repository that is not as carefully curated as the data warehouse or similar applications. So who is your end customer and how do they build on your solution? >> Sure, so the end customer typically is an enterprise that has large volumes of data that deal in particular things. They collect, it could be customers, it could be passengers, it could be lots of different things. They want to be able to build profiles about those people or companies, like I said, or locations, any number of things can be considered an entity. The way they build upon it then is how they go about quantifying those profiles. We can help them do that, in fact, some of the work that I manage does that, but often times they do it themselves. They take the resolve data and that gets resolved nightly or even hourly. They build those profiles themselves for their own purpose. >> Then, to help us think about the application or the use case holistically, once they've built those profiles and essentially harmonized the data, what does that typically feed into? >> Oh gosh, any number of things really. Oh, shoot. We've got deployments in AWS in the cloud, we've got deployments, lots of deployments on premises obviously. That can go anywhere from relational databases to graph query language databases. Lots of different places from there for sure. >> Okay so, this actually sounds like everyone talks now about machine learning and forming every category of software. This sounds like you take the old style ETL, where master data was a value add layer on top, and that was, it took a fair amount of human judgment to do. Now, you're putting that service on top of ETL and you're largely automating it, probably with, I assume, some supervised guidance, supervised training. >> Yes, so we're getting into the machine learning space as far as entity extraction and resolution and recognition because more and more data is unstructured. But machine learning isn't necessarily a baked in part of that. Actually entity resolution is a prerequisite, I think, for quality machine learning. So if Rob Lantz is a customer, I want to be able to know what has Rob Lantz bought in the past from me. And maybe what is Rob Lantz talking about in social media? Well I need to know how to figure out who those people are and who's Rob Lantz and who's Robert Lantz is a completely different person, I don't want to collapse those two things together. Then I would build machine learning on top of that to say, right, now what's his behavior going to be in the future. But once I have that robust profile built up, I can derive a lot more interesting features with which to apply the machine learning. >> Okay, so you are a Data Bricks customer and there's also a burgeoning partnership. >> Rob: Yeah, I think that's true. >> So talk to us a little bit about what are some of the frustrations you had before adopting Data Bricks and maybe why you choose it. >> Yeah, sure. So the frustrations primarily with a traditional Hadoop environment involved having to go from one customer site to another customer site with an incredibly complex technology stack and then do a lot of the cluster management for those customers even after they'd already set it up because of all the inner workings of Hadoop and that ecosystem. Getting our Spark application installed there, we had to penetrate layers and layers of configuration in order to tune it appropriately to get the performance we needed. >> David: Okay, and were you at the keynote this morning? >> I was not, actually. >> Okay, I'm not going to ask you about that then. >> Ah. >> But I am going to ask you a little bit about your wishlist. You've been talking to people maybe in the hallway here, you just got here today but, what do you wish the community would do or develop, what would you like to learn while you're here? >> Learning while I'm here, I've already picked up a lot. So much going on and it's such a fast paced environment, it's really exciting. I think if I had a wishlist, I would want a more robust ML Lib, machine learning library. All the things that you can get on traditional, in scientific computing stacks moved onto a Spark ML Lib for easier access. On a cluster would be great. >> I thought several years ago ML Lib took over from Mahoot as the most active open source community for adding, really, I thought, scale out machine learning algorithms. If it doesn't have it all now, or maybe all is something you never reach, kind of like Red Queen effect, you know? >> Rob: For sure, for sure. >> What else is attracting these scale out implementations of the machine learning algorithms? >> Um? >> In other words, what are the platforms? If it's not Spark then... >> I don't think it exists frankly, unless you write your own. I think that would be the way to go. That's the way to go about it now. I think what organizations are having to do with machine learning in a distributed environment is just go with good enough, right. Whereas maybe some of the ensemble methods that are, actually aren't even really cutting edge necessarily, but you can really do a lot of tuning on those things, doing that tuning distributed at scale would be really powerful. I read somewhere, and I'm not going to be able to quote exactly where it was but, actually throwing more data at a problem is more valuable than tuning a perfect algorithm frankly. If we could combine the two, I think that would be really powerful. That is, finding the right algorithm and throwing all the data at it would get you a really solid model that would pick up on that signal that underlies any of these phenomena. >> David: Okay well, go ahead George. >> I was going to ask, I think that goes back to, I don't know if it was Google Paper, or one of the Google search quality guys who's a luminary in the machine learning space says, "data always trumps algorithms." >> I believe that's true and that's true in my experience certainly. >> Once you had this machine learning and once you've perhaps simplified the multi-vendor stack, then what is your solution start looking like in terms of broadening its appeal, because of the lower TCO. And then, perhaps embracing more use cases. >> I don't know that it necessarily embraces more use cases because entity resolution applies so broadly already, but what I would say is will give us more time to focus on improving the ER itself. That's I think going to be a really, really powerful improvement we can make to Novetta entity analytics as it stands right now. That's going to go into, we alluded to before, the machine learning as part of the entity resolution. Entity extraction, automated entity extraction from unstructured information and not just unstructured text but unstructured images and video. Could be a really powerful thing. Taking in stuff that isn't tagged and pulling the entities out of that automatically without actually having to have a human in the loop. Pulling every name out, every phone number out, every address out. Go ahead, sorry. >> This goes back to a couple conversations we've had today where people say data trumps algorithms, even if they don't say it explicitly, so the cloud vendors who are sitting on billions of photos, many of which might have house street addresses and things like that, or faces, how do you make better... How do you extract better tuning for your algorithms from data sets that I assume are smaller than the cloud vendors? >> They're pretty big. We employ data engineers that are very experienced at tagging that stuff manually. What I would envision would happen is we would apply somebody for a week or two weeks, to go in and tag the data as appropriate. In fact, we have products that go in and do concept tagging already across multiple languages. That's going to be the subject of my talk tomorrow as a matter of fact. But we can tag things manually or with machine assistance and then use that as a training set to go apply to the much larger data set. I'm not so worried about the scale of the data, we already have a lot, a lot of data. I think it's going to be getting that proof set that's already tagged. >> So what you're saying is, it actually sounds kind of important. That actually almost ties into what we hear about Facebook training their messenger bot where we can't do it purely just on training data so we're going to take some data that needs semi-supervision, and that becomes our new labeled set, our new training data. Then we can run it against this broad, unwashed mass of training data. Is that the strategy? >> Certainly we would get there. We would want to get there and that's the beauty of what Data Bricks promises, is that ability to save a lot of the time that we would spend doing the nug work on cluster management to innovate in that way and we're really excited about that. >> Alright, we've got just a minute to go here before the break, so I wanted to ask you maybe, the wish list question, I've been asking everybody today, what do you wish you had? Whether it's in entity resolution or some other area in the next couple of years for Novetta, what's on your list? >> Well I think that would be the more robust machine learning library, all in Spark, kind of native, so we wouldn't have to deploy that ourselves. Then, I think everything else is there, frankly. We are very excited about the platform and the stack that comes with it. >> Well that's a great ending right there, George do you have any other questions you want to ask? Alright, we're just wrapping up here. Thank you so much, we appreciate you being on the show Rob, and we'll see you out there in the Expo. >> I appreciate it, thank you. >> Alright, thanks so much. >> George: It's good to meet you. >> Thanks. >> Alright, you are watching the CUBE here at Spark Summit 2017, stay tuned, we'll be back with our next guest.

Published Date : Jun 6 2017

SUMMARY :

brought to you by Data Bricks. Welcome back to the CUBE, I want to talk to you a little bit about and we've done that for about 15 years. and build profiles and that enables a more robust picture and how did it change once you adopted Spark? and get it done overnight or in hours and not weeks. and that predates my time and how do they build on your solution? and that gets resolved nightly or even hourly. We've got deployments in AWS in the cloud, and that was, it took a fair amount going to be in the future. Okay, so you are a Data Bricks customer and maybe why you choose it. to get the performance we needed. what would you like to learn while you're here? All the things that you can get on traditional, kind of like Red Queen effect, you know? If it's not Spark then... I read somewhere, and I'm not going to be able or one of the Google search quality guys and that's true in my experience certainly. because of the lower TCO. and pulling the entities out of that automatically that I assume are smaller than the cloud vendors? I think it's going to be getting that proof set Is that the strategy? is that ability to save a lot of the time and the stack that comes with it. and we'll see you out there in the Expo. Alright, you are watching the CUBE

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George	PERSON	0.99+
Rob Lantz	PERSON	0.99+
Robert Lantz	PERSON	0.99+
San Francisco	LOCATION	0.99+
Data Bricks	ORGANIZATION	0.99+
a week	QUANTITY	0.99+
Rob	PERSON	0.99+
two	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Novetta	ORGANIZATION	0.99+
two weeks	QUANTITY	0.99+
tomorrow	DATE	0.99+
two things	QUANTITY	0.98+
today	DATE	0.98+
Spark Summit 2017	EVENT	0.98+
several years ago	DATE	0.97+
Hadoop	TITLE	0.97+
Google	ORGANIZATION	0.97+
about 15 years	QUANTITY	0.96+
#SparkSummit	EVENT	0.95+
billions of photos	QUANTITY	0.95+
this morning	DATE	0.91+
ML Lib	TITLE	0.91+
billions	QUANTITY	0.9+
one	QUANTITY	0.87+
Mahoot	ORGANIZATION	0.85+
one customer site	QUANTITY	0.85+
Hadoop	DATE	0.84+
two people	QUANTITY	0.74+
CUBE	ORGANIZATION	0.72+
Predictive Analytics	ORGANIZATION	0.68+
next couple	DATE	0.66+
Director	PERSON	0.66+
years	DATE	0.62+
Spark ML Lib	TITLE	0.61+
Queen	TITLE	0.59+
ML	TITLE	0.57+
couple	QUANTITY	0.54+
Red	OTHER	0.53+
MapReduce	ORGANIZATION	0.52+
Google Paper	ORGANIZATION	0.47+

Holden Karau, IBM Big Data SV 17 #BigDataSV #theCUBE

>> Announcer: Big Data Silicon Valley 2017. >> Hey, welcome back, everybody, Jeff Frick here with The Cube. We are live at the historic Pagoda Lounge in San Jose for Big Data SV, which is associated with Strathead Dupe World, across the street, as well as Big Data week, so everything big data is happening in San Jose, we're happy to be here, love the new venue, if you're around, stop by, back of the Fairmount, Pagoda Lounge. We're excited to be joined in this next segment by, who's now become a regular, any time we're at a Big Data event, a Spark event, Holden always stops by. Holden Karau, she's the principal software engineer at IBM. Holden, great to see you. >> Thank you, it's wonderful to be back yet again. >> Absolutely, so the big data meme just keeps rolling, Google Cloud Next was last week, a lot of talk about AI and ML and of course you're very involved in Spark, so what are you excited about these days? What are you, I'm sure you've got a couple presentations going on across the street. >> Yeah, so my two presentations this week, oh wow, I should remember them. So the one that I'm doing today is with my co-worker Seth Hendrickson, also at IBM, and we're going to be focused on how to use structured streaming for machine learning. And sort of, I think that's really interesting, because streaming machine learning is something a lot of people seem to want to do but aren't yet doing in production, so it's always fun to talk to people before they've built their systems. And then tomorrow I'm going to be talking with Joey on how to debug Spark, which is something that I, you know, a lot of people ask questions about, but I tend to not talk about, because it tends to scare people away, and so I try to keep the happy going. >> Jeff: Bugs are never fun. >> No, no, never fun. >> Just picking up on that structured streaming and machine learning, so there's this issue of, as we move more and more towards the industrial internet of things, like having to process events as they come in, make a decision. How, there's a range of latency that's required. Where does structured streaming and ML fit today, and where might that go? >> So structured streaming for today, latency wise, is probably not something I would use for something like that right now. It's in the like sub second range. Which is nice, but it's not what you want for like live serving of decisions for your car, right? That's just not going to be feasible. But I think it certainly has the potential to get a lot faster. We've seen a lot of renewed interest in ML liblocal, which is really about making it so that we can take the models that we've trained in Spark and really push them out to the edge and sort of serve them in the edge, and apply our models on end devices. So I'm really excited about where that's going. To be fair, part of my excitement is someone else is doing that work, so I'm very excited that they're doing this work for me. >> Let me clarify on that, just to make sure I understand. So there's a lot of overhead in Spark, because it runs on a cluster, because you have an optimizer, because you have the high availability or the resilience, and so you're saying we can preserve the predict and maybe serve part and carve out all the other overhead for running in a very small environment. >> Right, yeah. So I think for a lot of these IOT devices and stuff like that it actually makes a lot more sense to do the predictions on the device itself, right. These models generally are megabytes in size, and we don't need a cluster to do predictions on these models, right. We really need the cluster to train them, but I think for a lot of cases, pushing the prediction out to the edge node is actually a pretty reasonable use case. And so I'm really excited that we've got some work going on there. >> Taking that one step further, we've talked to a bunch of people, both like at GE, and at their Minds and Machines show, and IBM's Genius of Things, where you want to be able to train the models up in the cloud where you're getting data from all the different devices and then push the retrained model out to the edge. Can that happen in Spark, or do we have to have something else orchestrating all that? >> So actually pushing the model out isn't something that I would do in Spark itself, I think that's better served by other tools. Spark is not really well suited to large amounts of internet traffic, right. But it's really well suited to the training, and I think with ML liblocal it'll essentially, we'll be able to provide both sides of it, and the copy part will be left up to whoever it is that's doing their work, right, because like if you're copying over a cell network you need to do something very different as if you're broadcasting over a terrestrial XM or something like that, you need to do something very different for satellite. >> If you're at the edge on a device, would you be actually running, like you were saying earlier, structured streaming, with the prediction? >> Right, I don't think you would use structured streaming per se on the edge device, but essentially there would be a lot of code share between structured streaming and the code that you'd be using on the edge device. And it's being vectored out now so that we can have this code sharing and Spark machine learning. And you would use structured streaming maybe on the training side, and then on the serving side you would use your custom local code. >> Okay, so tell us a little more about Spark ML today and how we can democratize machine learning, you know, for a bigger audience. >> Right, I think machine learning is great, but right now you really need a strong statistical background to really be able to apply it effectively. And we probably can't get rid of that for all problems, but I think for a lot of problems, doing things like hyperparameter tuning can actually give really powerful tools to just like regular engineering folks who, they're smart, but maybe they don't have a strong machine learning background. And Spark's ML pipelines make it really easy to sort of construct multiple stages, and then just be like, okay, I don't know what these parameters should be, I want you to do a search over what these different parameters could be for me, and it makes it really easy to do this as just a regular engineer with less of an ML background. >> Would that be like, just for those of us who are, who don't know what hyperparameter tuning is, that would be the knobs, the variables? >> Yeah, it's going to spin the knobs on like our regularization parameter on like our regression, and it can also spin some knobs on maybe the engram sizes that we're using on the inputs to something else, right. And it can compare how these knobs sort of interact with each other, because often you can tune one knob but you actually have six different knobs that you want to tune and you don't know, if you just explore each one individually, you're not going to find the best setting for them working together. >> So this would make it easier for, as you're saying, someone who's not a data scientist to set up a pipeline that lets you predict. >> I think so, very much. I think it does a lot of the, brings a lot of the benefits from sort of the SciPy world to the big data world. And SciPy is really wonderful about making machine learning really accessible, but it's just not ready for big data, and I think this does a good job of bringing these same concepts, if not the code, but the same concepts, to big data. >> The SciPy, if I understand, is it a notebook that would run essentially on one machine? >> SciPy can be put in a notebook environment, and generally it would run on, yeah, a single machine. >> And so to make that sit on Spark means that you could then run it on a cluster-- >> So this isn't actually taking SciPy and distributing it, this is just like stealing the good concepts from SciPy and making them available for big data people. Because SciPy's done a really good job of making a very intuitive machine learning interface. >> So just to put a fine sort of qualifier on one thing, if you're doing the internet of things and you have Spark at the edge and you're running the model there, it's the programming model, so structured streaming is one way of programming Spark, but if you don't have structured streaming at the edge, would you just be using the core batch Spark programming model? >> So at the edge you'd just be using, you wouldn't even be using batch, right, because you're trying to predict individual events, right, so you'd just be calling predict with every new event that you're getting in. And you might have a q mechanism of some type. But essentially if we had this batch, we would be adding additional latency, and I think at the edge we really, the reason we're moving the models to the edge is to avoid the latency. >> So just to be clear then, is the programming model, so it wouldn't be structured streaming, and we're taking out all the overhead that forced us to use batch with Spark. So the reason I'm trying to clarify is a lot of people had this question for a long time, which is are we going to have a different programming model at the edge from what we have at the center? >> Yeah, that's a great question. And I don't think the answer is finished yet, but I think the work is being done to try and make it look the same. Of course, you know, trying to make it look the same, this is Boosh, it's not like actually barking at us right now, even though she looks like a dog, she is, there will always be things which are a little bit different from the edge to your cluster, but I think Spark has done a really good job of making things look very similar on single node cases to multi node cases, and I think we can probably bring the same things to ML. >> Okay, so it's almost time, we're coming back, Spark took us from single machine to cluster, and now we have to essentially bring it back for an edge device that's really light weight. >> Yeah, I think at the end of the day, just from a latency point of view, that's what we have to do for serving. For some models, not for everyone. Like if you're building a website with a recommendation system, you don't need to serve that model like on the edge node, that's fine, but like if you've got a car device we can't depend on cell latency, right, you have to serve that in car. >> So what are some of the things, some of the other things that IBM is contributing to the ecosystem that you see having a big impact over the next couple years? >> So there's a lot of really exciting things coming out of IBM. And I'm obviously pretty biased. I spend a lot of time focused on Python support in Spark, and one of the most exciting things is coming from my co-worker Brian, I'm not going to say his last name in case I get it wrong, but Brian is amazing, and he's been working on integrating Arrow with Spark, and this can make it so that it's going to be a lot easier to sort of interoperate between JVM languages and Python and R, so I'm really optimistic about the sort of Python and R interfaces improving a lot in Spark and getting a lot faster as well. And we're also, in addition to the Arrow work, we've got some work around making it a lot easier for people in R and Python to get started. The R stuff is mostly actually the Microsoft people, thanks Felix, you're awesome. I don't actually know which camera I should have done that to but that's okay. >> I think you got it! >> But Felix is amazing, and the other people working on R are too. But I think we've both been pursuing sort of making it so that people who are in the R or Python spaces can just use like Pit Install, Conda Install, or whatever tool it is they're used to working with, to just bring Spark into their machine really easily, just like they would sort of any other software package that they're using. Because right now, for someone getting started in Spark, if you're in the Java space it's pretty easy, but if you're in R or Python you have to do sort of a lot of weird setup work, and it's worth it, but like if we can get rid of that friction, I think we can get a lot more people in these communities using Spark. >> Let me see, just as a scenario, the R server is getting fairly well integrated into Sequel server, so would it be, would you be able to use R as the language with a Spark execution engine to somehow integrate it into Sequel server as an execution engine for doing the machine learning and predicting? >> You definitely, well I shouldn't say definitely, you probably could do that. I don't necessarily know if that's a good idea, but that's the kind of stuff that this would enable, right, it'll make it so that people that are making tools in R or Python can just use Spark as another library, right, and it doesn't have to be this really special setup. It can just be this library and they point out the cluster and they can do whatever work it wants to do. That being said, the Sequel server R integration, if you find yourself using that to do like distributed computing, you should probably take a step back and like rethink what you're doing. >> George: Because it's not really scale out. >> It's not really set up for that. And you might be better off doing this with like, connecting your Spark cluster to your Sequel server instance using like JDBC or a special driver and doing it that way, but you definitely could do it in another inverted sort of way. >> So last question from me, if you look out a couple years, how will we make machine learning accessible to a bigger and bigger audience? And I know you touched on the tuning of the knobs, hyperparameter tuning, what will it look like ultimately? >> I think ML pipelines are probably what things are going to end up looking like. But I think the other part that we'll sort of see is we'll see a lot more examples of how to work with certain kinds of data, because right now, like, I know what I need to do when I'm ingesting some textural data, but I know that because I spent like a week trying to figure out what the hell I was doing once, right. And I didn't bother to write it down. And it looks like no one else bothered to write it down. So really I think we'll see a lot of tools that look very similar to the tools we have today, they'll have more options and they'll be a bit easier to use, but I think the main thing that we're really lacking right now is good documentation and sort of good books and just good resources for people to figure out how to use these tools. Now of course, I mean, I'm biased, because I work on these tools, so I'm like, yeah, they're pretty great. So there might be other people who are like, Holden, no, you're wrong, we need to rethink everything. But I think this is, we can go very far with the pipeline concept. >> And then that's good, right? The democratization of these things opens it up to more people, you get more creative people solving more different problems, that makes the whole thing go. >> You can like install Spark easily, you can, you know, set up an ML pipeline, you can train your model, you can start doing predictions, you can, people that haven't been able to do machine learning at scale can get started super easily, and build a recommendation system for their small little online shop and be like, hey, you bought this, you might also want to buy Boosh, he's really cute, but you can't have this one. No no no, not this one. >> Such a tease! >> Holden: I'm sorry, I'm sorry. >> Well Holden, that will, we'll say goodbye for now, I'm sure we will see you in June in San Francisco at the Spark Summit, and look forward to the update. >> Holden: I look forward to chatting with you then. >> Absolutely, and break a leg this afternoon at your presentation. >> Holden: Thank you. >> She's Holden Karau, I'm Jeff Frick, he's George Gilbert, you're watching The Cube, we're at Big Data SV, thanks for watching. (upbeat music)

Published Date : Mar 15 2017

SUMMARY :

Announcer: Big Data We're excited to be joined to be back yet again. so what are you excited about these days? but I tend to not talk about, like having to process and really push them out to the edge and carve out all the other overhead We really need the cluster to train them, model out to the edge. and the copy part will be left up to and then on the serving side you would use you know, for a bigger audience. and it makes it really easy to do this that you want to tune and you don't know, that lets you predict. but the same concepts, to big data. and generally it would run the good concepts from SciPy the models to the edge So just to be clear then, from the edge to your cluster, machine to cluster, like on the edge node, that's fine, R and Python to get started. and the other people working on R are too. but that's the kind of stuff not really scale out. to your Sequel server instance and they'll be a bit easier to use, that makes the whole thing go. and be like, hey, you bought this, look forward to the update. to chatting with you then. Absolutely, and break you're watching The Cube,

ENTITIES

Entity	Category	Confidence
Jeff Frick	PERSON	0.99+
Brian	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Holden Karau	PERSON	0.99+
Holden	PERSON	0.99+
Felix	PERSON	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Joey	PERSON	0.99+
Jeff	PERSON	0.99+
IBM	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
Seth Hendrickson	PERSON	0.99+
Spark	TITLE	0.99+
Python	TITLE	0.99+
last week	DATE	0.99+
Microsoft	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
San Francisco	LOCATION	0.99+
June	DATE	0.99+
six different knobs	QUANTITY	0.99+
GE	ORGANIZATION	0.99+
Boosh	PERSON	0.99+
Pagoda Lounge	LOCATION	0.99+
one knob	QUANTITY	0.99+
both sides	QUANTITY	0.99+
two presentations	QUANTITY	0.99+
this week	DATE	0.98+
today	DATE	0.98+
The Cube	ORGANIZATION	0.98+
Java	TITLE	0.98+
both	QUANTITY	0.97+
one thing	QUANTITY	0.96+
one	QUANTITY	0.96+
Big Data week	EVENT	0.96+
single machine	QUANTITY	0.95+
R	TITLE	0.95+
SciPy	TITLE	0.95+
Big Data	EVENT	0.95+
single machine	QUANTITY	0.95+
each one	QUANTITY	0.94+
JDBC	TITLE	0.93+
Spark ML	TITLE	0.89+
JVM	TITLE	0.89+
The Cube	TITLE	0.88+
single	QUANTITY	0.88+
Sequel	TITLE	0.87+
Big Data Silicon Valley 2017	EVENT	0.86+
Spark Summit	LOCATION	0.86+
one machine	QUANTITY	0.86+
a week	QUANTITY	0.84+
Fairmount	LOCATION	0.83+
liblocal	TITLE	0.83+

Frederick Reiss, IBM STC - Big Data SV 2017 - #BigDataSV - #theCUBE

>> Narrator: Live from San Jose, California it's the Cube, covering Big Data Silicon Valley 2017. (upbeat music) >> Big Data SV 2016, day two of our wall to wall coverage of Strata Hadoob Conference, Big Data SV, really what we call Big Data Week because this is where all the action is going on down in San Jose. We're at the historic Pagoda Lounge in the back of the Faramount, come on by and say hello, we've got a really cool space and we're excited and never been in this space before, so we're excited to be here. So we got George Gilbert here from Wiki, we're really excited to have our next guest, he's Fred Rice, he's the chief architect at IBM Spark Technology Center in San Francisco. Fred, great to see you. >> Thank you, Jeff. >> So I remember when Rob Thomas, we went up and met with him in San Francisco when you guys first opened the Spark Technology Center a couple of years now. Give us an update on what's going on there, I know IBM's putting a lot of investment in this Spark Technology Center in the San Francisco office specifically. Give us kind of an update of what's going on. >> That's right, Jeff. Now we're in the new Watson West building in San Francisco on 505 Howard Street, colocated, we have about a 50 person development organization. Right next to us we have about 25 designers and on the same floor a lot of developers from Watson doing a lot of data science, from the weather underground, doing weather and data analysis, so it's a really exciting place to be, lots of interesting work in data science going on there. >> And it's really great to see how IBM is taking the core Watson, obviously enabled by Spark and other core open source technology and now applying it, we're seeing Watson for Health, Watson for Thomas Vehicles, Watson for Marketing, Watson for this, and really bringing that type of machine learning power to all the various verticals in which you guys play. >> Absolutely, that's been what Watson has been about from the very beginning, bringing the power of machine learning, the power of artificial intelligence to real world applications. >> Jeff: Excellent. >> So let's tie it back to the Spark community. Most folks understand how data bricks builds out the core or does most of the core work for, like, the sequel workload the streaming and machine learning and I guess graph is still immature. We were talking earlier about IBM's contributions in helping to build up the machine learning side. Help us understand what the data bricks core technology for machine learning is and how IBM is building beyond that. >> So the core technology for machine learning in Apache Spark comes out, actually, of the machine learning department at UC Berkeley as well as a lot of different memories from the community. Some of those community members also work for data bricks. We actually at the IBM Spark Technology Center have made a number of contributions to the core Apache Spark and the libraries, for example recent contributions in neural nets. In addition to that, we also work on a project called Apache System ML, which used to be proprietary IBM technology, but the IBM Spark Technology Center has turned System ML into Apache System ML, it's now an open Apache incubating project that's been moving forward out in the open. You can now download the latest release online and that provides a piece that we saw was missing from Spark and a lot of other similar environments and optimizer for machine learning algorithms. So in Spark, you have the catalyst optimizer for data analysis, data frames, sequel, you write your queries in terms of those high level APIs and catalyst figures out how to make them go fast. In System ML, we have an optimizer for high level languages like Spark and Python where you can write algorithms in terms of linear algebra, in terms of high level operations on matrices and vectors and have the optimizer take care of making those algorithms run in parallel, run in scale, taking account of the data characteristics. Does the data fit in memory, and if so, keep it in memory. Does the data not fit in memory? Stream it from desk. >> Okay, so there was a ton of stuff in there. >> Fred: Yep. >> And if I were to refer to that as so densely packed as to be a black hole, that might come across wrong, so I won't refer to that as a black hole. But let's unpack that, so the, and I meant that in a good way, like high bandwidth, you know. >> Fred: Thanks, George. >> Um, so the traditional Spark, the machine learning that comes with Spark's ML lib, one of it's distinguishing characteristics is that the models, the algorithms that are in there, have been built to run on a cluster. >> Fred: That's right. >> And very few have, very few others have built machine learning algorithms to run on a cluster, but as you were saying, you don't really have an optimizer for finding something where a couple of the algorithms would be fit optimally to solve a problem. Help us understand, then, how System ML solves a more general problem for, say, ensemble models and for scale out, I guess I'm, help us understand how System ML fits relative to Sparks ML lib and the more general problems it can solve. >> So, ML Live and a lot of other packages such as Sparking Water from H20, for example, provide you with a toolbox of algorithms and each of those algorithms has been hand tuned for a particular range of problem sizes and problem characteristics. This works great as long as the particular problem you're facing as a data scientist is a good match to that implementation that you have in your toolbox. What System ML provides is less like having a toolbox and more like having a machine shop. You can, you have a lot more flexibility, you have a lot more power, you can write down an algorithm as you would write it down if you were implementing it just to run on your laptop and then let the System ML optimizer take care of producing a parallel version of that algorithm that is customized to the characteristics of your cluster, customized to the characteristics of your data. >> So let me stop you right there, because I want to use an analogy that others might find easy to relate to for all the people who understand sequel and scale out sequel. So, the way you were describing it, it sounds like oh, if I were a sequel developer and I wanted to get at some data on my laptop, I would find it pretty easy to write the sequel to do that. Now, let's say I had a bunch of servers, each with it's own database, and I wanted to get data from each database. If I didn't have a scale out database, I would have to figure out physically how to go to each server in the cluster to get it. What I'm hearing for System ML is it will take that query that I might have written on my one server and it will transparently figure out how to scale that out, although in this case not queries, machine learning algorithms. >> The database analogy is very apt. Just like sequel and query optimization by allowing you to separate that logical description of what you're looking for from the physical description of how to get at it. Lets you have a parallel database with the exact same language as a single machine database. In System ML, because we have an optimizer that separates that logical description of the machine learning algorithm from the physical implementation, we can target a lot of parallel systems, we can also target a large server and the code, the code that implements the algorithm stays the same. >> Okay, now let's take that a step further. You refer to matrix math and I think linear algebra and a whole lot of other things that I never quite made it to since I was a humanities major but when we're talking about those things, my understanding is that those are primitives that Spark doesn't really implement so that if you wanted to do neural nets, which relies on some of those constructs for high performance, >> Fred: Yes. >> Then, um, that's not built into Spark. Can you get to that capability using System ML? >> Yes. System ML edits core, provides you with a library, provides you as a user with a library of machine, rather, linear algebra primitives, just like a language like r or a library like Mumpai gives you matrices and vectors and all of the operations you can do on top of those primitives. And just to be clear, linear algebra really is the language of machine learning. If you pick up a paper about an advanced machine learning algorithm, chances are the specification for what that algorithm does and how that algorithm works is going to be written in the paper literally in linear algebra and the implementation that was used in that paper is probably written in the language where linear algebra is built in, like r, like Mumpai. >> So it sounds to me like Spark has done the work of sort of the blocking and tackling of machine learning to run in parallel. And that's I mean, to be clear, since we haven't really talked about it, that's important when you're handling data at scale and you want to train, you know, models on very, very large data sets. But it sounds like when we want to go to some of the more advanced machine learning capabilities, the ones that today are making all the noise with, you know, speech to text, text to speech, natural language, understanding those neural network based capabilities are not built into the core Spark ML lib, that, would it be fair to say you could start getting at them through System ML? >> Yes, System ML is a much better way to do scalable linear algebra on top of Spark than the very limited linear algebra that's built into Spark. >> So alright, let's take the next step. Can System ML be grafted onto Spark in some way or would it have to be in an entirely new API that doesn't take, integrate with all the other Spark APIs? In a way, that has differentiated Spark, where each API is sort of accessible from every other. Can you tie System ML in or do the Spark guys have to build more primitives into their own sort of engine first? >> A lot of the work that we've done with the Spark Technology Center as part of bringing System ML into the Apache ecosystem has been to build a nice, tight integration with Apache Spark so you can pass Spark data frames directly into System ML you can get data frames back. Your System ML algorithm, once you've written it, in terms of one of System ML's main systematic languages it just plugs into Spark like all the algorithms that are built into Spark. >> Okay, so that's, that would keep Spark competitive with more advanced machine learning frameworks for a longer period of time, in other words, it wouldn't hit the wall the way if would if it encountered tensor flow from Google for Google's way of doing deep learning, Spark wouldn't hit the wall once it needed, like, a tensor flow as long as it had System ML so deeply integrated the way you're doing it. >> Right, with a system like System ML, you can quickly move into new domains of machine learning. So for example, this afternoon I'm going to give a talk with one of our machine learning developers, Mike Dusenberry, about our recent efforts to implement deep learning in System ML, like full scale, convolutional neural nets running on a cluster in parallel processing many gigabytes of images, and we implemented that with very little effort because we have this optimizer underneath that takes care of a lot of the details of how you get that data into the processing, how you get the data spread across the cluster, how you get the processing moved to the data or vice versa. All those decisions are taken care of in the optimizer, you just write down the linear algebra parts and let the system take care of it. That let us implement deep learning much more quickly than we would have if we had done it from scratch. >> So it's just this ongoing cadence of basically removing the infrastructure gut management from the data scientists and enabling them to concentrate really where their value is is on the algorithms themselves, so they don't have to worry about how many clusters it's running on, and that configuration kind of typical dev ops that we see on the regular development side, but now you're really bringing that into the machine learning space. >> That's right, Jeff. Personally, I find all the minutia of making a parallel algorithm worked really fascinating but a lot of people working in data science really see parallelism as a tool. They want to solve the data science problem and System ML lets you focus on solving the data science problem because the system takes care of the parallelism. >> You guys could go on in the weeds for probably three hours but we don't have enough coffee and we're going to set up a follow up time because you're both in San Francisco. But before we let you go, Fred, as you look forward into 2017, kind of the advances that you guys have done there at the IBM Spark Center in the city, what's kind of the next couple great hurdles that you're looking to cross, new challenges that are getting you up every morning that you're excited to come back a year from now and be able to say wow, these are the one or two things that we were able to take down in 2017? >> We're moving forward on several different fronts this year. On one front, we're helping to get the notebook experience with Spark notebooks consistent across the entire IBM product portfolio. We helped a lot with the rollout of notebooks on data science experience on z, for example, and we're working actively with the data science experience and with the Watson data platform. On the other hand, we're contributing to Spark 2.2. There are some exciting features, particularly in sequel that we're hoping to get into that release as well as some new improvements to ML Live. We're moving forward with Apache System ML, we just cut Version 0.13 of that. We're talking right now on the mailing list about getting System ML out of incubation, making it a full, top level project. And we're also continuing to help with the adoption of Apache Spark technology in the enterprise. Our latest focus has been on deep learning on Spark. >> Well, I think we found him! Smartest guy in the room. (laughter) Thanks for stopping by and good luck on your talk this afternoon. >> Thank you, Jeff. >> Absolutely. Alright, he's Fred Rice, he's George Gilbert, and I'm Jeff Rick, you're watching the Cube from Big Data SV, part of Big Data Week in San Jose, California. (upbeat music) (mellow music) >> Hi, I'm John Furrier, the cofounder of SiliconANGLE Media cohost of the Cube. I've been in the tech business since I was 19, first programming on mini computers.

Published Date : Mar 15 2017

SUMMARY :

it's the Cube, covering Big Data Silicon Valley 2017. in the back of the Faramount, come on by and say hello, in the San Francisco office specifically. and on the same floor a lot of developers from Watson to all the various verticals in which you guys play. of machine learning, the power of artificial intelligence or does most of the core work for, like, the sequel workload and have the optimizer take care of making those algorithms and I meant that in a good way, is that the models, the algorithms that are in there, and the more general problems it can solve. to that implementation that you have in your toolbox. in the cluster to get it. and the code, the code that implements the algorithm so that if you wanted to do neural nets, Can you get to that capability using System ML? and all of the operations you can do the ones that today are making all the noise with, you know, linear algebra on top of Spark than the very limited So alright, let's take the next step. System ML into the Apache ecosystem has been to build so deeply integrated the way you're doing it. and let the system take care of it. is on the algorithms themselves, so they don't have to worry because the system takes care of the parallelism. into 2017, kind of the advances that you guys have done of Apache Spark technology in the enterprise. Smartest guy in the room. and I'm Jeff Rick, you're watching the Cube cohost of the Cube.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Jeff Rick	PERSON	0.99+
George	PERSON	0.99+
Jeff	PERSON	0.99+
Fred Rice	PERSON	0.99+
Mike Dusenberry	PERSON	0.99+
IBM	ORGANIZATION	0.99+
2017	DATE	0.99+
San Francisco	LOCATION	0.99+
John Furrier	PERSON	0.99+
San Jose	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
505 Howard Street	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Frederick Reiss	PERSON	0.99+
Spark Technology Center	ORGANIZATION	0.99+
Fred	PERSON	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
one	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
Spark 2.2	TITLE	0.99+
three hours	QUANTITY	0.99+
Watson	ORGANIZATION	0.99+
UC Berkeley	ORGANIZATION	0.99+
one server	QUANTITY	0.99+
Spark	TITLE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Python	TITLE	0.99+
each server	QUANTITY	0.99+
both	QUANTITY	0.99+
each	QUANTITY	0.99+
each database	QUANTITY	0.98+
Big Data Week	EVENT	0.98+
Pagoda Lounge	LOCATION	0.98+
Strata Hadoob Conference	EVENT	0.98+
System ML	TITLE	0.98+
Big Data SV	EVENT	0.97+
each API	QUANTITY	0.97+
ML Live	TITLE	0.96+
today	DATE	0.96+
Thomas Vehicles	ORGANIZATION	0.96+
Apache System ML	TITLE	0.95+
Big Data	EVENT	0.95+
Apache Spark	TITLE	0.94+
Watson for Marketing	ORGANIZATION	0.94+
Sparking Water	TITLE	0.94+
first	QUANTITY	0.94+
one front	QUANTITY	0.94+
Big Data SV 2016	EVENT	0.94+
IBM Spark Technology Center	ORGANIZATION	0.94+
about 25 designers	QUANTITY	0.93+

Jean Francois Puget, IBM | IBM Machine Learning Launch 2017

>> Announcer: Live from New York, it's theCUBE, covering the IBM machine learning launch event. Brought to you by IBM. Now, here are your hosts, Dave Vellante and Stu Miniman. >> Alright, we're back. Jean Francois Puget is here, he's the distinguished engineer for machine learning and optimization at IBM analytics, CUBE alum. Good to see you again. >> Yes. >> Thanks very much for coming on, big day for you guys. >> Jean Francois: Indeed. >> It's like giving birth every time you guys give one of these products. We saw you a little bit in the analyst meeting, pretty well attended. Give us the highlights from your standpoint. What are the key things that we should be focused on in this announcement? >> For most people, machine learning equals machine learning algorithms. Algorithms, when you look at newspapers or blogs, social media, it's all about algorithms. Our view that, sure, you need algorithms for machine learning, but you need steps before you run algorithms, and after. So before, you need to get data, to transform it, to make it usable for machine learning. And then, you run algorithms. These produce models, and then, you need to move your models into a production environment. For instance, you use an algorithm to learn from past credit card transaction fraud. You can learn models, patterns, that correspond to fraud. Then, you want to use those models, those patterns, in your payment system. And moving from where you run the algorithm to the operation system is a nightmare today, so our value is to automate what you do before you run algorithms, and then what you do after. That's our differentiator. >> I've had some folks in theCUBE in the past have said years ago, actually, said, "You know what, algorithms are plentiful." I think he made the statement, I remember my friend Avi Mehta, "Algorithms are free. "It's what you do with them that matters." >> Exactly, that's, I believe in autonomy that open source won for machine learning algorithms. Now the future is with open source, clearly. But it solves only a part of the problem you're facing if you want to action machine learning. So, exactly what you said. What do you do with the results of algorithm is key. And open source people don't care much about it, for good reasons. They are focusing on producing the best algorithm. We are focusing on creating value for our customers. It's different. >> In terms of, you mentioned open source a couple times, in terms of customer choice, what's your philosophy with regard to the various tooling and platforms for open source, how do you go about selecting which to support? >> Machine learning is fascinating. It's overhyped, maybe, but it's also moving very quickly. Every year there is a new cool stuff. Five years ago, nobody spoke about deep learning. Now it's everywhere. Who knows what will happen next year? Our take is to support open source, to support the top open source packages. We don't know which one will win in the future. We don't know even if one will be enough for all needs. We believe one size does not fit all, so our take is support a curated list of mid-show open source. We start with Spark ML for many reasons, but we won't stop at Spark ML. >> Okay, I wonder if we can talk use cases. Two of my favorite, well, let's just start with fraud. Fraud has become much, much better over the past certainly 10 years, but still not perfect. I don't know if perfection is achievable, but lot of false positives. How will machine learning affect that? Can we expect as consumers even better fraud detection in more real time? >> If we think of the full life cycle going from data to value, we will provide a better answer. We still use machine learning algorithm to create models, but a model does not tell you what to do. It will tell you, okay, for this credit card transaction coming, it has a high probability to be fraud. Or this one has a lower priority, uh, probability. But then it's up to the designer of the overall application to make decisions, so what we recommend is to use machine learning data prediction but not only, and then use, maybe, (murmuring). For instance, if your machine learning model tells you this is a fraud with a high probability, say 90%, and this is a customer you know very well, it's a 10-year customer you know very well, then you can be confident that it's a fraud. Then if next fraud tells you this is 70% probability, but it's a customer since one week. In a week, we don't know the customer, so the confidence we can get in machine learning should be low, and there you will not reject the transaction immediately. Maybe you will enter, you don't approve it automatically, maybe you will send a one-time passcode, or you enter a serve vendor system, but you don't reject it outright. Really, the idea is to use machine learning predictions as yet another input for making decisions. You're making decision informed on what you could learn from your past. But it's not replacing human decision-making. Our approach with IBM, you don't see IBM speak much about artificial intelligence in general because we don't believe we're here to replace humans. We're here to assist humans, so we say, augmented intelligence or assistance. That's the role we see for machine learning. It will give you additional data so that you make better decisions. >> It's not the concept that you object to, it's the term artificial intelligence. It's really machine intelligence, it's not fake. >> I started my career as a PhD in artificial intelligence, I won't say when, but long enough. At that time, there were already promise that we have Terminator in the next decade and this and that. And the same happened in the '60s, or it was after the '60s. And then, there is an AI winter, and we have a risk here to have an AI winter because some people are just raising red flags that are not substantiated, I believe. I don't think that technology's here that we can replace human decision-making altogether any time soon, but we can help. We can certainly make some proficient, more efficient, more productive with machine learning. >> Having said that, there are a lot of cognitive functions that are getting replaced, maybe not by so-called artificial intelligence, but certainly by machines and automation. >> Yes, so we're automating a number of things, and maybe we won't need to have people do quality check and just have an automated vision system detect defects. Sure, so we're automating more and more, but this is not new, it has been going on for centuries. >> Well, the list evolved. So, what can humans do that machines can't, and how would you expect that to change? >> We're moving away from IMB machine learning, but it is interesting. You know, each time there is a capacity that a machine that will automate, we basically redefine intelligence to exclude it, so you know. That's what I foresee. >> Yeah, well, robots a while ago, Stu, couldn't climb stairs, and now, look at that. >> Do we feel threatened because a robot can climb a stair faster than us? Not necessarily. >> No, it doesn't bother us, right. Okay, question? >> Yeah, so I guess, bringing it back down to the solution that we're talking about today, if I now am doing, I'm doing the analytics, the machine learning on the mainframe, how do we make sure that we don't overrun and blow out all our MIPS? >> We recommend, so we are not using the mainframe base compute system. We recommend using ZIPS, so additional calls to not overload, so it's a very important point. We claim, okay, if you do everything on the mainframe, you can learn from operational data. You don't want to disturb, and you don't want to disturb takes a lot of different meanings. One that you just said, you don't want to slow down your operation processings because you're going to hurt your business. But you also want to be careful. Say we have a payment system where there is a machine learning model predicting fraud probability, a part of the system. You don't want a young bright data scientist decide that he had a great idea, a great model, and he wants to push his model in production without asking anyone. So you want to control that. That's why we insist, we are providing governance that includes a lot of things like keeping track of how models were created from which data sets, so lineage. We also want to have access control and not allow anyone to just deploy a new model because we make it easy to deploy, so we want to have a role-based access and only someone someone with some executive, well, it depends on the customer, but not everybody can update the production system, and we want to support that. And that's something that differentiates us from open source. Open source developers, they don't care about governance. It's not their problem, but it is our customer problem, so this solution will come with all the governance and integrity constraints you can expect from us. >> Can you speak to, first solution's going to be on z/OS, what's the roadmap look like and what are some of those challenges of rolling this out to other private cloud solutions? >> We are going to shape this quarter IBM machine learning for Z. It starts with Spark ML as a base open source. This is not, this is interesting, but it's not all that is for machine learning. So that's how we start. We're going to add more in the future. Last week we announced we will shape Anaconda, which is a major distribution for Python ecosystem, and it includes a number of machine learning open source. We announced it for next quarter. >> I believe in the press release it said down the road things like TensorFlow are coming, H20. >> But Anaconda will announce for next quarter, so we will leverage this when it's out. Then indeed, we have a roadmap to include major open source, so major open source are the one from Anaconda (murmuring), mostly. Key deep learning, so TensorFlow and probably one or two additional, we're still discussing. One that I'm very keen on, it's called XGBoost in one word. People don't speak about it in newspapers, but this is what wins all Kaggle competitions. Kaggle is a machine learning competition site. When I say all, all that are not imagery cognition competitions. >> Dave: And that was ex-- >> XGBoost, X-G-B-O-O-S-T. >> Dave: XGBoost, okay. >> XGBoost, and it's-- >> Dave: X-ray gamma, right? >> It's really a package. When I say we don't know which package will win, XGBoost was introduced a year ago also, or maybe a bit more, but not so long ago, and now, if you have structure data, it is the best choice today. It's a really fast-moving, but so, we will support mid-show deep learning package and mid-show classical learning package like the one from Anaconda or XGBoost. The other thing we start with Z. We announced in the analyst session that we will have a power version and a private cloud, meaning XTC69X version as well. I can't tell you when because it's not firm, but it will come. >> And in public cloud as well, I guess we'll, you've got components in the public cloud today like the Watson Data Platform that you've extracted and put here. >> We have extracted part of the testing experience, so we've extracted notebooks and a graphical tool called ModelBuilder from DSX as part of IBM machine learning now, and we're going to add more of DSX as we go. But the goal is to really share code and function across private cloud and public cloud. As Rob Thomas defined it, we want with private cloud to offer all the features and functionality of public cloud, except that it would run inside a firewall. We are really developing machine learning and Watson machine learning on a command code base. It's an internal open source project. We share code, and then, we shape on different platform. >> I mean, you haven't, just now, used the word hybrid. Every now and then IBM does, but do you see that so-called hybrid use case as viable, or do you see it more, some workloads should run on prem, some should run in the cloud, and maybe they'll never come together? >> Machine learning, you basically have to face, one is training and the other is scoring. I see people moving training to cloud quite easily, unless there is some regulation about data privacy. But training is a good fit for cloud because usually you need a large computing system but only for limited time, so elasticity's great. But then deployment, if you want to score transaction in a CICS transaction, it has to run beside CICS, not cloud. If you want to score data on an IoT gateway, you want to score other gateway, not in a data center. I would say that may not be what people think first, but what will drive really the split between public cloud, private, and on prem is where you want to apply your machine learning models, where you want to score. For instance, smart watches, they are switching to gear to fit measurement system. You want to score your health data on the watch, not in the internet somewhere. >> Right, and in that CICS example that you gave, you'd essentially be bringing the model to the CICS data, is that right? >> Yes, that's what we do. That's a value of machine learning for Z is if you want to score transactions happening on Z, you need to be running on Z. So it's clear, mainframe people, they don't want to hear about public cloud, so they will be the last one moving. They have their reasons, but they like mainframe because it ties really, really secure and private. >> Dave: Public cloud's a dirty word. >> Yes, yes, for Z users. At least that's what I was told, and I could check with many people. But we know that in general the move is for public cloud, so we want to help people, depending on their journey, of the cloud. >> You've got one of those, too. Jean Francois, thanks very much for coming on theCUBE, it was really a pleasure having you back. >> Thank you. >> You're welcome. Alright, keep it right there, everybody. We'll be back with our next guest. This is theCUBE, we're live from the Waldorf Astoria. IBM's machine learning announcement, be right back. (electronic keyboard music)

Published Date : Feb 15 2017

SUMMARY :

Brought to you by IBM. Good to see you again. on, big day for you guys. What are the key things that we and then what you do after. "It's what you do with them that matters." So, exactly what you said. but we won't stop at Spark ML. the past certainly 10 years, so that you make better decisions. that you object to, that we have Terminator in the next decade cognitive functions that and maybe we won't need to and how would you expect that to change? to exclude it, so you know. and now, look at that. Do we feel threatened because No, it doesn't bother us, right. and you don't want to disturb but it's not all that I believe in the press release it said so we will leverage this when it's out. and now, if you have structure data, like the Watson Data Platform But the goal is to really but do you see that so-called is where you want to apply is if you want to score so we want to help people, depending on it was really a pleasure having you back. from the Waldorf Astoria.

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Jean Francois	PERSON	0.99+
IBM	ORGANIZATION	0.99+
10-year	QUANTITY	0.99+
Stu Miniman	PERSON	0.99+
Avi Mehta	PERSON	0.99+
New York	LOCATION	0.99+
Anaconda	ORGANIZATION	0.99+
70%	QUANTITY	0.99+
Jean Francois Puget	PERSON	0.99+
next year	DATE	0.99+
Two	QUANTITY	0.99+
Last week	DATE	0.99+
next quarter	DATE	0.99+
90%	QUANTITY	0.99+
Rob Thomas	PERSON	0.99+
one-time	QUANTITY	0.99+
today	DATE	0.99+
Five years ago	DATE	0.99+
one word	QUANTITY	0.99+
CICS	ORGANIZATION	0.99+
Python	TITLE	0.99+
a year ago	DATE	0.99+
one	QUANTITY	0.99+
two	QUANTITY	0.99+
next decade	DATE	0.98+
one week	QUANTITY	0.98+
first solution	QUANTITY	0.98+
XGBoost	TITLE	0.98+
a week	QUANTITY	0.97+
Spark ML	TITLE	0.97+
'60s	DATE	0.97+
ModelBuilder	TITLE	0.96+
one size	QUANTITY	0.96+
One	QUANTITY	0.95+
first	QUANTITY	0.94+
Watson Data Platform	TITLE	0.93+
each time	QUANTITY	0.93+
Kaggle	ORGANIZATION	0.92+
Stu	PERSON	0.91+
this quarter	DATE	0.91+
DSX	TITLE	0.89+
XGBoost	ORGANIZATION	0.89+
Waldorf Astoria	ORGANIZATION	0.86+
Spark ML.	TITLE	0.85+
z/OS	TITLE	0.82+
years	DATE	0.8+
centuries	QUANTITY	0.75+
10 years	QUANTITY	0.75+
DSX	ORGANIZATION	0.72+
Terminator	TITLE	0.64+
XTC69X	TITLE	0.63+
IBM Machine Learning Launch 2017	EVENT	0.63+
couple times	QUANTITY	0.57+
machine learning	EVENT	0.56+
X	TITLE	0.56+
Watson	TITLE	0.55+
these products	QUANTITY	0.53+
-G-B	COMMERCIAL_ITEM	0.53+
H20	ORGANIZATION	0.52+
TensorFlow	ORGANIZATION	0.5+
theCUBE	ORGANIZATION	0.49+
CUBE	ORGANIZATION	0.37+

Rob Thomas, IBM | IBM Machine Learning Launch

>> Narrator: Live from New York, it's theCUBE. Covering the IBM Machine Learning Launch Event. Brought to you by IBM. Now, here are your hosts, Dave Vellante and Stu Miniman. >> Welcome back to New York City, everybody this is theCUBE, we're here at the IBM Machine Learning Launch Event, Rob Thomas is here, he's the general manager of the IBM analytics group. Rob, good to see you again. >> Dave, great to see you, thanks for being here. >> Yeah it's our pleasure. So two years ago, IBM announced the Z platform, and the big theme was bringing analytics and transactions together. You guys are sort of extending that today, bringing machine learning. So the news just hit three minutes ago. >> Rob: Yep. >> Take us through what you announced. >> This is a big day for us. The announcement is we are going to bring machine learning to private Clouds, and my observation is this, you look at the world today, over 90% of the data in the world cannot be googled. Why is that? It's because it's behind corporate firewalls. And as we've worked with clients over the last few years, sometimes they don't want to move their most sensitive data to the public Cloud yet, and so what we've done is we've taken the machine learning from IBM Watson, we've extracted that, and we're enabling that on private Clouds, and we're telling clients you can get the power of machine learning across any type of data, whether it's data in a warehouse, a database, unstructured content, email, you name it we're bringing machine learning everywhere. To your point, we were thinking about, so where do we start? And we said, well, what is the world's most valuable data? It's the data on the mainframe. It's the transactional data that runs the retailers of the world, the banks of the world, insurance companies, airlines of the world, and so we said we're going to start there because we can show clients how they can use machine learning to unlock value in their most valuable data. >> And which, you say private Cloud, of course, we're talking about the original private Cloud, >> Rob: Yeah. >> Which is the mainframe, right? >> Rob: Exactly. >> And I presume that you'll extend that to other platforms over time is that right? >> Yeah, I mean, we're going to think about every place that data is managed behind a firewall, we want to enable machine learning as an ingredient. And so this is the first step, and we're going to be delivering every quarter starting next quarter, bringing it to other platforms, other repositories, because once clients get a taste of the idea of automating analytics with machine learning, what we call continuous intelligence, it changes the way they do analytics. And, so, demand will be off the charts here. >> So it's essentially Watson ML extracted and placed on Z, is that right? And describe how people are going to be using this and who's going to be using it. >> Sure, so Watson on the Cloud today is IBM's Cloud platform for artificial intelligence, cognitive computing, augmented intelligence. A component of that is machine learning. So we're bringing that as IBM machine learning which will run today on the mainframe, and then in the future, other platforms. Now let's talk about what it does. What it is, it's a single-place unified model management, so you can manage all your models from one place. And we've got really interesting technology that we pulled out of IBM research, called CADS, which stands for the Cognitive Assistance for Data Scientist. And the idea behind CADS is, you don't have to know which algorithm to choose, we're going to choose the algorithm for you. You build your model, we'll decide based on all the algorithms available on open-source what you built for yourself, what IBM's provided, what's the best way to run it, and our focus here is, it's about productivity of data science and data scientists. No company has as many data scientists as they want, and so we've got to make the ones they do have vastly more productive, and so with technology like CADS, we're helping them do their job more efficiently and better. >> Yeah, CADS, we've talked about this in theCUBE before, it's like an algorithm to choose an algorithm, and makes the best fit. >> Rob: Yeah. >> Okay. And you guys addressed some of the collaboration issues at your Watson data platform announcement last October, so talk about the personas who are asking you to give me access to mainframe data, and give me, to tooling that actually resides on this private Cloud. >> It's definitely a data science persona, but we see, I'd say, an emerging market where it's more the business analyst type that is saying I'd really like to get at that data, but I haven't been able to do that easily in the past. So giving them a single pane of glass if you will, with some light data science experience, where they can manage their models, using CADS to actually make it more productive. And then we have something called a feedback loop that's built into it, which is you build a model running on Z, as you get new data in, these are the largest transactional systems in the world so there's data coming in every second. As you get new data in, that model is constantly updating. The model is learning from the data that's coming in, and it's becoming smarter. That's the whole idea behind machine learning in the first place. And that's what we've been able to enable here. Now, you and I have talked through the years, Dave, about IBM's investment in Spark. This is one of the first, I would say, world-class applications of Spark. We announced Spark on the mainframe last year, what we're bringing with IBM machine learning is leveraging Spark as an execution engine on the mainframe, and so I see this as Spark is finally coming into the mainstream, when you talk about Spark accessing the world's greatest transactional data. >> Rob, I wonder if you can help our audience kind of squint through a compare and contrast, public Cloud versus what you're offering today, 'cause one thing, public Cloud adding new services, machine learning seemed like one of those areas that we would add, like IBM had done with a machine learning platform. Streaming, absolutely you hear mobile streaming applications absolutely happened in the public Cloud. Is cost similar in private Cloud? Can I get all the services? How will IBM and your customer base keep up with that pace of innovation that we've seen from IBM and others in the public Cloud on PRIM? >> Yeah, so, look, my view is it's not an either or. Because when you look at this valuable data, clients want to do some of it in public Cloud, they want to keep a lot of it in the system that they built on PRIMA. So our job is, how do we actually bridge that gap? So I see machine learning like we've talked about becoming much more of a hybrid capability over time because the data they want to move to the Cloud, they should do that. The economics are great. The data, doing it on private Cloud, actually the economics are tremendous as well. And so we're delivering an elastic infrastructure on private Cloud as well that can scale the public Cloud. So to me it's not either or, it's about what everybody wants as Cloud features. They want the elasticity, they want a creatable interface, they want the economics of Cloud, and our job is to deliver that in both places. Whether it's on the public Cloud, which we're doing, or on the private Cloud. >> Yeah, one of the thought exercises I've gone through is if you follow the data, and follow the applications, it's going to show you where customers are going to do things. If you look at IOT, if you look at healthcare, there's lots of uses that it's going to be on PRIMA it's going to be on the edge, I got to interview Walmart a couple of years ago at the IBM Ed show, and they leveraged Z globally to use their sales, their enablement, and obviously they're not going to use AWS as their platform. What's the trends, what do you hear form their customers, how much of the data, are there reasons why it needs to stay at the edge? It's not just compliance and governance, but it's just because that's where the data is and I think you were saying there's just so much data on the Z series itself compared to in other environments. >> Yeah, and it's not just the mainframe, right? Let's be honest, there's just massive amounts of data that still sits behind corporate firewalls. And while I believe the end destination is a lot of that will be on public Cloud, what do you do now? Because you can't wait until that future arrives. And so the place, the biggest change I've seen in the market in the last year is clients are building private Clouds. It's not traditional on-premise deployments, it's, they're building an elastic infrastructure behind their firewall, you see it a lot in heavily-regulated industries, so financial services where they're dealing with things like GDPR, any type of retailer who's dealing with things like PCI compliance. Heavy-regulated industries are saying, we want to move there, but we got challenges to solve right now. And so, our mission is, we want to make data simple and accessible, wherever it is, on private Cloud or public Cloud, and help clients on that journey. >> Okay, so carrying through on that, so you're now unlocking access to mainframe data, great, if I have, say, a retail example, and I've got some data science, I'm building some models, I'm accessing the mainframe data, if I have data that's elsewhere in the Cloud, how specifically with regard to this announcement will a practitioner execute on that? >> Yeah, so, one is you could decide one place that you want to land your data and have it be resonant, so you could do that. We have scenarios where clients are using data science experience on the Cloud, but they're actually leaving the data behind the firewalls. So we don't require them to move the data, so our model is one of flexibility in terms of how they want to manage their data assets. Which I think is unique in terms of IBM's approach to that. Others in the market say, if you want to use our tools, you have to move your data to our Cloud, some of them even say as you click through the terms, now we own your data, now we own your insights, that's not our approach. Our view is it's your data, if you want to run the applications in the Cloud, leave the data where it is, that's fine. If you want to move both to the Cloud, that's fine. If you wanted to leave both on private Cloud, that's fine. We have capabilities like Big SQL where we can actually federate data across public and private Clouds, so we're trying to provide choice and flexibility when it comes to this. >> And, Rob, in the context of this announcement, that would be, that example you gave, would be done through APIs that allow me access to that Cloud data is that right? >> Yeah, exactly, yes. >> Dave: Okay. >> So last year we announced something called Data Connect, which is basically, think of it as a bus between private and public Cloud. You can leverage Data Connect to seamlessly and easily move data. It's very high-speed, it uses our Aspera technology under the covers, so you can do that. >> Dave: A recent acquisition. >> Rob, IBM's been very active in open source engagement, in trying to help the industry sort out some of the challenges out there. Where do you see the state of the machine learning frameworks Google of course has TensorFlow, we've seen Amazon pushing at MXNet, is IBM supporting all of them, there certain horses that you have strong feelings for? What are your customers telling you? >> I believe in openness and choice. So with IBM machine learning you can choose your language, you can use Scala, you can use Java, you can use Python, more to come. You can choose your framework. We're starting with Spark ML because that's where we have our competency and that's where we see a lot of client desire. But I'm open to clients using other frameworks over time as well, so we'll start to bring that in. I think the IT industry always wants to kind of put people into a box. This is the model you should use. That's not our approach. Our approach is, you can use the language, you can use the framework that you want, and through things like IBM machine learning, we give you the ability to tap this data that is your most valuable data. >> Yeah, the box today has just become this mosaic and you have to provide access to all the pieces of that mosaic. One of the things that practitioners tell us is they struggle sometimes, and I wonder if you could weigh in on this, to invest either in improving the model or capturing more data and they have limited budget, and they said, okay. And I've had people tell me, no, you're way better off getting more data in, I've had people say, no no, now with machine learning we can advance the models. What are you seeing there, what are you advising customers in that regard? >> So, computes become relatively cheap, which is good. Data acquisitions become relatively cheap. So my view is, go full speed ahead on both of those. The value comes from the right algorithms and the right models. That's where the value is. And so I encourage clients, even think about maybe you separate your teams. And you have one that's focused on data acquisition and how you do that, and another team that's focused on model development, algorithm development. Because otherwise, if you give somebody both jobs, they both get done halfway, typically. And the value is from the right models, the right algorithms, so that's where we stress the focus. >> And models to date have been okay, but there's a lot of room for improvement. Like the two examples I like to use are retargeting, ad retargeting, which, as we all know as consumers is not great. You buy something and then you get targeted for another week. And then fraud detection, which is actually, for the last ten years, quite good, but there's still a lot of false positives. Where do you see IBM machine learning taking that practical use case in terms of improving those models? >> Yeah, so why are there false positives? The issue typically comes down to the quality of data, and the amount of data that you have that's why. Let me give an example. So one of the clients that's going to be talking at our event this afternoon is Argus who's focused on the healthcare space. >> Dave: Yeah, we're going to have him on here as well. >> Excellent, so Argus is basically, they collect data across payers, they're focused on healthcare, payers, providers, pharmacy benefit managers, and their whole mission is how do we cost-effectively serve different scenarios or different diseases, in this case diabetes, and how do we make sure we're getting the right care at the right time? So they've got all that data on the mainframe, they're constantly getting new data in, it could be about blood sugar levels, it could be about glucose, it could be about changes in blood pressure. Their models will get smarter over time because they built them with IBM machine learning so that what's cost-effective today may not be the most effective or cost-effective solution tomorrow. But we're giving them that continuous intelligence as data comes in to do that. That is the value of machine learning. I think sometimes people miss that point, they think it's just about making the data scientists' job easier, that productivity is part of it, but it's really about the voracity of the data and that you're constantly updating your models. >> And the patient outcome there, I read through some of the notes earlier, is if I can essentially opt in to allow the system to adjudicate the medication or the claim, and if I do so, I can get that instantaneously or in near real-time as opposed to have to wait weeks and phone calls and haggling. Is that right, did I get that right? >> That's right, and look, there's two dimensions. It's the cost of treatment, so you want to optimize that, and then it's the effectiveness. And which one's more important? Well, they're both actually critically important. And so what we're doing with Argus is building, helping them build models where they deploy this so that they're optimizing both of those. >> Right, and in the case, again, back to the personas, that would be, and you guys stressed this at your announcement last October, it's the data scientist, it's the data engineer, it's the, I guess even the application developer, right? Involved in that type of collaboration. >> My hope would be over time, when I talked about we view machine learning as an ingredient across everywhere that data is, is you want to embed machine learning into any applications that are built. And at that point you no longer need a data scientist per se, for that case, you can just have the app developer that's incorporating that. Whereas another tough challenge like the one we discussed, that's where you need data scientists. So think about, you need to divide and conquer the machine learning problem, where the data scientist can play, the business analyst can play, the app developers can play, the data engineers can play, and that's what we're enabling. >> And how does streaming fit in? We talked earlier about this sort of batch, interactive, and now you have this continuous sort of work load. How does streaming fit? >> So we use streaming in a few ways. One is very high-speed data ingest, it's a good way to get data into the Cloud. We also can do analytics on the fly. So a lot of our use case around streaming where we actually build analytical models into the streaming engine so that you're doing analytics on the fly. So I view that as, it's a different side of the same coin. It's kind of based on your use case, how fast you're ingesting data if you're, you know, sub-millisecond response times, you constantly have data coming in, you need something like a streaming engine to do that. >> And it's actually consolidating that data pipeline, is what you described which is big in terms of simplifying the complexity, this mosaic of a dupe, for example and that's a big value proposition of Spark. Alright, we'll give you the last word, you've got an audience outside waiting, big announcement today; final thoughts. >> You know, we talked about machine learning for a long time. I'll give you an analogy. So 1896, Charles Brady King is the first person to drive an automobile down the street in Detroit. It was 20 years later before Henry Ford actually turned it from a novelty into mass appeal. So it was like a 20-year incubation period where you could actually automate it, you could make it more cost-effective, you could make it simpler and easy. I feel like we're kind of in the same thing here where, the data era in my mind began around the turn of the century. Companies came onto the internet, started to collect a lot more data. It's taken us a while to get to the point where we could actually make this really easy and to do it at scale. And people have been wanting to do machine learning for years. It starts today. So we're excited about that. >> Yeah, and we saw the same thing with the steam engine, it was decades before it actually was perfected, and now the timeframe in our industry is compressed to years, sometimes months. >> Rob: Exactly. >> Alright, Rob, thanks very much for coming on theCUBE. Good luck with the announcement today. >> Thank you. >> Good to see you again. >> Thank you guys. >> Alright, keep it right there, everybody. We'll be right back with our next guest, we're live from the Waldorf Astoria, the IBM Machine Learning Launch Event. Be right back. [electronic music]

Published Date : Feb 15 2017

SUMMARY :

Brought to you by IBM. Rob, good to see you again. Dave, great to see you, and the big theme was bringing analytics and we're telling clients you can get it changes the way they do analytics. are going to be using this And the idea behind CADS and makes the best fit. so talk about the personas do that easily in the past. in the public Cloud. Whether it's on the public Cloud, and follow the applications, And so the place, that you want to land your under the covers, so you can do that. of the machine learning frameworks This is the model you should use. and you have to provide access to and the right models. for the last ten years, quite good, and the amount of data to have him on here as well. That is the value of machine learning. the system to adjudicate It's the cost of treatment, Right, and in the case, And at that point you no and now you have this We also can do analytics on the fly. in terms of simplifying the complexity, King is the first person and now the timeframe in our industry much for coming on theCUBE. the IBM Machine Learning Launch Event.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Henry Ford	PERSON	0.99+
Rob	PERSON	0.99+
Dave	PERSON	0.99+
Stu Miniman	PERSON	0.99+
Detroit	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
Charles Brady King	PERSON	0.99+
New York City	LOCATION	0.99+
Walmart	ORGANIZATION	0.99+
Scala	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
New York	LOCATION	0.99+
last year	DATE	0.99+
two dimensions	QUANTITY	0.99+
1896	DATE	0.99+
Java	TITLE	0.99+
both	QUANTITY	0.99+
Argus	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
Python	TITLE	0.99+
20-year	QUANTITY	0.99+
GDPR	TITLE	0.99+
Argus	PERSON	0.99+
one	QUANTITY	0.99+
two examples	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
both jobs	QUANTITY	0.99+
first step	QUANTITY	0.99+
today	DATE	0.99+
next quarter	DATE	0.99+
two years ago	DATE	0.98+
first	QUANTITY	0.98+
Google	ORGANIZATION	0.98+
first person	QUANTITY	0.98+
three minutes ago	DATE	0.98+
20 years later	DATE	0.98+
Watson	TITLE	0.98+
last October	DATE	0.97+
IBM Machine Learning Launch Event	EVENT	0.96+
IBM Machine Learning Launch Event	EVENT	0.96+
Spark ML	TITLE	0.96+
both places	QUANTITY	0.95+
One	QUANTITY	0.95+
IBM Machine Learning Launch Event	EVENT	0.94+
MXNet	ORGANIZATION	0.94+
Watson ML	TITLE	0.94+
Data Connect	TITLE	0.94+
Cloud	TITLE	0.93+

Manish Gupta, Redis Labs | Spark Summit East 2017

>> Announcer: Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017. Brought to you by Databricks. Now, here are your hosts Dave Vellante and George Gilbert. >> Welcome back to snowy Boston, everybody. This is theCUBE, the leader in live tech coverage. We're here at Spark Summit East, hashtag SparkSummit. Manish Gupta is here, he's the CMO at Redis Labs. Manish, welcome to theCUBE. >> Thank you, good to be here. >> So, you know, 10 years ago you say you're in the database business and everybody would yawn. Now you're the life of the party. >> Yeah, the world has changed. I think the party has lots and lots of players. We are happy to be on the top of that heap. >> It is a crowded space, so how does Redis Labs differentiate? >> Redis Labs is the company behind the massively popular open source Redis, and Redis became popular because of its performance primarily, and then simplicity. Developers could very easily run up an instance of Redis, solve some very hairy problems, and time to market was a big issue for them. Redis Enterprise took that forward and enabled it to be mission critical, ready for the largest workloads, ready for things that the enterprises need in a highly distributed clustered environment. So they have resilience and they benefit from the performance of Redis. >> And your claim to fame, as you say, is that top-gun performance, you guys will talk about some of the benchmarks later. We're talking about use cases like fraud detection, as example. Obviously ad serving would be another one. But add some color to that if you would. >> Redis is whatever you need to make real time real, Redis plays a very important role. It is able to deliver millions of operations per second with sub-millisecond latency, and that's the hallmark. With data structures that comprise Redis, you can solve the problems in a way, and the reason you can get that performance is because the data structures take some very complex issues and simplify the operation. Depending on the use case, you could use one of the data structures, you can mix and match the data structures, so that's the power of a Redis. We're used for ITO, for machine learning, for metering of billing and telecommunications environment, for personalization, for ad serving with companies like Groupon and others, and the list goes on and on. >> Yeah, you've got a big list on your website of all your customers, so you can check that out. Let's get the business model piece out of the way. Everybody's always fascinated. Okay, you got open source, how do you make money? How does Redis make money? >> Yeah, you know, we believe strategically fostering the growth of open source is foundational in our business model, and we invest heavily both R&D and marketing to do that. On top of that, to enable enterprise success and deployment of Redis, we have the mission critical, highly available Redis Enterprise offerings. Our monetization is entirely based on the Redis Enterprise platform, which takes advantage of the data structures and performance of core Redis, but layers on top management and the capabilities that make things like auto-recovery, auto-sorting, management much, much easier for the enterprise. We make that available in four deployment models. The enterprise can select us as Redis cloud, which runs on a public infrastructure on any of the four major platforms. We also allow for the enterprise to select a VPC environment in their own private clouds. They can also get software and self-manage that, or get our software and we can manage it for them. Four deployment options are the modalities in other ways where the enterprise customers help us monetize. >> When you said four major platforms, you meant cloud platforms? >> That's right. AWS, >> So, AWS, Azure >> Azure, Google, and IBM. >> Is IBM software, got there in the fourth, alright. >> That's right, all four. >> Go to the whip IBM. Go ahead, George. >> Along the lines of the business model, and we were sort of starting to talk about this earlier offline, you're just one component in building an application, and there's always this challenge of, well, I can manage my component better than anyone else, but it's got to fit with a bunch of other vendors' components. How do you make that seamless to the customer so that it's not defaulting over to a cloud vendor who has to build all the components themselves to make it work together? >> Certainly, you know, database is an integral part of your stack, of your application stack, but it is a stack, so there are other components. Redis and Redis Labs has a very, very large ecosystem within which we operate. We work closely with others for interfaces, for connectors, for interoperability, and that's a sustained environment that we invest in on a continuous basis. >> How do handle application consistency? A lot of in the no-SQL world, even in the AWS world, you hear about eventual consistency, but in the real-time world, there's a need for more rigorous, what's your philosophy there, how do you approach that? >> I think that's an issue that many no-SQL vendors have not been able to crack. Redis Labs has been at the forefront of that. We are taking an approach, and we are offering what we call tuneable consistency. Depending on the economics and the business model and the use case, the needs of consistency vary. In some cases, you do need immediate consistency. In other cases, you don't ever need consistency. And to give that flexibility to the customer is very important, so we've taken the approach where you can go from loose consistency to what we call strong eventual consistency. That approach is based on a fairly well trusted architecture and approach called CRDT, Conflict-free Replication Data Type. That approach allows us to, regardless of what the cluster magnitude or the distribution looks like geographically, we can deliver strong eventual consistency which meets the needs of majority of the customers. >> What are you seeing in terms of, you know, also in that a discussion about acid properties, and how many workloads really need acid properties. What are seeing now as you get more cloud native workloads and more no-SQL oriented workloads in terms of the requirement for those acid properties? >> First of all, we truly believe and agree that not all environments required acid support. Having said that, to be a truly credible database, you must support acid, and we do. Redis is acid-compli, supports acid, and Redis Labs certainly supports that. >> I remember on a stage once with Curt Monash, I'm sure you know Curt, right? Very famous database person. And he basically had a similar answer. But you would say that increasingly there are workloads that, the growth workloads don't necessarily require that, is that fair statement? >> That's a fair statement I would say. >> Dave: Great, good. >> There's a trade-off, though, when you talked about strong eventual consistency, potentially you have to wait for, presumably, a quorum of the partitions, I'm getting really technical here, but in other words, you've got a copy of the data here-- >> Dave: Good CMO question. (laughing) >> But your value proposition to the customers, we get this stuff done fast, but if you have to wait for a couple other servers to make sure that they've got the update, that can slow things way down. How does that trade-off work? >> I think that's part of the power of our architecture. We have a nothing shared, single proxy architecture where all of the replication, the disaster recovery, and the consistency management of the back end is handled by the proxy, and we ensure that the performance is not degraded when you are working through the consistency challenges, and that's where significant amount of IP is in the development of that proxy. >> I'll take that as a, let's go into it even more offline. >> Manish: Sounds good. >> And I have some other CMO questions, if I may. A lot of young companies like yours, especially in open source world, when they go to get the word out, they rely on their community, their open source community, and that's the core, and that makes a lot of sense, it's their peeps. As you become, grow more into enterprise grade apps and workloads, how do you extend beyond that? What is Redis Labs doing to sort of reach that C-Suite, are you even trying to reach that C-Suite up level to messaging? How do you as a CMO deal with those challenges? >> Maybe I'll begin by talking about our personas that matter to us in the ecosystem. The enterprise level, the architects, the developers, are the primary target, which we try to influence in early part of the decision cycle, it's at the architectural level. The ultimate teams that manage, run, and operate the infrastructure is certainly the DevOps, or the operations teams, and we spend time there. All along for some of the enterprise engagements, CIOs, chief data officers, and CTOs tend to play a very important role in the decisions and the selection process, and so, we do influence and interact with the C-Suite quite heavily. What the power of the open source gives us is that groundswell of love for Redis. Literally you can walk around a developer environment, such as the Spark Summit here, and you'll find people wearing Redis Geek shirts. And we get emails from Kazakhstan and strange, places from all over the world where we don't necessarily have salesforce, and requesting t-shirts, "send us stickers." Because people love Redis, and the word of mouth, that ground level love for the technology enables the decisions to be so much easier and smoother. We're not convincing, it's not a philosophical battle anymore. It's simply about the use case and the solution where Redis Enterprise fits or doesn't fit. >> Okay, so it really is that core developer community that are your advocates, and they're able to internally sell to the C-Suite. A lot of times the C-Suite, not the CTO so much, but certainly the CIO, CDO are like, "Yeah, yeah, they're geekin' out on some new hot thing. "What's the business impact?" Do you get that question a lot, and how do address it? >> I think then you get to some of the very basic tools, ROI calculators and the value proposition. For the C-level, the message is very simple. We are the least risky bet. We are the best long-term proposition, and we are the best cost answer for their implementation. Particularly as the needs are increasingly becoming more real-time in nature, they are not batch processed. Yes, there will always be some of that, but as the workloads are becoming, there is a need for faster processing, there is a need for quick insights, and real-time is not a moniker anymore, right. Real-time truly needs to be delivered today. And so, I think those three propositions for the C-Suite are resonating very well. >> Let's talk about ROI calculators for a second. I love talking about it because it underscores what a company feels as though its core value proposition is. I would think with Redis Labs part of the value proposition is you are enabling new types of workloads and new types of, whether it's sources of revenue or productivity. And these are generally telephone numbers as compared to some of the cost savings head to head to your competition, which of course you want to stress as well because the CFO cares about the cap-backs. What do you emphasize in that, and we don't have to get into the calculator itself, but in the conceptual model, what's the emphasis? Is it on those sort of business value attributes, is it on the sort of cost-savings? How do you translate performance into that business value? A lot of questions there, but if you could summarize, that'd be great. >> Well, I think you can think of it in three dimensions. The very first one is, does the performance support the use case or the solution that is required? That's the very first one. The second piece that fits in it, and that's in our books, that's operations per second and the latency. The second piece is the cost side, and that has two components to it. The first component is, what are the compute requirements? So, what is the infrastructure underneath that has to support it? And the efficiency that Redis and Redis Enterprise has is dramatically superior to the alternatives. And so, the economics show up. To run a million operations per second, we can do that on two nodes as opposed to alternative, which might need 50 nodes or 300 nodes. >> You can utilize your assets on the floor much better than maybe the competition can. >> This is where the data structures come into play quite a bit. That's one part of-- >> Dave: That's one part of the cost. >> Yeah. The other part of the cost is the human cost. >> Dave: People, yeah. >> And because, and this goes back to the open source, because the people available with the talent and the competency and appreciation for Redis, it's easy to procure those people, and your cost of acquisition and deploying goes down quite a bit. So, there's a human cost to it. The third dimension to this whole equation is time to market. And time to market is measured in many ways. Is it lost revenue if it takes you longer to get there? And Redis consistently from multiple analysts' reports gets top ranking for fastest way to get to market because of how simple it is. Beyond performance, simplicity is a second hallmark. >> That's a benefit acceleration, and you can quantify that. >> Absolutely, absolutely. And that's a revenue parameter, right. >> For years, people have been saying this Cambrian explosion of databases is unsustainable, and sort of in response we've gotten a squaring of the Cambrian explosion. The question is, with your sort of very flexible, I don't want to get too geeky, 'cause Dave'll cut me off, but the idea that you can accommodate time series and all these different ways of, all these different types of data, are we approaching a situation where customers can start consolidating their database choices and have fewer vendors, fewer products in their landscape? >> I think not only are we getting there, but we must get there. You've got over 300 databases in the marketplace, and imagine a CIO or an architect trying to have to sort through that to make a decision, it's difficult, and you certainly cannot support it from a trading standpoint or from an investment, cap-backs, and all that standpoint. What we have done with Redis is introduce something called Redis Modules. We released that at the last RedisConf in May in San Francisco. And the Redis Module is a very simple concept but a very powerful concept. It's an API which can be utilized to take an existing development effort, written as CC++, that can be ported onto the Redis data structures. This gives you the flexibility without having to reinvent the wheel every single time to take that investment, port it on top of Redis, and you get the performance, and you can make now Redis becomes a multi-model database. And I'm going to get to your answer of how do you address the multiple needs so you don't need multiple databases. To give you some examples, since the introduction of Redis Modules, we have now over 50 modules that have been published by a variety of places, not just Redis Labs. To indicate how simple and how powerful this model is. We took Lucene and developed the world's fastest full-text search engine as a module. We have very recently introduced Redis machine learning as a module that works with Spark ML and serves as a great serving layer in the machine learning domain. Just two very simple examples, but work that's being done ported over onto Redis data structures and now you have ability to do some very powerful things because of what Redis is. And this is the way future's going to be. I think every database is trying to offer multi-functionality to be multi-model in nature, but instead of doing it one step at a time, this approach gives us the ability to leverage the entire ecosystem. >> Your point being consolidation's inevitable in this business as well. >> Manish: Architectural consolidation. >> Yes, but also you would think, company consolidation, isn't that going to follow? What do you make of the market, and tell me, if you look back on the database market and what Oracle was able to achieve in the face of, maybe not as many players, but you had Sybase and Informix, and certainly DB2's still around, and SQL Server's still around, but Oracle won, and maybe it was SQL standards that. It's great to be lucky and good. Can we learn from that, or is this a whole different world? Are there similarities, and how do you, how do you see that consolidation potentially shaking out, if you agree that there will be consolidation? >> Yeah, there has to be, first and foremost, an architectural approach that solves the OPEX, CAPEX challenge for the enterprise. But beyond that, no industry can sustain the diversity and the fragmentation that exists in database world. I think there will always be new things coming out, of universities particularly. There's great innovation and research happening, and that is required to augment. But at the end of the day, the commercial enterprises cannot be of the fragmented volume that we have today in the database world, so there is going to be some consolidation, and it's not unnatural. I think it's natural, it's expected, time will tell what that looks like. We've seen some of our competitors acquire smaller companies to add graph functionality, to add search functionality. We just don't think that's the level of consolidation that really moves the needle for the industry. It's got to be at a higher level of consolidation. >> I don't want to, don't take this the wrong way, don't hate me for saying it, but is Oracle sort of the enemy, if I can say that. I mean, it's like, no, okay. >> Depends how you define enemy. >> I'm not going to go do many of the workloads that you're talking about on Oracle, despite what Larry tells me at Oracle OpenWorld. And I'm not going to make Oracle my choice for any of the workloads that you guys are working on. I guess in terms, I mean, everybody who's in the database business looks at that and say, "Hey, we can do it cheaper, better, "more productively," but, could you respond to that, and what do you make of Amazon's moves in the database world? Does that concern you? >> We think of Amazon and Oracle as two very different philosophies, if you can use that word. The approach we have taken is really a forward-looking approach and philosophy. We believe that the needs of the market need to be solved in new ways, and new ways should not be encumbered by old approaches. We're not trying to go and replicate what was done in the SQL world or in a relational database world. Our approach is how do you deliver a multi-model database that has the real-time attribute attached to it in a way that requires very limited computer force power and very few resources to manage? You take all of those things as kind of the core philosophy, which is a forward-looking philosophy. We are definitely not trying to replicate what an Oracle used to be. AWS I think is a very different animal. >> Dave: Interesting, though. >> They have defined the cloud, and I think play a very important role. We are a strong partner of theirs, much of our traffic runs on AWS infrastructure, certainly also on other clouds. I think AWS is one to watch in how they evolve. They have database offerings, including Redis offerings. However, we fully recognize, and the industry recognizes that that's not to the same capability as Redis Enterprise. It's open sourced Redis managed by AWS, and that's fine as a cache, but you cannot persist, and you really cannot have a multi-model capability that's a full database in that approach. >> And you're in the marketplace. >> Manish: We are in the marketplace. >> Obviously. >> And actually, we announced earlier, a few weeks ago, that you can buy and get Redis cloud access, which is Redis Enterprise cloud, on AWS through the integrated billing approach on their marketplace. You can have an AWS account and get our service, the true Redis Enterprise service. >> And as a software company, you'd figure, okay, the cloud infrastructures are service, we don't care what infrastructure it runs on. Whatever the customer wants, but you see AWS making these moves up-market, you got to obviously be paying attention to that. >> Manish: Certainly, certainly. >> Go ahead, last question. >> Interesting that you were saying that to solve this problem of proliferation of choice it has to be multi-model with speed and low resource requirement. If I were to interpret that from an old-style database perspective, it would be you're going to get, the multi-model is something you are addressing now, with the extensibility, but the speed means taking out that abstraction layer that was the query optimizer sort of and working almost at the storage layer, or having an option to do that. Would that be a fair way to say? >> No, I don't think that necessarily needs to be the case. For us, speed translates from the simplicity and the power of the data structures. Instead of having to serialize, deserialize before you process data in a Spark context, or instead of having to look for data that is perhaps not put in sorted sets for a use case that you might be doing, running a query on, if the data is already handled through one of the data structures, you now have a much faster query time, you now have the ability to reach the data in the right approach. And again, this is no-SQL, right, so it's a schema lesson write and it sets your scheme as you want it be on read. We marry that with the data structures, and that gives you the ultimate speed. >> We have to leave it there, but Manish, I'll give you the last word. Things we should be paying attention to for Redis Labs this year, events, announcements? >> I think the big thing I would leave the audience with is RedisConf 2017. It's May 31 to June 2 in San Francisco. We are expecting over 1,000 people. The brightest minds around Redis of the database world will be there, and anybody who is considering deploying the next generation database should attend. >> Dave: Where are you doing that? >> It's the Marriott Marquis in San Franciso. >> Great, is that on Howard Street, across from the--? >> It is right across from Moscone. >> Great, awesome location. People know it, easy to get to. Well, congratulations on the success. We'll be lookin' for outputs from that event, and hope to see you again on theCUBE. >> Thank you, enjoyed the conversation. >> Alright, good. Keep it right there, everybody, we'll be back with our next guest. This is theCUBE, we're live from Spark Summit East. Be right back. (upbeat electronic rock music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. Manish Gupta is here, he's the CMO at Redis Labs. So, you know, 10 years ago you say We are happy to be on the top of that heap. Redis Labs is the company behind But add some color to that if you would. and the reason you can get that performance Let's get the business model piece out of the way. We also allow for the enterprise to select a VPC environment That's right. Google, and IBM. Go to the whip IBM. Along the lines of the business model, Certainly, you know, database is an integral part and the use case, the needs of consistency vary. in terms of the requirement for those acid properties? you must support acid, and we do. the growth workloads don't necessarily require that, Dave: Good CMO question. but if you have to wait for a couple other servers and the consistency management of the back end and that's the core, and that makes and the word of mouth, that ground level love but certainly the CIO, CDO are like, For the C-level, the message is very simple. part of the value proposition is you are enabling That's the very first one. much better than maybe the competition can. This is where the data structures of the cost. The other part of the cost is the human cost. and the competency and appreciation for Redis, And that's a revenue parameter, right. but the idea that you can accommodate time series We released that at the last RedisConf in this business as well. and tell me, if you look back on the database market that really moves the needle for the industry. but is Oracle sort of the enemy, if I can say that. for any of the workloads that you guys are working on. We believe that the needs of the market and that's fine as a cache, but you cannot persist, the true Redis Enterprise service. okay, the cloud infrastructures are service, the multi-model is something you are addressing now, and the power of the data structures. but Manish, I'll give you the last word. of the database world will be there, and hope to see you again on theCUBE. This is theCUBE, we're live from Spark Summit East.

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave	PERSON	0.99+
AWS	ORGANIZATION	0.99+
George	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Howard Street	LOCATION	0.99+
Curt	PERSON	0.99+
second piece	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
Redis Labs	ORGANIZATION	0.99+
Manish Gupta	PERSON	0.99+
two nodes	QUANTITY	0.99+
Redis	ORGANIZATION	0.99+
two components	QUANTITY	0.99+
two	QUANTITY	0.99+
San Franciso	LOCATION	0.99+
Larry	PERSON	0.99+
Manish	PERSON	0.99+
first component	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
over 50 modules	QUANTITY	0.99+
June 2	DATE	0.99+
May 31	DATE	0.99+
Google	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Curt Monash	PERSON	0.99+
May	DATE	0.99+
millions	QUANTITY	0.99+
third dimension	QUANTITY	0.98+
50 nodes	QUANTITY	0.98+
Moscone	LOCATION	0.98+
fourth	QUANTITY	0.98+
Redis Enterprise	TITLE	0.98+
300 nodes	QUANTITY	0.98+
Redis	TITLE	0.98+
Kazakhstan	LOCATION	0.98+
over 1,000 people	QUANTITY	0.98+
one part	QUANTITY	0.98+
both	QUANTITY	0.98+
one step	QUANTITY	0.97+
C-Suite	TITLE	0.97+
Marriott Marquis	ORGANIZATION	0.97+
second hallmark	QUANTITY	0.97+
10 years ago	DATE	0.97+
Spark Summit East 2017	EVENT	0.97+
Groupon	ORGANIZATION	0.97+
first one	QUANTITY	0.97+
CDO	TITLE	0.97+
over 300 databases	QUANTITY	0.96+
SQL Server	TITLE	0.96+
Redis Enterprise cloud	TITLE	0.96+

Nick Pentreath, IBM STC - Spark Summit East 2017 - #sparksummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is The Cube, covering Spark Summit East 2017. Brought to you by Data Bricks. Now, here are your hosts, Dave Valente and George Gilbert. >> Boston, everybody. Nick Pentry this year, he's a principal engineer a the IBM Spark Technology Center in South Africa. Welcome to The Cube. >> Thank you. >> Great to see you. >> Great to see you. >> So let's see, it's a different time of year, here that you're used to. >> I've flown from, I don't know the Fahrenheit's equivalent, but 30 degrees Celsius heat and sunshine to snow and sleet, so. >> Yeah, yeah. So it's a lot chillier there. Wait until tomorrow. But, so we were joking. You probably get the T-shirt for the longest flight here, so welcome. >> Yeah, I actually need the parka, or like a beanie. (all laugh) >> Little better. Long sleeve. So Nick, tell us about the Spark Technology Center, STC is its acronym and your role, there. >> Sure, yeah, thank you. So Spark Technology Center was formed by IBM a little over a year ago, and its mission is to focus on the Open Source world, particularly Apache Spark and the ecosystem around that, and to really drive forward the community and to make contributions to both the core project and the ecosystem. The overarching goal is to help drive adoption, yeah, and particularly enterprise customers, the kind of customers that IBM typically serves. And to harden Spark and to make it really enterprise ready. >> So why Spark? I mean, we've watched IBM do this now for several years. The famous example that I like to use is Linux. When IBM put $1 billion into Linux, it really went all in on Open Source, and it drove a lot of IBM value, both internally and externally for customers. So what was it about Spark? I mean, you could have made a similar bet on Hadoop. You decided not to, you sort of waited to see that market evolve. What was the catalyst for having you guys all go in on Spark? >> Yeah, good question. I don't know all the details, certainly, of what was the internal drivers because I joined HTC a little under a year ago, so I'm fairly new. >> Translate the hallway talk, maybe. (Nick laughs) >> Essentially, I think you raise very good parallels to Linux and also Java. >> Absolutely. >> So Spark, sorry, IBM, made these investments and Open Source technologies that had ceased to be transformational and kind of game-changing. And I think, you know, most people will probably admit within IBM that they maybe missed the boat, actually, on Hadoop and saw Spark as the successor and actually saw a chance to really dive into that and kind of almost leap frog and say, "We're going to "back this as the next generation analytics platform "and operating system for analytics "and big debt in the enterprise." >> Well, I don't know if you happened to watch the Super Bowl, but there's a saying that it's sometimes better to be lucky than good. (Nick laughs) And that sort of applies, and so, in some respects, maybe missing the window on Hadoop was not a bad thing for IBM >> Yeah, exactly because not a lot of people made a ton of dough on Hadoop and they're still sort of struggling to figure it out. And now along comes Spark, and you've got this more real time nature. IBM talks a lot about bringing analytics and transactions together. They've made some announcements about that and affecting business outcomes in near real time. I mean, that's really what it's all about and one of your areas of expertise is machine learning. And so, talk about that relationship and what it means for organizations, your mission. >> Yeah, machine learning is a key part of the mission. And you've seen the kind of big debt in enterprise story, starting with the kind of Hadoop and data lakes. And that's evolved into, now we've, before we just dumped all of this data into these data lakes and these silos and maybe we had some Hadoop jobs and so on. But now we've got all this data we can store, what are we actually going to do with it? So part of that is the traditional data warehousing and business intelligence and analytics, but more and more, we're seeing there's a rich value in this data, and to unlock it, you really need intelligent systems. You need machine learning, you need AI, you need real time decision making that starts transcending the boundaries of all the rule-based systems and human-based systems. So we see machine learning as one of the key tools and one of the key unlockers of value in these enterprise data stores. >> So Nick, perhaps paint us a picture of someone who's advanced enough to be working with machine learning with BMI and we know that the tool chain's kind of immature. Although, IBM with Data Works or Data First has a fairly broad end-to-end sort of suit of tools, but what are the early-use cases? And what needs to mature to go into higher volume production apps or higher-value production apps? >> I think the early-use cases for machine learning in general and certainly at scale are numerous and they're growing, but classic examples are, let's say, recommendation engines. That's an area that's close to my heart. In my previous life before IBM, I bought the startup that had a recommendation engine service targeting online stores and new commerce players and social networks and so on. So this is a great kind of example use case. We've got all this data about, let's say, customer behavior in your retail store or your video-sharing site, and in order to serve those customers better and make more money, if you can make good recommendations about what they should buy, what they should watch, or what they should listen to, that's a classic use case for machine learning and unlocking the data that is there, so that is one of the drivers of some of these systems, players like Amazon, they're sort of good examples of the recommendation use case. Another is fraud detection, and that is a classic example in financial services, enterprise, which is a kind of staple of IBM's customer base. So these are a couple of examples of the use cases, but the tool sets, traditionally, have been kind of cumbersome. So Amazon bought everything from scratch themselves using customized systems, and they've got teams and teams of people. Nowadays, you've got this bold into Apache Spark, you've got it in Spark, a machine learning library, you've got good models to do that kind of thing. So I think from an algorithmic perspective, there's been a lot of advancement and there's a lot of standardization and almost commoditization of the model side. So what is missing? >> George: Yeah, what else? >> And what are the shortfalls currently? So there's a big difference between the current view, I guess the hype of the machine learning as you've got data, you apply some machine learning, and then you get profit, right? But really, there's a hugely complex workflow that involves this end-to-end story. You've got data coming from various data sources, you have to feed it into one centralized system, transform and process it, extract your features and do your sort of hardcore data signs, which is the core piece that everyone sort of thinks about as the only piece, but that's kind of in the middle and it makes up a relatively small proportion of the overall chain. And once you've got that, you do model training and selection testing, and you now have to take that model, that machine-learning algorithm and you need to deploy it into a real system to make real decisions. And that's not even the end of it because once you've got that, you need to close the loop, what we call the feedback loop, and you need to monitor the performance of that model in the real world. You need to make sure that it's not deteriorating, that it's adding business value. All of these ind of things. So I think that is the real, the piece of the puzzle that's missing at the moment is this end-to-end, delivering this end-to-end story and doing it at scale, securely, enterprise-grade. >> And the business impact of that presumably will be a better-quality experience. I mean, recommendation engines and fraud detection have been around for a while, they're just not that good. Retargeting systems are too little too late, and kind of cumbersome fraud detection. Still a lot of false positives. Getting much better, certainly compressing the time. It used to be six months, >> Yes, yes. Now it's minutes or second, but a lot of false positives still, so, but are you suggesting that by closing that gap, that we'll start to see from a consumer standpoint much better experiences? >> Well, I think that's imperative because if you don't see that from a consumer standpoint, then the mission is failing because ultimately, it's not magic that you just simply throw machine learning at something and you unlock business value and everyone's happy. You have to, you know, there's a human in the loop, there. You have to fulfill the customer's need, you have to fulfill consumer needs, and the better you do that, the more successful your business is. You mentioned the time scale, and I think that's a key piece, here. >> Yeah. >> What makes better decisions? What makes a machine-learning system better? Well, it's better data and more data, and faster decisions. So I think all of those three are coming into play with Apache Spark, end-to-end's story streaming systems, and the models are getting better and better because they're getting more data and better data. >> So I think we've, the industry, has pretty much attacked the time problem. Certainly for fraud detection and recommendation systems the quality issue. Are we close? I mean, are we're talking about 6-12 months before we really sort of start to see a major impact to the consumer and ultimately, to the company who's providing those services? >> Nick: Well, >> Or is it further away than that, you think? >> You know, it's always difficult to make predictions about timeframes, but I think there's a long way to go to go from, yeah, as you mentioned where we are, the algorithms and the models are quite commoditized. The time gap to make predictions is kind of down to this real-time nature. >> Yeah. >> So what is missing? I think it's actually less about the traditional machine-learning algorithms and more about making the systems better and getting better feedback, better monitoring, so improving the end user's experience of these systems. >> Yeah. >> And that's actually, I don't think it's, I think there's a lot of work to be done. I don't think it's a 6-12 month thing, necessarily. I don't think that in 12 months, certainly, you know, everything's going to be perfectly recommended. I think there's areas of active research in the kind of academic fields of how to improve these things, but I think there's a big engineering challenge to bring in more disparate data sources, to better, to improve data quality, to improve these feedback loops, to try and get systems that are serving customer needs better. So improving recommendations, improving the quality of fraud detection systems. Everything from that to medical imaging and counter detection. I think we've got a long way to go. >> Would it be fair to say that we've done a pretty good job with traditional application lifecycle in terms of DevOps, but we now need the DevOps for the data scientists and their collaborators? >> Nick: Yeah, I think that's >> And where is BMI along that? >> Yeah, that's a good question, and I think you kind of hit the nail on the head, that the enterprise applied machine learning problem has moved from the kind of academic to the software engineering and actually, DevOps. Internally, someone mentioned the word train ops, so it's almost like, you know, the machine learning workflow and actually professionalizing and operationalizing that. So recently, IBM, for one, has announced what's in data platform and now, what's in machine learning. And that really tries to address that problem. So really, the aim is to simplify and productionize these end-to-end machine-learning workflows. So that is the product push that IBM has at the moment. >> George: Okay, that's helpful. >> Yeah, and right. I was at the Watson data platform announcement you call the Data Works. I think they changed the branding. >> Nick: Yeah. >> It looked like there were numerous components that IBM had in its portfolio that's now strung together. And to create that end-to-end system that you're describing. Is that a fair characterization, or is it underplaying? I'm sure it is. The work that went into it, but help us maybe understand that better. >> Yeah, I should caveat it by saying we're fairly focused, very focused at HTC on the Open Source side of things, So my work is predominately within the Apache Spark project and I'm less involved in the data bank. >> Dave: So you didn't contribute specifically to Watson data platform? >> Not to the product line, so, you know, >> Yeah, so its really not an appropriate question for you? >> I wouldn't want to kind of, >> Yeah. >> To talk too deeply about it >> Yeah, yeah, so that, >> Simply because I haven't been involved. >> Yeah, that's, I don't want to push you on that because it's not your wheelhouse, but then, help me understand how you will commercialize the activities that you do, or is that not necessarily the intent? >> So the intent with HTC particularly is that we focus on Open Source and a core part of that is that we, being within IBM, we have the opportunity to interface with other product groups and customer groups. >> George: Right. >> So while we're not directly focused on, let's say, the commercial aspect, we want to effectively leverage the ability to talk to real-world customers and find the use cases, talk to other product groups that are building this Watson data platform and all the product lines and the features, data sans experience, it's all built on top of Apache Apache Spark and platform. >> Dave: So your role is really to innovate? >> Exactly, yeah. >> Leverage and Open Source and innovate. >> Both innovate and kind of improve, so improve performance improve efficiency. When you are operating at the scale of a company such as IBM and other large players, your customers and you as product teams and builders of products will come into contact with all the kind of little issues and bugs >> Right. >> And performance >> Make it better. Problems, yeah. And that is the feedback that we take on board and we try and make it better, not just for IBM and their customers. Because it's an Apache product and everyone benefits. So that's really the idea. Take all the feedback and learnings from enterprise customers and product groups and centralize that in the Open Source contributions that we make. >> Great. Would it be, so would it be fair to say you're focusing on making the core Spark, Spark ML and Spark ML Lib capabilities sort of machine learning libraries and in the pipeline, more robust? >> Yes. >> And if that's the case, we know there needs to be improvements in its ability to serve predictions in real time, like high speed. We know there's a need to take the pipeline and sort of share it with other tools, perhaps. Or collaborate with other tool chains. >> Nick: Yeah. >> What are some of the things that the Enterprise customers are looking for along the lines? >> Yeah, that's a great question and very topical at the moment. So both from an Open Source community perspective and Enterprise customer perspective, this is one of the, if not the key, I think, kind of missing pieces within the Spark machine-learning kind of community at the moment, and it's one of the things that comes up most often. So it is a missing piece, and we as a community need to work together and decide, is this something that we built within Spark and provide that functionality? Is is something where we try and adopt open standards that will benefit everybody and that provides a kind of one standardized format, or way or serving models? Or is it something where there's a few Open Source projects out there that might serve for this purpose, and do we get behind those? So I don't have the answer because this is ongoing work, but it's definitely one of the most critical kind of blockers, or, let's say, areas that needs work at the moment. >> One quick question, then, along those lines. IBM, the first thing IBM contributed to the Spark community was Spark ML, which is, as I understand it, it was an ability to, I think, create an ensemble sort of set of models to do a better job or create a more, >> So are you referring to system ML, I think it is? >> System ML. >> System ML, yeah, yeah. >> What are they, I forgot. >> Yeah, so, so. >> Yeah, where does that fit? >> System ML started out as a IBM research project and perhaps the simplest way to describe it is, as a kind of sequel optimizer is to take sequel queries and decide how to execute them in the most efficient way, system ML takes a kind of high-level mathematical language and compiles it down to a execution plan that runs in a distributed system. So in much the same way as your sequel operators allow this very flexible and high-level language, you don't have to worry about how things are done, you just tell the system what you want done. System ML aims to do that for mathematical and machine learning problems, so it's now an Apache project. It's been donated to Open Source and it's an incubating project under very active development. And that is really, there's a couple of different aspects to it, but that's the high-level goal. The underlying execution engine is Spark. It can run on Hadoop and it can run locally, but really, the main focus is to execute on Spark and then expose these kind of higher level APRs that are familiar to users of languages like R and Python, for example, to be able to write their algorithms and not necessarily worry about how do I do large scale matrix operations on a cluster? System ML will compile that down and execute that for them. >> So really quickly, follow up, what that means is if it's a higher level way for people who sort of cluster aware to write machine-learning algorithms that are cluster aware? >> Nick: Precisely, yeah. >> That's very, very valuable. When it works. >> When it works, yeah. So it does, again, with the caveat that I'm mostly focused on Spark and not so much the System ML side of things, so I'm definitely not an expert. I don't claim to be an expert in it. But it does, you know, it works at the moment. It works for a large class of machine-learning problems. It's very powerful, but again, it's a young project and there's always work to be done, so exactly the areas that I know that they're focusing on are these areas of usability, hardening up the APRs and making them easier to use and easier to access for users coming from the R and Python communities who, again are, as you said, they're not necessarily experts on distributed systems and cluster awareness, but they know how to write a very complex machine-learning model in R, for example. And it's really trying to enable them with a set of APR tools. So in terms of the underlying engine, they are, I don't know how many hundreds of thousands, millions of lines of code and years and years of research that's gone into that, so it's an extremely powerful set of tools. But yes, a lot of work still to be done there and ongoing to make it, in a way to make it user ready and Enterprise ready in a sense of making it easier for people to use it and adopt it and to put it into their systems and production. >> So I wonder if we can close, Nick, just a few questions on STC, so the Spark Technology Centers in Cape Town, is that a global expertise center? Is is STC a virtual sort of IBM community, or? >> I'm the only member visiting Cape Town, >> David: Okay. >> So I'm kind of fairly lucky from that perspective, to be able to kind of live at home. The rest of the team is mostly in San Francisco, so there's an office there that's co-located with the Watson west office >> Yeah. >> And Watson teams >> Sure. >> That are based there in Howard Street, I think it is. >> Dave: How often do you get there? >> I'll be there next week. >> Okay. >> So I typically, sort of two or three times a year, I try and get across there >> Right. And interface with the team, >> So, >> But we are a fairly, I mean, IBM is obviously a global company, and I've been surprised actually, pleasantly surprised there are team members pretty much everywhere. Our team has a few scattered around including me, but in general, when we interface with various teams, they pop up in all kinds of geographical locations, and I think it's great, you know, a huge diversity of people and locations, so. >> Anything, I mean, these early days here, early day one, but anything you saw in the morning keynotes or things you hope to learn here? Anything that's excited you so far? >> A couple of the morning keynotes, but had to dash out to kind of prepare for, I'm doing a talk later, actually on feature hashing for scalable machine learning, so that's at 12:20, please come and see it. >> Dave: A breakout session, it's at what, 12:20? >> 20 past 12:00, yeah. >> Okay. >> So in room 302, I think, >> Okay. >> I'll be talking about that, so I needed to prepare, but I think some of the key exciting things that I have seen that I would like to go and take a look at are kind of related to the deep learning on Spark. I think that's been a hot topic recently in one of the areas, again, Spark is, perhaps, hasn't been the strongest contender, let's say, but there's some really interesting work coming out of Intel, it looks like. >> They're talking here on The Cube in a couple hours. >> Yeah. >> Yeah. >> I'd really like to see their work. >> Yeah. >> And that sounds very exciting, so yeah. I think every time I come to a Spark summit, they always need projects from the community, various companies, some of them big, some of them startups that are pushing the envelope, whether it's research projects in machine learning, whether it's adding deep learning libraries, whether it's improving performance for kind of commodity clusters or for single, very powerful single modes, there's always people pushing the envelope, and that's what's great about being involved in an Open Source community project and being part of those communities, so yeah. That's one of the talks that I would like to go and see. And I think I, unfortunately, had to miss some of the Netflix talks on their recommendation pipeline. That's always interesting to see. >> Dave: Right. >> But I'll have to check them on the video (laughs). >> Well, there's always another project in Open Source land. Nick, thanks very much for coming on The Cube and good luck. Cool, thanks very much. Thanks for having me. >> Have a good trip, stay warm, hang in there. (Nick laughs) Alright, keep it right there. My buddy George and I will be back with our next guest. We're live. This is The Cube from Sparks Summit East, #sparksummit. We'll be right back. (upbeat music) (gentle music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Data Bricks. a the IBM Spark Technology Center in South Africa. So let's see, it's a different time of year, here I've flown from, I don't know the Fahrenheit's equivalent, You probably get the T-shirt for the longest flight here, need the parka, or like a beanie. So Nick, tell us about the Spark Technology Center, and the ecosystem. The famous example that I like to use is Linux. I don't know all the details, certainly, Translate the hallway talk, maybe. Essentially, I think you raise very good parallels and kind of almost leap frog and say, "We're going to and so, in some respects, maybe missing the window on Hadoop and they're still sort of struggling to figure it out. So part of that is the traditional data warehousing So Nick, perhaps paint us a picture of someone and almost commoditization of the model side. And that's not even the end of it And the business impact of that presumably will be still, so, but are you suggesting that by closing it's not magic that you just simply throw and the models are getting better and better attacked the time problem. to go from, yeah, as you mentioned where we are, and more about making the systems better So improving recommendations, improving the quality So really, the aim is to simplify and productionize Yeah, and right. And to create that end-to-end system that you're describing. and I'm less involved in the data bank. So the intent with HTC particularly is that we focus leverage the ability to talk to real-world customers and you as product teams and builders of products and centralize that in the Open Source contributions sort of machine learning libraries and in the pipeline, And if that's the case, So I don't have the answer because this is ongoing work, IBM, the first thing IBM contributed to the Spark community but really, the main focus is to execute on Spark When it works. and ongoing to make it, in a way to make it user ready So I'm kind of fairly lucky from that perspective, And interface with the team, and I think it's great, you know, A couple of the morning keynotes, but had to dash out are kind of related to the deep learning on Spark. that are pushing the envelope, whether it's research and good luck. My buddy George and I will be back with our next guest.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George Gilbert	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Valente	PERSON	0.99+
George	PERSON	0.99+
Dave	PERSON	0.99+
Nick Pentreath	PERSON	0.99+
Howard Street	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Nick Pentry	PERSON	0.99+
$1 billion	QUANTITY	0.99+
Nick	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
HTC	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Cape Town	LOCATION	0.99+
South Africa	LOCATION	0.99+
Java	TITLE	0.99+
Linux	TITLE	0.99+
12 months	QUANTITY	0.99+
six months	QUANTITY	0.99+
next week	DATE	0.99+
Boston	LOCATION	0.99+
Boston, Massachusetts	LOCATION	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
BMI	ORGANIZATION	0.99+
Python	TITLE	0.99+
Spark	TITLE	0.99+
12:20	DATE	0.99+
three	QUANTITY	0.99+
6-12 month	QUANTITY	0.99+
Watson	ORGANIZATION	0.98+
tomorrow	DATE	0.98+
Spark Technology Center	ORGANIZATION	0.98+
one	QUANTITY	0.98+
Spark Technology Centers	ORGANIZATION	0.98+
this year	DATE	0.97+
Hadoop	TITLE	0.97+
hundreds of thousands	QUANTITY	0.97+
both	QUANTITY	0.97+
30 degrees Celsius	QUANTITY	0.97+
Data First	ORGANIZATION	0.97+
Super Bowl	EVENT	0.97+
single	QUANTITY	0.96+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Spark ML: