Reynold Xin, Databricks - #Spark Summit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.

Published Date : Jun 7 2017

SUMMARY :

Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Reynold	PERSON	0.99+
Ali	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
David Goad	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
North Pole	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Reynold Xin	PERSON	0.99+
last month	DATE	0.99+
10X	QUANTITY	0.99+
two questions	QUANTITY	0.99+
three trillion records	QUANTITY	0.99+
second area	QUANTITY	0.99+
today	DATE	0.99+
last week	DATE	0.99+
Spark	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
first direction	QUANTITY	0.99+
One	QUANTITY	0.99+
James Bond	PERSON	0.98+
Spark	ORGANIZATION	0.98+
both	QUANTITY	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
Tungsten	ORGANIZATION	0.98+
two focuses	QUANTITY	0.97+
three directions	QUANTITY	0.97+
one minute	QUANTITY	0.97+
one area	QUANTITY	0.96+
three	QUANTITY	0.96+
about five	QUANTITY	0.96+
DBIO	ORGANIZATION	0.96+
six years	QUANTITY	0.95+
one thing	QUANTITY	0.94+
over a hundred experiments	QUANTITY	0.94+
Oracle	ORGANIZATION	0.92+
Theano	TITLE	0.92+
single note	QUANTITY	0.91+
Intel	ORGANIZATION	0.91+
one route	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.88+
Office	TITLE	0.87+
TensorFlow	TITLE	0.87+
S3	TITLE	0.87+
MXNet	TITLE	0.85+

Dinesh Nirmal, IBM | CUBEConversation

(upbeat music) >> Hi everyone. We have a special program today. We are joined by Dinesh Nirmal, who is VP of Development and Analytics, for Analytics at the IBM company and Dinesh has an extremely broad perspective on what's going on in this part of the industry, and IBM has a very broad portfolio. So, between the two of us, I think we can cover a lot of ground today. So, Dinesh, welcome. >> Oh thank you George. Great to be here. >> So just to frame the discussion, I wanted to hit on sort of four key highlights. One is balancing the compatibility across cloud on-prem, and edge versus leveraging specialized services that might be on any one of those platforms. And then harmonizing and simplifying both the management and the development of services across these platforms. You have that trade-off between: do I do everything compatibly; or can I take advantage of platforms, specific stuff? And then, we've heard a huge amount of noise on Machine Learning. And everyone says they're democratizing it. We want to hear your perspective on how you think that's most effectively done. And then, if we have time, the how to manage Machine Learning feedback, data feedback loops to improve the models. So, having started with that. >> So you talked about the private cloud and the public cloud and then, how do you manage the data and the models, or the other analytical assets across the hybrid nature of today. So, if you look at our enterprises, it's a hybrid format that most customers adopt. I mean you have some data in the public side; but you have your mission critical data, that's very core to your transactions, exist in the private cloud. Now, how do you make sure that the data that you've pushed on the cloud, that you can go use to build models? And then you can take that model deployed on-prem or on public cloud. >> Is that the emerging sort of mainstream design pattern, where mission critical systems are less likely to move, for latency, for the fact that they're fused to their own hardware, but you take the data, and the researching for the models happens up in the cloud, and then that gets pushed down close to where the transaction decisions are. >> Right, so there's also the economics of data that comes into play, so if you are doing a, you know, large scale neural net, where you have GPUs, and you want to do deep learning, obviously, you know, it might make more sense for you to push it into the cloud and be able to do that or one of the other deep learning frameworks out there. But then you have your core transactional data that includes your customer data, you know, or your customer medical data, which I think some customers might be reluctant to push on a public cloud, and then, but you still want to build models and predict and all those things. So I think it's a hybrid nature, depending on the sensitivities of the data. Customers might decide to put it on cloud versus private cloud which is in their premises, right? So then how do you serve those customer needs, making sure that you can build a model on the cloud and that you can deploy that model on private cloud or vice versa. I mean, you can build that model on private cloud or only on private, and then deployed on your public cloud. Now the challenge, one last statement, is that people think, well, once I build a model, and I deploy it on public cloud, then it's easy, because it's just an API call at that time, just to call that model to execute the transactions. But that's not the case. You take support vector machine, for example, right, that still has vectors in there, that means your data is there, right, so even though you're saying you're deploying the model, you still have sensitive data there, so those are the kind of things customers need to think about before they go deploy those models. >> So I might, this is a topic for our Friday interview with a member of the Watson IT family, but it's not so black and white when you say we'll leave all your customer data with you, and we'll work on the models, because it, sort of like, teabags, you know, you can take the customer's teabag and squeeze some of the tea out, in your IBM or public cloud, and give them back the teabag, but you're getting some of the benefit of this data. >> Right, so like, it depends, depends on the algorithms you build. You could take a linear regression, and you don't have the challenges I mentioned, in support of retro machine, because none of the data is moving, it's just modeled. So it depends, I think that's where, you know, like Watson has done, will help tremendously because the data is secure in that sense. But if you're building on your own, it's a different challenge, you've got to make sure you pick the right algorithms to do that. >> Okay, so let's move on to the modern sort of what we call operational analytic pipeline, where the key steps are ingest, process, analyze, predict, serve, and you can drill down on those more. Today there's, those pipelines are pretty much built out of multi-vendor components. How do you see that evolving under pressures of, or tension between simplicity, coming from one vendor, and the pieces all designed together, and the specialization, where you want to have a, you know, unique tool in one component. >> Right, so you're exactly right. So you can take a two prong approach. One is, you can go to a cloud provider, and get each of the services, and you stitch it together. That's one approach. A challenging approach, but that has its benefits, right, I mean, you bring some core strengths from each vendor into it. The other one is the integrate approach, where you ingest the data, you shape or cleanse the data, you get it prepared for analytics, you build the model, you predict, you visualize. I mean, that all comes in one. The benefit there is you get the whole stack in one, you have one you have a whole pipeline that you can execute, you have one service provider that's giving them services, it's managed. So all those benefits come with it, and that's probably the preferred way for it integrated all together in one stack, I think that's the path most people go towards, because then you have the whole pipeline available to you, and also the services that comes with it. So any updates that comes with it, and how do you make sure, if you take the first round, one challenge you have is how do you make sure all these services are compatible with each other? How do you make sure they're compliant? So if you're an insurance company, you want it to be HIPAA compliant. Are you going to individually make sure that each of these services are HIPAA compliant? Or would you get from one integrated provider, you can make sure they are HIPAA compliant, tests are done, so all those benefits, to me, outweigh you going, putting unmanaged service all together, and then creating a data link to underlay all of it. >> Would it be fair to say, to use an analogy, where Hadoop, being sort of, originating in many different Apache products, is a quasi-multi vendor kind of pipeline, and the state of, the state of the machine learning analytic pipeline, is still kind of multi-vendor today. You see that moving toward single vendor pipeline, who do you see as the sort of last man standing? >> So, I mean, I can speak from an IBM perspective, I can say that the benefit that a company, a vendor like IBM brings forward, is like, so the different, public or private cloud or hybrid, you obviously have the choice of going to public cloud, you can get the same service on public cloud, so you get a hybrid experience, so that's one aspect of it. Then, if you get the integrated solution, all the way from ingest to visualization, you have one provider, it's tested, it's integrated, you know, it's combined, it works well together, so I would say, going forward, if you look at it purely from an enterprise perspective, I would say integrated solutions is the way to go, because that what will be the last man standing. I'll give you an example. I was with a major bank in Europe, about a month ago, and I took them through our data science experience, our machine learning project and all that, and you know, the CTO's take was that, Dinesh, I got it. Building the model itself, it only took us two days, but incorporating our model into our existing infrastructure, it has been 11 months, we haven't been able to do it. So that's the challenge our enterprises face, and they want an integrated solution to bring that model into their existing infrastructure. So that's, you know, that's my thought. >> Today though, let's talk about the IBM pipeline. Spark is core, Ingest is, off the-- >> Dinesh: Right, so you can do spark streaming, you can use Kafka, or you can use infostream which is our proprietary tool. >> Right, although, you wouldn't really use structured streaming for ingest, 'cause of the back pressure? >> Right, so they are-- >> The point that I'm trying to make is, it's still multi-vendor, and then the serving side, I don't know, where, once the analysis is done and predictions are made, some sort of sequel database has to take over, so it's, today, it's still pretty multi vendor. So how do you see any of those products broadening their footprints so that the number of pieces decreases. >> So good question, they are all going to get into end pipeline, because that's where the value is, unless you provide an integrated end to end solution for a customer, especially parts customer it's all about putting it all together, and putting these pieces together is not easy, even if you ingest the data, IOP kind of data, a lot of times, 99% of the time, data is not clean, unless you're in a competition where you get cleansed data, in real world, that never happens. So then, I would say 80% of a data scientists time is spent on cleaning the data, shaping the data, preparing the data to build that pipeline. So for most customers, it's critical that they get that end to end, well oiled, well connected solution integrated solution, than take it from each vendor, every isolated solution. To answer your question, yes, every vendor is going to move into the ingest, data cleansing phase, transformation, and the building the pipeline and then visualization, if you look at those five steps, has to be developed. >> But just building the data cleansing and transformation, having it in your, native to your own pipeline, that doesn't sound like it's going to solve the problem of messy data that needs, you know, human supervision to correct. >> I mean, so there is some level of human supervision to be sure, so I'll give you an example, right, so when data from an insurance company goes, a lot of times, the gender could be missing, how do you know if it's a male or female? Then you got to build another model to say, you know, this patient has gone for a prostate exam, you know, it's a male, gynecology is a female, so you have to do some intuitary work in there, to make sure that the data is clean, and then there's some human supervision to make sure that this is good to build models, because when you're executing that pipeline in real time, >> Yeah. >> It's all based on the past data, so you want to make sure that the data is as clean as possible to train and model, that you're going to execute on. >> So, let me ask you, turning to a slide we've got about complexity, and first, for developers, and then second, for admins, if we take the steps in the pipeline, as ingest, process, analyze, predict, serve, and sort of products or product categories as Kafka, Spark streaming and sequel, web service for predict, and MPP sequel, or no sequel for serve, even if they all came from IBM, would it be possible to unify the data model, the addressing and name space, and I'm just kicking off a few that I can think of, programming model, persistence, transaction model, workflow, testing integration, there's one thing to say it's all IBM, and then there's another thing, so that the developer working with it, sees as it as one suite. >> So it has to be validated, and that's the benefit that IBM brings already, because we obviously test each segment to make sure it works, but when you talk about complexity, building the model is one, you know, development of the model, but now the complexity also comes in the deployment of the model, now we talk about the management of the model, where, how you monitor it? When was the model deployed, was it deployed in tests, was it deployed in production, and who changed that model last, what was changed, and how is it scoring? Is it scoring high or low? You want to get notification when the model starts going low. So complexity is all the way across, all the way from getting the data, in cleaning the data, developing the model, it never ends. And the other benefit that IBM has added is the feedback loop, where when you talk about complexity, it reduces the complexity, so today, if the model scores low, you have to take it offline, retrain the model based on the new data, and then redeploy it. Usually for enterprises, there is slots where you can take it offline, put it back online, all these things, so it's a process. What we have done is created a feedback loop where we are training the model in real time, using real time data, so the model is continuously-- >> Online learning. >> Online learning. >> And challenger, champion, or AB testing to see which one is more robust. >> Right, so you can do that, I mean, you could have multiple models where you can do AB testing, but in this case, you can condition, train the model to say, okay, this model scores the best. And then, another benefit is that, if you look at the whole machine learning process, there's the data, there's development, there's deployment. On development side, more and more it's getting commoditized, meaning picking the right algorithm, there's a lot of tools, including IBM, where he can say, question what's the right one to use for this, so that piece is getting a little, less complex, I don't want to say easier, but less complex, but the data cleansing and the deployment, these are two enterprises, when you have thousands of models how do you make sure that you deploy the right model. >> So you might say that the pipeline for managing the model is sort of separate from the original data pipeline, maybe it includes the same technology, or as much of the same technology, but once your pipeline, your data pipeline is in production, the model pipeline has to keep cycling through. >> Exactly, so the data pipeline could be changing, so if you take a lone example, right, a lot of the data that goes in the model pipeline, is static, I mean, my age, it's not going to change every day, I mean, it is, but you know, the age that goes into my salary, my race, my gender, those are static data that you can take from data and put it in there, but then there's also real time data that's coming, my loan amount, my credit score, all those things, so how do you bring that data pipeline between real time and static data, into the model pipeline, so the model can predict accurately and based on the score dipping, you should be able to re-try the model using real time data. >> I want to take, Dinesh, to the issue of a multi-vendor stack again, and the administrative challenges, so here, we look at a slide that shows me just rattling off some of the admin challenges, governance, performance modeling, scheduling orchestration, availability, recovering authentication, authorization, resource isolation, elasticity, testing integration, so that's the Y-axis, and then for every different product in the pipeline, as the access, say Kafka, Spark, structured streaming MPP, sequel, no sequel, so you got a mess. >> Right. >> Most open source companies are trying to make life easier for companies by managing their software as a service for the customer, and that's typically how they monetize. But tell us what you see the problem is, or will be with that approach. >> So, great question. Let me take a very simple example. Probably most of our audience know about GDPR, which is the European law to write to forget. So if you're an enterprise, and say, George, I want my data deleted, you have to delete all of my data within a period of time. Now, that's where one of the aspects you talked about with governance comes in. How do you make sure you have governance across not just data but your individual assets? So if you're using a multi-vendor solution, in all of that, that state governance, how do I make sure that data get deleted by all these services that's all tied together. >> Let me maybe make an analogy. On CSI, when they pick up something at the crime scene, they got to make sure that it's bagged, and the chain of custody doesn't lose its integrity all the way back to the evidence room. I assume you're talking about something like that. >> Yeah, something similar. Where the data, as it moves between private cloud, public cloud, analytical assets, is using that data, all those things need to work seamlessly for you to execute that particular transaction to delete data from everywhere. >> So that's, it's not just administrative costs, but regulations that are pushing towards more homogenous platforms. >> Right, right, and even if you take some of the other things on the stack, monitoring, logging, metering, provides some of those capabilities, but you have to make sure when you put all these services together, how are they going to integrate all together? You have one monitoring stack, so if you're pulling you know, your IOT kind of data into a data center, or your whole stack evaluation, how do you make sure you're getting the right monitoring data across the board? Those are the kind of challenges that you will have. >> It's funny you mention that, because we were talking to an old Lotus colleague of mine, who was CTO of Microsoft's IT organization, and we were talking about how the cloud vendors can put machine learning application, machine learning management application across their properties, or their services, but he said one of the first problems he'll encounter is the telemetry, like it's really easy on hardware, CPUs, utilization, memory utilization, a noise enabler for iO, but as you get higher up in the application services, it becomes much more difficult to harmonize, so that a program can figure out what's going wrong. >> Right, and I mean, like anomaly detection, right? >> Yes. >> I mean, how do you make sure you're seeing patterns where you can predict something before it happens, right? >> Is that on the road map for...? >> Yeah, so we're already working with some big customers to say, if you have a data center, how do you look at outage to predict what can go wrong in the future, root cause analysis, I mean, that is a huge problem solved. So let's say customer hit a problem, you took an outage, what caused it? Because today, you have specialists who will come and try to figure out what the problem is, but can we use machine learning or deep learning to figure out, is it a fix that was missing, or an application got changed that caused a CPU spike, that caused the outage? So that whole cost analysis is the one that's the hardest to solve, because you are talking about people's decades worth of knowledge, now you are influencing a machine to do that prediction. >> And from my understanding, root cause analysis is most effective when you have a rich model of how your, in this case, data structure and apps are working, and there might be many little models, but they're held together by some sort of knowledge graph that says here is where all the pieces fit, these are the pieces below these, sort of as peers to these other things, how does that knowledge graph get built in, and is this the next generation of a configuration management database. >> Right, so I call it the self-healing, self-managing, self-fixing data center. It's easy for you to turn up the heat or A/C, the temperature goes down, I mean, those are good, but the real value for a customer is exactly what you mentioned, building up that knowledge graft from different models that all comes together, but the hardest part, is, how do you, predicting an anomaly is one thing, but getting to the root cause is a different thing, because at that point, now you're saying, I know exactly what's caused this problem, and I can prevent it from happening again. That's not easy. We are working with our customers to figure out how do we get to the root cause analysis, but it's all about building the knowledge graph with multiple models coming from different systems, today, I mean enterprises have different systems from multi-vendors. We have to bring all that monitoring data into one source, and that's where that knowledge comes in, and then different models will feed that data, and then you need to prime that data, using deep learning algorithms to say, what caused this? >> Okay, so this actually sounds extremely relevant, although we're probably, in the interest of time, going to have to dig down on that one another time, but, just at a high level, it sounds like the knowledge graph is sort of your web or directory, into how local components or local models work, and then, knowing that, if it sees problems coming up here, it can understand how it affects something else tangentially. >> So think of knowledge graph as a neural net, because it's just building new neural net based on the past data, and it has that built-in knowledge where it says, okay, these symptoms seem to be a problem that I have encountered in the past. Now I can predict the root cause because I know this happened in the past. So it's kind of like you putting that net to build new problem determinations as it goes along. So it's a complex task. It's not easy to get to root cause analysis. But that's something we are aggressively working on developing. >> Okay, so let me ask, let's talk about sort of democratizing machine learning and the different ways of doing that. You've actually talked about the big pain points, maybe not so sexy, but that are critical, which is operationalizing the models, and preparing the data. Let me bounce off you some of the other approaches. One that we have heard from Amazon is that they're saying, well, data expunging might be an issue, and operationalizing the models might be an issue, but the biggest issue in terms of making this developer ready, is we're going to take the machine learning we use to run our business, whether it's merchandising fashion, running recommendation engines, managing fulfillment or logistics, and just like I did with AWS, they're dog-fooding it internally, and then they're going to put it out on AWS as a new layer of a platform. Where do you see that being effective, and where less effective? >> Right, so let me answer the first part of your question, the democratization of learning. So that happens when for example, a real estate agent who has no idea about machine learning, be able to come and predict the house prices in this area. That's to me, is democratizing. Because at that time, you have made it available to everyone, everyone can use it. But that comes back to our first point, which is having that clean set of data. You can build all the pre-canned pipelines out there, but if you're not feeding the set of data into, none of this, you know. Garbage in, garbage out, that's what you're going to get. So when we talk about democratization, it's not that easy and simple because you can build all this pre-canned pipelines that you have used in-house for your own purposes, but every customer has many unique cases. So if I take you as a bank, your fraud detection methods is completely different than me as a bank, my limit for fraud detection could be completely different. So there is always customization that's involved, the data that's coming in is different, so while it's a buzzword, I think there's knowledge that people need to feed it, there's models that needs to be tuned and trained, and there's deployment that is completely different, so you know, there is work that has to be done. >> So then what I'm taking away from what you're saying is, you don't have to start from ground zero with your data, but you might want to add some of your data, which is specialized, or slightly different from what the pre-trained model is, you still have to worry about operationalizing it, so it's not a pure developer ready API, but it uplevels the skills requirement so that it's not quite as demanding as working with TensorFlow or something like that. >> Right, I mean, so you can always build pre-canned pipelines and make it available, so we have already done that. For example, fraud detection, we have pre-canned pipelines for IT analytics, we have pre-canned pipelines. So it's nothing new, you can always do what you have done in house, and make it available to the public or the customers, but then they have to take it and have to do customization to meet their demands, bring their data to re-train the model, all those things has to be done, it's not just about providing the model, but every customer use case is completely different, whether you are looking at fraud detection from that one bank's perspective, not all banks are going to do the same thing. Same thing for predicting, for example, the loan, I mean, your loan approval process is going to be completely different than me as a bank loan approval process. >> So let me ask you then, and we're getting low on time here, but what would you, if you had to characterize Microsoft, Azure, Google, Amazon, as each bringing to bear certain advantages and disadvantages, and you're now the ambassador, so you're not a representative of IBM, help us understand the sweet spot for each of those. Like you're trying to fix the two sides of the pipeline, I guess, thinking of it like a barbell, you know, where are the others based on their data assets and their tools, where do they need to work. >> So, there's two aspects of it, there's enterprise aspect, so as an enterprise, I would like to say, it's not just about the technology, but there's also the services aspect. If my model goes down in the middle of the night, and my banking app is down, who do I call? If I'm using a service that is available on the cloud provider which is open source, do I have the right amount of coverage to call somebody and fix it. So there's the enterprise capabilities, availabilities, reliability, that is different, than a developer comes in, has a CSV file that he or she wants to build a model to predict something, that's different, this is different, two different aspects. So if you talk about, you know, all these vendors, if I'm bearing an enterprise card, some of the things I would look is, can I get an integrated solution, end to end on the machine learning platform. >> And that means end to end in one location, >> Right. >> So you don't have network issues or latency and stuff like that. >> Right, it's an integrated solution, where I can bring in the data, there's no challenges to latency, those kinds of things, and then can I get the enterprise level service, SLA all those things, right? So, in there, the named vendors obviously have an upper hand, because they are preferred to enterprises than a brand new open source that will come along, but then there is, within enterprises, there are a line of businesses building models, using some of the open source vendors, which is okay, but eventually they'd have to get deployed and then how do you make sure you have that enterprise capabilities up there. So if you ask me, I think each vendor brings some capabilities. I think the benefit IBM brings in is, one, you have the choice or the freedom to bring in cloud or on-prem or hybrid, you have all the choices of languages, like we support R, Python Spar, Spark, I mean, SPS, so I mean, the choice, the freedom, the reliability, the availability, the enterprise nature, that's where IBM comes in and differentiates, and that's for our customers, a huge plus. >> One last question, and we're really out of time, in terms of thinking about a unified pipeline, when we were at Spark Summit, sitting down with Matei Zaharia and Reynold Shin, the question came up, the data breaks has an incomplete pipeline, no persistence, no ingest, not really much in the way of serving, but boy are they good at, you know, data transmigration, and munging and machine learning, but they said they consider it part of their ultimate responsibility to take control. And on the ingest side it's Kafka, the serving side, might be Redis or something else, or the Spark databases like Snappy Data and Splice Machine. Spark is so central to IBM's efforts. What might a unified Spark pipeline look like? Have you guys thought about that? >> It's not there, obviously they probably could be working on it, but for our purpose, Spark is critical for us, and the reason we invested in Spark so much is because of the executions, where you can take a tremendous amount of data, and, you know, crunch through it in a very short amount of time, that's the reason, we also invented Spark Sequel, because we have a good chunk of customers still use Sequel heavily, We put a lot of work into the Spark ML, so we are continuing to invest, and probably they will get to and integrated into a solution, but it's not there yet, but as it comes along, we'll adapt. If it meets our needs and demands, and enterprise can do it, then definitely, I mean, you know, we saw that Spark's core engine has the ability to crunch a tremendous amount of data, so we are using it, I mean, 45 of our internal products use Spark as our core engine. Our DSX, Data Science Experience, has Spark as our core engine. So, yeah, I mean, today it's not there, but I know they're probably working on it, and if there are elements of this whole pipeline that comes together, that is convenient for us to use, and at enterprise level, we will definitely consider using it. >> Okay, on that note, Dinesh, thanks for joining us, and taking time out of your busy schedule. My name is George Gilbert, I'm with Dinesh Nirmal from IBM, VP of Analytics Development, and we are at the Cube studio in Palo Alto, and we will be back in the not too distant future, with more interesting interviews with some of the gurus at IBM. (peppy music)

Published Date : Aug 22 2017

SUMMARY :

So, between the two of us, I think Oh thank you George. the how to manage Machine Learning feedback, that you can go use to build models? but you take the data, and the researching for and that you can deploy that model on private cloud but it's not so black and white when you say and you don't have the challenges I mentioned, and the specialization, where you want to have and get each of the services, and you stitch it together. who do you see as the sort of last man standing? So that's, you know, that's my thought. Spark is core, Ingest is, off the-- Dinesh: Right, so you can do spark streaming, so that the number of pieces decreases. and then visualization, if you look at those five steps, of messy data that needs, you know, human supervision so you want to make sure that the data is as clean as in the pipeline, as ingest, process, analyze, if the model scores low, you have to take it offline, to see which one is more robust. Right, so you can do that, I mean, you could have So you might say that the pipeline for managing I mean, it is, but you know, the age that goes MPP, sequel, no sequel, so you got a mess. But tell us what you see the problem is, Now, that's where one of the aspects you talked about and the chain of custody doesn't lose its integrity for you to execute that particular transaction to delete but regulations that are pushing towards more Those are the kind of challenges that you will have. It's funny you mention that, because we were to say, if you have a data center, how do you look at most effective when you have a rich model and then you need to prime that data, using deep learning but, just at a high level, it sounds like the knowledge So it's kind of like you putting that net Let me bounce off you some of the other approaches. pipelines that you have used in-house for your own purposes, the pre-trained model is, you still have to worry So it's nothing new, you can always do what you have So let me ask you then, and we're getting low on time So if you talk about, you know, all these vendors, So you don't have network issues or latency and then how do you make sure you have that but boy are they good at, you know, where you can take a tremendous amount of data, of the gurus at IBM.

ENTITIES

Entity	Category	Confidence
Microsoft	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
IBM	ORGANIZATION	0.99+
George	PERSON	0.99+
Europe	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Dinesh Nirmal	PERSON	0.99+
99%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
80%	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
two	QUANTITY	0.99+
HIPAA	TITLE	0.99+
Dinesh	PERSON	0.99+
Reynold Shin	PERSON	0.99+
Friday	DATE	0.99+
AWS	ORGANIZATION	0.99+
Today	DATE	0.99+
today	DATE	0.99+
five steps	QUANTITY	0.99+
45	QUANTITY	0.99+
two days	QUANTITY	0.99+
11 months	QUANTITY	0.99+
each segment	QUANTITY	0.99+
first part	QUANTITY	0.99+
two enterprises	QUANTITY	0.99+
One	QUANTITY	0.99+
first point	QUANTITY	0.99+
first round	QUANTITY	0.99+
each vendor	QUANTITY	0.99+
Lotus	TITLE	0.99+
each	QUANTITY	0.99+
Azure	ORGANIZATION	0.99+
two aspects	QUANTITY	0.99+
one challenge	QUANTITY	0.99+
one approach	QUANTITY	0.99+
Spark	TITLE	0.99+
two sides	QUANTITY	0.99+
Cube	ORGANIZATION	0.99+
one stack	QUANTITY	0.98+
one component	QUANTITY	0.98+
one source	QUANTITY	0.98+
GDPR	TITLE	0.98+
One last question	QUANTITY	0.98+
one	QUANTITY	0.98+
thousands of models	QUANTITY	0.98+
one vendor	QUANTITY	0.98+
both	QUANTITY	0.98+
Kafka	TITLE	0.97+
one thing	QUANTITY	0.97+
Sequel	TITLE	0.97+
one location	QUANTITY	0.97+
second	QUANTITY	0.96+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Reynold: