Ion Stoica, Databricks - Spark Summit East 2017 - #sparksummit - #theCUBE
>> [Announcer] Live from Boston Massachusetts. This is theCUBE. Covering Sparks Summit East 2017. Brought to you by Databricks. Now here are your hosts, Dave Vellante and George Gilbert. >> [Dave] Welcome back to Boston everybody, this is Spark Summit East #SparkSummit And this is theCUBE. Ion Stoica is here. He's Executive Chairman of Databricks and Professor of Computer Science at UCal Berkeley. The smarts is rubbing off on me. I always feel smart when I co-host with George. And now having you on is just a pleasure, so thanks very much for taking the time. >> [Ion] Thank you for having me. >> So loved the talk this morning, we learned about RISELabs, we're going to talk about that. Which is the son of AMP. You may be the father of those two, so. Again welcome. Give us the update, great keynote this morning. How's the vibe, how are you feeling? >> [Ion] I think it's great, you know, thank you and thank everyone for attending the summit. It's a lot of energy, a lot of interesting discussions, and a lot of ideas around. So I'm very happy about how things are going. >> [Dave] So let's start with RISELabs. Maybe take us back, to those who don't understand, so the birth of AMP and what you were trying to achieve there and what's next. >> Yeah, so the AMP was a six-year Project at Berkeley, and it involved around eight faculties and over the duration of the lab around 60 students and postdocs, And the mission of the AMPLab was to make sense of big data. AMPLab started in 2009, at the end of 2009, and the premise is that in order to make sense of this big data, we need a holistic approach, which involves algorithms, in particular machine-learning algorithms, machines, means systems, large-scale systems, and people, crowd sourcing. And more precisely the goal was to build a stack, a data analytic stack for interactive analytics, to be used across industry and academia. And, of course, being at Berkeley, it has to be open source. (laugh) So that's basically what was AMPLab and it was a birthplace for Apache Spark that's why you are all here today. And a few other open-source systems like Mesos, Apache Mesos, and Alluxio which was previously called Tachyon. And so AMPLab ended in December last year and in January, this January, we started a new lab which is called RISE. RISE stands for Real-time Intelligent Secure Execution. And the premise of the new lab is that actually the real value in the data is the decision you can make on the data. And you can see this more and more at almost every organization. They want to use their data to make some decision to improve their business processes, applications, services, or come up with new applications and services. But then if you think about that, what does it mean that the emphasis is on the decision? Then it means that you want the decision to be fast, because fast decisions are better than slower decisions. You want decisions to be on fresh data, on live data, because decisions on the data I have right now are original but those are decisions on the data from yesterday, or last week. And then you also want to make targeted, personalized decisions. Because the decisions on personal information are better than aggregate information. So that's the fundamental premise. So therefore you want to be on platforms, tools and algorithms to enable intelligent real-time decisions on live data with strong security. And the security is a big emphasis of the lab because it means to provide privacy, confidentiality and integrity, and as you hear about data breaches or things like that every day. So for an organization, it is extremely important to provide privacy and confidentiality to their users and it's not only because the users want that, but it also indirectly can help them to improve their service. Because if I guarantee your data is confidential with me, you are probably much more willing to share some of your data with me. And if you share some of the data with me, I can build and provide better services. So that's basically in a nutshell what the lab is and what the focus is. >> [Dave] Okay, so you said three things: fast, live and targeted. So fast means you can affect the outcome. >> Yes. Live data means it's better quality. And then targeted means it's relevant. >> Yes. >> Okay, and then my question on security, I felt like when cloud and Big Data came to fore, security became a do-over. (laughter) Is that a fair assessment? Are you doing it over? >> [George] Or as Bill Clinton would call it, a Mulligan. >> Yeah, if you get a Mulligan on security. >> I think security is, it's always a difficult topic because it means so many things for so many people. >> Hmm-mmm. >> So there are instances and actually cloud is quite secure. It's actually cloud can be more secure than some on-prem deployments. In fact, if you hear about these data leaks or security breaches, you don't hear them happening in the cloud. And there is some reason for that, right? It is because they have trained people, you know, they are paranoid about this, they do a specification maybe much more often and things like that. But still, you know, the state of security is not that great. Right? For instance, if I compromise your operating system, whether it's in cloud or in not in the cloud, I can't do anything. Right? Or your VM, right? On all this cloud you run on a VM. And now you are going to allow on some containers. Right? So it's a lot of attacks, or there are attacks, sophisticated attacks, which means your data is encrypted, but if I can look at the access patterns, how much data you transferred, or how much data you access from memory, then I can infer something about what you are doing about your queries, right? If it's more data, maybe it's a query on New York. If it's less data it's probably maybe something smaller, like maybe something at Berkeley. So you can infer from multiple queries just looking at the access. So it's a difficult problem. But fortunately again, there are some new technologies which are developed and some new algorithms which gives us some hope. One of the most interesting technologies which is happening today is hardware enclaves. So with hardware enclaves you can execute the code within this enclave which is hardware protected. And even if your operating system or VM is compromised, you cannot access your code which runs into this enclave. And Intel has Intell SGX and we are working and collaborating with them actively. ARM has TrustZone and AMB also announced they are going to have a similar technology in their chips. So that's kind of a very interesting and very promising development. I think the other aspect, it's a focus of the lab, is that even if you have the enclaves, it doesn't automatically solve the problem. Because the code itself has a vulnerability. Yes, I can run the code in hardware enclave, but the code can send out >> Right. >> data outside. >> Right, the enclave is a more granular perimeter. Right? >> Yeah. So yeah, so you are looking and the security expert is in your lab looking at this, maybe how to split the application so you run only a small part in the enclave, which is a critical part, and you can make sure that also the code is secure, and the rest of the code you run outside. But the rest of the code, it's only going to work on data which is encrypted. Right? So there is a lot of interesting research but that's good. >> And does Blockchain fit in there as well? >> Yeah, I think Blockchain it's a very interesting technology. And again it's real-time and the area is also very interesting directions. >> Yeah, right. >> Absolutely. >> So you guys, I want George, you've shared with me sort of what you were calling a new workload. So you had batch and you have interactive and now you've got continuous- >> Continuous, yes. >> And I know that's a topic that you want to discuss and I'd love to hear more about that. But George, tee it up. >> Well, okay. So we were talking earlier and the objective of RISE is fast and continuous-type decisions. And this is different from the traditional, you either do it batch or you do it interactive. So maybe tell us about some applications where that is one workload among the other traditional workloads. And then let's unpack that a little more. >> Yeah, so I'll give you a few applications. So it's more than continuously interacting with the environment continuously, but you also learn continuously. I'll give you some examples. So for instance in one example, think about you want to detect a network security attack, and respond and diagnose and defend in the real time. So what this means is that you need to continuously get logs from the network and from the more endpoints you can get the better. Right? Because more data will help you to detect things faster. But then you need to detect the new pattern and you need to learn the new patterns. Because new security attacks, which are the ones that are effective, are slightly different from the past one because you hope that you already have the defense in place for the past ones. So now you are going to learn that and then you are going to react. You may push patches in real time. You may push filters, installing new filters to firewalls. So that's kind of one application that's going in real time. Another application can be about self driving. Now self driving has made tremendous strides. And a lot of algorithms you know, very smart algorithms now they are implemented on the cars. Right? All the system is on the cars. But imagine now that you want to continuously get the information from this car, aggregate and learn and then send back the information you learned to the cars. Like for instance if it's an accident or a roadblock an object which is dropped on the highway, so you can learn from the other cars what they've done in that situation. It may mean in some cases the driver took an evasive action, right? Maybe you can monitor also the cars which are not self-driving, but driven by the humans. And then you learn that in real time and then the other cars which follow through the same, confronted with the same situation, they now know what to do. Right? So this is again, I want to emphasize this. Not only continuous sensing environment, and making the decisions, but a very important components about learning. >> Let me take you back to the security example as I sort of process the auto one. >> Yeah, yeah. >> So in the security example, it doesn't sound like, I mean if you have a vast network, you know, end points, software, infrastructure, you're not going to have one God model looking out at everything. >> Yes. >> So I assume that means there are models distributed everywhere and they don't know what a new, necessarily but an entirely new attack pattern looks like. So in other words, for that isolated model, it doesn't know what it doesn't know. I don't know if that's what Rumsfeld called it. >> Yes (laughs). >> How does it know what to pass back for retraining? >> Yes. Yes. Yes. So there are many aspects and there are many things you can look at. And it's again, it's a research problem, so I cannot give you the solution now, I can hypothesize and I give you some examples. But for instance, you can look about, and you correlate by observing the affect. Some of the affects of the attack are visible. In some cases, denial of service attack. That's pretty clear. Even the And so forth, they maybe cause computers to crash, right? So once you see some of this kind of anomaly, right, anomalies on the end devices, end host and things like that. Maybe reported by humans, right? Then you can try to correlate with what kind of traffic you've got. Right? And from there, from that correlation, probably you can, and hopefully, you can develop some models to identify what kind of traffic. Where it comes from. What is the content, and so forth, which causes behavior, anomalous behavior. >> And where is that correlation happening? >> I think it will happen everywhere, right? Because- >> At the edge and at the center. >> Absolutely. >> And then I assume that it sounds like the models both at the edge and at the center are ensemble models. >> Yes. >> Because you're tracking different behavior. >> Yes. You are going to track different behavior and you are going to, I think that's a good hypothesis. And then you are going to assemble them, assemble to come up with the best decision. >> Okay, so now let's wind forward to the car example. >> Yeah. >> So it sound like there's a mesh network, at least, Peter Levine's sort of talk was there's near-local compute resources and you can use bitcoin to pay for it or Blockchain or however it works. But that sort of topology, we haven't really encountered before in computing, have we? And how imminent is that sort of ... >> I think that some of the stuff you can do today in the cloud. I think if you're on super-low latency probably you need to have more computation towards the edges, but if I'm thinking that I want kind of reactions on tens, hundreds of milliseconds, in theory you can do it today with the cloud infrastructure we have. And if you think about in many cases, if you can't do it within a few hundredths of milliseconds, it's still super useful. Right? To avoid this object which has dropped on the highway. You know, if I have a few hundred milliseconds, many cars will effectively avoid that having that information. >> Let's have that conversation about the edge a little further. The one we were having off camera. So there's a debate in our community about how much data will stay at the edge, how much will go into the cloud, David Flores said 90% of it will stay at the edge. Your comment was, it depends on the value. What do you mean by that? >> I think that that depends who am I and how I perceive the value of the data. And, you know, what can be the value of the data? This is what I was saying. I think that value of the data is fundamentally what kind of decisions, what kind of actions it will enable me to take. Right? So here I'm not just talking about you know, credit card information or things like that, even exactly there is an action somebody's going to take on that. So if I do believe that the data can provide me with ability to take better actions or make better decisions I think that I want to keep it. And it's not, because why I want to keep it, because also it's not only the decision it enables me now, but everyone is going to continuously improve their algorithms. Develop new algorithms. And when you do that, how do you test them? You test on the old data. Right? So I think that for all these reasons, a lot of data, valuable data in this sense, is going to go to the cloud. Now, is there a lot of data that should remain on the edges? And I think that's fair. But it's, again, if a cloud provider, or someone who provides a service in the cloud, believes that the data is valuable. I do believe that eventually it is going to get to the cloud. >> So if it's valuable, it will be persisted and will eventually get to the cloud? And we talked about latency, but latency, the example of evasive action. You can't send the back to the cloud and make the decision, you have to make it real time. But eventually that data, if it's important, will go back to the cloud. The other question of all this data that we are now processing on a continuous basis, how much actually will get persisted, most of it, much of it probably does not get persisted. Right? Is that a fair assumption? >> Yeah, I think so. And probably all the data is not equal. All right? It's like you want to maybe, even if you take a continuous video, all right? On the cars, they continuously have videos from multiple cameras and radar and lidar, all of this stuff. This continuous. And if you think about this one, I would assume that you don't want to send all the data to the cloud. But the data around the interesting events, you may want to do, right? So before and after the car has a near-accident, or took an evasive action, or the human had to intervene. So in all these cases, probably I want to send the data to the cloud. But for the most cases, probably not. >> That's good. We have to leave it there, but I'll give you the last word on things that are exciting you, things you're working on, interesting projects. >> Yeah, so I think this is what really excites me is about how we are going to have this continuous application, you are going to continuously interact with the environment. You are going to continuously learn and improve. And here there are many challenges. And I just want to say a few more there, and which we haven't discussed. One, in general it's about explainability. Right? If these systems augment the human decision process, if these systems are going to make decisions which impact you as a human, you want to know why. Right? Like I gave this example, assuming you have machine-learning algorithms, you're making a diagnosis on your MRI, or x-ray. You want to know why. What is in this x-ray causes that decision? If you go to the doctor, they are going to point and show you. Okay, this is why you have this condition. So I think this is very important. Because as a human you want to understand. And you want to understand not only why the decision happens, but you want also to understand what you have to do, you want to understand what you need to do to do better in the future, right? Like if your mortgage application is turned down, I want to know why is that? Because next time when I apply to the mortgage, I want to have a higher chance to get it through. So I think that's a very important aspect. And the last thing I will say is that this is super important and information is about having algorithms which can say I don't know. Right? It's like, okay I never have seen this situation in the past. So I don't know what to do. This is much better than giving you just the wrong decision. Right? >> Right, or a low probability that you don't know what to do with. (laughs) >> Yeah. >> Excellent. Ion, thanks again for coming in theCUBE. It was really a pleasure having you. >> Thanks for having me. >> You're welcome. All right, keep it right there everybody. George and I will be back to do our wrap right after this short break. This is theCUBE. We're live from Spark Summit East. Right back. (techno music)
SUMMARY :
Brought to you by Databricks. And now having you on is just a pleasure, So loved the talk this morning, [Ion] I think it's great, you know, and what you were trying to achieve there is the decision you can make on the data. So fast means you can affect the outcome. And then targeted means it's relevant. Are you doing it over? because it means so many things for so many people. So with hardware enclaves you can execute the code Right, the enclave is a more granular perimeter. and the rest of the code you run outside. And again it's real-time and the area is also So you guys, I want George, And I know that's a topic that you want to discuss and the objective of RISE and from the more endpoints you can get the better. Let me take you back to the security example So in the security example, and they don't know what a new, and you correlate both at the edge and at the center And then you are going to assemble them, to the car example. and you can use bitcoin to pay for it And if you think about What do you mean by that? So here I'm not just talking about you know, You can't send the back to the cloud And if you think about this one, but I'll give you the last word And you want to understand not only why that you don't know what to do with. It was really a pleasure having you. George and I will be back to do our wrap
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David Flores | PERSON | 0.99+ |
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
2009 | DATE | 0.99+ |
Peter Levine | PERSON | 0.99+ |
Bill Clinton | PERSON | 0.99+ |
New York | LOCATION | 0.99+ |
90% | QUANTITY | 0.99+ |
January | DATE | 0.99+ |
AMB | ORGANIZATION | 0.99+ |
last week | DATE | 0.99+ |
Dave | PERSON | 0.99+ |
yesterday | DATE | 0.99+ |
Ion | PERSON | 0.99+ |
ARM | ORGANIZATION | 0.99+ |
Boston | LOCATION | 0.99+ |
six-year | QUANTITY | 0.99+ |
December last year | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
three things | QUANTITY | 0.99+ |
Boston Massachusetts | LOCATION | 0.99+ |
one example | QUANTITY | 0.99+ |
two | QUANTITY | 0.98+ |
UCal Berkeley | ORGANIZATION | 0.98+ |
Berkeley | LOCATION | 0.98+ |
AMPLab | ORGANIZATION | 0.98+ |
Ion Stoica | PERSON | 0.98+ |
tens, hundreds of milliseconds | QUANTITY | 0.98+ |
today | DATE | 0.97+ |
end of 2009 | DATE | 0.96+ |
Rumsfeld | PERSON | 0.96+ |
Intel | ORGANIZATION | 0.96+ |
Intell | ORGANIZATION | 0.95+ |
both | QUANTITY | 0.95+ |
One | QUANTITY | 0.95+ |
AMP | ORGANIZATION | 0.94+ |
TrustZone | ORGANIZATION | 0.94+ |
Spark Summit East 2017 | EVENT | 0.93+ |
around 60 students | QUANTITY | 0.93+ |
RISE | ORGANIZATION | 0.93+ |
Sparks Summit East 2017 | EVENT | 0.92+ |
one | QUANTITY | 0.89+ |
one workload | QUANTITY | 0.88+ |
Spark Summit East | EVENT | 0.87+ |
Apache Spark | ORGANIZATION | 0.87+ |
around eight faculties | QUANTITY | 0.86+ |
this January | DATE | 0.86+ |
this morning | DATE | 0.84+ |
Mulligan | ORGANIZATION | 0.78+ |
few hundredths of milliseconds | QUANTITY | 0.77+ |
Professor | PERSON | 0.74+ |
God | PERSON | 0.72+ |
theCUBE | ORGANIZATION | 0.7+ |
few hundred milliseconds | QUANTITY | 0.67+ |
SGX | COMMERCIAL_ITEM | 0.64+ |
Mesos | ORGANIZATION | 0.63+ |
one application | QUANTITY | 0.63+ |
Apache Mesos | ORGANIZATION | 0.62+ |
Alluxio | ORGANIZATION | 0.62+ |
AMPLab | EVENT | 0.59+ |
Tachyon | ORGANIZATION | 0.59+ |
#SparkSummit | EVENT | 0.57+ |