Matei Zaharia, Databricks - #SparkSummit - #theCUBE
>> Narrator: Live from San Francisco, it's theCUBE. Covering Spark Summit2017, brought to you by Databricks. (upbeat music) >> Welcome back to Spark Summit 2017, you're watching theCUBE and we have an honored guest here today, his name is Matei Zaharia and Matei is the creator of Spark, Chief Technologist, and Co-Founder of Databricks, did I get all that right? >> Yeah, thanks a lot for having me again. Excited to be here. >> Yeah Matei we were watching your keynote this morning and we're all excited to hear about better support for deep learning, about some of the structured streaming apps now being in production. I want to ask you what happened after the keynote? What kind of feedback have you heard from people in the hallways? >> Yeah definitely, so the feedback has definitely been super positive. I think people really like the direction that we're moving in with Apache Spark and with this library, such as a deep learning pipeline one. So we've gotten a lot of questions about the deep learning library, when will it support more types and so on. It's really good at supporting images right now. And also with streaming, I think people are just excited to try out the low latency streaming. >> Any other priorities people asked you about that maybe you haven't focused on yet? >> That I haven't focused on in the keynote, so I think that's a good question, I think overall some of the things we keep seeing are people just want to make it easier to just operate Spark on it under that scale and simplify things like monitoring and debugging and so on, so that's a constant theme that we're seeing. And then another thing that's generally been going on, I didn't focus on it this time, is increasing usage by Python and R users. So there's a lot of work in the latest release to continue improving that, to make it easier to use in those languages. >> Okay, we were watching the demo, the impressive demos, this morning, in fact George was watching the keynote, he was the one millisecond latency, he said wow. George, you want to ask a little more about that? >> So yeah let's talk about, 'cause there's this rise of continuous apps, which I think you guys named. >> Matei: Yeah. >> And resonates with everyone to go along with batch and request response. And in the past, so people were saying, well Spark was doing many micro batches, latency was couple hundred milliseconds. So now that you're down at one millisecond, what does that change in terms the class of apps that you're appropriate for or you know, some people have talked about criticality of vent processing. Where is Spark on that now? >> Yeah definitely, so yeah, so the goal of this is exactly to support the full range of latency, possible all the way down to sub-millisecond latency. And give users the same programming model for them so they don't have to use a different system or a lower level programming model to get that low latency. And so basically since we began structured streaming, we moved, we tried to make sure the API is not tied in with micro-batching in anyway. And so this is the next step to actually eliminate that from the engine and be able to execute these computations. And what are the new applications? So I think this really enables two types of things we've seen. One is kind of automated decision making system, so this would be something, it could be even on say, a website or you know, say when someone's applying for a loan or something like that, could be making decisions but it could even be an even lower latency, like say stock market style of place or internet of things, or like industrial monitoring, and making decisions there. That's one thing. And then the other thing we see people doing is a lot of kind of stream to stream ETL, which is a bit more boring in some way, but as you set that up, it's nice to have this very low latency transformations that can produce new streams from an existing one, because then nothing downstream from them is effected in terms of latency. >> So in this last example, it's sort of to help build microservice type applications. >> Yeah, exactly, yeah. Well in general, there's this whole, basically this whole architecture of saying all my data will be streamed and then I'll have some applications that just produce a new stream. And then later that stuff can go into a data link or into a real time system or whatever. So it's basically keeping it low latency while it remains in stream form. >> So we were talking earlier and we've been talking to the Snappy Data folks at the place machine folks. And they built Spark into a DBMS. So that like it's immutable. I'm sorry, mutable. >> Matei: Mutable, yeah. >> Like a data frame is updateable. So what does that make possible, even if you can do the same things with Spark, without it? What does it make easier? >> So that's also in the same spirit of continuous applications, it's saying you should have a single programming model and interface for doing both your transactional work and your analytics after, and then maybe serving the results of the analytics. So that makes a lot of sense and an example of that would be, you know, I keep going back to say the financial or credit card type of use cases, but it would be something where users are conducting transactions and maybe you learn stuff about them from that. You say okay, here's where they're located, now here's what they're purchasing, whatever. And then you also want to know, I'll have to make a decision. For example, do I allow them to go past the limit on their credit card or something like that. Or is this a normal use of it or is this a fraudulent one? So that's where it helps to integrate these and you can do these things. So there are products like Snappy Data That integrate a specific database with Spark. And we're also trying to make sure in Spark, the API, so that people can integrate their own system, whatever database or key value store they want. >> So would you have to jump through hoops if you didn't want to integrate any other store other than talking to a file system, or? >> Yeah if you want to do these transactions on a file system, there will be basically some performance constraints to doing that. It depends on the weight, it's definitely the simplest thing and if you have a low enough rate of up data it could actually be fine. But if you want more fine grained ones, then it becomes a problem. >> It would seem like if you tack on a product for ingest, not that you really want to get into that, think Kafka, which could also stretch into the transforms on some basic analytics. And you mentioned, I think on the Spark East keynote, Redis for serving, you've got like now a multi sort of vendor product stack. And so there's complexity to that. >> Matei: Yeah definitely yeah. >> Do you foresee a scenario where you could see that as a high volume solution and it's something that you would take ownership of? >> I see, so well, do you mean from the Apache Spark side or from the Databricks side? >> George: Actually either. >> Yeah so I think from the Spark side, basically so far the project doesn't provide storage, it just provides computation and it plugs into different storage engines. And so it would be kind of a big shift, it might be possible, but it would be kind of a big shift to say, okay well also provide persistent storage. I think the more likely thing that will happen is better and better integrations with the most widely used open source storage systems. So Redis is one. Apache Kafka, there's a lot of work on integrating that better and so on. From the Databricks side, that is different because that is a fully managed cloud service and it definitely makes sense there that'd you have a turnkey solution for that. Right now we actually built that for people who want that we can build it, sometimes with other vendors or with just services built into Amazon, but that makes a lot of sense. >> And Matei, something I read a press release on, but I didn't hear it in the keynote this morning. I hate to steal thunder from tomorrow, but can you give us a sneak preview on serverless apps? What's that about? >> Yeah, so this is actually we put out a press release today and we'll actually put out, well we'll have a full keynote tomorrow morning and also a lot more details on our website. So this is a Databricks serverless. It's basically a serverless platform for adding Apache Spark and data science. So not to steal away too much thunder, but you know serverless computing is this idea of users can just submit query or computation, they don't have to configure the hardware at all and they just get high performance and they get results. And so far it's been very successful with stateless workloads such as Sequel or Amazon Lambda, which is you know just functions serving a webpage or something like that. So this is going to be the first offering that actually extends that model to data science and in general to Spark workloads. So you can have machine learning users, you can have these streaming applications, all these things, on that kind of environment. So yeah, we'll have a lot more detail on that tomorrow, it's something that we're excited about. >> I want to circle back to IoT apps. You know there's sort of, beyond an emerging consensus, that we're going to do a lot of training in the cloud 'cause we have access to big compute and lots of data. But then the issue on the edge, in the near to medium term, the footprint, like a lot of people are telling us high volume devices will have 3 megs of memory and a gateway server would have like two gigs and two cores. So can you carve Spark up into fitting on one of the... >> That's a good question, I think for that, it's again, the most likely way that would happen is through data sources. For example, there are these projects like Apache knife and other projects as well that let you build up a data pipeline from IoT devices all the way to the cloud. And you can imagine some computation through those. So I think, yeah I don't have a very concrete answer, I think here it is something that's coming up a bunch though, so we do want to support this type of like splitting the computation. >> But in terms of splitting the computation, you could take a trained model, model training is fat compute and then the trained model-- >> You can definitely push the model and do inference. >> Would that inference thing have to happen in a Spark run time or could it be somewhere? >> I think it could happen anywhere else also. And actually like we do see a lot of people wanting to export basically machine learning pipelines or models from Spark into another environment. So it can happen somewhere else too. Yeah and then the other aspect of it is also data collection. So if you can push something that says here is when the data is exciting, like when the data is interesting you should remember these and send them on. That would also help, because otherwise you know, say it's like a video camera or something, most of the time it's looking at nothing. I mean you don't want to send all that back. >> That's actually a key point, which is some folks like especially in the IT ops area where you know, training wheels for IoT 'cause they're doing machine learning on infrastructure. >> Matei: Yeah which is there. >> Yeah, they say oh anything outside, two standard deviations of the band of exhortations, but there's more of an answer to that, I gather, from what you're saying. >> Yeah I mean I think you can create, for example, you can create a small machine learning model that decides whether what it's seeing is unusual and sends it back or you can even make it query specific, like you can count, like I want to find this type of object that's going by the camera. And try to find that. So I think there's a lot of room to improve that. >> Okay, well we have just a couple of minutes left here, want to draw into the future a little bit. And there's been some great progress since the summit last year to this one. What would you say is the next boundary that needs to be pushed to get Spark to the next level, whatever that may be? >> Yeah definitely yeah, well okay so again on the, so first of all in terms of the project today I think the big workload is that we are seeing come up all the time, are deep learning and stream processing. These are the big emerging ones. I mean there's still a lot of data warehousing, ETL and so on, that's still there. But these are the new ones, so that's what we're focusing on on our team at least. And we'll continue building out the stuff that you saw announced today. I think beyond that, I do think that part of the problem and this is more on the Databricks side, part of the problem is also just making it much easier for teams or businesses to begin using these technologies at all. And that's where we think cloud computing or software as a service is the way because you just turn it on and you can immediately start doing things. But that's basically, the way that I view that, is right now the barrier to do any project with data science or machine learning, or even like simple kind of analytics and unstructured data, the barrier is really high. So companies can only do it on a few projects. There might be like a 100 things they could be trying, but they can only afford to spend up two or three of them. So if you lower that barrier, there'll be a lot more of them and everyone will be able to quickly try one of these applications and see whether it actually works. >> And this ties into some of you graduate studies, like with model management and things like that? >> Yeah, so on the research side. So I'm also you know, doing research at Stanford and on that side we have this lab called Dawn, which is about usable machine learning. It's exactly these things. Like how do you enable an order of magnitude of more people to try to do things with machine learning. So actually we're also doing the video push down thing I mentioned, that's one thing we're looking at. A bunch of other stuff as well. >> Matei we could talk to you all day, but we don't have all day. We're up against the break here, but I wanted to thank you very much for coming and sharing a few moments here and look forward to seeing you in the hallways here at Spark right? >> Yeah thanks again for having me. >> Thanks for joining us and thank you all for watching, here we are at theCUBE at Spark 2017, thanks for watching. (upbeat music)
SUMMARY :
Covering Spark Summit2017, brought to you by Databricks. Excited to be here. I want to ask you what happened after the keynote? Yeah definitely, so the feedback has definitely That I haven't focused on in the keynote, George, you want to ask a little more about that? of continuous apps, which I think you guys named. And in the past, so people were saying, And so this is the next step to actually eliminate So in this last example, it's sort of to help build So it's basically keeping it low latency So that like it's immutable. even if you can do the same things with Spark, And then you also want to know, the simplest thing and if you have a low for ingest, not that you really want to get into that, and it definitely makes sense there that'd you have I hate to steal thunder from tomorrow, but can you give us So you can have machine learning users, So can you carve Spark up into fitting on And you can imagine some computation through those. You can definitely push the model So if you can push something that says like especially in the IT ops area where you know, but there's more of an answer to that, I gather, Yeah I mean I think you can create, for example, What would you say is the next boundary So if you lower that barrier, there'll be a lot So I'm also you know, doing research at Stanford and look forward to seeing you in the hallways Thanks for joining us and thank you all for watching,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
Matei | PERSON | 0.99+ |
Matei Zaharia | PERSON | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
two gigs | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
3 megs | QUANTITY | 0.99+ |
two cores | QUANTITY | 0.99+ |
tomorrow morning | DATE | 0.99+ |
three | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
tomorrow | DATE | 0.99+ |
100 things | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
last year | DATE | 0.99+ |
two | QUANTITY | 0.98+ |
San Francisco | LOCATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
two types | QUANTITY | 0.98+ |
Spark | ORGANIZATION | 0.98+ |
One | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
Apache | ORGANIZATION | 0.97+ |
Stanford | ORGANIZATION | 0.97+ |
first offering | QUANTITY | 0.97+ |
one thing | QUANTITY | 0.96+ |
this morning | DATE | 0.96+ |
couple hundred milliseconds | QUANTITY | 0.95+ |
Lambda | TITLE | 0.94+ |
Spark Summit2017 | EVENT | 0.93+ |
one | QUANTITY | 0.89+ |
two standard | QUANTITY | 0.87+ |
#theCUBE | ORGANIZATION | 0.81+ |
single programming model | QUANTITY | 0.8+ |
Databricks | PERSON | 0.78+ |
R | TITLE | 0.78+ |
Snappy Data | ORGANIZATION | 0.77+ |
of minutes | QUANTITY | 0.67+ |
first | QUANTITY | 0.66+ |
Spark East | ORGANIZATION | 0.63+ |
Kafka | TITLE | 0.62+ |
Apache Spark | TITLE | 0.61+ |
Sequel | TITLE | 0.6+ |
Spark 2017 | EVENT | 0.58+ |
Narrator: | TITLE | 0.57+ |
theCUBE | ORGANIZATION | 0.56+ |
Redis | TITLE | 0.55+ |
Redis | ORGANIZATION | 0.5+ |
theCUBE | TITLE | 0.46+ |
#SparkSummit | TITLE | 0.35+ |