Holden Karau, Google | Flink Forward 2018
>> Narrator: Live from San Francisco, it's the Cube, covering Flink Forward, brought to you by Data Artisans. (tech music) >> Hi, this is George Gilbert, we're at Flink Forward, the user conference for the Apache Flink Community, sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest, with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. >> Yes. >> And now, Holden is at Google, focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's Dataflow, as well as Flink and other stream processors. >> Yeah. >> And Holden has been working on integrating it with the Google TensorFlow framework, also open-sourced. Yes. >> So, Holden, tell us about the objective of putting these together. What type of use cases.... >> So, I think it's really exciting. And it's still very early days, I want to be clear. If you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark, with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot people essentially end up having to write, like, a stream-processing job to do their data preparation, and they have to write a corresponding online serving job, to do similar data preparation for when they want to make real predictions. And by integrating tf.Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple, uniform way, that can be taken from the training time into the online serving time, without them having to rewrite their code, removing the potential for mistakes where we like, change one variable slightly in one place and forget to update it in another. And just really simplifying the deployment process for these models. >> Okay, so help us tie that back to, in this case, Flink. >> Yes. >> And also to clarify, that data prep.... My impression was data prep was a different activity. It was like design time and serving was run time. But you're saying that they can be better integrated? >> So, there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, I don't have to do that at serving time. But one of the classic examples for data prep would be tokenizing my inputs, or performing some kind of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form, or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to re-create the data prep logic that I created for training at serving time. >> So, by having common Beam API and the common provider underneath it, like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine-learning model that you want those.... It would be ideal to have those transformation activities be common in your prep work, and then in the production serving. >> Yes, very true. >> So, tell us what type of customers want to write to the Beam API and have that portability? >> Yeah, so that's a really good question. So, there's a lot of people who really want portability outside of Google Cloud, and that's one group of people, essentially people who want to adopt different Google Cloud technologies, but they don't want be locked into Google Cloud forever. Which is completely understandable. There are other people who are more interested in being able to switch streaming engines, like, they want to be able to switch between Spark and Flink. And those are people who want to try out different streaming engines without having to rewrite their entire jobs. >> Does Spark Structured Streaming support the Beam API? >> So, right now, the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the Structured Streaming API. It's a thing we're actively discussing on the mailing list, how to go about doing. Because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure. But it's something that we should look at more of. Where was I going with this? So the other one that I see, is like, Flink is a wonderful API, but it's very Java-focused. And so, Java's great, everyone loves it, but a lot of cool things that are being done nowadays, are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. Beam gives a way for people to work with Python, across these different engines. Flink supports Python, but it's maybe not a first class citizen. And the Beam Python support is still a work in progress. We're working to get it to be better, but it's.... You can see the demos this afternoon, although if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. >> In to support Go. >> Which is a little out of left field. >> So, would it be fair to say that the value of Beam, for potential Flink customers, they can work and start on Google Cloud platform. They can start on one of several stream processors. They can move to another one later, and they also inherit the better language support, or bindings from the Beam API? >> I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. It's somewhat subjective, like what better language support is. But I think definitely for Go, it's pretty clear. This stuff is all stuff that's in the master branch, it's not released today. But if people are looking to play with it, I think it's really exciting. They can go and check it out from GitHub, and build it locally. >> So, what type of customers do you see who have moved into production with machine learning? >> So the.... >> And the streaming pipelines? >> The biggest customer that's in production is obviously, or not obviously, is Spotify. One of them is Spotify. They give a lot of talks about it. Because I didn't know we were going to be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly. I'll just stick with Spotify. >> Without the names, the sort of use cases and the general industry.... >> I don't want to get in trouble. >> Okay. >> I'm just going to ... sorry. >> Okay. So then, let's talk about, does Google view Dataflow as their sort of strategic successor to map produce? >> Yes, so.... >> And is that a competitor then to Flink? >> I think Flink and Dataflow can be used in some of the same cases. But, I think they're more complementary. Flink is something you can run on-prem. You can run it in different Defenders. And Dataflow is very much like, "I can run this on Google Cloud." And part of the idea with Beam is to make it so that people who want to write Dataflow jobs but maybe want the flexibility to go back to something else later can still have that. Yeah, we couldn't swap in Flink or Dataflow execution engines if we're on Google Cloud, but.... We're not, how do I put it nicely? Provided people are running this stuff, they're burning CPU cycles, I don't really care if they're running Dataflow or Flink as the execution engine. Either way, it's a party for me, right? >> George: Okay. >> It's probably one of those, sort of, friendly competitions. Where we both push each other to do better and add more features that the respective projects have. >> Okay, 30 second question. >> Cool. >> Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? >> Totally. I mostly see it as extending existing apps. This is obviously, possibly a bias, just for the people that I talk to. But, going ground up with both streaming and machine learning, at the same time, like, starting both of those projects fresh is a really big hurdle to get over. >> George: For skills. >> For skills. It's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something ... maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly. Or you'll build a streaming system, and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping head first into both at the same time. But this could change. Batch has been King for a long time and streaming is getting it's day in the sun. So, we could start seeing people becoming more adventurous and doing both, at the same time. >> Holden, on that note, we'll have to call it a day. That was most informative. >> It's really good to see you again. >> Likewise. So this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference, sponsored by Data Artisans. And we will be back in a few minutes after this short break. (tech music)
SUMMARY :
Narrator: Live from San Francisco, it's the Cube, This is the second Flink Forward conference focused on the Beam API, which is an API And Holden has been working on integrating it So, Holden, tell us about the objective of the same preparation code available to use And also to clarify, that data prep.... I don't have to do that at serving time. and the common provider underneath it, in being able to switch streaming engines, And the Beam Python support is still a work in progress. or bindings from the Beam API? But if people are looking to play with it, I didn't have a chance to go through my customer list the sort of use cases and the general industry.... as their sort of strategic successor to map produce? And part of the idea with Beam is to make it so that and add more features that the respective projects have. at the same time, and streaming is getting it's day in the sun. Holden, on that note, we'll have to call it a day. We're on the ground at Flink Forward,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Holden Karau | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
Java | TITLE | 0.99+ |
Holden | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Spotify | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
two paths | QUANTITY | 0.99+ |
TensorFlow | TITLE | 0.99+ |
One | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
GitHub | ORGANIZATION | 0.98+ |
today | DATE | 0.98+ |
Dataflow | TITLE | 0.97+ |
Flink | ORGANIZATION | 0.97+ |
one variable | QUANTITY | 0.97+ |
a day | QUANTITY | 0.97+ |
Go | TITLE | 0.97+ |
Flink Forward | EVENT | 0.96+ |
Flink | TITLE | 0.96+ |
30 second question | QUANTITY | 0.96+ |
one place | QUANTITY | 0.95+ |
Beam | TITLE | 0.95+ |
second | QUANTITY | 0.95+ |
Google Cloud | TITLE | 0.94+ |
Apache | ORGANIZATION | 0.94+ |
one group | QUANTITY | 0.94+ |
one | QUANTITY | 0.93+ |
this afternoon | DATE | 0.9+ |
Dstream | TITLE | 0.88+ |
2018 | DATE | 0.87+ |
first | QUANTITY | 0.79+ |
Beam API | TITLE | 0.75+ |
Beam | ORGANIZATION | 0.74+ |
Apache Flink Community | ORGANIZATION | 0.72+ |
Holden Karau, IBM Big Data SV 17 #BigDataSV #theCUBE
>> Announcer: Big Data Silicon Valley 2017. >> Hey, welcome back, everybody, Jeff Frick here with The Cube. We are live at the historic Pagoda Lounge in San Jose for Big Data SV, which is associated with Strathead Dupe World, across the street, as well as Big Data week, so everything big data is happening in San Jose, we're happy to be here, love the new venue, if you're around, stop by, back of the Fairmount, Pagoda Lounge. We're excited to be joined in this next segment by, who's now become a regular, any time we're at a Big Data event, a Spark event, Holden always stops by. Holden Karau, she's the principal software engineer at IBM. Holden, great to see you. >> Thank you, it's wonderful to be back yet again. >> Absolutely, so the big data meme just keeps rolling, Google Cloud Next was last week, a lot of talk about AI and ML and of course you're very involved in Spark, so what are you excited about these days? What are you, I'm sure you've got a couple presentations going on across the street. >> Yeah, so my two presentations this week, oh wow, I should remember them. So the one that I'm doing today is with my co-worker Seth Hendrickson, also at IBM, and we're going to be focused on how to use structured streaming for machine learning. And sort of, I think that's really interesting, because streaming machine learning is something a lot of people seem to want to do but aren't yet doing in production, so it's always fun to talk to people before they've built their systems. And then tomorrow I'm going to be talking with Joey on how to debug Spark, which is something that I, you know, a lot of people ask questions about, but I tend to not talk about, because it tends to scare people away, and so I try to keep the happy going. >> Jeff: Bugs are never fun. >> No, no, never fun. >> Just picking up on that structured streaming and machine learning, so there's this issue of, as we move more and more towards the industrial internet of things, like having to process events as they come in, make a decision. How, there's a range of latency that's required. Where does structured streaming and ML fit today, and where might that go? >> So structured streaming for today, latency wise, is probably not something I would use for something like that right now. It's in the like sub second range. Which is nice, but it's not what you want for like live serving of decisions for your car, right? That's just not going to be feasible. But I think it certainly has the potential to get a lot faster. We've seen a lot of renewed interest in ML liblocal, which is really about making it so that we can take the models that we've trained in Spark and really push them out to the edge and sort of serve them in the edge, and apply our models on end devices. So I'm really excited about where that's going. To be fair, part of my excitement is someone else is doing that work, so I'm very excited that they're doing this work for me. >> Let me clarify on that, just to make sure I understand. So there's a lot of overhead in Spark, because it runs on a cluster, because you have an optimizer, because you have the high availability or the resilience, and so you're saying we can preserve the predict and maybe serve part and carve out all the other overhead for running in a very small environment. >> Right, yeah. So I think for a lot of these IOT devices and stuff like that it actually makes a lot more sense to do the predictions on the device itself, right. These models generally are megabytes in size, and we don't need a cluster to do predictions on these models, right. We really need the cluster to train them, but I think for a lot of cases, pushing the prediction out to the edge node is actually a pretty reasonable use case. And so I'm really excited that we've got some work going on there. >> Taking that one step further, we've talked to a bunch of people, both like at GE, and at their Minds and Machines show, and IBM's Genius of Things, where you want to be able to train the models up in the cloud where you're getting data from all the different devices and then push the retrained model out to the edge. Can that happen in Spark, or do we have to have something else orchestrating all that? >> So actually pushing the model out isn't something that I would do in Spark itself, I think that's better served by other tools. Spark is not really well suited to large amounts of internet traffic, right. But it's really well suited to the training, and I think with ML liblocal it'll essentially, we'll be able to provide both sides of it, and the copy part will be left up to whoever it is that's doing their work, right, because like if you're copying over a cell network you need to do something very different as if you're broadcasting over a terrestrial XM or something like that, you need to do something very different for satellite. >> If you're at the edge on a device, would you be actually running, like you were saying earlier, structured streaming, with the prediction? >> Right, I don't think you would use structured streaming per se on the edge device, but essentially there would be a lot of code share between structured streaming and the code that you'd be using on the edge device. And it's being vectored out now so that we can have this code sharing and Spark machine learning. And you would use structured streaming maybe on the training side, and then on the serving side you would use your custom local code. >> Okay, so tell us a little more about Spark ML today and how we can democratize machine learning, you know, for a bigger audience. >> Right, I think machine learning is great, but right now you really need a strong statistical background to really be able to apply it effectively. And we probably can't get rid of that for all problems, but I think for a lot of problems, doing things like hyperparameter tuning can actually give really powerful tools to just like regular engineering folks who, they're smart, but maybe they don't have a strong machine learning background. And Spark's ML pipelines make it really easy to sort of construct multiple stages, and then just be like, okay, I don't know what these parameters should be, I want you to do a search over what these different parameters could be for me, and it makes it really easy to do this as just a regular engineer with less of an ML background. >> Would that be like, just for those of us who are, who don't know what hyperparameter tuning is, that would be the knobs, the variables? >> Yeah, it's going to spin the knobs on like our regularization parameter on like our regression, and it can also spin some knobs on maybe the engram sizes that we're using on the inputs to something else, right. And it can compare how these knobs sort of interact with each other, because often you can tune one knob but you actually have six different knobs that you want to tune and you don't know, if you just explore each one individually, you're not going to find the best setting for them working together. >> So this would make it easier for, as you're saying, someone who's not a data scientist to set up a pipeline that lets you predict. >> I think so, very much. I think it does a lot of the, brings a lot of the benefits from sort of the SciPy world to the big data world. And SciPy is really wonderful about making machine learning really accessible, but it's just not ready for big data, and I think this does a good job of bringing these same concepts, if not the code, but the same concepts, to big data. >> The SciPy, if I understand, is it a notebook that would run essentially on one machine? >> SciPy can be put in a notebook environment, and generally it would run on, yeah, a single machine. >> And so to make that sit on Spark means that you could then run it on a cluster-- >> So this isn't actually taking SciPy and distributing it, this is just like stealing the good concepts from SciPy and making them available for big data people. Because SciPy's done a really good job of making a very intuitive machine learning interface. >> So just to put a fine sort of qualifier on one thing, if you're doing the internet of things and you have Spark at the edge and you're running the model there, it's the programming model, so structured streaming is one way of programming Spark, but if you don't have structured streaming at the edge, would you just be using the core batch Spark programming model? >> So at the edge you'd just be using, you wouldn't even be using batch, right, because you're trying to predict individual events, right, so you'd just be calling predict with every new event that you're getting in. And you might have a q mechanism of some type. But essentially if we had this batch, we would be adding additional latency, and I think at the edge we really, the reason we're moving the models to the edge is to avoid the latency. >> So just to be clear then, is the programming model, so it wouldn't be structured streaming, and we're taking out all the overhead that forced us to use batch with Spark. So the reason I'm trying to clarify is a lot of people had this question for a long time, which is are we going to have a different programming model at the edge from what we have at the center? >> Yeah, that's a great question. And I don't think the answer is finished yet, but I think the work is being done to try and make it look the same. Of course, you know, trying to make it look the same, this is Boosh, it's not like actually barking at us right now, even though she looks like a dog, she is, there will always be things which are a little bit different from the edge to your cluster, but I think Spark has done a really good job of making things look very similar on single node cases to multi node cases, and I think we can probably bring the same things to ML. >> Okay, so it's almost time, we're coming back, Spark took us from single machine to cluster, and now we have to essentially bring it back for an edge device that's really light weight. >> Yeah, I think at the end of the day, just from a latency point of view, that's what we have to do for serving. For some models, not for everyone. Like if you're building a website with a recommendation system, you don't need to serve that model like on the edge node, that's fine, but like if you've got a car device we can't depend on cell latency, right, you have to serve that in car. >> So what are some of the things, some of the other things that IBM is contributing to the ecosystem that you see having a big impact over the next couple years? >> So there's a lot of really exciting things coming out of IBM. And I'm obviously pretty biased. I spend a lot of time focused on Python support in Spark, and one of the most exciting things is coming from my co-worker Brian, I'm not going to say his last name in case I get it wrong, but Brian is amazing, and he's been working on integrating Arrow with Spark, and this can make it so that it's going to be a lot easier to sort of interoperate between JVM languages and Python and R, so I'm really optimistic about the sort of Python and R interfaces improving a lot in Spark and getting a lot faster as well. And we're also, in addition to the Arrow work, we've got some work around making it a lot easier for people in R and Python to get started. The R stuff is mostly actually the Microsoft people, thanks Felix, you're awesome. I don't actually know which camera I should have done that to but that's okay. >> I think you got it! >> But Felix is amazing, and the other people working on R are too. But I think we've both been pursuing sort of making it so that people who are in the R or Python spaces can just use like Pit Install, Conda Install, or whatever tool it is they're used to working with, to just bring Spark into their machine really easily, just like they would sort of any other software package that they're using. Because right now, for someone getting started in Spark, if you're in the Java space it's pretty easy, but if you're in R or Python you have to do sort of a lot of weird setup work, and it's worth it, but like if we can get rid of that friction, I think we can get a lot more people in these communities using Spark. >> Let me see, just as a scenario, the R server is getting fairly well integrated into Sequel server, so would it be, would you be able to use R as the language with a Spark execution engine to somehow integrate it into Sequel server as an execution engine for doing the machine learning and predicting? >> You definitely, well I shouldn't say definitely, you probably could do that. I don't necessarily know if that's a good idea, but that's the kind of stuff that this would enable, right, it'll make it so that people that are making tools in R or Python can just use Spark as another library, right, and it doesn't have to be this really special setup. It can just be this library and they point out the cluster and they can do whatever work it wants to do. That being said, the Sequel server R integration, if you find yourself using that to do like distributed computing, you should probably take a step back and like rethink what you're doing. >> George: Because it's not really scale out. >> It's not really set up for that. And you might be better off doing this with like, connecting your Spark cluster to your Sequel server instance using like JDBC or a special driver and doing it that way, but you definitely could do it in another inverted sort of way. >> So last question from me, if you look out a couple years, how will we make machine learning accessible to a bigger and bigger audience? And I know you touched on the tuning of the knobs, hyperparameter tuning, what will it look like ultimately? >> I think ML pipelines are probably what things are going to end up looking like. But I think the other part that we'll sort of see is we'll see a lot more examples of how to work with certain kinds of data, because right now, like, I know what I need to do when I'm ingesting some textural data, but I know that because I spent like a week trying to figure out what the hell I was doing once, right. And I didn't bother to write it down. And it looks like no one else bothered to write it down. So really I think we'll see a lot of tools that look very similar to the tools we have today, they'll have more options and they'll be a bit easier to use, but I think the main thing that we're really lacking right now is good documentation and sort of good books and just good resources for people to figure out how to use these tools. Now of course, I mean, I'm biased, because I work on these tools, so I'm like, yeah, they're pretty great. So there might be other people who are like, Holden, no, you're wrong, we need to rethink everything. But I think this is, we can go very far with the pipeline concept. >> And then that's good, right? The democratization of these things opens it up to more people, you get more creative people solving more different problems, that makes the whole thing go. >> You can like install Spark easily, you can, you know, set up an ML pipeline, you can train your model, you can start doing predictions, you can, people that haven't been able to do machine learning at scale can get started super easily, and build a recommendation system for their small little online shop and be like, hey, you bought this, you might also want to buy Boosh, he's really cute, but you can't have this one. No no no, not this one. >> Such a tease! >> Holden: I'm sorry, I'm sorry. >> Well Holden, that will, we'll say goodbye for now, I'm sure we will see you in June in San Francisco at the Spark Summit, and look forward to the update. >> Holden: I look forward to chatting with you then. >> Absolutely, and break a leg this afternoon at your presentation. >> Holden: Thank you. >> She's Holden Karau, I'm Jeff Frick, he's George Gilbert, you're watching The Cube, we're at Big Data SV, thanks for watching. (upbeat music)
SUMMARY :
Announcer: Big Data We're excited to be joined to be back yet again. so what are you excited about these days? but I tend to not talk about, like having to process and really push them out to the edge and carve out all the other overhead We really need the cluster to train them, model out to the edge. and the copy part will be left up to and then on the serving side you would use you know, for a bigger audience. and it makes it really easy to do this that you want to tune and you don't know, that lets you predict. but the same concepts, to big data. and generally it would run the good concepts from SciPy the models to the edge So just to be clear then, from the edge to your cluster, machine to cluster, like on the edge node, that's fine, R and Python to get started. and the other people working on R are too. but that's the kind of stuff not really scale out. to your Sequel server instance and they'll be a bit easier to use, that makes the whole thing go. and be like, hey, you bought this, look forward to the update. to chatting with you then. Absolutely, and break you're watching The Cube,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Frick | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Holden Karau | PERSON | 0.99+ |
Holden | PERSON | 0.99+ |
Felix | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Joey | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
San Jose | LOCATION | 0.99+ |
Seth Hendrickson | PERSON | 0.99+ |
Spark | TITLE | 0.99+ |
Python | TITLE | 0.99+ |
last week | DATE | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
tomorrow | DATE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
June | DATE | 0.99+ |
six different knobs | QUANTITY | 0.99+ |
GE | ORGANIZATION | 0.99+ |
Boosh | PERSON | 0.99+ |
Pagoda Lounge | LOCATION | 0.99+ |
one knob | QUANTITY | 0.99+ |
both sides | QUANTITY | 0.99+ |
two presentations | QUANTITY | 0.99+ |
this week | DATE | 0.98+ |
today | DATE | 0.98+ |
The Cube | ORGANIZATION | 0.98+ |
Java | TITLE | 0.98+ |
both | QUANTITY | 0.97+ |
one thing | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
Big Data week | EVENT | 0.96+ |
single machine | QUANTITY | 0.95+ |
R | TITLE | 0.95+ |
SciPy | TITLE | 0.95+ |
Big Data | EVENT | 0.95+ |
single machine | QUANTITY | 0.95+ |
each one | QUANTITY | 0.94+ |
JDBC | TITLE | 0.93+ |
Spark ML | TITLE | 0.89+ |
JVM | TITLE | 0.89+ |
The Cube | TITLE | 0.88+ |
single | QUANTITY | 0.88+ |
Sequel | TITLE | 0.87+ |
Big Data Silicon Valley 2017 | EVENT | 0.86+ |
Spark Summit | LOCATION | 0.86+ |
one machine | QUANTITY | 0.86+ |
a week | QUANTITY | 0.84+ |
Fairmount | LOCATION | 0.83+ |
liblocal | TITLE | 0.83+ |
Holden Karau, IBM - #BigDataNYC 2016 - #theCUBE
>> Narrator: Live from New York, it's the CUBE from Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, Nvidia. And our ecosystem sponsors. Now, here are your hosts: Dave Vellante and Peter Burris. >> Welcome back to New York City, everybody. This is the CUBE, the worldwide leader in live tech coverage. Holden Karau is here, principle software engineer with IBM. Welcome to the CUBE. >> Thank you for having me. It's nice to be back. >> So, what's with Boo? >> So, Boo is my stuffed dog that I bring-- >> You've got to hold Boo up. >> Okay, yeah. >> Can't see Boo. >> So, this is Boo. Boo comes with me to all of my conferences in case I get stressed out. And she also hangs out normally on the podium while I'm giving the talk as well, just in case people get bored. You know, they can look at Boo. >> So, Boo is not some new open source project. >> No, no, Boo is not an open source project. But Boo is really cute. So, that counts for something. >> All right, so, what's new in your world of spark and machinery? >> So, there's a lot of really exciting things, right. Spark 2.0.0 came out, and that's really exciting because we finally got to get rid of some of the chunkier APIs. And data sets are just becoming sort of the core base of everything going forward in Spark. This is bringing the Spark Sequel engine to all sorts of places, right. So, the machine learning APIs are built on top of the data set API now. The streaming APIs are being built on top of the data set APIs. And this is starting to actually make it a lot easier for people to work together, I think. And that's one of the things that I really enjoy is when we can have people from different sort of profiles or roles work together. And so this support of data sets being everywhere in Spark now lets people with more of like a Sequel background still write stuff that's going to be used directly in sort of a production pipeline. And the engineers can build whatever, you know, production ready stuff they need on top of the Sequel expressions from the analysts and do some really cool stuff there. >> So, chunky API, what does that mean to a layperson? >> Sure, um, it means like, for example, there's this thing in Spark where one of the things you want to do is shuffle a whole bunch of data around and then look at all of the records associated with a given key, right? But, you know, when the APIs were first made, right, it was made by university students. Very smart university students, but you know, it started out as like a grad school project, right? And like, um, so finally with 2.0, we were about to get rid of things like places where we use traits like iterables rather than iterators. And because like these minor little drunky things it's like we had to keep supporting this old API, because you can't break people's code in a minor release, but when you do a big release like Spark 2.0, you can actually go, okay, you need to change your stuff now to start using Spark 2.0. But as a result of changing that in this one place, we're actually able to better support spilling to disk. And this is for people who have too much data to fit in memory even on the individual executors. So, being able to spill to disk more effectively is really important from a performance point of view. So, there's a lot of clean up of getting rid of things, which were sort of holding us back performance-wise. >> So, the value is there. Enough value to break the-- >> Yeah, enough value to break the APIs. And 1.6 will continue to be updated for people that are not ready to migrate right today. But for the people that are looking at it, it's definitely worth it, right? You get a bunch of real cool optimizations. >> One of the themes of this event of the last couple of years has been complexity. You guys wrote an article recently in SiliconANGLE some of the broken promises of open source, really the route of it, being complexity. So, Spark addresses that to a large degree. >> I think so. >> Maybe you could talk about that and explain to us sort of how and what the impact could be for businesses. >> So, I think Spark does a really good job of being really user-friendly, right? It has a Sequel engine for people that aren't comfortable with writing, you know, Scala or Java or Python code. But then on top of that, right, there's a lot of analysts that are really familiar with Python. And Spark actually exposes Python APIs and is working on exposing R APIs. And this is making it so that if you're working on Spark, you don't have to understand the internals in a lot of depth, right? There's some other streaming systems where to make them perform really well, you have to have a really deep mental model of what you're doing. But with Spark, it's much simpler and the APIs are cleaner, and they're exposed in the ways that people are already used to working with their data. And because it's exposed in ways that people are used to working with their data, they don't have to relearn large amounts of complexity. They just have to learn it in the few cases where they run into problems, right? Because it will work most of the time just with the sort of techniques that they're used to doing. So, I think that it's really cool. Especially structured streaming, which is new in Spark 2.0. And structured streaming makes it so that you can write sort of arbitrary Sequel expressions on streaming data, which is really awesome. Like, you can do aggregations without having to sit around and think about how to effectively do an aggregation over different microbatches. That's not a problem for you to worry about. That's a problem for the Spark developers to worry about. Which, unfortunately, is sometimes a problem for me to worry about, but you know, not too often. Boo helps out whenever it gets too stressful. >> First of all, a lot to learn. But there's been some great research done in places like Cornell and Penn and others about how the open source community collaborates and works together. And I'm wondering is the open source community that's building things like Spark, especially in a domain like Big Data, which the use cases themselves are so complex and so important. Are we starting to take some of the knowledge in the contributors, or developing, on how to collaborate and how to work together. And starting to find that way into the tools so that the whole thing starts to collaborate better? >> Yeah, I think, actually, if you look at Spark, you can see that there's a lot of sort of tools that are being built on top of Spark, which are also being built in similar models. I mean, the Apache Software Foundation is a really good tool for managing projects of a certain scale. You can see a lot of Spark-related projects that have also decided that become part of Apache Foundation is a good way to manage their governance and collaborate with different people. But then there's people that look at Spark and go like wow, there's a lot of overhead here. I don't think I'm going to have 500 people working on this project. I'm going to go and model my project after something a bit simpler, right? And I think that both of those are really valid ways of building open source tools on Spark. But it's really interesting seeing there's a Spark components page, essentially, a Spark packages list, for community to publish the work that they're doing on top of Spark. And it's really interesting to see all of the collaborations that are happening there. Especially even between vendors sometimes. You'll see people make tools, which help everyone's data access go faster. And it's open source. so you'll see it start to get contributed into other people's data access layers as well. >> So, pedagogy of how the open source community's work starting to find a way into the tools, so people who aren't in the community, but are focused on the outcomes are now able to not only gain the experience about how the big data works, but also how people on complex outcomes need to work. >> I think that's definitely happening. And you can see that a lot with, like, the collaboration layers that different people are building on top of Spark, like the different notebook solutions, are all very focused on ableing collaboration, right? Because if you're an analyst and you're writing some python code on your local machine, you're not going to, like, probably set up a get up recode to share that with everyone, right? But if you have a notebook and you can just send the link to your friends and be like hey, what's up, can you take a look at this? You can share your results more easily and you can also work together a lot more, more collaboratively. And then so data bricks is doing some great things. IBM as well. I'm sure there's other companies building great notebook solutions who I'm forgetting. But the notebooks, I think, are really empowering people to collaborate in ways that we haven't traditionally seen in the big data space before. >> So, collaboration, to stay on that theme. So, we had eight data scientists on a panel the other night and just talking about, collaboration came up, and the question is specifically from an application developer standpoint. As data becomes, you know, the new development kit, how much of a data scientist do you have to become or are you becoming as a developer? >> Right, so, my role is very different, right? Because I focus just on tools, mostly. So, my data science is mostly to make sure that what I'm doing is actually useful to other people. Because a lot of the people that consume my stuff are data scientists. So, for me, personally, like the answer is not a whole lot. But for a lot of my friends that are working in more traditional sort of data engineering roles where they're empowering specific use cases, they find themselves either working really closely with data scientists often to be like, okay, what are your requirements? What data do I need to be able to get to you so you can do your job? And, you know, sometimes if they find themselves blocking on the data scientists, they're like, how hard could it be? And it turns out, you know, statistics is actually pretty complicated. But sometimes, you know, they go ahead and pick up some of the tools on their own. And we get to see really cool things with really, really ugly graphs. 'Cause they do not know how to use graphing libraries. But, you know, it's really exciting. >> Machine learning is another big theme in this conference. Maybe you could share with us your perspectives on ML and what's happening there. >> So, I really thing machine learning is very powerful. And I think machine learning in Spark is also super powerful. And especially just like the traditional things is you down-sample your data. And you train a bunch of your models. And then, eventually, you're like okay, I think this is like the model that I want to like build for real. And then you go and you get your engineer to help you train it on your giant data set. But Spark and the notebooks that are built on top of it actually mean that it's entirely reasonable for data scientists to take the tools which are traditionally used by the data engineering roles, and just start directly applying them during their exploration phase. And so we're seeing a lot of really more interesting models come to life, right? Because if you're always working with down-sampled data, it's okay, right? Like you can do reasonable exploration on down-sampled data. But you can find some really cool sort of features that you wouldn't normally find once you're working with your full data set, right? 'Cause you're just not going to have that show up in your down-sampled data. And I think also streaming machine learning is a really interesting thing, right? Because we see there's a lot of IOT devices and stuff like that. And like the traditional machine learning thing is I'm going to build a model and then I'm going to deploy it. And then like a week later, I'll maybe consider building a new model. And then I'll deploy it. And then so very much it looks like the old software release processes as opposed to the more agile software release processes. And I think that streaming machine learning can look a lot more like, sort of the agile software development processes where it's like cool, I've got a bunch of labeled data from our contractors. I'm going to integrate that right away. And if I don't see any regression on my cross-validation set, we're just going to go ahead and deploy that today. And I think it's really exciting. I'm obviously a little biased, because some of my work right now is on enabling machine learning with structured streaming in Spark. So, I obviously think my work is useful. Otherwise I would be doing something else. But it's entirely possible. You know, everyone will be like Holden, your work is terrible. But I hope not. I hope people find it useful. >> Talking about sampling. In our first at Dupe World 2010, Albi Meta, he stopped by again today, of course, and he made the statement then. Sampling's dead. It's dead. Is sampling dead? >> Sampling didn't quite die. I think we're getting really close to killing sampling. Sampling will only be data once all of the data scientists in the organization have access to the same tools that the data engineers have been using, right? 'Cause otherwise you'll still be sampling. You'll still be implicitly doing your model selection on down-sampled data. And we'll still probably always find an excuse to sample data, because I'm lazy and sometimes I just want to develop on my laptop. But, you know, I think we're getting close to killing a lot more of sampling. >> Do you see an opportunity to start utilizing many of these tools to actually improve the process of building models, finding data sources, identifying individuals that need access to the data? Are we going to start turning big data on the problem of big data? >> No, that's really exciting. And so, okay, so this is something that I find really enjoyable. So, one of the things that traditionally, when everyone's doing their development on their laptop, right? You don't get to collect a lot of metrics about what they're doing, right? But once you start moving everyone into a sort of more integrated notebook environment, you can be like, okay, like, these are data sets that these different people are accessing. Like these are the things that I know about them. And you can actually train a recommendation algorithm on the data sets to recommend other data sets to people. And there are people that are starting to do this. And I think it's really powerful, right? Because it's like in small companies, maybe not super important, right? Because I'll just go an ask my coworker like hey, what data sets do I want to use? But if you're at a company like Google or IBM scale or even like a 500 person company, you're not going to know all of the data sets that are available for you to work with. And the machine will actually be able to make some really interesting recommendations there. >> All right, we have to leave it there. We're out of time. Holden, thanks very much. >> Thank you so much for having me and having Boo. >> Pleasure. All right, any time. Keep right there everybody. We'll be back with our next guest. This is the CUBE. We're live from New York City. We'll be right back.
SUMMARY :
Brought to you by headline sponsors, This is the CUBE, the worldwide leader It's nice to be back. normally on the podium So, Boo is not some So, that counts for something. And this is starting to So, being able to spill So, the value is there. But for the people that are looking at it, that to a large degree. about that and explain to us and think about how to And starting to find And it's really interesting to but are focused on the outcomes the link to your friends and the question is specifically be able to get to you Maybe you could share with And then you go and you get your engineer and he made the statement then. that the data engineers on the data sets to recommend All right, we have to leave it there. Thank you so much for This is the CUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
IBM | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Dave Vellante | PERSON | 0.99+ |
Nvidia | ORGANIZATION | 0.99+ |
Peter Burris | PERSON | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Holden Karau | PERSON | 0.99+ |
New York City | LOCATION | 0.99+ |
Java | TITLE | 0.99+ |
Apache Foundation | ORGANIZATION | 0.99+ |
Scala | TITLE | 0.99+ |
New York City | LOCATION | 0.99+ |
Python | TITLE | 0.99+ |
Spark 2.0 | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
500 people | QUANTITY | 0.99+ |
Albi Meta | PERSON | 0.99+ |
a week later | DATE | 0.99+ |
Spark 2.0.0 | TITLE | 0.99+ |
500 person | QUANTITY | 0.99+ |
Apache Software Foundation | ORGANIZATION | 0.98+ |
New York | LOCATION | 0.98+ |
today | DATE | 0.98+ |
Holden | PERSON | 0.98+ |
first | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
Cornell | ORGANIZATION | 0.97+ |
Boo | PERSON | 0.97+ |
One | QUANTITY | 0.96+ |
Spark Sequel | TITLE | 0.95+ |
CUBE | ORGANIZATION | 0.93+ |
eight data scientists | QUANTITY | 0.93+ |
python code | TITLE | 0.93+ |
2016 | DATE | 0.91+ |
one | QUANTITY | 0.91+ |
First | QUANTITY | 0.9+ |
Penn | ORGANIZATION | 0.89+ |
last couple of years | DATE | 0.88+ |
Big Data | ORGANIZATION | 0.86+ |
one place | QUANTITY | 0.85+ |
2.0 | TITLE | 0.8+ |
agile | TITLE | 0.79+ |
one of | QUANTITY | 0.75+ |
things | QUANTITY | 0.73+ |
once | QUANTITY | 0.7+ |
#BigDataNYC | EVENT | 0.7+ |
2010 | DATE | 0.65+ |
Dupe | EVENT | 0.6+ |
World | ORGANIZATION | 0.56+ |
Data | TITLE | 0.53+ |
themes | QUANTITY | 0.52+ |
1.6 | OTHER | 0.5+ |