Holden Karau, IBM - #BigDataNYC 2016 - #theCUBE

>> Narrator: Live from New York, it's the CUBE from Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, Nvidia. And our ecosystem sponsors. Now, here are your hosts: Dave Vellante and Peter Burris. >> Welcome back to New York City, everybody. This is the CUBE, the worldwide leader in live tech coverage. Holden Karau is here, principle software engineer with IBM. Welcome to the CUBE. >> Thank you for having me. It's nice to be back. >> So, what's with Boo? >> So, Boo is my stuffed dog that I bring-- >> You've got to hold Boo up. >> Okay, yeah. >> Can't see Boo. >> So, this is Boo. Boo comes with me to all of my conferences in case I get stressed out. And she also hangs out normally on the podium while I'm giving the talk as well, just in case people get bored. You know, they can look at Boo. >> So, Boo is not some new open source project. >> No, no, Boo is not an open source project. But Boo is really cute. So, that counts for something. >> All right, so, what's new in your world of spark and machinery? >> So, there's a lot of really exciting things, right. Spark 2.0.0 came out, and that's really exciting because we finally got to get rid of some of the chunkier APIs. And data sets are just becoming sort of the core base of everything going forward in Spark. This is bringing the Spark Sequel engine to all sorts of places, right. So, the machine learning APIs are built on top of the data set API now. The streaming APIs are being built on top of the data set APIs. And this is starting to actually make it a lot easier for people to work together, I think. And that's one of the things that I really enjoy is when we can have people from different sort of profiles or roles work together. And so this support of data sets being everywhere in Spark now lets people with more of like a Sequel background still write stuff that's going to be used directly in sort of a production pipeline. And the engineers can build whatever, you know, production ready stuff they need on top of the Sequel expressions from the analysts and do some really cool stuff there. >> So, chunky API, what does that mean to a layperson? >> Sure, um, it means like, for example, there's this thing in Spark where one of the things you want to do is shuffle a whole bunch of data around and then look at all of the records associated with a given key, right? But, you know, when the APIs were first made, right, it was made by university students. Very smart university students, but you know, it started out as like a grad school project, right? And like, um, so finally with 2.0, we were about to get rid of things like places where we use traits like iterables rather than iterators. And because like these minor little drunky things it's like we had to keep supporting this old API, because you can't break people's code in a minor release, but when you do a big release like Spark 2.0, you can actually go, okay, you need to change your stuff now to start using Spark 2.0. But as a result of changing that in this one place, we're actually able to better support spilling to disk. And this is for people who have too much data to fit in memory even on the individual executors. So, being able to spill to disk more effectively is really important from a performance point of view. So, there's a lot of clean up of getting rid of things, which were sort of holding us back performance-wise. >> So, the value is there. Enough value to break the-- >> Yeah, enough value to break the APIs. And 1.6 will continue to be updated for people that are not ready to migrate right today. But for the people that are looking at it, it's definitely worth it, right? You get a bunch of real cool optimizations. >> One of the themes of this event of the last couple of years has been complexity. You guys wrote an article recently in SiliconANGLE some of the broken promises of open source, really the route of it, being complexity. So, Spark addresses that to a large degree. >> I think so. >> Maybe you could talk about that and explain to us sort of how and what the impact could be for businesses. >> So, I think Spark does a really good job of being really user-friendly, right? It has a Sequel engine for people that aren't comfortable with writing, you know, Scala or Java or Python code. But then on top of that, right, there's a lot of analysts that are really familiar with Python. And Spark actually exposes Python APIs and is working on exposing R APIs. And this is making it so that if you're working on Spark, you don't have to understand the internals in a lot of depth, right? There's some other streaming systems where to make them perform really well, you have to have a really deep mental model of what you're doing. But with Spark, it's much simpler and the APIs are cleaner, and they're exposed in the ways that people are already used to working with their data. And because it's exposed in ways that people are used to working with their data, they don't have to relearn large amounts of complexity. They just have to learn it in the few cases where they run into problems, right? Because it will work most of the time just with the sort of techniques that they're used to doing. So, I think that it's really cool. Especially structured streaming, which is new in Spark 2.0. And structured streaming makes it so that you can write sort of arbitrary Sequel expressions on streaming data, which is really awesome. Like, you can do aggregations without having to sit around and think about how to effectively do an aggregation over different microbatches. That's not a problem for you to worry about. That's a problem for the Spark developers to worry about. Which, unfortunately, is sometimes a problem for me to worry about, but you know, not too often. Boo helps out whenever it gets too stressful. >> First of all, a lot to learn. But there's been some great research done in places like Cornell and Penn and others about how the open source community collaborates and works together. And I'm wondering is the open source community that's building things like Spark, especially in a domain like Big Data, which the use cases themselves are so complex and so important. Are we starting to take some of the knowledge in the contributors, or developing, on how to collaborate and how to work together. And starting to find that way into the tools so that the whole thing starts to collaborate better? >> Yeah, I think, actually, if you look at Spark, you can see that there's a lot of sort of tools that are being built on top of Spark, which are also being built in similar models. I mean, the Apache Software Foundation is a really good tool for managing projects of a certain scale. You can see a lot of Spark-related projects that have also decided that become part of Apache Foundation is a good way to manage their governance and collaborate with different people. But then there's people that look at Spark and go like wow, there's a lot of overhead here. I don't think I'm going to have 500 people working on this project. I'm going to go and model my project after something a bit simpler, right? And I think that both of those are really valid ways of building open source tools on Spark. But it's really interesting seeing there's a Spark components page, essentially, a Spark packages list, for community to publish the work that they're doing on top of Spark. And it's really interesting to see all of the collaborations that are happening there. Especially even between vendors sometimes. You'll see people make tools, which help everyone's data access go faster. And it's open source. so you'll see it start to get contributed into other people's data access layers as well. >> So, pedagogy of how the open source community's work starting to find a way into the tools, so people who aren't in the community, but are focused on the outcomes are now able to not only gain the experience about how the big data works, but also how people on complex outcomes need to work. >> I think that's definitely happening. And you can see that a lot with, like, the collaboration layers that different people are building on top of Spark, like the different notebook solutions, are all very focused on ableing collaboration, right? Because if you're an analyst and you're writing some python code on your local machine, you're not going to, like, probably set up a get up recode to share that with everyone, right? But if you have a notebook and you can just send the link to your friends and be like hey, what's up, can you take a look at this? You can share your results more easily and you can also work together a lot more, more collaboratively. And then so data bricks is doing some great things. IBM as well. I'm sure there's other companies building great notebook solutions who I'm forgetting. But the notebooks, I think, are really empowering people to collaborate in ways that we haven't traditionally seen in the big data space before. >> So, collaboration, to stay on that theme. So, we had eight data scientists on a panel the other night and just talking about, collaboration came up, and the question is specifically from an application developer standpoint. As data becomes, you know, the new development kit, how much of a data scientist do you have to become or are you becoming as a developer? >> Right, so, my role is very different, right? Because I focus just on tools, mostly. So, my data science is mostly to make sure that what I'm doing is actually useful to other people. Because a lot of the people that consume my stuff are data scientists. So, for me, personally, like the answer is not a whole lot. But for a lot of my friends that are working in more traditional sort of data engineering roles where they're empowering specific use cases, they find themselves either working really closely with data scientists often to be like, okay, what are your requirements? What data do I need to be able to get to you so you can do your job? And, you know, sometimes if they find themselves blocking on the data scientists, they're like, how hard could it be? And it turns out, you know, statistics is actually pretty complicated. But sometimes, you know, they go ahead and pick up some of the tools on their own. And we get to see really cool things with really, really ugly graphs. 'Cause they do not know how to use graphing libraries. But, you know, it's really exciting. >> Machine learning is another big theme in this conference. Maybe you could share with us your perspectives on ML and what's happening there. >> So, I really thing machine learning is very powerful. And I think machine learning in Spark is also super powerful. And especially just like the traditional things is you down-sample your data. And you train a bunch of your models. And then, eventually, you're like okay, I think this is like the model that I want to like build for real. And then you go and you get your engineer to help you train it on your giant data set. But Spark and the notebooks that are built on top of it actually mean that it's entirely reasonable for data scientists to take the tools which are traditionally used by the data engineering roles, and just start directly applying them during their exploration phase. And so we're seeing a lot of really more interesting models come to life, right? Because if you're always working with down-sampled data, it's okay, right? Like you can do reasonable exploration on down-sampled data. But you can find some really cool sort of features that you wouldn't normally find once you're working with your full data set, right? 'Cause you're just not going to have that show up in your down-sampled data. And I think also streaming machine learning is a really interesting thing, right? Because we see there's a lot of IOT devices and stuff like that. And like the traditional machine learning thing is I'm going to build a model and then I'm going to deploy it. And then like a week later, I'll maybe consider building a new model. And then I'll deploy it. And then so very much it looks like the old software release processes as opposed to the more agile software release processes. And I think that streaming machine learning can look a lot more like, sort of the agile software development processes where it's like cool, I've got a bunch of labeled data from our contractors. I'm going to integrate that right away. And if I don't see any regression on my cross-validation set, we're just going to go ahead and deploy that today. And I think it's really exciting. I'm obviously a little biased, because some of my work right now is on enabling machine learning with structured streaming in Spark. So, I obviously think my work is useful. Otherwise I would be doing something else. But it's entirely possible. You know, everyone will be like Holden, your work is terrible. But I hope not. I hope people find it useful. >> Talking about sampling. In our first at Dupe World 2010, Albi Meta, he stopped by again today, of course, and he made the statement then. Sampling's dead. It's dead. Is sampling dead? >> Sampling didn't quite die. I think we're getting really close to killing sampling. Sampling will only be data once all of the data scientists in the organization have access to the same tools that the data engineers have been using, right? 'Cause otherwise you'll still be sampling. You'll still be implicitly doing your model selection on down-sampled data. And we'll still probably always find an excuse to sample data, because I'm lazy and sometimes I just want to develop on my laptop. But, you know, I think we're getting close to killing a lot more of sampling. >> Do you see an opportunity to start utilizing many of these tools to actually improve the process of building models, finding data sources, identifying individuals that need access to the data? Are we going to start turning big data on the problem of big data? >> No, that's really exciting. And so, okay, so this is something that I find really enjoyable. So, one of the things that traditionally, when everyone's doing their development on their laptop, right? You don't get to collect a lot of metrics about what they're doing, right? But once you start moving everyone into a sort of more integrated notebook environment, you can be like, okay, like, these are data sets that these different people are accessing. Like these are the things that I know about them. And you can actually train a recommendation algorithm on the data sets to recommend other data sets to people. And there are people that are starting to do this. And I think it's really powerful, right? Because it's like in small companies, maybe not super important, right? Because I'll just go an ask my coworker like hey, what data sets do I want to use? But if you're at a company like Google or IBM scale or even like a 500 person company, you're not going to know all of the data sets that are available for you to work with. And the machine will actually be able to make some really interesting recommendations there. >> All right, we have to leave it there. We're out of time. Holden, thanks very much. >> Thank you so much for having me and having Boo. >> Pleasure. All right, any time. Keep right there everybody. We'll be back with our next guest. This is the CUBE. We're live from New York City. We'll be right back.

Published Date : Sep 30 2016

SUMMARY :

Brought to you by headline sponsors, This is the CUBE, the worldwide leader It's nice to be back. normally on the podium So, Boo is not some So, that counts for something. And this is starting to So, being able to spill So, the value is there. But for the people that are looking at it, that to a large degree. about that and explain to us and think about how to And starting to find And it's really interesting to but are focused on the outcomes the link to your friends and the question is specifically be able to get to you Maybe you could share with And then you go and you get your engineer and he made the statement then. that the data engineers on the data sets to recommend All right, we have to leave it there. Thank you so much for This is the CUBE.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Nvidia	ORGANIZATION	0.99+
Peter Burris	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Holden Karau	PERSON	0.99+
New York City	LOCATION	0.99+
Java	TITLE	0.99+
Apache Foundation	ORGANIZATION	0.99+
Scala	TITLE	0.99+
New York City	LOCATION	0.99+
Python	TITLE	0.99+
Spark 2.0	TITLE	0.99+
Spark	TITLE	0.99+
500 people	QUANTITY	0.99+
Albi Meta	PERSON	0.99+
a week later	DATE	0.99+
Spark 2.0.0	TITLE	0.99+
500 person	QUANTITY	0.99+
Apache Software Foundation	ORGANIZATION	0.98+
New York	LOCATION	0.98+
today	DATE	0.98+
Holden	PERSON	0.98+
first	QUANTITY	0.98+
both	QUANTITY	0.98+
Cornell	ORGANIZATION	0.97+
Boo	PERSON	0.97+
One	QUANTITY	0.96+
Spark Sequel	TITLE	0.95+
CUBE	ORGANIZATION	0.93+
eight data scientists	QUANTITY	0.93+
python code	TITLE	0.93+
2016	DATE	0.91+
one	QUANTITY	0.91+
First	QUANTITY	0.9+
Penn	ORGANIZATION	0.89+
last couple of years	DATE	0.88+
Big Data	ORGANIZATION	0.86+
one place	QUANTITY	0.85+
2.0	TITLE	0.8+
agile	TITLE	0.79+
one of	QUANTITY	0.75+
things	QUANTITY	0.73+
once	QUANTITY	0.7+
#BigDataNYC	EVENT	0.7+
2010	DATE	0.65+
Dupe	EVENT	0.6+
World	ORGANIZATION	0.56+
Data	TITLE	0.53+
themes	QUANTITY	0.52+
1.6	OTHER	0.5+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Allan Jacobsen: