Kickoff - #SparkSummit - #theCUBE
>> Announcer: Live, from San Francisco, it's theCUBE! Covering Spark Summit 2017. Brought to you by Databricks. (energetic techno music) >> Welcome to theCube! I'm your host, David Goad, and we're here at the Spark Summit 2017 in San Francisco, where it's all about data science and engineering at scale. Now, I know there's been a lot of great technology shows here at Moscone Center, but this is going to be one of the greatest, I think. We are joined by here by George Gilbert, who is the lead analyst for big data and analytics at Wikibon. George, welcome to theCUBE. >> Good to be here, David. >> All right, so, I know this is kind of like reporting in real time, 'cause you're listening to the keynote right now, right? >> George: Yeah. >> Well, I know we wanted to get us started with some of the key themes that you've heard. You've done a lot of work recently with how applications are changing with machine learning, as well as the new distributed computing. So, as you listen to what Matei is talking about, and some of the other keynote presenters, what are some of the key themes you're hearing so far? >> There's two big things that they are emphasizing so far this year, or at this Spark Summit. One is structured streaming, which they've been talking about more and more over the last 18 months, but it officially goes production-ready in the 2.2 release of Spark, which is imminent. But they also showed something really, really interesting with structured streaming. There've always been other streaming products, and the relevance of streaming is that we're more and more building applications that process data continuously. Not in either big batches or just request-response with a user interface. Your streaming capabilities dictate the class of apps that you're appropriate for. The Spark structured streaming had a lot of overhead in it, 'cause it had to manage a cluster. It was working with a query optimizer, and so it would basically batch up events in groups that would go through, like, once every 200 milliseconds to a full second. Which is near real-time, but not considered real-time. And I know I'm driving into the details a bit, but it's very significant. They demoed on stage today-- >> David: I saw the demo. >> They showed structured streams running one millisecond latency. That's a big breakthrough, because that means, essentially, you can do per-event processing, which is true streaming. >> And so this contributes to deep learning, right? Low-latency streaming. >> Well, it can complement it, because when you do machine learning, or deep learning, you basically have a model, and you want to predict something. The stream is flowing along, and so for every data element in the stream, you might want a prediction, or a classification, or something like that. Spark had okay support for deep learning before, but that's the other big emphasis now. Before, they could sort of serve models, like in production, but training models was somewhat more difficult for deep learning. That took parallelization they didn't have. >> I noticed there were three demos that kind of tied together in a little bit of a James Bond story. So, maybe the first one was talking about image classification, transfer learning, tell me a little bit more about what you heard from there. I know you need to mute your keynote. The first demo from Tim Hunter. >> The demo, like with James Bond, was, we're going to show, among my favorite movies, they show cars, they're learning to label cars, and then they're showing cars that appeared in James Bond movies, and so they're training the model to predict, was this car seen in a James Bond movie? And then they also have, they were joining it with data that showed where the car was last seen. So it's sort of like a James Bond sighting. And then they train that model, and then they sort of ran it in production, essentially, at real-time speed. >> And the continuous processing demo showed how fast that could be run. >> Right, right. That was a cool demo. That was a nice visual. >> And then we had the gentleman from Stanford, Christopher Re came up to talk more about the applications for machine learning. Is it really going to be radically easier to use? >> We didn't make it all the way through that keynote, but yes, there are things that can be used to make machine learning easier to use. There's, for one thing, like if you take the old statistical machine learning stuff, it's still very hard to identify the features, or the variables, that you're going to use in the model. And deep learning, many people expect over the next few years to be able to help with that, so that the features are something that a data scientist would collaborate with a domain expert. And deep learning, just the way it learns the features of a cat, like, here's the nose, here's the ears, here's the whiskers, there's the expectation that deep learning will help identify the features for models. So you turn machine learning on itself, and it helps things. Among other things that should get easier. >> We're going to get to talk to several of those keynoters a little bit later in the show, so do a little more deeper dive on that. Maybe talk to us just generally to about, who's here at this show, and what do you think they're looking for in the Spark community? >> Spark was always a bottom-up, adoption-first, because it fixed some really difficult problems with the predecessor technology, which was called MapReduce, which was the compute engine in Hadoop. That was not familiar to most programmers, whereas Spark, you know, there's an API for machine learning, there's an API for batch processing, for string processing, graph processing, but you can use SQL over all of those, and that made it much more accessible. And the fact that, now machine learning's built in, streaming's built in. All those things, you basically, MapReduce, the old version, was the equivalent of assembly language. This is at a SQL-level language. >> And so you were here at Spark Summit 2016, right? >> George: Yeah. >> We've seen some advances. Would you say it's incremental advances, or are we really making big leaps? >> Well, Spark 2.0 was a big leap, and we're just approaching 2.2. I would say that getting this structured streaming down to such low latency is a big, big deal, and adding good support for deep learning, which is now all the craze. Although most people are using it for, essentially, vision, listening, speaking, natural language processing, but it'll spread to other use cases. >> Yeah, we're going to hear about some more of those use cases throughout the show. We've got customers coming in, I won't name them all right now, but they'll be rolling in. What do you want to know most from those customers? >> The real thing is, Spark started out as, like, offline analytic preparation of data that was in data lakes. And it's moving more into the mainstream of production apps. The big thing is, what's the sweet spot? What type of apps, where are the edge conditions? That's what I think we'll be looking for. >> And when Matei came out on stage, what did you hear from him? What was the first thing he was prioritizing? Feel free to check your notes that you were taking! >> He was talking about, he did the state of the union as he normally does. The astonishing figure that there's like 375,000, I think, Spark Meetup members-- >> David: Wow. >> Yeah. And that's grown over the last four years, basically, from almost zero. So his focus really was on deep learning and on streaming, and those are the things we want to drill down a little bit. In the context of, what can you build with both? >> Well, we're coming up on our first break here, George. I'm really looking forward to interviewing some more of the guests today. So, thanks very much, and I invite you to stay with us here on theCUBE. We'll see you soon. (energetic techno music)
SUMMARY :
Brought to you by Databricks. but this is going to be one of the greatest, I think. and some of the other keynote presenters, And I know I'm driving into the details a bit, essentially, you can do per-event processing, And so this contributes to deep learning, right? and so for every data element in the stream, So, maybe the first one was talking about and so they're training the model to predict, And the continuous processing demo showed That was a cool demo. the applications for machine learning. so that the features are something a little bit later in the show, MapReduce, the old version, was the equivalent Would you say it's incremental advances, but it'll spread to other use cases. What do you want to know most from those customers? And it's moving more into the mainstream of production apps. he did the state of the union as he normally does. In the context of, what can you build with both? and I invite you to stay with us here on theCUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
Tim Hunter | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Christopher Re | PERSON | 0.99+ |
three demos | QUANTITY | 0.99+ |
Matei | PERSON | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
375,000 | QUANTITY | 0.99+ |
Moscone Center | LOCATION | 0.99+ |
Spark Summit | EVENT | 0.98+ |
SQL | TITLE | 0.98+ |
first thing | QUANTITY | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
both | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
one thing | QUANTITY | 0.98+ |
Spark | TITLE | 0.98+ |
One | QUANTITY | 0.98+ |
this year | DATE | 0.96+ |
first one | QUANTITY | 0.96+ |
MapReduce | TITLE | 0.96+ |
first demo | QUANTITY | 0.96+ |
Spark Summit 2016 | EVENT | 0.95+ |
Hadoop | TITLE | 0.95+ |
Databricks | ORGANIZATION | 0.94+ |
James Bond | PERSON | 0.93+ |
first break | QUANTITY | 0.92+ |
Wikibon | ORGANIZATION | 0.92+ |
two big things | QUANTITY | 0.86+ |
Stanford | ORGANIZATION | 0.84+ |
2.2 release | QUANTITY | 0.83+ |
once every 200 milliseconds | QUANTITY | 0.81+ |
one | QUANTITY | 0.8+ |
Spark | ORGANIZATION | 0.76+ |
last 18 months | DATE | 0.75+ |
almost zero | QUANTITY | 0.71+ |
James | PERSON | 0.71+ |
first | QUANTITY | 0.7+ |
last four years | DATE | 0.7+ |
second | QUANTITY | 0.69+ |
Spark 2.0 | COMMERCIAL_ITEM | 0.68+ |
theCUBE | ORGANIZATION | 0.66+ |
#theCUBE | ORGANIZATION | 0.64+ |
years | DATE | 0.5+ |
2.2 | OTHER | 0.49+ |
#SparkSummit | TITLE | 0.48+ |
Bond | TITLE | 0.48+ |
Kickoff | EVENT | 0.31+ |