Jack Norris, MapR - Spark Summit East 2016 #SparkSummit #theCUBE
>>From New York expecting the signal to nine. It's the cube covering sparks summit east brought to you by spark summit. Now your hosts, Dave Volante and George Gilbert >>Right here in Midtown at the Hilton hotel. This has sparked somebody and this is the cube. The cube goes out to the events. We extract the signal from the noise. Jack Norris is here. He's the CMO of Mapbox, long time cube, alum jackets. It's great to see you again. Hey, if you've been here since the beginning of this whole big data >>Meme and it might've started here, I don't know. I think we've yeah, >>I think you're right. I mean, it really did start it. I think in this building, it was our first big data show at the original, you know, uh, uh, Hadoop world. And, uh, and you guys, like I say, I've been there from the start. Uh, you were kind of impatient early on. You said, you know, we're just going to go build solutions and, uh, and ignore the noise and you built a really nice, nice business. Um, you guys have been growing, you're growing your Salesforce and, uh, and things are good and all of a sudden, boom, the spark thing comes in. So we're seeing the evolution. I remember saying to George and the early days of a dupe, we were geeking out talking to all the bits and bytes and then it turned into a business discussion. It's like we're back to the hardcore bits and bites. So give us the update from Matt bar's point of view, where are we in the whole big data space? >>Well, I think, um, I think it has transitioned. I mean, uh, if you look at the typical large fortune company, the web to Datto's, it's really, how do we best leverage our data and how do we leverage our data in that we can, we can make decisions much faster, right? That high-frequency decision-making process. Um, and typically that involves taking production data and analytics and joining them together so that you're actually impacting business as it happens and to do that effectively requires, um, innovations. So the exciting thing about spark is taking and, uh, and having a distributed compute engine, it's much easier to develop and, uh, in much faster. >>So in the remember the early days we'd be at these shows and the big question was, you know, can you take the humans out of the equation? It's like, no, no humans are the last mile. Um, is that, is that changing or would we still need that human interaction or, >>Um, humans are important part of the process, but increasingly if you can adjust and make, you know, small algorithmic decisions, um, and, and make those decisions at that kind of moment of truth, you got big impact, and I'll give you a few examples. So, um, ad platforms, you know, Rubicon project over a hundred billion ad auctions a day, you know, humans, part of that process in terms of setting that up and reviewing the process, but each, you know, each supply and demand decision, there is an automated decision optimizing that has a huge impact on the bottom line, um, fraud, uh, you know, credit card swiping that transaction and deciding is this fraudulent or not avoiding false positives, et cetera, a big leveraged item. So we're seeing things like that across manufacturing, across retail healthcare. And, um, it isn't about asking bigger questions or doing reports and looking back at, you know, what happened last week. It's more, how can I have an infrastructure in place that allows this organization to be agile? Because it's not the companies with the most data that's going to win. It's the companies that are the most agile and making intelligent. >>So it's so much data. Humans can ingest it any faster. I mean, we just, we can't keep up. So the world needs data scientists that needs trained developers. You've got some news I want to talk about on the training side, but even that we can only throw so many bodies at the problem. So it's really software. That's going to allow us to scale it. Software's hard. Software takes time. So we've seen a lot of the spend in the analytics, big data world on, on services. And obviously you guys and others have been working hard to shift it towards software. I want to come back to that training issue. We heard this morning about, uh, Databricks launched a move. They trained 20,000 people. That's a lot, but still long way to go. You guys are putting some investment into training. Talk about that news. Yeah. >>Yeah. Um, well it starts at the underlying software. If you can do things in the platform to make it much easier and do things that are hard to surround with services, like, uh, data protection, right? If you've lost data, it doesn't matter how many people you throw at it, you can't recover it. Right. So that's kind of the starting point you're gonna get fired. >>The, the, uh, the approach we've taken is, is to take, uh, a software product approach to the training as well. So we rolled out on demand training. So it's free, it's on demand. You work at your own pace. It's got different modules, there's some training associated with that, or some hands-on labs, if you will. Um, we launched that last January. So it's basically coming up the year anniversary. We recently celebrated, we trained 50,000 people, uh, on, on Hadoop and big data. Um, today we're announcing expansion on spark classes. We've got full curriculum around spark, including a certification. So you can get sparked certification through this, this map, our on demand training. Okay. >>Gotcha. You said something really, really intriguing that I want to dive into a little bit is where we were talking about the small decisions that can be made really, really fast for that a human in the loop human might have to train them, but it at runtime now where you said, it's not about asking bigger questions, it's finding faster answers, um, what had to change in your platform or in the underlying technology to make that possible. >>You know, um, there's a lot that into it. It's typically a series of functions, uh, a kind of breadth that needs to be brought to the problem as well as squeezing out latencies. So instead of, um, the traditional approach, which is different applications and different analytic techniques dictate a separate silo, a separate, you know, scheme of data. And you've got those all around the organization and data kind of travels, and you get an answer at the end of some period of time. Uh, it's converging that altogether into a single platform, squeezing out those latencies so that you can have an informed action at the speed of business, if you will. And, >>Um, let's say spark never came along. Would that be possible? >>Yes. Yes. Would you, how would you, so if you look at kind of the different architectures that are out there, there's typically deep analytics in terms of, you know, let's go look at the trends, you know, the last seven years, what happened. And then look, let's look at, um, doing actions on a streaming set, say for instance, storm, and then let's do a real time database operations. So you could do that with, with HBase or map RDB and all of that together. What spark has really done is made that whole development process just much easier and much more streamlined. And that's where a lot of the excitements happen. >>So you mentioned earlier, um, to, to use cases, ad tech and fraud detection. Um, and I want to ask you about those in the state of those. So ad tech obviously has come a long way, but it's still got a ways to go. I mean, you look at, I mean, who's making money on ads. Obviously Google will make tons of money. Everybody else is sorta chasing them Facebook making money. It's probably cause they didn't let Google in. Okay. So how will spark affect sort of that business? Uh, and, and what's map, R's sort of role in evolving that, you know, to the next level. >>So, so, um, there's, there's different kind of compute and the types of things you can do, um, on the data. I think increasingly we're seeing the kind of streaming analytics and making those decisions as the data arrives, right. And then there's the whole ecosystem in terms of how do you coordinate those flows of data? It's not just a simple, here's the origin, here's the destination. There's typically a complex data flow. Um, that's where we've kind of focused on map our streams, this huge publish and subscribe infrastructure so that you can get real-time data to the appropriate location and then do the right operations, a lot of that involved with spark, but not exclusively. >>Okay. And then on fraud detection, um, obviously come a long way. Sampling could have died. Yes. And now, but now we're getting too many false positives. You get the call and, you know, I mean, I get a lot of calls because we can buy so much equipment, but, um, but now what about the next level? What are you guys doing to take fraud detection to the next level? So that when I get on the plane in Boston and I land in London, it knows, um, is that a database problem? Is it an integration problem, a systems problem, and how, what role you guys play in solving that? >>Well, there's, there's, um, you know, there's, there's a lot of details and techniques that probably go, um, beyond, you know, what, what we'll share publicly or what are our customers talk about publicly? I think in general, it's the more data that you can apply to a problem. The more context, the better off you are, that's the way I kind of summarize it so that instead of a sampling or instead of a boy, that's a strange purchase over there, it's understanding, well, this is Dave Valenti and this is the full body of, of, uh, expenditures he's done, then the types of things and here's who he frequently purchases from. And here's kind of a transaction trend started in San Francisco, went to New York, et cetera. So in context it would make more sense. So >>Part of that is more data. And the other part of that is just better algorithms and better, better learnings and applying that on a continuous basis. How are your customers dealing with that, that constraint? I mean, if they got a, a hundred dollars to spend, yeah. They can only spend so much on, on each of those gathering more data, cleaning the data, they spent so much time getting it ready versus making their machine learning algorithms or whatever the other techniques to do. What are you seeing there as sort of best practice? It was probably varies. I'm sure, but give us some color on it. >>Um, I'll actually go back to Google and Google a letter last round, um, you know, excellent, excellent insights coming from Google. They wrote a paper called the unreasonable effectiveness of data and in it, they basically squarely addressed that problem. And given the choice to invest in either the complex model and algorithm or put more data at it, putting more data, had a huge impact. And, um, you know, my simple explanation is if you're sampling the data, you have to have a model that tries to recreate reality. If you're looking at all of the data, then the anomalies can, can pop up and be more apparent. And, um, the more context you can bring, the more data from other sources. So you get around, you know, a better picture of what's happening, the better off you are. And so that requires scale. It requires speed and requires different techniques that can be brought to bear, right? The database operation, here's a streaming operation, here's a deep, you know, file machine learning algorithm. >>So there's a lot of vendors in the sort of big data ecosystem are coming at spark from different angles and, um, are, are trying to add value to it and sort of bathe themselves in sort of the halo. Yep. Now you guys took some time upfront to build a converged platform so that you weren't trying to wrap your arms around 37 different projects. Can you tell us how having perhaps not anticipated spark how this converts platform allows you to add more value to it than other approaches? >>So, so we simplify, if you look at the Hadoop ecosystem, it's basically separated into the components for compute and management on top of the data layer, right? The Hadoop distributed file system. So how do you scale data? How do you protect it? It's very simply what's going on. Spark really does a great job at that top layer. Doesn't do anything about defining the underlying storage layer in the Hadoop community that underlying storage layer is a batch system. So you're trying to do, you know, micro batch kind of streaming operations on top of batch oriented data. What we addressed was to take that whole data layer, make it real time, make it random. Read-write converge enterprise storage together with Hadoop support and spark support on a single platform. And that's basically >>With the difference and to make an enterprise great. You guys were really the first to lead the lecture. You were, everybody started talking about attic price straight after you were kind of delivering it. So you've had a lead there. Do you feel like you still have a lead there, or is that the kind of thing where you sort of hit the top of the S-curve and start innovating elsewhere? >>NC state did a study, uh, just this past year, a recent study identified that only 25% of data corruption issues are identified and properly handled by the Hadoop distributed file system. 42% of those are silent. So there's a huge gap in terms of quote unquote enterprise grade features and what we think. >>Yes, silent data corruption has been a problem for decades now. And you're saying it's no different in the duke ecosystem, especially as, as mainstream businesses start to, uh, to adopt this what's happening in the valley. Uh, we're seeing, you know, in the wall street journal every day you read about down rounds, flat rounds, people can't get B rounds. Uh, you guys are funded, you know, you're growing, you're talking about investments, you know, what do you see? Do you, do you feel like you're achieving escape velocity? Um, maybe give us sort of an update on, uh, the state of the business. >>Yeah. I, I think the state of the business is best represented by the customers, right? And the customers kind of vote, right. They vote in terms of, you know, how well is this technology driving their business? So we've got a recent study, um, that kind of shows the, the returns that customers, um, are getting, uh, we've got a 1% chance, a 99% retention rate with our customers. We've got, uh, an expansion rate. That's, that's unbelievable. We've got multi-million dollar customers in, uh, in seven of the top verticals and nine out of the top $10 million customers. So we're seeing significant investments and more importantly, significant returns on the part of customers where they're not just doing a single application on the platform, but multiple >>Applications, Jack Norris map are always focused. Always a pleasure having you on the cube. Thanks very much for coming on. Appreciate it. Keep right there, buddy. We'll be back with our next guest is the cube we're live from spark somebody's right back. Okay.
SUMMARY :
covering sparks summit east brought to you by spark summit. It's great to see you again. I think we've yeah, You said, you know, we're just going to go build solutions and, if you look at the typical large fortune company, So in the remember the early days we'd be at these shows and the big question was, you know, and reviewing the process, but each, you know, each supply and demand decision, And obviously you guys and others have been working hard to shift it towards software. If you can do things in the platform to make it much easier and do things that are hard to surround So you can get sparked certification through really fast for that a human in the loop human might have to train them, but it at runtime around the organization and data kind of travels, and you get an answer at the end of some period Would that be possible? let's go look at the trends, you know, the last seven years, what happened. So you mentioned earlier, um, to, to use cases, ad tech and fraud detection. so that you can get real-time data to the appropriate location and then do the right operations, You get the call and, you know, I mean, I get a lot of calls because we can buy so much equipment, but, The more context, the better off you are, that's the way I kind of summarize What are you seeing there as sort of best practice? um, you know, my simple explanation is if you're sampling the data, this converts platform allows you to add more value to it than other approaches? So how do you scale data? You were, everybody started talking about attic price straight after you were kind of delivering it. and properly handled by the Hadoop distributed file system. you know, in the wall street journal every day you read about down rounds, flat rounds, people can't get B rounds. They vote in terms of, you know, Always a pleasure having you on the cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Valenti | PERSON | 0.99+ |
Jack Norris | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
New York | LOCATION | 0.99+ |
London | LOCATION | 0.99+ |
George | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Boston | LOCATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
99% | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
42% | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Databricks | ORGANIZATION | 0.99+ |
50,000 people | QUANTITY | 0.99+ |
nine | QUANTITY | 0.99+ |
20,000 people | QUANTITY | 0.99+ |
last week | DATE | 0.99+ |
Datto | ORGANIZATION | 0.99+ |
last January | DATE | 0.99+ |
$10 million | QUANTITY | 0.98+ |
seven | QUANTITY | 0.98+ |
each | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
Mapbox | ORGANIZATION | 0.98+ |
today | DATE | 0.97+ |
1% | QUANTITY | 0.97+ |
Hadoop | TITLE | 0.97+ |
Matt | PERSON | 0.96+ |
single platform | QUANTITY | 0.96+ |
NC | ORGANIZATION | 0.95+ |
this morning | DATE | 0.95+ |
single application | QUANTITY | 0.94+ |
25% | QUANTITY | 0.94+ |
Midtown | LOCATION | 0.93+ |
first big | QUANTITY | 0.92+ |
Rubicon | ORGANIZATION | 0.92+ |
37 different projects | QUANTITY | 0.92+ |
last seven years | DATE | 0.89+ |
over a hundred billion ad auctions a day | QUANTITY | 0.88+ |
this past year | DATE | 0.86+ |
spark | ORGANIZATION | 0.85+ |
multi-million dollar | QUANTITY | 0.84+ |
decades | QUANTITY | 0.83+ |
a hundred dollars | QUANTITY | 0.79+ |
data corruption | QUANTITY | 0.7+ |
HBase | TITLE | 0.67+ |
Hilton | ORGANIZATION | 0.67+ |
RDB | TITLE | 0.64+ |
Spark | ORGANIZATION | 0.57+ |
MapR | ORGANIZATION | 0.57+ |
map | TITLE | 0.57+ |
Salesforce | ORGANIZATION | 0.53+ |
2016 | EVENT | 0.51+ |
- Spark Summit | EVENT | 0.46+ |
East | LOCATION | 0.42+ |
Kickoff - #SparkSummit - #theCUBE
>> Announcer: Live, from San Francisco, it's theCUBE! Covering Spark Summit 2017. Brought to you by Databricks. (energetic techno music) >> Welcome to theCube! I'm your host, David Goad, and we're here at the Spark Summit 2017 in San Francisco, where it's all about data science and engineering at scale. Now, I know there's been a lot of great technology shows here at Moscone Center, but this is going to be one of the greatest, I think. We are joined by here by George Gilbert, who is the lead analyst for big data and analytics at Wikibon. George, welcome to theCUBE. >> Good to be here, David. >> All right, so, I know this is kind of like reporting in real time, 'cause you're listening to the keynote right now, right? >> George: Yeah. >> Well, I know we wanted to get us started with some of the key themes that you've heard. You've done a lot of work recently with how applications are changing with machine learning, as well as the new distributed computing. So, as you listen to what Matei is talking about, and some of the other keynote presenters, what are some of the key themes you're hearing so far? >> There's two big things that they are emphasizing so far this year, or at this Spark Summit. One is structured streaming, which they've been talking about more and more over the last 18 months, but it officially goes production-ready in the 2.2 release of Spark, which is imminent. But they also showed something really, really interesting with structured streaming. There've always been other streaming products, and the relevance of streaming is that we're more and more building applications that process data continuously. Not in either big batches or just request-response with a user interface. Your streaming capabilities dictate the class of apps that you're appropriate for. The Spark structured streaming had a lot of overhead in it, 'cause it had to manage a cluster. It was working with a query optimizer, and so it would basically batch up events in groups that would go through, like, once every 200 milliseconds to a full second. Which is near real-time, but not considered real-time. And I know I'm driving into the details a bit, but it's very significant. They demoed on stage today-- >> David: I saw the demo. >> They showed structured streams running one millisecond latency. That's a big breakthrough, because that means, essentially, you can do per-event processing, which is true streaming. >> And so this contributes to deep learning, right? Low-latency streaming. >> Well, it can complement it, because when you do machine learning, or deep learning, you basically have a model, and you want to predict something. The stream is flowing along, and so for every data element in the stream, you might want a prediction, or a classification, or something like that. Spark had okay support for deep learning before, but that's the other big emphasis now. Before, they could sort of serve models, like in production, but training models was somewhat more difficult for deep learning. That took parallelization they didn't have. >> I noticed there were three demos that kind of tied together in a little bit of a James Bond story. So, maybe the first one was talking about image classification, transfer learning, tell me a little bit more about what you heard from there. I know you need to mute your keynote. The first demo from Tim Hunter. >> The demo, like with James Bond, was, we're going to show, among my favorite movies, they show cars, they're learning to label cars, and then they're showing cars that appeared in James Bond movies, and so they're training the model to predict, was this car seen in a James Bond movie? And then they also have, they were joining it with data that showed where the car was last seen. So it's sort of like a James Bond sighting. And then they train that model, and then they sort of ran it in production, essentially, at real-time speed. >> And the continuous processing demo showed how fast that could be run. >> Right, right. That was a cool demo. That was a nice visual. >> And then we had the gentleman from Stanford, Christopher Re came up to talk more about the applications for machine learning. Is it really going to be radically easier to use? >> We didn't make it all the way through that keynote, but yes, there are things that can be used to make machine learning easier to use. There's, for one thing, like if you take the old statistical machine learning stuff, it's still very hard to identify the features, or the variables, that you're going to use in the model. And deep learning, many people expect over the next few years to be able to help with that, so that the features are something that a data scientist would collaborate with a domain expert. And deep learning, just the way it learns the features of a cat, like, here's the nose, here's the ears, here's the whiskers, there's the expectation that deep learning will help identify the features for models. So you turn machine learning on itself, and it helps things. Among other things that should get easier. >> We're going to get to talk to several of those keynoters a little bit later in the show, so do a little more deeper dive on that. Maybe talk to us just generally to about, who's here at this show, and what do you think they're looking for in the Spark community? >> Spark was always a bottom-up, adoption-first, because it fixed some really difficult problems with the predecessor technology, which was called MapReduce, which was the compute engine in Hadoop. That was not familiar to most programmers, whereas Spark, you know, there's an API for machine learning, there's an API for batch processing, for string processing, graph processing, but you can use SQL over all of those, and that made it much more accessible. And the fact that, now machine learning's built in, streaming's built in. All those things, you basically, MapReduce, the old version, was the equivalent of assembly language. This is at a SQL-level language. >> And so you were here at Spark Summit 2016, right? >> George: Yeah. >> We've seen some advances. Would you say it's incremental advances, or are we really making big leaps? >> Well, Spark 2.0 was a big leap, and we're just approaching 2.2. I would say that getting this structured streaming down to such low latency is a big, big deal, and adding good support for deep learning, which is now all the craze. Although most people are using it for, essentially, vision, listening, speaking, natural language processing, but it'll spread to other use cases. >> Yeah, we're going to hear about some more of those use cases throughout the show. We've got customers coming in, I won't name them all right now, but they'll be rolling in. What do you want to know most from those customers? >> The real thing is, Spark started out as, like, offline analytic preparation of data that was in data lakes. And it's moving more into the mainstream of production apps. The big thing is, what's the sweet spot? What type of apps, where are the edge conditions? That's what I think we'll be looking for. >> And when Matei came out on stage, what did you hear from him? What was the first thing he was prioritizing? Feel free to check your notes that you were taking! >> He was talking about, he did the state of the union as he normally does. The astonishing figure that there's like 375,000, I think, Spark Meetup members-- >> David: Wow. >> Yeah. And that's grown over the last four years, basically, from almost zero. So his focus really was on deep learning and on streaming, and those are the things we want to drill down a little bit. In the context of, what can you build with both? >> Well, we're coming up on our first break here, George. I'm really looking forward to interviewing some more of the guests today. So, thanks very much, and I invite you to stay with us here on theCUBE. We'll see you soon. (energetic techno music)
SUMMARY :
Brought to you by Databricks. but this is going to be one of the greatest, I think. and some of the other keynote presenters, And I know I'm driving into the details a bit, essentially, you can do per-event processing, And so this contributes to deep learning, right? and so for every data element in the stream, So, maybe the first one was talking about and so they're training the model to predict, And the continuous processing demo showed That was a cool demo. the applications for machine learning. so that the features are something a little bit later in the show, MapReduce, the old version, was the equivalent Would you say it's incremental advances, but it'll spread to other use cases. What do you want to know most from those customers? And it's moving more into the mainstream of production apps. he did the state of the union as he normally does. In the context of, what can you build with both? and I invite you to stay with us here on theCUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
Tim Hunter | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Christopher Re | PERSON | 0.99+ |
three demos | QUANTITY | 0.99+ |
Matei | PERSON | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
375,000 | QUANTITY | 0.99+ |
Moscone Center | LOCATION | 0.99+ |
Spark Summit | EVENT | 0.98+ |
SQL | TITLE | 0.98+ |
first thing | QUANTITY | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
both | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
one thing | QUANTITY | 0.98+ |
Spark | TITLE | 0.98+ |
One | QUANTITY | 0.98+ |
this year | DATE | 0.96+ |
first one | QUANTITY | 0.96+ |
MapReduce | TITLE | 0.96+ |
first demo | QUANTITY | 0.96+ |
Spark Summit 2016 | EVENT | 0.95+ |
Hadoop | TITLE | 0.95+ |
Databricks | ORGANIZATION | 0.94+ |
James Bond | PERSON | 0.93+ |
first break | QUANTITY | 0.92+ |
Wikibon | ORGANIZATION | 0.92+ |
two big things | QUANTITY | 0.86+ |
Stanford | ORGANIZATION | 0.84+ |
2.2 release | QUANTITY | 0.83+ |
once every 200 milliseconds | QUANTITY | 0.81+ |
one | QUANTITY | 0.8+ |
Spark | ORGANIZATION | 0.76+ |
last 18 months | DATE | 0.75+ |
almost zero | QUANTITY | 0.71+ |
James | PERSON | 0.71+ |
first | QUANTITY | 0.7+ |
last four years | DATE | 0.7+ |
second | QUANTITY | 0.69+ |
Spark 2.0 | COMMERCIAL_ITEM | 0.68+ |
theCUBE | ORGANIZATION | 0.66+ |
#theCUBE | ORGANIZATION | 0.64+ |
years | DATE | 0.5+ |
2.2 | OTHER | 0.49+ |
#SparkSummit | TITLE | 0.48+ |
Bond | TITLE | 0.48+ |
Kickoff | EVENT | 0.31+ |