Wikibon Big Data Market Update Pt. 1 - Spark Summit East 2017 - #sparksummit - #theCUBE

>> [Announcer] Live from Boston, Massachusetts, this is theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> We're back, welcome to Boston, everybody, this is a special presentation that George Gilbert and I are going to provide to you now. SiliconANGLE Media is the umbrella brand of our company, and we've got three sub-brands. One of them is Wikibon, it's the research organization that Gorge works in, and then of course, we have theCUBE and then SiliconANGLE, which is the tech publication, and then we extensively, as you may know, use CrowdChat and other social data, but we want to drill down now on the Wikibon, Wikibon research side of things. Wikibon was the first research company ever to do a big data forecast. Many, many years ago, our friend Jeff Kelly produced that for several years, we opensourced it, and it really, I think helped the industry a lot, sort of framing the big data opportunity, and then George last year did the first Spark forecast, really Spark adoption, so what we want to do now is talk about some of the trends in the marketplace, this is going to be done in two parts, today's part one, and we're really going to talk about the overall market trends and the market conditions, and then we're going to go to part two tomorrow, where you're going to release some of the numbers, right? And we'll share some of the numbers today. So, we're going to start on the first slide here, we're going to share with you some slides. The Wikibon forecast review, and George is going to, I'm going to ask you to talk about where we are at with big data apps, everybody's saying it's peaked, big data's now going mainstream, where are we at with big data apps? >> [George] Okay, so, I want to quote, just to provide context, the former CTO on VMware, Steve Herrod. He said, "In the end, it wasn't big data, "it was big analytics." And what's interesting is that when we start thinking about it, there have been three classes of, there have been traditionally two classes of workloads, one batch, and in the context of analytics, that means running reports in the background, doing offline business intelligence, but then there was also the interactive-type work. What's emerging is something that's continuously happening, and it doesn't mean that all apps are going to be always on, it just means that there are, all apps will have a batch component, an interactive component, like with the user, and then a streaming, or continuous component. >> [Dave] So it's a new type of workload. >> Yes. >> Okay. Anything else you want to point out here? >> Yeah, what's worth mentioning, this is, it's not like it's going to burst fully-formed out of the clouds, and become sort of a new standard, there's two things that has to happen, the technology has to mature, so right now you have some pretty tough trade-offs between integration, which provides simplicity, and choice and optimization, which gives you fragmentation, and then skillset, and both of those need to develop. >> [Dave] Alright, we're going to talk about both of those a little bit later in this segment. Let's go to the next slide, which really talks to some of the high-level forecast that we released last year, so these are last year's numbers, correct? >> Yes, yes. >> [Dave] Okay, so, what's changed? You've got the ogive curve, which is sort of the streaming penetration, Spark/streaming, that's what, was last year, this is now reflective of continuous, you'll be updating that, how is this changing, what do you want us to know here? >> [George] Okay, so the key takeaways here are, first, we took three application patterns, the first being the data lake, which is sort of the original canonical repository of all your data. That never goes away, but on top of it, you layer what we were calling last year systems of engagement, which is where you've got the interactive machine learning component helping to anticipate and influence a user's decision, and then on top of that, which was the aqua color, was the self-tuning systems, which is probably more IIoT stuff, where you've got a whole ecosystem of devices and intelligence in the cloud and at the edge, and you don't necessarily need a human in the loop. But, these now, when you look at them, you can break them down as having three types of workloads, the batch, the interactive, and the continuous. >> Okay, and that is sort of a new workload here, and this is a real big theme of your research now is, we all remember, no, we don't all remember, I remember punch cards, that's the ultimate batch, and then of course, the terminals were interactive, and you think of that as closer to real time, but now, this notion of continuous, if you go to the next slide, Patrick, we can take a look at how workloads are changing, so George, take us through that dynamic. >> [George] Okay so, to understand where we're going, sometimes it helps to look at where we've come from, and the traditional workloads, if we talk about applications, they were divided into, now, we talked about sort of batch versus interactive, but now, they were also divided into online transaction processing, operational application, systems of record, and then there was the analytic side, which was reporting on it, but this was sort of backward-looking reporting, and we begin to see some convergence between the two with web and mobile apps, where a user was interacting both with the analytics that informed an interaction that they might have. That's looking backwards, and we're going to take a quick look at some of the new technologies that augmented those older application patterns. Then we're going to go look at the emergent workloads and what they look like. >> Okay so, let's have a quick conversation about this before we go on to the next segment. Hadoop obviously was batch. It really was a way, as we've talked about today and many other dates in theCUBE, a way to reduce the expense of doing data warehousing and business intelligence, I remember we were interviewing Jeff Hammerbacher, and he said, "When I was at Facebook, "my mission was to break the dependency "and the container, the storage container." So he really wanted to, needed to reduce costs, he saw that infrastructure needed to change, so if you look at the next slide, which is really sort of talking to Hadoop doing batch in traditional BI, take us through that, and then we'll sort of evolve to the future. >> Okay, so this is an example of traditional workloads, batch business intelligence, because Hadoop has not really gotten to the maturity point of view where you can really do interactive business intelligence. It's going to take a little more work. But here, you've basically put in a repository more data than you could possibly ever fit in a data warehouse, and the key is, this environment was very fragmented, there were many different engines involved, and so there was a high developer complexity, and a high operational complexity, and we're getting to the point where we can do somewhat better on the integration, and we're getting to the point where we might be able to do interactive business intelligence and start doing a little bit of advanced analytics like machine learning. >> Okay. Let's talk a little bit about why we're here, we're here 'cause it's Spark Summit, Spark was designed to simplify big data, simplify a lot of the complexity in Hadoop, so on the next slide, you've got this red line of Spark, so what is Spark's role, what does that red line represent? >> Okay, so the key takeaway from this slide is, couple things. One, it's interesting, but when you listen to Matei Zaharia, who is the creator of Spark, he said, "I built this to be a better MapReduce than MapReduce," which was the old crufty heart of Hadoop. And of course, they've stretched it far beyond their original intentions, but it's not the panacea yet, and if you put it in the context of a data lake, it can help you with what a data engineer does with exploring and munging the data, and what a data scientist might do in terms of processing the data and getting it ready for more advanced analytics, but it doesn't give you an end-to-end solution, not even within the data lake. The point of explaining this is important, because we want to explain how, even in the newer workloads, Spark isn't yet mature to handle the end-to-end integration, and by making that point, we'll show where it needs still more work, and where you have to substitute other products. >> Okay, so let's have a quick discussion about those workloads. Workloads really kind of drive everything, a lot of decisions for organizations, where to put things, and how to protect data, where the value is, so in this next slide you've got, you're juxtaposing traditional workloads with emerging workloads, so let's talk about these new continuous apps. >> Okay, so, this tees it up well, 'cause we focused on the traditional workloads. The emerging ones are where data is always coming in. You could take a big flow of data and sort of end it and bucket it, and turn it into a batch process, but now that we have the capability to keep processing it, and you want answers from it very near real time, you don't want to stop it from flowing, so the first one that took off like this was collecting telemetry about the operation and performance of your apps and your infrastructure, and Splunk sort of conquered that workload first. And then the second one, the one that everyone's talking about now is sort of Internet of Things, but more accurately, the Industrial Internet of Things, and that stream of data is, again, something you'll want to analyze and act on with as little delay as possible. The third one is interesting, asynchronous microservices. This is difficult, because this doesn't necessarily require a lot of new technology, so much as a new skillset for developers, and that's going to mean it takes off fairly slowly. Maybe new developers coming out of school will adopt it whole cloth, but this is where you don't rely on a big central database, this is where you break things into little pieces, and each piece manages itself. >> So you say the components of these arrows that you're showing in just explore processor, these are all sort of discrete elements of the data flow that you have to then integrate as a customer? >> [George] Yes, frankly, these are all steps that could be an end-to-end integrative process, but it's not yet mature enough really to do it end-to-end. For example, we don't even have a data store that can go all the way from ingest to serve, and by ingest, I mean taking the millions, potentially millions or more, events per second coming in from your Internet of Things devices, the explorer would be in that same data store, letting you visualize what's there, and process doing the analysis, and serving then is, from that same data store, letting your industrial devices, or your business intelligence workloads get real-time updates. For this to work as one whole, we need a data store, for example, that can go from end-to-end, in addition to the compute and analytic capabilities that go end-to-end. The point of this is, for continuous workloads, we do want to get to this integrated point somehow, sometime, but we're not there yet. >> Okay, let's go deeper, and take a look at the next slide, you've got this data feedback loop, and you've got this prediction on top of this, what does all that mean, let's double-click on that. >> Okay, so now we're unpacking the slide we just looked at, in that we're unpacking it into two different elements, one is what you're doing when you're running the system, and the next one will be what you're doing when you're designing it. And so for this one, what you're doing when you're running the system, I've grayed out the where's the data coming from and where's it going to, just to focus on how we're operating on the data, and again, to repeat the green part, which is storage, we don't have an end-to-end integrated store that could cost-effectively, scalably handle this whole chain of steps, but what we do have is that in the runtime, you're going to ingest the data, you're going to process it and make it ready for prediction, then there's a step that's called devops for data science, we know devops for developers, but devops for data science, as we're going to see, actually unpacks a whole 'nother level of complexity, but this devops for data science, this is where you get the prediction, of, okay, so, if this turbine is vibrating and has a heat spike, it means shut it down because something's going to fail. That's the prediction component, and the serve part then takes that prediction, and makes sure that that device gets it fast. >> So you're putting that capability in the hands of the data science component so they can effect that outcome virtually instantaneously? >> Yes, but in this case, the data scientist will have done that at design time. We're still at run time, so this is, once the data scientist has built that model, here, it's the engineer who's keeping it running. >> Yeah, but it's designed into the process, that's the devops analogy. Okay great, well let's go to that sort of next piece, which is design, so how does this all affect design, what are the implications there? >> So now, before we had ingest process, then prediction with devops for data science, and then serving, now when you're at design time, you ingest the data, and there's a whole unpacking of steps, which requires a handful, or two fistfuls of tools right now to make operate. This is to acquire the data, explore it, prepare it, model it, assess it, distribute it, all those things are today handled by a collection of tools that you have to stitch together, and then you have process at which could be typically done in Spark, where you do the analysis, and then serving it, Spark isn't ready to serve, that's typically a high-speed database, one that either has tons of data for history, or gets very, very fast updates, like a Redis that's almost like a cache. So the point of this is, we can't yet take Spark as gospel from end to end. >> Okay so, there's a lot of complexity here. >> [George] Right, that's the trade-off. >> So let's take a look at the next slide, which talks to where that complexity comes from, let's look at it first from the developer side, and then we'll look at the admin, so, so on the next slide, we're looking at the complexity from the dev perspective, explain the axes here. >> Okay, okay. So, there's two axes. If you look at the x-axis at the bottom, there's ingest, explore, process, serve. Those were the steps at a high level that we said a developer has to master, and it's going to be in separate products, because we don't have the maturity today. Then on the y-axis, we have some, but not all, this is not an exhaustive list of all the different things a developer has to deal with, with each product, so the complexity is multiplying all the steps on the y-axis, data model, addressing, programming model, persistence, all the stuff's on the y-axis, by all the products he needs on the x-axis, it's a mess, which is why it's very, very hard to build these types of systems today. >> Well, and why everybody's pushing on this whole unified integration, that was a major thing that we heard throughout the day today. What about from the admin's side, let's take a look at the next slide, which is our last slide, in terms of the operational complexity, take us through that. >> [George] Okay, so, the admin is when the system's running, and reading out the complexity, or inferring the complexity, follows the same process. On the y-axis, there's a separate set of tasks. These are admin-related. Governance, scheduling and orchestration, a high availability, all the different types of security, resource isolation, each of these is done differently for each product, and the products are on the x-axis, ingest, explore, process, serve, so that when you multiply those out, and again, this isn't exhaustive, you get, again, essentially a mess of complexity. >> Okay, so we got the message, if you're a practitioner of these so-called big data technologies, you're going to be dealing with more complexity, despite the industry's pace of trying to address that, but you're seeing new projects pop up, but nonetheless, it feels like the complexity curve is growing faster than customer's ability to absorb that complexity. Okay, well, is there hope? >> Yes. But here's where we've had this conundrum. The Apache opensource community has been the most amazing source of innovation I think we've ever seen in the industry, but the problem is, going back to the amazing book, The Cathedral and the Bazaar, about opensource innovation versus top-down, the cathedral has this central architecture that makes everything fit together harmoniously, and beautifully, with simplicity. But the bazaar is so much faster, 'cause it's sort of this free market of innovation. The Apache ecosystem is the bazaar, and the burden is on the developer and the administrator to make it work together, and it was most appropriate for the big internet companies that had the skills to do that. Now, the companies that are distributing these Apache opensource components are doing a Herculean job of putting them together, but they weren't designed to fit together. On the other hand, you've got the cloud service providers, who are building, to some extent, services that have standard APIs that might've been supported by some of the Apache products, but they have proprietary implementations, so you have lock-in, but they have more of the cathedral-type architecture that-- >> And they're delivering 'em their services, even though actually, many of those data services are discrete APIs, as you point out, are proprietary. Okay, so, very useful, George, thank you, if you have questions on this presentation, you can hit Wikibon.com and fire off a question to us, we'll make sure it gets to George and gets answered. This is part one, part two tomorrow is we're going to dig into some of the numbers, right? So if you care about where the trends are, what the numbers look like, what the market size looks like, we'll be sharing that with you tomorrow, all this stuff, of course, will be available on-demand, we'll be doing CrowdChats on this, George, excellent job, thank you very much for taking us through this. Thanks for watching today, it is a wrap of day one, Spark Summit East, we'll be back live tomorrow from Boston, this is theCUBE, so check out siliconangle.com for a review of all the action today, all the news, check out Wikibon.com for all the research, siliconangle.tv is where we house all these videos, check that out, we start again tomorrow at 11 o'clock east coast time, right after the keynotes, this is theCUBE, we're at Spark Summit, #SparkSummit, we're out, see you tomorrow. (electronic music jingle)

Published Date : Feb 8 2017

SUMMARY :

brought to you by Databricks. and the market conditions, and then we're going to go and it doesn't mean that all apps are going to be always on, Anything else you want to point out here? the technology has to mature, so right now Let's go to the next slide, which really and at the edge, and you don't necessarily need and you think of that as closer to real time, and the traditional workloads, "and the container, the storage container." and we're getting to the point where so on the next slide, you've got this red line of Spark, but it's not the panacea yet, and if you put it Okay, so let's have a quick discussion and you want answers from it very near real time, and by ingest, I mean taking the millions, and take a look at the next slide, and the next one will be what you're doing here, it's the engineer who's keeping it running. Yeah, but it's designed into the process, So the point of this is, we can't yet take Spark so on the next slide, we're looking of all the different things a developer has to deal with, let's take a look at the next slide, and the products are on the x-axis, it feels like the complexity curve is growing faster and the burden is on the developer and the administrator of all the action today, all the news,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Patrick	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Jeff Hammerbacher	PERSON	0.99+
Steve Herrod	PERSON	0.99+
Jeff Kelly	PERSON	0.99+
George	PERSON	0.99+
Matei Zaharia	PERSON	0.99+
Boston	LOCATION	0.99+
last year	DATE	0.99+
Wikibon	ORGANIZATION	0.99+
SiliconANGLE	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
millions	QUANTITY	0.99+
VMware	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Gorge	ORGANIZATION	0.99+
one batch	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
two classes	QUANTITY	0.99+
Dave	PERSON	0.99+
three classes	QUANTITY	0.99+
first	QUANTITY	0.99+
two parts	QUANTITY	0.99+
each	QUANTITY	0.99+
second one	QUANTITY	0.99+
two different elements	QUANTITY	0.99+
first slide	QUANTITY	0.99+
two	QUANTITY	0.99+
The Cathedral and the Bazaar	TITLE	0.99+
each product	QUANTITY	0.99+
each piece	QUANTITY	0.99+
third one	QUANTITY	0.99+
One	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
today	DATE	0.98+
Facebook	ORGANIZATION	0.98+
first one	QUANTITY	0.98+
both	QUANTITY	0.98+
Apache	ORGANIZATION	0.98+
SiliconANGLE Media	ORGANIZATION	0.98+
first research	QUANTITY	0.98+
Spark Summit East 2017	EVENT	0.97+
Hadoop	TITLE	0.97+
two things	QUANTITY	0.97+
two fistfuls of tools	QUANTITY	0.96+
theCUBE	ORGANIZATION	0.96+
one	QUANTITY	0.96+
day one	QUANTITY	0.95+
#SparkSummit	EVENT	0.93+
siliconangle.com	OTHER	0.93+
two axes	QUANTITY	0.92+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Gorge: