Jags Ramnarayan, SnappyData - Spark Summit 2017 - #SparkSummit - #theCUBE

(techno music) >> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> You are watching the Spark Summit 2017 coverage by theCUBE. I'm your host David Goad, and joined with George Gilbert. How you doing George? >> Good to be here. >> And honored to introduce our next guest, the CTO from SnappyData, wow we were lucky to get this guy. >> Thanks for having me >> David: Jags Ramnarayan, Jags thanks for joining us. >> Thanks, thanks for having me. >> And for people who may not be familiar, maybe tell us what does SnappyData do? >> So SnappyData in a nutshell, is taking Spark, which is a computer engine, and in some sense augmenting the guts of Spark so that Spark truly becomes a hybrid database. A single data store that's capable of taking Spark streams, doing transactions, providing mutable state management in Spark, but most importantly being able to turn around, and run analytical queries on that state that is continuously merging. That's in a nutshell. Let me just say a few things, SnappyData itself is a startup that is a spun out, a spun out out of Pivotal. We've been out of Pivotal for roughly about a year, so the technology itself was to a great degree, incubated within Pivotal. It's a product called GemFire within VMware and Pivotal. So we took the guts of GemFire, which is an in-memory data base, designed for transactional low-latency, high confidence scenarios, and we are sort of fusing it, that's the key thing, fusing it into Spark, so that now Spark becomes significantly richer, as not just as a computer platform, but as a store. >> Great, and we know this is not your first Spark Summit, right? How many have you been to? Lost count? >> Boy, let's see, three, four now, Spark Summits, if I include the Spark Summit, this year, four to five. >> Great, so an active part of the community. What were you expecting to learn this year, and have you been surprised by anything? >> You know, it's always wonderful to see, I mean, every time I come to Spark, it's just a new set of innovations, right? I mean, when I first came to Spark, it was a mix of, let's talk about data frames, all of these, let's optimize my priorities. Today you come, I mean there is such a wide spectrum of amazing new things that are happening. It's just mind boggling. Right from AI techniques, structured streaming, and the real-time paradigm, and sort of this confluence that Databricks brings more to it. How can I create a confluence through a unified mechanism, where it is really brilliant, is what I think. >> Okay, well let's talk about how you're innovating at SnappyData. What are some of the applications or current projects you're working on? So number of things, I mean, GE is an investor in SnappyData. So we're trying to work with GE on the investor layer Dspace. We're working with large health care companies, also on their layer Dspace. So the part done with SnappyData is one that has a lot of high velocity streams of data emerging where the streams could be, for instance, Kafka streams driving Spark streams, but streams could also be operation databases. Your Postgres instance and your Cassandra database instance, and they're all generating continuous changes to data that's emerging in an operational world, can I suck that in and almost create a replica of that state that might be emerging in the SOQL operation environment, and still allow interactive analytics ASCIL for a number of concordant users on live data. Not cube data, not pre-aggregated data, but on live data itself, right? Being able to almost give you Google-like speeds to live data. >> George, we've heard people talking about this quite a bit. >> Yeah, so Jags, as you said upfront, Spark was conceived as sort of a general purpose, I guess, analytic compute engine, and adding DBMS to it, like sort of not bolting it on, but deeply integrating it, so that the core data structures now have DBMS properties, like transactionality, that must make a huge change in the scope of applications that are applicable. Can you desribe some of those for us? >> Yeah. The classic paradigm today that we find time and again as, the so-called smack stack, right? I mean lambda stack, now there's a smack stack. Which is really about Spark running on Mesos, but really using Spark streaming as an ingestion capability, and there is continuous state that is emerging that I want to write into Cassandra. So what we find very quickly is that the moment the state is emerging, I want to throw in a business intelligence tool on top and immediately do live dashboarding on that state that is continuously changing and emerging. So what we find is that the first part, which is the high speed drives, the ability to transform these data search, cleanse the data search, get the cleanse data into Cassandra, works really well. What is missing is this ability to say, well, how am I going to get insight? How can I ask you interesting, insightful questions, get responses immediately on that live data, right? And so the common problem there is the moment I have Cassandra working, let's say, with Spark, every time I run an analytical query, you only have two choices. One is use the parallel connector to pull in the data search from Cassandra, right, and now unfortunately, when you do analytics, you're working with large volumes. And every time I run even a simple query, all of a sudden I could be pulling in 10 gigabytes, 20 gigabytes of data into Spark to run the computation. Hundreds of seconds lost. Nothing like interactive, it's all about batch querying. So how can I turn around and say that if stuff changes in Cassandra, I can can have an immediate real-time reflection of that mutable state in Spark on which I can run queries rapidly. That's a very key aspect to us. >> So you were telling me earlier that you didn't see, necessarily, a need to replace entirely, the Cassandra in the smack stack, but to compliment it. >> Jags: That's right. >> Elaborate on that. >> So our focus, much like Spark, is all about in-memory, state management in-memory processing. And Cassandra, realistically, is really designed to say how can I scale the petabyte, right, for key value operations, semi-structured data, what have you. So we think there are a number of scenarios where you still want Cassandra to be your store, because in some sense a lot of these guys have already adapted Cassandra in a fairly big way. So you want to say, hey, leave your petabyte level wall in there, and you can essentially work with the real-time state, which could still be still many terabytes of state, essentially in main memory, that's going to work with specializing it. And we're also, I mean I can touch on this approximate query process and technology, which is other part, other key part here, to say hey, I can't really 1,000 cores, and 1,000 machines just so that you can do your job really well, so one of the techniques we are adopting, which even the Databricks guys stirred with Blink, essentially, it's an approximate query processing engine, we have our own essential approximate query processing engine, as an adjunct, essentially, to our store. What that essentially means is to say, can I take a billion records and synthesize something really, really small, using smart sampling techniques, sketching techniques, essentially statistical structures, that can be stored along with Spark and Spark memory itself, and fuse it with the Spark catalyst query engine. So that as you run your query and we can very smartly figure out, can I use the approximate data structures to answer the questions extremely quickly. Even when the data would be in petabyte volume, I have these data structures that just now taking, maybe gigabytes of storage only. So hopefully not getting too, too technical, so the Spark catalyst query optimizer, like an Oracle query optimizer, it knows about the data that it's going to query, only in your case, you're taking what catalyst knows about Spark, and extending it with what's stored in your native, also Spark native, data structures. >> That's right, exactly. So think about an optimizer always takes a query plan and says, here are all the possible plans you can execute, and here is cost estimate for these plans, we essentially inject more plans into that and hopefully, our plan is even more optimized than the plans that the Spark catalyst engine came up with. And Spark is beautiful because, the Catalyst engine is a very pluggable engine. So you can essentially augment that engine very easily. >> So you've been out in the marketplace, whether in alpha, beta, or now, production, for enough time so that the community is aware of what you've done. What are some of the areas that you're being pulled in that are, that people didn't associate Spark with? >> So more often, we land up in situations where they're looking at SAP HANA, as an example, maybe a Meme SQL, maybe just Postgres, and all of the sudden, there are these hybrid workloads, which is the Gartner term of HTAP, so there's a lot of HTAP use cases, where we get pulled into. So there's no Spark, but we get pulled into it because we just a hybrid database. That's what people look at us, essentially. >> Oh, so you pull Spark in because that's just part of your solution. >> Exactly, right. So think about Spark is not just data frames and rich API, but also it has a SQL interface, right. I can essentially execute, SQL, select SQL. Of course we augment that SQL so that now you can do what you expect from a database, which is an insert, an update, a delete, can I create a view, can I run a transaction? So all of a sudden, it's not just a Spark API but what we provide looks like a SQL database itself. >> Okay, interesting. So tell us, in the work with GE, they're among the first that have sort of educated the world that in that world there's so much data coming off devices, that we have to be intelligent about what we filter and send to cloud, we train models, potentially, up there, we run them closer to the edge, so that we get low latency analytics, but you were telling us earlier that there are alternatives, especially when you have such an intelligent database, working both at the edge and in the cloud. >> Right, so that's a great point. See what's happening with sort of a lot of these machine learning models is that these models are learned on historical data search. And quite often, especially if you look at predictive maintenance, those class of use cases, in industrial IRT, the parlance could evolve very rapidly, right? Maybe because of climate changes and let's say, for a windmill farm, there are few windmills that are breaking down so rapidly it's affecting everything else, in terms of the power generation. So being able to sort of order the model itself, incrementally and near real-time, is becoming more and more important. >> David: Wow. >> It's still a fairly academic research kind of area, but for instance, we are working very closely with the University of Michigan to sort of say, can we use some of these approximate techniques to incrementally also learn a model. Right, sort of incrementally augment a model, potential of the edge, or even inside the cloud, for instance. >> David: Wow. >> So if you're doing it at the edge, would you be updating the instance of the model associated with that locale and then would the model in the cloud be sort of like the master, and then that gets pushed down, until you have an instance and a master. >> That's right. See most typically what will happen is you have computed a model using a lot of historical data. You have typically supervised techniques to compute a model. And you take that model and inject it potentially into the edge, so that it can compute that model, which is the easy part, everybody does that. So you continue to do that, right, because you really want the data scientists to be pouring through those paradigms, looking and sort of tweaking those models. But for a certain number of models, even in the models injected in the edge, can I re-tweak that model in unsupervised way, is kind of the play, we're also kind of venturing into slowly, but that's all in the future. >> But if you're doing it unsupervised, do you need metrics that sort of flag, like what is the champion challenger, and figure out-- >> I should say that I mean, not all of these models can work in this very real-time manner. So, for instance, we've been looking at saying, can we reclassify NPC, the name place classifier, to essentially do incremental classification, or incrementally learning the model. Clustering approaches can actually be done in an unsupervised way in an incremental fashion. Things like that. There's a whole spectrum of algorithms that really need to be thought through for approximate algorithms to actually apply. So it's still an active research. >> Really great discussion, guys. We've just got about a minute to go, before the break, really great stuff. I don't want to interrupt you. But maybe switch real quick to business drivers. Maybe with SnappyData or with other peers you've talked to today. What business drivers do you think are going to affect the evolution of Spark the most? I mean, for us, as a small company, the single biggest challenge we have, it's like what one of you guys said, analysts, it's raining databases out there. And there's ability to constantly educate people how you can essentially realize a very next generation, like data pipeline, in a very simplified manner, is the challenge we are running into, right. I mean, I think the business model for us is primarily how many people are going to go and say, yes, batch related analytics is important, but incrementally, for competitive reasons, want to be playing that real-time analytics game lot more than before, right? So that's going to be big for us, and hopefully we can play a big part there, along with Spark and Databricks. >> Great, well we appreciate you coming on the show today, and sharing some of the interesting work that you're doing. George, thank you so much. and Jags, thank you so much for being on theCUBE. >> Thanks for having me on, I appreciate it. Thanks, George. And thank you all for tuning in. Once again, we have more to come, today and tomorrow, here at Spark Summit 2017, thanks for watching. (techno music)

Published Date : Jun 6 2017

SUMMARY :

Brought to you by Databricks. How you doing George? And honored to introduce our next guest, and in some sense augmenting the guts of Spark if I include the Spark Summit, this year, four to five. and have you been surprised by anything? and the real-time paradigm, and sort of this confluence So the part done with SnappyData is one about this quite a bit. so that the core data structures now have DBMS properties, that the moment the state is emerging, the Cassandra in the smack stack, but to compliment it. So that as you run your query and we can very So you can essentially augment that engine very easily. What are some of the areas that you're being pulled in maybe just Postgres, and all of the sudden, Oh, so you pull Spark in because So all of a sudden, it's not just a Spark API that have sort of educated the world So being able to sort of order the model itself, but for instance, we are working very closely in the cloud be sort of like the master, So you continue to do that, right, because you that really need to be thought through is the challenge we are running into, right. and sharing some of the interesting work that you're doing. And thank you all for tuning in.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
David Goad	PERSON	0.99+
George	PERSON	0.99+
University of Michigan	ORGANIZATION	0.99+
1,000 machines	QUANTITY	0.99+
20 gigabytes	QUANTITY	0.99+
GE	ORGANIZATION	0.99+
1,000 cores	QUANTITY	0.99+
10 gigabytes	QUANTITY	0.99+
David	PERSON	0.99+
Spark	TITLE	0.99+
San Francisco	LOCATION	0.99+
SQL	TITLE	0.99+
Spark	ORGANIZATION	0.99+
Jags Ramnarayan	PERSON	0.99+
first	QUANTITY	0.99+
first part	QUANTITY	0.99+
two choices	QUANTITY	0.99+
SAP HANA	TITLE	0.99+
tomorrow	DATE	0.99+
Hundreds of seconds	QUANTITY	0.99+
Gartner	ORGANIZATION	0.99+
this year	DATE	0.99+
Spark Summit 2017	EVENT	0.99+
Jags	PERSON	0.99+
One	QUANTITY	0.98+
today	DATE	0.98+
Today	DATE	0.98+
both	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
Spark Summit	EVENT	0.97+
single	QUANTITY	0.97+
Kafka	TITLE	0.97+
Oracle	ORGANIZATION	0.97+
Google	ORGANIZATION	0.96+
about a year	QUANTITY	0.96+
Blink	ORGANIZATION	0.95+
single data	QUANTITY	0.93+
SnappyData	ORGANIZATION	0.93+
Mesos	TITLE	0.91+
three	QUANTITY	0.91+
a billion records	QUANTITY	0.91+
#SparkSummit	EVENT	0.91+
Spark Summits	EVENT	0.9+
four	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.89+
Postgres	TITLE	0.89+
one	QUANTITY	0.88+
Cassandra	TITLE	0.87+

Don DeLoach, Midwest IoT Council | PentahoWorld 2017

>> Announcer: Live, from Orlando, Florida, it's TheCUBE, covering PentahoWorld 2017. Brought to you by Hitachi Vantara. >> Welcome back to sunny Orlando everybody. This is TheCUBE, the leader in live tech coverage. My name is Dave Vellante and this is PentahoWorld, #PWorld17. Don DeLoach here, he's the co-chair of the midwest IoT council. Thanks so much for coming on TheCUBE. >> Good to be here. >> So you've just written a new book. I got it right in my hot off the presses in my hands. The Future of IoT, leveraging the shift to a data-centric world. Can you see that okay? Alright, great, how's that, you got that? Well congratulations on getting the book done. >> Thanks. >> It's like, the closest a male can come to having a baby, I guess. But, so, it's fantastic. Let's start with sort of the premise of the book. What, why'd you write it? >> Sure, I'll give you the short version, 'cause that in and of itself could go on forever. I'm a data guy by background. And for the last five or six years, I've really been passionate about IoT. And the two converged with a focus on data, but it was kind of ahead of where most people in IoT were, because they were mostly focused on sensor technology and communications, and to a limited extent, the workflow. So I kind of developed this thesis around where I thought the market was going to go. And I would have this conversation over and over and over, but it wasn't really sticking and so I decided maybe I should write a book to talk about it and it took me forever to write the book 'cause fundamentally I didn't know what I was doing. Fortunately, I was able to eventually bring on a couple of co-authors and collectively we were able to get the book written and we published it in May of this year. >> And give us the premise, how would you summarize? >> So the central thesis of the book is that the market is going to shift from a focus on IoT enabled products like a smart refrigerator or a low-fat fryer or a turbine in a factory or a power plant or whatever. It's going to shift from the IoT enabled products to the IoT enabled enterprise. If you look at the Harvard Business Review article that Jim Heppelmann and Michael Porter did in 2014, they talked about the progression from products to smart products to smart, connected products, to product systems, to system of systems. We've largely been focused on smart, connected products, or as I would call IoT enabled products. And most of the technology vendors have focused their efforts on helping the lighting vendor or the refrigerator vendor or whatever IoT enable their product. But when that moves to mass adoption of IoT, if you're the CIO or the CEO of SeaLand or Disney or Walmart or whatever, you're not going to want to be a company that has 100,000 IoT enabled products. You're going to want to be an IoT enabled company. And the difference is really all around data primacy and how that data is treated. So, right now, most of the data goes from the IoT enabled product to the product provider. And they tell you what data you can get. But that, if you look at the progression, it's almost mathematically impossible that that is sustainable because company, organizations are going to want to take my, like let's just say we're talking about a fast food restaurant. They're going to want to take the data from the low-fat fryer and the data from the refrigerator or the shake machine or the lighting system or whatever, and they're going to want to look at it in the context of the other data. And they're going to also want to combine it with their point-of-sale or crew scheduling, or inventory and then if they're smart, they'll start to even pull in external data, like pedestrian traffic or street traffic or microweather or whatever, and they'll create a much richer signature. And then, it comes down to governance, where I want to create this enriched data set, and then propagate it to the right constituent in the right time in the right way. So you still give the product provider back the data that they want, and there's nothing that precludes you from doing that. And you give the low-fat fryer provider the data that they want, but you give your regional and corporate offices a different view of the same data, and you give the FDA or your supply chain partner, it's still the same atomic data, but what you're doing is you're separating the creation of the data from the consumption of the data, and that's where you gain maximum leverage, and that's really the thesis of the book. >> It's data, great summary by the way, so it's data in context, and the context of the low-fat fryer is going to be different than the workflow within that retail operation. >> Yeah, that's right and again, this is where, the product providers have initially kind of pushed back because they feel like they have stickiness and loyalty that's bred out of that link. But, first of all, that's going to change. So if you're Walmart or a major concern and you say, "I'm going to do a lighting RFP," and there's 10 vendors that say, "Hey, we want to compete for this," and six of 'em will allow Walmart to control the data, and four say, "No, we have to control the data," their list just went to six. They're just not going to put up with that. >> Dave: Period, the end, absolutely. >> That's right. So if the product providers are smart, they're going to get ahead of this and say, "Look, I get where the market's going. "We're going to need to give you control of the data, "but I'm going to ask for a contract that says "I'm going to get the data I'm already getting, "'cause I need to get that, and you want me to get that. "But number two, I'm going to recognize that "they can give, Walmart can give me my data back, "but enrich it and contextualize it "so I get better data back." So everybody can win, but it's all about the right architecture. >> Well and the product guys going to have the Trojan horse strategy of getting in when nobody was really looking. >> Don: That's right. >> And okay, so they've got there. Do you envision, Don, a point at which the Walmart might say, "No, that's our data "and you don't get it." >> Um, not really- >> or is there going to be a quid pro quo? >> and here's why. The argument that the product providers have made all along is, almost in a condescending way sometimes, although not intentionally condescending, it's been, look, we're selling you this low-fat fryer for your fast food restaurant. And you say you want the data, but you know, we had a team of people who are experts in this. Leave that to us, we'll analyze the data and we'll give you back what you need. Now, there's some truth to the fact that they should know their products better than anybody, and if I'm the fast food chain, I want them to get that data so that they can continually analyze and help me do my job better. They just don't have to get that data at my expense. There are ways to cooperatively work this, but again, it comes back to just the right architecture. So what we call the first receiver is in essence, setting up an abstraction close to the point of the ingestion of all this data. Upon which it's cleansed, enriched, and then propagated again to the right constituent in the right time in the right way. And by the way, I would add, with the right security considerations, and with the right data privacy considerations, 'cause like, if you look around the market now, things like GEP are in Europe and what we've seen in the US just in the wake of the elections and everything around how data is treated, privacy concerns are going to be huge. So if you don't know how to treat the data in the context of how it needs to be leveraged, you're going to lose that leverage of the data. >> Well, plus the widget guys are going to say "Look, we have to do predictive maintenance "on those devices and you want us to do that." You know, they say follow the money. Let's follow the data. So, what's the data flow look like in your mind? You got these edge devices. >> Yep, physical or virtual. Doesn't have to be a physical edge. Although, in a lot of cases, there are good reasons why you'd want a physical edge, but there's nothing technologically that says you have to have a physical edge. >> Elaborate on that, would you? What do you mean by virtual? >> Sure, so let's say I have a server inside a retail outfit. And it's collecting all of my IoT data and consolidating it and persisting it into a data store and then propagating it to a variety of constituents. That would be creating the first receiver in the physical edge. There's nothing that says that that edge device can't grab that data, but then persist it in a distributed Amazon cloud instance, or a Rackspace instance or whatever. It doesn't actually need to be persisted physically on the edge, but there's no reason it can't either. >> Okay, now I understand that now. So the guys at Wikibon, which is a sort of sister company to TheCUBE, have envisioned this three tiered data model where you've got the devices at the edge where real-time activity's going on, real-time analytics, and then you've got this sort of aggregation point, I guess call it a gateway. And then you've got, and that's as I say, aggregation of all these edge devices. And then you've got the cloud where the heavy modeling is done. It could be your private cloud or your public cloud. So does that three tier model make sense to you? >> Yeah, so what you're describing as the first tier is actually the sensor layer. The gateway layer that you're describing, in the book would be characterized as the first receiver. It's basically an edge tier that is augmented to persist and enrich the data and then apply the proper governance to it. But what I would argue is, in reality, I mean, your reference architecture is spot-on. But if you actually take that one step further, it's actually an n-tier architecture. Because there's no reason why the data doesn't go from the ten franchise stores, to the regional headquarters, to the country headquarters, to the corporate headquarters, and every step along the way, including the edge, you're going to see certain types of analytics and computational work done. I'll put a plug for my friends at Hitachi Lumada in on this, you know, there's like 700 horizontal IoT platforms out there. There aren't going to be 700 winners. There's going to be probably eight to 10, and that's only because the different specific verticals will provide for more winners than it would be if it was just one like a search engine. But, the winners are going to have to have an extensible architecture that is, will ultimately allow enterprises to do the very things I'm talking about doing. And so there are a number out there, but one of the things, and Rob Tiffany, who's the CTO of Lumada, I think has a really good handle on his team on an architecture that is really plausible for accomplishing this as the market migrates into the future. >> And that architecture's got to be very flexible, not just elastic, but sometimes we use the word plastic, plasticity, being able to go in any direction. >> Well, sure, up to and including the use of digital twins and avatars and the logic that goes along with that and the ability to spin something up and spin something down gives you that flexibility that you as an enterprise, especially the larger the enterprise, the more important that becomes, need. >> How much of the data, Don, at that edge do you think will be persisted, two part question? It's not all going to be persisted, is it? Isn't that too expensive? Is it necessary to persist all of that data? >> Well, no. So this is where, you'll hear the notion of data exhaust. What that really means is, let's just say I'm instrumenting every room in this hotel and each room has six different sensors in it and I'm taking a reading once a second. The ratio of inconsequential to consequential data is probably going to be over 99 to one. So it doesn't really make sense to persist that data and it sure as hell doesn't make sense to take that data and push it into a cloud where I spend more to reduce the value of the payload. That's just dumb. But what will happen is that, there are two things, one, I think people will see the value in locally persisting the data that has value, the consequential data, and doing that in a way that's stored at least for some period of time so you can run the type of edge analytics that might benefit from having that persisted store. The other thing that I think will happen, and this is, I don't talk much, I talk a little bit about it in the book, but there's this whole notion where when we get to the volumes of data that we really talk about where IoT will go by like 2025, it's going to push the physical limitations of how we can accommodate that. So people will begin to use techniques like developing statistical metadata models that are a highly accurate metadata representation of the entirety of the data set, but probably in about one percent of the space that's queryable and suitable for machine learning where it's going to enable you to do what you just physically couldn't do before. So that's a little bit into the future, but there are people doing some fabulous work on that right now and that'll creep into the overall lexicon over time. >> Is that a lightweight digital twin that gives you substantially the same insight? >> It could augment the digital twin in ways that allow you to stand up digital twins where you might not be able to before. The thing that, the example that most people would know about are, like in the Apache ecosystem, there are toolsets like SnappyData that are basically doing approximation, but they're doing it via sampling. And that is a step in that direction, but what you're looking for is very high value approximation that doesn't lose the outlier. So like in IoT, one of the things you normally are looking for is where am I going to pick up on anomalous behavior? Well if I'm using a sample set, and I'm only taking 15%, I by definition am going to lose a lot of that anomalous behavior. So it has to be a holistic representation of the data, but what happens is that that data is transformed into statistics that can be queryable as if it was the atomic data set, but what you're getting is a very high value approximation in a fraction of the space and time and resources. >> Ok, but that's not sampling. >> No, it's statistical metadata. There are, there's a, my last company had developed a thing that we called approximate query, and it was based on that exact set of patents around the formation of a statistical metadata model. It just so happens it's absolutely suited for where IoT is going. It's kind of, IoT isn't really there yet. People are still trying to figure out the edge in its most basic forms, but the sheer weight of the data and the progression of the market is going to force people to be innovative in how they look at some of these things. Just like, if you look at things like privacy, right now, people think in terms of anonymization. And that's, basically, I'm going to de-link data contextually where I'm going to effectively lose the linkages to the context in order to conform with data privacy. But there are techniques, like if you look at GDCAR, their techniques, within certain safe harbors, that allow you to pseudonymize the data where you can actually relink it under certain conditions. And there are some smart people out there solving these problems. That's where the market's going to go, it's just going to get there over time. And what I would also add to this equation is, at the end of the day, right now, the concepts that are in the book about the first receiver and the create, the abstraction of the creation of the data from the consumption of the data, look, it's a pretty basic thing, but it's the type of shift that is going to be required for enterprises to truly leverage the data. The things about statistical metadata and pseudonymization, pseudonymization will come before the statistical metadata. But the market forces are going to drive more and more into those areas, but you got to walk before you run. Right now, most people still have silos, which is interesting, because when you think about the whole notion of the internet of things, it infers that it's this exploitation of understanding the state of physical assets in a very broad based environment. And yet, the funny thing is, most IoT devices are silos that emulate M2M, sort of peer to peer networks just using the internet as a communication vehicle. But that'll change. >> Right, and that's really again, back to the premise of the book. We're going from these individual products, where all the data is locked into the product silo, to this digital fabric, that is an enterprise context, not a product context. >> That's right and if you go to the toolsets that Pentaho offers, the analytic toolsets. Let's just say, now that I've got this rich data set, assuming I'm following basic architectural principles so that I can leverage the maximum amount of data, that now gives me the ability to use these type of toolsets to do far better operational analytics to know what's going on, far better forensic analysis and investigative analytics to mine through the date and do root cause analysis, far better predictive analytics and prescriptive analytics to figure out what will go on, and ultimately feed the machine learning algorithms ultimately to get to in essence, the living organism, the adaptive systems that are continuously changing and adapting to circumstances. That's kind of the Holy Grail. >> You mentioned Hitachi Vantara before. I'm curious what your thoughts are on the Hitachi, you know, two years ago, we saw the acquisition, said, okay, now what? And you know, on paper it sounded good, and now it starts to come together, it starts to make more sense. You know, storage is going to the cloud. HDS says, alright, well we got this Hitachi relationship. But what do you make of that? How do you assess it, and where do you see it going? >> First of all, I actually think the moves that they've done are good. And I would not say that if I didn't think it. I'd just find a politically correct way not to say that. But I do think it's good. So they created the Hitachi Insight Group about a year and a half ago, and now that's been folded into Hitachin Vantara, alongside HDS and Pentaho and I think that it's a fairly logical set of elements coming together. I think they're going down the right path. In full disclosure, I worked for Hitachi Data Systems from '91 til '94, so it's not like I'm a recent employee of them, it's 25 years ago, but my experience with Hitachi corporate and the way they approach things has been unlike a lot of really super large companies, who may be super large, but may not be the best engineers, or may not always get everything done so well, Hitachi's a really formidable organization. And I think what they're doing with Pentaho and HDS and the Insight Group and specifically Lumada, is well thought out and I'm optimistic about where they're going. And by the way, they won't be the only winner in the equation. There's going to be eight or nine different key players, but they'll, I would not short them whatsoever. I have high hopes for them. >> The TAM is enormous. Normally, Hitachi eventually gets to where it wants to go. It's a very thoughtful company. I've been watching them for 30 years. But to a lot of people, the Pentaho and the Insight's play make a lot of sense, and then HDS, you used to work for HDS, lot of infrastructure still, lot of hardware, but a relationship with Hitachi Limited, that is quite strong, where do you see that fit, that third piece of the stool? >> So, this is where there's a few companies that have unique advantages, with Hitachi being one of them. Because if you think about IoT, IoT is the intersection of information technology and operational technology. So it's one thing to say, "I know how to build a database." or "I can build machine learning algorithms," or whatever. It's another thing to say, "I know how to build trains "or CAT scans or smart city lighting systems." And the domain expertise married with the technology delivers a set of capabilities that you can't match without that domain expertise. And, I mean, if you even just reduce it down to artificial intelligence and machine learning, you get an expert ML or AI guy, and they're only as good as the limits of their domain expertise. So that's why, and again, that's why I go back to the comparison to search engines, where there's going to be like, there's Google and maybe Yahoo. There's probably going to be more platform winners because the vertical expertise is going to be very, very important, but there's not going to be 700 of 'em. But Hitachi has an advantage that they bring to the table, 'cause they have very deep roots in energy, in medical equipment, in transportation. All of that will manifest itself in what they're doing in a big way, I think. >> Okay, so, but a lot of the things that you described, and help me understand this, are Hitachi Limited. Now of course, Hitachi Data Systems started as, National Advance Systems was a distribution arm for Hitachi IT products. >> Don: Right, good for you, not many people remember. >> I'm old. So, like I said, I had a 30 year history with this company. Do you foresee that that, and by the way, interestingly, was often criticized back when you were working for HDS, it was like, it's still a distribution hub, but in the last decade, HDS has become much more of a contributor to the innovation and the product strategy and so forth. Having said that, it seems to me advantageous if some of those things you discussed, the trains, the medical equipment, can start flowing back through HDS. I'm not sure if that's explicitly the plan. I didn't necessarily hear that, but it sort of has to, right? >> Well, I'm not privy to those discussions, so it would be conjecture on my part. >> Let's opine, but right, doesn't that make sense? >> Don: It makes perfect sense. >> Because, I mean HDS for years was just this storage silo. And then storage became a very uninteresting business, and credit to Hitachi for pivoting. But it seems to me that they could really, and they probably have a, I had Brian Householder on earlier I wish I had explored this more with him. But it just seems, the question for them is, okay, how are you going to tap those really diverse businesses. I mean, it's a business like a GE or a Siemens. I mean, it's very broad based. >> Well, again, conjecture on my part, but one way I would do it would be to start using Lumada in the various operations, the domain-specific operations right now with Hitachi. Whether they plan to do that or not, I'm not sure of. I've heard that they probably will. >> That's a data play, obviously, right? >> Well it's a platform play. And it's enabling technology that should augment what's already going on in the various elements of Hitachi. Again, I'm, this is conjecture on my part. But you asked, let's just go with this. I would say that makes a lot of sense. I'd be surprised if they don't do that. And I think in the process of doing that, you start to crosspollinate that expertise that gives you a unique advantage. It goes back to if you have unique advantages, you can choose to exploit them or not. Very few companies have the set of unique advantages that somebody like Hitachi has in terms of their engineering and massive reach into so many, you know, Hitachi, GE, Siemens, these are companies that have big reach to the extent that they exploit them or not. One of the things about Hitachi that's different than almost anybody though is they have all this domain expertise, but they've been in the technology-specific business for a long time as well, making computers. And so, they actually already have the internal expertise to crosspollinate, but you know, whether they do it or not, time will tell. >> Well, but it's interesting to watch the big whales, the horses in the track, if you will. Certainly GE has made a lot of noise, like, okay, we're a software company. And now you're seeing, wow, that's not so easy, and then again, I'm sanguine about GE. I think eventually they'll get there. And then you see IBM's got their sort of IoT division. They're bringing in people. Another company with a lot of IT expertise. Not a lot of OT expertise. And then you see Hitachi, who's actually got both. Siemens I don't know as well, but presumably, they're more OT than IT and so you would think that if you had to evaluate the companies' positions, that Hitachi's in a unique position. Certainly have a lot of software. We'll see if they can leverage that in the data play, obviously Pentaho is a key piece of that. >> One would assume, yeah for sure. No, I mean, I again, I think, I'm very optimistic about their future. I think very highly of the people I know inside that I think are playing a role here. You know, it's not like there aren't people at GE that I think highly of, but listen, you know, San Ramon was something that was spun up recently. Hitachi's been doing this for years and years and years. You know, so different players have different capabilities, but Hitachi seems to have sort of a holistic set of capabilities that they can bring together and to date, I've been very impressed with how they've been going about it. And especially with the architecture that they're bringing to bear with Lumada. >> Okay, the book is The Future of IoT, leveraging the shift to a data-centric world. Don DeLoach, and you had a co-author here as well. >> I had two co-authors. One is Wael Elrifai from Pentaho, Hitachi Vantara and the other is Emil Berthelsen, a Gartner analyst who was with Machina Research and then Gartner acquired them and Emil has stayed on with them. Both of them great guys and we wouldn't have this book if it weren't for the three of us together. I never would have pulled this off on my own, so it's a collective work. >> Don DeLoach, great having you on TheCUBE. Thanks very much for coming on. Alright, keep it right there buddy. We'll be back. This is PentahoWorld 2017, and this is TheCUBE. Be right back.

Published Date : Oct 27 2017

SUMMARY :

Brought to you by Hitachi Vantara. of the midwest IoT council. The Future of IoT, leveraging the shift the premise of the book. and communications, and to a is that the market is going to shift and the context of the low-fat But, first of all, that's going to change. So if the product providers are smart, Well and the product guys going to the Walmart might say, and if I'm the fast food chain, Well, plus the widget Doesn't have to be a physical edge. and then propagating it to the devices at the edge where and that's only because the got to be very flexible, especially the larger the enterprise, of the entirety of the data set, in a fraction of the space the linkages to the context in order back to the premise of the book. so that I can leverage the and now it starts to come together, and the Insight Group Pentaho and the Insight's play that they bring to the table, Okay, so, but a lot of the not many people remember. and the product strategy and so forth. to those discussions, and credit to Hitachi for pivoting. in the various operations, It goes back to if you the horses in the track, if you will. that they're bringing to bear with Lumada. leveraging the shift to and the other is Emil 2017, and this is TheCUBE.

ENTITIES

Entity	Category	Confidence
Hitachi	ORGANIZATION	0.99+
GE	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Walmart	ORGANIZATION	0.99+
Emil Berthelsen	PERSON	0.99+
2014	DATE	0.99+
Siemens	ORGANIZATION	0.99+
Dave	PERSON	0.99+
Disney	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
eight	QUANTITY	0.99+
IBM	ORGANIZATION	0.99+
Don DeLoach	PERSON	0.99+
Hitachi Data Systems	ORGANIZATION	0.99+
Wael Elrifai	PERSON	0.99+
15%	QUANTITY	0.99+
Jim Heppelmann	PERSON	0.99+
six	QUANTITY	0.99+
Yahoo	ORGANIZATION	0.99+
Emil	PERSON	0.99+
30 year	QUANTITY	0.99+
HDS	ORGANIZATION	0.99+
SeaLand	ORGANIZATION	0.99+
National Advance Systems	ORGANIZATION	0.99+
10 vendors	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
30 years	QUANTITY	0.99+
Insight Group	ORGANIZATION	0.99+
Rob Tiffany	PERSON	0.99+
700	QUANTITY	0.99+
Michael Porter	PERSON	0.99+
One	QUANTITY	0.99+
Hitachi Limited	ORGANIZATION	0.99+
Pentaho	ORGANIZATION	0.99+
Wikibon	ORGANIZATION	0.99+
three	QUANTITY	0.99+
2025	DATE	0.99+
Gartner	ORGANIZATION	0.99+
The Future of IoT	TITLE	0.99+
Brian Householder	PERSON	0.99+
Hitachi Data Systems	ORGANIZATION	0.99+
Machina Research	ORGANIZATION	0.99+
Hitachi Lumada	ORGANIZATION	0.99+
both	QUANTITY	0.99+
two years ago	DATE	0.99+
Orlando, Florida	LOCATION	0.99+
Lumada	ORGANIZATION	0.99+
Don	PERSON	0.99+
Midwest IoT Council	ORGANIZATION	0.99+
TAM	ORGANIZATION	0.99+
700 winners	QUANTITY	0.99+
two things	QUANTITY	0.99+
third piece	QUANTITY	0.99+
first tier	QUANTITY	0.99+
Both	QUANTITY	0.99+
Hitachi Insight Group	ORGANIZATION	0.99+
25 years ago	DATE	0.99+
Hitachi Vantara	ORGANIZATION	0.99+
two	QUANTITY	0.98+
10	QUANTITY	0.98+
one	QUANTITY	0.98+
each room	QUANTITY	0.98+
US	LOCATION	0.98+
TheCUBE	ORGANIZATION	0.98+

Matthew Hunt | Spark Summit 2017

>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, we're talking about data signs and engineering at scale, and we're having a great time, aren't we, George? >> We are! >> Well, we have another guest now we're going to talk to, I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg, Matt, thanks for joining us! >> My pleasure. >> Alright, we're going to talk about a lot of exciting stuff here today, but I want to first start with, you're a long-time member of the Spark community, right? How many Spark Summits have you been to? >> Almost all of them, actually, it's quite amazing to see the 10th one, yes. >> And you're pretty actively involved with the user group on the east coast? >> Matt: Yeah, I run the New York users group. >> Alright, well, what's that all about? >> We have some 2,000 people in New York who are interested in finding out what goes on, and which technologies to use, and what are people working on. >> Alright, so hopefully, you saw the keynote this morning with Matei? >> Yes. >> Alright, any comments or reactions from the things that he talked about as priorities? >> Well, I've always loved the keynotes at the Spark Summits, because they announce something that you don't already know is coming in advance, at least for most people. The second Spark Summit actually had people gasping in the audience while they were demoing, a lot of senior people-- >> Well, the one millisecond today was kind of a wow one-- >> Exactly, and I would say that the one thing to pick out of the keynote that really stood out for me was the changes in improvements they've made for streaming, including potentially being able to do sub-millisecond times for some workloads. >> Well, maybe talk to us about some of the apps that you're building at Bloomberg, and then I want you to join in, George, and drill down some of the details. >> Sure. And Bloomberg is a large company with 4,000-plus developers, we've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware, but there's some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in, that's a natural application, another one would be regulatory, there's a regulatory system called MiFID, or MiFID II, the regulations required for Europe, you have to be able to record every trade for seven years, provide daily reports, there's clearly a lot around that, and then I would also just say, our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. >> These applications sound like they're oriented towards streaming solutions, and really low latency. Has that been a constraint on what you can build so far? >> I would definitely say that we have some things that are latency constrained, it tends to be not like high frequency trading, where you care about microseconds, but milliseconds are important, how long does it take to get an answer, but I would say equally important with latency is efficiency, and those two often wind up being coupled together, though not always. >> And so when you say coupled, is it because it's a trade-off, or 'cause you need both? >> Right, so it's a little bit of both, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often, greater efficiencies. Micro-batching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in, and your method of processing them isn't efficient enough, it gets too slow simply from that, and that's why it's not just one or the other. >> So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency, and is that relevant for the apps that you're building? >> I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have, so that's one explicit one, but part of what's useful is, often when you sit down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through, and then you discover other inefficiencies that can be ironed out, and so it just makes it faster overall. I would say, a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on, well, all these things that make things much slower, much less efficient than they need to be, and we can close that gap a lot, I would say that from the very beginning. >> This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take N 10 control of continuous apps, for simplicity and performance, and so there's this, we'll write with transactional consistency, so we're assuring the customer of exactly one's semantics when we write to a file system or database or something like that. But, Spark has never really done native storage, whereas Matei came here on the show earlier today and said, "Well, Databricks as a company "is going to have to do something in that area," and he talks specifically about databases, and he said, he implied that Apache Spark, separate from Databricks, would also have to do more in state management, I don't know if he was saying key value store, but how would that open up a broader class of apps, how would it make your life simpler as a developer? >> Right. Interesting and great question, this is kind of a subject that's near and dear to my own heart, I would say. So part of that, when you take a step back, is about some of the potential promise of what Spark could be, or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value, if you can learn one small skillset, but it can work in a wide variety of use cases, whether it's streaming or at rest or analytics, and plug other things in. As always, there's a gap in any such system between theory and reality, and how much can you close that gap, but as for storage systems, this is something that, you and I have talked about this before, and I've written about it a fair amount too, Spark is historically an analytic system, so you have a bunch of data, and you can do analytics on it, but where's that data come from? Well, either it's streaming in, or you're reading from files, but most people need, essentially, an actual database. So what constitutes the universal system? You need file store, you need a distributive file store, you need a database with generally transactional semantics because the other forms are too hard for people to understand, you need analytics that are extensible, and you need a way to stream data in, and there's how close can you get to that, versus how much do you have to fit other parts that come together, very interesting question. >> So, so far, they've sort of outsourced that to DIY, do-it-yourself, but if they can find a sufficiently scalable relational database, they can do the sort of analytical queries, and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that, like Cassandra would be the, sort of the database that would handle all updates, and then some amount of those would be filtered through to a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you would have the option to drop that out, what applications would you undertake that you couldn't use right now, where the theme was, we're going to take big data apps into production, and then the competition that they show for streaming is of Kafka and Flink, so what does that do to that competitive balance? >> Right, so how many pieces do you need, and how well do they fit together is maybe the essence of that question, and people ask that all the time, and one of the limits has been, how mature is each piece, how efficient is it, and do they work together? And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real-world usage. I think we're coalescing around a smaller set of options, so in the, Kafka, for example, has a lot of usage, and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data, for ingest, I see that everywhere I go. But what happens when you move from Kafka into Spark, or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it, or Oracle, or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability, so we actually just opensourced that a few weeks ago, it's called ComDB2, and the reason we had to do that was the industry solutions at the time, when we started working on that, were inadequate to our needs, but we look at how long that took to develop for these other systems and think, that's really hard for someone else to get right, and so, if you need a database, which everyone does, how can you make that work better with Spark? And I think there're a number of very interesting developments that can make that a lot better, short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too. How do you make them work well together, we could talk about for a while, 'cause that's a fascinating question. >> On that one topic, maybe the Databricks guys don't want to assume responsibility for the development, because then they're picking a winner, perhaps? Maybe, as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Splice Machine and SnappyData do the work, take it upon themselves to take data frames, the core data structure, in Spark, and give it transactional semantics. Does that sound promising? >> There're multiple avenues for potential success, and who can use which, in a way, depends on the audience. If you look at things like Cassandra and HBase, they're distributing key value stores that additional things are being built on, so they started as distributed, and they're moving towards more encompassing systems, versus relational databases, which generally started as single image on single machine, and are moving towards federation distribution, and there's been a lot with that with post grads, for example. One of the questions would be, is it just knobs, or why don't they work well together? And there're a number of reasons. One is, what can be pushed down, how much knowledge do you have to have to make that decision, and optimizing that, I think, is actually one of the really interesting things that could be done, just as we have database query optimizers, why not, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted, but are coming. One is the very efficient copy of data between systems, and Apache Arrow, for example, is very, very interesting, and it's nearing the time when I think it's just going to explode, because it lets you connect these systems radically more efficiently in a standardized way, and that's one of the things that was missing, as soon as you hop from one system to another, all of a sudden, you have the semantic computational expense, that's a problem, we can fix that. The other is, the next level of integration requires, basically, exposing more hooks. In order to know, where should a query be executed and which operator should I push down, you need something that I think of as a meta-optimizer, and also, knowledge about the shape of the data, or statistics underlying, and ways to exchange that back and forth to be able to do it well. >> Wow, Matt, a lot of great questions there. We're coming up on a break, so we have to wrap things up, and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community, do over the next year. What are the top issues, things you'd love to see worked on? >> Right. It's an exciting time for Spark, because as time goes by, it gets more and more mature, and more real-world applications are viable. The hardest thing of all is to get, anywhere you in any organization's to get people working together, but the more people work together to enable these pieces, how do I efficiently work with databases, or have these better optimizations make streaming more mature, the more people can use it in practice, and that's why people develop software, is to actually tackle these real-world problems, so, I would love to see more of that. >> Can we all get along? (chuckling) Well, that's going to be the last word of this segue, Matt, thank you so much for coming on and spending some time with us here to share the story! >> My pleasure. >> Alright, thank you so much. Thank you George, and thank you all for watching this segment of theCUBE, please stay with us, as Spark Summit 2017 will be back in a few moments.

Published Date : Jun 6 2017

SUMMARY :

covering Spark Summit 2017, brought to you by Databricks. it's quite amazing to see the 10th one, yes. and what are people working on. that you don't already know is coming in advance, and I would say that the one thing and then I want you to join in, George, you have to be able to record every trade for seven years, Has that been a constraint on what you can build so far? where you care about microseconds, On the other hand, if you have a really large volume and then you discover other inefficiencies and so there's this, we'll write and there's how close can you get to that, what applications would you undertake and so, if you need a database, which everyone does, and give it transactional semantics. it's just going to explode, because it lets you and I wanted to give you at least 30 seconds and that's why people develop software, Alright, thank you so much.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Matt Hunt	PERSON	0.99+
Bloomberg	ORGANIZATION	0.99+
Matthew Hunt	PERSON	0.99+
Matt	PERSON	0.99+
Matei	PERSON	0.99+
New York	LOCATION	0.99+
San Francisco	LOCATION	0.99+
30 years	QUANTITY	0.99+
seven years	QUANTITY	0.99+
each piece	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
one	QUANTITY	0.99+
one millisecond	QUANTITY	0.99+
5,000 skills	QUANTITY	0.99+
both	QUANTITY	0.99+
two	QUANTITY	0.99+
two things	QUANTITY	0.99+
One	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Spark	TITLE	0.98+
Europe	LOCATION	0.98+
Spark Summit 2017	EVENT	0.98+
DB2	TITLE	0.98+
200 different products	QUANTITY	0.98+
Spark Summits	EVENT	0.98+
Spark Summit	EVENT	0.98+
today	DATE	0.98+
one system	QUANTITY	0.97+
next year	DATE	0.97+
4,000-plus developers	QUANTITY	0.97+
first	QUANTITY	0.96+
HBase	ORGANIZATION	0.95+
second	QUANTITY	0.94+
decades	QUANTITY	0.94+
MiFID II	TITLE	0.94+
one topic	QUANTITY	0.92+
this morning	DATE	0.92+
single machine	QUANTITY	0.91+
One of	QUANTITY	0.91+
ComDB2	TITLE	0.9+
few weeks ago	DATE	0.9+
Cassandra	PERSON	0.89+
earlier today	DATE	0.88+
10th one	QUANTITY	0.88+
2,000 people	QUANTITY	0.88+
one thing	QUANTITY	0.87+
Kafka	TITLE	0.87+
single image	QUANTITY	0.87+
MiFID	TITLE	0.85+
Spark	ORGANIZATION	0.81+
Splice Machine	TITLE	0.81+
Project Tungsten	ORGANIZATION	0.78+
theCUBE	ORGANIZATION	0.78+
at least 30 seconds	QUANTITY	0.77+
Cassandra	ORGANIZATION	0.72+
Apache Spark	ORGANIZATION	0.71+
questions	QUANTITY	0.7+
things	QUANTITY	0.69+
Apache Arrow	ORGANIZATION	0.69+
SnappyData	TITLE	0.66+

Kickoff - Spark Summit East 2017 - #sparksummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is theCUBE covering Spark Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Everybody the euphoria is still palpable here, we're in downtown Boston at the Hynes Convention Center. For Spark Summit East, #SparkSummit, my co-host and I, George Gilbert, will be unpacking what's going on for the next two days. George, it's good to be working with you again. >> Likewise. >> I always like working with my man, George Gilbert. We go deep, George goes deeper. Fantastic action going on here in Boston, actually quite a good crowd here, it was packed this morning in the keynotes. The rave is streaming. Everybody's talking about streaming. Let's sort of go back a little bit though George. When Spark first came onto the scene, you saw these projects coming out of Berkeley, it was the hope of bringing real-timeness to big data, dealing with some of the memory constraints that we found going from batch to real-time interactive and now streaming, you're going to talk about that a lot. Then you had IBM come in and put a lot of dough behind Spark, basically giving it a stamp, IBM's imprimatur-- >> George: Yeah. >> Much in the same way it did with Lynx-- >> George: Yeah. >> Kind of elbowing it's way in-- >> George: Yeah. >> The marketplace and sort of gaining a foothold. Many people at the time thought that Hadoop needed Spark more than Spark needed Hadoop. A lot of people thought that Spark was going to replace Hadoop. Where are we today? What's the state of big data? >> Okay so to set some context, when Hadoop V1, classic Hadoop came out it was file system, commodity file system, keep everything really cheap, don't have to worry about shared storage, which is very expensive and the processing model, the execution of munging through data was map produced. We're all familiar with those-- >> Dave: Complicated but dirt cheap. >> Yes. >> Dave: Relative to a traditional data warehouse. >> Yes. >> Don't buy a big Oracle Unix box or Lynx box, buy this new file system and figure out how to make it work and you'll save a ton of money. >> Yeah, but unlike the traditional RDBMS', it wasn't really that great for doing interactive business intelligence and things like that. It was really good for big batch jobs that would run overnight or periods of hours, things like that. The irony is when Matei Zaharia, the co-creator of Spark or actually the creator and co-founder of Databricks, which is steward of Spark. When he created the language and the execution environment, his objective was to do a better MapReduce than Radue, than MapReduce, make it faster, take advantage of memory, but he did such a good job of it, that he was able to extend it to be a uniform engine not just for MapReduce type batch stuff, but for streaming stuff. >> Dave: So originally they start out thinking that if I get this right-- >> Yeah. >> It was sort of a microbatch leveraging memory more effectively and then it extended beyond-- >> The microbatch is their current way to address the streaming stuff. >> Dave: Okay. >> It takes MapReduce, which would be big long running jobs, and they can slice them up and so each little slice turns into an element in the stream. >> Dave: Okay, so the point it was improvement upon these big long batch jobs-- >> George: Yeah. >> They're making it batch to interactive in real-time, so let's go back to big data for a moment here. >> George: Yeah. >> Big data was the hottest topic in the world three or four years ago and now it's sort of waned as a buzz word, but big data is now becoming more mainstream. We've talked about that a lot. A lot of people think it's done. Is big data done? >> George: Not it's more that it's sort of-- it's boring for us, kind of pundits, to talk about because it's becoming part of the fabric. The use cases are what's interesting. It started out as a way to collect all data into this really cheap storage repository and then once you did that, this was the data you couldn't afford to put into your terra data, data warehouse at 25,000 per terabyte or with running costs a multiple of that. Here you put all your data in here, your data scientists and data engineers started munging with the data, you started taking workloads off your data warehouse, like ETL things that didn't belong there. Now people are beginning to experiment with business intelligence sort of exploration and reporting on Hadoop, so taking more workloads off the data warehouse. The limitations, there are limitations there that will get solved by putting MPP SQL back-ends on it, but the next step after that. So we're working on that step, but the one that comes after that is make it easier for data scientists to use this data, to create predictive models-- [Dave] Okay, so I often joke that the ROI on big data was reduction on investment and lowering the denominator-- >> George: Yeah. >> In the expense equation, which I think it's fair to say that big data and Hadoop succeeded in achieving that, but then the question becomes, what's the real business impact. Clearly big data has not, except in some edge cases and there are a number of edge cases and examples, but it's not yet anyway lived up to the promise of real-time, affecting outcomes before, you know taking the human out of the decision, bringing transaction and analytics together. Now we're hearing a lot of that talk around AI and machine learning, of course, IoT is the next big thing, that's where streaming fits in. Is it same line new bottle? Or is it sort of the evolution of the data meme? >> George: It's an evolution, but it's not just a technology evolution to make it work. When we've been talking about big data as efficiency, like low cost, cost reduction for the existing type of infrastructure, but when it starts going into machine learning you're doing applications that are more strategic and more top line focused. That means your c-level execs actually have to get involved because they have to talk about the strategic objectives, like growth versus profitability or which markets you want to target first. >> So has Spark been a headwind or tailwind to Hadoop? >> I think it's very much been a tailwind because it simplified a lot of things that took many, many engines in Hadoop. That's something that Matei, creator of Spark, has been talking about for awhile. >> Dave: Okay something I learned today and actually I had heard this before, but the way I phrased it in my tweet, Genomiocs is kicking Moore's Law's ass. >> George: Yeah. >> That the price performance of sequencing a gene improves three x every year to what is essentially a doubling every 18 months for Moore's Law. The amount of data that's being created is just enormous, I think we heard from Broad Institute that they create 17 terabytes a day-- >> George: Yeah. >> As compared to YouTube, which is 24 terabytes a day. >> And then a few years it will be-- >> It will be dwarfing YouTube >> Yeah. >> Of course Twitter you couldn't even see-- >> Yeah. >> So what do you make of that? Is that just the fun fact, is that a new use case, is that really where this whole market is headed? >> It's not a fun fact because we've been hearing for years and years about this study about data doubling every 18 to 24 months, that's coming from the legacy storage guys who can only double their capacity every 18 to 24 months. The reality is that when we take what was analog data and we make it digitally accessible, the only thing that's preventing us from capturing all this data is the cost to acquire and manage it. The available data is growing much, much faster than 40% every 18 months. >> Dave: So what you're saying is that-- I mean this industry has marched to the cadence of Moore's Law for decades and what you're saying is that linear curve is actually reshaping and it's becoming exponential. >> George: For data-- >> Yes. >> George: So the pressure is on for compute, which is now the bottleneck to get clever and clever about how to process it-- >> So that says innovation has to come from elsewhere, not just Moore's Law. It's got to come from a combination of-- Thomas Friedman talks a lot about Moore's Law being one of the fundamentals, but there are others. >> George: Right. >> So from a data perspective, what are those combinatorial effects that are going to drive innovation forward? >> George: There was a big meetup for Spark last night and the focus was this new database called SnappyData that spun out of Pivotal and it's being mentored by Paul Maritz, ex-head of Development in Microsoft in the 90s and former head of VMWare. The interesting thing about this database, and we'll start seeing it in others, is you don't necessarily want to be able to query and analyze petabytes at once, it will take too long, sort of like munging through data of that size on Hadoop took too long. You can do things that approximate the answer and get it much faster. We're going to see more tricks like that. >> Dave: It's interesting you mention Maritz, I heard a lot of messaging this morning that talked about essentially real-time analysis and being able to make decisions on data that you've never seen before and actually affect outcomes. This narrative I first heard from Maritz many, many years ago when they launched Pivotal. He launched Pivotal to be this platform for building big data apps and now you're seeing Databricks and others sort of usurp that messaging and actually seeming to be at the center of that trend. What's going on there? >> I think there's two, what would you call it, two centers of gravity and our CTO David Floyer talks about this. The edge is becoming more intelligent because there's a huge bandwidth and latency gap between these smart devices at the edge, whether the smart device is like a car or a drone or just a bunch of sensors on a turbine. Those things need to analyze and respond in near real-time or hard real-time, like how to tune themselves, things like that, but they also have to send a lot of data back to the cloud to learn about how these things evolve. In other words it would be like sending the data to the cloud to figure out how the weather patterns are changing. >> Dave: Um,humm. >> That's the analogy. You need them both. >> Dave: Okay. >> So Spark right now is really good in the cloud, but they're doing work so that they can take a lighter weight version and put at the edge. We've also seen Amazon put some stuff at the edge and Azure as well. >> Dave: I want you to comment. We're going to talk about this later, we have a-- George and I are going to do a two-part series at this event. We're going to talk about the state of the market and then we're going to release our big data, in a glimpse to our big data numbers, our Spark forecast, our streaming forecast-- I say I mention streaming because that is-- we talk about batch, we talk about interactive/real-time, you know you're at a terminal-- anybody who's as old as I am remembers that. But now you're talking about streaming. Streaming is a new workload type, you call these things continuous apps, like streams of events coming into a call center, for example, >> George: Yeah. >> As one example that you used. Add some color to that. Talk about that new workload type and the roll of streaming, and really potentially how it fits into IoT. >> Okay, so for the last 60 years, since the birth of digital computing, we've had either one of two workloads, they were either batch, which is jobs that ran offline, you put your punch cards in and sometime later the answer comes out. Or we've had interactive, which is originally it was green screens and now we have PCs and mobile devices. The third one coming up now is continuous or streaming data that you act on in near real-time. It's not that those apps will replace the previous ones, it's that you'll have apps that have continuous processing, batch processing, interactive as a mix. An example would be today all the information about how your applications and data center infrastructure are operating, that's a lot of streams of data that Splunk first, took amat and did very well with-- so that you're looking in real-time and able to figure out if something goes wrong. That type of stuff, all the coulometry from your data center, that is a training wheel for Internet things, where you've got lots of stuff out at the edge. >> Dave: It's interesting you mention Splunk, Splunk doesn't actually use the big data term in its marketing, but they actually are big data and they are streaming. They're actually not talking about it, they're just doing it, but anyway-- Alright George, great thanks for that overview. We're going to break now, bring back our first guest, Arun Murthy, coming in from Hortonworks, co-founder at Hortonworks, so keep it right there everybody. This is theCUBE we're live from Spark Summit East, #SparkSummit, we'll be right back. (upbeat music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Databricks. George, it's good to be working with you again. and now streaming, you're going to talk about that a lot. Many people at the time thought that Hadoop needed Spark and the processing model, buy this new file system and figure out how to make it work and the execution environment, to address the streaming stuff. in the stream. so let's go back to big data for a moment here. and now it's sort of waned as a buzz word, [Dave] Okay, so I often joke that the ROI on big data and machine learning, of course, IoT is the next big thing, but it's not just a technology evolution to make it work. That's something that Matei, creator of Spark, but the way I phrased it in my tweet, That the price performance of sequencing a gene all this data is the cost to acquire and manage it. I mean this industry has marched to the cadence So that says innovation has to come from elsewhere, and the focus was this new database called SnappyData and actually seeming to be at the center of that trend. but they also have to send a lot of data back to the cloud That's the analogy. So Spark right now is really good in the cloud, We're going to talk about this later, we have a-- As one example that you used. and sometime later the answer comes out. We're going to break now,

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Paul Maritz	PERSON	0.99+
Dave Vellante	PERSON	0.99+
George Gilbert	PERSON	0.99+
Arun Murthy	PERSON	0.99+
Matei Zaharia	PERSON	0.99+
Dave	PERSON	0.99+
Boston	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Thomas Friedman	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David Floyer	PERSON	0.99+
Matei	PERSON	0.99+
Broad Institute	ORGANIZATION	0.99+
Berkeley	LOCATION	0.99+
two	QUANTITY	0.99+
Maritz	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
two-part	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
one	QUANTITY	0.99+
third one	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
YouTube	ORGANIZATION	0.99+
25,000 per terabyte	QUANTITY	0.99+
Hynes Convention Center	LOCATION	0.99+
24 months	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.98+
first guest	QUANTITY	0.98+
three	QUANTITY	0.98+
one example	QUANTITY	0.98+
Hadoop	TITLE	0.97+
last night	DATE	0.97+
three	DATE	0.97+
both	QUANTITY	0.97+
40%	QUANTITY	0.97+
today	DATE	0.97+
Spark Summit East 2017	EVENT	0.97+
17 terabytes a day	QUANTITY	0.97+
first	QUANTITY	0.97+
24 terabytes a day	QUANTITY	0.97+
Twitter	ORGANIZATION	0.96+
decades	QUANTITY	0.96+
90s	DATE	0.96+
Moore's Law	TITLE	0.96+
two workloads	QUANTITY	0.96+
Spark	TITLE	0.95+
four years ago	DATE	0.94+
Moore's	TITLE	0.94+
two centers	QUANTITY	0.92+
Unix	COMMERCIAL_ITEM	0.92+
Kickoff	EVENT	0.92+
#SparkSummit	EVENT	0.91+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for SnappyData: