Jags Ramnarayan, SnappyData - Spark Summit 2017 - #SparkSummit - #theCUBE
(techno music) >> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> You are watching the Spark Summit 2017 coverage by theCUBE. I'm your host David Goad, and joined with George Gilbert. How you doing George? >> Good to be here. >> And honored to introduce our next guest, the CTO from SnappyData, wow we were lucky to get this guy. >> Thanks for having me >> David: Jags Ramnarayan, Jags thanks for joining us. >> Thanks, thanks for having me. >> And for people who may not be familiar, maybe tell us what does SnappyData do? >> So SnappyData in a nutshell, is taking Spark, which is a computer engine, and in some sense augmenting the guts of Spark so that Spark truly becomes a hybrid database. A single data store that's capable of taking Spark streams, doing transactions, providing mutable state management in Spark, but most importantly being able to turn around, and run analytical queries on that state that is continuously merging. That's in a nutshell. Let me just say a few things, SnappyData itself is a startup that is a spun out, a spun out out of Pivotal. We've been out of Pivotal for roughly about a year, so the technology itself was to a great degree, incubated within Pivotal. It's a product called GemFire within VMware and Pivotal. So we took the guts of GemFire, which is an in-memory data base, designed for transactional low-latency, high confidence scenarios, and we are sort of fusing it, that's the key thing, fusing it into Spark, so that now Spark becomes significantly richer, as not just as a computer platform, but as a store. >> Great, and we know this is not your first Spark Summit, right? How many have you been to? Lost count? >> Boy, let's see, three, four now, Spark Summits, if I include the Spark Summit, this year, four to five. >> Great, so an active part of the community. What were you expecting to learn this year, and have you been surprised by anything? >> You know, it's always wonderful to see, I mean, every time I come to Spark, it's just a new set of innovations, right? I mean, when I first came to Spark, it was a mix of, let's talk about data frames, all of these, let's optimize my priorities. Today you come, I mean there is such a wide spectrum of amazing new things that are happening. It's just mind boggling. Right from AI techniques, structured streaming, and the real-time paradigm, and sort of this confluence that Databricks brings more to it. How can I create a confluence through a unified mechanism, where it is really brilliant, is what I think. >> Okay, well let's talk about how you're innovating at SnappyData. What are some of the applications or current projects you're working on? So number of things, I mean, GE is an investor in SnappyData. So we're trying to work with GE on the investor layer Dspace. We're working with large health care companies, also on their layer Dspace. So the part done with SnappyData is one that has a lot of high velocity streams of data emerging where the streams could be, for instance, Kafka streams driving Spark streams, but streams could also be operation databases. Your Postgres instance and your Cassandra database instance, and they're all generating continuous changes to data that's emerging in an operational world, can I suck that in and almost create a replica of that state that might be emerging in the SOQL operation environment, and still allow interactive analytics ASCIL for a number of concordant users on live data. Not cube data, not pre-aggregated data, but on live data itself, right? Being able to almost give you Google-like speeds to live data. >> George, we've heard people talking about this quite a bit. >> Yeah, so Jags, as you said upfront, Spark was conceived as sort of a general purpose, I guess, analytic compute engine, and adding DBMS to it, like sort of not bolting it on, but deeply integrating it, so that the core data structures now have DBMS properties, like transactionality, that must make a huge change in the scope of applications that are applicable. Can you desribe some of those for us? >> Yeah. The classic paradigm today that we find time and again as, the so-called smack stack, right? I mean lambda stack, now there's a smack stack. Which is really about Spark running on Mesos, but really using Spark streaming as an ingestion capability, and there is continuous state that is emerging that I want to write into Cassandra. So what we find very quickly is that the moment the state is emerging, I want to throw in a business intelligence tool on top and immediately do live dashboarding on that state that is continuously changing and emerging. So what we find is that the first part, which is the high speed drives, the ability to transform these data search, cleanse the data search, get the cleanse data into Cassandra, works really well. What is missing is this ability to say, well, how am I going to get insight? How can I ask you interesting, insightful questions, get responses immediately on that live data, right? And so the common problem there is the moment I have Cassandra working, let's say, with Spark, every time I run an analytical query, you only have two choices. One is use the parallel connector to pull in the data search from Cassandra, right, and now unfortunately, when you do analytics, you're working with large volumes. And every time I run even a simple query, all of a sudden I could be pulling in 10 gigabytes, 20 gigabytes of data into Spark to run the computation. Hundreds of seconds lost. Nothing like interactive, it's all about batch querying. So how can I turn around and say that if stuff changes in Cassandra, I can can have an immediate real-time reflection of that mutable state in Spark on which I can run queries rapidly. That's a very key aspect to us. >> So you were telling me earlier that you didn't see, necessarily, a need to replace entirely, the Cassandra in the smack stack, but to compliment it. >> Jags: That's right. >> Elaborate on that. >> So our focus, much like Spark, is all about in-memory, state management in-memory processing. And Cassandra, realistically, is really designed to say how can I scale the petabyte, right, for key value operations, semi-structured data, what have you. So we think there are a number of scenarios where you still want Cassandra to be your store, because in some sense a lot of these guys have already adapted Cassandra in a fairly big way. So you want to say, hey, leave your petabyte level wall in there, and you can essentially work with the real-time state, which could still be still many terabytes of state, essentially in main memory, that's going to work with specializing it. And we're also, I mean I can touch on this approximate query process and technology, which is other part, other key part here, to say hey, I can't really 1,000 cores, and 1,000 machines just so that you can do your job really well, so one of the techniques we are adopting, which even the Databricks guys stirred with Blink, essentially, it's an approximate query processing engine, we have our own essential approximate query processing engine, as an adjunct, essentially, to our store. What that essentially means is to say, can I take a billion records and synthesize something really, really small, using smart sampling techniques, sketching techniques, essentially statistical structures, that can be stored along with Spark and Spark memory itself, and fuse it with the Spark catalyst query engine. So that as you run your query and we can very smartly figure out, can I use the approximate data structures to answer the questions extremely quickly. Even when the data would be in petabyte volume, I have these data structures that just now taking, maybe gigabytes of storage only. So hopefully not getting too, too technical, so the Spark catalyst query optimizer, like an Oracle query optimizer, it knows about the data that it's going to query, only in your case, you're taking what catalyst knows about Spark, and extending it with what's stored in your native, also Spark native, data structures. >> That's right, exactly. So think about an optimizer always takes a query plan and says, here are all the possible plans you can execute, and here is cost estimate for these plans, we essentially inject more plans into that and hopefully, our plan is even more optimized than the plans that the Spark catalyst engine came up with. And Spark is beautiful because, the Catalyst engine is a very pluggable engine. So you can essentially augment that engine very easily. >> So you've been out in the marketplace, whether in alpha, beta, or now, production, for enough time so that the community is aware of what you've done. What are some of the areas that you're being pulled in that are, that people didn't associate Spark with? >> So more often, we land up in situations where they're looking at SAP HANA, as an example, maybe a Meme SQL, maybe just Postgres, and all of the sudden, there are these hybrid workloads, which is the Gartner term of HTAP, so there's a lot of HTAP use cases, where we get pulled into. So there's no Spark, but we get pulled into it because we just a hybrid database. That's what people look at us, essentially. >> Oh, so you pull Spark in because that's just part of your solution. >> Exactly, right. So think about Spark is not just data frames and rich API, but also it has a SQL interface, right. I can essentially execute, SQL, select SQL. Of course we augment that SQL so that now you can do what you expect from a database, which is an insert, an update, a delete, can I create a view, can I run a transaction? So all of a sudden, it's not just a Spark API but what we provide looks like a SQL database itself. >> Okay, interesting. So tell us, in the work with GE, they're among the first that have sort of educated the world that in that world there's so much data coming off devices, that we have to be intelligent about what we filter and send to cloud, we train models, potentially, up there, we run them closer to the edge, so that we get low latency analytics, but you were telling us earlier that there are alternatives, especially when you have such an intelligent database, working both at the edge and in the cloud. >> Right, so that's a great point. See what's happening with sort of a lot of these machine learning models is that these models are learned on historical data search. And quite often, especially if you look at predictive maintenance, those class of use cases, in industrial IRT, the parlance could evolve very rapidly, right? Maybe because of climate changes and let's say, for a windmill farm, there are few windmills that are breaking down so rapidly it's affecting everything else, in terms of the power generation. So being able to sort of order the model itself, incrementally and near real-time, is becoming more and more important. >> David: Wow. >> It's still a fairly academic research kind of area, but for instance, we are working very closely with the University of Michigan to sort of say, can we use some of these approximate techniques to incrementally also learn a model. Right, sort of incrementally augment a model, potential of the edge, or even inside the cloud, for instance. >> David: Wow. >> So if you're doing it at the edge, would you be updating the instance of the model associated with that locale and then would the model in the cloud be sort of like the master, and then that gets pushed down, until you have an instance and a master. >> That's right. See most typically what will happen is you have computed a model using a lot of historical data. You have typically supervised techniques to compute a model. And you take that model and inject it potentially into the edge, so that it can compute that model, which is the easy part, everybody does that. So you continue to do that, right, because you really want the data scientists to be pouring through those paradigms, looking and sort of tweaking those models. But for a certain number of models, even in the models injected in the edge, can I re-tweak that model in unsupervised way, is kind of the play, we're also kind of venturing into slowly, but that's all in the future. >> But if you're doing it unsupervised, do you need metrics that sort of flag, like what is the champion challenger, and figure out-- >> I should say that I mean, not all of these models can work in this very real-time manner. So, for instance, we've been looking at saying, can we reclassify NPC, the name place classifier, to essentially do incremental classification, or incrementally learning the model. Clustering approaches can actually be done in an unsupervised way in an incremental fashion. Things like that. There's a whole spectrum of algorithms that really need to be thought through for approximate algorithms to actually apply. So it's still an active research. >> Really great discussion, guys. We've just got about a minute to go, before the break, really great stuff. I don't want to interrupt you. But maybe switch real quick to business drivers. Maybe with SnappyData or with other peers you've talked to today. What business drivers do you think are going to affect the evolution of Spark the most? I mean, for us, as a small company, the single biggest challenge we have, it's like what one of you guys said, analysts, it's raining databases out there. And there's ability to constantly educate people how you can essentially realize a very next generation, like data pipeline, in a very simplified manner, is the challenge we are running into, right. I mean, I think the business model for us is primarily how many people are going to go and say, yes, batch related analytics is important, but incrementally, for competitive reasons, want to be playing that real-time analytics game lot more than before, right? So that's going to be big for us, and hopefully we can play a big part there, along with Spark and Databricks. >> Great, well we appreciate you coming on the show today, and sharing some of the interesting work that you're doing. George, thank you so much. and Jags, thank you so much for being on theCUBE. >> Thanks for having me on, I appreciate it. Thanks, George. And thank you all for tuning in. Once again, we have more to come, today and tomorrow, here at Spark Summit 2017, thanks for watching. (techno music)
SUMMARY :
Brought to you by Databricks. How you doing George? And honored to introduce our next guest, and in some sense augmenting the guts of Spark if I include the Spark Summit, this year, four to five. and have you been surprised by anything? and the real-time paradigm, and sort of this confluence So the part done with SnappyData is one about this quite a bit. so that the core data structures now have DBMS properties, that the moment the state is emerging, the Cassandra in the smack stack, but to compliment it. So that as you run your query and we can very So you can essentially augment that engine very easily. What are some of the areas that you're being pulled in maybe just Postgres, and all of the sudden, Oh, so you pull Spark in because So all of a sudden, it's not just a Spark API that have sort of educated the world So being able to sort of order the model itself, but for instance, we are working very closely in the cloud be sort of like the master, So you continue to do that, right, because you that really need to be thought through is the challenge we are running into, right. and sharing some of the interesting work that you're doing. And thank you all for tuning in.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
George | PERSON | 0.99+ |
University of Michigan | ORGANIZATION | 0.99+ |
1,000 machines | QUANTITY | 0.99+ |
20 gigabytes | QUANTITY | 0.99+ |
GE | ORGANIZATION | 0.99+ |
1,000 cores | QUANTITY | 0.99+ |
10 gigabytes | QUANTITY | 0.99+ |
David | PERSON | 0.99+ |
Spark | TITLE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
SQL | TITLE | 0.99+ |
Spark | ORGANIZATION | 0.99+ |
Jags Ramnarayan | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
first part | QUANTITY | 0.99+ |
two choices | QUANTITY | 0.99+ |
SAP HANA | TITLE | 0.99+ |
tomorrow | DATE | 0.99+ |
Hundreds of seconds | QUANTITY | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
this year | DATE | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
Jags | PERSON | 0.99+ |
One | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Today | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
Spark Summit | EVENT | 0.97+ |
single | QUANTITY | 0.97+ |
Kafka | TITLE | 0.97+ |
Oracle | ORGANIZATION | 0.97+ |
ORGANIZATION | 0.96+ | |
about a year | QUANTITY | 0.96+ |
Blink | ORGANIZATION | 0.95+ |
single data | QUANTITY | 0.93+ |
SnappyData | ORGANIZATION | 0.93+ |
Mesos | TITLE | 0.91+ |
three | QUANTITY | 0.91+ |
a billion records | QUANTITY | 0.91+ |
#SparkSummit | EVENT | 0.91+ |
Spark Summits | EVENT | 0.9+ |
four | QUANTITY | 0.89+ |
theCUBE | ORGANIZATION | 0.89+ |
Postgres | TITLE | 0.89+ |
one | QUANTITY | 0.88+ |
Cassandra | TITLE | 0.87+ |