Jim Campigli, WANdisco - #BigDataNYC 2015 - #theCUBE
>> Live from New York. It's The Cube, covering Big Data NYC 2015. Brought to you by Horton Works, IBM, EMC, and Pivotal. Now for your hosts, John Furrier and Dave Vellante. >> Hello, everyone. Welcome back to live in New York City for the Cube. A special big data [inaudible 00:00:27] our flagship program will go out to the events. They expect a [Inaudible 00:00:30] We are here live as part of Strata Hadoop Big Data NYC. I'm John Furrier. My co-host, Dave Vellante. Our next guest is Jim Campigli, the Chief Product Officer at WANdisco. Welcome back to The Cube. Great to see you. >> Thanks, great to be here. >> You've been COO of WANdisco, head of marketing, now Chief Product Officer for a few years. You guys have always had the patent. David was on earlier. I asked him specifically, why doesn't the other guys just do what you do? I wanted you to comment deeper on that because he had a great answer. He said, patents. But you guys do something that's really hard that people can't do. >> Right. >> So let's get into it because Fusion is a big announcement you guys made. Big deal with EMC, lot of traction with that, and it's one of these things that is kind of talked about, but not talked about. It's really a big deal, so what is the reason why you guys are so successful on the product side? >> Well I think, first of all, it starts with the technology that we have patented, and it's this true active active replication capability that we have. Other software products claim to have active active replication, but when you drill down on what they're really doing, typically, what's happening is they'll have a set of servers that they replicate across, and you can write a transaction at any server, but then that server is responsible for propagating it to all of the other servers in the implementation. There's no mechanism for pre-agreeing to that transaction before it's actually written, so there's no way to avoid conflicts up front, there's no way to effectively handle scenarios where some of the servers in the implementation go down while the replication is in process, and very frequently, those solutions end up requiring administrators to do periodic resynchronization, go back and manually find out what didn't take, and deal with all the deltas, whereas we offer guaranteed consistency. And effectively what happens is with us, you can write at any server as well, but the difference is we go through a peer-to-peer agreement process, and once a quorum of the servers in the implementation agree to the transaction, they all accept it, and we make sure everything is written in the same order on every server. And every server knows the last good transaction it processed, so if it goes down at some point in time, as soon as it comes back up, it can grab all the transactions it missed during that time slice while it was offline, resync itself automatically without an administrator having to do anything. And you can use that feature not only for network and server outages that cause downtime, but even for planned maintenance, which is one of the biggest causes of Hadoop availability issues, because obviously if you've got a global appointment, when it's midnight on Sunday in the U.S., it's the start of the business day on Monday in Europe, and then it's the middle of the afternoon in Asia. So if you take Hadoop clusters down, somebody somewhere in the world is going to be going without their applications and data. >> It's interesting; I want to get your comments on this because this has a great highlight into the next conversation we've been hearing all throughout The Cube this week is analytics, outcomes. These are the kind of things that people talk about because that means there's checks being written. Hadoop is moving into production. People have done the clusters. It used to be the conversation, hey, x number of clusters, you do this, you do that, replication here and there, YARN, all these different buzz words. Really feeds and speeds. Now, Hadoop is relevant, but it's kind of invisible. It's under the hood. >> Right. >> Yet, it's part of other things in the network, so high availability, non-disruptive operations, is what our table stakes now. So I want you to talk about that nuance because that's what we're seeing as the things that are powering, as the engine of Hadoop deployments. What is that? Take us through that nuance, because that's one of the things that you guys have been doing a lot of work in that's making it reliable and stable. To actually go out and play with Hadoop, deploy it, make sure it's always on. >> Well, we really come into play when companies are moving Hadoop out of the lab and into production. When they have defined application SLAs, when they can only have so much down time, and it may be business requirements, it may be regulatory compliance issues, for example, financial services. They pretty much always have to have their data available. They have to have a solid back-up of the data. That's a hard requirement for them to put anything into production in their data centers. >> The other use case we've been hearing is okay, I've got Hadoop, I've been playing with it, now I need to scale it up big time. I need to double, triple my clusters. I have to put it with my applications. Then the conversation's, okay, wait, do I need to do more cis admin work? How do you address that particular piece because I think that's where I think Fusion comes in from how I'm reading it, but is that a Fusion value proposition? Is it a WANdisco thing, and what does the customer, and is that happening? >> Yeah, so there's actually two angles to that, and the first is how do we maintain that up-time? How do we make sure there's performance availability to meet the SLA's, the production SLA's? The active active replication that we have patents for, that I described earlier, and it's embodied in our discount distributed coordination engine, is at the core of Fusion, and once a Fusion server's installed with each of your Hadoop clusters, that active active replication capability is extended to them, and we expose that HDFS API so the client applications, Sqoop, Flume, Impala, HIVE, anything that would normally run against a Hadoop cluster, would talk through us. If it's been defined for replication, we do the active active replication of it. Pass straight through and process normally on the local cluster. So how does that address the issues you were talking about? What you're getting by default with our active active replication is effectively continuous hot back-up. That means if one cluster or an entire data center goes offline, that data exists elsewhere. Your users can fail over. They can continue accessing the data, running their applications. As soon as that cluster comes back online, it resyncs automatically. Now what's the other >> No user involvement? No admin? >> No user involvement in that. Now the only time, and this gets back into what I was talking about earlier, if I take servers offline for planned maintenance, upgrade the hardware, the operating system, whatever it may be, I can take advantage of that feature, as I was alluding to earlier. I can take the servers of the entire cluster offline, and Fusion knows the last good transactions that were processed on that cluster. As soon as the admin turns it back on, it'll resync itself automatically. So that's how you avoid down time, even for planned maintenance, if you have to take an entire location off. Now, to your other question, how do you scale this stuff up? Think about what we do. We eliminate idle standby hardware, because everything is full read write. You don't have standby read-only back-up clusters and servers when we come into the picture. So let's say we walk into an existing implementation, and they've got two clusters. One is the active cluster where everything's being written to, read from, actively being accessed by users. The other's just simply taking snapshots or periodic back-ups, or they're using dis(CP) or something else, but they really can't get full utilization out of that. We come in with our active active replication capability, and they don't have to change anything, but what suddenly happens is, as soon as they define what they want replicated, we'll replicate it for them initially to the other clusters. They don't have to pre-sync it, and the cluster that was formally for disaster recovery, for back-up, is now live and fully usable. So guess what? I'm now able to scale up to twice my original implementation by just leveraging that formally read-only back-up cluster that I was >> Is there a lot of configuration involved in that, or is it automatically? >> No, so basically what happens, again, you don't have to synchronize the clusters in advance. The way we replicate is based on this concept of folders, and you can think of a folder as basically a collection of files and subdirectories that roll up into root directories, effectively, that reflect typically particular applications that people are using with Hadoop or groups of users that have data sets that they access for their various sets of applications. And you define the replicated folders, basically a high level directory that consists of everything in it, and as soon as you do that, what we'll do automatically, in a new implementation. Let's keep it simple. Let's say you just have two clusters, two locations. We'll replicate that folder in its entirety to the target you specify, and then from that point on, we're just moving the deltas over the wire. So you don't have to do anything in advance. And then suddenly that back-up hardware is fully usable, and you've doubled the size of your implementations. You've scaled up to 2x. >> So, I mean what you're describing before, really strikes me that the way you tell the complexity of a product and the value of a product in this space is what happens when something goes wrong. >> Yep. >> That's the question you always ask. How do you recover, because recovery's a very hard thing, and your patents, you've got a lot of math inside there. >> Right. >> But you also said something that's interesting, which is you're an asset utilization play. >> Right. >> You're being able to go in relatively simply and say, okay, you've got this asset that's underutilized. I'm now going to give you back some capacity that's on the floor and take advantage of that. >> Right, and you're able to scale up without spending any more on hardware and infrastructure. >> So I'm interested in, so another company. You're now with an EMC partnership this week. And they sort of got into this way back in the mainframe days with SRDF. I always thought when I first heard about WANdisco, it's like SRDF for Hadoop, but it's active active. Then they bought that Yada Yada. >> And there's no distance limitations for their active active. >> So what's the nature of the relationship with EMC? >> Okay, so basically EMC, like the other storage vendors that want to play in the Hadoop space, expose some form of an HDFS API, and in fact, if you look at Hortonworks or Cloudera, if you go and look at Cloudera Manager, one of the things it asks you when you're installing it is are you going to run this on regular HDFS storage, effectively a bunch of commodity boxes typically, or are you going to use EMC Isilon or the various other options? And what we're able to do is replicate across Hadoop clusters running on Isilon, running on EMC ECS, running on standard HDFS, and what that allows these companies to do is without modifying those storage systems, without migrating that data off of them, incorporate it into an enterprise-wide data lake, if that's what they want to do, and selectively replicate across all of those different storage systems. It could be a mix of different Hadoop distributions. You could have replication between C/D/H, HDP, Pivotal, MapR, all of those things, including EMC Storage that I just mentioned, it was mentioned in the press release, Isilon, and ECS effectively has a Hadoop-compatible API support. And we can create in effect a single virtual cluster out of all of those different platforms. >> So is it a go-to-market relationship? Is it an OEM deal? >> Yeah, it was really born out of the fact that we have some mutual customers that want to do exactly what I just described. They have standard Hortonworks or Cloudera deployments in house. They've got data running on Isilon, and they want to deploy a data lake that includes what they've got stored on Isilon with what they've got in HDFS and Hadoop and replicate across that. >> Like onerous EMC certification process? >> Yeah, we went through that process. We actually set up environments in our labs where we had EMC, Isilon, and ECS running and did demonstration integrations, replication across Isilon to HDP to Hortonworks, Isilon to Cloudera, ECS to Isilon to HDP and Cloudera and so forth. So we did prove it out. They saw that. In fact, they lent us boxes to actually do this in our labs, so they were very motivated, and they're seeing us in some of their bigger accounts. >> Talk about the aspect of two things: non-disruptive operations, meaning I have to want to deploy stuff because now that Hadoop has a hardened top with some abstraction layer, with analytics to focus, there's a lot of work going on under the hood, and a large scale enterprise might have a zillion versions of Hadoop. They might have little Hortonworks here. They might have something over here, so there might be some diversity in the distributions. That's one thing. The other one is operational disruption. >> Right. >> What do you guys do there? Is it zero disruption, and how do you deal with multiple versions of the distro? >> Okay, so basically what we're doing, the simplest way to describe it is we're providing a common API across all of these different distributions, running on different storage platforms and so forth, so that the client applications are always interacting with us. They're not worrying about the nuances of the particular Hadoop API's that these different things expose. So we're providing a layer of abstraction effectively. So we're transparent in effect, in that sense, operationally, once we're installed. The other thing is, and I mentioned this earlier, we come in, basically, you don't have to pre-sync clusters, you don't have to make sure they're all the same versions or the same distros or any of that, just install us, select the data that you want to replicate, we'll replicate it over initially to the target clusters, and then from that point on, you just go. It just works, and we talked about the core patent for active active replication. We've got other patents that have been approved, three patents now and seven pending applications pending, that allow this active active replication to take place while servers are being added and removed from implementations without disrupting user access or running applications and so forth. >> Final question for you, sum up the show this week. What's the vibe here? What's the aroma? Is it really Hadoop next? What is the overall Big Data NYC story here in Strata Hadoop? What's the main theme that you're seeing coming out of the show? >> I think the main theme that we're starting to see, it's twofold. I think one is we are seeing more and more companies moving this into production. There's a lot of interest in Spark and the whole fast data concept, and I don't think that Spark is necessarily orthogonal to Hadoop at all. I think the two have to coexist. If you think about Spark streaming and the whole fast data concept, basically, Hadoop provides the historical data at rest. It provides the historical context. The streaming data provides the point in time information. What Spark together with Hadoop allows you to do is that real time analysis, do the real time informed decision-making, but do it within historical context instead of a single point in time vacuum. So I think what's happening, and you notice the vendors themselves aren't saying, oh it's all Spark, forget Hadoop. They're really talking about coexisting. >> Alright, Jim, from WANdisco, Chief Product Officer, really in the trenches, talking about what's under the hood and making it all scale in the infrastructure so his analysts can hit the scene. Great to see you again. Thanks for coming and sharing your insight here on The Cube. Live in New York City. We are here, day two of three days of wall-to-wall coverage of Big Data NYC in conjunction with Strata. We'll be right back with more live coverage in the moment here in New York City after this short break.
SUMMARY :
Brought to you by Horton New York City for the Cube. You guys have always had the patent. on the product side? and once a quorum of the servers These are the kind of things because that's one of the things back-up of the data. and is that happening? So how does that address the issues and the cluster that was and you can think of a folder really strikes me that the way you tell That's the question you always ask. But you also said that's on the floor and Right, and you're able to scale up in the mainframe days with SRDF. And there's no distance limitations one of the things it asks you born out of the fact and Cloudera and so forth. diversity in the distributions. so that the client applications What is the overall Big Data NYC story and the whole fast data concept, in the infrastructure
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Jim | PERSON | 0.99+ |
Jim Campigli | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
WANdisco | ORGANIZATION | 0.99+ |
EMC | ORGANIZATION | 0.99+ |
Asia | LOCATION | 0.99+ |
U.S. | LOCATION | 0.99+ |
New York | LOCATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Horton Works | ORGANIZATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
New York City | LOCATION | 0.99+ |
two locations | QUANTITY | 0.99+ |
Strata Hadoop | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
Pivotal | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
Hadoop | TITLE | 0.99+ |
One | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
two clusters | QUANTITY | 0.99+ |
three days | QUANTITY | 0.99+ |
Monday | DATE | 0.99+ |
three patents | QUANTITY | 0.98+ |
this week | DATE | 0.98+ |
seven pending applications | QUANTITY | 0.98+ |
two angles | QUANTITY | 0.98+ |
two clusters | QUANTITY | 0.98+ |
Spark | TITLE | 0.97+ |
this week | DATE | 0.97+ |
one cluster | QUANTITY | 0.97+ |
00:00:30 | DATE | 0.95+ |
ECS | TITLE | 0.95+ |
HDP | ORGANIZATION | 0.94+ |
Cloudera Manager | TITLE | 0.94+ |
single point | QUANTITY | 0.94+ |
#BigDataNYC | EVENT | 0.94+ |
each | QUANTITY | 0.94+ |
Impala | TITLE | 0.93+ |
NYC | LOCATION | 0.93+ |
twofold | QUANTITY | 0.93+ |
Strata | ORGANIZATION | 0.92+ |
Flume | TITLE | 0.92+ |
00:00:27 | DATE | 0.92+ |
Sqoop | TITLE | 0.92+ |
Fusion | TITLE | 0.91+ |
Isilon | ORGANIZATION | 0.89+ |
Cloudera | ORGANIZATION | 0.89+ |
midnight | DATE | 0.89+ |
Sunday | DATE | 0.88+ |
Isilon | TITLE | 0.88+ |
single | QUANTITY | 0.88+ |
HIVE | TITLE | 0.87+ |
one thing | QUANTITY | 0.83+ |
double | QUANTITY | 0.83+ |