Distributed Data with Unifi Software

>> Narrator: From the Silicon Angle Media Office in Boston, Massachusetts, it's theCUBE. Now, here's your host, Stu Miniman. >> Hi, I'm Stu Miniman and we're here at the east coast studio for Silicon Angle Media. Happy to welcome back to the program, a many time guest, Chris Selland, who is now the Vice President of strategic growth with Unifi Software. Great to see you Chris. >> Thanks so much Stu, great to see you too. >> Alright, so Chris, we'd had you in your previous role many times. >> Chris: Yes >> I think not only is the first time we've had you on since you made the switch, but also first time we've had somebody from Unifi Software on. So, why don't you give us a little bit of background of Unifi and what brought you to this opportunity. >> Sure, absolutely happy to sort of open up the relationship with Unifi Software. I'm sure it's going to be a long and good one. But I joined the company about six months ago at this point. So I joined earlier this year. I actually had worked with Unifi for a bit as partners. Where when I was previously at the Vertica business inside of HP/HP, as you know for a number of years prior to that, where we did all the work together. I also knew the founders of Unifi, who were actually at Greenplum, which was a direct Vertica competitor. Greenplum is acquired by EMC. Vertica was acquired by HP. We were sort of friendly respected competitors. And so I have known the founders for a long time. But it was partly the people, but it was really the sort of the idea, the product. I was actually reading the report that Peter Burris or the piece that Peter Burris just did on I guess wikibon.com about distributed data. And it played so into our value proposition. We just see it's where things are going. I think it's where things are going right now. And I think the market's bearing that out. >> The piece you reference, it was actually, it's a Wikibon research meeting, we run those weekly. Internally, we're actually going to be doing them soon we will be broadcasting video. Cause, of course, we do a lot of video. But we pull the whole team together, and it was one, George Gilbert actually led this for us, talking about what architectures do I need to build, when I start doing distributed data. With my background really more in kind of the cloud and infrastructure world. We see it's a hybrid, and many times a multi-cloud world. And, therefore, one of the things we look at that's critical is wait, if I've got things in multiple places. I've got my SAS over here, I've got multiple public clouds I'm using, and I've got my data center. How do I get my arms around all the pieces? And of course data is critical to that. >> Right, exactly, and the fact that more and more people need data to do their jobs these days. Working with data is no longer just the area where data scientists, I mean organizations are certainly investing in data scientists, but there's a shortage, but at the same time, marketing people, finance people, operations people, supply chain folks. They need data to do their jobs. And as you said where it is, it's distributed, it's in legacy systems, it's in the data center, it's in warehouses, it's in SAS applications, it's in the cloud, it's on premise, It's all over the place, so, yep. >> Chris, I've talked to so many companies that are, everybody seems to be nibbling at a piece of this. We go to the Amazon show and there's this just ginormous ecosystem that everybody's picking at. Can you drill in a little bit for what problems do you solve there. I have talked to people. Everything from just trying to get the licensing in place, trying to empower the business unit to do things, trying to do government compliance of course. So where's Unifi's point in this. >> Well, having come out of essentially the data warehousing market. And now of course this has been going on, of course with all the investments in HDFS, Hadoop infrastructure, and open source infrastructure. There's been this fundamental thinking that, well the answer's if I get all of the data in one place then I can analyze it. Well that just doesn't work. >> Right. >> Because it's just not feasible. So I think really and its really when you step back it's one of these like ah-ha that makes total sense, right. What we do is we basically catalog the data in place. So you can use your legacy data that's on the main frame. Let's say I'm a marketing person. I'm trying to do an analysis of selling trends, marketing trends, marketing effectiveness. And I want to use some order data that's on the main frame, I want some click stream data that's sitting in HDFS, I want some customer data in the CRM system, or maybe it's in Sales Force, or Mercado. I need some data out of Workday. I want to use some external data. I want to use, say, weather data to look at seasonal analysis. I want to do neighborhooding. So, how do I do that? You know I may be sitting there with Qlik or Tableau or Looker or one of these modern B.I. products or visualization products, but at the same time where's the data. So our value proposition it starts with we catalog the data and we show where the data is. Okay, you've got these data sources, this is what they are, we describe them. And then there's a whole collaboration element to the platform that lets people as they're using the data say, well yes that's order data, but that's old data. So it's good if you use it up to 2007, but the more current data's over here. Do things like that. And then we also then help the person use it. And again I almost said IT, but it's not real data scientists, it's not just them. It's really about democratizing the use. Because business people don't know how to do inner and outer joins and things like that or what a schema is. They just know, I'm trying do a better job of analyzing sales trends. I got all these different data sources, but then once I found them, once I've decided what I want to use, how do I use them? So we answer that question too. >> Yea, Chris reminds me a lot of some the early value propositions we heard when kind of Hadoop and the whole big data wave came. It was how do I get as a smaller company, or even if I'm a bigger company, do it faster, do it for less money than the things it use to be. Okay, its going to be millions of dollars and it's going to take me 18 months to roll out. Is it right to say this is kind of an extension of that big data wave or what's different and what's the same? >> Absolutely, we use a lot of that stuff. I mean we basically use, and we've got flexibility in what we can use, but for most of our customers we use HDFS to store the data. We use Hive as the most typical data form, you have flexibility around there. We use MapReduce, or Spark to do transformation of the data. So we use all of those open source components, and as the product is being used, as the platform is being used and as multiple users, cause it's designed to be an enterprise platform, are using it, the data does eventually migrate into the data lake, but we don't require you to sort of get it there as a prerequisite. As I said, this is one of the things that we really talk about a lot. We catalog the data where it is, in place, so you don't have to move it to use it, you don't have to move it to see it. But at the same time if you want to move it you can. The fundamental idea I got to move it all first, I got to put it all in one place first, it never works. We've come into so many projects where organizations have tried to do that and they just can't, it's too complex these days. >> Alright, Chris, what are some of the organizational dynamics you're seeing from your customers. You mention data scientist, the business users. Who is identifying, whose driving this issues, whose got the budget to try to fix some of these challenges. >> Well, it tends to be our best implementations are driven really, almost all of them these days, are driven by used cases. So they're driven by business needs. Some of the big ones. I've sort of talked about customers already, but like customer 360 views. For instance, there's a very large credit union client of ours, that they have all of their data, that is organized by accounts, but they can't really look at Stu Miniman as my customer. How do I look at Stu's value to us as a customer? I can look at his mortgage account, I can look at his savings account, I can look at his checking account, I can look at his debit card, but I can't just see Stu. I want to like organize my data, that way. That type of customer 360 or marketing analysis I talked about is a great use case. Another one that we've been seeing a lot of is compliance. Where just having a better handle on what data is where it is. This is where some of the governance aspects of what we do also comes into play. Even though we're very much about solving business problems. There's a very strong data governance. Because when you are doing things like data compliance. We're working, for instance, with MoneyGram, is a customer of ours. Who this day and age in particular, when there's money flows across the borders, there's often times regulators want to know, wait that money that went from here to there, tell me where it came from, tell me where it went, tell me the lineage. And they need to be able to respond to those inquiries very very quickly. Now the reality is that data sits in all sorts of different places, both inside and outside of the organization. Being able to organize that and give the ability to respond more quickly and effectively is a big competitive advantage. Both helps with avoiding regulatory fines, but also helps with customers responsiveness. And then you've got things GDPR, the General Data Protection Regulation, I believe it is, which is being driven by the EU. Where its sort of like the next Y2K. Anybody in data, if they are not paying attention to it, they need to be pretty quick. At least if they're a big enough company they're doing business in Europe. Because if you are doing business with European companies or European customers, this is going to be a requirement as of May next year. There's a whole 'nother set of how data's kept, how data's stored, what customers can control over data. Things like 'Right to Be Forgotten'. This need to comply with regulatory... As data's gotten more important, as you might imagine, the regulators have gotten more interested in what organizations are doing with data. Having a framework with that, organizes and helps you be more compliant with those regulations is absolutely critical. >> Yeah, my understanding of GDPR, if you don't comply, there's hefty fines. >> Chris: Major Fines. >> Major Fines. That are going to hit you. Does Unifi solve that? Is there other re-architecture, redesign that customers need to do to be able to be compliant? [speaking at The same Time] >> No, no that's the whole idea again where being able to leave the data where it is, but know what it is and know where it is and if and when I need to use it and where it came from and where it's going and where it went. All of those things, so we provide the platform that enables the customers to use it or the partners to build the solutions for their customers. >> Curious, customers, their adoption of public cloud, how does that play into what you are doing? They deploy more SAS environments. We were having a conversation off camera today talking about the consolidation that's happening in the software world. What does those dynamics mean for your customers? >> Well public cloud is obviously booming and growing and any organization has some public cloud infrastructure at this point, just about any organization. There's some very heavily regulated areas. Actually health care's probably a good example. Where there's very little public cloud. But even there we're working with... we're part of the Microsoft Accelerator Program. Work very closely with the Azure team, for instance. And they're working in some health care environments, where you have to be things like HIPAA compliant, so there is a lot of caution around that. But none the less, the move to public cloud is certainly happening. I think I was just reading some stats the other day. I can't remember if they're Wikibon or other stats. It's still only about 5% of IT spending. And the reality is organizations of any size have plenty of on-prem data. And of course with all the use of SAS solutions, with Salesforce, Workday, Mercado, all of these different SAS applications, it's also in somebody else's data center, much of our data as well. So it's absolutely a hybrid environment. That's why the report that you guys put out on distributed data, really it spoke so much to what out value proposition is. And that's why you know I'm really glad to be here to talk to you about it. >> Great, Chris tell us a little bit, the company itself, how many employees you have, what metrics can you share about the number of customers, revenue, things like that. >> Sure, no, we've got about, I believe about 65 people at the company right now. I joined like I said earlier this year, late February, early March. At that point we we were like 40 people, so we've been growing very quickly. I can't get in too specifically to like our revenue, but basically we're well in the triple digit growth phase. We're still a small company, but we're growing quickly. Our number of customers it's up in the triple digits as well. So expanding very rapidly. And again we're a platform company, so we serve a variety of industries. Some of the big ones are health care, financial services. But even more in the industries it tends to be driven by these used cases I talked about as well. And we're building out our partnerships also, so that's a big part of what I do also. >> Can you share anything about funding where you are? >> Oh yeah, funding, you asked about that, sorry. Yes, we raised our B round of funding, which closed in March of this year. So we [mumbles], a company called Pelion Venture Partners, who you may know, Canaan Partners, and then most recently Scale Venture Partners are investors. So the companies raised a little over $32 million dollars so far. >> Partnerships, you mentioned Microsoft already. Any other key partnerships you want to call out? >> We're doing a lot of work. We have a very broad partner network, which we're building up, but some of the ones that we are sort of leaning in the most with, Microsoft is certainly one. We're doing a lot of work guys at Cloudera as well. We also work with Hortonworks, we also work with MapR. We're really working almost across the board in the BI space. We have spent a lot of time with the folks at Looker. Who was also a partner I was working with very closely during my Vertica days. We're working with Qlik, we're working with Tableau. We're really working with actually just about everybody in sort of BI and visualization. I don't think people like the term BI anymore. The desktop visualization space. And then on public cloud, also Google, Amazon, so really all the kind of major players. I would say that they're the ones that we worked with the most closely to date. As I mentioned earlier we're part of the Microsoft Accelerator Program, so we're certainly very involved in the Microsoft ecosystem. I actually just wrote a blog post, which I don't believe has been published yet, about some of the, what we call the full stack solutions we have been rolling out with Microsoft for a few customers. Where we're sitting on Azure, we're using HDInsight, which is essentially Microsoft's Hadoop cloud Hadoop distribution, visualized empower BI. So we've really got to lot of deep integration with Microsoft, but we've got a broad network as well. And then I should also mention service providers. We're building out our service provider partnerships also. >> Yeah, Chris I'm surprised we haven't talked about kind of AI yet at all, machine learning. It feels like everybody that was doing big data, now has kind pivoted in maybe a little bit early in the buzz word phase. What's your take on that? You've been apart of this for a while. Is big data just old now and we have a new thing, or how do you put those together? >> Well I think what we do maps very well until, at least my personal view of what's going on with AI/ML, is that it's really part of the fabric of what our product does. I talked before about once you sort of found the data you want to use, how do I use it? Well there's a lot of ML built into that. Where essentially, I see these different datasets, I want to use them... We do what's called one click functions. Which basically... What happens is these one click functions get smarter as more and more people use the product and use the data. So that if I've got some table over here and then I've got some SAS data source over there and one user of the product... or we might see field names that we, we grab the metadata, even though we don't require moving the data, we grab the metadata, we look at the metadata and then we'll sort of tell the user, we suggest that you join this data source with that data source and see what it looks like. And if they say: ah that worked, then we say oh okay that's part of sort of the whole ML infrastructure. Then we are more likely to advise the next few folks with the one click function that, hey if you trying to do a analysis of sales trends, well you might want to use this source and that source and you might want to join them together this way. So it's a combination of sort of AI and ML built into the fabric of what we do, and then also the community aspect of more and more people using it. But that's, going back to your original question, That's what I think that... There was quote, I'll misquote it, so I'm not going to directly say it, but it was just.. I think it might have John Ferrier, who was recently was talking about ML and just sort of saying you know eventually we're not going to talk about ML anymore than we talk about phone business or something. It's just going to become sort of integrated into the fabric of how organizations do business and how organizations do things. So we very much got it built in. You could certainly call us an AI/ML company if you want, its actually definitely part of our slide deck. But at the same time its something that will just sort of become a part of doing business over time. But it really, it depends on large data sets. As we all know, this is why it's so cheap to get Amazon Echoes and such these days. Because it's really beneficial, because the more data... There's value in that data, there was just another piece, I actually shared it on Linkedin today as a matter of fact, about, talking about Amazon and Whole Foods and saying: why are they getting such a valuation premium? They're getting such a valuation premium, because they're smart about using data, but one of the reasons they're smart about using the data is cause they have the data. So the more data you collect, the more data you use, the smarter the systems get, the more useful the solutions become. >> Absolutely, last year when Amazon reinvented, John Ferrier interviewed Andy Jassy and I had posited that the customer flywheel, is going to be replaced by that data flywheel. And enhanced to make things spin even further. >> That's exactly right and once you get that flywheel going it becomes a bigger and bigger competitive advantage, by the way that's also why the regulators are getting interested these days too, right? There's sort of, that flywheel going back the other way, but from our perspective... I mean first of all it just makes economic sense, right? These things could conceivably get out of control, that's at least what the regulators think, if you're not careful at least there's some oversight and I would say that, yes probably some oversight is a good idea, so you've got kind of flywheels pushing in both directions. But one way or another organizations need to get much smarter and much more precise and prescriptive about how they use data. And that's really what we're trying to help with. >> Okay, Chris want to give you the final word, Unify Software, you're working on kind of the strategic road pieces. What should we look for from you in your segment through the rest of 2017? >> Well, I think, I've always been a big believer, I've probably cited 'Crossing the Chasm' like so many times on theCUBE, during my prior HP 10 year and such but you know, I'm a big believer and we should be talking about customers, we should be talking about used cases. It's not about alphabet soup technology or data lakes, it's about the solutions and it's about how organizations are moving themselves forward with data. Going back to that Amazon example, so I think from us, yes we just released 2.O, we've got a very active blog, come by unifisoftware.com, visit it. But it's also going to be around what our customers are doing and that's really what we're going to try to promote. I mean if you remember this was also something, that for all the years I've worked with you guys I've been very much... You always have to make sure that the customer has agreed to be cited, it's nice when you can name them and reference them and we're working on our customer references, because that's what I think is the most powerful in this day and age, because again, going back to my, what I said before about, this is going throughout organizations now. People don't necessarily care about the technology infrastructure, but they care about what's being done with it. And so, being able to tell those customer stories, I think that's what you're going to probably see and hear the most from us. But we'll talk about our product as much as you let us as well. >> Great thing, it reminds me of when Wikibon was founded it was really about IT practice, users being able to share with their peers. Now when the software economy today, when they're doing things in software often that can be leveraged by their peers and that flywheel that they're doing, just like when Salesforce first rolled out, they make one change and then everybody else has that option. We're starting to see that more and more as we deploy as SAS and as cloud, it's not the shrink wrap software anymore. >> I think to that point, you know, I was at a conference earlier this year and it was an IT conference, but I was really sort of floored, because when you ask what we're talking about, what the enlightened IT folks and there is more and more enlightened IT folks we're talking about these days, it's the same thing. Right, it's how our business is succeeding, by being better at leveraging data. And I think the opportunities for people in IT... But they really have to think outside of the box, it's not about Hadoop and Sqoop and Sequel and Java anymore it's really about business solutions, but if you can start to think that way, I think there's tremendous opportunities and we're just scratching the surface. >> Absolutely, we found that really some of the proof points of what digital transformation really is for the companies. Alright Chris Selland, always a pleasure to catch up with you. Thanks so much for joining us and thank you for watching theCUBE. >> Chris: Thanks too. (techno music)

Published Date : Aug 2 2017

SUMMARY :

Narrator: From the Silicon Angle Media Office Great to see you Chris. we'd had you in your previous role many times. I think not only is the first time we've had you on But I joined the company about six months ago at this point. And of course data is critical to that. it's in legacy systems, it's in the data center, I have talked to people. the data warehousing market. So I think really and its really when you step back and it's going to take me 18 months to roll out. But at the same time if you want to move it you can. You mention data scientist, the business users. and give the ability to respond more quickly Yeah, my understanding of GDPR, if you don't comply, that customers need to do to be able to be compliant? that enables the customers how does that play into what you are doing? to be here to talk to you about it. what metrics can you share about the number of customers, But even more in the industries it tends to be So the companies raised a little Any other key partnerships you want to call out? so really all the kind of major players. in the buzz word phase. So the more data you collect, the more data you use, and I had posited that the customer flywheel, There's sort of, that flywheel going back the other way, What should we look for from you in your segment that for all the years I've worked with you guys We're starting to see that more and more as we deploy I think to that point, you know, and thank you for watching theCUBE. Chris: Thanks too.

ENTITIES

Entity	Category	Confidence
Chris	PERSON	0.99+
George Gilbert	PERSON	0.99+
John Ferrier	PERSON	0.99+
Unifi	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
Chris Selland	PERSON	0.99+
Stu Miniman	PERSON	0.99+
Pelion Venture Partners	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
Greenplum	ORGANIZATION	0.99+
Peter Burris	PERSON	0.99+
Google	ORGANIZATION	0.99+
Vertica	ORGANIZATION	0.99+
Stu	PERSON	0.99+
Unifi Software	ORGANIZATION	0.99+
Whole Foods	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
General Data Protection Regulation	TITLE	0.99+
Canaan Partners	ORGANIZATION	0.99+
Andy Jassy	PERSON	0.99+
EMC	ORGANIZATION	0.99+
Silicon Angle Media	ORGANIZATION	0.99+
last year	DATE	0.99+
Looker	ORGANIZATION	0.99+
May next year	DATE	0.99+
EU	ORGANIZATION	0.99+
late February	DATE	0.99+
40 people	QUANTITY	0.99+
18 months	QUANTITY	0.99+
MoneyGram	ORGANIZATION	0.99+
Qlik	ORGANIZATION	0.99+
HP/HP	ORGANIZATION	0.99+
Scale Venture Partners	ORGANIZATION	0.99+
360 views	QUANTITY	0.99+
one	QUANTITY	0.99+
MapR	ORGANIZATION	0.99+
GDPR	TITLE	0.99+
Cloudera	ORGANIZATION	0.99+
early March	DATE	0.99+
Echoes	COMMERCIAL_ITEM	0.99+
Both	QUANTITY	0.99+
Tableau	ORGANIZATION	0.99+
millions of dollars	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
both	QUANTITY	0.98+
Wikibon	ORGANIZATION	0.98+
Linkedin	ORGANIZATION	0.98+
one click	QUANTITY	0.98+
one place	QUANTITY	0.98+
Java	TITLE	0.98+
2007	DATE	0.98+
over $32 million	QUANTITY	0.98+
today	DATE	0.98+
Spark	TITLE	0.98+
HIPAA	TITLE	0.98+
first time	QUANTITY	0.98+
earlier this year	DATE	0.98+
unifisoftware.com	OTHER	0.98+
10 year	QUANTITY	0.97+

Jim Campigli, WANdisco - #BigDataNYC 2015 - #theCUBE

>> Live from New York. It's The Cube, covering Big Data NYC 2015. Brought to you by Horton Works, IBM, EMC, and Pivotal. Now for your hosts, John Furrier and Dave Vellante. >> Hello, everyone. Welcome back to live in New York City for the Cube. A special big data [inaudible 00:00:27] our flagship program will go out to the events. They expect a [Inaudible 00:00:30] We are here live as part of Strata Hadoop Big Data NYC. I'm John Furrier. My co-host, Dave Vellante. Our next guest is Jim Campigli, the Chief Product Officer at WANdisco. Welcome back to The Cube. Great to see you. >> Thanks, great to be here. >> You've been COO of WANdisco, head of marketing, now Chief Product Officer for a few years. You guys have always had the patent. David was on earlier. I asked him specifically, why doesn't the other guys just do what you do? I wanted you to comment deeper on that because he had a great answer. He said, patents. But you guys do something that's really hard that people can't do. >> Right. >> So let's get into it because Fusion is a big announcement you guys made. Big deal with EMC, lot of traction with that, and it's one of these things that is kind of talked about, but not talked about. It's really a big deal, so what is the reason why you guys are so successful on the product side? >> Well I think, first of all, it starts with the technology that we have patented, and it's this true active active replication capability that we have. Other software products claim to have active active replication, but when you drill down on what they're really doing, typically, what's happening is they'll have a set of servers that they replicate across, and you can write a transaction at any server, but then that server is responsible for propagating it to all of the other servers in the implementation. There's no mechanism for pre-agreeing to that transaction before it's actually written, so there's no way to avoid conflicts up front, there's no way to effectively handle scenarios where some of the servers in the implementation go down while the replication is in process, and very frequently, those solutions end up requiring administrators to do periodic resynchronization, go back and manually find out what didn't take, and deal with all the deltas, whereas we offer guaranteed consistency. And effectively what happens is with us, you can write at any server as well, but the difference is we go through a peer-to-peer agreement process, and once a quorum of the servers in the implementation agree to the transaction, they all accept it, and we make sure everything is written in the same order on every server. And every server knows the last good transaction it processed, so if it goes down at some point in time, as soon as it comes back up, it can grab all the transactions it missed during that time slice while it was offline, resync itself automatically without an administrator having to do anything. And you can use that feature not only for network and server outages that cause downtime, but even for planned maintenance, which is one of the biggest causes of Hadoop availability issues, because obviously if you've got a global appointment, when it's midnight on Sunday in the U.S., it's the start of the business day on Monday in Europe, and then it's the middle of the afternoon in Asia. So if you take Hadoop clusters down, somebody somewhere in the world is going to be going without their applications and data. >> It's interesting; I want to get your comments on this because this has a great highlight into the next conversation we've been hearing all throughout The Cube this week is analytics, outcomes. These are the kind of things that people talk about because that means there's checks being written. Hadoop is moving into production. People have done the clusters. It used to be the conversation, hey, x number of clusters, you do this, you do that, replication here and there, YARN, all these different buzz words. Really feeds and speeds. Now, Hadoop is relevant, but it's kind of invisible. It's under the hood. >> Right. >> Yet, it's part of other things in the network, so high availability, non-disruptive operations, is what our table stakes now. So I want you to talk about that nuance because that's what we're seeing as the things that are powering, as the engine of Hadoop deployments. What is that? Take us through that nuance, because that's one of the things that you guys have been doing a lot of work in that's making it reliable and stable. To actually go out and play with Hadoop, deploy it, make sure it's always on. >> Well, we really come into play when companies are moving Hadoop out of the lab and into production. When they have defined application SLAs, when they can only have so much down time, and it may be business requirements, it may be regulatory compliance issues, for example, financial services. They pretty much always have to have their data available. They have to have a solid back-up of the data. That's a hard requirement for them to put anything into production in their data centers. >> The other use case we've been hearing is okay, I've got Hadoop, I've been playing with it, now I need to scale it up big time. I need to double, triple my clusters. I have to put it with my applications. Then the conversation's, okay, wait, do I need to do more cis admin work? How do you address that particular piece because I think that's where I think Fusion comes in from how I'm reading it, but is that a Fusion value proposition? Is it a WANdisco thing, and what does the customer, and is that happening? >> Yeah, so there's actually two angles to that, and the first is how do we maintain that up-time? How do we make sure there's performance availability to meet the SLA's, the production SLA's? The active active replication that we have patents for, that I described earlier, and it's embodied in our discount distributed coordination engine, is at the core of Fusion, and once a Fusion server's installed with each of your Hadoop clusters, that active active replication capability is extended to them, and we expose that HDFS API so the client applications, Sqoop, Flume, Impala, HIVE, anything that would normally run against a Hadoop cluster, would talk through us. If it's been defined for replication, we do the active active replication of it. Pass straight through and process normally on the local cluster. So how does that address the issues you were talking about? What you're getting by default with our active active replication is effectively continuous hot back-up. That means if one cluster or an entire data center goes offline, that data exists elsewhere. Your users can fail over. They can continue accessing the data, running their applications. As soon as that cluster comes back online, it resyncs automatically. Now what's the other >> No user involvement? No admin? >> No user involvement in that. Now the only time, and this gets back into what I was talking about earlier, if I take servers offline for planned maintenance, upgrade the hardware, the operating system, whatever it may be, I can take advantage of that feature, as I was alluding to earlier. I can take the servers of the entire cluster offline, and Fusion knows the last good transactions that were processed on that cluster. As soon as the admin turns it back on, it'll resync itself automatically. So that's how you avoid down time, even for planned maintenance, if you have to take an entire location off. Now, to your other question, how do you scale this stuff up? Think about what we do. We eliminate idle standby hardware, because everything is full read write. You don't have standby read-only back-up clusters and servers when we come into the picture. So let's say we walk into an existing implementation, and they've got two clusters. One is the active cluster where everything's being written to, read from, actively being accessed by users. The other's just simply taking snapshots or periodic back-ups, or they're using dis(CP) or something else, but they really can't get full utilization out of that. We come in with our active active replication capability, and they don't have to change anything, but what suddenly happens is, as soon as they define what they want replicated, we'll replicate it for them initially to the other clusters. They don't have to pre-sync it, and the cluster that was formally for disaster recovery, for back-up, is now live and fully usable. So guess what? I'm now able to scale up to twice my original implementation by just leveraging that formally read-only back-up cluster that I was >> Is there a lot of configuration involved in that, or is it automatically? >> No, so basically what happens, again, you don't have to synchronize the clusters in advance. The way we replicate is based on this concept of folders, and you can think of a folder as basically a collection of files and subdirectories that roll up into root directories, effectively, that reflect typically particular applications that people are using with Hadoop or groups of users that have data sets that they access for their various sets of applications. And you define the replicated folders, basically a high level directory that consists of everything in it, and as soon as you do that, what we'll do automatically, in a new implementation. Let's keep it simple. Let's say you just have two clusters, two locations. We'll replicate that folder in its entirety to the target you specify, and then from that point on, we're just moving the deltas over the wire. So you don't have to do anything in advance. And then suddenly that back-up hardware is fully usable, and you've doubled the size of your implementations. You've scaled up to 2x. >> So, I mean what you're describing before, really strikes me that the way you tell the complexity of a product and the value of a product in this space is what happens when something goes wrong. >> Yep. >> That's the question you always ask. How do you recover, because recovery's a very hard thing, and your patents, you've got a lot of math inside there. >> Right. >> But you also said something that's interesting, which is you're an asset utilization play. >> Right. >> You're being able to go in relatively simply and say, okay, you've got this asset that's underutilized. I'm now going to give you back some capacity that's on the floor and take advantage of that. >> Right, and you're able to scale up without spending any more on hardware and infrastructure. >> So I'm interested in, so another company. You're now with an EMC partnership this week. And they sort of got into this way back in the mainframe days with SRDF. I always thought when I first heard about WANdisco, it's like SRDF for Hadoop, but it's active active. Then they bought that Yada Yada. >> And there's no distance limitations for their active active. >> So what's the nature of the relationship with EMC? >> Okay, so basically EMC, like the other storage vendors that want to play in the Hadoop space, expose some form of an HDFS API, and in fact, if you look at Hortonworks or Cloudera, if you go and look at Cloudera Manager, one of the things it asks you when you're installing it is are you going to run this on regular HDFS storage, effectively a bunch of commodity boxes typically, or are you going to use EMC Isilon or the various other options? And what we're able to do is replicate across Hadoop clusters running on Isilon, running on EMC ECS, running on standard HDFS, and what that allows these companies to do is without modifying those storage systems, without migrating that data off of them, incorporate it into an enterprise-wide data lake, if that's what they want to do, and selectively replicate across all of those different storage systems. It could be a mix of different Hadoop distributions. You could have replication between C/D/H, HDP, Pivotal, MapR, all of those things, including EMC Storage that I just mentioned, it was mentioned in the press release, Isilon, and ECS effectively has a Hadoop-compatible API support. And we can create in effect a single virtual cluster out of all of those different platforms. >> So is it a go-to-market relationship? Is it an OEM deal? >> Yeah, it was really born out of the fact that we have some mutual customers that want to do exactly what I just described. They have standard Hortonworks or Cloudera deployments in house. They've got data running on Isilon, and they want to deploy a data lake that includes what they've got stored on Isilon with what they've got in HDFS and Hadoop and replicate across that. >> Like onerous EMC certification process? >> Yeah, we went through that process. We actually set up environments in our labs where we had EMC, Isilon, and ECS running and did demonstration integrations, replication across Isilon to HDP to Hortonworks, Isilon to Cloudera, ECS to Isilon to HDP and Cloudera and so forth. So we did prove it out. They saw that. In fact, they lent us boxes to actually do this in our labs, so they were very motivated, and they're seeing us in some of their bigger accounts. >> Talk about the aspect of two things: non-disruptive operations, meaning I have to want to deploy stuff because now that Hadoop has a hardened top with some abstraction layer, with analytics to focus, there's a lot of work going on under the hood, and a large scale enterprise might have a zillion versions of Hadoop. They might have little Hortonworks here. They might have something over here, so there might be some diversity in the distributions. That's one thing. The other one is operational disruption. >> Right. >> What do you guys do there? Is it zero disruption, and how do you deal with multiple versions of the distro? >> Okay, so basically what we're doing, the simplest way to describe it is we're providing a common API across all of these different distributions, running on different storage platforms and so forth, so that the client applications are always interacting with us. They're not worrying about the nuances of the particular Hadoop API's that these different things expose. So we're providing a layer of abstraction effectively. So we're transparent in effect, in that sense, operationally, once we're installed. The other thing is, and I mentioned this earlier, we come in, basically, you don't have to pre-sync clusters, you don't have to make sure they're all the same versions or the same distros or any of that, just install us, select the data that you want to replicate, we'll replicate it over initially to the target clusters, and then from that point on, you just go. It just works, and we talked about the core patent for active active replication. We've got other patents that have been approved, three patents now and seven pending applications pending, that allow this active active replication to take place while servers are being added and removed from implementations without disrupting user access or running applications and so forth. >> Final question for you, sum up the show this week. What's the vibe here? What's the aroma? Is it really Hadoop next? What is the overall Big Data NYC story here in Strata Hadoop? What's the main theme that you're seeing coming out of the show? >> I think the main theme that we're starting to see, it's twofold. I think one is we are seeing more and more companies moving this into production. There's a lot of interest in Spark and the whole fast data concept, and I don't think that Spark is necessarily orthogonal to Hadoop at all. I think the two have to coexist. If you think about Spark streaming and the whole fast data concept, basically, Hadoop provides the historical data at rest. It provides the historical context. The streaming data provides the point in time information. What Spark together with Hadoop allows you to do is that real time analysis, do the real time informed decision-making, but do it within historical context instead of a single point in time vacuum. So I think what's happening, and you notice the vendors themselves aren't saying, oh it's all Spark, forget Hadoop. They're really talking about coexisting. >> Alright, Jim, from WANdisco, Chief Product Officer, really in the trenches, talking about what's under the hood and making it all scale in the infrastructure so his analysts can hit the scene. Great to see you again. Thanks for coming and sharing your insight here on The Cube. Live in New York City. We are here, day two of three days of wall-to-wall coverage of Big Data NYC in conjunction with Strata. We'll be right back with more live coverage in the moment here in New York City after this short break.

Published Date : Oct 6 2015

SUMMARY :

Brought to you by Horton New York City for the Cube. You guys have always had the patent. on the product side? and once a quorum of the servers These are the kind of things because that's one of the things back-up of the data. and is that happening? So how does that address the issues and the cluster that was and you can think of a folder really strikes me that the way you tell That's the question you always ask. But you also said that's on the floor and Right, and you're able to scale up in the mainframe days with SRDF. And there's no distance limitations one of the things it asks you born out of the fact and Cloudera and so forth. diversity in the distributions. so that the client applications What is the overall Big Data NYC story and the whole fast data concept, in the infrastructure

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Jim	PERSON	0.99+
Jim Campigli	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Europe	LOCATION	0.99+
WANdisco	ORGANIZATION	0.99+
EMC	ORGANIZATION	0.99+
Asia	LOCATION	0.99+
U.S.	LOCATION	0.99+
New York	LOCATION	0.99+
John Furrier	PERSON	0.99+
Horton Works	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
New York City	LOCATION	0.99+
two locations	QUANTITY	0.99+
Strata Hadoop	TITLE	0.99+
first	QUANTITY	0.99+
Pivotal	ORGANIZATION	0.99+
one	QUANTITY	0.99+
two things	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
One	QUANTITY	0.99+
two	QUANTITY	0.99+
two clusters	QUANTITY	0.99+
three days	QUANTITY	0.99+
Monday	DATE	0.99+
three patents	QUANTITY	0.98+
this week	DATE	0.98+
seven pending applications	QUANTITY	0.98+
two angles	QUANTITY	0.98+
two clusters	QUANTITY	0.98+
Spark	TITLE	0.97+
this week	DATE	0.97+
one cluster	QUANTITY	0.97+
00:00:30	DATE	0.95+
ECS	TITLE	0.95+
HDP	ORGANIZATION	0.94+
Cloudera Manager	TITLE	0.94+
single point	QUANTITY	0.94+
#BigDataNYC	EVENT	0.94+
each	QUANTITY	0.94+
Impala	TITLE	0.93+
NYC	LOCATION	0.93+
twofold	QUANTITY	0.93+
Strata	ORGANIZATION	0.92+
Flume	TITLE	0.92+
00:00:27	DATE	0.92+
Sqoop	TITLE	0.92+
Fusion	TITLE	0.91+
Isilon	ORGANIZATION	0.89+
Cloudera	ORGANIZATION	0.89+
midnight	DATE	0.89+
Sunday	DATE	0.88+
Isilon	TITLE	0.88+
single	QUANTITY	0.88+
HIVE	TITLE	0.87+
one thing	QUANTITY	0.83+
double	QUANTITY	0.83+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Sqoop: