Jack Norris | Strata Data Conference 2013

>>Okay. We're back here inside the cube, our flagship program about the events and extract the signal from the noise. This is strata conference. O'Reilly media is a big data event. We're talking about Hadoop analytics, data platforms, and big is come into the enterprise from the front door. As we heard them yesterday. I'm John Frey with Dave Volante, wiki.org. And we're here with Jack Norris, our cube alumni, and a favorite guest here. You're a in charge executive at map. Our, you guys are leading the charge with this use of a dupe. Welcome back to the cube. Thank you. Okay, so what's, let's chat about what's going on. What's your take on all the big news out here for the distributions. I'll the big power moose. You guys have a relationship with EMC. Okay. Exclusive relationship with those guys. Intel's got a distribution Horton versus with Microsoft, a lot of things going on. So this is your wheelhouse. So what's your take on the Hadoop action here? >>Well, I think there's an article in Forbes where I think they, they said it best. This is showing that map bars had the right strategy all along. And what we're seeing is, is basically there's a fairly low bar to taking a patchy Hadoop and providing a distribution. And so we're seeing a lot of new entrance in the market and there's, there's a lot of options. If you want to try Hadoop and experiment and get started. And then there's production class Hadoop, which includes enterprise data protection, snapshots mirrors, ability to integrate. And that's basically map R so start and test and dev with, with a lot of options and then move into production, class >>Mapbox. So break it down for the folks out there who are tipping the toe in the water and hearing all the noise. Cause it's right now, the noise level is very high, right? With the, with the recent announcements. But you guys have been doing business obviously for many years in this area. So when people say, Hey, I want to get a Hadoop distribution with enterprise. What, what should they be looking for? Okay. Because it's not that easy to kind of swing through the noise. So could you share with the folks out there, what, what to look for in like the, the table stakes, the check boxes? Cause there's a lot of claims. There's a lot of noise is this. And that is a lot of different options. Some teams have more committers or no committers than others, so that's all noise, but let's what are the key things that customers need to know? So I think there's, miling, >>There's three areas. All right. One is kind of how it integrates into your enterprise. And with Hadoop, you have the Hadoop distributed file system API. That's how you interact. Well, if you're able to also use standard tools that can use standard file and database access, it makes it much, much easier. So map ours unique and supporting NFS and making that happen. That's a, that's a big difference. The second is on dependability and there's high availability capabilities and then there's data protection. So I'll focus on snapshots as an example, you've got data replicated and Hindu. That's great. But if you have a user error, an application error, that's replicated just as quickly. So having the ability to recover and double-edged in time. Yeah. So if I can say, Hey, I made a mistake. Can I go back two minutes earlier with snapshots that makes it possible map ours, unique and snapshot support. And then finally, there's there's disaster recovery mirroring where you can go across clusters, mirror, what's going on across the land and being able to recover in the case of a disaster where you lose a whole cluster or use a whole >>Section and that's not available in >>Other, those aren't available either. That's >>NFS, >>Snapshots has been on the JIRA list for over five years. >>Yeah. Okay. So I wonder >>If I could find that and then there's third. Cause I said three and almost said two, the third is performance and scale and, but >>That'd be for >>Integration, dependability and speed. >>Okay. So dependability Jr's part of the VR snapshots. MDR. Okay. So let's talk about the performance because you guys had asked a Google's a big partner of you guys. So we should, we just had them on the cube strata. So you have to have a record setting. Do you have a record setting? EMC take that. Well, you work with DMC. So let me talk about the performance real quick. Then we'll talk about some of the EMC conversations, but performance, you have a variety of diverse performance benchmarks, Google you have within the enterprise. Can you talk about those? >>So, so what we announced this week was the minute sort world record. So minutes or runs across technologies is just, how can you, you know, how much data can you sort in 60 seconds? And if you look back at, at the previous record that was done in the labs with Microsoft with special purpose software, and they did 1.4 terabytes Hadoop hasn't been used since 2009, it's been several years because it's got features in there that work against performance. Things like checkpointing and logging because it assumes you've got long running MapReduce jobs. So we set the record with our distribution of Hadoop. So we have kind of one hand tied behind our back, given that technology. Secondly, we sent it in the cloud, which is the other hand tied behind our back because it's a, it's a virtualized environment. So we set the record with just with your legs And a 1.5 terabytes in 60 seconds. Very proud of that. >>Well, that's interesting because we've been doing a lot of labs testing, Dave and I and our teams on cost. Right. So, yeah. And it's an interesting benchmark because you always don't look at the nuance, the cost to compare a cloud performance versus bare metal. Most people don't factor into setup, cost of deployment. Exactly. So can you just quickly talk about that and how significant of an order of magnitude of your customer? >>So the, the previous Hadoop record took 3,400 servers about 27,000 cores, 13, 13,000, almost 14,000 discs and did 600 gigs, actually a little less than that at 5 78. And on Google, we did it with 2020 100 virtual instances, 8,000 cores did 1.5 terabytes >>And costs. You spin up the Google versus >>Basically if you look at that and you assume conservatively 4,000 per server, it's $13.8 million worth of hardware previously. And the cost to do that run on Google was $20 and 33 cents. >>Well, you got to discount. I mean, come on a partner mean it really costs that much. I mean, they that's what they would charge for it. Actually >>We are map artist's case on that minute. If you look at the Asheville charges to be 1200, >>Okay. It's not six millions, so millions to thousands. Yep. Okay. That's impressive. We'll have to go look at the numbers. Like we're going to look at GreenPlum's numbers in the next couple of weeks when talking about the Google relationship and men were that the up way with that was that >>Very excited about it. We're actually deployed throughout the cloud. We've got multiple partners Google's in limited preview. So we've got a number of customers kind of, you know, testing that and, and doing some really interesting things. >>So we monitor the data center market. I'll see with our proprietary tool that you know about the viewfinder and crowd spots and thing is that the data center verticals interesting, right? If you look at the sentiment analysis of what the conversation is on, on just the Twitter data, it's Facebook, apple, these companies. And when we dig into the numbers, it's not so much the companies, it's the fact that their data center operations are significantly being looked at as the leading indicator for where CEO's are going. So I want to ask you in your conversations with your customers, what are the conversations around moving to the cloud and where are they on that transition? Because we hear, yeah, one of the cloud for all the benefits you were mentioning, but Google and Facebook, these are the gold standards as, as architecture necessarily a cut and paste architecture, but they see the benefits that they're doing. So what are your conversations with your enterprise customers around the cloud cloud architecture and what other features besides replication and disaster recovery, are they, are they looking at >>Well, it's basically work, workload driven and dataset driven. So data that's already in the cloud are kind of a natural first step is, well, why don't I do the analysis there as well? So things like Google earth and digital advertising data, that's real interesting candidates for that also periodic workload. So if they have workloads that need to spin up and spin down, the, the cloud works, works really well for that. And in some cases it's driven by their own environments. They've got data centers that are approaching capacity and they need to kind of do offloads and then looking at the, at the cloud because it's easy to get up running quickly and uses an alternative. >>I want to do come back to one of your three sort of value props here, particularly the dependability piece and specifically the snapshot. So somebody asked me one time, how do you know a couple of years ago, how do you back up a petabyte as he could do this thing? And then his answer was, well, you don't know. So I want to, I want to ask you how your customers are protecting and, and, and, and what you guys are bringing to the table. >>So snapshots is not a bolt on feature. It's basically a low level feature based on the underlying data architecture. So when we architected that from the beginning, snapshots was, was a, was a core feature. And if you use a technique called redirect on, right, you're not copying the data, right? So you can do efficient, you can do a petabyte snapshot, you know, basically almost instantaneously because you're tracking the pointers of the latest blocks that have been written. So if, if the data change rate is, is basically, data's not changing, you can snapshot every minute and not have any additional storage overhead. >>Right. Okay. And, and so you can set that. So you, you map, map, our technologies will allow them to set that, dial that up, dial it down and switches. >>So we support logical volumes. So you can set policies at that volume and you can say, well, this volume is critical data. And then I can set policies. Well, critical data is every minute. And then I can change what the definition of critical data is. Maybe it's every five minutes, et cetera. So you can set up these different policies at volumes and have snapshots happen independently for each. >>Can you do that by workload or dataset or by application or whatever I get essentially provided as a service, as opposed to kind of a one size fits all approach. >>Exactly. And that, that also corresponds to user access, administrative privileges, you know, other features and policies within the, within the cluster. >>How about the, you know, this whole trend toward bringing SQL into, into Hadoop. What's, what's your take on that? And what's your angle? >>So interactive, SQL's an important aspect because you've got so many people trained in the organization and, and leverage, you know, sequel, but it's one of many use cases that needs to run across a big data platform. So there's a range of big data analytics, batch analytics, interactive capabilities with sequel, database operations, no sequel search streaming, all those are kind of functions that need to run across a platform. So it's a piece, but it's not the big driver, because what we've seen is that there's higher rival rate of machine generated data and machine generated response to respond to those for digital advertising, for recommendation engines for fraud detection can really move the needle for an organization, have huge swings and profitability >>And the ball down the field big time. Yeah. And >>Having an interactive piece with a kind of a human element involved, it doesn't really scale and work on a 24 by seven basis. >>Jack final question, we're over now by a minute. But when I ask a one party question, obviously, very competitive landscape right now in terms of competitiveness, the stakes are higher because the demand in the market market opportunities is massive. What's map ours business strategy going forward, no change in direction. Is it going to be same old, same old. You guys have any new things going down and you see the marketplace. >>We've got a huge lead when it comes to kind of mission critical enterprise grade features. And our focus is one platform. So the ability to support enterprise Hadoop, enterprise HBase and provide those full capabilities for ease of use for dependability, for performance. And, you know, we've seen a lot of companies test on one distribution and switch to map are and will continue to help that in the future. >>Well, we, we will, we will say we've been covering this big data space now going on four years now, Dave and I, and we've watched all the players pivot a few times. You guys have not, you guys have been true to your mission from day one and that we know where you stand. No one, everyone knows where you stand enterprise grade. It's a good strategy. I think everyone's putting that on their label now. So enterprise grade Washington, we call it a congratulations map art and said the cube. We'll be right back with our next guest here on day three wall-to-wall coverage at O'Reilly media. When do our news, our next from 12 to one, we'll be right back after this short break.

Published Date : Mar 4 2013

SUMMARY :

So what's your take on the Hadoop If you want to try Hadoop So could you share with the folks out there, what, what to look for in like the, the table stakes, And with Hadoop, you have the Hadoop That's If I could find that and then there's third. So let's talk about the performance because you And if you look back at, at the previous record that was done in the labs with So can you just quickly talk about that and how significant And on Google, we did it with 2020 100 virtual instances, And costs. And the cost to do that run on Google was $20 Well, you got to discount. If you look at the Asheville charges to be 1200, We'll have to go look at the numbers. So we've got a number of customers kind of, you know, testing that and, So I want to ask you in your conversations with your customers, So data that's already in the cloud are kind of a natural first step is, well, So I want to, I want to ask you how your customers are protecting and, and, So you can do efficient, you can do a petabyte snapshot, So you, you map, So you can set policies at that volume and you can say, Can you do that by workload or dataset or by application or whatever I get essentially provided as a service, you know, other features and policies within the, within the cluster. How about the, you know, this whole trend toward bringing SQL into, into Hadoop. you know, sequel, but it's one of many use cases that needs to run And the ball down the field big time. Having an interactive piece with a kind of a human element involved, and you see the marketplace. So the ability to support enterprise Hadoop, You guys have not, you guys have been true to your mission from day

ENTITIES

Entity	Category	Confidence
Dave Volante	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
$20	QUANTITY	0.99+
Jack Norris	PERSON	0.99+
John Frey	PERSON	0.99+
apple	ORGANIZATION	0.99+
$13.8 million	QUANTITY	0.99+
Dave	PERSON	0.99+
600 gigs	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
60 seconds	QUANTITY	0.99+
1.5 terabytes	QUANTITY	0.99+
33 cents	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
3,400 servers	QUANTITY	0.99+
six millions	QUANTITY	0.99+
8,000 cores	QUANTITY	0.99+
EMC	ORGANIZATION	0.99+
O'Reilly	ORGANIZATION	0.99+
1200	QUANTITY	0.99+
third	QUANTITY	0.99+
thousands	QUANTITY	0.99+
Asheville	LOCATION	0.99+
millions	QUANTITY	0.99+
two	QUANTITY	0.99+
Twitter	ORGANIZATION	0.99+
2009	DATE	0.99+
1.4 terabytes	QUANTITY	0.99+
SQL	TITLE	0.99+
three	QUANTITY	0.99+
yesterday	DATE	0.99+
24	QUANTITY	0.99+
this week	DATE	0.99+
four years	QUANTITY	0.99+
one party	QUANTITY	0.99+
over five years	QUANTITY	0.99+
three areas	QUANTITY	0.99+
Hadoop	TITLE	0.99+
One	QUANTITY	0.98+
2020	DATE	0.98+
one	QUANTITY	0.98+
100 virtual instances	QUANTITY	0.97+
second	QUANTITY	0.97+
one platform	QUANTITY	0.97+
first step	QUANTITY	0.97+
Jack	PERSON	0.97+
one time	QUANTITY	0.97+
Secondly	QUANTITY	0.95+
about 27,000 cores	QUANTITY	0.94+
HBase	TITLE	0.93+
13, 13,000	QUANTITY	0.93+
GreenPlum	ORGANIZATION	0.92+
day three	QUANTITY	0.92+
DMC	ORGANIZATION	0.91+
Intel	ORGANIZATION	0.9+
a minute	QUANTITY	0.9+
day one	QUANTITY	0.89+
Strata Data Conference	EVENT	0.89+
4,000 per server	QUANTITY	0.89+
14,000 discs	QUANTITY	0.87+
five minutes	QUANTITY	0.85+
Washington	LOCATION	0.84+
one distribution	QUANTITY	0.83+
wiki.org	OTHER	0.83+
seven	QUANTITY	0.83+
couple of years ago	DATE	0.83+
5 78	QUANTITY	0.82+
each	QUANTITY	0.81+
Jr	PERSON	0.79+
12	QUANTITY	0.77+

Jack Norris | Strata-Hadoop World 2012

>>Okay. We're back here, live in New York city for big data week. This is siliconangle.tvs, exclusive coverage of Hadoop world strata plus Hadoop world big event, a big data week. And we just wrote a blog post on siliconangle.com calling this the south by Southwest for data geeks and, and, um, it's my prediction that this is going to turn into a, quite the geek Fest. Uh, obviously the crowd here is enormous packed and an amazing event. And, uh, we're excited. This is siliconangle.com. I'm the founder John ferry. I'm joined by cohost update >>Volante of Wiki bond.org, where people go for free research and peers collaborate to solve problems. And we're here with Jack Norris. Who's the vice president of market marketing at map are a company that we've been tracking for quite some time. Jack, welcome back to the cube. Thank you, Dave. I'm going to hand it to you. You know, we met quite a while ago now. It was well over a year ago and we were pushing at you guys and saying, well, you know, open source and nice look, we're solving problems for customers. We got the right model. We think, you know, this is, this is our strategy. We're sticking to it. Watch what happens. And like I said, I have to hand it to you. You guys are really have some great traction in the market and you're doing what you said. And so congratulations on that. I know you've got a lot more work to do, but >>Yeah, and actually the, the topic of openness is when it's, it's pretty interesting. Um, and, uh, you know, if you look at the different options out there, all of them are combining open source with some proprietary. Uh, now in the case of some distributions, it's very small, like an ODBC driver with a proprietary, um, driver. Um, but I think it represents that that any solution combining to make it more open is, is important. So what we've done is make innovations, but what we've made those innovations we've opened up and provided API. It's like NFS for standard access, like rest, like, uh, ODBC drivers, et cetera. >>So, so it's a spectrum. I mean, actually we were at Oracle open world a few weeks ago and you listen to Larry Ellison, talk about the Oracle public cloud mix of actually a very strong case that it's open. You can move data, it's all Java. So it's all about standards. Yeah. And, uh, yeah, it from an opposite, but it was really all about the business value. That's, that's what the bottom line is. So, uh, we had your CEO, John Schroeder on yesterday. Uh, John and I both were very impressed with, um, essentially what he described as your philosophy of we, we not as a product when we have, we have customers when we announce that product and, um, you know, that's impressive, >>Is that what he was also given some good feedback that startup entrepreneurs out there who are obviously a lot of action going on with the startup community. And he's basically said the same thing, get customers. Yeah. And that's it, that's all and use your tech, but don't be so locked into the tech, get the cutters, understand the needs and then deliver that. So you guys have done great. And, uh, I want to talk about the, the show here. Okay. Because, uh, you guys are, um, have a big booth and big presence here at the show. What, what did you guys are learning? I'll say how's the positioning, how's the new news hitting. Give us a quick update. So, >>Uh, a lot of news, uh, first started, uh, on Tuesday where we announced the M seven edition. And, uh, yeah, I brought a demo here for me, uh, for you all. Uh, because the, the big thing about M seven is what we don't have. So, uh, w we're not demoing Regents servers, we're not demoing compactions, uh, we're not demoing a lot of, uh, manual administration, uh, administrative tasks. So what that really means is that we took this stack. And if you look at HBase HBase today has about half of dupe users, uh, adopting HBase. So it's a lot of momentum in the market, uh, and, you know, use for everything from real-time analytics to kind of lightweight LTP processing. But it's an infrastructure that sits on top of a JVM that stores it's data in the Hadoop distributed file system that sits on a JVM that stores its data in a Linux file system that writes to disk. >>And so a lot of the complexity is that stack. And so as an administrator, you have to worry about how data gets permit, uh, uh, you know, kind of basically written across that. And you've got region servers to keep up, uh, when you're doing kind of rights, you have things called compactions, which increased response time. So it's, uh, it's a complex environment and we've spent quite a bit of time in, in collapsing that infrastructure and with the M seven edition, you've got files and tables together in the same layer writing directly to disc. So there's no region servers, uh, there's no compactions to deal with. There's no pre splitting of tables and trying to do manual merges. It just makes it much, much simpler. >>Let's talk about some of your customers in terms of, um, the profile of these guys are, uh, I'm assuming and correct me if I'm wrong, that you're not selling to the tire kickers. You're selling to the guys who actually have some experience with, with a dupe and have run into some of the limitations and you come in and say, Hey, we can solve some of those problems. Is that, is that, is that right? Can you talk about that a little bit >>Characterization? I think part of it is when you're in the evaluation process and when you first hear about Hadoop, it's kind of like the Gartner hype curve, right. And, uh, you know, this stuff, it does everything. And of course you got data protection, cause you've got things replicated across the cluster. And, uh, of course you've got scalability because you can just add nodes and so forth. Well, once you start using it, you realize that yes, I've got data replicated across the cluster, but if I accidentally delete something or if I've got some corruption that's replicated across the cluster too. So things like snapshots are really important. So you can return to, you know, what was it, five minutes before, uh, you know, performance where you can get the most out of your hardware, um, you know, ease of administration where I can cut this up into, into logical volumes and, and have policies at that whole level instead of at an individual file. >>So there's a, there's a bunch of features that really resonate with users after they've had some experience. And those tend to be our, um, you know, our, our kind of key customers. There's a, there's another phase two, which is when you're testing Hadoop, you're looking at, what's possible with this platform. What, what type of analytics can I do when you go into production? Now, all of a sudden you're looking at how does this fit in with my SLS? How does this fit in with my data protection, uh, policies, you know, how do I integrate with my different data sources? And can I leverage existing code? You know, we had one customer, um, you know, a large kind of a systems integrator for the federal government. They have a million lines of code that they were told to rewrite, to run with other distributions that they could use just out of the box with Matt BARR. >>So, um, let's talk about some of those customers. Can you name some names and get >>Sure. So, um, actually I'll, I'll, I'll talk with, uh, we had a keynote today and, uh, we had this beautiful customer video. They've had to cut because of times it's running in our booth and it's screaming on our website. And I think we've got to, uh, actually some of the bumper here, we kind of inserted. So, um, but I want to shout out to those because they ended up in the cutting room floor running it here. Yeah. So one was Rubicon project and, um, they're, they're an interesting company. They're a real-time advertising platform at auction network. They recently passed a Google in terms of number one ad reach as mentioned by comScore, uh, and a lot of press on that. Um, I particularly liked the headline that mentioned those three companies because it was measured by comScore and comScore's customer to map our customer. And Google's a key partner. >>And, uh, yesterday we announced a world record for the Hadoop pterosaur running on, running on Google. So, um, M seven for Rubicon, it allows them to address and replace different point solutions that were running alongside of Hadoop. And, uh, you know, it simplifies their, their potentially simplifies their architecture because now they have more things done with a single platform, increases performance, simplifies administration. Um, another customer is ancestry.com who, uh, you know, maybe you've seen their ads or heard, uh, some of their radio shots. Um, they're they do a tremendous amount of, of data processing to help family services and genealogy and figure out, you know, family backgrounds. One of the things they do is, is DNA testing. Uh, so for an internet service to do that, advanced technology is pretty impressive. And, uh, you know, you send them it's $99, I believe, and they'll send you a DNA kit spit in the tube, you send it back and then they process that and match and give you insights into your family background. So for them simplifying HBase meant additional performance, so they could do matches faster and really simplified administration. Uh, so, you know, and, and Melinda Graham's words, uh, you know, it's simpler because they're just not there. Those, those components >>Jack, I want to ask you about enterprise grade had duped because, um, um, and then, uh, Ted Dunning, because he was, he was mentioned by Tim SDS on his keynote speech. So, so you have some rockstars stars in the company. I was in his management team. We had your CEO when we've interviewed MC Sri vis and Google IO, and we were on a panel together. So as to know your team solid team, uh, so let's talk about, uh, Ted in a minute, but I want to ask you about the enterprise grade Hadoop conversation. What does that mean now? I mean, obviously you guys were very successful at first. Again, we were skeptics at first, but now your traction and your performance has proven this is a market for that kind of platform. What does that mean now in this, uh, at this event today, as this is evolving as Hadoop ecosystem is not just Hadoop anymore. It's other things. Yeah, >>There's, there's, there's three dimensions to enterprise grade. Um, the first is, is ease of use and ease of use from an administrator standpoint, how easy does it integrate into an existing environment? How easy does it, does it fit into my, my it policies? You know, do you run in a lights out data center? Does the Hadoop distribution fit into that? So that's, that's one whole dimension. Um, a key to that is, is, you know, complete NFS support. So it functions like, uh, you know, like standard storage. Uh, a second dimension is undependability reliability. So it's not just, you know, do you have a checkbox ha feature it's do you have automated stateful fail over? Do you have self healing? Can you handle multiple, uh, failures and, and, you know, automated recovery. So, you know, in a lights out data center, can you actually go there once a week? Uh, and then just, you know, replace drives. And a great example of that is one of our customers had a test cluster with, with Matt BARR. It was a POC went on and did other things. They had a power field, they came back a week later and the cluster was up and running and they hadn't done any manual tasks there. And they were, they were just blown away to the recovery process for the other distributions, a long laundry list of, >>So I've got to ask you, I got to ask you this, the third >>One, what's the third one, third one is performance and performance is, is, you know, kind of Ross' speed. It's also, how do you leverage the infrastructure? Can you take advantage of, of the network infrastructure, multiple Knicks? Can you take advantage of heterogeneous hardware? Can you mix and match for different workloads? And it's really about sharing a cluster for different use cases and, and different users. And there's a lot of features there. It's not just raw >>The existing it infrastructure policies that whole, the whole, what happens when something goes wrong. Can you automate that? And then, >>And it's easy to be dependable, fast, and speed the same thing, making HBase, uh, easy, dependable, fast with themselves. >>So the talk of the show right now, he had the keynote this morning is that map. Our marketing has dropped the big data term and going with data Kozum. Is that true? Is that true? So, Joe, Hellerstein just had a tweet, Joe, um, famous, uh, Cal Berkeley professor, computer science professor now is CEO of a startup. Um, what's the industry trifecta they're doing, and he had a good couple of epic tweets this week. So shout out to Joe Hellerstein, but Joel Hellison's tweet that says map our marketing has decided to drop the term big data and go with data Kozum with a shout out to George Gilder. So I'm kind of like middle intellectual kind of humor. So w w w what's what's your response to that? Is it true? What's happening? What is your, the embargo, the VP of marketing? >>Well, if you look at the big data term, I think, you know, there's a lot of big data washing going on where, um, you know, architectures that have been out there for 30 years or, you know, all about big data. Uh, so I think there's a, uh, there's the need for a more descriptive term. Um, the, the purpose of data Kozum was not to try to coin something or try to, you know, change a big data label. It was just to get people to take a step back and think, and to realize that we are in a massive paradigm shift. And, you know, with a shout out to George Gilder, acknowledging, you know, he recognized what the impact of, of making available compute, uh, meant he recognized with Telekom what bandwidth would mean. And if you look at the combination of we've got all this, this, uh, compute efficiency and bandwidth, now data them is, is basically taking those resources and unleashing it and changing the way we do things. >>And, um, I think, I think one of the ways to look at that is the new things that will be possible. And there's been a lot of focus on, you know, SQL interfaces on top of, of Hadoop, which are important. But I think some of the more interesting use cases are taking this machine J generated data that's being produced very, very rapidly and having automated operational analytics that can respond in a very fast time to change how you do business, either, how you're communicating with customers, um, how you're responding to two different, uh, uh, risk factors in the environment for fraud, et cetera, or, uh, just increasing and improving, um, uh, your response time to kind of cost events. We met earlier called >>Actionable insight. Then he said, assigning intent, you be able to respond. It's interesting that you talk about that George Gilder, cause we like to kind of riff and get into the concept abstract concepts, but he also was very big in supply side economics. And so if you look at the business value conversation, one of things we pointed out, uh, yesterday and this morning, so opening, um, review was, you know, the, the top conversations, insight and analytics, you know, as a killer app right now, the app market has not developed. And that's why we like companies like continuity and what you guys are doing under the hood is being worked on right at many levels, performance units of those three things, but analytics is a no brainer insight, but the other one's business value. So when you look at that kind of data, Kozum, I can see where you're going with that. >>Um, and that's kind of what people want, because it's not so much like I'm Republican because he's Republican George Gilder and he bought American spectator. Everyone knows that. So, so obviously he's a Republican, but politics aside, the business side of what big data is implementing is massive. Now that I guess that's a Republican concept. Um, but not really. I mean, businesses is, is, uh, all parties. So relative to data caused them. I mean, no one talks about e-business anymore. We talking to IBM at the IBM conference and they were saying, Hey, that was a great marketing campaign, but no one says, Hey, uh, you and eat business today. So we think that big data is going to have the same effect, which is, Hey, are you, do you have big data? No, it's just assumed. Yeah. So that's what you're basically trying to establish that it's not just about big. >>Yeah. Let me give you one small example, um, from a business value standpoint and, uh, Ted Dunning, you mentioned Ted earlier, chief application architect, um, and one of the coauthors of, of, uh, the book hoot, which deals with machine learning, uh, he dealt with one of our large financial services, uh, companies, and, uh, you know, one of the techniques on Hadoop is, is clustering, uh, you know, K nearest neighbors, uh, you know, different algorithms. And they looked at a particular process and they sped up that process by 30,000 times. So there's a blog post, uh, that's on our website. You can find out additional information on that. And I, >>There's one >>Point on this one point, but I think, you know, to your point about business value and you know, what does data Kozum really mean? That's an incredible speed up, uh, in terms of, of performance and it changes how companies can react in real time. It changes how they can do pattern recognition. And Google did a really interesting paper called the unreasonable effectiveness of data. And in there they say simple algorithms on big data, on massive amounts of data, beat a complex model every time. And so I think what we'll see is a movement away from data sampling and trying to do an 80 20 to looking at all your data and identifying where are the exceptions that we want to increase because there, you know, revenue exceptions or that we want to address because it's a cost or a fraud. >>Well, that's what I, I would give a shout out to, uh, to the guys that digital reasoning Tim asked he's plugged, uh, Ted. It was idolized him in terms of his work. Obviously his work is awesome, but two, he brought up this concept of understanding gap and he showed an interesting chart in his keynote, which was the date explosion, you know, it's up and, you know, straight up, right. It's massive amount of data, 64% unstructured by his calculation. Then he showed out a flat line called attention. So as data's been exploding over time, going up attention mean user attention is flat with some uptick maybe, but so users and humans, they can't expand their mind fast enough. So machine learning technologies have to bridge that gap. That's analytics, that's insight. >>Yeah. There's a big conversation now going on about more data, better models, people trying to squint through some of the comments that Google made and say, all right, does that mean we just throw out >>The models and data trumps algorithms, data >>Trumps algorithms, but the question I have is do you think, and your customer is talking about, okay, well now they have more data. Can I actually develop better algorithms that are simpler? And is it a virtuous cycle? >>Yeah, it's I, I think, I mean, uh, there are there's, there are a lot of debate here, a lot of information, but I think one of the, one of the interesting things is given that compute cycles, given the, you know, kind of that compute efficiency that we have and given the bandwidth, you can take a model and then iterate very quickly on it and kind of arrive at, at insight. And in the past, it was just that amount of data in that amount of time to process. Okay. That could take you 40 days to get to the point where you can do now in hours. Right. >>Right. So, I mean, the great example is fraud detection, right? So we used the sample six months later, Hey, your credit card might've been hacked. And now it's, you know, you got a phone call, you know, or you can't use your credit card or whatever it is. And so, uh, but there's still a lot of use cases where, you know, whether is an example where modeling and better modeling would be very helpful. Uh, excellent. So, um, so Dana custom, are you planning other marketing initiatives around that? Or is this sort of tongue in cheek fun? Throw it out there. A little red meat into the chum in the waters is, >>You know, what really motivated us was, um, you know, the cubes here talking, you know, for the whole day, what could we possibly do to help give them a topic of conversation? >>Okay. Data cosmos. Now of course, we found that on our proprietary HBase tools, Jack Norris, thanks for coming in. We appreciate your support. You guys have been great. We've been following you and continue to follow. You've been a great support of the cube. Want to thank you personally, while we're here. Uh, Matt BARR has been generous underwriter supportive of our great independent editorial. We want to recognize you guys, thanks for your support. And we continue to look forward to watching you guys grow and kick ass. So thanks for all your support. And we'll be right back with our next guest after this short break. >>Thank you. >>10 years ago, the video news business believed the internet was a fat. The science is settled. We all know the internet is here to stay bubbles and busts come and go. But the industry deserves a news team that goes the distance coming up on social angle are some interesting new metrics for measuring the worth of a customer on the web. What zinc every morning, we're on the air to bring you the most up-to-date information on the tech industry with scrutiny on releases of the day and news of industry-wide trends. We're here daily with breaking analysis, from the best minds in the business. Join me, Kristin Filetti daily at the news desk on Silicon angle TV, your reference point for tech innovation 18 months.

Published Date : Oct 25 2012

SUMMARY :

And, uh, we're excited. We think, you know, this is, this is our strategy. Um, and, uh, you know, if you look at the different options out there, we not as a product when we have, we have customers when we announce that product and, um, you know, Because, uh, you guys are, um, have a big booth and big presence here at the show. uh, and, you know, use for everything from real-time analytics to you know, kind of basically written across that. Can you talk about that a little bit And, uh, you know, this stuff, it does everything. And those tend to be our, um, you know, Can you name some names and get uh, we had this beautiful customer video. uh, you know, you send them it's $99, I believe, and they'll send you a DNA so let's talk about, uh, Ted in a minute, but I want to ask you about the enterprise grade Hadoop conversation. So it functions like, uh, you know, like standard storage. is, you know, kind of Ross' speed. Can you automate that? And it's easy to be dependable, fast, and speed the same thing, making HBase, So the talk of the show right now, he had the keynote this morning is that map. there's a lot of big data washing going on where, um, you know, architectures that have been out there for you know, SQL interfaces on top of, of Hadoop, which are important. uh, yesterday and this morning, so opening, um, review was, you know, but no one says, Hey, uh, you and eat business today. uh, you know, K nearest neighbors, uh, you know, different algorithms. Point on this one point, but I think, you know, to your point about business value and you which was the date explosion, you know, it's up and, you know, straight up, right. that Google made and say, all right, does that mean we just throw out Trumps algorithms, but the question I have is do you think, and your customer is talking about, okay, well now they have more data. cycles, given the, you know, kind of that compute efficiency that we have and given And now it's, you know, you got a phone call, you know, We want to recognize you guys, thanks for your support. We all know the internet is here to stay bubbles and busts come and go.

ENTITIES

Entity	Category	Confidence
Joe Hellerstein	PERSON	0.99+
George Gilder	PERSON	0.99+
Ted Dunning	PERSON	0.99+
Kristin Filetti	PERSON	0.99+
Joel Hellison	PERSON	0.99+
John Schroeder	PERSON	0.99+
Joe	PERSON	0.99+
Jack	PERSON	0.99+
Larry Ellison	PERSON	0.99+
Jack Norris	PERSON	0.99+
John	PERSON	0.99+
40 days	QUANTITY	0.99+
Melinda Graham	PERSON	0.99+
64%	QUANTITY	0.99+
$99	QUANTITY	0.99+
comScore	ORGANIZATION	0.99+
Tim	PERSON	0.99+
Dave	PERSON	0.99+
Tuesday	DATE	0.99+
Matt BARR	PERSON	0.99+
Hellerstein	PERSON	0.99+
Google	ORGANIZATION	0.99+
George Gilder	PERSON	0.99+
Ted	PERSON	0.99+
John ferry	PERSON	0.99+
30 years	QUANTITY	0.99+
30,000 times	QUANTITY	0.99+
today	DATE	0.99+
IBM	ORGANIZATION	0.99+
a week later	DATE	0.99+
yesterday	DATE	0.99+
two	QUANTITY	0.99+
three companies	QUANTITY	0.99+
Dana	PERSON	0.99+
Tim SDS	PERSON	0.99+
one point	QUANTITY	0.99+
Java	TITLE	0.99+
first	QUANTITY	0.99+
six months later	DATE	0.99+
one	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
one customer	QUANTITY	0.99+
Linux	TITLE	0.98+
once a week	QUANTITY	0.98+
18 months	QUANTITY	0.98+
Rubicon	ORGANIZATION	0.98+
HBase	TITLE	0.98+
Kozum	PERSON	0.98+
Gartner	ORGANIZATION	0.98+
this morning	DATE	0.97+
Telekom	ORGANIZATION	0.97+
this week	DATE	0.97+
10 years ago	DATE	0.97+
second dimension	QUANTITY	0.97+
both	QUANTITY	0.97+
Kozum	ORGANIZATION	0.95+
third one	QUANTITY	0.95+
One	QUANTITY	0.94+
three things	QUANTITY	0.94+
a year ago	DATE	0.94+
Hadoop	TITLE	0.93+
siliconangle.com	OTHER	0.93+
Knicks	ORGANIZATION	0.93+
Regents	ORGANIZATION	0.92+

Jack Norris - Strata Conference 2012 - theCUBE

>>Hi everybody. We're back. This is Dave Volante from Wiki bond.org. We're live at strata in Santa Clara, California. This is Silicon angle TVs, continuous coverage of the strata conference. So Riley media or Raleigh media is a great partner of ours. And thanks to them for allowing us to be here. We've been going all week cause it's day three for us. I'm here with Jeff Kelly Wiki bonds that lead big data analysts. And we're here with Jack Norris. Who's the VP of marketing at Matt bar Jack. Welcome to the cube. Thank you, Dave. Thanks very much for coming on. And you know, we've been going all week. You guys are a great sponsor of ours. Thank you for the support. We really appreciate it. How's the show going for you? >>Great. A lot of attention, a lot of focus, a lot of discussion about Hadoop and big data. >>Yeah. So you guys getting a lot of traffic. I mean, it says I hear this 2,500 people here up from 1400 last year. So that's >>Yeah, we've had like five, six people deep in the, in the booth. So I think there's a lot of, a lot of interests. There's interesting. >>You know, when we were here last year, when you looked at the, the infrastructure and the competitive landscape, there wasn't a lot going on and just a very short time, that's completely changed. And you guys have had your hand in that. So, so that's good. Competition is a good thing, right? And, and obviously customers want choice, but so we want to talk about that a little bit. We want to talk about map bar, the kind of problems you're solving. So why don't we start there? What is map are all about? And you've got your own distribution of, of, of enterprise Hadoop. You make it Hadoop enterprise ready? Let's start there. >>Okay. Yeah, I mean, we invested heavily in creating a alternative distribution one that took the best of the open source community with the best of the map, our innovations, and really it's, it's about making Hadoop more applicable, broader use cases, more mission, critical support, you know, being able to sit in and work in a lights out data center environment. >>Okay. So what was the problem that you set out to solve? Why, why do, why do we need another distribution of Hadoop? Let me ask it that way. Get nice and close to. >>So there, there are some just big issues with, with the duke. >>One of those issues, let's talk about that. There's >>Some ease of use issues. There's some deep dependability issues. There's some, some performance. So, you know, let's take those in order right now. If you look at some of the distributions, Apache Hadoop, great technology, but it requires a programmer, right? To get access to the data it's through the Hadoop API, you can't really see the data. So there's a lot of focus of, you know, what do I do once the data's in there opening that up, providing a full file based access, right? So I can look at it and treat it like enterprise storage, see the data, use my standard tools, standard commands, you know, drag and drop from a file browser. You can do that with Matt bar. You can't do that with other districts >>Talking about mountain HDFS as a NFS correct >>Example. Correct. And then, and then just the underlying storage services. The fact that it's append only instead of full random read-write, you know, causes some, some issues. So, you know, that's some of the, the ease of use features. There's a whole lot. We could discuss there. Big picture for reliability. Dependability is there's a single point of failure, multiple single points of failure within Hadoop. So you risk data loss. So people have looked at Hadoop. Traditionally is, is batch oriented. Scratchpad right. We were out to solve that, right? We want to make sure that you can use it for mission critical data, that you don't have a risk of a data loss that you've got full high availability. You've got the full data protection in terms of snapshots and mirroring that you would expect with the enterprise products. >>It gets back to when you guys were, you know, thinking about doing this. I'm not even sure you were at the company at the time, but you, your DNA was there and you're familiar with it. So you guys saw this big data movement. You saw this at duke moon and you said, okay, this is cool. It's going to be big. And it's gonna take a long time for the community to fix all these problems. We can fix them. Now let's go do that. Is that the general discussion? Yeah. >>You know, I think, I think the what's different about this. This is the first open source package. The first open source project that's created a market. If you look at the other open source, you know, Linux, my SQL, et cetera, it was really late in the life cycle of a product. Everyone knew what the features were. It was about, you know, giving an alternative choice, better Unix. Your, your, the focus is on innovation and our founders, you know, have deep enterprise background or CTO was at Google and charge of big table, understands MapReduce at scale, spent time as chief software architect at Spinnaker, which was kind of the fastest clustered Nazanin on the planet. So recognize that the underlying layers of Hadoop needed some rearchitecture and needed some deep investment and to do that effectively and do that quickly required a whole lot of focus. And we thought that was the best way to go to market. >>Talk about the early validation from customers. Obviously you guys didn't just do this in a vacuum, I presume. So you went out and talked to some customers. Yeah. >>What sorts of conversations with customers, why we're in stealth mode? We're probably the loudest stealth >>As you were nodding. And I mean, what were they telling you at the time? Yeah, please go do this. >>The, what we address weren't secrets. I there've been gyrus for open for four or five years on, on these issues. >>Yeah. But at the same time, Jack, you've got this, you got this purist community out there that says, I don't want to, I don't want to rip out HDFS. You know, I want it to be pure. What'd you, what'd you say to those guys, you just say, okay, thank you. We, we understand you're not a prospect. >>And I think, I think that, you know, duke has a huge amount of momentum. And I think a lot of that momentum is that there isn't any risks to adopting Hadoop, right? It's not like the fractured no SQL market where there's 122 different entrance, which one's going to win. Hadoop's got the ecosystem. So when you say pure, it's about the API APIs, it's about making sure that if I create a MapReduce job, it's going to run an Apache. It's going to run a map bar. It's going to run on the other distributions. That's where I think that the heat and the focus is now to do that. You also have to have innovation occurring up and down the stack that that provides choice and alternatives for. >>So when I'm talking about purists, I don't, I agree with you the whole lock-in thing, which is the elephant in the room here. People will worry about lock-in >>Pun intended. >>No, no, but good one good catch. But so, but you're basically saying, Hey, where we're no more locked in than cloud era. Right. I mean, they've got their own >>Actually. I think we're less because it's so easy to get data in and out with our NFS. That there's probably less so, >>So, and I'm gonna come back to that. But so for instance, many, when I, when I say peers, I mean some users in ISV, some guys we've had on here, we had an Abby Mehta from Triceda on the other day, for instance, he's one who said, I just don't have time to mess with that stuff and figure out all that API integration. I mean, there are people out there that just don't want to go that route. Okay. But, but you're saying I'm, I'm inferring this plenty who do right. >>And the, and by the API route, I want to make sure I understand what you're saying. You >>Talked about, Hey, it's all about the API integration. It's not >>About, it's not the, it it's about the API APIs being consistent, a hundred percent compatible. Right. So if I, you know, write a program, that's, that's going after HDFS and the HDFS API, I want to make sure that that'll run on other distributions. Right. >>And that's your promise. Yeah. Okay. All right. So now where I was going with this was th again, there are some peers to say, oh, I just don't want to mess with all that. Now let's talk about what that means to mess with all that. So comScore was a big, high profile case study for you guys. They, they were cloud era customer. They basically, in my understanding is a couple of days migrated from Cloudera to Mapbox. And the impetus was, let's talk about that. Why'd they do that >>Performance data protection, ease of use >>License fee issues. There was some license issues there as well, right? The, the, your, your maintenance pricing was more attractive. Is that true? Or >>I read more mainly about price performance and reliability, and, you know, they tested our stuff at work real well in a test environment, they put it in production environment. Didn't actually tell all their users, they had one guys debug the software for half a day because something was wrong. It finished so quickly. >>So, so it took him a couple of days to migrate and then boom, >>Boom. And they've, they handle about 30 billion objects a day. So there, you know, the use of that really high performance support for, for streaming data flows, you know, they're talking about, they're doing forecasts and insights into web behavior, and, you know, they w the earlier they can do that, the better off they are. So >>Greg, >>So talk about the implications of, of your approach in terms of the customer base. So I'm, I'm imagining that your customers are more, perhaps advanced than a lot of your typical Hadoop users who are just getting started tinkering with Hadoop. Is it fair to say, you know, your customers know what they want and they want performance and they want it now. And they're a little more advanced than perhaps some of the typical early adopters. >>We've got people to go to our website and download the free version. And some of them are just starting off and getting used to Hadoop, but we did specifically target those very experienced Hadoop users that, you know, we're kind of, you know, stubbing their toes on, on the issues. And so they're very receptive to the message of we've made it faster. We've made it more reliable, you know, we've, we've added a lot of ease of use to the, to the Hindu. >>So I found this, let me interrupt, go back to what I was saying before is I found this comment that I found online from Mike Brown comScore. Skipio I presume you mean, he said comScore's map our direct access NFS feature, which exposes a duke distributed file system data as NFS files can then be easily mounted, modified, or overwritten. So that's a data access simplification. You also said we could capitalize on the purchase of map bar with an annual maintenance charge versus a yearly cost per node. NFS allowed our enterprise systems to easily access the data in the cluster. So does that make sense to you that, that enterprise of that annual maintenance charge versus yearly cost per node? I didn't get that. >>Oh, I think he's talking about some, some organizations prefer to do a perpetual license versus a subscription model that's >>Oh, okay. So the traditional way of licensing software >>And that, that you have to do it basically reinforces the fact that we've really invested in have kind of a, a product, you know, orientation rather than just services on top of, of some opensource. >>Okay. So you go in, you license it and then yeah. Perpetual license. >>Then you can also start with the free edition that does all the performance NFS support kick the tires >>Before you buy it. Sorry. Sorry, Jeff. Sorry to interrupt. No, no problem >>At all. So another topic, a lot of interest is security making a dupe enterprise ready. One of the pillars, there is security, making sure access controls, for instance, making sure let's talk about how you guys approach that and maybe how you differentiate from some of the other vendors out there, or the other >>Full Kerberos support. We Lincoln to enterprise standards for access eldap, et cetera. We leveraged the Linux, Pam security, and we also provide volume control. So, you know, right now in Hindu in Apache to dupe other distributions, you put policies at the file level or the entire cluster. And we see many organizations having separate physical clusters because of that limitation, right? And we'd provide volume. So you can define a volume. And in that volume control, access control, administrative privileges data protection class, and, you know, in a sense kind of segregate that content. And that provides a lot of, a lot of control and a lot more, you know, security and protection and separation of data. >>That scenario, the comScore scenario, common where somebody's moving off an existing distribution onto a map are, or, or you more going, going, seeing demand from new customers that are saying, Hey, what's this big data thing I really want to get into it. How's it shake out there >>Right now? There's this huge pent up demand for these features. And we're seeing a lot of people that have run on other distributions switched to map our >>A little bit of everything. How about, can you talk a little bit about your, your channel? You go to market strategy, maybe even some of your ecosystem and partnerships in the little time. >>Sure. So EMC is a big partner of the EMC Greenplum Mr. Edition is basically a map R you can start with any of our additions and upgrade to that. Greenplum with just a licensed key that gives us worldwide service and support. It's been a great partnership. >>We hear a lot of proof of concepts out there >>For, yeah. And then it just hit the news news today about EMC's distribution, Mr. Distribution being available with UCS Cisco's ECS gear. So now that's further expanded the, the footprint that we have about. >>Okay. So you're the EMC relationship. Anything else that you can share with us? >>We have other announcements coming out and >>Then you want to pre-announce in the queue. >>Oops. Did I let that slip >>It's alive? So be careful. And so, in terms of your, your channel strategy, you guys mostly selling direct indirect combination, >>It's it? It, it's kind of an indirect model through these, these large partners with a direct assist. >>Yeah. Okay. So you guys come in and help evangelize. Yep. Excellent. All right. Do you have anything else before we gotta got a roll here? >>Yeah, I did wonder if you could talk a little bit about, you mentioned EMC Greenplum so there's a lot of talk about the data warehouse market, the MPB data warehouses, versus a Hadoop based on that relationship. I'm assuming that Matt BARR thinks well, they're certainly complimentary. Can you just touch on that? And, you know, as opposed to some who think, well, Hadoop is going to be the platform where we go, >>Well, th th there's just, I mean, if you look at the typical organization, they're just really trying to get their, excuse me, their arms around a lot of this machine generated content, this, you know, unstructured data that just growing like wildfire. So there's a lot of Paducah specific use cases that are being rolled out. They're also kind of data lakes, data, oceans, whatever you want to call it, large pools where that information is then being extracted and loaded into data warehouses for further analysis. And I think the big pivot there is if it's well understood what the issue is, you define the schema, then there's a whole host of, of data warehouse applications out there that can be deployed. But there's many things where you don't really understand that yet having to dupe where you don't need to find a schema a is a, is a big value, >>Jack, I'm sorry. We have to go run a couple of minutes behind. Thank you very much for coming on the cube. Great story. Good luck with everything. And sounds like things are really going well and market's heating up and you're in the right place at the right time. So thank you again. Thank you to Jeff. And we'll be right back everybody to the strata conference live in Santa Clara, California, right after this word from our.

Published Date : Apr 27 2012

SUMMARY :

And you know, we've been going all week. A lot of attention, a lot of focus, a lot of discussion about Hadoop So that's So I think there's a lot of, And you guys have had your hand in that. broader use cases, more mission, critical support, you know, being able to sit in and work Let me ask it that way. So there, there are some just big issues with, One of those issues, let's talk about that. So there's a lot of focus of, you know, what do I do once the data's in So you risk data loss. It gets back to when you guys were, you know, thinking about doing this. It was about, you know, giving an alternative choice, better Unix. So you went out and talked to some customers. And I mean, what were they telling you at the time? I there've been gyrus for open for four or five You know, I want it to be And I think, I think that, you know, duke has a huge amount of momentum. So when I'm talking about purists, I don't, I agree with you the whole lock-in thing, I mean, they've got their own I think we're less because it's so easy to get data in and out with our NFS. So, and I'm gonna come back to that. And the, and by the API route, I want to make sure I understand what you're saying. Talked about, Hey, it's all about the API integration. So if I, you know, write a program, that's, that's going after for you guys. Is that true? and, you know, they tested our stuff at work real well in a test environment, they put it in production environment. you know, the use of that really high performance support for, to say, you know, your customers know what they want and they want performance and they want it now. experienced Hadoop users that, you know, we're kind of, you know, So does that make sense to you that, So the traditional way of licensing software And that, that you have to do it basically reinforces the fact that we've really invested in have kind Before you buy it. for instance, making sure let's talk about how you guys approach that and maybe how you differentiate from a lot of control and a lot more, you know, security and protection and separation of data. off an existing distribution onto a map are, or, or you more going, And we're seeing a lot of people that have run on other distributions switched to map our How about, can you talk a little bit about your, your channel? Mr. Edition is basically a map R you can start with any of our additions So now that's further Anything else that you can share with us? you guys mostly selling direct indirect combination, It, it's kind of an indirect model through these, these large partners with Do you have anything else before And, you know, as opposed to some who think, excuse me, their arms around a lot of this machine generated content, this, you know, So thank you again.

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Jeff	PERSON	0.99+
Jack Norris	PERSON	0.99+
five	QUANTITY	0.99+
Dave Volante	PERSON	0.99+
Jack	PERSON	0.99+
EMC	ORGANIZATION	0.99+
last year	DATE	0.99+
Matt BARR	PERSON	0.99+
four	QUANTITY	0.99+
UCS	ORGANIZATION	0.99+
2,500 people	QUANTITY	0.99+
Santa Clara, California	LOCATION	0.99+
Greg	PERSON	0.99+
Google	ORGANIZATION	0.99+
Mike Brown	PERSON	0.99+
half a day	QUANTITY	0.99+
Spinnaker	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
comScore	ORGANIZATION	0.99+
five years	QUANTITY	0.99+
Riley	ORGANIZATION	0.98+
EMC Greenplum	ORGANIZATION	0.98+
Abby Mehta	PERSON	0.98+
Linux	TITLE	0.97+
strata conference	EVENT	0.97+
SQL	TITLE	0.97+
One	QUANTITY	0.97+
one guys	QUANTITY	0.97+
today	DATE	0.97+
Raleigh	ORGANIZATION	0.97+
122 different entrance	QUANTITY	0.97+
six people	QUANTITY	0.97+
Skipio	PERSON	0.96+
Jeff Kelly	PERSON	0.95+
single point	QUANTITY	0.95+
about 30 billion objects a day	QUANTITY	0.94+
Strata Conference 2012	EVENT	0.93+
ECS	ORGANIZATION	0.93+
hundred percent	QUANTITY	0.91+
Triceda	ORGANIZATION	0.9+
Apache	TITLE	0.9+
firs	QUANTITY	0.9+
Paducah	LOCATION	0.89+
Greenplum	ORGANIZATION	0.89+
single points	QUANTITY	0.88+
day three	QUANTITY	0.88+
NFS	TITLE	0.87+
Wiki bond.org	OTHER	0.87+
1400	QUANTITY	0.85+
Unix	TITLE	0.85+
Wiki bonds	ORGANIZATION	0.84+
Silicon angle	ORGANIZATION	0.83+
Mapbox	ORGANIZATION	0.78+
Apache	ORGANIZATION	0.76+
MapReduce	ORGANIZATION	0.75+
Kerberos	ORGANIZATION	0.75+
first open	QUANTITY	0.74+
Pam	TITLE	0.73+
Matt bar	ORGANIZATION	0.73+
Nazanin	ORGANIZATION	0.61+
Cloudera	TITLE	0.59+
moon	LOCATION	0.58+
Cisco	ORGANIZATION	0.54+
one	QUANTITY	0.53+
days	QUANTITY	0.52+
MapReduce	TITLE	0.47+

Frederick Reiss, IBM STC - Big Data SV 2017 - #BigDataSV - #theCUBE

>> Narrator: Live from San Jose, California it's the Cube, covering Big Data Silicon Valley 2017. (upbeat music) >> Big Data SV 2016, day two of our wall to wall coverage of Strata Hadoob Conference, Big Data SV, really what we call Big Data Week because this is where all the action is going on down in San Jose. We're at the historic Pagoda Lounge in the back of the Faramount, come on by and say hello, we've got a really cool space and we're excited and never been in this space before, so we're excited to be here. So we got George Gilbert here from Wiki, we're really excited to have our next guest, he's Fred Rice, he's the chief architect at IBM Spark Technology Center in San Francisco. Fred, great to see you. >> Thank you, Jeff. >> So I remember when Rob Thomas, we went up and met with him in San Francisco when you guys first opened the Spark Technology Center a couple of years now. Give us an update on what's going on there, I know IBM's putting a lot of investment in this Spark Technology Center in the San Francisco office specifically. Give us kind of an update of what's going on. >> That's right, Jeff. Now we're in the new Watson West building in San Francisco on 505 Howard Street, colocated, we have about a 50 person development organization. Right next to us we have about 25 designers and on the same floor a lot of developers from Watson doing a lot of data science, from the weather underground, doing weather and data analysis, so it's a really exciting place to be, lots of interesting work in data science going on there. >> And it's really great to see how IBM is taking the core Watson, obviously enabled by Spark and other core open source technology and now applying it, we're seeing Watson for Health, Watson for Thomas Vehicles, Watson for Marketing, Watson for this, and really bringing that type of machine learning power to all the various verticals in which you guys play. >> Absolutely, that's been what Watson has been about from the very beginning, bringing the power of machine learning, the power of artificial intelligence to real world applications. >> Jeff: Excellent. >> So let's tie it back to the Spark community. Most folks understand how data bricks builds out the core or does most of the core work for, like, the sequel workload the streaming and machine learning and I guess graph is still immature. We were talking earlier about IBM's contributions in helping to build up the machine learning side. Help us understand what the data bricks core technology for machine learning is and how IBM is building beyond that. >> So the core technology for machine learning in Apache Spark comes out, actually, of the machine learning department at UC Berkeley as well as a lot of different memories from the community. Some of those community members also work for data bricks. We actually at the IBM Spark Technology Center have made a number of contributions to the core Apache Spark and the libraries, for example recent contributions in neural nets. In addition to that, we also work on a project called Apache System ML, which used to be proprietary IBM technology, but the IBM Spark Technology Center has turned System ML into Apache System ML, it's now an open Apache incubating project that's been moving forward out in the open. You can now download the latest release online and that provides a piece that we saw was missing from Spark and a lot of other similar environments and optimizer for machine learning algorithms. So in Spark, you have the catalyst optimizer for data analysis, data frames, sequel, you write your queries in terms of those high level APIs and catalyst figures out how to make them go fast. In System ML, we have an optimizer for high level languages like Spark and Python where you can write algorithms in terms of linear algebra, in terms of high level operations on matrices and vectors and have the optimizer take care of making those algorithms run in parallel, run in scale, taking account of the data characteristics. Does the data fit in memory, and if so, keep it in memory. Does the data not fit in memory? Stream it from desk. >> Okay, so there was a ton of stuff in there. >> Fred: Yep. >> And if I were to refer to that as so densely packed as to be a black hole, that might come across wrong, so I won't refer to that as a black hole. But let's unpack that, so the, and I meant that in a good way, like high bandwidth, you know. >> Fred: Thanks, George. >> Um, so the traditional Spark, the machine learning that comes with Spark's ML lib, one of it's distinguishing characteristics is that the models, the algorithms that are in there, have been built to run on a cluster. >> Fred: That's right. >> And very few have, very few others have built machine learning algorithms to run on a cluster, but as you were saying, you don't really have an optimizer for finding something where a couple of the algorithms would be fit optimally to solve a problem. Help us understand, then, how System ML solves a more general problem for, say, ensemble models and for scale out, I guess I'm, help us understand how System ML fits relative to Sparks ML lib and the more general problems it can solve. >> So, ML Live and a lot of other packages such as Sparking Water from H20, for example, provide you with a toolbox of algorithms and each of those algorithms has been hand tuned for a particular range of problem sizes and problem characteristics. This works great as long as the particular problem you're facing as a data scientist is a good match to that implementation that you have in your toolbox. What System ML provides is less like having a toolbox and more like having a machine shop. You can, you have a lot more flexibility, you have a lot more power, you can write down an algorithm as you would write it down if you were implementing it just to run on your laptop and then let the System ML optimizer take care of producing a parallel version of that algorithm that is customized to the characteristics of your cluster, customized to the characteristics of your data. >> So let me stop you right there, because I want to use an analogy that others might find easy to relate to for all the people who understand sequel and scale out sequel. So, the way you were describing it, it sounds like oh, if I were a sequel developer and I wanted to get at some data on my laptop, I would find it pretty easy to write the sequel to do that. Now, let's say I had a bunch of servers, each with it's own database, and I wanted to get data from each database. If I didn't have a scale out database, I would have to figure out physically how to go to each server in the cluster to get it. What I'm hearing for System ML is it will take that query that I might have written on my one server and it will transparently figure out how to scale that out, although in this case not queries, machine learning algorithms. >> The database analogy is very apt. Just like sequel and query optimization by allowing you to separate that logical description of what you're looking for from the physical description of how to get at it. Lets you have a parallel database with the exact same language as a single machine database. In System ML, because we have an optimizer that separates that logical description of the machine learning algorithm from the physical implementation, we can target a lot of parallel systems, we can also target a large server and the code, the code that implements the algorithm stays the same. >> Okay, now let's take that a step further. You refer to matrix math and I think linear algebra and a whole lot of other things that I never quite made it to since I was a humanities major but when we're talking about those things, my understanding is that those are primitives that Spark doesn't really implement so that if you wanted to do neural nets, which relies on some of those constructs for high performance, >> Fred: Yes. >> Then, um, that's not built into Spark. Can you get to that capability using System ML? >> Yes. System ML edits core, provides you with a library, provides you as a user with a library of machine, rather, linear algebra primitives, just like a language like r or a library like Mumpai gives you matrices and vectors and all of the operations you can do on top of those primitives. And just to be clear, linear algebra really is the language of machine learning. If you pick up a paper about an advanced machine learning algorithm, chances are the specification for what that algorithm does and how that algorithm works is going to be written in the paper literally in linear algebra and the implementation that was used in that paper is probably written in the language where linear algebra is built in, like r, like Mumpai. >> So it sounds to me like Spark has done the work of sort of the blocking and tackling of machine learning to run in parallel. And that's I mean, to be clear, since we haven't really talked about it, that's important when you're handling data at scale and you want to train, you know, models on very, very large data sets. But it sounds like when we want to go to some of the more advanced machine learning capabilities, the ones that today are making all the noise with, you know, speech to text, text to speech, natural language, understanding those neural network based capabilities are not built into the core Spark ML lib, that, would it be fair to say you could start getting at them through System ML? >> Yes, System ML is a much better way to do scalable linear algebra on top of Spark than the very limited linear algebra that's built into Spark. >> So alright, let's take the next step. Can System ML be grafted onto Spark in some way or would it have to be in an entirely new API that doesn't take, integrate with all the other Spark APIs? In a way, that has differentiated Spark, where each API is sort of accessible from every other. Can you tie System ML in or do the Spark guys have to build more primitives into their own sort of engine first? >> A lot of the work that we've done with the Spark Technology Center as part of bringing System ML into the Apache ecosystem has been to build a nice, tight integration with Apache Spark so you can pass Spark data frames directly into System ML you can get data frames back. Your System ML algorithm, once you've written it, in terms of one of System ML's main systematic languages it just plugs into Spark like all the algorithms that are built into Spark. >> Okay, so that's, that would keep Spark competitive with more advanced machine learning frameworks for a longer period of time, in other words, it wouldn't hit the wall the way if would if it encountered tensor flow from Google for Google's way of doing deep learning, Spark wouldn't hit the wall once it needed, like, a tensor flow as long as it had System ML so deeply integrated the way you're doing it. >> Right, with a system like System ML, you can quickly move into new domains of machine learning. So for example, this afternoon I'm going to give a talk with one of our machine learning developers, Mike Dusenberry, about our recent efforts to implement deep learning in System ML, like full scale, convolutional neural nets running on a cluster in parallel processing many gigabytes of images, and we implemented that with very little effort because we have this optimizer underneath that takes care of a lot of the details of how you get that data into the processing, how you get the data spread across the cluster, how you get the processing moved to the data or vice versa. All those decisions are taken care of in the optimizer, you just write down the linear algebra parts and let the system take care of it. That let us implement deep learning much more quickly than we would have if we had done it from scratch. >> So it's just this ongoing cadence of basically removing the infrastructure gut management from the data scientists and enabling them to concentrate really where their value is is on the algorithms themselves, so they don't have to worry about how many clusters it's running on, and that configuration kind of typical dev ops that we see on the regular development side, but now you're really bringing that into the machine learning space. >> That's right, Jeff. Personally, I find all the minutia of making a parallel algorithm worked really fascinating but a lot of people working in data science really see parallelism as a tool. They want to solve the data science problem and System ML lets you focus on solving the data science problem because the system takes care of the parallelism. >> You guys could go on in the weeds for probably three hours but we don't have enough coffee and we're going to set up a follow up time because you're both in San Francisco. But before we let you go, Fred, as you look forward into 2017, kind of the advances that you guys have done there at the IBM Spark Center in the city, what's kind of the next couple great hurdles that you're looking to cross, new challenges that are getting you up every morning that you're excited to come back a year from now and be able to say wow, these are the one or two things that we were able to take down in 2017? >> We're moving forward on several different fronts this year. On one front, we're helping to get the notebook experience with Spark notebooks consistent across the entire IBM product portfolio. We helped a lot with the rollout of notebooks on data science experience on z, for example, and we're working actively with the data science experience and with the Watson data platform. On the other hand, we're contributing to Spark 2.2. There are some exciting features, particularly in sequel that we're hoping to get into that release as well as some new improvements to ML Live. We're moving forward with Apache System ML, we just cut Version 0.13 of that. We're talking right now on the mailing list about getting System ML out of incubation, making it a full, top level project. And we're also continuing to help with the adoption of Apache Spark technology in the enterprise. Our latest focus has been on deep learning on Spark. >> Well, I think we found him! Smartest guy in the room. (laughter) Thanks for stopping by and good luck on your talk this afternoon. >> Thank you, Jeff. >> Absolutely. Alright, he's Fred Rice, he's George Gilbert, and I'm Jeff Rick, you're watching the Cube from Big Data SV, part of Big Data Week in San Jose, California. (upbeat music) (mellow music) >> Hi, I'm John Furrier, the cofounder of SiliconANGLE Media cohost of the Cube. I've been in the tech business since I was 19, first programming on mini computers.

Published Date : Mar 15 2017

SUMMARY :

it's the Cube, covering Big Data Silicon Valley 2017. in the back of the Faramount, come on by and say hello, in the San Francisco office specifically. and on the same floor a lot of developers from Watson to all the various verticals in which you guys play. of machine learning, the power of artificial intelligence or does most of the core work for, like, the sequel workload and have the optimizer take care of making those algorithms and I meant that in a good way, is that the models, the algorithms that are in there, and the more general problems it can solve. to that implementation that you have in your toolbox. in the cluster to get it. and the code, the code that implements the algorithm so that if you wanted to do neural nets, Can you get to that capability using System ML? and all of the operations you can do the ones that today are making all the noise with, you know, linear algebra on top of Spark than the very limited So alright, let's take the next step. System ML into the Apache ecosystem has been to build so deeply integrated the way you're doing it. and let the system take care of it. is on the algorithms themselves, so they don't have to worry because the system takes care of the parallelism. into 2017, kind of the advances that you guys have done of Apache Spark technology in the enterprise. Smartest guy in the room. and I'm Jeff Rick, you're watching the Cube cohost of the Cube.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Jeff Rick	PERSON	0.99+
George	PERSON	0.99+
Jeff	PERSON	0.99+
Fred Rice	PERSON	0.99+
Mike Dusenberry	PERSON	0.99+
IBM	ORGANIZATION	0.99+
2017	DATE	0.99+
San Francisco	LOCATION	0.99+
John Furrier	PERSON	0.99+
San Jose	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
505 Howard Street	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Frederick Reiss	PERSON	0.99+
Spark Technology Center	ORGANIZATION	0.99+
Fred	PERSON	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
one	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
Spark 2.2	TITLE	0.99+
three hours	QUANTITY	0.99+
Watson	ORGANIZATION	0.99+
UC Berkeley	ORGANIZATION	0.99+
one server	QUANTITY	0.99+
Spark	TITLE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Python	TITLE	0.99+
each server	QUANTITY	0.99+
both	QUANTITY	0.99+
each	QUANTITY	0.99+
each database	QUANTITY	0.98+
Big Data Week	EVENT	0.98+
Pagoda Lounge	LOCATION	0.98+
Strata Hadoob Conference	EVENT	0.98+
System ML	TITLE	0.98+
Big Data SV	EVENT	0.97+
each API	QUANTITY	0.97+
ML Live	TITLE	0.96+
today	DATE	0.96+
Thomas Vehicles	ORGANIZATION	0.96+
Apache System ML	TITLE	0.95+
Big Data	EVENT	0.95+
Apache Spark	TITLE	0.94+
Watson for Marketing	ORGANIZATION	0.94+
Sparking Water	TITLE	0.94+
first	QUANTITY	0.94+
one front	QUANTITY	0.94+
Big Data SV 2016	EVENT	0.94+
IBM Spark Technology Center	ORGANIZATION	0.94+
about 25 designers	QUANTITY	0.93+

Ben Sharma, Tony Fisher, Zaloni - BigData SV 2017 - #BigDataSV - #theCUBE

>> Announcer: Live from San Jose, California, it's The Cube, covering Big Data Silicon Valley 20-17. (rhythmic music) >> Hey, welcome back, everyone. We're live in Silicon Valley for Big Data SV, Big Data Silicon Valley in conjunction with Strata + Hadoob. This is the week where it all happens in Silicon Valley around the emergence of the Big Data as it goes to the next level. The Cube is actually on the ground covering it like a blanket. I'm John Furrier. My cohost, George Gilbert with Boogie Bond. And our next guest, we have two executives from Zeloni, Ben Sharma, who's the founder and CEO, and Tony Fischer, SVP and strategy. Guys, welcome back to The Cube. Good to see you. >> Thank you for having us back. >> You guys are great guests. You're in New York for Big Data NYC, and a lot is going on, certainly, here, and it's just getting kicked off with Strata-Hadoob, they got the sessions today, but you guys have already got some news out there. Give us the update. What's the big discussion at the show? >> So yeah, 20-16 was a great year for us. A lot of growth. We tripled our customer base, and a lot of interest in data lake, as customers are going from say Pilot and POCs into production implementation so far though. And in conjunction with that, this week we launched what we call a solution named Data Lake in a Box, appropriately, right? So what that means is we're bringing the full stack together to customers, so that we can get a data lake up and running in eight weeks time frame, with enterprise create data ingestion from their source systems hydrated into the data lake and ready for analytics. >> So is it a pretty big box, and is it waterproof? (all laughing) I mean, this is the big discussion now, pun intended. But the data lake is evolving, so I wanted to get your take on it. This is kind of been a theme that's been leading up and now front and center here on The Cube. Already the data lake has changed, also we've heard, I think Dave Alante in New York said data swamp. But using the data is critical on a data lake. So as it goes to more mature model of leveraging the data, what are the key trends right now? What are you guys seeing? Because this is a hot topic that everyone is talking about. >> Well, that's a good distinction that we like to make, is the difference between a data swamp and a data lake. >> And a data lake is much more governed. It has the rigor, it has the automation, it has a lot of the concepts that people are used to from traditional architectures, only we apply them in the scale-out architecture. So we put together a maturity model that really maps out a customer's journey throughout the big data and the data lake experience. And each phase of this, we can see what the customer's doing, what their trends are and where they want to go, and we can advise to them the right way to move forward. And so a lot of the customers we see are kind of in kind of what we call the ignore stage. I'd say most of the people we talk to are just ignoring. They don't have things active, but they're doing a lot of research. They're trying to figure out what's next. And we want to move them from there. The next stage up is called store. And store is basically just the sandbox environment. "I'm going to stick stuff in there." "I'm going to hope something comes out of it." No collaboration. But then, moving forward, there's the managed phase, the automated phase, and the optimized phase. And our goal is to move them up into those phases as quickly as possible. And data lake in a box is an effort to do that, to leapfrog them into a managed data lake environment. >> So that's kind of where the swamp analogy comes in, because the data lake, the swamp is kind of dirty, where you can almost think, "Okay, the first step is store it." And then they get busy or they try to figure out how to operationalize it, and then it's kind of like, "Uh ..." So your point, they're trying to get to that. So you guys get 'em to that set up, and then move them quickly to value? Is that kind of the approach? >> Yeah. So, time to value is critical, right? So how do you reduce the time to insight from the time the data is produced by the date producer, till the time you can make the data available to the data consumer for analytics and downstream use cases. So that's kind of our core focus in bringing these solutions to the market. >> Dave often and I were talking, and George always talk about the value of data at the right time at the right place, is the critical lynch-pin for the value, whether it's an app-driven, or whatever. So the data lake, you never know what data in the data lake will need to be pulled out and put into either real time or an app. So you have to assume at any given moment there's going to be data value. >> Sure >> So that, conceptually, people can get that. But how do you make that happen? Because that's a really hard problem. How do you guys tackle that when a customer says, "Hey, I want to do the data lake. "I've got to have the coverage. "I got to know who's accessing stuff. "But at the end of the day, "I got to move the data to where it's valuable." >> Sure. So the approach we have taken is with an integrated platform with a common metadata layer. Metadata is the key. So, using this common metadata layer, being able to do managed ingestion from various different sources, being able to do data validation and data quality, being able to manage the life cycle of the data, being able to generate these insights about the data itself, so that you can use that effectively for data science or for downstream applications and use cases is critical based on our experience of taking these applications from, say, a POC pilot phase into a production phase. >> And what's the next step, once you guys get to that point with the metadata? Because, like, I get that, it's like everyone's got the metadata focus. Now, I'm the data engineer, the data NG or the geek, the supergeek and then you've got the data science, then the analysts, then there will probably be a new category, a bot or something AI will do something. But you can have a spectrum of applications on the data side. How do they get access to the metadata? Is it through the machine learning? Do you guys have anything unique there that makes that seamless or is that the end goal? >> Sure, do you want to take that? >> Yes sure, it's a multi-pronged answer, but I'll start and you can jump in. One of the things we provide as part of our overall platform is a product called Micah. And Micah is really the kind of on-ramp to the data. And all those people that you just named, we love them all, but their access to the data is through a self-service data preparation product, and key to that is the metadata repository. So, all the metadata is out there; we call it a catalog at that point, and so they can go in, look at the catalog, get a sense for the data, get an understanding for the form and function of the data, see who uses it, see where it's used, and determine if that's the data that they want, and if it is, they have the ability to refine it further, or they can put it in a shopping cart if they have access to it, they can get it immediately, they can refine it, if they don't have access to it, there's an automatic request that they can get access to it. And so it's a onramp concept, of having a card catalog of all the information that's out there, how it's being used, how it's been refined, to allow the end user to make sure that they've got the right data, they can be positioned for their ultimate application. >> And just to add to what Tony said, because we are using this common metadata layer, and capturing metadata every instance, if you will, we are serving it up to the data consumers, using a rich catalog, so that a lot of our enterprise customers are now starting to create what they consider a data marketplace or a data portal within their organization, so that they're able to catalog not just the data that's in the data lake, but also data that's in other data stores. And provide one single unified view of these data sets, so that your data scientists can come in and see is this a data set that I can use for my model building? What are the different attributes of this data set? What is the quality of the data? How fresh is the data? And those kind of traits, so that they are effective in their analytical journey. >> I think that's the key thing that's interesting to me, is that you're seeing the big data explosions over the past ten years, eight years, we've been covering The Cube since the dupe world started. But now, it's the data set world, so it's a big data set in this market. The data sets are the key because that's what data scientists want to wrangle around with, and sling data sets with whatever tooling they want to use. Is that kind of the same trend that you guys see? >> That's correct. And also what we're seeing in the marketplace, is that customers are moving from a single architecture to a distributed architecture, where they may have a hybrid environment with some things being instantiated in the Cloud, some things being on PRIM. So how do you not provide a unified interface across these multiple environments, and in a governed way, so that the right people have access to the right data, and it's not the data swamp. >> Okay, so lets go back to the maturity model because I like that framework. So now you've just complicated the heck out of it. Cause now you've got Cloud, and then on PRIM, and then now, how do you put that prism of maturity model, on now hybrid, so how does that cross-connect there? And a second follow-up to that is, where are the customers on this progress bar? I'm sure they're different by customer but, so, maturity model to the hybrid, and then trends in the customer base that you're seeing? >> Alright, I'll take the second one, and then you can take the first one, okay? So, the vast majority of the people that we work with, and the people, the prospects customers, analysts we've talked to, other industry dignitaries, they put the vast majority of the customers in the ignore stage. Really just doing their research. So a good 50% plus of most organizations are still in that stage. And then, the data swamp environment, that I'm using it to store stuff, hopefully I'll get something good out of it. That's another 25% of the population. And so, most of the customers are there, and we're trying to move them kind of rapidly up and into a managed and automated data lake environment. The other trend along these lines that we're seeing, that's pretty interesting, is the emergence of IT in the big data world. It used to be a business user's world, and business users built these sandboxes, and business users did what they wanted to. But now, we see organizations that are really starting to bring IT into the fold, because they need the governance, they need the automation, they need the type of rigor that they're used to, in other data environments, and has been lacking in the big data environment. >> And you've got the IOT code cracking the code on the IOT side which has created another dimension of complexity. On the numbers of the 50% that ignore, is that profile more for Fortune 1000? >> It's larger companies, it's Fortune, and Global 2000. >> Got it, okay, and the terms of the hybrid maturity model, how's that, and add a third dimension, IOT, we've got a multi-dimensional chess game going here. >> I think they way we think about it is, that they're different patterns of data sets coming in. So they could be batched, they could be files, or database extracts, or they could be streams, right? So as long as you think about a converged architecture that can handle these different patterns, then you can map different use cases whether they are IOT and streaming use cases versus what we are seeing is that a lot of companies are trying to replace their operational analytics platforms with a data lake environment, and they're building their operational analytics on top of the data lake, correct? So you need to think more from an abstraction layer, how do you abstract it out? Because one of the challenges that we see customers facing, is that they don't want to get sticky with one Cloud service provider because they may have multiple Cloud service providers, >> John: It's a multi-Cloud world right now. >> So how do you leverage that, where you have one Cloud service provider in one geo, another Cloud service provider in another geo, and still being able to have an abstraction layer on top of it, so that you're building applications? >> So do you guys provide that data layer across that abstraction? >> That is correct, yes, so we leverage the ecosystem, but what we do is add the data management and data governance layer, we provide that abstraction, so that you can be on PREM, you can be in Cloud service provider one, or Cloud service provider two. You still have the same controls, and same governance functions as you build your data lake environment. >> And this is consistent with some of the Cube interviews we had all day today, and other Cube interviews, where when you had the Cloud, you're renting basically, but you own your data. You get to have a nice ... And that metadata seems to be the key, that's the key, right? For everything. >> That's right. And now what we're seeing is that a lot of our Enterprise customers are looking at bringing in some of the public cloud infrastructure into their on-PRAM environment as they are going to be available in appliances and things like that, right? So how do you then make sure that whatever you're doing in a non-enterprise cloud environment you are also able to extend it to the enterprise-- >> And the consequences to the enterprise is that the enterprise multiple jobs, if they don't have a consistent data layer ... >> Sure, yeah. >> It's just more redundancy. >> Exactly. >> Not redundancy, duplication actually. >> Yeah, duplication and difficulty of rationalizing it together. >> So let me drill down into a little more detail on the transition between these sort of maturity phases? And then the movement into production apps. I'm curious to know, we've heard Tableau, XL, Power BI, Click I guess, being-- sort of adapting to being front ends to big data. But they don't, for their experience to work they can't really handle big data sets. So you need the MPP sequel database on the data lake. And I guess the question there is is there value to be gotten or measurable value to be gotten just from turning the data lake into you know, interactive BI kind of platform? And sort of as the first step along that maturity model. >> One of the patterns we were seeing is that serving LIR is becoming more and more mature in the data lake, so that earlier it used to be mainly batch type of workloads. Now, with MPP engines running on the data lake itself, you are able to connect your existing BI applications, whether it's Tableau, Click, Power BI, and others, to these engines so that you are able to get low-latency query response times and are able to slice-and-dice your data sets in the data lake itself. >> But you're essentially still, you have to sample the data. You can't handle the full data set unless you're working with something like Zoom Data. >> Yeah, so there are physical limitations obviously. And then there are also this next generation of BI tools which work in a converged manner in the data lake itself. So there's like Zoom Data, Arcadia, and others that are able to kind of run inside the data lake itself instead of you having to have an external environment like the other BI tools, so we see that as a pattern. But if you already are an enterprise, you have on board a BI platform, how do you leverage that with the data lake as part of the next-generation architecture is a key trend that we are seeing. >> So that your metadata helps make that from swamp to curated data lake. >> That's right, and not only that what we have done, as Tony was mentioning, in our Micah product we have a self-service catalog and then we provide a shopping cart experience where you can actually source data sets into the shopping cart, and we let them provision a sandbox. And when they provision the sandbox, they can actually launch Tableau or whatever the BI tool of choice is on that sandbox, so that they can actually-- and that sandbox could exist in the data lake or it could exist on a relational data store or an MPP data store that's outside of the data lake. That's part of your modern data architecture. >> But further to your point, if people have to throw out all of their decision support applications and their BI applications in order to change their data infrastructure, they're not going to do it. >> Understood. >> So you have to make that environment work and that's what Ben's referring to with a lot of the new accelerator tools and things that will sit on top of the data lake. >> Guys, thanks so much for coming on The Cube. Really appreciate it. I'll give you guys the final word in the segment ... What do you expect this week? I mean, obviously, we've been seeing the consolidation. You're starting to see the swim lanes of with Spark and Open Source and you see the cloud and IOT colliding, there's a huge intersection with deep learning, AI is certainly hyped up now beyond all recognition but it's essentially deep learning. Neural networks meets machine learning. That's been around before, but now freely available with Cloud and Compute. And so kind of a interesting dynamic that's rockin' the big data world. Your thoughts on what we're going to see this week and how that relates to the industry? >> I'll take a stab at it and you may feel free to jump in. I think what we'll see is that lot of customers that have been playing with big data for a couple of years are now getting to a point where what worked for one or two use cases now needs to be scaled out and provided at an enterprise scale. So they're looking at a managed and a governance layer to put on top of the platform. So they can enable machine learning and AI and all those use cases, because business is asking for them. Right? Business is asking for how they can bring intenser flow and run on the data lake itself, right? So we see those kind of requirements coming up more and more frequently. >> Awesome. Tony? >> What he said. >> And enterprise readiness certainly has to be table-- there's a lot of table stakes in the enterprise. It's not like, easy to get into, you can see Google kind of just putting their toe in the water with the Google cloud, tenser flow, great highlight they got spanner, so all these other things like latency rearing their heads again. So these are all kind of table stakes. >> Yeah, and the other thing, moving forward with respect to machine learning and some of the advanced algorithms, what we're doing now and some of the research we're doing is actually using machine learning to manage the data lake, which is a new concept, so when we get to the optimized phase of our maturity model, a lot of that has to do with self-correcting and self-automating. >> I need some machine learning and some AI, so does George and we need machine learning to watch the machine learn, and then algorithmists for algorithms. It's a crazy world, exciting time for us. >> Are we going to have a bot next time when we come here? (all laughing) >> We're going to chat off of messenger, we just came from south by southwest. Guys, thanks for coming on The Cube. Great insight and congratulations on the continued momentum. This is The Cube breakin' it down with experts, CEOs, entrepreneurs, all here inside The Cube. Big Data Sv, I'm John for George Gilbert. We'll be back after this short break. Thanks! (upbeat electronic music)

Published Date : Mar 14 2017

SUMMARY :

Announcer: Live from This is the week where it What's the big discussion at the show? hydrated into the data lake But the data lake is evolving, is the difference between a and the data lake experience. Is that kind of the approach? make the data available So the data lake, you never "But at the end of the day, So the approach we have taken is seamless or is that the end goal? One of the things we provide that's in the data lake, Is that kind of the same so that the right people have access And a second follow-up to that is, and the people, the prospects customers, On the numbers of the 50% that ignore, it's Fortune, and Global 2000. of the hybrid maturity model, of the data lake, correct? John: It's a multi-Cloud the data management and And that metadata seems to be the key, some of the public cloud And the consequences of rationalizing it together. database on the data lake. in the data lake itself. You can't handle the full data set manner in the data lake itself. So that your metadata helps make that exist in the data lake But further to your point, if So you have to make and how that relates to the industry? and run on the data lake itself, right? stakes in the enterprise. a lot of that has to and some AI, so does George and we need on the continued momentum.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Tony Fischer	PERSON	0.99+
one	QUANTITY	0.99+
Tony	PERSON	0.99+
Dave Alante	PERSON	0.99+
Tony Fisher	PERSON	0.99+
George	PERSON	0.99+
Ben Sharma	PERSON	0.99+
Dave	PERSON	0.99+
New York	LOCATION	0.99+
John Furrier	PERSON	0.99+
George Gilbert	PERSON	0.99+
John	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
Zeloni	PERSON	0.99+
Zaloni	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
50%	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
25%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
eight weeks	QUANTITY	0.99+
two executives	QUANTITY	0.99+
first step	QUANTITY	0.99+
Tableau	TITLE	0.99+
eight years	QUANTITY	0.99+
today	DATE	0.99+
Big Data	ORGANIZATION	0.98+
two	QUANTITY	0.98+
this week	DATE	0.98+
second one	QUANTITY	0.98+
One	QUANTITY	0.98+
first one	QUANTITY	0.98+
each phase	QUANTITY	0.98+
Ben	PERSON	0.97+
NYC	LOCATION	0.97+
20-16	DATE	0.97+
Cloud	TITLE	0.97+
Strata	ORGANIZATION	0.97+
Big Data Sv	ORGANIZATION	0.97+
second	QUANTITY	0.96+
two use cases	QUANTITY	0.96+
Cube	ORGANIZATION	0.96+
third	QUANTITY	0.94+
The Cube	ORGANIZATION	0.91+
single architecture	QUANTITY	0.91+
Power	TITLE	0.9+
Micah	LOCATION	0.85+
Arcadia	TITLE	0.83+
Zoom Data	TITLE	0.83+
Big Data SV	ORGANIZATION	0.82+
Micah	PERSON	0.81+
Click	TITLE	0.8+
Strata-Hadoob	TITLE	0.8+
Zoom Data	TITLE	0.78+
Fortune	ORGANIZATION	0.78+
Spark	TITLE	0.78+
Power BI	TITLE	0.78+
#theCUBE	ORGANIZATION	0.77+
one geo	QUANTITY	0.76+
one single unified	QUANTITY	0.75+
Big Data Silicon Valley	ORGANIZATION	0.72+
Bond	ORGANIZATION	0.72+
Hadoob	ORGANIZATION	0.72+
POCs	ORGANIZATION	0.67+
PRIM	TITLE	0.66+
Data	ORGANIZATION	0.65+
lake	ORGANIZATION	0.6+
Pilot	ORGANIZATION	0.58+
XL	TITLE	0.58+
of years	QUANTITY	0.56+
Global	ORGANIZATION	0.55+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Strata-Hadoob: