Niall Fitzgerald, Spark NZ | Red Hat Summit 2019

>> Man: Live from Boston, Massachusetts, it's theCUBE, covering Red Hat Summit 2019. Brought to you by Red Hat. >> And we are back live in Boston as we continue our coverage here on theCUBE of Red Hat Summit 2019. It is our sixth year here at the show and this year obviously some huge announcements. A significant moment it's been for Red Hat, we heard from Jim Whitehurst a little bit ago. Stu Miniman, John Walls, we're now joined as well by Niall Fitzgerald, who is the GM of IT Application Architecture and Design at Spark NZ. Niall, good afternoon, or I guess good morning still we're in an Eastern time zone. >> Yeah it's the middle of the night in New Zealand I'd say. >> Yeah, so Spark NZ New Zealand. Tell us a little bit first off about Spark NZ. What the folks back home are doing right now, work-wise, and your role with the company. >> Yeah, so Spark is the largest provider of telecommunication services in New Zealand. All the traditional type of services you'd expect, mobile, broadband, et cetera. We came out of the traditional kind of post office, so we've a lot of heritage, and about four years ago we rebranded from Telecom New Zealand into Spark. To represent that we were changing from being a telco into much broader range of digital services. Our purpose is to help all New Zealanders win big in the digital world. >> Niall, step back for a second. Talk to our audience that might not know the telecom industry as well as you, I've been an observer and participator in the industry but you know back in the dot com boom it was like limitless bandwidth and we're gonna do all these wonderful things, and cloud and digitization, have put some new opportunities as well as stresses and strains on your industry so, you know what's going on and you said you rebranded? >> Yeah, look, I think it's well-known it's been a tough last few years for most telcos in the world. I was listening to Red Hat talking yesterday about 60 consecutive quarters or more of growth, I don't think there's any telco in the world that probably has the same story. Like most, we're facing kinda decline in all the traditional revenues like voice and text and things like that, so we're all having to kinda rebrand ourselves and deliver much higher levels of customer service. People expect the same levels of service from us that they do from Amazon, Google, and everyone else. In Spark what that means to us is we've moved into lots of new things as you said, things like ICT, we're now very big in cloud, we've recently launched a Spark Sports brand and we've got streaming right to the key events like Formula 1. We're going to stream the Rugby World Cup, which is a massive event for New Zealanders, so looking forward to seeing that and Ireland on the all blacks in the final in September this year. So yeah, lot going on. Tough times but forcing us to keep changing every year. >> And so, about these changes that you're making whether technologically based, let's just deal with that. What is that ultimately going to do for you in terms of better customer service delivery? So, you've got inherent challenges, you've talked about them at all, that the world's changing, how we use this medium, this communication opportunity is changing, and you've been just a little behind the wave, hard to keep up with it, so rapidly changing. How much of a challenge is that? And then how are you going to address this going forward? How do you stay relevant? >> Yeah I think we're lucky in one regard because if I look back about five, seven years ago we were like most traditional telcos. We had a spaghetti for want of a better description of systems, and then we had all was multiples of everything, at the time we had 19 integration layers and 10 billing systems and it wasn't uncommon. But way back in 2012 we actually embarked on a massive transformation program, and we spent five years consolidating all of that infrastructure so going into about 2017 we were very lucky in that we had a massive foundation laid already, so what that then enabled us to do was to actually push away calls from our contact centers into mobile apps, into digital adoption. We've been a big embracer of things like big data and robotic process automation as well to try and take cost out of our industry. So, I think we're quite well placed. Now that allows us to do things like innovate new products for our customers so we bundle things like Spotify and Netflix. It allows to introduce things like Spark Sports brand, which we couldn't have done five years ago before the transformation We just wouldn't have been able to enable these things with our existing kind of legacy IT estate. >> So how's open-source play into all this for you? >> Yeah open-source, I suppose our first foray into open-source was when we went to start embracing big data and automation. So we started using things like Hadoop and various other things and our entire platform is based around open-source. We changed to an IMS network recently and we started embracing things like OpenStack, and then it really took us to a new level recently when we started working on Red Hat's Fuse, and OpenShift we started implementing that. >> Okay, so the OpenStack show for many years, the last few years we saw the telcos coming in specifically for network function virtualization or NFV. Is that what you're using in that space? >> Yeah, we are. Interestingly, at this conference I've heard a lot of people talk about OpenShift and OpenStack, obviously, particularly in the telco game. We actually came out a bit differently from the application space. So we had an integration platform that we had put in through this transformation phase which had served us well, and was connecting all of our 40, 50 systems together. But it was coming up to a life cycle event, and we decided we'd look externally and see had we options beyond just upgrading it. So we started looking around, and we effectively found Fuse, and in bringing in Fuse we then brought OpenShift in, which is quite different to what I've seen from a number of other people, they're bringing in things like OpenShift and building on top of it. We did it the other way around, you know? And we did it primarily for cost reasons, you know? >> Yeah, so talk a little bit about that impact of Fuse and OpenShift, what that means. Were you already down the containerization journey, or did that help drive >> Niall: No, no some of that modernization? >> That's exactly what happened. If I'm honest we hadn't really explored containerization too much because we had come to the end of our kind of transformation journey. Open-source and containerization wasn't around when we went through that. So we kinda needed some really core reasons to move on, so, yeah effectively what happened was we looked at Fuse, I was gonna say primarily for cost, but we were looking for something that we could migrate to where it makes sense. We were looking for something that wasn't a massive lift for the people who worked in our integration already, so they could be rescaled into it, and interestingly we turned agile recently which has changed the way we look at the needs of our systems. So our old integration platform, if we needed to deploy a change we had to take an outage, which was fine when we had a centralized IT department who deployed once a month and took a two hour outage, but when you have 20 tribes all developing features in isolation and they wanna go straight through to production, if everybody took an outage then our systems wouldn't really be up very often. So one of the key things that we were looking at for our new integration platform was can we deploy hot and can we scale? So that's basically where Fuse came into us. >> Okay, so can you? >> We can and we do. Still a little bit nervous about pressing the button mid-day and doing stuff >> Right, simultaneously and thinking this has really gotta work, right? >> Yeah then normally, >> We saw it today though on the demo stage, on the keynote. You know, simultaneous operations going on. >> No, we do it, and they normally don't tell me when they're doing it they just do it and tell me it worked afterwards, but no it's actually been really successful and you can imagine connection 40 or 50 systems together is effectively the equivalent of about 2,000 API's and we managed to migrate, we're about 70% of the way through. But we've managed to migrate those without actually impacting the systems that use them and that's probably been one of our most successful IT projects that I've seen. >> It's funny, you said we were towards the end of our transformation journey, and of course I think we all understand, it is just, I might've reached a marker in my journey, but it needs to be a continuous process. And you went through an agile transformation. So bring us in a little bit. Organizationally, what happened there. Some of the good, the bad, and the ugly of agile, 'cause I mean agile's always an ongoing thing. >> It is, yeah. So about the start of last year we started to think about agile and the need to change our ways of working. And we looked at a number of models overseas, and companies like Spotify and various banks, and we settled on a model of chapter and tribes. And we took about six months in looking at what that meant for us as an organization and all of the things that we needed to change. Everything from, people's contracts to people's titles. We got rid of all complex titles and moved down to simple things like Developer, Tester, et cetera. We had to train our people in agile so we ran boot camps for over 2,000 people. We had one with 500 people attend. We had to review all of our processes and see where we had centralized things like IT governance or procurement. How do you actually manage this when you have up to 20 different people effectively, or tribes doing their own developments, so over a period of about six months we went through all of these. We started with a concept of some forerunner tribes so we could figure out how this thing actually works, you know? And get some lessons. And then on the first of July last year, about a 2,000 people in various buildings packed up their stuff in their desks and moved into a new world, into their tribes with different working spaces and different collaboration areas and all the tools that we need. So, yeah we're about nine months down that journey now and it's been good. >> How many total employees? >> We have about 5,000 in total. >> 5,000, so you had 500 at one time. 10% of your workforce in training at one time. >> That's right, yeah. Absolutely. >> How do you keep the wheels on the bus rolling? Because I mean you're asking people not only to learn new skills, but learn them in a new environment, and learn them literally in a new place. I mean that's just massive change and I think, we're human beings. We're creatures of habit to a certain extent. You had to hit a lot of bumps along the way. >> Yeah, so one of the key things we did upfront was we said the operate part of our business, which is effectively things like our contact center, our sales staff, our service desks, we will not go agile with those on the first day, because they operate in a slightly different way of working. The people in our stores, et cetera. So we had a concept of agile light and agile heavy. So we kinda parked them for a minute so that we wouldn't do exactly what you say and let the wheels fall off the trolley. And we took to people that were the IT developers, the product development staff, and all of that, which came to just over about 2,000 people, and we firstly flipped those 2,000 people and put those through bootcamp. But even as you say, scheduling the boot camps, we made sure that we always had the right people on the ground and we would schedule smaller boot camps for them later if we needed to do it, but yeah. >> So nine months in now. You talking to your peers, if they're gonna go through. Any key learnings, what were some of the most challenging things that you ran into? >> I think probably the major one is that agile at its heart is a way of working, and despite the name it's actually quite prescriptive in how you should work, you know? When you pick up the agile book it tells you all the ceremonies you need to run and the processes that you need to run as well. And I think you need to be pragmatic in how you implement it because there are so many different flavors of agile. The one flavor, even with an organization of Spark size, it doesn't work. So the tribes and squads that are building out new products compared to the tribes that are doing things like upgrading systems, they will work in different ways. So I think the first thing is be pragmatic, take the goodness and the intent of agile, but implement it in how it works for you. And there's some other practical considerations, like prior to being agile we had quite a large number of our technology partners were based offshore in India, and you know it's quite difficult to run a 10 AM stand up in New Zealand setting the priorities for the day and the sprint plans, when, you know, four members of your team are asleep in India. You know, they're missing out on all of the goodness and the collocation and the sharing, so one of the things we had anticipated that, so luckily enough we had moved a lot of those people onshore in advance of agile, you know? But it is a big cultural change for everyone in the organization, not least the leadership teams as well. >> John: Well you got through it. >> We got through it, but there's no going back. >> Absolutely, no you're in the deep end now. Well, Niall, thanks for being with us, we appreciate the time joining us here on theCUBE, and I think that an Irishman is always welcomed in Boston. >> Thank you very much! We've been enjoying the hospitality. >> Yeah the door's always open. >> Thank you very much. >> Thank you very much. Niall Fitzgerald, joing us from Spark NZ. Back with more here on theCUBE, you're watching this live at the Red Hat Summit 2019.

Published Date : May 8 2019

SUMMARY :

Brought to you by Red Hat. And we are back live in Boston and your role with the company. To represent that we were changing from being a telco in the industry but you know back in the dot com boom and Ireland on the all blacks in the final that the world's changing, how we use this medium, at the time we had 19 integration layers and we started embracing things like OpenStack, Okay, so the OpenStack show for many years, Fuse, and in bringing in Fuse we then brought OpenShift in, Yeah, so talk a little bit about that impact So one of the key things that we were looking at We can and we do. We saw it today though on the demo stage, on the keynote. and we managed to migrate, and of course I think we all understand, and all of the things that we needed to change. 5,000, so you had 500 at one time. That's right, yeah. and I think, we're human beings. Yeah, so one of the key things we did upfront things that you ran into? so one of the things we had anticipated that, we appreciate the time joining us here on theCUBE, We've been enjoying the hospitality. Thank you very much.

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Niall Fitzgerald	PERSON	0.99+
Niall	PERSON	0.99+
Jim Whitehurst	PERSON	0.99+
John	PERSON	0.99+
Google	ORGANIZATION	0.99+
New Zealand	LOCATION	0.99+
Stu Miniman	PERSON	0.99+
John Walls	PERSON	0.99+
Boston	LOCATION	0.99+
India	LOCATION	0.99+
Spotify	ORGANIZATION	0.99+
Red Hat	ORGANIZATION	0.99+
two hour	QUANTITY	0.99+
Spark	ORGANIZATION	0.99+
10%	QUANTITY	0.99+
500	QUANTITY	0.99+
five years	QUANTITY	0.99+
Rugby World Cup	EVENT	0.99+
last year	DATE	0.99+
2,000 people	QUANTITY	0.99+
one time	QUANTITY	0.99+
500 people	QUANTITY	0.99+
nine months	QUANTITY	0.99+
50 systems	QUANTITY	0.99+
sixth year	QUANTITY	0.99+
yesterday	DATE	0.99+
agile	TITLE	0.99+
about six months	QUANTITY	0.99+
today	DATE	0.99+
10 billing systems	QUANTITY	0.99+
Red Hat Summit 2019	EVENT	0.98+
10 AM	DATE	0.98+
five years ago	DATE	0.98+
once a month	QUANTITY	0.98+
20 tribes	QUANTITY	0.98+
one	QUANTITY	0.98+
over 2,000 people	QUANTITY	0.98+
Netflix	ORGANIZATION	0.98+
40, 50 systems	QUANTITY	0.98+
Spark Sports	ORGANIZATION	0.97+
about 5,000	QUANTITY	0.97+
this year	DATE	0.97+
Boston, Massachusetts	LOCATION	0.97+
2012	DATE	0.97+
5,000	QUANTITY	0.97+
September this year	DATE	0.97+
Hadoop	TITLE	0.97+
first thing	QUANTITY	0.96+
first day	QUANTITY	0.96+
first of July last year	DATE	0.96+
about 2,000 API	QUANTITY	0.96+
about nine months	QUANTITY	0.96+
one regard	QUANTITY	0.95+
about four years ago	DATE	0.95+
OpenShift	TITLE	0.95+
about 70%	QUANTITY	0.94+
40	QUANTITY	0.94+
Spark NZ	ORGANIZATION	0.93+
one flavor	QUANTITY	0.93+
19 integration layers	QUANTITY	0.9+
up to 20 different people	QUANTITY	0.9+

Vishy Gopalakrishnan, AT&T | AT&T Spark 2018

>> From the Palace of Fine Arts in San Francisco, it's theCUBE. Covering AT&T Spark. (upbeat music) >> Hi, I'm Maribel Lopez, the founder of Lopez Research, and I am guest hosting theCUBE at the AT&T Spark event in San Francisco. And I have the great pleasure of being with Vishy Gopalakrishnan. He is the VP of ecosystems and innovation at AT&T. And Vishy, I've known you for a long time now. I've known you through companies that are as diverse as SAP to AT&T. Could you tell us a little bit about what VP of ecosystem and innovation does and this concept of the foundry that AT&T is having? >> Sure. First of all nice to see you again, Maribel. >> Paths cross. No new people, just different business cards. >> Exactly. So ecosystem and innovation. So this organization has been around at AT&T for about seven years or so. And it was set up to fundamentally answer this question: How can AT&T systematically tap into innovation that happens outside the company and then bring it inside, and then over a period of time become as good at adopting some of those principles of innovative thinking, innovative principles of problem-solving into the company itself? So if you think about ecosystem and innovation, there are three key pillars to ecosystem and innovation. One of them is called ecosystem outreach. So this is a part of the organization that acts as the interface to the broader startup NVC community. >> Right. >> Right. So this allows us to keep on top of innovation happening across a wide variety of technology waterfronts. Networking, security, virtualization, all the way up to AR, VR, AI machine-learning, et cetera. >> It wouldn't be innovation if they weren't together, right? People try to really parse them, but true innovation comes of looking at some of the intersections of technology. >> Absolutely. And we're also agnostic in some sense about where the innovation comes from. 'Cause all we're trying to do is apply innovation to a particular business problem. And the foundry is the second component of the ecosystem and innovation organization. Think of the foundry as centers of innovation. There are six of them around the globe. Four in the US, one in Tel Aviv, Israel, and the newest one in Mexico City that we opened in March. And these foundries represent fundamentally an environment within AT&T where we can rapidly prototype new technologies, de-risk new technologies before we introduce them into the rest of the organization and actually also provide a way for us to bring proactively new, promising areas of technology to the rest of the business. So the foundries, if you will, serve as the leading edge of technology innovation within a company like AT&T. >> Well I've been in The Valley for more than 10 years now, and I came from the East Coast, and the concept of an innovation lab and innovation foundry isn't new. We've seen it come and go with established companies and with new companies. So I remember the launch of the foundry. You said it's about seven years ago, now. I can't believe it's even been that long. What have you learned in that time, and how are you making it work? Because I think everybody wants to be innovative, and they want to take, particularly established companies, these innovations and bring them back into the corporation. Can you give us a little more color and context on what you think you've done well and what surprised you? >> That's a great point to make about the relative longevity of the organization within a company like AT&T. >> And it's grown, apparently, with all the new innovation centers. >> Yes. And we've expanded to other locations outside. I think some of the lessons we've learned are that no organization stands still. >> True. >> AT&T as we know it today is different from what AT&T was seven years ago. The kinds of businesses we're in, the kinds of capabilities that we have to bring to bear, markedly different from what it was seven years ago. And the nature of the competitive waterfront is also dramatically different. So, which means that as an innovation organization, we've had to evolve almost lockstep and sometimes ahead of the organization itself. So that's been one thing that we've done, is that we've made sure that we always are aware of where the company's going, so that as we look at what kinds of innovation might apply, might be relevant, might be material for the corporation, we know that it's always grounded in what the company wants to do now, in about two years from now. >> So forget the science projects and try to get something that's practical to the business, but also a bit edgy, right? >> Yes. >> You want to be edgy. >> Yes, and it's an art and a science. We like to focus on innovation that's in context. So pure innovation is kind of interesting, but we always like to bring it back to either an internal stakeholder or an external customer as a stakeholder to sign off and be almost the kind of the voice of reason to say yes, this is interesting technology, but this is how it might or might not apply to my business problem. >> Do they ask you for things? Does the organization come to you and say, "Hey, we're looking for blah and..."? >> Absolutely. In fact, a big part of what we do as an organization is actually keep the dialogue with the internal stakeholders kind of ongoing and active, so that we always need to be aware of, from a business standpoint, what are the imperatives that a business leader is facing. 'Cause let's face it, a lot of these business leaders within a corporation as large as AT&T are running P&Ls that are pretty large. So for us to bring relevant and material innovation to them, we have to be aware of what are the two or three top, key problem areas that they're looking at. Is it cost reduction? Is it operational simplification? If it's a big part for network organization, what parts of network optimization are they most interested in? So being aware of that informs us better and in some sense helps us curate what kinds of innovative solutions we bring to them. >> Now you are talking about how you put these innovation engines around the globe, and I imagine that you are learning and gaining different things and insights from these different groups because there are phenomenally different ways people use technology depending where they are in the world. So can you share a little bit with us about what's exciting, what you're seeing in the labs today, are there geographic differences that we should be aware of as business leaders when we think of trying to roll out technologies? >> Sure. I'll give a two-part answer. One of it from areas of kind of focus for us. >> Okay. >> One as we just finished the panel on edge compute, so that's a big focus for the foundry organization, is trying to understand the use cases in which edge computing might actually give a pretty dramatic improvement in user experience, what is the role of the network edge in doing that, so working with a broad ecosystem of partners, both established and start-ups to actually make that happen. So that's one big area of focus. The other thing we're doing is... A big part of AT&T's business is actually focused on the enterprise side to AT&T business. So we have two foundry locations, one in Plano and one in Houston, that are focused exclusively on customer co-creation with our enterprise customers. For the past five years, we focused exclusively on IOT and used the Plano foundry to co-create around IOT for customers. In terms of differences across geographies, I think the most salient one is the one in Mexico City. We actually started that with the very explicit intent of innovating for emerging markets. Emerging markets have the need for high-performing, high-quality solutions. >> At a low cost. >> Exactly. So you need to deliver them at a much, much lower cost than the emerging markets actually will bear. So which means that you have to frame the problem differently, you have to go about innovation very differently, and oftentimes, you'll have to tap into the local innovation ecosystem as well. So that's a big, big part of what we're doing in Mexico as well. Trying to tap into the global network that we have as a company through all of the six foundry locations but making sure that we're tailoring it to what the local Mexican market needs. >> I'm actually very excited to see how innovation has been rolling out around the world. One of the things that comes up in every dialogue I have around innovation right now and frankly in most products is AI. Do you see a role of AI happening in the foundry today? >> Yeah we've been doing work on AI for quite some time. In fact, we've been doing a series of projects for our internal organization around applying machine-learning techniques to some very complex network optimization problems. And we're doing that for about 18 months or so. And we've been looking at even ways to apply reinforcement learning to some very classic network problems as well. As part of some of the work that we're doing around edge, we're looking at ways to do influencing at the edge. For a variety of use cases, including, for example, a public safety or a first-responder kind of a use case. So absolutely, AI and machine-learning continue to be one of the areas that we spend a lot of time on. >> Well Vishy, it's been great talking to you today here at AT&T's shape, and look forward to seeing you again soon. >> Thank you, Maribel. Likewise. >> Maribel Lopez, speaking with theCUBE. Thank you. (upbeat music)

Published Date : Sep 10 2018

SUMMARY :

From the Palace of Fine Arts in San Francisco, And I have the great pleasure First of all nice to see you again, Maribel. as the interface to the broader startup NVC community. all the way up to AR, VR, AI machine-learning, et cetera. at some of the intersections of technology. So the foundries, if you will, serve as the leading edge So I remember the launch of the foundry. of the organization within a company like AT&T. And it's grown, apparently, with all the new And we've expanded to other locations outside. the kinds of capabilities that we have to bring to bear, to sign off and be almost the kind of the voice of reason Does the organization come to you and say, So being aware of that informs us better and I imagine that you are learning and gaining One of it from areas of kind of focus for us. on the enterprise side to AT&T business. So which means that you have to frame the problem One of the things that comes up in every dialogue I have As part of some of the work that we're doing around edge, and look forward to seeing you again soon. Thank you, Maribel. Maribel Lopez, speaking with theCUBE.

ENTITIES

Entity	Category	Confidence
Maribel Lopez	PERSON	0.99+
AT&T	ORGANIZATION	0.99+
Mexico City	LOCATION	0.99+
US	LOCATION	0.99+
Maribel	PERSON	0.99+
Mexico	LOCATION	0.99+
March	DATE	0.99+
six	QUANTITY	0.99+
Vishy Gopalakrishnan	PERSON	0.99+
two	QUANTITY	0.99+
AT&T.	ORGANIZATION	0.99+
Houston	LOCATION	0.99+
one	QUANTITY	0.99+
Four	QUANTITY	0.99+
Vishy	PERSON	0.99+
One	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
two-part	QUANTITY	0.99+
three	QUANTITY	0.99+
seven years ago	DATE	0.99+
AT&T Spark	ORGANIZATION	0.99+
Lopez Research	ORGANIZATION	0.99+
about seven years	QUANTITY	0.98+
second component	QUANTITY	0.98+
both	QUANTITY	0.98+
about 18 months	QUANTITY	0.98+
one thing	QUANTITY	0.97+
today	DATE	0.97+
P&Ls	ORGANIZATION	0.97+
more than 10 years	QUANTITY	0.96+
2018	DATE	0.95+
three key pillars	QUANTITY	0.93+
First	QUANTITY	0.93+
East Coast	LOCATION	0.93+
two foundry	QUANTITY	0.89+
AT&T Spark	EVENT	0.88+
about two years	QUANTITY	0.86+
six foundry locations	QUANTITY	0.84+
Tel Aviv, Israel	LOCATION	0.81+
about seven years ago	DATE	0.77+
Mexican	LOCATION	0.74+
past five years	DATE	0.74+
Plano	LOCATION	0.74+
them	QUANTITY	0.73+
theCUBE	ORGANIZATION	0.71+
SAP	ORGANIZATION	0.69+
Palace of Fine Arts	ORGANIZATION	0.56+
things	QUANTITY	0.49+

Maribel Lopez, Lopez Research | AT&T Spark 2018

>> From the Palace of Fine Arts in San Francisco, it's theCUBE covering AT&T Spark. (techy music) Now here's Jeff Frick. >> Hey, welcome back everybody. Jeff Frick here at theCUBE. We're at AT&T's Spark event, it's up in San Francisco at the Palace of Fine Arts. It's really all about 5G, and we're excited to be here, you know, there's been a lot of conversation about 5G for a very, very long time, and we're super excited to have the expert in the field. Maribel Lopez has been following this forever. So Maribel, first off, thanks for stopping by, thanks for hosting a few segments and great to catch up. >> Excited to be here. >> Absolutely, so 5G, you've made a funny comment before we went on. You said, "Jeff, this 5G's been going on "forever and ever and ever, but now it's finally "starting to come to reality, to fruition." >> Yeah, I got to see all the Gs: the 2G, the 3G, the 4G, now the 5G, and you know, for a couple of years we were just talking about standards, and what's really exciting to me is that now people are talking about doing production stuff, you know, not just rolling in a test van and prototype equipment, but actual things that we might be able to see deployed within the coming year. >> Right. >> People are talking about lighting up cities. AT&T announced another five cities that they were going to put, actually seven, I think, on the calendar. >> Up to a dozen, I think, now, then they had another-- >> Yes, they had seven, they added another five-- >> Seven after that, right. >> And then another seven, so we're really starting to see momentum in 5G, it's going to happen. >> Right, so there's a bunch of things with 5G that are fundamentally different than the last G. >> Right. >> And the first one, right, is it wasn't really developed just for faster voice. That was not the objective of 5G. >> Yeah. >> It's really to take advantage of IoT and this whole kind of machine to machine world in which we're in right now. >> Yeah. >> That's a fundamental difference in terms of the applications that it can open up. >> Yeah, we're seeing... To your point, I mean, we talked a lot about bandwidth before. Yes, you get more bandwidth, but you also get lower latency, and that's the thing of how fast something can travel, and that opens up a huge amount of new applications like autonomous driving. If you want a wireless connection in autonomous driving you need 5G so you have that, you know, really sharp response time to make it happen. If you're doing remote medicine, you know, 5G gives you both bandwidth, but also the latency to see if something's happening so that you can do things that are real-time in nature. So, I think it's that real-time in nature with high speed that everybody's talking about. We saw eSports and gaming listed today, and the discussion about how you could now do it on a low-end PC because between your 5G network and new software you've got this huge opportunity with the cloud to just do a whole new, different way of gaming and entertainment, so lots of great applications are coming out with 5G. >> Yeah, it's pretty interesting on that demo, because it was an NVIDIA guy talking about-- >> Yes. >> Having basically an NVIDIA data center to do all the graphic computation back in the cloud at the NVIDIA data center-- >> Yeah. >> And then delivering it to whatever kind of low-end edge device that you had, in this case a laptop. The funny thing about the latency that I thought really kind of struck home for me was they talked about when your audio and your video are slightly out of sync when you're watching a video. >> Exactly. >> When it's just off a little bit. >> Mm-hm. >> Not enough like, "Wait, this is broken," but enough to actually get nausea. >> Yeah. >> You actually have a physical reaction, so I think that was really interesting. That is what's going to go away when we have the better connectivity speeds, everything else with 5G. >> And I think that's when one of the things that's been holding back the immersive nature of new applications like VR, so that disconnect that you talked about is really important to get rid of that, and you can get rid of part of that with wireless and part of it with low latency. So, if we get the headsets a little smaller and we get more content I think we'll start to get a better vision of what's happening there. I also think we're starting to see these things come into the enterprise. You know, the enterprises are really taking 5G seriously. They're looking at doing things like their own private 5G networks in things like manufacturing and robotics, for example. >> Right, right, yeah, the private 5G, interesting, in a lot of conversation, too, about doing it for the first responders to have their own dedicated network, but one of the topics I thought interesting was the commitment to software and the commitment to opensource, and we've kind of seen the rise of the telcos and OpenStack. >> Yeah. >> We've been covering OpenStack, I think since 2013, and you could see with each and every passing year that the telco presence within the OpenStack community just increased and it really seemed to find a home, and here they dedicated a whole keynote session to AT&T's embracing of opensource. >> Yeah, opensource is actually interesting because I think it's counterintuitive to think that a large enterprise customer like AT&T would go so deep into opensource, but when you really think of it, if you want to be innovative and you want to run at, you know, what we now consider cloud speed-- >> Right. >> Digital native speed, then you need to have that concept of opensource and open APIs to build on top of so that really what you're focusing on is the part of your business that differentiates you, not on building the whole stack. So, the days of building, like, your whole stack from scratch are over, and opensource is really important, and what I found really interesting about that was the takeaway that so many companies, even competitors of each other, had all thrown in on this concept of this opensource technology so that they could basically bootstrap their innovation. >> Right, the other kind of theme that kind of came up, which I found really interesting, is if you've ever seen Jeff Bezos speak on his investment in Blue Origin. >> Yeah. >> He talks very specifically that he wants to put a platform in play-- >> Mm-hm. >> Leveraging the winnings that he's gotten from Amazon to enable future entrepreneurs to have an infrastructure in which they can build cool applications-- >> Absolutely. >> In this case for space. We heard the same message here within this kind of 5G, that the concept of, you know, kind of infinite compute, infinite bandwidth-- >> Right. >> And infinite storage asymptotically approaching zero, what applications would you build in that world, and really this constant conversation of experience, whether that be a business experience, a consumer experience-- >> Yeah. >> A first responder experience, is really what's behind kind of the excitement on this 5G conversation. >> I think there was always a disconnect of when you get data, and how quickly you can analyze that data and get it back to somebody to do something meaningful, so this whole experience is about even if you are not holding a 5G handset or some 5G thing in your hand or elsewhere, what that will do is because they've built the 5G infrastructure you get the opportunity to make 4G better for everybody. So, I think people think, "Oh, I've got to wait for 5G." It's like, "No, you're going to see the benefits "of 5G long before everybody's ubiquitously deployed, "long before everybody has 5G devices." >> Right. >> Things are just going to work better, and you can get that data faster and new experiences faster, so I'm excited for it. >> Right, and then the other piece that we hear over and over, right, is AI and machine learning, and again-- >> Absolutely, mm-hm. >> It's not AI and machine learning just for the sake of AI and machine learning. It's baked into all these other applications to make them all work better, and again, that's another big thing that we hear here at the keynotes. >> Yeah, I think the AI and machine learning is interesting because we've had it for a long time, but now everybody has access to it, right? We've got cloud services that give you algorithms, we've got massive compute, and now we've got the ability to take all the data from IoT sensors and other things and get it back to either a centralized place, or to do edge compute on it, which I think is really exciting. >> Right, so just to wrap, get your kind of your final impressions on kind of the show-- >> Yeah. >> And again, you said you'd been here for all the Gs, (laughs) so is a 5G, is this a big difference from our prior step functions? >> I think it is because of that latency that we talked about and the ability to do much more real-time, data intensive apps. So, you've always had this concept of moving to more data, but it had lower latency, it might've had a higher cost. Now we're getting that right kind of combination of cost, bandwidth, real-time nature, so I think every G gets better and 5G is just better than 4G, but in different ways, so-- >> All right, well Maribel, thanks again for stopping by, and also for helping us out guest hosting a few segments. >> Thank you. >> All right, (chuckles) she's Maribel, I'm Jeff, you're watching theCUBE. We're at AT&T Spark in San Francisco, thanks for watching. (techy music)

Published Date : Sep 10 2018

SUMMARY :

From the Palace of Fine Arts thanks for hosting a few segments and great to catch up. "starting to come to reality, to fruition." and you know, for a couple of years going to put, actually seven, I think, on the calendar. momentum in 5G, it's going to happen. that are fundamentally different than the last G. And the first one, right, is it wasn't It's really to take advantage of IoT of the applications that it can open up. and the discussion about how you could now do it And then delivering it to whatever kind of but enough to actually get nausea. the better connectivity speeds, everything else with 5G. to get rid of that, and you can get rid of part of that to opensource, and we've kind of seen and you could see with each and every passing year to build on top of so that really what you're focusing on Right, the other kind of theme that kind of came up, that the concept of, you know, kind of the excitement on this 5G conversation. and get it back to somebody to do something meaningful, and you can get that data faster to make them all work better, the ability to take all the data from IoT sensors of moving to more data, but it had lower latency, and also for helping us out guest hosting a few segments. We're at AT&T Spark in San Francisco, thanks for watching.

ENTITIES

Entity	Category	Confidence
Maribel	PERSON	0.99+
Jeff Bezos	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Maribel Lopez	PERSON	0.99+
AT&T	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
San Francisco	LOCATION	0.99+
NVIDIA	ORGANIZATION	0.99+
seven	QUANTITY	0.99+
Seven	QUANTITY	0.99+
AT&T Spark	ORGANIZATION	0.99+
Blue Origin	ORGANIZATION	0.99+
five cities	QUANTITY	0.99+
5G	ORGANIZATION	0.99+
first one	QUANTITY	0.99+
five	QUANTITY	0.98+
2013	DATE	0.98+
both	QUANTITY	0.97+
opensource	TITLE	0.97+
one	QUANTITY	0.96+
telcos	ORGANIZATION	0.96+
today	DATE	0.96+
first responders	QUANTITY	0.96+
2018	DATE	0.95+
theCUBE	ORGANIZATION	0.95+
first	QUANTITY	0.94+
telco	ORGANIZATION	0.91+
Palace of Fine Arts	ORGANIZATION	0.89+
OpenStack	TITLE	0.88+
each	QUANTITY	0.87+
Palace of Fine Arts	LOCATION	0.87+
zero	QUANTITY	0.87+
Lopez	PERSON	0.86+
Up to a dozen	QUANTITY	0.85+
first responder	QUANTITY	0.68+
4G	ORGANIZATION	0.68+
5G	QUANTITY	0.67+
G	QUANTITY	0.66+
coming	DATE	0.66+
techy	PERSON	0.63+
every	QUANTITY	0.62+
years	QUANTITY	0.57+
Spark	EVENT	0.52+
year	QUANTITY	0.51+
OpenStack	ORGANIZATION	0.51+
2G	ORGANIZATION	0.49+
couple	QUANTITY	0.47+
5G	OTHER	0.46+
4G	QUANTITY	0.46+
3G	COMMERCIAL_ITEM	0.45+
theCUBE	TITLE	0.39+

Mazin Gilbert, AT&T | AT&T Spark 2018

>> From the Palace of Fine Arts in San Francisco, it's theCUBE! Covering AT&T Spark. (bubbly music) >> Hello! I'm Maribel Lopez, the Founder of Lopez Research, and I am here today at the AT&T Spark event in San Francisco and I have great pleasure and honor of interviewing Mazin Gilbert, who is the VP of Advanced Technology and Systems at AT&T. We've been talking a lot today, and welcome Mazin. >> Thank you, Maribel for having us. >> We've been talking a lot today about 5G, 5G is like the first and foremost topic on a lot of people's mind that came to the event today, but I thought we might step back for those that aren't as familiar with 5G, and maybe we could do a little 5G 101 with Mazin. What's going on with 5G? Tell us about what 5G is and why it's so important to our future. >> 5G is not another G. (Maribel laughs) It really is a transformational and a revolution really to not to what we're doing as a company, but to society and humanity in general. It would really free us to be mobile, untethered, and to explore new experiences that we've never had before. >> Do I think of this as just faster 4G? Because we had 2G, then 3G, then 4G, is 5G something different? When you say it allows us to be mobile and untethered, don't we already have that? >> No we don't. There are a lot of experiences that are not possible to do today. So imagine that having multiple teenagers experiencing virtual reality, augmented reality, all mobile, while they are in the car all in different countries; we can't have that kind of an experience today. Imagine cars as we move towards autonomous cars, we cannot do autonomous cars today without the intelligence, the speeds and the latency with 5G, so that all cars connect and talk to each other in a split of a second. >> See, I think that's one of the real benefits of this concept of 5G. So when you talk about 5G, 5G is yes more bandwidth, but also lower latency, and that's going to allow the things that you're talking about. I know that you also mention things such as telemedicine, and FirstNet network, any other examples that you're seeing that you think are really going to add a difference to peoples lives going forward as they look at 5G? >> 5G is a key enabler in terms of how these experiences are going to really be transformed. But when you bring in 5G with the edge compute. Today, think of compute, and storage, and securing everything, is sitting somewhere, and as you're talking, that something goes to some unknown place. In the 5G era, with the edge, think of computer storage as following you. And now-- >> So you're your own data center. (laughs) >> You're pretty much your own data center. Wherever you go with every corner, there's a data center following you right there. And now add to that, we're transforming our network to be programmable with our software-defined network, and add AI into that, bringing all of this diamond together, the 5G, the edge, programming the network with software-defined, and AI, and that is what the new experiences is. This is when you'll start seeing really an autonomous world. A world in which that we're able to experience drones flying and repairing cell sites, or repairing oil tanks, without us really being involved, from being in our living room watching a movie. This is a world that is extremely fascinating, a world in which that people can interact and experience family reunion, all virtually in the same room, but they're all in different countries. >> I do think there's this breakthrough power of connectivity. We've talked about it in the next generation of telemedicine, you mentioned some of the dangerous jobs that we'd be able to use drones for, not just for sort of hovering over peoples gorgeous monuments or other things that we've seen as the initial deployments, but something that's really meaningful. Now I know the other topic that has come up quite a bit, is this topic of opensource, and you're in the advanced technology group, and sometimes I think that people don't equate the concept of opensource with large established organizations, like an AT&T, but yet, you made the case that this was foundational and critical for your innovation, can you tell us a little bit more about that? >> Opensource is really part of our DNA. If you look at the inventions of the Unix, C, C++, all originated from AT&T Labs and Bell Labs, we've always been part of that opensource community. But really in the past five years, I think opensource has moved to a completely another level. Because now we're not just talking about opensource, we're talking about open platforms, we're talking about open APIs. What that means is that, we're now into-- >> A lot of open here. (laughs) >> Everything open in here. And what that really means is that we no longer as one company, no one company in the world can make it on their own. The world-- >> K, this is a big difference. >> It's a big difference. The world is getting smaller, and companies together, for us to really drive these transformational experiences companies need to collaborate and work with each other. And this is really what opensource is, is that, think of what we've done with our software-defined network, what we called ONAP in the opensource, we started as a one company, and there was another, one of the Chinese mobile companies also had a source code in there. In the past one year, we now have a hundred companies, some of the biggest brand companies, all collaborating to building open APIs. But why the opensource and open API is important, enables collaboration, expedite innovation, we've done more in the past one year than what we could've done alone for 10 years, and that's really the power of opensource and open platforms >> I totally agree with you on this one. One of the things that we've really seen happen is as newer companies, these theoretically innovative companies have come online, cloud native companies, they've been very big on this open proponent, but we're also seeing large established companies move in the same direction, and it's allowing every organization to have that deeply innovative, flexible architecture that allows them to build new services without things breaking, so I think it's very exciting to see the breadth of companies that you had on stage talking about this, and the breadth of companies that are now in that. And the other thing that's interesting about it is they're competitors as well, right? So, there's that little bit of a edgy coopetition that's happening, but it's interesting to see that everybody feels that there's room for intense innovation in that space as well. So we've talked a little bit about opensource, we've talked about 5G, you are in advanced technology, and I think we'd be remiss to not talk about the big two letter acronym that's in the room that's not 5G, which would be AI. Tell me what's going on with AI, how are you guys thinking about it, what advice do you have for other organizations that are approaching it? Because you are actually a huge developer of AI across your entire organization, so maybe you could tee up a little bit about how that works. >> AI is transformational, and fundamental for AT&T. We have always developed AI solutions, and we were the first to deploy a AI in call centers 20 years ago. >> 20 years ago, really? >> 20 years ago. >> You were doing AI 20 years ago? >> 20 years ago. >> See, just goes to show. >> 20 years ago. I mean AI really, if you go to the source of AI, it really goes in the '40s and '50s with pioneers like Shannon and others. But the first deployment in a commercial call center, not a pilot, was really by AT&T. >> An actual implementation, yeah. >> With a service, we called it how may I help you. And the reason we put that out, because our customers were annoyed with press one for this and press two for billing, they wanted to speak naturally. And so we put the system that says "How may I help you?" and how may I help you allowed the customer to tell us in their own language, in their own words, what is it that they want from us as opposed to really dictating to them what they have to say Now today, it's really very hard for you to call any company in the world, without getting a service that uses some form of speech recognition or speech understanding. >> Thankfully. (laughs) >> But where we're applying it today and have been for the past two, three years, we're finding some really amazing opportunities that we've never imagined before. AI in its essence, is nothing more than automation leveraging data. So using your data as the oil, as the foundation, and driving automation, and that automation could be complete automation of a service, or it could be helping the human to doing their job better and faster. It could be helping a doctor in finding information about patients that they couldn't have done by themselves by processing a million records all together. We're doing the same thing at AT&T. The network is the most complex project ever to be created on the planet. And it's a complex project that changes every second of the day as people move around, and they try different devices. And so to be able to optimize that experience, is really an AI problem, so we apply it today to identify where to build the next cell sites all the way to what's the right ad to show to the customer, or, how do we really make your life easier with our services without you really calling our call center, how do I diagnose and repair your setup box before you're calling us? All of that foundation is really starting to be driven by AI technologies, very exciting. >> Well I'm actually very excited to see where AI takes us, and I'm excited to hear about what you're doing in the future. Thanks for takin' the time to come here today, >> It's my pleasure. >> And be with us on theCUBE. Thank you. >> It's always a pleasure talking to you, thank you very much. >> I'm Maribel Lopez closin' out at AT&T Spark, thank you. (bubbly music)

Published Date : Sep 10 2018

SUMMARY :

From the Palace of Fine Arts in San Francisco, I'm Maribel Lopez, the Founder of Lopez Research, and maybe we could do a little 5G 101 with Mazin. and a revolution really to so that all cars connect and talk to each other and that's going to allow the things that you're talking about. that something goes to some unknown place. So you're your own data center. and that is what the new experiences is. in the next generation of telemedicine, But really in the past five years, A lot of open here. no one company in the world and that's really the power of opensource and open platforms and the breadth of companies that are now in that. and we were the first to deploy a it really goes in the '40s and '50s allowed the customer to tell us in their own language, (laughs) and have been for the past two, three years, Thanks for takin' the time to come here today, And be with us on theCUBE. It's always a pleasure talking to you, at AT&T Spark, thank you.

ENTITIES

Entity	Category	Confidence
Maribel Lopez	PERSON	0.99+
Maribel	PERSON	0.99+
Bell Labs	ORGANIZATION	0.99+
AT&T Labs	ORGANIZATION	0.99+
Mazin Gilbert	PERSON	0.99+
AT&T.	ORGANIZATION	0.99+
Mazin	PERSON	0.99+
AT&T	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
today	DATE	0.99+
AT&T Spark	ORGANIZATION	0.99+
20 years ago	DATE	0.99+
one	QUANTITY	0.98+
first	QUANTITY	0.98+
FirstNet	ORGANIZATION	0.98+
one company	QUANTITY	0.98+
a million records	QUANTITY	0.98+
Today	DATE	0.98+
One	QUANTITY	0.97+
5G	ORGANIZATION	0.95+
Shannon	PERSON	0.94+
Lopez Research	ORGANIZATION	0.94+
2018	DATE	0.93+
three years	QUANTITY	0.93+
'50s	DATE	0.92+
T Spark	EVENT	0.92+
Chinese	OTHER	0.92+
past one year	DATE	0.92+
opensource	TITLE	0.91+
'40s	DATE	0.91+
first deployment	QUANTITY	0.88+
hundred companies	QUANTITY	0.86+
ONAP	ORGANIZATION	0.85+
two	QUANTITY	0.75+
VP	PERSON	0.74+
past five years	DATE	0.72+
two letter	QUANTITY	0.69+
C++	TITLE	0.66+
opensource	ORGANIZATION	0.61+
AT&	ORGANIZATION	0.59+
Technology	PERSON	0.59+
Systems	PERSON	0.58+
second	QUANTITY	0.57+
of Fine Arts	LOCATION	0.56+
Palace	ORGANIZATION	0.55+
C	TITLE	0.54+
2G	QUANTITY	0.53+
Advanced	ORGANIZATION	0.53+
Unix	TITLE	0.47+
3G	QUANTITY	0.45+
past	DATE	0.45+
4G	QUANTITY	0.37+
5G	QUANTITY	0.37+
101	TITLE	0.35+
5G	OTHER	0.35+

Jeff McAffer, Microsoft | AT&T Spark 2018

>> From the palace of fine arts in San Francisco it's the Cube. Covering AT&T Spark, now here's Jeff Frick. >> Hey, welcome back everybody, Jeff Frick here with the Cube. We're at the Palace of Fine Arts in San Francisco at the AT&T Spark event and it's all about 5G. 5G is this huge revolution and I haven't got a definitive number but it's something on the order of hundred X, improvement of speed and data throughput. There's a lot of excitement but one of the things that is less talked about here but it was actually up on the keynote was really the roll of opensource and AT&T talked a lot about opensource and how important it is and really redefining the company around the speed of software development versus the speed of hardware development and that's a big piece of it. We're excited to have somebody who knows all about opensource our next guest he's Jeff McCaffer. He's the Director of Opensource Progress Officer at Microsoft, did I get that right Jeff? >> No, well it said Opensource Programs Officer. >> Programs Officer. >> Yep. >> So you do all about opensource. >> Yeah. >> Well first off, welcome. >> Thank you very much Jeff, it's good to be here. >> So Microsoft, you know, no one would have ever thought, I mean, you know, I'm probably dating myself. 15 years ago with that Microsoft would be a big component of opensource but in fact they're a huge proponent of opensource. >> Absolutely, even just not so long ago you know, it was not the foremost in everybody's mind that Microsoft would be doing opensource. But now it's a core part of our company. It's a core part of how we work and our engagement with the rest of the industries. So it's really growing and it's continuing to grow. >> So how did it kind of get there and what are some of the real key components that you have to worry about in your role to managing, you know, participating in all these various communities all over the place. >> Yeah, well I mean it's been a long road but it's really the way software's happening today, you mention in the intro about the dispute of software versus hardware and software's just going so fast and you know, you can aspire to be world class but when everybody else starts there with opensource, you know it's really hard to start from zero get to there. So we're really happy to be you know, using opensource and contributing. One of the real challenges we've had going forward is the scale like simply we've got literally millions of uses of opensource across all of our products and services. And managing that, keeping track of it, engaging with those, all those communities and everything is a real big challenge. So we've been building paulo season tools and changing the culture to understand that you know, you need to engage, push fixes back, all those sorts of things. And then when we look at our releasing our software, we have thousands of opensource repositories on GitHub, thousands of developers at Microsoft working on GitHub repositories, our own and others in the community. So it's just managing all of that as being a really big challenge. >> Right and it's interesting cause the opensource projects themselves, we've seen at time and time again. You know, they fork and they go a lot of different directions. There's sometimes disagreement about direction. >> Sure. >> And prioritization, so you've got a kind of manage that within the opensource thread but as well as within, you know, where those products play a role in your products. >> Right, right and we've taken a sort of federated model in the company, we're very diverse as you know right and so my team sort of helps put guidelines in place for for project teams to run and then those project teams run their own program. How they engage with opensource, however they want to and sort of at the level they want to that matches their business requirements. So it ranges everywhere from people who are fully opensource to folks who are just you know, using a little bit of stuff here and there within their products. >> Right, what if you could speak a little to opensource and the role that it plays in employee happiness, employee retention cause you know, there's so much goodness and you see it at these shows. >> Absolutely. >> Where there's particular contributors that you know, they're rock stars in their community. They've made super important contributions. >> Yeah. >> They've managed the community and I always think back, if you're the person managing that person back at the office you know, how much time do they put into their opensource effort? >> Sure. >> How much time do they put in their company efforts? How much of their time is really the company software that's built on top of that opensource. >> Yeah. >> And how do you manage that because it is a really important piece for a lot of people's personas. >> Absolutely. >> And their self values. >> Yeah, well and there's been a lot of research that says also that high performing teams, one of the traits of high performing teams is engaging in opensource. And at the personal level like individuals, there's kind of a different set of possibilities there, you know, either you're engaging in opensource for part of your product work, right, so that's sponsored by the company. Or you might be doing some things on the side or some tangential range in between there, right? >> Right. >> And sort of all of those you need to drive to the appropriate level, the folks who are working on it day to day for their, for the company. There's some really interesting dynamics that can get setup. Super exciting for the team, sometimes it can get a little waylaid maybe but you know, you want to keep them, keep them on task. But then also the, the folks who are doing it of their own volition, like on their own time and that sort of thing. That also brings back a bunch of energy and everything into the workplace. New technologies that they'll discover in their area and they'll bring back the energy and the excitement about engaging back to the regular team. >> Right. >> So there's lots of possibilities there. >> So what brings you here, what brings you to AT&T Spark today? >> Well they invited me to speak on a panel earlier today about opensource and the future of opensource and so I had a, there were a couple of other people from Linux Foundation and from AT&T. So we had a good conversation on stage. >> Yeah it's pretty interesting how, pretty much all these projects you know, eventually get put in to the Linux Foundation. That they, you know, they've just kind of become this defacto steward for a wide variety of opensource projects. >> Yeah, well there's a number of different foundations, Linux Foundation's certainly one of the better known ones, the Eclipse Foundation, Apache. >> Right, Apache yeah, right. >> Been around lots of times doing lots of good things. So there's a ton of amazing projects out there in all of these foundations. And it's just super exciting to see them all be engaging like in this sort of cohesive right, and with a good governance model. >> Right. >> Yep. >> So I'll give you the last word, one of my favorite lines always that's opensource is opensource is free like a puppy. >> Yes, it's totally free like a puppy. >> So, you know, you're living in that world, what is one of the things about opensource that most people miss, one of the really positive attributes that most people just don't see. And then what's one of the big, you know kind of biggest, kind of ongoing challenges, that's just part of operating in this opensource world? >> Well I mean, I phrase it in challenges and opportunities, right, there are obviously lots of challenges, like I was saying with scale and managing security. And the culture change that goes around collaboration and that sort of thing. The opportunities, I think are boundless really, I mean there's, one of the most gratifying things that you can see as an opensource project, is people take your technology and use it in ways you never imagined. Right, so there's, we can think of that as our products too and we take our products and they've got opensource APIs. They've got opensource frameworks and such. And people take them and do amazing things with them that we never imagined possible. And that just, that part is really exciting and invigorating. >> Yeah, alright Jeff well thanks for taking a few minutes. >> Sure. >> Congrats on all your work and I guess we'll see you in Orlando in a month or so. >> Okay, possibly. >> Alright, he's Jeff, I'm Jeff, we're all Jeffs here and we're at the Palace of Fine Arts at AT&T Spark, thanks for watching, see you next time. (upbeat music)

Published Date : Sep 10 2018

SUMMARY :

it's the Cube. There's a lot of excitement but one of the things that So Microsoft, you know, no one would have ever thought, Absolutely, even just not so long ago you know, that you have to worry about in your role to managing, changing the culture to understand that you know, Right and it's interesting cause the opensource projects you know, where those products play a role in your products. in the company, we're very diverse as you know right employee retention cause you know, Where there's particular contributors that you know, How much of their time is really the company software And how do you manage that because it is a really you know, either you're engaging in opensource for part of And sort of all of those you need to drive to the about opensource and the future of opensource pretty much all these projects you know, Linux Foundation's certainly one of the better known ones, And it's just super exciting to see them all be engaging So I'll give you the last word, one of my favorite lines So, you know, you're living in that world, that you can see as an opensource project, Congrats on all your work and I guess we'll see you in thanks for watching, see you next time.

ENTITIES

Entity	Category	Confidence
Jeff McCaffer	PERSON	0.99+
Jeff McAffer	PERSON	0.99+
Jeff	PERSON	0.99+
Jeff Frick	PERSON	0.99+
AT&T.	ORGANIZATION	0.99+
AT&T	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Orlando	LOCATION	0.99+
Apache	ORGANIZATION	0.99+
AT&T Spark	ORGANIZATION	0.99+
Jeffs	PERSON	0.99+
Linux Foundation	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
15 years ago	DATE	0.98+
One	QUANTITY	0.98+
GitHub	ORGANIZATION	0.97+
Eclipse Foundation	ORGANIZATION	0.97+
today	DATE	0.96+
hundred X	QUANTITY	0.96+
millions	QUANTITY	0.95+
earlier today	DATE	0.94+
2018	DATE	0.94+
Palace of Fine Arts	LOCATION	0.93+
thousands of developers	QUANTITY	0.88+
opensource	QUANTITY	0.83+
opensource	ORGANIZATION	0.75+
AT&T Spark	EVENT	0.75+
opensource	TITLE	0.72+
zero	QUANTITY	0.69+
Opensource Programs Officer	PERSON	0.66+
ton of amazing	QUANTITY	0.62+
Opensource Progress Officer	PERSON	0.48+
couple	QUANTITY	0.4+

Alicia Abella, AT&T | AT&T Spark 2018

>> From the Palace of Fine Arts in San Francisco, it's theCUBE, covering AT&T Spark. Now here's Jeff Frick. >> Hey, welcome back, everybody. Jeff Frick here with theCUBE. We're at the Palace of Fine Arts in San Francisco at the AT&T Spark event. It's really all about 5G and what 5G is going to enable. You know, this is a really big technology that's very, very close. I think a lot closer than most people understand. And one of the most important components of 5G is it was designed from the ground up really not so much for people-to-people communications as much as machine-to-machine communications. So we're really excited to have someone who's right in the thick of that and talk about the implications, especially another topic that we hear all the time, which is Edge computing. So it's Alicia Abella. She is the VP of Operational Automation in Program Management from AT&T Labs. Alicia, welcome. >> Thank you for having me, Jeff. >> Absolutely. So we were talking a little bit before we turn on the cameras about 5G and Edge computing. And how the two, while not directly tied together, are huge enablers of one another. I wonder if you can unpack a little bit about why is 5G such an important component to kind of the vision of Edge computing? >> Sure, absolutely. Yeah, happy to do so. So Edge computing is really about bringing processing power closer to the end device, closer to the end user, where a lot of the processing data analytics has to occur. And you want to do that because you want to be able to deliver the services and applications close to the edge, close to where the customer is, so that you can deliver on the speeds that those applications need. 5G plays a role because 5G is promising to be very fast and also very reliable and very secure. So now you've got three things to your advantage paired up with Edge to be able to deliver on a lot of these use cases that we hear a lot about when we talk about 5G, when we talk about Edge. Some example use cases are the autonomous vehicle. The autonomous vehicle is a classic example for Edge computing as well as 5G. And in fact, it illustrates a kind of continuum, because you can have processing that always has to remain in the car. Anything related to safety? That processing has to happen right on that device. The device in this case being the car. But there are other processing capabilities, like maybe updates to real-time maps. That could happen on the Edge. You still have to be near real-time, so you want to have that kind of processing and updating happening at the Edge. Then maybe you have something where you want to download some new entertainment, a movie to your car. Well, that can reside back at the data center, further away from where the device or the car is. So you've got this continuum. >> So really, what the 5G does is really open up the balance of how you can distribute that store computing and communications. It's always about latency. At the end of the day, it's always about latency. And as much as we want to get as much compute close, oh, and also, I guess power. Power and latency. >> Power and Edge actually go hand-in-hand as well. >> It's a big deal, right? >> Mhm. >> So what you're saying is, because of 5G, and the fact that now you have a much lower latency, faster connectivity port, you can now have some of that stuff maybe not at the Edge and enable that Edge device to do more, other things? >> Yes, so I often like to say that we are unleashing the device away from having it be tethered to the compute processor that's handling it and now you can go mobile. Because now what you do is, if the processing is happening on the Edge and not on the device, you save on battery life, but you also make the device more lightweight, easier to manage, easier to move around. The form factor can become smaller. So there's also an advantage to Edge computing to the device as well. >> Right. It's pretty interesting. There was an NVIDIA demo in the keynote of running a video game on the NVIDIA chips in a data center and pumping a really high resolution experience back out to the laptop screen I think is what he was using it for. And it's a really interesting use case in how when you do have these fast, reliable networks, you can shift the compute, and not just a peer compute, but the graphics, et cetera, and really start to redistribute that in lots of different ways that were just not even fathomable before. Before you had to buy the big gaming machine. You had to buy the big, giant GPU. You had to have that locally, and all that was running on your local machine. You just showed a demo where it's all running back in their data center in Santa Clara. Really opens up a huge amount of opportunity. >> That's right. So Edge computing is really distributed in nature. I mean, it is all about distribution. And distributing that compute power wherever you need it. Sprinkling it across the country of where you need it. So we've gone, there's been this pendulum shift, where we started with the mainframe, big rooms, lots of air conditioning, and then the pendulum swung over to the PC. And that client-server model. Where now you had your PC and you did your computing locally. And then it swung back the other way for Cloud computing where everything was centralized again and all that compute power was centralized. And now the pendulum is swinging back again the other way to this distributed model where now you've got your compute capabilities distributed across the country where you need it. >> Right. So interesting. I mean, networking was the last of the virtualized platform between storage and compute, and then finally networking. But if you really start to think of a world with basically infinite power, compute, infinite store, and infinite networking, basically asymptotically approaching zero pricing. Think of the world from that way. We're not there. We're never going to get to that absolute place, but it really opens up a lot of different ways to think about what you could do with that power. So I wonder if there's some other things you can share with us. At Labs, you guys are looking forward to this 5G world. What are some of the things that you see that just, wow, I would have never even thought that was even in the realm of possibility that some people are coming up with? >> Yeah. >> Any favorites? >> Oh, I think one of our favorites is certainly looking at the case of manufacturing. Even though you would think of manufacturing as very fixed, the challenge with manufacturing is that a lot of those robotics capabilities that are in the manufacturing assembly lines, for example, they're all based on wires and they can't change and upgrade what they're doing very quickly. So being able to deliver 5G, have things that are wireless, and have Edge compute capabilities that are very powerful means that they can now shift and move around their assembly lines very quickly. So that's going to help the economy. Help those businesses be able to adapt more quickly to changes in their businesses. And so that's one that is quite exciting to us. And I would say the next one that's also exciting for us would be, we talked about autonomous vehicles already, 'cause that one's kind of far out, right? >> I don't think it's as far as most people think, actually. We covered a lot of autonomous vehicle companies, and there's just so much research being done now. I don't think it's as far out as people think. >> Yes, and so I think we are definitely committed to deploying Edge compute. And in the process, from a more technical perspective, I think one of the things that we are going to be interested in doing is, and you alluded to it before, is how do you manage all of those applications and services and distribute them in a way that is economical, that we can do it at scale, that we can do it on demand? So that too is part of what's exciting about being able to deploy Edge. >> Yeah. It's pretty interesting, the manufacturing example, 'cause it came up again in the keynote to really embracing software-defined, embracing open source. And the takeaway was moving at the speed of software development, not moving at the speed of hardware development. Because software moves a lot faster. And can be more flexible. It's easy to respond to market demands, or competitive demands, or just to innovate a lot faster. So really taking that approach, and obviously a lot of conversation about you guys in the open Stack community and the open-source projects enables you and your customers to start to adapt to software-defined innovation as opposed to just pure hardware-defined innovation. >> That's right. That's right, yup. >> Alright, Alicia, I'll give you the final word. Any surprises? Oh, no, you've got a chat coming up, so why don't you give us a quick preview for what your conversation is going to be about later today? >> Yeah, thank you, Jeff. So yeah, later I'll be talking about AT&T's initiatives around encouraging women to pursue stem fields. In particular, computer science. It turns out that the number of women getting undergraduate degrees in computer science peaked in the mid-80s. And it's been going downhill since. Last year, only 17% of women were getting degrees in computer science So AT&T's mission, and what we announced today was a million dollar donation to the Girls Who Code organization. That's one of many different non-profit organizations that AT&T is involved with to make sure that we continue to encourage young women and also underrepresented minorities and others who want to get in the stem fields to get involved because technology is changing very quickly. We need people who can understand the technology, who can develop the software we talked about, and we need to get that pipeline filled up. And so we're very committed to helping the community and helping to encourage young girls to pursue degrees in stem. >> That's great. Girls Who Code is a fantastic organization. We've had 'em on. Anita Borg, I mean, there's so much good work that goes on out there, so that's a great announcement. And congratulations. >> Thank you. >> And I'm sure that's a meaningful contribution. >> Yeah, thank you. >> So Alicia, thanks for stopping by, and good luck this afternoon, and we'll see you next time. >> Thank you, Jeff. >> Alright. >> Appreciate it. >> She's Alicia, I'm Jeff. You're watching theCUBE. We're at AT&T Spark in downtown San Francisco. Thanks for watching. (upbeat electronic music)

Published Date : Sep 10 2018

SUMMARY :

From the Palace of Fine Arts in San Francisco, And one of the most important components of 5G I wonder if you can unpack a little bit so that you can deliver on the speeds the balance of how you can distribute the Edge and not on the device, you save on battery life, and really start to redistribute that Sprinkling it across the country of where you need it. to think about what you could do with that power. So that's going to help the economy. and there's just so much research being done now. And in the process, from a more technical perspective, and the open-source projects enables you That's right. so why don't you give us a quick preview and helping to encourage young girls And congratulations. and good luck this afternoon, and we'll see you next time. We're at AT&T Spark in downtown San Francisco.

ENTITIES

Entity	Category	Confidence
Alicia	PERSON	0.99+
Anita Borg	PERSON	0.99+
Alicia Abella	PERSON	0.99+
Jeff	PERSON	0.99+
Jeff Frick	PERSON	0.99+
AT&T	ORGANIZATION	0.99+
Santa Clara	LOCATION	0.99+
Last year	DATE	0.99+
NVIDIA	ORGANIZATION	0.99+
AT&T Labs	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Girls Who Code	ORGANIZATION	0.99+
AT&T Spark	ORGANIZATION	0.99+
17%	QUANTITY	0.99+
San Francisco	LOCATION	0.98+
one	QUANTITY	0.98+
today	DATE	0.98+
mid-80s	DATE	0.98+
later today	DATE	0.97+
million dollar	QUANTITY	0.97+
5G	ORGANIZATION	0.96+
2018	DATE	0.93+
theCUBE	ORGANIZATION	0.92+
three things	QUANTITY	0.9+
this afternoon	DATE	0.83+
Edge	TITLE	0.82+
T Spark	EVENT	0.77+
Palace of Fine Arts	ORGANIZATION	0.68+
Labs	ORGANIZATION	0.68+
zero	QUANTITY	0.68+
Palace of Fine Arts	LOCATION	0.62+
Edge	ORGANIZATION	0.59+
AT&	ORGANIZATION	0.56+
5G	TITLE	0.54+
Edge	COMMERCIAL_ITEM	0.5+
5G	OTHER	0.45+

Gordon Mansfield, AT&T | AT&T Spark 2018

>> From the Palace of Fine Arts in San Francisco, it's theCUBE covering AT&T Spark. (techy music) Now here's Jeff Frick. >> Hey, welcome back, everybody, Jeff Frick here with theCUBE. We're at the Palace of Fine Arts in San Francisco at the AT&T Spark event. It's all about 5G, you know we've been hearing about 5G for a long, long time, that 5G is coming, it's in cities, there's more cities that it's rolling out to, it's lots of special networks, so we're excited to be here as it becomes real, and we've got a guy who's right in the middle of the weeds, right in all the devices. He's Gordon Mansfield, the VP of converged access and device technology at AT&T, Gordon, welcome. >> Thank you. >> So, what do you think? You've probably been looking at this 5G stuff for a long, long time. It feels like we're finally getting pretty close. >> We're getting really close, you know, we're gearing up to launch our first 12 markets this year, and just this past weekend we made the first end-to-end call across our production network with a mobile form factor device, so we're real close and we're real excited. >> So, that just happened, right, this first call? >> It just happened this past weekend. So, what were some of the final hurdles to finally get that little milestone that you guys have probably been looking forward to for while? >> Yeah, so the final hurdles is really getting the device modems into, you know, that form factor device, that mobile form factor to where it can be portable, you can carry it-- >> Right. >> And make these fantastic mobile data calls, and so getting that technology, working together, communicating with the network infrastructure, that work just finished, or there's multiple stages, but a critical stage just got completed last week. We were able to take that technology straight to the field in Waco, Texas, and start demonstrating and working with it live in our production network. >> So, do you get the dog out and he can hear his master's voice when you do that first phone call? >> Well... (laughs) You know... >> The old RCA. >> It's pretty close. >> I know, nobody knows what we're even talking about, right, too old. >> (chuckles) They don't, do they? >> So, the other thing that's really interesting about 5G compared to the other, prior roll-outs is really the focus on devices, and you're in charge of devices and devices is a lot more than just handsets, right? >> Yeah. >> This was really designed for the industrial internet and IoT, and really a whole swath of device-to-device communication. How did that kind of change the way you look at your job? >> Yeah, so you know, we've been working on IoT and modules in the IoT space, but with 5G you start to enable lots of new capabilities with very high bandwidth, low latent applications, which allows us to revolutionize various vertical industries, and so now it's no longer just about the smartphone or the tablet, but it's about anything and everything that you can imagine-- >> Right. >> And so, you know, I tell people all the time, you know, when we first start talking about technology we really think about some cool things, but the reality is we barely touch the surface, and so you know, people will just begin to imagine the capabilities that 5G will unleash and you'll start to put, you know, the capabilities into everything from a refrigerator to robot arms on a manufacturing floor and all kinds of points in between. >> Right, you know it's funny, we go to a lot of tech conferences, and we were just at VMworld a couple weeks ago and you know, Michael Dell said on air that, you know, the edge will actually be bigger than the cloud, and right, it's been all about cloud for the last several years. >> Yeah. >> Now it's all about edge. Well, the key to edge is connectivity, and that's a really important piece of the 5G story. >> Absolutely, if you take your compute power and you push it further to the edge you've got to then connect, and so you can put very low-cost, low horsepower components on the edge, connect them, you know, so in a device, connect them to the edge and come up with some pretty powerful capabilities. >> Yeah, and the other interesting thing from your guys' point of view, having dealt with handsets for so long, is just the whole low power, and a lot of the edge type applications are going to be in remote areas, difficult to get to areas, difficult to plumb areas, so the whole experience with low power combined with the low latency is really a big game changer. >> That's absolutely correct, so when you take low power you can put battery devices that last years-- >> Right. >> And have them in remote locations, sensors, et cetera, and have them connect in a low-latent, high-bandwidth way to deliver, you know, anything that you can imagine. >> Right, so it feels to me that there's really not the buzz around 5G that there should be, and I don't know because we've kind of heard about it for a while and it's kind of been in extended development or people just aren't paying attention, but what's interesting, a lot of conversations in the keynotes talking about experiences. >> Mm-hm. >> Really changing the way you can think about developing applications for experiences based on this technology. We saw the NVIDIA demo where they're running NVIDIA processors in their cloud and sending it to a laptop here, where before you'd have to spend thousands of dollars on a local machine. As you look back, what are some of the things that you've seen, either in testing or conversations, that maybe people just don't have any perception of how this is going to change some of their day-to-day activities? >> So, I don't think people, you know, unfortunately we've become immune. The devices, right, the processing power that we put in devices that people carry in their pocket, they keep going up and up. The reality is at some point you've got to flatten that to... From a consumer perspective you've got to flatten that to have a device that people can afford. >> Right. >> And so, with 5G and you start putting things to the edge you start taking away some of the processing power that physically is in the phone and you put that at the edge, to where now people can have really high speed, high capabilities in a relatively low-cost device. >> That's pretty interesting, you're the first person. So, it is really this redistribution of, you know, networking, compute, and store-- >> Mm-hm. >> That's now enabled with this fast networking, where before your options were really not so great. >> Yeah, it's always a balance, but today your only option is to continue to put more and more horsepower into the device itself-- >> Right. >> More processing, compute storage, into the device. By spreading that and having some of it maintained in the network you can maintain, you can manage cost in the end user device that people carry in their pocket. >> Okay, so give you the last word, when you are at a cocktail party on the weekend talking to some people about what you do, what surprises people most about 5G once you tell them it's this new thing that's coming down the pike? >> Well, you know, look, in my job I get to see lots of cool things, and when I start describing some virtual or augmented reality, imagine walking down the street with a pair of glasses and suddenly images, right, start, you know, being fed on top of what you're really looking at. You start, you know, you can imagine a day where, you know, an advertisement may pop up in your field of view, or you know, points of interest that you might want to see, and you know, obviously we've got to control that and manage it to consumer expectations, but that's not as far away as people might imagine. >> Right, and just to recap, you're in 12 markets. >> 12 markets-- >> You're in seven, five more, and then another seven coming, right? >> That's right, so 12 by the end of '18, and seven more in early '19. We're off to a fast start and looking to grow from there. >> All right, Gordon, well congratulations on progress to date and good luck with the roll-out. >> All right, thank you. >> (chuckles) All right, he's Gordon, I'm Jeff, you're watching theCUBE. We're at AT&T Spark in San Francisco, thanks for watching. (techy music)

Published Date : Sep 10 2018

SUMMARY :

From the Palace of Fine Arts It's all about 5G, you know we've been So, what do you think? We're getting really close, you know, to finally get that little milestone that you guys and so getting that technology, working together, You know... I know, nobody knows what we're How did that kind of change the way you look at your job? and so you know, people will just begin said on air that, you know, the edge will actually Well, the key to edge is connectivity, and you push it further to the edge Yeah, and the other interesting thing you know, anything that you can imagine. in the keynotes talking about experiences. Really changing the way you can think about developing So, I don't think people, you know, is in the phone and you put that at the edge, you know, networking, compute, That's now enabled with this fast networking, in the network you can maintain, you can manage cost and you know, obviously we've got to control that That's right, so 12 by the end of '18, progress to date and good luck with the roll-out. We're at AT&T Spark in San Francisco, thanks for watching.

ENTITIES

Entity	Category	Confidence
Jeff Frick	PERSON	0.99+
Gordon Mansfield	PERSON	0.99+
Jeff	PERSON	0.99+
Michael Dell	PERSON	0.99+
NVIDIA	ORGANIZATION	0.99+
Gordon	PERSON	0.99+
last week	DATE	0.99+
five	QUANTITY	0.99+
AT&T	ORGANIZATION	0.99+
seven	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
12 markets	QUANTITY	0.99+
AT&T Spark	ORGANIZATION	0.99+
thousands of dollars	QUANTITY	0.99+
Waco, Texas	LOCATION	0.99+
this year	DATE	0.99+
today	DATE	0.99+
first call	QUANTITY	0.98+
early '19	DATE	0.98+
end of '18	DATE	0.98+
first 12 markets	QUANTITY	0.97+
first person	QUANTITY	0.97+
first phone call	QUANTITY	0.97+
2018	DATE	0.96+
12	QUANTITY	0.93+
Palace of Fine Arts	ORGANIZATION	0.93+
first	QUANTITY	0.92+
theCUBE	ORGANIZATION	0.88+
last years	DATE	0.83+
a couple weeks ago	DATE	0.81+
first end	QUANTITY	0.78+
5G	ORGANIZATION	0.78+
past weekend	DATE	0.75+
a pair of glasses	QUANTITY	0.75+
Spark	EVENT	0.74+
Palace of Fine Arts	LOCATION	0.72+
5G	QUANTITY	0.68+
VMworld	ORGANIZATION	0.63+
last several years	DATE	0.62+
call	QUANTITY	0.62+
to-	QUANTITY	0.52+

Chris Sambar, AT&T | AT&T Spark 2018

>> From the Palace of Fine Arts in San Francisco, it's theCUBE, covering AT&T Spark. Now here's Jeff Frick. >> Hey welcome back everybody, Jeff Frick here with theCUBE. We're at San Francisco, at the historic Palace of Fine Arts, it's a beautiful spot, it's redone, they moved Exploratorium out a couple years ago, so now it's in a really nice event space, and we're here for the AT&T Spark Event, and the conversation's all around 5G. But we're excited to have our first guest, and he's working on something that's a little bit tangential to 5G-related, but not absolutely connected, but really important work, it's Chris Sambar, he is the SVP of FirstNet at AT&T, Chris, great to see you. >> Thanks Jeff, great to be here, I appreciate it. >> Yeah, so you had a nice Keynote Presentation, talking about FirstNet. So for people I've missed, that aren't familiar, what is AT&T FirstNet? >> Sure, I'll give a quick background. As I was mentioning up there, tomorrow is the 17-year Anniversary of 9/11. So 17 years ago tomorrow, a big problem in New York City. Lots of first responders descended on the area. All of them were trying to communicate with each other, they were trying to use their radios, which they're you know, typically what you see a first responder using, the wireless networks in the area. Unfortunately challenges, it wasn't working. They were having trouble communicating with each other, their existing wireless networks were getting congested, and so the 9/11 Commission came out with a report years later, and they said we need a dedicated communications network, just for First Responders. So they spun all this up and they said, we're going to dedicate some Spectrum, 20 megahertz of D-Class Spectrum, which is really prime Spectrum. Seven billion dollars and we're going to set up this Federal entity, called the FirstNet Authority, and they're going to create a Public Safety Network across America. So FirstNet Authority spent a few years figuring out how to do it, and they landed on what we have today, which was a Public/Private Partnership, between AT&T, and Public Safety throughout America, and we're building them a terrific network across the country. It is literally a separate network so when I, I think of wireless in America, I think of four main commercial carriers, AT&T, Verizon, T-Mobile, Sprint. This is the 5th carrier, this is Public Safety's Wireless Network just for them. >> So when you say an extra network, so it's a completely separate, obviously you're leveraging infrastructure, like towers and power and those types of things. But it's a completely separate network, than the existing four that you mentioned. >> Yeah, so if you walk into our data centers throughout the country, you're going to see separate hardware, physical infrastructure that is just for FirstNet, that's the core network just for this network. On the RAN, the Radio Access Network, we've got antennas that have Band 14 on them, that's Public Safety's Band, dedicated just for them when they need it. So yeah, it's literally a physically separate network. The SIM card that goes into a FirstNet device, is a different SIM card than our commercial users would use, because it's separate. >> So one of the really interesting things about 5G, and kind of the evolution of wireless is, is taking some of the load that has been taken by like WiFi, and other options for fast, always on connectivity. I would assume radio, and I don't know that much about radio frequencies that have been around forever with communications in, in First Responders. Is the vision that the 5G will eventually take over that type of communication as well? >> Yeah, absolutely. If you look at the evolution of First Responder, and Public Safety Communications, for many years now they've used radios. Relatively small, narrow Spectrum bands for Narrow Band Voice, right, just voice communications. It really doesn't do data, maybe a little bit, but really not much. Now they're going to expand to this Spectrum, the D-Class, the D-Block Spectrum, excuse me, which is 700 megahertz, it's a low-band Spectrum, that'll provide them with Dedicated Spectrum, and then the next step, as you say, is 5G, so take the load off as Public Safety comes into the, the new Public Safety Communications space, that they've really been wanting for years and years, they'll start to utilize 5G as well on our network. >> So where are you on the development of FirstNet, where are you on the rollout, what's the sequence of events? >> The first thing we did, the award was last year in March 2017. The first thing we did was we built out the core network. When I talked about all that physical infrastructure, that basically took a year to build out, and it was pretty extensive, and about a half a billion dollars so, that was the first thing we did, that completed earlier this year. >> Was that nationwide or major metro cities or how-- >> Nationwide, everywhere in the country. >> Nationwide, okay. >> So now what we're doing is, we are putting the Spectrum that we were given, or I should say we were leased for 25 years, we're putting that Spectrum up across our towers all over the country so, that will take five years, it's a five-year build-out, tens of thousands of towers across America, will get this Public Safety Spectrum, for Public Safety, and for their use. >> Right. Will you target by GEO, by Metro area, I mean, how's it going to actually happen? That's a huge global rollout, five years is a long time. How you kind of prioritize, how are you really going to market with this? >> The Band 14 Spectrum is being rolled out in the major, the major dense areas across the country. I will tell you that by the end of the rollout, five years from now, 99% of the population of America, will have Band 14 Spectrum, so the vast majority of the population. Other areas where we don't roll it out, rural areas for example, all of the features that Public Safety wants, we call them (mumbles) and priority, which is the features to allow them to always have access to the network whenever they need it. Those features will be on our regular commercial Spectrum. So if Band 14 isn't there, the network will function exactly as if it were there for them. >> Right. Then how do you roll it out to the agencies, all the First Responders, the Fire, the Police, the EMTs, et cetera? How do they start to take advantage of this opportunity? >> Sure, so we started that earlier this year. We really started in a March-April timeframe in earnest, signing up agencies, and the uptake's been phenomenal. It's over 2500 Public Safety Agencies across America, over 150,00, and that number grows by thousands every week. That's actually a pretty old number but, they are signing up in droves. In fact, one of the problems we were having initially is, handling the volume of First Responders that wanted to sign up, and the reason is they're seeing that, whether it's a fire out in Oregon, and they need connectivity in the middle of nowhere, in a forest where there's no wireless connectivity at all, we'll bring a vehicle out there, put up an antenna and provide them connectivity. Whether it's a Fourth of July show, or a parade, or an active shooter, wherever large groups of people, combined together and the network gets congested, they're seeing that wow, my device works no matter what. I can always send a text message, I can send a video, it just works. Where it didn't work before. So they love it, and they're really, they're really signing up in droves, it's great. >> It's really interesting because it's, it's interesting that this was triggered, as part of the post 9/11 activity to make things better, and make things safer. But there was a lot of buzz, especially out here in the West with, with First Responders in the news, who were running out of band width. As you said, the Firefighters, the fire's been burning out here, it seems like forever, and really nobody thinking about those, or obviously they're probably roaming on their traditional data plan, and they're probably out there, for weeks and weeks at a time, that wasn't part of their allocation, when they figured out what plan they should be. So the timing is pretty significant, and there's clearly a big demand for this. >> Absolutely. So that example that you sight is a really good one. Two weeks ago, there was a lot in the news about a fire agency in the West, that said they were throttled by their carrier. It was a commercial carrier, and commercial carriers have terms and conditions, that sometimes they need to throttle usage, if you get to a certain level. That's how commercial networks work. >> Right, right. >> FirstNet was built with not only different technology, hardware, software, but with different terms and conditions. Because we understand that, when a First Responder responds to your house, we don't want that to be the minute in time, when their network, their plan got maxed out, and now they're getting throttled. >> Right. >> So we don't have any throttling on the FirstNet Network. So it's not only the hardware, software, technical aspects of the network, but the terms and conditions are different. It's what you would expect that a First Responder would have and want on their device, and that's what we're providing for them. >> Right, and the other cool thing that you mentioned is, we see it all the time, we go to a lot of conferences. A lot of people probably experience it at, at big events right, is that still today, WiFi and traditional LTE, has hard times in super-dense environments, where there's just tons and tons and tons of bodies I imagine, absorbing all that signal, as much as anything else, so to have a separate Spectrum in those type of environments which are usually chaotic when you got First Responders, or some of these mass events that you outlined, is a pretty important feature, to not get just completely wiped out by everybody else happening to be there at the same time. >> Exactly. I'll give you two quick examples, that'll illustrate what you just said. The first one is, on the Fourth of July, in downtown Washington D.C. You can imagine that show. It's an awesome show, but there are hundreds of thousands of people that gather around that Washington Monument, to watch the show. And the expectation is at the peak of the show, when all those people are there, you're not really going to be sending text messages, or calling people, the network's probably just not going to work very well. That's, we've all gotten used to that. >> Right, right. >> This year, I had First Responders, who were there during the event, sending me videos of the fireworks going off. Something that never would've been possible before, and them saying oh my gosh. The actually works the way it's supposed to work, we can use our phones. Then the second example, which is a really sad example. There was a recent school shooting down in Florida. You had Sheriffs, Local Police, Ambulances. You even had some Federal Authorities that showed up. They couldn't communicate with each other, because they were on different radio networks. Imagine if they had that capability of FirstNet, where they could communicate with each other, and the network worked, even though there were thousands of people that were gathering around that scene, to see what was going on. So that's the capability we're bringing to Public Safety, and it's really good for all of us. >> Do you see that this is kind of the, the aggregator of the multi-disparate systems that exist now, as you mentioned in, in your Keynote, and again there's different agencies, they've got different frequencies, they've got Police, Fire, Ambulance, Federal Agencies, that now potentially this, as just kind of a unified First Responder network, becomes the defacto way, that I can get in touch with anyone regardless of of where they come from, or who they're associated with? >> That is exactly the vision of FirstNet. In major cities across America, Police, Fire, Emergency Medical typically, are on three different radio networks, and it's very difficult for them to communicate with each other. They may have a shared frequency or two between them, but it's very challenging for them. Our goal is to sign all of them up, put them on one LTE network, the FirstNet Network, customized for them, so they can all communicate with each other, regardless of how much congestion is on the network. So that's the vision of FirstNet. >> Then that's even before you get into the 5G impacts, which will be the data impacts, whereas I think again, you showed in some of your examples, the enhanced amount of data that they can bring to bear, on solving a problem, whether it's a layout of a building for the Fire Department or drone footage from up above. We talked to Menlo Park Fire, they're using drones more and more to give eyes over the fire to the guys down on the ground. So there's a lot of really interesting applications that you can get more better data, to drive more better applications through that network, to help these guys do their job. >> Yeah, you've got it, the smart city's cameras, don't you want that to be able to stream over the network, and give it to First Responders, so they know what they're going to encounter, when they show up to the scene of whatever issue's going on in the city, of course you do, and you need a really reliable, stable network to provide that on. >> Well Chris, this is not only an interesting work, but very noble, and an important work, so appreciate all of the efforts that you're putting in, and thanks for stopping by. >> I appreciate it Jeff, it's been great talking with you. >> Alright, he's Chris, I'm Jeff, you're watching theCUBE, we're in San Francisco at the Palace of Fine Arts, at AT&T Spark. Thanks for watching, we'll see you next time. (bright music)

Published Date : Sep 10 2018

SUMMARY :

From the Palace of Fine Arts and the conversation's all around 5G. Yeah, so you had a nice Keynote Presentation, and so the 9/11 Commission came out than the existing four that you mentioned. that's the core network just for this network. and kind of the evolution of wireless is, so take the load off as Public Safety the award was last year in March 2017. all over the country so, how are you really going to market with this? all of the features that Public Safety wants, all the First Responders, the Fire, the Police, and the reason is they're seeing that, as part of the post 9/11 activity to make things better, So that example that you sight is a really good one. and now they're getting throttled. So it's not only the hardware, software, Right, and the other cool thing that you mentioned is, the network's probably just not going to work very well. and the network worked, So that's the vision of FirstNet. the enhanced amount of data that they can bring to bear, and give it to First Responders, so appreciate all of the efforts Thanks for watching, we'll see you next time.

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Verizon	ORGANIZATION	0.99+
Oregon	LOCATION	0.99+
Chris	PERSON	0.99+
Chris Sambar	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Sprint	ORGANIZATION	0.99+
America	LOCATION	0.99+
New York City	LOCATION	0.99+
T-Mobile	ORGANIZATION	0.99+
March 2017	DATE	0.99+
AT&T	ORGANIZATION	0.99+
five-year	QUANTITY	0.99+
Florida	LOCATION	0.99+
five years	QUANTITY	0.99+
Seven billion dollars	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
FirstNet Authority	ORGANIZATION	0.99+
25 years	QUANTITY	0.99+
FirstNet	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Washington Monument	LOCATION	0.99+
AT&T Spark	ORGANIZATION	0.99+
20 megahertz	QUANTITY	0.99+
99%	QUANTITY	0.99+
last year	DATE	0.99+
700 megahertz	QUANTITY	0.99+
over 150,00	QUANTITY	0.99+
second example	QUANTITY	0.99+
March	DATE	0.99+
This year	DATE	0.99+
5th carrier	QUANTITY	0.99+
Public Safety	ORGANIZATION	0.99+
Two weeks ago	DATE	0.99+
a year	QUANTITY	0.99+
tomorrow	DATE	0.99+
Fire Department	ORGANIZATION	0.98+
two quick examples	QUANTITY	0.98+
one	QUANTITY	0.98+
first guest	QUANTITY	0.98+
thousands of people	QUANTITY	0.98+
AT&T FirstNet	ORGANIZATION	0.98+
17 years ago	DATE	0.98+
first one	QUANTITY	0.98+
9/11	EVENT	0.98+
tens of thousands of towers	QUANTITY	0.97+
hundreds of thousands of people	QUANTITY	0.97+
First Responders	ORGANIZATION	0.97+
Band 14 Spectrum	OTHER	0.97+
Fourth of July	EVENT	0.97+
four	QUANTITY	0.96+
First Responder	ORGANIZATION	0.96+
theCUBE	ORGANIZATION	0.96+
Menlo Park Fire	ORGANIZATION	0.96+
first thing	QUANTITY	0.96+
over 2500 Public Safety Agencies	QUANTITY	0.96+
about a half a billion dollars	QUANTITY	0.96+
earlier this year	DATE	0.95+
today	DATE	0.94+
17-year Anniversary	QUANTITY	0.94+
first responder	QUANTITY	0.92+

Mark Grover & Jennifer Wu | Spark Summit 2017

>> Announcer: Live from San Francisco, it's the Cube covering Spark Summit 2017, brought to you by databricks. >> Hi, we're back here where the Cube is live, and I didn't even know it Welcome, we're at Spark Summit 2017. Having so much fun talking to our guests I didn't know the camera was on. We are doing a talk with Cloudera, a couple of experts that we have here. First is Mark Grover, who's a software engineer and an author. He wrote the book, "Dupe Application Architectures." Mark, welcome to the show. >> Mark: Thank you very much. Glad to be here. And just to his left we also have Jennifer Wu, and Jennifer's director of product management at Cloudera. Did I get that right? >> That's right. I'm happy to be here, too. >> Alright, great to have you. Why don't we get started talking a little bit more about what Cloudera is maybe introducing new at the show? I saw a booth over here. Mark, do you want to get started? >> Mark: Yeah, there are two exciting things that we've launched at least recently. There Cloudera Altus, which is for transient work loads and being able to do ETL-Like workloads, and Jennifer will be happy to talk more about that. And then there's Cloudera data science workbench, which is this tool that allows folks to use data science at scale. So, get away from doing data science in silos on your personal laptops, and do it in a secure environment on cloud. >> Alright, well, let's jump into Data Science Workbench first. Tell me a little bit more about that, and you mentioned it's for exploratory data science. So give us a little more detail on what it does. >> Yeah, absolutely. So, there was private beta for Cloudera Data Science Workbench earlier in the year and then it was GA a few months ago. And it's like you said, an exploratory data science tool that brings data science to the masses within an enterprise. Previously people used to have, it was this dichotomy, right? As a data scientist, I want to have the latest and greatest tools. I want to use the latest version of Python, the latest notebook kernel, and I want to be able to use R and Python to be able to crunch this data and run my models in machine learning. However, on the other side of this dichotomy are the IT organization of the organization, where if they want to make sure that all tools are compliant and that your clusters are secure, and your data is not going into places that are not secured by state of the art security solutions, like Kerberos for example, right? And of course if the data scientists are putting the data on their laptops and taking the laptop around to wherever they go, that's not really a solution. So, that was one problem. And the other one was if you were to bring them all together in the same solution, data scientists have different requirements. One may want to use Python 2.6. Another one maybe want to use 3.2, right? And so Cloudera Data Science Workbench is a new product that allows data scientists to visualize and do machine learning through this very nice notebook-like interface, share their work with the rest of their colleagues in the organization, but also allows you to keep your clusters secure. So it allows you to run against a Kerberized cluster, allows single sign on to your web interface to Data Science Workbench, and provides a really nice developer experience in the sense that My workflow and my tools and my version of Python does not conflict with Jennifer's version of Python. We all have our own docker and Kubernetes-based infrastructure that makes sure that we use the packages that we need, and they don't interfere with each other. We're going to go to Jennifer on Altus in just a few minutes, but George first give you a chance to maybe dig in on Data Science workshop. >> Two questions on the data science side: some of the really toughest nuts to crack have been Sort of a common environment for the collaborators, but also the ability to operationalize the models once you've sort of agreed on them, and manage the lifecycle across teams, you know? Like, challenger champion, promote something, or even before that doing the ab testing, and then sort of what's in production is typically in a different language from what, you know, it was designed in and sort of integrating it with the apps. Where is that on the road map? Cause no one really has a good answer for that. >> Yeah, that's an excellent question. In general I think it's the problem to crack these days. How do you productionalize something that was written by a data scientist in a notebook-like system onto the production cluster, right? And I think the part where the data scientist works in a different language than the language that's in production, I think that problem, the best I can say right now is to actually have someone rewrite that. Have someone rewrite that in the language you're going to make in production, right? I don't see that to be the more common part. I think the more widespread problem is even when the language is production, how do you go making the part that the data scientist wrote, the model or whatever that would be, into a prodution cluster? And so, Data Science Workbench in particular runs on the same cluster that is being managed by Cloudera manager, right? So this is a tool that you install, but that is available to you as a web server, as a web interface, and so that allows you to move your development machine learning algorithms from your data science workbench to production much more easier, because it's all running on the same hardware and same systems. There's no separate Cloudera managers that you have to use to manage the workbench compared to your actual cluster. >> Okay. A tangential question, but one of the, the difficulties of doing machine learning is finding all the training data and, and sort of data science expertise to sit with the domain expert to, you know, figure out proper model of features, things like that. One of the things we've seen so far from the cloud vendors is they take their huge datasets in terms of voice, you know, images. They do the natural language understanding, speech or rather text to speech, you know, facial recognition. Cause they have such huge datasets they can train on. We're hearing noises that they'd going to take that down to the more mundane statistical kind of machine learning algorithms, so that you wouldn't be, like, here's a algorithm to do churn, you know, go to town, but that they might have something that's already kind of pre-populated that you would just customize. Is that something that you guys would tackle, too? >> I can't speak for the road map in that sense, but I think some of that problem needs to be tackled by projects like Spark for example. So I think as the stack matures, it's going to raise the level of abstraction as time goes on. And I think whatever benefits Spark ecosystem will have will come directly to distributions like Cloudera. >> George: That's interesting. >> Yeah >> Okay >> Alright, well let's go to Jennifer now and talk about Altus a little bit. Now you've been on the Cube show before, right? >> I have not. >> Okay, well, familiar with your work. Tell us again, you're the product manager for Altus. What does it do, and what was the motivation to build it? >> Yeah, we're really excited about Cloudera Altus. So, we released Cloudera Altus in its first GA form in April, and we launched Cloudera Altus in a public environment in Strata London about two weeks ago, so we're really excited about this and we are very excited to now open this up to all of the customer base. And what it is is a platform as a service offering designed to leverage, basically, the agility and the scale of cloud, and make a very easy to use type of experience to expose Cloudera capacity for, in particular for data engineering type of workloads. So the end user will be able to very easily, in a very agile manner, get data engineering capacity on Cloudera in the cloud, and they'll be able to do things like ETL and large scale data processing, productionized machine learning workflows in the cloud with this new data engineering as a service experience. And we wanted to abstract away the cloud, and cluster operations, and make the end user a really, the end user experience very easy. So, jobs and workloads as first class objects. You can do things like submit jobs, clone jobs, terminate jobs, troubleshoot jobs. We wanted to make this very, very easy for the data engineering end user. >> It does sound like you've sort of abstracted away a lot of the infrastructure that you would associate with on-prem, and sort of almost make it, like, programmable and invisible. But, um, I guess my, one of my questions is when you put it in a cloud environment, when you're on-prem you have a certain set of competitors which is kind of restrictive, because you are the standalone platform. But when you go on the cloud, someone might say, "I want to use red shift on Amazon," or Snowflake, you know, as the MPP sequel database at the end of a pipeline. And it's not just, I'm using those as examples. There's, you know, dozens, hundreds, thousands of other services to choose from. >> Yes. >> What happens to the integrity of that platform if someone carves off one piece? >> Right. So, interoperability and a unified data pipeline is very important to us, so we want to make sure that we can still service the entire data pipeline all the way from ingest and data processing to analytics. So our team has 24 different open source components that we deliver in the CDH distribution, and we have committers across the entire stack. We know the application, and we want to make sure that everything's interoperable, no matter how you deploy the cluster. So if you deploy data engineering clusters through Cloudera Altus, but you deployed Impala clusters for data marks in the cloud through Cloudera Director or through any other format, we want all these clusters to be interoperable, and we've taken great pains in order to make everything work together well. >> George: Okay. So how do Altus and Sata Science Workbench interoperate with Spark? Maybe start with >> You want to go first with Altus? >> Sure, so, we, in terms of interoperability we focus on things like making sure there are no data silos so that the data that you use for your entire data lake can be consumed by the different components in our system, the different compute engines and different tools, and so if you're processing data you can also look at this data and visualize this data through Data Science Workbench. So after you do data ingestion and data processing, you can use any of the other analytic tools and then, and this includes Data Science Workbench. >> Right, and for Data Science Workbench runs, for example, with the latest version of Spark you could pick, the currently latest released version of Spark, Spark 2.1, Spark 2.2 is being boarded of course, and that will soon be integrated after its release. For example you could use Data Science Workbench with your flavor of Spark two's version and you can run PySpark or Scala jobs on this notebook-like interface, be able to share your work, and because you're using Spark Underneath the hood it uses yarn for resource management, the Data Science Workbench itself uses Docker for configuration management, and Kubernetes for resource managing these Docker containers. >> What would be, if you had to describe sort of the edge conditions and the sweet spot of the application, I mean you talked about data engineering. One thing, we were talking to Matei Zaharia and Ronald Chin about was, and Ali Ghodsi as well was if you put Spark on a database, or at least a, you know, sophisticated storage manager, like Kudu, all of a sudden there're a whole new class of jobs or applications that open up. Have you guys thought about what that might look like in the future, and what new applications you would tackle? >> I think a lot of that benefit, for example, could be coming from the underlying storage engine. So let's take Spark on Kudu, for example. The inherent characteristics of Kudu today allow you to do updates without having to either deal with the complexity of something like Hbase, or the crappy performance of dealing HDFS compactions, right? So the sweet spot comes from Kudu's capabilities. Of course it doesn't support transactions or anything like that today, but imagine putting something like Spark and being able to use the machine learning libraries and, we have been limited so far in the machine learning algorithms that we have implemented in Spark by the storage system sometimes, and, for example new machine learning algorithms or the existing ones could rewritten to make use of the update features for example, in Kudu. >> And so, it sounds like it makes it, the machine learning pipeline might get richer, but I'm not hearing that, and maybe this isn't sort of in the near term sort of roadmap, the idea that you would build sort of operational apps that have these sophisticated analytics built in, you know, where the analytics, um, you've done the training but at run time, you know, the inferencing influences a transaction, influences a decision. Is that something that you would foresee? >> I think that's totally possible. Again, at the core of it is the part that now you have one storage system that can do scans really well, and it can also do random reads and writes any place, right? So as your, and so that allows applications which were previously siloed because one appication that ran off of HDFS, another application that ran out of Hbase, and then so you had to correlate them to just being one single application that can use to train and then also use their trained data to then make decisions on the new transactions that come in. >> So that's very much within the sort of scope of imagination, or scope. That's part of sort of the ultimate plan? >> Mark: I think it's definitely conceivable now, yeah. >> Okay. >> We're up against a hard break coming up in just a minute, so you each get a 30-second answer here, so it's the same question. You've been here for a day and a half now. What's the most surprising thing you've learned that you thing should be shared more broadly with the Spark community? Let's start with you. >> I think one of the great things that's happening in Spark today is people have been complaining about latency for a long time. So if you saw the keynote yesterday, you would see that Spark is making forays into reducing that latency. And if you are interested in Spark, using Spark, it's very exciting news. You should keep tabs on it. We hope to deliver lower latency as a community sooner. >> How long is one millisecond? (Mark laughs) >> Yeah, I'm largely focused on cloud infrastructure and I found here at the conference that, like, many many people are very much prepared to actually start taking more, you know, more POCs and more interest in cloud and the response in terms of all of this in Altus has been very encouraging. >> Great. Well, Jennifer, Mark, thank you so much for spending some time here on the Cube with us today. We're going to come by your booth and chat a little bit more later. It's some interesting stuff. And thank you all for watching the Cube today here at Spark Summit 2017, and thanks to Cloudera for bringing us these two experts. And thank you for watching. We'll see you again in just a few minutes with our next interview.

Published Date : Jun 7 2017

SUMMARY :

covering Spark Summit 2017, brought to you by databricks. I didn't know the camera was on. And just to his left we also have Jennifer Wu, I'm happy to be here, too. Mark, do you want to get started? and being able to do ETL-Like workloads, and you mentioned it's for exploratory data science. And the other one was if you were to bring them all together and manage the lifecycle across teams, you know? and so that allows you to move your development machine the domain expert to, you know, I can't speak for the road map in that sense, and talk about Altus a little bit. to build it? on Cloudera in the cloud, and they'll be able to do things a lot of the infrastructure that you would associate with We know the application, and we want to make sure Maybe start with so that the data that you use for your entire data lake and you can run PySpark in the future, and what new applications you would tackle? or the existing ones could rewritten to make use the idea that you would build sort of operational apps Again, at the core of it is the part that now you have That's part of sort of the ultimate plan? that you thing should be shared more broadly So if you saw the keynote yesterday, you would see that and the response in terms of all of this on the Cube with us today.

ENTITIES

Entity	Category	Confidence
Jennifer	PERSON	0.99+
Mark Grover	PERSON	0.99+
Jennifer Wu	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
George	PERSON	0.99+
Mark	PERSON	0.99+
April	DATE	0.99+
Ronald Chin	PERSON	0.99+
San Francisco	LOCATION	0.99+
Matei Zaharia	PERSON	0.99+
30-second	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
Dupe Application Architectures	TITLE	0.99+
dozens	QUANTITY	0.99+
Python	TITLE	0.99+
yesterday	DATE	0.99+
Two questions	QUANTITY	0.99+
today	DATE	0.99+
Spark	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
two experts	QUANTITY	0.99+
a day and a half	QUANTITY	0.99+
First	QUANTITY	0.99+
one problem	QUANTITY	0.99+
Python 2.6	TITLE	0.99+
Strata London	LOCATION	0.99+
one piece	QUANTITY	0.99+
first	QUANTITY	0.98+
Spark Summit 2017	EVENT	0.98+
Cloudera Altus	TITLE	0.98+
Scala	TITLE	0.98+
Docker	TITLE	0.98+
One	QUANTITY	0.97+
Kudu	ORGANIZATION	0.97+
one millisecond	QUANTITY	0.97+
PySpark	TITLE	0.96+
R	TITLE	0.95+
one	QUANTITY	0.95+
two weeks ago	DATE	0.93+
Data Science Workbench	TITLE	0.92+
Cloudera	TITLE	0.91+
hundreds	QUANTITY	0.89+
Hbase	TITLE	0.89+
each	QUANTITY	0.89+
24 different open source components	QUANTITY	0.89+
few months ago	DATE	0.89+
single	QUANTITY	0.88+
kernel	TITLE	0.88+
Altus	TITLE	0.88+

Reynold Xin, Databricks - #Spark Summit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.

Published Date : Jun 7 2017

SUMMARY :

Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Reynold	PERSON	0.99+
Ali	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
David Goad	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
North Pole	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Reynold Xin	PERSON	0.99+
last month	DATE	0.99+
10X	QUANTITY	0.99+
two questions	QUANTITY	0.99+
three trillion records	QUANTITY	0.99+
second area	QUANTITY	0.99+
today	DATE	0.99+
last week	DATE	0.99+
Spark	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
first direction	QUANTITY	0.99+
One	QUANTITY	0.99+
James Bond	PERSON	0.98+
Spark	ORGANIZATION	0.98+
both	QUANTITY	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
Tungsten	ORGANIZATION	0.98+
two focuses	QUANTITY	0.97+
three directions	QUANTITY	0.97+
one minute	QUANTITY	0.97+
one area	QUANTITY	0.96+
three	QUANTITY	0.96+
about five	QUANTITY	0.96+
DBIO	ORGANIZATION	0.96+
six years	QUANTITY	0.95+
one thing	QUANTITY	0.94+
over a hundred experiments	QUANTITY	0.94+
Oracle	ORGANIZATION	0.92+
Theano	TITLE	0.92+
single note	QUANTITY	0.91+
Intel	ORGANIZATION	0.91+
one route	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.88+
Office	TITLE	0.87+
TensorFlow	TITLE	0.87+
S3	TITLE	0.87+
MXNet	TITLE	0.85+

Jags Ramnarayan, SnappyData - Spark Summit 2017 - #SparkSummit - #theCUBE

(techno music) >> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> You are watching the Spark Summit 2017 coverage by theCUBE. I'm your host David Goad, and joined with George Gilbert. How you doing George? >> Good to be here. >> And honored to introduce our next guest, the CTO from SnappyData, wow we were lucky to get this guy. >> Thanks for having me >> David: Jags Ramnarayan, Jags thanks for joining us. >> Thanks, thanks for having me. >> And for people who may not be familiar, maybe tell us what does SnappyData do? >> So SnappyData in a nutshell, is taking Spark, which is a computer engine, and in some sense augmenting the guts of Spark so that Spark truly becomes a hybrid database. A single data store that's capable of taking Spark streams, doing transactions, providing mutable state management in Spark, but most importantly being able to turn around, and run analytical queries on that state that is continuously merging. That's in a nutshell. Let me just say a few things, SnappyData itself is a startup that is a spun out, a spun out out of Pivotal. We've been out of Pivotal for roughly about a year, so the technology itself was to a great degree, incubated within Pivotal. It's a product called GemFire within VMware and Pivotal. So we took the guts of GemFire, which is an in-memory data base, designed for transactional low-latency, high confidence scenarios, and we are sort of fusing it, that's the key thing, fusing it into Spark, so that now Spark becomes significantly richer, as not just as a computer platform, but as a store. >> Great, and we know this is not your first Spark Summit, right? How many have you been to? Lost count? >> Boy, let's see, three, four now, Spark Summits, if I include the Spark Summit, this year, four to five. >> Great, so an active part of the community. What were you expecting to learn this year, and have you been surprised by anything? >> You know, it's always wonderful to see, I mean, every time I come to Spark, it's just a new set of innovations, right? I mean, when I first came to Spark, it was a mix of, let's talk about data frames, all of these, let's optimize my priorities. Today you come, I mean there is such a wide spectrum of amazing new things that are happening. It's just mind boggling. Right from AI techniques, structured streaming, and the real-time paradigm, and sort of this confluence that Databricks brings more to it. How can I create a confluence through a unified mechanism, where it is really brilliant, is what I think. >> Okay, well let's talk about how you're innovating at SnappyData. What are some of the applications or current projects you're working on? So number of things, I mean, GE is an investor in SnappyData. So we're trying to work with GE on the investor layer Dspace. We're working with large health care companies, also on their layer Dspace. So the part done with SnappyData is one that has a lot of high velocity streams of data emerging where the streams could be, for instance, Kafka streams driving Spark streams, but streams could also be operation databases. Your Postgres instance and your Cassandra database instance, and they're all generating continuous changes to data that's emerging in an operational world, can I suck that in and almost create a replica of that state that might be emerging in the SOQL operation environment, and still allow interactive analytics ASCIL for a number of concordant users on live data. Not cube data, not pre-aggregated data, but on live data itself, right? Being able to almost give you Google-like speeds to live data. >> George, we've heard people talking about this quite a bit. >> Yeah, so Jags, as you said upfront, Spark was conceived as sort of a general purpose, I guess, analytic compute engine, and adding DBMS to it, like sort of not bolting it on, but deeply integrating it, so that the core data structures now have DBMS properties, like transactionality, that must make a huge change in the scope of applications that are applicable. Can you desribe some of those for us? >> Yeah. The classic paradigm today that we find time and again as, the so-called smack stack, right? I mean lambda stack, now there's a smack stack. Which is really about Spark running on Mesos, but really using Spark streaming as an ingestion capability, and there is continuous state that is emerging that I want to write into Cassandra. So what we find very quickly is that the moment the state is emerging, I want to throw in a business intelligence tool on top and immediately do live dashboarding on that state that is continuously changing and emerging. So what we find is that the first part, which is the high speed drives, the ability to transform these data search, cleanse the data search, get the cleanse data into Cassandra, works really well. What is missing is this ability to say, well, how am I going to get insight? How can I ask you interesting, insightful questions, get responses immediately on that live data, right? And so the common problem there is the moment I have Cassandra working, let's say, with Spark, every time I run an analytical query, you only have two choices. One is use the parallel connector to pull in the data search from Cassandra, right, and now unfortunately, when you do analytics, you're working with large volumes. And every time I run even a simple query, all of a sudden I could be pulling in 10 gigabytes, 20 gigabytes of data into Spark to run the computation. Hundreds of seconds lost. Nothing like interactive, it's all about batch querying. So how can I turn around and say that if stuff changes in Cassandra, I can can have an immediate real-time reflection of that mutable state in Spark on which I can run queries rapidly. That's a very key aspect to us. >> So you were telling me earlier that you didn't see, necessarily, a need to replace entirely, the Cassandra in the smack stack, but to compliment it. >> Jags: That's right. >> Elaborate on that. >> So our focus, much like Spark, is all about in-memory, state management in-memory processing. And Cassandra, realistically, is really designed to say how can I scale the petabyte, right, for key value operations, semi-structured data, what have you. So we think there are a number of scenarios where you still want Cassandra to be your store, because in some sense a lot of these guys have already adapted Cassandra in a fairly big way. So you want to say, hey, leave your petabyte level wall in there, and you can essentially work with the real-time state, which could still be still many terabytes of state, essentially in main memory, that's going to work with specializing it. And we're also, I mean I can touch on this approximate query process and technology, which is other part, other key part here, to say hey, I can't really 1,000 cores, and 1,000 machines just so that you can do your job really well, so one of the techniques we are adopting, which even the Databricks guys stirred with Blink, essentially, it's an approximate query processing engine, we have our own essential approximate query processing engine, as an adjunct, essentially, to our store. What that essentially means is to say, can I take a billion records and synthesize something really, really small, using smart sampling techniques, sketching techniques, essentially statistical structures, that can be stored along with Spark and Spark memory itself, and fuse it with the Spark catalyst query engine. So that as you run your query and we can very smartly figure out, can I use the approximate data structures to answer the questions extremely quickly. Even when the data would be in petabyte volume, I have these data structures that just now taking, maybe gigabytes of storage only. So hopefully not getting too, too technical, so the Spark catalyst query optimizer, like an Oracle query optimizer, it knows about the data that it's going to query, only in your case, you're taking what catalyst knows about Spark, and extending it with what's stored in your native, also Spark native, data structures. >> That's right, exactly. So think about an optimizer always takes a query plan and says, here are all the possible plans you can execute, and here is cost estimate for these plans, we essentially inject more plans into that and hopefully, our plan is even more optimized than the plans that the Spark catalyst engine came up with. And Spark is beautiful because, the Catalyst engine is a very pluggable engine. So you can essentially augment that engine very easily. >> So you've been out in the marketplace, whether in alpha, beta, or now, production, for enough time so that the community is aware of what you've done. What are some of the areas that you're being pulled in that are, that people didn't associate Spark with? >> So more often, we land up in situations where they're looking at SAP HANA, as an example, maybe a Meme SQL, maybe just Postgres, and all of the sudden, there are these hybrid workloads, which is the Gartner term of HTAP, so there's a lot of HTAP use cases, where we get pulled into. So there's no Spark, but we get pulled into it because we just a hybrid database. That's what people look at us, essentially. >> Oh, so you pull Spark in because that's just part of your solution. >> Exactly, right. So think about Spark is not just data frames and rich API, but also it has a SQL interface, right. I can essentially execute, SQL, select SQL. Of course we augment that SQL so that now you can do what you expect from a database, which is an insert, an update, a delete, can I create a view, can I run a transaction? So all of a sudden, it's not just a Spark API but what we provide looks like a SQL database itself. >> Okay, interesting. So tell us, in the work with GE, they're among the first that have sort of educated the world that in that world there's so much data coming off devices, that we have to be intelligent about what we filter and send to cloud, we train models, potentially, up there, we run them closer to the edge, so that we get low latency analytics, but you were telling us earlier that there are alternatives, especially when you have such an intelligent database, working both at the edge and in the cloud. >> Right, so that's a great point. See what's happening with sort of a lot of these machine learning models is that these models are learned on historical data search. And quite often, especially if you look at predictive maintenance, those class of use cases, in industrial IRT, the parlance could evolve very rapidly, right? Maybe because of climate changes and let's say, for a windmill farm, there are few windmills that are breaking down so rapidly it's affecting everything else, in terms of the power generation. So being able to sort of order the model itself, incrementally and near real-time, is becoming more and more important. >> David: Wow. >> It's still a fairly academic research kind of area, but for instance, we are working very closely with the University of Michigan to sort of say, can we use some of these approximate techniques to incrementally also learn a model. Right, sort of incrementally augment a model, potential of the edge, or even inside the cloud, for instance. >> David: Wow. >> So if you're doing it at the edge, would you be updating the instance of the model associated with that locale and then would the model in the cloud be sort of like the master, and then that gets pushed down, until you have an instance and a master. >> That's right. See most typically what will happen is you have computed a model using a lot of historical data. You have typically supervised techniques to compute a model. And you take that model and inject it potentially into the edge, so that it can compute that model, which is the easy part, everybody does that. So you continue to do that, right, because you really want the data scientists to be pouring through those paradigms, looking and sort of tweaking those models. But for a certain number of models, even in the models injected in the edge, can I re-tweak that model in unsupervised way, is kind of the play, we're also kind of venturing into slowly, but that's all in the future. >> But if you're doing it unsupervised, do you need metrics that sort of flag, like what is the champion challenger, and figure out-- >> I should say that I mean, not all of these models can work in this very real-time manner. So, for instance, we've been looking at saying, can we reclassify NPC, the name place classifier, to essentially do incremental classification, or incrementally learning the model. Clustering approaches can actually be done in an unsupervised way in an incremental fashion. Things like that. There's a whole spectrum of algorithms that really need to be thought through for approximate algorithms to actually apply. So it's still an active research. >> Really great discussion, guys. We've just got about a minute to go, before the break, really great stuff. I don't want to interrupt you. But maybe switch real quick to business drivers. Maybe with SnappyData or with other peers you've talked to today. What business drivers do you think are going to affect the evolution of Spark the most? I mean, for us, as a small company, the single biggest challenge we have, it's like what one of you guys said, analysts, it's raining databases out there. And there's ability to constantly educate people how you can essentially realize a very next generation, like data pipeline, in a very simplified manner, is the challenge we are running into, right. I mean, I think the business model for us is primarily how many people are going to go and say, yes, batch related analytics is important, but incrementally, for competitive reasons, want to be playing that real-time analytics game lot more than before, right? So that's going to be big for us, and hopefully we can play a big part there, along with Spark and Databricks. >> Great, well we appreciate you coming on the show today, and sharing some of the interesting work that you're doing. George, thank you so much. and Jags, thank you so much for being on theCUBE. >> Thanks for having me on, I appreciate it. Thanks, George. And thank you all for tuning in. Once again, we have more to come, today and tomorrow, here at Spark Summit 2017, thanks for watching. (techno music)

Published Date : Jun 6 2017

SUMMARY :

Brought to you by Databricks. How you doing George? And honored to introduce our next guest, and in some sense augmenting the guts of Spark if I include the Spark Summit, this year, four to five. and have you been surprised by anything? and the real-time paradigm, and sort of this confluence So the part done with SnappyData is one about this quite a bit. so that the core data structures now have DBMS properties, that the moment the state is emerging, the Cassandra in the smack stack, but to compliment it. So that as you run your query and we can very So you can essentially augment that engine very easily. What are some of the areas that you're being pulled in maybe just Postgres, and all of the sudden, Oh, so you pull Spark in because So all of a sudden, it's not just a Spark API that have sort of educated the world So being able to sort of order the model itself, but for instance, we are working very closely in the cloud be sort of like the master, So you continue to do that, right, because you that really need to be thought through is the challenge we are running into, right. and sharing some of the interesting work that you're doing. And thank you all for tuning in.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
David Goad	PERSON	0.99+
George	PERSON	0.99+
University of Michigan	ORGANIZATION	0.99+
1,000 machines	QUANTITY	0.99+
20 gigabytes	QUANTITY	0.99+
GE	ORGANIZATION	0.99+
1,000 cores	QUANTITY	0.99+
10 gigabytes	QUANTITY	0.99+
David	PERSON	0.99+
Spark	TITLE	0.99+
San Francisco	LOCATION	0.99+
SQL	TITLE	0.99+
Spark	ORGANIZATION	0.99+
Jags Ramnarayan	PERSON	0.99+
first	QUANTITY	0.99+
first part	QUANTITY	0.99+
two choices	QUANTITY	0.99+
SAP HANA	TITLE	0.99+
tomorrow	DATE	0.99+
Hundreds of seconds	QUANTITY	0.99+
Gartner	ORGANIZATION	0.99+
this year	DATE	0.99+
Spark Summit 2017	EVENT	0.99+
Jags	PERSON	0.99+
One	QUANTITY	0.98+
today	DATE	0.98+
Today	DATE	0.98+
both	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
Spark Summit	EVENT	0.97+
single	QUANTITY	0.97+
Kafka	TITLE	0.97+
Oracle	ORGANIZATION	0.97+
Google	ORGANIZATION	0.96+
about a year	QUANTITY	0.96+
Blink	ORGANIZATION	0.95+
single data	QUANTITY	0.93+
SnappyData	ORGANIZATION	0.93+
Mesos	TITLE	0.91+
three	QUANTITY	0.91+
a billion records	QUANTITY	0.91+
#SparkSummit	EVENT	0.91+
Spark Summits	EVENT	0.9+
four	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.89+
Postgres	TITLE	0.89+
one	QUANTITY	0.88+
Cassandra	TITLE	0.87+

Matthew Hunt | Spark Summit 2017

>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, we're talking about data signs and engineering at scale, and we're having a great time, aren't we, George? >> We are! >> Well, we have another guest now we're going to talk to, I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg, Matt, thanks for joining us! >> My pleasure. >> Alright, we're going to talk about a lot of exciting stuff here today, but I want to first start with, you're a long-time member of the Spark community, right? How many Spark Summits have you been to? >> Almost all of them, actually, it's quite amazing to see the 10th one, yes. >> And you're pretty actively involved with the user group on the east coast? >> Matt: Yeah, I run the New York users group. >> Alright, well, what's that all about? >> We have some 2,000 people in New York who are interested in finding out what goes on, and which technologies to use, and what are people working on. >> Alright, so hopefully, you saw the keynote this morning with Matei? >> Yes. >> Alright, any comments or reactions from the things that he talked about as priorities? >> Well, I've always loved the keynotes at the Spark Summits, because they announce something that you don't already know is coming in advance, at least for most people. The second Spark Summit actually had people gasping in the audience while they were demoing, a lot of senior people-- >> Well, the one millisecond today was kind of a wow one-- >> Exactly, and I would say that the one thing to pick out of the keynote that really stood out for me was the changes in improvements they've made for streaming, including potentially being able to do sub-millisecond times for some workloads. >> Well, maybe talk to us about some of the apps that you're building at Bloomberg, and then I want you to join in, George, and drill down some of the details. >> Sure. And Bloomberg is a large company with 4,000-plus developers, we've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware, but there's some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in, that's a natural application, another one would be regulatory, there's a regulatory system called MiFID, or MiFID II, the regulations required for Europe, you have to be able to record every trade for seven years, provide daily reports, there's clearly a lot around that, and then I would also just say, our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. >> These applications sound like they're oriented towards streaming solutions, and really low latency. Has that been a constraint on what you can build so far? >> I would definitely say that we have some things that are latency constrained, it tends to be not like high frequency trading, where you care about microseconds, but milliseconds are important, how long does it take to get an answer, but I would say equally important with latency is efficiency, and those two often wind up being coupled together, though not always. >> And so when you say coupled, is it because it's a trade-off, or 'cause you need both? >> Right, so it's a little bit of both, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often, greater efficiencies. Micro-batching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in, and your method of processing them isn't efficient enough, it gets too slow simply from that, and that's why it's not just one or the other. >> So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency, and is that relevant for the apps that you're building? >> I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have, so that's one explicit one, but part of what's useful is, often when you sit down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through, and then you discover other inefficiencies that can be ironed out, and so it just makes it faster overall. I would say, a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on, well, all these things that make things much slower, much less efficient than they need to be, and we can close that gap a lot, I would say that from the very beginning. >> This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take N 10 control of continuous apps, for simplicity and performance, and so there's this, we'll write with transactional consistency, so we're assuring the customer of exactly one's semantics when we write to a file system or database or something like that. But, Spark has never really done native storage, whereas Matei came here on the show earlier today and said, "Well, Databricks as a company "is going to have to do something in that area," and he talks specifically about databases, and he said, he implied that Apache Spark, separate from Databricks, would also have to do more in state management, I don't know if he was saying key value store, but how would that open up a broader class of apps, how would it make your life simpler as a developer? >> Right. Interesting and great question, this is kind of a subject that's near and dear to my own heart, I would say. So part of that, when you take a step back, is about some of the potential promise of what Spark could be, or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value, if you can learn one small skillset, but it can work in a wide variety of use cases, whether it's streaming or at rest or analytics, and plug other things in. As always, there's a gap in any such system between theory and reality, and how much can you close that gap, but as for storage systems, this is something that, you and I have talked about this before, and I've written about it a fair amount too, Spark is historically an analytic system, so you have a bunch of data, and you can do analytics on it, but where's that data come from? Well, either it's streaming in, or you're reading from files, but most people need, essentially, an actual database. So what constitutes the universal system? You need file store, you need a distributive file store, you need a database with generally transactional semantics because the other forms are too hard for people to understand, you need analytics that are extensible, and you need a way to stream data in, and there's how close can you get to that, versus how much do you have to fit other parts that come together, very interesting question. >> So, so far, they've sort of outsourced that to DIY, do-it-yourself, but if they can find a sufficiently scalable relational database, they can do the sort of analytical queries, and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that, like Cassandra would be the, sort of the database that would handle all updates, and then some amount of those would be filtered through to a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you would have the option to drop that out, what applications would you undertake that you couldn't use right now, where the theme was, we're going to take big data apps into production, and then the competition that they show for streaming is of Kafka and Flink, so what does that do to that competitive balance? >> Right, so how many pieces do you need, and how well do they fit together is maybe the essence of that question, and people ask that all the time, and one of the limits has been, how mature is each piece, how efficient is it, and do they work together? And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real-world usage. I think we're coalescing around a smaller set of options, so in the, Kafka, for example, has a lot of usage, and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data, for ingest, I see that everywhere I go. But what happens when you move from Kafka into Spark, or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it, or Oracle, or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability, so we actually just opensourced that a few weeks ago, it's called ComDB2, and the reason we had to do that was the industry solutions at the time, when we started working on that, were inadequate to our needs, but we look at how long that took to develop for these other systems and think, that's really hard for someone else to get right, and so, if you need a database, which everyone does, how can you make that work better with Spark? And I think there're a number of very interesting developments that can make that a lot better, short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too. How do you make them work well together, we could talk about for a while, 'cause that's a fascinating question. >> On that one topic, maybe the Databricks guys don't want to assume responsibility for the development, because then they're picking a winner, perhaps? Maybe, as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Splice Machine and SnappyData do the work, take it upon themselves to take data frames, the core data structure, in Spark, and give it transactional semantics. Does that sound promising? >> There're multiple avenues for potential success, and who can use which, in a way, depends on the audience. If you look at things like Cassandra and HBase, they're distributing key value stores that additional things are being built on, so they started as distributed, and they're moving towards more encompassing systems, versus relational databases, which generally started as single image on single machine, and are moving towards federation distribution, and there's been a lot with that with post grads, for example. One of the questions would be, is it just knobs, or why don't they work well together? And there're a number of reasons. One is, what can be pushed down, how much knowledge do you have to have to make that decision, and optimizing that, I think, is actually one of the really interesting things that could be done, just as we have database query optimizers, why not, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted, but are coming. One is the very efficient copy of data between systems, and Apache Arrow, for example, is very, very interesting, and it's nearing the time when I think it's just going to explode, because it lets you connect these systems radically more efficiently in a standardized way, and that's one of the things that was missing, as soon as you hop from one system to another, all of a sudden, you have the semantic computational expense, that's a problem, we can fix that. The other is, the next level of integration requires, basically, exposing more hooks. In order to know, where should a query be executed and which operator should I push down, you need something that I think of as a meta-optimizer, and also, knowledge about the shape of the data, or statistics underlying, and ways to exchange that back and forth to be able to do it well. >> Wow, Matt, a lot of great questions there. We're coming up on a break, so we have to wrap things up, and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community, do over the next year. What are the top issues, things you'd love to see worked on? >> Right. It's an exciting time for Spark, because as time goes by, it gets more and more mature, and more real-world applications are viable. The hardest thing of all is to get, anywhere you in any organization's to get people working together, but the more people work together to enable these pieces, how do I efficiently work with databases, or have these better optimizations make streaming more mature, the more people can use it in practice, and that's why people develop software, is to actually tackle these real-world problems, so, I would love to see more of that. >> Can we all get along? (chuckling) Well, that's going to be the last word of this segue, Matt, thank you so much for coming on and spending some time with us here to share the story! >> My pleasure. >> Alright, thank you so much. Thank you George, and thank you all for watching this segment of theCUBE, please stay with us, as Spark Summit 2017 will be back in a few moments.

Published Date : Jun 6 2017

SUMMARY :

covering Spark Summit 2017, brought to you by Databricks. it's quite amazing to see the 10th one, yes. and what are people working on. that you don't already know is coming in advance, and I would say that the one thing and then I want you to join in, George, you have to be able to record every trade for seven years, Has that been a constraint on what you can build so far? where you care about microseconds, On the other hand, if you have a really large volume and then you discover other inefficiencies and so there's this, we'll write and there's how close can you get to that, what applications would you undertake and so, if you need a database, which everyone does, and give it transactional semantics. it's just going to explode, because it lets you and I wanted to give you at least 30 seconds and that's why people develop software, Alright, thank you so much.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Matt Hunt	PERSON	0.99+
Bloomberg	ORGANIZATION	0.99+
Matthew Hunt	PERSON	0.99+
Matt	PERSON	0.99+
Matei	PERSON	0.99+
New York	LOCATION	0.99+
San Francisco	LOCATION	0.99+
30 years	QUANTITY	0.99+
seven years	QUANTITY	0.99+
each piece	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
one	QUANTITY	0.99+
one millisecond	QUANTITY	0.99+
5,000 skills	QUANTITY	0.99+
both	QUANTITY	0.99+
two	QUANTITY	0.99+
two things	QUANTITY	0.99+
One	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Spark	TITLE	0.98+
Europe	LOCATION	0.98+
Spark Summit 2017	EVENT	0.98+
DB2	TITLE	0.98+
200 different products	QUANTITY	0.98+
Spark Summits	EVENT	0.98+
Spark Summit	EVENT	0.98+
today	DATE	0.98+
one system	QUANTITY	0.97+
next year	DATE	0.97+
4,000-plus developers	QUANTITY	0.97+
first	QUANTITY	0.96+
HBase	ORGANIZATION	0.95+
second	QUANTITY	0.94+
decades	QUANTITY	0.94+
MiFID II	TITLE	0.94+
one topic	QUANTITY	0.92+
this morning	DATE	0.92+
single machine	QUANTITY	0.91+
One of	QUANTITY	0.91+
ComDB2	TITLE	0.9+
few weeks ago	DATE	0.9+
Cassandra	PERSON	0.89+
earlier today	DATE	0.88+
10th one	QUANTITY	0.88+
2,000 people	QUANTITY	0.88+
one thing	QUANTITY	0.87+
Kafka	TITLE	0.87+
single image	QUANTITY	0.87+
MiFID	TITLE	0.85+
Spark	ORGANIZATION	0.81+
Splice Machine	TITLE	0.81+
Project Tungsten	ORGANIZATION	0.78+
theCUBE	ORGANIZATION	0.78+
at least 30 seconds	QUANTITY	0.77+
Cassandra	ORGANIZATION	0.72+
Apache Spark	ORGANIZATION	0.71+
questions	QUANTITY	0.7+
things	QUANTITY	0.69+
Apache Arrow	ORGANIZATION	0.69+
SnappyData	TITLE	0.66+

Rob Lantz, Novetta - Spark Summit 2017 - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco it's the CUBE covering Spark Summit 2017 brought to you by Data Bricks. >> Welcome back to the CUBE, we're continuing to take about two people who are not just talking about things but doing things. We're happy to have, from Novetta, the Director of Predictive Analytics, Mr. Rob Lantz. Rob, welcome to the show. >> Thank you. >> And off to my right, George, how are you? >> Good. >> We've introduced you before. >> Yes. >> Well let's talk to the guest. Let's get right to it. I want to talk to you a little bit about what does Novetta do and then maybe what apps you're building using Spark. >> Sure, so Novetta is an advanced analytics company, we're medium sized and we develop custom hardware and software solutions for our customers who are looking to get insights out of their big data. Our primary offering is a hard entity resolution engine. We scale up to billions of records and we've done that for about 15 years. >> So you're in the business end of analytics, right? >> Yeah, I think so. >> Alright, so talk to us a little bit more about entity resolution, and that's all Spark right? This is your main priority? >> Yes, yes, indeed. Entity resolution is the science of taking multiple disparate data sets, traditional big data, and taking records from those and determining which of those are actually the same individual or company or address or location and which of those should be kept separate. We can aggregate those things together and build profiles and that enables a more robust picture of what's going on for an organization. >> Okay, and George? >> So what did you do... What was the solution looking like before Spark and how did it change once you adopted Spark? >> Sure, so with Spark, it enabled us to get a lot faster. Obviously those computations scaled a lot better. Before, we were having to write a lot of custom code to get those computations out across a grid. When we moved to Hadoop and then Spark, that made us, let's say able to scale those things and get it done overnight or in hours and not weeks. >> So when you say you had to do a lot of custom code to distribute across the cluster, does that include when you were working with MapReduce, or was this even before the Hadoop era? >> Oh it was before the Hadoop era and that predates my time so I won't be able to speak expertly about it, but to my understanding, it was a challenge for sure. >> Okay so this sounds like a service that your customers would then themselves build on. Maybe an ETL customer would figure out master data from a repository that is not as carefully curated as the data warehouse or similar applications. So who is your end customer and how do they build on your solution? >> Sure, so the end customer typically is an enterprise that has large volumes of data that deal in particular things. They collect, it could be customers, it could be passengers, it could be lots of different things. They want to be able to build profiles about those people or companies, like I said, or locations, any number of things can be considered an entity. The way they build upon it then is how they go about quantifying those profiles. We can help them do that, in fact, some of the work that I manage does that, but often times they do it themselves. They take the resolve data and that gets resolved nightly or even hourly. They build those profiles themselves for their own purpose. >> Then, to help us think about the application or the use case holistically, once they've built those profiles and essentially harmonized the data, what does that typically feed into? >> Oh gosh, any number of things really. Oh, shoot. We've got deployments in AWS in the cloud, we've got deployments, lots of deployments on premises obviously. That can go anywhere from relational databases to graph query language databases. Lots of different places from there for sure. >> Okay so, this actually sounds like everyone talks now about machine learning and forming every category of software. This sounds like you take the old style ETL, where master data was a value add layer on top, and that was, it took a fair amount of human judgment to do. Now, you're putting that service on top of ETL and you're largely automating it, probably with, I assume, some supervised guidance, supervised training. >> Yes, so we're getting into the machine learning space as far as entity extraction and resolution and recognition because more and more data is unstructured. But machine learning isn't necessarily a baked in part of that. Actually entity resolution is a prerequisite, I think, for quality machine learning. So if Rob Lantz is a customer, I want to be able to know what has Rob Lantz bought in the past from me. And maybe what is Rob Lantz talking about in social media? Well I need to know how to figure out who those people are and who's Rob Lantz and who's Robert Lantz is a completely different person, I don't want to collapse those two things together. Then I would build machine learning on top of that to say, right, now what's his behavior going to be in the future. But once I have that robust profile built up, I can derive a lot more interesting features with which to apply the machine learning. >> Okay, so you are a Data Bricks customer and there's also a burgeoning partnership. >> Rob: Yeah, I think that's true. >> So talk to us a little bit about what are some of the frustrations you had before adopting Data Bricks and maybe why you choose it. >> Yeah, sure. So the frustrations primarily with a traditional Hadoop environment involved having to go from one customer site to another customer site with an incredibly complex technology stack and then do a lot of the cluster management for those customers even after they'd already set it up because of all the inner workings of Hadoop and that ecosystem. Getting our Spark application installed there, we had to penetrate layers and layers of configuration in order to tune it appropriately to get the performance we needed. >> David: Okay, and were you at the keynote this morning? >> I was not, actually. >> Okay, I'm not going to ask you about that then. >> Ah. >> But I am going to ask you a little bit about your wishlist. You've been talking to people maybe in the hallway here, you just got here today but, what do you wish the community would do or develop, what would you like to learn while you're here? >> Learning while I'm here, I've already picked up a lot. So much going on and it's such a fast paced environment, it's really exciting. I think if I had a wishlist, I would want a more robust ML Lib, machine learning library. All the things that you can get on traditional, in scientific computing stacks moved onto a Spark ML Lib for easier access. On a cluster would be great. >> I thought several years ago ML Lib took over from Mahoot as the most active open source community for adding, really, I thought, scale out machine learning algorithms. If it doesn't have it all now, or maybe all is something you never reach, kind of like Red Queen effect, you know? >> Rob: For sure, for sure. >> What else is attracting these scale out implementations of the machine learning algorithms? >> Um? >> In other words, what are the platforms? If it's not Spark then... >> I don't think it exists frankly, unless you write your own. I think that would be the way to go. That's the way to go about it now. I think what organizations are having to do with machine learning in a distributed environment is just go with good enough, right. Whereas maybe some of the ensemble methods that are, actually aren't even really cutting edge necessarily, but you can really do a lot of tuning on those things, doing that tuning distributed at scale would be really powerful. I read somewhere, and I'm not going to be able to quote exactly where it was but, actually throwing more data at a problem is more valuable than tuning a perfect algorithm frankly. If we could combine the two, I think that would be really powerful. That is, finding the right algorithm and throwing all the data at it would get you a really solid model that would pick up on that signal that underlies any of these phenomena. >> David: Okay well, go ahead George. >> I was going to ask, I think that goes back to, I don't know if it was Google Paper, or one of the Google search quality guys who's a luminary in the machine learning space says, "data always trumps algorithms." >> I believe that's true and that's true in my experience certainly. >> Once you had this machine learning and once you've perhaps simplified the multi-vendor stack, then what is your solution start looking like in terms of broadening its appeal, because of the lower TCO. And then, perhaps embracing more use cases. >> I don't know that it necessarily embraces more use cases because entity resolution applies so broadly already, but what I would say is will give us more time to focus on improving the ER itself. That's I think going to be a really, really powerful improvement we can make to Novetta entity analytics as it stands right now. That's going to go into, we alluded to before, the machine learning as part of the entity resolution. Entity extraction, automated entity extraction from unstructured information and not just unstructured text but unstructured images and video. Could be a really powerful thing. Taking in stuff that isn't tagged and pulling the entities out of that automatically without actually having to have a human in the loop. Pulling every name out, every phone number out, every address out. Go ahead, sorry. >> This goes back to a couple conversations we've had today where people say data trumps algorithms, even if they don't say it explicitly, so the cloud vendors who are sitting on billions of photos, many of which might have house street addresses and things like that, or faces, how do you make better... How do you extract better tuning for your algorithms from data sets that I assume are smaller than the cloud vendors? >> They're pretty big. We employ data engineers that are very experienced at tagging that stuff manually. What I would envision would happen is we would apply somebody for a week or two weeks, to go in and tag the data as appropriate. In fact, we have products that go in and do concept tagging already across multiple languages. That's going to be the subject of my talk tomorrow as a matter of fact. But we can tag things manually or with machine assistance and then use that as a training set to go apply to the much larger data set. I'm not so worried about the scale of the data, we already have a lot, a lot of data. I think it's going to be getting that proof set that's already tagged. >> So what you're saying is, it actually sounds kind of important. That actually almost ties into what we hear about Facebook training their messenger bot where we can't do it purely just on training data so we're going to take some data that needs semi-supervision, and that becomes our new labeled set, our new training data. Then we can run it against this broad, unwashed mass of training data. Is that the strategy? >> Certainly we would get there. We would want to get there and that's the beauty of what Data Bricks promises, is that ability to save a lot of the time that we would spend doing the nug work on cluster management to innovate in that way and we're really excited about that. >> Alright, we've got just a minute to go here before the break, so I wanted to ask you maybe, the wish list question, I've been asking everybody today, what do you wish you had? Whether it's in entity resolution or some other area in the next couple of years for Novetta, what's on your list? >> Well I think that would be the more robust machine learning library, all in Spark, kind of native, so we wouldn't have to deploy that ourselves. Then, I think everything else is there, frankly. We are very excited about the platform and the stack that comes with it. >> Well that's a great ending right there, George do you have any other questions you want to ask? Alright, we're just wrapping up here. Thank you so much, we appreciate you being on the show Rob, and we'll see you out there in the Expo. >> I appreciate it, thank you. >> Alright, thanks so much. >> George: It's good to meet you. >> Thanks. >> Alright, you are watching the CUBE here at Spark Summit 2017, stay tuned, we'll be back with our next guest.

Published Date : Jun 6 2017

SUMMARY :

brought to you by Data Bricks. Welcome back to the CUBE, I want to talk to you a little bit about and we've done that for about 15 years. and build profiles and that enables a more robust picture and how did it change once you adopted Spark? and get it done overnight or in hours and not weeks. and that predates my time and how do they build on your solution? and that gets resolved nightly or even hourly. We've got deployments in AWS in the cloud, and that was, it took a fair amount going to be in the future. Okay, so you are a Data Bricks customer and maybe why you choose it. to get the performance we needed. what would you like to learn while you're here? All the things that you can get on traditional, kind of like Red Queen effect, you know? If it's not Spark then... I read somewhere, and I'm not going to be able or one of the Google search quality guys and that's true in my experience certainly. because of the lower TCO. and pulling the entities out of that automatically that I assume are smaller than the cloud vendors? I think it's going to be getting that proof set Is that the strategy? is that ability to save a lot of the time and the stack that comes with it. and we'll see you out there in the Expo. Alright, you are watching the CUBE

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George	PERSON	0.99+
Rob Lantz	PERSON	0.99+
Robert Lantz	PERSON	0.99+
San Francisco	LOCATION	0.99+
Data Bricks	ORGANIZATION	0.99+
a week	QUANTITY	0.99+
Rob	PERSON	0.99+
two	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Novetta	ORGANIZATION	0.99+
two weeks	QUANTITY	0.99+
tomorrow	DATE	0.99+
two things	QUANTITY	0.98+
today	DATE	0.98+
Spark Summit 2017	EVENT	0.98+
several years ago	DATE	0.97+
Hadoop	TITLE	0.97+
Google	ORGANIZATION	0.97+
about 15 years	QUANTITY	0.96+
#SparkSummit	EVENT	0.95+
billions of photos	QUANTITY	0.95+
this morning	DATE	0.91+
ML Lib	TITLE	0.91+
billions	QUANTITY	0.9+
one	QUANTITY	0.87+
Mahoot	ORGANIZATION	0.85+
one customer site	QUANTITY	0.85+
Hadoop	DATE	0.84+
two people	QUANTITY	0.74+
CUBE	ORGANIZATION	0.72+
Predictive Analytics	ORGANIZATION	0.68+
next couple	DATE	0.66+
Director	PERSON	0.66+
years	DATE	0.62+
Spark ML Lib	TITLE	0.61+
Queen	TITLE	0.59+
ML	TITLE	0.57+
couple	QUANTITY	0.54+
Red	OTHER	0.53+
MapReduce	ORGANIZATION	0.52+
Google Paper	ORGANIZATION	0.47+

Dr. Jisheng Wang, Hewlett Packard Enterprise, Spark Summit 2017 - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco, it's theCUBE covering Sparks Summit 2017 brought to you by Databricks. >> You are watching theCUBE at Sparks Summit 2017. We continue our coverage here talking with developers, partners, customers, all things Spark, and today we're honored now to have our next guest Dr. Jisheng Wang who's the Senior Director of Data Science at the CTO Office at Hewlett Packard Enterprise. Dr. Wang, welcome to the show. >> Yeah, thanks for having me here. >> All right and also to my right we have Mr. Jim Kobielus who's the Lead Analyst for Data Science at Wikibon. Welcome, Jim. >> Great to be here like always. >> Well let's jump into it. At first I want to ask about your background a little bit. We were talking about the organization, maybe you could do a better job (laughs) of telling me where you came from and you just recently joined HPE. >> Yes. I actually recently joined HPE earlier this year through the Niara acquisition, and now I'm the Senior Director of Data Science in the CTO Office of Aruba. Actually, Aruba you probably know like two years back, HP acquired Aruba as a wireless networking company, and now Aruba takes charge of the whole enterprise networking business in HP which is about over three billion annual revenue every year now. >> Host: That's not confusing at all. I can follow you (laughs). >> Yes, okay. >> Well all I know is you're doing some exciting stuff with Spark, so maybe tell us about this new solution that you're developing. >> Yes, actually my most experience of Spark now goes back to the Niara time, so Niara was a three and a half year old startup that invented, reinvented the enterprise security using big data and data science. So what is the problem we solved, we tried to solve in Niara is called a UEBA, user and entity behavioral analytics. So I'll just try to be very brief here. Most of the transitional security solutions focus on detecting attackers from outside, but what if the origin of the attacker is inside the enterprise, say Snowden, what can you do? So you probably heard of many cases today employees leaving the company by stealing lots of the company's IP and sensitive data. So UEBA is a new solution try to monitor the behavioral change of the enterprise users to detect both this kind of malicious insider and also the compromised user. >> Host: Behavioral analytics. >> Yes, so it sounds like it's a native analytics which we run like a product. >> Yeah and Jim you've done a lot of work in the industry on this, so any questions you might have for him around UEBA? >> Yeah, give us a sense for how you're incorporating streaming analytics and machine learning into that UEBA solution and then where Spark fits into the overall approach that you take? >> Right, okay. So actually when we started three and a half years back, the first version when we developed the first version of the data pipeline, we used a mix of Hadoop, YARN, Spark, even Apache Storm for different kind of stream and batch analytics work. But soon after with increased maturity and also the momentum from this open source Apache Spark community, we migrated all our stream and batch, you know the ETL and data analytics work into Spark. And it's not just Spark. It's Spark, Spark streaming, MLE, the whole ecosystem of that. So there are at least a couple advantages we have experienced through this kind of a transition. The first thing which really helped us is the simplification of the infrastructure and also the reduction of the DevOps efforts there. >> So simplification around Spark, the whole stack of Spark that you mentioned. >> Yes. >> Okay. >> So for the Niara solution originally, we supported, even here today, we supported both the on-premise and the cloud deployment. For the cloud we also supported the public cloud like AWS, Microsoft Azure, and also Privia Cloud. So you can understand with, if we have to maintain a stack of different like open source tools over this kind of many different deployments, the overhead of doing the DevOps work to monitor, alarming, debugging this kind of infrastructure over different deployments is very hard. So Spark provides us some unified platform. We can integrate the streaming, you know batch, real-time, near real-time, or even longterm batch job all together. So that heavily reduced both the expertise and also the effort required for the DevOps. This is one of the biggest advantages we experienced, and certainly we also experienced something like the scalability, performance, and also the convenience for developers to develop a new applications, all of this, from Spark. >> So are you using the Spark structured streaming runtime inside of your application? Is that true? >> We actually use Spark in the steaming processing when the data, so like in the UEBS solutions, the first thing is collecting a lot of the data, different account data source, network data, cloud application data. So when the data comes in, the first thing is streaming job for the ETL, to process the data. Then after that, we actually also develop the some, like different frequency like one minute, 10 minute, one hour, one day of this analytics job on top of that. And even recently we have started some early adoption of the deep learning into this, how to use deep learning to monitor the user behavior change over time, especially after user gives a notice what user, is user going to access like most servers or download some of the sensitive data? So all of this requires very complex analytics infrastructure. >> Now there were some announcements today here at Spark Summit by Databricks of adding deep learning support to their core Spark code base. What are your thoughts about the deep learning pipelines, API, that they announced this morning? It's new news, I'll understand if you don't, haven't digested it totally, but you probably have some good thoughts on the topic. >> Yes, actually this is also news for me, so I can just speak from my current experience. How to integrate deep learning into Spark actually was a big challenge so far for us because what we used so far, the deep learning piece, we used TensorFlow. And certainly most of our other stream and data massaging or ETL work is done by Spark. So in this case, there are a couple ways to manage this, too. One is to set up two separate resource pool, one for Spark, the other one for TensorFlow, but in our deployment there is some very small on-premise department which has only like four node or five node cluster. It's not efficient to split resource in that way. So we actually also looking for some closer integration between deep learning and Spark. So one thing we looked before is called the TensorFlow on Spark which was open source a couple months ago by Yahoo. >> Right. >> So maybe this is certainly more exciting news for the Spark team to develop this native integration. >> Jim: Very good. >> Okay and we talked about the UEBA solution, but let's go back to a little broader HPE perspective. You have this concept called the intelligent edge, what's that all about? >> So that's a very cool name. Actually come a little bit back. I come from the enterprise background, and enterprise applications have some, actually a lag behind than consumer applications in terms of the adoption of the new data science technology. So there are some native challenges for that. For example, collecting and storing large amount of this enterprise sensitive data is a huge concern, especially in European countries. Also for the similar reason how to collect, normally weigh developer enterprise applications. You're lack of some good quantity and quality of the trending data. So this is some native challenges when you develop enterprise applications, but even despite of this, HPE and Aruba recently made several acquisitions of analytics companies to accelerate the adoption of analytics into different product line. Actually that intelligent age comes from this IOT, which is internet of things, is expected to be the fastest growing market in the next few years here. >> So are you going to be integrating the UEBA behavioral analytics and Spark capability into your IOT portfolio at HP? Is that a strategy or direction for you? >> Yes. Yes, for the big picture that certainly is. So you can think, I think some of the Gartner Report expected the number of the IOT devices is going to grow over 20 billion by 2020. Since all of this IOT devices are connected to either intranet or internet, either through wire or wireless, so as a networking company, we have the advantage of collecting data and even take some actions at the first of place. So the idea of this intelligent age is we want to turn each of these IOT devices, the small IOT devices like IP camera, like those motion detection, all of these small devices as opposed to the distributed sensor for the data collection and also some inline actor to do some real-time or even close to real-time decisions. For example, the behavior anomaly detection is a very good example here. If IOT devices is compromised, if the IP camera has been compromised, then use that to steal your internal data. We should detect and stop that at the first place. >> Can you tell me about the challenges of putting deep learning algorithms natively on resource constrained endpoints in the IOT? That must be really challenging to get them to perform well considering that there may be just a little bit of memory or flash capacity or whatever on the endpoints. Any thoughts about how that can be done effectively and efficiently? >> Very good question >> And at low cost. >> Yes, very good question. So there are two aspects into this. First is this global training of the intelligence which is not going to be done on each of the device. In that case, each of the device is more like the sensor for the data collection. So we are going to build a, collect the data sent to the cloud, or build all of this giant pool, like computing resource to trend the classifier, to trend the model, but when we trend the model, we are going to ship the model, so the inference and the detection of the model of those behavioral anomaly really happen on the endpoint. >> Do the training centrally and then push the trained algorithms down to the edge devices. >> Yes. But even like, the second as well even like you said, some of the device like say people try to put those small chips in the spoon, in the case of, in hospital to make it like more intelligent, you cannot put even just the detection piece there. So we also looking to some new technology. I know like Caffe recently announced, released some of the lightweight deep learning models. Also there's some, your probably know, there's some of the improvement from the chip industry. >> Jim: Yes. >> How to optimize the chip design for this kind of more analytics driven task there. So we are all looking to this different areas now. >> We have just a couple minutes left, and Jim you get one last question after this, but I got to ask you, what's on your wishlist? What do you wish you could learn or maybe what did you come to Spark Summit hoping to take away? >> I've always treated myself as a technical developer. One thing I am very excited these days is the emerging of the new technology, like a Spark, like TensorFlow, like Caffe, even Big-Deal which was announced this morning. So this is something like the first go, when I come to this big advanced industry events, I want to learn the new technology. And the second thing is mostly to share our experience and also about adopting of this new technology and also learn from other colleagues from different industries, how people change life, disrupt the old industry by taking advantage of the new technologies here. >> The community's growing fast. I'm sure you're going to receive what you're looking for. And Jim, final question? >> Yeah, I heard you mention DevOps and Spark in same context, and that's a huge theme we're seeing, more DevOps is being wrapped around the lifecycle of development and training and deployment of machine learning models. If you could have your ideal DevOps tool for Spark developers, what would it look like? What would it do in a nutshell? >> Actually it's still, I just share my personal experience. In Niara, we actually developed a lot of the in-house DevOps tools like for example, when you run a lot of different Spark jobs, stream, batch, like one minute batch verus one day batch job, how do you monitor the status of those workflows? How do you know when the data stop coming? How do you know when the workflow failed? Then even how, monitor is a big thing and then alarming when you have something failure or something wrong, how do you alarm it, and also the debug is another big challenge. So I certainly see the growing effort from both Databricks and the community on different aspects of that. >> Jim: Very good. >> All right, so I'm going to ask you for kind of a soundbite summary. I'm going to put you on the spot here, you're in an elevator and I want you to answer this one question. Spark has enabled me to do blank better than ever before. >> Certainly, certainly. I think as I explained before, it helped a lot from both the developer, even the start-up try to disrupt some industry. It helps a lot, and I'm really excited to see this deep learning integration, all different road map report, you know, down the road. I think they're on the right track. >> All right. Dr. Wang, thank you so much for spending some time with us. We appreciate it and go enjoy the rest of your day. >> Yeah, thanks for being here. >> And thank you for watching the Cube. We're here at Spark Summit 2017. We'll be back after the break with another guest. (easygoing electronic music)

Published Date : Jun 6 2017

SUMMARY :

brought to you by Databricks. at the CTO Office at Hewlett Packard Enterprise. All right and also to my right we have Mr. Jim Kobielus (laughs) of telling me where you came from of the whole enterprise networking business I can follow you (laughs). that you're developing. of the company's IP and sensitive data. Yes, so it sounds like it's a native analytics of the data pipeline, we used a mix of Hadoop, YARN, the whole stack of Spark that you mentioned. We can integrate the streaming, you know batch, of the deep learning into this, but you probably have some good thoughts on the topic. one for Spark, the other one for TensorFlow, for the Spark team to develop this native integration. Okay and we talked about the UEBA solution, Also for the similar reason how to collect, of the IOT devices is going to grow natively on resource constrained endpoints in the IOT? collect the data sent to the cloud, Do the training centrally But even like, the second as well even like you said, So we are all looking to this different areas now. And the second thing is mostly to share our experience And Jim, final question? If you could have your ideal DevOps tool So I certainly see the growing effort All right, so I'm going to ask you even the start-up try to disrupt some industry. We appreciate it and go enjoy the rest of your day. We'll be back after the break with another guest.

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
HPE	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
10 minute	QUANTITY	0.99+
one hour	QUANTITY	0.99+
one minute	QUANTITY	0.99+
Wang	PERSON	0.99+
San Francisco	LOCATION	0.99+
Yahoo	ORGANIZATION	0.99+
Jisheng Wang	PERSON	0.99+
Niara	ORGANIZATION	0.99+
first version	QUANTITY	0.99+
one day	QUANTITY	0.99+
two aspects	QUANTITY	0.99+
Jim Kobielus	PERSON	0.99+
Hewlett Packard Enterprise	ORGANIZATION	0.99+
First	QUANTITY	0.99+
Caffe	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Spark	ORGANIZATION	0.99+
one	QUANTITY	0.99+
each	QUANTITY	0.99+
three and a half year	QUANTITY	0.99+
both	QUANTITY	0.99+
Sparks Summit 2017	EVENT	0.99+
first	QUANTITY	0.99+
DevOps	TITLE	0.99+
2020	DATE	0.99+
second thing	QUANTITY	0.99+
Aruba	ORGANIZATION	0.98+
Snowden	PERSON	0.98+
two years back	DATE	0.98+
first thing	QUANTITY	0.98+
one last question	QUANTITY	0.98+
AWS	ORGANIZATION	0.98+
over 20 billion	QUANTITY	0.98+
one question	QUANTITY	0.98+
UEBA	TITLE	0.98+
today	DATE	0.98+
Spark Summit	EVENT	0.97+
Microsoft	ORGANIZATION	0.97+
Spark Summit 2017	EVENT	0.96+
Apache	ORGANIZATION	0.96+
three and a half years back	DATE	0.96+
Databricks	ORGANIZATION	0.96+
one day batch	QUANTITY	0.96+
earlier this year	DATE	0.94+
Aruba	LOCATION	0.94+
One	QUANTITY	0.94+
#SparkSummit	EVENT	0.94+
One thing	QUANTITY	0.94+
one thing	QUANTITY	0.94+
European	LOCATION	0.94+
Gartner	ORGANIZATION	0.93+

Wikibon Big Data Market Update pt. 2 - Spark Summit East 2017 - #SparkSummit - #theCUBE

(lively music) >> [Announcer] Live from Boston, Massachusetts, this is the Cube, covering Sparks Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Sparks Summit in Boston, everybody. This is the Cube, the worldwide leader in live tech coverage. We've been here two days, wall-to-wall coverage of Sparks Summit. George Gilbert, my cohost this week, and I are going to review part two of the Wikibon Big Data Forecast. Now, it's very preliminary. We're only going to show you a small subset of what we're doing here. And so, well, let me just set it up. So, these are preliminary estimates, and we're going to look at different ways to triangulate the market. So, at Wikibon, what we try to do is focus on disruptive markets, and try to forecast those over the long term. What we try to do is identify where the traditional market research estimates really, we feel, might be missing some of the big trends. So, we're trying to figure out, what's the impact, for example, of real time. And, what's the impact of this new workload that we've been talking about around continuous streaming. So, we're beginning to put together ways to triangulate that, and we're going to show you, give you a glimpse today of what we're doing. So, if you bring up the first slide, we showed this yesterday in part one. This is our last year's big data forecast. And, what we're going to do today, is we're going to focus in on that line, that S-curve. That really represents the real time component of the market. The Spark would be in there. The Streaming analytics would be in there. Add some color to that, George, if you would. >> [George] Okay, for 60 years, since the dawn of computing, we have two ways of interacting with computers. You put your punch cards in, or whatever else and you come back and you get your answer later. That's batch. Then, starting in the early 60's, we had interactive, where you're at a terminal. And then, the big revolution in the 80's was you had a PC, but you still were either interactive either with terminal or batch, typically for reporting and things like that. What's happening is the rise of a new interaction mode. Which is continuous processing. Streaming is one way of looking at it but it might be more effective to call it continuous processing because you're not going to get rid of batch or interactive but your apps are going to have a little of each. So, what we're trying to do, since this is early, early in its life cycle, we're going to try and look at that streaming component from a couple of different angles. >> Okay, as I say, that's represented by this Ogive curve, or the S-curve. On the next slide, we're at the beginning when you think about these continuous workloads. We're at the early part of that S-curve, and of course, most of you or many of you know how the S-curve works. It's slow, slow, slow. For a lot of effort, you don't get much in return. Then you hit the steep part of that S-curve. And that's really when things start to take off. So, the challenge is, things are complex right now. That's really what this slide shows. And Spark is designed, really, to reduce some of that complexity. We've heard a lot about that, but take us through this. Look at this data flow from ingest, to explore, to process, to serve. We talked a lot about that yesterday, but this underscores the complexity in the marketplace. >> [George] Right, and while we're just looking mostly at numbers today, the point of the forecast is to estimate when the barriers, representing complexities, start to fall. And then, when we can put all these pieces together, in just explore, process, serve. When that becomes an end-to-end pipeline. When you can start taking the data in on one end, get a scientist to turn it into a model, inject it into an application, and that process becomes automated. That's when it's mature enough for the knee in the curve to start. >> And that's when we think the market's going to explode. But now so, how do you bound this. Okay, when we do forecasts, we always try to bound things. Because if they're not bounded, then you get no foundation. So, if you look at the next slide, we're trying to get a sense of real-time analytics. How big can it actually get? That's what this slide is really trying to-- >> [George] So this one was one firm's take on real-time analytics, where by 2027, they see it peaking just under-- >> [Dave] When you say one firm, you mean somebody from the technology district? >> [George] Publicly available data. And we take it as as a, since they didn't have a lot of assumptions published, we took it as, okay one data point. And then, we're going to come at it with some bottoms-up end top-down data points, and compare. >> [Dave] Okay, so the next slide we want to drill into the DBMS market and when you think about DBMS, you think about the traditional RDBMS and what we know, or the Oracle, SQL Server, IBMDB2's, etc. And then, you have this emergent NewSQL, and noSQL entrance, which are, obviously, we talked today to a number of folks. The number of suppliers is exploding. The revenue's still relatively small. Certainly small relative to the RDBMS marketplace. But, take us through what your expectations is here, and what some of the assumptions are behind this. >> [George] Okay, so the first thing to understand is the DBMS market, overall, is about $40 billion of which 30 billion goes to online transaction processing supporting real operational apps. 10 billion goes to Orlap or business intelligence type stuff. The Orlap one is shrinking materially. The online transaction processing one, new sales is shrinking materially but there's a huge maintenance stream. >> [Dave] Yeah which companies like Oracle and IBM and Microsoft are living off of that trying to fund new development. >> We modeled that declining gently and beginning to accelerate more going out into the latter years of the tenure period. >> What's driving that decline? Obviously, you've got the big sucking sound of a dup in part, is driving that. But really, increasingly it's people shifting their resources to some of these new emergent applications and workloads and new types of databases to support them right? But these are still, those new databases, you can see here, the NewSQL and noSQL still, relatively, small. A lot of it's open source. But then it starts to take off. What's your assumption there? >> So here, what's going on is, if you look at dollars today, it's, actually, interesting. If you take the noSQL databases, you take DynamoDB, you take Cassandra, Hadoop, HBase, Couchbase, Mongo, Kudu and you add all those up, it's about, with DynamoDB, it's, probably, about 1.55 billion out of a $40 billion market today. >> [Dave] Okay but it's starting to get meaningful. We were approaching two billion. >> But where it's meaningful is the unit share. If that were translated into Oracle pricing. The market would be much, much bigger. So the point it. >> Ten X? >> At least, at least. >> Okay, so in terms of work being done. If there's a measure of work being done. >> [George] We're looking at dollars here. >> Operations per second or etcetera, it would be enormous. >> Yes, but that's reflective of the fact that the data volumes are exploding but the prices are dropping precipitously. >> So do you have a metric to demonstrate that. We're, obviously, not going to show it today but. >> [George] Yes. >> Okay great, so-- >> On the business intelligence side, without naming names, the data warehouse appliance vendors are charging anywhere from 25,000 per terabyte up to, when you include running costs, as high as 100,000 a terabyte. That their customers are estimating. That's not the selling cost but that's the cost of ownership per terabyte. Whereas, if you look at, let's say Hadoop, which is comparable for the off loading some of the data warehouse work loads. That's down to the 5K per terabyte range. >> Okay great, so you expect that these platforms will have a bigger and bigger impact? What's your pricing assumption? Is prices going to go up or is it just volume's going to go through the roof? >> I'm, actually, expecting pricing. It's difficult because we're going to add more and more functionality. Volumes go up and if you add sufficient functionality, you can maintain pricing. But as volumes go up, typically, prices go down. So it's a matter of how much do these noSQL and NewSQL databases add in terms of functionality and I distinguish between them because NewSQL databases are scaled out version of Oracle or Teradata but they are based on the more open source pricing model. >> Okay and NoSQL, don't forget, stands for not only SQL, not not SQL. >> If you look at the slides, big existing markets never fall off a cliff when they're in the climb. They just slowly fade. And, eventually, that accelerates. But what's interesting here is, the data volumes could explode but the revenue associated with the NoSQL which is the dark gray and the NewSQL which is the blue. Those don't explode. You could take, what's the DBMS cost of supporting YouTube? It would be in the many, many, many billions of dollars. It would support 1/2 of an Oracle itself probably. But it's all open source there so. >> Right, so that's minimizing the opportunity is what you're saying? >> Right. >> You can see the database market is flat, certainly flattish and even declining but you do expect some growth in the out years as part of that evasion, that volume, presumably-- >> And that's the next slide which is where we've seen that growth come from. >> Okay so let's talk about that. So the next slide, again, I should have set this up better. The X-axis year is worldwide dollars and the horizontal axis is time. And we're talking here about these continuous application work loads. This new work load that you talked about earlier. So take us through the three. >> [George] There's three types of workloads that, in large part, are going to be driving most of this revenue. Now, these aren't completely, they are completely comparable to the DBMS market because some of these don't use traditional databases. Or if they do, they're Torry databases and I'll explain that. >> [Dave] Sure but if I look at the IoT Edge, the Cloud and the micro services and streaming, that's a tail wind to the database forecast in the previous slide, is that right? >> [George] It's, actually, interesting but the application and infrastructure telemetry, this is what Splunk pioneered. Which is all the torrents of data coming out of your data center and your applications and you're trying to manage what's going on. That is a database application. And we know Splunk, for 2016, was 400 million. In software revenue Hadoop was 750 million. And the various other management vendors, New Relic, AppDynamics, start ups and 5% of Azure and AWS revenue. If you add all that up, it comes out to $1.7 billion for 2016. And so, we can put a growth rate on that. And we talked to several vendors to say, okay, how much will that work load be compared to IoT Edge Cloud. And the IoT Edge Cloud is the smart devices at the Edge and the analytics are in the fog but not counting the database revenue up in the Cloud. So it's everything surrounding the Cloud. And that, actually, if you look out five years, that's, maybe, 20% larger than the app and infrastructure telemetry but growing much, much faster. Then the third one where you were talking about was this a tail wind to the database. Micro server systems streaming are very different ways of building applications from what we do now. Now, people build their logic for the application and everyone then, stores their data in this centralized external database. In micro services, you build a little piece of the app and whatever data you need, you store within that little piece of the app. And so the database requirements are, rather, primitive. And so that piece will not drive a lot of database revenue. >> So if you could go back to the previous slide, Patrick. What's driving database growth in the out years? Why wouldn't database continue to get eaten away and decline? >> [George] In broad terms, the overall database market, it staying flat. Because as prices collapse but the data volumes go up. >> [Dave] But there's an assumption in here that the NoSQL space, actually, grows in the out years. What's driving that growth? >> [George] Both the NoSQL and the NewSQL. The NoSQL, probably, is best serving capturing the IoT data because you don't need lots of fancy query capabilities for concurrency. >> [Dave] So it is a tail wind in a sense in that-- >> [George] IoT but that's different. >> [Dave] Yeah sure but you've got the overall market growing. And that's because the new stuff, NewSQL and NoSQL is growing faster than the decline of the old stuff. And it's not in the 2020 to 2022 time frame. It's not enough to offset that decline. And then they have it start growing again. You're saying that's going to be driven by IoT and other Edge use cases? >> Yes, IoT Edge and the NewSQL, actually, is where when they mature, you start to substitute them for the traditional operational apps. For people who want to write database apps not who want to write micro service based apps. >> Okay, alright good. Thank you, George, for setting it up for us. Now, we're going to be at Big Data SV in mid March? Is that right? Middle of March. And George is going to be releasing the actual final forecast there. We do it every year. We use Spark Summit to look at our preliminary numbers, some of the Spark related forecasts like continuous work loads. And then we harden those forecasts going into Big Data SV. We publish our big data report like we've done for the past, five, six, seven years. So check us out at Big Data SV. We do that in conjunction with the Strada events. So we'll be there again this year at the Fairmont Hotel. We got a bunch of stuff going on all week there. Some really good programs going on. So check out siliconangle.tv for all that action. Check out Wikibon.com. Look for new research coming out. You're going to be publishing this quarter, correct? And of course, check out siliconangle.com for all the news. And, really, we appreciate everybody watching. George, been a pleasure co-hosting with you. As always, really enjoyable. >> Alright, thanks Dave. >> Alright, to that's a rap from Sparks. We're going to try to get out of here, hit the snow storm and work our way home. Thanks everybody for watching. A great job everyone here. Seth, Ava, Patrick and Alex. And thanks to our audience. This is the Cube. We're out, see you next time. (lively music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. of the Wikibon Big Data Forecast. What's happening is the rise of a new interaction mode. On the next slide, we're at the beginning for the knee in the curve to start. So, if you look at the next slide, And then, we're going to come at it with some bottoms-up [Dave] Okay, so the next slide we want to drill into the [George] Okay, so the first thing to understand and IBM and Microsoft are living off of that going out into the latter years of the tenure period. you can see here, the NewSQL and you add all those up, [Dave] Okay but it's starting to get meaningful. So the point it. Okay, so in terms of work being done. it would be enormous. that the data volumes are exploding So do you have a metric to demonstrate that. some of the data warehouse work loads. the more open source pricing model. Okay and NoSQL, don't forget, but the revenue associated with the NoSQL And that's the next slide which is where and the horizontal axis is time. in large part, are going to be driving of the app and whatever data you need, What's driving database growth in the out years? the data volumes go up. that the NoSQL space, actually, grows is best serving capturing the IoT data because And it's not in the 2020 to 2022 time frame. and the NewSQL, actually, And George is going to be releasing This is the Cube.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
Patrick	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Seth	PERSON	0.99+
30 billion	QUANTITY	0.99+
Alex	PERSON	0.99+
two billion	QUANTITY	0.99+
2016	DATE	0.99+
$40 billion	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
2027	DATE	0.99+
20%	QUANTITY	0.99+
five years	QUANTITY	0.99+
New Relic	ORGANIZATION	0.99+
Orlap	ORGANIZATION	0.99+
$1.7 billion	QUANTITY	0.99+
10 billion	QUANTITY	0.99+
2020	DATE	0.99+
Boston	LOCATION	0.99+
Ava	PERSON	0.99+
mid March	DATE	0.99+
third one	QUANTITY	0.99+
last year	DATE	0.99+
AppDynamics	ORGANIZATION	0.99+
2022	DATE	0.99+
yesterday	DATE	0.99+
Wikibon	ORGANIZATION	0.99+
60 years	QUANTITY	0.99+
two days	QUANTITY	0.99+
siliconangle.com	OTHER	0.99+
400 million	QUANTITY	0.99+
750 million	QUANTITY	0.99+
YouTube	ORGANIZATION	0.99+
today	DATE	0.99+
5%	QUANTITY	0.99+
Middle of March	DATE	0.99+
Sparks Summit	EVENT	0.99+
first slide	QUANTITY	0.99+
three	QUANTITY	0.99+
two ways	QUANTITY	0.98+
Boston, Massachusetts	LOCATION	0.98+
early 60's	DATE	0.98+
about $40 billion	QUANTITY	0.98+
one firm	QUANTITY	0.98+
this year	DATE	0.98+
Ten X	QUANTITY	0.98+
Spark Summit	EVENT	0.97+
25,000 per terabyte	QUANTITY	0.97+
80's	DATE	0.97+
Databricks	ORGANIZATION	0.97+
DynamoDB	TITLE	0.97+
three types	QUANTITY	0.97+
Both	QUANTITY	0.96+
Sparks Summit East 2017	EVENT	0.96+
Spark Summit East 2017	EVENT	0.96+
this week	DATE	0.95+
Spark	TITLE	0.95+

Bill Peterson, MapR - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is theCUBE, covering Spark Summit East 2017. Brought to you by Databricks. Now, here are your hosts Dave Vellante and George Gilbert. >> Welcome back to Boston, everybody, this is theCUBE, the leader in live tech coverage. We're here in Boston, in snowy Boston. This is Spark Summit. Spark Summit does a East Coast version, they do a West Coast version, they've got one in Europe this year. theCUBE has been a partner with Databricks as the live broadcast partner. Our friend Bill Peterson is here. He's the head of partner marketing at MapR. Bill, good to see you again. >> Thank you, thanks for having me. >> So how's the show going for you? >> It's great. >> Give us the vibe. We're kind of windin' down day two. >> It is. The show's been great, we've got a lot of traffic coming by, a lot of deep technical questions which is-- >> Dave: Hardcore at the show-- >> It is, it is. I spend a lot of time there smiling and going, "Yeah, talk to him." (laughs) But it's great. We're getting those deep technical questions and it's great. We actually just got one on Lustre, which I had to think for a minute, oh, HPC. It was like way back in there. >> Dave: You know, Cray's on the floor. >> Oh, yeah that's true. But a lot of our customers as well. UnitedHealth Group, Wells Fargo, AMEX coming by. Which is great to see them and talk to them, but also they've got some deep technical questions for us. So it's moving the needle with existing customers but also new business, which is great. >> So I got to ask a basic question. What is MapR? MapR started in the early days of Hadoop distro, vendor, one of the big three. When somebody says to you what is MapR, what do you say? My answer today is MapR is an enterprise software company that delivers a converged data platform. That converged data platform consists of a file system, a NoSQL database, a Hadoop distribution, a Spark distribution, and a set of data management tools. And as a customer of MapR, you get all of those. You can turn 'em all on if you'd like. You can just turn on the file system, for example, if you wanted to just use the file system for storage. But the enterprise software piece of that is all the hardening we do behind the scenes on things like snapshots, mirroring, data governance, multi-tenancy, ease of use performance, all of that baked in to the solution, or the platform as we're calling it now. So as you're kind of alluding to, a year ago now we kind of got out of that business of saying okay, lead 100% with Hadoop and then while we have your attention, or if we don't, hey wait, we got all this other stuff in the basket we want to show you, we went the platform play and said we're going to include everything and it's all there and then the baseline underneath is the hardening of it, the file system, the database, and the streaming product, actually, which I didn't mention, which is kind of the core, and everything plays off of there. And that honestly has been really well-received. And it just, I feel, makes it so much easier because-- It happened here, we get the question, okay, how are you different from Cloudera or Hortonworks? And some of it here, given the nature of the attendees, is very technical, but there's been a couple of business users that I've talked to. And when I talk about us as an enterprise software company delivering a plethora of solutions versus just Hadoop, you can see the light going on sometimes in people's eyes. And I got it today, earlier, "I had no idea you had a file system," which, to me, just drives me insane because the file system is pretty cool, right? >> Well you guys are early on in investing in that file system and recovery capabilities and all the-- >> Two years in stealth writing it. >> Nasty, gnarly, hard stuff that was kind of poo-pooed early on. >> Yeah, yeah. MapR was never patient about waiting for the open source community to just figure it out and catch up. You always just said all right, we're going to solve this problem and go sell. >> And I'm glad you said that. I want to be clear. We're not giving up on open source or anything, right? Open source is still a big piece. 50% of our engineers' time is working on open source projects. That's still super important to us. And then back in November-ish last year we announced the MapR Ecosystem Packs, which is our effort to help our customers that are using open source components to stay current. 'Cause that's a pain in the butt. So this is a set of packages that have a whole bunch of components. We lead with Spark and Drill, and that was by customer request, that they were having a hard time keeping current with Spark and Drill. So the packs allow them to come up to current level within the converged data platform for all of their open source components. And that's something we're going to do at dot Level, so I think we're at 2.1 or 2 now. The dot levels will bring you up on everything and then the big ones, like the 3.0s, the 4.0s, will bring Spark and Drill current. And so we're going to kind of leapfrog those. So that's still a really important part of our business and we don't want to forget that part, but what we're trying here to do is, via the platform, is deliver all of that in one entity, right? >> So the converged data platform is relevant presumably because you've got the history of Hadoop, 'cause you got all these different components and you got to cobble 'em together and they're different interfaces and different environments, you're trying to unify that and you have unified that, right? >> Yeah, yeah. >> So what is your customer feedback with regard to the converged data platform? >> Yeah so it's a great question because for existing customers, it was like, ah, thank you. It was one of those, right, because we're listening. Actually, again, glad you said that. This week, in addition to Spark Summit we're doing our yearly customer advisory board so we've got, like a lot of vendors, we've got a 30 plus company customer advisory board that we bring in and we sit down with them for a couple of days and they give us feedback on what we should and shouldn't be doing and where, directional and all that, which is super important. And that's where a lot of this converged data platform came out of is the need for... There was just too much, it's kind of confusing. I'll give the example of streams, right? We came out with our streaming product last year and okay, I'm using Hadoop, I'm using your file system, I'm using NoSQL, now you're adding streams, this is great, but now, like MEP, the Ecosystem Packages, I have to keep everything current. You got to make it easier for me, you got to make my life easier for me. So for existing customers it's a stay current, I like this, the model, I can turn on and off what I want when I want. Great model for them, existing business. For new business it gets us out of that Hadoop-only mode, right? I kind of jokingly call us Hadoop plus plus plus plus. We keep adding solutions and add it to a single, cohesive data platform that we keep updated. And as I mentioned here, talking to new customers or new prospects, our potential new business, when I describe the model you can just see the light going on and they realize wow, there's a lot more to this than I had imagined. I got it earlier today, I thought you guys only did Hadoop. Which is a little infuriating as a marketer, but I think from a mechanism and a delivery and a message and a story point of view, it's really helped. >> More Cube time will help get this out there. (laughs) >> Well played, well played. >> It's good to have you back on. Okay, so Spark comes along a couple years ago and it was like ah, what's going to happen to Hadoop? So you guys embraced Spark. Talk more specifically about Spark, where it fits in your platform and the ecosystem generally. >> Spark, Hadoop, others as a entity to bring data into the converged data platform, that's one way to think about it. Way oversimplified, obviously, but that's a really great way, I think, to think about it is if we're going to provide this platform that anybody can query on, you can run analytics against. We talk a lot about now converged applications. So taking historical data, taking operational data, so streaming data, great example. Putting those together and you could use the Data Lake example if you want, that's fine. But putting them into a converged application in the middle where they overlap, kind of typical Venn diagram where they overlap, and that middle part is the converged application. What's feeding that? Well, Spark could be feeding that, Hadoop could be feeding that. Just yesterday we announced a Docker for containers, that could be feeding into the converged data platform as well. So we look at all of these things as an opportunity for us to manage data and to make data accessible at the enterprise level. And then that enterprise level goes back to what I was talkin' before, it's got to have all of those things, like multi-tenancy and snapshots and mirroring and data governance, security, et cetera. But Spark is a big component of that. All of the customers who came by here that I mentioned earlier, which are some really good names for us, are all using Spark to drive data into the converged data platform. So we look at it as we can help them build new applications within converged data platform with that data. So whether it's Spark data, Hadoop data, container data, we don't really care. >> So along those lines, if the focus of intense interest right now is on Spark, and Spark says oh, and we work with all these databases, data storers, file systems, if you approach a customer who's Spark first, what's the message relative to all the other data storers that they can get to through, without getting too techy, their API? >> Sure, sure. I think as you know, George, we support a whole bunch of APIs. So I guess for us it's the breadth. >> But I'm thinking of Spark in particular. If someone says specifically, I want to run Databricks, but I need something underneath it to capture the data and to manage it. >> Well I think that's the beauty of our file system there. As I mentioned, if you think about it from an architectural point of view, our file system along the bottom, or it could be our database or our streaming product, but in this instance-- >> George: That's what I'm getting at too, all three. >> Picture that as the bottom layer as your storage-- I shouldn't say storage layer but as the bottom layer. 'Cause it's not just storage, it's more than storage. Middle layer is maybe some of your open source tools and the like, and then above that is what I called your data delivery mechanisms. Which would be Spark, for example, one bucket. Another bucket could be Hadoop, and another bucket could be these microservices we're talking about. Let my draw the picture another way using a partner, SAP. One of the things we've had some success with SAP is SAP HANA sitting up here. SAP would love to have you put all your data in HANA. It's probably not going to happen. >> George: Yeah, good luck. >> Yeah, good luck, right? But what if you, hey customer, what if you put zero to two years worth of data, historical data, in HANA. Okay, maybe the customer starts nodding their head like you just did. Hey customer, what if you put two to five years worth of data in Business Warehouse. Guess what, you already own that. You've been an SAP customer for awhile, you already have it. Okay, the customer's now really nodding their head. You got their attention. To your original question, whether it's Spark or whatever, five plus years, put it in MapR. >> Oh, and then like HANA Vora could do the query. >> Drill can query across all of them. >> Oh, right including the Business Warehouse, okay. >> So we're running in the file system. That, to me, and we do this obviously with our joint SAP MapR customers, that to me is kind of a really cool vision. And to your original question, if that was Spark at the top feeding it rather than SAP, sure, right? Why not? >> What can you share with us, Bill, about business metrics around MapR? However you choose to share it, head count, want to give us gross margins by product, that's great, but-- (laughs) >> Would you like revenues too, Dave? >> We know they're very high because you're a software company, so that's actually a bad question. I've already profit-- (laughs) >> You don't have to give us top line revenues-- >> So what are you guys saying publicly about the company, its growth. >> That's fair. >> Give us the latest. >> Fantastic, number one. Hiring like crazy, we're well north of 500 people now. I actually, you want to hear a funny story? I yesterday was texting in the booth, with a candidate from my team, back and forth on salary. Did the salary negotiation on text right there in the booth and closed her, she starts on the 27th, so. >> Dave: Congratulations. >> I'm very excited about that. So moving along on that. Seven, 800 plus customers as we talk about... We just finished our fiscal year on January 31st, so we're on Feb one fiscal year. And we always do a momentum press release, which will be coming out soon. Hiring, again, like crazy, as I mentioned, executive staff is all filled in and built to scale which we're really excited about. We talk a lot about the kind of uptake of-- it used to be of the file system, Hadoop, et cetera on its own, but now in this one the momentum release we'll be doing, we'll talk about the converged data platform and the uplift we've seen from that. So we obviously can't talk revenue numbers and the like, but everything... David, I got to tell you, we've been doin' this a long time, all of that is just all moving in the right direction. And then the other example I'll give you from my world, in the partner world. Last year I rebranded our partner to the converged partner program. We're going with this whole converged thing, right? And we established three levels, elite, preferred, and affiliate with different levels there. But also, there's revenue requirements at each level, so elite, preferred, and affiliate, and there's resell and influence revenues, we have MDF funds, not only from the big guys coming to us, but we're paying out MDF funds now to select partners as well. So all of this stuff I always talk about as the maturity of the company, right? We're maturing in our messaging, we're maturing in the level of people who are joining, and we're maturing in the customers and the deals, the deal sizes and volumes that we're seeing. It's all movin' in the right direction. >> Dave: Great, awesome, congratulations. >> Bill: Thank you, yeah, I'm excited. >> Can you talk about number of customers or number of employees relative to last year? >> Oh boy. Honestly, George, I don't know off the top of my head. I apologize, I don't know the metric, but I know it's north of 500 today, of employees, and it's like seven, 800 customers. >> Okay, okay. >> Yeah, yeah. >> And a little bit more on this partner, elite, preferred, and affiliate. >> Affiliate, yeah. >> What did you call it, the converged partners program? >> Converged-- Yeah, yeah. >> What are some of the details of that? >> Sure. So the elites are invite only, and those are some of the bigger ones. So for us, we're-- >> Dave: Like, some examples. >> Cisco, SAP, AWS, others, but those are some of the big ones. And they were looking at things like resell and influence revenue. That's what I track in my... I always jokingly say at MapR, even though we're kind of a big startup now, I always jokingly say at MapR you have three jobs. You have the job you were hired for, you have your Thursday night job, and you have your Sunday night job. (Dave and George laugh) In the job that I was hired for, partner marketing, I track influence and resell revenue. So at the elite level, we're doing both. Like Cisco resells us, so this S-Series, we're in their SKU, their sales reps can go sell an S-Series for big data workloads or analytical workloads, MapR, on it, off you go. Our job then is cashing checks, which I like. That's a good job to have in this business. At the preferred level it's kind of that next tier of big players, but revenue thresholds haven't moved into the elite yet. Partners in there, like the MicroStrategies of the world, we're doing a lot with them, Tableau, Talend, a lot of the BI vendors in there. And then the affiliates are the smaller guys who maybe we'll do one piece of a campaign during the year with them. So I'll give you an example, Attunity, you guys know those guys right here? >> Sure >> Yeah, yeah. >> Last year we were doing a campaign on DWO, data warehouse offload. We wanted to bring them in but this was a MapR campaign running for a quarter, and we're typical, like a lot of companies, we run four campaigns a year and then my partner in field stuff kind of opts into that and we run stuff to support it. And then corporate marketing does something. Pretty traditional. But what I try and do is pull these partners into those campaigns. So we did a webinar with Attunity as part of that campaign. So at the affiliate level, the lower level, we're not doing a full go-to-market like we would with the elites at the top, but they're being brought into our campaigns and then obviously hopefully, we hope on the other side they're going to pull us in as well. >> Great, last question. What should we pay attention to, what's comin' up? >> Yeah, so-- >> Let's see, we got some events, we got Strata coming up you'll be out your way, or out MapR way. >> As my Twitter handle says, seat 11A. That's where I am. (laughs) Yeah, I mean the Docker announcement we're really excited about, and microservices. You'll see more from us on the whole microservices thing. Streaming is still a big one, we think, for this year. You guys probably agree. That's why we announced the MapR streaming product last year. So again, from a go-to-market point of view and kind of putting some meat behind streaming not only MapR but with partners, so streaming as a component and a delivery model for managing data in CDP. I think that's a big one. Machine learning is something that we're seeing more and more touching us from a number of customers but also from the partner perspective. I see all the partner requests that come in to join the partner program, and there's been an uptick in the machine learning customers that want to come in and-- Excuse me, partners, that want to be talking to us. Which I think is really interesting. >> Where you would be the sort of prediction serving layer? >> Exactly, exactly. Or a data store. A lot of them are looking for just an easy data store that the MapR file system can do. >> Infrastructure to support that, yeah. >> Commodity, right? The whole old promise of Hadoop or just a generic file system is give me easy access to storage on commodity hardware. The machine learning-- >> That works. >> Right. The existing machine learning vendors need an answer for that. When the customer asks them, they want just an easy answer, say oh, we just use MapR FS for that and we're done. Okay, that's fine with me, I'll take that one. >> So that's the operational end of that machine learning pipeline that we call DevOps for data scientists? >> Correct, right. I guess the nice synergy there is the whole, going back to the Docker microservices one, there's a DevOps component there as well. So, might be interesting marrying those together. >> All right, we got to go, Bill, thanks very much, good to see you again. >> All right, thank you. >> All right, George and I will be back to wrap. We're going to part two of our big data forecast right now, so stay with us, right back. (digital music) (synth music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. Bill, good to see you again. We're kind of windin' down day two. a lot of deep technical questions which is-- "Yeah, talk to him." So it's moving the needle with existing customers is all the hardening we do behind the scenes that was kind of poo-pooed early on. You always just said all right, we're going to solve So the packs allow them to come up to current level I got it earlier today, I thought you guys only did Hadoop. More Cube time will help get this out there. It's good to have you back on. and that middle part is the converged application. I think as you know, George, we support and to manage it. our file system along the bottom, and the like, and then above that is what I called Okay, maybe the customer starts nodding their head And to your original question, if that was Spark at the top so that's actually a bad question. So what are you guys saying publicly and closed her, she starts on the 27th, so. all of that is just all moving in the right direction. Honestly, George, I don't know off the top of my head. And a little bit more on this partner, elite, Yeah, yeah. So the elites are invite only, So at the elite level, we're doing both. So at the affiliate level, the lower level, What should we pay attention to, what's comin' up? Let's see, we got some events, we got Strata coming up I see all the partner requests that come in that the MapR file system can do. to storage on commodity hardware. When the customer asks them, they want just an easy answer, I guess the nice synergy there is the whole, thanks very much, good to see you again. We're going to part two of our big data forecast

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George	PERSON	0.99+
Dave Vellante	PERSON	0.99+
UnitedHealth Group	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
AMEX	ORGANIZATION	0.99+
Bill Peterson	PERSON	0.99+
Boston	LOCATION	0.99+
Dave	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
two	QUANTITY	0.99+
MapR	ORGANIZATION	0.99+
Wells Fargo	ORGANIZATION	0.99+
Last year	DATE	0.99+
50%	QUANTITY	0.99+
five years	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Databricks	ORGANIZATION	0.99+
yesterday	DATE	0.99+
two years	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
Bill	PERSON	0.99+
Cloudera	ORGANIZATION	0.99+
30 plus	QUANTITY	0.99+
zero	QUANTITY	0.99+
last year	DATE	0.99+
Two years	QUANTITY	0.99+
today	DATE	0.99+
November	DATE	0.99+
both	QUANTITY	0.99+
January 31st	DATE	0.99+
Feb one	DATE	0.99+
HANA	TITLE	0.99+
This week	DATE	0.99+
Thursday night	DATE	0.99+
SAP	ORGANIZATION	0.99+
Sunday night	DATE	0.99+
five plus years	QUANTITY	0.99+
three jobs	QUANTITY	0.99+
Tableau	ORGANIZATION	0.99+
Boston, Massachusetts	LOCATION	0.99+
Seven, 800 plus customers	QUANTITY	0.99+
100%	QUANTITY	0.98+
Talend	ORGANIZATION	0.98+
NoSQL	TITLE	0.98+
Hadoop	TITLE	0.98+
seven, 800 customers	QUANTITY	0.98+
each level	QUANTITY	0.98+
a year ago	DATE	0.98+
Spark	TITLE	0.98+
Twitter	ORGANIZATION	0.98+
this year	DATE	0.98+
theCUBE	ORGANIZATION	0.98+
day two	QUANTITY	0.98+
27th	DATE	0.97+
One	QUANTITY	0.97+
one	QUANTITY	0.97+
SAP HANA	TITLE	0.97+
Spark Summit	EVENT	0.97+
East Coast	LOCATION	0.96+

Joel Cumming, Kik - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts this is the Cube, covering Spark Summit East 2017 brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Boston, everybody, where it's a blizzard outside and a blizzard of content coming to you from Spark Summit East, #SparkSummit. This is the Cube, the worldwide leader in live tech coverage. Joel Cumming is here. He's the head of data at Kik. Kicking butt at Kik. Welcome to the Cube. >> Thank you, thanks for having me. >> So tell us about Kik, this cool mobile chat app. Checked it out a little bit. >> Yeah, so Kik has been around since about 2010. We're, as you mentioned, a mobile chat app, start-up based in Waterloo, Ontario. Kik really took off, really 2010 when it got 2 million users in the first 22 days of its existence. So was insanely popular, specifically with U.S. youth, and the reason for that really is Kik started off in a time where chatting through text cost money. Text messages cost money back in 2010, and really not every kid has a phone like they do today. So if you had an iPod or an iPad all you needed to do was sign up, and you had a user name and now you could text with your friends, so kids could do that just like their parents could with Kik, and that's really where we got our entrenchment with U.S. youth. >> And you're the head of data. So talk a little bit about your background. What does that mean to be a head of data? >> Yes, so prior to working at Kik I worked at Blackberry, and I like to say I worked at Blackberry probably around the time just before you bought your first Blackberry and I left just after you bought your first iPhone. So kind of in that range, but was there for nine years. >> Vellante: Can you do that with real estate? >> Yeah, I'd love to be able to do that with real estate. But it was a great time at Blackberry. It was very exciting to be part of that growth. When I was there, we grew from three million to 80 million customers, from three thousand employees to 17 thousand employees, and of course, things went sideways for Blackberry, but conveniently at the end Blackberry was working in BBM, and leading a team of data scientists and data engineers there. And BBM if you're not familiar with it is a chat app as well, and across town is where Kik is headquartered. The appeal to me of moving to Kik was a company that was very small and fast moving, but they actually weren't leveraging data at all. So when I got there, they had a pile of logs sitting in S3, waiting for someone to take advantage of them. They were good at measuring events, and looking at those events and how they tracked over time, but not really combining them to understand or personalize any experience for their end customers. >> So they knew enough to keep the data. >> They knew enough to keep the data. >> They just weren't sure what to do with it. Okay so, you come in, and where did you start? >> So the first day that I started that was the first day I used any AWS product, so I had worked on the big data tools at the old place, with Hadoop and Pig and Hive and Oracle and those kinds of things, but had never used an AWS product until I got there and it was very much sink or swim and on my first day our CEO in the meeting said, "Okay, you're data guy here now. "I want you to tell me in a week why people leave Kik." And I'm like, man we don't even have a database yet. The first thing I did was I fired up a Redshift cluster. First time I had done that, looked at the tools that were available in AWS to transform the data using EMR and Pig and those kinds of things, and was lucky enough, fortunate enough that they could figure that out in a week and I didn't give him the full answer of why people left, but I was able to give him some ideas of places we could go based on some preliminary exploration. So I went from leading this team of about 40 people to being a team of one and writing all the code myself. Super exciting, not the experience that everybody wants, but for me it was a lot of fun. Over the last three years have built up the team. Now we have three data engineers and three data scientists and indeed it's a lot more important to people every day at Kik. >> What sort of impact has your team had on the product itself and the customer experience? >> So the beginning it was really just trying to understand the behaviors of people across Kik, and that took a while to really wrap our heads around, and any good data analysis combines behaviors that you have to ask people their opinion on and also behaviors that we see them do. So I had an old boss that used to work at Rogers, which is a telecomm provider in Canada, and he said if you ask people the things that they watch they tell you documentaries and the news and very important stuff, but if you see what they actually watch it's reality TV and trashy shows, and so the truth is really somewhere in the middle. There's an aspirational element. So for us really understanding the data we already had, instrumenting new events, and then in the last year and a half, building out an A/B testing framework is something that's been instrumental in how we leverage data at Kik. So we were making decisions by gut feel in the very beginning, then we moved into this era where we were doing A/B testing and very focused on statistical significance, and rigor around all of our experiments, but then stepping back and realizing maybe the bets that we have aren't big enough. So we need to maybe bet a little bit more on some bigger features that have the opportunity to move the needle. So we've been doing that recently with a few features that we've released, but data is super important now, both to stimulate creativity of our product managers as well as to measure the success of those features. >> And how do you map to the product managers who are defining the new features? Are you a central group? Are you sort of point guards within the different product groups? How does that, your evidence-based decisions or recommendations but they make ultimately, presumably, the decisions. What's the dynamic? >> So it's a great question. In my experience, it's very difficult to build a structure that's perfect. So in the purely centralized model you've got this problem of people are coming to you to ask for something, and they may get turned away because you're too busy, and then in the decentralized model you tend to have lots of duplication and overlap and maybe not sharing all the things that you need to share. So we tried to build a hybrid of both. And so we had our data engineers centralized and we tried doing what we called tours of duty, so our data scientists would be embedded with various teams within the company so it could be, it could be the core messenger team. It could be our bot platform team. It could be our anti-spam team. And they would sit with them and it's very easy for product managers and developers to ask them questions and for them to give out answers, and then we would rotate those folks through a different tour of duty after a few months and they would sit with another team. So we did that for a while, and it worked pretty well, but one of the major things we found was a problem was there's no good checkpoint to confirm that what they're doing is right. So in software development you're releasing a version of software. There's QA, there's code review and there's structure in place to ensure that yes, this number I'm providing is right. It's difficult when you've got a data scientist who's out with a team for him to come back to the team and get that peer review. So now we're kind of reevaluating that. We use an agile approach, but we have primes for each of these groups but now we all sit together. >> So the accountability is after the data scientist made a recommendation that the product manager agrees with, how do you ensure that it measured up to the expectation? Like sort of after the fact. >> Yeah, so in those cases our A/B tests are it's nice to have that unbiased data resource on the team that's embedded with them that can step back and say yes, this idea worked, or it didn't work. So that's the approach that we're taking. It's not a dedicated resource, but a prime resource for each of these teams that's a subject matter expert and then is evaluating the results in an unbiased kind of way. >> So you've got this relatively small, even though it's quadruple the size when you started, data team and then application development team as sort of colleagues or how do you interact with them? >> Yeah, we're actually part of the engineering organization at Kik, part of R and D, and in different times in my life I've been part of different organizations whether it's marketing or whether it's I.T. or whether it's R and D, and R and D really fits nicely. And the reason why I think it's the best is because if there's data that you need to understand users more there's much more direct control over getting that element instrumented within a product that you have when you're part of R and D. If you're in marketing, you're like hey, I'd love to know how many times people tap on that red button, but no event fires when that red button is tapped. Good luck trying to get the software developers to put that in. But when there's an inherent component of R and D that's dependent on data, and data has that direct path to those developers, getting that kind of thing done is much easier. >> So from a tooling standpoint, thinking about data scientists and data engineers, a lot of the tools that we've seen in this so-called big data world have been quite spoke. Different interfaces, different experience. How are you addressing that? Does Spark help with that? Maybe talk about that a bit more. >> Yeah, so I was fortunate enough to do a session today that sort of talked about data V1 at Kik versus data V2 at Kik, and we drew this kind of a line in the sand. So when I started it was just me. I'm trying to answer these questions very quickly on these three or five day timelines that we get from our CEO. >> Vallente: You've been here a week, come on! >> Yeah exactly, so you sacrifice data engineering and architecture when you're living like that. So you can answer questions very quickly. It worked well for a while, but then all of a sudden we come up and we have 300 data pipelines. They're a mess. They're hard to manage and control. We've got code sometimes in Sequel or sometimes in Python scripts, or sometimes on people's laptops. We have no real plan for Getup integration. And then you know real scalability out of Redshift. We were doing a lot of our workloads in Redshift to do transformations just because, get the data into Redshift, write some Sequel and then have your results. We're running into contention problems with that. So what we decided to do is sort of stop, step back and say, okay so how are we going to house all of this atomic data that we have in a way that's efficient. So we started with Redshift, our database was 10 terabytes. Now it's 100, except for we get five terabytes of data per day that's new coming in, so putting that all in Redshift, it doesn't make sense. It's not all that useful. So if we cull that data under supervision, we don't want to get rid of the atomic data, how do we control that data under supervision. So we decided to go the data lake route, even though we hate the term data lake, but basically a folder structure within S3 that's stored in a query optimized format like Parquet, and now we can access that data very quickly at an atomic level, at a cleansed level and also an at aggregate level. So for us, this data V2 was the evolution of stopping doing a lot of things the way we used to do, which was lots of data pipelines, kind of code that was all over the place, and then aggregations in Redshift, and starting to use Spark, specifically Databricks. Databricks we think of in two ways. One is kind of managed Spark, so that we don't have to do all the configuration that we used to have to do with EMR, and then the second is notebooks that we can align with all the work that we're doing and have revision control and Getup integration as well. >> A question to clarify, when you've put the data lake, which is the file system and then the data in Parquet format, or Parquet files, so this is where you want to have some sort of interactive experience for business intelligence. Do you need some sort of MPP server on top of that to provide interactive performance, or, because I know a lot customers are struggling at that point where they got all the data there, and it's kind of organized, but then if they really want to munge through that huge volume they find it slows to lower than a crawl. >> Yeah, it's a great point. And we're at the stage right now where our data lake at the top layer of our data lake where we aggregate and normalize, we also push that data into Redshift. So Redshift what we're trying to do with that is make it a read-only environment, so that our analysts and developers, so they know they have consistent read performance on Redshift, where before when it's a mix of batch jobs as well as read workload, they didn't have that guarantee. So you're right, and we think what will probably happen over the next year or so is the advancements in Spark will make it much more capable as a data warehousing product, and then you'd have to start a question do I need both Redshift and Spark for that kind of thing? But today I think some of the cost-based optimizations that are coming, at least the promise of them coming I would hope that those would help Spark becoming more of a data warehouse, but we'll have to see. >> So carry that thread a little further through. I mean in terms of things that you'd like to see in the Spark roadmap, things that could be improved. What's your feedback to Databricks? >> We're fortunate, we work with them pretty closely. We've been a customer for about half a year, and they've been outstanding working with us. So structured streaming is a great example of something we worked pretty closely with on. We're really excited about. We don't have, you know we have certain pockets within our company that require very real-time data, so obviously your operational components. Are your servers up or down, as well as our anti-spam team. They require very low latency access to data. We haven't typically, if we batch every hour that's fine in most cases, but structured streaming when our data streams are coming in now through Kinesis Firehose, and we can process those without have to worry about checking to see if it's time we should start this or is all the data there so we can run this batch. Structured streaming solves a lot of those, it simplifies a lot of that workload for us. So that's something we've been working with them on. The other things that we're really interested in. We've got a bit of list, but the other major ones are how do you start to leverage this data to use it for personalization back in the app? So today we think of data in two ways at Kik. It's data as KPIs, so it's like the things you need to run your business, maybe it's A/B testing results, maybe it's how many active users you had yesterday, that kind of thing. And then the second is data as a product, and how do you provide personalization at an individual level based on your data sciences models back out to the app. So we do that, I should point out at Kik we don't see anybody's messages. We don't read your messages. We don't have access to those. But we have the metadata around the transactions that you have, like most companies do. So that helps us improve our products and services under our privacy policy to say okay, who's building good relationships and who's leaving the platform and why are they doing it. But we can also service components that are useful for personalization, so if you've chatted with three different bots on our platform that's important for us to know if we want to recommend another bot to you. Or you know the classic people people you may know recommendations. We don't do that right now, but behind the scenes we have the kind of information that we could help personalize that experience for you. So those two things are very different. In a lot of companies there's an R and D element, like at Blackberry, the app world recommendation engine was something that there was a team that ran in production but our team was helping those guys tweak and tune their models. So it's the same kind of thing at Kik where we can build, our data scientist are building models for personalization, and then we need to service them back up to the rest of the company. And the process right now of taking the results of our models and then putting them into a real time serving system isn't that clean, and so we do batches every day on things that don't need to be near real-time, so things like predicted gender. If we know your first name, we've downloaded the list of baby names from the U.S. Social Security website and we can say the frequency of the name Pat 80 percent of the time it's a male, and 20 percent it's a female, but Joel is 99 percent of the time it's male and one percent of the time it's a female, so based on your tolerance for whatever you want to use this personalization for we can give you our degrees of confidence on that. That's one example of what we surface rate now in our API back to our own first party components of our app. But in the future with more real-time data coming in from Spark streaming with more real-time model scoring, and then the ability to push that over into some sort of capability that can be surfaced up through an API, it gives our data team the capability of being much more flexible and fast at surfacing things that can provide personalization to the end user, as opposed to what we have now which is all this batch processing and then loading once a day and then knowing that we can't react on the fly. >> So if I were to try and turn that into a sort of a roadmap, a Spark roadmap, it sounds like the process of taking the analysis and doing perhaps even online training to update the models, or just rescoring if you're doing a little slightly less fresh, but then serving it up from a high speed serving layer, that's when you can take data that's coming in from the game and send it back to improve the game in real time. >> Exactly. Yep. >> That's what you're looking for. >> Yeah. >> You and a lot of other people. >> Yeah I think so. >> So how's the event been for you? >> It's been great. There's some really smart people here. It's humbling when you go to some of these sessions and you know, we're fortunate where we try and not have to think about a lot of the details that people are explaining here, but it's really good to understand them and know that there are some smart people that are fixing these problems. As like all events, been some really good sessions, but the networking is amazing, so meeting lots of great people here, and hearing their stories too. >> And you're hoping to go to the hockey game tonight. >> Yeah, I'd love to go to the hockey game. See if we can get through the snow. >> Who are the Bruins playing tonight. >> San Jose. >> Oh, good. >> It could be a good game. >> Yeah, the rivalry. You guys into the hockey game? Alright, good. Alright, Joel, listen, thanks very much for coming on the Cube. Great segment. I really appreciate your insights and sharing. >> Okay, thanks for having me. >> You're welcome. Alright, keep it right there, everybody. George and I will be back right after this short break. This is the Cube. We're live from Spark Summit in Boston.

Published Date : Feb 9 2017

SUMMARY :

brought to you by Databricks. and a blizzard of content coming to you So tell us about Kik, this cool mobile chat app. and the reason for that really is Kik started off What does that mean to be a head of data? and I like to say I worked at Blackberry but conveniently at the end Blackberry was working Okay so, you come in, and where did you start? and on my first day our CEO in the meeting said, and also behaviors that we see them do. And how do you map to the product managers but one of the major things we found was a problem So the accountability is after the data scientist So that's the approach that we're taking. and data has that direct path to those developers, a lot of the tools that we've seen and we drew this kind of a line in the sand. One is kind of managed Spark, so that we don't have to do and it's kind of organized, but then if they that are coming, at least the promise of them coming in the Spark roadmap, things that could be improved. It's data as KPIs, so it's like the things you need from the game and send it back to improve the game and not have to think about a lot of the details See if we can get through the snow. Yeah, the rivalry. This is the Cube.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Canada	LOCATION	0.99+
Joel Cumming	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Blackberry	ORGANIZATION	0.99+
2010	DATE	0.99+
Joel	PERSON	0.99+
AWS	ORGANIZATION	0.99+
10 terabytes	QUANTITY	0.99+
20 percent	QUANTITY	0.99+
nine years	QUANTITY	0.99+
99 percent	QUANTITY	0.99+
Boston	LOCATION	0.99+
iPad	COMMERCIAL_ITEM	0.99+
three million	QUANTITY	0.99+
17 thousand employees	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
three thousand employees	QUANTITY	0.99+
Kik	ORGANIZATION	0.99+
three	QUANTITY	0.99+
Waterloo, Ontario	LOCATION	0.99+
iPod	COMMERCIAL_ITEM	0.99+
three data scientists	QUANTITY	0.99+
two things	QUANTITY	0.99+
Python	TITLE	0.99+
100	QUANTITY	0.99+
one percent	QUANTITY	0.99+
first	QUANTITY	0.99+
Redshift	TITLE	0.99+
both	QUANTITY	0.99+
2 million users	QUANTITY	0.99+
80 percent	QUANTITY	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
today	DATE	0.99+
Kik	PERSON	0.99+
five day	QUANTITY	0.99+
each	QUANTITY	0.99+
three data engineers	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
second	QUANTITY	0.99+
300 data pipelines	QUANTITY	0.98+
One	QUANTITY	0.98+
yesterday	DATE	0.98+
two ways	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
S3	TITLE	0.98+
one	QUANTITY	0.98+
Parquet	TITLE	0.98+
first day	QUANTITY	0.98+
Rogers	ORGANIZATION	0.98+
about half a year	QUANTITY	0.97+
once a day	QUANTITY	0.97+
Spark	TITLE	0.97+
Spark Summit East 2017	EVENT	0.97+
first 22 days	QUANTITY	0.97+
about 40 people	QUANTITY	0.97+
next year	DATE	0.97+
first thing	QUANTITY	0.96+
First time	QUANTITY	0.96+
Spark	ORGANIZATION	0.95+
U.S. Social Security	ORGANIZATION	0.95+
a week	QUANTITY	0.95+
80 million customers	QUANTITY	0.95+

John Landry, HP - Spark Summit East 2017 - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> Live from Boston, Massachusetts, this is the CUBE, covering Spark Summit East 2017 brought to you by databricks. Now, here are your hosts Dave Valante and George Gilbert. >> Welcome back to Boston everyone. It's snowing like crazy outside, it's a cold mid-winter day here in Boston but we're here with the CUBE, the world-wide leader in tech coverage. We are live covering Spark Summit. This is wall to wall coverage, this is our second day here. John Landry with us, he's the distinguished technologist for HP's personal systems data science group within Hewlett Packard. John, welcome. >> Thank you very much for having me here. >> So I was saying, I was joking, we do a lot of shows with HPE, it's nice to have HP back on the CUBE, it's been awhile. But I want to start there. The company split up just over a year ago and it's seemingly been successful for both sides but you were describing to us that you've gone through an IT transformation of sorts within HP. Can you describe that? >> In the past, we were basically a data warehousing type of approach with reporting and what have you coming out of data warehouses, using Vertica, but recently, we made an investment into more of a programming platform for analytics and so where transformation to the cloud is about that where we're basically instead of investing into our own data centers because really, with the split, our data centers went with Hewlett Packard Enterprise, is that we're building our software platform in the cloud and that software platform includes analytics and in this case, we're building big data on top of Spark and so that transformation is huge for us, but it's also enabled us to move a lot faster, the velocity of our business and to be able to match up to that better. Like I said, it's mainly around the software development really more than anything else. >> Describe your role in a little bit more detail inside of HP. >> My role is I'm the leader in our big data investments and so I've been leading teams internally and also collaborating across HP with our print group and what we've done is we've managed to put together a strategy around our cloud-based solution to that. One of the things that was important was we had a common platform because when you put a program platform in place, if it's not common, then we can't collaborate. Our investment could be fractured, we could have a lot of side little efforts going on and what have you so my role is to pry the leadership in the direction for that and also one of the reasons I'm here today is to get involved in the Spark community because our investment is in Spark so that's another part of my role is to get involved with the industry and to be able to connect with the experts in the industry so we can leverage off of that because we don't have that expertise internally. >> What are the strategic and tactical objectives of your analytics initiatives? Is it to get better predictive maintenance on your devices? Is it to create new services for customers? Can you describe that? >> It's two-fold, internal and external so internally, we got millions of dollars of opportunity to better our products with cost, also to optimize our business models and the way we can do that is by using the data that comes back from our products, our services, our customers, combining that together and creating models around that that are then automated and can be turned into apps that can be used internally by our organizations. The second part is to take the same approach, same data, but apply that back towards our customers and so with the split, our enterprise services group also went with Hewlett Packard Enterprise and so now, we have a dedicated effort towards creating manage services for the commercial environment. And that's both on the print size and on the personal system side so to basically fuel that, analytics is a big part of the story. So we've had different things that you'll see out there like touch point manager is one of our services we're delivering in personal systems. >> Dave: What is that? >> Touch point manager is aimed at providing management services for SMB and for commercial environments. So for instance, in touch point manager, we can provide predictive type of capabilities for support. A number of different services that companies are looking for when they buy our products. Another thing we're going after too is device as a service. So there's another thing that we've announced recently that basically we're invested into there and so this is obviously if you're delivering devices as a service, you want to do that as optimal as possible. Well, being able to understand the devices, what's happening with them, been able to predictive support on them, been able to optimize the usage of those devices, that's all important. >> Dave: A lot of data. >> The data really helps us out, right? So the data that we can collect back from our devices and to be able to take that and turn that around into applications that are delivering information inside or outside is huge for us, a huge opportunity. >> It's interesting where you talk about internal initiatives and manage services, which sound like they're most external, but on the internal ones, you were talking about taking customer data and internal data and turning those into live models. Can you elaborate on that? >> Sure, I can give you a great example is on our mobile products, they all have batteries. All of our batteries are instrumented as smart batteries and that's an industry standard but HP actually goes a step further on that, it's the information that we put into our batteries. So by monitoring those batteries and the usage in the field is we can tell how optimally they're performing, but also how they're being used and how we can better design batteries going forward. So in addition, we can actually provide information back into our supply chain. For instance, there's a cell supplier for the battery, there's a pack supplier, there's our unit manufacturer for the product, and so a lot of things that we've been able to uncover is that we can go and improve process. And so improving process alone helps to improve the quality of what we deliver and the quality of the experience to our customers. So that's one example of just using the data, turning that around into a model. >> Is there an advantage to having such high volume, such market share in getting not just more data, but sort of more of the bell curve, so you get the edge conditions? >> Absolutely, it's really interesting because when we started out on this, everybody's used to doing reporting which is absolute numbers and how much did you shift and all that kind of stuff. But, we're doing big data, right? So in big data, you just need a good sample population. Turn the data scientist into that and they've got their statistical algorithms against that. They give you the confidence factor based upon the data that you have so it's absolutely a good factor for us because we don't have to see all the platforms out there. Then, the other thing is, when you look at populations, we see variances in different customers so we're looking at, like one of our populations that's very valuable to us is our own, so we take the 60 thousand units that we have internally at HP and that's one of our sample populations. What a better way to get information on your own products? But, you take that and you take it to one of our other customers and their population's going to look slight different. Why? Because they use the products differently. So one of the things is just usage of the products, the environment they're used in, how they use them. Our sample populations are great in that respect. Of course, the other thing is, very important to point out, we only collect data under the rules and regulations that are out there, so we absolutely follow that and we absolutely keep our data secure and we absolutely keep everything and that's important. Sometimes, today they get a little bit spooked sometimes around that, but the case is that our services are provided based on customers signing up for them. >> I'm guessing you don't collect more data than Google. >> No, we're nowhere near Google. >> So, if you're not spooked at Google - >> That's what I tell people. I say if you got a smartphone, you're giving up a lot more data than we're collecting. >> Buy something from Amazon. Spark, where does Spark fit into all of this? >> Spark is great because we needed a programming platform that could scale in our data centers and in our previous approaches, we didn't have a programming platform. We started with a Hadoop, the Hadoop was very complex though. It really gets down to the hardware and you're programming and trying to distribute that load and getting clusters and you pick up Spark and immediately abstraction. The other thing is it allows me to hire people that can actually program on top of it. I don't have to get someone that knows Map Reduce. I can sit there and it's like what do you know? You know R, Scala, you know Python, it doesn't matter. I can run all of that on top of it. So that's huge for us. The other thing is flat out the speed because as you start getting going with this, we get this pull all of a sudden. It's like well I only need the data like once a month, it's like I need it once a week, I need it once a day, I need the output of this by the hour now. So, the scale and the speed of that is huge and then when you put that on the cloud platform, you know, Spark on a cloud platform like Amazon, now I've got access to all the compute instances. I can scale that, I can optimize it because I don't always need all the power. The flexibility of Spark and being able to deliver that is huge for our success. >> So, I've got to ask some columbo questions and George, maybe you can help me sort of frame it. So you mentioned you were using Hadoop. Like a lot of Hadoop practitioners, you found it very complex. Now, Hewlett Packard has resources. Many companies don't but so you mentioned people out doing Python and R and Scale and Map Reduce, are you basically saying okay, we're going to unify portions of our Hadoop complexity with Spark and that's going to simplify our efforts? >> No, what we actually did was we started on the Hadoop side of it. The first thing we did was try to move from a data warehouse to more of a data lake approach or repository and that was internal, right? >> Dave: And that was a cost reduction? >> That was a cost reduction but also, data accessibility. >> Dave: Yeah, okay. >> The other thing we did was ingesting the data. When you're starting to bring data in from millions of devices, we had a problem coming through the firewall type approach and you got to have something in front of that like a Kafka or something in front of it that can handle it. So when we moved to the cloud, we didn't even try to put up our own, we just used Kinesis and that we didn't have to spend any resources to go solve that problem. Well, the next thing was, when we got the data, you need to ingest the data in and our data's coming in, we want to split it out, we needed to clean it and what you, we actually started out running Java and then we ran Java on top of Hadoop, but then we came across Spark and we said that's it. For us to go to the next step of actually really get into Hadoop, we were going to have to get some more skills and to find the skills to actually program in Hadoop was going to be complex. And to train them organically was going to be complex. We got a lot of smart people, but- >> Dave: You got a lot of stuff to do, too. >> That's the thing, we wanted to spend more time getting information out of the data as opposed to the framework of getting it to run and everything. >> Dave: Okay, so there's a lot of questions coming out. You mentioned Kinesis, so you've replaced that? >> Yeah, when we went to the cloud, we used as many Amazon services as we can as opposed to growing something for ourselves so when we get onto Amazon, you know, getting data into an S3 bucket through Kineses was a no-brainer. When we transferred over to the cloud, it took us less than 30 days to point our devices at Kinesis and we had all our data flowing into S3. So that was like wow, let's go do something else. >> So I got to ask you something else. Again, I love when practitioners come on. So, one of the complaints that I hear sometimes from AWS users and I wonder if you see this is the data pipeline is getting more and more complex. I got an API for Kinesis, one for S3, one for DynamoDB, one for Elastic Plus. There must be 15 proprietary APIs that are primitive, and again, it gets complicated and sometimes it's hard to even figure out what's the right cost model to use. Is that increasingly becoming more complex or is it just so much simpler than what you had before and you're in nirvana right now? >> When you mentioned costs, just the cost of moving to the cloud was a major cost reduction for us. >> Reduction? >> So now it's - >> You had that HP corporate tax on you before - >> Yeah, now we're going from data centers and software licenses. >> So that was a big win for you? >> Yeah, huge, and that released us up to go spend dollars on resources to focus on the data science aspect. So when we start looking at it, we continually optimized, don't get me wrong. But, the point is, if we can bring it up real quickly, that's going to save us a lot of money even if you don't have to maintain it. So we want to focus on creating the code inside of Spark that's actually doing the real work as opposed to the infrastructure. So that cost savings was huge. Now, when you look at it over time, we could've over analyzed that and everything else, but what we did was we used a rapid prototyping approach and then from there, we continued to optimize. So what's really good about the cloud is you can predict the cost and with internal data centers and software licenses and everything else, you can't predict the cost because everybody's trying to figure out who's paying for what. But in the case of the cloud, it's all pretty much you get your bill and you understand what you're paying. So anyway - >> And then you can adjust accordingly? >> We continue to optimize so we use the services but if we have for some reason, it's going to deliver us an advantage, we'll go develop it. But right now, our advantage is we got umteen opportunities to create AI type code and applications to basically automate these services, we don't even have enough resources to do it right now. But, the common programming platform's going to help us. >> Can you drill into those umpteen examples? Just some of them because - >> I mentioned the battery one for instance. So take that across the whole system so now you've got your storage devices, you've got your software that's running on there, we've got built into our system security monitoring at the firmware level just basically connecting into that and adding AI around that is huge because now we can see a tax that may be happening upon your fleet and we can create services out of that. Anything that you can automate around that is money in our pocket or money in our customers' pocket so if we can save them money with these new services, they're going to be more willing to come to HP for products. >> It's actually more than just automation because it's the stuff you couldn't do with 1,000 monkeys trying to write Shakespeare. You have data that you could not get before. >> You're right, what we're doing, the automation is helping us uncover things that we would've never seen and you're right, the whole gorilla walking through the room, I could sit there and I could show you tons of examples of where we're missing the boat. Even when we brought up our first data sets, we started looking at them and some of the stuff we looked at, we thought this is just bad data and actually it wasn't, it was bad product. >> People talk about dark data - >> We had no data models, we had no data model to say is it good or bad? And now we have data models and we're continuing to create those data models around, you create the data model and then you can continue to teach it and that's where we create the apps around it. Our primitives are the data models that we're creating from the device data that we have. >> Are there some of these apps where some of the intelligence lives on the device and it can, like in a security attack, it's a big surface area, you want to lock it down right away. >> We do. The good example on the security is we built something into our products called Sure Start. What essentially it is is we have ability to monitor the firmware layer and so there's a local process that's running independent of everything else that's running that's monitoring what's happening at that firmware level. Well, if there's an attack, it's going to immediately prevent the attack or recover from the attack. Well, that's built into the product. >> But it has to have a model of what this anomalous behavior is. >> Well in our case, we're monitoring what the firmware should look like and if we see that the firmware, you know you take check sums from the firmware or the pattern - >> So the firmware does not change? >> Well basically we can take the characteristics of the firmware and monitor it. If we see that changing, then we know something's wrong. Now it can get corrupt through hardware failure maybe because glitches can happen maybe. I mean solar flares can cause problems sometimes. So, the point is we've found that customers had problems sometimes where basically their firmware would get corrupted and they couldn't start their system. So we're like are we getting attacked? Is this a hardware issue? Could it be bad Flash devices? There's always all kinds of things that could cause that. Well now we monitor it and we know what's going on. Now, the other cool thing is we create logs from that so when those events occur, we can collect those logs and we're monitoring those events so now we can have something monitor the logs that are monitoring all the units. So, if you've got millions of units out there, how are you going to do that manually? You can't and that's where the automation comes in. >> So the logs give you the ability up in the cloud or at HP to look at the ecosystem of devices, but there is intelligence down on the - >> There's intelligence to protect the device in an auto recover which is really cool. So in the past, you had to get your repair. Imagine if someone attacked your fleet of notebooks. Say you got 10 thousand of them and basically it brought every single one of them down one day. What would you do? >> Dave: Freak. >> And everything you got to replace. It was just an attack and it could happen so we basically protect against that with our products and at the same time, we can see that may be a current and then from the footprints of it, we can then do analysis on it and determine was that malicious, is this happening because of a hardware issue, is this happening because maybe we tried to update the firmware and something happened there? What caused that to happen? And so that's where collecting the data from the population then helps us do that and then mix that with other things like service events. Are we seeing service events being driven by this? Thermal, we can look at the thermal data. Maybe there's some kind of heat issue that's causing this to happen. So we starting mixing that. >> Did Samsung come calling to buy this? >> Well, actually what's funny is Samsung is actually a supplier of ours, is a battery supplier of ours. So, by monitoring the batteries, what's interesting is we're helping them out because we go back to them. One of the things I'm working on, is we want to create apps that can go back to them and they can see the performance of their product that they're delivering to us. So instead of us having to call a meeting and saying hey guys let's talk about this, we've got some problems here. Imagine how much time that takes. But if they can self-monitor, then they're going to want to keep supplying to us, then they're going to better their product. >> That's huge. What a productivity boost because you're like hey, we got a problem, let's meet and talk about it and then you take an action to go and figure out what it is. Now if you need a meeting, it's like let's look at the data. >> Yeah, you don't have enough people. >> But there's also potentially a shift in pricing power. I would imagine it shifts a little more in your favor if you have all the data that indicates the quality of their product. >> That's an interesting thing. I don't know that we've reached that point. I think that in the future, it would be something that could be included in the contracts. The fact that the world is the way it is today and data is a big part of that to where going forward, absolutely, the fact that you have that data helps you to better have a relationship with your suppliers. >> And your customers, I mean it used to be that the brand used to have all the information. The internet obviously changed all that, but this whole digital transformation and IOT and all those log data, that sort of levels the playing field back to the brand. >> John: It actually changes it. >> You can now add value for the consumer that you couldn't before. >> And that's what HP's trying to do. We're invested to exactly do that is to really improve or increase the value of our brand. We have a strong brand today but - >> What do you guys do with - we got to wrap - but what do you do with databricks? What's the relationship there? >> Databricks, again we decided that we didn't want to be the experts on managing the whole Spark thing. The other part was that we're going to be involved with Spark and help them drive the direction as far as our use cases and what have you. Databricks and Spark go hand in hand. They got the experts there and it's been huge, our relationship, being able to work with these guys. But I recognize the fact that, and going back to software development and everything else, we don't want to spare resources on that. We got too many other things to do and the less that I have to worry about my Spark code running and scaling and the cost of it and being able to put code in production, the better and so, having that layer there is saving us a ton of money and resources and a ton of time. Just imagine time to market, it's just huge. >> Alright, John, sorry we got to wrap. Awesome having you on, thanks for sharing your story. >> It's great to talk to you guys. >> Alright, keep it right there everybody. We'll be back with our next guest. This is the CUBE live from Spark Summit East, we'll be right back.

Published Date : Feb 9 2017

SUMMARY :

brought to you by databricks. the world-wide leader in tech coverage. we do a lot of shows with HPE, In the past, we were basically a data warehousing bit more detail inside of HP. One of the things that was important was we had a common the way we can do that is by using the data we can provide predictive type of capabilities for support. So the data that we can collect back from our devices It's interesting where you talk about internal and the quality of the experience to our customers. Then, the other thing is, when you look at populations, I say if you got a smartphone, you're giving up Spark, where does Spark fit into all of this? and then when you put that on the cloud platform, and that's going to simplify our efforts? and that was internal, right? and to find the skills to actually program That's the thing, we wanted to spend more time Dave: Okay, so there's a lot of questions coming out. so when we get onto Amazon, you know, getting data into So I got to ask you something else. of moving to the cloud was a major cost reduction for us. Yeah, now we're going from But, the point is, if we can bring it up real quickly, We continue to optimize so we use the services So take that across the whole system because it's the stuff you couldn't do with that we would've never seen and you're right, And now we have data models and we're continuing intelligence lives on the device and it can, The good example on the security is we built But it has to have a model of what Now, the other cool thing is we create logs from that So in the past, you had to get your repair. and at the same time, we can see that may be a current of their product that they're delivering to us. and then you take an action to go if you have all the data that indicates and data is a big part of that to where the playing field back to the brand. that you couldn't before. is to really improve or increase the value of our brand. and the less that I have to worry about Alright, John, sorry we got to wrap. This is the CUBE live from Spark Summit East,

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Dave Valante	PERSON	0.99+
George Gilbert	PERSON	0.99+
John	PERSON	0.99+
George	PERSON	0.99+
HP	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
John Landry	PERSON	0.99+
Hewlett Packard	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
10 thousand	QUANTITY	0.99+
Java	TITLE	0.99+
Google	ORGANIZATION	0.99+
Samsung	ORGANIZATION	0.99+
Spark	ORGANIZATION	0.99+
second day	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
second part	QUANTITY	0.99+
60 thousand units	QUANTITY	0.99+
Python	TITLE	0.99+
Hadoop	TITLE	0.99+
less than 30 days	QUANTITY	0.99+
millions of dollars	QUANTITY	0.99+
today	DATE	0.99+
Hewlett Packard Enterprise	ORGANIZATION	0.99+
once a month	QUANTITY	0.99+
HPE	ORGANIZATION	0.99+
both sides	QUANTITY	0.99+
Spark	TITLE	0.99+
1,000 monkeys	QUANTITY	0.99+
one	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
once a week	QUANTITY	0.98+
once a day	QUANTITY	0.98+
15 proprietary APIs	QUANTITY	0.98+
One	QUANTITY	0.98+
both	QUANTITY	0.98+
one day	QUANTITY	0.98+
Map Reduce	TITLE	0.97+
Spark Summit East 2017	EVENT	0.97+
first data sets	QUANTITY	0.97+
two-fold	QUANTITY	0.97+
Spark Summit	EVENT	0.96+
R	TITLE	0.96+
a ton	QUANTITY	0.95+
millions of units	QUANTITY	0.95+
Scale	TITLE	0.95+
Kafka	TITLE	0.94+
Shakespeare	PERSON	0.94+
S3	TITLE	0.94+

Manish Gupta, Redis Labs | Spark Summit East 2017

>> Announcer: Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017. Brought to you by Databricks. Now, here are your hosts Dave Vellante and George Gilbert. >> Welcome back to snowy Boston, everybody. This is theCUBE, the leader in live tech coverage. We're here at Spark Summit East, hashtag SparkSummit. Manish Gupta is here, he's the CMO at Redis Labs. Manish, welcome to theCUBE. >> Thank you, good to be here. >> So, you know, 10 years ago you say you're in the database business and everybody would yawn. Now you're the life of the party. >> Yeah, the world has changed. I think the party has lots and lots of players. We are happy to be on the top of that heap. >> It is a crowded space, so how does Redis Labs differentiate? >> Redis Labs is the company behind the massively popular open source Redis, and Redis became popular because of its performance primarily, and then simplicity. Developers could very easily run up an instance of Redis, solve some very hairy problems, and time to market was a big issue for them. Redis Enterprise took that forward and enabled it to be mission critical, ready for the largest workloads, ready for things that the enterprises need in a highly distributed clustered environment. So they have resilience and they benefit from the performance of Redis. >> And your claim to fame, as you say, is that top-gun performance, you guys will talk about some of the benchmarks later. We're talking about use cases like fraud detection, as example. Obviously ad serving would be another one. But add some color to that if you would. >> Redis is whatever you need to make real time real, Redis plays a very important role. It is able to deliver millions of operations per second with sub-millisecond latency, and that's the hallmark. With data structures that comprise Redis, you can solve the problems in a way, and the reason you can get that performance is because the data structures take some very complex issues and simplify the operation. Depending on the use case, you could use one of the data structures, you can mix and match the data structures, so that's the power of a Redis. We're used for ITO, for machine learning, for metering of billing and telecommunications environment, for personalization, for ad serving with companies like Groupon and others, and the list goes on and on. >> Yeah, you've got a big list on your website of all your customers, so you can check that out. Let's get the business model piece out of the way. Everybody's always fascinated. Okay, you got open source, how do you make money? How does Redis make money? >> Yeah, you know, we believe strategically fostering the growth of open source is foundational in our business model, and we invest heavily both R&D and marketing to do that. On top of that, to enable enterprise success and deployment of Redis, we have the mission critical, highly available Redis Enterprise offerings. Our monetization is entirely based on the Redis Enterprise platform, which takes advantage of the data structures and performance of core Redis, but layers on top management and the capabilities that make things like auto-recovery, auto-sorting, management much, much easier for the enterprise. We make that available in four deployment models. The enterprise can select us as Redis cloud, which runs on a public infrastructure on any of the four major platforms. We also allow for the enterprise to select a VPC environment in their own private clouds. They can also get software and self-manage that, or get our software and we can manage it for them. Four deployment options are the modalities in other ways where the enterprise customers help us monetize. >> When you said four major platforms, you meant cloud platforms? >> That's right. AWS, >> So, AWS, Azure >> Azure, Google, and IBM. >> Is IBM software, got there in the fourth, alright. >> That's right, all four. >> Go to the whip IBM. Go ahead, George. >> Along the lines of the business model, and we were sort of starting to talk about this earlier offline, you're just one component in building an application, and there's always this challenge of, well, I can manage my component better than anyone else, but it's got to fit with a bunch of other vendors' components. How do you make that seamless to the customer so that it's not defaulting over to a cloud vendor who has to build all the components themselves to make it work together? >> Certainly, you know, database is an integral part of your stack, of your application stack, but it is a stack, so there are other components. Redis and Redis Labs has a very, very large ecosystem within which we operate. We work closely with others for interfaces, for connectors, for interoperability, and that's a sustained environment that we invest in on a continuous basis. >> How do handle application consistency? A lot of in the no-SQL world, even in the AWS world, you hear about eventual consistency, but in the real-time world, there's a need for more rigorous, what's your philosophy there, how do you approach that? >> I think that's an issue that many no-SQL vendors have not been able to crack. Redis Labs has been at the forefront of that. We are taking an approach, and we are offering what we call tuneable consistency. Depending on the economics and the business model and the use case, the needs of consistency vary. In some cases, you do need immediate consistency. In other cases, you don't ever need consistency. And to give that flexibility to the customer is very important, so we've taken the approach where you can go from loose consistency to what we call strong eventual consistency. That approach is based on a fairly well trusted architecture and approach called CRDT, Conflict-free Replication Data Type. That approach allows us to, regardless of what the cluster magnitude or the distribution looks like geographically, we can deliver strong eventual consistency which meets the needs of majority of the customers. >> What are you seeing in terms of, you know, also in that a discussion about acid properties, and how many workloads really need acid properties. What are seeing now as you get more cloud native workloads and more no-SQL oriented workloads in terms of the requirement for those acid properties? >> First of all, we truly believe and agree that not all environments required acid support. Having said that, to be a truly credible database, you must support acid, and we do. Redis is acid-compli, supports acid, and Redis Labs certainly supports that. >> I remember on a stage once with Curt Monash, I'm sure you know Curt, right? Very famous database person. And he basically had a similar answer. But you would say that increasingly there are workloads that, the growth workloads don't necessarily require that, is that fair statement? >> That's a fair statement I would say. >> Dave: Great, good. >> There's a trade-off, though, when you talked about strong eventual consistency, potentially you have to wait for, presumably, a quorum of the partitions, I'm getting really technical here, but in other words, you've got a copy of the data here-- >> Dave: Good CMO question. (laughing) >> But your value proposition to the customers, we get this stuff done fast, but if you have to wait for a couple other servers to make sure that they've got the update, that can slow things way down. How does that trade-off work? >> I think that's part of the power of our architecture. We have a nothing shared, single proxy architecture where all of the replication, the disaster recovery, and the consistency management of the back end is handled by the proxy, and we ensure that the performance is not degraded when you are working through the consistency challenges, and that's where significant amount of IP is in the development of that proxy. >> I'll take that as a, let's go into it even more offline. >> Manish: Sounds good. >> And I have some other CMO questions, if I may. A lot of young companies like yours, especially in open source world, when they go to get the word out, they rely on their community, their open source community, and that's the core, and that makes a lot of sense, it's their peeps. As you become, grow more into enterprise grade apps and workloads, how do you extend beyond that? What is Redis Labs doing to sort of reach that C-Suite, are you even trying to reach that C-Suite up level to messaging? How do you as a CMO deal with those challenges? >> Maybe I'll begin by talking about our personas that matter to us in the ecosystem. The enterprise level, the architects, the developers, are the primary target, which we try to influence in early part of the decision cycle, it's at the architectural level. The ultimate teams that manage, run, and operate the infrastructure is certainly the DevOps, or the operations teams, and we spend time there. All along for some of the enterprise engagements, CIOs, chief data officers, and CTOs tend to play a very important role in the decisions and the selection process, and so, we do influence and interact with the C-Suite quite heavily. What the power of the open source gives us is that groundswell of love for Redis. Literally you can walk around a developer environment, such as the Spark Summit here, and you'll find people wearing Redis Geek shirts. And we get emails from Kazakhstan and strange, places from all over the world where we don't necessarily have salesforce, and requesting t-shirts, "send us stickers." Because people love Redis, and the word of mouth, that ground level love for the technology enables the decisions to be so much easier and smoother. We're not convincing, it's not a philosophical battle anymore. It's simply about the use case and the solution where Redis Enterprise fits or doesn't fit. >> Okay, so it really is that core developer community that are your advocates, and they're able to internally sell to the C-Suite. A lot of times the C-Suite, not the CTO so much, but certainly the CIO, CDO are like, "Yeah, yeah, they're geekin' out on some new hot thing. "What's the business impact?" Do you get that question a lot, and how do address it? >> I think then you get to some of the very basic tools, ROI calculators and the value proposition. For the C-level, the message is very simple. We are the least risky bet. We are the best long-term proposition, and we are the best cost answer for their implementation. Particularly as the needs are increasingly becoming more real-time in nature, they are not batch processed. Yes, there will always be some of that, but as the workloads are becoming, there is a need for faster processing, there is a need for quick insights, and real-time is not a moniker anymore, right. Real-time truly needs to be delivered today. And so, I think those three propositions for the C-Suite are resonating very well. >> Let's talk about ROI calculators for a second. I love talking about it because it underscores what a company feels as though its core value proposition is. I would think with Redis Labs part of the value proposition is you are enabling new types of workloads and new types of, whether it's sources of revenue or productivity. And these are generally telephone numbers as compared to some of the cost savings head to head to your competition, which of course you want to stress as well because the CFO cares about the cap-backs. What do you emphasize in that, and we don't have to get into the calculator itself, but in the conceptual model, what's the emphasis? Is it on those sort of business value attributes, is it on the sort of cost-savings? How do you translate performance into that business value? A lot of questions there, but if you could summarize, that'd be great. >> Well, I think you can think of it in three dimensions. The very first one is, does the performance support the use case or the solution that is required? That's the very first one. The second piece that fits in it, and that's in our books, that's operations per second and the latency. The second piece is the cost side, and that has two components to it. The first component is, what are the compute requirements? So, what is the infrastructure underneath that has to support it? And the efficiency that Redis and Redis Enterprise has is dramatically superior to the alternatives. And so, the economics show up. To run a million operations per second, we can do that on two nodes as opposed to alternative, which might need 50 nodes or 300 nodes. >> You can utilize your assets on the floor much better than maybe the competition can. >> This is where the data structures come into play quite a bit. That's one part of-- >> Dave: That's one part of the cost. >> Yeah. The other part of the cost is the human cost. >> Dave: People, yeah. >> And because, and this goes back to the open source, because the people available with the talent and the competency and appreciation for Redis, it's easy to procure those people, and your cost of acquisition and deploying goes down quite a bit. So, there's a human cost to it. The third dimension to this whole equation is time to market. And time to market is measured in many ways. Is it lost revenue if it takes you longer to get there? And Redis consistently from multiple analysts' reports gets top ranking for fastest way to get to market because of how simple it is. Beyond performance, simplicity is a second hallmark. >> That's a benefit acceleration, and you can quantify that. >> Absolutely, absolutely. And that's a revenue parameter, right. >> For years, people have been saying this Cambrian explosion of databases is unsustainable, and sort of in response we've gotten a squaring of the Cambrian explosion. The question is, with your sort of very flexible, I don't want to get too geeky, 'cause Dave'll cut me off, but the idea that you can accommodate time series and all these different ways of, all these different types of data, are we approaching a situation where customers can start consolidating their database choices and have fewer vendors, fewer products in their landscape? >> I think not only are we getting there, but we must get there. You've got over 300 databases in the marketplace, and imagine a CIO or an architect trying to have to sort through that to make a decision, it's difficult, and you certainly cannot support it from a trading standpoint or from an investment, cap-backs, and all that standpoint. What we have done with Redis is introduce something called Redis Modules. We released that at the last RedisConf in May in San Francisco. And the Redis Module is a very simple concept but a very powerful concept. It's an API which can be utilized to take an existing development effort, written as CC++, that can be ported onto the Redis data structures. This gives you the flexibility without having to reinvent the wheel every single time to take that investment, port it on top of Redis, and you get the performance, and you can make now Redis becomes a multi-model database. And I'm going to get to your answer of how do you address the multiple needs so you don't need multiple databases. To give you some examples, since the introduction of Redis Modules, we have now over 50 modules that have been published by a variety of places, not just Redis Labs. To indicate how simple and how powerful this model is. We took Lucene and developed the world's fastest full-text search engine as a module. We have very recently introduced Redis machine learning as a module that works with Spark ML and serves as a great serving layer in the machine learning domain. Just two very simple examples, but work that's being done ported over onto Redis data structures and now you have ability to do some very powerful things because of what Redis is. And this is the way future's going to be. I think every database is trying to offer multi-functionality to be multi-model in nature, but instead of doing it one step at a time, this approach gives us the ability to leverage the entire ecosystem. >> Your point being consolidation's inevitable in this business as well. >> Manish: Architectural consolidation. >> Yes, but also you would think, company consolidation, isn't that going to follow? What do you make of the market, and tell me, if you look back on the database market and what Oracle was able to achieve in the face of, maybe not as many players, but you had Sybase and Informix, and certainly DB2's still around, and SQL Server's still around, but Oracle won, and maybe it was SQL standards that. It's great to be lucky and good. Can we learn from that, or is this a whole different world? Are there similarities, and how do you, how do you see that consolidation potentially shaking out, if you agree that there will be consolidation? >> Yeah, there has to be, first and foremost, an architectural approach that solves the OPEX, CAPEX challenge for the enterprise. But beyond that, no industry can sustain the diversity and the fragmentation that exists in database world. I think there will always be new things coming out, of universities particularly. There's great innovation and research happening, and that is required to augment. But at the end of the day, the commercial enterprises cannot be of the fragmented volume that we have today in the database world, so there is going to be some consolidation, and it's not unnatural. I think it's natural, it's expected, time will tell what that looks like. We've seen some of our competitors acquire smaller companies to add graph functionality, to add search functionality. We just don't think that's the level of consolidation that really moves the needle for the industry. It's got to be at a higher level of consolidation. >> I don't want to, don't take this the wrong way, don't hate me for saying it, but is Oracle sort of the enemy, if I can say that. I mean, it's like, no, okay. >> Depends how you define enemy. >> I'm not going to go do many of the workloads that you're talking about on Oracle, despite what Larry tells me at Oracle OpenWorld. And I'm not going to make Oracle my choice for any of the workloads that you guys are working on. I guess in terms, I mean, everybody who's in the database business looks at that and say, "Hey, we can do it cheaper, better, "more productively," but, could you respond to that, and what do you make of Amazon's moves in the database world? Does that concern you? >> We think of Amazon and Oracle as two very different philosophies, if you can use that word. The approach we have taken is really a forward-looking approach and philosophy. We believe that the needs of the market need to be solved in new ways, and new ways should not be encumbered by old approaches. We're not trying to go and replicate what was done in the SQL world or in a relational database world. Our approach is how do you deliver a multi-model database that has the real-time attribute attached to it in a way that requires very limited computer force power and very few resources to manage? You take all of those things as kind of the core philosophy, which is a forward-looking philosophy. We are definitely not trying to replicate what an Oracle used to be. AWS I think is a very different animal. >> Dave: Interesting, though. >> They have defined the cloud, and I think play a very important role. We are a strong partner of theirs, much of our traffic runs on AWS infrastructure, certainly also on other clouds. I think AWS is one to watch in how they evolve. They have database offerings, including Redis offerings. However, we fully recognize, and the industry recognizes that that's not to the same capability as Redis Enterprise. It's open sourced Redis managed by AWS, and that's fine as a cache, but you cannot persist, and you really cannot have a multi-model capability that's a full database in that approach. >> And you're in the marketplace. >> Manish: We are in the marketplace. >> Obviously. >> And actually, we announced earlier, a few weeks ago, that you can buy and get Redis cloud access, which is Redis Enterprise cloud, on AWS through the integrated billing approach on their marketplace. You can have an AWS account and get our service, the true Redis Enterprise service. >> And as a software company, you'd figure, okay, the cloud infrastructures are service, we don't care what infrastructure it runs on. Whatever the customer wants, but you see AWS making these moves up-market, you got to obviously be paying attention to that. >> Manish: Certainly, certainly. >> Go ahead, last question. >> Interesting that you were saying that to solve this problem of proliferation of choice it has to be multi-model with speed and low resource requirement. If I were to interpret that from an old-style database perspective, it would be you're going to get, the multi-model is something you are addressing now, with the extensibility, but the speed means taking out that abstraction layer that was the query optimizer sort of and working almost at the storage layer, or having an option to do that. Would that be a fair way to say? >> No, I don't think that necessarily needs to be the case. For us, speed translates from the simplicity and the power of the data structures. Instead of having to serialize, deserialize before you process data in a Spark context, or instead of having to look for data that is perhaps not put in sorted sets for a use case that you might be doing, running a query on, if the data is already handled through one of the data structures, you now have a much faster query time, you now have the ability to reach the data in the right approach. And again, this is no-SQL, right, so it's a schema lesson write and it sets your scheme as you want it be on read. We marry that with the data structures, and that gives you the ultimate speed. >> We have to leave it there, but Manish, I'll give you the last word. Things we should be paying attention to for Redis Labs this year, events, announcements? >> I think the big thing I would leave the audience with is RedisConf 2017. It's May 31 to June 2 in San Francisco. We are expecting over 1,000 people. The brightest minds around Redis of the database world will be there, and anybody who is considering deploying the next generation database should attend. >> Dave: Where are you doing that? >> It's the Marriott Marquis in San Franciso. >> Great, is that on Howard Street, across from the--? >> It is right across from Moscone. >> Great, awesome location. People know it, easy to get to. Well, congratulations on the success. We'll be lookin' for outputs from that event, and hope to see you again on theCUBE. >> Thank you, enjoyed the conversation. >> Alright, good. Keep it right there, everybody, we'll be back with our next guest. This is theCUBE, we're live from Spark Summit East. Be right back. (upbeat electronic rock music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. Manish Gupta is here, he's the CMO at Redis Labs. So, you know, 10 years ago you say We are happy to be on the top of that heap. Redis Labs is the company behind But add some color to that if you would. and the reason you can get that performance Let's get the business model piece out of the way. We also allow for the enterprise to select a VPC environment That's right. Google, and IBM. Go to the whip IBM. Along the lines of the business model, Certainly, you know, database is an integral part and the use case, the needs of consistency vary. in terms of the requirement for those acid properties? you must support acid, and we do. the growth workloads don't necessarily require that, Dave: Good CMO question. but if you have to wait for a couple other servers and the consistency management of the back end and that's the core, and that makes and the word of mouth, that ground level love but certainly the CIO, CDO are like, For the C-level, the message is very simple. part of the value proposition is you are enabling That's the very first one. much better than maybe the competition can. This is where the data structures of the cost. The other part of the cost is the human cost. and the competency and appreciation for Redis, And that's a revenue parameter, right. but the idea that you can accommodate time series We released that at the last RedisConf in this business as well. and tell me, if you look back on the database market that really moves the needle for the industry. but is Oracle sort of the enemy, if I can say that. for any of the workloads that you guys are working on. We believe that the needs of the market and that's fine as a cache, but you cannot persist, the true Redis Enterprise service. okay, the cloud infrastructures are service, the multi-model is something you are addressing now, and the power of the data structures. but Manish, I'll give you the last word. of the database world will be there, and hope to see you again on theCUBE. This is theCUBE, we're live from Spark Summit East.

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave	PERSON	0.99+
AWS	ORGANIZATION	0.99+
George	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Howard Street	LOCATION	0.99+
Curt	PERSON	0.99+
second piece	QUANTITY	0.99+
San Francisco	LOCATION	0.99+
Redis Labs	ORGANIZATION	0.99+
Manish Gupta	PERSON	0.99+
two nodes	QUANTITY	0.99+
Redis	ORGANIZATION	0.99+
two components	QUANTITY	0.99+
two	QUANTITY	0.99+
San Franciso	LOCATION	0.99+
Larry	PERSON	0.99+
Manish	PERSON	0.99+
first component	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
over 50 modules	QUANTITY	0.99+
June 2	DATE	0.99+
May 31	DATE	0.99+
Google	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Curt Monash	PERSON	0.99+
May	DATE	0.99+
millions	QUANTITY	0.99+
third dimension	QUANTITY	0.98+
50 nodes	QUANTITY	0.98+
Moscone	LOCATION	0.98+
fourth	QUANTITY	0.98+
Redis Enterprise	TITLE	0.98+
300 nodes	QUANTITY	0.98+
Redis	TITLE	0.98+
Kazakhstan	LOCATION	0.98+
over 1,000 people	QUANTITY	0.98+
one part	QUANTITY	0.98+
both	QUANTITY	0.98+
one step	QUANTITY	0.97+
C-Suite	TITLE	0.97+
Marriott Marquis	ORGANIZATION	0.97+
second hallmark	QUANTITY	0.97+
10 years ago	DATE	0.97+
Spark Summit East 2017	EVENT	0.97+
Groupon	ORGANIZATION	0.97+
first one	QUANTITY	0.97+
CDO	TITLE	0.97+
over 300 databases	QUANTITY	0.96+
SQL Server	TITLE	0.96+
Redis Enterprise cloud	TITLE	0.96+

Bryan Duxbury, StreamSets | Spark Summit East 2017

>> Announcer: Live from Boston, Massachusetts. This is "The Cube" covering Spark Summit East 2017. Brought to you by Databricks. Now here are your hosts Dave Volante and George Gilbert. >> Welcome back to snowy Boston everybody. This is "The Cube." The leader in live tech coverage. This is Spark Summit. Spark Summit East #SparkSummit. Bryan Duxbury's here. He's the vice president of engineering at StreamSets. Cleveland boy! Welcome to "The Cube." >> Thanks for having me. >> You've very welcome. Tell us, let's start with StreamSets. We're going to talk about Spark and some of the use cases that it's enabling and some of the integrations you're doing. But what does StreamSets do? >> Sure, StreamSets is a data movement software. So I like to think of it either the first mile or the last mile of a lot of different analytical or data movement workflows. Basically we build a product that allows you to build a workflow, or build a data pipeline that doesn't require you to code. It's a graphical user interphase for dropping an origin, several destinations, and then lightweight transformations onto a canvas. You click play and it runs. So this is kind of different than, a lot of the market today is a programming tool or a command line tool. That still requires your systems engineers or your unfortunate data scientists pretending to be systems engineers to do systems engineering. To do a science project to figure out how to move data. The challenge of data movement I think is often underplayed how challenging it is. But it's extremely tedious work. You know, you have to connect to dozens or hundreds of different data sources. Totally different schemas. Different database drivers, or systems altogether. And it break all the time. So the home-built stuff is really challenging to keep online. When it goes down, your business is not, you're not moving data. You can't actually get the insights you built in the first place. >> I remember I broke into this industry you know, in the days of mainframe. You used to read about them and they had this high-speed data mover. And it was this key component. And it had to be integrated. It had to be able to move, back then, it was large amounts of data fast. Today especially with the advent of Hadoop, people say okay don't move the data, keep it in place. Now that's not always practical. So talk about the sort of business case for starting a company that basically moves data. >> We handle basically the one step before. I agree with you completely. Many data analytical situations today where you're doing like the true, like business-oriented detail, where you're actually analyzing data and producing value, you can do it in place. Which is to say in your cluster, in your Spark cluster, all the different environments you can imagine. The problem is that if it's not there already, then it's a pretty monumental effort to get it there. I think we see. You know a lot of people think oh I can just write a SQL script, right? And that works for the first two to 20 tables you want to deploy. But for instance, in my background, I used to work at Square. I ran a data platform there. We had 500 tables we had to move on a regular basis. Coupled with a whole variety of other data sources. So at some point it becomes really impractical to hand-code these solutions. And even when you build your own framework, and you start to build tools internally, you know, it's not your job really, these companies, to build a world class data movement tool. It's their job to make the data valuable, right? And actually data movement is like utility, right. Providing the utility, really the thing to do is be productive and cost effective, right? So the reason why we build StreamSets, the reason why this thing is a thing in the first place, is because we think people shouldn't be in the business of building data movement tools. They should be in the business of moving their data and then getting on with it. Does that make sense? >> Yeah absolutely. So talk about how it all fits in with Spark generally and specifically Spark coming to the enterprise. >> Well in terms of how StreamSets connects to stuff, we deploy in every way you can imagine, whether you want to run your own premise, on your own machines, or in the Cloud. It's up to you to deploy however you like. We're not prescriptive about that. We often get deployed on the edge of clusters, wether it's your Hadoop cluster or your Spark cluster. And basically we try not to get in the way of these analysis tools. There are many great analytical tools out there like Spark is a great example. We focus really on the moving of data. So what you'll see is someone will build a Spark streaming application or some big Spark SQL thing that actually produces the reports. And we plug in ahead of that. So if you're data is being collected from, you know, Edge web logs or some thing or some Kafka thing or a third party AVI or scripting website. We do the first collection. And then it's usually picked up from there with the next tool. Whether it's Spark or other things. I'm trying to think about the right way to put this. I think that people who write Spark they should focus on the part that's like the business value for them. They should be doing the thing that actually is applying the machine learning model, or is producing the report that the CEO or CTO wants to see. And move away from the ingest part of the business. Does that make sense? >> [] Yeah. >> Yeah. When the Spark guys sort of aspire to that by saying you don't have to worry about exactly when's delivery. And you know you can make sure this sort of guarantee, you've got guarantees that will get from point A to point B. >> Bryan: Yeah. >> Things like that. But all those sources of data and all those targets, writing all those adapters is, I mean, that's been a La Brea tar pit for many companies over time. >> In essence that is our business. I think that you touch on a good point. Spark can actually do some of these things right. There's not complete, but significant overlap in some cases. But the important difference is that Spark is a cluster tool for working with cluster data. And we're not going to beat you running a Spark application for consuming from Kafka to do your analysis. But you want to use Spark for reading local files? Do you want to use Spark for reading from a mainframe? Like these are things that StreamSets is built for. And that library of connectors you're talking about, it's our bread and butter. It's not your job as a data scientist, you know, applying Spark, to build a library of connectors. So actually the challenge is not the difficulty of building any one connector, because we have that down to an art now. But we can afford to invest, we can build a portfolio of connectors. But you as a user of Spark, can only afford to do it on demand. Reactive. And so that turn around time, of the cost it might take you to build that connector is pretty significant. And actually I often see the flow side. This is a problem I faced at Square, which was that people asked me to integrate new data sources, I had to say no. Because it was too rare, it was too unusual for what we had to do. We had other things to support. So the problem with that is that I have no idea what kind of opportunity cost I left behind. Like what kind of data we didn't get, kind of analysis we couldn't do. And with an approach like StreamSets, you can solve that problem sort of up front even. >> So sort of two follow ups. One is it would seem to be an evergreen effort to maintain the existing connectors. >> Bryan: Certainly. >> And two, is there a way to leverage connectors that others have built, like the Kafka connect type stuff. >> Truthfully we are a heavy-duty user of open source software so our actual product, if you dig in to what you see, it's a framework for executing pipelines. And it's for connecting other software into our product. So it's not like when we integrate Kafka we built a build brand new blue sky Kafka connector. We actually integrate what stuff is out there. So our idea is to bring as much of that stuff in there as we can. And really be part of the community. You know, our product is also open source. So we play well with the community. We have had people contribute connectors. People who say we love the product, we need it to connect to this other database. And then they do it for us. So it's been a pretty exciting situation. >> We were talking earlier off-camera, George and I have been talking all week about the badge workloads, interactive workloads, now you've got this sort of new emerging workloads, continuous screening workloads, which is in the name. What are you seeing there? And what kind of use cases is that enabling? >> So we're focused on mostly the continuous delivery workload. We also deliver the batch stuff. We're finding is people are moving farther and farther away from batch in general. Because batch was not the goal it was a means to the end. People wanted to get their data into their environment, so they could do their analysis. They want to run their daily reports, things like that. But ask any data scientist, they would rather the data show up immediately. So we're definitely seeing a lot of customers who want to do things like moving data live from a log file into Hadoop they can read immediately, in the order of minutes. We're trying to do our best to enable those kind of use cases. In particular we're seeing a lot of interest in the Spark arena, obviously that's kind of why we're here today. You know people want to add their event processing, or their aggregation, and analysis, like Spark, especially like Spark SQL. And they want that to be almost happening at the time of ingest. Not once it landed, but like when it's happening. So we're starting to build integration. We have kind of our foot in the door there, with our Spark processor. Which allows you to put a Spark workflow right in the middle of your data pipeline. Or as many of them as you want in fact. And we all sort of manage the lifecycle of that. And do all those connections as required to make your pipeline pretend to have a Spark processor in the middle. We really think that with that kind of workload, you can do your ingest, but you can also capture your real-time analytics along the way. And that doesn't replace batch reporting for say that'll happen after the fact. Our your daily reports or what have you. But it makes it that much easier for your data scientists to have, you know, a piece of intelligence that they had in flight. You know? >> I love talking to someone who's a practitioner now sort of working for a company that's selling technology. What do you see, from both perspectives, as Spark being good at? You know, what's the best fit? And what's it not good at? >> Well I think that Spark is following the arc of like Hadoop basically. It started out as infrastructure for engineers, for building really big scary things. But it's becoming more and more a productivity tool for analysts, data scientist, machine-learning experts. And we see that popping up all the time. And it's really exciting frankly, to think about these streaming analytics that can happen. These scoring machine-learning models. Really bringing a lot more power into the hands of these people who are not engineers. People who are much more focused on the semantic value of the data. And not the garbage in garbage out value of the data. >> You were talking before about it's really hard, data movement and the data's not always right. Data quality continues to be a challenge. >> Bryan: Yeah. >> Maybe comment on that. State the data quality and how the industry is dealing with that problem. >> It is hard, it is hard. I think that the traditional approach to data quality is to try and specify a quality up front. We take the opposite approach. We basically say that it's impossible to know that your data will be correct at all times. So we have what we call schema drift tools. So we try to go, we say like intent-driven approach. We're interacting with your data. Rather then a schema driven approach. So of course your data has an implicit schema as it's passing through the pipeline. Rather than saying, let's transform com three, we want you to use the name. We want you to be aware of what it is you're trying to actually change and affect. And the rest just kind of flows along with it. There's no magic bullet for every kind of data-quality issue or schema change that could possibly come into your pipeline. We try to do the best to make it easy for you to do effectively the best practice. The easiest thing that will survive the future, build robust data pipelines. This is one of the biggest challenges I think with like home-grown solutions. Is that it's really easy to build something that works. It's not easy to build something that works all the time. It's very easy to not imagine the edge cases. 'Cause it might take you a year until you've actually encountered you know, the first big problem. The real, the gotcha that you didn't consider when you were building your own thing. And those of us at StreamSets who have been in the industry and on the user side, we've had some of these experiences. So we're trying to export that knowledge in the product. >> Dave: Who do you guys sell to? >> Everybody. (laughing) We see a lot of success today with, we call it Hadoop replatforming. Which is people who are moving from their huge variety of data sources environment into like a Hadoop data-like kind of environment. Also Cloud, people are moving into the Cloud. The need a way for their data to get from wherever it is to where they want it to be. And certainly people could script these things manually. They could build their own tools for this. But it's just so much more productive to do it quickly in a UI. >> Is it an architect who's buying your product? Is it a developer? >> It's a variety. So I think our product resonates greatly with a developer. But also people who are higher up in the chain. People who are trying to design their whole topology. I think the thing I love to talk about is everyone, when they start on a data project, they sit down and they draw this beautiful diagram with boxes and arrows that says here's where the data's going to go. But a month later, it works, kind of, but it's never that thing. >> Dave: Yeah because the data is just everywhere. >> Exactly. And the reality is that what you have to do to make it work correctly within SLA guidelines and things like that is so not what you imagined. But then you can almost never go backwards. You can never say based on what I have, give me the box scenarios, because it's a systems analysis effort that no one has the time to engage in. But since StreamSets is actually instruments, every step of the pipeline, and we have a view into how all your pipelines actually fit together. We can give you that. We can just generate it. So we actually have a product. We've been talking about the StreamSet data collector which is the core like data movement product. We have like our enterprise edition, which is called the Dataflow Performance Manager, or DPM, It basically gives you a lot of collaboration and enterprise grade authentication. And access control, and the commander control features. So it aggregates your metrics across all your data collectors. It helps you visualize your topology. So people like your director of analytics, or your CIO, who want to know is everything okay? We have a dashboard for them now. And that's really powerful. It's a beautiful UI. And it's really a platform for us to build visualizations with more intelligence. That looks across your whole infrastructure. >> Dave: That's good. >> Yeah. And then the thing is this is strangely kind of unprecedented. Because, you know, again, the engineer who wants to build this himself would say, I could just deploy Graphite. And all of a sudden I've got graphs it's fine right. But they're missing the details. What about the systems that aren't under your control? What about the failure cases? All these things, these are the things we tackle. 'Cause it's our business we can afford to invest massively and make this a really first-class data engineering environment. >> Would it be fair to say that Kafka sort of as it exists today is just data movement built on a log, but that it doesn't do the analytics. And it doesn't really yet, maybe it's just beginning to do some of the monitoring you know, with a dashboard, or that's a statement of direction. Would it be fair to say that you can layer on top of that? Or you can substitute on top of it with all the analytics? And then when you want the really fancy analytic soup, you know, call out to Spark. >> Sure, I would say that for one thing we definitely want to stay out of the analytics base. We think there's many great analytics tools out there like Spark. We also are not a storage tool. In fact, we're kind of like, we're queue-like but we view ourselves more like, if there's a pipe and a pump, we're the pump. And Kafka is the pipe. I think that from like a monitoring perspective, we monitor Kafka indirectly. 'Cause if we know what's coming out, and we know what's going in later, we can give you the stats. And that's actually what's important. This is actually one of the challenges of having sort of a home-grown or disconnected solution, is that stitching together so you understand the end to end is extremely difficult. 'Cause if you have a relational database, and a Kafka, and a Hadoop, and a Spark job, sure you can monitor all those things. They all have their own UIs. But if you can't understand what the is on the whole system you're left like with four windows open trying to figure out where things connect. And it's just too difficult. >> So just on a sort of a positioning point of view for someone who's trying to make sense out of all the choices they have, to what extent would you call yourself a management framework for someone who's building these pipelines, whether from Scratch, or buying components. And to what extent is it, I guess, when you talk about a pump, that would be almost like the run time part of it. >> Bryan: Yeah, yeah. >> So you know there's a control plane and then there's a data plane. >> Bryan: Sure. >> What's the mix? >> Yeah well we do both for sure. I mean I would say that the data point for us is StreamSet's data collector. We move data, we physically move the data. We have our own internal pipeline execution engine. So it doesn't presuppose any other existing technologies, not dependent on Hadoop or Spark or Kafka or anything. You know to some degree data collector is also the control plane for small deployments. Because it does give you start to stop commanding control. Some metrics monitoring, things like that. Now, what people need to expand beyond the realm of single data collector, when they have enterprises with more than one business unit, or data center, or security zone, things like that. You don't just deploy one data collector, you deploy a bunch, dozens or hundreds. And in that case, that's where dataflow performance manager again comes in, as that control plane. Now dataflow performance manager has no data in it. It does not pass your actual business data. But it does again aggregate all of your metrics from all your data collectors and gives you a unified view across your whole enterprise. >> And one more follow-up along those lines. When you have a multi-vendor stack, or a multi-vendor pipeline. >> Bryan: Yeah. >> What gives you the meta view? >> Well we're at the ins and outs. We see the interfaces. So in theory if someone were to consume data out of Kafka do something right. Then there's another job later, like a Spark job. >> George: Yeah. >> So we don't automatic visibility for that. But our plan in the future is to expand as dataflow performance manager to take third party metric sources effectively. To broaden the view of your entire enterprise. >> You've got a bunch of stuff on your website here which is kind of interesting. Talking about some of the things we talked about. You know taming data drift is one of your papers. The silent killer of data integrity. And some other good resources. So just in sort of closing, how do we learn more? What would you suggest? >> Sure, yeah please visit the website. The product is open source and free to download. Data collector is free to download. I would encourage people to try it out. It's really easy to take for a spin. And if you love it you should check out our community. We have a very active Slack channel and Google group, which you can find from the website as well. And there's also a blog full of tutorials. >> Yeah well you're solving gnarly problems that a lot of companies just don't want to deal with. That's good thanks for doing the dirty work, we appreciate it. >> Yeah my pleasure. >> Alright Bryan thanks for coming on "The Cube." >> Thanks for having me. >> Good to see you. You're welcome. Keep right there buddy we'll be back with our next guest. This is "The Cube" we're live from Boston Spark Summit. Spark Summit East #SparkSummit right back. >> Narrator: Since the dawn.

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. He's the vice president of engineering at StreamSets. and some of the integrations you're doing. And it break all the time. And it had to be integrated. all the different environments you can imagine. generally and specifically Spark coming to the enterprise. And move away from the ingest part of the business. When the Spark guys sort of aspire to that But all those sources of data and all those targets, of the cost it might take you to build that connector to maintain the existing connectors. like the Kafka connect type stuff. And really be part of the community. about the badge workloads, interactive workloads, We have kind of our foot in the door there, What do you see, from both perspectives, And not the garbage in garbage out value of the data. data movement and the data's not always right. and how the industry is dealing with that problem. The real, the gotcha that you didn't consider Also Cloud, people are moving into the Cloud. I think the thing I love to talk about is And the reality is that what you have to do What about the systems that aren't under your control? And then when you want the really fancy And Kafka is the pipe. to what extent would you call yourself So you know there's a control plane and gives you a unified view across your whole enterprise. When you have a multi-vendor stack, We see the interfaces. But our plan in the future is to expand Talking about some of the things we talked about. And if you love it you should check out our community. That's good thanks for doing the dirty work, Good to see you.

ENTITIES

Entity	Category	Confidence
Bryan	PERSON	0.99+
Dave	PERSON	0.99+
Dave Volante	PERSON	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Bryan Duxbury	PERSON	0.99+
StreamSets	ORGANIZATION	0.99+
first mile	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
dozens	QUANTITY	0.99+
Spark	TITLE	0.99+
500 tables	QUANTITY	0.99+
first	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
20 tables	QUANTITY	0.99+
Kafka	TITLE	0.99+
hundreds	QUANTITY	0.99+
One	QUANTITY	0.99+
more than one business unit	QUANTITY	0.98+
Boston	LOCATION	0.98+
a year	QUANTITY	0.98+
Spark SQL	TITLE	0.98+
today	DATE	0.98+
first collection	QUANTITY	0.98+
one	QUANTITY	0.98+
a month later	DATE	0.98+
both	QUANTITY	0.98+
two	QUANTITY	0.98+
SQL	TITLE	0.98+
StreamSets	TITLE	0.98+
Today	DATE	0.97+
Databricks	ORGANIZATION	0.97+
Spark Summit East	LOCATION	0.97+
one data collector	QUANTITY	0.97+
Boston Spark Summit	LOCATION	0.97+
Spark Summit East 2017	EVENT	0.97+
Spark Summit East	EVENT	0.96+
one step	QUANTITY	0.96+
Cleveland	LOCATION	0.95+
both perspectives	QUANTITY	0.95+
StreamSet	ORGANIZATION	0.95+
Slack	ORGANIZATION	0.95+
Square	ORGANIZATION	0.95+
Hadoop	TITLE	0.94+
four windows	QUANTITY	0.93+
first two	QUANTITY	0.93+
Spark Summit	EVENT	0.93+
single data collector	QUANTITY	0.92+

Robbie Strickland, IBM - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> Announcer: Live from Boston Massachusetts this is theCube. Covering Spark Summit East 2017, brought to you by Databricks. Now here are your hosts Dave Vellante and George Gilbert. >> Welcome back to theCube, everybody, we're here in Boston. The Cube is the worldwide leader in live tech coverage. This is Spark Summit, hashtag #SparkSummit. And Robbie Strickland is here. He's the Vice President of Engines & Pipelines, I love that title, for the Watson Data Platform at IBM Analytics, formerly with The Weather Company that was acquired by IBM. Welcome to you theCube, good to see you. >> Thank you, good to be here. >> So, it's my standing tongue-in-cheek line is the industry's changing, Dell buys EMC, IBM buys The Weather Company. [Robbie] That's right. >> Wow! That sort of says it all, right? But it was kind of a really interesting blockbuster acquisition. Great for the folks at The Weather Company, great for IBM, so give us the update. Where are we at today? >> So, it's been an interesting first year. Actually, we just hit our first anniversary of the acquisition and a lot has changed. Part of my role, new role at IBM, having come from The Weather Company, is a byproduct of the two companies bringing our best analytics work and kind of pulling those together. I don't know if we have some water but that would be great. So, (coughs) excuse me. >> Dave: So, let me chat for a bit. >> Thanks. >> Feel free to clear your throat. So, you were at IBM, the conference at the time was called IBM Insight. It was the day before the acquisition was announced and we had David Kenny on. David Kenny was the CEO of The Weather Company. And I remember we were talking, and I was like, wow, you have such an interesting business model. Off camera, I was like, what do you want to do with this company, you guys are like prime. Are you going public, you going to sell this thing, I know you have an MBA background. And he goes, "Oh, yeah, we're having fun." Next day was the announcement that IBM bought The Weather Company. I saw him later and I was like, "Aha!" >> And now he's the leader of the Watson Group. >> That's right. >> Which is part of our, The Weather Company joined The Watson Group. >> And The Cloud and analytics groups have come together in recognition that analytics and The Cloud are peanut butter and jelly. >> Robbie: That's absolutely right. >> And David's running that organization, right? >> That is absolutely right. So, it's been an exciting year, it's been an interesting year, a lot of challenges. But I think where we are now with the Watson Data Platform is a real recognition that the use dase where we want to try to make data and analytics and machine learning and operationalizing all of those, that that's not easy for people. And we need to make that easy. And our experience doing that at The Weather Company and all the challenges we ran into have informed the organization, have informed the road map and the technologies that we're using to kind of move forward on that path. >> And The Watson Data Platform was announced in, I believe, October. >> Robbie: That's right. >> You guys had a big announcement in New York City. And you took many sort of components that were viewed as individual discreet functions-- >> Robbie: That's right. >> And brought them together in a single data pipeline. Is that right? >> Robbie: That's right. >> So, maybe describe that a little bit for our audience. >> So, the vision is, you know, one of the things that's missing in the market today is the ability to easily grab data from some source, whether it's a database or a Kafka stream, or some sort of streaming data feed, which is actually something that's often overlooked. Usually you have platforms that are oriented around streaming data, data feeds, or oriented around data at rest, batch data. One of the things that we really wanted to do was sort of combine those two together because we think that's really important. So, to be able to easily acquire data at scale, bring it into a platform, orchestrate complex workflows around that, with the objective, of course, of data enrichment. Ultimately, what you want to be able to do is take those raw signals, whatever they are, and turn that into some sort of enriched data for your organization. And so, for example, we may take signals in from a mobile app, things like beacons, usage beacons on a mobile app, and turn that into a recommendation engine so we can feed real time content decisions back into a mobile platform. Well, that's really hard right now. It requires lots of custom development. It requires you to essentially stitch together your pipeline end to end. It might involve a machine learning pipeline that runs a training pipeline. It might involve, it's all batch oriented, so you land your data somewhere, you run this machine learning pipeline maybe in Spark or ADO or whatever you've got. And then the results of that get fed back into some data store that gets merged with your online application. And then you need to have a restful API or something for your application to consume that and make decisions. So, our objective was to take all of the manual work of standing up those individual pieces and build a platform where that is just, that's what it's designed to do. It's designed to orchestrate those multiple combinations of real time and batch flows. And then with a click of a button and a few configuration options, stand up a restful service on top of whatever the results are. You know, either at an interim stage or at the end of the line. >> And you guys gave an example. You actually showed a demo at the announcement. And I think it was a retail example, and you showed a lot of what would traditionally be batch processes, and then real time, a recommendation came up and completed the purchase. The inference was this is an out of the box software solution. >> Robbie: That's right. >> And that's really what you're saying you've developed. A lot of people would say, oh, it's IBM, they've cobbled together a bunch of their old products, stuck them together, put an abstraction layer on, and wrapped a bunch of services around it. I'm hearing from you-- >> That's exactly, that's just WebSphere. It's WebSphere repackaged. >> (laughing) Yeah, yeah, yeah. >> No, it's not that. So, one of the things that we're trying to do is, if you look at our cloud strategy, I mean, this is really part and parcel, I mean, the nexus of the cloud strategy is the Watson Data Platform. What we could have done is we could have said let's build a fantastic cloud and compete with Amazon or Google or Microsoft. But what we realized is that there is a certain niche there of people who want to take individual services and compose them together and build an application. Mostly on top of just raw VMs with some additional, you know, let's stitch together something with Lambda or stitch together something with SQS, or whatever it may be. Our objective was to sort of elevate that a bit, not try to compete on that level. And say, how do we bring Enterprise grade capabilities to that space. Enterprise grade data management capabilities end-to-end application development, machine learning as a first class citizen, in a cohesive experience. So that, you know, the collaboration is key. We want to be able to collaborate with business users, data scientists, data engineers, developers, API developers, the consumers of the end results of that, whether they be mobile developers or whatever. One of the things that is sort of key, I think, to the vision is that these roles that we've traditionally looked at. If you look at the way that tool sets are built, they're very targeted to specific roles. The data engineer has a tool, the data scientist has a tool. And what's been the difficult part is the boundaries between those have been very firm and the collaboration has been difficult. And so, we draw the personas as a Venn diagram. Because it's very difficult, especially if you look at a smaller company, and even sometimes larger companies, the data engineer is the data scientist. The developer who builds the mobile application is the data scientist. And then in some larger organizations, you have very large teams of data scientists that have these artificial barriers between the data scientist and the data engineer. So, how do we solve both cases? And I think the answer was for us a platform that allows for seamless collaboration where there is not these clean lines between the personas, that the tool sets easily move from one to the other. And if you're one of those hybrid people that works across lines, that the tool feels like it's one tool for you. But if you're two different teams working together, that you can easily hand off. So, that was one of the key objectives we're trying to answer. >> Definitely an innovative component of the announcement, for sure. Go ahead, George. >> So, help us sort of bracket how mature this end-to-end tool suite is in terms of how much of the pipeline it addresses. You know, from the data origin all the way to a trained model and deploying that model. Sort of what's there now, what's left to do. >> So, there are a few things we've brought to market. Probably the most significant is the data science experience. The data science experience is oriented around data science and has, as its sort of central interface, Jupyter Notebooks. Now, as well as, we brought in our studio, and those sorts of things. The idea there being that we'll start with the collaboration around data scientists. So, data scientists can use their language of choice, collaborate around data sets, save out the results of their work and have it consumed either publicly by some other group of data scientists. But the collaboration among data scientists, that was sort of step one. There's a lot of work going on that's sort of ongoing, not ready to bring to market, around how do we simplify machine learning pipelines specifically, how do we bring governance and lineage, and catalog services and those sorts of things. And then the ingest, one of the things we're working on that we have brought to market is our product called Lift which connects, as well. And that's bringing large amounts of data easily into the platform. There are a few components that have sort of been brought to market. dashDB, of course, is a key source of data clouded. So, one of the things that we're working on is some of these existing technologies that actually really play well into the eco system, trying to tie them well together. And then add the additional glue pieces. >> And some of your information management and governance components, as well. Now, maybe that is a little bit more legacy but they're proven. And I don't know if the exits and entries into those systems are as open, I don't know, but there's some capabilities there. >> Speaking of openness, that's actually a great point. If you look at the IIG suite, it's a great On-Premise suite. And one of the challenges that we've had in sort of past IBM cloud offerings is a lot of what has been the M.O. in the past is take a great On-Prem solution and just try to stand it up as a service in the cloud. Which in some cases has been successful, in other cases, less so. One of the things we're trying to look at with this platform is how do we leverage (a) open source. So that whatever you may already be running open source on, Prem or in some other provider, that it's very easy to move your workloads. So, we want to be able to say if you've got 10,000 lines of fraud detection code to map produce. You don't need to rewrite that in anything. You can just move it. And the other thing is where our existing legacy tech doesn't necessarily translate well to the cloud, our first strategy is see if there's any traction around an existing open source project that satisfies that need, and try to see if we can build on that. Where there's not, we go cloud first and we build something that's tailor made to come out. >> So, who's the first one or two customers for this platform? Is it like IBM Global Business Services where they're building the semi-custom industry apps? Or is it the very, very big and sophisticated, like banks and Telcos who are doing the same? Or have you gotten to the point where you can push it out to a much wider audience? >> That's a great question, and it's actually one that is a source of lots of conversation internally for us. If you look at where the data science experience is right now, it's a lot of individual data scientists, you know, small companies, those sorts of things coming together. And a lot of that is because some of the sophistication that we expect for Enterprise customers is not quite there yet. So, we wouldn't expect Enterprise customers to necessarily be onboarded as quickly at the moment. But if we look at sort of the, so I guess there's maybe a medium term answer and a long term answer. I think the long term answer is definitely the Enterprise customers, you know, leveraging IBM's huge entry point into all of those customers today, there's definitely a play to be made there. And one of the things that we're differentiating, we think, over an AWS or Google, is that we're trying to answer that use case in a way that they really aren't even trying to answer it right now. And so, that's one thing. The other is, you know, going beta with a launch customer that's a healthcare provider or a bank where they have all sorts of regulatory requirements, that's more complicated. And so, we are looking at, in some cases, we're looking at those banks or healthcare providers and trying to carve off a small niche use case that doesn't actually fall into the category of all those regulatory requirements. So that we can get our feet wet, get the tires kicked, those sorts of things. And in some cases we're looking for less traditional Enterprise customers to try to launch with. So, that's an active area of discussion. And one of the other key ones is The Weather Company. Trying to take The Weather Company workloads and move The Weather Company workloads. >> I want to come back to The Weather Company. When you did that deal, I was talking to one of your executives and he said, "Why do you think we did the deal?" I said, "Well, you've got 1500 data scientists, "you've got all this data, you know, it's the future." He goes, "Yeah, it's also going to be a platform "for IOT for IBM." >> Robbie: That's right. >> And I was like, "Hmmm." I get the IOT piece, how does it become a platform for IBM's IOT strategy? Is that really the case? Is that transpiring and how so? >> It's interesting because that was definitely one of the key tenets behind the acquisition. And what we've been working on so hard over the last year, as I'm sure you know, sometimes boxes and arrows on an architecture diagram and reality are more challenging. >> Dave: (laughing) Don't do that. >> And so, what we've had to do is reconcile a lot of what we built at The Weather Company, existing IBM tech, and the new things that were in flight, and try to figure out how can we fit all those pieces together. And so, it's been complicated but also good. In some cases, it's just people and expertise. And bringing those people and expertise and leaving some of the software behind. And other cases, it's actually bringing software. So, the story is, obviously, where the rubber meets the road, more complicated than what it sounds like in the press release. But the reality is we've combined those teams and they are all moving in the same direction together with various bits and pieces from the different teams. >> Okay, so, there's vision and then the road map to execute on that, and it's going to unfold over several years. >> Robbie: That's right. >> Okay, good. Stuff at the event here, I mean, what are you seeing, what's hot, what's going on with Spark? >> I think one of the interesting things with what's going on with Spark right now is a lot of the optimizations, especially things around GPUs and that. And we're pretty excited about that, being a hardware manufacturer, that's something that is interesting to us. We run our own cloud. Where some people may not be able to immediately leverage those capabilities, we're pretty excited about that. And also, we're looking at some of those, you know, taking Spark and running it on Power and those sorts of things to try to leverage the hardware improvements. So, that's one of the things we're doing. >> Alright, we have to leave it there, Robbie. Thanks very much for coming on theCube, really appreciate it. >> Thank you. >> You're welcome. Alright, keep it right there, everybody. We'll be right back with our next guest. This is theCube. We're live from Spark Summit East, hashtag #SparkSummit. Be right back. >> Narrator: Since the dawn of The Cloud, theCube.

Published Date : Feb 9 2017

SUMMARY :

brought to you by Databricks. The Cube is the worldwide leader in live tech coverage. is the industry's changing, Dell buys EMC, Great for the folks at The Weather Company, is a byproduct of the two companies And I remember we were talking, and I was like, Which is part of our, And The Cloud and analytics groups have come together is a real recognition that the use dase And The Watson Data Platform was announced in, And you took many sort of components that were And brought them together in a single data pipeline. So, the vision is, you know, one of the things And I think it was a retail example, And that's really what you're saying you've developed. That's exactly, that's just WebSphere. So, one of the things that we're trying to do is, of the announcement, for sure. You know, from the data origin all the way to So, one of the things that we're working on And I don't know if the exits and entries One of the things we're trying to look at with this platform And a lot of that is because some of the sophistication and he said, "Why do you think we did the deal?" Is that really the case? one of the key tenets behind the acquisition. and the new things that were in flight, to execute on that, and it's going to unfold Stuff at the event here, I mean, So, that's one of the things we're doing. Alright, we have to leave it there, Robbie. This is theCube.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
The Weather Company	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Robbie	PERSON	0.99+
Dave	PERSON	0.99+
Robbie Strickland	PERSON	0.99+
Watson Group	ORGANIZATION	0.99+
David Kenny	PERSON	0.99+
October	DATE	0.99+
New York City	LOCATION	0.99+
1500 data scientists	QUANTITY	0.99+
two companies	QUANTITY	0.99+
10,000 lines	QUANTITY	0.99+
Dell	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
One	QUANTITY	0.99+
both cases	QUANTITY	0.99+
Boston Massachusetts	LOCATION	0.99+
Spark Summit	EVENT	0.99+
IBM Analytics	ORGANIZATION	0.99+
Spark	TITLE	0.99+
one	QUANTITY	0.99+
ADO	TITLE	0.99+
Lambda	TITLE	0.99+
Telcos	ORGANIZATION	0.99+
The Cloud	ORGANIZATION	0.98+
Spark Summit East 2017	EVENT	0.98+
first strategy	QUANTITY	0.98+
IBM Global Business Services	ORGANIZATION	0.98+
EMC	ORGANIZATION	0.98+
one tool	QUANTITY	0.98+
first anniversary	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
last year	DATE	0.98+
today	DATE	0.97+
two customers	QUANTITY	0.97+
single	QUANTITY	0.97+
SQS	TITLE	0.97+
first year	QUANTITY	0.97+
two	QUANTITY	0.96+
two different teams	QUANTITY	0.96+
WebSphere	TITLE	0.96+
#SparkSummit	EVENT	0.95+
Jupyter	ORGANIZATION	0.95+
Watson Data Platform	TITLE	0.94+
Kafka	TITLE	0.94+

Arun Murthy, Hortonworks - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> [Announcer] Live, from Boston, Massachusetts, it's the Cube, covering Spark Summit East 2017, brought to you by Data Breaks. Now, your host, Dave Alante and George Gilbert. >> Welcome back to snowy Boston everybody, this is The Cube, the leader in live tech coverage. Arun Murthy is here, he's the founder and vice president of engineering at Horton Works, father of YARN, can I call you that, godfather of YARN, is that fair, or? (laughs) Anyway. He's so, so modest. Welcome back to the Cube, it's great to see you. >> Pleasure to have you. >> Coming off the big keynote, (laughs) you ended the session this morning, so that was great. Glad you made it in to Boston, and uh, lot of talk about security and governance, you know we've been talking about that years, it feels like it's truly starting to come into the main stream Arun, so. >> Well I think it's just a reflection of what customers are doing with the tech now. Now, three, four years ago, a lot of it was pilots, a lot of it was, you know, people playing with the tech. But increasingly, it's about, you know, people actually applying stuff in production, having data, system of record, running workloads both on prem and on the cloud, cloud is sort of becoming more and more real at mainstream enterprises. So a lot of it means, as you take any of the examples today any interesting app will have some sort of real time data feed, it's probably coming out from a cell phone or sensor which means that data is actually not, in most cases not coming on prem, it's actually getting collected in a local cloud somewhere, it's just more cost effective, why would we put up 25 data centers if you don't have to, right? So then you got to connect that data, production data you have or customer data you have or data you might have purchased and then join them up, run some interesting analytics, do geobased real time threat detection, cyber security. A lot of it means that you need a common way to secure data, govern it, and that's where we see the action, I think it's a really good sign for the market and for the community that people are pushing on these dimensions of the broader, because, getting pushed in this dimension because it means that people are actually using it for real production work loads. >> Well in the early days of Hadoop you really didn't talk that much about cloud. >> Yeah. >> You know, and now, >> Absolutely. >> It's like, you know, duh, cloud. >> Yeah. >> It's everywhere, and of course the whole hybrid cloud thing comes into play, what are you seeing there, what are things you can do in a hybrid, you know, or on prem that you can't do in a public cloud and what's the dynamic look like? >> Well, it's definitely not an either or, right? So what we're seeing is increasingly interesting apps need data which are born in the cloud and they'll stay in the cloud, but they also need transactional data which stays on prem, you might have an EDW for example, right? >> Right. >> There's not a lot of, you know, people want to solve business problems and not just move data from one place to another, right? Or back from one place to another, so it's not interesting to move an EDW to the cloud, and similarly it's not interesting to bring your IOT data or sensor data back into on-prem, right? Just makes sense. So naturally what happens is, you know, at Hortonworks we talk of kinds of modern app or a modern data app, which means a modern data app has to spare, has to sort of, you know, it can pass both on-prem data and cloud data. >> Yeah, you talked about that in your keynote years ago. Furio said that the data is the new development kit. And now you're seeing the apps are just so dang rich, >> Exactly, exactly. >> And they have to span >> Absolutely. >> physical locations, >> Yeah. >> But then this whole thing of IOT comes up, we've been having a conversation on The Cube, last several Cubes of, okay, how much stays out, how much stays in, there's a lot of debates about that, there's reasons not to bring it in, but you talked today about some of the important stuff will come back. >> Yeah. >> So the way this is, this all is going to be, you know, there's a lot of data that should be born in the cloud and stay there, the IOT data, but then what will happen increasingly is, key summaries of the data will move back and forth, so key summaries of your EDW will move to the cloud, sometimes key summaries of your IOT data, you know, you want to do some sort of historical training in analytics, that will come back on-prem, so I think there's a bi-directional data movement, but it just won't be all the data, right? It'll be key interesting summaries of the data but not all of it. >> And a lot of times, people say well it doesn't matter where it lives, cloud should be an operating model, not a place where you put data or applications, and while that's true and we would agree with that, from a customer standpoint it matters in terms of performance and latency issues and cost and regulation, >> And security and governance. >> Yeah. >> Absolutely. >> You need to think those things through. >> Exactly, so I mean, so that's what we're focused on, to make sure that you have a common security and governance model regardless of where data is, so you can think of it as, infrastructure you own and infrastructure you lease. >> Right. >> Right? Now, the details matter of course, when you go to the cloud you lose S3 for example or ADLS from Microsoft, but you got to make sure that there's a common sort of security governance front and top of it, in front of it, as an example one of the things that, you know, in the open source community, Ranger's a really sort of key project right now from a security authorization and authentication standpoint. We've done a lot of work with our friends at Microsoft to make sure, you can actually now manage data in Wasabi which is their object store, data stream, natively with Ranger, so you can set a policy that says only Dave can access these files, you know, George can access these columns, that sort of stuff is natively done on the Microsoft platform thanks to the relationship we have with them. >> Right. >> So that's actually really interesting for the open source communities. So you've talked about sort of commodity storage at the bottom layer and even if they're different sort of interfaces and implementations, it's still commodity storage, and now what's really helpful to customers is that they have a common security model, >> Exactly. >> Authorization, authentication, >> Authentication, lineage prominence, >> Oh okay. >> You want to make sure all of these are common sources across. >> But you've mentioned off of the different data patterns, like the stuff that might be streaming in on the cloud, what, assuming you're not putting it into just a file system or an object store, and you want to sort of merge it with >> Yeah. >> Historical data, so what are some of the data stores other than the file system, in other words, newfangled databases to manage this sort of interaction? >> So I think what you're saying is, we certainly have the raw data, the raw data is going to line up in whatever cloud native storage, >> Yeah. >> It's going to be Amazon, Wasabi, ADLS, Google Storage. But then increasingly you want, so now the patterns change so you have raw data, you have some sort of an ETL process, what's interesting in the cloud is that even the process data or, if you take the unstructured raw data and structure it, that structured data also needs to live on the cloud platform, right? The reason that's important is because A, it's cheaper to use the native platform rather than set up your own database on top of it. The other one is you also want to take advantage of all the native sources that the cloud storage provides, so for example, linking your application. So automatically data in Wasabi, you know, if you can set up a policy and easily say this structured data stable that I have of which is a summary of all the IOT activity in the last 24 hours, you can, using the cloud provider's technologies you can actually make it show up easily in Europe, like you don't have to do any work, right? So increasingly what we Hortonworks focused a lot on is to make sure that we, all of the computer engines, whether it's Spark or Hive or, you know, or MapReduce, it doesn't really matter, they're all natively working on the cloud provider's storage platform. >> [George] Okay. >> Right, so, >> Okay. >> That's a really key consideration for us. >> And the follow up to that, you know, there's a bit of a misconception that Spark replaces Hadoop, but it actually can be a processing, a compute engine for, >> Yeah. >> That can compliment or replace some of the compute engines in Hadoop, help us frame, how you talk about it with your customers. >> For us it's really simple, like in the past, the only option you had on Hadoop to do any computation was MapReduce, that was, I started working in MapReduce 11 years ago, so as you can imagine, it's a pretty good run for any technology, right? Spark is definitely the interesting sort of engine for sort of the, anything from mission learning to ETL for data on top of Hadoop. But again, what we focus a lot on is to make sure that every time we bring in, so right now, when we started on HTP, the first on HTP had about nine open source projects literally just nine. Today, the last one we shipped was 2.5, HTP 2.5 had about 27 I think, like it's a huge sort of explosion, right? But the problem with that is not just that we have 27 projects, the problem is that you're going to make sure each of the 27 work with all the 26 others. >> It's a QA nightmare. >> Exactly. So that integration is really key, so same thing with Spark, we want to make sure you have security and YARN (mumbles), like you saw in the demo today, you can now run Spark SQL but also make sure you get low level (mumbles) masking, all of the enterprise capabilities that you need, and I was at a financial services three or four weeks ago in Chicago. Today, to do equivalent of what I showed today on demo, they need literally, they have a classic ADW, and they have to maintain anywhere between 1500 to 2500 views of the same database, that's a nightmare as you can imagine. Now the fact that you can do this on the raw data using whether it's Hive or Spark or Peg or MapReduce, it doesn't really matter, it's really key, and that's the thing we push to make sure things like YARN security work across all the stacks, all the open source techs. >> So that makes life better, a simplification use case if you will, >> Yeah. >> What are some of the other use cases that you're seeing things like Spark enable? >> Machine learning is a really big one. Increasingly, every product is going to have some, people call it, machine learning and AI and deep learning, there's a lot of techniques out there, but the key part is you want to build a predictive model, in the past (mumbles) everybody want to build a model and score what's happening in the real world against model, but equally important make sure the model gets updated as more data comes in on and actually as the model scores does get smaller over time. So that's something we see all over, so for example, even within our own product, it's not just us enabling this for the customer, for example at Hortonworks we have a product called SmartSense which allows you to optimize how people use Hadoop. Where the, what are the opportunities for you to explore deficiencies within your own Hadoop system, whether it's Spark or Hive, right? So we now put mesh learning into SmartSense. And show you that customers who are running queries like you are running, Mr. Customer X, other customers like you are tuning Hadoop this way, they're running this sort of config, they're using these sort of features in Hadoop. That allows us to actually make the product itself better all the way down the pipe. >> So you're improving the scoring algorithm or you're sort of replacing it with something better? >> What we're doing there is just helping them optimize their Hadoop deploys. >> Yep. >> Right? You know, configuration and tuning and kernel settings and network settings, we do that automatically with SmartSense. >> But the customer, you talked about scoring and trying to, >> Yeah. >> They're tuning that, improving that and increasing the probability of it's accuracy, or is it? >> It's both. >> Okay. >> So the thing is what they do is, you initially come with a hypothesis, you have some amount of data, right? I'm a big believer that over time, more data, you're better off spending more, getting more data into the system than to tune that algorithm financially, right? >> Interesting, okay. >> Right, so you know, for example, you know, talk to any of the big guys on Facebook because they'll do the same, what they'll say is it's much better to get, to spend your time getting 10x data to the system and improving the model rather than spending 10x the time and improving the model itself on day one. >> Yeah, but that's a key choice, because you got to >> Exactly. >> Spend money on doing either, >> One of them. >> And you're saying go for the data. >> Go for the data. >> At least now. >> Yeah, go for data, what happens is the good part of that is it's not just the model, it's the, what you got to really get through is the entire end to end flow. >> Yeah. >> All the way from data aggregation to ingestion to collection to scoring, all that aspect, you're better off sort of walking through the paces like building the entire end to end product rather than spending time in a silo trying to make a lot of change. >> We've talked to a lot of machine learning tool vendors, application vendors, and it seems like we got to the point with Big Data where we put it in a repository then we started doing better at curating it and understanding it then starting to do a little bit exploration with business intelligence, but with machine learning, we don't have something that does this end to end, you know, from acquiring the data, building the model to operationalizing it, where are we on that, who should we look to for that? >> It's definitely very early, I mean if you look at, even the EDW space, for example, what is EDW? EDW is ingestion, ETL, and then sort of fast query layer, Olap BI, on and on and on, right? So that's the full EDW flow, I don't think as a market, I mean, it's really early in this space, not only as an overall industry, we have that end to end sort of industrialized design concept, it's going to take time, but a lot of people are ahead, you know, the Google's a world ahead, over time a lot of people will catch up. >> We got to go, I wish we had more time, I had so many other questions for you but I know time is tight in our schedule, so thanks so much Arun, >> Appreciate it. For coming on, appreciate it, alright, keep right there everybody, we'll be back with our next guest, it's The Cube, we're live from Spark Summit East in Boston, right back. (upbeat music)

Published Date : Feb 9 2017

SUMMARY :

brought to you by Data Breaks. father of YARN, can I call you that, Glad you made it in to Boston, So a lot of it means, as you take any of the examples today you really didn't talk that has to sort of, you know, it can pass both on-prem data Yeah, you talked about that in your keynote years ago. but you talked today about some of the important stuff So the way this is, this all is going to be, you know, And security and You need to think those so that's what we're focused on, to make sure that you have as an example one of the things that, you know, in the open So that's actually really interesting for the open source You want to make sure all of these are common sources in the last 24 hours, you can, using the cloud provider's in Hadoop, help us frame, how you talk about it with like in the past, the only option you had on Hadoop all of the enterprise capabilities that you need, Where the, what are the opportunities for you to explore What we're doing there is just helping them optimize and network settings, we do that automatically for example, you know, talk to any of the big guys is it's not just the model, it's the, what you got to really like building the entire end to end product rather than but a lot of people are ahead, you know, the Google's everybody, we'll be back with our next guest, it's The Cube,

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave Alante	PERSON	0.99+
Arun Murthy	PERSON	0.99+
Europe	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
10x	QUANTITY	0.99+
Boston	LOCATION	0.99+
Chicago	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
George	PERSON	0.99+
Arun	PERSON	0.99+
Wasabi	ORGANIZATION	0.99+
25 data centers	QUANTITY	0.99+
Today	DATE	0.99+
Hadoop	TITLE	0.99+
Wasabi	LOCATION	0.99+
YARN	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
ADLS	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Horton Works	ORGANIZATION	0.99+
today	DATE	0.99+
Data Breaks	ORGANIZATION	0.99+
1500	QUANTITY	0.98+
SmartSense	TITLE	0.98+
S3	TITLE	0.98+
Boston, Massachusetts	LOCATION	0.98+
One	QUANTITY	0.98+
27 projects	QUANTITY	0.98+
three	DATE	0.98+
Google	ORGANIZATION	0.98+
Furio	PERSON	0.98+
Spark	TITLE	0.98+
2500 views	QUANTITY	0.98+
first	QUANTITY	0.97+
Spark Summit East	LOCATION	0.97+
both	QUANTITY	0.97+
Spark SQL	TITLE	0.97+
Google Storage	ORGANIZATION	0.97+
26	QUANTITY	0.96+
Ranger	ORGANIZATION	0.96+
four weeks ago	DATE	0.95+
one	QUANTITY	0.94+
each	QUANTITY	0.94+
four years ago	DATE	0.94+
11 years ago	DATE	0.93+
27 work	QUANTITY	0.9+
MapReduce	TITLE	0.89+
Hive	TITLE	0.89+
this morning	DATE	0.88+
EDW	TITLE	0.88+
about nine open source	QUANTITY	0.88+
day one	QUANTITY	0.87+
nine	QUANTITY	0.86+
years	DATE	0.84+
Olap	TITLE	0.83+
Cube	ORGANIZATION	0.81+
a lot of data	QUANTITY	0.8+

Day Two Kickoff - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is theCUBE, covering Spark Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to day two in Boston where it is snowing sideways here. But we're all here at Spark Summit #SparkSummit, Spark Summit East, this is theCUBE. Sound like an Anglo flagship product. We go out to the event, we program for our audience, we extract the signal from the noise. I'm here with George Gilbert, day two, at Spark Summit, George. We're seeing the evolution of so-called big data. Spark was a key part of that. Designed to really both simplify and speed up big data oriented transactions and really help fulfill the dream of big data, which is to be able to affect outcomes in near real time. A lot of those outcomes, of course, are related to ad tech and selling and retail oriented use cases, but we're hearing more and more around education and deep learning and affecting consumers and human life in different ways. We're now 10 years in to the whole big data trend, what's your take, George, on what's going on here? >> Even if we started off with ad tech, which is what most of the big internet companies did, we always start off in any new paradigm with one application that kind of defines that era. And then we copy and extend that pattern. For me, on the rethinking your business the a McGraw-Hill interview we did yesterday was the most amazing thing because they took, what they had was a textbook business for their education unit and they're re-thinking the business, as in what does it mean to be an education company? And they take cognitive science about how people learn and then they take essentially digital assets and help people on a curriculum, not the centuries old sort of teacher, lecture, homework kind of thing, but individualized education where the patterns of reinforcement are consistent with how each student learns. And it's not just a break up the lecture into little bits, it's more of a how do you learn most effectively? How do you internalize information? >> I think that is a great example, George, and there are many, many examples of companies that are transforming digitally. Years and years ago people started to think about okay, how can I instrument or digitize certain assets that I have for certain physical assets? I remember a story when we did the MIT event in London with Andy MacAfee and Eric Binyolsen, they were giving the example of McCormick Spice, the spice company, who digitized by turning what they were doing into recipes and driving demand for their product and actually building new communities. That was kind of an interesting example, but sort of mundane. The McGraw-Hill education is massive. Their chief data scientist, chief data scientist? I don't know, the head of engineering, I guess, is who he was. >> VP of Analytics and Data Science. >> VP of Analytics and Data Science, yeah. He spoke today and got a big round of applause when he sort of led off about the importance of education at the keynote. He's right on, and I think that's a classic example of a company that was built around printing presses and distributing dead trees that is completely transformed and it's quite successful. Over the last only two years brought in a new CEO. So that's good, but let's bring it back to Spark specifically. When Spark first came out, George, you were very enthusiastic. You're technical, you love the deep tech. And you saw the potential for Spark to really address some of the problems that we faced with Hadoop, particularly the complexity, the batch orientation. Even some of the costs -- >> The hidden costs. >> Associated with that, those hidden costs. So you were very enthusiastic, in your mind, has Spark lived up to your initial expectations? >> That's a really good question, and I guess techies like me are often a little more enthusiastic than the current maturity of the technology. Spark doesn't replace Hadoop, but it carves out a big chunk of what Hadoop would do. Spark doesn't address storage, and it doesn't really have any sort of management bits. So you could sort of hollow out Hadoop and put Spark in. But it's still got a little ways to go in terms of becoming really, really fast to respond in near real time. Not just human real time, but like machine real time. It doesn't work sort of deeply with databases yet. It's still teething, and sort of every release, which is approximately every 12 to 18 months, it gets broader in its applicability. So there's no question sort of everyone is piling on, which means that'll help it mature faster. >> When Hadoop was first sort of introduced to the early masses, not the main stream masses, but the early masses, the profundity of Hadoop was that you could leave data in place and bring compute to the data. And people got very excited about that because they knew there was so much data and you just couldn't keep moving it around. But the early insiders of Hadoop, I remember, they would come to theCUBE and everybody was, of course, enthusiastic and lot of cheerleading going on. But in the hallway conversations with Hadoop, with the real insiders you would have conversations about, people are going to realize how much this sucks some day and how hard this is and it's going to hit a wall. Some of the cheerleaders would say, no way, Hadoop forever. Now you've started to see that in practice. And the number of real hardcore transformations as a result of Hadoop in and of itself have been quite limited. The same is true for virtually, most anyway, technology, not any technology. I'd say the smartphone was pretty transformative in and of itself, but nonetheless, we are seeing that sort of progression and we're starting to see a lot of the same use cases that you hear about like fraud detection and retargeting as coming up again. I think what we're seeing is those are improving. Like fraud detection, I talked yesterday about it used to be six months before you'd even detect fraud, if you ever did. Now it's minutes or seconds. But you still get a lot of false positives. So we're going to just keep turning that crank. Mike Gualtieri today talked about the efficacy of today's AI and he gave some examples of Google, he showed a plane crash and he said, it said plane and it accurately identified that, but also the API said it could be wind sports or something like that. So you can see it's still not there yet. At the same time, you see things like Siri and Amazon Alexa getting better and better and better. So my question to you, kind of long-winded here, is, is that what Spark is all about? Just making better the initial initiatives around big data, or is it more transformative than that? >> Interesting question, and I would come at it with a couple different answers. Spark was a reaction to you can't, you can't have multiple different engines to attack all the different data problems because you would do a part of the analysis here, push it into a disk, pull it out of a disk to another engine, all of that would take too long or be too complex a pipeline to go from end to the other. Spark was like, we'll do it all in our unified engine and you can come at it from SQL, you can come at it from streaming, so it's all in one place. That changes the sophistication of what you can do, the simplicity, and therefore how many people can access it and apply it to these problems. And the fact that it's so much faster means you can attack a qualitatively different setup of problems. >> I think as well it really underscores the importance of Open Source and the ability of the Open Source community to launch projects that both stick and can attract serious investment. Not only with IBM, but that's a good example. But entire ecosystems that collectively can really move the needle. Big day today, George, we've got a number of guests. We'll give you the last word at the open. >> Okay, what I thought, this is going to sound a little bit sort of abstract, but a couple of two takeaways from some of our most technical speakers yesterday. One was with Juan Stoyka who sort of co-headed the lab that was the genesis of Spark at Berkeley. >> AMPLabs. >> The AMPLab at Berkeley. >> And now Rise Labs. >> And then also with the IBM Chief Data Officer for the Analytics Unit. >> Seth Filbrun. >> Filbrun, yes. When we look at what's the core value add ultimately, it's not these infrastructure analytic frameworks and that sort of thing, it's the machine learning model in its flywheel feedback state where it's getting trained and re-trained on the data that comes in from the app and then as you continually improve it, that was the whole rationale for Data Links, but not with models. It was put all the data there because you're going to ask questions you couldn't anticipate. So here it's collect all the data from the app because you're going to improve the model in ways you didn't expect. And that beating heart, that living model that's always getting better, that's the core value add. And that's going to belong to end customers and to application companies. >> One of the speakers today, AI kind of invented in the 50s, a lot of excitement in the 70s, kind of died in the 80s and it's coming back. It's almost like it's being reborn. And it's still in its infant stages, but the potential is enormous. All right, George, that's a wrap for the open. Big day today, keep it right there, everybody. We got a number of guests today, and as well, don't forget, at the end of the day today George and I will be introducing part two of our WikiBon Big Data forecast. This is where we'll release a lot of our numbers and George will give a first look at that. So keep it right there everybody, this is theCUBE. We're live from Spark Summit East, #SparkSummit. We'll be right back. (techno music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. fulfill the dream of big data, which is to be able it's more of a how do you learn most effectively? the example of McCormick Spice, the spice company, some of the problems that we faced with Hadoop, So you were very enthusiastic, in your mind, than the current maturity of the technology. At the same time, you see things like Siri That changes the sophistication of what you can do, of Open Source and the ability of the Open Source community One was with Juan Stoyka who sort of co-headed the lab for the Analytics Unit. that comes in from the app and then as you One of the speakers today, AI kind of invented

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Mike Gualtieri	PERSON	0.99+
George	PERSON	0.99+
Juan Stoyka	PERSON	0.99+
Boston	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Eric Binyolsen	PERSON	0.99+
London	LOCATION	0.99+
yesterday	DATE	0.99+
10 years	QUANTITY	0.99+
Siri	TITLE	0.99+
Berkeley	LOCATION	0.99+
Google	ORGANIZATION	0.99+
McCormick Spice	ORGANIZATION	0.99+
Boston, Massachusetts	LOCATION	0.99+
Rise Labs	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
today	DATE	0.99+
Seth Filbrun	PERSON	0.99+
80s	DATE	0.98+
50s	DATE	0.98+
each student	QUANTITY	0.98+
two takeaways	QUANTITY	0.98+
70s	DATE	0.98+
Spark	ORGANIZATION	0.98+
Spark Summit East 2017	EVENT	0.98+
first	QUANTITY	0.97+
both	QUANTITY	0.97+
Andy MacAfee	PERSON	0.97+
#SparkSummit	EVENT	0.97+
One	QUANTITY	0.96+
1	QUANTITY	0.96+
day two	QUANTITY	0.95+
one application	QUANTITY	0.95+
Spark	TITLE	0.95+
McGraw-Hill	PERSON	0.94+
AMPLabs	ORGANIZATION	0.94+
Years	DATE	0.94+
one place	QUANTITY	0.93+
Hadoop	TITLE	0.93+
Alexa	TITLE	0.93+
Databricks	ORGANIZATION	0.93+
Spark Summit East	EVENT	0.93+
12	QUANTITY	0.91+
two years	QUANTITY	0.91+
Spark Summit East	LOCATION	0.91+
six months	QUANTITY	0.9+
SQL	TITLE	0.89+
Chief Data Officer	PERSON	0.89+
Hadoop	PERSON	0.85+
much	QUANTITY	0.84+
Spark Summit	EVENT	0.84+
Anglo	OTHER	0.81+
first look	QUANTITY	0.75+
8 months	QUANTITY	0.72+
WikiBon	ORGANIZATION	0.69+
part two	QUANTITY	0.69+
Hill	ORGANIZATION	0.68+
Kickoff	EVENT	0.64+
couple	QUANTITY	0.64+
McGraw-	PERSON	0.64+

Wikibon Big Data Market Update Pt. 1 - Spark Summit East 2017 - #sparksummit - #theCUBE

>> [Announcer] Live from Boston, Massachusetts, this is theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> We're back, welcome to Boston, everybody, this is a special presentation that George Gilbert and I are going to provide to you now. SiliconANGLE Media is the umbrella brand of our company, and we've got three sub-brands. One of them is Wikibon, it's the research organization that Gorge works in, and then of course, we have theCUBE and then SiliconANGLE, which is the tech publication, and then we extensively, as you may know, use CrowdChat and other social data, but we want to drill down now on the Wikibon, Wikibon research side of things. Wikibon was the first research company ever to do a big data forecast. Many, many years ago, our friend Jeff Kelly produced that for several years, we opensourced it, and it really, I think helped the industry a lot, sort of framing the big data opportunity, and then George last year did the first Spark forecast, really Spark adoption, so what we want to do now is talk about some of the trends in the marketplace, this is going to be done in two parts, today's part one, and we're really going to talk about the overall market trends and the market conditions, and then we're going to go to part two tomorrow, where you're going to release some of the numbers, right? And we'll share some of the numbers today. So, we're going to start on the first slide here, we're going to share with you some slides. The Wikibon forecast review, and George is going to, I'm going to ask you to talk about where we are at with big data apps, everybody's saying it's peaked, big data's now going mainstream, where are we at with big data apps? >> [George] Okay, so, I want to quote, just to provide context, the former CTO on VMware, Steve Herrod. He said, "In the end, it wasn't big data, "it was big analytics." And what's interesting is that when we start thinking about it, there have been three classes of, there have been traditionally two classes of workloads, one batch, and in the context of analytics, that means running reports in the background, doing offline business intelligence, but then there was also the interactive-type work. What's emerging is something that's continuously happening, and it doesn't mean that all apps are going to be always on, it just means that there are, all apps will have a batch component, an interactive component, like with the user, and then a streaming, or continuous component. >> [Dave] So it's a new type of workload. >> Yes. >> Okay. Anything else you want to point out here? >> Yeah, what's worth mentioning, this is, it's not like it's going to burst fully-formed out of the clouds, and become sort of a new standard, there's two things that has to happen, the technology has to mature, so right now you have some pretty tough trade-offs between integration, which provides simplicity, and choice and optimization, which gives you fragmentation, and then skillset, and both of those need to develop. >> [Dave] Alright, we're going to talk about both of those a little bit later in this segment. Let's go to the next slide, which really talks to some of the high-level forecast that we released last year, so these are last year's numbers, correct? >> Yes, yes. >> [Dave] Okay, so, what's changed? You've got the ogive curve, which is sort of the streaming penetration, Spark/streaming, that's what, was last year, this is now reflective of continuous, you'll be updating that, how is this changing, what do you want us to know here? >> [George] Okay, so the key takeaways here are, first, we took three application patterns, the first being the data lake, which is sort of the original canonical repository of all your data. That never goes away, but on top of it, you layer what we were calling last year systems of engagement, which is where you've got the interactive machine learning component helping to anticipate and influence a user's decision, and then on top of that, which was the aqua color, was the self-tuning systems, which is probably more IIoT stuff, where you've got a whole ecosystem of devices and intelligence in the cloud and at the edge, and you don't necessarily need a human in the loop. But, these now, when you look at them, you can break them down as having three types of workloads, the batch, the interactive, and the continuous. >> Okay, and that is sort of a new workload here, and this is a real big theme of your research now is, we all remember, no, we don't all remember, I remember punch cards, that's the ultimate batch, and then of course, the terminals were interactive, and you think of that as closer to real time, but now, this notion of continuous, if you go to the next slide, Patrick, we can take a look at how workloads are changing, so George, take us through that dynamic. >> [George] Okay so, to understand where we're going, sometimes it helps to look at where we've come from, and the traditional workloads, if we talk about applications, they were divided into, now, we talked about sort of batch versus interactive, but now, they were also divided into online transaction processing, operational application, systems of record, and then there was the analytic side, which was reporting on it, but this was sort of backward-looking reporting, and we begin to see some convergence between the two with web and mobile apps, where a user was interacting both with the analytics that informed an interaction that they might have. That's looking backwards, and we're going to take a quick look at some of the new technologies that augmented those older application patterns. Then we're going to go look at the emergent workloads and what they look like. >> Okay so, let's have a quick conversation about this before we go on to the next segment. Hadoop obviously was batch. It really was a way, as we've talked about today and many other dates in theCUBE, a way to reduce the expense of doing data warehousing and business intelligence, I remember we were interviewing Jeff Hammerbacher, and he said, "When I was at Facebook, "my mission was to break the dependency "and the container, the storage container." So he really wanted to, needed to reduce costs, he saw that infrastructure needed to change, so if you look at the next slide, which is really sort of talking to Hadoop doing batch in traditional BI, take us through that, and then we'll sort of evolve to the future. >> Okay, so this is an example of traditional workloads, batch business intelligence, because Hadoop has not really gotten to the maturity point of view where you can really do interactive business intelligence. It's going to take a little more work. But here, you've basically put in a repository more data than you could possibly ever fit in a data warehouse, and the key is, this environment was very fragmented, there were many different engines involved, and so there was a high developer complexity, and a high operational complexity, and we're getting to the point where we can do somewhat better on the integration, and we're getting to the point where we might be able to do interactive business intelligence and start doing a little bit of advanced analytics like machine learning. >> Okay. Let's talk a little bit about why we're here, we're here 'cause it's Spark Summit, Spark was designed to simplify big data, simplify a lot of the complexity in Hadoop, so on the next slide, you've got this red line of Spark, so what is Spark's role, what does that red line represent? >> Okay, so the key takeaway from this slide is, couple things. One, it's interesting, but when you listen to Matei Zaharia, who is the creator of Spark, he said, "I built this to be a better MapReduce than MapReduce," which was the old crufty heart of Hadoop. And of course, they've stretched it far beyond their original intentions, but it's not the panacea yet, and if you put it in the context of a data lake, it can help you with what a data engineer does with exploring and munging the data, and what a data scientist might do in terms of processing the data and getting it ready for more advanced analytics, but it doesn't give you an end-to-end solution, not even within the data lake. The point of explaining this is important, because we want to explain how, even in the newer workloads, Spark isn't yet mature to handle the end-to-end integration, and by making that point, we'll show where it needs still more work, and where you have to substitute other products. >> Okay, so let's have a quick discussion about those workloads. Workloads really kind of drive everything, a lot of decisions for organizations, where to put things, and how to protect data, where the value is, so in this next slide you've got, you're juxtaposing traditional workloads with emerging workloads, so let's talk about these new continuous apps. >> Okay, so, this tees it up well, 'cause we focused on the traditional workloads. The emerging ones are where data is always coming in. You could take a big flow of data and sort of end it and bucket it, and turn it into a batch process, but now that we have the capability to keep processing it, and you want answers from it very near real time, you don't want to stop it from flowing, so the first one that took off like this was collecting telemetry about the operation and performance of your apps and your infrastructure, and Splunk sort of conquered that workload first. And then the second one, the one that everyone's talking about now is sort of Internet of Things, but more accurately, the Industrial Internet of Things, and that stream of data is, again, something you'll want to analyze and act on with as little delay as possible. The third one is interesting, asynchronous microservices. This is difficult, because this doesn't necessarily require a lot of new technology, so much as a new skillset for developers, and that's going to mean it takes off fairly slowly. Maybe new developers coming out of school will adopt it whole cloth, but this is where you don't rely on a big central database, this is where you break things into little pieces, and each piece manages itself. >> So you say the components of these arrows that you're showing in just explore processor, these are all sort of discrete elements of the data flow that you have to then integrate as a customer? >> [George] Yes, frankly, these are all steps that could be an end-to-end integrative process, but it's not yet mature enough really to do it end-to-end. For example, we don't even have a data store that can go all the way from ingest to serve, and by ingest, I mean taking the millions, potentially millions or more, events per second coming in from your Internet of Things devices, the explorer would be in that same data store, letting you visualize what's there, and process doing the analysis, and serving then is, from that same data store, letting your industrial devices, or your business intelligence workloads get real-time updates. For this to work as one whole, we need a data store, for example, that can go from end-to-end, in addition to the compute and analytic capabilities that go end-to-end. The point of this is, for continuous workloads, we do want to get to this integrated point somehow, sometime, but we're not there yet. >> Okay, let's go deeper, and take a look at the next slide, you've got this data feedback loop, and you've got this prediction on top of this, what does all that mean, let's double-click on that. >> Okay, so now we're unpacking the slide we just looked at, in that we're unpacking it into two different elements, one is what you're doing when you're running the system, and the next one will be what you're doing when you're designing it. And so for this one, what you're doing when you're running the system, I've grayed out the where's the data coming from and where's it going to, just to focus on how we're operating on the data, and again, to repeat the green part, which is storage, we don't have an end-to-end integrated store that could cost-effectively, scalably handle this whole chain of steps, but what we do have is that in the runtime, you're going to ingest the data, you're going to process it and make it ready for prediction, then there's a step that's called devops for data science, we know devops for developers, but devops for data science, as we're going to see, actually unpacks a whole 'nother level of complexity, but this devops for data science, this is where you get the prediction, of, okay, so, if this turbine is vibrating and has a heat spike, it means shut it down because something's going to fail. That's the prediction component, and the serve part then takes that prediction, and makes sure that that device gets it fast. >> So you're putting that capability in the hands of the data science component so they can effect that outcome virtually instantaneously? >> Yes, but in this case, the data scientist will have done that at design time. We're still at run time, so this is, once the data scientist has built that model, here, it's the engineer who's keeping it running. >> Yeah, but it's designed into the process, that's the devops analogy. Okay great, well let's go to that sort of next piece, which is design, so how does this all affect design, what are the implications there? >> So now, before we had ingest process, then prediction with devops for data science, and then serving, now when you're at design time, you ingest the data, and there's a whole unpacking of steps, which requires a handful, or two fistfuls of tools right now to make operate. This is to acquire the data, explore it, prepare it, model it, assess it, distribute it, all those things are today handled by a collection of tools that you have to stitch together, and then you have process at which could be typically done in Spark, where you do the analysis, and then serving it, Spark isn't ready to serve, that's typically a high-speed database, one that either has tons of data for history, or gets very, very fast updates, like a Redis that's almost like a cache. So the point of this is, we can't yet take Spark as gospel from end to end. >> Okay so, there's a lot of complexity here. >> [George] Right, that's the trade-off. >> So let's take a look at the next slide, which talks to where that complexity comes from, let's look at it first from the developer side, and then we'll look at the admin, so, so on the next slide, we're looking at the complexity from the dev perspective, explain the axes here. >> Okay, okay. So, there's two axes. If you look at the x-axis at the bottom, there's ingest, explore, process, serve. Those were the steps at a high level that we said a developer has to master, and it's going to be in separate products, because we don't have the maturity today. Then on the y-axis, we have some, but not all, this is not an exhaustive list of all the different things a developer has to deal with, with each product, so the complexity is multiplying all the steps on the y-axis, data model, addressing, programming model, persistence, all the stuff's on the y-axis, by all the products he needs on the x-axis, it's a mess, which is why it's very, very hard to build these types of systems today. >> Well, and why everybody's pushing on this whole unified integration, that was a major thing that we heard throughout the day today. What about from the admin's side, let's take a look at the next slide, which is our last slide, in terms of the operational complexity, take us through that. >> [George] Okay, so, the admin is when the system's running, and reading out the complexity, or inferring the complexity, follows the same process. On the y-axis, there's a separate set of tasks. These are admin-related. Governance, scheduling and orchestration, a high availability, all the different types of security, resource isolation, each of these is done differently for each product, and the products are on the x-axis, ingest, explore, process, serve, so that when you multiply those out, and again, this isn't exhaustive, you get, again, essentially a mess of complexity. >> Okay, so we got the message, if you're a practitioner of these so-called big data technologies, you're going to be dealing with more complexity, despite the industry's pace of trying to address that, but you're seeing new projects pop up, but nonetheless, it feels like the complexity curve is growing faster than customer's ability to absorb that complexity. Okay, well, is there hope? >> Yes. But here's where we've had this conundrum. The Apache opensource community has been the most amazing source of innovation I think we've ever seen in the industry, but the problem is, going back to the amazing book, The Cathedral and the Bazaar, about opensource innovation versus top-down, the cathedral has this central architecture that makes everything fit together harmoniously, and beautifully, with simplicity. But the bazaar is so much faster, 'cause it's sort of this free market of innovation. The Apache ecosystem is the bazaar, and the burden is on the developer and the administrator to make it work together, and it was most appropriate for the big internet companies that had the skills to do that. Now, the companies that are distributing these Apache opensource components are doing a Herculean job of putting them together, but they weren't designed to fit together. On the other hand, you've got the cloud service providers, who are building, to some extent, services that have standard APIs that might've been supported by some of the Apache products, but they have proprietary implementations, so you have lock-in, but they have more of the cathedral-type architecture that-- >> And they're delivering 'em their services, even though actually, many of those data services are discrete APIs, as you point out, are proprietary. Okay, so, very useful, George, thank you, if you have questions on this presentation, you can hit Wikibon.com and fire off a question to us, we'll make sure it gets to George and gets answered. This is part one, part two tomorrow is we're going to dig into some of the numbers, right? So if you care about where the trends are, what the numbers look like, what the market size looks like, we'll be sharing that with you tomorrow, all this stuff, of course, will be available on-demand, we'll be doing CrowdChats on this, George, excellent job, thank you very much for taking us through this. Thanks for watching today, it is a wrap of day one, Spark Summit East, we'll be back live tomorrow from Boston, this is theCUBE, so check out siliconangle.com for a review of all the action today, all the news, check out Wikibon.com for all the research, siliconangle.tv is where we house all these videos, check that out, we start again tomorrow at 11 o'clock east coast time, right after the keynotes, this is theCUBE, we're at Spark Summit, #SparkSummit, we're out, see you tomorrow. (electronic music jingle)

Published Date : Feb 8 2017

SUMMARY :

brought to you by Databricks. and the market conditions, and then we're going to go and it doesn't mean that all apps are going to be always on, Anything else you want to point out here? the technology has to mature, so right now Let's go to the next slide, which really and at the edge, and you don't necessarily need and you think of that as closer to real time, and the traditional workloads, "and the container, the storage container." and we're getting to the point where so on the next slide, you've got this red line of Spark, but it's not the panacea yet, and if you put it Okay, so let's have a quick discussion and you want answers from it very near real time, and by ingest, I mean taking the millions, and take a look at the next slide, and the next one will be what you're doing here, it's the engineer who's keeping it running. Yeah, but it's designed into the process, So the point of this is, we can't yet take Spark so on the next slide, we're looking of all the different things a developer has to deal with, let's take a look at the next slide, and the products are on the x-axis, it feels like the complexity curve is growing faster and the burden is on the developer and the administrator of all the action today, all the news,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Patrick	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Jeff Hammerbacher	PERSON	0.99+
Steve Herrod	PERSON	0.99+
Jeff Kelly	PERSON	0.99+
George	PERSON	0.99+
Matei Zaharia	PERSON	0.99+
Boston	LOCATION	0.99+
last year	DATE	0.99+
Wikibon	ORGANIZATION	0.99+
SiliconANGLE	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
millions	QUANTITY	0.99+
VMware	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Gorge	ORGANIZATION	0.99+
one batch	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
two classes	QUANTITY	0.99+
Dave	PERSON	0.99+
three classes	QUANTITY	0.99+
first	QUANTITY	0.99+
two parts	QUANTITY	0.99+
each	QUANTITY	0.99+
second one	QUANTITY	0.99+
two different elements	QUANTITY	0.99+
first slide	QUANTITY	0.99+
two	QUANTITY	0.99+
The Cathedral and the Bazaar	TITLE	0.99+
each product	QUANTITY	0.99+
each piece	QUANTITY	0.99+
third one	QUANTITY	0.99+
One	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
today	DATE	0.98+
Facebook	ORGANIZATION	0.98+
first one	QUANTITY	0.98+
both	QUANTITY	0.98+
Apache	ORGANIZATION	0.98+
SiliconANGLE Media	ORGANIZATION	0.98+
first research	QUANTITY	0.98+
Spark Summit East 2017	EVENT	0.97+
Hadoop	TITLE	0.97+
two things	QUANTITY	0.97+
two fistfuls of tools	QUANTITY	0.96+
theCUBE	ORGANIZATION	0.96+
one	QUANTITY	0.96+
day one	QUANTITY	0.95+
#SparkSummit	EVENT	0.93+
siliconangle.com	OTHER	0.93+
two axes	QUANTITY	0.92+

Ion Stoica, Databricks - Spark Summit East 2017 - #sparksummit - #theCUBE

>> [Announcer] Live from Boston Massachusetts. This is theCUBE. Covering Sparks Summit East 2017. Brought to you by Databricks. Now here are your hosts, Dave Vellante and George Gilbert. >> [Dave] Welcome back to Boston everybody, this is Spark Summit East #SparkSummit And this is theCUBE. Ion Stoica is here. He's Executive Chairman of Databricks and Professor of Computer Science at UCal Berkeley. The smarts is rubbing off on me. I always feel smart when I co-host with George. And now having you on is just a pleasure, so thanks very much for taking the time. >> [Ion] Thank you for having me. >> So loved the talk this morning, we learned about RISELabs, we're going to talk about that. Which is the son of AMP. You may be the father of those two, so. Again welcome. Give us the update, great keynote this morning. How's the vibe, how are you feeling? >> [Ion] I think it's great, you know, thank you and thank everyone for attending the summit. It's a lot of energy, a lot of interesting discussions, and a lot of ideas around. So I'm very happy about how things are going. >> [Dave] So let's start with RISELabs. Maybe take us back, to those who don't understand, so the birth of AMP and what you were trying to achieve there and what's next. >> Yeah, so the AMP was a six-year Project at Berkeley, and it involved around eight faculties and over the duration of the lab around 60 students and postdocs, And the mission of the AMPLab was to make sense of big data. AMPLab started in 2009, at the end of 2009, and the premise is that in order to make sense of this big data, we need a holistic approach, which involves algorithms, in particular machine-learning algorithms, machines, means systems, large-scale systems, and people, crowd sourcing. And more precisely the goal was to build a stack, a data analytic stack for interactive analytics, to be used across industry and academia. And, of course, being at Berkeley, it has to be open source. (laugh) So that's basically what was AMPLab and it was a birthplace for Apache Spark that's why you are all here today. And a few other open-source systems like Mesos, Apache Mesos, and Alluxio which was previously called Tachyon. And so AMPLab ended in December last year and in January, this January, we started a new lab which is called RISE. RISE stands for Real-time Intelligent Secure Execution. And the premise of the new lab is that actually the real value in the data is the decision you can make on the data. And you can see this more and more at almost every organization. They want to use their data to make some decision to improve their business processes, applications, services, or come up with new applications and services. But then if you think about that, what does it mean that the emphasis is on the decision? Then it means that you want the decision to be fast, because fast decisions are better than slower decisions. You want decisions to be on fresh data, on live data, because decisions on the data I have right now are original but those are decisions on the data from yesterday, or last week. And then you also want to make targeted, personalized decisions. Because the decisions on personal information are better than aggregate information. So that's the fundamental premise. So therefore you want to be on platforms, tools and algorithms to enable intelligent real-time decisions on live data with strong security. And the security is a big emphasis of the lab because it means to provide privacy, confidentiality and integrity, and as you hear about data breaches or things like that every day. So for an organization, it is extremely important to provide privacy and confidentiality to their users and it's not only because the users want that, but it also indirectly can help them to improve their service. Because if I guarantee your data is confidential with me, you are probably much more willing to share some of your data with me. And if you share some of the data with me, I can build and provide better services. So that's basically in a nutshell what the lab is and what the focus is. >> [Dave] Okay, so you said three things: fast, live and targeted. So fast means you can affect the outcome. >> Yes. Live data means it's better quality. And then targeted means it's relevant. >> Yes. >> Okay, and then my question on security, I felt like when cloud and Big Data came to fore, security became a do-over. (laughter) Is that a fair assessment? Are you doing it over? >> [George] Or as Bill Clinton would call it, a Mulligan. >> Yeah, if you get a Mulligan on security. >> I think security is, it's always a difficult topic because it means so many things for so many people. >> Hmm-mmm. >> So there are instances and actually cloud is quite secure. It's actually cloud can be more secure than some on-prem deployments. In fact, if you hear about these data leaks or security breaches, you don't hear them happening in the cloud. And there is some reason for that, right? It is because they have trained people, you know, they are paranoid about this, they do a specification maybe much more often and things like that. But still, you know, the state of security is not that great. Right? For instance, if I compromise your operating system, whether it's in cloud or in not in the cloud, I can't do anything. Right? Or your VM, right? On all this cloud you run on a VM. And now you are going to allow on some containers. Right? So it's a lot of attacks, or there are attacks, sophisticated attacks, which means your data is encrypted, but if I can look at the access patterns, how much data you transferred, or how much data you access from memory, then I can infer something about what you are doing about your queries, right? If it's more data, maybe it's a query on New York. If it's less data it's probably maybe something smaller, like maybe something at Berkeley. So you can infer from multiple queries just looking at the access. So it's a difficult problem. But fortunately again, there are some new technologies which are developed and some new algorithms which gives us some hope. One of the most interesting technologies which is happening today is hardware enclaves. So with hardware enclaves you can execute the code within this enclave which is hardware protected. And even if your operating system or VM is compromised, you cannot access your code which runs into this enclave. And Intel has Intell SGX and we are working and collaborating with them actively. ARM has TrustZone and AMB also announced they are going to have a similar technology in their chips. So that's kind of a very interesting and very promising development. I think the other aspect, it's a focus of the lab, is that even if you have the enclaves, it doesn't automatically solve the problem. Because the code itself has a vulnerability. Yes, I can run the code in hardware enclave, but the code can send out >> Right. >> data outside. >> Right, the enclave is a more granular perimeter. Right? >> Yeah. So yeah, so you are looking and the security expert is in your lab looking at this, maybe how to split the application so you run only a small part in the enclave, which is a critical part, and you can make sure that also the code is secure, and the rest of the code you run outside. But the rest of the code, it's only going to work on data which is encrypted. Right? So there is a lot of interesting research but that's good. >> And does Blockchain fit in there as well? >> Yeah, I think Blockchain it's a very interesting technology. And again it's real-time and the area is also very interesting directions. >> Yeah, right. >> Absolutely. >> So you guys, I want George, you've shared with me sort of what you were calling a new workload. So you had batch and you have interactive and now you've got continuous- >> Continuous, yes. >> And I know that's a topic that you want to discuss and I'd love to hear more about that. But George, tee it up. >> Well, okay. So we were talking earlier and the objective of RISE is fast and continuous-type decisions. And this is different from the traditional, you either do it batch or you do it interactive. So maybe tell us about some applications where that is one workload among the other traditional workloads. And then let's unpack that a little more. >> Yeah, so I'll give you a few applications. So it's more than continuously interacting with the environment continuously, but you also learn continuously. I'll give you some examples. So for instance in one example, think about you want to detect a network security attack, and respond and diagnose and defend in the real time. So what this means is that you need to continuously get logs from the network and from the more endpoints you can get the better. Right? Because more data will help you to detect things faster. But then you need to detect the new pattern and you need to learn the new patterns. Because new security attacks, which are the ones that are effective, are slightly different from the past one because you hope that you already have the defense in place for the past ones. So now you are going to learn that and then you are going to react. You may push patches in real time. You may push filters, installing new filters to firewalls. So that's kind of one application that's going in real time. Another application can be about self driving. Now self driving has made tremendous strides. And a lot of algorithms you know, very smart algorithms now they are implemented on the cars. Right? All the system is on the cars. But imagine now that you want to continuously get the information from this car, aggregate and learn and then send back the information you learned to the cars. Like for instance if it's an accident or a roadblock an object which is dropped on the highway, so you can learn from the other cars what they've done in that situation. It may mean in some cases the driver took an evasive action, right? Maybe you can monitor also the cars which are not self-driving, but driven by the humans. And then you learn that in real time and then the other cars which follow through the same, confronted with the same situation, they now know what to do. Right? So this is again, I want to emphasize this. Not only continuous sensing environment, and making the decisions, but a very important components about learning. >> Let me take you back to the security example as I sort of process the auto one. >> Yeah, yeah. >> So in the security example, it doesn't sound like, I mean if you have a vast network, you know, end points, software, infrastructure, you're not going to have one God model looking out at everything. >> Yes. >> So I assume that means there are models distributed everywhere and they don't know what a new, necessarily but an entirely new attack pattern looks like. So in other words, for that isolated model, it doesn't know what it doesn't know. I don't know if that's what Rumsfeld called it. >> Yes (laughs). >> How does it know what to pass back for retraining? >> Yes. Yes. Yes. So there are many aspects and there are many things you can look at. And it's again, it's a research problem, so I cannot give you the solution now, I can hypothesize and I give you some examples. But for instance, you can look about, and you correlate by observing the affect. Some of the affects of the attack are visible. In some cases, denial of service attack. That's pretty clear. Even the And so forth, they maybe cause computers to crash, right? So once you see some of this kind of anomaly, right, anomalies on the end devices, end host and things like that. Maybe reported by humans, right? Then you can try to correlate with what kind of traffic you've got. Right? And from there, from that correlation, probably you can, and hopefully, you can develop some models to identify what kind of traffic. Where it comes from. What is the content, and so forth, which causes behavior, anomalous behavior. >> And where is that correlation happening? >> I think it will happen everywhere, right? Because- >> At the edge and at the center. >> Absolutely. >> And then I assume that it sounds like the models both at the edge and at the center are ensemble models. >> Yes. >> Because you're tracking different behavior. >> Yes. You are going to track different behavior and you are going to, I think that's a good hypothesis. And then you are going to assemble them, assemble to come up with the best decision. >> Okay, so now let's wind forward to the car example. >> Yeah. >> So it sound like there's a mesh network, at least, Peter Levine's sort of talk was there's near-local compute resources and you can use bitcoin to pay for it or Blockchain or however it works. But that sort of topology, we haven't really encountered before in computing, have we? And how imminent is that sort of ... >> I think that some of the stuff you can do today in the cloud. I think if you're on super-low latency probably you need to have more computation towards the edges, but if I'm thinking that I want kind of reactions on tens, hundreds of milliseconds, in theory you can do it today with the cloud infrastructure we have. And if you think about in many cases, if you can't do it within a few hundredths of milliseconds, it's still super useful. Right? To avoid this object which has dropped on the highway. You know, if I have a few hundred milliseconds, many cars will effectively avoid that having that information. >> Let's have that conversation about the edge a little further. The one we were having off camera. So there's a debate in our community about how much data will stay at the edge, how much will go into the cloud, David Flores said 90% of it will stay at the edge. Your comment was, it depends on the value. What do you mean by that? >> I think that that depends who am I and how I perceive the value of the data. And, you know, what can be the value of the data? This is what I was saying. I think that value of the data is fundamentally what kind of decisions, what kind of actions it will enable me to take. Right? So here I'm not just talking about you know, credit card information or things like that, even exactly there is an action somebody's going to take on that. So if I do believe that the data can provide me with ability to take better actions or make better decisions I think that I want to keep it. And it's not, because why I want to keep it, because also it's not only the decision it enables me now, but everyone is going to continuously improve their algorithms. Develop new algorithms. And when you do that, how do you test them? You test on the old data. Right? So I think that for all these reasons, a lot of data, valuable data in this sense, is going to go to the cloud. Now, is there a lot of data that should remain on the edges? And I think that's fair. But it's, again, if a cloud provider, or someone who provides a service in the cloud, believes that the data is valuable. I do believe that eventually it is going to get to the cloud. >> So if it's valuable, it will be persisted and will eventually get to the cloud? And we talked about latency, but latency, the example of evasive action. You can't send the back to the cloud and make the decision, you have to make it real time. But eventually that data, if it's important, will go back to the cloud. The other question of all this data that we are now processing on a continuous basis, how much actually will get persisted, most of it, much of it probably does not get persisted. Right? Is that a fair assumption? >> Yeah, I think so. And probably all the data is not equal. All right? It's like you want to maybe, even if you take a continuous video, all right? On the cars, they continuously have videos from multiple cameras and radar and lidar, all of this stuff. This continuous. And if you think about this one, I would assume that you don't want to send all the data to the cloud. But the data around the interesting events, you may want to do, right? So before and after the car has a near-accident, or took an evasive action, or the human had to intervene. So in all these cases, probably I want to send the data to the cloud. But for the most cases, probably not. >> That's good. We have to leave it there, but I'll give you the last word on things that are exciting you, things you're working on, interesting projects. >> Yeah, so I think this is what really excites me is about how we are going to have this continuous application, you are going to continuously interact with the environment. You are going to continuously learn and improve. And here there are many challenges. And I just want to say a few more there, and which we haven't discussed. One, in general it's about explainability. Right? If these systems augment the human decision process, if these systems are going to make decisions which impact you as a human, you want to know why. Right? Like I gave this example, assuming you have machine-learning algorithms, you're making a diagnosis on your MRI, or x-ray. You want to know why. What is in this x-ray causes that decision? If you go to the doctor, they are going to point and show you. Okay, this is why you have this condition. So I think this is very important. Because as a human you want to understand. And you want to understand not only why the decision happens, but you want also to understand what you have to do, you want to understand what you need to do to do better in the future, right? Like if your mortgage application is turned down, I want to know why is that? Because next time when I apply to the mortgage, I want to have a higher chance to get it through. So I think that's a very important aspect. And the last thing I will say is that this is super important and information is about having algorithms which can say I don't know. Right? It's like, okay I never have seen this situation in the past. So I don't know what to do. This is much better than giving you just the wrong decision. Right? >> Right, or a low probability that you don't know what to do with. (laughs) >> Yeah. >> Excellent. Ion, thanks again for coming in theCUBE. It was really a pleasure having you. >> Thanks for having me. >> You're welcome. All right, keep it right there everybody. George and I will be back to do our wrap right after this short break. This is theCUBE. We're live from Spark Summit East. Right back. (techno music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Databricks. And now having you on is just a pleasure, So loved the talk this morning, [Ion] I think it's great, you know, and what you were trying to achieve there is the decision you can make on the data. So fast means you can affect the outcome. And then targeted means it's relevant. Are you doing it over? because it means so many things for so many people. So with hardware enclaves you can execute the code Right, the enclave is a more granular perimeter. and the rest of the code you run outside. And again it's real-time and the area is also So you guys, I want George, And I know that's a topic that you want to discuss and the objective of RISE and from the more endpoints you can get the better. Let me take you back to the security example So in the security example, and they don't know what a new, and you correlate both at the edge and at the center And then you are going to assemble them, to the car example. and you can use bitcoin to pay for it And if you think about What do you mean by that? So here I'm not just talking about you know, You can't send the back to the cloud And if you think about this one, but I'll give you the last word And you want to understand not only why that you don't know what to do with. It was really a pleasure having you. George and I will be back to do our wrap

ENTITIES

Entity	Category	Confidence
David Flores	PERSON	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave Vellante	PERSON	0.99+
2009	DATE	0.99+
Peter Levine	PERSON	0.99+
Bill Clinton	PERSON	0.99+
New York	LOCATION	0.99+
90%	QUANTITY	0.99+
January	DATE	0.99+
AMB	ORGANIZATION	0.99+
last week	DATE	0.99+
Dave	PERSON	0.99+
yesterday	DATE	0.99+
Ion	PERSON	0.99+
ARM	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
six-year	QUANTITY	0.99+
December last year	DATE	0.99+
Databricks	ORGANIZATION	0.99+
three things	QUANTITY	0.99+
Boston Massachusetts	LOCATION	0.99+
one example	QUANTITY	0.99+
two	QUANTITY	0.98+
UCal Berkeley	ORGANIZATION	0.98+
Berkeley	LOCATION	0.98+
AMPLab	ORGANIZATION	0.98+
Ion Stoica	PERSON	0.98+
tens, hundreds of milliseconds	QUANTITY	0.98+
today	DATE	0.97+
end of 2009	DATE	0.96+
Rumsfeld	PERSON	0.96+
Intel	ORGANIZATION	0.96+
Intell	ORGANIZATION	0.95+
both	QUANTITY	0.95+
One	QUANTITY	0.95+
AMP	ORGANIZATION	0.94+
TrustZone	ORGANIZATION	0.94+
Spark Summit East 2017	EVENT	0.93+
around 60 students	QUANTITY	0.93+
RISE	ORGANIZATION	0.93+
Sparks Summit East 2017	EVENT	0.92+
one	QUANTITY	0.89+
one workload	QUANTITY	0.88+
Spark Summit East	EVENT	0.87+
Apache Spark	ORGANIZATION	0.87+
around eight faculties	QUANTITY	0.86+
this January	DATE	0.86+
this morning	DATE	0.84+
Mulligan	ORGANIZATION	0.78+
few hundredths of milliseconds	QUANTITY	0.77+
Professor	PERSON	0.74+
God	PERSON	0.72+
theCUBE	ORGANIZATION	0.7+
few hundred milliseconds	QUANTITY	0.67+
SGX	COMMERCIAL_ITEM	0.64+
Mesos	ORGANIZATION	0.63+
one application	QUANTITY	0.63+
Apache Mesos	ORGANIZATION	0.62+
Alluxio	ORGANIZATION	0.62+
AMPLab	EVENT	0.59+
Tachyon	ORGANIZATION	0.59+
#SparkSummit	EVENT	0.57+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Spark: