Ash Munshi, Pepperdata - #SparkSummit - #theCUBE

(upbeat music) >> Announcer: Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, it's day two at the Spark Summit 2017. I'm David Goad and here with George Gilbert from Wikibon, George. >> George: Good to be here. >> Alright and the guest of honor of course, is Ash Munshi, who is the CEO of Pepperdata. Ash, welcome to the show. >> Thank you very much, thank you. >> Well you have an interesting background, I want you to just tell us real quick here, not give the whole bio, but you got a great background in machine learning, you were an early user of Spark, tell us a little bit about your experience. >> So I'm actually a mathematician originally, a theoretician who worked for IBM Research, and then subsequently Larry Ellison at Oracle, and a number of other places. But most recently I was CTO at Yahoo, and then subsequent to that I did a bunch of startups, that involved different types of machine learning, and also just in general, sort of a lot of big data infrastructure stuff. >> And go back to 2012 with Spark right? You had an interesting development. Right, so 2011, 2012, when Spark was still early, we were actually building a recommendation system, based on user-generated reviews. That was a project that was done with Nando de Freitas, who is now at DeepMind, and Peter Cnudde, who's one of the key guys that runs infrastructure at Yahoo. We started that company, and we were one of the early users of Spark, and what we found was, that we were analyzing all the reviews at Amazon. So Amazon allows you to crawl all of their reviews, and we basically had natural language processing, that would allow us to analyze all those reviews. When we were doing sort of MapReduce stuff, it was taking us a huge number of nodes, and 24 hours to actually go do analysis. And then we had this little project called Spark, out of AMPlab, and we decided spin it up, and see what we could do. It had lots of issues at that time, but we were able to actually spin it up on to, I think it was in the order of 100,000 nodes, and we were able take our times for running our algorithms from you know, sort of tens of hours, down to sort of an hour or two, so it was a significant improvement in performance. And that's when we realized that, you know, this is going to be something that's going to be really important once this set of issues, where it, once it was going to get mature enough to make happen, and I'm glad to see that that it's actually happened now, and it's actually taken over the world. >> Yeah that little project became a big deal, didn't it? >> It became a big deal, and now everybody's taking advantage of the same thing. >> Well bring us to the present here. We'll talk about Pepperdata and what you do, and then George is going to ask a little bit more about some of the solutions that you have. >> Perfect, so Pepperdata was a company founded by two gentlemen, Sean Suchter and Chad Carson. Sean used to run Yahoo Search, and one of the first guys who actually helped develop Hadoop next to Eric14 and that team. And then Chad was one of the first guys who actually figured out how to monetize clicks, and was the data science guy around the whole thing. So those are the two guys that actually started the company. I joined the company last July as CEO, and you know, what we've done recently, is we've sort of expanded our focus of the company to addressing DevOps for big data. And the reason why DevOps for big data is important, is because what's happened in the last few years, is people have gone from experimenting with big data, to taking big data into production, and now they're actually starting to figure out how to actually make it so that it actually runs properly, and scales, and does all the other kinds of things that are there, right? So, it's that transition that's actually happened, so, "Hey, we ran it in production, "and it didn't quite work the way we wanted to, "now we actually have to make it work correctly." That's where we sort of fit in, and that's where DevOps comes in, right? DevOps comes in when you're actually trying to make production systems that are going to perform in the right way. And the reason for DevOps is it shortens the cycle between developers and operators, right? So the tighter the loop, the faster you can get solutions out, because business users are actually wanting that to happen. That's where we're squarely focused, is how do we make that work? How do we make that work correctly for big data? And the difference between, sort of classic DevOps and DevOps for big data, is that you're now dealing with not just, you know, a set of computers solving an isolated sort of problem. You're dealing with thousands of machines that are solving one problem, and the amount of data is significantly larger. So the classical methodologies that you have, while, you know, agile and all that still works, the tools don't work to actually figure out what you can do with DevOps, and that's where we come in. We've got a set of tools that are focused on performance effectively, 'cause that's the big difference between distributed systems performance I should say, that's the big difference between that, and sort of classic even scaled out computing, right? So if you've got web servers, yes performance is important, and you need data for those, but that can actually be sharded nicely. This is one system working on one problem, right? Or a set of systems working on one problem. That's much harder, it's a different set of problems, and we help solve those problems. >> Yeah, and George you look like you're itching to dig into this, feel free. (exclaims loudly) >> Well so, it was, so one of the big announcements at the show, and the sort of the headline announcement today, was Spark server lists, like so it's not just someone running Spark in the cloud sort of as a manage service, it's up there as a, you know, sort of SaaS application. And you could call it platform of the service, but it's basically a service where, you know, the infrastructure is invisible. Now, for all those customers who are running their own clusters, which is pretty much everyone I would imagine at this point, how far can you take them in hiding much of the overhead of running those clusters? And by the overhead I mean, you know, the primarily performance and maximizing, you know, sort of maximizing resource efficiency. >> So, you have to actually sort of double-click on to the kind of resources that we're talking about here, right? So there's the number of nodes that you're going to need to actually do the computation. There is, you know, the amount of disc storage and stuff that you're going to need, what type of CPUs you're going to need. All of that stuff is sort of part of the costing if you will, of running an infrastructure. If somebody hides all that stuff, and makes it so that it's economical, then you know, that's a great thing, right? And if it can actually be made so that it's works for huge installations, and hides it appropriately so I don't pay too much of a tax, that's a wonderful thing to do. But we have, our customers are enterprises, typically Fortune 200 enterprises, and they have both a mixture of cloud-based stuff, where they actually want to control everything about what's going on, and then they have infrastructure internally, which by definition they control everything that's going on, and for them we're very, very applicable. I don't know how we'd applicable in this, sort of new world as a service that grows and shrinks. I can certainly imagine that whoever provides that service would embed us, to be able to use the stuff more efficiently. >> No, you answered my question, which is, for the people who aren't getting the turnkey you know, sort of SaaS solution, and they need help managing, you know, what's a fairly involved stack, they would turn to you? >> Ash: Yes. >> Okay. >> Can I ask you about the specific products? >> George: Oh yes. >> I saw you at the booth, and I saw you were announcing a couple of things. Well what is new-- >> Ash: Correct. >> With the show? >> Correct, so at the show we announced Code Analyzer for Apache Spark, and what that allows people to do, is really understand where performance issues are actually happening in their code. So, one of the wonderful things about Spark, compared to MapReduce, is that it abstracts the paradigm that you actually write against, right? So that's a wonderful thing, 'cause it makes it easier to write code. The problem when we abstract, is what does that abstraction do down in the hardware, and where am I losing performance? And being able to give that information back to the user. So you know, in Spark, you have jobs that can run in parallel. So an apps consists of jobs, jobs can run in parallel, and each one of these things can consume resources, CPU, memory, and you see that through sort of garbage collection, or a disc or a network, and what you want to find out, is which one these parallel tasks was dominating the CPU? Why was it dominating the CPU? Which one actually caused the garbage collector actually go crazy at some point? While the Spark UI provides some of that information, what it doesn't do, is gives you a time series view of what's going on. So it's sort of a blow-by-blow view of what's going on. By imposing the time series view on sort of an enhanced version of the Spark UI, you now have much better visibility about which offending stages are causing the issue. And the nice thing about that is, once you know that, you know exactly which piece of code that you actually want to go and look at. So classic example would be, you might have two stages that are running in parallel. The Spark UI will tell you that it's stage three that's causing the problem, but if you look at the time series, you'll find out that stage two actually runs longer, and that's the one that's pegging the CPU. And you can see that because we have the time series, but you couldn't see that any other way. >> So you have a code analyzer and also the app profiler. >> So the app profiler is the other product that we announced a few months ago. We announced that I guess about three months ago or so. And the app profiler, what it does, is it actually looks after the run is done, it actually looks at all the data that the run produces, so the Spark history server produces, and then it actually goes back and analyzes that and says, "Well you know what? "You're executors here, are not working as efficiently, "these are the executors "that aren't working as efficiently." It might be using too much memory or whatever, and then it allows the developer to basically be able to click on it and say, "Explain to me why that's happening?" And then it gives you a little, you know, a little fix-it if you will. It's like, if this is happening, you probably want to do these things, in order to improve performance. So, what's happening with our customers, is our customers are asking developers to run the application profiler first, before they actually put stuff on production. Because if the application profiler comes back and says, "Everything is green." That there's no critical issues there. Then they're saying, "Okay fine, put it on my cluster, "on the production cluster, "but don't do it ahead of time." The application profiler, to be clear, is actually based on some work that, on open source project called Dr. Elephant, which comes out of LinkedIn. And now we're working very closely together to make sure that we actually can advance the set of heuristics that we have, that will allow developers to understand and diagnose more and more complex problems. >> The Spark community has the best code names ever. Dr. Elephant, I've never heard of that one before. (laughter) >> Well Dr. Elephant, actually, is not just the Spark community, it's actually also part of the MapReduce community, right? >> David: Ah, okay. >> So yeah, I mean remember Hadoop? >> David: Yes. >> The elephant thing, so Dr. Elephant, and you know. >> Well let's talk about where things are going next, George? >> So, you know, one of the things we hear all the time from customers and vendors, is, "How are we going to deal with this new era "of distributed computing?" You know, where we've got the cloud, on-prem, edge, and like so, for the first question, let's leave out the edge and say, you've got your Fortune 200 client, they have, you know, production clusters or even if it's just one on-prem, but they also want to work in the cloud, whether it's for elastics stuff, or just for, they're gathering a lot of data there. How can you help them manage both, you know, environments? >> Right, so I think there's a bunch of times still, before we get into most customers actually facing that problem. What we see today is, that a lot of the Fortune 200, or our customers, I shouldn't say a lot of the Fortune 200, a lot of our customers have significant, you know, deployments internally on-prem. They do experimentation on the cloud, right? The current infrastructure for managing all these, and sort of orchestrating all this stuff, is typically YARN. What we're seeing, is that more than likely they're going to wind up, or at least our intelligence tells us that it's going to wind up being Kubernetes that's actually going to wind up managing that. So, what will happen is-- >> George: Both on-prem and-- >> Well let me get to that, alright? >> George: Okay. >> So, I think YARN will be replaced certainly on-prem with Kupernetes, because then you can do multi data center, and things of that sort. The nice thing about Kupernetes, is it in fact can span the cloud as well. So, Kupernetes as an infrastructure, is certainly capable of being able to both handle a multi data center deployment on-prem, along with whatever actually happens on the cloud. There is infrastructure available to do that. It's very immature, most of the customers aren't anywhere close to being able to do that, and I would say even before Kupernetes gets accepted within the environment, it's probably 18 months, and there's probably another 18 months to two years, before we start facing this hybrid cloud, on-prem kind of problem. So we're a few years out I think. >> So, would, for those of us including our viewers, you know, who know the acronym, and know that it's a, you know, scheduler slash cluster manager, resource manager, would that give you enough of a control plane and knowledge of sort of the resources out there, for you to be able to either instrument or deploy an instrument to all the clusters (mumbles). >> So we are actually leading the effort right now for big data on Kupernetes. So there is a group of, there's a small group working. It's Google, us, Red Hat, Palantir, Bloomberg now has joined the group as well. We are actually today talking about our effort on getting HDFS working on Kupernetes, so we see the writing on the wall. We clearly are positioning ourselves to be a player in that particular space, so we think we'll be ready and able to take that challenge on. >> Ash this is great stuff, we've just got about a minute before the break, so I wanted to ask you just a final question. You've been in the Spark community for a while, so what of their open source tools should we be keeping our eyes out for? >> Kupernetes. >> David: That's the one? >> To me that is the killer that's coming next. >> David: Alright. >> I think that's going to make life, it's going to unify the microservices architecture, plus the sort of multi data center and everything else. I think it's really, really good. Board works, it's been working for a long time. >> David: Alright, and I want to thank you for that little Pepper pen that I got over at your booth, as the coolest-- >> Come and get more. >> Gadget here. >> We also have Pepper sauce. >> Oh, of course. (laughter) Well there sir-- >> It's our sauce. >> There's the hot news from-- >> Ash: There you go. >> Pepperdata Ash Munshi. Thank you so much for being on the show, we appreciate it. >> Ash: My pleasure, thank you very much. >> And thank you for watching theCUBE. We're going to be back with more guests, including Ali Ghodsi, CEO of Databricks, coming up next. (upbeat music) (ocean roaring)

Published Date : Jun 7 2017

SUMMARY :

brought to you by Databricks. and here with George Gilbert from Wikibon, George. Alright and the guest of honor of course, I want you to just tell us real quick here, and then subsequent to that I did a bunch of startups, and it's actually taken over the world. and now everybody's taking advantage of the same thing. about some of the solutions that you have. So the classical methodologies that you have, Yeah, and George you look like And by the overhead I mean, you know, is sort of part of the costing if you will, and I saw you were announcing a couple of things. And the nice thing about that is, once you know that, And then it gives you a little, The Spark community has the best code names ever. is not just the Spark community, and like so, for the first question, that a lot of the Fortune 200, or our customers, and there's probably another 18 months to two years, and know that it's a, you know, scheduler Bloomberg now has joined the group as well. so I wanted to ask you just a final question. plus the sort of multi data center Oh, of course. Thank you so much for being on the show, we appreciate it. And thank you for watching theCUBE.

ENTITIES

Entity	Category	Confidence
David Goad	PERSON	0.99+
Ash Munshi	PERSON	0.99+
George	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
Larry Ellison	PERSON	0.99+
George Gilbert	PERSON	0.99+
Google	ORGANIZATION	0.99+
Sean Suchter	PERSON	0.99+
David	PERSON	0.99+
Sean	PERSON	0.99+
Ash	PERSON	0.99+
Red Hat	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Yahoo	ORGANIZATION	0.99+
Peter Cnudde	PERSON	0.99+
2011	DATE	0.99+
DeepMind	ORGANIZATION	0.99+
Bloomberg	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
two guys	QUANTITY	0.99+
Pepperdata	ORGANIZATION	0.99+
24 hours	QUANTITY	0.99+
first question	QUANTITY	0.99+
Spark UI	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
DevOps	TITLE	0.99+
2012	DATE	0.99+
Chad Carson	PERSON	0.99+
two years	QUANTITY	0.99+
18 months	QUANTITY	0.99+
one	QUANTITY	0.99+
two	QUANTITY	0.99+
one problem	QUANTITY	0.99+
last July	DATE	0.99+
Databricks	ORGANIZATION	0.99+
LinkedIn	ORGANIZATION	0.99+
Spark Summit 2017	EVENT	0.99+
Code Analyzer	TITLE	0.99+
Spark	TITLE	0.98+
100,000 nodes	QUANTITY	0.98+
today	DATE	0.98+
Palantir	ORGANIZATION	0.98+
an hour	QUANTITY	0.98+
IBM Research	ORGANIZATION	0.98+
Both	QUANTITY	0.98+
two gentlemen	QUANTITY	0.98+
Chad	PERSON	0.98+
two stages	QUANTITY	0.98+
first guys	QUANTITY	0.98+
both	QUANTITY	0.97+
thousands of machines	QUANTITY	0.97+
each one	QUANTITY	0.97+
tens of hours	QUANTITY	0.95+
Kupernetes	ORGANIZATION	0.95+
MapReduce	TITLE	0.95+
Yahoo Search	ORGANIZATION	0.94+

Day 2 Kickoff - #SparkSummit - #theCUBE

[Narrator] Live from San Francisco it's the Cube covering Sparks Summit 2017 brought to you by databricks. >> Welcome to the Cube. My name is David Goad and I'm your host and we are here at Spark day two. It's the Spark Summit and I am flanked by a couple of consultants here from-- sorry, analysts from Wikibon. I got to get this straight. To my left we have Jim Kobielus who is our lead analysist for Data Science. Jim, welcome to the show. >> Thanks David. >> And we also have George Gilbert who is the lead analyst for Big Data and Analytics. I'll get this right eventually. So why don't we start with Jim. Jim just kicking off the show here today, we wanted to get some preliminary thoughts before we really jump into the rest of the day. What are the big themes that we're going to hear about? >> Yeah, today is the Enterprise day at Sparks Summit. So Spark for the Enterprise. Yesterday was focused on Spark, the evolution, extension of Spark to support for native development of deep learning as well as speeding up Spark to support sub-millisecond latencies. But today it's all about Spark and the Enterprise really what I call wrapping dev-ops around Spark, making it more productionizable, supportable. The databricks serverless announcement, though it was announced yesterday, the press release went up they're going into some depth right now in the key note about serverless and really serverless is all about providing an in cloud Spark, essentially a sand box for teams of developers to scale up and scale out enough resources to do the modeling, the training, the deployment, the iteration, the evaluation of Spark jobs in essentially a 24 by seven multi-tenant fully supported environment. So it's really about driving this continuous Spark development and iteration process into a 24 by seven model in the Enterprise, which is really what's happening is that data scientists, Spark developers are becoming an operational function that businesses are building, strategic infrastructure around things like recommendation engines, and e-commerce environments, absolutely demand 24 by seven resilience Spark team based collaboration environments, which is really what the serverless announcement is all about. >> David: So getting increasing demand on mission critical problems so that optimization is a big deal. >> Yeah, data science is not just an R&D function, it's an operational IT function as well. So that's what it's all about. >> David: Awesome, well let's go to George. I saw you watching the key note. I think still watching it again this morning, so taking notes feverishly. What were some of the things that stuck out to you from the key note speaker this morning? >> There are some things that are sort of going to bleed over from yesterday where we can explore some more. We're going to have on the show, the chief architect, Renald Chin, and the CEO, Ali Goatsee, and some of the things that we want to understand is how the scope of applications that are appropriate for Spark are expanding. We've got sort of unofficial guidance yesterday that, you know, just because Spark doesn't handle key value stores or databases all that tightly right now, that doesn't mean it won't in the future on the Apache Spark side through better APIs and on the databricks side, perhaps custom integration and the significance of that is that you can open up a whole class of operational apps, apps that run your business and that now incorporate, you know, rich analytics as well. Another thing that we'll want to be asking about is, keying off what Jim was saying, now that this becomes not a managed service where you just take the labor that the end customer was applying to get the thing running but it's now automated and you don't even know the infrastructure. We'll want to know what does that mean for the edge, you know, where we're doing analytics close to internet of things and people and sort of if there has to be a new configuration of Spark to work with that. And then of course what do we do about the whole data science process and the dev-ops for data science when you have machine learning distributed across the cloud and edge and On-Prem. >> Jim: In fact, I know we have Pepperdata coming on right after this, who might be able to talk about that exact dev-ops in terms of performance optimization into distributed Spark environment, yeah. >> George, I want to follow up with that. We had Matt Fryer from Hotels.com, he's going to be on our show later but he was on the key note stage this morning. He talked about going all cloud, all Spark, and how data science is even competitive advantage for Hotels.com. What do you want to dig into when we get him on the show? >> That's a really good question because if you look at business strategy, you don't really build a sustainable advantage just by doing one thing better than everyone else. That's easier to pick off. The sustainable strategic advantages come from not just doing one thing better than everyone else but many things and then orchestrating their improvement over time and I'd like to dig into how they're going to do that. 'Cause remember Hotels.com it's the internet equivalent descendant of the original travel reservation systems, which did confer competitive advantage on the early architects and deployers of that technology. >> Great and then Pepperdata wanted to come back and we're going to have them on the show here in just a moment. What would you like to learn from them? What do you think will benefit the community the most? >> Jim: Actually, keying off something George said, I'd like to get a sense for how you optimize Spark deployments in a radically distributed IOT edge environment. Whether they've got any plans, or what their thoughts are in terms of the challenges there. As more the intelligence gets pushed to the edge much of that will be on machine learning and deep learning models built into Spark. What are the challenges there? I mean, if you've got thousands to millions of end points that are all autonomious and intelligent and they're all running Spark, just what are the orchestration requirements, what are the resource management requirements, how do you monitor end-to-end in and environment like that and optimize the passing of data and the transfer of the control flow or orchestration across all those dispersed points. >> Okay, so 30 seconds now, why should the audience tune into our show today? What are they going to get? >> I think what they're going to get is a really good sense for how the emerging best practices for optimizing Spark in a distributed fog environment out to the edge where not just the edge devices but everything, all nodes, will incorporate machine learning and deep learning. They'll get a sense for what's been done today, what's the tooling is to enable dev-ops in that kind of environment. As well as, sort of the emerging best practices for compressing more of these algorithms and the data itself as well as doing training in a theoretically federated environment. I'm hoping to hear from some of the vendors who are on the show today. >> David: Fantastic and George, closing thoughts on the opening segment? 30 seconds. >> Closing thoughts on the opening segment. Like Jim is, we want to think about Spark holistically and it has traditionally been best position that's sort of this-- as Tay acknowledged yesterday sort of this offline branch of analytics that you apply to data like sort of repository that you accumulated and now we want to see it put into production but to do that you need more than just what Spark is today. You need basically a database or key value kind of option so that your storing your work as it goes along so you can go back and analyze it either simple analysis or complex analysis. So I want to hear about that. I want to hear about their plans for IOT. Spark is kind of a heavy weight environment, so you're probably not going to put it in the boot of your car or at least not likely anytime soon. >> Jim: Intelligent edge. I mean, Microsoft build a few weeks ago was really deep on intelligent edge. HP, who we're doing their show actually I think it's in Vegas, right? They're also big on intelligent edge. In fact, we had somebody on the show yesterday from HP going into some depth on that. I want to hear what databricks has to say on that theme. >> Yeah, and which part of the edge, is it the gateway, the edge gateway, which is really a slim down server, or the edge device, which could be a 32 bit meg RAM network card. >> Yeah. >> All right, well gentlemen appreciate the little insight here before we get started today and we're just getting started. Thank you both for being on the show and thank you for watching the Cube. We'll be back in a little while with our CEO from databricks. Thanks for watching. (upbeat music)

Published Date : Jun 7 2017

SUMMARY :

brought to you by databricks. It's the Spark Summit and I am flanked by What are the big themes that we're going to hear about? So Spark for the Enterprise. so that optimization is a big deal. So that's what it's all about. from the key note speaker this morning? and some of the things that we want to understand is Jim: In fact, I know we have Pepperdata coming on and how data science is and I'd like to dig into how they're going to do that. What would you like to learn from them? As more the intelligence gets pushed to the edge and the data itself David: Fantastic and George, but to do that you need more than just what Spark is today. I want to hear what databricks has to say on that theme. or the edge device, and thank you for watching the Cube.

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
Jim Kobielus	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Ali Goatsee	PERSON	0.99+
David Goad	PERSON	0.99+
Matt Fryer	PERSON	0.99+
Renald Chin	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
thousands	QUANTITY	0.99+
30 seconds	QUANTITY	0.99+
Hotels.com	ORGANIZATION	0.99+
yesterday	DATE	0.99+
Vegas	LOCATION	0.99+
32 bit	QUANTITY	0.99+
today	DATE	0.99+
24	QUANTITY	0.99+
HP	ORGANIZATION	0.99+
Spark	TITLE	0.99+
seven	QUANTITY	0.98+
Yesterday	DATE	0.98+
both	QUANTITY	0.98+
Spark Summit	EVENT	0.98+
Tay	PERSON	0.97+
Sparks Summit 2017	EVENT	0.96+
one	QUANTITY	0.96+
this morning	DATE	0.96+
Pepperdata	ORGANIZATION	0.96+
Day 2	QUANTITY	0.95+
Wikibon	ORGANIZATION	0.94+
Sparks Summit	EVENT	0.93+
databricks	ORGANIZATION	0.91+
day two	QUANTITY	0.87+
Spark	ORGANIZATION	0.86+
few weeks ago	DATE	0.86+
millions of end points	QUANTITY	0.81+
Big Data	ORGANIZATION	0.81+
Cube	COMMERCIAL_ITEM	0.68+
sub	QUANTITY	0.6+
Apache Spark	TITLE	0.55+
Analytics	ORGANIZATION	0.53+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Pepperdata: