Ash Munshi, Pepperdata - #SparkSummit - #theCUBE

(upbeat music) >> Announcer: Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, it's day two at the Spark Summit 2017. I'm David Goad and here with George Gilbert from Wikibon, George. >> George: Good to be here. >> Alright and the guest of honor of course, is Ash Munshi, who is the CEO of Pepperdata. Ash, welcome to the show. >> Thank you very much, thank you. >> Well you have an interesting background, I want you to just tell us real quick here, not give the whole bio, but you got a great background in machine learning, you were an early user of Spark, tell us a little bit about your experience. >> So I'm actually a mathematician originally, a theoretician who worked for IBM Research, and then subsequently Larry Ellison at Oracle, and a number of other places. But most recently I was CTO at Yahoo, and then subsequent to that I did a bunch of startups, that involved different types of machine learning, and also just in general, sort of a lot of big data infrastructure stuff. >> And go back to 2012 with Spark right? You had an interesting development. Right, so 2011, 2012, when Spark was still early, we were actually building a recommendation system, based on user-generated reviews. That was a project that was done with Nando de Freitas, who is now at DeepMind, and Peter Cnudde, who's one of the key guys that runs infrastructure at Yahoo. We started that company, and we were one of the early users of Spark, and what we found was, that we were analyzing all the reviews at Amazon. So Amazon allows you to crawl all of their reviews, and we basically had natural language processing, that would allow us to analyze all those reviews. When we were doing sort of MapReduce stuff, it was taking us a huge number of nodes, and 24 hours to actually go do analysis. And then we had this little project called Spark, out of AMPlab, and we decided spin it up, and see what we could do. It had lots of issues at that time, but we were able to actually spin it up on to, I think it was in the order of 100,000 nodes, and we were able take our times for running our algorithms from you know, sort of tens of hours, down to sort of an hour or two, so it was a significant improvement in performance. And that's when we realized that, you know, this is going to be something that's going to be really important once this set of issues, where it, once it was going to get mature enough to make happen, and I'm glad to see that that it's actually happened now, and it's actually taken over the world. >> Yeah that little project became a big deal, didn't it? >> It became a big deal, and now everybody's taking advantage of the same thing. >> Well bring us to the present here. We'll talk about Pepperdata and what you do, and then George is going to ask a little bit more about some of the solutions that you have. >> Perfect, so Pepperdata was a company founded by two gentlemen, Sean Suchter and Chad Carson. Sean used to run Yahoo Search, and one of the first guys who actually helped develop Hadoop next to Eric14 and that team. And then Chad was one of the first guys who actually figured out how to monetize clicks, and was the data science guy around the whole thing. So those are the two guys that actually started the company. I joined the company last July as CEO, and you know, what we've done recently, is we've sort of expanded our focus of the company to addressing DevOps for big data. And the reason why DevOps for big data is important, is because what's happened in the last few years, is people have gone from experimenting with big data, to taking big data into production, and now they're actually starting to figure out how to actually make it so that it actually runs properly, and scales, and does all the other kinds of things that are there, right? So, it's that transition that's actually happened, so, "Hey, we ran it in production, "and it didn't quite work the way we wanted to, "now we actually have to make it work correctly." That's where we sort of fit in, and that's where DevOps comes in, right? DevOps comes in when you're actually trying to make production systems that are going to perform in the right way. And the reason for DevOps is it shortens the cycle between developers and operators, right? So the tighter the loop, the faster you can get solutions out, because business users are actually wanting that to happen. That's where we're squarely focused, is how do we make that work? How do we make that work correctly for big data? And the difference between, sort of classic DevOps and DevOps for big data, is that you're now dealing with not just, you know, a set of computers solving an isolated sort of problem. You're dealing with thousands of machines that are solving one problem, and the amount of data is significantly larger. So the classical methodologies that you have, while, you know, agile and all that still works, the tools don't work to actually figure out what you can do with DevOps, and that's where we come in. We've got a set of tools that are focused on performance effectively, 'cause that's the big difference between distributed systems performance I should say, that's the big difference between that, and sort of classic even scaled out computing, right? So if you've got web servers, yes performance is important, and you need data for those, but that can actually be sharded nicely. This is one system working on one problem, right? Or a set of systems working on one problem. That's much harder, it's a different set of problems, and we help solve those problems. >> Yeah, and George you look like you're itching to dig into this, feel free. (exclaims loudly) >> Well so, it was, so one of the big announcements at the show, and the sort of the headline announcement today, was Spark server lists, like so it's not just someone running Spark in the cloud sort of as a manage service, it's up there as a, you know, sort of SaaS application. And you could call it platform of the service, but it's basically a service where, you know, the infrastructure is invisible. Now, for all those customers who are running their own clusters, which is pretty much everyone I would imagine at this point, how far can you take them in hiding much of the overhead of running those clusters? And by the overhead I mean, you know, the primarily performance and maximizing, you know, sort of maximizing resource efficiency. >> So, you have to actually sort of double-click on to the kind of resources that we're talking about here, right? So there's the number of nodes that you're going to need to actually do the computation. There is, you know, the amount of disc storage and stuff that you're going to need, what type of CPUs you're going to need. All of that stuff is sort of part of the costing if you will, of running an infrastructure. If somebody hides all that stuff, and makes it so that it's economical, then you know, that's a great thing, right? And if it can actually be made so that it's works for huge installations, and hides it appropriately so I don't pay too much of a tax, that's a wonderful thing to do. But we have, our customers are enterprises, typically Fortune 200 enterprises, and they have both a mixture of cloud-based stuff, where they actually want to control everything about what's going on, and then they have infrastructure internally, which by definition they control everything that's going on, and for them we're very, very applicable. I don't know how we'd applicable in this, sort of new world as a service that grows and shrinks. I can certainly imagine that whoever provides that service would embed us, to be able to use the stuff more efficiently. >> No, you answered my question, which is, for the people who aren't getting the turnkey you know, sort of SaaS solution, and they need help managing, you know, what's a fairly involved stack, they would turn to you? >> Ash: Yes. >> Okay. >> Can I ask you about the specific products? >> George: Oh yes. >> I saw you at the booth, and I saw you were announcing a couple of things. Well what is new-- >> Ash: Correct. >> With the show? >> Correct, so at the show we announced Code Analyzer for Apache Spark, and what that allows people to do, is really understand where performance issues are actually happening in their code. So, one of the wonderful things about Spark, compared to MapReduce, is that it abstracts the paradigm that you actually write against, right? So that's a wonderful thing, 'cause it makes it easier to write code. The problem when we abstract, is what does that abstraction do down in the hardware, and where am I losing performance? And being able to give that information back to the user. So you know, in Spark, you have jobs that can run in parallel. So an apps consists of jobs, jobs can run in parallel, and each one of these things can consume resources, CPU, memory, and you see that through sort of garbage collection, or a disc or a network, and what you want to find out, is which one these parallel tasks was dominating the CPU? Why was it dominating the CPU? Which one actually caused the garbage collector actually go crazy at some point? While the Spark UI provides some of that information, what it doesn't do, is gives you a time series view of what's going on. So it's sort of a blow-by-blow view of what's going on. By imposing the time series view on sort of an enhanced version of the Spark UI, you now have much better visibility about which offending stages are causing the issue. And the nice thing about that is, once you know that, you know exactly which piece of code that you actually want to go and look at. So classic example would be, you might have two stages that are running in parallel. The Spark UI will tell you that it's stage three that's causing the problem, but if you look at the time series, you'll find out that stage two actually runs longer, and that's the one that's pegging the CPU. And you can see that because we have the time series, but you couldn't see that any other way. >> So you have a code analyzer and also the app profiler. >> So the app profiler is the other product that we announced a few months ago. We announced that I guess about three months ago or so. And the app profiler, what it does, is it actually looks after the run is done, it actually looks at all the data that the run produces, so the Spark history server produces, and then it actually goes back and analyzes that and says, "Well you know what? "You're executors here, are not working as efficiently, "these are the executors "that aren't working as efficiently." It might be using too much memory or whatever, and then it allows the developer to basically be able to click on it and say, "Explain to me why that's happening?" And then it gives you a little, you know, a little fix-it if you will. It's like, if this is happening, you probably want to do these things, in order to improve performance. So, what's happening with our customers, is our customers are asking developers to run the application profiler first, before they actually put stuff on production. Because if the application profiler comes back and says, "Everything is green." That there's no critical issues there. Then they're saying, "Okay fine, put it on my cluster, "on the production cluster, "but don't do it ahead of time." The application profiler, to be clear, is actually based on some work that, on open source project called Dr. Elephant, which comes out of LinkedIn. And now we're working very closely together to make sure that we actually can advance the set of heuristics that we have, that will allow developers to understand and diagnose more and more complex problems. >> The Spark community has the best code names ever. Dr. Elephant, I've never heard of that one before. (laughter) >> Well Dr. Elephant, actually, is not just the Spark community, it's actually also part of the MapReduce community, right? >> David: Ah, okay. >> So yeah, I mean remember Hadoop? >> David: Yes. >> The elephant thing, so Dr. Elephant, and you know. >> Well let's talk about where things are going next, George? >> So, you know, one of the things we hear all the time from customers and vendors, is, "How are we going to deal with this new era "of distributed computing?" You know, where we've got the cloud, on-prem, edge, and like so, for the first question, let's leave out the edge and say, you've got your Fortune 200 client, they have, you know, production clusters or even if it's just one on-prem, but they also want to work in the cloud, whether it's for elastics stuff, or just for, they're gathering a lot of data there. How can you help them manage both, you know, environments? >> Right, so I think there's a bunch of times still, before we get into most customers actually facing that problem. What we see today is, that a lot of the Fortune 200, or our customers, I shouldn't say a lot of the Fortune 200, a lot of our customers have significant, you know, deployments internally on-prem. They do experimentation on the cloud, right? The current infrastructure for managing all these, and sort of orchestrating all this stuff, is typically YARN. What we're seeing, is that more than likely they're going to wind up, or at least our intelligence tells us that it's going to wind up being Kubernetes that's actually going to wind up managing that. So, what will happen is-- >> George: Both on-prem and-- >> Well let me get to that, alright? >> George: Okay. >> So, I think YARN will be replaced certainly on-prem with Kupernetes, because then you can do multi data center, and things of that sort. The nice thing about Kupernetes, is it in fact can span the cloud as well. So, Kupernetes as an infrastructure, is certainly capable of being able to both handle a multi data center deployment on-prem, along with whatever actually happens on the cloud. There is infrastructure available to do that. It's very immature, most of the customers aren't anywhere close to being able to do that, and I would say even before Kupernetes gets accepted within the environment, it's probably 18 months, and there's probably another 18 months to two years, before we start facing this hybrid cloud, on-prem kind of problem. So we're a few years out I think. >> So, would, for those of us including our viewers, you know, who know the acronym, and know that it's a, you know, scheduler slash cluster manager, resource manager, would that give you enough of a control plane and knowledge of sort of the resources out there, for you to be able to either instrument or deploy an instrument to all the clusters (mumbles). >> So we are actually leading the effort right now for big data on Kupernetes. So there is a group of, there's a small group working. It's Google, us, Red Hat, Palantir, Bloomberg now has joined the group as well. We are actually today talking about our effort on getting HDFS working on Kupernetes, so we see the writing on the wall. We clearly are positioning ourselves to be a player in that particular space, so we think we'll be ready and able to take that challenge on. >> Ash this is great stuff, we've just got about a minute before the break, so I wanted to ask you just a final question. You've been in the Spark community for a while, so what of their open source tools should we be keeping our eyes out for? >> Kupernetes. >> David: That's the one? >> To me that is the killer that's coming next. >> David: Alright. >> I think that's going to make life, it's going to unify the microservices architecture, plus the sort of multi data center and everything else. I think it's really, really good. Board works, it's been working for a long time. >> David: Alright, and I want to thank you for that little Pepper pen that I got over at your booth, as the coolest-- >> Come and get more. >> Gadget here. >> We also have Pepper sauce. >> Oh, of course. (laughter) Well there sir-- >> It's our sauce. >> There's the hot news from-- >> Ash: There you go. >> Pepperdata Ash Munshi. Thank you so much for being on the show, we appreciate it. >> Ash: My pleasure, thank you very much. >> And thank you for watching theCUBE. We're going to be back with more guests, including Ali Ghodsi, CEO of Databricks, coming up next. (upbeat music) (ocean roaring)

Published Date : Jun 7 2017

SUMMARY :

brought to you by Databricks. and here with George Gilbert from Wikibon, George. Alright and the guest of honor of course, I want you to just tell us real quick here, and then subsequent to that I did a bunch of startups, and it's actually taken over the world. and now everybody's taking advantage of the same thing. about some of the solutions that you have. So the classical methodologies that you have, Yeah, and George you look like And by the overhead I mean, you know, is sort of part of the costing if you will, and I saw you were announcing a couple of things. And the nice thing about that is, once you know that, And then it gives you a little, The Spark community has the best code names ever. is not just the Spark community, and like so, for the first question, that a lot of the Fortune 200, or our customers, and there's probably another 18 months to two years, and know that it's a, you know, scheduler Bloomberg now has joined the group as well. so I wanted to ask you just a final question. plus the sort of multi data center Oh, of course. Thank you so much for being on the show, we appreciate it. And thank you for watching theCUBE.

ENTITIES

Entity	Category	Confidence
David Goad	PERSON	0.99+
Ash Munshi	PERSON	0.99+
George	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
Larry Ellison	PERSON	0.99+
George Gilbert	PERSON	0.99+
Google	ORGANIZATION	0.99+
Sean Suchter	PERSON	0.99+
David	PERSON	0.99+
Sean	PERSON	0.99+
Ash	PERSON	0.99+
Red Hat	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Yahoo	ORGANIZATION	0.99+
Peter Cnudde	PERSON	0.99+
2011	DATE	0.99+
DeepMind	ORGANIZATION	0.99+
Bloomberg	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
two guys	QUANTITY	0.99+
Pepperdata	ORGANIZATION	0.99+
24 hours	QUANTITY	0.99+
first question	QUANTITY	0.99+
Spark UI	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
DevOps	TITLE	0.99+
2012	DATE	0.99+
Chad Carson	PERSON	0.99+
two years	QUANTITY	0.99+
18 months	QUANTITY	0.99+
one	QUANTITY	0.99+
two	QUANTITY	0.99+
one problem	QUANTITY	0.99+
last July	DATE	0.99+
Databricks	ORGANIZATION	0.99+
LinkedIn	ORGANIZATION	0.99+
Spark Summit 2017	EVENT	0.99+
Code Analyzer	TITLE	0.99+
Spark	TITLE	0.98+
100,000 nodes	QUANTITY	0.98+
today	DATE	0.98+
Palantir	ORGANIZATION	0.98+
an hour	QUANTITY	0.98+
IBM Research	ORGANIZATION	0.98+
Both	QUANTITY	0.98+
two gentlemen	QUANTITY	0.98+
Chad	PERSON	0.98+
two stages	QUANTITY	0.98+
first guys	QUANTITY	0.98+
both	QUANTITY	0.97+
thousands of machines	QUANTITY	0.97+
each one	QUANTITY	0.97+
tens of hours	QUANTITY	0.95+
Kupernetes	ORGANIZATION	0.95+
MapReduce	TITLE	0.95+
Yahoo Search	ORGANIZATION	0.94+

Adam Wilson & Joe Hellerstein, Trifacta - Big Data SV 17 - #BigDataSV - #theCUBE

>> Commentator: Live from San Jose, California. It's theCUBE covering Big Data Silicon Valley 2017. >> Okay, welcome back everyone. We are here live in Silicon Valley for Big Data SV (mumbles) event in conjunction with Strata + Hadoop. Our companion event, the Big Data NYC and we're here breaking down the Big Data world as it evolves and goes to the next level up on the step function, AI machine learning, IOT really forcing people to really focus on a clear line of the side of the data. I'm John Furrier with our announcer from Wikibon, George Gilbert and our next guest, our two executives from Trifacta. The founder and Chief Strategy Officer, Joe Hellerstein and Adam Wilson, the CEO. Guys, welcome to theCUBE. Welcome back. >> Great to be here. >> Good to be here. >> Founder, co-founder? >> Co-founder. >> Co-founder. He's a multiple co-founders. I remember it 'cause you guys were one of the first sites that have the (mumbles) in the about section on all the management team. Just to show you how technical you guys are. Welcome back. >> And if you're Trifacta, you have to have three founders, right? So that's part of the tri, right? >> The triple threat, so to speak. Okay, so a big year for you guys. Give us the update. I mean, also we had Alation announce this partnering going on and some product movement. >> Yup. >> But there's a turbulent time right now. You have a lot of things happening in multiple theaters to technical theater to business theater. And also within the customer base. It's a land grand, it seems to be on the metadata and who's going to control what. What's happening? What's going on in the market place and what's the update from you guys? >> Yeah, yeah. Last year was an absolutely spectacular year for Trifacta. It was four times growth in bookings, three times growth in customers. You know, it's been really exciting for us to see the technology get in the hands of some of the largest companies on the planet and to see what they're able to do with it. From the very beginning, we really believed in this idea of self service and democratization. We recognize that the wrangling of the data is often where a lot of the time and the effort goes. In fact, up to 80% of the time and effort goes in a lot of these analytic projects and to the extent that we can help take the data from (mumbles) in a more productive way and to allow more people in an organization to do that. That's going to create information agility that that we feel really good about and there are customers and they are telling us is having an impact on their use of Big Data and Hadoop. And I think you're seeing that transition where, you know, in the very beginning there was a lot of offloading, a lot of like, hey we're going to grab some cost savings but then in some point, people scratch their heads and said, well, wait a minute. What about the strategic asset that we were building? That was going to change the way people work with the data. Where is that piece of it? And I think as people started figuring out in order to get our (mumbles), we got to have users and use cases on these clusters and the data like itself is not a used case. Tools like Trifacta have been absolutely instrumental and really fueling that maturity in the market and we feel great about what's happening there. >> I want to get some more drilled out before we get to some of these questions for Joe too because I think you mentioned, you got some quotes. I just want to double up a click on that. It always comes up in the business model question for people. What's your business model? >> Sure. >> And doing democratization is really hard. Sometimes democratization doesn't appear until years later so it's one of those elusive things. You see it and you believe it but then making it happen are two different things. >> Yeah, sure. >> So. And appreciate that the vision they-- (mumbles) But ultimately, at the end of the day, that business model comes down to how you organized. Prove points. >> Yup. >> Customers, partnerships. >> Yeah. >> We had Alation on Stephanie (mumbles). Can you share just and connect the dots on the business model? >> Sure. >> With respect to the product, customers, partners. How was that specifically evolving? >> Adam: Sure. >> Give some examples. >> Sure, yeah. And I would say kind of-- we felt from the beginning that, you know, we wanted to turn what was traditionally a very complex messy problem dealing with data, you know, in the user experience problem that was powered by machine learning and so, a lot of it was down to, you know, how we were going to build and architect the technology needed (mumbles) for really getting the power in the hands of the people who know the data best. But it's important, and I think this is often lost in Silicon Valley where the focus on innovation is all around technology to recognize that the business model also has to support democritization so one of the first things we did coming in was to release a free version of the product. So Trifacta Wrangler that is now being used by over 4500 companies, ten of thousands of users and the power of that in terms of getting people something of value that they could start using right away on spreadsheets and files and small data and allowing them to get value but then also for us, the exchange is that we're actually getting a chance to curate at scale usage data across all of these-- >> Is this a (mumbles) product? >> It's a hybrid product. >> Okay. >> So the data stays local. It never leaves their local laptop. The metadata is hashed and put into the cloud and now we're-- >> (mumbles) to that. >> Absolutely. And so now we can use that as training data that actually has more people wrangle, the product itself gets smarter based on that. >> That's good. >> So that's creating real tangible value for customers and for us is a source of very strategic advantage and so we think that combination of the technology innovation but also making sure that we can get this in the hands of users and they can get going and as their problem grows up to be bigger and more complicated, not just spreadsheets and files on the desktop but something more complicated, then we're right there along with them for products that would have been modified. >> How about partnerships with Alation? How they (mumbles)? What are all the deals you got going on there? >> So Alation has been a great partner for us for a while and we've really deepened the integration with the announcements today. We think that cataloging and data wrangling are very complimentary and they're a natural fit. We've got customers like Munich Re, like eBay as well as MarketShare that are using both solutions in concert with one another and so, we really felt that it was natural to tighten that coupling and to help people go from inventorying what's going on in their data legs and their clusters to then cleansing, standardizing. Essentially making it fit for purpose and then ensuring that metadata can roundtrip back into the catalog. And so that's really been an extension of what we're doing also at the technical level with technologies like Cloudera Navigator with Atlas and with the project that Joe's involved with at Berkeley called Ground. So I don't know if you want to talk-- >> Yeah, tell him about Ground. >> Sure. So part of our outlook on this and this speaks to the kind of way that the landscape in the industry's shaping out is that we're not going to see customers buying until it's sort of lock in on the key components of the area for (mumbles). So for example, storage, HD (mumbles). This is open and that's key, I think, for all the players in this base at HTFS. It's not a product from a storage vendor. It's an open platform and you can change vendors along the way and you could role your own and so on. So metadata, to my mind, is going to move in the same direction. That the storage of metadata, the basic component tree that keeps the metadata, that's got to be open to give people the confidence that they're going to pour the basic descriptions of what's in their business and what their people are doing into a place that they know they can count on and it will be vendor neutral. So the catalog vendors are, in my mind, providing a functionality above that basic storage that relates to how do you search the catalog, what does the catalog do for you to suggest things, to suggest data sets that you should be looking at. So that's a value we have on top but below that what we're seeing is, we're seeing Horton and Cloudera coming out with either products re opensource and it's sort of the metadata space and what would be a shame is if the two vendors ended up kind of pointing guns inward and kind of killing the metadata storage. So one of the things that I got interested in as my dual role as a professor at Berkeley and also as a founder of a company in this space was we want to ensure that there's a free open vendor neutral metadata solution. So we began building out a project called Ground which is both a platform for metadata storage that can be sitting underneath catalog vendors and other metadata value adds. And it's also a platform for research much as we did with Spark previously at Berkeley. So Ground is a project in our new lab at Berkeley. The RISELab which is the successor to the AMPLab that gave us Spark. And Ground has now got, you know, collaboratives from Cloudera, from LinkedIn. Capital One has significantly invested in Ground and is putting engineers behind it and contributors are coming also from some startups to build out an open-sourced platform for metadata. >> How old has Ground been around? >> Joe: Ground's been around for about 12 months. It's very-- >> So it's brand new. How do people get involved? >> Brand new. >> Just standard similar to the way the AMPLab was? Just jump in and-- >> Yeah, you know-- >> Go away and-- >> It comes up on GitHub. There's (mumbles) to go download and play with. It's in alpha. And you know, we hope we (mumbles) and the usual opensource still. >> This is interesting. I like this idea because one thing you've been riffing on the cue ball of time is how do you make data addressable? Because ultimately, you know, real time you need to have access to data really really low (mumbles) to see the inside to make it work. Hence the data swamp problem right? So, how do you guys see that? 'Cause now I can just pop in. I can hear the objections. Oh, security! You know. How do you guys see the protections? I'd love to help get my data in there and get something back in return in a community model. Security? Is it the hashing? What's the-- How do you get any security (mumbles)? Or what are the issues? >> Yeah, so I mean the straightforward issues are the traditional issues of authorization and encryption and those are issues that are reasonably well-plumed out in the industry and you can go out and you can take the solutions from people like Clutter or from Horton and those solutions have plugin quite nicely actually to a variety of platforms. And I feel like that level of enterprise security is understood. It's work for vendors to work with that technology so when we went out, we make sure we were carburized in all the right ways at Trifacta to work with these vendors and that we integrated well with Navigator, we integrated with Atlas. That was, you know, there was some labor there but it's understood. There's also-- >> It's solvable basically. >> It's solvable basically and pluggable. There are research questions there which, you know, on another day we could talk about but for instance if you don't trust your cloud hosting service what do you do? And that's like an open area that we're working on at Berkeley. Intel SGX is a really interesting technology and that's based probably a topic for another day. >> But you know, I think it's important-- >> The sooner we get you out of the studio, Paolo Alto would love to drill on that. >> I think it's important though that, you know, when we talk about self service, the first question that comes up is I'm only going to let you self service as far as I can govern what's going on, right? And so I think those things-- >> Restrictions, guard rails-- >> Really going hand in here. >> About handcuffs. >> Yeah so, right. Because that's always a first thing that kind of comes out where people say, okay wait minute now is this-- if I've now got, you know-- you've got an increasing number of knowledge workers who think that is their-- and believe that it is their unalienable right to have access to data. >> Well that's the (mumbles) democratization. That's the top down, you know, governance control point. >> So how do you balance that? And I think you can't solve for one side of that equation without the other, right? And that's really really critical. >> Democratization is anarchization, right? >> Right, exactly. >> Yes, exactly. But it's hard though. I mean, and you look at all the big trends where there was, you know, web one data, web (mumbles), all had those democratization trends but they took six years to play out and I think there might be a more auxiliary with cloud when you point about this new stop. Okay George, go ahead. You might get in there. >> I wanted to ask you about, you know, what we were talking about earlier and what customers are faced with which is, you know, a lot of choice and specialization because building something end to end and having it fully functional is really difficult. So... What are the functional points where you start driving the guard rails in that Ikee cares about and then what are the user experience points where you have critical mass so that the end users then draw other compliant tools in. You with me? On sort of the IT side and the user side and then which tools start pulling those standards? >> Well, I would say at the highest level, to me what's been very interesting especially would be with that's happened in opensource is that people have now gotten accustomed to the idea that like I don't have to go buy a big monolithic stacks where the innovation moves only as fast as the slowest product in the stack or the portfolio. I can grab onto things and I can download them today and be using them tomorrow. And that has, I think, changed the entire approach that companies like Trifacta are taking to how we how we build and release product to market, how we inter operate with partners like Alation and Waterline and how we integrate with the platform vendors like Cloudera, MapR, and Horton because we recognize that we are going to have to be meniacal focused on one piece of this puzzle and to go very very deep but then play incredibly well both, you know, with all the rest of the ecosystem and so I think that is really colored our entire product strategy and how we go to market and I think customers, you know, they want the flexibility to change their minds and the subscription model is all about that, right? You got to earn it every single year. >> So what's the future of (mumbles)? 'Cause that brings up a good point we were kind of critical of Google and you mentioned you guys had-- I saw in some news that you guys were involved with Google. >> Yup. >> Being enterprise ready is not just, hey we have the great tech and you buy from us, damn it we're Google. >> Right. >> I mean, you have to have sales people. You have to have automation mechanism to create great product. Will the future of wrangling and data prep go into-- where does it end up? Because enterprises want, they want certain things. They're finicky of things. >> Right, right. >> As you guys know. So how does the future of data prep deal with the, I won't say the slowness of the enterprise, but they're more conservative, more SLA driven than they are price performance. >> But they're also more fragmented than ever before and you know, while that may not be a great thing for the customers for a company that's all about harmonizing data that's actually a phenomenal opportunity, right? Because we want to be the decision that customers make that guarantee that all their other decisions are changeable, right? And I go and-- >> Well they have legacy systems of record. This is the challenge, right? So I got the old oracle monolithic-- >> That's fine. And that's good-- >> So how do you-- >> The more the merrier, right? >> Does that impact you guys at all? How did you guys handle that situation? >> To me, to us that is more fragmentation which creates more need for wrangling because that introduces more complexity, right? >> You guys do well in that environment. >> Absolutely. And that, you know, is only getting bigger, worse, and more complicated. And especially as people go from (mumbles) to cloud as people start thinking about moving from just looking at transactions to interactions to now looking at behavior data and the IOT-- >> You're welcome in that environment. >> So we welcome that. In fact, that's where-- we went to solve this problem for Hadoop and Big Data first because we wanted to solve the problems at scale that were the most complicated and over time we can always move downstream to sort of more structured and smaller data and that's kind of what's happened with our business. >> I guess I want to circle back to this issue of which part of this value chain of refining data is-- if I'm understanding you right, the data wrangling is the anchor and once a company has made that choice then all the other tool choices have to revolve around it? Is that a-- >> Well think about this way, I mean, the bulk of the time when you talk to the analysts and also the bulk of the labor cost and these things isn't getting the data from its raw form into usage. That whole process of wrangling which is not really just data prep. It's all the things you do all day long to kind of massage these data sets and get 'em from here to there and make 'em work. That space is where the labor cost is. That also means that's spaces were the value add is because that's where your people power or your business context is really getting poured in to understand what do I have, what am I doing with it and what do I want to get out of it. As we move from bottom line IT to top line value generation with data, it becomes all the more so, right? Because now it's not just the matter of getting the reports out every month. It's also what did that brilliant in sales do to that dataset to get that much left? I need to learn from her and do a similar thing. Alright? So, that whole space is where the value is. What that means is that, you know, you don't want that space to be tied to a particular BI tool or a particular execution edge. So when we say that we want to make a decision in the middle of that enables all the other decisions, what you really want to make sure is that that work process in there is not tightly bound to the rest of the stack. Okay? And so you want to particularly pick technologies in that space that will play nicely with different storage, that play nicely with different execution environments. Today it's a dupe, tomorrow it's Amazon, the next day it's Google and they have different engines back there potentially. And you want it certainly makes your place with all the analytic and visualizations-- >> So decouple from all that? >> You want to decouple that and you want to not lock yourself in 'cause that's where the creativity's happening on the consumption side and that's where the mess that you talked about is just growing on the production side so data production is just getting more complicated. Data consumption's getting more interesting. >> That's actually a really really cool good point. >> Elaborating on that, does that mean that you have to open up interfaces with either the UI layer or at the sort of data definition layer? Or does that just mean other companies have to do the work to tie in to the styles? The styles and structures that you have already written? >> In fact it's sort of the opposite. We do the work to tie in to a lot of this, these other decisions in this infrastructure, you know. We don't pretend for a minute that people are going to sort of pick a solution like Trifacta and then build their organization around it. As your point, there's tons of legacy, technology out there. There is all kinds of things moving. Absolutely. So we, a big part of being the decoder ring for data for Trifacta and saying it's like listen, we are going to inter operate with your existing investments and we're going to make sure that you can always get at your data, you can always take it from whatever state its in to whatever state you need to be in, you can change your mind along the way. And that puts a lot of owners on us and that's the reason why we have to be so focused on this space and not jump into visualization and analytics and not jump in to its storage and processing and not try to do the other things to the right or left. Right? >> So final question. I'd like you guys both to take a stab at it. You know, just going to pivot off at what Joe was saying. Some of the most interesting things are happening in the data exploration kind of discovery area from creativity to insights to game changing stuff. >> Yup. >> Ventures potentially. >> Joe: Yup. >> The problem of the complexity, that's conflict. >> Yeah. >> So how does we resolve this? I mean, besides the Trifacta solution which you guys are taming, creating a platform for that, how do people in industry work together to solve that problem? What's the approach? >> So I think actually there's a couple sort of heartening trends on this front that make me pretty optimistic. One of these is that the inside of structures are in the enterprises we work with becoming quite aligned between IT and the line of business. It's no longer the case that the line of business that are these annoying people that they're distracting IT from their bottom line function. IT's bottom line function is being translated into a what's your value for the business question? And the answer for a savvy IT management person is, I will try to empower the people around me to be rabid fans and I will also try to make sure that they do their own works so I don't have to learn how to do it for them. Right? And so, that I think is happening-- >> Guys to this (mumbles) business guys, a bunch of annoying guys who don't get what I need, right? So it works both ways, right? >> It does, it does. And I see that that's improving sort of in the industry as the corporate missions around data change, right? So it's no longer that the IT guys really only need to take care of executives and everyone else doesn't matter. Their function really is to serve the business and I see that alignment. The other thing that I think is a huge opportunity and the part of who I-- we're excited to be so tightly coupled with Google and also have our stuff running in Amazon and at Microsoft. It's as people read platform to the cloud, a lot of legacy becomes a shed or at least become deprecated. And so there is a real-- >> Or containerized or some sort of microservice. >> Yeah. >> Right, right. >> And so, people are peeling off business function and as part of that cost savings to migrate it to the cloud, they're also simplified. And you know, things will get complicated again. >> What's (mumbles) solution architects out there that kind of re-boot their careers because the old way was, hey I got networks, I got apps and stacks and so that gives the guys who could be the new heroes coming in. >> Right. >> And thinking differently about enabling that creativity. >> In the midst of all that, everything you said is true. IT is a massive place and it always will be. And tools that can come in and help are absolutely going to be (mumbles). >> This is obvious now. The tension's obviously eased a bit in the sense that there's clear line of sight that top line and bottom line are working together now on. You mentioned that earlier. Okay. Adam, take a stab at it. (mumbling) >> I was just going to-- hey, I know it's great. I was just going to give an example, I think, that illustrates that point so you know, one of our customers is Pepsi. And Pepsi came to us and they said, listen we work with retailers all over the world and their reality is that, when they place orders with us, they often get it wrong. And sometimes they order too much and then they return it, it spoils and that's bad for us. Or they order too little and they stock out and we miss revenue opportunities. So they said, we actually have to be better at demand planning and forecasting than the orders that are literally coming in the door. So how do we do that? Well, we're getting all of the customers to give us their point of sale data. We're combining that with geospatial data, with weather data. We're like looking at historical data and industry averages but as you can see, they were like-- we're stitching together data across a whole variety of sources and they said the best people to do this are actually the category managers and the people responsible for the brands 'cause they literally live inside those businesses and they understand it. And so what happened was they-- the IT organization was saying, look listen, we don't want to be the people doing the janitorial work on the data. We're going to give that work over to people who understand it and they're going to be more productive and get to better outcomes with that information and that brings us up to go find new and interesting sources and I think that collaborative model that you're starting to see emerge where they can now be the data heroes in a different way by not being the ones beating the bottleneck on provisioning but rather can go out and figure out how do we share the best stuff across the organization? How do we find new sources of information to bring in that people can leverage to make better decisions? That's in incredibly powerful place to be and you know, I think that that model is really what's going to be driving a lot of the thinking at Trifacta and in the industry over the next couple of years. >> Great. Adam Wilson, CEO of Trifacta. Joe Hellestein, CTO-- Chief Strategy Officer of Trifacta and also a professor at Berkeley. Great story. Getting the (mumbles) right is hard but under the hood stuff's complicated and again, congratulations about sharing the Ground project. Ground open source. Open source lab kind of thing at-- in Berkeley. Exciting new stuff. Thanks so much for coming on theCUBE. I appreciate great conversation. I'm John Furrier, George Gilbert. You're watching theCUBE here at Big Data SV in conjunction with Strata and Hadoop. Thanks for watching. >> Great. >> Thanks guys.

Published Date : Mar 16 2017

SUMMARY :

It's theCUBE covering Big Data Silicon Valley 2017. and Adam Wilson, the CEO. that have the (mumbles) in the about section Okay, so a big year for you guys. and what's the update from you guys? and really fueling that maturity in the market in the business model question for people. You see it and you believe it but then that business model comes down to how you organized. on the business model? With respect to the product, customers, partners. that the business model also has to support democritization So the data stays local. the product itself gets smarter and files on the desktop but something more complicated, and to help people go from inventorying that relates to how do you search the catalog, It's very-- So it's brand new. and the usual opensource still. I can hear the objections. and that we integrated well with Navigator, There are research questions there which, you know, The sooner we get you out and believe that it is their unalienable right That's the top down, you know, governance control point. And I think you can't solve for one side of that equation and I think there might be a more auxiliary with cloud so that the end users then draw other compliant tools in. and how we go to market and I think customers, you know, I saw in some news that you guys hey we have the great tech and you buy from us, I mean, you have to have sales people. So how does the future of data prep deal with the, So I got the old oracle monolithic-- And that's good-- in that environment. and the IOT-- You're welcome in that and that's kind of what's happened with our business. the bulk of the time when you talk to the analysts and you want to not lock yourself in and that's the reason why we have to be in the data exploration kind of discovery area The problem of the complexity, in the enterprises we work with becoming quite aligned And I see that that's improving sort of in the industry as or some sort of microservice. and as part of that cost savings to migrate it to the cloud, so that gives the guys who could be In the midst of all that, everything you said is true. in the sense that there's clear line of sight and in the industry over the next couple of years. and again, congratulations about sharing the Ground project.

ENTITIES

Entity	Category	Confidence
Joe Hellerstein	PERSON	0.99+
George	PERSON	0.99+
Joe	PERSON	0.99+
George Gilbert	PERSON	0.99+
Joe Hellestein	PERSON	0.99+
John Furrier	PERSON	0.99+
Trifacta	ORGANIZATION	0.99+
Pepsi	ORGANIZATION	0.99+
Adam Wilson	PERSON	0.99+
Adam	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Waterline	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Berkeley	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
San Jose, California	LOCATION	0.99+
Alation	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Stephanie	PERSON	0.99+
Horton	ORGANIZATION	0.99+
LinkedIn	ORGANIZATION	0.99+
six years	QUANTITY	0.99+
one	QUANTITY	0.99+
MapR	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
Capital One	ORGANIZATION	0.99+
first question	QUANTITY	0.99+
Today	DATE	0.99+
One	QUANTITY	0.99+
Last year	DATE	0.99+
two executives	QUANTITY	0.99+
Trifacta	PERSON	0.99+
Cloudera	ORGANIZATION	0.99+
one piece	QUANTITY	0.98+
both solutions	QUANTITY	0.98+
today	DATE	0.98+
over 4500 companies	QUANTITY	0.98+
Intel	ORGANIZATION	0.98+
both ways	QUANTITY	0.98+
both	QUANTITY	0.98+
three founders	QUANTITY	0.97+
two vendors	QUANTITY	0.97+
first sites	QUANTITY	0.97+
Ground	ORGANIZATION	0.97+
Munich Re	ORGANIZATION	0.97+
about 12 months	QUANTITY	0.97+
NYC	LOCATION	0.96+
first thing	QUANTITY	0.96+
four times	QUANTITY	0.96+
eBay	ORGANIZATION	0.95+
Wikibon	ORGANIZATION	0.95+
Paolo Alto	PERSON	0.95+
next day	DATE	0.95+
three times	QUANTITY	0.94+
ten of thousands of users	QUANTITY	0.93+
one side	QUANTITY	0.93+
years later	DATE	0.92+

Day Two Kickoff - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is theCUBE, covering Spark Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to day two in Boston where it is snowing sideways here. But we're all here at Spark Summit #SparkSummit, Spark Summit East, this is theCUBE. Sound like an Anglo flagship product. We go out to the event, we program for our audience, we extract the signal from the noise. I'm here with George Gilbert, day two, at Spark Summit, George. We're seeing the evolution of so-called big data. Spark was a key part of that. Designed to really both simplify and speed up big data oriented transactions and really help fulfill the dream of big data, which is to be able to affect outcomes in near real time. A lot of those outcomes, of course, are related to ad tech and selling and retail oriented use cases, but we're hearing more and more around education and deep learning and affecting consumers and human life in different ways. We're now 10 years in to the whole big data trend, what's your take, George, on what's going on here? >> Even if we started off with ad tech, which is what most of the big internet companies did, we always start off in any new paradigm with one application that kind of defines that era. And then we copy and extend that pattern. For me, on the rethinking your business the a McGraw-Hill interview we did yesterday was the most amazing thing because they took, what they had was a textbook business for their education unit and they're re-thinking the business, as in what does it mean to be an education company? And they take cognitive science about how people learn and then they take essentially digital assets and help people on a curriculum, not the centuries old sort of teacher, lecture, homework kind of thing, but individualized education where the patterns of reinforcement are consistent with how each student learns. And it's not just a break up the lecture into little bits, it's more of a how do you learn most effectively? How do you internalize information? >> I think that is a great example, George, and there are many, many examples of companies that are transforming digitally. Years and years ago people started to think about okay, how can I instrument or digitize certain assets that I have for certain physical assets? I remember a story when we did the MIT event in London with Andy MacAfee and Eric Binyolsen, they were giving the example of McCormick Spice, the spice company, who digitized by turning what they were doing into recipes and driving demand for their product and actually building new communities. That was kind of an interesting example, but sort of mundane. The McGraw-Hill education is massive. Their chief data scientist, chief data scientist? I don't know, the head of engineering, I guess, is who he was. >> VP of Analytics and Data Science. >> VP of Analytics and Data Science, yeah. He spoke today and got a big round of applause when he sort of led off about the importance of education at the keynote. He's right on, and I think that's a classic example of a company that was built around printing presses and distributing dead trees that is completely transformed and it's quite successful. Over the last only two years brought in a new CEO. So that's good, but let's bring it back to Spark specifically. When Spark first came out, George, you were very enthusiastic. You're technical, you love the deep tech. And you saw the potential for Spark to really address some of the problems that we faced with Hadoop, particularly the complexity, the batch orientation. Even some of the costs -- >> The hidden costs. >> Associated with that, those hidden costs. So you were very enthusiastic, in your mind, has Spark lived up to your initial expectations? >> That's a really good question, and I guess techies like me are often a little more enthusiastic than the current maturity of the technology. Spark doesn't replace Hadoop, but it carves out a big chunk of what Hadoop would do. Spark doesn't address storage, and it doesn't really have any sort of management bits. So you could sort of hollow out Hadoop and put Spark in. But it's still got a little ways to go in terms of becoming really, really fast to respond in near real time. Not just human real time, but like machine real time. It doesn't work sort of deeply with databases yet. It's still teething, and sort of every release, which is approximately every 12 to 18 months, it gets broader in its applicability. So there's no question sort of everyone is piling on, which means that'll help it mature faster. >> When Hadoop was first sort of introduced to the early masses, not the main stream masses, but the early masses, the profundity of Hadoop was that you could leave data in place and bring compute to the data. And people got very excited about that because they knew there was so much data and you just couldn't keep moving it around. But the early insiders of Hadoop, I remember, they would come to theCUBE and everybody was, of course, enthusiastic and lot of cheerleading going on. But in the hallway conversations with Hadoop, with the real insiders you would have conversations about, people are going to realize how much this sucks some day and how hard this is and it's going to hit a wall. Some of the cheerleaders would say, no way, Hadoop forever. Now you've started to see that in practice. And the number of real hardcore transformations as a result of Hadoop in and of itself have been quite limited. The same is true for virtually, most anyway, technology, not any technology. I'd say the smartphone was pretty transformative in and of itself, but nonetheless, we are seeing that sort of progression and we're starting to see a lot of the same use cases that you hear about like fraud detection and retargeting as coming up again. I think what we're seeing is those are improving. Like fraud detection, I talked yesterday about it used to be six months before you'd even detect fraud, if you ever did. Now it's minutes or seconds. But you still get a lot of false positives. So we're going to just keep turning that crank. Mike Gualtieri today talked about the efficacy of today's AI and he gave some examples of Google, he showed a plane crash and he said, it said plane and it accurately identified that, but also the API said it could be wind sports or something like that. So you can see it's still not there yet. At the same time, you see things like Siri and Amazon Alexa getting better and better and better. So my question to you, kind of long-winded here, is, is that what Spark is all about? Just making better the initial initiatives around big data, or is it more transformative than that? >> Interesting question, and I would come at it with a couple different answers. Spark was a reaction to you can't, you can't have multiple different engines to attack all the different data problems because you would do a part of the analysis here, push it into a disk, pull it out of a disk to another engine, all of that would take too long or be too complex a pipeline to go from end to the other. Spark was like, we'll do it all in our unified engine and you can come at it from SQL, you can come at it from streaming, so it's all in one place. That changes the sophistication of what you can do, the simplicity, and therefore how many people can access it and apply it to these problems. And the fact that it's so much faster means you can attack a qualitatively different setup of problems. >> I think as well it really underscores the importance of Open Source and the ability of the Open Source community to launch projects that both stick and can attract serious investment. Not only with IBM, but that's a good example. But entire ecosystems that collectively can really move the needle. Big day today, George, we've got a number of guests. We'll give you the last word at the open. >> Okay, what I thought, this is going to sound a little bit sort of abstract, but a couple of two takeaways from some of our most technical speakers yesterday. One was with Juan Stoyka who sort of co-headed the lab that was the genesis of Spark at Berkeley. >> AMPLabs. >> The AMPLab at Berkeley. >> And now Rise Labs. >> And then also with the IBM Chief Data Officer for the Analytics Unit. >> Seth Filbrun. >> Filbrun, yes. When we look at what's the core value add ultimately, it's not these infrastructure analytic frameworks and that sort of thing, it's the machine learning model in its flywheel feedback state where it's getting trained and re-trained on the data that comes in from the app and then as you continually improve it, that was the whole rationale for Data Links, but not with models. It was put all the data there because you're going to ask questions you couldn't anticipate. So here it's collect all the data from the app because you're going to improve the model in ways you didn't expect. And that beating heart, that living model that's always getting better, that's the core value add. And that's going to belong to end customers and to application companies. >> One of the speakers today, AI kind of invented in the 50s, a lot of excitement in the 70s, kind of died in the 80s and it's coming back. It's almost like it's being reborn. And it's still in its infant stages, but the potential is enormous. All right, George, that's a wrap for the open. Big day today, keep it right there, everybody. We got a number of guests today, and as well, don't forget, at the end of the day today George and I will be introducing part two of our WikiBon Big Data forecast. This is where we'll release a lot of our numbers and George will give a first look at that. So keep it right there everybody, this is theCUBE. We're live from Spark Summit East, #SparkSummit. We'll be right back. (techno music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. fulfill the dream of big data, which is to be able it's more of a how do you learn most effectively? the example of McCormick Spice, the spice company, some of the problems that we faced with Hadoop, So you were very enthusiastic, in your mind, than the current maturity of the technology. At the same time, you see things like Siri That changes the sophistication of what you can do, of Open Source and the ability of the Open Source community One was with Juan Stoyka who sort of co-headed the lab for the Analytics Unit. that comes in from the app and then as you One of the speakers today, AI kind of invented

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Mike Gualtieri	PERSON	0.99+
George	PERSON	0.99+
Juan Stoyka	PERSON	0.99+
Boston	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Eric Binyolsen	PERSON	0.99+
London	LOCATION	0.99+
yesterday	DATE	0.99+
10 years	QUANTITY	0.99+
Siri	TITLE	0.99+
Berkeley	LOCATION	0.99+
Google	ORGANIZATION	0.99+
McCormick Spice	ORGANIZATION	0.99+
Boston, Massachusetts	LOCATION	0.99+
Rise Labs	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
today	DATE	0.99+
Seth Filbrun	PERSON	0.99+
80s	DATE	0.98+
50s	DATE	0.98+
each student	QUANTITY	0.98+
two takeaways	QUANTITY	0.98+
70s	DATE	0.98+
Spark	ORGANIZATION	0.98+
Spark Summit East 2017	EVENT	0.98+
first	QUANTITY	0.97+
both	QUANTITY	0.97+
Andy MacAfee	PERSON	0.97+
#SparkSummit	EVENT	0.97+
One	QUANTITY	0.96+
1	QUANTITY	0.96+
day two	QUANTITY	0.95+
one application	QUANTITY	0.95+
Spark	TITLE	0.95+
McGraw-Hill	PERSON	0.94+
AMPLabs	ORGANIZATION	0.94+
Years	DATE	0.94+
one place	QUANTITY	0.93+
Hadoop	TITLE	0.93+
Alexa	TITLE	0.93+
Databricks	ORGANIZATION	0.93+
Spark Summit East	EVENT	0.93+
12	QUANTITY	0.91+
two years	QUANTITY	0.91+
Spark Summit East	LOCATION	0.91+
six months	QUANTITY	0.9+
SQL	TITLE	0.89+
Chief Data Officer	PERSON	0.89+
Hadoop	PERSON	0.85+
much	QUANTITY	0.84+
Spark Summit	EVENT	0.84+
Anglo	OTHER	0.81+
first look	QUANTITY	0.75+
8 months	QUANTITY	0.72+
WikiBon	ORGANIZATION	0.69+
part two	QUANTITY	0.69+
Hill	ORGANIZATION	0.68+
Kickoff	EVENT	0.64+
couple	QUANTITY	0.64+
McGraw-	PERSON	0.64+

Ion Stoica, Databricks - Spark Summit East 2017 - #sparksummit - #theCUBE

>> [Announcer] Live from Boston Massachusetts. This is theCUBE. Covering Sparks Summit East 2017. Brought to you by Databricks. Now here are your hosts, Dave Vellante and George Gilbert. >> [Dave] Welcome back to Boston everybody, this is Spark Summit East #SparkSummit And this is theCUBE. Ion Stoica is here. He's Executive Chairman of Databricks and Professor of Computer Science at UCal Berkeley. The smarts is rubbing off on me. I always feel smart when I co-host with George. And now having you on is just a pleasure, so thanks very much for taking the time. >> [Ion] Thank you for having me. >> So loved the talk this morning, we learned about RISELabs, we're going to talk about that. Which is the son of AMP. You may be the father of those two, so. Again welcome. Give us the update, great keynote this morning. How's the vibe, how are you feeling? >> [Ion] I think it's great, you know, thank you and thank everyone for attending the summit. It's a lot of energy, a lot of interesting discussions, and a lot of ideas around. So I'm very happy about how things are going. >> [Dave] So let's start with RISELabs. Maybe take us back, to those who don't understand, so the birth of AMP and what you were trying to achieve there and what's next. >> Yeah, so the AMP was a six-year Project at Berkeley, and it involved around eight faculties and over the duration of the lab around 60 students and postdocs, And the mission of the AMPLab was to make sense of big data. AMPLab started in 2009, at the end of 2009, and the premise is that in order to make sense of this big data, we need a holistic approach, which involves algorithms, in particular machine-learning algorithms, machines, means systems, large-scale systems, and people, crowd sourcing. And more precisely the goal was to build a stack, a data analytic stack for interactive analytics, to be used across industry and academia. And, of course, being at Berkeley, it has to be open source. (laugh) So that's basically what was AMPLab and it was a birthplace for Apache Spark that's why you are all here today. And a few other open-source systems like Mesos, Apache Mesos, and Alluxio which was previously called Tachyon. And so AMPLab ended in December last year and in January, this January, we started a new lab which is called RISE. RISE stands for Real-time Intelligent Secure Execution. And the premise of the new lab is that actually the real value in the data is the decision you can make on the data. And you can see this more and more at almost every organization. They want to use their data to make some decision to improve their business processes, applications, services, or come up with new applications and services. But then if you think about that, what does it mean that the emphasis is on the decision? Then it means that you want the decision to be fast, because fast decisions are better than slower decisions. You want decisions to be on fresh data, on live data, because decisions on the data I have right now are original but those are decisions on the data from yesterday, or last week. And then you also want to make targeted, personalized decisions. Because the decisions on personal information are better than aggregate information. So that's the fundamental premise. So therefore you want to be on platforms, tools and algorithms to enable intelligent real-time decisions on live data with strong security. And the security is a big emphasis of the lab because it means to provide privacy, confidentiality and integrity, and as you hear about data breaches or things like that every day. So for an organization, it is extremely important to provide privacy and confidentiality to their users and it's not only because the users want that, but it also indirectly can help them to improve their service. Because if I guarantee your data is confidential with me, you are probably much more willing to share some of your data with me. And if you share some of the data with me, I can build and provide better services. So that's basically in a nutshell what the lab is and what the focus is. >> [Dave] Okay, so you said three things: fast, live and targeted. So fast means you can affect the outcome. >> Yes. Live data means it's better quality. And then targeted means it's relevant. >> Yes. >> Okay, and then my question on security, I felt like when cloud and Big Data came to fore, security became a do-over. (laughter) Is that a fair assessment? Are you doing it over? >> [George] Or as Bill Clinton would call it, a Mulligan. >> Yeah, if you get a Mulligan on security. >> I think security is, it's always a difficult topic because it means so many things for so many people. >> Hmm-mmm. >> So there are instances and actually cloud is quite secure. It's actually cloud can be more secure than some on-prem deployments. In fact, if you hear about these data leaks or security breaches, you don't hear them happening in the cloud. And there is some reason for that, right? It is because they have trained people, you know, they are paranoid about this, they do a specification maybe much more often and things like that. But still, you know, the state of security is not that great. Right? For instance, if I compromise your operating system, whether it's in cloud or in not in the cloud, I can't do anything. Right? Or your VM, right? On all this cloud you run on a VM. And now you are going to allow on some containers. Right? So it's a lot of attacks, or there are attacks, sophisticated attacks, which means your data is encrypted, but if I can look at the access patterns, how much data you transferred, or how much data you access from memory, then I can infer something about what you are doing about your queries, right? If it's more data, maybe it's a query on New York. If it's less data it's probably maybe something smaller, like maybe something at Berkeley. So you can infer from multiple queries just looking at the access. So it's a difficult problem. But fortunately again, there are some new technologies which are developed and some new algorithms which gives us some hope. One of the most interesting technologies which is happening today is hardware enclaves. So with hardware enclaves you can execute the code within this enclave which is hardware protected. And even if your operating system or VM is compromised, you cannot access your code which runs into this enclave. And Intel has Intell SGX and we are working and collaborating with them actively. ARM has TrustZone and AMB also announced they are going to have a similar technology in their chips. So that's kind of a very interesting and very promising development. I think the other aspect, it's a focus of the lab, is that even if you have the enclaves, it doesn't automatically solve the problem. Because the code itself has a vulnerability. Yes, I can run the code in hardware enclave, but the code can send out >> Right. >> data outside. >> Right, the enclave is a more granular perimeter. Right? >> Yeah. So yeah, so you are looking and the security expert is in your lab looking at this, maybe how to split the application so you run only a small part in the enclave, which is a critical part, and you can make sure that also the code is secure, and the rest of the code you run outside. But the rest of the code, it's only going to work on data which is encrypted. Right? So there is a lot of interesting research but that's good. >> And does Blockchain fit in there as well? >> Yeah, I think Blockchain it's a very interesting technology. And again it's real-time and the area is also very interesting directions. >> Yeah, right. >> Absolutely. >> So you guys, I want George, you've shared with me sort of what you were calling a new workload. So you had batch and you have interactive and now you've got continuous- >> Continuous, yes. >> And I know that's a topic that you want to discuss and I'd love to hear more about that. But George, tee it up. >> Well, okay. So we were talking earlier and the objective of RISE is fast and continuous-type decisions. And this is different from the traditional, you either do it batch or you do it interactive. So maybe tell us about some applications where that is one workload among the other traditional workloads. And then let's unpack that a little more. >> Yeah, so I'll give you a few applications. So it's more than continuously interacting with the environment continuously, but you also learn continuously. I'll give you some examples. So for instance in one example, think about you want to detect a network security attack, and respond and diagnose and defend in the real time. So what this means is that you need to continuously get logs from the network and from the more endpoints you can get the better. Right? Because more data will help you to detect things faster. But then you need to detect the new pattern and you need to learn the new patterns. Because new security attacks, which are the ones that are effective, are slightly different from the past one because you hope that you already have the defense in place for the past ones. So now you are going to learn that and then you are going to react. You may push patches in real time. You may push filters, installing new filters to firewalls. So that's kind of one application that's going in real time. Another application can be about self driving. Now self driving has made tremendous strides. And a lot of algorithms you know, very smart algorithms now they are implemented on the cars. Right? All the system is on the cars. But imagine now that you want to continuously get the information from this car, aggregate and learn and then send back the information you learned to the cars. Like for instance if it's an accident or a roadblock an object which is dropped on the highway, so you can learn from the other cars what they've done in that situation. It may mean in some cases the driver took an evasive action, right? Maybe you can monitor also the cars which are not self-driving, but driven by the humans. And then you learn that in real time and then the other cars which follow through the same, confronted with the same situation, they now know what to do. Right? So this is again, I want to emphasize this. Not only continuous sensing environment, and making the decisions, but a very important components about learning. >> Let me take you back to the security example as I sort of process the auto one. >> Yeah, yeah. >> So in the security example, it doesn't sound like, I mean if you have a vast network, you know, end points, software, infrastructure, you're not going to have one God model looking out at everything. >> Yes. >> So I assume that means there are models distributed everywhere and they don't know what a new, necessarily but an entirely new attack pattern looks like. So in other words, for that isolated model, it doesn't know what it doesn't know. I don't know if that's what Rumsfeld called it. >> Yes (laughs). >> How does it know what to pass back for retraining? >> Yes. Yes. Yes. So there are many aspects and there are many things you can look at. And it's again, it's a research problem, so I cannot give you the solution now, I can hypothesize and I give you some examples. But for instance, you can look about, and you correlate by observing the affect. Some of the affects of the attack are visible. In some cases, denial of service attack. That's pretty clear. Even the And so forth, they maybe cause computers to crash, right? So once you see some of this kind of anomaly, right, anomalies on the end devices, end host and things like that. Maybe reported by humans, right? Then you can try to correlate with what kind of traffic you've got. Right? And from there, from that correlation, probably you can, and hopefully, you can develop some models to identify what kind of traffic. Where it comes from. What is the content, and so forth, which causes behavior, anomalous behavior. >> And where is that correlation happening? >> I think it will happen everywhere, right? Because- >> At the edge and at the center. >> Absolutely. >> And then I assume that it sounds like the models both at the edge and at the center are ensemble models. >> Yes. >> Because you're tracking different behavior. >> Yes. You are going to track different behavior and you are going to, I think that's a good hypothesis. And then you are going to assemble them, assemble to come up with the best decision. >> Okay, so now let's wind forward to the car example. >> Yeah. >> So it sound like there's a mesh network, at least, Peter Levine's sort of talk was there's near-local compute resources and you can use bitcoin to pay for it or Blockchain or however it works. But that sort of topology, we haven't really encountered before in computing, have we? And how imminent is that sort of ... >> I think that some of the stuff you can do today in the cloud. I think if you're on super-low latency probably you need to have more computation towards the edges, but if I'm thinking that I want kind of reactions on tens, hundreds of milliseconds, in theory you can do it today with the cloud infrastructure we have. And if you think about in many cases, if you can't do it within a few hundredths of milliseconds, it's still super useful. Right? To avoid this object which has dropped on the highway. You know, if I have a few hundred milliseconds, many cars will effectively avoid that having that information. >> Let's have that conversation about the edge a little further. The one we were having off camera. So there's a debate in our community about how much data will stay at the edge, how much will go into the cloud, David Flores said 90% of it will stay at the edge. Your comment was, it depends on the value. What do you mean by that? >> I think that that depends who am I and how I perceive the value of the data. And, you know, what can be the value of the data? This is what I was saying. I think that value of the data is fundamentally what kind of decisions, what kind of actions it will enable me to take. Right? So here I'm not just talking about you know, credit card information or things like that, even exactly there is an action somebody's going to take on that. So if I do believe that the data can provide me with ability to take better actions or make better decisions I think that I want to keep it. And it's not, because why I want to keep it, because also it's not only the decision it enables me now, but everyone is going to continuously improve their algorithms. Develop new algorithms. And when you do that, how do you test them? You test on the old data. Right? So I think that for all these reasons, a lot of data, valuable data in this sense, is going to go to the cloud. Now, is there a lot of data that should remain on the edges? And I think that's fair. But it's, again, if a cloud provider, or someone who provides a service in the cloud, believes that the data is valuable. I do believe that eventually it is going to get to the cloud. >> So if it's valuable, it will be persisted and will eventually get to the cloud? And we talked about latency, but latency, the example of evasive action. You can't send the back to the cloud and make the decision, you have to make it real time. But eventually that data, if it's important, will go back to the cloud. The other question of all this data that we are now processing on a continuous basis, how much actually will get persisted, most of it, much of it probably does not get persisted. Right? Is that a fair assumption? >> Yeah, I think so. And probably all the data is not equal. All right? It's like you want to maybe, even if you take a continuous video, all right? On the cars, they continuously have videos from multiple cameras and radar and lidar, all of this stuff. This continuous. And if you think about this one, I would assume that you don't want to send all the data to the cloud. But the data around the interesting events, you may want to do, right? So before and after the car has a near-accident, or took an evasive action, or the human had to intervene. So in all these cases, probably I want to send the data to the cloud. But for the most cases, probably not. >> That's good. We have to leave it there, but I'll give you the last word on things that are exciting you, things you're working on, interesting projects. >> Yeah, so I think this is what really excites me is about how we are going to have this continuous application, you are going to continuously interact with the environment. You are going to continuously learn and improve. And here there are many challenges. And I just want to say a few more there, and which we haven't discussed. One, in general it's about explainability. Right? If these systems augment the human decision process, if these systems are going to make decisions which impact you as a human, you want to know why. Right? Like I gave this example, assuming you have machine-learning algorithms, you're making a diagnosis on your MRI, or x-ray. You want to know why. What is in this x-ray causes that decision? If you go to the doctor, they are going to point and show you. Okay, this is why you have this condition. So I think this is very important. Because as a human you want to understand. And you want to understand not only why the decision happens, but you want also to understand what you have to do, you want to understand what you need to do to do better in the future, right? Like if your mortgage application is turned down, I want to know why is that? Because next time when I apply to the mortgage, I want to have a higher chance to get it through. So I think that's a very important aspect. And the last thing I will say is that this is super important and information is about having algorithms which can say I don't know. Right? It's like, okay I never have seen this situation in the past. So I don't know what to do. This is much better than giving you just the wrong decision. Right? >> Right, or a low probability that you don't know what to do with. (laughs) >> Yeah. >> Excellent. Ion, thanks again for coming in theCUBE. It was really a pleasure having you. >> Thanks for having me. >> You're welcome. All right, keep it right there everybody. George and I will be back to do our wrap right after this short break. This is theCUBE. We're live from Spark Summit East. Right back. (techno music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Databricks. And now having you on is just a pleasure, So loved the talk this morning, [Ion] I think it's great, you know, and what you were trying to achieve there is the decision you can make on the data. So fast means you can affect the outcome. And then targeted means it's relevant. Are you doing it over? because it means so many things for so many people. So with hardware enclaves you can execute the code Right, the enclave is a more granular perimeter. and the rest of the code you run outside. And again it's real-time and the area is also So you guys, I want George, And I know that's a topic that you want to discuss and the objective of RISE and from the more endpoints you can get the better. Let me take you back to the security example So in the security example, and they don't know what a new, and you correlate both at the edge and at the center And then you are going to assemble them, to the car example. and you can use bitcoin to pay for it And if you think about What do you mean by that? So here I'm not just talking about you know, You can't send the back to the cloud And if you think about this one, but I'll give you the last word And you want to understand not only why that you don't know what to do with. It was really a pleasure having you. George and I will be back to do our wrap

ENTITIES

Entity	Category	Confidence
David Flores	PERSON	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave Vellante	PERSON	0.99+
2009	DATE	0.99+
Peter Levine	PERSON	0.99+
Bill Clinton	PERSON	0.99+
New York	LOCATION	0.99+
90%	QUANTITY	0.99+
January	DATE	0.99+
AMB	ORGANIZATION	0.99+
last week	DATE	0.99+
Dave	PERSON	0.99+
yesterday	DATE	0.99+
Ion	PERSON	0.99+
ARM	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
six-year	QUANTITY	0.99+
December last year	DATE	0.99+
Databricks	ORGANIZATION	0.99+
three things	QUANTITY	0.99+
Boston Massachusetts	LOCATION	0.99+
one example	QUANTITY	0.99+
two	QUANTITY	0.98+
UCal Berkeley	ORGANIZATION	0.98+
Berkeley	LOCATION	0.98+
AMPLab	ORGANIZATION	0.98+
Ion Stoica	PERSON	0.98+
tens, hundreds of milliseconds	QUANTITY	0.98+
today	DATE	0.97+
end of 2009	DATE	0.96+
Rumsfeld	PERSON	0.96+
Intel	ORGANIZATION	0.96+
Intell	ORGANIZATION	0.95+
both	QUANTITY	0.95+
One	QUANTITY	0.95+
AMP	ORGANIZATION	0.94+
TrustZone	ORGANIZATION	0.94+
Spark Summit East 2017	EVENT	0.93+
around 60 students	QUANTITY	0.93+
RISE	ORGANIZATION	0.93+
Sparks Summit East 2017	EVENT	0.92+
one	QUANTITY	0.89+
one workload	QUANTITY	0.88+
Spark Summit East	EVENT	0.87+
Apache Spark	ORGANIZATION	0.87+
around eight faculties	QUANTITY	0.86+
this January	DATE	0.86+
this morning	DATE	0.84+
Mulligan	ORGANIZATION	0.78+
few hundredths of milliseconds	QUANTITY	0.77+
Professor	PERSON	0.74+
God	PERSON	0.72+
theCUBE	ORGANIZATION	0.7+
few hundred milliseconds	QUANTITY	0.67+
SGX	COMMERCIAL_ITEM	0.64+
Mesos	ORGANIZATION	0.63+
one application	QUANTITY	0.63+
Apache Mesos	ORGANIZATION	0.62+
Alluxio	ORGANIZATION	0.62+
AMPLab	EVENT	0.59+
Tachyon	ORGANIZATION	0.59+
#SparkSummit	EVENT	0.57+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for AMPLab: