Anthony Lye, NetApp & Amiram Shachar, Spot by NetApp | AWS re:Invent 2021

(upbeat music) >> Welcome back to theCUBE's continuing coverage of AWS re:Invent 2021 live from Las Vegas. I'm Lisa Martin. We are doing one of the most important industry events, hybrid events this year with Amazon and its massive ecosystem of partners, some of which are joining me next. We've got two live sets, two remote sets, over 100 guests on the program, I'm going to be talking about the next decade in Cloud innovation. I'm pleased to welcome back Anthony Lye to the program, the Executive Vice President and General Manager of Public Cloud at NetApp. Anthony good to see you. >> Nice to see you again thanks for... >> Nice to see you in person. >> I know... >> It's been a couple of years. And Amiram Shachar is here, the VP and GM of Spot by NetApp, Amiram it's great to have you on the program, welcome. >> Likewise, thank you. >> So the acquisition, the Spot acquisition was during the pandemic mid 2020, Amiram talk to me about that why NetApp, how's it going? Give us the lay of the land. >> I think that's the, it's one of the greatest things that NetApp has done, and I think it's one of the most amazing outcomes we could have as a company. And if you think about it in a first sight, when you look at storage company and compute company, what's the connection? But the thing is that NetApp is a company that is going through a huge transformation into Cloud. And by doing this acquisition, it's really like signaling where it's going. It's going way beyond, and honestly I just wanted to be part of it. >> And what's the customer sentiment been the 18 months or so, post acquisition? >> I think NetApp has done specifically with Anthony leading that acquisition, NetApp has done a phenomenal job of keeping Spot as a business unit, independent business unit. So our customers didn't really feel that something had happened, like the only thing we told them is we're going to have more funding, so. >> I'm sure they like that. Anthony talk to us about NetApp's transformation, transition, Spot as part of that. And then of course, CloudCheckr which acquisition was just announced I believe yesterday? >> We closed on actually November 7th. >> Lisa: Okay. >> So it's almost been a month now since we closed, but I've been at NetApp my gosh, it'll be five years in February. And you know, I think that the company had a real desire to sort of, to re-imagine itself and to sort of to embrace the public Clouds and to give its customers you know, what I think it's done incredibly well is this idea of symmetry. That we wanted to build something on Amazon that was as good or maybe a little bit better than on-premise. And customers really I think appreciated, they appreciate that sort of, that desire for us to do those kinds of things. Now of course, CloudCheckr was my ninth acquisition in four years. Just to sort of, to build on what Amiram said I mean, CloudCheckr we acquired four Spot and we acquired what? Four companies in the last 12 months for Spot. So we really believe that as a company now we can address all of their potential opportunities, whether it's in a legacy application, whether it's a virtual desktop, whether it's a Cloud native application, or we just went and announced Ocean for Apache Spark. So Spot now has an optimization and automation solution for Spark on AWS which we announced, I think just yesterday. >> Correct. >> But I'd like to get both of your perspectives on keeping Spot as a brand, Anthony we'll start with you and then Amiram we'll go to you. >> Amiram is the founder, and he was the CEO of the company and built a fantastic company. And we, NetApp I think has a phenomenal brand, but a brand that's that's associated with the sort of the traditional IT organization. And as you note in the Cloud the buyers are slightly different. They're sort of the application owners, or they operate in a sort of a construct that most people call CloudOps or DevOps. And we felt that Spot represented that new buyer in ways that NetApp didn't and probably couldn't. And so we really liked the idea of having the structure of the big N supported by a little pink and a little blue and a more sort of Cloud native brand. >> And that's key, especially the dynamics in the market that we've seen the last 22 months with the rapid changes, the pivot to Cloud customers that weren't that digital needing to go in that direction to survive in the very beginning, I imagine this was really kind of core to NetApp's strategy, but also helping both of your customers to survive initially and then to be able to thrive and identify some of those key areas where they can cut costs would be a far more efficient. >> Okay I think you are in here, if you were born physical you're now digital, and if you weren't born physical you were born digital. And you know, digital is a very effective medium accelerated by the pandemic because as you said, we couldn't really get close to each other and you just look at the innovation around us here at Amazon, it's just amazing to watch. And we've just been really, really good partners with Amazon now for many, many years. And we continue to see just huge, huge opportunities. >> Well Adam Selipsky this morning in his keynote, one of the partners he called out was NetApp. >> Yeah I know I mean, I'll talk a little bit later on maybe with Yancey and I but you know, Amazon now sells our product. They haven't done that with anybody. So ONTAP is now a product that Amazon sells. >> Lisa: Okay. >> Amazon supports, Amazon bills, Amazon runs. So we've really, really demonstrated I think not just to our customers, that sort of a high rate of innovation and an opportunity to sort of accelerate their businesses, but we've demonstrated it to Amazon themselves, that we can operate like them. And we can develop with them at a speed that they are comfortable with. That maybe a few years ago many people would have doubted that a legacy company could operate this way. >> Right, one of the things we know about Amazon is the speed, but also their focus on the customer it's laser-focused, that whole flywheel of Amazon everything that was being announced this morning was exciting to your point Anthony, but it's also showing how involved the customers and the partners are in the ecosystem and that flywheel. Amiram talk to me from your perspective what are some of the, from a visionary standpoint what are some of the things that you're looking forward to going forward with CloudCheckr, but also knowing how deeply connected and integrated NetApp is with a big powerhouse like AWS? >> Yeah, so a few things about that. I think the first thing is also my take from today, like listening to the keynote and looking at all the new announcements. I think the trend is that deployment to the Cloud is becoming easier, but operations is becoming messier. And I think when we look at our category and where we aspire, where we want to be and where we're going. So I think with the CloudCheckr acquisition. So we're expanding into an area that we haven't been to because there are two categories in Cloud cost, there is optimization and there is cost management. What we've done, what we've built, what we've, the business we had is in the optimization space. It's actively reducing and optimizing resources for customers. And there are very few companies in that category as I can say. But right now we're expanding into that area of cost management, so we can meet our customers sooner and you can see us doing it in multiple areas, not only here, but also if we look at a customer journey in the Cloud, it starts with bring workloads in the Cloud, deploy them, and then secure them, and then automate them and then optimize them. Nobody moves to the Cloud and optimizes. So we're typically meeting customers at the end of their journey, we're meeting customers where they need an optimization and they have everything already set up. And right now with Ocean for Apache Spark, Ocean continuous delivery, Spot security, we're meeting customers sooner in their journey so we can provide a much more holistic solution and platform to customers wherever they are in their migration to the Cloud and scaling into Cloud. And with CloudCheckr also taking us to a whole new world of cost management. So, I think we're scaling and ramping and doing all these things, and it's so amazing to realize that we haven't unleashed even 1% of what we can do. >> Really, so there's much more under the covers that we're still waiting for? >> I think the good news is you know, to comment more on what you said, our roadmaps are now largely being driven by customers. And that's just so refreshing to know that you've not only solved a problem for a particular customer, but the customer wants you to solve more problems and that they trust us to be that sort of organization that can help them. So, we're full steam ahead. You know, we're going to continue to acquire in areas where we think we can get acceleration. But our acquisition of Spot was very much about as Amiram said, bringing not just a great company into the business, but to invest significantly in it. And that's really proven I think to me, as Amiram said, one of the most if not the most successful acquisition NetApp has ever done. >> Well congratulations, that's fantastic. But it also sounds like from that customer focus there's clear, strong alignment with how AWS operates, how it values its customers from NetApp's perspective and I imagine from Spots as well. >> You know, if there's one thing I was really proud of during the acquisition, is I got a phone call from a customer, it's the largest food delivery company in South America, and they were very worried about this acquisition and I asked them why? And they told me, "Because your customer service, Spot's customer service is the best customer service I've ever gotten, and if I'm not going to continue to get this customer service, I need to look how I'm finding another vendor." And they told me that, when they want to even tell AWS like which company they can learn from, they're always pointing at Spot. So, and that was a very refreshing moment for me to realize how much also at Spot we care about our customers, but not only as a gimmick, as something that customer obsession, as something that we really live. And that was interesting to see that, that was a concern by our customers when we got acquired. >> Well that's proof in the pudding, because you're right it's one thing to say, companies can always say, "We're customer obsessed, we're customer first, we're customer focused." It's one thing to say it as a marketing term it's a whole other thing to actually live it and demonstrate it, and actually have people coming to you saying that, "We want to model that." I'm curious Anthony, what did you pull over from that? What has NetApp learned from this? >> I always tell Amiram that the idea was that they would essentially take us over. That you know, we sort of loved their culture, we loved their people and their process. And we literally changed a lot of how NetApp operated to operate along the Spot model. So we really did, as Amiram said earlier on, we let them not just sort of exist, but we let them thrive. And we encourage them to point at other areas that NetApp, that they thought we should change to be more like them. And it's raised the bar across everything we do now. And so, we now have a lot of the Spot business processes, a lot of the Spot cultures sort of seeping into the whole of the company. >> That's a very empathetic approach, and that's one of the things that we've learned in the last year and a half that's been, it's key to leadership, it's key to anything is that empathy. But the ability to recognize where there are things within an organization that can be improved and looking at leaders like Spot to go, "Let's actually make this really symbiotic and bi-directional." And I imagine with CloudCheckr it's going to be the same type of influence? >> Well as I've always said, and I say this to the employees and to the acquisitions that we make, what we are acquiring is people. You know the logo, the software, even in many ways the customer base is really very much I think a function of the people. And we work incredibly hard to retain the people, but we do so by sort of empowering them and encouraging them to lead. We really don't want to have the historical perspective of acquisitions, where big company swamps the little company. And I think we've tried very hard to make that a part of our acquisition strategy. And so CloudCheckr is very early in the process but very much, we're following those things, even Amiram and his team are learning from them. If they're doing something a little better than Spot is, then that's something we'll pick up from them. >> And that's just from a very open cultural perspective, that's a big change for NetApp but it's also a smart way to go, 'cause you're right it's, you're acquiring people. And we often talk about people, process, technology. But it's, sometimes to be honest with you it's rare that we hear companies talking about the people focus as being that's critical. It's because of our people that we have successful support, happy successful customers. So that people focus is (inaudible). >> You know, it's the company and culture is not something you can manufacture. It's something that happens and it happens I think through people. And it's an important thing is, if you can establish an organization with the right kinds of people and again, all credit goes to Amiram as the founder and CEO of the company. I think you sort of demanded a kind of person and a kind of culture that set you apart from so many other companies. >> I think the focus on culture was, I was very obsessed with it from very early on in the process that even Spot investors were very, they were questioning like, how come that you are so much obsessed with culture so early on? And I think it paid off big time. There was a book I read while being a CEO that really helped me to scale from quarter to quarter, because I really believe that as a CEO of a startup, every quarter you're basically applying again to your job because you're getting a new company every quarter. And about people, processes, technology, so at Spot it was a little bit different through the book I read, which is "The Hard Thing About Hard Things" by Ben Horowitz, it's people, product, revenue, PPR. And you need to take care of the people, and if you don't take care of the people, so nothing else matter, like it's nothing else just... >> Right. >> And if the people and the product are not working well, so the revenue are not going to come. So revenue was always for us as something that is coming, it's trailing after a good product and good people. >> I love that, what a great, honest focus and vision you guys both have congratulations on the acquisition, CloudCheckr. But also just the cultural alignment that you've done that's really driven by your people and the customers, it's really refreshing to hear that and congrats on NetApp's continued partnership with AWS. We look forward to having you on again next time we can see you in person and talk more about customer successes. >> Thank you very much for hosting us. >> My pleasure guys. >> Thank you. >> For my guests, I'm Lisa Martin. You're watching theCUBE, the global leader in live tech coverage. (upbeat music)

Published Date : Dec 1 2021

SUMMARY :

on the program, I'm going to be Nice to see you again And Amiram Shachar is here, the So the acquisition, the And if you think about like the only thing Anthony talk to us about and to give its customers you know, to get both of your perspectives And so we really liked the idea of having the pivot to Cloud customers that weren't by the pandemic because as you said, one of the partners he They haven't done that with anybody. and an opportunity to sort of and the partners are and it's so amazing to realize into the business, but to from that customer focus So, and that was a very refreshing to you saying that, "We that the idea was that But the ability to recognize and to the acquisitions that we make, But it's, sometimes to be honest with you and a kind of culture that set you apart that really helped me to so the revenue are not going to come. it's really refreshing to hear that the global leader in live tech coverage.

ENTITIES

Entity	Category	Confidence
Lisa Martin	PERSON	0.99+
NetApp	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Lisa	PERSON	0.99+
Anthony	PERSON	0.99+
Anthony Lye	PERSON	0.99+
Ben Horowitz	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Adam Selipsky	PERSON	0.99+
November 7th	DATE	0.99+
Amiram	PERSON	0.99+
February	DATE	0.99+
Amiram Shachar	PERSON	0.99+
The Hard Thing About Hard Things	TITLE	0.99+
South America	LOCATION	0.99+
Las Vegas	LOCATION	0.99+
Spot	ORGANIZATION	0.99+
two remote sets	QUANTITY	0.99+
five years	QUANTITY	0.99+
Four companies	QUANTITY	0.99+
CloudCheckr	ORGANIZATION	0.99+
Ocean	ORGANIZATION	0.99+
two live sets	QUANTITY	0.99+
yesterday	DATE	0.99+
Apache	ORGANIZATION	0.99+
18 months	QUANTITY	0.99+
1%	QUANTITY	0.99+
today	DATE	0.99+
two categories	QUANTITY	0.99+
ninth acquisition	QUANTITY	0.99+
four years	QUANTITY	0.98+
both	QUANTITY	0.98+
over 100 guests	QUANTITY	0.98+
mid 2020	DATE	0.98+
one	QUANTITY	0.98+
Spots	ORGANIZATION	0.98+
this year	DATE	0.98+
first thing	QUANTITY	0.98+

Did HPE GreenLake Just Set a New Bar in the On-Prem Cloud Services Market?

>> Welcome back to The Cube's coverage of HPE's GreenLake announcements. My name is Dave Vellante and you're watching the Cube. I'm here with Holger Mueller, who is an analyst at Constellation Research. And Matt Maccaux is the global field CTO of Ezmeral software at HPE. We're going to talk data. Gents, great to see you. >> Holger: Great to be here. >> So, Holger, what do you see happening in the data market? Obviously data's hot, you know, digital, I call it the force marks to digital. Everybody realizes wow, digital business, that's a data business. We've got to get our data act together. What do you see in the market is the big trends, the big waves? >> We are all young enough or old enough to remember when people were saying data is the new oil, right? Nothing has changed, right? Data is the key ingredient, which matters to enterprise, which they have to store, which they have to enrich, which they have to use for their decision-making. It's the foundation of everything. If you want to go into machine learning or (indistinct) It's growing very fast, right? We have the capability now to look at all the data in enterprise, which weren't able 10 years ago to do that. So data is main center to everything. >> Yeah, it's even more valuable than oil, I think, right? 'Cause with oil, you can only use once. Data, you can, it's kind of polyglot. I can go in different directions and it's amazing, right? >> It's the beauty of digital products, right? They don't get consumed, right? They don't get fired up, right? And no carbon footprint, right? "Oh wait, wait, we have to think about carbon footprint." Different story, right? So to get to the data, you have to spend some energy. >> So it's that simple, right? I mean, it really is. Data is fundamental. It's got to be at the core. And so Matt, what are you guys announcing today, and how does that play into what Holger just said? >> What we're announcing today is that organizations no longer need to make a difficult choice. Prior to today, organizations were thinking if I'm going to do advanced machine learning and really exploit my data, I have to go to the cloud. But all my data's still on premises because of privacy rules, industry rules. And so what we're announcing today, through GreenLake Services, is a cloud services way to deliver that same cloud-based analytical capability. Machine learning, data engineering, through hybrid analytics. It's a unified platform to tie together everything from data engineering to advance data science. And we're also announcing the world's first Kubernetes native object store, that is hybrid cloud enabled. Which means you can keep your data connected across clouds in a data fabric, or Dave, as you say, mesh. >> Okay, can we dig into that a little bit? So, you're essentially saying that, so you're going to have data in both places, right? Public cloud, edge, on-prem, and you're saying, HPE is announcing a capability to connect them, I think you used the term fabric. I'm cool, by the way, with the term fabric, we can, we'll parse that out another time. >> I love for you to discuss textiles. Fabrics vs. mesh. For me, every fabric breaks down to mesh if you put it on a microscope. It's the same thing. >> Oh wow, now that's really, that's too detailed for my brain, right this moment. But, you're saying you can connect all those different estates because data by its very nature is everywhere. You're going to unify that, and what, that can manage that through sort of a single view? >> That's right. So, the management is centralized. We need to be able to know where our data is being provisioned. But again, we don't want organizations to feel like they have to make the trade off. If they want to use cloud surface A in Azure, and cloud surface B in GCP, why not connect them together? Why not allow the data to remain in sync or not, through a distributed fabric? Because we use that term fabric over and over again. But the idea is let the data be where it most naturally makes sense, and exploit it. Monetization is an old tool, but exploit it in a way that works best for your users and applications. >> In sync or not, that's interesting. So it's my choice? >> That's right. Because the back of an automobile could be a teeny tiny, small edge location. It's not always going to be in sync until it connects back up with a training facility. But we still need to be able to manage that. And maybe that data gets persisted to a core data center. Maybe it gets pushed to the cloud, but we still need to know where that data is, where it came from, its lineage, what quality it has, what security we're going to wrap around that, that all should be part of this fabric. >> Okay. So, you've got essentially a governance model, at least maybe you're working toward that, and maybe it's not all baked today, but that's the north star. Is this fabric connect, single management view, governed in a federated fashion? >> Right. And it's available through the most common API's that these applications are already written in. So, everybody today's talking S3. I've got to get all of my data, I need to put it into an object store, it needs to be S3 compatible. So, we are extending this capability to be S3 native. But it's optimized for performance. Today, when you put data in an object store, it's kind of one size fits all. Well, we know for those streaming analytical capabilities, those high performance workloads, it needs to be tuned for that. So, how about I give you a very small object on the very fastest disk in your data center and maybe that cheaper location somewhere else. And so we're giving you that balance as part of the overall management estate. >> Holger, what's your take on this? I mean, Frank Slootman says we'll never, we're not going halfway house. We're never going to do on-prem, we're only in the cloud. So that basically says, okay, he's ignoring a pretty large market by choice. You're not, Matt, you must love those words. But what do you see as the public cloud players, kind of the moves on-prem, particularly in this realm? >> Well, we've seen lots of cloud players who were only cloud coming back towards on-premise, right? We call it the next generation compute platform where I can move data and workloads between on-premise and ideally, multiple clouds, right? Because I don't want to be logged into public cloud vendors. And we see two trends, right? One trend is the traditional hardware supplier of on-premise has not scaled to cloud technology in terms of big data analytics. They just missed the boat for that in the past, this is changing. You guys are a traditional player and changing this, so congratulations. The other thing, is there's been no innovation for the on-premise tech stack, right? The only technology stack to run modern application has been invested for a long time in the cloud. So what we see since two, three years, right? With the first one being Google with Kubernetes, that are good at GKE on-premise, then onto us, right? Bringing their tech stack with compromises to on-premises, right? Acknowledging exactly what we're talking about, the data is everywhere, data is important. Data gravity is there, right? It's just the network's fault, where the networks are too slow, right? If you could just move everything anywhere we want like juggling two balls, then we'd be in different place. But that's the not enough investment for the traditional IT players for that stack, and the modern stack being there. And now every public cloud player has an on-premise offering with different flavors, different capabilities. >> I want to give you guys Dave's story of kind of history and you can kind of course correct, and tell me how this, Matt, maybe fits into what's happened with customers. So, you know, before Hadoop, obviously you had to buy a big Oracle database and you know, you running Unix, and you buy some big storage subsystem if you had any money left over, you know, you maybe, you know, do some actual analytics. But then Hadoop comes in, lowers the cost, and then S3 kneecaps the entire Hadoop market, right? >> I wouldn't say that, I wouldn't agree. Sorry to jump on your history. Because the fascinating thing, what Hadoop brought to the enterprise for the first time, you're absolutely right, affordable, right, to do that. But it's not only about affordability because S3 as the affordability. The big thing is you can store information without knowing how to analyze it, right? So, you mentioned Snowflake, right? Before, it was like an Oracle database. It was Starschema for data warehouse, and so on. You had to make decisions how to store that data because compute capabilities, storage capabilities, were too limited, right? That's what Hadoop blew away. >> I agree, no schema on, right. But then that created data lakes, which create a data swamps, and that whole mess, and then Spark comes in and help clean it out, okay, fine. So, we're cool with that. But the early days of Hadoop, you had, companies would have a Hadoop monolith, they probably had their data catalog in Excel or Google sheets, right? And so now, my question to you, Matt, is there's a lot of customers that are still in that world. What do they do? They got an option to go to the cloud. I'm hearing that you're giving them another option? >> That's right. So we know that data is going to move to the cloud, as I mentioned. So let's keep that data in sync, and governed, and secured, like you expect. But for the data that can't move, let's bring those cloud native services to your data center. And so that's a big part of this announcement is this unified analytics. So that you can continue to run the tools that you want to today while bringing those next generation tools based on Apache Spark, using libraries like Delta Lake so you can go anything from Tableaux through Presto sequel, to advance machine learning in your Jupiter notebooks on-premises where you know your data is secured. And if it happens to sit in existing Hadoop data lake, that's fine too. We don't want our customers to have to make that trade off as they go from one to the other. Let's give you the best of both worlds, or as they say, you can eat your cake and have it too. >> Okay, so. Now let's talk about sort of developers on-prem, right? They've been kind of... If they really wanted to go cloud native, they had to go to the cloud. Do you feel like this changes the game? Do on-prem developers, do they want that capability? Will they lean into that capability? Or will they say no, no, the cloud is cool. What's your take? >> I love developers, right? But it's about who makes the decision, who pays the developers, right? So the CXOs in the enterprises, they need exactly, this is why we call the next-gen computing platform, that you can move your code assets. It's very hard to build software, so it's very valuable to an enterprise. I don't want to have limited to one single location or certain computing infrastructure, right? Luckily, we have Kubernetes to be able to move that, but I want to be able to deploy it on-premise if I have to. I want to deploy it, would be able to deploy in the multiple clouds which are available. And that's the key part. And that makes developers happy too, because the code you write has got to run multiple places. So you can build more code, better code, instead of building the same thing multiple places, because a little compiler change here, a little compiler change there. Nobody wants to do portability testing and rewriting, recertified for certain platforms. >> The head of application development or application architecture and the business are ultimately going to dictate that, number one. Number two, you're saying that developers shouldn't care because it can write once, run anywhere. >> That is the promise, and that's the interesting thing which is available now, 'cause people know, thanks to Kubernetes as a container platform and the abstraction which containers provide, and that makes everybody's life easier. But it goes much more higher than the Head of Apps, right? This is the digital transformation strategy, the next generation application the company has to build as a response to a pandemic, as a pivot, as digital transformation, as digital disruption capability. >> I mean, I see a lot of organizations basically modernizing by building some kind of abstraction to their backend systems, modernizing it through cloud native, and then saying, hey, as you were saying Holger, run it anywhere you want, or connect to those cloud apps, or connect across clouds, connect to other on-prem apps, and eventually out to the edge. Is that what you see? >> It's so much easier said than done though. Organizations have struggled so much with this, especially as we start talking about those data intensive app and workloads. Kubernetes and Hadoop? Up until now, organizations haven't been able to deploy those services. So, what we're offering as part of these GreenLake unified analytics services, a Kubernetes runtime. It's not ours. It's top of branch open source. And open source operators like Apache Spark, bringing in Delta Lake libraries, so that if your developer does want to use cloud native tools to build those next generation advanced analytics applications, but prod is still on-premises, they should just be able to pick that code up, and because we are deploying 100% open-source frameworks, the code should run as is. >> So, it seems like the strategy is to basically build, now that's what GreenLake is, right? It's a cloud. It's like, hey, here's your options, use whatever you want. >> Well, and it's your cloud. That's, what's so important about GreenLake, is it's your cloud, in your data center or co-lo, with your data, your tools, and your code. And again, we know that organizations are going to go to a multi or hybrid cloud location and through our management capabilities, we can reach out if you don't want us to control those, not necessarily, that's okay, but we should at least be able to monitor and audit the data that sits in those other locations, the applications that are running, maybe I register your GKE cluster. I don't manage it, but at least through a central pane of glass, I can tell the Head of Applications, what that person's utilization is across these environments. >> You know, and you said something, Matt, that struck, resonated with me, which is this is not trivial. I mean, not as simple to do. I mean what you see, you see a lot of customers or companies, what they're doing, vendors, they'll wrap their stack in Kubernetes, shove it in the cloud, it's essentially hosted stack, right? And, you're kind of taking a different approach. You're saying, hey, we're essentially building a cloud that's going to connect all these estates. And the key is you're going to have to keep, and you are, I think that's probably part of the reason why we're here, announcing stuff very quickly. A lot of innovation has to come out to satisfy that demand that you're essentially talking about. >> Because we've oversimplified things with containers, right? Because containers don't have what matters for data, and what matters for enterprise, which is persistence, right? I have to be able to turn my systems down, or I don't know when I'm going to use that data, but it has to stay there. And that's not solved in the container world by itself. And that's what's coming now, the heavy lifting is done by people like HPE, to provide that persistence of the data across the different deployment platforms. And then, there's just a need to modernize my on-premise platforms. Right? I can't run on a server which is two, three years old, right? It's no longer safe, it doesn't have trusted identity, all the good stuff that you need these days, right? It cannot be operated remotely, or whatever happens there, where there's two, three years, is long enough for a server to have run their course, right? >> Well you're a software guy, you hate hardware anyway, so just abstract that hardware complexity away from you. >> Hardware is the necessary evil, right? It's like TSA. I want to go somewhere, but I have to go through TSA. >> But that's a key point, let me buy a service, if I need compute, give it to me. And if I don't, I don't want to hear about it, right? And that's kind of the direction that you're headed. >> That's right. >> Holger: That's what you're offering. >> That's right, and specifically the services. So GreenLake's been offering infrastructure, virtual machines, IaaS, as a service. And we want to stop talking about that underlying capability because it's a dial tone now. What organizations and these developers want is the service. Give me a service or a function, like I get in the cloud, but I need to get going today. I need it within my security parameters, access to my data, my tools, so I can get going as quickly as possible. And then beyond that, we're going to give you that cloud billing practices. Because, just because you're deploying a cloud native service, if you're still still being deployed via CapEx, you're not solving a lot of problems. So we also need to have that cloud billing model. >> Great. Well Holger, we'll give you the last word, bring us home. >> It's very interesting to have the cloud qualities of subscription-based pricing maintained by HPE as the cloud vendor from somewhere else. And that gives you that flexibility. And that's very important because data is essential to enterprise processes. And there's three reasons why data doesn't go to the cloud, right? We know that. It's privacy residency requirement, there is no cloud infrastructure in the country. It's performance, because network latency plays a role, right? Especially for critical appraisal. And then there's not invented here, right? Remember Charles Phillips saying how old the CIO is? I know if they're going to go to the cloud or not, right? So, it was not invented here. These are the things which keep data on-premise. You know that load, and HP is coming on with a very interesting offering. >> It's physics, it's laws, it's politics, and sometimes it's cost, right? Sometimes it's too expensive to move and migrate. Guys, thanks so much. Great to see you both. >> Matt: Dave, it's always a pleasure. All right, and thank you for watching the Cubes continuous coverage of HPE's big GreenLake announcements. Keep it right there for more great content. (calm music begins)

Published Date : Sep 28 2021

SUMMARY :

And Matt Maccaux is the global field CTO I call it the force marks to digital. So data is main center to everything. 'Cause with oil, you can only use once. So to get to the data, you And so Matt, what are you I have to go to the cloud. capability to connect them, It's the same thing. You're going to unify that, and what, We need to be able to know So it's my choice? It's not always going to be in sync but that's the north star. I need to put it into an object store, But what do you see as for that in the past, I want to give you guys Sorry to jump on your history. And so now, my question to you, Matt, And if it happens to sit in they had to go to the cloud. because the code you write has and the business the company has to build as and eventually out to the edge. to pick that code up, So, it seems like the and audit the data that sits to have to keep, and you are, I have to be able to turn my systems down, guy, you hate hardware anyway, I have to go through TSA. And that's kind of the but I need to get going today. the last word, bring us home. I know if they're going to go Great to see you both. the Cubes continuous coverage

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Frank Slootman	PERSON	0.99+
Matt	PERSON	0.99+
Matt Maccaux	PERSON	0.99+
Holger	PERSON	0.99+
Dave	PERSON	0.99+
Holger Mueller	PERSON	0.99+
two	QUANTITY	0.99+
100%	QUANTITY	0.99+
Charles Phillips	PERSON	0.99+
Constellation Research	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
Excel	TITLE	0.99+
HP	ORGANIZATION	0.99+
today	DATE	0.99+
three years	QUANTITY	0.99+
GreenLake	ORGANIZATION	0.99+
three reasons	QUANTITY	0.99+
Today	DATE	0.99+
Google	ORGANIZATION	0.99+
two balls	QUANTITY	0.98+
first	QUANTITY	0.98+
Oracle	ORGANIZATION	0.98+
10 years ago	DATE	0.98+
Ezmeral	ORGANIZATION	0.98+
both worlds	QUANTITY	0.98+
first time	QUANTITY	0.98+
S3	TITLE	0.98+
One trend	QUANTITY	0.98+
GreenLake Services	ORGANIZATION	0.98+
first one	QUANTITY	0.98+
Snowflake	TITLE	0.97+
both places	QUANTITY	0.97+
Kubernetes	TITLE	0.97+
once	QUANTITY	0.96+
both	QUANTITY	0.96+
two trends	QUANTITY	0.96+
Delta Lake	TITLE	0.95+
Google	TITLE	0.94+
Hadoop	TITLE	0.94+
CapEx	ORGANIZATION	0.93+
Tableaux	TITLE	0.93+
Azure	TITLE	0.92+
GKE	ORGANIZATION	0.92+
Cubes	ORGANIZATION	0.92+
Unix	TITLE	0.92+
one single location	QUANTITY	0.91+
single view	QUANTITY	0.9+
Spark	TITLE	0.86+
Apache	ORGANIZATION	0.85+
pandemic	EVENT	0.82+
Hadoop	ORGANIZATION	0.81+
three years old	QUANTITY	0.8+
single	QUANTITY	0.8+
Kubernetes	ORGANIZATION	0.74+
big waves	EVENT	0.73+
Apache Spark	ORGANIZATION	0.71+
Number two	QUANTITY	0.69+

Maria Colgan & Gerald Venzl, Oracle | June CUBEconversation

(upbeat music) >> It'll be five, four, three and then silent two, one, and then you guys just follow my lead. We're just making some last minute adjustments. Like I said, we're down two hands today. So, you good Alex? Okay, are you guys ready? >> I'm ready. >> Ready. >> I got to get get one note here. >> So I noticed Maria you stopped anyway, so I have time. >> Just so they know Dave and the Boston Studio, are they both kind of concurrently be on film even when they're not speaking or will only the speaker be on film for like if Gerald's drawing while Maria is talking about-- >> Sorry but then I missed one part of my onboarding spiel. There should be, if you go into gallery there should be a label. There should be something labeled Boston live switch feed. If you pin that gallery view you'll see what our program currently being recorded is. So any time you don't see yourself on that feed is an excellent time to take a drink of water, scratch your nose, check your notes. Do whatever you got to do off screen. >> Can you give us a three shot, Alex? >> Yes, there it is. >> And then go to me, just give me a one-shot to Dave. So when I'm here you guys can take a drink or whatever >> That makes sense? >> Yeah. >> Excellent, I will get my recordings restarted and we'll open up when Dave's ready. >> All right, you guys ready? >> Ready. >> All right Steve, you go on mute. >> Okay, on me in 5, 4, 3. Developers have become the new king makers in the world of digital and cloud. The rise of containers and microservices has accelerated the transition to cloud native applications. A lot of people will talk about application architecture and the related paradigms and the benefits they bring for the process of writing and delivering new apps. But a major challenge continues to be, the how and the what when it comes to accessing, processing and getting insights from the massive amounts of data that we have to deal with in today's world. And with me are two experts from the data management world who will share with us how they think about the best techniques and practices based on what they see at large organizations who are working with data and developing so-called data-driven apps. Please welcome Maria Colgan and Gerald Venzl, two distinguish product managers from Oracle. Folks, welcome, thanks so much for coming on. >> Thanks for having us Dave. >> Thank you very much for having us. >> Okay, Maria let's start with you. So, we throw around this term data-driven, data-driven applications. What are we really talking about there? >> So data-driven applications are applications that work on a diverse set of data. So anything from spatial to sensor data, document data as well as your usual transaction processing data. And what they're going to do is they'll generate value from that data in very different ways to a traditional application. So for example, they may use machine learning, they are able to do product recommendations in the middle of a transaction. Or we could use graph to be able to identify an influencer within the community so we can target them with a specific promotion. It could also use spatial data to be able to help find the nearest stores to a particular customer. And because these apps are deployed on multiple platforms, everything from mobile devices as well as standard browsers, they need a data platform that's going to be both secure, reliable and scalable. >> Well, so when you think about how the workloads are shifting I mean, we're not talking about, you know it's not anymore a world of just your ERP or your HCM or your CRM, you know kind of the traditional operational systems. You really are seeing an explosion of these new data oriented apps. You're seeing, you know, modeling in the cloud, you are going to see more and more inferencing, inferencing at the edge. But Maria maybe you could talk a little bit about sort of the benefits that customers are seeing from developing these types of applications. I mean, why should people care about data-driven apps? >> Oh, for sure, there's massive benefits to them. I mean, probably the most obvious one for any business regardless of the industry, is that they not only allow you to understand what your customers are up to, but they allow you to be able to anticipate those customer's needs. So that helps businesses maintain that competitive edge and retain their customers. But it also helps them make data-driven decisions in real time based on actual data rather than on somebody's gut feeling or basing those decisions on historical data. So for example, you can do real-time price adjustments on products based on demand and so forth, that kind of thing. So it really changes the way people do business today. >> So Gerald, you think about the narrative in the industry everybody wants to be a platform player all your customers they are becoming software companies, they are becoming platform players. Everybody wants to be like, you know name a company that is huge trillion dollar market cap or whatever, and those are data-driven companies. And so it would seem to me that data-driven applications, there's nobody, no company really shouldn't be data-driven. Do you buy that? >> Yeah, absolutely. I mean, data-driven, and that naturally the whole industry is data-driven, right? It's like we all have information technologies about processing data and deriving information out of it. But when it comes to app development I think there is a big push to kind of like we have to do machine learning in our applications, we have to get insights from data. And when you actually look back a bit and take a step back, you see that there's of course many different kinds of applications out there as well that's not to be forgotten, right? So there is a usual front end user interfaces where really the application all it does is just entering some piece of information that's stored somewhere or perhaps a microservice that's not attached to a data to you at all but just receives or asks calls (indistinct). So I think it's not necessarily so important for every developer to kind of go on a bandwagon that they have to be data-driven. But I think it's equally important for those applications and those developers that build applications, that drive the business, that make business critical decisions as Maria mentioned before. Those guys should take really a close look into what data-driven apps means and what the data to you can actually give to them. Because what we see also happening a lot is that a lot of the things that are well known and out there just ready to use are being reimplemented in the applications. And for those applications, they essentially just ended up spending more time writing codes that will be already there and then have to maintain and debug the code as well rather than just going to market faster. >> Gerald can you talk to the prevailing approaches that developers take to build data-driven applications? What are the ones that you see? Let's dig into that a little bit more and maybe differentiate the different approaches and talk about that? >> Yeah, absolutely. I think right now the industry is like in two camps, it's like sort of a religious war going on that you'll see often happening with different architectures and so forth going on. So we have single purpose databases or data management technologies. Which are technologies that are as the name suggests build around a single purpose. So it's like, you know a typical example would be your ordinary key-value store. And a key-value store all it does is it allows you to store and retrieve a piece of data whatever that may be really, really fast but it doesn't really go beyond that. And then the other side of the house or the other camp would be multimodal databases, multimodal data management technologies. Those are technologies that allow you to store different types of data, different formats of data in the same technology in the same system alongside. And, you know, when you look at the geographics out there of what we have from technology, is pretty much any relational database or any database really has evolved into such a multimodal database. Whether that's MySQL that allows you to store or chase them alongside relational or even a MongoDB that allows you to do or gives you native graph support since (mumbles) and as well alongside the adjacent support. >> Well, it's clearly a trend in the industry. We've talked about this a lot in The Cube. We know where Oracle stands on this. I mean, you just mentioned MySQL but I mean, Oracle Databases you've been extending, you've mentioned JSON, we've got blockchain now in there you're infusing, you know ML and AI into the database, graph database capabilities, you know on and on and on. We talked a lot about we compared that to Amazon which is kind of the right tool, the right job approach. So maybe you could talk about, you know, your point of view, the benefits for developers of using that converged database if I can use that word approach being able to store multiple data formats? Why do you feel like that's a better approach? >> Yeah, I think on a high level it comes down to complexity. You are actually avoiding additional complexity, right? So not every use case that you have necessarily warrants to have yet another data management technology or yet the special build technology for managing that data, right? It's like many use cases that we see out there happily want to just store a piece of a chase and document, a piece of chase in a database and then perhaps retrieve it again afterwards so write some simple queries over it. And you really don't have to get a new database technology or a NoSQL database into the mix if you already have some to just fulfill that exact use case. You could just happily store that information as well in the database you already have. And what it really comes down to is the learning curve for developers, right? So it's like, as you use the same technology to store other types of data, you don't have to learn a new technology, you don't have to associate yourself with new and learn new drivers. You don't have to find new frameworks and you don't have to know how to necessarily operate or best model your data for that database. You can essentially just reuse your knowledge of the technology as well as the libraries and code you have already built in house perhaps in another application, perhaps, you know framework that you used against the same technology because it is still the same technology. So, kind of all comes down again to avoiding complexity rather than not fragmenting you know, the many different technologies we have. If you were to look at the different data formats that are out there today it's like, you know, you would end up with many different databases just to store them if you were to fully religiously follow the single purpose best built technology for every use case paradigm, right? And then you would just end up having to manage many different databases more than actually focusing on your app and getting value to your business or to your user. >> Okay, so I get that and I buy that by the way. I mean, especially if you're a larger organization and you've got all these projects going on but before we go back to Maria, Gerald, I want to just, I want to push on that a little bit. Because the counter to that argument would be in the analogy. And I wonder if you, I'd love for you to, you know knock this analogy off the blocks. The counter would be okay, Oracle is the Swiss Army knife and it's got, you know, all in one. But sometimes I need that specialized long screwdriver and I go into my toolbox and I grab that. It's better than the screwdriver in my Swiss Army knife. Why, are you the Swiss Army knife of databases? Or are you the all-in-one have that best of breed screwdriver for me? How do you think about that? >> Yeah, that's a fantastic question, right? And I think it's first of all, you have to separate between Oracle the company that has actually multiple data management technologies and databases out there as you said before, right? And Oracle Database. And I think Oracle Database is definitely a Swiss Army knife has many capabilities of since the last 40 years, you know that we've seen object support coming that's still in the Oracle Database today. We have seen XML coming, it's still in the Oracle Database, graph, spatial, et cetera. And so you have many different ways of managing your data and then on top of that going into the converge, not only do we allow you to store the different data model in there but we actually allow you also to, you apply all the security policies and so forth on top of it something Maria can talk more about the mission around converged database. I would also argue though that for some aspects, we do actually have to or add a screwdriver that you talked about as well. So especially in the relational world people get very quickly hung up on this idea that, oh, if you only do rows and columns, well, that's kind of what you put down on disk. And that was never true, it's the relational model is actually a logical model. What's probably being put down on disk is blocks that align themselves nice with block storage and always has been. So that allows you to actually model and process the data sort of differently. And one common example or one good example that we have that we introduced a couple of years ago was when, column and databases were very strong and you know, the competition came it's like, yeah, we have In-Memory column that stores now they're so much better. And we were like, well, orienting the data role-based or column-based really doesn't matter in the sense that we store them as blocks on disks. And so we introduced the in memory technology which gives you an In-Memory column, a representation of your data as well alongside your relational. So there is an example where you go like, well, actually you know, if you have this use case of the column or analytics all In-Memory, I would argue Oracle Database is also that screwdriver you want to go down to and gives you that capability. Because not only gives you representation in columnar, but also which many people then forget all the analytic power on top of SQL. It's one thing to store your data columnar, it's a completely different story to actually be able to run analytics on top of that and having all the built-in functionalities and stuff that you want to do with the data on top of it as you analyze it. >> You know, that's a great example, the kilometer 'cause I remember there was like a lot of hype around it. Oh, it's the Oracle killer, you know, at Vertica. Vertica is still around but, you know it never really hit escape velocity. But you know, good product, good company, whatever. Natezza, it kind of got buried inside of IBM. ParXL kind of became, you know, red shift with that deal so that kind of went away. Teradata bought a company, I forget which company it bought but. So that hype kind of disapated and now it's like, oh yeah, columnar. It's kind of like In-Memory, we've had a In-Memory databases ever since we've had databases you know, it's a kind of a feature not a sector. But anyway, Maria, let's come back to you. You've got a lot of customer experience. And you speak with a lot of companies, you know during your time at Oracle. What else are you seeing in terms of the benefits to this approach that might not be so intuitive and obvious right away? >> I think one of the biggest benefits to having a multimodel multiworkload or as we call it a converged database, is the fact that you can get greater data synergy from it. In other words, you can utilize all these different techniques and data models to get better value out of that data. So things like being able to do real-time machine learning, fraud detection inside a transaction or being able to do a product recommendation by accessing three different data models. So for example, if I'm trying to recommend a product for you Dave, I might use graph analytics to be able to figure out your community. Not just your friends, but other people on our system who look and behave just like you. Once I know that community then I can go over and see what products they bought by looking up our product catalog which may be stored as JSON. And then on top of that I can then see using the key-value what products inside that catalog those community members gave a five star rating to. So that way I can really pinpoint the right product for you. And I can do all of that in one transaction inside the database without having to transform that data into different models or God forbid, access different systems to be able to get all of that information. So it really simplifies how we can generate that value from the data. And of course, the other thing our customers love is when it comes to deploying data-driven apps, when you do it on a converged database it's much simpler because it is that standard data platform. So you're not having to manage multiple independent single purpose databases. You're not having to implement the security and the high availability policies, you know across a bunch of different diverse platforms. All of that can be done much simpler with a converged database 'cause the DBA team of course, is going to just use that standard set of tools to manage, monitor and secure those systems. >> Thank you for that. And you know, it's interesting, you talk about simplification and you are in Juan's organization so you've big focus on mission critical. And so one of the things that I think is often overlooked well, we talk about all the time is recovery. And if things are simpler, recovery is faster and easier. And so it's kind of the hallmark of Oracle is like the gold standard of the toughest apps, the most mission critical apps. But I wanted to get to the cloud Maria. So because everything is going to the cloud, right? Not all workloads are going to the cloud but everybody is talking about the cloud. Everybody has cloud first mentality and so yes, it's a hybrid world. But the natural next question is how do you think the cloud fits into this world of data-driven apps? >> I think just like any app that you're developing, the cloud helps to accelerate that development. And of course the deployment of these data-driven applications. 'Cause if you think about it, the developer is instantly able to provision a converged database that Oracle will automatically manage and look after for them. But what's great about doing something like that if you use like our autonomous database service is that it comes in different flavors. So you can get autonomous transaction processing, data warehousing or autonomous JSON so that the developer is going to get a database that's been optimized for their specific use case, whatever they are trying to solve. And it's also going to contain all of that great functionality and capabilities that we've been talking about. So what that really means to the developer though is as the project evolves and inevitably the business needs change a little, there's no need to panic when one of those changes comes in because your converged database or your autonomous database has all of those additional capabilities. So you can simply utilize those to able to address those evolving changes in the project. 'Cause let's face it, none of us normally know exactly what we need to build right at the very beginning. And on top of that they also kind of get a built-in buddy in the cloud, especially in the autonomous database. And that buddy comes in the form of built-in workload optimizations. So with the autonomous database we do things like automatic indexing where we're using machine learning to be that buddy for the developer. So what it'll do is it'll monitor the workload and see what kind of queries are being run on that system. And then it will actually determine if there are indexes that should be built to help improve the performance of that application. And not only does it bill those indexes but it verifies that they help improve the performance before publishing it to the application. So by the time the developer is finished with that app and it's ready to be deployed, it's actually also been optimized by the developers buddy, the Oracle autonomous database. So, you know, it's a really nice helping hand for developers when they're building any app especially data-driven apps. >> I like how you sort of gave us, you know the truth here is you don't always know where you're going when you're building an app. It's like it goes from you are trying to build it and they will come to start building it and we'll figure out where it's going to go. With Agile that's kind of how it works. But so I wonder, can you give some examples of maybe customers or maybe genericize them if you need to. Data-driven apps in the cloud where customers were able to drive more efficiency, where the cloud buddy allowed the customers to do more with less? >> No, we have tons of these but I'll try and keep it to just a couple. One that comes to mind straight away is retrace. These folks built a blockchain app in the Oracle Cloud that allows manufacturers to actually share the supply chain with the consumer. So the consumer can see exactly, who made their product? Using what raw materials? Where they were sourced from? How it was done? All of that is visible to the consumer. And in order to be able to share that they had to work on a very diverse set of data. So they had everything from JSON documents to images as well as your traditional transactions in there. And they store all of that information inside the Oracle autonomous database, they were able to build their app and deploy it on the cloud. And they were able to do all of that very, very quickly. So, you know, that ability to work on multiple different data types in a single database really helped them build that product and get it to market in a very short amount of time. Another customer that's doing something really, really interesting is MindSense. So these guys operate the largest mines in Canada, Chile, and Peru. But what they do is they put these x-ray devices on the massive mechanical shovels that are at the cove or at the mine face. And what that does is it senses the contents of the buckets inside these mining machines. And it's looking to see at that content, to see how it can optimize the processing of the ore inside in that bucket. So they're looking to minimize the amount of power and water that it's going to take to process that. And also of course, minimize the amount of waste that's going to come out of that project. So all of that sensor data is sent into an autonomous database where it's going to be processed by a whole host of different users. So everything from the mine engineers to the geo scientists, to even their own data scientists utilize that data to drive their business forward. And what I love about these guys is they're not happy with building just one app. MindSense actually use our built-in low core development environment, APEX that comes as part of the autonomous database and they actually produce applications constantly for different aspects of their business using that technology. And it's actually able to accelerate those new apps to the business. It takes them now just a couple of days or weeks to produce an app instead of months or years to build those new apps. >> Great, thank you for that Maria. Gerald, I'm going to push you again. So, I said upfront and talked about microservices and the cloud and containers and you know, anybody in the developer space follows that very closely. But some of the things that we've been talking about here people might look at that and say, well, they're kind of antithetical to microservices. This is our Oracles monolithic approach. But when you think about the benefits of microservices, people want freedom of choice, technology choice, seen as a big advantage of microservices and containers. How do you address such an argument? >> Yeah, that's an excellent question and I get that quite often. The microservices architecture in general as I said before had architectures, Linux distributions, et cetera. It's kind of always a bit of like there's an academic approach and there's a pragmatic approach. And when you look at the microservices the original definitions that came out at the early 2010s. They actually never said that each microservice has to have a database. And they also never said that if a microservice has a database, you have to use a different technology for each microservice. Just like they never said, you have to write a microservice in a different programming language, right? So where I'm going with this is like, yes you know, sometimes when you look at some vendors out there, some niche players, they push this message or they jump on this academic approach of like each microservice has the best tool at hand or I'd use a different database for your purpose, et cetera. Which almost often comes across like us. You know, we want to stay part of the conversation. Nothing stops a developer from, you know using a multimodal database for the microservice and just using that as a document store, right? Or just using that as a relational database. And, you know, sometimes I mean, it was actually something that happened that was really interesting yesterday I don't know whether you follow Dave or not. But Facebook had an outage yesterday, right? And Facebook is one of those companies that are seen as the Silicon Valley, you know know how to do microservices companies. And when you add through the outage, well, what happened, right? Some unfortunate logical error with configuration as a force that took a database cluster down. So, you know, there you have it where you go like, well, maybe not every microservice is actually in fact talking to its own database or its own special purpose database. I think there, you know, well, what we should, the industry should be focusing much more on this argument of which technology to use? What's the right tool for a job? Is more to ask themselves, what business problem actually are we trying to solve? And therefore what's the right approach and the right technology for this. And so therefore, just as I said before, you know multimodal databases they do have strong benefits. They have many built-in functionalities that are already there and they allow you to reduce this complexity of having to know many different technologies, right? And so it's not only to store different data models either you know, treat a multimodal database as a chasing documents store or a relational database but most databases are multimodal since 20 plus years. But it's also actually being able to perhaps if you store that data together, you can perhaps actually derive additional value for somebody else but perhaps not for your application. But like for example, if you were to use Oracle Database you can actually write queries on top of all of that data. It doesn't really matter for our query engine whether it's the data is format that then chase or the data is formatted in rows and columns you can just rather than query over it. And that's actually very powerful for those guys that have to, you know get the reporting done the end of the day, the end of the week. And for those guys that are the data scientists that they want to figure out, you know which product performed really well or can we tweak something here and there. When you look into that space you still see a huge divergence between the guys to put data in kind of the altarpiece style and guys that try to derive new insights. And there's still a lot of ETL going around and, you know we have big data technologies that some of them come and went and some of them came in that are still around like Apache Spark which is still like a SQL engine on top of any of your data kind of going back to the same concept. And so I will say that, you know, for developers when we look at microservices it's like, first of all, is the argument you were making because the vendor or the technology you want to use tells you this argument or, you know, you kind of want to have an argument to use a specific technology? Or is it really more because it is the best technology, to best use for this given use case for this given application that you have? And if so there's of course, also nothing wrong to use a single purpose technology either, right? >> Yeah, I mean, whenever I talk about Oracle I always come back to the most important applications, the mission critical. It's very difficult to architect databases with microservices and containers. You have to be really, really careful. And so and again, it comes back to what we were talking before about with Maria that the complexity and the recovery. But Gerald I want to stay with you for a minute. So there's other data management technologies popping out there. I mean, I've seen some people saying, okay just leave the data in an S3 bucket. We can query that, then we've got some magic sauce to do that. And so why are you optimistic about you know, traditional database technology going forward? >> I would say because of the history of databases. So one thing that once struck me when I came to Oracle and then got to meet great people like Juan Luis and Andy Mendelsohn who had been here for a long, long time. I come to realization that relational databases are around for about 45 years now. And, you know, I was like, I'm too young to have been around then, right? So I was like, what else was around 45 years? It's like just the tech stack that we have today. It's like, how does this look like? Well, Linux only came out in 93. Well, databases pre-date Linux a lot rather than as I started digging I saw a lot of technologies come and go, right? And you mentioned before like the technologies that data management systems that we had that came and went like the columnar databases or XML databases, object databases. And even before relational databases before Cot gave us the relational model there were apparently these networks stores network databases which to some extent look very similar to adjacent documents. There wasn't a harder storing data and a hierarchy to format. And, you know when you then start actually reading the Cot paper and diving a little bit more into the relation model, that's I think one important crux in there that most of the industry keeps forgetting or it hasn't been around to even know. And that is that when Cot created the relational model, he actually focused not so much on the application putting the data in, but on future users and applications still being able to making sense out of the data, right? And that's kind of like I said before we had those network models, we had XML databases you have adjacent documents stores. And the one thing that they all have along with it is like the application that puts the data in decides the structure of the data. And that's all well and good if you had an application of the developer writing an application. It can become really tricky when 10 years later you still want to look at that data and the application that the developer is no longer around then you go like, what does this all mean? Where is the structure defined? What is this attribute? What does it mean? How does it correlate to others? And the one thing that people tend to forget is that it's actually the data that's here to stay not someone who does the applications where it is. Ideally, every company wants to store every single byte of data that they have because there might be future value in it. Economically may not make sense that's now much more feasible than just years ago. But if you could, why wouldn't you want to store all your data, right? And sometimes you actually have to store the data for seven years or whatever because the laws require you to. And so coming back then and you know, like 10 years from now and looking at the data and going like making sense of that data can actually become a lot more difficult and a lot more challenging than having to first figure out and how we store this data for general use. And that kind of was what the relational model was all about. We decompose the data structures into tables and columns with relationships amongst each other so therefore between each other. So that therefore if somebody wants to, you know typical example would be well you store some purchases from your web store, right? There's a customer attribute in it. There's some credit card payment information in it, just some product information on what the customer bought. Well, in the relational model if you just want to figure out which products were sold on a given day or week, you just would query the payment and products table to get the sense out of it. You don't need to touch the customer and so forth. And with the hierarchical model you have to first sit down and understand how is the structure, what is the customer? Where is the payment? You know, does the document start with the payment or does it start with the customer? Where do I find this information? And then in the very early days those databases even struggled to then not having to scan all the documents to get the data out. So coming back to your question a bit, I apologize for going on here. But you know, it's like relational databases have been around for 45 years. I actually argue it's one of the most successful software technologies that we have out there when you look in the overall industry, right? 45 years is like, in IT terms it's like from a star being the ones who are going supernova. You have said it before that many technologies coming and went, right? And just want to add a more really interesting example by the way is Hadoop and HDFS, right? They kind of gave us this additional promise of like, you know, the 2010s like 2012, 2013 the hype of Hadoop and so forth and (mumbles) and HDFS. And people are just like, just put everything into HDFS and worry about the data later, right? And we can query it and map reduce it and whatever. And we had customers actually coming to us they were like, great we have half a petabyte of data on an HDFS cluster and we have no clue what's stored in there. How do we figure this out? What are we going to do now? Now you had a big data cleansing problem. And so I think that is why databases and also data modeling is something that will not go away anytime soon. And I think databases and database technologies are here for quite a while to stay. Because many of those are people they don't think about what's happening to the data five years from now. And many of the niche players also and also frankly even Amazon you know, following with this single purpose thing is like, just use the right tool for the job for your application, right? Just pull in the data there the way you wanted. And it's like, okay, so you use technologies all over the place and then five years from now you have your data fragmented everywhere in different formats and, you know inconsistencies, and, and, and. And those are usually when you come back to this data-driven business critical business decision applications the worst case scenario you can have, right? Because now you need an army of people to actually do data cleansing. And there's not a coincidence that data science has become very, very popular the last recent years as we kind of went on with this proliferation of different database or data management technologies some of those are not even database. But I think I leave it at that. >> It's an interesting talk track because you're right. I mean, no schema on right was alluring, but it definitely created some problems. It also created an entire, you know you referenced the hyper specialized roles and did the data cleansing component. I mean, maybe technology will eventually solve that problem but it hasn't up at least up tonight. Okay, last question, Maria maybe you could start off and Gerald if you want to chime in as well it'd be great. I mean, it's interesting to watch this industry when Oracle sort of won the top database mantle. I mean, I watched it, I saw it. It was, remember it was Informix and it was (indistinct) too and of course, Microsoft you got to give them credit with SQL server, but Oracle won the database wars. And then everything got kind of quiet for awhile database was sort of boring. And then it exploded, you know, all the, you know not only SQL and the key-value stores and the cloud databases and this is really a hot area now. And when we looked at Oracle we said, okay, Oracle it's all about Oracle Database, but we've seen the kind of resurgence in MySQL which everybody thought, you know once Oracle bought Sun they were going to kill MySQL. But now we see you investing in HeatWave, TimesTen, we talked about In-Memory databases before. So where do those fit in Maria in the grand scheme? How should we think about Oracle's database portfolio? >> So there's lots of places where you'd use those different things. 'Cause just like any other industry there are going to be new and boutique use cases that are going to benefit from a more specialized product or single purpose product. So good examples off the top of my head of the kind of systems that would benefit from that would be things like a stock exchange system or a telephone exchange system. Both of those are latency critical transaction processing applications where they need microsecond response times. And that's going to exceed perhaps what you might normally get or deploy with a converged database. And so Oracle's TimesTen database our In-Memory database is perfect for those kinds of applications. But there's also a host of MySQL applications out there today and you said it yourself there Dave, HeatWave is a great place to provision and deploy those kinds of applications because it's going to run 100 times faster than AWS (mumbles). So, you know, there really is a place in the market and in our customer's systems and the needs they have for all of these different members of our database family here at Oracle. >> Yeah, well, the internet is basically running in the lamp stack so I see MySQL going away. All right Gerald, will give you the final word, bring us home. >> Oh, thank you very much. Yeah, I mean, as Maria said, I think it comes back to what we discussed before. There is obviously still needs for special technologies or different technologies than a relational database or multimodal database. Oracle has actually many more databases that people may first think of. Not only the three that we have already mentioned but there's even SP so the Oracle's NoSQL database. And, you know, on a high level Oracle is a data management company, right? And we want to give our customers the best tools and the best technology to manage all of their data. Rather than therefore there has to be a need or there should be a part of the business that also focuses on this highly specialized systems and this highly specialized technologies that address those use cases. And I think it makes perfect sense. It's like, you know, when the customer comes to Oracle they're not only getting this, take this one product you know, and if you don't like it your problem but actually you have choice, right? And choice allows you to make a decision based on what's best for you and not necessarily best for the vendor you're talking to. >> Well guys, really appreciate your time today and your insights. Maria, Gerald, thanks so much for coming on The Cube. >> Thank you very much for having us. >> And thanks for watching this Cube conversation this is Dave Vellante and we'll see you next time. (upbeat music)

Published Date : Jun 24 2021

SUMMARY :

and then you guys just follow my lead. So I noticed Maria you stopped anyway, So any time you don't So when I'm here you guys and we'll open up when Dave's ready. and the benefits they bring What are we really talking about there? the nearest stores to kind of the traditional So for example, you can do So Gerald, you think about to you at all but just receives or even a MongoDB that allows you to do ML and AI into the database, in the database you already have. and I buy that by the way. of since the last 40 years, you know the benefits to this approach is the fact that you can get And you know, it's And that buddy comes in the form of the truth here is you don't and deploy it on the cloud. and the cloud and containers and you know, is the argument you were making And so why are you because the laws require you to. And then it exploded, you and the needs they have in the lamp stack so I and the best technology to and your insights. we'll see you next time.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Gerald Venzl	PERSON	0.99+
Andy Mendelsohn	PERSON	0.99+
Maria	PERSON	0.99+
Dave	PERSON	0.99+
Chile	LOCATION	0.99+
Maria Colgan	PERSON	0.99+
Peru	LOCATION	0.99+
100 times	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Gerald	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Canada	LOCATION	0.99+
seven years	QUANTITY	0.99+
Juan Luis	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Steve	PERSON	0.99+
five star	QUANTITY	0.99+
Maria Colgan	PERSON	0.99+
Swiss Army	ORGANIZATION	0.99+
Swiss Army	ORGANIZATION	0.99+
Alex	PERSON	0.99+
Facebook	ORGANIZATION	0.99+
MySQL	TITLE	0.99+
one note	QUANTITY	0.99+
yesterday	DATE	0.99+
two hands	QUANTITY	0.99+
three	QUANTITY	0.99+
two experts	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Linux	TITLE	0.99+
Teradata	ORGANIZATION	0.99+
each microservice	QUANTITY	0.99+
Hadoop	TITLE	0.99+
45 years	QUANTITY	0.99+
Oracles	ORGANIZATION	0.99+
early 2010s	DATE	0.99+
today	DATE	0.99+
one-shot	QUANTITY	0.99+
five	QUANTITY	0.99+
one good example	QUANTITY	0.99+
Sun	ORGANIZATION	0.99+
tonight	DATE	0.99+
first	QUANTITY	0.99+

Ram Venkatesh, Hortonworks & Sudhir Hasbe, Google | DataWorks Summit 2018

>> Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2018. Brought to you by HortonWorks. >> We are wrapping up Day One of coverage of Dataworks here in San Jose, California on theCUBE. I'm your host, Rebecca Knight, along with my co-host, James Kobielus. We have two guests for this last segment of the day. We have Sudhir Hasbe, who is the director of product management at Google and Ram Venkatesh, who is VP of Engineering at Hortonworks. Ram, Sudhir, thanks so much for coming on the show. >> Thank you very much. >> Thank you. >> So, I want to start out by asking you about a joint announcement that was made earlier this morning about using some Hortonworks technology deployed onto Google Cloud. Tell our viewers more. >> Sure, so basically what we announced was support for the Hortonworks DataPlatform and Hortonworks DataFlow, HDP and HDF, running on top of the Google Cloud Platform. So this includes deep integration with Google's cloud storage connector layer as well as it's a certified distribution of HDP to run on the Google Cloud Platform. >> I think the key thing is a lot of our customers have been telling us they like the familiar environment of Hortonworks distribution that they've been using on-premises and as they look at moving to cloud, like in GCP, Google Cloud, they want the similar, familiar environment. So, they want the choice to deploy on-premises or Google Cloud, but they want the familiarity of what they've already been using with Hortonworks products. So this announcement actually helps customers pick and choose like whether they want to run Hortonworks distribution on-premises, they want to do it in cloud, or they wat to build this hybrid solution where the data can reside on-premises, can move to cloud and build these common, hybrid architecture. So, that's what this does. >> So, HDP customers can store data in the Google Cloud. They can execute ephemeral workloads, analytic workloads, machine learning in the Google Cloud. And there's some tie-in between Hortonworks's real-time or low latency or streaming capabilities from HDF in the Google Cloud. So, could you describe, at a full sort of detail level, the degrees of technical integration between your two offerings here. >> You want to take that? >> Sure, I'll handle that. So, essentially, deep in the heart of HDP, there's the HDFS layer that includes Hadoop compatible file system which is a plug-able file system layer. So, what Google has done is they have provided an implementation of this API for the Google Cloud Storage Connector. So this is the GCS Connector. We've taken the connector and we've actually continued to refine it to work with our workloads and now Hortonworks has actually bundling, packaging, and making this connector be available as part of HDP. >> So bilateral data movement between them? Bilateral workload movement? >> No, think of this as being very efficient when our workloads are running on top of GCP. When they need to get at data, they can get at data that is in the Google Cloud Storage buckets in a very, very efficient manner. So, since we have fairly deep expertise on workloads like Apache Hive and Apache Spark, we've actually done work in these workloads to make sure that they can run efficiently, not just on HDFS, but also in the cloud storage connector. This is a critical part of making sure that the architecture is actually optimized for the cloud. So, at our skill and our customers are moving their workloads from on-premise to the cloud, it's not just functional parity, but they also need sort of the operational and the cost efficiency that they're looking for as they move to the cloud. So, to do that, we need to enable these fundamental disaggregated storage pattern. See, on-prem, the big win with Hadoop was we could bring the processing to where the data was. In the cloud, we need to make sure that we work well when storage and compute are disaggregated and they're scaled elastically, independent of each other. So this is a fairly fundamental architectural change. We want to make sure that we enable this in a first-class manner. >> I think that's a key point, right. I think what cloud allows you to do is scale the storage and compute independently. And so, with storing data in Google Cloud Storage, you can like scale that horizontally and then just leverage that as your storage layer. And the compute can independently scale by itself. And what this is allowing customers of HDP and HDF is store the data on GCP, on the cloud storage, and then just use the scale, the compute side of it with HDP and HDF. >> So, if you'll indulge me to a name, another Hortonworks partner for just a hypothetical. Let's say one of your customers is using IBM Data Science Experience to do TensorFlow modeling and training, can they then inside of HDP on GCP, can they use the compute infrastructure inside of GCP to do the actual modeling which is more compute intensive and then the separate decoupled storage infrastructure to do the training which is more storage intensive? Is that a capability that would available to your customers? With this integration with Google? >> Yeah, so where we are going with this is we are saying, IBM DSX and other solutions that are built on top of HDP, they can transparently take advantage of the fact that they have HDP compute infrastructure to run against. So, you can run your machine learning training jobs, you can run your scoring jobs and you can have the same unmodified DSX experience whether you're running against an on-premise HDP environment or an in-cloud HDP environment. Further, that's sort of the benefit for partners and partner solutions. From a customer standpoint, the big value prop here is that customers, they're used to securing and governing their data on-prem in their particular way with HDP, with Apache Ranger, Atlas, and so forth. So, when they move to the cloud, we want this experience to be seamless from a management standpoint. So, from a data management standpoint, we want all of their learning from a security and governance perspective to apply when they are running in Google Cloud as well. So, we've had this capability on Azure and on AWS, so with this partnership, we are announcing the same type of deep integration with GCP as well. >> So Hortonworks is that one pane of glass across all your product partners for all manner of jobs. Go ahead, Rebecca. >> Well, I just wanted to ask about, we've talked about the reason, the impetus for this. With the customer, it's more familiar for customers, it offers the seamless experience, But, can you delve a little bit into the business problems that you're solving for customers here? >> A lot of times, our customers are at various points on their cloud journey, that for some of them, it's very simple, they're like there's a broom coming by and the datacenter is going away in 12 months and I need to be in the cloud. So, this is where there is a wholesale movement of infrastructure from on-premise to the cloud. Others are exploring individual business use cases. So, for example, one of our large customers, a travel partner, so they are exploring their new pricing model and they want to roll out this pricing model in the cloud. They have on-premise infrastructure, they know they have that for a while. They are spinning up new use cases in the cloud typically for reasons of agility. So, if you, typically many of our customers, they operate large, multi-tenant clusters on-prem. That's nice for, so a very scalable compute for running large jobs. But, if you want to run, for example, a new version of Spark, you have to upgrade the entire cluster before you can do that. Whereas in this sort of model, what they can say is, they can bring up a new workload and just have the specific versions and dependency that it needs, independent of all of their other infrastructure. So this gives them agility where they can move as fast as... >> Through the containerization of the Spark jobs or whatever. >> Correct, and so containerization as well as even spinning up an entire new environment. Because, in the cloud, given that you have access to elastic compute resources, they can come and go. So, your workloads are much more independent of the underlying cluster than they are on-premise. And this is where sort of the core business benefits around agility, speed of deployment, things like that come into play. >> And also, if you look at the total cost of ownership, really take an example where customers are collecting all this information through the month. And, at month end, you want to do closing of books. And so that's a great example where you want ephemeral workloads. So this is like do it once in a month, finish the books and close the books. That's a great scenario for cloud where you don't have to on-premises create an infrastructure, keep it ready. So that's one example where now, in the new partnership, you can collect all the data through the on-premises if you want throughout the month. But, move that and leverage cloud to go ahead and scale and do this workload and finish the books and all. That's one, the second example I can give is, a lot of customers collecting, like they run their e-commerce platforms and all on-premises, let's say they're running it. They can still connect all these events through HDP that may be running on-premises with Kafka and then, what you can do is, in-cloud, in GCP, you can deploy HDP, HDF, and you can use the HDF from there for real-time stream processing. So, collect all these clickstream events, use them, make decisions like, hey, which products are selling better?, should we go ahead and give?, how many people are looking at that product?, or how many people have bought it?. That kind of aggregation and real-time at scale, now you can do in-cloud and build these hybrid architectures that are there. And enable scenarios where in past, to do that kind of stuff, you would have to procure hardware, deploy hardware, all of that. Which all goes away. In-cloud, you can do that much more flexibly and just use whatever capacity you have. >> Well, you know, ephemeral workloads are at the heart of what many enterprise data scientists do. Real-world experiments, ad-hoc experiments, with certain datasets. You build a TensorFlow model or maybe a model in Caffe or whatever and you deploy it out to a cluster and so the life of a data scientist is often nothing but a stream of new tasks that are all ephemeral in their own right but are part of an ongoing experimentation program that's, you know, they're building and testing assets that may be or may not be deployed in the production applications. That's you know, so I can see a clear need for that, well, that capability of this announcement in lots of working data science shops in the business world. >> Absolutely. >> And I think coming down to, if you really look at the partnership, right. There are two or three key areas where it's going to have a huge advantage for our customers. One is analytics at-scale at a lower cost, like total cost of ownership, reducing that, running at-scale analytics. That's one of the big things. Again, as I said, the hybrid scenarios. Most customers, enterprise customers have huge deployments of infrastructure on-premises and that's not going to go away. Over a period of time, leveraging cloud is a priority for a lot of customers but they will be in these hybrid scenarios. And what this partnership allows them to do is have these scenarios that can span across cloud and on-premises infrastructure that they are building and get business value out of all of these. And then, finally, we at Google believe that the world will be more and more real-time over a period of time. Like, we already are seeing a lot of these real-time scenarios with IoT events coming in and people making real-time decisions. And this is only going to grow. And this partnership also provides the whole streaming analytics capabilities in-cloud at-scale for customers to build these hybrid plus also real-time streaming scenarios with this package. >> Well it's clear from Google what the Hortonworks partnership gives you in this competitive space, in the multi-cloud space. It gives you that ability to support hybrid cloud scenarios. You're one of the premier public cloud providers and we all know about. And clearly now that you got, you've had the Hortonworks partnership, you have that ability to support those kinds of highly hybridized deployments for your customers, many of whom I'm sure have those requirements. >> That's perfect, exactly right. >> Well a great note to end on. Thank you so much for coming on theCUBE. Sudhir, Ram, that you so much. >> Thank you, thanks a lot. >> Thank you. >> I'm Rebecca Knight for James Kobielus, we will have more tomorrow from DataWorks. We will see you tomorrow. This is theCUBE signing off. >> From sunny San Jose. >> That's right.

Published Date : Jun 20 2018

SUMMARY :

in the heart of Silicon Valley, for coming on the show. So, I want to start out by asking you to run on the Google Cloud Platform. and as they look at moving to cloud, in the Google Cloud. So, essentially, deep in the heart of HDP, and the cost efficiency is scale the storage and to do the training which and you can have the same that one pane of glass With the customer, it's and just have the specific of the Spark jobs or whatever. of the underlying cluster and then, what you can and so the life of a data that the world will be And clearly now that you got, Sudhir, Ram, that you so much. We will see you tomorrow.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Rebecca	PERSON	0.99+
two	QUANTITY	0.99+
Sudhir	PERSON	0.99+
Ram Venkatesh	PERSON	0.99+
San Jose	LOCATION	0.99+
HortonWorks	ORGANIZATION	0.99+
Sudhir Hasbe	PERSON	0.99+
Google	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
two guests	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
DataWorks	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
Ram	PERSON	0.99+
AWS	ORGANIZATION	0.99+
one example	QUANTITY	0.99+
one	QUANTITY	0.99+
two offerings	QUANTITY	0.98+
12 months	QUANTITY	0.98+
One	QUANTITY	0.98+
Day One	QUANTITY	0.98+
DataWorks Summit 2018	EVENT	0.97+
IBM	ORGANIZATION	0.97+
second example	QUANTITY	0.97+
Google Cloud Platform	TITLE	0.96+
Atlas	ORGANIZATION	0.96+
Google Cloud	TITLE	0.94+
Apache Ranger	ORGANIZATION	0.92+
three key areas	QUANTITY	0.92+
Hadoop	TITLE	0.91+
Kafka	TITLE	0.9+
theCUBE	ORGANIZATION	0.88+
earlier this morning	DATE	0.87+
Apache Hive	ORGANIZATION	0.86+
GCP	TITLE	0.86+
one pane	QUANTITY	0.86+
IBM Data Science	ORGANIZATION	0.84+
Azure	TITLE	0.82+
Spark	TITLE	0.81+
first	QUANTITY	0.79+
HDF	ORGANIZATION	0.74+
once in a month	QUANTITY	0.73+
HDP	ORGANIZATION	0.7+
TensorFlow	OTHER	0.69+
Hortonworks DataPlatform	ORGANIZATION	0.67+
Apache Spark	ORGANIZATION	0.61+
GCS	OTHER	0.57+
HDP	TITLE	0.5+
DSX	TITLE	0.49+
Cloud Storage	TITLE	0.47+

Sumit Gupta & Steven Eliuk, IBM | IBM CDO Summit Spring 2018

(music playing) >> Narrator: Live, from downtown San Francisco It's the Cube. Covering IBM Chief Data Officer Startegy Summit 2018. Brought to you by: IBM >> Welcome back to San Francisco everybody we're at the Parc 55 in Union Square. My name is Dave Vellante, and you're watching the Cube. The leader in live tech coverage and this is our exclusive coverage of IBM's Chief Data Officer Strategy Summit. They hold these both in San Francisco and in Boston. It's an intimate event, about 150 Chief Data Officers really absorbing what IBM has done internally and IBM transferring knowledge to its clients. Steven Eluk is here. He is one of those internal practitioners at IBM. He's the Vice President of Deep Learning and the Global Chief Data Office at IBM. We just heard from him and some of his strategies and used cases. He's joined by Sumit Gupta, a Cube alum. Who is the Vice President of Machine Learning and deep learning within IBM's cognitive systems group. Sumit. >> Thank you. >> Good to see you, welcome back Steven, lets get into it. So, I was um paying close attention when Bob Picciano took over the cognitive systems group. I said, "Hmm, that's interesting". Recently a software guy, of course I know he's got some hardware expertise. But bringing in someone who's deep into software and machine learning, and deep learning, and AI, and cognitive systems into a systems organization. So you guys specifically set out to develop solutions to solve problems like Steven's trying to solve. Right, explain that. >> Yeah, so I think ugh there's a revolution going on in the market the computing market where we have all these new machine learning, and deep learning technologies that are having meaningful impact or promise of having meaningful impact. But these new technologies, are actually significantly I would say complex and they require very complex and high performance computing systems. You know I think Bob and I think in particular IBM saw the opportunity and realized that we really need to architect a new class of infrastructure. Both software and hardware to address what data scientist like Steve are trying to do in the space, right? The open source software that's out there: Denzoflo, Cafe, Torch - These things are truly game changing. But they also require GPU accelerators. They also require multiple systems like... In fact interestingly enough you know some of the super computers that we've been building for the scientific computing world, those same technologies are now coming into the AI world and the enterprise. >> So, the infrastructure for AI, if I can use that term? It's got to be flexible, Steven we were sort of talking about that elastic versus I'm even extending it to plastic. As Sumit you just said, it's got to have that tooling, got to have that modern tooling, you've got to accommodate alternative processor capabilities um, and so, that forms what you've used Steven to sort of create new capabilities new business capabilities within IBM. I wanted to, we didn't touch upon this before, but we touched upon your data strategy before but tie it back to the line of business. You essentially are a presume a liaison between the line of business and the chief data office >> Steven: Yeah. >> Officer office. How did that all work out, and shake out? Did you defining the business outcomes, the requirements, how did you go about that? >> Well, actually, surprisingly, we have very little new use cases that we're generating internally from my organization. Because there's so many to pick from already throughout the organization, right? There's all these business units coming to us and saying, "Hey, now the data is in the data lake and now we know there's more data, now we want to do this. How do we do it?" You know, so that's where we come in, that's where we start touching and massaging and enabling them. And that's the main efforts that we have. We do have some derivative works that have come out, that have been like new offerings that you'll see here. But mostly we already have so many use cases that from those businesses units that we're really trying to heighten and bring extra value to those domains first. >> So, a lot of organizations sounds like IBM was similar you created the data lake you know, things like "a doop" made a lower cost to just put stuff in the data lake. But then, it's like "okay, now what?" >> Steven: Yeah. >> So is that right? So you've got the data and this bog of data and you're trying to make more sense out of it but get more value out of it? >> Steven: Absolutely. >> That's what they were pushing you to do? >> Yeah, absolutely. And with that, with more data you need more computational power. And actually Sumit and I go pretty far back and I can tell you from my previous roles I heightened to him many years ago some of the deficiencies in the current architecture in X86 etc and I said, "If you hit these points, I will buy these products." And what they went back and they did is they, they addressed all of the issues that I had. Like there's certain issues... >> That's when you were, sorry to interrupt, that's when you were a customer, right? >> Steven: That's when I was... >> An external customer >> Outside. I'm still an internal customer, so I've always been a customer I guess in that role right? >> Yep, yep. >> But, I need to get data to the computational device as quickly as possible. And with certain older gen technologies, like PTI Gen3 and certain issues around um x86. I couldn't get that data there for like high fidelity imaging for autonomous vehicles for ya know, high fidelity image analysis. But, with certain technologies in power we have like envy link and directly to the CPU. And we also have PTI Gen4, right? So, so these are big enablers for me so that I can really keep the utilization of those very expensive compute devices higher. Because they're not starved for data. >> And you've also put a lot of emphasis on IO, right? I mean that's... >> Yeah, you know if I may break it down right there's actually I would say three different pieces to the puzzle here right? The highest level from Steve's perspective, from Steven's teams perspective or any data scientist perspective is they need to just do their data science and not worry about the infrastructure, right? They actually don't want to know that there's an infrastructure. They want to say, "launch job" - right? That's the level of grand clarity we want, right? In the background, they want our schedulers, our software, our hardware to just seamlessly use either one system or scale to 100 systems, right? To use one GPU or to use 1,000 GPUs, right? So that's where our offerings come in, right. We went and built this offering called Powder and Powder essentially is open source software like TensorFlow, like Efi, like Torch. But performace and capabilities add it to make it much easier to use. So for example, we have an extremely terrific scheduling software that manages jobs called Spectrum Conductor for Spark. So as the name suggests, it uses Apache Spark. But again the data scientist doesn't know that. They say, "launch job". And the software actually goes and scales that job across tens of servers or hundreds of servers. The IT team can determine how many servers their going to allocate for data scientist. They can have all kinds of user management, data management, model management software. We take the open source software, we package it. You know surprisingly ugh most people don't realize this, the open source software like TensorFlow has primarily been built on a (mumbles). And most of our enterprise clients, including Steven, are on Redhat. So we, we engineered Redhat to be able to manage TensorFlow. And you know I chose those words carefully, there was a little bit of engineering both on Redhat and on TensorFlow to make that whole thing work together. Sounds trivial, took several months and huge value proposition to the enterprise clients. And then the last piece I think that Steven was referencing too, is we also trying to go and make the eye more accessible for non data scientist or I would say even data engineers. So we for example, have a software called Powder Vision. This takes images and videos, and automatically creates a trained deep learning model for them, right. So we analyze the images, you of course have to tell us in these images, for these hundred images here are the most important things. For example, you've identified: here are people, here are cars, here are traffic signs. But if you give us some of that labeled data, we automatically do the work that a data scientist would have done, and create this pre trained AI model for you. This really enables many rapid prototyping for a lot of clients who either kind of fought to have data scientists or don't want to have data scientists. >> So just to summarize that, the three pieces: It's making it simpler for the data scientists, just run the job - Um, the backend piece which is the schedulers, the hardware, the software doing its thing - and then its making that data science capability more accessible. >> Right, right, right. >> Those are the three layers. >> So you know, I'll resay it in my words maybe >> Yeah please. >> Ease of use right, hardware software optimized for performance and capability, and point and click AI, right. AI for non data scientists, right. It's like the three levels that I think of when I'm engaging with data scientists and clients. >> And essentially it's embedded AI right? I've been making the point today that a lot of the AI is going to be purchased from companies like IBM, and I'm just going to apply it. I'm not going to try to go build my own, own AI right? I mean, is that... >> No absolutely. >> Is that the right way to think about it as a practitioner >> I think, I think we talked about it a little bit about it on the panel earlier but if we can, if we can leverage these pre built models and just apply a little bit of training data it makes it so much easier for the organizations and so much cheaper. They don't have to invest in a crazy amount of infrastructure, all the labeling of data, they don't have to do that. So, I think it's definitely steering that way. It's going to take a little bit of time, we have some of them there. But as we as we iterate, we are going to get more and more of these types of you know, commodity type models that people could utilize. >> I'll give you an example, so we have a software called Intelligent Analytics at IBM. It's very good at taking any surveillance data and for example recognizing anomalies or you know if people aren't suppose to be in a zone. Ugh and we had a client who wanted to do worker safety compliance. So they want to make sure workers are wearing their safety jackets and their helmets when they're in a construction site. So we use surveillance data created a new AI model using Powder AI vision. We were then able to plug into this IVA - Intelligence Analytic Software. So they have the nice gooey base software for the dashboards and the alerts, yet we were able to do incremental training on their specific use case, which by the way, with their specific you know equipment and jackets and stuff like that. And create a new AI model, very quickly. For them to be able to apply and make sure their workers are actually complaint to all of the safety requirements they have on the construction site. >> Hmm interesting. So when I, Sometimes it's like a new form of capture says identify "all the pictures with bridges", right that's the kind of thing you're capable to do with these video analytics. >> That's exactly right. You, every, clients will have all kinds of uses I was at a, talking to a client, who's a major car manufacturer in the world and he was saying it would be great if I could identify the make and model of what cars people are driving into my dealership. Because I bet I can draw a ugh corelation between what they drive into and what they going to drive out of, right. Marketing insights, right. And, ugh, so there's a lot of things that people want to do with which would really be spoke in their use cases. And build on top of existing AI models that we have already. >> And you mentioned, X86 before. And not to start a food fight but um >> Steven: And we use both internally too, right. >> So lets talk about that a little bit, I mean where do you use X86 where do you use IBM Cognitive and Power Systems? >> I have a mix of both, >> Why, how do you decide? >> There's certain of work loads. I will delegate that over to Power, just because ya know they're data starved and we are noticing a complication is being impacted by it. Um, but because we deal with so many different organizations certain organizations optimize for X86 and some of them optimize for power and I can't pick, I have to have everything. Just like I mentioned earlier, I also have to support cloud on prim, I can't pick just to be on prim right, it so. >> I imagine the big cloud providers are in the same boat which I know some are your customers. You're betting on data, you're betting on digital and it's a good bet. >> Steven: Yeah, 100 percent. >> We're betting on data and AI, right. So I think data, you got to do something with the data, right? And analytics and AI is what people are doing with that data we have an advantage both at the hardware level and at the software level in these two I would say workloads or segments - which is data and AI, right. And we fundamentally have invested in the processor architecture to improve the performance and capabilities, right. You could offer a much larger AI models on a power system that you use than you can on an X86 system that you use. Right, that's one advantage. You can train and AI model four times faster on a power system than you can on an Intel Based System. So the clients who have a lot of data, who care about how fast their training runs, are the ones who are committing to power systems today. >> Mmm.Hmm. >> Latency requirements, things like that, really really big deal. >> So what that means for you as a practitioner is you can do more with less or is it I mean >> I can definitely do more with less, but the real value is that I'm able to get an outcome quicker. Everyone says, "Okay, you can just roll our more GPU's more GPU's, but run more experiments run more experiments". No no that's not actually it. I want to reduce the time for a an experiment Get it done as quickly as possible so I get that insight. 'Cause then what I can do I can get possibly cancel out a bunch of those jobs that are already running cause I already have the insight, knowing that that model is not doing anything. Alright, so it's very important to get the time down. Jeff Dean said it a few years ago, he uses the same slide often. But, you know, when things are taking months you know that's what happened basically from the 80's up until you know 2010. >> Right >> We didn't have the computation we didn't have the data. Once we were able to get that experimentation time down, we're able to iterate very very quickly on this. >> And throwing GPU's at the problem doesn't solve it because it's too much complexity or? >> It it helps the problem, there's no question. But when my GPU utilization goes from 95% down to 60% ya know I'm getting only a two-thirds return on investment there. It's a really really big deal, yeah. >> Sumit: I mean the key here I think Steven, and I'll draw it out again is this time to insight. Because time to insight actually is time to dollars, right. People are using AI either to make more money, right by providing better customer products, better products to the customers, giving better recommendations. Or they're saving on their operational costs right, they're improving their efficiencies. Maybe their routing their trucks in the right way, their routing their inventory in the right place, they're reducing the amount of inventory that they need. So in all cases you can actually coordinate AI to a revenue outcome or a dollar outcome. So the faster you can do that, you know, I tell most people that I engage with the hardware and software they get from us pays for itself very quickly. Because they make that much more money or they save that much more money, using power systems. >> We, we even see this internally I've heard stories and all that, Sumit kind of commented on this but - There's actually sales people that take this software & hardware out and they're able to get an outcome sometimes in certain situations where they just take the clients data and they're sales people they're not data scientists they train it it's so simple to use then they present the client with the outcomes the next day and the client is just like blown away. This isn't just a one time occurrence, like sales people are actually using this right. So it's getting to the area that it's so simple to use you're able to get those outcomes that we're even seeing it you know deals close quicker. >> Yeah, that's powerful. And Sumit to your point, the business case is actually really easy to make. You can say, "Okay, this initiative that you're driving what's your forecast for how much revenue?" Now lets make an assumption for how much faster we're going to be able to deliver it. And if I can show them a one day turn around, on a corpus of data, okay lets say two months times whatever, my time to break. I can run the business case very easily and communicate to the CFO or whomever the line of business head so. >> That's right. I mean just, I was at a retailer, at a grocery store a local grocery store in the bay area recently and he was telling me how In California we've passed legislation that does not allow plastic bags anymore. You have to pay for it. So people are bringing their own bags. But that's actually increased theft for them. Because people bring their own bag, put stuff in it and walk out. And he didn't want to have an analytic system that can detect if someone puts something in a bag and then did not buy it at purchase. So it's, in many ways they want to use the existing camera systems they have but automatically be able to detect fraudulent behavior or you know anomalies. And it's actually quite easy to do with a lot of the software we have around Power AI Vision, around video analytics from IBM right. And that's what we were talking about right? Take existing trained AI models on vision and enhance them for your specific use case and the scenarios you're looking for. >> Excellent. Guys we got to go. Thanks Steven, thanks Sumit for coming back on and appreciate the insights. >> Thank you >> Glad to be here >> You're welcome. Alright, keep it right there buddy we'll be back with our next guest. You're watching "The Cube" at IBM's CDO Strategy Summit from San Francisco. We'll be right back. (music playing)

Published Date : May 1 2018

SUMMARY :

Brought to you by: IBM and the Global Chief Data Office at IBM. So you guys specifically set out to develop solutions and realized that we really need to architect between the line of business and the chief data office how did you go about that? And that's the main efforts that we have. to just put stuff in the data lake. and I can tell you from my previous roles so I've always been a customer I guess in that role right? so that I can really keep the utilization And you've also put a lot of emphasis on IO, right? That's the level of grand clarity we want, right? So just to summarize that, the three pieces: It's like the three levels that I think of a lot of the AI is going to be purchased about it on the panel earlier but if we can, and for example recognizing anomalies or you know that's the kind of thing you're capable to do And build on top of existing AI models that we have And not to start a food fight but um and I can't pick, I have to have everything. I imagine the big cloud providers are in the same boat and at the software level in these two I would say really really big deal. but the real value is that We didn't have the computation we didn't have the data. It it helps the problem, there's no question. So the faster you can do that, you know, and they're able to get an outcome sometimes and communicate to the CFO or whomever and the scenarios you're looking for. appreciate the insights. with our next guest.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Steven Eluk	PERSON	0.99+
Steve	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Bob Picciano	PERSON	0.99+
Steven	PERSON	0.99+
Sumit	PERSON	0.99+
Jeff Dean	PERSON	0.99+
Sumit Gupta	PERSON	0.99+
California	LOCATION	0.99+
Boston	LOCATION	0.99+
Bob	PERSON	0.99+
San Francisco	LOCATION	0.99+
Steven Eliuk	PERSON	0.99+
three pieces	QUANTITY	0.99+
100 systems	QUANTITY	0.99+
two months	QUANTITY	0.99+
100 percent	QUANTITY	0.99+
2010	DATE	0.99+
hundred images	QUANTITY	0.99+
1,000 GPUs	QUANTITY	0.99+
95%	QUANTITY	0.99+
The Cube	TITLE	0.99+
one GPU	QUANTITY	0.99+
two	QUANTITY	0.99+
60%	QUANTITY	0.99+
Denzoflo	ORGANIZATION	0.99+
one system	QUANTITY	0.99+
both	QUANTITY	0.99+
one	QUANTITY	0.99+
tens of servers	QUANTITY	0.99+
two-thirds	QUANTITY	0.99+
Parc 55	LOCATION	0.99+
one day	QUANTITY	0.98+
hundreds of servers	QUANTITY	0.98+
one time	QUANTITY	0.98+
X86	COMMERCIAL_ITEM	0.98+
IBM Cognitive	ORGANIZATION	0.98+
80's	DATE	0.98+
three levels	QUANTITY	0.98+
today	DATE	0.97+
Both	QUANTITY	0.97+
CDO Strategy Summit	EVENT	0.97+
Spark	TITLE	0.96+
one advantage	QUANTITY	0.96+
Spectrum Conductor	TITLE	0.96+
Torch	TITLE	0.96+
X86	TITLE	0.96+
Vice President	PERSON	0.95+
three different pieces	QUANTITY	0.95+
PTI Gen4	COMMERCIAL_ITEM	0.94+
three layers	QUANTITY	0.94+
Union Square	LOCATION	0.93+
TensorFlow	TITLE	0.93+
Torch	ORGANIZATION	0.93+
PTI Gen3	COMMERCIAL_ITEM	0.92+
Efi	TITLE	0.92+
Startegy Summit 2018	EVENT	0.9+

Holden Karau, Google | Flink Forward 2018

>> Narrator: Live from San Francisco, it's the Cube, covering Flink Forward, brought to you by Data Artisans. (tech music) >> Hi, this is George Gilbert, we're at Flink Forward, the user conference for the Apache Flink Community, sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest, with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. >> Yes. >> And now, Holden is at Google, focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's Dataflow, as well as Flink and other stream processors. >> Yeah. >> And Holden has been working on integrating it with the Google TensorFlow framework, also open-sourced. Yes. >> So, Holden, tell us about the objective of putting these together. What type of use cases.... >> So, I think it's really exciting. And it's still very early days, I want to be clear. If you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark, with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot people essentially end up having to write, like, a stream-processing job to do their data preparation, and they have to write a corresponding online serving job, to do similar data preparation for when they want to make real predictions. And by integrating tf.Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple, uniform way, that can be taken from the training time into the online serving time, without them having to rewrite their code, removing the potential for mistakes where we like, change one variable slightly in one place and forget to update it in another. And just really simplifying the deployment process for these models. >> Okay, so help us tie that back to, in this case, Flink. >> Yes. >> And also to clarify, that data prep.... My impression was data prep was a different activity. It was like design time and serving was run time. But you're saying that they can be better integrated? >> So, there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, I don't have to do that at serving time. But one of the classic examples for data prep would be tokenizing my inputs, or performing some kind of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form, or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to re-create the data prep logic that I created for training at serving time. >> So, by having common Beam API and the common provider underneath it, like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine-learning model that you want those.... It would be ideal to have those transformation activities be common in your prep work, and then in the production serving. >> Yes, very true. >> So, tell us what type of customers want to write to the Beam API and have that portability? >> Yeah, so that's a really good question. So, there's a lot of people who really want portability outside of Google Cloud, and that's one group of people, essentially people who want to adopt different Google Cloud technologies, but they don't want be locked into Google Cloud forever. Which is completely understandable. There are other people who are more interested in being able to switch streaming engines, like, they want to be able to switch between Spark and Flink. And those are people who want to try out different streaming engines without having to rewrite their entire jobs. >> Does Spark Structured Streaming support the Beam API? >> So, right now, the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the Structured Streaming API. It's a thing we're actively discussing on the mailing list, how to go about doing. Because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure. But it's something that we should look at more of. Where was I going with this? So the other one that I see, is like, Flink is a wonderful API, but it's very Java-focused. And so, Java's great, everyone loves it, but a lot of cool things that are being done nowadays, are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. Beam gives a way for people to work with Python, across these different engines. Flink supports Python, but it's maybe not a first class citizen. And the Beam Python support is still a work in progress. We're working to get it to be better, but it's.... You can see the demos this afternoon, although if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. >> In to support Go. >> Which is a little out of left field. >> So, would it be fair to say that the value of Beam, for potential Flink customers, they can work and start on Google Cloud platform. They can start on one of several stream processors. They can move to another one later, and they also inherit the better language support, or bindings from the Beam API? >> I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. It's somewhat subjective, like what better language support is. But I think definitely for Go, it's pretty clear. This stuff is all stuff that's in the master branch, it's not released today. But if people are looking to play with it, I think it's really exciting. They can go and check it out from GitHub, and build it locally. >> So, what type of customers do you see who have moved into production with machine learning? >> So the.... >> And the streaming pipelines? >> The biggest customer that's in production is obviously, or not obviously, is Spotify. One of them is Spotify. They give a lot of talks about it. Because I didn't know we were going to be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly. I'll just stick with Spotify. >> Without the names, the sort of use cases and the general industry.... >> I don't want to get in trouble. >> Okay. >> I'm just going to ... sorry. >> Okay. So then, let's talk about, does Google view Dataflow as their sort of strategic successor to map produce? >> Yes, so.... >> And is that a competitor then to Flink? >> I think Flink and Dataflow can be used in some of the same cases. But, I think they're more complementary. Flink is something you can run on-prem. You can run it in different Defenders. And Dataflow is very much like, "I can run this on Google Cloud." And part of the idea with Beam is to make it so that people who want to write Dataflow jobs but maybe want the flexibility to go back to something else later can still have that. Yeah, we couldn't swap in Flink or Dataflow execution engines if we're on Google Cloud, but.... We're not, how do I put it nicely? Provided people are running this stuff, they're burning CPU cycles, I don't really care if they're running Dataflow or Flink as the execution engine. Either way, it's a party for me, right? >> George: Okay. >> It's probably one of those, sort of, friendly competitions. Where we both push each other to do better and add more features that the respective projects have. >> Okay, 30 second question. >> Cool. >> Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? >> Totally. I mostly see it as extending existing apps. This is obviously, possibly a bias, just for the people that I talk to. But, going ground up with both streaming and machine learning, at the same time, like, starting both of those projects fresh is a really big hurdle to get over. >> George: For skills. >> For skills. It's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something ... maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly. Or you'll build a streaming system, and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping head first into both at the same time. But this could change. Batch has been King for a long time and streaming is getting it's day in the sun. So, we could start seeing people becoming more adventurous and doing both, at the same time. >> Holden, on that note, we'll have to call it a day. That was most informative. >> It's really good to see you again. >> Likewise. So this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference, sponsored by Data Artisans. And we will be back in a few minutes after this short break. (tech music)

Published Date : Apr 11 2018

SUMMARY :

Narrator: Live from San Francisco, it's the Cube, This is the second Flink Forward conference focused on the Beam API, which is an API And Holden has been working on integrating it So, Holden, tell us about the objective of the same preparation code available to use And also to clarify, that data prep.... I don't have to do that at serving time. and the common provider underneath it, in being able to switch streaming engines, And the Beam Python support is still a work in progress. or bindings from the Beam API? But if people are looking to play with it, I didn't have a chance to go through my customer list the sort of use cases and the general industry.... as their sort of strategic successor to map produce? And part of the idea with Beam is to make it so that and add more features that the respective projects have. at the same time, and streaming is getting it's day in the sun. Holden, on that note, we'll have to call it a day. We're on the ground at Flink Forward,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
San Francisco	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Holden Karau	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
Python	TITLE	0.99+
Java	TITLE	0.99+
Holden	PERSON	0.99+
Google	ORGANIZATION	0.99+
Spotify	ORGANIZATION	0.99+
both	QUANTITY	0.99+
two paths	QUANTITY	0.99+
TensorFlow	TITLE	0.99+
One	QUANTITY	0.99+
Spark	TITLE	0.99+
GitHub	ORGANIZATION	0.98+
today	DATE	0.98+
Dataflow	TITLE	0.97+
Flink	ORGANIZATION	0.97+
one variable	QUANTITY	0.97+
a day	QUANTITY	0.97+
Go	TITLE	0.97+
Flink Forward	EVENT	0.96+
Flink	TITLE	0.96+
30 second question	QUANTITY	0.96+
one place	QUANTITY	0.95+
Beam	TITLE	0.95+
second	QUANTITY	0.95+
Google Cloud	TITLE	0.94+
Apache	ORGANIZATION	0.94+
one group	QUANTITY	0.94+
one	QUANTITY	0.93+
this afternoon	DATE	0.9+
Dstream	TITLE	0.88+
2018	DATE	0.87+
first	QUANTITY	0.79+
Beam API	TITLE	0.75+
Beam	ORGANIZATION	0.74+
Apache Flink Community	ORGANIZATION	0.72+

Ken King & Sumit Gupta, IBM | IBM Think 2018

>> Narrator: Live from Las Vegas, it's the Cube, covering IBM Think 2018, brought to you by IBM. >> We're back at IBM Think 2018. You're watching the Cube, the leader in live tech coverage. My name is Dave Vellante and I'm here with my co-host, Peter Burris. Ken King is here; he's the general manager of OpenPOWER from IBM, and Sumit Gupta, PhD, who is the VP, HPC, AI, ML for IBM Cognitive. Gentleman, welcome to the Cube >> Sumit: Thank you. >> Thank you for having us. >> So, really, guys, a pleasure. We had dinner last night, talked about Picciano who runs the OpenPOWER business, appreciate you guys comin' on, but, I got to ask you, Sumit, I'll start with you. OpenPOWER, Cognitive systems, a lot of people say, "Well, that's just the power system. "This is the old AIX business, it's just renaming it. "It's a branding thing.", what do you say? >> I think we had a fundamental strategy shift where we realized that AI was going to be the dominant workload moving into the future, and the systems that have been designed today or in the past are not the right systems for the AI future. So, we also believe that it's not just about silicon and even a single server. It's about the software, it's about thinking at the react level and the data center level. So, fundamentally, Cognitive Systems is about co-designing hardware and software with an open ecosystem of partners who are innovating to maximize the data and AI support at a react level. >> Somebody was talkin' to Steve Mills, probably about 10 years ago, and he said, "Listen, if you're going to compete with Intel, "you can copy them, that's not what we're going to do." You know, he didn't like the spark strategy. "We have a better strategy.", is what he said, and "Oh, strategies, we're going to open it up, "we're going to try to get 10% of the market. "You know, we'll see if we can get there.", but, Ken, I wonder if you could sort of talk about, just from a high level, the strategy and maybe go into the segments. >> Yeah, absolutely, so, yeah, you're absolutely right on the strategy. You know, we have completely opened up the architecture. Our focus on growth is around having an ecosystem and an open architecture so everybody can innovate on top of it effectively and everybody in the ecosystem can profit from it and gains good margins. So, that's the strategy, that's how we design the OpenPOWER ecosystem, but, you know, our segments, our core segments, AIX in Unix is still a core, very big core segment of ours. Unix itself is flat to declining, but AIX is continuing to take share in that segment through all the new innovations we're delivering. The other segments are all growth segments, high growth segments, whether it's SAP HANA, our cognitive infrastructure in modern day to platform, or even what we're doing in the HyperScale data centers. Those are all significant growth opportunities for us, and those are all Linux based, and, so, that is really where a lot of the OpenPOWER initiatives are driving growth for us and leveraging the fact that, through that ecosystem, we're getting a lot of incremental innovation that's occurring and it's delivering competitive differentiation for our platform. I say for our platform, but that doesn't mean just for IBM, but for all the ecosystem partners as well, and a lot of that was on display on Monday when we had our OpenPOWER summit. >> So, to talk about more about the OpenPOWER summit, what was that all about, who was there? Give us some stats on OpenPOWER and ecosystem. >> Yeah, absolutely. So, it was a good day, we're up to well over 300 members. We have over 50 different systems that are coming out in the market from IBM or our partners. Over 20 different manufacturers out there actually developing OpenPOWER systems. A lot of announcements or a lot of statements that were made at the summit that we thought were extremely valuable, first of all, we got the number one server vendor in Europe, Atos, designing and developing P9, the number on in Japan, Hitachi, the number one in China, Inspur. We got top ODMs like Super Micro, Wistron, and others that are also developing their power nine. We have a lot of different component providers on the new PCIe gen four, on the open cabinet capabilities, a lot of announcements made by a number of component partners and accelerator partners at the summit as well. The other thing I'm excited about is we have over 70 ISVs now on the platform, and a number of statements were made and announcements on Monday from people like MapD, Anaconda, H2O, Conetica and others who are leveraging those innovations bought on the platform like NVLink and the coherency between GPU and CPU to do accelerated analytics and accelerated GPU database kind of capabilities, but the thing that had me the most excited on Monday were the end users. I've always said, and the analysts always ask me the questions of when are you going to start penetration in the market? When are you going to show that you've got a lot of end users deploying this? And there were a lot of statements by a lot of big players on Monday. Google was on stage and publicly said the IO was amazing, the memory bandwidth is amazing. We are deploying Zaius, which is the power nine server, in our data centers and we're ready for scale, and it's now Google strong which is basically saying that this thing is hardened and ready for production, but we also (laughs) had a number of other significant ones, Tencent talkin' about deploying OpenPOWER, 30% better efficiency, 30% less server resources required, the cloud armor of Alibaba talkin' about how they're putting on their on their X-Dragon, they have it in a piler program, they're asking everybody to use it now so they can figure out how do they go into production. PayPal made statements about how they're using it, but the machine learning and deep learning to do fraud detection, and we even had Limelight, who is not as big a name, but >> CDN, yeah. >> They're a CDN tool provider to people like Netflix and others. We're talkin' about the great capability with the IO and the ability to reduce the buffering and improve the streaming for all these CDN providers out there. So, we were really excited about all those end users and all the things they're saying. That demonstrates the power of this ecosystem. >> Alright, so just to comment on the architecture and then, I want to get into the Cognitive piece. I mean, you guys did, years ago, little Indians, recognizing you got to get software based to be compatible. You mentioned, Ken, bandwidth, IO bandwidth, CAPI stuff that you've done. So, there's a lot of incentives, especially for the big hyperscale guys, to be able to do more with less, but, to me, let's get into the AI, the Cognitive piece. Bob Picciano comes over from running a $15 billion analytics business, so, obviously, he's got some knowledge. He's bringin' in people like you with all these cool buzzwords in your title. So, talk a little bit about infrastructure for AI and why power is the right platform. >> Sure, so, I think we all recognize that the performance advantages and even power advantages that we were getting from Dennard scaling, also known as Moore's law, is over, right. So, people talk about the end of Moore's Law, and that's really the end of gaining processor performance with Dennard scaling and the Moore's Law. What we believe is that to continue to meet the performance needs of all of these new AI and data workloads, you need accelerators, and not just computer accelerators, you actually need accelerated networking. You need accelerated storage, you need high-density memory sitting very close to the compute power, and, if you really think about it, what's happened is, again, system view, right, we're not silicon view, we're looking at the system. The minute you start looking at the silicon you realize you want to get the data to where the computer is, or the computer where the data is. So, it all becomes about creating bigger pipelines, factor of pipelines, to move data around to get to the right compute piece. For example, we put much more emphasis on a much faster memory system to make sure we are getting data from the system memory to the CPU. >> Coherently. >> Coherently, that's the main memory. We put interfaces on power nine including NVLink, OpenCAPI, and PCIe gen four, and that enabled us to get that data either from the network to the system memory, or out back to the network, or to storage, or to accelerators like GPUs. We built and embedded these high-speed interconnects into power nine, into the processor. Nvidia put NVLink into their GPU, and we've been working with marketers like Xilinx and Mellanox on getting OpenCAPI onto their components. >> And we're seeing up to 10x for both memory bandwidth and IO over x86 which is significant. You should talk about how we're seeing up to 4x improvement in training of MLDL algorithms over x86 which is dramatic in how quickly you can get from data to insight, right? You could take training and turn it from weeks to days, or days to hours, or even hours to minutes, and that makes a huge difference in what you can do in any industry as far as getting insight out of your data which is the competitive differentiator in today's environment. >> Let's talk about this notion of architecture, or systems especially. The basic platform for how we've been building systems has been relatively consistent for a long time. The basic approach to how we think about building systems has been relatively consistent. You start with the database manager, you run it on an Intel processor, you build your application, you scale it up based on SMP needs. There's been some variations; we're going into clustering, because we do some other things, but you guys are talking about something fundamentally different, and flash memory, the ability to do flash storage, which dramatically changes the relationship between the processor and the data, means that we're not going to see all of the organization of the workloads around the server, see how much we can do in it. It's really going to be much more of a balanced approach. How is power going to provide that more balanced systems approach across as we distribute data, as we distribute processing, as we create a cloud experience that isn't in one place, but is in more places. >> Well, this ties exactly to the point I made around it's not just accelerated compute, which we've all talked about a lot over the years, it's also about accelerated storage, accelerated networking, and accelerated memories, right. This is really, the point being, that the compute, if you don't have a fast pipeline into the processor from all of this wonderful storage and flash technology, there's going to be a choke point in the network, or they'll be a choke point once the data gets to the server, you're choked then. So, a lot of our focus has been, first of all, partnering with a company like Mellanox which builds extremely high bandwidth, high-speed >> And EOF. >> Right, right, and I'm using one as an example right. >> Sure. >> I'm using one as an example and that's where the large partnerships, we have like 300 partnerships, as Ken talked about in the OpenPOWER foundation. Those partnerships is because we brought together all of these technology providers. We believe that no one company can own the agenda of technology. No one company can invest enough to continue to give us the performance we need to meet the needs of the AI workloads, and that's why we want to partner with all these technology vendors who've all invested billions of dollars to provide the best systems and software for AI and data. >> But fundamentally, >> It's the whole construct of data centric systems, right? >> Right. >> I mean, sometimes you got to process the data in the network, right? Sometimes you got to process the data in the storage. It's not just at the CPU, the GPUs a huge place for processing that data. >> Sure. >> How do you do that all coherently and how do things work together in a system environment is crucial versus a vertically integrated capability where the CPU provider continues to put more and more into the processor and disenfranchise the rest of the ecosystem. >> Well, that was the counter building strategies that we want to talk about. You have Intel who wants to put as much on the die as possible. It's worked quite well for Intel over the years. You had to take a different strategy. If you tried to take Intel on with that strategy, you would have failed. So, talk about the different philosophies, but really I'm interested in what it means for things like alternative processing and your relationship in your ecosystem. >> This is not about company strategies, right. I mean, Intel is a semiconductor company and they think like a semiconductor company. We're a systems and software company, we think like that, but this is not about company strategy. This is about what the market needs, what client workloads need, and if you start there, you start with a data centric strategy. You start with data centric systems. You think about moving data around and making sure there is heritage in this computer, there is accelerated computer, you have very fast networks. So, we just built the US's fastest supercomputer. We're currently building the US's fastest supercomputer which is the project name is Coral, but there are two supercomputers, one at Oak Ridge National Labs and one at Lawrence Livermore. These are the ultimate HPC and AI machines, right. Its computer's a very important part of them, but networking and storage is just as important. The file system is just as important. The cluster management software is just as important, right, because if you are serving data scientists and a biologist, they don't want to deal with, "How many servers do I need to launch this job on? "How do I manage the jobs, how do I manage the server?" You want them to just scale, right. So, we do a lot of work on our scalability. We do a lot of work in using Apache Spark to enable cluster virtualization and user virtualization. >> Well, if we think about, I don't like the term data gravity, it's wrong a lot of different perspectives, but if we think about it, you guys are trying to build systems in a world that's centered on data, as opposed to a world that's centered on the server. >> That's exactly right. >> That's right. >> You got that, right? >> That's exactly right. >> Yeah, absolutely. >> Alright, you guys got to go, we got to wrap, but I just want to close with, I mean, always says infrastructure matters. You got Z growing, you got power growing, you got storage growing, it's given a good tailwind to IBM, so, guys, great work. Congratulations, got a lot more to do, I know, but thanks for >> It's going to be a fun year. comin' on the Cube, appreciate it. >> Thank you very much. >> Thank you. >> Appreciate you having us. >> Alright, keep it right there, everybody. We'll be back with our next guest. You're watching the Cube live from IBM Think 2018. We'll be right back. (techno beat)

Published Date : Mar 21 2018

SUMMARY :

covering IBM Think 2018, brought to you by IBM. Ken King is here; he's the general manager "This is the old AIX business, it's just renaming it. and the systems that have been designed today or in the past You know, he didn't like the spark strategy. So, that's the strategy, that's how we design So, to talk about more about the OpenPOWER summit, the questions of when are you going to and the ability to reduce the buffering the big hyperscale guys, to be able to do more with less, from the system memory to the CPU. Coherently, that's the main memory. and that makes a huge difference in what you can do and flash memory, the ability to do flash storage, This is really, the point being, that the compute, Right, right, and I'm using one as an example the large partnerships, we have like 300 partnerships, It's not just at the CPU, the GPUs and disenfranchise the rest of the ecosystem. So, talk about the different philosophies, "How do I manage the jobs, how do I manage the server?" but if we think about it, you guys are trying You got Z growing, you got power growing, comin' on the Cube, appreciate it. We'll be back with our next guest.

ENTITIES

Entity	Category	Confidence
Peter Burris	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Ken King	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Steve Mills	PERSON	0.99+
Ken	PERSON	0.99+
Sumit	PERSON	0.99+
Bob Picciano	PERSON	0.99+
China	LOCATION	0.99+
Monday	DATE	0.99+
Europe	LOCATION	0.99+
Mellanox	ORGANIZATION	0.99+
PayPal	ORGANIZATION	0.99+
10%	QUANTITY	0.99+
Alibaba	ORGANIZATION	0.99+
Japan	LOCATION	0.99+
Sumit Gupta	PERSON	0.99+
OpenPOWER	ORGANIZATION	0.99+
30%	QUANTITY	0.99+
$15 billion	QUANTITY	0.99+
one	QUANTITY	0.99+
Nvidia	ORGANIZATION	0.99+
Hitachi	ORGANIZATION	0.99+
Conetica	ORGANIZATION	0.99+
Xilinx	ORGANIZATION	0.99+
Las Vegas	LOCATION	0.99+
OpenPOWER	EVENT	0.99+
Google	ORGANIZATION	0.99+
Netflix	ORGANIZATION	0.99+
Atos	ORGANIZATION	0.99+
Picciano	PERSON	0.99+
300 partnerships	QUANTITY	0.99+
Intel	ORGANIZATION	0.99+
Anaconda	ORGANIZATION	0.99+
Inspur	ORGANIZATION	0.98+
two supercomputers	QUANTITY	0.98+
Linux	TITLE	0.98+
Moore's Law	TITLE	0.98+
over 300 members	QUANTITY	0.98+
US	LOCATION	0.98+
SAP HANA	TITLE	0.97+
AIX	ORGANIZATION	0.97+
over 50 different systems	QUANTITY	0.97+
Wistron	ORGANIZATION	0.97+
both	QUANTITY	0.97+
Limelight	ORGANIZATION	0.97+
H2O	ORGANIZATION	0.97+
Unix	TITLE	0.97+
over 70 ISVs	QUANTITY	0.97+
Over 20 different manufacturers	QUANTITY	0.97+
billions of dollars	QUANTITY	0.96+
MapD	ORGANIZATION	0.96+
Dennard	ORGANIZATION	0.95+
OpenCAPI	TITLE	0.95+
Moore's law	TITLE	0.95+
today	DATE	0.95+
single server	QUANTITY	0.94+
Lawrence	LOCATION	0.93+
Oak Ridge National Labs	ORGANIZATION	0.93+
IBM Cognitive	ORGANIZATION	0.93+
Tencent	ORGANIZATION	0.93+
nine	QUANTITY	0.92+
one place	QUANTITY	0.91+
up to 10x	QUANTITY	0.9+
X-Dragon	COMMERCIAL_ITEM	0.9+
30% less	QUANTITY	0.9+
P9	COMMERCIAL_ITEM	0.89+
last night	DATE	0.88+
Coral	ORGANIZATION	0.88+
AIX	TITLE	0.87+
Cognitive Systems	ORGANIZATION	0.86+

Dheeraj Pandey, Nutanix | Nutanix NEXT Nice 2017

>> Narrator: Live, from Nice, France. It's theCUBE. Covering .NEXT Conference 2017 Europe. Brought to you by Nutanix. (techno music) >> Welcome back, I'm Stu Miniman and this is SiliconANGLE Media's production of theCUBE. Happy to have a welcome back to the program, CEO and Founder of Nutanix, Dheeraj Pandey. The keynote this morning, talking about how Nutanix really going from a traditional enterprise infrastructure company really becoming it's goal of being an iconic software company. So, Dheeraj, bring us up to speed as to you know, how Nutanix positioned itself for this future. >> Yeah, I think it's it's been a rite of passage because you can't start from AWS in day one. You have to sell books, and sell eCommerce. You know, you being in the eCommerce space. It was a 20 years journey for them before they could get into computing and people took them seriously. I mean, look at Apple with iPod, and then iPhone, and the iPad, and then iTunes and app store. And all that stuff was a journey of 15 years. You know, before they could really see that they've arrived. I think for us we had to build the form factor of an iPhone four so that people realize what this hyperconvergence thing was. Before we could go and ship an android as an operating system. 'Cause if hadn't android operating system come first... Just like Windows Mobile operating system was around for a while and nobody really understood how to really go and make money on it. I think we had to build a form factor first. And now that people grock it, now we can really go and make software out of this. And be swell software and make the android version of the iOS itself. And that's the thing. I think, as a company we're challenged to balance these paradoxes. Oh, I thought you were an appliance company and you believe in this Apple like finesse. Polish and attention to detail. How do you apply that to an android like the shboosh model where you leave it to others to go and build handsets and so on. I think that's the challenge that you've taken upon ourselves. Now inside, with the cloud service, we have a lot of control. With appliances, we have somewhat control because we at least know what our hardware is running on. But software we open it up. And opening it up, and yet not giving up on the attention to detail is the challenge that this company has to, actually, really go and undertake. We are looking at a lot of our tools and bill for certifications, and you know, passing the test. The litmus test for hardware and we're trying to figure out how to automate the heck out of it. Make them into cloud services. So that customers can now go an crowdsource certifications. So, there'll be some new paradigms that will emerge and the reason why we are well placed for those kinds of things is because our heritage is appliance. So now when we think of doing software a lot of the tooling, a lot of the automations, certifications, the attention to detail we had we'll need to go and make them into cloud services. We have some of them, like Cicer is a cloud service. X-ray is a cloud service. Foundation is a cloud service. So a lot of these services will then go and make the job of certifying an unknown piece of hardware easier, actually. I mean in fact, even day two and beyond we have what we can NCC which is a service that runs from within prism to do health checks. And every two hours you can do health checks. So if there's a new piece of hardware that we thought we just certified, we need to keep paranoid about it. Stay paranoid about it, and say, look is the hardware really the hardware we wanted it to be. So there's lots of really innovative things we can do as a company that really had the heritage of appliance to go and do software, as well. >> Yeah, absolutely people have always underestimated the interoperability required. Remember when server virtualization rolled out up the BIOS. You know, could make everything go horribly. Even, you know, containers could give you portability and run everywhere. Oh wait, networking and storage. There's considerations there. Do you think it's getting to a point from a maturation of the market that the software... You know, can you in the future take Nutanix to be a fully software company where you kind of let somebody else take care of the hardware pieces and then you just become their software. And then there's service software services. That seem like a likely future? >> Yeah, I think with the right tools, right level of automation, right level of machine learning, right level of talk-back. You know, I say talk-balk, I mean the fact that the hard beats are coming to us we understand what the customers are doing. And with the right level of paranoia day two and beyond. Which is NCC for example, it's, We call it Nutanix Cluster check. And it does like 350 odd health checks on a periodic basis. And it erases the load, and some things like that. With the right level of paranoia I think we can really go and make this work. And by the way, that's where design comes in. Like, how do you think of X-Ray as a service, and Foundation, and Cicer and NCC and so on. I think that's where the real design of a software company that is also not being callous about hardware comes in, actually. So I'm really looking forward to it. I think... it's not just about tech and products. It's also about go-to-market because go-to-market has a change too. I mean, the kind of packaging, and the kind of pricing, the kind of ELA's, sales compensation, channel programs, a lot of those things have to be revisited as well. As upstream engineering, you talk about, there's a lot of downstream go-to market engineering as well, that needs to be done. >> Now, when it comes to go-to-market, partnerships are key of course. There's the channel. You want to grow your sales channel, and grow a piece. But also from a technology standpoint, there's a comment I heard you make earlier this week. You know, Google has the opportunity to be kind of that next partner. As like Dell was a partner to give you pre-IPO credibility Dell's trusted you. Dell, you have Lenovo, you have IBM up on stage there. As a software company, who are the partners that help Nutanix kind of through this next phase? >> I think you mentioned some of them already. You know, the cloud vendors, though, obviously open up. And there will be new ones that'll open up over time as well. Where we're thinking about ways to blur the lines between public and private. Because I think the world, including the public cloud vendors have come to realize that. You know, you can't have silos. You can't have a public cloud that's separate from the private and so on. So being able to blur the lines, there'll be a lot of cloud partners for us as well. I think on the hardware side, we already talked about all of them, actually. Now, HP and Cisco are right now partners, in double quotes, because we go and make our software work on it, you know. But on some levels they'll probably also have to open up. And they're networking partners that've been working with you know, Arista is a good case in point. Lexi's another one. And security partners, like Palo Alto could be a large one over time because we think about what firewalls need to be look like in the next five years, and so on, you know. I think in every way, I look at even Apache foundation. Which is not really a company but the fact that we can really coop a lot of open source and build COM marketplace apps. Where the apps could be spun up in an on-prem environment and a single tenet on-prem environment. And you can drag and drop them into a side merchant intent environment. I think being able to go and do more with Apache. To me it's the... I would say, the biggest game changer for the company would be what else can we do with Apache? You know, 'cause we did a lot the first eight years. I mean, obviously, Linux is a big piece of our overall story, you know. Not just as hypervisor but a controller, and things like that is all Linux based. Which draws the pace of innovation of this company, actually. But beyond Linux, we've used Cassandra and ZooKeeper, RocksDB and things like that. What else can we do with Apache Spark, and Costco, and MariaDB, and things like that. I think we need to go and elevate the definition of infrastructure. To include databases and NoSQL systems, and batch processing hadoop, and things like that. All those things become a part of the overall marketplace story for us, you know. And that's where the really interesting stuff really comes in. >> How do you look at open source from a strategic standpoint from Nutanix? I think it's been phenomenal because we have then operated as a company that's bigger than we are. 'Cause otherwise, I mean, look at VMware. They don't have that goodness. Nor does Microsoft actually. I mean, Amazon is the only one that really goes and makes the best out of open source. >> Explain that, we say Microsoft had a huge push into open source. Especially, you know, kind of publicly the last two or three years. But they've been working on it, they've, you know, heavily embraced containers. You know, they've gone Kubernetes. You know, heavily. >> I'm going to give you examples. I think there's a lot of marchitecture. And what Microsoft is doing is open source. But, of course you know, Linux has to work on Hyper-V. So, that's a given. They cannot make a relevant stack without really making Linux work in Hyper-V. But they tried Hadoop on Windows. And Horton works actually on quartered Hadoop in Windows but there are not too many takers, as you see, you know. Containers will probably continue to make a lot of progress on Linux because of the LXD and LXC engines, and things like that. And there's a lot more momentum on the Linux side of containers then the LB on the Windows side containers >> And even Azure is running more Linux than they are Windows these days. >> Absolutely, now that being said, Azure Stack is still Azure Stack. It's still Hyper-V. It's still system centered, not user-centered and things like that. I think Microsoft software will really, really have to find itself. And change a lot of its thinking to really go and say we truly embrace open source like the way Amazon does. And like the way Facebook does. Like the way Nutanix does, I think. You know, it's a very different way we look at open source. We are much like Facebook and Amazon than someone else. I mean, VMware is way farther away from open source, in that sense. I mean vSphere, overall You know, I mean I would say that it probably is Linux based. ESX is Linux based from 17, 18 years ago. I am sure that curt path has been forked forever. And it's very hard for them to go and uptake from open source from overall upstream stuff actually. That we build, you know I mean, our stuff runs on a palm sized server. A palm sized server, imagine it. And that's where we put in a drone and that's the foundation of an edge cloud for us, in some sense. Our stuff runs on IBM power system because IBM was doing a lot of work with open source KVM that made it easy for us to port it to H-V, and so on. And so, I think H-V is a lot more momentum because it shares that overall core base of open source, as well. And I think, over time we'll do many more things with open source. Including in the platform space. >> Okay, how's Nutanix doing globally. You know, what more do you want to be doing. How would you rate yourself on kind of new tenent as a global company? >> I think it's a great question and it's one of those that's a double edged sword, actually. And I'll tell you what I mean by that. So when you stop growing, non-US business become 50%, 'cause that's pretty much the reflection of ID spend. Half the spend is outside the US, half the spend is within the US. Right from here is 65/35. Which is a very healthy place to be in, actually. I don't want to just think to change to like 50/50 end because that's a proxy for are we stop growing, actually. At the same time, I'd love to be shipping everywhere, because again, I've said that the definition of an enterprise cloud is even more relevant. And, you know, parts of the world that is not US, actually. In that sense, just being able to go and maintain that customer base outside the US. I mean, being able to do it. I mean, you know we recently sold a system in Myanmar, actually. And I was telling my friends that look, now I can die in piece because we have a system in Myanmar, you know. But the very fact that they are partners, and there's the channel community, and there's technology champion and their exports. There are certified people in these remote parts of the world. And the fact that we can support these customers successfully, says a lot about the overall reach of the technology. The fact that it's reliable, the fact that it's easy to use and spin up, and the fact that its easy to get certified on. I think is the core of Nutanix, so I feel good about those things, actually. >> You've reached a certain maturity of product marketed option and we've seen Nutanix starting to spread out into certain things sometimes we call adjacencies. You've talked about some of the different softer pieces. How do you manage the growth, the spread and make sure that, you know, simplicity. We were talking to Seneal this morning about absolutely you want simplicity but you also want to, you know. Where does Nutanix play and where don't they play? You know, where >> That's a great question So, there's a really good book that I was introduced to about two years ago. And it's also... There's some videos on YouTube about this book. It's called, The Founder's Mentality the YouTube video is called The Founder's Mentality, as well. And it talks about this very phenomenon that as companies grow they become complex. So they introduce a problem. It's called the Paradox of Growth. The thing that you want to do, really do, was grow. And that thing that you covered kills you. 'Cause growth creates complexity and complexity is a silent killer of growth. So the thing that you covered is the thing that kills you. And that is the Paradox of Growth, actually. You know, in very simple terms. And then it goes on to talk about what are the things you need to do because you started an insurgent company over time you started acting like you've arrived and you're incumbent now, all of a sudden. And the moment you start thinking like an incumbent you're done, in some sense. What are the headwinds, and what are the tailwinds that you can actually produce to actually stay an insurgent. I think there's some great lessons there about an insurgent mindset, and an owner's mentality and then finally, this obsessions for the front lining. How do you think about customers as the first, last thing. So, I think that's one of the guiding principles of the company. In how can we continue to imbibe the founder's mentality in there as well. Where every employee can be a founder, actually, without really having the founder's tag, and so on. And then internally, there's a lot of things we could do differently, in the way that we do engineering, in the way we do collaboration. I mean, these are all good things to revisit design. Not just the product design piece, but organizational design like what does it mean to have two PIDs a team, and microservices, and product managers, and prism developers and COM developers, assigned to two PIDs a team, and so on. QA developers and so on. So there's a lot of structure that we can put at scale. That continues to make us look small, continues to have accountability at a product manager level so that they act like GM's, as opposed to PM's. Where each of these two PIDs a team are like a quasi PNL. You know they, you can look at them very objectively and you can fund them. If they start to become too big you need to split them. If they are not doing too well, you need to go and kill them, actually. >> Alright, Dheeraj, last question I have for you. Enterprise cloud, I think, you know when it first came out as a term, we said, it was a little bit inspirational. What should we be looking for in a year to really benchmark and show as proof points that it's becoming reality. You know, from Nutanix. >> That's a great point. You know, obviously, when Gartner starts to use the term very close term, you know what I say. Used the term enterprise cloud operating system. And in one of the recent discourses I saw, enterprise cloud operating model. That's very similar to system, vs model, but the operating model of the enterprise cloud is based on the tenants of you know, web skilled engineering you know, the fact that things aren't in commodity servers. Everything is pure software and you have zero differentiation in hardware. And all those differentiation comes in pure software. Infrastructure is cold. All those things are not going away. Now how it becomes easy to use, so that you don't need PhD's to manage it is where consumer grade design comes in. And where you have this notion of prism and calmed that actually come to really help make it easy to use. I think this is the core of enterprise cloud itself, you know. I think, obviously, every layer in this overall cake needs more features, more capability, and so on. But foundationally, it's about web skilled engineering, consumer grade design. And if you're doing these two things getting more workloads, getting more geographies, getting more platforms, getting more features... All those things are basically a rite of passage. You know, you need to continue to do them all the time, actually. >> Alright, Dheeraj, I had a customer on. Said the reason he bought Nutanix was for that fullness of vision. So, always appreciate catching up with you. And we'll be back with lots more coverage here from Nutanix .NEXT, here in Nice, France. I'm Stu Miniman, and you're watching TheCUBE.

Published Date : Nov 8 2017

SUMMARY :

Brought to you by Nutanix. CEO and Founder of the attention to detail and then you just become their software. and the kind of pricing, You know, Google has the opportunity to be the fact that we can really and makes the best out of open source. kind of publicly the because of the LXD and LXC And even Azure and that's the How would you rate yourself on And the fact that we can support and make sure that, you know, simplicity. And the moment you start Enterprise cloud, I think, you know And in one of the recent Said the reason he bought Nutanix

ENTITIES

Entity	Category	Confidence
Dheeraj	PERSON	0.99+
Myanmar	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Dell	ORGANIZATION	0.99+
NCC	ORGANIZATION	0.99+
Nutanix	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
US	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Lenovo	ORGANIZATION	0.99+
20 years	QUANTITY	0.99+
android	TITLE	0.99+
Dheeraj Pandey	PERSON	0.99+
50%	QUANTITY	0.99+
Apache	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
Cicer	ORGANIZATION	0.99+
15 years	QUANTITY	0.99+
Stu Miniman	PERSON	0.99+
iPad	COMMERCIAL_ITEM	0.99+
Palo Alto	ORGANIZATION	0.99+
iOS	TITLE	0.99+
The Founder's Mentality	TITLE	0.99+
iPod	COMMERCIAL_ITEM	0.99+
The Founder's Mentality	TITLE	0.99+
Hadoop	TITLE	0.99+
first	QUANTITY	0.99+
Linux	TITLE	0.99+
each	QUANTITY	0.99+
Gartner	ORGANIZATION	0.99+
Windows	TITLE	0.99+
Facebook	ORGANIZATION	0.99+
ESX	TITLE	0.99+
AWS	ORGANIZATION	0.99+
Nice, France	LOCATION	0.99+
Costco	ORGANIZATION	0.99+
Azure Stack	TITLE	0.98+
H-V	TITLE	0.98+
iPhone four	COMMERCIAL_ITEM	0.98+
first eight years	QUANTITY	0.98+
two PIDs	QUANTITY	0.97+
iTunes	TITLE	0.97+
one	QUANTITY	0.97+
vSphere	TITLE	0.97+
350 odd health checks	QUANTITY	0.97+
YouTube	ORGANIZATION	0.97+

Ross Turk, Red Hat | Open Source Summit 2017

(upbeat music) >> Announcer: Live from Los Angeles, it's theCUBE covering Open Source Summit, North America 2017, brought to you by the Linux Foundation, and the Red Hat. >> Okay, welcome back everyone. Live here in Los Angeles, is theCUBE's exclusive coverage of the Open Source Summit, North America. I'm John Furrier, your host, with my cohost, Stu Miniman with Wikibon. Our next guest is Ross Turk, Director of Evangelism at Red Hat. Welcome to theCUBE, good to see you again. >> Good to see you. >> So, evangelizing is now going to be super more important as Open Source Summit, formerly called The Linux Con, Linux kernel. So, Linux is really now the foundation. So, now all these new products are emerging, hence the new name Open Source Summit. You guys are in the middle of it. >> Ross: Mm-hmm. >> What's the themes that you guys are pumping out there right now from an evangelist standpoint? Give me the order of operations in terms of priorities. >> Well, gosh, we're trying to tell stories about how people operate infrastructure in today's modern world, right, which is a lot about making sure that, you know, dealing with ephemeral infrastructure, dealing with containerized applications, and that sort of thing. It gives a lot more flexibility to people who are doing modern operations. It's about applications that spill over across multiple machines and doing so in a way that doesn't require a lot of heavy lifting or wiring things up by hand. So, there's this whole modern operations experience thing we talk about, but we also talk about a modern developer experience. What does it mean to build applications today? And, of course, you combine those together it turns into Dev Ops, right. But, the large companies still work in these two separate worlds. But, people are building technology differently than they ever did in the past, and they're deploying it differently than they did in the past. So, there's lots of stories that can come out of that. >> Well, let's start with the story that we love. Stu and I were talking about the server list at the beginning because you have the Dev Ops movement certainly is going mainstream. You're seeing a lot of enterprises looking at that as viable. Now, they're operationalizing it, and they need to have that industrial strength Red Hat, Linux. But, now Kubernetes and Servalist, the younger developers, they just want an infrastructure as code. >> That seems to be a very hot story here, and Kubernetes server list is kind of in the hallway conversations. How do you guys bring that to bear? >> Well, I think that what Red Hat does is we give an operating environment that can sit underneath all of it with Rail and everything else we build that is stable and secure and reliable. And, you need that in order to have all of this chaos happening above it with developers deploying microservices and moving things around, and demands changing and all these other things, you need to have something really stable and reliable underneath that, something that you know can be if the applications and virtual machines and containers aren't long running, what sits underneath of them is long running, and it still needs to be stable and reliable. So, a lot of the work we've been doing for the past 20 years around Linux Engineering, I think, contributes to making this stable environment for a modern developer. >> Yeah, Ross, one of the challenges in scaling is usually I've got to worry things like storage. You know, state is there, you know data gravity is something we need to be concerned about. It's great to say ephemeral and I want everything anywhere, and, I can put it in this cloud or use it in that application, but at the end of the day it's tough to build some of these pieces. How's Red Hat helping there as containerization and scale, how does that fit into kind of this storage discussion moving on? >> It's a real struggle right, because you can talk to people and they say oh, every single one of the microservices held over and they scale out, and all this, and they talk about this really elaborate infrastructure like well, where is all your data being stored? Oh, it's sitting in Oracle, you know, so you find this sort of like dissonance between how data is managed and how applications are managed. At Red Hat, we believe that storage should be another microservice alongside all the other microservices make up and application. So, that's why we put a lot of engineering effort into making things like Ceph and Red Hat Gluster Storage work well alongside Open Shift so that a developer can provision storage as needed without having to go to an ops person, and that when that storage gets provisioned it's in containers alongside other containers that are providing the other things that your application needs. >> Software defined storage was the answer, it's the Holy Grail. We've heard software defined data center. We've been covering this also in the VM world, heard an awful lot about that. But, that still is a key part of the software, and now you have hardware stacks, so IOT and Cloud are opening up these new use cases for enterprises where whoa, we actually kind of didn't test that hardware with that software, so it's kind of interesting dynamic because software defined is still super important. What's your view on software defined storage, in particular, is that an answer, is it stable, what's your thoughts? >> Well, I think it's an answer, but it depends on what the question is, just to be kind of-- >> What is software defined storage? Let's start with that one. >> Well, so, what is software defined storage? Software defined storage is, okay, so I'll say it in more like what it isn't. >> The traditional storage, traditional storage solutions get deployed as appliances, which are vertically integrated hardware and software solutions that are built to do one thing, and to do that one thing well. And, that one thing is to store data. They're kind of like big refrigerator-sized things that you bring into your data center with a forklift and it's a big oepration, and then they provide storage for any number of applications. What software defined storage does is it implements those same services and those same capabilities, but it does it entirely in software. So, instead of being this vertically integrated software, hardware solution, you end up with software that lets you build it on any hardware, and that hardware can be physical hardware so you can build a storage cluster made up of 1,000 bare metal servers, or you could build that same cluster on a thousand VMs inside of a public cloud. So, in making storage no longer a hardware problem, like it used to be, I mean fundamentally it's a hardware problem, you get down bits are stored somewhere, but, the management of storage is no longer a hardware concern, it's a software concern, now, and that means it's a little bit more flexible. You can containerize it. You can deploy it in the public cloud. You can deploy it in VMs. You can deploy it on bare metal. So, that's what software defined storage is doing is it's changing things around, but it requires different skills. >> Come on Ross, I want a storageless environment, can we get on that? >> A storageless environment? Sure, I guess. Storage has become somebody else's problem at that point. >> Absolutely, how about, how is containers changing that whole discussion? You know, it took us like a decade to kind of get storage working in a virtualized environment, networking seems to be really tackling the container piece, storage seems a little further behind, you know, what're you seeing some of the big challenges there and how are we looking to solve that? >> Well, here there's when you look at containers and storage, there are really two things to consider. The first is how do you make storage such that a containerized environment can consume it easily, right. This is what at Red Hat we call container ready. So, we call a storage solution container ready, what it means that your container platform knows how to consume it. Most storage is container ready, all it takes is a Kubernetes volume driver to be container ready, and that's one half of it, and that's really, really important. It's the same kind of thing we had to do with virtualization, making sure every hypervisor could talk to every storage system. Now, we're making sure every container platform can talk to every storage system. That's important, but it's only half the puzzle, 'cause the other half is now that you have storage as a software thing, a distributed software thing, you can actually deploy that storage inside the same containers that you're using, that are driving the demand for that storage. So, it's this kind of weird, you know, snake eating it's own tail thing where you as a developer, let's say I'm deploying an application, I need a database, I need a web server, blah, blah, and a bunch of other things, and I need a scale out storage system, I can deploy that in containers just alongside everything else, and it uses the local storage of each of the container hosts to build that shared storage that then is used to provide services to other containerized applications. So, it's the ability to have storage in containers Which is really strange. We call that container native storage. >> It's interesting the markets going pretty crazy, so if you kind of take the Dev Ops and say assume for a minute infrastructure is programmable. >> Mm-hmm. >> But, then you look at the developer action right now on the App side, we've seen all kinds of new stuff Apple has their announcement today with the new iPhone 8. We've been covering that on siliconangle.com. Forbes has got great stories as well. New AR kit, so augmented reality is a huge deal, virtual reality obviously still hyped up, is still promised, those are going to require new chips. That's going to require consumer behavior change, so, the developers are staring at a different market than worrying about provisioning storage, right. So, but, these are now new pressures. New hardware, new opportunities, as a developer, advocate, and evangelist, and an industry participant, and user, how do you look at that, and how is that impacting the developer market because Androids got good stuff coming down, too, not just Apple, Samsung? >> Ross: Yeah. >> It's all multimedia, I mean. >> Well, what's interesting about AR kit is that if you go just back five years that same capability required a very, very particular type of phone, you know, like the project Tango stuff required all these depth cameras and like connect style stuff to do the AR kit, and Apple was able to solve a lot of that in software just using two cameras, right, and in software. And, I think that's really-- >> John: On a phone? >> On a phone, on a phone no less, and I think what's amazing about that is all of the capabilities that we walk around with in our pocket now were really hard to get a long time ago. >> Well, this is interesting, your point, let's stay on this because this really illustrates the point. AR kit, for example is proving that the iPhone now is smart enough and with software, enough horsepower to do that kind of thing, but that's replicable across all devices now as an IOT device. The Internet of Things is going to be a freight train coming down the tracks, security, endpoint security, whether it's, I mean all kinds of coolness, but yet threats are there. So, software has to do all this, right. So, how's that going to impact the cloud game, your business, you guys you have to move faster on hardening things, be more organic on the innovation side, not business-wise, but technical strategy. >> Well, I think a lot of it is enabling developers to work more quickly and build features more quickly, also, educating developers on the security and privacy ramifications of the things that they build because it's really easy to just go out in front and advance and innovate and forget about all of that stuff. So, it's about changing developer culture so that you consider security and privacy first, as opposed to later. And, also, maybe you want to consider storage as well if you're talking about machine learning or IOT and all of these types of things, you're -- >> Videos, I mean this is video, software rendering. That's a storage nightmare. >> It's all got to live somewhere, and once you put it in that place where it lives, it's really hard to move it. So, this is a thing you want to plan from the very beginning. >> And, I think that's what's cool about AI, too, and self-driving cars it's a consumer, you know, flashy, coolness that can say hey, this is happening. I mean how fast is happening, but the developer is now bringing it to the businesses and say, okay, we don't have an AR virtual reality strategy for our retail, for instance, you potentially could be out of business. So, these are the kind of thoughts that are going at the C-level that now are going into what used to be IT, but all of IT, how do you handle this? This is an architectural question, so your thoughts on that, because that seems to be a conversation we see a lot. Architectural that's going to solve problems today, not foreclose future opportunities. >> Well, it's cultural, too, inside of the company, like everywhere inside of a company there used to be Internet teams in companies, remember. We used to be like oh, go talk to the Internet team because something's wrong with the Web or whatever, now, there's no Internet team, everybody's the Internet team, Every single team in an organization is thinking about how to leverage the Internet to make their job more effective. The same is going to be true for everything that we're talking about, you know. Security, interestingly enough, so many people always thought security was somebody else's problem. but just this week, we were reminded that it's everybody's problem, hundreds of millions of people's problem, security. So, I think that as these things kind of advance-- >> John: Security first, and privacy first is critical. >> It is absolutely critical, and there used to be, I mean, I think at some point maybe there won't be a security team inside of a company because everybody's going to be the security team, but it's like everybody's the Internet team now, and I always felt the same way about open source communities. I thought there would never, you know, always everybody-- >> Well, people are ruling their own security now. You have these LifeLock or whatever they call them, these services for a password protection because you can't trust even all these databases that are out there. You have block chain with immutability, yeah, certainly the wallets are not yet, but I mean certainly this is where it might be a future scenario. >> Yeah, and I think for all of these things agility is going to be key. The ability to go down a path a certain distance and realize whoa I've run into a privacy problem, back up a little bit, continue down another path. I think that the faster we can make the development process, I think the less risky we make going into all these new frontiers. >> Yeah, Ross, one of the things we've really liked watching the last kind of five years or so is storage turning into a discussion of data and how can we leverage that data, real-time data, you know, decisions at the edge, analytics, what's exciting you the most about kind of the storage world these days? >> Oh boy. Well, you know, I just spent about five years in the storage infrastructure world, so a lot of what kind of kept me going day and night was saving people money, making things faster, making things easier, but also, giving storage platforms that were elastic enough to handle all of this really interesting stuff that happens on top of them. So, there's all kinds of new big data stacks that I find particularly interesting, a lot of the real time analysis stuff like Apache Spark and things like that. There's so much going into visualization right now, as well, how you handle large amounts of time series data and that sort of thing. There's been a lot of advancements in exactly that. Personally, I'm really excited lately in all the data of this stuff, all the ways you can extract meaning from all this data, you know, the ways that you can give it a business context that allows you to make better decisions with it. >> Not a lot of data conversations here at this conference as is open source software, but I mean data I mean I've said and I wrote a blog post in 2008 Dave always, Dave Olantho always jokes with me because I always reference it, I said data is the new development kit, meaning data is going to be part of the software development model, and it actually is with big data, but, you're not hearing a lot of it here because most people are talking about their communities, their projects, but the role of data is fundamental at the edge. >> Ross: Absolutely. >> And, so, how is that going to change some of these conversations and can data be developed on, and is data now part of the software development life cycle that's coming to fruition in the new way. >> Interesting, I think that's an interesting observation that as we see sort of Dev and Ops coming together, right, the world of the operator and the world of the developer coming together, I think we'll probably, at some point, see the world of the developer come together with the world of the data scientist because as I kind of wrack my brain I'm thinking okay, what type of future developer wouldn't have to be dealing with large amounts of data wouldn't have to have that kind of skill to be able to deal with it. So, I think we're going to start to see more software developers getting more involved in big data, machine learning, data analytics, and things like that for sure. >> Well, either way, this open source growth that's coming is going to be exponential. Data is already there. I mean we have a joke in our office software is eating the world as Mark Andreasen would say years ago, but, data is eating software. So, in terms of how you look at it someone's eating somebody, but, this becomes interesting for the IOT developer, or the industrial developer. Those systems were never connected to IT in the past. It was like they ran their own stuff from their own terminals. >> And, there's this idea that everybody's heard that data has gravity, right. And, I actually was talking to somebody about this and they said, well, actually the data has inertia, and I'm like, no, that's not really it 'cause once it's moving it's not hard to stop it. The idea that data has gravity means that let's say I'm putting together this new IOT application, or whatever, I'm gathering data from a bunch of sensors or whatever, and I've got the data in that place. Now, having all that data in that place is more meaningful to me than most of the software that I wrote. You know, it's like that is the value, the kernel of the data is there, and data having gravity means that it's hard to move once it's in a certain place, but, it also means that it attracts workloads to it, right. So, it used to be that software was king, and software created data and managed data, and now data is king, and it brings software to it, I think. >> I totally agree with you, and I think they might even call this the open data summit soon, but it's beyond open source. Now, this is going to be great. They work hand in hand. Software and data are going to be great. Stu what's your thoughts on the role that data's not being talked much here? >> Yeah, John at Amazon weighed in last year. When we talked to Andy Jassy it was the customers were the flywheel, and I think data's going to be that next flywheel of really feeding into that data gravity discussion that you were having, Ross. You know, when Hadoop came out it was like oh, we're going to bring the code to the data. Well, we know if I'm going to have more data I'm going to have my data sources, I'm going to have third party data sources that I want to be able to work and interact with those, so, data absolutely huge opportunities there, and the companies that can leverage that and get more value out of it is going to be a-- >> Well, we already see it's a competitive advantage, no doubt, but it's the privacy issue still the big debate like we know in our immediate businesses. Look at Facebook, I've got a free App I get to see all my friends' photos, their vacations, everyone's living a great life on Facebook, but, then all of a sudden I give my data away for free for the privilege to use that App, but all the sudden they start injecting fake news at me. I don't want that anymore, and you're still making money off of my data, so that's interesting. Facebook makes money off of my data. >> Yeah, that's-- >> That's my contract with them. >> Yeah, If you ask what their asset is, one person might think it's traffic, you know, or eyeballs, but, I think it's data. >> So, they're using data, I might not like it, so that might be an opportunity for somebody else so your point Stu, so if you start thinking about it differently, data decisions are going to be an architectural challenge. >> Yeah, absolutely. I think enterprise architecture thinking, even today, you're seeing enterprise architects thinking more and more and more about data than they have in the past. >> Ross, what do you think about the show, final word in the segment, what's going on Open Source Summit, share with the folks that are watching? >> The vibe here, it's now a new name, but it's still the same game, multiple events come together. >> Yeah, multiple events together. I like Open Source Summit as a name. I think it's a good name. It's properly named for what's going on here. It's been an interesting experience for me because I've been in this community for a really long time. So, I come here and I run into all kinds of old friends, the hallway track's always a good track for me. The content is fantastic, but the hallway track is always really good, and I can't think of anywhere else where you can go and get this selection of people, right. You have people who're working on all layers of the problem, and they can all come together and talk. So, I don't know-- >> It's really a cross-fertilization, cross-pollination, whatever word you want to use. I think this event's going to be in the 30,000 pretty quickly. I mean this is going to be. Well, if you look at the growth, the numbers, you know, presented on stage, Jim Zemlon, was pointing out the growth, by 2026, 400 million libraries. I mean people still think that's underestimated. >> Yeah. >> So, that's a lot of growth. >> I think it could get there, and I think these folks organize great shows, so I look forward to seeing them scale up to 30,000. >> Ross, thanks for your commentary, appreciate the perspective, and the insight here on theCUBE. >> Thank you. >> Thanks for joining us. This is theCUBE live coverage from Open Source Summit, North America. I'm John, for Stu Miniman, back with more after this short break. (upbeat music)

Published Date : Sep 12 2017

SUMMARY :

brought to you by the Linux Foundation, and the Red Hat. Welcome to theCUBE, good to see you again. So, evangelizing is now going to be super more important What's the themes that you guys are pumping out there And, of course, you combine those together beginning because you have the Dev Ops movement That seems to be a very hot story here, So, a lot of the work we've been doing for the past 20 years and scale, how does that fit into kind of this storage providing the other things that your application needs. But, that still is a key part of the software, What is software defined storage? Well, so, what is software defined storage? hardware and software solutions that are built to do one Storage has become somebody else's problem at that point. So, it's the ability to have storage in containers so if you kind of take the Dev Ops and say and user, how do you look at that, and how is that impacting like the project Tango stuff required all these depth amazing about that is all of the capabilities that we So, how's that going to impact the cloud game, So, it's about changing developer culture so that you Videos, I mean this is video, software rendering. It's all got to live somewhere, and once you put it in because that seems to be a conversation we see a lot. The same is going to be true for everything that we're going to be the security team, but it's like everybody's these services for a password protection because you agility is going to be key. give it a business context that allows you to make meaning data is going to be part of the software and is data now part of the software development life cycle to be able to deal with it. coming is going to be exponential. You know, it's like that is the value, Software and data are going to be great. the flywheel, and I think data's going to be for the privilege to use that App, with them. think it's traffic, you know, or eyeballs, differently, data decisions are going to be and more and more about data than they have in the past. but it's still the same game, multiple events come together. The content is fantastic, but the hallway track is I think this event's going to be organize great shows, so I look forward to seeing perspective, and the insight here on theCUBE. This is theCUBE live coverage from

ENTITIES

Entity	Category	Confidence
Jim Zemlon	PERSON	0.99+
Mark Andreasen	PERSON	0.99+
Ross	PERSON	0.99+
John Furrier	PERSON	0.99+
Stu Miniman	PERSON	0.99+
Dave Olantho	PERSON	0.99+
John	PERSON	0.99+
Samsung	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Ross Turk	PERSON	0.99+
2008	DATE	0.99+
Facebook	ORGANIZATION	0.99+
Andy Jassy	PERSON	0.99+
Linux Foundation	ORGANIZATION	0.99+
Red Hat	ORGANIZATION	0.99+
Los Angeles	LOCATION	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
last year	DATE	0.99+
iPhone 8	COMMERCIAL_ITEM	0.99+
Stu	PERSON	0.99+
Open Source Summit	EVENT	0.99+
two cameras	QUANTITY	0.99+
two things	QUANTITY	0.99+
theCUBE	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Oracle	ORGANIZATION	0.98+
400 million libraries	QUANTITY	0.98+
today	DATE	0.98+
North America	LOCATION	0.97+
five years	QUANTITY	0.97+
Linux	TITLE	0.96+
Androids	TITLE	0.96+
Dave	PERSON	0.96+
about five years	QUANTITY	0.96+
this week	DATE	0.96+
Open Source Summit 2017	EVENT	0.96+
Linux kernel	TITLE	0.94+
2026	DATE	0.93+
two separate worlds	QUANTITY	0.93+
siliconangle.com	OTHER	0.93+
one person	QUANTITY	0.92+
one	QUANTITY	0.91+
one thing	QUANTITY	0.91+
each	QUANTITY	0.91+
hundreds of millions of people	QUANTITY	0.91+
up to 30,000	QUANTITY	0.89+
America	LOCATION	0.88+
half	QUANTITY	0.88+
Kubernetes	ORGANIZATION	0.88+
The Linux Con	EVENT	0.86+
30,000	QUANTITY	0.86+
Wikibon	ORGANIZATION	0.86+
2017	EVENT	0.85+
one thing	QUANTITY	0.84+
Forbes	ORGANIZATION	0.83+
1,000 bare metal servers	QUANTITY	0.82+
one half	QUANTITY	0.8+

Day 2 Kickoff - #SparkSummit - #theCUBE

[Narrator] Live from San Francisco it's the Cube covering Sparks Summit 2017 brought to you by databricks. >> Welcome to the Cube. My name is David Goad and I'm your host and we are here at Spark day two. It's the Spark Summit and I am flanked by a couple of consultants here from-- sorry, analysts from Wikibon. I got to get this straight. To my left we have Jim Kobielus who is our lead analysist for Data Science. Jim, welcome to the show. >> Thanks David. >> And we also have George Gilbert who is the lead analyst for Big Data and Analytics. I'll get this right eventually. So why don't we start with Jim. Jim just kicking off the show here today, we wanted to get some preliminary thoughts before we really jump into the rest of the day. What are the big themes that we're going to hear about? >> Yeah, today is the Enterprise day at Sparks Summit. So Spark for the Enterprise. Yesterday was focused on Spark, the evolution, extension of Spark to support for native development of deep learning as well as speeding up Spark to support sub-millisecond latencies. But today it's all about Spark and the Enterprise really what I call wrapping dev-ops around Spark, making it more productionizable, supportable. The databricks serverless announcement, though it was announced yesterday, the press release went up they're going into some depth right now in the key note about serverless and really serverless is all about providing an in cloud Spark, essentially a sand box for teams of developers to scale up and scale out enough resources to do the modeling, the training, the deployment, the iteration, the evaluation of Spark jobs in essentially a 24 by seven multi-tenant fully supported environment. So it's really about driving this continuous Spark development and iteration process into a 24 by seven model in the Enterprise, which is really what's happening is that data scientists, Spark developers are becoming an operational function that businesses are building, strategic infrastructure around things like recommendation engines, and e-commerce environments, absolutely demand 24 by seven resilience Spark team based collaboration environments, which is really what the serverless announcement is all about. >> David: So getting increasing demand on mission critical problems so that optimization is a big deal. >> Yeah, data science is not just an R&D function, it's an operational IT function as well. So that's what it's all about. >> David: Awesome, well let's go to George. I saw you watching the key note. I think still watching it again this morning, so taking notes feverishly. What were some of the things that stuck out to you from the key note speaker this morning? >> There are some things that are sort of going to bleed over from yesterday where we can explore some more. We're going to have on the show, the chief architect, Renald Chin, and the CEO, Ali Goatsee, and some of the things that we want to understand is how the scope of applications that are appropriate for Spark are expanding. We've got sort of unofficial guidance yesterday that, you know, just because Spark doesn't handle key value stores or databases all that tightly right now, that doesn't mean it won't in the future on the Apache Spark side through better APIs and on the databricks side, perhaps custom integration and the significance of that is that you can open up a whole class of operational apps, apps that run your business and that now incorporate, you know, rich analytics as well. Another thing that we'll want to be asking about is, keying off what Jim was saying, now that this becomes not a managed service where you just take the labor that the end customer was applying to get the thing running but it's now automated and you don't even know the infrastructure. We'll want to know what does that mean for the edge, you know, where we're doing analytics close to internet of things and people and sort of if there has to be a new configuration of Spark to work with that. And then of course what do we do about the whole data science process and the dev-ops for data science when you have machine learning distributed across the cloud and edge and On-Prem. >> Jim: In fact, I know we have Pepperdata coming on right after this, who might be able to talk about that exact dev-ops in terms of performance optimization into distributed Spark environment, yeah. >> George, I want to follow up with that. We had Matt Fryer from Hotels.com, he's going to be on our show later but he was on the key note stage this morning. He talked about going all cloud, all Spark, and how data science is even competitive advantage for Hotels.com. What do you want to dig into when we get him on the show? >> That's a really good question because if you look at business strategy, you don't really build a sustainable advantage just by doing one thing better than everyone else. That's easier to pick off. The sustainable strategic advantages come from not just doing one thing better than everyone else but many things and then orchestrating their improvement over time and I'd like to dig into how they're going to do that. 'Cause remember Hotels.com it's the internet equivalent descendant of the original travel reservation systems, which did confer competitive advantage on the early architects and deployers of that technology. >> Great and then Pepperdata wanted to come back and we're going to have them on the show here in just a moment. What would you like to learn from them? What do you think will benefit the community the most? >> Jim: Actually, keying off something George said, I'd like to get a sense for how you optimize Spark deployments in a radically distributed IOT edge environment. Whether they've got any plans, or what their thoughts are in terms of the challenges there. As more the intelligence gets pushed to the edge much of that will be on machine learning and deep learning models built into Spark. What are the challenges there? I mean, if you've got thousands to millions of end points that are all autonomious and intelligent and they're all running Spark, just what are the orchestration requirements, what are the resource management requirements, how do you monitor end-to-end in and environment like that and optimize the passing of data and the transfer of the control flow or orchestration across all those dispersed points. >> Okay, so 30 seconds now, why should the audience tune into our show today? What are they going to get? >> I think what they're going to get is a really good sense for how the emerging best practices for optimizing Spark in a distributed fog environment out to the edge where not just the edge devices but everything, all nodes, will incorporate machine learning and deep learning. They'll get a sense for what's been done today, what's the tooling is to enable dev-ops in that kind of environment. As well as, sort of the emerging best practices for compressing more of these algorithms and the data itself as well as doing training in a theoretically federated environment. I'm hoping to hear from some of the vendors who are on the show today. >> David: Fantastic and George, closing thoughts on the opening segment? 30 seconds. >> Closing thoughts on the opening segment. Like Jim is, we want to think about Spark holistically and it has traditionally been best position that's sort of this-- as Tay acknowledged yesterday sort of this offline branch of analytics that you apply to data like sort of repository that you accumulated and now we want to see it put into production but to do that you need more than just what Spark is today. You need basically a database or key value kind of option so that your storing your work as it goes along so you can go back and analyze it either simple analysis or complex analysis. So I want to hear about that. I want to hear about their plans for IOT. Spark is kind of a heavy weight environment, so you're probably not going to put it in the boot of your car or at least not likely anytime soon. >> Jim: Intelligent edge. I mean, Microsoft build a few weeks ago was really deep on intelligent edge. HP, who we're doing their show actually I think it's in Vegas, right? They're also big on intelligent edge. In fact, we had somebody on the show yesterday from HP going into some depth on that. I want to hear what databricks has to say on that theme. >> Yeah, and which part of the edge, is it the gateway, the edge gateway, which is really a slim down server, or the edge device, which could be a 32 bit meg RAM network card. >> Yeah. >> All right, well gentlemen appreciate the little insight here before we get started today and we're just getting started. Thank you both for being on the show and thank you for watching the Cube. We'll be back in a little while with our CEO from databricks. Thanks for watching. (upbeat music)

Published Date : Jun 7 2017

SUMMARY :

brought to you by databricks. It's the Spark Summit and I am flanked by What are the big themes that we're going to hear about? So Spark for the Enterprise. so that optimization is a big deal. So that's what it's all about. from the key note speaker this morning? and some of the things that we want to understand is Jim: In fact, I know we have Pepperdata coming on and how data science is and I'd like to dig into how they're going to do that. What would you like to learn from them? As more the intelligence gets pushed to the edge and the data itself David: Fantastic and George, but to do that you need more than just what Spark is today. I want to hear what databricks has to say on that theme. or the edge device, and thank you for watching the Cube.

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
Jim Kobielus	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Ali Goatsee	PERSON	0.99+
David Goad	PERSON	0.99+
Matt Fryer	PERSON	0.99+
Renald Chin	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
thousands	QUANTITY	0.99+
30 seconds	QUANTITY	0.99+
Hotels.com	ORGANIZATION	0.99+
yesterday	DATE	0.99+
Vegas	LOCATION	0.99+
32 bit	QUANTITY	0.99+
today	DATE	0.99+
24	QUANTITY	0.99+
HP	ORGANIZATION	0.99+
Spark	TITLE	0.99+
seven	QUANTITY	0.98+
Yesterday	DATE	0.98+
both	QUANTITY	0.98+
Spark Summit	EVENT	0.98+
Tay	PERSON	0.97+
Sparks Summit 2017	EVENT	0.96+
one	QUANTITY	0.96+
this morning	DATE	0.96+
Pepperdata	ORGANIZATION	0.96+
Day 2	QUANTITY	0.95+
Wikibon	ORGANIZATION	0.94+
Sparks Summit	EVENT	0.93+
databricks	ORGANIZATION	0.91+
day two	QUANTITY	0.87+
Spark	ORGANIZATION	0.86+
few weeks ago	DATE	0.86+
millions of end points	QUANTITY	0.81+
Big Data	ORGANIZATION	0.81+
Cube	COMMERCIAL_ITEM	0.68+
sub	QUANTITY	0.6+
Apache Spark	TITLE	0.55+
Analytics	ORGANIZATION	0.53+

Day One Wrap - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco, it's the CUBE covering Spark Summit 2017, brought to by Databricks. (energetic music plays) >> And what an exciting day we've had here at the CUBE. We've been at Spark Summit 2017, talking to partners, to customers, to founders, technologists, data scientists. It's been a load of information, right? >> Yeah, an overload of information. >> Well, George, you've been here in the studio with me talking with a lot of the guests. I'm going to ask you to maybe recap some of the top things you've heard today for our guests. >> Okay so, well, Databricks laid down, sort of, three themes that they wanted folks to take away. Deep learning, Structured Streaming, and serverless. Now, deep learning is not entirely new to Spark. But they've dramatically improved their support for it. I think, going beyond the frameworks that were written specifically for Spark, like Deeplearning4j and BigDL by Intel And now like TensorFlow, which is the opensource framework from Google, has gotten much better support. Structured Streaming, it was not clear how much more news we were going to get, because it's been talked about for 18 months. And they really, really surprised a lot of people, including me, where they took, essentially, the processing time for an event or a small batch of events down to 1 millisecond. Whereas, before, it was in the hundreds if not higher. And that changes the type of apps you can build. And also, the Databricks guys had coined the term continuous apps, which means they operate on a never-ending stream of data, which is different from what we've had in the past where it's batch or with a user interface, request-response. So they definitely turned up the volume on what they can do with continuous apps. And serverless, they'll talk about more tomorrow. And Jim, I think, is going to weigh in. But it, basically, greatly simplifies the ability to run this infrastructure, because you don't think of it as a cluster of resources. You just know that it's sort of out there, and you ask requests of it, and it figures out how to fulfill it. I will say, the other big surprise for me was when we have Matei, who's the creator of Spark and the chief technologist at Databricks, come on the show and say, when we asked him about how Spark was going to deal with, essentially, more advanced storage of data so that you could update things, so that you could get queries back, so that you could do analytics, and not just of stuff that's stored in Spark but stuff that Spark stores essentially below it. And he said, "You know, Databricks, you can expect to see come out with or partner with a database to do these advanced scenarios." And I got the distinct impression, and after listen to the tape again, that he was talking about for Apache Spark, which is separate from Databricks, that they would do some sort of key-value store. So in other words, when you look at competitors or quasi-competitors like Confluent Kafka or a data artist in Flink, they don't, they're not perfect competitors. They overlap some. Now Spark is pushing its way more into overlapping with some of those solutions. >> Alright. Well, Jim Kobielus. And thank you for that, George. You've been mingling with the masses today. (laughs) And you've been here all day as well. >> Educated masses, yeah, (David laughs) who are really engaged in this stuff, yes. >> Well, great, maybe give us some of your top takeaways after all the conversations you've had today. >> They're not all that dissimilar from George's. What Databricks, Databricks of course being the center, the developer, the primary committer in the Spark opensource community. They've done a number of very important things in terms of the announcements today at this event that push Spark, the Spark ecosystem, where it needs to go to expand the range of capabilities and their deployability into production environments. I feel the deep-learning side, announcement in terms of the deep-learning pipeline API very, very important. Now, as George indicated, Spark has been used in a fair number of deep-learning development environments. But not as a modeling tool so much as a training tool, a tool for In Memory distributed training of deep-learning models that we developed in TensorFlow, in Caffe, and other frameworks. Now this announcement is essentially bringing support for deep learning directly into the Spark modeling pipeline, the machine-learning modeling pipeline, being able to call out to deep learning, you know, TensorFlow and so forth, from within MLlib. That's very important. That means that Spark developers, of which there are many, far more than there are TensorFlow developers, will now have an easy pass to bring more deep learning into their projects. That's critically important to democratize deep learning. I hope, and from what I've seen what Databricks has indicated, that they have support currently in API reaching out to both TensorFlow and Keras, that they have plans to bring in API support for access to other leading DL toolkits such as Caffe, Caffe 2, which is Facebook-developed, such as MXNet, which is Amazon-developed, and so forth. That's very encouraging. Structured Streaming is very important in terms of what they announced, which is an API to enable access to faster, or higher-throughput Structured Streaming in their cloud environment. And they also announced that they have gone beyond, in terms of the code that they've built, the micro-batch architecture of Structured Streaming, to enable it to evolve into a more true streaming environment to be able to contend credibly with the likes of Flink. 'Cause I think that the Spark community has, sort of, had their back against the wall with Structured Streaming that they couldn't fully provide a true sub-millisecond en-oo-en latency environment heretofore. But it sounds like with this R&D that Databricks is addressing that, and that's critically important for the Spark community to continue to evolve in terms of continuous computation. And then the serverless-apps announcement is also very important, 'cause I see it as really being, it's a fully-managed multi-tenant Spark-development environment, as an enabler for continuous Build, Deploy, and Testing DevOps within a Spark machine-learning and now deep-learning context. The Spark community as it evolves and matures needs robust DevOps tools to production-ize these machine-learning and deep-learning models. Because really, in many ways, many customers, many developers are now using, or developing, Spark applications that are real 24-by-7 enterprise application artifacts that need a robust DevOps environment. And I think that Databricks has indicated they know where this market needs to go and they're pushing it with R&D. And I'm encouraged by all those signs. >> So, great. Well thank you, Jim. I hope both you gentlemen are looking forward to tomorrow. I certainly am. >> Oh yeah. >> And to you out there, tune in again around 10:00 a.m. Pacific Time. We're going to be broadcasting live here. From Spark Summit 2017, I'm David Goad with Jim and George, saying goodbye for now. And we'll see you in the morning. (sparse percussion music playing) (wind humming and waves crashing).

Published Date : Jun 7 2017

SUMMARY :

Announcer: Live from San Francisco, it's the CUBE to customers, to founders, technologists, data scientists. I'm going to ask you to maybe recap And that changes the type of apps you can build. And thank you for that, George. after all the conversations you've had today. for the Spark community to continue to evolve I hope both you gentlemen are looking forward to tomorrow. And to you out there, tune in again

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Jim	PERSON	0.99+
George	PERSON	0.99+
David	PERSON	0.99+
David Goad	PERSON	0.99+
San Francisco	LOCATION	0.99+
Matei	PERSON	0.99+
tomorrow	DATE	0.99+
Amazon	ORGANIZATION	0.99+
Databricks	ORGANIZATION	0.99+
hundreds	QUANTITY	0.99+
Spark	TITLE	0.99+
both	QUANTITY	0.98+
Google	ORGANIZATION	0.98+
Intel	ORGANIZATION	0.98+
Spark Summit 2017	EVENT	0.98+
18 months	QUANTITY	0.98+
Flink	ORGANIZATION	0.97+
Facebook	ORGANIZATION	0.97+
Confluent Kafka	ORGANIZATION	0.97+
Caffe	ORGANIZATION	0.96+
today	DATE	0.96+
TensorFlow	TITLE	0.94+
three themes	QUANTITY	0.94+
10:00 a.m. Pacific Time	DATE	0.94+
CUBE	ORGANIZATION	0.94+
Deeplearning4j	TITLE	0.94+
Spark	ORGANIZATION	0.93+
1 millisecond	QUANTITY	0.93+
Keras	ORGANIZATION	0.91+
Day One	QUANTITY	0.81+
BigDL	TITLE	0.79+
TensorFlow	ORGANIZATION	0.79+
7	QUANTITY	0.77+
MLlib	TITLE	0.73+
Caffe 2	ORGANIZATION	0.7+
Caffe	TITLE	0.7+
24-	QUANTITY	0.68+
MXNet	ORGANIZATION	0.67+
Apache Spark	ORGANIZATION	0.54+

Matthew Hunt | Spark Summit 2017

>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, we're talking about data signs and engineering at scale, and we're having a great time, aren't we, George? >> We are! >> Well, we have another guest now we're going to talk to, I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg, Matt, thanks for joining us! >> My pleasure. >> Alright, we're going to talk about a lot of exciting stuff here today, but I want to first start with, you're a long-time member of the Spark community, right? How many Spark Summits have you been to? >> Almost all of them, actually, it's quite amazing to see the 10th one, yes. >> And you're pretty actively involved with the user group on the east coast? >> Matt: Yeah, I run the New York users group. >> Alright, well, what's that all about? >> We have some 2,000 people in New York who are interested in finding out what goes on, and which technologies to use, and what are people working on. >> Alright, so hopefully, you saw the keynote this morning with Matei? >> Yes. >> Alright, any comments or reactions from the things that he talked about as priorities? >> Well, I've always loved the keynotes at the Spark Summits, because they announce something that you don't already know is coming in advance, at least for most people. The second Spark Summit actually had people gasping in the audience while they were demoing, a lot of senior people-- >> Well, the one millisecond today was kind of a wow one-- >> Exactly, and I would say that the one thing to pick out of the keynote that really stood out for me was the changes in improvements they've made for streaming, including potentially being able to do sub-millisecond times for some workloads. >> Well, maybe talk to us about some of the apps that you're building at Bloomberg, and then I want you to join in, George, and drill down some of the details. >> Sure. And Bloomberg is a large company with 4,000-plus developers, we've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware, but there's some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in, that's a natural application, another one would be regulatory, there's a regulatory system called MiFID, or MiFID II, the regulations required for Europe, you have to be able to record every trade for seven years, provide daily reports, there's clearly a lot around that, and then I would also just say, our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. >> These applications sound like they're oriented towards streaming solutions, and really low latency. Has that been a constraint on what you can build so far? >> I would definitely say that we have some things that are latency constrained, it tends to be not like high frequency trading, where you care about microseconds, but milliseconds are important, how long does it take to get an answer, but I would say equally important with latency is efficiency, and those two often wind up being coupled together, though not always. >> And so when you say coupled, is it because it's a trade-off, or 'cause you need both? >> Right, so it's a little bit of both, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often, greater efficiencies. Micro-batching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in, and your method of processing them isn't efficient enough, it gets too slow simply from that, and that's why it's not just one or the other. >> So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency, and is that relevant for the apps that you're building? >> I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have, so that's one explicit one, but part of what's useful is, often when you sit down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through, and then you discover other inefficiencies that can be ironed out, and so it just makes it faster overall. I would say, a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on, well, all these things that make things much slower, much less efficient than they need to be, and we can close that gap a lot, I would say that from the very beginning. >> This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take N 10 control of continuous apps, for simplicity and performance, and so there's this, we'll write with transactional consistency, so we're assuring the customer of exactly one's semantics when we write to a file system or database or something like that. But, Spark has never really done native storage, whereas Matei came here on the show earlier today and said, "Well, Databricks as a company "is going to have to do something in that area," and he talks specifically about databases, and he said, he implied that Apache Spark, separate from Databricks, would also have to do more in state management, I don't know if he was saying key value store, but how would that open up a broader class of apps, how would it make your life simpler as a developer? >> Right. Interesting and great question, this is kind of a subject that's near and dear to my own heart, I would say. So part of that, when you take a step back, is about some of the potential promise of what Spark could be, or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value, if you can learn one small skillset, but it can work in a wide variety of use cases, whether it's streaming or at rest or analytics, and plug other things in. As always, there's a gap in any such system between theory and reality, and how much can you close that gap, but as for storage systems, this is something that, you and I have talked about this before, and I've written about it a fair amount too, Spark is historically an analytic system, so you have a bunch of data, and you can do analytics on it, but where's that data come from? Well, either it's streaming in, or you're reading from files, but most people need, essentially, an actual database. So what constitutes the universal system? You need file store, you need a distributive file store, you need a database with generally transactional semantics because the other forms are too hard for people to understand, you need analytics that are extensible, and you need a way to stream data in, and there's how close can you get to that, versus how much do you have to fit other parts that come together, very interesting question. >> So, so far, they've sort of outsourced that to DIY, do-it-yourself, but if they can find a sufficiently scalable relational database, they can do the sort of analytical queries, and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that, like Cassandra would be the, sort of the database that would handle all updates, and then some amount of those would be filtered through to a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you would have the option to drop that out, what applications would you undertake that you couldn't use right now, where the theme was, we're going to take big data apps into production, and then the competition that they show for streaming is of Kafka and Flink, so what does that do to that competitive balance? >> Right, so how many pieces do you need, and how well do they fit together is maybe the essence of that question, and people ask that all the time, and one of the limits has been, how mature is each piece, how efficient is it, and do they work together? And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real-world usage. I think we're coalescing around a smaller set of options, so in the, Kafka, for example, has a lot of usage, and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data, for ingest, I see that everywhere I go. But what happens when you move from Kafka into Spark, or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it, or Oracle, or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability, so we actually just opensourced that a few weeks ago, it's called ComDB2, and the reason we had to do that was the industry solutions at the time, when we started working on that, were inadequate to our needs, but we look at how long that took to develop for these other systems and think, that's really hard for someone else to get right, and so, if you need a database, which everyone does, how can you make that work better with Spark? And I think there're a number of very interesting developments that can make that a lot better, short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too. How do you make them work well together, we could talk about for a while, 'cause that's a fascinating question. >> On that one topic, maybe the Databricks guys don't want to assume responsibility for the development, because then they're picking a winner, perhaps? Maybe, as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Splice Machine and SnappyData do the work, take it upon themselves to take data frames, the core data structure, in Spark, and give it transactional semantics. Does that sound promising? >> There're multiple avenues for potential success, and who can use which, in a way, depends on the audience. If you look at things like Cassandra and HBase, they're distributing key value stores that additional things are being built on, so they started as distributed, and they're moving towards more encompassing systems, versus relational databases, which generally started as single image on single machine, and are moving towards federation distribution, and there's been a lot with that with post grads, for example. One of the questions would be, is it just knobs, or why don't they work well together? And there're a number of reasons. One is, what can be pushed down, how much knowledge do you have to have to make that decision, and optimizing that, I think, is actually one of the really interesting things that could be done, just as we have database query optimizers, why not, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted, but are coming. One is the very efficient copy of data between systems, and Apache Arrow, for example, is very, very interesting, and it's nearing the time when I think it's just going to explode, because it lets you connect these systems radically more efficiently in a standardized way, and that's one of the things that was missing, as soon as you hop from one system to another, all of a sudden, you have the semantic computational expense, that's a problem, we can fix that. The other is, the next level of integration requires, basically, exposing more hooks. In order to know, where should a query be executed and which operator should I push down, you need something that I think of as a meta-optimizer, and also, knowledge about the shape of the data, or statistics underlying, and ways to exchange that back and forth to be able to do it well. >> Wow, Matt, a lot of great questions there. We're coming up on a break, so we have to wrap things up, and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community, do over the next year. What are the top issues, things you'd love to see worked on? >> Right. It's an exciting time for Spark, because as time goes by, it gets more and more mature, and more real-world applications are viable. The hardest thing of all is to get, anywhere you in any organization's to get people working together, but the more people work together to enable these pieces, how do I efficiently work with databases, or have these better optimizations make streaming more mature, the more people can use it in practice, and that's why people develop software, is to actually tackle these real-world problems, so, I would love to see more of that. >> Can we all get along? (chuckling) Well, that's going to be the last word of this segue, Matt, thank you so much for coming on and spending some time with us here to share the story! >> My pleasure. >> Alright, thank you so much. Thank you George, and thank you all for watching this segment of theCUBE, please stay with us, as Spark Summit 2017 will be back in a few moments.

Published Date : Jun 6 2017

SUMMARY :

covering Spark Summit 2017, brought to you by Databricks. it's quite amazing to see the 10th one, yes. and what are people working on. that you don't already know is coming in advance, and I would say that the one thing and then I want you to join in, George, you have to be able to record every trade for seven years, Has that been a constraint on what you can build so far? where you care about microseconds, On the other hand, if you have a really large volume and then you discover other inefficiencies and so there's this, we'll write and there's how close can you get to that, what applications would you undertake and so, if you need a database, which everyone does, and give it transactional semantics. it's just going to explode, because it lets you and I wanted to give you at least 30 seconds and that's why people develop software, Alright, thank you so much.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Matt Hunt	PERSON	0.99+
Bloomberg	ORGANIZATION	0.99+
Matthew Hunt	PERSON	0.99+
Matt	PERSON	0.99+
Matei	PERSON	0.99+
New York	LOCATION	0.99+
San Francisco	LOCATION	0.99+
30 years	QUANTITY	0.99+
seven years	QUANTITY	0.99+
each piece	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
one	QUANTITY	0.99+
one millisecond	QUANTITY	0.99+
5,000 skills	QUANTITY	0.99+
both	QUANTITY	0.99+
two	QUANTITY	0.99+
two things	QUANTITY	0.99+
One	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Spark	TITLE	0.98+
Europe	LOCATION	0.98+
Spark Summit 2017	EVENT	0.98+
DB2	TITLE	0.98+
200 different products	QUANTITY	0.98+
Spark Summits	EVENT	0.98+
Spark Summit	EVENT	0.98+
today	DATE	0.98+
one system	QUANTITY	0.97+
next year	DATE	0.97+
4,000-plus developers	QUANTITY	0.97+
first	QUANTITY	0.96+
HBase	ORGANIZATION	0.95+
second	QUANTITY	0.94+
decades	QUANTITY	0.94+
MiFID II	TITLE	0.94+
one topic	QUANTITY	0.92+
this morning	DATE	0.92+
single machine	QUANTITY	0.91+
One of	QUANTITY	0.91+
ComDB2	TITLE	0.9+
few weeks ago	DATE	0.9+
Cassandra	PERSON	0.89+
earlier today	DATE	0.88+
10th one	QUANTITY	0.88+
2,000 people	QUANTITY	0.88+
one thing	QUANTITY	0.87+
Kafka	TITLE	0.87+
single image	QUANTITY	0.87+
MiFID	TITLE	0.85+
Spark	ORGANIZATION	0.81+
Splice Machine	TITLE	0.81+
Project Tungsten	ORGANIZATION	0.78+
theCUBE	ORGANIZATION	0.78+
at least 30 seconds	QUANTITY	0.77+
Cassandra	ORGANIZATION	0.72+
Apache Spark	ORGANIZATION	0.71+
questions	QUANTITY	0.7+
things	QUANTITY	0.69+
Apache Arrow	ORGANIZATION	0.69+
SnappyData	TITLE	0.66+

Matei Zaharia, Databricks - #SparkSummit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE. Covering Spark Summit2017, brought to you by Databricks. (upbeat music) >> Welcome back to Spark Summit 2017, you're watching theCUBE and we have an honored guest here today, his name is Matei Zaharia and Matei is the creator of Spark, Chief Technologist, and Co-Founder of Databricks, did I get all that right? >> Yeah, thanks a lot for having me again. Excited to be here. >> Yeah Matei we were watching your keynote this morning and we're all excited to hear about better support for deep learning, about some of the structured streaming apps now being in production. I want to ask you what happened after the keynote? What kind of feedback have you heard from people in the hallways? >> Yeah definitely, so the feedback has definitely been super positive. I think people really like the direction that we're moving in with Apache Spark and with this library, such as a deep learning pipeline one. So we've gotten a lot of questions about the deep learning library, when will it support more types and so on. It's really good at supporting images right now. And also with streaming, I think people are just excited to try out the low latency streaming. >> Any other priorities people asked you about that maybe you haven't focused on yet? >> That I haven't focused on in the keynote, so I think that's a good question, I think overall some of the things we keep seeing are people just want to make it easier to just operate Spark on it under that scale and simplify things like monitoring and debugging and so on, so that's a constant theme that we're seeing. And then another thing that's generally been going on, I didn't focus on it this time, is increasing usage by Python and R users. So there's a lot of work in the latest release to continue improving that, to make it easier to use in those languages. >> Okay, we were watching the demo, the impressive demos, this morning, in fact George was watching the keynote, he was the one millisecond latency, he said wow. George, you want to ask a little more about that? >> So yeah let's talk about, 'cause there's this rise of continuous apps, which I think you guys named. >> Matei: Yeah. >> And resonates with everyone to go along with batch and request response. And in the past, so people were saying, well Spark was doing many micro batches, latency was couple hundred milliseconds. So now that you're down at one millisecond, what does that change in terms the class of apps that you're appropriate for or you know, some people have talked about criticality of vent processing. Where is Spark on that now? >> Yeah definitely, so yeah, so the goal of this is exactly to support the full range of latency, possible all the way down to sub-millisecond latency. And give users the same programming model for them so they don't have to use a different system or a lower level programming model to get that low latency. And so basically since we began structured streaming, we moved, we tried to make sure the API is not tied in with micro-batching in anyway. And so this is the next step to actually eliminate that from the engine and be able to execute these computations. And what are the new applications? So I think this really enables two types of things we've seen. One is kind of automated decision making system, so this would be something, it could be even on say, a website or you know, say when someone's applying for a loan or something like that, could be making decisions but it could even be an even lower latency, like say stock market style of place or internet of things, or like industrial monitoring, and making decisions there. That's one thing. And then the other thing we see people doing is a lot of kind of stream to stream ETL, which is a bit more boring in some way, but as you set that up, it's nice to have this very low latency transformations that can produce new streams from an existing one, because then nothing downstream from them is effected in terms of latency. >> So in this last example, it's sort of to help build microservice type applications. >> Yeah, exactly, yeah. Well in general, there's this whole, basically this whole architecture of saying all my data will be streamed and then I'll have some applications that just produce a new stream. And then later that stuff can go into a data link or into a real time system or whatever. So it's basically keeping it low latency while it remains in stream form. >> So we were talking earlier and we've been talking to the Snappy Data folks at the place machine folks. And they built Spark into a DBMS. So that like it's immutable. I'm sorry, mutable. >> Matei: Mutable, yeah. >> Like a data frame is updateable. So what does that make possible, even if you can do the same things with Spark, without it? What does it make easier? >> So that's also in the same spirit of continuous applications, it's saying you should have a single programming model and interface for doing both your transactional work and your analytics after, and then maybe serving the results of the analytics. So that makes a lot of sense and an example of that would be, you know, I keep going back to say the financial or credit card type of use cases, but it would be something where users are conducting transactions and maybe you learn stuff about them from that. You say okay, here's where they're located, now here's what they're purchasing, whatever. And then you also want to know, I'll have to make a decision. For example, do I allow them to go past the limit on their credit card or something like that. Or is this a normal use of it or is this a fraudulent one? So that's where it helps to integrate these and you can do these things. So there are products like Snappy Data That integrate a specific database with Spark. And we're also trying to make sure in Spark, the API, so that people can integrate their own system, whatever database or key value store they want. >> So would you have to jump through hoops if you didn't want to integrate any other store other than talking to a file system, or? >> Yeah if you want to do these transactions on a file system, there will be basically some performance constraints to doing that. It depends on the weight, it's definitely the simplest thing and if you have a low enough rate of up data it could actually be fine. But if you want more fine grained ones, then it becomes a problem. >> It would seem like if you tack on a product for ingest, not that you really want to get into that, think Kafka, which could also stretch into the transforms on some basic analytics. And you mentioned, I think on the Spark East keynote, Redis for serving, you've got like now a multi sort of vendor product stack. And so there's complexity to that. >> Matei: Yeah definitely yeah. >> Do you foresee a scenario where you could see that as a high volume solution and it's something that you would take ownership of? >> I see, so well, do you mean from the Apache Spark side or from the Databricks side? >> George: Actually either. >> Yeah so I think from the Spark side, basically so far the project doesn't provide storage, it just provides computation and it plugs into different storage engines. And so it would be kind of a big shift, it might be possible, but it would be kind of a big shift to say, okay well also provide persistent storage. I think the more likely thing that will happen is better and better integrations with the most widely used open source storage systems. So Redis is one. Apache Kafka, there's a lot of work on integrating that better and so on. From the Databricks side, that is different because that is a fully managed cloud service and it definitely makes sense there that'd you have a turnkey solution for that. Right now we actually built that for people who want that we can build it, sometimes with other vendors or with just services built into Amazon, but that makes a lot of sense. >> And Matei, something I read a press release on, but I didn't hear it in the keynote this morning. I hate to steal thunder from tomorrow, but can you give us a sneak preview on serverless apps? What's that about? >> Yeah, so this is actually we put out a press release today and we'll actually put out, well we'll have a full keynote tomorrow morning and also a lot more details on our website. So this is a Databricks serverless. It's basically a serverless platform for adding Apache Spark and data science. So not to steal away too much thunder, but you know serverless computing is this idea of users can just submit query or computation, they don't have to configure the hardware at all and they just get high performance and they get results. And so far it's been very successful with stateless workloads such as Sequel or Amazon Lambda, which is you know just functions serving a webpage or something like that. So this is going to be the first offering that actually extends that model to data science and in general to Spark workloads. So you can have machine learning users, you can have these streaming applications, all these things, on that kind of environment. So yeah, we'll have a lot more detail on that tomorrow, it's something that we're excited about. >> I want to circle back to IoT apps. You know there's sort of, beyond an emerging consensus, that we're going to do a lot of training in the cloud 'cause we have access to big compute and lots of data. But then the issue on the edge, in the near to medium term, the footprint, like a lot of people are telling us high volume devices will have 3 megs of memory and a gateway server would have like two gigs and two cores. So can you carve Spark up into fitting on one of the... >> That's a good question, I think for that, it's again, the most likely way that would happen is through data sources. For example, there are these projects like Apache knife and other projects as well that let you build up a data pipeline from IoT devices all the way to the cloud. And you can imagine some computation through those. So I think, yeah I don't have a very concrete answer, I think here it is something that's coming up a bunch though, so we do want to support this type of like splitting the computation. >> But in terms of splitting the computation, you could take a trained model, model training is fat compute and then the trained model-- >> You can definitely push the model and do inference. >> Would that inference thing have to happen in a Spark run time or could it be somewhere? >> I think it could happen anywhere else also. And actually like we do see a lot of people wanting to export basically machine learning pipelines or models from Spark into another environment. So it can happen somewhere else too. Yeah and then the other aspect of it is also data collection. So if you can push something that says here is when the data is exciting, like when the data is interesting you should remember these and send them on. That would also help, because otherwise you know, say it's like a video camera or something, most of the time it's looking at nothing. I mean you don't want to send all that back. >> That's actually a key point, which is some folks like especially in the IT ops area where you know, training wheels for IoT 'cause they're doing machine learning on infrastructure. >> Matei: Yeah which is there. >> Yeah, they say oh anything outside, two standard deviations of the band of exhortations, but there's more of an answer to that, I gather, from what you're saying. >> Yeah I mean I think you can create, for example, you can create a small machine learning model that decides whether what it's seeing is unusual and sends it back or you can even make it query specific, like you can count, like I want to find this type of object that's going by the camera. And try to find that. So I think there's a lot of room to improve that. >> Okay, well we have just a couple of minutes left here, want to draw into the future a little bit. And there's been some great progress since the summit last year to this one. What would you say is the next boundary that needs to be pushed to get Spark to the next level, whatever that may be? >> Yeah definitely yeah, well okay so again on the, so first of all in terms of the project today I think the big workload is that we are seeing come up all the time, are deep learning and stream processing. These are the big emerging ones. I mean there's still a lot of data warehousing, ETL and so on, that's still there. But these are the new ones, so that's what we're focusing on on our team at least. And we'll continue building out the stuff that you saw announced today. I think beyond that, I do think that part of the problem and this is more on the Databricks side, part of the problem is also just making it much easier for teams or businesses to begin using these technologies at all. And that's where we think cloud computing or software as a service is the way because you just turn it on and you can immediately start doing things. But that's basically, the way that I view that, is right now the barrier to do any project with data science or machine learning, or even like simple kind of analytics and unstructured data, the barrier is really high. So companies can only do it on a few projects. There might be like a 100 things they could be trying, but they can only afford to spend up two or three of them. So if you lower that barrier, there'll be a lot more of them and everyone will be able to quickly try one of these applications and see whether it actually works. >> And this ties into some of you graduate studies, like with model management and things like that? >> Yeah, so on the research side. So I'm also you know, doing research at Stanford and on that side we have this lab called Dawn, which is about usable machine learning. It's exactly these things. Like how do you enable an order of magnitude of more people to try to do things with machine learning. So actually we're also doing the video push down thing I mentioned, that's one thing we're looking at. A bunch of other stuff as well. >> Matei we could talk to you all day, but we don't have all day. We're up against the break here, but I wanted to thank you very much for coming and sharing a few moments here and look forward to seeing you in the hallways here at Spark right? >> Yeah thanks again for having me. >> Thanks for joining us and thank you all for watching, here we are at theCUBE at Spark 2017, thanks for watching. (upbeat music)

Published Date : Jun 6 2017

SUMMARY :

Covering Spark Summit2017, brought to you by Databricks. Excited to be here. I want to ask you what happened after the keynote? Yeah definitely, so the feedback has definitely That I haven't focused on in the keynote, George, you want to ask a little more about that? of continuous apps, which I think you guys named. And in the past, so people were saying, And so this is the next step to actually eliminate So in this last example, it's sort of to help build So it's basically keeping it low latency So that like it's immutable. even if you can do the same things with Spark, And then you also want to know, the simplest thing and if you have a low for ingest, not that you really want to get into that, and it definitely makes sense there that'd you have I hate to steal thunder from tomorrow, but can you give us So you can have machine learning users, So can you carve Spark up into fitting on And you can imagine some computation through those. You can definitely push the model So if you can push something that says like especially in the IT ops area where you know, but there's more of an answer to that, I gather, Yeah I mean I think you can create, for example, What would you say is the next boundary So if you lower that barrier, there'll be a lot So I'm also you know, doing research at Stanford and look forward to seeing you in the hallways Thanks for joining us and thank you all for watching,

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Matei	PERSON	0.99+
Matei Zaharia	PERSON	0.99+
one millisecond	QUANTITY	0.99+
two gigs	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
3 megs	QUANTITY	0.99+
two cores	QUANTITY	0.99+
tomorrow morning	DATE	0.99+
three	QUANTITY	0.99+
today	DATE	0.99+
tomorrow	DATE	0.99+
100 things	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
Python	TITLE	0.99+
Spark	TITLE	0.99+
last year	DATE	0.99+
two	QUANTITY	0.98+
San Francisco	LOCATION	0.98+
Spark Summit 2017	EVENT	0.98+
two types	QUANTITY	0.98+
Spark	ORGANIZATION	0.98+
One	QUANTITY	0.98+
both	QUANTITY	0.98+
Apache	ORGANIZATION	0.97+
Stanford	ORGANIZATION	0.97+
first offering	QUANTITY	0.97+
one thing	QUANTITY	0.96+
this morning	DATE	0.96+
couple hundred milliseconds	QUANTITY	0.95+
Lambda	TITLE	0.94+
Spark Summit2017	EVENT	0.93+
one	QUANTITY	0.89+
two standard	QUANTITY	0.87+
#theCUBE	ORGANIZATION	0.81+
single programming model	QUANTITY	0.8+
Databricks	PERSON	0.78+
R	TITLE	0.78+
Snappy Data	ORGANIZATION	0.77+
of minutes	QUANTITY	0.67+
first	QUANTITY	0.66+
Spark East	ORGANIZATION	0.63+
Kafka	TITLE	0.62+
Apache Spark	TITLE	0.61+
Sequel	TITLE	0.6+
Spark 2017	EVENT	0.58+
Narrator:	TITLE	0.57+
theCUBE	ORGANIZATION	0.56+
Redis	TITLE	0.55+
Redis	ORGANIZATION	0.5+
theCUBE	TITLE	0.46+
#SparkSummit	TITLE	0.35+

Stephanie McReynolds, Alation & Lee Paries, Think Big Analytics - #BigDataSV - #theCUBE

>> Voiceover: San Jose, California, tt's theCUBE, covering Big Data Silicon Valley 2017. (techno music) >> Hey, welcome back everyone. Live in Silicon Valley for Big Data SV. This is theCUBE coverage in conjunction with Strata + Hadoop. I'm John Furrier with George Gilbert at Wikibon. Two great guests. We have Stephanie McReynolds, Vice President of startup Alation, and Lee Paries who is the VP of Think Big Analytics. Thanks for coming back. Both been on theCUBE, you have been on theCUBE before, but Think Big has been on many times. Good to see you. What's new, what are you guys up to? >> Yeah, excited to be here and to be here with Lee. Lee and I have a personal relationship that goes back quite aways in the industry. And then what we're talking about today is the integration between Kylo, which was recently announced as an open source project from Think Big, and Alation's capability to sit on top of Kylo and to gather to increase the velocity of data lake initiatives, kind of going from zero to 60 in a pretty short amount of time to get both technical value from Kylo and business value from Alation. >> So talk about Alation's traction, because you guys has been an interesting startup, a lot of great press. George is a big fan. He's going to jump in with some questions, but some good product fit with the market. What's the update? What's some of the status on the traction in terms of the company and customers and whatnot? >> Yeah, we've been growing pretty rapidly for a startup. We've doubled our production customer count from last time we talked. Some great brand names. Munich Reinsurance this morning was talking about their implementation. So they have 600 users of Alation in their organization. We've entered Europe, not only with Munich Reinsurance but Tesco is a large account of ours in Europe now. And here in the States we've seen broad adoption across a wide range of industries, every one from Pfizer in the healthcare space to eBay, who's been our longest standing customer. They have about 1,000 weekly users on Alation. So not only a great increase in number of logos, but also organic growth internally at many of these companies across data scientists, data analysts, business analysts, a wide range of users of the product, as well. >> It's been interesting. What I like about your approach, and we talk about Think Big about it before, we let every guest come in so far that's been in the same area is talking about metadata layers, and so this is interesting, there's a metadata data addressability if you will for lack of a better description, but yet human usable has to be integrating into human processes, whether it's virtualization, or any kind of real time app or anything. So you're seeing this convergence between I need to get the data into an app, whether it's IoT data or something else, really really fast, so really kind of the discovery pieces now, the interesting layer, how competitive is it, and what's the different solutions that you guys see in this market? >> Yeah, I think it's interesting, because metadata has kind of had a revival, right? Everyone is talking about the importance in metadata and open integration with metadata. I think really our angle is as Alation is that having open transfer of technical metadata is very important for the foundation of analytics, but what really brings that technical metadata to life is also understanding what is the business context of what's happening technically in the system? What's the business context of data? What's the behavioral context of how that data has been used that might inform me as an analyst? >> And what's your unique approach to that? Because that's like the Holy Grail. It's like translating geek metadata, indexing stuff into like usable business outcomes. It's been a cliche for years, you know. >> The approach is really based on machine learning and AI technology to make recommendations to business users about what might be interesting to them. So we're at a state in the market where there is so much data that is available and that you can access, either in Hadoop as a data lake or in a data warehouse in a database like Teradata, that today what you need as state of the art is the system to start to recommend to you what might be interesting data for you to use as a data scientist or an analyst, and not just what's the data you could use, but how accurate is that data, how trustworthy is it? I think there's a whole nother theme of governance that's rising that's tied to that metadata discussion, which is it's not enough to just shove bits and bytes between different systems anymore. You really need to understand how has this data been manipulated and used and how does that influence my security considerations, my privacy considerations, the value I'm going to be able to get out of that data set? >> What's your take on this, 'cause you guys have a relationship. How is Think Big doing? Then talk about the partnership you guys have with Alation. >> Sure, so I mean when you look at what we've done specifically to an open source project it's the first one that Teradata has fully sponsored and released based on Apache 2.0 called Kylo, it's really about the enablement of the full data lake platform and the full framework, everywhere from ingest, to securing it, to governing it, which part of that is collecting is part of that process, the basic technical and business metadata so later you can hand it over to the user so they could sample, they could profile the data, they can find, they can search in a Google like manner, and then you can enable the organization with that data. So when you look at it from a standpoint of partnering together, it's really about collecting that data specifically within Hadoop to enable it, yet with the ability then to hand it off to more the enterprise wide solution like Alation through API connections that connect to that, and then for them they enrich it in a way that they go about it with the social collaboration and the business to extend it from there. >> So that's the accelerant then. So you're accelerating the open source project in through this new, with Alation. So you're still going to rock and roll with the open source. >> Very much going to rock and roll with the open source. So it's really been based on five years of Think Big's work in the marketplace over about 150 data lakes. The IT we've built around that to do things repeatedly, consistently, and then releasing that in the last two years, dedicated development based on Apache Spark and NiFi to stand that out. >> Great work by the way. Open sources continue to be more relevant. But I got to get your perspective on a meme that's been floating around day one here, and maybe it's because of the election, but someone said, "We got to drain the data swamp, "and make data great again." And not a play on Trump, but the data lake is going through a transition and saying, "Okay, we've got data lakes," but now this year it's been a focus on making that much more active and cleaner and making sure it doesn't become a swamp if you will. So there's been a focus of taking data lake content and getting it into real time, and IoT has kind of I think been a forcing function. But you guys, do you guys have a perspective on that on where data lakes are going? Certainly it's been trending conversation here at the show. >> Yeah, I think IoT has been part of drain that data swamp, but I think also now you have a mass of business analysts that are starting to get access to that data in the lake. These Hadoop implementations are maturing to the stage where you have-- >> John: To value coming out of it. >> Yeah, and people are trying to wring value out of that lake, and sometimes finding that it is harder than they expected because the data hasn't been pre-prepared for them. This old world of IT would pre-prepare the data, and then I got a single metric or I got a couple metrics to choose from is now turned on its head. People are taking a more exploratory, discovery oriented approach to navigating through their data and finding that the nuisances of data really matter when trying to evolve an insight. So the literacy in these organizations and their awareness of some of the challenges of a lake are coming to the forefront, and I think that's a healthy conversation for us all to have. If you're going to have a data driven organization, you have to really understand the nuisances of your data to know where to apply it appropriately to decision making. >> So (mumbles) actually going back quite a few years when he started at Microsoft said, Internet software has changed paradigm so much in that we have this new set of actions where it was discover, learn, try, buy, recommend, and it sounds like as a consumer of data in a data lake we've added or preppended this discovery step. Where in a well curated data warehouse it was learn, you had your X dimensions that were curated and refined, and you don't have that as much with the data lake. I guess I'm wondering, it's almost like if you're going to take, as we were talking to the last team with AtScale and moving OLAP to be something you consume on a data lake the way you consume on a data warehouse, it's almost like Alation and a smart catalog is as much a requirement as a visualization tool is by itself on a data warehouse? >> I think what we're seeing is this notion of data needing to be curated, and including many brains and many different perspectives in that curation process is something that's defining the future of analytics and how people use technical metadata, and what does it mean for the devops organization to get involved in draining that swamp? That means not only looking at the elements of the data that are coming in from a technical perspective, but then collaborating with a business to curate the value on top of that data. >> So in other words it's not just to help the user, the business analyst, navigate, but it's also to help the operational folks do a better job of curating once they find out who's using it, who's using the data and how. >> That's right. They kind of need to know how this data is going to be used in the organization. The volumes are so high that they couldn't possibly curate every bit and byte that is stored in the data lake. So by looking at how different individuals in the organization and different groups are trying to access that data that gives early signal to where should we be spending more time or less time in processing this data and helping the organization really get to their end goals of usage. >> Lee, I want to ask you a question. On your blog post, I just was pointed out earlier, you guys quote a Gartner stat which says, which is pretty doom and gloom, which said, "70% of Hadoop deployments in 2017 "will either fail or deliver their estimated cost savings "of their predicted revenue." And then it says, "That's a dim view, "but not shared by the Kylo community." How are you guys going to make the Kylo data lake software work well? What's your thoughts on that? Because I think people, that's the number one, again, question that I highlighted earlier is okay, I don't want a swamp, so that's fear, whether they get one or not, so they worry about data cleansing and all these things. So what's Kylo doing that's going to accelerate, or lower that number, of fails in the data lake world? >> Yeah sure, so again, a lot of it's through experience of going out there and seeing what's done. A lot of people have been doing a lot of different things within the data lakes, but when you go in there there's certain things they're not doing, and then when you're doing them it's about doing them over consistently and continually improving upon that, and that's what Kylo is, it's really a framework that we keep adding to, and as the community grows and other projects come in there can enhance it we bring the value. But a lot of times when we go in it it's basically end users can't get to the data, either one because they're not allowed to because maybe it's not secured and relied to turn it over to them and let them drive with it, or they don't know the data is there, which goes back to basic collecting the basic metadata and data (mumbles) to know it's there to leverage it. So a lot of times it's going back and looking at and leveraging what we have to build that solid foundation so IT and operations can feel like they can hand that over in a template format so business users could get to the data and start acting off of that. >> You just lost your mic there, but Stephanie, I got to ask you a question. So just on a point of clarification, so you guys, are you supporting Kylo? Is that the relationship, or how does that work? >> So we're integrated with Kylo. So Kylo will ingest data into the lake, manage that data lake from a security perspective giving folks permissions, enables some wrangling on that data, and what Alation is receiving then from Kylo is that technical metadata that's being created along that entire path. >> So you're certified with Kylo? How does that all work from the customer standpoint? >> That's a very much integration partnership that we'd be working together. >> So from a customer standpoint it's clean and you then provide the benefits on the other side? >> Correct. >> Yeah, absolutely. We've been working with data lake implementations for some time, since our founding really, and I think this is an extension of our philosophy that the data lakes are going to play an important role that are going to complement databases and analytics tools, business intelligence tools, and the analytics environment, and the open source is part of the future of how folks are building these environments. So we're excited to support the Kylo initiative. We've had a longstanding relationship with Teradata as a partner, so it's a great way to work together. >> Thanks for coming on theCUBE. Really appreciate it, and thank... What do you think of the show you guys so far? What's the current vibe of the show? >> Oh, it's been good so far. I mean, it's one day into it, but very good vibe so far. Different topics and different things-- >> AI machine learning. You couldn't be more happier with that machine learning-- >> Great to see machine learning taking a forefront, people really digging into the details around what it means when you apply it. >> Stephanie, thanks for coming on theCUBE, really appreciate it. More CUBE coverage after the show break. Live from Silicon Valley, I'm John Furrier with George Gilbert. We'll be right back after this short break. (techno music)

Published Date : Mar 15 2017

SUMMARY :

(techno music) What's new, what are you guys up to? and to gather to increase He's going to jump in with some questions, And here in the States we've seen broad adoption that you guys see in this market? Everyone is talking about the importance in metadata Because that's like the Holy Grail. is the system to start to recommend to you Then talk about the partnership you guys have with Alation. and the business to extend it from there. So that's the accelerant then. and NiFi to stand that out. and maybe it's because of the election, to the stage where you have-- and finding that the nuisances of data really matter to be something you consume on a data lake and many different perspectives in that curation process but it's also to help the operational folks and helping the organization really get in the data lake world? and data (mumbles) to know it's there to leverage it. but Stephanie, I got to ask you a question. and what Alation is receiving then from Kylo that we'd be working together. that the data lakes are going to play an important role What's the current vibe of the show? Oh, it's been good so far. You couldn't be more happier with that machine learning-- people really digging into the details More CUBE coverage after the show break.

ENTITIES

Entity	Category	Confidence
Stephanie McReynolds	PERSON	0.99+
George Gilbert	PERSON	0.99+
Europe	LOCATION	0.99+
Stephanie	PERSON	0.99+
Lee	PERSON	0.99+
Tesco	ORGANIZATION	0.99+
Lee Paries	PERSON	0.99+
George	PERSON	0.99+
Trump	PERSON	0.99+
2017	DATE	0.99+
John	PERSON	0.99+
Pfizer	ORGANIZATION	0.99+
five years	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Think Big	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
70%	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
Alation	ORGANIZATION	0.99+
Teradata	ORGANIZATION	0.99+
Think Big Analytics	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
Gartner	ORGANIZATION	0.99+
zero	QUANTITY	0.99+
Kylo	ORGANIZATION	0.99+
60	QUANTITY	0.99+
600 users	QUANTITY	0.98+
AtScale	ORGANIZATION	0.98+
eBay	ORGANIZATION	0.98+
Google	ORGANIZATION	0.98+
today	DATE	0.98+
first one	QUANTITY	0.98+
Hadoop	TITLE	0.98+
Both	QUANTITY	0.98+
both	QUANTITY	0.97+
Two great guests	QUANTITY	0.97+
this year	DATE	0.97+
about 1,000 weekly users	QUANTITY	0.97+
one day	QUANTITY	0.95+
single metric	QUANTITY	0.95+
Apache Spark	ORGANIZATION	0.94+
Kylo	TITLE	0.93+
Wikibon	ORGANIZATION	0.93+
NiFi	ORGANIZATION	0.92+
about 150 data lakes	QUANTITY	0.92+
Apache 2.0	TITLE	0.89+
this morning	DATE	0.88+
couple	QUANTITY	0.86+
Big Data Silicon Valley 2017	EVENT	0.84+
day one	QUANTITY	0.83+
Vice President	PERSON	0.81+
Strata	TITLE	0.77+
Kylo	PERSON	0.77+
#theCUBE	ORGANIZATION	0.76+
Big Data	ORGANIZATION	0.75+
last two years	DATE	0.71+
one	QUANTITY	0.7+
Munich Reinsurance	ORGANIZATION	0.62+
CUBE	ORGANIZATION	0.52+

Frederick Reiss, IBM STC - Big Data SV 2017 - #BigDataSV - #theCUBE

>> Narrator: Live from San Jose, California it's the Cube, covering Big Data Silicon Valley 2017. (upbeat music) >> Big Data SV 2016, day two of our wall to wall coverage of Strata Hadoob Conference, Big Data SV, really what we call Big Data Week because this is where all the action is going on down in San Jose. We're at the historic Pagoda Lounge in the back of the Faramount, come on by and say hello, we've got a really cool space and we're excited and never been in this space before, so we're excited to be here. So we got George Gilbert here from Wiki, we're really excited to have our next guest, he's Fred Rice, he's the chief architect at IBM Spark Technology Center in San Francisco. Fred, great to see you. >> Thank you, Jeff. >> So I remember when Rob Thomas, we went up and met with him in San Francisco when you guys first opened the Spark Technology Center a couple of years now. Give us an update on what's going on there, I know IBM's putting a lot of investment in this Spark Technology Center in the San Francisco office specifically. Give us kind of an update of what's going on. >> That's right, Jeff. Now we're in the new Watson West building in San Francisco on 505 Howard Street, colocated, we have about a 50 person development organization. Right next to us we have about 25 designers and on the same floor a lot of developers from Watson doing a lot of data science, from the weather underground, doing weather and data analysis, so it's a really exciting place to be, lots of interesting work in data science going on there. >> And it's really great to see how IBM is taking the core Watson, obviously enabled by Spark and other core open source technology and now applying it, we're seeing Watson for Health, Watson for Thomas Vehicles, Watson for Marketing, Watson for this, and really bringing that type of machine learning power to all the various verticals in which you guys play. >> Absolutely, that's been what Watson has been about from the very beginning, bringing the power of machine learning, the power of artificial intelligence to real world applications. >> Jeff: Excellent. >> So let's tie it back to the Spark community. Most folks understand how data bricks builds out the core or does most of the core work for, like, the sequel workload the streaming and machine learning and I guess graph is still immature. We were talking earlier about IBM's contributions in helping to build up the machine learning side. Help us understand what the data bricks core technology for machine learning is and how IBM is building beyond that. >> So the core technology for machine learning in Apache Spark comes out, actually, of the machine learning department at UC Berkeley as well as a lot of different memories from the community. Some of those community members also work for data bricks. We actually at the IBM Spark Technology Center have made a number of contributions to the core Apache Spark and the libraries, for example recent contributions in neural nets. In addition to that, we also work on a project called Apache System ML, which used to be proprietary IBM technology, but the IBM Spark Technology Center has turned System ML into Apache System ML, it's now an open Apache incubating project that's been moving forward out in the open. You can now download the latest release online and that provides a piece that we saw was missing from Spark and a lot of other similar environments and optimizer for machine learning algorithms. So in Spark, you have the catalyst optimizer for data analysis, data frames, sequel, you write your queries in terms of those high level APIs and catalyst figures out how to make them go fast. In System ML, we have an optimizer for high level languages like Spark and Python where you can write algorithms in terms of linear algebra, in terms of high level operations on matrices and vectors and have the optimizer take care of making those algorithms run in parallel, run in scale, taking account of the data characteristics. Does the data fit in memory, and if so, keep it in memory. Does the data not fit in memory? Stream it from desk. >> Okay, so there was a ton of stuff in there. >> Fred: Yep. >> And if I were to refer to that as so densely packed as to be a black hole, that might come across wrong, so I won't refer to that as a black hole. But let's unpack that, so the, and I meant that in a good way, like high bandwidth, you know. >> Fred: Thanks, George. >> Um, so the traditional Spark, the machine learning that comes with Spark's ML lib, one of it's distinguishing characteristics is that the models, the algorithms that are in there, have been built to run on a cluster. >> Fred: That's right. >> And very few have, very few others have built machine learning algorithms to run on a cluster, but as you were saying, you don't really have an optimizer for finding something where a couple of the algorithms would be fit optimally to solve a problem. Help us understand, then, how System ML solves a more general problem for, say, ensemble models and for scale out, I guess I'm, help us understand how System ML fits relative to Sparks ML lib and the more general problems it can solve. >> So, ML Live and a lot of other packages such as Sparking Water from H20, for example, provide you with a toolbox of algorithms and each of those algorithms has been hand tuned for a particular range of problem sizes and problem characteristics. This works great as long as the particular problem you're facing as a data scientist is a good match to that implementation that you have in your toolbox. What System ML provides is less like having a toolbox and more like having a machine shop. You can, you have a lot more flexibility, you have a lot more power, you can write down an algorithm as you would write it down if you were implementing it just to run on your laptop and then let the System ML optimizer take care of producing a parallel version of that algorithm that is customized to the characteristics of your cluster, customized to the characteristics of your data. >> So let me stop you right there, because I want to use an analogy that others might find easy to relate to for all the people who understand sequel and scale out sequel. So, the way you were describing it, it sounds like oh, if I were a sequel developer and I wanted to get at some data on my laptop, I would find it pretty easy to write the sequel to do that. Now, let's say I had a bunch of servers, each with it's own database, and I wanted to get data from each database. If I didn't have a scale out database, I would have to figure out physically how to go to each server in the cluster to get it. What I'm hearing for System ML is it will take that query that I might have written on my one server and it will transparently figure out how to scale that out, although in this case not queries, machine learning algorithms. >> The database analogy is very apt. Just like sequel and query optimization by allowing you to separate that logical description of what you're looking for from the physical description of how to get at it. Lets you have a parallel database with the exact same language as a single machine database. In System ML, because we have an optimizer that separates that logical description of the machine learning algorithm from the physical implementation, we can target a lot of parallel systems, we can also target a large server and the code, the code that implements the algorithm stays the same. >> Okay, now let's take that a step further. You refer to matrix math and I think linear algebra and a whole lot of other things that I never quite made it to since I was a humanities major but when we're talking about those things, my understanding is that those are primitives that Spark doesn't really implement so that if you wanted to do neural nets, which relies on some of those constructs for high performance, >> Fred: Yes. >> Then, um, that's not built into Spark. Can you get to that capability using System ML? >> Yes. System ML edits core, provides you with a library, provides you as a user with a library of machine, rather, linear algebra primitives, just like a language like r or a library like Mumpai gives you matrices and vectors and all of the operations you can do on top of those primitives. And just to be clear, linear algebra really is the language of machine learning. If you pick up a paper about an advanced machine learning algorithm, chances are the specification for what that algorithm does and how that algorithm works is going to be written in the paper literally in linear algebra and the implementation that was used in that paper is probably written in the language where linear algebra is built in, like r, like Mumpai. >> So it sounds to me like Spark has done the work of sort of the blocking and tackling of machine learning to run in parallel. And that's I mean, to be clear, since we haven't really talked about it, that's important when you're handling data at scale and you want to train, you know, models on very, very large data sets. But it sounds like when we want to go to some of the more advanced machine learning capabilities, the ones that today are making all the noise with, you know, speech to text, text to speech, natural language, understanding those neural network based capabilities are not built into the core Spark ML lib, that, would it be fair to say you could start getting at them through System ML? >> Yes, System ML is a much better way to do scalable linear algebra on top of Spark than the very limited linear algebra that's built into Spark. >> So alright, let's take the next step. Can System ML be grafted onto Spark in some way or would it have to be in an entirely new API that doesn't take, integrate with all the other Spark APIs? In a way, that has differentiated Spark, where each API is sort of accessible from every other. Can you tie System ML in or do the Spark guys have to build more primitives into their own sort of engine first? >> A lot of the work that we've done with the Spark Technology Center as part of bringing System ML into the Apache ecosystem has been to build a nice, tight integration with Apache Spark so you can pass Spark data frames directly into System ML you can get data frames back. Your System ML algorithm, once you've written it, in terms of one of System ML's main systematic languages it just plugs into Spark like all the algorithms that are built into Spark. >> Okay, so that's, that would keep Spark competitive with more advanced machine learning frameworks for a longer period of time, in other words, it wouldn't hit the wall the way if would if it encountered tensor flow from Google for Google's way of doing deep learning, Spark wouldn't hit the wall once it needed, like, a tensor flow as long as it had System ML so deeply integrated the way you're doing it. >> Right, with a system like System ML, you can quickly move into new domains of machine learning. So for example, this afternoon I'm going to give a talk with one of our machine learning developers, Mike Dusenberry, about our recent efforts to implement deep learning in System ML, like full scale, convolutional neural nets running on a cluster in parallel processing many gigabytes of images, and we implemented that with very little effort because we have this optimizer underneath that takes care of a lot of the details of how you get that data into the processing, how you get the data spread across the cluster, how you get the processing moved to the data or vice versa. All those decisions are taken care of in the optimizer, you just write down the linear algebra parts and let the system take care of it. That let us implement deep learning much more quickly than we would have if we had done it from scratch. >> So it's just this ongoing cadence of basically removing the infrastructure gut management from the data scientists and enabling them to concentrate really where their value is is on the algorithms themselves, so they don't have to worry about how many clusters it's running on, and that configuration kind of typical dev ops that we see on the regular development side, but now you're really bringing that into the machine learning space. >> That's right, Jeff. Personally, I find all the minutia of making a parallel algorithm worked really fascinating but a lot of people working in data science really see parallelism as a tool. They want to solve the data science problem and System ML lets you focus on solving the data science problem because the system takes care of the parallelism. >> You guys could go on in the weeds for probably three hours but we don't have enough coffee and we're going to set up a follow up time because you're both in San Francisco. But before we let you go, Fred, as you look forward into 2017, kind of the advances that you guys have done there at the IBM Spark Center in the city, what's kind of the next couple great hurdles that you're looking to cross, new challenges that are getting you up every morning that you're excited to come back a year from now and be able to say wow, these are the one or two things that we were able to take down in 2017? >> We're moving forward on several different fronts this year. On one front, we're helping to get the notebook experience with Spark notebooks consistent across the entire IBM product portfolio. We helped a lot with the rollout of notebooks on data science experience on z, for example, and we're working actively with the data science experience and with the Watson data platform. On the other hand, we're contributing to Spark 2.2. There are some exciting features, particularly in sequel that we're hoping to get into that release as well as some new improvements to ML Live. We're moving forward with Apache System ML, we just cut Version 0.13 of that. We're talking right now on the mailing list about getting System ML out of incubation, making it a full, top level project. And we're also continuing to help with the adoption of Apache Spark technology in the enterprise. Our latest focus has been on deep learning on Spark. >> Well, I think we found him! Smartest guy in the room. (laughter) Thanks for stopping by and good luck on your talk this afternoon. >> Thank you, Jeff. >> Absolutely. Alright, he's Fred Rice, he's George Gilbert, and I'm Jeff Rick, you're watching the Cube from Big Data SV, part of Big Data Week in San Jose, California. (upbeat music) (mellow music) >> Hi, I'm John Furrier, the cofounder of SiliconANGLE Media cohost of the Cube. I've been in the tech business since I was 19, first programming on mini computers.

Published Date : Mar 15 2017

SUMMARY :

it's the Cube, covering Big Data Silicon Valley 2017. in the back of the Faramount, come on by and say hello, in the San Francisco office specifically. and on the same floor a lot of developers from Watson to all the various verticals in which you guys play. of machine learning, the power of artificial intelligence or does most of the core work for, like, the sequel workload and have the optimizer take care of making those algorithms and I meant that in a good way, is that the models, the algorithms that are in there, and the more general problems it can solve. to that implementation that you have in your toolbox. in the cluster to get it. and the code, the code that implements the algorithm so that if you wanted to do neural nets, Can you get to that capability using System ML? and all of the operations you can do the ones that today are making all the noise with, you know, linear algebra on top of Spark than the very limited So alright, let's take the next step. System ML into the Apache ecosystem has been to build so deeply integrated the way you're doing it. and let the system take care of it. is on the algorithms themselves, so they don't have to worry because the system takes care of the parallelism. into 2017, kind of the advances that you guys have done of Apache Spark technology in the enterprise. Smartest guy in the room. and I'm Jeff Rick, you're watching the Cube cohost of the Cube.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Jeff Rick	PERSON	0.99+
George	PERSON	0.99+
Jeff	PERSON	0.99+
Fred Rice	PERSON	0.99+
Mike Dusenberry	PERSON	0.99+
IBM	ORGANIZATION	0.99+
2017	DATE	0.99+
San Francisco	LOCATION	0.99+
John Furrier	PERSON	0.99+
San Jose	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
505 Howard Street	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Frederick Reiss	PERSON	0.99+
Spark Technology Center	ORGANIZATION	0.99+
Fred	PERSON	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
one	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
Spark 2.2	TITLE	0.99+
three hours	QUANTITY	0.99+
Watson	ORGANIZATION	0.99+
UC Berkeley	ORGANIZATION	0.99+
one server	QUANTITY	0.99+
Spark	TITLE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Python	TITLE	0.99+
each server	QUANTITY	0.99+
both	QUANTITY	0.99+
each	QUANTITY	0.99+
each database	QUANTITY	0.98+
Big Data Week	EVENT	0.98+
Pagoda Lounge	LOCATION	0.98+
Strata Hadoob Conference	EVENT	0.98+
System ML	TITLE	0.98+
Big Data SV	EVENT	0.97+
each API	QUANTITY	0.97+
ML Live	TITLE	0.96+
today	DATE	0.96+
Thomas Vehicles	ORGANIZATION	0.96+
Apache System ML	TITLE	0.95+
Big Data	EVENT	0.95+
Apache Spark	TITLE	0.94+
Watson for Marketing	ORGANIZATION	0.94+
Sparking Water	TITLE	0.94+
first	QUANTITY	0.94+
one front	QUANTITY	0.94+
Big Data SV 2016	EVENT	0.94+
IBM Spark Technology Center	ORGANIZATION	0.94+
about 25 designers	QUANTITY	0.93+

Aaron Colcord & David Favela, FIS Global - Spark Summit East 2017 - #sparksummit - #theCUBE

>> Narrator: Live, from Boston, Massachusetts, this is theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, David Vellante and George Gilbert. >> Back to Boston, everybody, where the city is bracing for a big snowstorm. Still euphoric over the Patriots' big win. Aaron Colcord is here, he's the director of engineering at FIS Global, and he's joined by Dave Favela, who's the director of BI at FIS Global. Gentlemen, welcome to theCUBE. It's good to see you. >> Yeah, thank you. >> Thank you very much. >> Thanks so much for coming on. So Dave, set it up. FIS Global, the company that does a ton of work in financial services that nobody's ever heard of. >> Yeah, absolutely, absolutely. Yeah, we serve and touch virtually every credit union or bank in the United States, and have services that extend globally, and that ranges anywhere from back office services to technology services that we provide by way of mobile banking or online banking. And so, we're a Fortune 500 company with a reach, like I said, throughout the nation and globally. >> So, you're a services company that provides, sort of, end-to-end capabilities for somebody who wants to start a bank, or upgrade their infrastructure? >> Absolutely, yeah. So, whether you're starting a bank or whether you're an existing bank looking to offer some type of technology, whether it's back-end processing services, mobile banking, bill pay, peer-to-peer payments, so, we are considered a FinTech company, and one of the largest FinTech companies there is. >> And Aaron, your role as the director of engineering, maybe talk about that a little bit. >> My role is primarily about the mobile data analytics, about creating a product that's able to not only be able to give the basic behavior of our mobile application, but be able to actually dig deeper and create interesting analytics, insights into the data, to give our customers understanding about not only the mobile application, but be able to even, as we're building right now, a use case for being able to take action on that data. >> So, I mean, mobile obviously is sweeping the banking industry by storm, I mean, banks have always been, basically, IT companies, when you think about it, a huge component of IT, but now mobile comes in and, maybe talk a little bit about, sort of the big drivers in the business, and how, you know, mobile is fitting in. >> Absolutely. So, first of all, you see a shift that's happening with the end user: you, David, as a user of mobile banking, right? You probably have gone to the branch maybe once in the last 90 days, but have logged into mobile banking 10 times. So, we've seen anywhere from an eight to nine time shift in usage and engagement on the digital channel, and what that means is, more interactions and more touch points that the bank is getting off of the consumer behavior. And so, what we're trying to do here is turn that into getting to know the customer profile better, so that they could better serve in this digital channel, where there's a lot more interactions occurring. >> Yeah, I mean, you look at the demographic, too. I mean, my kids don't even use cheques. Right, I mean, it's all, everything's done on mobile, Venmo, or whatever, the capabilities they have. So, what's the infrastructure behind that that enables it? I mean, it can't be what it used to be. I mean, probably back-end still is, but what else do you have to create to enable that? >> Well, it's been a tremendous amount of transformation on the back-ends over the last ten years, and particularly when we talk about how that interaction has changed, from becoming a more formal experience to becoming a more intimate experience through the mobile client. But, more specifically to the back-end, we have actually implemented Apache Spark as one of our platforms, to actually help transform and move the data faster. Mobile actually creates a tremendous amount of back-end activity, sometimes even more than what we were able to see in other channels. >> Yeah, and if you think about it, if you just kind of step back a little bit, this is about core banking, right, and as you speak to IT systems, and so, if you think about all the transactions that happen on the daily, whether you're in branch, at ATM, on a mobile device, it's processed through a core banking system, and so one of the challenges that, I think, this industry and FinTech is up against is that, you've got all these legacy old systems that have been built that can't compute all this data at a fast enough rate, and so for us, bringing in Aaron, this is about, how do you actually leverage new technology, and take the technical data of the old systems, data schemas and models, and marry the two to provide data, key data that's been generated. >> Dave: Without shutting down the business. >> Without shutting down the business. >> Because that's the hard part. >> Can you elaborate on that, because that's non-trivial. It used to be when banks merged, it could take years for the back-off of systems to come together. So now, let's say a bank comes to you, they have their, I don't want to say legacy systems, it's the systems they've built up over time, but they want the more modern capabilities. How do you marry the two? >> Would you take a first stab? >> Well, it is actually a very complicated process, because you always have to try to understand data itself, and how to put those two things together. More specifically on the mobile client, because of the way that we are able to think about how data can be transformed and transported, we came up with a very flexible mechanism to allow data to actually be interpreted on the fly, and processed, so that when you talk about two different banks, by transforming it into this type of format, we're able to kind of reinterpret it and process it. >> Would this be, could you think of this as a very, very smart stream processor that, where ETL would be at the most basic layer, and then you're adding meaning to the data so that it shows up to the mobile client in a way that coheres to the user model that the user is experiencing on their device? >> I think that's a really good way of putting it, yeah. I mean, there's a, we like to think of it, I call it a semantic layer, of how you, one, treat ETL as one process, and then you have a semantic layer that you basically transform the bottom bits, so to speak, into components that you can then assemble semantically so that it starts making sense to the end user. >> And to that point, you know, to your integration question, it is very challenging, because you're trying to marry the old with the new, and we'll tease the section for tomorrow in which Aaron will talk about that, but for us, at enterprise grade, it has to be done very cautiously, right? And we're under heavy regulation and compliance and security, and so, it's not about abandoning the old, right? It's trying to figure out, how do we take that, what's been in place and been stable, and then couple it with the new technology that we're introducing. >> Which is interesting conversation, the old versus new, and I look at your title, Dave, and it's got 'BI' in it. I remember I interviewed Christian Chabot, who was then CEO of Tableau, and he's like, "Old, slow, BI", okay, now you guys are here talking about Spark. Spark's all about real-time and speed and memory, and everything else. Talk about the transformation in your role as this industry has transformed. >> Yeah, absolutely, so, when we think about business intelligence and creating that intelligence layer, we elected the mobile channel, right? Because we're seeing that most inner activities happen there. So for us, an intelligent BI solution is not just, you know, data management and analytics platform. There has to be the fulfillment. You talk a lot about actioning on your data. So for us, it's, if we could actually create, you know, intelligence layer to analytics level, how can we feed marketing solutions with this intelligence to have the full circle and insights back? I believe, the gentlemen, they were talking about the RISE Lab in this morning session. >> Dave: The follow-on to AMP, basically. >> Yeah, exactly. So, there it was all about that feedback loop, right? And so, for us, when we think about BI, the whole loop is from data management to end-to-end marketing solutions, and then back, so that we can serve the mobile customer. >> Well, so, you know, the original promise of the data warehouse was this 365, what you just described, right? And being able to effect business outcomes, and that is now the promise of so-called big data, even though people don't really like that term anymore, so, my question is, is it same line, new bottle, or is it really transformational? Are we going to live up to that challenge this time around? As practitioners, I'd really love your input on that. >> I think I'd love to expand on that. >> Absolutely. >> Yeah, I mean, I don't think it's, I think it's a whole new bottle and a whole new wine. David here is from wine country, and, there's definitely the, data warehouse introduced the important concepts, of which is a tremendous foundation for us to stand on. You know, you always like to stand on the shoulders of giants. It introduced a concept, but in the case of marrying the new with the old, there's a tremendous extra third dimension, okay? So, we have a velocity dimension when we start talking about Apache Spark. We can accelerate it, make it go quick, and we can get that data. There's another aspect there when we start talking about, for example, hey, different banks have different types of way that they like to talk to it, so now we're kind of talking about, there's variation in people's data, and Apache Spark, actually, is able to give that capability to process data that is different than each other, and then being able to marry it, down the pipe, together. And then the additional, what I think is actually making it into a new wine is, when we start talking about data, the traditional mechanism, data warehousing, that 360 view of the customer, they were thinking more of data as in, I like to think of it as, let's count beans, right? Let's just come up with what how many people were doing X, how many were doing this? >> Dave: Accurate reporting, yeah. >> Exactly, and if you think about it, it was driving the business through the rear-view mirror, because all you had to do was base it off of the historical information, and that's how we're going to drive the business. We're going to look in the rear-view mirror, we're going to look at what's been going on, and then we're going to see what's going on. And I think the transformation here is taking technologies and being able to say, how do we put not only predictive analytics inside play, but how do we actually allow the customer to take control and actually move forward? And then, as well, expand those use cases for variation, use that same technology to look for, between the data points, are there more data points that can be actually derived and moved forward on? >> George, I loved that description. You have, in one of your reports, I remember, George had this picture of this boat, and he said, "Oh, imagine trying to drive the boat", and it was looking at the wake (laughs), you know, right? Rather than looking in the rear-view mirror. >> But in addition to that, yeah, it's like driving the rear-view mirror, but you also said something interesting about, sort of, I guess the words I used to use were anticipating and influencing the customer. >> Aaron: Exactly. >> Can you talk about how much of that is done offline, like scoring profiles, and how much of that is done in real-time with the customer? >> Go ahead. >> Well, a lot of it still is still being done offline, mostly because, you know, as trying to serve a bank, you have to also be able to serve their immediate needs. So, really, we're evolving to actually build that use case around the real-time. We actually do have the technology already in place. We built the POCs, we built the technology inside, we're being able to move real-time, and we're ready to go there. >> So, what will be the difference? Me as a consumer, how will that change my experience? >> I think that would probably be best for you. >> Yeah, well, just got to step back a little bit, too, because, you know, what we're representing here is the digital channel mobile analytics, right? But, there's other areas within FIS Global that handles real-time payments with real-time analytics, such as a credit card division, right? So, both are happening sort of in parallel right now. For us, from our perspective on the mobile and digital front, the experience and how that's going to change is that, if you were a bank, and as a bank or a credit union you're receiving this behavioral data from our product, you want to be able to offer up better services that meet your consumer profile, right? And so, from our standpoint, we're working with other teams within FIS Global via Spark and Cloud, to essentially get that holistic profile to offer up those services that are more targeted, that are, I think, more meaningful to the consumer when they're in the mobile banking application. >> So, does FIS provide that sort of data service, that behavioral service, sort of as a turnkey service, or as a service, or is that something that you sort of teach the bank or the credit union how to fish? >> That's a really good question. We stated our mission statement as helping these institutions, creating a culture of being data-driven, right? So, give them the taste of data in a way that, you know, democratizing data, if you will, as we talked about this morning. >> Dave: Yeah, that's right. >> That concept's really important to us, because with that comes, give FIS more data, right? Send them more data, or have them teach us how to manage all this data, to have a data science experience, where we can go in and play with the data to create our own sub-targeting, because our belief is that, you know, our clients know their customers the best, so we're here to serve them with tools to do that. >> So, I want to come back to the role of Spark. I mean, Hadoop was profound, right, I mean, shipped five megabytes of code, a petabyte a day, no doubt about it. But at the same time, it was a heavy lift. It still is a heavy lift. So talk about the role of Spark in terms of catalyzing that vision that we've been talking about. >> Oh, definitely. So, Apache Spark, when we talk in terms of big data, big data got started with Hadoop, and MapReduce was definitely an interesting concept, but Apache Spark really lifted and accelerates the entire vision of big data. When you look at, for example, MapReduce, you need to go get a team of trained engineers, who are typically going to work in a lower level language like Java, and they no longer focus in on what the business objectives are. They're focusing on the programming objectives, the requirements. With Spark, because it takes a more high-level abstraction of how we process data, it means that you're more focusing on, what's the actual business case? How are we actually abstracting the data? How are we moving data? But then it also gives you that same capability to go inside the actual APIs, get a little bit lower, to modify it for what's your specific needs. So, I think the true transformation with Apache Spark is basically allowing us, now, like for example, in the presentation this morning, it was, there's a lot of people who are using Scala. We use Scala, ourselves. There's now a lot of people who are using Python, and everybody's using SQL. How does SQL, something that has survived so robustly for almost 30, 40 years, still keep on coming back like a boomerang on us? And it's because a language composed of four simple keywords is just so easy to use, and so descriptive and declarative, that allows us to actually just concentrate on the business, and I think that's actually the acceleration that Apache Spark actually brings to the business, is being able to just focus in on what you're actually trying to do, and focus in on your objectives, and it actually lowers the actual, that same team of engineers that you're using for MapReduce now become extremely more productive. I mean, when I look at the number of lines of codes that we had to do to figure out machine learning and Hadoop, to the amount of lines that you have to do in Apache Spark, it's tremendously, it's like, five lines in Apache Spark, 30 in MapReduce, and the system just responds and gives it to you a hundred times faster. >> Why Spark, too? I mean, Spark, when we saw it two years ago, to your point of this tidal wave of data, we saw more mobile phone adoption, we saw those people that were on mobile banking using it more, logging in more, and then we're seeing the proliferation of devices, right, in IoT, so for us, these are all these interaction and data points that is a tsunami that's coming our way, so that's when we strategically elected to go Spark, so we could handle the volume and compute storage- >> And Aaron, what you just described is, all the attention used to be on just making it work, and now it's putting to work, is really- >> Aaron: Right, exactly. >> You're seeing that in your businesses. >> Quick question. Do you see, now that you have this, sort of, lower and lower latency analytics and ability to access more of the, what previously were data silos, do you see services that are possible that banks couldn't have thought of before, beyond just making different products recommended at the appropriate moment, are there new things that banks can offer? >> It's interesting. On one hand, you free up their time from an analysis standpoint, to where they could actually start to get out of the weeds to think about new products and services, so, from that component, yes. From the standpoint of seeing pattern recognition in the data, and seeing what it can do aside from target marketing, our products are actually often used by our product owners internally to understand, what are the consumers doing on the device, so that they could actually come up with better services to ultimately serve them, aside from marketing solutions. >> Notwithstanding your political affiliations, we won't go there, but there's certainly a mood of, and a trend toward, deregulation, that's presumably good news for the financial services industry. Can you comment on that, or, what's the narrative going on in your customer base? Are they excited about fewer regulations, or is that just all political nonsense? Any thoughts? >> Yeah (laughs), you know, on one hand, why people come to FIS is because we do adhere to a compliance and regulation standpoint, right? >> Dave: Complexity is your friend, then (laughs). >> Absolutely, right, so they can trust us in that regard, right? And so, from our vantage point, will it go away entirely? No, absolutely not, right. I think Cloud introduces a whole new layer of complexity, because how do you handle Cloud computing and NPI, and PII data in the Cloud, and our customers look to us to make sure that, first and foremost, security for the end consumer is in place, and so, but I think it's an interesting question, and one that you are seeing end users click through without even viewing agreements or whatnot, they just want to get to product, right? So, you know, will it go away, or do we see it going away? No, but ... >> You guys don't read all that text, do you? (laughing) >> No comment? >> Required, required to. >> You know, no matter where it goes with the politics, I think there's a theme over the last 10 years, and the 10 years before. Things are transforming, things are evolving in ways, and sometimes going extremely, extremely fast in ways that we don't, surely can't anticipate. I think, if we were to think about just a mobile application, or the mobile bank experience 10 years ago, all we wanted was just to be able to see just the bank balance, and now we're able to take that same application and not only see our bank balance, but be able to deposit our cheque, or even replace the card in our pocket completely, with the mobile app, and I think we're going to see the exact same types of transformations over the industry over the next 10 years. Whether or not it's more regulation or different regulation, I think it's going to still speak to the same services, which FIS is there to help deliver. >> Yeah, and you're right, there are going to be new regulations, because they'll evolve, maybe out with the old, in with the new, you see, and global regulations are on run book, and you've got your Cloud, there's data locality, and you know, it's never-ending. That's great for your business. Fantastic. >> It comes down to trust, ultimately, right? I mean, they still, our customers still go to banks and credit unions because they trust them with their data, if you will, or their online currency, in some regards. So, you know, that's not going to change. >> Right, yeah. Well, Aaron, Dave, thanks very much for coming to theCUBE, it was great to have you. >> Thanks so much for talking with us. >> Absolutely, good luck with everything. >> Alright, keep it right there, buddy. We'll be back with our next guest. This is theCUBE. We're live from Boston, Spark Summit East, #SparkSummit. Be right back. >> I remember, when I had such a fantastic batting practice-

Published Date : Feb 8 2017

SUMMARY :

brought to you by Databricks. It's good to see you. FIS Global, the company that does a ton of work and have services that extend globally, and one of the largest FinTech companies there is. maybe talk about that a little bit. but be able to actually dig deeper and how, you know, mobile is fitting in. that the bank is getting off of the consumer behavior. but what else do you have to create to enable that? and particularly when we talk about and so one of the challenges that, I think, it's the systems they've built up over time, and how to put those two things together. so that it starts making sense to the end user. and so, it's not about abandoning the old, right? Talk about the transformation in your role and creating that intelligence layer, and then back, so that we can serve the mobile customer. and that is now the promise of so-called big data, and then being able to marry it, down the pipe, together. Exactly, and if you think about it, and it was looking at the wake (laughs), you know, right? But in addition to that, yeah, We built the POCs, we built the technology inside, the experience and how that's going to change is that, you know, democratizing data, if you will, because our belief is that, you know, But at the same time, it was a heavy lift. and the system just responds and gives it to you and ability to access more of the, so that they could actually come up with better services for the financial services industry. and one that you are seeing end users click through and the 10 years before. and you know, it's never-ending. because they trust them with their data, if you will, it was great to have you. We'll be back with our next guest.

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Aaron	PERSON	0.99+
Dave Favela	PERSON	0.99+
David Vellante	PERSON	0.99+
Aaron Colcord	PERSON	0.99+
George Gilbert	PERSON	0.99+
David	PERSON	0.99+
FIS Global	ORGANIZATION	0.99+
David Favela	PERSON	0.99+
Christian Chabot	PERSON	0.99+
George	PERSON	0.99+
10 times	QUANTITY	0.99+
FIS	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
five megabytes	QUANTITY	0.99+
Scala	TITLE	0.99+
two	QUANTITY	0.99+
Tableau	ORGANIZATION	0.99+
Java	TITLE	0.99+
eight	QUANTITY	0.99+
Python	TITLE	0.99+
RISE Lab	ORGANIZATION	0.99+
SQL	TITLE	0.99+
two different banks	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
United States	LOCATION	0.99+
five lines	QUANTITY	0.99+
two things	QUANTITY	0.99+
two years ago	DATE	0.99+
tomorrow	DATE	0.98+
MapReduce	TITLE	0.98+
Spark	TITLE	0.98+
Apache	ORGANIZATION	0.98+
both	QUANTITY	0.98+
10 years ago	DATE	0.98+
third dimension	QUANTITY	0.97+
360 view	QUANTITY	0.97+
Patriots'	ORGANIZATION	0.97+
one	QUANTITY	0.97+
Databricks	ORGANIZATION	0.96+
Spark Summit East 2017	EVENT	0.96+
nine time	QUANTITY	0.96+
first stab	QUANTITY	0.95+
#SparkSummit	EVENT	0.94+
Hadoop	TITLE	0.94+
a petabyte a day	QUANTITY	0.93+
one process	QUANTITY	0.92+
Venmo	ORGANIZATION	0.9+
30	QUANTITY	0.9+
almost 30	QUANTITY	0.9+
10 years	DATE	0.9+
Apache Spark	ORGANIZATION	0.89+
FinTech	ORGANIZATION	0.89+
Global	EVENT	0.89+
once	QUANTITY	0.89+
#sparksummit	EVENT	0.88+
theCUBE	ORGANIZATION	0.88+
Cloud	TITLE	0.86+
first	QUANTITY	0.86+
four simple keywords	QUANTITY	0.86+
this morning	DATE	0.84+
last 10 years	DATE	0.84+

Nick Pentreath, IBM STC - Spark Summit East 2017 - #sparksummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is The Cube, covering Spark Summit East 2017. Brought to you by Data Bricks. Now, here are your hosts, Dave Valente and George Gilbert. >> Boston, everybody. Nick Pentry this year, he's a principal engineer a the IBM Spark Technology Center in South Africa. Welcome to The Cube. >> Thank you. >> Great to see you. >> Great to see you. >> So let's see, it's a different time of year, here that you're used to. >> I've flown from, I don't know the Fahrenheit's equivalent, but 30 degrees Celsius heat and sunshine to snow and sleet, so. >> Yeah, yeah. So it's a lot chillier there. Wait until tomorrow. But, so we were joking. You probably get the T-shirt for the longest flight here, so welcome. >> Yeah, I actually need the parka, or like a beanie. (all laugh) >> Little better. Long sleeve. So Nick, tell us about the Spark Technology Center, STC is its acronym and your role, there. >> Sure, yeah, thank you. So Spark Technology Center was formed by IBM a little over a year ago, and its mission is to focus on the Open Source world, particularly Apache Spark and the ecosystem around that, and to really drive forward the community and to make contributions to both the core project and the ecosystem. The overarching goal is to help drive adoption, yeah, and particularly enterprise customers, the kind of customers that IBM typically serves. And to harden Spark and to make it really enterprise ready. >> So why Spark? I mean, we've watched IBM do this now for several years. The famous example that I like to use is Linux. When IBM put $1 billion into Linux, it really went all in on Open Source, and it drove a lot of IBM value, both internally and externally for customers. So what was it about Spark? I mean, you could have made a similar bet on Hadoop. You decided not to, you sort of waited to see that market evolve. What was the catalyst for having you guys all go in on Spark? >> Yeah, good question. I don't know all the details, certainly, of what was the internal drivers because I joined HTC a little under a year ago, so I'm fairly new. >> Translate the hallway talk, maybe. (Nick laughs) >> Essentially, I think you raise very good parallels to Linux and also Java. >> Absolutely. >> So Spark, sorry, IBM, made these investments and Open Source technologies that had ceased to be transformational and kind of game-changing. And I think, you know, most people will probably admit within IBM that they maybe missed the boat, actually, on Hadoop and saw Spark as the successor and actually saw a chance to really dive into that and kind of almost leap frog and say, "We're going to "back this as the next generation analytics platform "and operating system for analytics "and big debt in the enterprise." >> Well, I don't know if you happened to watch the Super Bowl, but there's a saying that it's sometimes better to be lucky than good. (Nick laughs) And that sort of applies, and so, in some respects, maybe missing the window on Hadoop was not a bad thing for IBM >> Yeah, exactly because not a lot of people made a ton of dough on Hadoop and they're still sort of struggling to figure it out. And now along comes Spark, and you've got this more real time nature. IBM talks a lot about bringing analytics and transactions together. They've made some announcements about that and affecting business outcomes in near real time. I mean, that's really what it's all about and one of your areas of expertise is machine learning. And so, talk about that relationship and what it means for organizations, your mission. >> Yeah, machine learning is a key part of the mission. And you've seen the kind of big debt in enterprise story, starting with the kind of Hadoop and data lakes. And that's evolved into, now we've, before we just dumped all of this data into these data lakes and these silos and maybe we had some Hadoop jobs and so on. But now we've got all this data we can store, what are we actually going to do with it? So part of that is the traditional data warehousing and business intelligence and analytics, but more and more, we're seeing there's a rich value in this data, and to unlock it, you really need intelligent systems. You need machine learning, you need AI, you need real time decision making that starts transcending the boundaries of all the rule-based systems and human-based systems. So we see machine learning as one of the key tools and one of the key unlockers of value in these enterprise data stores. >> So Nick, perhaps paint us a picture of someone who's advanced enough to be working with machine learning with BMI and we know that the tool chain's kind of immature. Although, IBM with Data Works or Data First has a fairly broad end-to-end sort of suit of tools, but what are the early-use cases? And what needs to mature to go into higher volume production apps or higher-value production apps? >> I think the early-use cases for machine learning in general and certainly at scale are numerous and they're growing, but classic examples are, let's say, recommendation engines. That's an area that's close to my heart. In my previous life before IBM, I bought the startup that had a recommendation engine service targeting online stores and new commerce players and social networks and so on. So this is a great kind of example use case. We've got all this data about, let's say, customer behavior in your retail store or your video-sharing site, and in order to serve those customers better and make more money, if you can make good recommendations about what they should buy, what they should watch, or what they should listen to, that's a classic use case for machine learning and unlocking the data that is there, so that is one of the drivers of some of these systems, players like Amazon, they're sort of good examples of the recommendation use case. Another is fraud detection, and that is a classic example in financial services, enterprise, which is a kind of staple of IBM's customer base. So these are a couple of examples of the use cases, but the tool sets, traditionally, have been kind of cumbersome. So Amazon bought everything from scratch themselves using customized systems, and they've got teams and teams of people. Nowadays, you've got this bold into Apache Spark, you've got it in Spark, a machine learning library, you've got good models to do that kind of thing. So I think from an algorithmic perspective, there's been a lot of advancement and there's a lot of standardization and almost commoditization of the model side. So what is missing? >> George: Yeah, what else? >> And what are the shortfalls currently? So there's a big difference between the current view, I guess the hype of the machine learning as you've got data, you apply some machine learning, and then you get profit, right? But really, there's a hugely complex workflow that involves this end-to-end story. You've got data coming from various data sources, you have to feed it into one centralized system, transform and process it, extract your features and do your sort of hardcore data signs, which is the core piece that everyone sort of thinks about as the only piece, but that's kind of in the middle and it makes up a relatively small proportion of the overall chain. And once you've got that, you do model training and selection testing, and you now have to take that model, that machine-learning algorithm and you need to deploy it into a real system to make real decisions. And that's not even the end of it because once you've got that, you need to close the loop, what we call the feedback loop, and you need to monitor the performance of that model in the real world. You need to make sure that it's not deteriorating, that it's adding business value. All of these ind of things. So I think that is the real, the piece of the puzzle that's missing at the moment is this end-to-end, delivering this end-to-end story and doing it at scale, securely, enterprise-grade. >> And the business impact of that presumably will be a better-quality experience. I mean, recommendation engines and fraud detection have been around for a while, they're just not that good. Retargeting systems are too little too late, and kind of cumbersome fraud detection. Still a lot of false positives. Getting much better, certainly compressing the time. It used to be six months, >> Yes, yes. Now it's minutes or second, but a lot of false positives still, so, but are you suggesting that by closing that gap, that we'll start to see from a consumer standpoint much better experiences? >> Well, I think that's imperative because if you don't see that from a consumer standpoint, then the mission is failing because ultimately, it's not magic that you just simply throw machine learning at something and you unlock business value and everyone's happy. You have to, you know, there's a human in the loop, there. You have to fulfill the customer's need, you have to fulfill consumer needs, and the better you do that, the more successful your business is. You mentioned the time scale, and I think that's a key piece, here. >> Yeah. >> What makes better decisions? What makes a machine-learning system better? Well, it's better data and more data, and faster decisions. So I think all of those three are coming into play with Apache Spark, end-to-end's story streaming systems, and the models are getting better and better because they're getting more data and better data. >> So I think we've, the industry, has pretty much attacked the time problem. Certainly for fraud detection and recommendation systems the quality issue. Are we close? I mean, are we're talking about 6-12 months before we really sort of start to see a major impact to the consumer and ultimately, to the company who's providing those services? >> Nick: Well, >> Or is it further away than that, you think? >> You know, it's always difficult to make predictions about timeframes, but I think there's a long way to go to go from, yeah, as you mentioned where we are, the algorithms and the models are quite commoditized. The time gap to make predictions is kind of down to this real-time nature. >> Yeah. >> So what is missing? I think it's actually less about the traditional machine-learning algorithms and more about making the systems better and getting better feedback, better monitoring, so improving the end user's experience of these systems. >> Yeah. >> And that's actually, I don't think it's, I think there's a lot of work to be done. I don't think it's a 6-12 month thing, necessarily. I don't think that in 12 months, certainly, you know, everything's going to be perfectly recommended. I think there's areas of active research in the kind of academic fields of how to improve these things, but I think there's a big engineering challenge to bring in more disparate data sources, to better, to improve data quality, to improve these feedback loops, to try and get systems that are serving customer needs better. So improving recommendations, improving the quality of fraud detection systems. Everything from that to medical imaging and counter detection. I think we've got a long way to go. >> Would it be fair to say that we've done a pretty good job with traditional application lifecycle in terms of DevOps, but we now need the DevOps for the data scientists and their collaborators? >> Nick: Yeah, I think that's >> And where is BMI along that? >> Yeah, that's a good question, and I think you kind of hit the nail on the head, that the enterprise applied machine learning problem has moved from the kind of academic to the software engineering and actually, DevOps. Internally, someone mentioned the word train ops, so it's almost like, you know, the machine learning workflow and actually professionalizing and operationalizing that. So recently, IBM, for one, has announced what's in data platform and now, what's in machine learning. And that really tries to address that problem. So really, the aim is to simplify and productionize these end-to-end machine-learning workflows. So that is the product push that IBM has at the moment. >> George: Okay, that's helpful. >> Yeah, and right. I was at the Watson data platform announcement you call the Data Works. I think they changed the branding. >> Nick: Yeah. >> It looked like there were numerous components that IBM had in its portfolio that's now strung together. And to create that end-to-end system that you're describing. Is that a fair characterization, or is it underplaying? I'm sure it is. The work that went into it, but help us maybe understand that better. >> Yeah, I should caveat it by saying we're fairly focused, very focused at HTC on the Open Source side of things, So my work is predominately within the Apache Spark project and I'm less involved in the data bank. >> Dave: So you didn't contribute specifically to Watson data platform? >> Not to the product line, so, you know, >> Yeah, so its really not an appropriate question for you? >> I wouldn't want to kind of, >> Yeah. >> To talk too deeply about it >> Yeah, yeah, so that, >> Simply because I haven't been involved. >> Yeah, that's, I don't want to push you on that because it's not your wheelhouse, but then, help me understand how you will commercialize the activities that you do, or is that not necessarily the intent? >> So the intent with HTC particularly is that we focus on Open Source and a core part of that is that we, being within IBM, we have the opportunity to interface with other product groups and customer groups. >> George: Right. >> So while we're not directly focused on, let's say, the commercial aspect, we want to effectively leverage the ability to talk to real-world customers and find the use cases, talk to other product groups that are building this Watson data platform and all the product lines and the features, data sans experience, it's all built on top of Apache Apache Spark and platform. >> Dave: So your role is really to innovate? >> Exactly, yeah. >> Leverage and Open Source and innovate. >> Both innovate and kind of improve, so improve performance improve efficiency. When you are operating at the scale of a company such as IBM and other large players, your customers and you as product teams and builders of products will come into contact with all the kind of little issues and bugs >> Right. >> And performance >> Make it better. Problems, yeah. And that is the feedback that we take on board and we try and make it better, not just for IBM and their customers. Because it's an Apache product and everyone benefits. So that's really the idea. Take all the feedback and learnings from enterprise customers and product groups and centralize that in the Open Source contributions that we make. >> Great. Would it be, so would it be fair to say you're focusing on making the core Spark, Spark ML and Spark ML Lib capabilities sort of machine learning libraries and in the pipeline, more robust? >> Yes. >> And if that's the case, we know there needs to be improvements in its ability to serve predictions in real time, like high speed. We know there's a need to take the pipeline and sort of share it with other tools, perhaps. Or collaborate with other tool chains. >> Nick: Yeah. >> What are some of the things that the Enterprise customers are looking for along the lines? >> Yeah, that's a great question and very topical at the moment. So both from an Open Source community perspective and Enterprise customer perspective, this is one of the, if not the key, I think, kind of missing pieces within the Spark machine-learning kind of community at the moment, and it's one of the things that comes up most often. So it is a missing piece, and we as a community need to work together and decide, is this something that we built within Spark and provide that functionality? Is is something where we try and adopt open standards that will benefit everybody and that provides a kind of one standardized format, or way or serving models? Or is it something where there's a few Open Source projects out there that might serve for this purpose, and do we get behind those? So I don't have the answer because this is ongoing work, but it's definitely one of the most critical kind of blockers, or, let's say, areas that needs work at the moment. >> One quick question, then, along those lines. IBM, the first thing IBM contributed to the Spark community was Spark ML, which is, as I understand it, it was an ability to, I think, create an ensemble sort of set of models to do a better job or create a more, >> So are you referring to system ML, I think it is? >> System ML. >> System ML, yeah, yeah. >> What are they, I forgot. >> Yeah, so, so. >> Yeah, where does that fit? >> System ML started out as a IBM research project and perhaps the simplest way to describe it is, as a kind of sequel optimizer is to take sequel queries and decide how to execute them in the most efficient way, system ML takes a kind of high-level mathematical language and compiles it down to a execution plan that runs in a distributed system. So in much the same way as your sequel operators allow this very flexible and high-level language, you don't have to worry about how things are done, you just tell the system what you want done. System ML aims to do that for mathematical and machine learning problems, so it's now an Apache project. It's been donated to Open Source and it's an incubating project under very active development. And that is really, there's a couple of different aspects to it, but that's the high-level goal. The underlying execution engine is Spark. It can run on Hadoop and it can run locally, but really, the main focus is to execute on Spark and then expose these kind of higher level APRs that are familiar to users of languages like R and Python, for example, to be able to write their algorithms and not necessarily worry about how do I do large scale matrix operations on a cluster? System ML will compile that down and execute that for them. >> So really quickly, follow up, what that means is if it's a higher level way for people who sort of cluster aware to write machine-learning algorithms that are cluster aware? >> Nick: Precisely, yeah. >> That's very, very valuable. When it works. >> When it works, yeah. So it does, again, with the caveat that I'm mostly focused on Spark and not so much the System ML side of things, so I'm definitely not an expert. I don't claim to be an expert in it. But it does, you know, it works at the moment. It works for a large class of machine-learning problems. It's very powerful, but again, it's a young project and there's always work to be done, so exactly the areas that I know that they're focusing on are these areas of usability, hardening up the APRs and making them easier to use and easier to access for users coming from the R and Python communities who, again are, as you said, they're not necessarily experts on distributed systems and cluster awareness, but they know how to write a very complex machine-learning model in R, for example. And it's really trying to enable them with a set of APR tools. So in terms of the underlying engine, they are, I don't know how many hundreds of thousands, millions of lines of code and years and years of research that's gone into that, so it's an extremely powerful set of tools. But yes, a lot of work still to be done there and ongoing to make it, in a way to make it user ready and Enterprise ready in a sense of making it easier for people to use it and adopt it and to put it into their systems and production. >> So I wonder if we can close, Nick, just a few questions on STC, so the Spark Technology Centers in Cape Town, is that a global expertise center? Is is STC a virtual sort of IBM community, or? >> I'm the only member visiting Cape Town, >> David: Okay. >> So I'm kind of fairly lucky from that perspective, to be able to kind of live at home. The rest of the team is mostly in San Francisco, so there's an office there that's co-located with the Watson west office >> Yeah. >> And Watson teams >> Sure. >> That are based there in Howard Street, I think it is. >> Dave: How often do you get there? >> I'll be there next week. >> Okay. >> So I typically, sort of two or three times a year, I try and get across there >> Right. And interface with the team, >> So, >> But we are a fairly, I mean, IBM is obviously a global company, and I've been surprised actually, pleasantly surprised there are team members pretty much everywhere. Our team has a few scattered around including me, but in general, when we interface with various teams, they pop up in all kinds of geographical locations, and I think it's great, you know, a huge diversity of people and locations, so. >> Anything, I mean, these early days here, early day one, but anything you saw in the morning keynotes or things you hope to learn here? Anything that's excited you so far? >> A couple of the morning keynotes, but had to dash out to kind of prepare for, I'm doing a talk later, actually on feature hashing for scalable machine learning, so that's at 12:20, please come and see it. >> Dave: A breakout session, it's at what, 12:20? >> 20 past 12:00, yeah. >> Okay. >> So in room 302, I think, >> Okay. >> I'll be talking about that, so I needed to prepare, but I think some of the key exciting things that I have seen that I would like to go and take a look at are kind of related to the deep learning on Spark. I think that's been a hot topic recently in one of the areas, again, Spark is, perhaps, hasn't been the strongest contender, let's say, but there's some really interesting work coming out of Intel, it looks like. >> They're talking here on The Cube in a couple hours. >> Yeah. >> Yeah. >> I'd really like to see their work. >> Yeah. >> And that sounds very exciting, so yeah. I think every time I come to a Spark summit, they always need projects from the community, various companies, some of them big, some of them startups that are pushing the envelope, whether it's research projects in machine learning, whether it's adding deep learning libraries, whether it's improving performance for kind of commodity clusters or for single, very powerful single modes, there's always people pushing the envelope, and that's what's great about being involved in an Open Source community project and being part of those communities, so yeah. That's one of the talks that I would like to go and see. And I think I, unfortunately, had to miss some of the Netflix talks on their recommendation pipeline. That's always interesting to see. >> Dave: Right. >> But I'll have to check them on the video (laughs). >> Well, there's always another project in Open Source land. Nick, thanks very much for coming on The Cube and good luck. Cool, thanks very much. Thanks for having me. >> Have a good trip, stay warm, hang in there. (Nick laughs) Alright, keep it right there. My buddy George and I will be back with our next guest. We're live. This is The Cube from Sparks Summit East, #sparksummit. We'll be right back. (upbeat music) (gentle music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Data Bricks. a the IBM Spark Technology Center in South Africa. So let's see, it's a different time of year, here I've flown from, I don't know the Fahrenheit's equivalent, You probably get the T-shirt for the longest flight here, need the parka, or like a beanie. So Nick, tell us about the Spark Technology Center, and the ecosystem. The famous example that I like to use is Linux. I don't know all the details, certainly, Translate the hallway talk, maybe. Essentially, I think you raise very good parallels and kind of almost leap frog and say, "We're going to and so, in some respects, maybe missing the window on Hadoop and they're still sort of struggling to figure it out. So part of that is the traditional data warehousing So Nick, perhaps paint us a picture of someone and almost commoditization of the model side. And that's not even the end of it And the business impact of that presumably will be still, so, but are you suggesting that by closing it's not magic that you just simply throw and the models are getting better and better attacked the time problem. to go from, yeah, as you mentioned where we are, and more about making the systems better So improving recommendations, improving the quality So really, the aim is to simplify and productionize Yeah, and right. And to create that end-to-end system that you're describing. and I'm less involved in the data bank. So the intent with HTC particularly is that we focus leverage the ability to talk to real-world customers and you as product teams and builders of products and centralize that in the Open Source contributions sort of machine learning libraries and in the pipeline, And if that's the case, So I don't have the answer because this is ongoing work, IBM, the first thing IBM contributed to the Spark community but really, the main focus is to execute on Spark When it works. and ongoing to make it, in a way to make it user ready So I'm kind of fairly lucky from that perspective, And interface with the team, and I think it's great, you know, A couple of the morning keynotes, but had to dash out are kind of related to the deep learning on Spark. that are pushing the envelope, whether it's research and good luck. My buddy George and I will be back with our next guest.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George Gilbert	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Valente	PERSON	0.99+
George	PERSON	0.99+
Dave	PERSON	0.99+
Nick Pentreath	PERSON	0.99+
Howard Street	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Nick Pentry	PERSON	0.99+
$1 billion	QUANTITY	0.99+
Nick	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
HTC	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Cape Town	LOCATION	0.99+
South Africa	LOCATION	0.99+
Java	TITLE	0.99+
Linux	TITLE	0.99+
12 months	QUANTITY	0.99+
six months	QUANTITY	0.99+
next week	DATE	0.99+
Boston	LOCATION	0.99+
Boston, Massachusetts	LOCATION	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
BMI	ORGANIZATION	0.99+
Python	TITLE	0.99+
Spark	TITLE	0.99+
12:20	DATE	0.99+
three	QUANTITY	0.99+
6-12 month	QUANTITY	0.99+
Watson	ORGANIZATION	0.98+
tomorrow	DATE	0.98+
Spark Technology Center	ORGANIZATION	0.98+
one	QUANTITY	0.98+
Spark Technology Centers	ORGANIZATION	0.98+
this year	DATE	0.97+
Hadoop	TITLE	0.97+
hundreds of thousands	QUANTITY	0.97+
both	QUANTITY	0.97+
30 degrees Celsius	QUANTITY	0.97+
Data First	ORGANIZATION	0.97+
Super Bowl	EVENT	0.97+
single	QUANTITY	0.96+

Joel Horwitz, IBM & David Richards, WANdisco - Hadoop Summit 2016 San Jose - #theCUBE

>> Narrator: From San Jose, California, in the heart of Silicon Valley, it's theCUBE. Covering Hadoop Summit 2016. Brought to you by Hortonworks. Here's your host, John Furrier. >> Welcome back everyone. We are here live in Silicon Valley at Hadoop Summit 2016, actually San Jose. This is theCUBE, our flagship program. We go out to the events and extract the signal to the noise. Our next guest, David Richards, CEO of WANdisco. And Joel Horowitz, strategy and business development, IBM analyst. Guys, welcome back to theCUBE. Good to see you guys. >> Thank you for having us. >> It's great to be here, John. >> Give us the update on WANdisco. What's the relationship with IBM and WANdisco? 'Cause, you know. I can just almost see it, but I'm not going to predict. Just tell us. >> Okay, so, I think the last time we were on theCUBE, I was sitting with Re-ti-co who works very closely with Joe. And we began to talk about how our partnership was evolving. And of course, we were negotiating an OEM deal back then, so we really couldn't talk about it very much. But this week, I'm delighted to say that we announced, I think it's called IBM Big Replicate? >> Joel: Big Replicate, yeah. We have a big everything and Replicate's the latest edition. >> So it's going really well. It's OEM'd into IBM's analytics, big data products, and cloud products. >> Yeah, I'm smiling and smirking because we've had so many conversations, David, on theCUBE with you on and following your business through the bumpy road or the wild seas of big data. And it's been a really interesting tossing and turning of the industry. I mean, Joel, we've talked about it too. The innovation around Hadoop and then the massive slowdown and realization that cloud is now on top of it. The consumerization of the enterprise created a little shift in the value proposition, and then a massive rush to build enterprise grade, right? And you guys had that enterprise grade piece of it. IBM, certainly you're enterprise grade. You have enterprise everywhere. But the ecosystem had to evolve really fast. What happened? Share with the audience this shift. >> So, it's classic product adoption lifecycle and the buying audience has changed over that time continuum. In the very early days when we first started talking more at these events, when we were talking about Hadoop, we all really cared about whether it was Pig and Hive. >> You once had a distribution. That's a throwback. Today's Thursday, we'll do that tomorrow. >> And the buying audience has changed, and consequently, the companies involved in the ecosystem have changed. So where we once used to really care about all of those different components, we don't really care about the machinations below the application layer anymore. Some people do, yes, but by and large, we don't. And that's why cloud for example is so successful because you press a button, and it's there. And that, I think, is where the market is going to very, very quickly. So, it makes perfect sense for a company like WANdisco who've got 20, 30, 40, 50 sales people to move to a company like IBM that have 4 or 5,000 people selling our analytics products. >> Yeah, and so this is an OEM deal. Let's just get that news on the table. So, you're an OEM. IBM's going to OEM their product and brand it IBM, Big Replication? >> Yeah, it's part of our Big Insights Portfolio. We've done a great job at growing this product line over the last few years, with last year talking about how we decoupled all the value-as from the core distribution. So I'm happy to say that we're both part of the ODPI. It's an ODPI-certified distribution. That is Hadoop that we offer today for free. But then we've been adding not just in terms of the data management capabilities, but the partnership here that we're announcing with WANdisco and how we branded it as Big Replicate is squarely aimed at the data management market today. But where we're headed, as David points out, is really much bigger, right? We're talking about support for not only distributed storage and data, but we're also talking about a hybrid offering that will get you to the cloud faster. So not only does Big Replicate work with HDFS, it also works with the Swift objects store, which as you know, kind of the underlying storage for our cloud offering. So what we're hoping to see from this great partnership is as you see around you, Hadoop is a great market. But there's a lot more here when you talk about managing data that you need to consider. And I think hybrid is becoming a lot larger of a story than simply distributing your processing and your storage. It's becoming a lot more about okay, how do you offset different regions? How do you think through that there are multiple, I think there's this idea that there's one Hadoop cluster in an enterprise. I think that's factually wrong. I think what we're observing is that there's actually people who are spinning up, you know, multiple Hadoop distributions at the line of business for maybe a campaign or for maybe doing fraud detection, or maybe doing log file, whatever. And managing all those clusters, and they'll have Cloud Arrow. They'll have Hortonworks. They'll have IBM. They'll have all of these different distributions that they're having to deal with. And what we're offering is sanity. It's like give me sanity for how I can actually replicate that data. >> I love the name Big Replicate, fantastic. Big Insights, Big Replicate. And so go to market, you guys are going to have bigger sales force. It's a nice pop for you guys. I mean, it's good deal. >> We were just talking before we came on air about sort of a deal flow coming through. It's coming through, this potential deal flow coming through, which has been off the charts. I mean, obviously when you turn on the tap, and then suddenly you enable thousands and thousands of sales people to start selling your products. I mean, IBM, are doing a great job. And I think IBM are in a unique position where they own both cloud and on-prem. There are very few companies that own both the on-prem-- >> They're going to need to have that connection for the companies that are going hybrid. So hybrid cloud becomes interesting right now. >> Well, actually, it's, there's a theory that says okay, so, and we were just discussing this, the value of data lies in analytics, not in the data itself. It lies in you've been able to pull out information from that data. Most CIOs-- >> If you can get the data. >> If you can get the data. Let's assume that you've got the data. So then it becomes a question of, >> That's a big assumption. Yes, it is. (laughs) I just had Nancy Handling on about metadata. No, that's an issue. People have data they store they can't do anything with it. >> Exactly. And that's part of the problem because what you actually have to have is CPU slash processing power for an unknown amount of data any one moment in time. Now, that sounds like an elastic use case, and you can't do elastic on-prem. You can only do elastic in cloud. That means that virtually every distribution will have to be a hybrid distribution. IBM realized this years ago and began to build this hybrid infrastructure. We're going to help them to move data, completely consistent data, between on-prem and cloud, so when you query things in the cloud, it's exactly the same results and the correct results you get. >> And also the stability too on that. There's so many potential, as we've discussed in the past, that sounds simple and logical. To do an enterprise grade is pretty complex. And so it just gives a nice, stable enterprise grade component. >> I mean, the volumes of data that we're talking about here are just off the charts. >> Give me a use case of a customer that you guys are working with, or has there been any go-to-market activity or an ideal scenario that you guys see as a use case for this partnership? >> We're already seeing a whole bunch of things come through. >> What's the number one pattern that bubbles up to the top? Use case-wise. >> As Joel pointed out, that he doesn't believe that any one company just has one version of Hadoop behind their firewall. They have multiple vendors. >> 100% agree with that. >> So how do you create one, single cluster from all of those? >> John: That's one problem you solved. >> That's of course a very large problem. Second problem that we're seeing in spades is I have to move data to cloud to run analytics applications against it. That's huge. That required completely guaranteed consistent data between on-prem and cloud. And I think those two use cases alone account for pretty much every single company. >> I think there's even a third here. I think the third is actually, I think frankly there's a lot of inefficiencies in managing just HDFS and how many times you have to actually copy data. If I looked across, I think the standard right now is having like three copies. And actually, working with Big Replicate and WANdisco, you can actually have more assurances and actually have to make less copies across the cluster and actually across multiple clusters. If you think about that, you have three copies of the data sitting in this cluster. Likely, an analysts have a dragged a bunch of the same data in other clusters, so that's another multiple of three. So there's amount of waste in terms of the same data living across your enterprise. That I think there's a huge cost-savings component to this as well. >> Does this involve anything with Project Atlas at all? You guys are working with, >> Not yet, no. >> That project? It's interesting. We're seeing a lot of opening up the data, but all they're doing is creating versions of it. And so then it becomes version control of the data. You see a master or a centralization of data? Actually, not centralize, pull all the data in one spot, but why replicate it? Do you see that going on? I guess I'm not following the trend here. I can't see the mega trend going on. >> It's cloud. >> What's the big trend? >> The big trend is I need an elastic infrastructure. I can't build an elastic infrastructure on-premise. It doesn't make economic sense to build massive redundancy maybe three or four times the infrastructure I need on premise when I'm only going to use it maybe 10, 20% of the time. So the mega trend is cloud provides me with a completely economic, elastic infrastructure. In order to take advantage of that, I have to be able to move data, transactional data, data that changes all the time, into that cloud infrastructure and query it. That's the mega trend. It's as simple as that. >> So moving data around at the right time? >> And that's transaction. Anybody can say okay, press pause. Move the data, press play. >> So if I understand this correctly, and just, sorry, I'm a little slow. End of the day today. So instead of staging the data, you're moving data via the analytics engines. Is that what you're getting at? >> You use data that's being transformed. >> I think you're accessing data differently. I think today with Hadoop, you're accessing it maybe through like Flume or through Oozy, where you're building all these data pipelines that you have to manage. And I think that's obnoxious. I think really what you want is to use something like Apache Spark. Obviously, we've made a large investment in that earlier, actually, last year. To me, what I think I'm seeing is people who have very specific use cases. So, they want to do analysis for a particular campaign, and so they may just pull a bunch of data into memory from across their data environment. And that may be on the cloud. It may be from a third-party. It may be from a transactional system. It may be from anywhere. And that may be done in Hadoop. It may not, frankly. >> Yeah, this is the great point, and again, one of the themes on the show is, this is a question that's kind of been talked about in the hallways. And I'd love to hear your thoughts on this. Is there are some people saying that there's really no traction for Hadoop in the cloud. And that customers are saying, you know, it's not about just Hadoop in the cloud. I'm going to put in S3 or object store. >> You're right. I think-- >> Yeah, I'm right as in what? >> Every single-- >> There's no traction for Hadoop in the cloud? >> I'll tell you what customers tell us. Customers look at what they actually need from storage, and they compare whatever it is, Hadoop or any on-premise proprietor storage array and then look at what S3 and Swift and so on offer to them. And if you do a side-by-side comparison, there isn't really a difference between those two things. So I would argue that it's a fact that functionally, storage in cloud gives you all the functionality that any customer would need. And therefore, the relevance of Hadoop in cloud probably isn't there. >> I would add to that. So it really depends on how you define Hadoop. If you define Hadoop by the storage layer, then I would say for sure. Like HDFS versus an objects store, that's going to be a difficult one to find some sort of benefit there. But if you look at Hadoop, like I was talking to my friend Blake from Netflix, and I was asking him so I hear you guys are kind of like replatforming on Spark now. And he was basically telling me, well, sort of. I mean, they've invested a lot in Pig and Hive. So if you think it now about Hadoop as this broader ecosystem which you brought up Atlas, we talk about Ranger and Knox and all the stuff that keeps coming out, there's a lot of people who are still invested in the peripheral ecosystem around Hadoop as that central point. My argument would be that I think there's still going to be a place for distributed computing kind of projects. And now whether those will continue to interface through Yarn via and then down to HDFS, or whether that'll be Yarn on say an objects store or something and those projects will persist on their own. To me that's kind of more of how I think about the larger discussion around Hadoop. I think people have made a lot of investments in terms of that ecosystem around Hadoop, and that's something that they're going to have to think through. >> Yeah. And Hadoop wasn't really designed for cloud. It was designed for commodity servers, deployment with ease and at low cost. It wasn't designed for cloud-based applications. Storage in cloud was designed for storage in cloud. Right, that's with S3. That's what Swift and so on were designed specifically to do, and they fulfill most of those functions. But Joel's right, there will be companies that continue to use-- >> What's my whole argument? My whole argument is that why would you want to use Hadoop in the cloud when you can just do that? >> Correct. >> There's object store out. There's plenty of great storage opportunities in the cloud. They're mostly shoe-horning Hadoop, and I think that's, anyway. >> There are two classes of customers. There were customers that were born in the cloud, and they're not going to suddenly say, oh you know what, we need to build our own server infrastructure behind our own firewall 'cause they were born in the cloud. >> I'm going to ask you guys this question. You can choose to answer or not. Joel may not want to answer it 'cause he's from IBM and gets his wrist slapped. This is a question I got on DM. Hadoop ecosystem consolidation question. People are mailing in the questions. Now, keep sending me your questions if you don't want your name on it. Hold on, Hadoop system ecosystem. When will this start to happen? What is holding back the M and A? >> So, that's a great question. First of all, consolidation happens when you sort of reach that tipping point or leveling off, that inflection point where the market levels off, and we've reached market saturation. So there's no more market to go after. And the big guys like IBM and so on come in-- >> Or there was never a market to begin with. (laughs) >> I don't think that's the case, but yes, I see the point. Now, what's stopping that from happening today, and you're a naughty boy by the way for asking this question, is a lot of these companies are still very well funded. So while they still have cash on the balance sheet, of course, it's very, very hard for that to take place. >> You picked up my next question. But that's a good point. The VCs held back in 2009 after the crash of 2008. Sequoia's memo, you know, the good times role, or RIP good times. They stopped funding companies. Companies are getting funded, continually getting funding. Joel. >> So I don't think you can look at this market as like an isolated market like there's the Hadoop market and then there's a Spark market. And then even there's like an AI or cognitive market. I actually think this is all the same market. Machine learning would not be possible if you didn't have Hadoop, right? I wouldn't say it. It wouldn't have a resurgence that it has had. Mahout was one of the first machine learning languages that caught fire from Ted Dunning and others. And that kind of brought it back to life. And then Spark, I mean if you talk to-- >> John: I wouldn't say it creates it. Incubated. >> Incubated, right. >> And created that Renaissance-like experience. >> Yeah, deep learning, Some of those machine learning algorithms require you to have a distributed kind of framework to work in. And so I would argue that it's less of a consolidation, but it's more of an evolution of people going okay, there's distributed computing. Do I need to do that on-premise in this Hadoop ecosystem, or can I do that in the cloud, or in a growing Spark ecosystem? But I would argue there's other things happening. >> I would agree with you. I love both areas. My snarky comment there was never a market to begin with, what I'm saying there is that the monetization of commanding the hill that everyone's fighting for was just one of many hills in a bigger field of hills. And so, you could be in a cul-de-sac of being your own champion of no paying customers. >> What you have-- >> John: Or a free open-source product. >> Unlike the dotcom era where most of those companies were in the public markets, and you could actually see proper valuations, most of the companies, the unicorns now, most are not public. So the valuations are really difficult to, and the valuation metrics are hard to come by. There are only few of those companies that are in the public market. >> The cash story's right on. I think to Joel' point, it's easy to pivot in a market that's big and growing. Just 'cause you're in the wrong corner of the market pivoting or vectoring into the value is easier now than it was 10 years ago. Because, one, if you have a unicorn situation, you have cash on the bank. So they have a good flush cash. Your runway's so far out, you can still do your thing. If you're a startup, you can get time to value pretty quickly with the cloud. So again, I still think it's very healthy. In my opinion, I kind of think you guys have good analysis on that point. >> I think we're going to see some really cool stuff happen working together, and especially from what I'm seeing from IBM, in the fact that in the IT crowd, there is a behavioral change that's happening that Hadoop opened the door to. That we're starting to see more and more It professionals walk through. In the sense that, Hadoop has opened the door to not thinking of data as a liability, but actually thinking about data differently as an asset. And I think this is where this market does have an opportunity to continue to grow as long as we don't get carried away with trying to solve all of the old problems that we solved for on-premise data management. Like if we do that, then we're just, then there will be a consolidation. >> Metadata is a huge issue. I think that's going to be a big deal. And on the M and A, my feeling on the M and A is that, you got to buy something of value, so you either have revenue, which means customers, and or initial property. So, in a market of open source, it comes back down to the valuation question. If you're IBM or Oracle or HP, they can pivot too. And they can be agile. Now slower agile, but you know, they can literally throw some engineers at it. So if there's no customers in I and P, they can replicate, >> Exactly. >> That product. >> And we're seeing IBM do that. >> They don't know what they're buying. My whole point is if there's nothing to buy. >> I think it depends on, ultimately it depends on where we see people deriving value, and clearly in WANdisco, there's a huge amount of value that we're seeing our customers derive. So I think it comes down to that, and there is a lot of IP there, and there's a lot of IP in a lot of these companies. I think it's just a matter of widening their view, and I think WANdisco is probably the earliest to do this frankly. Was to recognize that for them to succeed, it couldn't just be about Hadoop. It actually had to expand to talk about cloud and talk about other data environments, right? >> Well, congratulations on the OEM deal. IBM, great name, Big Replicate. Love it, fantastic name. >> We're excited. >> It's a great product, and we've been following you guys for a long time, David. Great product, great energy. So I'm sure there's going to be a lot more deals coming on your. Good strategy is OEM strategy thing, huh? >> Oh yeah. >> It reduces sales cost. >> Gives us tremendous operational leverage. Getting 4,000, 5,000-- >> You get a great partner in IBM. They know the enterprise, great stuff. This is theCUBE bringing all the action here at Hadoop. IBM OEM deal with WANdisco all happening right here on theCUBE. Be back with more live coverage after this short break.

Published Date : Jul 1 2016

SUMMARY :

Brought to you by Hortonworks. extract the signal to the noise. What's the relationship And of course, we were Replicate's the latest edition. So it's going really well. The consumerization of the enterprise and the buying audience has changed That's a throwback. And the buying audience has changed, Let's just get that news on the table. of the data management capabilities, I love the name Big that own both the on-prem-- for the companies that are going hybrid. not in the data itself. If you can get the data. I just had Nancy Handling and the correct results you get. And also the stability too on that. I mean, the volumes of bunch of things come through. What's the number one pattern that any one company just has one version And I think those two use cases alone of the data sitting in this cluster. I guess I'm not following the trend here. data that changes all the time, Move the data, press play. So instead of staging the data, And that may be on the cloud. And that customers are saying, you know, I think-- Swift and so on offer to them. and all the stuff that keeps coming out, that continue to use-- opportunities in the cloud. and they're not going to suddenly say, What is holding back the M and A? And the big guys like market to begin with. hard for that to take place. after the crash of 2008. And that kind of brought it back to life. John: I wouldn't say it creates it. And created that or can I do that in the cloud, that the monetization that are in the public market. I think to Joel' point, it's easy to pivot And I think this is where this market I think that's going to be a big deal. there's nothing to buy. the earliest to do this frankly. Well, congratulations on the OEM deal. So I'm sure there's going to be Gives us tremendous They know the enterprise, great stuff.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Joel	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Joe	PERSON	0.99+
David Richards	PERSON	0.99+
Joel Horowitz	PERSON	0.99+
2009	DATE	0.99+
John	PERSON	0.99+
4	QUANTITY	0.99+
WANdisco	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
20	QUANTITY	0.99+
San Jose	LOCATION	0.99+
HP	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
Joel Horwitz	PERSON	0.99+
Ted Dunning	PERSON	0.99+
Big Replicate	ORGANIZATION	0.99+
last year	DATE	0.99+
Silicon Valley	LOCATION	0.99+
Big Replicate	ORGANIZATION	0.99+
40	QUANTITY	0.99+
30	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
third	QUANTITY	0.99+
today	DATE	0.99+
Hadoop	TITLE	0.99+
San Jose, California	LOCATION	0.99+
three	QUANTITY	0.99+
two things	QUANTITY	0.99+
2008	DATE	0.99+
5,000 people	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
David Richards	PERSON	0.99+
Blake	PERSON	0.99+
4,000, 5,000	QUANTITY	0.99+
S3	TITLE	0.99+
two classes	QUANTITY	0.99+
tomorrow	DATE	0.99+
Second problem	QUANTITY	0.99+
both areas	QUANTITY	0.99+
three copies	QUANTITY	0.99+
Hadoop Summit 2016	EVENT	0.99+
Swift	TITLE	0.99+
both	QUANTITY	0.99+
Big Insights	ORGANIZATION	0.99+
one problem	QUANTITY	0.98+
Today	DATE	0.98+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Apache Spark: