Arun Murthy, Hortonworks | theCUBE NYC 2018

>> Live from New York, it's The Cube, covering The Cube New York City 2018 brought to you by SiliconAngle Media and its Ecosystem partners. >> Okay, welcome back everyone, here live in New York City for Cube NYC, formally Big Data NYC, now called CubeNYC. The topic has moved beyond big data. It's about cloud, it's about data, it's also about potentially blockchain in the future. I'm John Furrier, Dave Vellante. We're happy to have a special guest here, Arun Murthy. He's the cofounder and chief product officer of Hortonworks, been in the Ecosystem from the beginning, at Yahoo, already been on the Cube many times, but great to see you, thanks for coming in, >> My pleasure, >> appreciate it. >> thanks for having me. >> Super smart to have you on here, because a lot of people have been squinting through the noise of the market place. You guys have been now for a few years on this data plan idea, so you guys have actually launched Hadoop with Cloudera, they were first. You came after, Yahoo became second, two big players. Evolved it quickly, you guys saw early on that this is bigger than Hadoop. And now, all the conversations on what you guys have been talking about three years ago. Give us the update, what's the product update? How is the hybrids a big part of that, what's the story? >> We started off being the Hadoop company, and Rob, our CEO who was here on Cube, a couple of hours ago, he calls it sort of the phase one of the company, where it were Hadoop company. Very quickly realized we had to help enterprises manage the entire life cycle data, all the way from the edge to the data center, to the cloud, and between, right. So which is why we did acquisition of YARN, we've been talking about it, which kind of became the basis of our Hot marks Data flow product. And then as we went through the phase of that journey it was quickly obvious to us that enterprises had to manage data and applications in a hybrid manner right which is both on prem And public load and increasingly Edge, which is really very we spend a lot of time these days With IOT and everything from autonomous cars to video monitoring to all these aspects coming in. Which is why we wanted to get to the data plan architecture it allows to get you to a consistent security governance model. There's a lot of, I'll call it a lot of, a lot of fight about Cloud being insecure and so on, I don't think there's anything inherently insecure about the Cloud. The issue that we see is lack of skills and our enterprises know how to manage the data on-prem they know how to do LDAP, groups, and curb rows, and AAD, and what have you, they just don't have the skill sets yet to be able to do it on the public load, which leads to mistakes occasionally. >> Um-hm. >> And Data breaches and so on. So we recognize really early that part of data plan was to get that consistent security in governance models, so you don't have to worry about how you set up IMRL's on Amazon versus LDAP on-prem versus something else on Google. >> It's operating consistency. >> It's operating, exactly. I've talked about this in the past. So getting that Data plan was that journey, and this week at Charlotte work week we announced was we wanted to take that step further we've been able to kind of allow enterprise to manage this hybrid architecture on prem, multiple public loads. >> And the Edge. >> In a connected manner, the issue we saw early on and it's something we've been working on for a long while. Is that we've been able to connect the architectures Hadoop when it started it was more of an on premise architecture right, and I was there in 2005, 2006 when it started, Hadoop's started was bought on the world wide web we had a gigabyte of ethernet and I was up to the rack. From the rack on we had only eight gigs up to the rack so if you have a 2000 or cluster your dealing with eight gigs of connection. >> Bottleneck >> Huge bottleneck, fast forward today, you have at least ten if not one hundred gigabits. Moving to one hundred to a terabyte architecture, for that standpoint, and then what's happening is everything in that world, if you had the opportunity to read things on the assumptions we have in Hadoop. And then the good news is that when Cloud came along Cloud already had decoupled storage and architecture, storage and compute architectures. As we've sort of helped customers navigate the two worlds, with data plan, it's been a journey that's been reasonably successful and I think we have an opportunity to kind of provide identical consistent architectures both on prem and on Cloud. So it's almost like we took Hadoop and adapted it to Cloud. I think we can adapt the Cloud architecture back on prem, too to have consistent architectures. >> So talk about the Cloud native architecture. So you have a post that just got published. Cloud native architecture for big data and the data center. No, Cloud native architecture to big data in the data center. That's hyrid, explain the hybrid model, how do you define that? >> Like I said, for us it's really important to be able to have consistent architectures, consistent security, consistent governance, consistent way to manage data, and consistent way to actually to double up and port applications. So portability for data is important, which is why having security and governance consistently is a key. And then portability for the applications themselves are important, which is why we are so excited to kind of be, kind of first to embrace the whole containerize the ecosystem initiative. We've announced the open hybrid architecture initiative which is about decoupling storage and compute and then leveraging containers for all the big data apps, for the entire ecosystem. And this is where we are really excited to be working with both IBM and Redhat especially Redhat given their sort of investments in Kubernetes and open ship. We see that much like you'll have S3 and EC2, S3 for storage, EC2 for compute, and same thing with ADLS and azure compute. You'll actually have the next gen HDFS and Kubernetives. So is this a massive architectural rewrite, or is it more sort of management around the core. >> Great question. So part of it is evolution of the architecture. We have to get, whether it's Spark or Kafka or any of these open source projects, we need to do some evolution in the architecture, to make them work in the ecosystem, in the containerized world. So we are containerizing every one of the 28 animals 30 animals, in the zoo, right. That's a lot of work, we are kind of you know, sort of do it, we've done it in the past. Along with your point it's not enough to just have the architecture, you need to have a consistent fabric to be able to manage and operate it, which is really where the data plan comes in again. That was really the point of data plane all the time, this is a multi-roadmap, you know when we sit down we are thinking about what we'll do in 22, and 23. But we really have to execute on a multi-roadmap. >> And Data plane was a lynch pin. >> Well it was just like the sharp edge of the sword. Right, it was the tip of the sphere, but really the idea was always that we have to get data plan in to kind of get that hybrid product out there. And then we can sort of get to a inter generational data plan which would work with the next generation of the big data ecosystem itself. >> Do you see Kubernetes and things like Kubernetes, you've got STO a few service meshes up the stack, >> Absolutely are going to play a pretty instrumental role around orchestrating work loads and providing new stateless and stateful application with data, so now data you've got more data being generated there. So this is a new dynamic, it sounds like that's a fit for what you guys are doing. >> Which is something we've seen for awhile now. Like containers are something we've tracked for a long time and really excited to see Docker and RedHat. All the work that they are doing with Redhat containers. Get the security and so on. It's the maturing of that ecosystem. And now, the ability to port, build and port applications. And the really cool part for me is that, we will definitely see Kubenetes and open shift, and prem but even if you look at the Cloud the really nice part is that each of the Cloud providers themselves, provide a Kubenesos. Whether it's GKE on Google or Fargate on Amazon or AKS on Microsoft, we will be able to take identical architectures and leverage them. When we containerize high mark aft or spark we will be able to do this with kubernetes on spark with open shift and there will be open shift on leg which is available in the public cloud but also GKE and Fargate and AKS. >> What's interesting about the Redhat relationship is that I think you guys are smart to do this, is by partnering with Redhat you can, customers can run their workloads, analytical workloads, in the same production environment that Redhat is in. But with kind of differentiation if you will. >> Exactly with data plane. >> Data plane is just a wonderful thing there. So again good move there. Now around the ecosystem. Who else are you partnering with? what else do you see out there? who is in your world that is important? >> You know again our friends at IBM, that we've had a long relationship with them. We are doing a lot of work with IBM to integrate, data plane and also ICPD, which is the IBM Cloud plane for data, which brings along all of the IBM ecosystem. Whether it's DBT or IGC information governance catalogs, all that kind of were back in this world. What we also believe this will give a flip to is the whole continued standardization of security and governance. So you guys remember the old dpi, it caused a bit of a flutter, a few years ago. (anxious laughing) >> We know how that turned out. >> What we did was we kind of said, old DPI was based on the old distributions, now it's DPI's turn to be more about merit and governance. So we are collaborating with IBM on DPI more on merit and governance, because again we see that as being very critical in this sort of multi-Cloud, on prem edge world. >> Well the narrative, was always why do you need it, but it's clear that these three companies have succeeded dramatically, when you look at the financials, there has been statements made about IBM's contribution to seven figure deals to you guys. We had Redhat on and you guys are birds of a feather. [Murhty] Exactly. >> It certainly worked for you three, which presumably means it confers value to your customers. >> Which is really important, right from a customer standpoint, what is something we really focus on is that the benefit of the bargain is that now they understand that some of their key vendor partners that's us and Ibm and Redhat, we have a shared roadmap so now they can be much more sure about the fact that they can go to containers and kubernetes and so on and so on. Because all of the tools that they depend on are and all the partners they depend on are working together. >> So they can place bets. >> So they can place bets, and the important thing is that they can place longer term bets. Not a quarter bet, we hear about customers talking about building the next gen data centers, with kubernetes in mind. >> They have too. >> They have too, right and it's more than just building machines up, because what happens is with this world we talked about things like networking the way you do networking in this world with kubernetes, is different than you do before. So now they have to place longer term bets and they can do this now with the guarantee that the three of us will work together to deliver on the architecture. >> Well Arun, great to have you on the Cube, great to see you, final question for you, as you guys have a good long plan which is very cool. Short term customers are realizing, the set-up phase is over, okay now they're in usage mode. So the data has got to deliver value, so there is a real pressure for ROI, we would give people a little bit of a pass earlier on because set-up everything, set-up the data legs, do all this stuff, get it all operationalized, but now, with the AI and the machine learning front and center that's a signal that people want to start putting this to work. What have you seen customers gravitate to from the product side? Where are they going, is it the streaming is it the Kafka, is it the, what products are they gravitating to? >> Yeah definitely, I look at these in my role, in terms of use cases, right, we are certainly seeing a continued push towards the real-time analytics space. Which is why we place a longer-term bet on HDF and Kafka and so on. What's been really heartening kind of back to your sentiment, is we are seeing a lot of push right now on security garments. That's why we introduced for GDPR, we introduced a bunch of cable readies and data plane, with DSS and James Cornelius wrote about this earlier in the year, we are seeing customers really push us for key aspects like GDPR. This is a reflection for me of the fact of the maturing of the ecosystem, it means that it's no longer something on the side that you play with, it's something that's more, the whole ecosystem is now more a system of record instead of a system of augmentation, so that is really heartening but also brings a sharper focus and more sort of responsibility on our shoulders. >> Awesome, well congratulations, you guys have stock prices at a 52-week high. Congratulations. >> Those things take care of themselves. >> Good products, and stock prices take care of themselves. >> Okay the Cube coverage here in New York City, I'm John Vellante, stay with us for more live coverage all things data happening here in New York City. We will be right back after this short break. (digital beat)

Published Date : Sep 12 2018

SUMMARY :

brought to you by SiliconAngle Media at Yahoo, already been on the Cube many times, And now, all the conversations on what you guys a couple of hours ago, he calls it sort of the phase one so you don't have to worry about how you set up IMRL's on was we wanted to take that step further we've been able In a connected manner, the issue we saw early on on the assumptions we have in Hadoop. So talk about the Cloud native architecture. it more sort of management around the core. evolution in the architecture, to make them work in idea was always that we have to get data plan in to for what you guys are doing. And the really cool part for me is that, we will definitely What's interesting about the Redhat relationship is that Now around the ecosystem. So you guys remember the old dpi, it caused a bit of a So we are collaborating with IBM on DPI more on merit and Well the narrative, was always why do you need it, but It certainly worked for you three, which presumably be much more sure about the fact that they can go to building the next gen data centers, with kubernetes in mind. So now they have to place longer term bets and they So the data has got to deliver value, so there is a on the side that you play with, it's something that's Awesome, well congratulations, you guys have stock Okay the Cube coverage here in New York City,

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Arun Murthy	PERSON	0.99+
Rob	PERSON	0.99+
IBM	ORGANIZATION	0.99+
2005	DATE	0.99+
John Vellante	PERSON	0.99+
John Furrier	PERSON	0.99+
Redhat	ORGANIZATION	0.99+
Yahoo	ORGANIZATION	0.99+
30 animals	QUANTITY	0.99+
SiliconAngle Media	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
AKS	ORGANIZATION	0.99+
New York City	LOCATION	0.99+
second	QUANTITY	0.99+
52-week	QUANTITY	0.99+
James Cornelius	PERSON	0.99+
Google	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
New York	LOCATION	0.99+
three	QUANTITY	0.99+
YARN	ORGANIZATION	0.99+
28 animals	QUANTITY	0.99+
one hundred	QUANTITY	0.99+
Fargate	ORGANIZATION	0.99+
two worlds	QUANTITY	0.99+
GDPR	TITLE	0.99+
2006	DATE	0.99+
Arun	PERSON	0.99+
three companies	QUANTITY	0.99+
one hundred gigabits	QUANTITY	0.99+
eight gigs	QUANTITY	0.99+
this week	DATE	0.99+
two big players	QUANTITY	0.99+
Hadoop	TITLE	0.98+
first	QUANTITY	0.98+
Spark	TITLE	0.98+
GKE	ORGANIZATION	0.98+
Kafka	TITLE	0.98+
both	QUANTITY	0.98+
Kubernetes	TITLE	0.98+
each	QUANTITY	0.97+
today	DATE	0.97+
NYC	LOCATION	0.97+
three years ago	DATE	0.97+
Cloud	TITLE	0.97+
Charlotte	LOCATION	0.96+
seven figure	QUANTITY	0.96+
DSS	ORGANIZATION	0.96+
EC2	TITLE	0.95+
S3	TITLE	0.95+
Cube	COMMERCIAL_ITEM	0.94+
Cube	ORGANIZATION	0.92+
Murhty	PERSON	0.88+
2000	QUANTITY	0.88+
few years ago	DATE	0.87+
couple of hours ago	DATE	0.87+
Ecosystem	ORGANIZATION	0.86+
Ibm	PERSON	0.85+

Arun Murthy, Hortonworks | DataWorks Summit 2018

>> Live from San Jose in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2018, brought to you by Hortonworks. >> Welcome back to theCUBE's live coverage of DataWorks here in San Jose, California. I'm your host, Rebecca Knight, along with my cohost, Jim Kobielus. We're joined by Aaron Murphy, Arun Murphy, sorry. He is the co-founder and chief product officer of Hortonworks. Thank you so much for returning to theCUBE. It's great to have you on >> Yeah, likewise. It's been a fun time getting back, yeah. >> So you were on the main stage this morning in the keynote, and you were describing the journey, the data journey that so many customers are on right now, and you were talking about the cloud saying that the cloud is part of the strategy but it really needs to fit into the overall business strategy. Can you describe a little bit about how you're approach to that? >> Absolutely, and the way we look at this is we help customers leverage data to actually deliver better capabilities, better services, better experiences, to their customers, and that's the business we are in. Now with that obviously we look at cloud as a really key part of it, of the overall strategy in terms of how you want to manage data on-prem and on the cloud. We kind of joke that we ourself live in a world of real-time data. We just live in it and data is everywhere. You might have trucks on the road, you might have drawings, you might have sensors and you have it all over the world. At that point, we've kind of got to a point where enterprise understand that they'll manage all the infrastructure but in a lot of cases, it will make a lot more sense to actually lease some of it and that's the cloud. It's the same way, if you're delivering packages, you don't got buy planes and lay out roads you go to FedEx and actually let them handle that view. That's kind of what the cloud is. So that is why we really fundamentally believe that we have to help customers leverage infrastructure whatever makes sense pragmatically both from an architectural standpoint and from a financial standpoint and that's kind of why we talked about how your cloud strategy, is part of your data strategy which is actually fundamentally part of your business strategy. >> So how are you helping customers to leverage this? What is on their minds and what's your response? >> Yeah, it's really interesting, like I said, cloud is cloud, and infrastructure management is certainly something that's at the foremost, at the top of the mind for every CIO today. And what we've consistently heard is they need a way to manage all this data and all this infrastructure in a hybrid multi-tenant, multi-cloud fashion. Because in some GEOs you might not have your favorite cloud renderer. You know, go to parts of Asia is a great example. You might have to use on of the Chinese clouds. You go to parts of Europe, especially with things like the GDPR, the data residency laws and so on, you have to be very, very cognizant of where your data gets stored and where your infrastructure is present. And that is why we fundamentally believe it's really important to have and give enterprise a fabric with which it can manage all of this. And hide the details of all of the underlying infrastructure from them as much as possible. >> And that's DataPlane Services. >> And that's DataPlane Services, exactly. >> The Hortonworks DataPlane Services we launched in October of last year. Actually I was on CUBE talking about it back then too. We see a lot of interest, a lot of excitement around it because now they understand that, again, this doesn't mean that we drive it down to the least common denominator. It is about helping enterprises leverage the key differentiators at each of the cloud renderers products. For example, Google, which we announced a partnership, they are really strong on AI and MO. So if you are running TensorFlow and you want to deal with things like Kubernetes, GKE is a great place to do it. And, for example, you can now go to Google Cloud and get DPUs which work great for TensorFlow. Similarly, a lot of customers run on Amazon for a bunch of the operational stuff, Redshift as an example. So the world we live in, we want to help the CIO leverage the best piece of the cloud but then give them a consistent way to manage and count that data. We were joking on stage that IT has just about learned how deal with Kerberos and Hadoob And now we're telling them, "Oh, go figure out IM on Google." which is also IM on Amazon but they are completely different. The only thing that's consistent is the name. So I think we have a unique opportunity especially with the open source technologies like Altas, Ranger, Knox and so on, to be able to draw a consistent fabric over this and secured occurrence. And help the enterprise leverage the best parts of the cloud to put a best fit architecture together, but which also happens to be a best of breed architecture. >> So the fabric is everything you're describing, all the Apache open source projects in which HortonWorks is a primary committer and contributor, are able to scheme as in policies and metadata and so forth across this distributed heterogeneous fabric of public and private cloud segments within a distributed environment. >> Exactly. >> That's increasingly being containerized in terms of the applications for deployment to edge nodes. Containerization is a big theme in HTP3.0 which you announced at this show. >> Yeah. >> So, if you could give us a quick sense for how that containerization capability plays into more of an edge focus for what your customers are doing. >> Exactly, great point, and again, the fabric is obviously, the core parts of the fabric are the open source projects but we've also done a lot of net new innovation with data plans which, by the way, is also open source. Its a new product and a new platform that you can actually leverage, to lay it out over the open source ones you're familiar with. And again, like you said, containerization, what is actually driving the fundamentals of this, the details matter, the scale at which we operate, we're talking about thousands of nodes, terabytes of data. The details really matter because a 5% improvement at that scale leads to millions of dollars in optimization for capex and opex. So that's why all of that, the details are being fueled and driven by the community which is kind of what we tell over HDP3 Until the key ones, like you said, are containerization because now we can actually get complete agility in terms of how you deploy the applications. You get isolation not only at the resource management level with containers but you also get it at the software level, which means, if two data scientists wanted to use a different version of Python or Scala or Spark or whatever it is, they get that consistently and holistically. That now they can actually go from the test dev cycle into production in a completely consistent manner. So that's why containers are so big because now we can actually leverage it across the stack and the things like MiNiFi showing up. We can actually-- >> Define MiNiFi before you go further. What is MiNiFi for our listeners? >> Great question. Yeah, so we've always had NiFi-- >> Real-time >> Real-time data flow management and NiFi was still sort of within the data center. What MiNiFi does is actually now a really, really small layer, a small thin library if you will that you can throw on a phone, a doorbell, a sensor and that gives you all the capabilities of NiFi but at the edge. >> Mmm Right? And it's actually not just data flow but what is really cool about NiFi it's actually command and control. So you can actually do bidirectional command and control so you can actually change in real-time the flows you want, the processing you do, and so on. So what we're trying to do with MiNiFi is actually not just collect data from the edge but also push the processing as much as possible to the edge because we really do believe a lot more processing is going to happen at the edge especially with the A6 and so on coming out. There will be custom hardware that you can throw and essentially leverage that hardware at the edge to actually do this processing. And we believe, you know, we want to do that even if the cost of data not actually landing up at rest because at the end of the day we're in the insights business not in the data storage business. >> Well I want to get back to that. You were talking about innovation and how so much of it is driven by the open source community and you're a veteran of the big data open source community. How do we maintain that? How does that continue to be the fuel? >> Yeah, and a lot of it starts with just being consistent. From day one, James was around back then, in 2011 we started, we've always said, "We're going to be open source." because we fundamentally believed that the community is going to out innovate any one vendor regardless of how much money they have in the bank. So we really do believe that's the best way to innovate mostly because their is a sense of shared ownership of that product. It's not just one vendor throwing some code out there try to shove it down the customers throat. And we've seen this over and over again, right. Three years ago, we talk about a lot of the data plane stuff comes from Atlas and Ranger and so on. None of these existed. These actually came from the fruits of the collaboration with the community with actually some very large enterprises being a part of it. So it's a great example of how we continue to drive it6 because we fundamentally believe that, that's the best way to innovate and continue to believe so. >> Right. And the community, the Apache community as a whole so many different projects that for example, in streaming, there is Kafka, >> Okay. >> and there is others that address a core set of common requirements but in different ways, >> Exactly. >> supporting different approaches, for example, they are doing streaming with stateless transactions and so forth, or stateless semantics and so forth. Seems to me that HortonWorks is shifting towards being more of a streaming oriented vendor away from data at rest. Though, I should say HDP3.0 has got great scalability and storage efficiency capabilities baked in. I wonder if you could just break it down a little bit what the innovations or enhancements are in HDP3.0 for those of your core customers, which is most of them who are managing massive multi-terabyte, multi-petabyte distributed, federated, big data lakes. What's in HDP3.0 for them? >> Oh for lots. Again, like I said, we obviously spend a lot of time on the streaming side because that's where we see. We live in a real-time world. But again, we don't do it at the cost of our core business which continues to be HDP. And as you can see, the community trend is drive, we talked about continuization massive step up for the Hadoob Community. We've also added support for GPUs. Again, if you think about Trove's at scale machine learning. >> Graphing processing units, >> Graphical-- >> AI, deep learning >> Yeah, it's huge. Deep learning, intensive flow and so on, really, really need a custom, sort of GPU, if you will. So that's coming. That's an HDP3. We've added a whole bunch of scalability improvements with HDFS. We've added federation because now we can go from, you can go over a billion files a billion objects in HDFS. We also added capabilities for-- >> But you indicated yesterday when we were talking that very few of your customers need that capacity yet but you think they will so-- >> Oh for sure. Again, part of this is as we enable more source of data in real-time that's the fuel which drives and that was always the strategy behind the HDF product. It was about, can we leverage the synergies between the real-time world, feed that into what you do today, in your classic enterprise with data at rest and that is what is driving the necessity for scale. >> Yes. >> Right. We've done that. We spend a lot of work, again, loading the total cost of ownership the TCO so we added erasure coding. >> What is that exactly? >> Yeah, so erasure coding is a classic sort of storage concept which allows you to actually in sort of, you know HTFS has always been three replicas So for redundancy, fault tolerance and recovery. Now, it sounds okay having three replicas because it's cheap disk, right. But when you start to think about our customers running 70, 80 hundred terabytes of data those three replicas add up because you've now gone from 80 terabytes of effective data where actually two 1/4 of an exobyte in terms of raw storage. So now what we can do with erasure coding is actually instead of storing the three blocks we actually store parody. We store the encoding of it which means we can actually go down from three to like two, one and a half, whatever we want to do. So, if we can get from three blocks to one and a half especially for your core data, >> Yeah >> the ones you're not accessing every day. It results in a massive savings in terms of your infrastructure costs. And that's kind of what we're in the business doing, helping customers do better with the data they have whether it's on-prem or on the cloud, that's sort of we want to help customers be comfortable getting more data under management along with secured and the lower TCO. The other sort of big piece I'm really excited about HDP3 is all the work that's happened to Hive Community for what we call the real-time database. >> Yes. >> As you guys know, you follow the whole sequel of ours in the Doob Space. >> And hive has changed a lot in the last several years, this is very different from what it was five years ago. >> The only thing that's same from five years ago is the name (laughing) >> So again, the community has done a phenomenal job, kind of, really taking sort of a, we used to call it like a sequel engine on HDFS. From there, to drive it with 3.0, it's now like, with Hive 3 which is part of HDP3 it's a full fledged database. It's got full asset support. In fact, the asset support is so good that writing asset tables is at least as fast as writing non-asset tables now. And you can do that not only on-- >> Transactional database. >> Exactly. Now not only can you do it on prem, you can do it on S3. So you can actually drive the transactions through Hive on S3. We've done a lot of work to actually, you were there yesterday when we were talking about some of the performance work we've done with LAP and so on to actually give consistent performance both on-prem and the cloud and this is a lot of effort simply because the performance characteristics you get from the storage layer with HDFS versus S3 are significantly different. So now we have been able to bridge those with things with LAP. We've done a lot of work and sort of enhanced the security model around it, governance and security. So now you get things like account level, masking, row-level filtering, all the standard stuff that you would expect and more from an Enprise air house. We talked to a lot of our customers, they're doing, literally tens of thousands of views because they don't have the capabilities that exist in Hive now. >> Mmm-hmm 6 And I'm sitting here kind of being amazed that for an open source set of tools to have the best security and governance at this point is pretty amazing coming from where we started off. >> And it's absolutely essential for GDPR compliance and compliance HIPA and every other mandate and sensitivity that requires you to protect personally identifiable information, so very important. So in many ways HortonWorks has one of the premier big data catalogs for all manner of compliance requirements that your customers are chasing. >> Yeah, and James, you wrote about it in the contex6t of data storage studio which we introduced >> Yes. >> You know, things like consent management, having--- >> A consent portal >> A consent portal >> In which the customer can indicate the degree to which >> Exactly. >> they require controls over their management of their PII possibly to be forgotten and so forth. >> Yeah, it's going to be forgotten, it's consent even for analytics. Within the context of GDPR, you have to allow the customer to opt out of analytics, them being part of an analytic itself, right. >> Yeah. >> So things like those are now something we enable to the enhanced security models that are done in Ranger. So now, it's sort of the really cool part of what we've done now with GDPR is that we can get all these capabilities on existing data an existing applications by just adding a security policy, not rewriting It's a massive, massive, massive deal which I cannot tell you how much customers are excited about because they now understand. They were sort of freaking out that I have to go to 30, 40, 50 thousand enterprise apps6 and change them to take advantage, to actually provide consent, and try to be forgotten. The fact that you can do that now by changing a security policy with Ranger is huge for them. >> Arun, thank you so much for coming on theCUBE. It's always so much fun talking to you. >> Likewise. Thank you so much. >> I learned something every time I listen to you. >> Indeed, indeed. I'm Rebecca Knight for James Kobeilus, we will have more from theCUBE's live coverage of DataWorks just after this. (Techno music)

Published Date : Jun 19 2018

SUMMARY :

brought to you by Hortonworks. It's great to have you on Yeah, likewise. is part of the strategy but it really needs to fit and that's the business we are in. And hide the details of all of the underlying infrastructure for a bunch of the operational stuff, So the fabric is everything you're describing, in terms of the applications for deployment to edge nodes. So, if you could give us a quick sense for Until the key ones, like you said, are containerization Define MiNiFi before you go further. Yeah, so we've always had NiFi-- and that gives you all the capabilities of NiFi the processing you do, and so on. and how so much of it is driven by the open source community that the community is going to out innovate any one vendor And the community, the Apache community as a whole I wonder if you could just break it down a little bit And as you can see, the community trend is drive, because now we can go from, you can go over a billion files the real-time world, feed that into what you do today, loading the total cost of ownership the TCO sort of storage concept which allows you to actually is all the work that's happened to Hive Community in the Doob Space. And hive has changed a lot in the last several years, And you can do that not only on-- the performance characteristics you get to have the best security and governance at this point and sensitivity that requires you to protect possibly to be forgotten and so forth. Within the context of GDPR, you have to allow The fact that you can do that now Arun, thank you so much for coming on theCUBE. Thank you so much. we will have more from theCUBE's live coverage of DataWorks

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
James	PERSON	0.99+
Aaron Murphy	PERSON	0.99+
Arun Murphy	PERSON	0.99+
Arun	PERSON	0.99+
2011	DATE	0.99+
Google	ORGANIZATION	0.99+
5%	QUANTITY	0.99+
80 terabytes	QUANTITY	0.99+
FedEx	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Arun Murthy	PERSON	0.99+
HortonWorks	ORGANIZATION	0.99+
yesterday	DATE	0.99+
San Jose, California	LOCATION	0.99+
three replicas	QUANTITY	0.99+
James Kobeilus	PERSON	0.99+
three blocks	QUANTITY	0.99+
GDPR	TITLE	0.99+
Python	TITLE	0.99+
Europe	LOCATION	0.99+
millions of dollars	QUANTITY	0.99+
Scala	TITLE	0.99+
Spark	TITLE	0.99+
theCUBE	ORGANIZATION	0.99+
five years ago	DATE	0.99+
one and a half	QUANTITY	0.98+
Enprise	ORGANIZATION	0.98+
three	QUANTITY	0.98+
Hive 3	TITLE	0.98+
Three years ago	DATE	0.98+
both	QUANTITY	0.98+
Asia	LOCATION	0.97+
50 thousand	QUANTITY	0.97+
TCO	ORGANIZATION	0.97+
MiNiFi	TITLE	0.97+
Apache	ORGANIZATION	0.97+
40	QUANTITY	0.97+
Altas	ORGANIZATION	0.97+
Hortonworks DataPlane Services	ORGANIZATION	0.96+
DataWorks Summit 2018	EVENT	0.96+
30	QUANTITY	0.95+
thousands of nodes	QUANTITY	0.95+
A6	COMMERCIAL_ITEM	0.95+
Kerberos	ORGANIZATION	0.95+
today	DATE	0.95+
Knox	ORGANIZATION	0.94+
one	QUANTITY	0.94+
hive	TITLE	0.94+
two data scientists	QUANTITY	0.94+
each	QUANTITY	0.92+
Chinese	OTHER	0.92+
TensorFlow	TITLE	0.92+
S3	TITLE	0.91+
October of last year	DATE	0.91+
Ranger	ORGANIZATION	0.91+
Hadoob	ORGANIZATION	0.91+
HIPA	TITLE	0.9+
CUBE	ORGANIZATION	0.9+
tens of thousands	QUANTITY	0.9+
one vendor	QUANTITY	0.89+
last several years	DATE	0.88+
a billion objects	QUANTITY	0.86+
70, 80 hundred terabytes of data	QUANTITY	0.86+
HTP3.0	TITLE	0.86+
two 1/4 of an exobyte	QUANTITY	0.86+
Atlas and	ORGANIZATION	0.85+
DataPlane Services	ORGANIZATION	0.84+
Google Cloud	TITLE	0.82+

Arun Murthy, Hortonworks | BigData NYC 2017

>> Coming back when we were a DOS spreadsheet company. I did a short stint at Microsoft and then joined Frank Quattrone when he spun out of Morgan Stanley to create what would become the number three tech investment (upbeat music) >> Host: Live from mid-town Manhattan, it's theCUBE covering the BigData New York City 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsors. (upbeat electronic music) >> Welcome back, everyone. We're here, live, on day two of our three days of coverage of BigData NYC. This is our event that we put on every year. It's our fifth year doing BigData NYC in conjunction with Hadoop World which evolved into Strata Conference, which evolved into Strata Hadoop, now called Strata Data. Probably next year will be called Strata AI, but we're still theCUBE, we'll always be theCUBE and this our BigData NYC, our eighth year covering the BigData world since Hadoop World. And then as Hortonworks came on we started covering Hortonworks' data summit. >> Arun: DataWorks Summit. >> DataWorks Summit. Arun Murthy, my next guest, Co-Founder and Chief Product Officer of Hortonworks. Great to see you, looking good. >> Likewise, thank you. Thanks for having me. >> Boy, what a journey. Hadoop, years ago, >> 12 years now. >> I still remember, you guys came out of Yahoo, you guys put Hortonworks together and then since, gone public, first to go public, then Cloudera just went public. So, the Hadoop World is pretty much out there, everyone knows where it's at, it's got to nice use case, but the whole world's moved around it. You guys have been, really the first of the Hadoop players, before ever Cloudera, on this notion of data in flight, or, I call, real-time data but I think, you guys call it data-in-motion. Batch, we all know what Batch does, a lot of things to do with Batch, you can optimize it, it's not going anywhere, it's going to grow. Real-time data-in-motion's a huge deal. Give us the update. >> Absolutely, you know, we've obviously been in this space, personally, I've been in this for about 12 years now. So, we've had a lot of time to think about it. >> Host: Since you were 12? >> Yeah. (laughs) Almost. Probably look like it. So, back in 2014 and '15 when we, sort of, went public and we're started looking around, the thesis always was, yes, Hadoop is important, we're going to love you to manage lots and lots of data, but a lot of the stuff we've done since the beginning, starting with YARN and so on, was really enable the use cases beyond the whole traditional transactions and analytics. And Drop, our CO calls it, his vision's always been we've got to get into a pre-transactional world, if you will, rather than the post-transactional analytics and BIN and so on. So that's where it started. And increasingly, the obvious next step was to say, look enterprises want to be able to get insights from data, but they also want, increasingly, they want to get insights and they want to deal with it in real-time. You know while you're in you shopping cart. They want to make sure you don't abandon your shopping cart. If you were sitting at at retailer and you're on an island and you're about to walk away from a dress, you want to be able to do something about it. So, this notion of real-time is really important because it helps the enterprise connect with the customer at the point of action, if you will, and provide value right away rather than having to try to do this post-transaction. So, it's been a really important journey. We went and bought this company called Onyara, which is a bunch of geeks like us who started off with the government, built this batching NiFi thing, huge community. Its just, like, taking off at this point. It's been a fantastic thing to join hands and join the team and keep pushing in the whole streaming data style. >> There's a real, I don't mean to tangent but I do since you brought up community I wanted to bring this up. It's been the theme here this week. It's more and more obvious that the community role is becoming central, beyond open-source. We all know open-source, standing on the shoulders before us, you know. And Linux Foundation showing code numbers hitting up from $64 million to billions in the next five, ten years, exponential growth of new code coming in. So open-source certainly blew me. But now community is translating to things you start to see blockchain, very community based. That's a whole new currency market that's changing the financial landscape, ICOs and what-not, that's just one data point. Businesses, marketing communities, you're starting to see data as a fundamental thing around communities. And certainly it's going to change the vendor landscape. So you guys compare to, Cloudera and others have always been community driven. >> Yeah our philosophy has been simple. You know, more eyes and more hands are better than fewer. And it's been one of the cornerstones of our founding thesis, if you will. And you saw how that's gone on over course of six years we've been around. Super-excited to have someone like IBM join hands, it happened at DataWorks Summit in San Jose. That announcement, again, is a reflection of the fact that we've been very, very community driven and very, very ecosystem driven. >> Communities are fundamentally built on trust and partnering. >> Arun: Exactly >> Coding is pretty obvious, you code with your friends. You code with people who are good, they become your friends. There's an honor system among you. You're starting to see that in the corporate deals. So explain the dynamic there and some of the successes that you guys have had on the product side where one plus one equals more than two. One plus one equals five or three. >> You know IBM has been a great example. They've decided to focus on their strengths which is around Watson and machine learning and for us to focus on our strengths around data management, infrastructure, cloud and so on. So this combination of DSX, which is their data science work experience, along with Hortonworks is really powerful. We are seeing that over and over again. Just yesterday we announced the whole Dataplane thing, we were super excited about it. And now to get IBM to say, we'll get in our technologies and our IP, big data, whether it's big Quality or big Insights or big SEQUEL, and the word has been phenomenal. >> Well the Dataplane announcement, finally people who know me know that I hate the term data lake. I always said it's always been a data ocean. So I get redemption because now the data lakes, now it's admitting it's a horrible name but just saying stitching together the data lakes, Which is essentially a data ocean. Data lakes are out there and you can form these data lakes, or data sets, batch, whatever, but connecting them and integrating them is a huge issue, especially with security. >> And a lot of it is, it's also just pragmatism. We start off with this notion of data lake and say, hey, you got too many silos inside the enterprise in one data center, you want to put them together. But then increasingly, as Hadoop has become more and more mainstream, I can't remember the last time I had to explain what Hadoop is to somebody. As it has become mainstream, couple things have happened. One is, we talked about streaming data. We see all the time, especially with HTF. We have customers streaming data from autonomous cars. You have customers streaming from security cameras. You can put a small minify agent in a security camera or smart phone and can stream it all the way back. Then you get into physics. You're up against the laws of physics. If you have a security camera in Japan, why would you want to move it all the way to California and process it. You'd rather do it right there, right? So with this notion of a regional data center becomes really important. >> And that talks to the Edge as well. >> Exactly, right. So you want to have something in Japan that collects all of the security cameras in Tokyo, and you do analysis and push what you want back here, right. So that's physics. The other thing we are increasingly seeing is with data sovereignty rules especially things like GDPR, there's now regulation reasons where data has to naturally stay in different regions. Customer data from Germany cannot move to France or visa versa, right. >> Data governance is a huge issue and this is the problem I have with data governance. I am really looking for a solution so if you can illuminate this it would be great. So there is going to be an Equifax out there again. >> Arun: Oh, for sure. >> And the problem is, is that going to force some regulation change? So what we see is, certainly on the mugi bond side, I see it personally is that, you can almost see that something else will happen that'll force some policy regulation or governance. You don't want to screw up your data. You also don't want to rewrite your applications or rewrite you machine learning algorithms. So there's a lot of waste potential by not structuring the data properly. Can you comment on what's the preferred path? >> Absolutely, and that's why we've been working on things like Dataplane for almost a couple of years now. We is to say, you have to have data and policies which make sense, given a context. And the context is going to change by application, by usage, by compliance, by law. So, now to manage 20, 30, 50 a 100 data lakes, would it be better, not saying lakes, data ponds, >> [Host} Any Data. >> Any data >> Any data pool, stream, river, ocean, whatever. (laughs) >> Jacuzzis. Data jacuzzis, right. So what you want to do is want a holistic fabric, I like the term, you know Forrester uses, they call it the fabric. >> Host: Data fabric. >> Data fabric, right? You want a fabric over these so you can actually control and maintain governance and security centrally, but apply it with context. Last not least, is you want to do this whether it's on frame or on the cloud, or multi-cloud. So we've been working with a bank. They were probably based in Germany but for GDPR they had to stand up something in France now. They had French customers, but for a bunch of new reasons, regulation reasons, they had to sign up something in France. So they bring their own data center, then they had only the cloud provider, right, who I won't name. And they were great, things are working well. Now they want to expand the similar offering to customers in Asia. It turns out their favorite cloud vendor was not available in Asia or they were not available in time frame which made sense for the offering. So they had to go with cloud vendor two. So now although each of the vendors will do their job in terms of giving you all the security and governance and so on, the fact that you are to manage it three ways, one for OnFrame, one for cloud vendor A and B, was really hard, too hard for them. So this notion of a fabric across these things, which is Dataplane. And that, by the way, is based by all the open source technologies we love like Atlas and Ranger. By the way, that is also what IBM is betting on and what the entire ecosystem, but it seems like a no-brainer at this point. That was the kind of reason why we foresaw the need for something like a Dataplane and obviously couldn't be more excited to have something like that in the market today as a net new service that people can use. >> You get the catalogs, security controls, data integration. >> Arun: Exactly. >> Then you get the cloud, whatever, pick your cloud scenario, you can do that. Killer architecture, I liked it a lot. I guess the question I have for you personally is what's driving the product decisions at Hortonworks? And the second part of that question is, how does that change your ecosystem engagement? Because you guys have been very friendly in a partnering sense and also very good with the ecosystem. How are you guys deciding the product strategies? Does it bubble up from the community? Is there an ivory tower, let's go take that hill? >> It's both, because what typically happens is obviously we've been in the community now for a long time. Working publicly now with well over 1,000 customers not only puts a lot of responsibility on our shoulders but it's also very nice because it gives us a vantage point which is unique. That's number one. The second one we see is being in the community, also we see the fact that people are starting to solve the problems. So it's another elementary for us. So you have one as the enterprise side, we see what the enterprises are facing which is kind of where Dataplane came in, but we also saw in the community where people are starting to ask us about hey, can you do multi-cluster Atlas? Or multi-cluster Ranger? Put two and two together and say there is a real need. >> So you get some consensus. >> You get some consensus, and you also see that on the enterprise side. Last not least is when went to friends like IBM and say hey we're doing this. This is where we can position this, right. So we can actually bring in IGSC, you can bring big Quality and bring all these type, >> [Host} So things had clicked with IBM? >> Exactly. >> Rob Thomas was thinking the same thing. Bring in the power system and the horsepower. >> Exactly, yep. We announced something, for example, we have been working with the power guys and NVIDIA, for deep learning, right. That sort of stuff is what clicks if you're in the community long enough, if you have the vantage point of the enterprise long enough, it feels like the two of them click. And that's frankly, my job. >> Great, and you've got obviously the landscape. The waves are coming in. So I've got to ask you, the big waves are coming in and you're seeing people starting to get hip with the couple of key things that they got to get their hands on. They need to have the big surfboards, metaphorically speaking. They got to have some good products, big emphasis on real value. Don't give me any hype, don't give me a head fake. You know, I buy, okay, AI Wash, and people can see right through that. Alright, that's clear. But AI's great. We all cheer for AI but the reality is, everyone knows that's pretty much b.s. except for core machine learning is on the front edge of innovation. So that's cool, but value. [Laughs] Hey I've got the integrate and operationalize my data so that's the big wave that's coming. Comment on the community piece because enterprises now are realizing as open source becomes the dominant source of value for them, they are now really going to the next level. It used to be like the emerging enterprises that knew open source. The guys will volunteer and they may not go deeper in the community. But now more people in the enterprises are in open source communities, they are recruiting from open source communities, and that's impacting their business. What's your advice for someone who's been in the community of open source? Lessons you've learned, what is the best practice, from your standpoint on philosophy, how to build into the community, how to build a community model. >> Yeah, I mean, the end of the day, my best advice is to say look, the community is defined by the people who contribute. So, you get advice if you contribute. Which means, if that's the fundamental truth. Which means you have to get your legal policies and so on to a point that you can actually start to let your employees contribute. That kicks off a flywheel, where you can actually go then recruit the best talent, because the best talent wants to stand out. Github is a resume now. It is not a word doc. If you don't allow them to build that resume they're not going to come by and it's just a fundamental truth. >> It's self governing, it's reality. >> It's reality, exactly. Right and we see that over and over again. It's taken time but it as with things, the flywheel has changed enough. >> A whole new generation's coming online. If you look at the young kids coming in now, it is an amazing environment. You've got TensorFlow, all this cool stuff happening. It's just amazing. >> You, know 20 years ago that wouldn't happen because the Googles of the world won't open source it. Now increasingly, >> The secret's out, open source works. >> Yeah, (laughs) shh. >> Tell everybody. You know they know already but, This is changing some of the how H.R. works and how people collaborate, >> And the policies around it. The legal policies around contribution so, >> Arun, great to see you. Congratulations. It's been fun to watch the Hortonworks journey. I want to appreciate you and Rob Bearden for supporting theCUBE here in BigData NYC. If is wasn't for Hortonworks and Rob Bearden and your support, theCUBE would not be part of the Strata Data, which we are not allowed to broadcast into, for the record. O'Reilly Media does not allow TheCube or our analysts inside their venue. They've excluded us and that's a bummer for them. They're a closed organization. But I want to thank Hortonworks and you guys for supporting us. >> Arun: Likewise. >> We really appreciate it. >> Arun: Thanks for having me back. >> Thanks and shout out to Rob Bearden. Good luck and CPO, it's a fun job, you know, not the pressure. I got a lot of pressure. A whole lot. >> Arun: Alright, thanks. >> More Cube coverage after this short break. (upbeat electronic music)

Published Date : Sep 28 2017

SUMMARY :

the number three tech investment Brought to you by SiliconANGLE Media This is our event that we put on every year. Co-Founder and Chief Product Officer of Hortonworks. Thanks for having me. Boy, what a journey. You guys have been, really the first of the Hadoop players, Absolutely, you know, we've obviously been in this space, at the point of action, if you will, standing on the shoulders before us, you know. And it's been one of the cornerstones Communities are fundamentally built on that you guys have had on the product side and the word has been phenomenal. So I get redemption because now the data lakes, I can't remember the last time I had to explain and you do analysis and push what you want back here, right. so if you can illuminate this it would be great. I see it personally is that, you can almost see that We is to say, you have to have data and policies Any data pool, stream, river, ocean, whatever. I like the term, you know Forrester uses, the fact that you are to manage it three ways, I guess the question I have for you personally is So you have one as the enterprise side, and you also see that on the enterprise side. Bring in the power system and the horsepower. if you have the vantage point of the enterprise long enough, is on the front edge of innovation. and so on to a point that you can actually the flywheel has changed enough. If you look at the young kids coming in now, because the Googles of the world won't open source it. This is changing some of the how H.R. works And the policies around it. and you guys for supporting us. Thanks and shout out to Rob Bearden. More Cube coverage after this short break.

ENTITIES

Entity	Category	Confidence
Asia	LOCATION	0.99+
France	LOCATION	0.99+
Arun	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Rob Bearden	PERSON	0.99+
Germany	LOCATION	0.99+
Arun Murthy	PERSON	0.99+
Japan	LOCATION	0.99+
NVIDIA	ORGANIZATION	0.99+
Tokyo	LOCATION	0.99+
2014	DATE	0.99+
California	LOCATION	0.99+
12	QUANTITY	0.99+
five	QUANTITY	0.99+
Frank Quattrone	PERSON	0.99+
three	QUANTITY	0.99+
two	QUANTITY	0.99+
Onyara	ORGANIZATION	0.99+
$64 million	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
O'Reilly Media	ORGANIZATION	0.99+
each	QUANTITY	0.99+
Morgan Stanley	ORGANIZATION	0.99+
Linux Foundation	ORGANIZATION	0.99+
One	QUANTITY	0.99+
fifth year	QUANTITY	0.99+
Atlas	ORGANIZATION	0.99+
20	QUANTITY	0.99+
one	QUANTITY	0.99+
Rob Thomas	PERSON	0.99+
three days	QUANTITY	0.99+
eighth year	QUANTITY	0.99+
yesterday	DATE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
six years	QUANTITY	0.99+
Equifax	ORGANIZATION	0.99+
next year	DATE	0.99+
NYC	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
second part	QUANTITY	0.99+
both	QUANTITY	0.99+
Ranger	ORGANIZATION	0.99+
50	QUANTITY	0.98+
30	QUANTITY	0.98+
Yahoo	ORGANIZATION	0.98+
Strata Conference	EVENT	0.98+
DataWorks Summit	EVENT	0.98+
Hadoop	TITLE	0.98+
'15	DATE	0.97+
20 years ago	DATE	0.97+
Forrester	ORGANIZATION	0.97+
GDPR	TITLE	0.97+
second one	QUANTITY	0.97+
one data center	QUANTITY	0.97+
Github	ORGANIZATION	0.96+
about 12 years	QUANTITY	0.96+
three ways	QUANTITY	0.96+
Manhattan	LOCATION	0.95+
day two	QUANTITY	0.95+
this week	DATE	0.95+
NiFi	ORGANIZATION	0.94+
Dataplane	ORGANIZATION	0.94+
BigData	ORGANIZATION	0.94+
Hadoop World	EVENT	0.93+
billions	QUANTITY	0.93+

Arun Murthy, Hortonworks | DataWorks Summit 2017

>> Announcer: Live from San Jose, in the heart of Silicon Valley, it's theCUBE covering DataWorks Summit 2017. Brought to you by Hortonworks. >> Good morning, welcome to theCUBE. We are live at day 2 of the DataWorks Summit, and have had a great day so far, yesterday and today, I'm Lisa Martin with my co-host George Gilbert. George and I are very excited to be joined by a multiple CUBE alumni, the co-founder and VP of Engineering at Hortonworks Arun Murthy. Hey, Arun. >> Thanks for having me, it's good to be back. >> Great to have you back, so yesterday, great energy at the event. You could see and hear behind us, great energy this morning. One of the things that was really interesting yesterday, besides the IBM announcement, and we'll dig into that, was that we had your CEO on, as well as Rob Thomas from IBM, and Rob said, you know, one of the interesting things over the last five years was that there have been only 10 companies that have beat the S&P 500, have outperformed, in each of the last five years, and those companies have made big bets on data science and machine learning. And as we heard yesterday, these four meta-trains IoT, cloud streaming, analytics, and now the fourth big leg, data science. Talk to us about what Hortonworks is doing, you've been here from the beginning, as a co-founder I've mentioned, you've been with Hadoop since it was a little baby. How is Hortonworks evolving to become one of those big users making big bets on helping your customers, and yourselves, leverage machine loading to really drive the business forward? >> Absolutely, a great question. So, you know, if you look at some of the history of Hadoop, it started off with this notion of a data lake, and then, I'm talking about the enterprise side of Hadoop, right? I've been working for Hadoop for about 12 years now, you know, the last six of it has been as a vendor selling Hadoop to enterprises. They started off with this notion of data lake, and as people have adopted that vision of a data lake, you know, you bring all the data in, and now you're starting to get governance and security, and all of that. Obviously the, one of the best ways to get value over the data is the notion of, you know, can you, sort of, predict what is going to happen in your world of it, with your customers, and, you know, whatever it is with the data that you already have. So that notion of, you know, Rob, our CEO, talks about how we're trying to move from a post-transactional world to a pre-transactional world, and doing the analytics and data sciences will be, obviously, with me. We could talk about, and there's so many applications of it, something as similar as, you know, we did a demo last year of, you know, of how we're working with a freight company, and we're starting to show them, you know, predict which drivers and which routes are going to have issues, as they're trying to move, alright? Four years ago we did the same demo, and we would say, okay this driver has, you know, we would show that this driver had an issue on this route, but now, within the world, we can actually predict and let you know to take preventive measures up front. Similarly internally, you know, you can take things from, you know, mission-learning, and log analytics, and so on, we have a internal problem, you know, where we have to test two different versions of HDP itself, and as you can imagine, it's a really, really hard problem. We have the support, 10 operating systems, seven databases, like, if you multiply that matrix, it's, you know, tens of thousands of options. So, if you do all that testing, we now use mission-learning internally, to look through the logs, and kind of predict where the failures were, and help our own, sort of, software engineers understand where the problems were, right? An extension of that has been, you know, the work we've done in Smartsense, which is a service we offer our enterprise customers. We collect logs from their Hadoop clusters, and then they can actually help them understand where they can either tune their applications, or even tune their hardware, right? They might have a, you know, we have this example I really like where at a really large enterprise Financial Services client, they had literally, you know, hundreds and, you know, and thousands of machines on HDP, and we, using Smartsense, we actually found that there were 25 machines which had bad NIC configuration, and we proved to them that by fixing those, we got a 30% to put back on their cluster. At that scale, it's a lot of money, it's a lot of cap, it's a lot of optics So, as a company, we try to ourselves, as much as we, kind of, try to help our customers adopt it, that make sense? >> Yeah, let's drill down on that even a little more, cause it's pretty easy to understand what's the standard telemetry you would want out of hardware, but as you, sort of, move up the stack the metrics, I guess, become more custom. So how do you learn, not just from one customer, but from many customers especially when you can't standardize what you're supposed to pull out of them? >> Yeah so, we're sort of really big believers in, sort of, doctoring your own stuff, right? So, we talk about the notion of data lake, we actually run a Smartsense data lake where we actually get data across, you know, the hundreds of of our customers, and we can actually do predictive mission-learning on that data in our own data lake. Right? And to your point about how we go up the stack, this is, kind of, where we feel like we have a natural advantage because we work on all the layers, whether it's the sequel engine, or the storage engine, or, you know, above and beyond the hardware. So, as we build these models, we understand that we need more, or different, telemetry right? And we put that back into the product so the next version of HDP will have that metrics that we wanted. And, now we've been doing this for a couple of years, which means we've done three, four, five turns of the crank, obviously something we always get better at, but I feel like, compared to where we were a couple of years ago when Smartsense first came out, it's actually matured quite a lot, from that perspective. >> So, there's a couple different paths you can add to this, which is customers might want, as part of their big data workloads, some non-Hortonworks, you know, services or software when it's on-prem, and then can you also extend this management to the Cloud if they want to hybrid setup where, in the not too distant future, the Cloud vendor will be also a provider for this type of management. >> So absolutely, in fact it's true today when, you know, we work with, you know, Microsoft's a great partner of ours. We work with them to enable Smartsense on HDI, which means we can actually get the same telemetry back, whether you're running the data on an on-prem HDP, or you're running this on HDI. Similarly, we shipped a version of our Cloud product, our Hortonworks Data Cloud, on Amazon and again Smartsense preplanned there, so whether you're on an Amazon, or a Microsoft, or on-prem, we get the same telemetry, we get the same data back. We can actually, if you're a customer using many of these products, we can actually give you that telemetry back. Similarly, if you guys probably know this we have, you were probably there in an analyst when they announced the Flex Support subscription, which means that now we can actually take the support subscription you have to get from Hortonworks, and you can actually use it on-prem or on the Cloud. >> So in terms of transforming, HDP for example, just want to make sure I'm understanding this, you're pulling in data from customers to help evolve the product, and that data can be on-prem, it can be in a Microsoft lesur, it can be an AWS? >> Exactly. The HDP can be running in any of these, we will actually pull all of them to our data lake, and they actually do the analytics for us and then present it back to the customers. So, in our support subscription, the way this works is we do the analytics in our lake, and it pushes it back, in fact to our support team tickets, and our sales force, and all the support mechanisms. And they get a set of recommendations saying Hey, we know this is the work loads you're running, we see these are the opportunities for you to do better, whether it's tuning a hardware, tuning an application, tuning the software, we sort of send the recommendations back, and the customer can go and say Oh, that makes sense, the accept that and we'll, you know, we'll update the recommendation for you automatically. Then you can have, or you can say Maybe I don't want to change my kernel pedometers, let's have a conversation. And if the customer, you know, is going through with that, then they can go and change it on their own. We do that, sort of, back and forth with the customer. >> One thing that just pops into my mind is, we talked a lot yesterday about data governance, are there particular, and also yesterday on stage were >> Arun: With IBM >> Yes exactly, when we think of, you know, really data-intensive industries, retail, financial services, insurance, healthcare, manufacturing, are there particular industries where you're really leveraging this, kind of, bi-directional, because there's no governance restrictions, or maybe I shouldn't say none, but. Give us a sense of which particular industries are really helping to fuel the evolution of Hortonworks data lake. >> So, I think healthcare is a great example. You know, when we started off, sort of this open-source project, or an atlas, you know, a couple of years ago, we got a lot of traction in the healthcare sort of insurance industry. You know, folks like Aetna were actually founding members of that, you know, sort of consortium of doing this, right? And, we're starting to see them get a lot of leverage, all of this. Similarly now as we go into, you know, Europe and expand there, things like GDPR, are really, really being pardoned, right? And, you guys know GDPR is a really big deal. Like, you pay, if you're not compliant by, I think it's like March of next year, you pay a portion of your revenue as fines. That's, you know, big money for everybody. So, I think that's what we're really excited about the portion with IBM, because we feel like the two of us can help a lot of customers, especially in countries where they're significantly, highly regulated, than the United States, to actually get leverage our, sort of, giant portfolio of products. And IBM's been a great company to atlas, they've adopted wholesale as you saw, you know, in the announcements yesterday. >> So, you're doing a Keynote tomorrow, so give us maybe the top three things, you're giving the Keynote on Data Lake 3.0, walk us through the evolution. Data Lakes 1.0, 2.0, 3.0, where you are now, and what folks can expect to hear and see in your Keynote. >> Absolutely. So as we've, kind of, continued to work with customers and we see the maturity model of customers, you know, initially people are staying up a data lake, and then they'd want, you know, sort of security, basic security what it covers, and so on. Now, they want governance, and as we're starting to go to that journey clearly, our customers are pushing us to help them get more value from the data. It's not just about putting the data lake, and obviously managing data with governance, it's also about Can you help us, you know, do mission-learning, Can you help us build other apps, and so on. So, as we look to there's a fundamental evolution that, you know, Hadoop legal system had to go through was with advance of technologies like, you know, a Docker, it's really important first to help the customers bring more than just workloads, which are sort of native to Hadoop. You know, Hadoop started off with MapReduce, obviously Spark's went great, and now we're starting to see technologies like Flink coming, but increasingly, you know, we want to do data science. To mass market data science is obviously, you know, people, like, want to use Spark, but the mass market is still Python, and R, and so on, right? >> Lisa: Non-native, okay. >> Non-native. Which are not really built, you know, these predate Hadoop by a long way, right. So now as we bring these applications in, having technology like Docker is really important, because now we can actually containerize these apps. It's not just about running Spark, you know, running Spark with R, or running Spark with Python, which you can do today. The problem is, in a true multi-tenant governed system, you want, not just R, but you want specifics of a libraries for R, right. And the libraries, you know, George wants might be completely different than what I want. And, you know, you can't do a multi-tenant system where you install both of them simultaneously. So Docker is a really elegant solution to problems like those. So now we can actually bring those technologies into a Docker container, so George's Docker containers will not, you know, conflict with mine. And you can actually go to the races, you know after the races, we're doing data signs. Which is really key for technologies like DSX, right? Because with DSX if you see, obviously DSX supports Spark with technologies like, you know, Zeppelin which is a front-end, but they also have Jupiter, which is going to work the mass market users for Python and R, right? So we want to make sure there's no friction whether it's, sort of, the guys using Spark, or the guys using R, and equally importantly DSX, you know, in the short map will also support things like, you know, the classic IBM portfolio, SBSS and so on. So bringing all of those things in together, making sure they run with data in the data lake, and also the computer in the data lake, is really big for us. >> Wow, so it sounds like your Keynote's going to be very educational for the folks that are attending tomorrow, so last question for you. One of the themes that occurred in the Keynote this morning was sharing a fun-fact about these speakers. What's a fun-fact about Arun Murthy? >> Great question. I guess, you know, people have been looking for folks with, you know, 10 years of experience on Hadoop. I'm here finally, right? There's not a lot of people but, you know, it's fun to be one of those people who've worked on this for about 10 years. Obviously, I look forward to working on this for another 10 or 15 more, but it's been an amazing journey. >> Excellent. Well, we thank you again for sharing time again with us on theCUBE. You've been watching theCUBE live on day 2 of the Dataworks Summit, hashtag DWS17, for my co-host George Gilbert. I am Lisa Martin, stick around we've got great content coming your way.

Published Date : Jun 14 2017

SUMMARY :

Brought to you by Hortonworks. We are live at day 2 of the DataWorks Summit, and Rob said, you know, one of the interesting and we're starting to show them, you know, when you can't standardize what you're or the storage engine, or, you know, some non-Hortonworks, you know, services when, you know, we work with, you know, And if the customer, you know, Yes exactly, when we think of, you know, Similarly now as we go into, you know, Data Lakes 1.0, 2.0, 3.0, where you are now, with advance of technologies like, you know, And the libraries, you know, George wants One of the themes that occurred in the Keynote this morning There's not a lot of people but, you know, Well, we thank you again for sharing time again

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Lisa Martin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Rob	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Rob Thomas	PERSON	0.99+
George	PERSON	0.99+
Lisa	PERSON	0.99+
30%	QUANTITY	0.99+
San Jose	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
25 machines	QUANTITY	0.99+
10 operating systems	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
Arun Murthy	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
two	QUANTITY	0.99+
Aetna	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
Arun	PERSON	0.99+
today	DATE	0.99+
Spark	TITLE	0.99+
yesterday	DATE	0.99+
AWS	ORGANIZATION	0.99+
both	QUANTITY	0.99+
Python	TITLE	0.99+
last year	DATE	0.99+
Four years ago	DATE	0.99+
15	QUANTITY	0.99+
tomorrow	DATE	0.99+
CUBE	ORGANIZATION	0.99+
three	QUANTITY	0.99+
DataWorks Summit	EVENT	0.99+
seven databases	QUANTITY	0.98+
four	QUANTITY	0.98+
DataWorks Summit 2017	EVENT	0.98+
United States	LOCATION	0.98+
Dataworks Summit	EVENT	0.98+
10	QUANTITY	0.98+
Europe	LOCATION	0.97+
10 companies	QUANTITY	0.97+
One	QUANTITY	0.97+
one customer	QUANTITY	0.97+
thousands of machines	QUANTITY	0.97+
about 10 years	QUANTITY	0.96+
GDPR	TITLE	0.96+
Docker	TITLE	0.96+
Smartsense	ORGANIZATION	0.96+
about 12 years	QUANTITY	0.95+
this morning	DATE	0.95+
each	QUANTITY	0.95+
two different versions	QUANTITY	0.95+
five turns	QUANTITY	0.94+
R	TITLE	0.93+
four meta-trains	QUANTITY	0.92+
day 2	QUANTITY	0.92+
Data Lakes 1.0	COMMERCIAL_ITEM	0.92+
Flink	ORGANIZATION	0.91+
first	QUANTITY	0.91+
HDP	ORGANIZATION	0.91+

Arun Murthy, Hortonworks - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> [Announcer] Live, from Boston, Massachusetts, it's the Cube, covering Spark Summit East 2017, brought to you by Data Breaks. Now, your host, Dave Alante and George Gilbert. >> Welcome back to snowy Boston everybody, this is The Cube, the leader in live tech coverage. Arun Murthy is here, he's the founder and vice president of engineering at Horton Works, father of YARN, can I call you that, godfather of YARN, is that fair, or? (laughs) Anyway. He's so, so modest. Welcome back to the Cube, it's great to see you. >> Pleasure to have you. >> Coming off the big keynote, (laughs) you ended the session this morning, so that was great. Glad you made it in to Boston, and uh, lot of talk about security and governance, you know we've been talking about that years, it feels like it's truly starting to come into the main stream Arun, so. >> Well I think it's just a reflection of what customers are doing with the tech now. Now, three, four years ago, a lot of it was pilots, a lot of it was, you know, people playing with the tech. But increasingly, it's about, you know, people actually applying stuff in production, having data, system of record, running workloads both on prem and on the cloud, cloud is sort of becoming more and more real at mainstream enterprises. So a lot of it means, as you take any of the examples today any interesting app will have some sort of real time data feed, it's probably coming out from a cell phone or sensor which means that data is actually not, in most cases not coming on prem, it's actually getting collected in a local cloud somewhere, it's just more cost effective, why would we put up 25 data centers if you don't have to, right? So then you got to connect that data, production data you have or customer data you have or data you might have purchased and then join them up, run some interesting analytics, do geobased real time threat detection, cyber security. A lot of it means that you need a common way to secure data, govern it, and that's where we see the action, I think it's a really good sign for the market and for the community that people are pushing on these dimensions of the broader, because, getting pushed in this dimension because it means that people are actually using it for real production work loads. >> Well in the early days of Hadoop you really didn't talk that much about cloud. >> Yeah. >> You know, and now, >> Absolutely. >> It's like, you know, duh, cloud. >> Yeah. >> It's everywhere, and of course the whole hybrid cloud thing comes into play, what are you seeing there, what are things you can do in a hybrid, you know, or on prem that you can't do in a public cloud and what's the dynamic look like? >> Well, it's definitely not an either or, right? So what we're seeing is increasingly interesting apps need data which are born in the cloud and they'll stay in the cloud, but they also need transactional data which stays on prem, you might have an EDW for example, right? >> Right. >> There's not a lot of, you know, people want to solve business problems and not just move data from one place to another, right? Or back from one place to another, so it's not interesting to move an EDW to the cloud, and similarly it's not interesting to bring your IOT data or sensor data back into on-prem, right? Just makes sense. So naturally what happens is, you know, at Hortonworks we talk of kinds of modern app or a modern data app, which means a modern data app has to spare, has to sort of, you know, it can pass both on-prem data and cloud data. >> Yeah, you talked about that in your keynote years ago. Furio said that the data is the new development kit. And now you're seeing the apps are just so dang rich, >> Exactly, exactly. >> And they have to span >> Absolutely. >> physical locations, >> Yeah. >> But then this whole thing of IOT comes up, we've been having a conversation on The Cube, last several Cubes of, okay, how much stays out, how much stays in, there's a lot of debates about that, there's reasons not to bring it in, but you talked today about some of the important stuff will come back. >> Yeah. >> So the way this is, this all is going to be, you know, there's a lot of data that should be born in the cloud and stay there, the IOT data, but then what will happen increasingly is, key summaries of the data will move back and forth, so key summaries of your EDW will move to the cloud, sometimes key summaries of your IOT data, you know, you want to do some sort of historical training in analytics, that will come back on-prem, so I think there's a bi-directional data movement, but it just won't be all the data, right? It'll be key interesting summaries of the data but not all of it. >> And a lot of times, people say well it doesn't matter where it lives, cloud should be an operating model, not a place where you put data or applications, and while that's true and we would agree with that, from a customer standpoint it matters in terms of performance and latency issues and cost and regulation, >> And security and governance. >> Yeah. >> Absolutely. >> You need to think those things through. >> Exactly, so I mean, so that's what we're focused on, to make sure that you have a common security and governance model regardless of where data is, so you can think of it as, infrastructure you own and infrastructure you lease. >> Right. >> Right? Now, the details matter of course, when you go to the cloud you lose S3 for example or ADLS from Microsoft, but you got to make sure that there's a common sort of security governance front and top of it, in front of it, as an example one of the things that, you know, in the open source community, Ranger's a really sort of key project right now from a security authorization and authentication standpoint. We've done a lot of work with our friends at Microsoft to make sure, you can actually now manage data in Wasabi which is their object store, data stream, natively with Ranger, so you can set a policy that says only Dave can access these files, you know, George can access these columns, that sort of stuff is natively done on the Microsoft platform thanks to the relationship we have with them. >> Right. >> So that's actually really interesting for the open source communities. So you've talked about sort of commodity storage at the bottom layer and even if they're different sort of interfaces and implementations, it's still commodity storage, and now what's really helpful to customers is that they have a common security model, >> Exactly. >> Authorization, authentication, >> Authentication, lineage prominence, >> Oh okay. >> You want to make sure all of these are common sources across. >> But you've mentioned off of the different data patterns, like the stuff that might be streaming in on the cloud, what, assuming you're not putting it into just a file system or an object store, and you want to sort of merge it with >> Yeah. >> Historical data, so what are some of the data stores other than the file system, in other words, newfangled databases to manage this sort of interaction? >> So I think what you're saying is, we certainly have the raw data, the raw data is going to line up in whatever cloud native storage, >> Yeah. >> It's going to be Amazon, Wasabi, ADLS, Google Storage. But then increasingly you want, so now the patterns change so you have raw data, you have some sort of an ETL process, what's interesting in the cloud is that even the process data or, if you take the unstructured raw data and structure it, that structured data also needs to live on the cloud platform, right? The reason that's important is because A, it's cheaper to use the native platform rather than set up your own database on top of it. The other one is you also want to take advantage of all the native sources that the cloud storage provides, so for example, linking your application. So automatically data in Wasabi, you know, if you can set up a policy and easily say this structured data stable that I have of which is a summary of all the IOT activity in the last 24 hours, you can, using the cloud provider's technologies you can actually make it show up easily in Europe, like you don't have to do any work, right? So increasingly what we Hortonworks focused a lot on is to make sure that we, all of the computer engines, whether it's Spark or Hive or, you know, or MapReduce, it doesn't really matter, they're all natively working on the cloud provider's storage platform. >> [George] Okay. >> Right, so, >> Okay. >> That's a really key consideration for us. >> And the follow up to that, you know, there's a bit of a misconception that Spark replaces Hadoop, but it actually can be a processing, a compute engine for, >> Yeah. >> That can compliment or replace some of the compute engines in Hadoop, help us frame, how you talk about it with your customers. >> For us it's really simple, like in the past, the only option you had on Hadoop to do any computation was MapReduce, that was, I started working in MapReduce 11 years ago, so as you can imagine, it's a pretty good run for any technology, right? Spark is definitely the interesting sort of engine for sort of the, anything from mission learning to ETL for data on top of Hadoop. But again, what we focus a lot on is to make sure that every time we bring in, so right now, when we started on HTP, the first on HTP had about nine open source projects literally just nine. Today, the last one we shipped was 2.5, HTP 2.5 had about 27 I think, like it's a huge sort of explosion, right? But the problem with that is not just that we have 27 projects, the problem is that you're going to make sure each of the 27 work with all the 26 others. >> It's a QA nightmare. >> Exactly. So that integration is really key, so same thing with Spark, we want to make sure you have security and YARN (mumbles), like you saw in the demo today, you can now run Spark SQL but also make sure you get low level (mumbles) masking, all of the enterprise capabilities that you need, and I was at a financial services three or four weeks ago in Chicago. Today, to do equivalent of what I showed today on demo, they need literally, they have a classic ADW, and they have to maintain anywhere between 1500 to 2500 views of the same database, that's a nightmare as you can imagine. Now the fact that you can do this on the raw data using whether it's Hive or Spark or Peg or MapReduce, it doesn't really matter, it's really key, and that's the thing we push to make sure things like YARN security work across all the stacks, all the open source techs. >> So that makes life better, a simplification use case if you will, >> Yeah. >> What are some of the other use cases that you're seeing things like Spark enable? >> Machine learning is a really big one. Increasingly, every product is going to have some, people call it, machine learning and AI and deep learning, there's a lot of techniques out there, but the key part is you want to build a predictive model, in the past (mumbles) everybody want to build a model and score what's happening in the real world against model, but equally important make sure the model gets updated as more data comes in on and actually as the model scores does get smaller over time. So that's something we see all over, so for example, even within our own product, it's not just us enabling this for the customer, for example at Hortonworks we have a product called SmartSense which allows you to optimize how people use Hadoop. Where the, what are the opportunities for you to explore deficiencies within your own Hadoop system, whether it's Spark or Hive, right? So we now put mesh learning into SmartSense. And show you that customers who are running queries like you are running, Mr. Customer X, other customers like you are tuning Hadoop this way, they're running this sort of config, they're using these sort of features in Hadoop. That allows us to actually make the product itself better all the way down the pipe. >> So you're improving the scoring algorithm or you're sort of replacing it with something better? >> What we're doing there is just helping them optimize their Hadoop deploys. >> Yep. >> Right? You know, configuration and tuning and kernel settings and network settings, we do that automatically with SmartSense. >> But the customer, you talked about scoring and trying to, >> Yeah. >> They're tuning that, improving that and increasing the probability of it's accuracy, or is it? >> It's both. >> Okay. >> So the thing is what they do is, you initially come with a hypothesis, you have some amount of data, right? I'm a big believer that over time, more data, you're better off spending more, getting more data into the system than to tune that algorithm financially, right? >> Interesting, okay. >> Right, so you know, for example, you know, talk to any of the big guys on Facebook because they'll do the same, what they'll say is it's much better to get, to spend your time getting 10x data to the system and improving the model rather than spending 10x the time and improving the model itself on day one. >> Yeah, but that's a key choice, because you got to >> Exactly. >> Spend money on doing either, >> One of them. >> And you're saying go for the data. >> Go for the data. >> At least now. >> Yeah, go for data, what happens is the good part of that is it's not just the model, it's the, what you got to really get through is the entire end to end flow. >> Yeah. >> All the way from data aggregation to ingestion to collection to scoring, all that aspect, you're better off sort of walking through the paces like building the entire end to end product rather than spending time in a silo trying to make a lot of change. >> We've talked to a lot of machine learning tool vendors, application vendors, and it seems like we got to the point with Big Data where we put it in a repository then we started doing better at curating it and understanding it then starting to do a little bit exploration with business intelligence, but with machine learning, we don't have something that does this end to end, you know, from acquiring the data, building the model to operationalizing it, where are we on that, who should we look to for that? >> It's definitely very early, I mean if you look at, even the EDW space, for example, what is EDW? EDW is ingestion, ETL, and then sort of fast query layer, Olap BI, on and on and on, right? So that's the full EDW flow, I don't think as a market, I mean, it's really early in this space, not only as an overall industry, we have that end to end sort of industrialized design concept, it's going to take time, but a lot of people are ahead, you know, the Google's a world ahead, over time a lot of people will catch up. >> We got to go, I wish we had more time, I had so many other questions for you but I know time is tight in our schedule, so thanks so much Arun, >> Appreciate it. For coming on, appreciate it, alright, keep right there everybody, we'll be back with our next guest, it's The Cube, we're live from Spark Summit East in Boston, right back. (upbeat music)

Published Date : Feb 9 2017

SUMMARY :

brought to you by Data Breaks. father of YARN, can I call you that, Glad you made it in to Boston, So a lot of it means, as you take any of the examples today you really didn't talk that has to sort of, you know, it can pass both on-prem data Yeah, you talked about that in your keynote years ago. but you talked today about some of the important stuff So the way this is, this all is going to be, you know, And security and You need to think those so that's what we're focused on, to make sure that you have as an example one of the things that, you know, in the open So that's actually really interesting for the open source You want to make sure all of these are common sources in the last 24 hours, you can, using the cloud provider's in Hadoop, help us frame, how you talk about it with like in the past, the only option you had on Hadoop all of the enterprise capabilities that you need, Where the, what are the opportunities for you to explore What we're doing there is just helping them optimize and network settings, we do that automatically for example, you know, talk to any of the big guys is it's not just the model, it's the, what you got to really like building the entire end to end product rather than but a lot of people are ahead, you know, the Google's everybody, we'll be back with our next guest, it's The Cube,

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave Alante	PERSON	0.99+
Arun Murthy	PERSON	0.99+
Europe	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
10x	QUANTITY	0.99+
Boston	LOCATION	0.99+
Chicago	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
George	PERSON	0.99+
Arun	PERSON	0.99+
Wasabi	ORGANIZATION	0.99+
25 data centers	QUANTITY	0.99+
Today	DATE	0.99+
Hadoop	TITLE	0.99+
Wasabi	LOCATION	0.99+
YARN	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
ADLS	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Horton Works	ORGANIZATION	0.99+
today	DATE	0.99+
Data Breaks	ORGANIZATION	0.99+
1500	QUANTITY	0.98+
SmartSense	TITLE	0.98+
S3	TITLE	0.98+
Boston, Massachusetts	LOCATION	0.98+
One	QUANTITY	0.98+
27 projects	QUANTITY	0.98+
three	DATE	0.98+
Google	ORGANIZATION	0.98+
Furio	PERSON	0.98+
Spark	TITLE	0.98+
2500 views	QUANTITY	0.98+
first	QUANTITY	0.97+
Spark Summit East	LOCATION	0.97+
both	QUANTITY	0.97+
Spark SQL	TITLE	0.97+
Google Storage	ORGANIZATION	0.97+
26	QUANTITY	0.96+
Ranger	ORGANIZATION	0.96+
four weeks ago	DATE	0.95+
one	QUANTITY	0.94+
each	QUANTITY	0.94+
four years ago	DATE	0.94+
11 years ago	DATE	0.93+
27 work	QUANTITY	0.9+
MapReduce	TITLE	0.89+
Hive	TITLE	0.89+
this morning	DATE	0.88+
EDW	TITLE	0.88+
about nine open source	QUANTITY	0.88+
day one	QUANTITY	0.87+
nine	QUANTITY	0.86+
years	DATE	0.84+
Olap	TITLE	0.83+
Cube	ORGANIZATION	0.81+
a lot of data	QUANTITY	0.8+

John Kreisa, Hortonworks | DataWorks Summit 2018

>> Live from San José, in the heart of Silicon Valley, it's theCUBE! Covering DataWorks Summit 2018. Brought to you by Hortonworks. (electro music) >> Welcome back to theCUBE's live coverage of DataWorks here in sunny San José, California. I'm your host, Rebecca Knight, along with my co-host, James Kobielus. We're joined by John Kreisa. He is the VP of marketing here at Hortonworks. Thanks so much for coming on the show. >> Thank you for having me. >> We've enjoyed watching you on the main stage, it's been a lot of fun. >> Thank you, it's been great. It's been great general sessions, some great talks. Talking about the technology, we've heard from some customers, some third parties, and most recently from Kevin Slavin from The Shed which is really amazing. >> So I really want to get into this event. You have 2,100 attendees from 23 different countries, 32 different industries. >> Yep. This started as a small, >> That's right. tiny little thing! >> Didn't Yahoo start it in 2008? >> It did, yeah. >> You changed names a few year ago, but it's still the same event, looming larger and larger. >> Yeah! >> It's been great, it's gone international as you've said. It's actually the 17th total event that we've done. >> Yeah. >> If you count the ones we've done in Europe and Asia. It's a global community around data, so it's no surprise. The growth has been phenomenal, the energy is great, the innovations that the community is talking about, the ecosystem is talking about, is really great. It just continues to evolve as an event, it continues to bring new ideas and share those ideas. >> What are you hearing from customers? What are they buzzing about? Every morning on the main stage, you do different polls that say, "how much are you using machine learning? What portion of your data are you moving to the cloud?" What are you learning? >> So it's interesting because we've done similar polls in our show in Berlin, and the results are very similar. We did the cloud poll pole and there's a lot of buzz around cloud. What we're hearing is there's a lot of companies that are thinking about, or are somewhere along their cloud journey. It's exactly what their overall plans are, and there's a lot of news about maybe cloud will eat everything, but if you look at the pole results, something like 75% of the attendees said they have cloud in their plans. Only about 12% said they're going to move everything to the cloud, so a lot of hybrid with cloud. It's how to figure out which work loads to run where, how to think about that strategy in terms of where to deploy the data, where to deploy the work loads and what that should look like and that's one of the main things that we're hearing and talking a lot about. >> We've been seeing that Wikiban and our recent update to the recent market forecast showed that public cloud will dominate increasingly in the coming decade, but hybrid cloud will be a long transition period for many or most enterprises who are still firmly rooted in on-premises employment, so forth and so on. Clearly, the bulk of your customers, both of your custom employments are on premise. >> They are. >> So you're working from a good starting point which means you've got what, 1,400 customers? >> That's right, thereabouts. >> Predominantly on premises, but many of them here at this show want to sustain their investment in a vendor that provides them with that flexibility as they decide they want to use Google or Microsoft or AWS or IBM for a particular workload that their existing investment to Hortonworks doesn't prevent them from facilitating. It moves that data and those workloads. >> That's right. The fact that we want to help them do that, a lot of our customers have, I'll call it a multi-cloud strategy. They want to be able to work with an Amazon or a Google or any of the other vendors in the space equally well and have the ability to move workloads around and that's one of the things that we can help them with. >> One of the things you also did yesterday on the main stage, was you talked about this conference in the greater context of the world and what's going on right now. This is happening against the backdrop of the World Cup, and you said that this is really emblematic of data because this is a game, a tournament that generates tons of data. >> A tremendous amount of data. >> It's showing how data can launch new business models, disrupt old ones. Where do you think we're at right now? For someone who's been in this industry for a long time, just lay the scene. >> I think we're still very much at the beginning. Even though the conference has been around for awhile, the technology has been. It's emerging so fast and just evolving so fast that we're still at the beginning of all the transformations. I've been listening to the customer presentations here and all of them are at some point along the journey. Many are really still starting. Even in some of the polls that we had today talked about the fact that they're very much at the beginning of their journey with things like streaming or some of the A.I. machine learning technologies. They're at various stages, so I believe we're really at the beginning of the transformation that we'll see. >> That reminds me of another detail of your product portfolio or your architecture streaming and edge deployments are also in the future for many of your customers who still primarily do analytics on data at rest. You made an investment in a number of technologies NiFi from streaming. There's something called MiNiFi that has been discussed here at this show as an enabler for streaming all the way out to edge devices. What I'm getting at is that's indicative of Arun Murthy, one of your co-founders, has made- it was a very good discussion for us analysts and also here at the show. That is one of many investments you're making is to prepare for a future that will set workloads that will be more predominant in the coming decade. One of the new things I've heard this week that I'd not heard in terms of emphasis from you guys is more of an emphasis on data warehousing as an important use case for HDP in your portfolios, specifically with HIVE. The HIVE 3.0 now in- HDP3.0. >> Yes. >> With the enhancements to HIVE to support more real time and low latency, but also there's ACID capabilities there. I'm hearing something- what you guys are doing is consistent with one of your competitors, Cloudera. They're going deeper into data warehousing too because they recognize they've got to got there like you do to be able to absorb more of your customers' workloads. I think that's important that you guys are making that investment. You're not just big data, you're all data and all data applications. Potentially, if your customers want to go there and engage you. >> Yes. >> I think that was a significant, subtle emphasis that me as an analyst noticed. >> Thank you. There were so many enhancements in 3.0 that were brought from the community that it was hard to talk about everything in depth, but you're right. The enhancements to HIVE in terms of performance have really enabled it to take on a greater set of workloads and inner activity that we know that our customers want. The advantage being that you have a common data layer in the back end and you can run all this different work. It might be data warehousing, high speed query workloads, but you can do it on that same data with Spark and data-science related workloads. Again, it's that common pool backend of the data lake and having that ability to do it with common security and governance. It's one of the benefits our customers are telling us they really appreciate. >> One of the things we've also heard this morning was talking about data analytics in terms of brand value and brand protection importantly. Fedex, exactly. Talking about, the speaker said, we've all seen these apology commercials. What do you think- is it damage control? What is the customer motivation here? >> Well a company can have billions of dollars of market cap wiped out by breeches in security, and we've seen it. This is not theoretical, these are actual occurrences that we've seen. Really, they're trying to protect the brand and the business and continue to be viable. They can get knocked back so far that it can take years to recover from the impact. They're looking at the security aspects of it, the governance of their data, the regulations of GVPR. These things you've mentioned have real financial impact on the businesses, and I think it's brand and the actual operations and finances of the businesses that can be impacted negatively. >> When you're thinking about Hortonworks's marketing messages going forward, how do you want to be described now, and then how do you want customers to think of you five or 10 years from now? >> I want them to think of us as a partner to help us with their data journey, on all aspects of their data journey, whether they're collecting data from the EDGE, you mentioned NiFi and things like that. Bringing that data back, processing it in motion, as well as processing it in rest, regardless of where that data lands. On premise, in the cloud, somewhere in between, the hybrid, multi-cloud strategy. We really want to be thought of as their partner in their data journey. That's really what we're doing. >> Even going forward, one of the things you were talking about earlier is the company's sort of saying, "we want to be boring. We want to help you do all the stuff-" >> There's a lot of money in boring. >> There's a lot of money, right! Exactly! As you said, a partner in their data journey. Is it "we'll do anything and everything"? Are you going to do niche stuff? >> That's a good question. Not everything. We are focused on the data layer. The movement of data, the process and storage, and truly the analytic applications that can be built on top of the platform. Right now we've stuck to our strategy. It's been very consistent since the beginning of the company in terms of taking these open source technologies, making them enterprise viable, developing an eco-system around it and fostering a community around it. That's been our strategy since before the company even started. We want to continue to do that and we will continue to do that. There's so much innovation happening in the community that we quickly bring that into the products and make sure that's available in a trusted, enterprise-tested platform. That's really one of the things we see our customers- over and over again they select us because we bring innovation to them quickly, in a safe and consumable way. >> Before we came on camera, I was telling Rebecca that Hortonworks has done a sensational job of continuing to align your product roadmaps with those of your leading partners. IBM, AWS, Microsoft. In many ways, your primary partners are not them, but the entire open source community. 26 open source projects in which Hortonworks represents and incorporated in your product portfolio in which you are a primary player and committer. You're a primary ingester of innovation from all the communities in which you operate. >> We do. >> That is your core business model. >> That's right. We both foster the innovation and we help drive the information ourselves with our engineers and architects. You're absolutely right, Jim. It's the ability to get that innovation, which is happening so fast in the community, into the product and companies need to innovate. Things are happening so fast. Moore's Law was mentioned multiple times on the main stage, you know, and how it's impacting different parts of the organization. It's not just the technology, but business models are evolving quickly. We heard a little bit about Trumble, and if you've seen Tim Leonard's talk that he gave around what they're doing in terms of logistics and the ability to go all the way out to the farmer and impact what's happening at the farm and tracking things down to the level of a tomato or an egg all the way back and just understand that. It's evolving business models. It's not just the tech but the evolution of business models. Rob talked about it yesterday. I think those are some of the things that are kind of key. >> Let me stay on that point really quick. Industrial internet like precision agriculture and everything it relates to, is increasingly relying on visual analysis, parts and eggs and whatever it might be. That is convolutional neural networks, that is A.I., it has to be trained, and it has to be trained increasingly in the cloud where the data lives. The data lives in H.D.P, clusters and whatnot. In many ways, no matter where the world goes in terms of industrial IoT, there will be massive cluster of HTFS and object storage driving it and also embedded A.I. models that have to follow a specific DevOps life cycle. You guys have a strong orientation in your portfolio towards that degree of real-time streaming, as it were, of tasks that go through the entire life cycle. From the preparing the data, to modeling, to training, to deploying it out, to Google or IBM or wherever else they want to go. So I'm thinking that you guys are in a good position for that as well. >> Yeah. >> I just wanted to ask you finally, what is the takeaway? We're talking about the attendees, talking about the community that you're cultivating here, theme, ideas, innovation, insight. What do you hope an attendee leaves with? >> I hope that the attendee leaves educated, understanding the technology and the impacts that it can have so that they will go back and change their business and continue to drive their data projects. The whole intent is really, and we even changed the format of the conference for more educational opportunities. For me, I want attendees to- a satisfied attendee would be one that learned about the things they came to learn so that they could go back to achieve the goals that they have when they get back. Whether it's business transformation, technology transformation, some combination of the two. To me, that's what I hope that everyone is taking away and that they want to come back next year when we're in Washington, D.C. and- >> My stomping ground. >> His hometown. >> Easy trip for you. They'll probably send you out here- (laughs) >> Yeah, that's right. >> Well John, it's always fun talking to you. Thank you so much. >> Thank you very much. >> We will have more from theCUBE's live coverage of DataWorks right after this. I'm Rebecca Knight for James Kobielus. (upbeat electro music)

Published Date : Jun 20 2018

SUMMARY :

in the heart of Silicon Valley, He is the VP of marketing you on the main stage, Talking about the technology, So I really want to This started as a small, That's right. but it's still the same event, It's actually the 17th total event the innovations that the community is that's one of the main things that Clearly, the bulk of your customers, their existing investment to Hortonworks have the ability to move workloads One of the things you also did just lay the scene. Even in some of the polls that One of the new things I've heard this With the enhancements to HIVE to subtle emphasis that me the data lake and having that ability to One of the things we've also aspects of it, the the EDGE, you mentioned NiFi and one of the things you were talking There's a lot of money, right! That's really one of the things we all the communities in which you operate. It's the ability to get that innovation, the cloud where the data lives. talking about the community that learned about the things they came to They'll probably send you out here- fun talking to you. coverage of DataWorks right after this.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Rebecca	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Tim Leonard	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Arun Murthy	PERSON	0.99+
Jim	PERSON	0.99+
Kevin Slavin	PERSON	0.99+
Europe	LOCATION	0.99+
John Kreisa	PERSON	0.99+
Berlin	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
John	PERSON	0.99+
Google	ORGANIZATION	0.99+
2008	DATE	0.99+
Washington, D.C.	LOCATION	0.99+
Asia	LOCATION	0.99+
75%	QUANTITY	0.99+
Rob	PERSON	0.99+
five	QUANTITY	0.99+
San José	LOCATION	0.99+
next year	DATE	0.99+
Yahoo	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
32 different industries	QUANTITY	0.99+
World Cup	EVENT	0.99+
yesterday	DATE	0.99+
23 different countries	QUANTITY	0.99+
one	QUANTITY	0.99+
1,400 customers	QUANTITY	0.99+
today	DATE	0.99+
two	QUANTITY	0.99+
2,100 attendees	QUANTITY	0.99+
Fedex	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
26 open source projects	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.98+
17th	QUANTITY	0.98+
both	QUANTITY	0.98+
One	QUANTITY	0.98+
billions of dollars	QUANTITY	0.98+
Cloudera	ORGANIZATION	0.97+
about 12%	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
this week	DATE	0.96+
DataWorks Summit 2018	EVENT	0.95+
NiFi	ORGANIZATION	0.91+
this morning	DATE	0.89+
HIVE 3.0	OTHER	0.86+
Spark	TITLE	0.86+
few year ago	DATE	0.85+
Wikiban	ORGANIZATION	0.85+
The Shed	ORGANIZATION	0.84+
San José, California	LOCATION	0.84+
tons	QUANTITY	0.82+
H.D.P	LOCATION	0.82+
DataWorks	EVENT	0.81+
things	QUANTITY	0.78+
DataWorks	ORGANIZATION	0.74+
MiNiFi	TITLE	0.62+
data	QUANTITY	0.61+
Moore	TITLE	0.6+
years	QUANTITY	0.59+
coming decade	DATE	0.59+
Trumble	ORGANIZATION	0.59+
GVPR	ORGANIZATION	0.58+
3.0	OTHER	0.56+

Kickoff - Spark Summit East 2017 - #sparksummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is theCUBE covering Spark Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Everybody the euphoria is still palpable here, we're in downtown Boston at the Hynes Convention Center. For Spark Summit East, #SparkSummit, my co-host and I, George Gilbert, will be unpacking what's going on for the next two days. George, it's good to be working with you again. >> Likewise. >> I always like working with my man, George Gilbert. We go deep, George goes deeper. Fantastic action going on here in Boston, actually quite a good crowd here, it was packed this morning in the keynotes. The rave is streaming. Everybody's talking about streaming. Let's sort of go back a little bit though George. When Spark first came onto the scene, you saw these projects coming out of Berkeley, it was the hope of bringing real-timeness to big data, dealing with some of the memory constraints that we found going from batch to real-time interactive and now streaming, you're going to talk about that a lot. Then you had IBM come in and put a lot of dough behind Spark, basically giving it a stamp, IBM's imprimatur-- >> George: Yeah. >> Much in the same way it did with Lynx-- >> George: Yeah. >> Kind of elbowing it's way in-- >> George: Yeah. >> The marketplace and sort of gaining a foothold. Many people at the time thought that Hadoop needed Spark more than Spark needed Hadoop. A lot of people thought that Spark was going to replace Hadoop. Where are we today? What's the state of big data? >> Okay so to set some context, when Hadoop V1, classic Hadoop came out it was file system, commodity file system, keep everything really cheap, don't have to worry about shared storage, which is very expensive and the processing model, the execution of munging through data was map produced. We're all familiar with those-- >> Dave: Complicated but dirt cheap. >> Yes. >> Dave: Relative to a traditional data warehouse. >> Yes. >> Don't buy a big Oracle Unix box or Lynx box, buy this new file system and figure out how to make it work and you'll save a ton of money. >> Yeah, but unlike the traditional RDBMS', it wasn't really that great for doing interactive business intelligence and things like that. It was really good for big batch jobs that would run overnight or periods of hours, things like that. The irony is when Matei Zaharia, the co-creator of Spark or actually the creator and co-founder of Databricks, which is steward of Spark. When he created the language and the execution environment, his objective was to do a better MapReduce than Radue, than MapReduce, make it faster, take advantage of memory, but he did such a good job of it, that he was able to extend it to be a uniform engine not just for MapReduce type batch stuff, but for streaming stuff. >> Dave: So originally they start out thinking that if I get this right-- >> Yeah. >> It was sort of a microbatch leveraging memory more effectively and then it extended beyond-- >> The microbatch is their current way to address the streaming stuff. >> Dave: Okay. >> It takes MapReduce, which would be big long running jobs, and they can slice them up and so each little slice turns into an element in the stream. >> Dave: Okay, so the point it was improvement upon these big long batch jobs-- >> George: Yeah. >> They're making it batch to interactive in real-time, so let's go back to big data for a moment here. >> George: Yeah. >> Big data was the hottest topic in the world three or four years ago and now it's sort of waned as a buzz word, but big data is now becoming more mainstream. We've talked about that a lot. A lot of people think it's done. Is big data done? >> George: Not it's more that it's sort of-- it's boring for us, kind of pundits, to talk about because it's becoming part of the fabric. The use cases are what's interesting. It started out as a way to collect all data into this really cheap storage repository and then once you did that, this was the data you couldn't afford to put into your terra data, data warehouse at 25,000 per terabyte or with running costs a multiple of that. Here you put all your data in here, your data scientists and data engineers started munging with the data, you started taking workloads off your data warehouse, like ETL things that didn't belong there. Now people are beginning to experiment with business intelligence sort of exploration and reporting on Hadoop, so taking more workloads off the data warehouse. The limitations, there are limitations there that will get solved by putting MPP SQL back-ends on it, but the next step after that. So we're working on that step, but the one that comes after that is make it easier for data scientists to use this data, to create predictive models-- [Dave] Okay, so I often joke that the ROI on big data was reduction on investment and lowering the denominator-- >> George: Yeah. >> In the expense equation, which I think it's fair to say that big data and Hadoop succeeded in achieving that, but then the question becomes, what's the real business impact. Clearly big data has not, except in some edge cases and there are a number of edge cases and examples, but it's not yet anyway lived up to the promise of real-time, affecting outcomes before, you know taking the human out of the decision, bringing transaction and analytics together. Now we're hearing a lot of that talk around AI and machine learning, of course, IoT is the next big thing, that's where streaming fits in. Is it same line new bottle? Or is it sort of the evolution of the data meme? >> George: It's an evolution, but it's not just a technology evolution to make it work. When we've been talking about big data as efficiency, like low cost, cost reduction for the existing type of infrastructure, but when it starts going into machine learning you're doing applications that are more strategic and more top line focused. That means your c-level execs actually have to get involved because they have to talk about the strategic objectives, like growth versus profitability or which markets you want to target first. >> So has Spark been a headwind or tailwind to Hadoop? >> I think it's very much been a tailwind because it simplified a lot of things that took many, many engines in Hadoop. That's something that Matei, creator of Spark, has been talking about for awhile. >> Dave: Okay something I learned today and actually I had heard this before, but the way I phrased it in my tweet, Genomiocs is kicking Moore's Law's ass. >> George: Yeah. >> That the price performance of sequencing a gene improves three x every year to what is essentially a doubling every 18 months for Moore's Law. The amount of data that's being created is just enormous, I think we heard from Broad Institute that they create 17 terabytes a day-- >> George: Yeah. >> As compared to YouTube, which is 24 terabytes a day. >> And then a few years it will be-- >> It will be dwarfing YouTube >> Yeah. >> Of course Twitter you couldn't even see-- >> Yeah. >> So what do you make of that? Is that just the fun fact, is that a new use case, is that really where this whole market is headed? >> It's not a fun fact because we've been hearing for years and years about this study about data doubling every 18 to 24 months, that's coming from the legacy storage guys who can only double their capacity every 18 to 24 months. The reality is that when we take what was analog data and we make it digitally accessible, the only thing that's preventing us from capturing all this data is the cost to acquire and manage it. The available data is growing much, much faster than 40% every 18 months. >> Dave: So what you're saying is that-- I mean this industry has marched to the cadence of Moore's Law for decades and what you're saying is that linear curve is actually reshaping and it's becoming exponential. >> George: For data-- >> Yes. >> George: So the pressure is on for compute, which is now the bottleneck to get clever and clever about how to process it-- >> So that says innovation has to come from elsewhere, not just Moore's Law. It's got to come from a combination of-- Thomas Friedman talks a lot about Moore's Law being one of the fundamentals, but there are others. >> George: Right. >> So from a data perspective, what are those combinatorial effects that are going to drive innovation forward? >> George: There was a big meetup for Spark last night and the focus was this new database called SnappyData that spun out of Pivotal and it's being mentored by Paul Maritz, ex-head of Development in Microsoft in the 90s and former head of VMWare. The interesting thing about this database, and we'll start seeing it in others, is you don't necessarily want to be able to query and analyze petabytes at once, it will take too long, sort of like munging through data of that size on Hadoop took too long. You can do things that approximate the answer and get it much faster. We're going to see more tricks like that. >> Dave: It's interesting you mention Maritz, I heard a lot of messaging this morning that talked about essentially real-time analysis and being able to make decisions on data that you've never seen before and actually affect outcomes. This narrative I first heard from Maritz many, many years ago when they launched Pivotal. He launched Pivotal to be this platform for building big data apps and now you're seeing Databricks and others sort of usurp that messaging and actually seeming to be at the center of that trend. What's going on there? >> I think there's two, what would you call it, two centers of gravity and our CTO David Floyer talks about this. The edge is becoming more intelligent because there's a huge bandwidth and latency gap between these smart devices at the edge, whether the smart device is like a car or a drone or just a bunch of sensors on a turbine. Those things need to analyze and respond in near real-time or hard real-time, like how to tune themselves, things like that, but they also have to send a lot of data back to the cloud to learn about how these things evolve. In other words it would be like sending the data to the cloud to figure out how the weather patterns are changing. >> Dave: Um,humm. >> That's the analogy. You need them both. >> Dave: Okay. >> So Spark right now is really good in the cloud, but they're doing work so that they can take a lighter weight version and put at the edge. We've also seen Amazon put some stuff at the edge and Azure as well. >> Dave: I want you to comment. We're going to talk about this later, we have a-- George and I are going to do a two-part series at this event. We're going to talk about the state of the market and then we're going to release our big data, in a glimpse to our big data numbers, our Spark forecast, our streaming forecast-- I say I mention streaming because that is-- we talk about batch, we talk about interactive/real-time, you know you're at a terminal-- anybody who's as old as I am remembers that. But now you're talking about streaming. Streaming is a new workload type, you call these things continuous apps, like streams of events coming into a call center, for example, >> George: Yeah. >> As one example that you used. Add some color to that. Talk about that new workload type and the roll of streaming, and really potentially how it fits into IoT. >> Okay, so for the last 60 years, since the birth of digital computing, we've had either one of two workloads, they were either batch, which is jobs that ran offline, you put your punch cards in and sometime later the answer comes out. Or we've had interactive, which is originally it was green screens and now we have PCs and mobile devices. The third one coming up now is continuous or streaming data that you act on in near real-time. It's not that those apps will replace the previous ones, it's that you'll have apps that have continuous processing, batch processing, interactive as a mix. An example would be today all the information about how your applications and data center infrastructure are operating, that's a lot of streams of data that Splunk first, took amat and did very well with-- so that you're looking in real-time and able to figure out if something goes wrong. That type of stuff, all the coulometry from your data center, that is a training wheel for Internet things, where you've got lots of stuff out at the edge. >> Dave: It's interesting you mention Splunk, Splunk doesn't actually use the big data term in its marketing, but they actually are big data and they are streaming. They're actually not talking about it, they're just doing it, but anyway-- Alright George, great thanks for that overview. We're going to break now, bring back our first guest, Arun Murthy, coming in from Hortonworks, co-founder at Hortonworks, so keep it right there everybody. This is theCUBE we're live from Spark Summit East, #SparkSummit, we'll be right back. (upbeat music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Databricks. George, it's good to be working with you again. and now streaming, you're going to talk about that a lot. Many people at the time thought that Hadoop needed Spark and the processing model, buy this new file system and figure out how to make it work and the execution environment, to address the streaming stuff. in the stream. so let's go back to big data for a moment here. and now it's sort of waned as a buzz word, [Dave] Okay, so I often joke that the ROI on big data and machine learning, of course, IoT is the next big thing, but it's not just a technology evolution to make it work. That's something that Matei, creator of Spark, but the way I phrased it in my tweet, That the price performance of sequencing a gene all this data is the cost to acquire and manage it. I mean this industry has marched to the cadence So that says innovation has to come from elsewhere, and the focus was this new database called SnappyData and actually seeming to be at the center of that trend. but they also have to send a lot of data back to the cloud That's the analogy. So Spark right now is really good in the cloud, We're going to talk about this later, we have a-- As one example that you used. and sometime later the answer comes out. We're going to break now,

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Paul Maritz	PERSON	0.99+
Dave Vellante	PERSON	0.99+
George Gilbert	PERSON	0.99+
Arun Murthy	PERSON	0.99+
Matei Zaharia	PERSON	0.99+
Dave	PERSON	0.99+
Boston	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Thomas Friedman	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David Floyer	PERSON	0.99+
Matei	PERSON	0.99+
Broad Institute	ORGANIZATION	0.99+
Berkeley	LOCATION	0.99+
two	QUANTITY	0.99+
Maritz	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
two-part	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
one	QUANTITY	0.99+
third one	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
YouTube	ORGANIZATION	0.99+
25,000 per terabyte	QUANTITY	0.99+
Hynes Convention Center	LOCATION	0.99+
24 months	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.98+
first guest	QUANTITY	0.98+
three	QUANTITY	0.98+
one example	QUANTITY	0.98+
Hadoop	TITLE	0.97+
last night	DATE	0.97+
three	DATE	0.97+
both	QUANTITY	0.97+
40%	QUANTITY	0.97+
today	DATE	0.97+
Spark Summit East 2017	EVENT	0.97+
17 terabytes a day	QUANTITY	0.97+
first	QUANTITY	0.97+
24 terabytes a day	QUANTITY	0.97+
Twitter	ORGANIZATION	0.96+
decades	QUANTITY	0.96+
90s	DATE	0.96+
Moore's Law	TITLE	0.96+
two workloads	QUANTITY	0.96+
Spark	TITLE	0.95+
four years ago	DATE	0.94+
Moore's	TITLE	0.94+
two centers	QUANTITY	0.92+
Unix	COMMERCIAL_ITEM	0.92+
Kickoff	EVENT	0.92+
#SparkSummit	EVENT	0.91+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Arun Murthy: