Arpit Joshipura, Linux Foundation | CUBEConversation, May 2019

>> From our studios, in the heart of Silicon Valley, Palo Alto, California, this is a CUBE Conversation. >> Welcome to this CUBE Conversation here in Palo Alto, California. I'm John Furrier, host of theCUBE. We are here with Arpit Joshipura, GM of Networking, Edge, IoT for the Linux Foundation. Arpit, great to see you again, welcome back to theCUBE, thanks for joining us. >> Thank you, thank you. Happy to be here. >> So obviously, we love the Linux Foundation. We've been following all the events; we've chatted in the past about networking. Computer storage and networking just doesn't seem to go away with cloud and on-premise hybrid cloud, multicloud, but open-source software continues to surpass expectations, growth, geographies outside the United States and North America, just overall, just greatness in software. Everything's an abstraction layer now; you've got Kubernetes, Cloud Native- so many good things going on with software, so congratulations. >> Well thank you. No, I think we're excited too. >> So you guys got a big event coming up in China: OSS, Open Source Summit, plus KubeCon. >> Yep. >> A lot of exciting things, I want to talk about that in a second. But I want to get your take on a couple key things. Edge and IoT, deep learning and AI, and networking. I want to kind of drill down with you. Tell us what's the updates on the projects around Linux Foundation. >> Okay. >> The exciting ones. I mean, we know Cloud Native CNCF is going to take up more logos, more members, keeps growing. >> Yep. >> Cloud Native clearly has a lot of opportunity. But the classic in the set, certainly, networking and computer storage is still kicking butt. >> Yeah. So, let me start off by Edge. And the fundamental assumption here is that what happened in the cloud and core is going to move to the Edge. And it's going to be 50, 100, 200 times larger in terms of opportunity, applications, spending, et cetera. And so what LF did was we announced a very exciting project called Linux Foundation Edge, as an umbrella, earlier in January. And it was announced with over 60 founding members, right. It's the largest founding member announcement we've had in quite some time. And the reason for that is very simple- the project aims at unifying the fragmented edge in IoT markets. So today, edge is completely fragmented. If you talk to clouds, they have a view of edge. Azure, Amazon, Baidu, Tencent, you name it. If you talk to the enterprise, they have a view of what edge needs to be. If you talk to the telcos, they are bringing the telecom stack close to the edge. And then if you talk to the IoT vendors, they have a perception of edge. So each of them are solving the edge problems differently. What LF Edge is doing, is it is unifying a framework and set of frameworks, that allow you to create a common life cycle management framework for edge computing. >> Yeah. >> Now the best part of it is, it's built on five exciting technologies. So people ask, "You know, why now?" So, there are five technologies that are converging at the same time. 5G, low latency. NFV, network function virtualization, so on demand. AI, so predictive analytics for machine learning. Container and microservices app development, so you can really write apps really fast. And then, hardware development: TPU, GPU, NPU. Lots of exciting different size and shapes. All five converging; put it close to the apps, and you have a whole new market. >> This is, first of all, complicated in the sense of... cluttered, fragmented, shifting grounds, so it's an opportunity. >> It's an opportunity. >> So, I get that- fragmented, you've got the clouds, you've got the enterprises, and you've got the telcos all doing their own thing. >> Yep. >> So, multiple technologies exploding. 5G, Wi-Fi 6, a bunch of other things you laid out, >> Mhmm. >> all happening. But also, you have all those suppliers, right? >> Yes. >> And, so you have different manufacturers-- >> And different layers. >> So it's multiple dimensions to the complexity. >> Correct, correct. >> What are you guys seeing, in terms of, as a solution, what's motivating the founding members; when you say unifying, what specifically does that mean? >> What that means is, the entire ecosystem from those markets are coming together to solve common problems. And I always sort of joke around, but it's true- the common problems are really the plumbing, right? It's the common life cycle management, how do you start, stop, boot, load, log, you know, things like that. How do you abstract? Now in the Edge, you've 400, 500 interfaces that comes into an IoT or an edge device. You know, Zigbee, Bluetooth, you've got protocols like M2T; things that are legacy and new. Then you have connectivity to the clouds. Devices of various forms and shapes. So there's a lot of end by end problems, as we call it. So, the cloud players. So for LF Edge for example, Tencent and Baidu and the cloud leaders are coming together and saying, "Let's solve it once." The industrial IoT player, like Dynamic, OSIsoft, they're coming in saying, "Let's solve it once." The telcos- AT&T, NTT, they're saying "Let's solve it once. And let's solve this problem in open-source. Because we all don't need to do it, and we'll differentiate on top." And then of course, the classic system vendors that support these markets are all joining hands. >> Talk about the business pressure real quick. I know, you look at, say, Alibaba for instance, and the folks you mentioned, Tencent, in China. They're perfecting the edge. You've got videos at the edge; all kinds of edge devices; people. >> Correct. >> So there's business pressures, as well. >> The business pressure is very simple. The innovation has to speed up. The cost has to go down. And new apps are coming up, so extra revenue, right? So because of these five technologies I mentioned, you've got the top killer apps in edge are anything that is, kind of, video but not YouTube. So, anything that the video comes from 360 venues, or drones, things like that. Plus, anything that moves, but that's not a phone. So things like connected cars, vehicles. All of those are edge applications. So in LF Edge, we are defining edge as an application that requires 20 milliseconds or less latency. >> I can't wait for someone to define- software define- "edge". Or, it probably is defined. A great example- I interviewed an R&D engineer at VMware yesterday in San Francisco, it was at the RADIO event- and we were just riffing on 5G, and talking about software at the edge. And one of the advances >> Yes. >> that's coming is splicing the frequency so that you can put software in the radios at the antennas, >> Correct. Yeah. >> so you can essentially provision, in real time. >> Correct, and that's a telco use case, >> Yeah. >> so our projects at the LF Edge are EdgeX Foundry, Akraino, Edge Virtualization Engine, Open Glossary, Home Edge. There's five and growing. And all of these software projects can allow you to put edge blueprints. And blueprints are really reference solutions for smart cities, manufacturing, telcos, industrial gateways, et cetera et cetera. So, lots of-- >> It's kind of your fertile ground for entrepreneurship, too, if you think about it, >> Correct; startups are huge. >> because, just the radio software that splices the radio spectrum is going to potentially maybe enable a service provider market, and towers, right? >> Correct, correct. >> Own my own land, I can own the tower and rent it out, one radio. >> Yep. >> So, business model innovations also an opportunity, >> It's a huge-- >> not just the business pressure to have an edge, but-- >> Correct. So technology, business, and market pressures. All three are colliding. >> Yeah, perfect storm. >> So edge is very exciting for us, and we had some new announcements come out in May, and more exciting news to come out in June, as well. >> And so, going back to Linux Foundation. If I want to learn more. >> LFEdge.org. >> That's kind of the CNCF of edge, if you will, right? Kind of thing. >> Yeah. It's an umbrella with all the projects, and that's equivalent to the CNCF, right. >> Yeah. >> And of course it's a huge group. >> So it's kind of momentum. 64 founding members-- >> Huge momentum. Yeah, now we are at 70 founding members, and growing. >> And how long has it been around? >> The umbrella has been around for about five months; some of the projects have been around for a couple of years, as they incubate. >> Well let us know when the events start kicking in. We'll get theCUBE down there to cover it. >> Absolutely. >> Super exciting. Again, multiple dimensions of innovation. Alright, next topic, one of my favorites, is AI and deep learning. AI's great. If you don't have data you can't really make AI work; deep learning requires data. So this is a data conversation. What's going on in the Linux Foundation around AI and deep learning? >> Yeah. So we have a foundation called LF Deep Learning, as you know. It was launched last year, and since then we have significantly moved it forward by adding more members, and obviously the key here is adding more projects, right. So our goal in the LF Deep Learning Foundation is to bring the community of data scientists, researchers, entrepreneurs, academia, and users to collaborate. And create frameworks and platforms that don't require a PhD to use. >> So a lot of data ingestion, managing data, so not a lot of coding, >> Platforms. >> more data analyst, and/or applications? >> It's more, I would say, platforms for use, right? >> Yeah. >> So frameworks that you can actually use to get business outcomes. So projects include Acumos, which is a machine learning framework and a marketplace which allows you to, sort of, use a lot of use cases that can be commonly put. And this is across all verticals. But I'll give you a telecom example. For example, there is a use case, which is drones inspecting base stations-- >> Yeah. >> And doing analytics for maintenance. That can be fed into a marketplace, used by other operators worldwide. You don't have to repeat that. And you don't need to understand the details of machine learning algorithms. >> Yeah. >> So we are trying to do that. There are projects that have been contributed from Tencent, Baidu, Uber, et cetera. Angel, Elastic Deep Learning, Pyro. >> Yeah. >> It's a huge investment for us. >> And everybody wins when there's contribution, because data's one of those things where if there's available, it just gets smarter. >> Correct. And if you look at deep learning, and machine learning, right. I mean obviously there's the classic definition; I won't go into that. But from our perspective, we look at data and how you can share the data, and so from an LF perspective, we have something called a CDLA license. So, think of an Apache for data. How do you share data? Because it's a big issue. >> Big deal. >> And we have solved that problem. Then you can say, "Hey, there's all these machine learning algorithms," you know, TensorFlow, and others, right. How can you use it? And have plugins to this framework? Then there's the infrastructure. Where do you run these machine learning? Like if you run it on edge, you can run predictive maintenance before a machine breaks down. If you run it in the core, you can do a lot more, right? So we've done that level of integration. >> So you're treating data like code. You can bring data to the table-- >> And then-- >> Apply some licensing best practices like Apache. >> Yes, and then integrate it with the machine learning, deep learning models, and create platforms and frameworks. Whether it's for cloud services, for sharing across clouds, elastic searching-- >> And Amazon does that in terms of they vertically integrate SageMaker, for instance. >> That's exactly right. >> So it's a similar-- >> And this is the open-source version of it. >> Got it- oh, that's awesome. So, how does someone get involved here, obviously developers are going to love this, but-- >> LF Deep Learning is the place to go, under Linux Foundation, similar to LF Edge, and CNCF. >> So it's not just developers. It's also people who have data, who might want to expose it in. >> Data scientists, databases, algorithmists, machine learning, and obviously, a whole bunch of startups. >> A new kind of developer, data developer. >> Right. Exactly. And a lot of verticals, like the security vertical, telecom vertical, enterprise verticals, finance, et cetera. >> You know, I've always said- you and I talked about this before, and I always rant on theCUBE about this- I believe that there's going to be a data development environment where data is code, kind of like what DevOps did with-- >> It's the new currency, yeah. >> It's the new currency. >> Yeah. Alright, so final area I want to chat with you before we get into the OSS China thing: networking. >> Yeah. >> Near and dear to your heart. >> Near and dear to my-- >> Networking's hot now, because if you bring IoT, edge, AI, networking, you've got to move things around-- >> Move things around, (laughs) right, so-- >> And you still need networking. >> So we're in the second year of the LF Networking journey, and we are really excited at the progress that has happened. So, projects like ONAP, OpenDaylight, Tungsten Fabric, OPNFV, FDio, I mean these are now, I wouldn't say household names, but business enterprise names. And if you've seen, pretty much all the telecom providers- almost 70% of the subscribers covered, enabled by the service providers, are now participating. Vendors are completely behind it. So we are moving into a phase which is really the deployment phase. And we are starting to see, not just PoCs [Proofs of Concept], but real deployments happening, some of the major carriers now. Very excited, you know, Dublin, ONAP's Dublin release is coming up, OPNFV just released the Hunter release. Lots of exciting work in Fido, to sort of connect-- >> Yeah. >> multiple projects together. So, we're looking at it, the big news there is the launch of what's called OVP. It's a compliance and verification program that cuts down the deployment time of a VNF by half. >> You know, it's interesting, Stu and I always talk about this- Stu Miniman, CUBE cohost with me- about networking, you know, virtualization came out and it was like, "Oh networking is going to change." It's actually helped networking. >> It helped networking. >> Now you're seeing programmable networks come out, you see Cisco >> And it's helped. >> doing a lot of things, Juniper as well, and you've got containers in Kubernetes right around the corner, so again, this is not going to change the need, it's going to- It's not going to change >> It's just a-- >> the desire and need of networking, it's going to change what networking is. How do you describe that to people? Someone saying, "Yeah, but tell me what's going on in networking? Virtualization, we got through that wave, now I've got the container, Kubernetes, service mesh wave, how does networking change? >> Yeah, so it's a four step process, right? The first step, as you rightly said, virtualization, moved into VMs. Then came disaggregation, which was enabled by the technology SDN, as we all know. Then came orchestration, which was last year. And that was enabled by projects like ONAP and automation. So now, all of the networks are automated, fully running, self healing, feedback closed control, all that stuff. And networks have to be automated before 5G and IoT and all of these things hit, because you're no longer talking about phones. You're talking about things that get connected, right. So that's where we are today. And that journey continues for another two years, and beyond. But very heavy focused on deployment. And while that's happening, we're looking at the hybrid version of VMs and containers running in the network. How do you make that happen? How do you translate one from the other? So, you know, VNFs, CNFs, everything going at the same time in your network. >> You know what's exciting is with the software abstractions emerging, the hard problems are starting to emerge because as it gets more complicated, end by end problems, as you said, there's a lot of new costs and complexities, for instance, the big conversation at the Edge is, you don't want to move data around. >> No, no. >> So you want to move compute to the edge, >> You can, yeah-- >> But it's still a networking problem, you've still got edge, so edge, AI, deep learning, networking all tied together-- >> They're all tied together, right, and this is where Linux Foundation, by developing these projects, in umbrellas, but then allowing working groups to collaborate between these projects, is a very simple governance mechanism we use. So for example, we have edge working groups in Kubernetes that work with LF Edge. We have Hyperledger syncs that work for telecoms. So LFN and Hyperledger, right? Then we have automotive-grade Linux, that have connected cars working on the edge. Massive collaboration. But, that's how it is. >> Yeah, you connect the dots but you don't, kind of, force any kind of semantic, or syntax >> No. >> into what people can build. >> Each project is autonomous, >> Yeah. >> and independent, but related. >> Yeah, it's smart. You guys have a good view, I'm a big fan of what you guys are doing. Okay, let's talk about the Open Source Summit and KubeCon, happening in China, the week of the 24th of June. >> Correct. >> What's going on, there's a lot of stuff going on beyond Cloud Native and Linux, what are some of the hot areas in China that you guys are going to be talking about? I know you're going over. >> Yeah, so, we're really excited to be there, and this is, again, life beyond Linux and Cloud Native; there's a whole dimension of projects there. Everything from the edge, and the excitement of Iot, cloud edge. We have keynotes from Tencent, and VMware, and all the Chinese- China Mobile and others, that are all focusing on the explosive growth of open-source in China, right. >> Yeah, and they have a lot of use cases; they've been very aggressive on mobility, Netdata, >> Very aggressive on mobility, data, right, and they have been a big contributor to open-source. >> Yeah. >> So all of that is going to happen there. A lot of tracks on AI and deep learning, as a lot more algorithms come out of the Tencents and the Baidus and the Alibabas of the world. So we have tracks there. We have huge tracks on networking, because 5G and implementation of ONAP and network automation is all part of the umbrella. So we're looking at a cross-section of projects in Open Source Summit and KubeCon, all integrated in Shanghai. >> And a lot of use cases are developing, certainly on the edge, in China. >> Correct. >> A lot of cross pollination-- >> Cross pollination. >> A lot of fragmentation has been addressed in China, so they've kind of solved some of those problems. >> Yeah, and I think the good news is, as a global community, which is open-source, whether it's Europe, Asia, China, India, Japan, the developers are coming together very nicely, through a common governance which crosses boundaries. >> Yeah. >> And building on use cases that are relevant to their community. >> And what's great about what you guys have done with Linux Foundation is that you're not taking positions on geographies, because let the clouds do that, because clouds have-- >> Clouds have geographies, >> Clouds, yeah they have agents-- >> Edge may have geography, they have regions. >> But software's software. (laughs) >> Software's software, yeah. (laughs) >> Arpit, thanks for coming in. Great insight, loved talking about networking, the deep learning- congratulations- and obviously the IoT Edge is hot, and-- >> Thank you very much, excited to be here. >> Have a good trip to China. Thanks for coming in. >> Thank you, thank you. >> I'm John Furrier here for CUBE Conversation with the Linux Foundation; big event in China, Open Source Summit, and KubeCon in Shanghai, week of June 24th. It's a CUBE Conversation, thanks for watching.

Published Date : May 17 2019

SUMMARY :

in the heart of Silicon Valley, GM of Networking, Edge, IoT for the Linux Foundation. Happy to be here. We've been following all the events; No, I think we're excited too. So you guys got a big event coming up in China: A lot of exciting things, I mean, we know Cloud Native CNCF is going to take up But the classic in the set, and set of frameworks, that allow you to and you have a whole new market. This is, first of all, complicated in the sense of... and you've got the telcos all doing their own thing. you laid out, But also, you have all those suppliers, Tencent and Baidu and the cloud leaders and the folks you mentioned, Tencent, in China. So, anything that the video comes from 360 venues, and talking about software at the edge. Yeah. so you can essentially And all of these software projects can allow you Own my own land, I can own the tower So technology, business, and market pressures. and more exciting news to come out in June, And so, That's kind of the CNCF of edge, if you will, right? and that's equivalent And of course So it's kind of momentum. Yeah, now we are at 70 founding members, and growing. some of the projects have been around We'll get theCUBE down there to cover it. If you don't have data you can't really and obviously the key here is adding more projects, right. So frameworks that you can actually use And you don't need to understand So we are trying to do that. And everybody wins when there's contribution, And if you look at deep learning, And have plugins to this framework? You can bring data to the table-- Yes, and then integrate it with the machine learning, And Amazon does that in terms of they obviously developers are going to love this, but-- LF Deep Learning is the place to go, So it's not just developers. and obviously, a whole bunch of startups. And a lot of verticals, like the security vertical, Alright, so final area I want to chat with you almost 70% of the subscribers covered, that cuts down the deployment time of a VNF by half. about networking, you know, virtualization came out How do you describe that to people? So now, all of the networks are automated, the hard problems are starting to emerge So LFN and Hyperledger, right? of what you guys are doing. that you guys are going to be talking about? and the excitement of Iot, cloud edge. and they have been a big contributor to open-source. So all of that is going to happen there. And a lot of use cases are developing, A lot of fragmentation has been addressed in China, the developers are coming together very nicely, that are relevant to their community. they have regions. But software's software. Software's software, yeah. and obviously the IoT Edge is hot, and-- Thank you very much, Have a good trip to China. and KubeCon in Shanghai,

ENTITIES

Entity	Category	Confidence
Alibaba	ORGANIZATION	0.99+
China	LOCATION	0.99+
May	DATE	0.99+
Uber	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Tencent	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
June	DATE	0.99+
Baidu	ORGANIZATION	0.99+
20 milliseconds	QUANTITY	0.99+
ONAP	ORGANIZATION	0.99+
Shanghai	LOCATION	0.99+
50	QUANTITY	0.99+
Linux Foundation	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
May 2019	DATE	0.99+
Palo Alto, California	LOCATION	0.99+
Cisco	ORGANIZATION	0.99+
LF Deep Learning Foundation	ORGANIZATION	0.99+
last year	DATE	0.99+
United States	LOCATION	0.99+
AT&T	ORGANIZATION	0.99+
70 founding members	QUANTITY	0.99+
five	QUANTITY	0.99+
five technologies	QUANTITY	0.99+
OpenDaylight	ORGANIZATION	0.99+
64 founding members	QUANTITY	0.99+
yesterday	DATE	0.99+
KubeCon	EVENT	0.99+
Arpit Joshipura	PERSON	0.99+
first step	QUANTITY	0.99+
NTT	ORGANIZATION	0.99+
each	QUANTITY	0.99+
two years	QUANTITY	0.99+
Tungsten Fabric	ORGANIZATION	0.99+
360 venues	QUANTITY	0.99+
YouTube	ORGANIZATION	0.99+
second year	QUANTITY	0.99+
Alibabas	ORGANIZATION	0.99+
Linux	TITLE	0.99+
OSIsoft	ORGANIZATION	0.99+
LFEdge.org	OTHER	0.99+
Asia	LOCATION	0.99+
Arpit	PERSON	0.99+
Europe	LOCATION	0.99+
Stu	PERSON	0.99+
Baidus	ORGANIZATION	0.98+
Stu Miniman	PERSON	0.98+
North America	LOCATION	0.98+
Hyperledger	ORGANIZATION	0.98+
Apache	ORGANIZATION	0.98+
LF	ORGANIZATION	0.98+
over 60 founding members	QUANTITY	0.98+
five exciting technologies	QUANTITY	0.98+
one	QUANTITY	0.98+
100	QUANTITY	0.98+
four step	QUANTITY	0.98+
OPNFV	ORGANIZATION	0.98+
CUBE Conversation	EVENT	0.98+
Open Source Summit	EVENT	0.98+
Cloud Native	TITLE	0.98+
Tencents	ORGANIZATION	0.98+
India	LOCATION	0.98+
Dynamic	ORGANIZATION	0.98+
CNCF	ORGANIZATION	0.98+
Angel	ORGANIZATION	0.97+

Randy Bias, Juniper Networks | OpenStack Summit 2018

>> Announcer: Live, from Vancouver, Canada it's the CUBE, covering OpenStack Summit North America 2018, brought to you by Red Hat, the Open Stack Foundation, and it's ecosystem partners. >> Welcome back, I'm Stu Miniman and my cohost John Troyer and you're watching the CUBE, the worldwide leader in tech coverage. Happy to welcome back to the program long time friend of the CUBE back from the earliest days, Randy Bias, Vice President with Juniper, Randy, great to see you. >> Absolutely, great to be back with you guys. >> All right, so Randy, we've been talking about, you know, community, and everything's going good and attendance might be down a little bit but how we fit in with containers and kubernetes, and everything, so we expect you to tear everything up for us and tell us the reality of what's happening in this community. >> I'll do my best (laughing). >> All right, so before we get to the kubernetic stuff, you're working on, we used to call it OpenContrail? Which you were involved in before Juniper acquired it, went through a rebranding recently, Tungsten, which I was looking up, came from the word heavy stone, give us the update from the networking side. >> Yeah, so the short history is that there was a company called Contrail, and they created a software defined networking controller, it was acquired by Juniper in 2012, 2013, and then that was open sourced, so Juniper for a long time was running with sort of two editions, Contrail which was the commercial offering, and OpenContrail which was the open source, and then shortly after I joined Juniper, identified that, you know, we really needed to go back to the drawing board on the way that we had organized the community, and transition it from being Juniper-led to community led, and so over the past year, I spearheaded that effort, and then that culminated in us announcing at the end of March at ONS that, you know, OpenContrail was now Tungsten Fabric. We renamed it, we moved it into the Linux foundation, under its governance, and now Juniper is one of many people of the community that have a seat at the table for the management, both from a business and technical perspective, and we're moving forward with a new reinvigorated community. >> Yeah, so networking sits at really the intersection of this multi-cloud world that we're living in. There's so many players trying to be there, you know Cisco, really moving to become more of a software company, when I interviewed their number two guy at their show, he's like, when you think of Cisco in the future, we're not even going to be a networking company, we'll be a software company. VMware, of course, pushed heavy through, then the Nicira acquisition, where does Tungsten fit, kind of compare and contrast for us, where it fits among some of these other offerings out there in the marketplace. >> Yeah, I mean, I think most enterprise vendors are in a similar transition from being a hardware to software companies. We're no different than any of the rest. I think we have a pretty significant advantage in that we have a lot of growth in the cloud sector, so a lot of the large public clouds are our customers and we're selling a tremendous amount of hardwaring to them, so I think we've got a lot longer runway. But, you know, we just recently hired CTO, Bikash Koley, out of Google, and we're starting to see some additional folks out of Google, like my new boss, Morgan, and what that's bringing with it is a very much a software first type perspective. So Bikash and Morgan really built everything for the Google network from the topper rack all the way out to the win and it's almost all software-based, disaggregated, hardware, software, opensource software running on top of white boxes, and so that kind of perspective is now really deep, start beginning to become embedded in Juniper. And at the head of that is Tungsten. So we see Tungsten Fabric as being sort of a tool that we use to create, you know, a global ubiquitous network fabric, that anybody can use anywhere, without talking to Juniper at all, without knowing that Juniper's part of Tungsten, and then as they grow up and they get to a point where they need multi-cloud, they need federation, or they need kind of day two enterprise operations, you know, we have a commercial version and a commercial distribution that they can use. >> Randy, we talked a little bit about OpenContrail and last year, at OpenStack Summit and moving it to a more of a community based governance model, and now that's happened with the Linux Foundation, can you talk a little bit about the role of opensource governance, and corporate governance, and then foundations, and just going forward, you know, what's an effective model for 2018 going forward, for a foundation-led project and maybe in the context of Tungsten Fabric, and how is that looking? >> Yeah, so again, OpenContrail's now Tungsten Fabrics, might be new for some of the viewers, lot of people still coming to terms with that. And so one of the things that we noticed is that, and when many people go and they say, hey, we want opensource first, the AT&T's of this world, part of what they're saying, one of the aspects of being opensource versus we want to be one of many around the table, we want to have a seat at the table, we want to have the option to contribute code back, and we want to feel like it's a group effort. And so that was a big factor, right? It was an opensource project, but it was largely the governance was carried by Juniper, all the testing infrastructure was Juniper, you know, all of the people who made architectural decisions were Juniper, all of the lead contributors were Juniper, and so, going to Linux Foundation was critical to us having a legal framework, for the trademarks, the code, the licenses, the contributor license agreements, are all owned and operated by the Linux Foundation and not by Juniper, so we basically have a trusted third party who can mediate all those things and create a structure, a governance small structure where Juniper has one seat at the table, and all the other community members do as well. So it was really key to getting, to moving to that model to increase people's interest in the project and to really go the next level. There just wasn't any way to do it without doing this. >> All right, so, Randy, let's talk about OpenStack. You were watching the keynote yesterday, you were, you know, in the Twitter stream, >> Randy: I don't usually watch keynotes, man. >> Stu: But you know this community, so-- >> I do know this community (laughing). >> Give us kind of the good, the bad, and the ugly from your standpoint as to, you know, where we've gone, you know, what's doing well, and what you're frustrated as heck that we still haven't fixed yet. >> Well, I mean, it's great that we have so much inroads amongst the carriers, it's great that, you know, that there's a segment that OpenStack has been able to land in. I mean, at some points when I was feeling particularly pessimistic on some days, I was like, oh man, this thing's never going to go anywhere, so that's great. On the other hand, you know, the promise that we had of sort of being the Linux operating center, operating system of the data center, and you know, really gaining inroads into private cloud and enterprise, that just hasn't materialized and I don't see a path to that. A lot of that has to do with history, I'm not sure how much of that I want to go into here, but I see those as being bright lights. I see the Ocata containers effort and sort of having this alternative structure that's more or less like the umbrella structure that I lobbied for while I was on the board. So for several years on the board, I said we need to really look more like the Apache Software Foundation, we need to look less like the Linux Operating System in terms of how we think about things. Not this big integrated monolithic release, you need more competition between projects and that just wasn't really embraced. And I think that that, in a way, that was one of several things that really kind of limited our ability to capture the market that we really wanted, which is the enterprise market. >> Yeah, well, I know, and one of those sticking points there that I've talked to you many times over the years about is how do I actually deploy this? You know, getting a base configuration and scaling this out, simplicity is tough, getting to those environments, you know, getting it up in two weeks, is good for some environments, but maybe not for others. >> Yeah, I mean I think there's sort of a spectrum, right? At one end of the spectrum, you say hey, I'm going to have a very opinionated approach like kubernetes does, and we're going to limit what we say we can do, you know, we're not all things to all people. And I think that opinionated approach, like the Linux operating system worked very, very well. And then other end of the spectrum is we've got no opinion like the Apache Software Foundation, and then it's up to vendors to go and cherry pick the pieces they want and turn that into some kind of commercial offering, whether it's Hortonworks, or Thi-dare or Du-per or whatever it is, the problem is that OpenStack wound up in the middle where it had the sort of integrated monolithic release cycle which it still does, which started to be all things to all people, and it was never as great as it could be, so it's like we got to support Hyper-V, we got to support VMware, and as the laundry list of all things we have to support grew longer, it became more and more difficult to have a compelling, easy to use, easy to scale offering that any enterprise could consume. >> Randy, a lot of talk this week about edge computing, with several different definitions, right? But it does strike me that, you know, there's a certain set of apps, that you write 'em and that they live fine in a big public cloud, and a big data center somewhere. But there's a lot of hardware that's going to be living out in the world, whether that's at the base of a radio tower, or in a wall, or in my shoe, that is going to be running hardware, and is going to be running something, and sometimes that something can be OpenStack, and we're seeing some examples of it, many examples of that already. Is that an area of growth for OpenStack? Is that an interesting part of how this fabric is going to expand? >> Well, I probably have a contrarian view here. So, I spent a bunch of time at Juniper, one of the things I worked on for a while was edge computing and we're still trying to decide what we want to do there and you know, kind of to the first point you made is everybody's edge is different, right? Is it on the mobile phone, is it back in the data center, the difference is that the real estate gets more expensive as you move out, right? And it's in terms of latency, and it's in terms of bandwidth and it's also in terms of cost of storage and compute. There's a move closer to the mobile device that becomes progressively more expensive, and so that's why a lot of people sort of look and say hey, wouldn't it be nice if we can get you out the closer lower latency and bandwidth and so on but as we looked at it, a lot of the different use cases it became really interesting in that, it wasn't clear if there was that much value between 5 milliseconds and 20 milliseconds, right? I mean, that's pretty, either one's pretty close, sure there's a lot of difference between 20 and a 100, but maybe not so much between 5 and 20. And so we kind of came to the conclusion that at least for right now, probably, the bulk of use cases are fine with 20 milliseconds, and what that means is that regional systems like AWS's Lambda at the Edge, they're in metro, those are probably good for most cases. I don't know that you need to be on the tower, I don't know that you need to be in the central office, so I think edge computing is still nascent, we don't know exactly what all those use cases are, but I think you might be able to service most of them from regional data centers, and then the question really becomes what does that stack need to be and if you have a regional data center that's got plenty of power, plenty of space, then it might be that OpenStack is a good solution, but if you're trying to scale down onto the tower, I got to have some doubts about whether OpenStack can really scale down that far. >> Randy, analytics is something we've been seeing, the networking people used for many years, at this show, starting to hear a lot of discussion about AI and ML, would love your view point as to what you're seeing in that space. >> You know I have some friends who started off in AI in very early days and he had a very pessimistic view. He said, you know this stuff comes and goes, but I'm actually very positive and optimistic about it because the way I look at this is there's a renaissance happening which is that, you know, now ML is really available to masses and you're seeing people do really interesting things like, we have a product called AppFormix, and what they do is they take ML and they apply it to operations and I love this because as an operations guy, you know, I used to have these problems in production where something would go out and the first thing I'd do, is I'm trying to do correlation and then root cause analysis, like, what was the actual failure? Like I can see the symptom on this end and now I have to get all the way back to what caused it, and the reality is that machine learning, AI techniques and protocols can do all the heavy lifting for operators very, very quickly and basically surface a problem for somebody to do the final analysis on. And so I do think that ML and AI apply to very specific vertical problems, it is just a place where we're going to see a tremendous amount of revolution in the next couple years. >> All right, and that hits right at really that intersection between kind of the developers and the operators there-- >> Absolutely. >> What are you seeing from an organizational standpoint, companies you're talking to these days, how are they doing adopting that change, dealing with that, you know, often schism or are they bringing those groups together? >> Well, I think you remember that like in the early days, I used bring my deck along and I would talk about assembly line IT versus the robotics spectrum all of IT and I would sort of make that sort of analogy to sort of the car manufacturing process, and I think what machine learning is really going to do is take us to that next level past that right? So we had the assembly line where we have all the specialists, we had the robotics factory where we had people who know how to build a robots and software, and it's really sort of like, just churning out with a lot of people on the line, and I think the next level after that is, you know, completely fully automated applications driving themselves, you know, self-driving applications, and I think that's when things get really interesting, and maybe we start to remove the traditional operator out of the equation and it really becomes about empowering developers with tools that are comfortable and that leverage all the cloud era and stuff that we built. >> All right, so Randy, you're credited with the pets versus cattle analogy, what's the latest, you were talking about some of the previous slide decks, what's Randy Bias looking on down the road? >> I mean, the stuff just comes to me, man. I can't like predict, but the thing I've been talking about a lot lately is services of platform, I think we might've talked about that last time, which is just this notion that if we look at where Amazon's invested and what's interesting, it's certainly not at the infrastructure layer and it's really not at the PAS layer, it's that thick layer in between with like database as a service and NoSQL as a service, and messaging service, and DNS and so on, where you can kind of cherry pick those things as you're assembling your own PAS for your application, and I still think that's the area that is under-discussed, and the reason is is the people back into basically doing that, building kind of the service as a platform system, but they're not like going into it, kind of like eyes wide open. >> Yeah, so just following up on that last piece, one of the criticisms I have this week is when you talk about multi-cloud, most of the people talk about, oh well people are clawing things back to their data centers. Juniper plays across the board, strong partnership with Amazon, yet you're here, what are you hearing from customers, you know, what do you see as kind of the balance there and, you know, the public cloud's role in the world? >> I mean, they're still winning, right? I don't think there's any doubt, I haven't seen a decline back here talking about, but we are starting to enter into the era of, okay, this stuff is out there, and it's running, but I need to find my governance model, I need to understand who's using what, I need to understand what it's costing me, and that's the sign of the maturation process. And so I think that, you know, we saw in the early days of cloud, people jumping the gun, creating compliance services, and you know, SAS products that would basically measure how much you're spending and think that it's time for that stuff to come back in vogue again, because the tool needs to be there for people to manage these extended supply chain of IT vendors which include the public cloud. And I think that the idea that would claw them back as opposed to like just see that as holistic part of what we're trying to accomplish doesn't make any sense. >> Well learned. Well, Randy Bias, always a pleasure to catch up with you. >> John. >> John Troyer, I'm Stu Miniman, getting towards the end of two days of three days of live coverage. Thanks for staying with the CUBE. (bubbly electronic music)

Published Date : May 23 2018

SUMMARY :

brought to you by Red Hat, the Open Stack Foundation, the worldwide leader in tech coverage. and everything, so we expect you to All right, so before we get to the kubernetic stuff, Yeah, so the short history is that Yeah, so networking sits at really the intersection and so that kind of perspective is now really deep, all the testing infrastructure was Juniper, you know, you were, you know, in the Twitter stream, where we've gone, you know, what's doing well, On the other hand, you know, the promise that we had there that I've talked to you many times and as the laundry list of all things we have to support and is going to be running something, kind of to the first point you made is the networking people used for many years, and now I have to get all the way back to what caused it, and that leverage all the cloud era and stuff that we built. and it's really not at the PAS layer, as kind of the balance there and, you know, and you know, SAS products that would basically Well, Randy Bias, always a pleasure to catch up with you. Thanks for staying with the CUBE.

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Stu Miniman	PERSON	0.99+
John Troyer	PERSON	0.99+
2012	DATE	0.99+
Cisco	ORGANIZATION	0.99+
2018	DATE	0.99+
Linux Foundation	ORGANIZATION	0.99+
Randy	PERSON	0.99+
Randy Bias	PERSON	0.99+
Red Hat	ORGANIZATION	0.99+
Juniper Networks	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
2013	DATE	0.99+
Juniper	ORGANIZATION	0.99+
Apache Software Foundation	ORGANIZATION	0.99+
20 milliseconds	QUANTITY	0.99+
AT&T	ORGANIZATION	0.99+
three days	QUANTITY	0.99+
John	PERSON	0.99+
Vancouver, Canada	LOCATION	0.99+
two days	QUANTITY	0.99+
Open Stack Foundation	ORGANIZATION	0.99+
5 milliseconds	QUANTITY	0.99+
yesterday	DATE	0.99+
Tungsten Fabric	ORGANIZATION	0.99+
last year	DATE	0.99+
Contrail	ORGANIZATION	0.99+
OpenStack Summit	EVENT	0.98+
end of March	DATE	0.98+
Nicira	ORGANIZATION	0.98+
SAS	ORGANIZATION	0.98+
Tungsten	ORGANIZATION	0.98+
20	QUANTITY	0.98+
two editions	QUANTITY	0.98+
Hortonworks	ORGANIZATION	0.98+
Ocata	ORGANIZATION	0.98+
one	QUANTITY	0.98+
Hyper-V	TITLE	0.98+
two weeks	QUANTITY	0.98+
CUBE	ORGANIZATION	0.98+
OpenStack Summit North America 2018	EVENT	0.98+
5	QUANTITY	0.98+
Linux	TITLE	0.97+
this week	DATE	0.97+
100	QUANTITY	0.97+
VMware	ORGANIZATION	0.96+
both	QUANTITY	0.96+
first point	QUANTITY	0.96+
OpenStack	ORGANIZATION	0.95+
Stu	PERSON	0.95+
Thi-dare	ORGANIZATION	0.95+
Vice President	PERSON	0.95+
Bikash Koley	PERSON	0.94+
OpenStack Summit 2018	EVENT	0.94+
first thing	QUANTITY	0.93+
Apache	ORGANIZATION	0.92+

Reynold Xin, Databricks - #Spark Summit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.

Published Date : Jun 7 2017

SUMMARY :

Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Reynold	PERSON	0.99+
Ali	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
David Goad	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
North Pole	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Reynold Xin	PERSON	0.99+
last month	DATE	0.99+
10X	QUANTITY	0.99+
two questions	QUANTITY	0.99+
three trillion records	QUANTITY	0.99+
second area	QUANTITY	0.99+
today	DATE	0.99+
last week	DATE	0.99+
Spark	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
first direction	QUANTITY	0.99+
One	QUANTITY	0.99+
James Bond	PERSON	0.98+
Spark	ORGANIZATION	0.98+
both	QUANTITY	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
Tungsten	ORGANIZATION	0.98+
two focuses	QUANTITY	0.97+
three directions	QUANTITY	0.97+
one minute	QUANTITY	0.97+
one area	QUANTITY	0.96+
three	QUANTITY	0.96+
about five	QUANTITY	0.96+
DBIO	ORGANIZATION	0.96+
six years	QUANTITY	0.95+
one thing	QUANTITY	0.94+
over a hundred experiments	QUANTITY	0.94+
Oracle	ORGANIZATION	0.92+
Theano	TITLE	0.92+
single note	QUANTITY	0.91+
Intel	ORGANIZATION	0.91+
one route	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.88+
Office	TITLE	0.87+
TensorFlow	TITLE	0.87+
S3	TITLE	0.87+
MXNet	TITLE	0.85+

Matthew Hunt | Spark Summit 2017

>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, we're talking about data signs and engineering at scale, and we're having a great time, aren't we, George? >> We are! >> Well, we have another guest now we're going to talk to, I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg, Matt, thanks for joining us! >> My pleasure. >> Alright, we're going to talk about a lot of exciting stuff here today, but I want to first start with, you're a long-time member of the Spark community, right? How many Spark Summits have you been to? >> Almost all of them, actually, it's quite amazing to see the 10th one, yes. >> And you're pretty actively involved with the user group on the east coast? >> Matt: Yeah, I run the New York users group. >> Alright, well, what's that all about? >> We have some 2,000 people in New York who are interested in finding out what goes on, and which technologies to use, and what are people working on. >> Alright, so hopefully, you saw the keynote this morning with Matei? >> Yes. >> Alright, any comments or reactions from the things that he talked about as priorities? >> Well, I've always loved the keynotes at the Spark Summits, because they announce something that you don't already know is coming in advance, at least for most people. The second Spark Summit actually had people gasping in the audience while they were demoing, a lot of senior people-- >> Well, the one millisecond today was kind of a wow one-- >> Exactly, and I would say that the one thing to pick out of the keynote that really stood out for me was the changes in improvements they've made for streaming, including potentially being able to do sub-millisecond times for some workloads. >> Well, maybe talk to us about some of the apps that you're building at Bloomberg, and then I want you to join in, George, and drill down some of the details. >> Sure. And Bloomberg is a large company with 4,000-plus developers, we've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware, but there's some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in, that's a natural application, another one would be regulatory, there's a regulatory system called MiFID, or MiFID II, the regulations required for Europe, you have to be able to record every trade for seven years, provide daily reports, there's clearly a lot around that, and then I would also just say, our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. >> These applications sound like they're oriented towards streaming solutions, and really low latency. Has that been a constraint on what you can build so far? >> I would definitely say that we have some things that are latency constrained, it tends to be not like high frequency trading, where you care about microseconds, but milliseconds are important, how long does it take to get an answer, but I would say equally important with latency is efficiency, and those two often wind up being coupled together, though not always. >> And so when you say coupled, is it because it's a trade-off, or 'cause you need both? >> Right, so it's a little bit of both, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often, greater efficiencies. Micro-batching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in, and your method of processing them isn't efficient enough, it gets too slow simply from that, and that's why it's not just one or the other. >> So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency, and is that relevant for the apps that you're building? >> I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have, so that's one explicit one, but part of what's useful is, often when you sit down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through, and then you discover other inefficiencies that can be ironed out, and so it just makes it faster overall. I would say, a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on, well, all these things that make things much slower, much less efficient than they need to be, and we can close that gap a lot, I would say that from the very beginning. >> This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take N 10 control of continuous apps, for simplicity and performance, and so there's this, we'll write with transactional consistency, so we're assuring the customer of exactly one's semantics when we write to a file system or database or something like that. But, Spark has never really done native storage, whereas Matei came here on the show earlier today and said, "Well, Databricks as a company "is going to have to do something in that area," and he talks specifically about databases, and he said, he implied that Apache Spark, separate from Databricks, would also have to do more in state management, I don't know if he was saying key value store, but how would that open up a broader class of apps, how would it make your life simpler as a developer? >> Right. Interesting and great question, this is kind of a subject that's near and dear to my own heart, I would say. So part of that, when you take a step back, is about some of the potential promise of what Spark could be, or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value, if you can learn one small skillset, but it can work in a wide variety of use cases, whether it's streaming or at rest or analytics, and plug other things in. As always, there's a gap in any such system between theory and reality, and how much can you close that gap, but as for storage systems, this is something that, you and I have talked about this before, and I've written about it a fair amount too, Spark is historically an analytic system, so you have a bunch of data, and you can do analytics on it, but where's that data come from? Well, either it's streaming in, or you're reading from files, but most people need, essentially, an actual database. So what constitutes the universal system? You need file store, you need a distributive file store, you need a database with generally transactional semantics because the other forms are too hard for people to understand, you need analytics that are extensible, and you need a way to stream data in, and there's how close can you get to that, versus how much do you have to fit other parts that come together, very interesting question. >> So, so far, they've sort of outsourced that to DIY, do-it-yourself, but if they can find a sufficiently scalable relational database, they can do the sort of analytical queries, and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that, like Cassandra would be the, sort of the database that would handle all updates, and then some amount of those would be filtered through to a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you would have the option to drop that out, what applications would you undertake that you couldn't use right now, where the theme was, we're going to take big data apps into production, and then the competition that they show for streaming is of Kafka and Flink, so what does that do to that competitive balance? >> Right, so how many pieces do you need, and how well do they fit together is maybe the essence of that question, and people ask that all the time, and one of the limits has been, how mature is each piece, how efficient is it, and do they work together? And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real-world usage. I think we're coalescing around a smaller set of options, so in the, Kafka, for example, has a lot of usage, and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data, for ingest, I see that everywhere I go. But what happens when you move from Kafka into Spark, or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it, or Oracle, or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability, so we actually just opensourced that a few weeks ago, it's called ComDB2, and the reason we had to do that was the industry solutions at the time, when we started working on that, were inadequate to our needs, but we look at how long that took to develop for these other systems and think, that's really hard for someone else to get right, and so, if you need a database, which everyone does, how can you make that work better with Spark? And I think there're a number of very interesting developments that can make that a lot better, short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too. How do you make them work well together, we could talk about for a while, 'cause that's a fascinating question. >> On that one topic, maybe the Databricks guys don't want to assume responsibility for the development, because then they're picking a winner, perhaps? Maybe, as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Splice Machine and SnappyData do the work, take it upon themselves to take data frames, the core data structure, in Spark, and give it transactional semantics. Does that sound promising? >> There're multiple avenues for potential success, and who can use which, in a way, depends on the audience. If you look at things like Cassandra and HBase, they're distributing key value stores that additional things are being built on, so they started as distributed, and they're moving towards more encompassing systems, versus relational databases, which generally started as single image on single machine, and are moving towards federation distribution, and there's been a lot with that with post grads, for example. One of the questions would be, is it just knobs, or why don't they work well together? And there're a number of reasons. One is, what can be pushed down, how much knowledge do you have to have to make that decision, and optimizing that, I think, is actually one of the really interesting things that could be done, just as we have database query optimizers, why not, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted, but are coming. One is the very efficient copy of data between systems, and Apache Arrow, for example, is very, very interesting, and it's nearing the time when I think it's just going to explode, because it lets you connect these systems radically more efficiently in a standardized way, and that's one of the things that was missing, as soon as you hop from one system to another, all of a sudden, you have the semantic computational expense, that's a problem, we can fix that. The other is, the next level of integration requires, basically, exposing more hooks. In order to know, where should a query be executed and which operator should I push down, you need something that I think of as a meta-optimizer, and also, knowledge about the shape of the data, or statistics underlying, and ways to exchange that back and forth to be able to do it well. >> Wow, Matt, a lot of great questions there. We're coming up on a break, so we have to wrap things up, and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community, do over the next year. What are the top issues, things you'd love to see worked on? >> Right. It's an exciting time for Spark, because as time goes by, it gets more and more mature, and more real-world applications are viable. The hardest thing of all is to get, anywhere you in any organization's to get people working together, but the more people work together to enable these pieces, how do I efficiently work with databases, or have these better optimizations make streaming more mature, the more people can use it in practice, and that's why people develop software, is to actually tackle these real-world problems, so, I would love to see more of that. >> Can we all get along? (chuckling) Well, that's going to be the last word of this segue, Matt, thank you so much for coming on and spending some time with us here to share the story! >> My pleasure. >> Alright, thank you so much. Thank you George, and thank you all for watching this segment of theCUBE, please stay with us, as Spark Summit 2017 will be back in a few moments.

Published Date : Jun 6 2017

SUMMARY :

covering Spark Summit 2017, brought to you by Databricks. it's quite amazing to see the 10th one, yes. and what are people working on. that you don't already know is coming in advance, and I would say that the one thing and then I want you to join in, George, you have to be able to record every trade for seven years, Has that been a constraint on what you can build so far? where you care about microseconds, On the other hand, if you have a really large volume and then you discover other inefficiencies and so there's this, we'll write and there's how close can you get to that, what applications would you undertake and so, if you need a database, which everyone does, and give it transactional semantics. it's just going to explode, because it lets you and I wanted to give you at least 30 seconds and that's why people develop software, Alright, thank you so much.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Matt Hunt	PERSON	0.99+
Bloomberg	ORGANIZATION	0.99+
Matthew Hunt	PERSON	0.99+
Matt	PERSON	0.99+
Matei	PERSON	0.99+
New York	LOCATION	0.99+
San Francisco	LOCATION	0.99+
30 years	QUANTITY	0.99+
seven years	QUANTITY	0.99+
each piece	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
one	QUANTITY	0.99+
one millisecond	QUANTITY	0.99+
5,000 skills	QUANTITY	0.99+
both	QUANTITY	0.99+
two	QUANTITY	0.99+
two things	QUANTITY	0.99+
One	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Spark	TITLE	0.98+
Europe	LOCATION	0.98+
Spark Summit 2017	EVENT	0.98+
DB2	TITLE	0.98+
200 different products	QUANTITY	0.98+
Spark Summits	EVENT	0.98+
Spark Summit	EVENT	0.98+
today	DATE	0.98+
one system	QUANTITY	0.97+
next year	DATE	0.97+
4,000-plus developers	QUANTITY	0.97+
first	QUANTITY	0.96+
HBase	ORGANIZATION	0.95+
second	QUANTITY	0.94+
decades	QUANTITY	0.94+
MiFID II	TITLE	0.94+
one topic	QUANTITY	0.92+
this morning	DATE	0.92+
single machine	QUANTITY	0.91+
One of	QUANTITY	0.91+
ComDB2	TITLE	0.9+
few weeks ago	DATE	0.9+
Cassandra	PERSON	0.89+
earlier today	DATE	0.88+
10th one	QUANTITY	0.88+
2,000 people	QUANTITY	0.88+
one thing	QUANTITY	0.87+
Kafka	TITLE	0.87+
single image	QUANTITY	0.87+
MiFID	TITLE	0.85+
Spark	ORGANIZATION	0.81+
Splice Machine	TITLE	0.81+
Project Tungsten	ORGANIZATION	0.78+
theCUBE	ORGANIZATION	0.78+
at least 30 seconds	QUANTITY	0.77+
Cassandra	ORGANIZATION	0.72+
Apache Spark	ORGANIZATION	0.71+
questions	QUANTITY	0.7+
things	QUANTITY	0.69+
Apache Arrow	ORGANIZATION	0.69+
SnappyData	TITLE	0.66+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Tungsten: