Matt Fryer, Hotels.com - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco, it's The Cube. Covering Spark Summit 2017. Brought to you by Databricks. >> The Cube is live once again from Spark Summit 2017, I'm David Goad, your host, here with George Gilbert, and we are interviewing many of the speakers that we saw on stage this morning at the keynote. Happy to introduce our next guest on the show, his name is Matt Fryer, Matt, how're you doing? >> Matt: Very well. >> You're the chief, Chief Data Science Officer, I don't see many CDSOs out there, is that a common-- >> I think to say, it's a newer title, and it's coming, I think, where companies that feel the use of data, data science and algorithms, are fundamental to their, their futures. They're creating both the mix of commercial, technical, and algorithmic skill sets, this one team, and to execute together, and that's where the title came from. There's more coming, there's a number of-- Facebook have a few, that's one for example, but it's a newer title, I think it's going to become larger and larger, as time goes on. >> David: So, the CDSO for Hotels.com, something else we learned about you that you may not want me to reveal, but I heard you were the inspiration for Captain Obvious, is that true? >> Uh, that's not true. (laughter) I think Captain Obvious is only an expression of my brand, so there's an awesome brand team, at our office in Dallas. (crosstalk) We all love the captain, he has some good humorous moments, and he keeps us all kind of happy. >> Oh, yeah, he states the obvious, we're going to talk about some of the obvious, and maybe some of the not-so obvious here in this interview. So let's talk a little bit about company culture, because you talked a lot on the stage this morning about customer-first kind of approach, rather than a, "Ooh, look what I can do with the technology." Talk a little bit more about the culture at Hotels.com. >> And that's important, and I think, we're a very data-driven culture, I think most tech companies, and travel, technology companies have that kind of ethos. But fundamentally, the focus and the reason we exist is for the customer. So we want to bring, and actually-- in even better ways than that, I think it's the people. So whether it's the focus on the customer, if we did the right thing by the customer, we fundamentally want you to use our platform time and time again. Whatever need you have, booking, lodging and travel, please use our platform. That's the crucial win. So, to do that, we have to always delight you in every experience you have with us. And equally about people, it's about the team, so we have an internal concept called being supportive. So the whole part of our team culture, is that everybody helps everybody else out, we don't single things out, we're all part of the same team, and we all win if all of us pull together. That makes it a great place, a fun place to work, we're going to play with some new technologies, tech is important to us, but actually the people are even more important to us. >> In part why you love the Spark Summit then, huh? Same kind of spirit here, right? >> It's great, I think it's my third Spark Summit, my second time over in San Francisco, and the size of it is very impressive now. I just love meeting other people learning about some of the things they're up to, how we can apply those back to our business, and hopefully sharing a little bit of what we're up to. >> David: Let's dive into how you're applying it to your business, you talked about this evolution toward becoming an algorithm business, what does that mean and what part does Spark play in that? >> Matt: I think what it is, is about how do you, if you think about a bit of the journey, historically, a lot of the opportunity came in building new features, constantly building it, it's almost like a semi arms race, about how to build more and more features. The crucial thing I think going forward, and particularly with mobile devices now, we have over half our traffic, comes from people using smartphones, on both the app and mobile web. That bringing together means that, be more targeted, in understanding your journey, and people are, last on to time, speed is much more important, people expect things to be right there when they need it, relevance is much more important to people, so we need to bring all those things together to offer a much more targeted experience, and a much more real-time experience. People expect you to have understood what they did milliseconds ago, and respond to that. The only way you can do that is using data science and algorithms. You balance out on a business operation side, just how do you scale? The analogy I use with, say, anomaly detection, which is a crucial feature for enterprises. Used to have a large business intelligence, lots of reports, pages of paper, now people have things like Tablo, Power BI, those are great and you need those to start with, but really as a business leader, you want to know, "Tell me what's broken, tell me what's changed, "because if it's changed something caused the change, "tell me why it's slowly moving, and most importantly, "tell me where the opportunity is." And that transforms the conversation where algorithms can really surface that to users, and it's about organic intelligence, it's not about artificial intelligence, it's about how would you bring together the people, and the advance in technology to really do a great job for customers. >> David: Well, you mentioned AI, you made a big bold claim about AI, I'm going to ask George to weigh in on this in just a moment, you said AI was going to be the next big thing in the travel industry, can you explain? >> One of the next big things, I think. Yeah, I think it's already happening, in fact, our chairman, Mr. Diller made that statement very recently, also backed up by both the CEO and the brand president, where it's... If you think about 20 years ago, one of the things both Expedia and Hotels.com, and travel online space did, were democratize price information, and made it transparent to users. So previously, the power was with the travel agents, that power moved to the user, they had the information. And that's evolved over time, and what we feel with artificial intelligence, particularly organic intelligence, enablers like mobile, messaging and having conversations, have a machine learning how to make this happen, that you can turn the screen around and actually empower users always with the second revolution. They actually have the advice, and the benefits you had a number of years ago from travel agents: A, they had the price transparency, they have the other part now, which is the content, advice, and what's the most relevant to help them. And you can listen to what they're saying to you, as a customer, and actually we can now replay the perfect information back to them, or increasingly perfect as time goes on. (crosstalk) >> That is fascinating, 'cause in the way you broke that out, with--it wasn't actually only travel, but over the last couple decades, price transparency became an issue for many industries, but what you're saying now is, by giving the content to surprise and delight the customer, as long as you're collecting the data breadcrumbs to help you do that, you're not giving up control, you're actually creating stickiness. >> Matt: We're empowering, is the language I use. And if you empower the user, the more likely to come back to use your service in the future, and that's really what we want, we want happy customers. >> George: Tell us a little bit, at the risk of dropping a little in the wait, tell us a little bit about how you empower, in other words, how do you know what type of content to serve up, and how do you measure how they engage with it? >> It's a great question, and I think it's quite embryonic, part of the world right now. I don't think anybody's-- have we made some great developments? I said it was a long journey we have, but it's a lot about how do you, and this is true across data science machine learning, great data science is fundamental to having great feedback loops. So, there's lots of different techniques and tactics around how you might discover those feedback loops, and customers demand that you use their data to help them. So, we need to get faster, and streaming is one way, that's becoming feasible, and the advances in streaming and it's great Databricks are working on that, but the advances in streaming allows it to feed that loop, to take that much--those real-time signals, as well as previous signals, to really help figure out what you're trying to do today, what content-- interesting thing is, Netflix and Amazon were some pioneers in this space, where if you use Netflix service, often you go, "How the hell did they know "this video was going to be right for me?" And, some of the comments, and you can say, well, what they're actually doing is they're looking at microsegments, so previously everyone talked about custom segments as these very large groups, and they have their place, but increasing machine learning allows you to build microsegments. What I can start to do is actually discover from the behavior of others, things you likely-- very relevant things that you're going to be very interested in, and actually help inspire you and discover things you didn't even know existed. And by filling that gap and using those microsegments as well as put truly personal, personalization, I can bring that together to offer you a much more enhanced service. >> George: And so, help make that concrete in terms of, what would I as a potential--I want to plan a vacation for the summer, I have my five and a half inch or, five-seven iPhone, and that's my primary device. And in banking, it's moved from tying everything to the checking account, to tying every interaction to your mobile device. So what would you show me on my mobile device, that would get me really engaged about going to some location? >> So I think a lot of it is about where you are in that journey. So, you think, there's so many different routes customers can take, through that buying decision. And depends on the trip type, whether it's a leisure trip, seeing your family and friends, how much knowledge you may have about them, have you been there before? We look for all those signals, to try and help inspire. So a great example might be, if you stayed in a hotel on our site before, and you liked that hotel, and you come back and do a search again, we try and make it easy to continue by putting that hotel at the top. Trying to make it easy to task-complete. We have a trip planner capability you'll see on the home screen, which allows you to record and play back some of your previous searches, so you can quickly see and compare where you've been, and what's interesting for you. But on top of that, we can then use the signals, and increasingly, we have a very advanced filter list, and that's a key, and we're looking in stuff, how we do conversations in the chatbox, is this sort of future, how to have a conversation to say, "Hey, here's a list of hotels, which we used a mix of your, "the types of preferences understood about you, "and the wider thing, where you are in the world, "what's going on, what time of day." We take hundreds of different signals to try and figure out what the right list is for you, and from that list, the great thing is most people interact with that list and give us more signals, exactly what you wanted. We can hone and hone and hone, and repeat, 'cause I said at the start, for example, those majority of customers will do multiple searches. They want to understand what the market is, they may not be interested in one particular place, they may have a sweeter place there instead. Even now, where we've moved further up the funnel, investing behind, how can you figure out what destination you're interested in? So you may not even know what destination you're interested in, or there might be other destinations that you didn't know--with a very relevant for your use case, particularly if you're going on vacation, we can help inspire you to find that hidden gem, that hidden great prize, you may not even know it existed. Being the much better job, but to show you how busy the market is, to how fast you should be looking to book there, if it's a very compressed, busy market, you need to get in there quick to lock your price in, and we're now providing that information to help you make a better decision. And we can mine all that data, to empower you to make smart decisions with smart data. >> I want to clarify something I saw in your demonstration this morning, you were talking about detecting the differences between photos and user-generated content, so do you have users actually posting their own photos of the hotel, right next to the photoshopped pictures of the hotel? >> Matt: We do, yeah. >> David: What are the ramifications of that? >> So it's an interesting advancement we've made, so we've... In the last of the year, we now offer and asking users to submit their photos, to help other users. I think one of the crucial things is about how to be authentic. Over the years, we've had tens of millions of testimonial reviews, text reviews, and we can see they're really, crucially important to users, and their buying decisions. >> David: It scares the hotel owners to death though, doesn't it? >> Matt: Well, I think it does, but I think the testimony of the customer, could be one of the key things we call them, as we have verified reviews, so to leave a review on our site, you've had to stay in that hotel. We think that's a crucial step in really helping to say, "These are your customers." In recent times, we've taken that product further, to now when you actually arrive at the hotel within a few hours, We'll ask you what your first impressions were. We would ask if you want to share that with the hotel owner. To get the hotel owner a chance to actually rectify any early challenges, so you can have a great stay. And one of the crucial things we have is that, what's really, really important, is that users and customers have a great stay, that reflects on our Net Promoter score, and their view of us, and we need to fill that cycle and make sure we have happy users. So that real-time review is super crucial, in basing how can hotels--if they want happy users and customers as well, it helps them to cut a course correct, if there's an issue, and we can step in as well to help the user if it's a really deep issue. And then with the photos, the key to think is how to navigate and understand what the photo is, so the user helps us by tagging that, which is great, but how we-- >> David: Possibly mistagging it. >> Possibly mistagging it on occasion, that's something we've, we've built in some skill as you've heard, on how to tackle that, but the crucial thing is how to bring these together, if you're on a mobile device, you've got to scan through each photo, and in places around the world have limited bandwidth, a limited time to go through them, so what we're now working on is how to assess the quality of those photos, to try and make sure we authentically--what we want to do, is get the customer the most lively experience they will have. As I said before, we're on the customer's kind of focus, we want to make sure they get the best photos, the most realistic of what's going to happen, and doing the most diverse. You want to see three photos, exactly the same, and we're working on the moment, you can swipe left and swipe right, we're working on how that display evolves over time, but it's exciting. >> David: Very exciting, fascinating stuff. Sorry that we're up against a hard break, coming here in just a moment, but I wanted to give you just 30 seconds to kind of sum up, maybe the next big technical challenge you're looking at that involves Spark, and we'll close with that. >> Cool, it's a great question. I think I talked a little about that in the keynote, totally caught the kind of out challenge. How to scale a mountain, which has been-- there's been great advance on how to stream data into platforms, Spark is a core part of that, and the platforms that we've been building, both internally, and partnering with Databricks and using their platform, has really given us a large boost going forwards, but how you turn those algorithms and that competitive algorithmic advantage, into a live production environment, whether it's marketplaces, Adtech marketplaces or websites, or in call centers, or in social media, wherever the platform needs to go, that's a hard problem right now. Or, I think it's too hard a problem right now. And I'd love to see--and we're going to invest behind that, a transformation, that hopefully this time next year, that is no longer a problem, and is actually an asset. >> David: Well I hope I'm not Captain Obvious to say, I know you're up to the challenge. Thank you so much, Matt Fryer, we appreciate you being on the show, thank you for sharing what's going on at Hotels.com. And thank you all for watching The Cube, we'll be back in a few moments with our next guest, here at Spark Summit 2017. (electronic music) (wind blowing)

Published Date : Jun 8 2017

SUMMARY :

Brought to you by Databricks. and we are interviewing many of the speakers and to execute together, something else we learned about you that We all love the captain, he has some good humorous moments, and maybe some of the not-so obvious here in this interview. So, to do that, we have to always delight you and the size of it is very impressive now. and the advance in technology to really do and the benefits you had a number of years ago to help you do that, you're not giving up control, And if you empower the user, the more likely to come back And, some of the comments, and you can say, well, So what would you show me on my mobile device, Being the much better job, but to show you how busy and we can see they're really, crucially important to users, to now when you actually arrive at the hotel but the crucial thing is how to bring these together, coming here in just a moment, but I wanted to give you just and the platforms that we've been building, we appreciate you being on the show, thank you for sharing

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Matt Fryer	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
David Goad	PERSON	0.99+
Diller	PERSON	0.99+
Matt	PERSON	0.99+
Dallas	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Hotels.com	ORGANIZATION	0.99+
Expedia	ORGANIZATION	0.99+
Netflix	ORGANIZATION	0.99+
five and a half inch	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
30 seconds	QUANTITY	0.99+
The Cube	TITLE	0.99+
second revolution	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
Spark Summit 2017	EVENT	0.99+
next year	DATE	0.99+
third	QUANTITY	0.98+
both	QUANTITY	0.98+
one	QUANTITY	0.98+
iPhone	COMMERCIAL_ITEM	0.98+
Spark Summit	EVENT	0.98+
three photos	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
each photo	QUANTITY	0.97+
first impressions	QUANTITY	0.97+
second time	QUANTITY	0.96+
one way	QUANTITY	0.93+
Power BI	TITLE	0.92+
about 20 years ago	DATE	0.91+
One	QUANTITY	0.91+
tens of millions	QUANTITY	0.91+
five-seven	QUANTITY	0.9+
Spark	ORGANIZATION	0.89+
today	DATE	0.88+
last couple decades	DATE	0.87+
this morning	DATE	0.86+
one team	QUANTITY	0.85+
#SparkSummit	EVENT	0.8+
number of years ago	DATE	0.77+
seconds ago	DATE	0.77+
Tablo	TITLE	0.76+
Captain Obvious	PERSON	0.74+
signals	QUANTITY	0.72+
first	QUANTITY	0.7+
over	QUANTITY	0.66+
#theCUBE	TITLE	0.6+
Cube	COMMERCIAL_ITEM	0.39+
Cube	TITLE	0.37+

Wesley Kerr, Riot Games - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017. Brought to you by Databricks. >> Getting close to the end of the day here at Spark Summit, but we saved the best for last I think. I'm pretty sure about that. I'm David Goad, your host here on theCUBE and we now have data scientists from Riot Games, yes, Riot Games. His name is Wesley Kerr. Wesley, thanks for joining us. >> Thanks for having me. >> What's the best money-making game at Riot Games? >> Well we only have one game. We're known for League of Legends. It came out in 2009, it has been growing and well-received by our fans since then. >> And what's your role there? It says data scientist, but what do you really do? >> So we build models to look at things like in game behavior. We build models to actually help players engage with our store and buy our content. We look at different ways we can, just, improve our player experience. >> Alright well let's talk about a little more under the hood, here. How are you deploying Spark in the game? >> So we relied on Databricks for all of our deployment. We do many different clusters. We have about 14 data scientists that work with us, each one is sort of able to manage their own clusters: spin 'em up, tear 'em down, find their data that way and work with it through Databricks. >> So what else will you cover? You had a keynote session this morning, right? >> Yep. >> Give a recap for theCUBE audience of what you talked about. >> So we talked about our efforts in player behavior where we build models and deploy models that are watching chat between players so we evaluate whether or not players are being unsportsmanlike and come up with ways to, sort of, help them curb that behavior and be more sportsmanlike in our game. >> Oh wow, unsportsmanlike. How do you define that? It's if people are being abusive? >> So what we saw was there are about one or two percent of our games that is some form of serious abuse and that comes in term of hate speech, racism, sexism, things that have no place in the game and so we want them to realize that that language is bad and they shouldn't be using it. >> It's all key word driven or are there other behaviors or things that can indicate? >> So right now it's purely based on things said in chat, but we're currently investigating other, sort of, other ways of measuring that behavior and how it occurs in game and how it could influence what people are saying. >> Maybe like tweets coming from The White House? (laughing) >> Okay, so George. >> We should be able to measure that as well. >> So how about those warriors? (laughing) >> No, George did you want to talk a little bit more >> Sure. >> David: about the technical achievements here? When you look at like trying to measure engagement and sort of maybe it sounds like converting high engagement to store purchases, tell us a little more maybe how that works. >> So we look at, we want. Our game is completely free to play. Players can download, play it all the way through and we really try to create a very engaging game that they want to come back and they want to play and then everything they can buy in the store is actually just cosmetics. So we really hope to build content that our players love and are happy to spend money on. As far as... We just really want engagement to be from around players coming back and playing and having a good time and it's less about how to get that high engagement conversion into monetization as we've seen that players who are happy and loving the game are happy to spend their money. >> So tell us more about how you build some of these models like, you know, turning it into not turning it into Spark code, but how do you analyze it and, sort of, what's the database mechanism for, you know, 'cause the storage layer in Spark, you know, is just like the file system? >> Sure, yeah absolutely. So we are a world-wide game. We're played by over 100 million players around the world >> David: Wow. >> And so that data comes flowing in from all around the world into our centralized data warehouse. That data warehouse has gameplay data so we know how you did in game. It also has time series events, so things that occurred in each game. And our game is really session based so players can come play for an hour, that's one game, and then they leave and come back and play again. And so what we're able to do is then, sort of, look at those models and how they did. And I'll give you an example around our content recommendations. So we look at the champions that you've been playing recently to predict which champions you are likely to play next. And that we can actually just query the database, start building our collaborative filtering models on top of it, and then recommend champions that you may not play now, you may be interested in playing, or we may decide to give you a special discount on a champion if we think it'll resonate well with you. >> And in this case, just to be clear, the champions you're talking about are other players, not models? >> It's actually the in-game avatar. So it's the champion that they play. So we have 130 unique champions and each game you choose which champion you want to play and so then that plays out for like. It's much more like a sport than it is like a game. So it's five v five, online competitive. So there are different objectives on the map. You work with your team to complete those objectives and beat the other team. So we like to think of it like basketball, but with magic and in a virtual world. >> And the teams stay together? Or are they constantly recombining? >> They can disband, yeah. Your next game may find nine other people. If you're playing with your friends then you can just keep queuing up with them as well. So the champions that they control there happen to be who you're playing in that game. >> And when you are trying to anticipate champions that someone might play in the future, what are the variables that you're trying to guess and how long did it take you to build those models? >> Yeah, it's a good question. Right now we are able to sort of leverage the power of our user, our players, so we have 100 million. And so what we do and we have in our game there are roles so, for instance, like there's a center in basketball, we have a bot lane. So we have bottom lane support and bottom lane ADC. So a support character is there to make sure that your ADC is able to defeat the other team. And if you play a lot of support, odds are there are other players in the world who play a lot of support too so we find similar players. We find that if they engaged on the same sorts of champions that you play. For instance, I'm a Leona main and so I play her a lot. And if I were to look at what other people played in addition to Leona it could be things like Braum and so then we would recommend Braum as a champion that you should try out that you've maybe not played yet. >> David: Okay. >> So and then what's the data warehouse that you guys use for the ultimate repository of all this? >> All the data flows into a Hive data warehouse, stored in S3. We have two different ways of interacting with it. One, we can run queries against Hive. It tends to be a bit slower for our use cases. And then our data scientists tend to access that all that data through Databricks and Spark. And it runs much quicker for our use cases. >> Do you take what's in S3 and put it into a parquet format to accelerate? >> Sometimes, so we do some of those rewrites. We do a lot of our secondary ETLs where we're just joining across multiple tables and writing back out. We'll optimize those for our Spark use cases and there's writing back, sort of, read from S3, do some transformations, write back to S3. >> And how latency-sensitive is this? Are you guys trying to make decisions as the player moves along in his level or? >> So historically we've been batch. We do- our recommendations are updated weekly so we haven't needed a much higher cadence. But we're moving to a point where I want to see us be able to actually make recommendations on the client and do it immediately after you've finished a game with, say, Leona, here's an offer for Braum. Go check it out, give it a try in your next game. >> So Wesley what would you like to see developed that hasn't been developed yet that would really help in your business specifically? >> So one thing that's really exciting for gaming right now is procedural generation and artificial intelligence. So here there are a lot of opportunities, you've seen some collaborations between Deep Mind and Blizzard where they're learning to play Starcraft. For me, I think there's a similar world where we have a game that has different sorts of mechanics. So we have a large social piece to our game and teamwork is required. And so understanding how we can leverage that and help influence the future of artificial intelligence is something that I want to see us be able to do. >> Did you talk with anybody here at the Spark Summit about that? >> Anyone who would listen. (laughing) So we chatted some with the teams up at Blizzard and Twitch about some of the things they're doing for natural language as well. >> Alright so what was the most useful conversation you had here at the summit? >> The most useful one that I had, I think, was with the Databricks team. So at the end of my keynote, It was kind of serendipitous, I was talking about some work we had done with deep learning and sort of doing hyper parameter searches over our worker nodes, so actually being able to quickly try out many different models. And in the announcement that morning before my keynote, Tim talked about how they actually have deep learning pipelines now and it was based on a conversation we had had so I was very excited to see it come to fruition and now is open source and we can leverage it. >> Awesome, well, we're up against a hard break here. >> Wesley: Okay. >> We're almost at the end of the day. Wesley, it's been a riot talking to you. We really appreciate it and thank you for coming on the show and sharing your knowledge. >> Wesley: You bet, thanks for having me. >> Alright and that's it, we're going to wrap it up today. We have a wrap-up coming up, as a matter of fact, in just a few minutes. My name is David Goad. You're watching theCUBE at Spark Summit. (upbeat music)

Published Date : Jun 8 2017

SUMMARY :

Brought to you by Databricks. and we now have data scientists Well we only have one game. So we build models to look at things How are you deploying Spark in the game? So we relied on Databricks for all of our deployment. of what you talked about. So we talked about our efforts in player behavior How do you define that? and so we want them to realize that that language is bad and how it occurs in game and how it could influence When you look at like trying to measure engagement So we really hope to build content So we are a world-wide game. so we know how you did in game. So it's the champion that they play. So the champions that they control there happen and so then we would recommend Braum as a champion One, we can run queries against Hive. Sometimes, so we do some of those rewrites. so we haven't needed a much higher cadence. And so understanding how we can leverage that So we chatted some with the teams up at Blizzard and it was based on a conversation we had had We really appreciate it and thank you Alright and that's it, we're going to wrap it up today.

ENTITIES

Entity	Category	Confidence
Wesley	PERSON	0.99+
David Goad	PERSON	0.99+
2009	DATE	0.99+
Wesley Kerr	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
Blizzard	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Tim	PERSON	0.99+
Deep Mind	ORGANIZATION	0.99+
League of Legends	TITLE	0.99+
100 million	QUANTITY	0.99+
one game	QUANTITY	0.99+
each game	QUANTITY	0.99+
Twitch	ORGANIZATION	0.99+
Braum	PERSON	0.99+
Riot Games	ORGANIZATION	0.99+
S3	TITLE	0.99+
over 100 million players	QUANTITY	0.99+
Spark	TITLE	0.99+
today	DATE	0.98+
Spark Summit 2017	EVENT	0.98+
Starcraft	TITLE	0.98+
One	QUANTITY	0.98+
an hour	QUANTITY	0.98+
each one	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
five	QUANTITY	0.97+
Spark Summit	LOCATION	0.97+
Leona	PERSON	0.96+
Spark Summit	EVENT	0.95+
130 unique champions	QUANTITY	0.94+
nine other people	QUANTITY	0.94+
two different ways	QUANTITY	0.94+
about 14 data scientists	QUANTITY	0.91+
this morning	DATE	0.89+
two percent	QUANTITY	0.84+
one thing	QUANTITY	0.82+
The White House	ORGANIZATION	0.79+
that morning	DATE	0.72+
about one	QUANTITY	0.72+
#SparkSummit	EVENT	0.67+
theCUBE	ORGANIZATION	0.63+
Hive	TITLE	0.61+

Reynold Xin, Databricks - #Spark Summit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.

Published Date : Jun 7 2017

SUMMARY :

Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Reynold	PERSON	0.99+
Ali	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
David Goad	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
North Pole	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Reynold Xin	PERSON	0.99+
last month	DATE	0.99+
10X	QUANTITY	0.99+
two questions	QUANTITY	0.99+
three trillion records	QUANTITY	0.99+
second area	QUANTITY	0.99+
today	DATE	0.99+
last week	DATE	0.99+
Spark	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
first direction	QUANTITY	0.99+
One	QUANTITY	0.99+
James Bond	PERSON	0.98+
Spark	ORGANIZATION	0.98+
both	QUANTITY	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
Tungsten	ORGANIZATION	0.98+
two focuses	QUANTITY	0.97+
three directions	QUANTITY	0.97+
one minute	QUANTITY	0.97+
one area	QUANTITY	0.96+
three	QUANTITY	0.96+
about five	QUANTITY	0.96+
DBIO	ORGANIZATION	0.96+
six years	QUANTITY	0.95+
one thing	QUANTITY	0.94+
over a hundred experiments	QUANTITY	0.94+
Oracle	ORGANIZATION	0.92+
Theano	TITLE	0.92+
single note	QUANTITY	0.91+
Intel	ORGANIZATION	0.91+
one route	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.88+
Office	TITLE	0.87+
TensorFlow	TITLE	0.87+
S3	TITLE	0.87+
MXNet	TITLE	0.85+

Ash Munshi, Pepperdata - #SparkSummit - #theCUBE

(upbeat music) >> Announcer: Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, it's day two at the Spark Summit 2017. I'm David Goad and here with George Gilbert from Wikibon, George. >> George: Good to be here. >> Alright and the guest of honor of course, is Ash Munshi, who is the CEO of Pepperdata. Ash, welcome to the show. >> Thank you very much, thank you. >> Well you have an interesting background, I want you to just tell us real quick here, not give the whole bio, but you got a great background in machine learning, you were an early user of Spark, tell us a little bit about your experience. >> So I'm actually a mathematician originally, a theoretician who worked for IBM Research, and then subsequently Larry Ellison at Oracle, and a number of other places. But most recently I was CTO at Yahoo, and then subsequent to that I did a bunch of startups, that involved different types of machine learning, and also just in general, sort of a lot of big data infrastructure stuff. >> And go back to 2012 with Spark right? You had an interesting development. Right, so 2011, 2012, when Spark was still early, we were actually building a recommendation system, based on user-generated reviews. That was a project that was done with Nando de Freitas, who is now at DeepMind, and Peter Cnudde, who's one of the key guys that runs infrastructure at Yahoo. We started that company, and we were one of the early users of Spark, and what we found was, that we were analyzing all the reviews at Amazon. So Amazon allows you to crawl all of their reviews, and we basically had natural language processing, that would allow us to analyze all those reviews. When we were doing sort of MapReduce stuff, it was taking us a huge number of nodes, and 24 hours to actually go do analysis. And then we had this little project called Spark, out of AMPlab, and we decided spin it up, and see what we could do. It had lots of issues at that time, but we were able to actually spin it up on to, I think it was in the order of 100,000 nodes, and we were able take our times for running our algorithms from you know, sort of tens of hours, down to sort of an hour or two, so it was a significant improvement in performance. And that's when we realized that, you know, this is going to be something that's going to be really important once this set of issues, where it, once it was going to get mature enough to make happen, and I'm glad to see that that it's actually happened now, and it's actually taken over the world. >> Yeah that little project became a big deal, didn't it? >> It became a big deal, and now everybody's taking advantage of the same thing. >> Well bring us to the present here. We'll talk about Pepperdata and what you do, and then George is going to ask a little bit more about some of the solutions that you have. >> Perfect, so Pepperdata was a company founded by two gentlemen, Sean Suchter and Chad Carson. Sean used to run Yahoo Search, and one of the first guys who actually helped develop Hadoop next to Eric14 and that team. And then Chad was one of the first guys who actually figured out how to monetize clicks, and was the data science guy around the whole thing. So those are the two guys that actually started the company. I joined the company last July as CEO, and you know, what we've done recently, is we've sort of expanded our focus of the company to addressing DevOps for big data. And the reason why DevOps for big data is important, is because what's happened in the last few years, is people have gone from experimenting with big data, to taking big data into production, and now they're actually starting to figure out how to actually make it so that it actually runs properly, and scales, and does all the other kinds of things that are there, right? So, it's that transition that's actually happened, so, "Hey, we ran it in production, "and it didn't quite work the way we wanted to, "now we actually have to make it work correctly." That's where we sort of fit in, and that's where DevOps comes in, right? DevOps comes in when you're actually trying to make production systems that are going to perform in the right way. And the reason for DevOps is it shortens the cycle between developers and operators, right? So the tighter the loop, the faster you can get solutions out, because business users are actually wanting that to happen. That's where we're squarely focused, is how do we make that work? How do we make that work correctly for big data? And the difference between, sort of classic DevOps and DevOps for big data, is that you're now dealing with not just, you know, a set of computers solving an isolated sort of problem. You're dealing with thousands of machines that are solving one problem, and the amount of data is significantly larger. So the classical methodologies that you have, while, you know, agile and all that still works, the tools don't work to actually figure out what you can do with DevOps, and that's where we come in. We've got a set of tools that are focused on performance effectively, 'cause that's the big difference between distributed systems performance I should say, that's the big difference between that, and sort of classic even scaled out computing, right? So if you've got web servers, yes performance is important, and you need data for those, but that can actually be sharded nicely. This is one system working on one problem, right? Or a set of systems working on one problem. That's much harder, it's a different set of problems, and we help solve those problems. >> Yeah, and George you look like you're itching to dig into this, feel free. (exclaims loudly) >> Well so, it was, so one of the big announcements at the show, and the sort of the headline announcement today, was Spark server lists, like so it's not just someone running Spark in the cloud sort of as a manage service, it's up there as a, you know, sort of SaaS application. And you could call it platform of the service, but it's basically a service where, you know, the infrastructure is invisible. Now, for all those customers who are running their own clusters, which is pretty much everyone I would imagine at this point, how far can you take them in hiding much of the overhead of running those clusters? And by the overhead I mean, you know, the primarily performance and maximizing, you know, sort of maximizing resource efficiency. >> So, you have to actually sort of double-click on to the kind of resources that we're talking about here, right? So there's the number of nodes that you're going to need to actually do the computation. There is, you know, the amount of disc storage and stuff that you're going to need, what type of CPUs you're going to need. All of that stuff is sort of part of the costing if you will, of running an infrastructure. If somebody hides all that stuff, and makes it so that it's economical, then you know, that's a great thing, right? And if it can actually be made so that it's works for huge installations, and hides it appropriately so I don't pay too much of a tax, that's a wonderful thing to do. But we have, our customers are enterprises, typically Fortune 200 enterprises, and they have both a mixture of cloud-based stuff, where they actually want to control everything about what's going on, and then they have infrastructure internally, which by definition they control everything that's going on, and for them we're very, very applicable. I don't know how we'd applicable in this, sort of new world as a service that grows and shrinks. I can certainly imagine that whoever provides that service would embed us, to be able to use the stuff more efficiently. >> No, you answered my question, which is, for the people who aren't getting the turnkey you know, sort of SaaS solution, and they need help managing, you know, what's a fairly involved stack, they would turn to you? >> Ash: Yes. >> Okay. >> Can I ask you about the specific products? >> George: Oh yes. >> I saw you at the booth, and I saw you were announcing a couple of things. Well what is new-- >> Ash: Correct. >> With the show? >> Correct, so at the show we announced Code Analyzer for Apache Spark, and what that allows people to do, is really understand where performance issues are actually happening in their code. So, one of the wonderful things about Spark, compared to MapReduce, is that it abstracts the paradigm that you actually write against, right? So that's a wonderful thing, 'cause it makes it easier to write code. The problem when we abstract, is what does that abstraction do down in the hardware, and where am I losing performance? And being able to give that information back to the user. So you know, in Spark, you have jobs that can run in parallel. So an apps consists of jobs, jobs can run in parallel, and each one of these things can consume resources, CPU, memory, and you see that through sort of garbage collection, or a disc or a network, and what you want to find out, is which one these parallel tasks was dominating the CPU? Why was it dominating the CPU? Which one actually caused the garbage collector actually go crazy at some point? While the Spark UI provides some of that information, what it doesn't do, is gives you a time series view of what's going on. So it's sort of a blow-by-blow view of what's going on. By imposing the time series view on sort of an enhanced version of the Spark UI, you now have much better visibility about which offending stages are causing the issue. And the nice thing about that is, once you know that, you know exactly which piece of code that you actually want to go and look at. So classic example would be, you might have two stages that are running in parallel. The Spark UI will tell you that it's stage three that's causing the problem, but if you look at the time series, you'll find out that stage two actually runs longer, and that's the one that's pegging the CPU. And you can see that because we have the time series, but you couldn't see that any other way. >> So you have a code analyzer and also the app profiler. >> So the app profiler is the other product that we announced a few months ago. We announced that I guess about three months ago or so. And the app profiler, what it does, is it actually looks after the run is done, it actually looks at all the data that the run produces, so the Spark history server produces, and then it actually goes back and analyzes that and says, "Well you know what? "You're executors here, are not working as efficiently, "these are the executors "that aren't working as efficiently." It might be using too much memory or whatever, and then it allows the developer to basically be able to click on it and say, "Explain to me why that's happening?" And then it gives you a little, you know, a little fix-it if you will. It's like, if this is happening, you probably want to do these things, in order to improve performance. So, what's happening with our customers, is our customers are asking developers to run the application profiler first, before they actually put stuff on production. Because if the application profiler comes back and says, "Everything is green." That there's no critical issues there. Then they're saying, "Okay fine, put it on my cluster, "on the production cluster, "but don't do it ahead of time." The application profiler, to be clear, is actually based on some work that, on open source project called Dr. Elephant, which comes out of LinkedIn. And now we're working very closely together to make sure that we actually can advance the set of heuristics that we have, that will allow developers to understand and diagnose more and more complex problems. >> The Spark community has the best code names ever. Dr. Elephant, I've never heard of that one before. (laughter) >> Well Dr. Elephant, actually, is not just the Spark community, it's actually also part of the MapReduce community, right? >> David: Ah, okay. >> So yeah, I mean remember Hadoop? >> David: Yes. >> The elephant thing, so Dr. Elephant, and you know. >> Well let's talk about where things are going next, George? >> So, you know, one of the things we hear all the time from customers and vendors, is, "How are we going to deal with this new era "of distributed computing?" You know, where we've got the cloud, on-prem, edge, and like so, for the first question, let's leave out the edge and say, you've got your Fortune 200 client, they have, you know, production clusters or even if it's just one on-prem, but they also want to work in the cloud, whether it's for elastics stuff, or just for, they're gathering a lot of data there. How can you help them manage both, you know, environments? >> Right, so I think there's a bunch of times still, before we get into most customers actually facing that problem. What we see today is, that a lot of the Fortune 200, or our customers, I shouldn't say a lot of the Fortune 200, a lot of our customers have significant, you know, deployments internally on-prem. They do experimentation on the cloud, right? The current infrastructure for managing all these, and sort of orchestrating all this stuff, is typically YARN. What we're seeing, is that more than likely they're going to wind up, or at least our intelligence tells us that it's going to wind up being Kubernetes that's actually going to wind up managing that. So, what will happen is-- >> George: Both on-prem and-- >> Well let me get to that, alright? >> George: Okay. >> So, I think YARN will be replaced certainly on-prem with Kupernetes, because then you can do multi data center, and things of that sort. The nice thing about Kupernetes, is it in fact can span the cloud as well. So, Kupernetes as an infrastructure, is certainly capable of being able to both handle a multi data center deployment on-prem, along with whatever actually happens on the cloud. There is infrastructure available to do that. It's very immature, most of the customers aren't anywhere close to being able to do that, and I would say even before Kupernetes gets accepted within the environment, it's probably 18 months, and there's probably another 18 months to two years, before we start facing this hybrid cloud, on-prem kind of problem. So we're a few years out I think. >> So, would, for those of us including our viewers, you know, who know the acronym, and know that it's a, you know, scheduler slash cluster manager, resource manager, would that give you enough of a control plane and knowledge of sort of the resources out there, for you to be able to either instrument or deploy an instrument to all the clusters (mumbles). >> So we are actually leading the effort right now for big data on Kupernetes. So there is a group of, there's a small group working. It's Google, us, Red Hat, Palantir, Bloomberg now has joined the group as well. We are actually today talking about our effort on getting HDFS working on Kupernetes, so we see the writing on the wall. We clearly are positioning ourselves to be a player in that particular space, so we think we'll be ready and able to take that challenge on. >> Ash this is great stuff, we've just got about a minute before the break, so I wanted to ask you just a final question. You've been in the Spark community for a while, so what of their open source tools should we be keeping our eyes out for? >> Kupernetes. >> David: That's the one? >> To me that is the killer that's coming next. >> David: Alright. >> I think that's going to make life, it's going to unify the microservices architecture, plus the sort of multi data center and everything else. I think it's really, really good. Board works, it's been working for a long time. >> David: Alright, and I want to thank you for that little Pepper pen that I got over at your booth, as the coolest-- >> Come and get more. >> Gadget here. >> We also have Pepper sauce. >> Oh, of course. (laughter) Well there sir-- >> It's our sauce. >> There's the hot news from-- >> Ash: There you go. >> Pepperdata Ash Munshi. Thank you so much for being on the show, we appreciate it. >> Ash: My pleasure, thank you very much. >> And thank you for watching theCUBE. We're going to be back with more guests, including Ali Ghodsi, CEO of Databricks, coming up next. (upbeat music) (ocean roaring)

Published Date : Jun 7 2017

SUMMARY :

brought to you by Databricks. and here with George Gilbert from Wikibon, George. Alright and the guest of honor of course, I want you to just tell us real quick here, and then subsequent to that I did a bunch of startups, and it's actually taken over the world. and now everybody's taking advantage of the same thing. about some of the solutions that you have. So the classical methodologies that you have, Yeah, and George you look like And by the overhead I mean, you know, is sort of part of the costing if you will, and I saw you were announcing a couple of things. And the nice thing about that is, once you know that, And then it gives you a little, The Spark community has the best code names ever. is not just the Spark community, and like so, for the first question, that a lot of the Fortune 200, or our customers, and there's probably another 18 months to two years, and know that it's a, you know, scheduler Bloomberg now has joined the group as well. so I wanted to ask you just a final question. plus the sort of multi data center Oh, of course. Thank you so much for being on the show, we appreciate it. And thank you for watching theCUBE.

ENTITIES

Entity	Category	Confidence
David Goad	PERSON	0.99+
Ash Munshi	PERSON	0.99+
George	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
Larry Ellison	PERSON	0.99+
George Gilbert	PERSON	0.99+
Google	ORGANIZATION	0.99+
Sean Suchter	PERSON	0.99+
David	PERSON	0.99+
Sean	PERSON	0.99+
Ash	PERSON	0.99+
Red Hat	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Yahoo	ORGANIZATION	0.99+
Peter Cnudde	PERSON	0.99+
2011	DATE	0.99+
DeepMind	ORGANIZATION	0.99+
Bloomberg	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
two guys	QUANTITY	0.99+
Pepperdata	ORGANIZATION	0.99+
24 hours	QUANTITY	0.99+
first question	QUANTITY	0.99+
Spark UI	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
DevOps	TITLE	0.99+
2012	DATE	0.99+
Chad Carson	PERSON	0.99+
two years	QUANTITY	0.99+
18 months	QUANTITY	0.99+
one	QUANTITY	0.99+
two	QUANTITY	0.99+
one problem	QUANTITY	0.99+
last July	DATE	0.99+
Databricks	ORGANIZATION	0.99+
LinkedIn	ORGANIZATION	0.99+
Spark Summit 2017	EVENT	0.99+
Code Analyzer	TITLE	0.99+
Spark	TITLE	0.98+
100,000 nodes	QUANTITY	0.98+
today	DATE	0.98+
Palantir	ORGANIZATION	0.98+
an hour	QUANTITY	0.98+
IBM Research	ORGANIZATION	0.98+
Both	QUANTITY	0.98+
two gentlemen	QUANTITY	0.98+
Chad	PERSON	0.98+
two stages	QUANTITY	0.98+
first guys	QUANTITY	0.98+
both	QUANTITY	0.97+
thousands of machines	QUANTITY	0.97+
each one	QUANTITY	0.97+
tens of hours	QUANTITY	0.95+
Kupernetes	ORGANIZATION	0.95+
MapReduce	TITLE	0.95+
Yahoo Search	ORGANIZATION	0.94+

Day 2 Kickoff - #SparkSummit - #theCUBE

[Narrator] Live from San Francisco it's the Cube covering Sparks Summit 2017 brought to you by databricks. >> Welcome to the Cube. My name is David Goad and I'm your host and we are here at Spark day two. It's the Spark Summit and I am flanked by a couple of consultants here from-- sorry, analysts from Wikibon. I got to get this straight. To my left we have Jim Kobielus who is our lead analysist for Data Science. Jim, welcome to the show. >> Thanks David. >> And we also have George Gilbert who is the lead analyst for Big Data and Analytics. I'll get this right eventually. So why don't we start with Jim. Jim just kicking off the show here today, we wanted to get some preliminary thoughts before we really jump into the rest of the day. What are the big themes that we're going to hear about? >> Yeah, today is the Enterprise day at Sparks Summit. So Spark for the Enterprise. Yesterday was focused on Spark, the evolution, extension of Spark to support for native development of deep learning as well as speeding up Spark to support sub-millisecond latencies. But today it's all about Spark and the Enterprise really what I call wrapping dev-ops around Spark, making it more productionizable, supportable. The databricks serverless announcement, though it was announced yesterday, the press release went up they're going into some depth right now in the key note about serverless and really serverless is all about providing an in cloud Spark, essentially a sand box for teams of developers to scale up and scale out enough resources to do the modeling, the training, the deployment, the iteration, the evaluation of Spark jobs in essentially a 24 by seven multi-tenant fully supported environment. So it's really about driving this continuous Spark development and iteration process into a 24 by seven model in the Enterprise, which is really what's happening is that data scientists, Spark developers are becoming an operational function that businesses are building, strategic infrastructure around things like recommendation engines, and e-commerce environments, absolutely demand 24 by seven resilience Spark team based collaboration environments, which is really what the serverless announcement is all about. >> David: So getting increasing demand on mission critical problems so that optimization is a big deal. >> Yeah, data science is not just an R&D function, it's an operational IT function as well. So that's what it's all about. >> David: Awesome, well let's go to George. I saw you watching the key note. I think still watching it again this morning, so taking notes feverishly. What were some of the things that stuck out to you from the key note speaker this morning? >> There are some things that are sort of going to bleed over from yesterday where we can explore some more. We're going to have on the show, the chief architect, Renald Chin, and the CEO, Ali Goatsee, and some of the things that we want to understand is how the scope of applications that are appropriate for Spark are expanding. We've got sort of unofficial guidance yesterday that, you know, just because Spark doesn't handle key value stores or databases all that tightly right now, that doesn't mean it won't in the future on the Apache Spark side through better APIs and on the databricks side, perhaps custom integration and the significance of that is that you can open up a whole class of operational apps, apps that run your business and that now incorporate, you know, rich analytics as well. Another thing that we'll want to be asking about is, keying off what Jim was saying, now that this becomes not a managed service where you just take the labor that the end customer was applying to get the thing running but it's now automated and you don't even know the infrastructure. We'll want to know what does that mean for the edge, you know, where we're doing analytics close to internet of things and people and sort of if there has to be a new configuration of Spark to work with that. And then of course what do we do about the whole data science process and the dev-ops for data science when you have machine learning distributed across the cloud and edge and On-Prem. >> Jim: In fact, I know we have Pepperdata coming on right after this, who might be able to talk about that exact dev-ops in terms of performance optimization into distributed Spark environment, yeah. >> George, I want to follow up with that. We had Matt Fryer from Hotels.com, he's going to be on our show later but he was on the key note stage this morning. He talked about going all cloud, all Spark, and how data science is even competitive advantage for Hotels.com. What do you want to dig into when we get him on the show? >> That's a really good question because if you look at business strategy, you don't really build a sustainable advantage just by doing one thing better than everyone else. That's easier to pick off. The sustainable strategic advantages come from not just doing one thing better than everyone else but many things and then orchestrating their improvement over time and I'd like to dig into how they're going to do that. 'Cause remember Hotels.com it's the internet equivalent descendant of the original travel reservation systems, which did confer competitive advantage on the early architects and deployers of that technology. >> Great and then Pepperdata wanted to come back and we're going to have them on the show here in just a moment. What would you like to learn from them? What do you think will benefit the community the most? >> Jim: Actually, keying off something George said, I'd like to get a sense for how you optimize Spark deployments in a radically distributed IOT edge environment. Whether they've got any plans, or what their thoughts are in terms of the challenges there. As more the intelligence gets pushed to the edge much of that will be on machine learning and deep learning models built into Spark. What are the challenges there? I mean, if you've got thousands to millions of end points that are all autonomious and intelligent and they're all running Spark, just what are the orchestration requirements, what are the resource management requirements, how do you monitor end-to-end in and environment like that and optimize the passing of data and the transfer of the control flow or orchestration across all those dispersed points. >> Okay, so 30 seconds now, why should the audience tune into our show today? What are they going to get? >> I think what they're going to get is a really good sense for how the emerging best practices for optimizing Spark in a distributed fog environment out to the edge where not just the edge devices but everything, all nodes, will incorporate machine learning and deep learning. They'll get a sense for what's been done today, what's the tooling is to enable dev-ops in that kind of environment. As well as, sort of the emerging best practices for compressing more of these algorithms and the data itself as well as doing training in a theoretically federated environment. I'm hoping to hear from some of the vendors who are on the show today. >> David: Fantastic and George, closing thoughts on the opening segment? 30 seconds. >> Closing thoughts on the opening segment. Like Jim is, we want to think about Spark holistically and it has traditionally been best position that's sort of this-- as Tay acknowledged yesterday sort of this offline branch of analytics that you apply to data like sort of repository that you accumulated and now we want to see it put into production but to do that you need more than just what Spark is today. You need basically a database or key value kind of option so that your storing your work as it goes along so you can go back and analyze it either simple analysis or complex analysis. So I want to hear about that. I want to hear about their plans for IOT. Spark is kind of a heavy weight environment, so you're probably not going to put it in the boot of your car or at least not likely anytime soon. >> Jim: Intelligent edge. I mean, Microsoft build a few weeks ago was really deep on intelligent edge. HP, who we're doing their show actually I think it's in Vegas, right? They're also big on intelligent edge. In fact, we had somebody on the show yesterday from HP going into some depth on that. I want to hear what databricks has to say on that theme. >> Yeah, and which part of the edge, is it the gateway, the edge gateway, which is really a slim down server, or the edge device, which could be a 32 bit meg RAM network card. >> Yeah. >> All right, well gentlemen appreciate the little insight here before we get started today and we're just getting started. Thank you both for being on the show and thank you for watching the Cube. We'll be back in a little while with our CEO from databricks. Thanks for watching. (upbeat music)

Published Date : Jun 7 2017

SUMMARY :

brought to you by databricks. It's the Spark Summit and I am flanked by What are the big themes that we're going to hear about? So Spark for the Enterprise. so that optimization is a big deal. So that's what it's all about. from the key note speaker this morning? and some of the things that we want to understand is Jim: In fact, I know we have Pepperdata coming on and how data science is and I'd like to dig into how they're going to do that. What would you like to learn from them? As more the intelligence gets pushed to the edge and the data itself David: Fantastic and George, but to do that you need more than just what Spark is today. I want to hear what databricks has to say on that theme. or the edge device, and thank you for watching the Cube.

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
Jim Kobielus	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Ali Goatsee	PERSON	0.99+
David Goad	PERSON	0.99+
Matt Fryer	PERSON	0.99+
Renald Chin	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
thousands	QUANTITY	0.99+
30 seconds	QUANTITY	0.99+
Hotels.com	ORGANIZATION	0.99+
yesterday	DATE	0.99+
Vegas	LOCATION	0.99+
32 bit	QUANTITY	0.99+
today	DATE	0.99+
24	QUANTITY	0.99+
HP	ORGANIZATION	0.99+
Spark	TITLE	0.99+
seven	QUANTITY	0.98+
Yesterday	DATE	0.98+
both	QUANTITY	0.98+
Spark Summit	EVENT	0.98+
Tay	PERSON	0.97+
Sparks Summit 2017	EVENT	0.96+
one	QUANTITY	0.96+
this morning	DATE	0.96+
Pepperdata	ORGANIZATION	0.96+
Day 2	QUANTITY	0.95+
Wikibon	ORGANIZATION	0.94+
Sparks Summit	EVENT	0.93+
databricks	ORGANIZATION	0.91+
day two	QUANTITY	0.87+
Spark	ORGANIZATION	0.86+
few weeks ago	DATE	0.86+
millions of end points	QUANTITY	0.81+
Big Data	ORGANIZATION	0.81+
Cube	COMMERCIAL_ITEM	0.68+
sub	QUANTITY	0.6+
Apache Spark	TITLE	0.55+
Analytics	ORGANIZATION	0.53+

Day One Wrap - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco, it's the CUBE covering Spark Summit 2017, brought to by Databricks. (energetic music plays) >> And what an exciting day we've had here at the CUBE. We've been at Spark Summit 2017, talking to partners, to customers, to founders, technologists, data scientists. It's been a load of information, right? >> Yeah, an overload of information. >> Well, George, you've been here in the studio with me talking with a lot of the guests. I'm going to ask you to maybe recap some of the top things you've heard today for our guests. >> Okay so, well, Databricks laid down, sort of, three themes that they wanted folks to take away. Deep learning, Structured Streaming, and serverless. Now, deep learning is not entirely new to Spark. But they've dramatically improved their support for it. I think, going beyond the frameworks that were written specifically for Spark, like Deeplearning4j and BigDL by Intel And now like TensorFlow, which is the opensource framework from Google, has gotten much better support. Structured Streaming, it was not clear how much more news we were going to get, because it's been talked about for 18 months. And they really, really surprised a lot of people, including me, where they took, essentially, the processing time for an event or a small batch of events down to 1 millisecond. Whereas, before, it was in the hundreds if not higher. And that changes the type of apps you can build. And also, the Databricks guys had coined the term continuous apps, which means they operate on a never-ending stream of data, which is different from what we've had in the past where it's batch or with a user interface, request-response. So they definitely turned up the volume on what they can do with continuous apps. And serverless, they'll talk about more tomorrow. And Jim, I think, is going to weigh in. But it, basically, greatly simplifies the ability to run this infrastructure, because you don't think of it as a cluster of resources. You just know that it's sort of out there, and you ask requests of it, and it figures out how to fulfill it. I will say, the other big surprise for me was when we have Matei, who's the creator of Spark and the chief technologist at Databricks, come on the show and say, when we asked him about how Spark was going to deal with, essentially, more advanced storage of data so that you could update things, so that you could get queries back, so that you could do analytics, and not just of stuff that's stored in Spark but stuff that Spark stores essentially below it. And he said, "You know, Databricks, you can expect to see come out with or partner with a database to do these advanced scenarios." And I got the distinct impression, and after listen to the tape again, that he was talking about for Apache Spark, which is separate from Databricks, that they would do some sort of key-value store. So in other words, when you look at competitors or quasi-competitors like Confluent Kafka or a data artist in Flink, they don't, they're not perfect competitors. They overlap some. Now Spark is pushing its way more into overlapping with some of those solutions. >> Alright. Well, Jim Kobielus. And thank you for that, George. You've been mingling with the masses today. (laughs) And you've been here all day as well. >> Educated masses, yeah, (David laughs) who are really engaged in this stuff, yes. >> Well, great, maybe give us some of your top takeaways after all the conversations you've had today. >> They're not all that dissimilar from George's. What Databricks, Databricks of course being the center, the developer, the primary committer in the Spark opensource community. They've done a number of very important things in terms of the announcements today at this event that push Spark, the Spark ecosystem, where it needs to go to expand the range of capabilities and their deployability into production environments. I feel the deep-learning side, announcement in terms of the deep-learning pipeline API very, very important. Now, as George indicated, Spark has been used in a fair number of deep-learning development environments. But not as a modeling tool so much as a training tool, a tool for In Memory distributed training of deep-learning models that we developed in TensorFlow, in Caffe, and other frameworks. Now this announcement is essentially bringing support for deep learning directly into the Spark modeling pipeline, the machine-learning modeling pipeline, being able to call out to deep learning, you know, TensorFlow and so forth, from within MLlib. That's very important. That means that Spark developers, of which there are many, far more than there are TensorFlow developers, will now have an easy pass to bring more deep learning into their projects. That's critically important to democratize deep learning. I hope, and from what I've seen what Databricks has indicated, that they have support currently in API reaching out to both TensorFlow and Keras, that they have plans to bring in API support for access to other leading DL toolkits such as Caffe, Caffe 2, which is Facebook-developed, such as MXNet, which is Amazon-developed, and so forth. That's very encouraging. Structured Streaming is very important in terms of what they announced, which is an API to enable access to faster, or higher-throughput Structured Streaming in their cloud environment. And they also announced that they have gone beyond, in terms of the code that they've built, the micro-batch architecture of Structured Streaming, to enable it to evolve into a more true streaming environment to be able to contend credibly with the likes of Flink. 'Cause I think that the Spark community has, sort of, had their back against the wall with Structured Streaming that they couldn't fully provide a true sub-millisecond en-oo-en latency environment heretofore. But it sounds like with this R&D that Databricks is addressing that, and that's critically important for the Spark community to continue to evolve in terms of continuous computation. And then the serverless-apps announcement is also very important, 'cause I see it as really being, it's a fully-managed multi-tenant Spark-development environment, as an enabler for continuous Build, Deploy, and Testing DevOps within a Spark machine-learning and now deep-learning context. The Spark community as it evolves and matures needs robust DevOps tools to production-ize these machine-learning and deep-learning models. Because really, in many ways, many customers, many developers are now using, or developing, Spark applications that are real 24-by-7 enterprise application artifacts that need a robust DevOps environment. And I think that Databricks has indicated they know where this market needs to go and they're pushing it with R&D. And I'm encouraged by all those signs. >> So, great. Well thank you, Jim. I hope both you gentlemen are looking forward to tomorrow. I certainly am. >> Oh yeah. >> And to you out there, tune in again around 10:00 a.m. Pacific Time. We're going to be broadcasting live here. From Spark Summit 2017, I'm David Goad with Jim and George, saying goodbye for now. And we'll see you in the morning. (sparse percussion music playing) (wind humming and waves crashing).

Published Date : Jun 7 2017

SUMMARY :

Announcer: Live from San Francisco, it's the CUBE to customers, to founders, technologists, data scientists. I'm going to ask you to maybe recap And that changes the type of apps you can build. And thank you for that, George. after all the conversations you've had today. for the Spark community to continue to evolve I hope both you gentlemen are looking forward to tomorrow. And to you out there, tune in again

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Jim	PERSON	0.99+
George	PERSON	0.99+
David	PERSON	0.99+
David Goad	PERSON	0.99+
San Francisco	LOCATION	0.99+
Matei	PERSON	0.99+
tomorrow	DATE	0.99+
Amazon	ORGANIZATION	0.99+
Databricks	ORGANIZATION	0.99+
hundreds	QUANTITY	0.99+
Spark	TITLE	0.99+
both	QUANTITY	0.98+
Google	ORGANIZATION	0.98+
Intel	ORGANIZATION	0.98+
Spark Summit 2017	EVENT	0.98+
18 months	QUANTITY	0.98+
Flink	ORGANIZATION	0.97+
Facebook	ORGANIZATION	0.97+
Confluent Kafka	ORGANIZATION	0.97+
Caffe	ORGANIZATION	0.96+
today	DATE	0.96+
TensorFlow	TITLE	0.94+
three themes	QUANTITY	0.94+
10:00 a.m. Pacific Time	DATE	0.94+
CUBE	ORGANIZATION	0.94+
Deeplearning4j	TITLE	0.94+
Spark	ORGANIZATION	0.93+
1 millisecond	QUANTITY	0.93+
Keras	ORGANIZATION	0.91+
Day One	QUANTITY	0.81+
BigDL	TITLE	0.79+
TensorFlow	ORGANIZATION	0.79+
7	QUANTITY	0.77+
MLlib	TITLE	0.73+
Caffe 2	ORGANIZATION	0.7+
Caffe	TITLE	0.7+
24-	QUANTITY	0.68+
MXNet	ORGANIZATION	0.67+
Apache Spark	ORGANIZATION	0.54+

Jags Ramnarayan, SnappyData - Spark Summit 2017 - #SparkSummit - #theCUBE

(techno music) >> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> You are watching the Spark Summit 2017 coverage by theCUBE. I'm your host David Goad, and joined with George Gilbert. How you doing George? >> Good to be here. >> And honored to introduce our next guest, the CTO from SnappyData, wow we were lucky to get this guy. >> Thanks for having me >> David: Jags Ramnarayan, Jags thanks for joining us. >> Thanks, thanks for having me. >> And for people who may not be familiar, maybe tell us what does SnappyData do? >> So SnappyData in a nutshell, is taking Spark, which is a computer engine, and in some sense augmenting the guts of Spark so that Spark truly becomes a hybrid database. A single data store that's capable of taking Spark streams, doing transactions, providing mutable state management in Spark, but most importantly being able to turn around, and run analytical queries on that state that is continuously merging. That's in a nutshell. Let me just say a few things, SnappyData itself is a startup that is a spun out, a spun out out of Pivotal. We've been out of Pivotal for roughly about a year, so the technology itself was to a great degree, incubated within Pivotal. It's a product called GemFire within VMware and Pivotal. So we took the guts of GemFire, which is an in-memory data base, designed for transactional low-latency, high confidence scenarios, and we are sort of fusing it, that's the key thing, fusing it into Spark, so that now Spark becomes significantly richer, as not just as a computer platform, but as a store. >> Great, and we know this is not your first Spark Summit, right? How many have you been to? Lost count? >> Boy, let's see, three, four now, Spark Summits, if I include the Spark Summit, this year, four to five. >> Great, so an active part of the community. What were you expecting to learn this year, and have you been surprised by anything? >> You know, it's always wonderful to see, I mean, every time I come to Spark, it's just a new set of innovations, right? I mean, when I first came to Spark, it was a mix of, let's talk about data frames, all of these, let's optimize my priorities. Today you come, I mean there is such a wide spectrum of amazing new things that are happening. It's just mind boggling. Right from AI techniques, structured streaming, and the real-time paradigm, and sort of this confluence that Databricks brings more to it. How can I create a confluence through a unified mechanism, where it is really brilliant, is what I think. >> Okay, well let's talk about how you're innovating at SnappyData. What are some of the applications or current projects you're working on? So number of things, I mean, GE is an investor in SnappyData. So we're trying to work with GE on the investor layer Dspace. We're working with large health care companies, also on their layer Dspace. So the part done with SnappyData is one that has a lot of high velocity streams of data emerging where the streams could be, for instance, Kafka streams driving Spark streams, but streams could also be operation databases. Your Postgres instance and your Cassandra database instance, and they're all generating continuous changes to data that's emerging in an operational world, can I suck that in and almost create a replica of that state that might be emerging in the SOQL operation environment, and still allow interactive analytics ASCIL for a number of concordant users on live data. Not cube data, not pre-aggregated data, but on live data itself, right? Being able to almost give you Google-like speeds to live data. >> George, we've heard people talking about this quite a bit. >> Yeah, so Jags, as you said upfront, Spark was conceived as sort of a general purpose, I guess, analytic compute engine, and adding DBMS to it, like sort of not bolting it on, but deeply integrating it, so that the core data structures now have DBMS properties, like transactionality, that must make a huge change in the scope of applications that are applicable. Can you desribe some of those for us? >> Yeah. The classic paradigm today that we find time and again as, the so-called smack stack, right? I mean lambda stack, now there's a smack stack. Which is really about Spark running on Mesos, but really using Spark streaming as an ingestion capability, and there is continuous state that is emerging that I want to write into Cassandra. So what we find very quickly is that the moment the state is emerging, I want to throw in a business intelligence tool on top and immediately do live dashboarding on that state that is continuously changing and emerging. So what we find is that the first part, which is the high speed drives, the ability to transform these data search, cleanse the data search, get the cleanse data into Cassandra, works really well. What is missing is this ability to say, well, how am I going to get insight? How can I ask you interesting, insightful questions, get responses immediately on that live data, right? And so the common problem there is the moment I have Cassandra working, let's say, with Spark, every time I run an analytical query, you only have two choices. One is use the parallel connector to pull in the data search from Cassandra, right, and now unfortunately, when you do analytics, you're working with large volumes. And every time I run even a simple query, all of a sudden I could be pulling in 10 gigabytes, 20 gigabytes of data into Spark to run the computation. Hundreds of seconds lost. Nothing like interactive, it's all about batch querying. So how can I turn around and say that if stuff changes in Cassandra, I can can have an immediate real-time reflection of that mutable state in Spark on which I can run queries rapidly. That's a very key aspect to us. >> So you were telling me earlier that you didn't see, necessarily, a need to replace entirely, the Cassandra in the smack stack, but to compliment it. >> Jags: That's right. >> Elaborate on that. >> So our focus, much like Spark, is all about in-memory, state management in-memory processing. And Cassandra, realistically, is really designed to say how can I scale the petabyte, right, for key value operations, semi-structured data, what have you. So we think there are a number of scenarios where you still want Cassandra to be your store, because in some sense a lot of these guys have already adapted Cassandra in a fairly big way. So you want to say, hey, leave your petabyte level wall in there, and you can essentially work with the real-time state, which could still be still many terabytes of state, essentially in main memory, that's going to work with specializing it. And we're also, I mean I can touch on this approximate query process and technology, which is other part, other key part here, to say hey, I can't really 1,000 cores, and 1,000 machines just so that you can do your job really well, so one of the techniques we are adopting, which even the Databricks guys stirred with Blink, essentially, it's an approximate query processing engine, we have our own essential approximate query processing engine, as an adjunct, essentially, to our store. What that essentially means is to say, can I take a billion records and synthesize something really, really small, using smart sampling techniques, sketching techniques, essentially statistical structures, that can be stored along with Spark and Spark memory itself, and fuse it with the Spark catalyst query engine. So that as you run your query and we can very smartly figure out, can I use the approximate data structures to answer the questions extremely quickly. Even when the data would be in petabyte volume, I have these data structures that just now taking, maybe gigabytes of storage only. So hopefully not getting too, too technical, so the Spark catalyst query optimizer, like an Oracle query optimizer, it knows about the data that it's going to query, only in your case, you're taking what catalyst knows about Spark, and extending it with what's stored in your native, also Spark native, data structures. >> That's right, exactly. So think about an optimizer always takes a query plan and says, here are all the possible plans you can execute, and here is cost estimate for these plans, we essentially inject more plans into that and hopefully, our plan is even more optimized than the plans that the Spark catalyst engine came up with. And Spark is beautiful because, the Catalyst engine is a very pluggable engine. So you can essentially augment that engine very easily. >> So you've been out in the marketplace, whether in alpha, beta, or now, production, for enough time so that the community is aware of what you've done. What are some of the areas that you're being pulled in that are, that people didn't associate Spark with? >> So more often, we land up in situations where they're looking at SAP HANA, as an example, maybe a Meme SQL, maybe just Postgres, and all of the sudden, there are these hybrid workloads, which is the Gartner term of HTAP, so there's a lot of HTAP use cases, where we get pulled into. So there's no Spark, but we get pulled into it because we just a hybrid database. That's what people look at us, essentially. >> Oh, so you pull Spark in because that's just part of your solution. >> Exactly, right. So think about Spark is not just data frames and rich API, but also it has a SQL interface, right. I can essentially execute, SQL, select SQL. Of course we augment that SQL so that now you can do what you expect from a database, which is an insert, an update, a delete, can I create a view, can I run a transaction? So all of a sudden, it's not just a Spark API but what we provide looks like a SQL database itself. >> Okay, interesting. So tell us, in the work with GE, they're among the first that have sort of educated the world that in that world there's so much data coming off devices, that we have to be intelligent about what we filter and send to cloud, we train models, potentially, up there, we run them closer to the edge, so that we get low latency analytics, but you were telling us earlier that there are alternatives, especially when you have such an intelligent database, working both at the edge and in the cloud. >> Right, so that's a great point. See what's happening with sort of a lot of these machine learning models is that these models are learned on historical data search. And quite often, especially if you look at predictive maintenance, those class of use cases, in industrial IRT, the parlance could evolve very rapidly, right? Maybe because of climate changes and let's say, for a windmill farm, there are few windmills that are breaking down so rapidly it's affecting everything else, in terms of the power generation. So being able to sort of order the model itself, incrementally and near real-time, is becoming more and more important. >> David: Wow. >> It's still a fairly academic research kind of area, but for instance, we are working very closely with the University of Michigan to sort of say, can we use some of these approximate techniques to incrementally also learn a model. Right, sort of incrementally augment a model, potential of the edge, or even inside the cloud, for instance. >> David: Wow. >> So if you're doing it at the edge, would you be updating the instance of the model associated with that locale and then would the model in the cloud be sort of like the master, and then that gets pushed down, until you have an instance and a master. >> That's right. See most typically what will happen is you have computed a model using a lot of historical data. You have typically supervised techniques to compute a model. And you take that model and inject it potentially into the edge, so that it can compute that model, which is the easy part, everybody does that. So you continue to do that, right, because you really want the data scientists to be pouring through those paradigms, looking and sort of tweaking those models. But for a certain number of models, even in the models injected in the edge, can I re-tweak that model in unsupervised way, is kind of the play, we're also kind of venturing into slowly, but that's all in the future. >> But if you're doing it unsupervised, do you need metrics that sort of flag, like what is the champion challenger, and figure out-- >> I should say that I mean, not all of these models can work in this very real-time manner. So, for instance, we've been looking at saying, can we reclassify NPC, the name place classifier, to essentially do incremental classification, or incrementally learning the model. Clustering approaches can actually be done in an unsupervised way in an incremental fashion. Things like that. There's a whole spectrum of algorithms that really need to be thought through for approximate algorithms to actually apply. So it's still an active research. >> Really great discussion, guys. We've just got about a minute to go, before the break, really great stuff. I don't want to interrupt you. But maybe switch real quick to business drivers. Maybe with SnappyData or with other peers you've talked to today. What business drivers do you think are going to affect the evolution of Spark the most? I mean, for us, as a small company, the single biggest challenge we have, it's like what one of you guys said, analysts, it's raining databases out there. And there's ability to constantly educate people how you can essentially realize a very next generation, like data pipeline, in a very simplified manner, is the challenge we are running into, right. I mean, I think the business model for us is primarily how many people are going to go and say, yes, batch related analytics is important, but incrementally, for competitive reasons, want to be playing that real-time analytics game lot more than before, right? So that's going to be big for us, and hopefully we can play a big part there, along with Spark and Databricks. >> Great, well we appreciate you coming on the show today, and sharing some of the interesting work that you're doing. George, thank you so much. and Jags, thank you so much for being on theCUBE. >> Thanks for having me on, I appreciate it. Thanks, George. And thank you all for tuning in. Once again, we have more to come, today and tomorrow, here at Spark Summit 2017, thanks for watching. (techno music)

Published Date : Jun 6 2017

SUMMARY :

Brought to you by Databricks. How you doing George? And honored to introduce our next guest, and in some sense augmenting the guts of Spark if I include the Spark Summit, this year, four to five. and have you been surprised by anything? and the real-time paradigm, and sort of this confluence So the part done with SnappyData is one about this quite a bit. so that the core data structures now have DBMS properties, that the moment the state is emerging, the Cassandra in the smack stack, but to compliment it. So that as you run your query and we can very So you can essentially augment that engine very easily. What are some of the areas that you're being pulled in maybe just Postgres, and all of the sudden, Oh, so you pull Spark in because So all of a sudden, it's not just a Spark API that have sort of educated the world So being able to sort of order the model itself, but for instance, we are working very closely in the cloud be sort of like the master, So you continue to do that, right, because you that really need to be thought through is the challenge we are running into, right. and sharing some of the interesting work that you're doing. And thank you all for tuning in.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
David Goad	PERSON	0.99+
George	PERSON	0.99+
University of Michigan	ORGANIZATION	0.99+
1,000 machines	QUANTITY	0.99+
20 gigabytes	QUANTITY	0.99+
GE	ORGANIZATION	0.99+
1,000 cores	QUANTITY	0.99+
10 gigabytes	QUANTITY	0.99+
David	PERSON	0.99+
Spark	TITLE	0.99+
San Francisco	LOCATION	0.99+
SQL	TITLE	0.99+
Spark	ORGANIZATION	0.99+
Jags Ramnarayan	PERSON	0.99+
first	QUANTITY	0.99+
first part	QUANTITY	0.99+
two choices	QUANTITY	0.99+
SAP HANA	TITLE	0.99+
tomorrow	DATE	0.99+
Hundreds of seconds	QUANTITY	0.99+
Gartner	ORGANIZATION	0.99+
this year	DATE	0.99+
Spark Summit 2017	EVENT	0.99+
Jags	PERSON	0.99+
One	QUANTITY	0.98+
today	DATE	0.98+
Today	DATE	0.98+
both	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
Spark Summit	EVENT	0.97+
single	QUANTITY	0.97+
Kafka	TITLE	0.97+
Oracle	ORGANIZATION	0.97+
Google	ORGANIZATION	0.96+
about a year	QUANTITY	0.96+
Blink	ORGANIZATION	0.95+
single data	QUANTITY	0.93+
SnappyData	ORGANIZATION	0.93+
Mesos	TITLE	0.91+
three	QUANTITY	0.91+
a billion records	QUANTITY	0.91+
#SparkSummit	EVENT	0.91+
Spark Summits	EVENT	0.9+
four	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.89+
Postgres	TITLE	0.89+
one	QUANTITY	0.88+
Cassandra	TITLE	0.87+

Octavian Tanase, Netapp - #SparkSummit - #theCUBE

(upbeat music) >> Announcer: Live from San Franciso, it's theCUBE covering the Spark Summit 2017. Brought to you by Databricks. >> You are watching theCUBE at Spark Summit 2017 I'm David Goad here with my friend George Gilbert. How you doing, George? >> Good. >> All right, but the man of the hour is over to my left. I'd like to introduce a Databricks partner, and his name is Octavian Tanase, he's the SVP for Data ONTAP Software and Systems Group at NetApp. Octavian. >> Thank you for having us. >> All right well you have kind of an interesting background. We were chatting before, you started as an engineer, developer? >> Yeah, so I'm in an executive role right now but I have an interesting trajectory. Most people in a similar role come from a product management or sales background. I'm a former engineer and you know, somebody that has a passion for technology and now for customers and building interesting technologies. >> Okay, well if you have a passion for this technology then, I'd like to get your take on the market place a little bit. Tell us about the evolution of the mainstream and what you see changing. >> I think your data is the new currency of the 21st century. You have a desire and a thirst to get more out of your data. You have developers, you have analysts looking to build the next great application to mine your data for great business outcomes. NetApp as a data management company is very much interested in working with companies like Databricks and a bunch of hyperscalers to enable that type of solutions that either enable in place analytics or data lakes or you know, solutions that really enable developers and analysts to harness that part of the data. >> Mhmm. So ... Maybe walk us through what you've seen to date in terms of the mainstream use cases for big data and then tell us where you think they're going, but what walls need to be pushed back with the confection of technologies to get there. >> Originally what I've seen a lot of people investing in data lake technologies. Data lakes in a nutshell are massive containers that are simple to manage, scalable performant where you can aggregate a bunch of data sources and then you can run a map-produced type of workload to correlate that data, to harness that part of data, to draw conclusions. That was sort of the original track. Over time, I think there's a desire, given how dynamic and diverse that the data is, to build a lot of this analytics in-line, in real time. That's where companies like Databricks comes and that's where the cloud comes to enable both the agility as well as the type of real time behavior to getting those analytics. >> Now this is your first Spark Summit? >> Absolutely, happy to be here. >> Oh I know it's just the first day, but what have you learned so far? Any great questions from other participants? >> Well I think I see a lot of people innovating very fast. I see both established players paying attention, I see new companies looking to take advantage of this revolution that is happening, you know, around data and the data services and data analytics. >> Maybe tell us a little more what we were talking about before we started about how some customers who are very sensitive to their data want to keep it in their data centers or Equinix which still counts as pretty much theirs, but the compute is often the cloud somewhere. >> As you can imagine, we work with a lot of enterprise customers and one thing that I've learned in the last couple of years is that their thought process has evolved, you know, banks, large financial institutions. Two years ago, we're not even considering the cloud. And I see that now changing and I see them wanting to operate like a cloud provider, I see them want to take advantage of the flexibility and the agility of the cloud. I see them being more comfortable with the type of security capabilities that the cloud offers today. Security has been probably the most troublesome issue that folks have looked to overcome and then the gravity of the data. The reality is that the data, it's very distributed in dynamic, diverse in nature as I mentioned earlier. There's data created at the edge, data created in the data center, and people want to be able to process that data in real time regardless where data is without necessarily having to move it in some cases. Everybody's looking for data management solutions that enable mobility, you know, governments, management of that data and this enabling analytics, wherever that data is. >> You said some really interesting things in there which is, I mean I can see where the customer's data center extended to Equinix, where they want to bring the compute to the data because the data's heavier than the compute, but what about on the edge? Does it make sense to bring, is there enough data there to keep it there and bring compute down to the edge or do you co-locate compute persistently? And then how much of the compute is done at the edge? >> The reality is that you're probably going to see customers do both. There is more data created at the edge than in the history before. You'll see a lot of the data management companies invest in software-defined solutions that require a very small footprint, both from the storage point of view as well as compute. One of the advantages of technology like ONTAP is the investment that has been made to enable data reduction because your ability to store data at the edge is not really very good, so you want to have these capabilities to reduce the footprint by compression, by deduping, by compacting that data, and then making some smart decisions at the edge. Perhaps do some in-line, in-place analytics there and moving some of the data back into a central data center where more batch analytics can take place. >> But when you talk about that compaction, deduping, there was one more, but I think everyone gets the point. Are you talking about having a NetApp ONTAP device near the edge or on the edge? >> That device, it's actually software only. >> Ahh. >> You guys probably are aware of the fact that ONTAP now ships in three flavors, or three form factors. There is an engineered appliance, and we will likely do that for many years to come. But we also have ONTAP running in a virtual environment, either on KVM or Vmware as well as ONTAP running in the cloud. We've been running in the AWS cloud since 2014. We're also running in the Azure cloud. We are talking to other vendors to improve the ubiquity of software-defined ONTAP. >> Just to be really specific, we're told now that an edge gateway, not an edge device, but gateway, it's about two gigs in memory and two cores. Is that something a software-defined ONTAP would run on? >> Absolutely. You'll see us running on a variety of devices in the field with energy companies. You'll see ONTAP running in the tactical sphere, and we have projects that I can't really tell you about, but you'll find it broadly deployed on the edge. >> George: Okay. >> Yeah, let's talk a little bit about NetApp. What are some of the business outcomes you're looking for here? Do you have good executive sponsorship of these initiatives? >> We are very excited to be here. NetApp has been in the data management realm for a very, very long time. Yeah, analytics is a natural place, a great adjacency for us. We've been very fortunate to work with NoSQL type of companies. We've been very happy to collaborate with some of the leaders in analytics such as Databricks. We are entering the IOT space and enabling solutions that are really edge focused. So overall, this is a great fit for us and we're very excited to participate at the Summit. >> What do you think will be ... We've heard from Mata that sort of the state of the art in terms of, I hate to say the word, its fantasy, but like experimentation perhaps, is structured streaming, so continuous apps which are calling on deep learning models. Where would you play in that and what do you think ... What are the barriers there? What comes next? >> I think any complete analytics solution will need a bunch of services and some infrastructure that lends itself for that type of a workload, that type of a use case so you need, in some cases, very fast storage with super low latencies. In some cases you will need tremendous throughput. In some cases you will need that small footprint of an operating system running at the edge to enable some of that in-line processing. I think the market will evolve very fast. The solutions will evolve very fast and you will need the type of industry sponsorship by companies that really understand data management and that have made it their business for a very, very long time. I see that synergy that is being created between the innovation in analytics, the innovation that happens in the cloud, and the innovation that a company like NetApp does around a data fabric and around the type of services that are required to govern, to move, to secure, to protect that data in a very cost efficient way. >> This is kind of key, because people are struggling with having some sort of commonality in their architecture between the edge, on PRAM, and the cloud, but it could be at many different levels. What's your sweet spot for offering that? I mean, you talked about deduping and ... >> Compression and compaction. >> Compression and snapshots or whatever. Having that available in different form factors, what does that enable a customer to do, perhaps using different software on top? >> I'm glad that you asked. The reality is that we want to enable customers to consolidate both second and third platform applications on the ONTAP operating system. Customers will find not only flexibility, but consistency on the data management regardless of where data is. Whether it's in the cloud, near the cloud, or on the edge. We believe that we have the most flexible solution to enable data analytics, data management, that lends itself for all these use cases that enable next generation type of applications. >> Okay but if that predicated on having not just data ONTAP, but also a common application architecture on top? >> I think we wanted to enable a variety of solutions being based there. In some cases we're building glue. What do I mean by glue? It's for example, an NFS to HDFS connector that enable that translation from the native format for most of the data in a Hadoop or Spark type of EMR system. We're investing in enabling that flexibility and enabling that innovation that would happen by many of the companies that we see here on the floor today. >> George: Okay, that makes sense. >> We have just a minute to go here before the break. If you could talk to the entire Spark community, and you are right now on theCUBE, what's on your wish list? What do you wish people would do more of? Or if you could get help with something, what would it be? >> I think that my ask is continue to innovate. Push boundaries, and continue to be clever in partnering both with small vendors that are really innovating with tremendous space, as well as with established vendors that have really made the data management their business for many years and then are looking to participate in the ecosystem. >> Let's innovate together. >> All right, very good. >> Octavian, thank you so much for taking some time here out of your busy day to share with theCUBE, and we appreciate you being here >> Very good. >> Thank you so much. >> Pleasure >> Thanks, Octavian. >> That's right, you're watching theCUBE here at Spark Summit 2017. We'll see you in a few minutes with our next guest. (upbeat electronic music)

Published Date : Jun 6 2017

SUMMARY :

Brought to you by Databricks. How you doing, George? All right, but the man of the hour is over to my left. All right well you have kind of an interesting background. I'm a former engineer and you know, and what you see changing. the next great application to mine your data and then tell us where you think they're going, given how dynamic and diverse that the data is, around data and the data services and data analytics. but the compute is often the cloud somewhere. The reality is that the data, it's very distributed and moving some of the data back into a central data center near the edge or on the edge? You guys probably are aware of the fact that ONTAP Is that something a software-defined ONTAP would run on? and we have projects that I can't really tell you about, What are some of the business outcomes NetApp has been in the data management realm We've heard from Mata that sort of the state of the art that type of a use case so you need, in some cases, between the edge, on PRAM, and the cloud, Having that available in different form factors, I'm glad that you asked. for most of the data in a Hadoop and you are right now on theCUBE, that have really made the data management We'll see you in a few minutes with our next guest.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
David Goad	PERSON	0.99+
George	PERSON	0.99+
Octavian Tanase	PERSON	0.99+
Octavian	PERSON	0.99+
two cores	QUANTITY	0.99+
21st century	DATE	0.99+
San Franciso	LOCATION	0.99+
ONTAP	TITLE	0.99+
first day	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
both	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Two years ago	DATE	0.99+
2014	DATE	0.99+
one thing	QUANTITY	0.99+
first	QUANTITY	0.99+
One	QUANTITY	0.98+
Equinix	ORGANIZATION	0.98+
Spark Summit 2017	EVENT	0.98+
three flavors	QUANTITY	0.98+
third platform	QUANTITY	0.98+
Data ONTAP Software and Systems Group	ORGANIZATION	0.98+
one	QUANTITY	0.96+
Spark	TITLE	0.96+
NetApp	TITLE	0.96+
NetApp	ORGANIZATION	0.95+
today	DATE	0.95+
second	QUANTITY	0.94+
three form factors	QUANTITY	0.93+
last couple of years	DATE	0.93+
Spark Summit	EVENT	0.91+
Azure cloud	TITLE	0.9+
Mata	PERSON	0.88+
about two gigs	QUANTITY	0.85+
Netapp	ORGANIZATION	0.79+
ONTAP	ORGANIZATION	0.72+
NoSQL	TITLE	0.72+
theCUBE	ORGANIZATION	0.7+
Hadoop	TITLE	0.56+
#SparkSummit	TITLE	0.34+

Kickoff - #SparkSummit - #theCUBE

>> Announcer: Live, from San Francisco, it's theCUBE! Covering Spark Summit 2017. Brought to you by Databricks. (energetic techno music) >> Welcome to theCube! I'm your host, David Goad, and we're here at the Spark Summit 2017 in San Francisco, where it's all about data science and engineering at scale. Now, I know there's been a lot of great technology shows here at Moscone Center, but this is going to be one of the greatest, I think. We are joined by here by George Gilbert, who is the lead analyst for big data and analytics at Wikibon. George, welcome to theCUBE. >> Good to be here, David. >> All right, so, I know this is kind of like reporting in real time, 'cause you're listening to the keynote right now, right? >> George: Yeah. >> Well, I know we wanted to get us started with some of the key themes that you've heard. You've done a lot of work recently with how applications are changing with machine learning, as well as the new distributed computing. So, as you listen to what Matei is talking about, and some of the other keynote presenters, what are some of the key themes you're hearing so far? >> There's two big things that they are emphasizing so far this year, or at this Spark Summit. One is structured streaming, which they've been talking about more and more over the last 18 months, but it officially goes production-ready in the 2.2 release of Spark, which is imminent. But they also showed something really, really interesting with structured streaming. There've always been other streaming products, and the relevance of streaming is that we're more and more building applications that process data continuously. Not in either big batches or just request-response with a user interface. Your streaming capabilities dictate the class of apps that you're appropriate for. The Spark structured streaming had a lot of overhead in it, 'cause it had to manage a cluster. It was working with a query optimizer, and so it would basically batch up events in groups that would go through, like, once every 200 milliseconds to a full second. Which is near real-time, but not considered real-time. And I know I'm driving into the details a bit, but it's very significant. They demoed on stage today-- >> David: I saw the demo. >> They showed structured streams running one millisecond latency. That's a big breakthrough, because that means, essentially, you can do per-event processing, which is true streaming. >> And so this contributes to deep learning, right? Low-latency streaming. >> Well, it can complement it, because when you do machine learning, or deep learning, you basically have a model, and you want to predict something. The stream is flowing along, and so for every data element in the stream, you might want a prediction, or a classification, or something like that. Spark had okay support for deep learning before, but that's the other big emphasis now. Before, they could sort of serve models, like in production, but training models was somewhat more difficult for deep learning. That took parallelization they didn't have. >> I noticed there were three demos that kind of tied together in a little bit of a James Bond story. So, maybe the first one was talking about image classification, transfer learning, tell me a little bit more about what you heard from there. I know you need to mute your keynote. The first demo from Tim Hunter. >> The demo, like with James Bond, was, we're going to show, among my favorite movies, they show cars, they're learning to label cars, and then they're showing cars that appeared in James Bond movies, and so they're training the model to predict, was this car seen in a James Bond movie? And then they also have, they were joining it with data that showed where the car was last seen. So it's sort of like a James Bond sighting. And then they train that model, and then they sort of ran it in production, essentially, at real-time speed. >> And the continuous processing demo showed how fast that could be run. >> Right, right. That was a cool demo. That was a nice visual. >> And then we had the gentleman from Stanford, Christopher Re came up to talk more about the applications for machine learning. Is it really going to be radically easier to use? >> We didn't make it all the way through that keynote, but yes, there are things that can be used to make machine learning easier to use. There's, for one thing, like if you take the old statistical machine learning stuff, it's still very hard to identify the features, or the variables, that you're going to use in the model. And deep learning, many people expect over the next few years to be able to help with that, so that the features are something that a data scientist would collaborate with a domain expert. And deep learning, just the way it learns the features of a cat, like, here's the nose, here's the ears, here's the whiskers, there's the expectation that deep learning will help identify the features for models. So you turn machine learning on itself, and it helps things. Among other things that should get easier. >> We're going to get to talk to several of those keynoters a little bit later in the show, so do a little more deeper dive on that. Maybe talk to us just generally to about, who's here at this show, and what do you think they're looking for in the Spark community? >> Spark was always a bottom-up, adoption-first, because it fixed some really difficult problems with the predecessor technology, which was called MapReduce, which was the compute engine in Hadoop. That was not familiar to most programmers, whereas Spark, you know, there's an API for machine learning, there's an API for batch processing, for string processing, graph processing, but you can use SQL over all of those, and that made it much more accessible. And the fact that, now machine learning's built in, streaming's built in. All those things, you basically, MapReduce, the old version, was the equivalent of assembly language. This is at a SQL-level language. >> And so you were here at Spark Summit 2016, right? >> George: Yeah. >> We've seen some advances. Would you say it's incremental advances, or are we really making big leaps? >> Well, Spark 2.0 was a big leap, and we're just approaching 2.2. I would say that getting this structured streaming down to such low latency is a big, big deal, and adding good support for deep learning, which is now all the craze. Although most people are using it for, essentially, vision, listening, speaking, natural language processing, but it'll spread to other use cases. >> Yeah, we're going to hear about some more of those use cases throughout the show. We've got customers coming in, I won't name them all right now, but they'll be rolling in. What do you want to know most from those customers? >> The real thing is, Spark started out as, like, offline analytic preparation of data that was in data lakes. And it's moving more into the mainstream of production apps. The big thing is, what's the sweet spot? What type of apps, where are the edge conditions? That's what I think we'll be looking for. >> And when Matei came out on stage, what did you hear from him? What was the first thing he was prioritizing? Feel free to check your notes that you were taking! >> He was talking about, he did the state of the union as he normally does. The astonishing figure that there's like 375,000, I think, Spark Meetup members-- >> David: Wow. >> Yeah. And that's grown over the last four years, basically, from almost zero. So his focus really was on deep learning and on streaming, and those are the things we want to drill down a little bit. In the context of, what can you build with both? >> Well, we're coming up on our first break here, George. I'm really looking forward to interviewing some more of the guests today. So, thanks very much, and I invite you to stay with us here on theCUBE. We'll see you soon. (energetic techno music)

Published Date : Jun 6 2017

SUMMARY :

Brought to you by Databricks. but this is going to be one of the greatest, I think. and some of the other keynote presenters, And I know I'm driving into the details a bit, essentially, you can do per-event processing, And so this contributes to deep learning, right? and so for every data element in the stream, So, maybe the first one was talking about and so they're training the model to predict, And the continuous processing demo showed That was a cool demo. the applications for machine learning. so that the features are something a little bit later in the show, MapReduce, the old version, was the equivalent Would you say it's incremental advances, but it'll spread to other use cases. What do you want to know most from those customers? And it's moving more into the mainstream of production apps. he did the state of the union as he normally does. In the context of, what can you build with both? and I invite you to stay with us here on theCUBE.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
David Goad	PERSON	0.99+
Tim Hunter	PERSON	0.99+
San Francisco	LOCATION	0.99+
Christopher Re	PERSON	0.99+
three demos	QUANTITY	0.99+
Matei	PERSON	0.99+
one millisecond	QUANTITY	0.99+
375,000	QUANTITY	0.99+
Moscone Center	LOCATION	0.99+
Spark Summit	EVENT	0.98+
SQL	TITLE	0.98+
first thing	QUANTITY	0.98+
Spark Summit 2017	EVENT	0.98+
both	QUANTITY	0.98+
today	DATE	0.98+
one thing	QUANTITY	0.98+
Spark	TITLE	0.98+
One	QUANTITY	0.98+
this year	DATE	0.96+
first one	QUANTITY	0.96+
MapReduce	TITLE	0.96+
first demo	QUANTITY	0.96+
Spark Summit 2016	EVENT	0.95+
Hadoop	TITLE	0.95+
Databricks	ORGANIZATION	0.94+
James Bond	PERSON	0.93+
first break	QUANTITY	0.92+
Wikibon	ORGANIZATION	0.92+
two big things	QUANTITY	0.86+
Stanford	ORGANIZATION	0.84+
2.2 release	QUANTITY	0.83+
once every 200 milliseconds	QUANTITY	0.81+
one	QUANTITY	0.8+
Spark	ORGANIZATION	0.76+
last 18 months	DATE	0.75+
almost zero	QUANTITY	0.71+
James	PERSON	0.71+
first	QUANTITY	0.7+
last four years	DATE	0.7+
second	QUANTITY	0.69+
Spark 2.0	COMMERCIAL_ITEM	0.68+
theCUBE	ORGANIZATION	0.66+
#theCUBE	ORGANIZATION	0.64+
years	DATE	0.5+
2.2	OTHER	0.49+
#SparkSummit	TITLE	0.48+
Bond	TITLE	0.48+
Kickoff	EVENT	0.31+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for David Goad: