Ali Ghodsi, Databricks - #SparkSummit - #theCUBE
>> Narrator: Live from San Francisco, it's the Cube. Covering Sparks Summit 2017. Brought to you by Databricks. (upbeat music) >> Welcome back to the Cube, day two at Sparks Summit. It's very exciting. I can't wait to talk to this gentleman. We have the CEO from Databricks, Ali Ghodsi, joining us. Ali, welcome to the show. >> Thank you so much. >> David: Well we sat here and watched the keynote this morning with Databricks and you delivered. Some big announcements. Before we get into some of that, I want to ask you, it's been about a year and a half since you transitioned from VP Products and Engineering into a CEO role. What's the most fun part of that and maybe what's the toughest part? >> Oh, I see. That's a good question and that's a tough question too. Most fun part is... You know, you touch many more facets of the business. So in engineering, it's all the tech and you're dealing only with engineers, mostly. Customers are one hop away, there's a product management layer between you and the customers. So you're very inwards focused. As a CEO you're dealing with marketing, finance, sales, these different functions. And then, externally with media, with stakeholders, a lot of customer calls. There's many, many more facets of the business that you're seeing. And it also gives you a preview and it also gives you a perspective that you couldn't have before. You see how the pieces fit together so you actually can have a better perspective and see further out than you could before. Before, I was more in my own pick situation where I was seeing sort of just the things relating to engineering so that's the best part. >> You're obviously working close with customers. You introduced a few customers this morning up on stage. But after the keynote, did you hear any reactions from people? What are they saying? >> Yes the keynote was recently so on my way here I've had multiple people sort of... A couple people that high-fived just before I got up on stage here. On several softwaring, people are really excited about that. Less devops, less configuration, let them focus on the innovation, they want that. So that's something that's celebrated. Yesterday-- >> Recap that real quickly for our audience here, what the server-less operating is. >> Absolutely, so it's very simple. We want lots of data scientists to be able to do machine learning without have to worry about the infrastructure underneath it. So we have something called server-less pools and server-less pools you can just have lots of data scientists use it. Under the hood, this pool of resources shrinks and expands automatically. It adds storage, if needed. And you don't have to worry about the configuration of it. And it also makes sure that it's isolating the different data scientists. So if one data scientist happened to run something that takes much more resources, it won't effect the other data scientists that are sharing that. So the short story of it is you cut costs significantly, you can now have 3000 people share the same resources and it enables them to move faster because they don't have to worry about all the devops that they otherwise have to do. >> George, is that a really big deal? >> Well we know whenever there's infrastructure that gets between a developer, data science, and their outcomes, that's friction. I'd be curious to say let's put that into a bigger perspective, which is if you go back several years, what were the class of apps that Spark was being used for, and in conjunction with what other technologies. Then bring us forward to today and then maybe look out three years. >> Ali: Yeah, that's a great question. So from the very beginning, data is key for any of these predictive analytics that we are doing. So that was always a key thing. But back then we saw more Hadoop data lakes. There more data lakes, data reservoirs, data marks that people were building out. We saw also a lot of traditional data warehousing. These days, we see more and more things moving to cloud. The Hadoop data lake received, often times at enterprises, being transformed into a cloud blob storage. That's cheaper, it's dual-up replicated, it's on many continents. That's something that we've seen happen. And we work across any of these, frankly. We, from the very beginning, Spark, one of its strengths is it integrates really well wherever your data is. And there's a huge community of developers around it, over 1000 people now that have contributed to it. Many of these people are in other organizations, they're employed by other companies and their job is to make sure that Databricks or Spark works really, really well with, say, Cassandra or with S3. That's a shift that we're seeing. In terms of applications people are building it's moving more into production. Four years ago much more of it was interactive exploratory. Now we're seeing production use cases. The fraud analytics use case that I mentioned, that's running continuously and the requirements there are different. You can't go down for ten minutes on a Saturday morning at 4 a.m. when you're doing credit card fraud because that's a lot of fraud and that affects the business of, say, Capital One. So that's much more crucial for them. >> So what would be the surrounding infrastructure and applications to make that whole solution work? Would you plug into a traditional system of record at the sales order entry kind of process point? Are you working off sort of semi-real-time or near real-time data? And did you train the models on the data lake? How did the pieces fit together? >> Unfortunately the answers depends on the particular architecture that the customer has. Every enterprise is slightly different. But it's not uncommon that the data is coming in. They're using Spark structured streaming in Databricks to get it into S3, so that's one piece of the puzzle. Then when it ends up there, from then on it funnels out to many different use cases. It could be a data warehousing use case, where they're just using interactive sequel on it. So that's the traditional interactive use case, but it could be a real-time use case, where it's actually taking those data that it's processed and it's detecting anomalies and putting triggers in other systems and then those systems downstream will react to those triggers for anomalies. But it could also be that it's periodically training models and storing the models somewhere. Often times it might be in a Cassandra, or in a Redis, or something of that sort. It will store the model there and then some web application can then take it from there, do point queries to it and say okay, I have a particular user that came in here George now, quickly look up what is his feature vector, figure out what the product recommendations we should show to this person and then it takes it from there. >> So in those cases, Cassandra or Redis, they're playing the serving layer. But generating the prediction model is coming from you and they're just doing the inferencing, the prediction itself. So if you look out several years, without asking you the roadmap, which you can feel free to answer, how do you see that scope of apps expanding or the share of an existing app like that? >> Yeah, I think two interesting trends that I believe in, I'll be foolish enough to make predictions. One is that I think that data warehousing, as we know it today, will continue to exist. However, it will be transformed and all the data warehousing solutions that we have today will add predictive capabilities or it will disappear. So let me motivate that. If you have a data warehouse with customer data in it and a fact table, you have all your transactions there, you have all your products there. Today, you can plug in BI tools and on top of that you can see what's my business health today and yesterday. But you can't ask it: tell me about tomorrow. Why not? The data is there, why can I not ask it this customer data, you tell me which of these customers are going to turn, or which one of them should I reach out to because I can possibly upsell these? Why wouldn't I want to do that? I think everyone would want to do that and everyday a warehousing solution in ten years will have these capabilities. Now with Spark sequel you can do that and the announcement yesterday showed you also how you can bake models, machinery models, and export them so a sequel analyst can just act system directly with no machine learning experience. It's just a simple function call and it just works. So that's one prediction I'll make. The second prediction I'll make is that we're going to see lots of revolutions in different industries, beyond the traditional 'get people to click on ads' and understand social behavior. We're going to go beyond that. So for those use cases it will be closer to the things I mentioned like Shell and what you need to do there is involve these domain experts. The domain experts will come in, the doctors, or the machine specialists, you have to involve them in the loop. And they'll be able to transform, maybe much less exotic applications, it's not the super high-tech Silicon Valley stuff, but it's nevertheless extremely important to every enterprise, to every protocol, on the planet. That's, I think, the exciting part of where predictions will go in the next decade or two. >> If I were to try and pick out the most man-bytes dug kind of observation in there, you know, it's supposed to be the unexpected thing, I would say where you said all data warehouses are going to become predictive services. Because what we've been hearing, it's sort of the other side of that coin which is all the operational databases will get all the predictive capabilities. But you said something very different. I guess my question is are you seeing the advanced analytics going to the data warehouse because the repository of data is going to be bigger there and so you can either build better models or because it's not burdened with transaction SLAs that you can serve up predictions quicker? >> The data warehousing has been about basic statistics. It's been a sequel that the language that is used is to get descriptive statistics. Tables with averages and medians, that's statistics. Why wouldn't you want to have advanced statistics which now does predictions on it. It just so happens that sequel is not the right interface for that. So it's going to be very natural that people who are already asking statistical questions for the last 30 years from their customer data, these massive throes of data that they have stored. Why wouldn't they want to also say, 'okay now give me more advanced statistics?' I'm not an expert on advanced statistics but you the system. Tell me what I should watch out for. Which of these customers do I talk to? Which of the products are in trouble? Which of the products are not, or which parts of my business are not doing well now? Predict the future for me. >> George: When you're doing that though, you're now doing it on data that has a fair amount of latency built into it. Because that's how it got into the data warehouse. Where if it's in the operational database, it's really low latency, typically low latency stuff. Where and why do you see that distinction? >> I do think also that we'll see more and more real-time engines take over. If you do things in real-time you can do it for a fraction of the cost. So we'll also see those capabilities come in. So you don't have to... Your question is, why would you want to once a week batch everything into a central warehouse and I agree with that. It will be streaming in live and then you can on that, do predictions, you can do basic analytics. I think basically the lines will blur between all these technologies that we're seeing. In some sense, Spark actually was the precursor to all that. So Spark already was unifying machine learning, sequel, ETL, real-time, and you're going to see that everywhere appear. >> You mentioned Shell as an example, one of your customers, you also had HP, Capital One, and you developed this unified analytics platform, that's solving some of their common problems. Now that you're in the mood to make predictions, what do you think are going to be the most compelling use cases or industries where you're going to see Databricks going in the future? >> That's a hard one. Right now, I think healthcare. There's a lot of data sets, there's a lot of gene sequencing data. They want to be able to use machine learning. In fact, I think those industries being transformed slowly from using classical statistics into machine learning. We've actually helped some of these companies do that. We've set up workshops and they've gotten people trained. And now they're hiring machine learning experts that are coming in. So that's one I think in the healthcare industry, whether it's for drug-testing, clinical-trials, even diagnosis, that's a big one, I do think industrial IT. These are big companies with lots of equipment, they have tons of sensor data, massive data sets. There's a lot of predictions that they can do on that. So that's a second one I would say. Financial industry, they've always been about predictions, so it makes a lot of sense that they continue doing that. Those are the biggest ones for Databricks. But I think now also as slowly, other verticals are moving into the cloud. We'll see more of other use cases as well. But those are the biggest ones I see right now. It's hard to say where it will be ten years from now, or 15. Things are going so fast that it's hard to even predict six months. >> David: Do you believe IOT is going to be a big business driver? >> Yes, absolutely. >> I want to circle back where you said that we've got different types of databases but we're going to unify the capabilities. Without saying, it's not like one wins, one loses. >> Ali: Yes, I didn't want to do that. >> So describe maybe the characteristics of what a database that compliments Sparks really well might look like. >> That's hard for me to say. The capabilities of Spark, I think, are here to stay. The ability to be able to ETL variety of data that doesn't have structure, so Structured Query Language, SQL, is not fit for it, that is really important and it's going to become more important since data is the new oil, as they said. Well, then it's going to be very important to be able to work with all kinds of data and getting that into the systems. There's more things everyday being created. Devices, IOT, whatever it is that are spewing out this data in different forms and shapes. So being able to work with that variety, that's going to be an important property. So they'll have to do that. That's the ETL portion or the ELT portion. The real-time portion, not having to do this in a batch manner once a week because now time is a competitive advantage. So if I'm one week behind you that means I'm going to lose out. So doing that in real-time, or near human-time or human real-time, that's going to be really important. So that's going to come as well, I think, and people will demand that. That's going to be a competitive advantage. Wherever you can add that secret sauce it's going to add value to the customers. And then finally the predictive stuff, adding the predictive stuff. But I think people will want to continue to also do all the old stuff they've been doing. I don't think that's going to go away. Those bring value to customers, they want to do all those traditional use cases as well. >> So what about now where customers expect to have some, not clear how much, un-Primmed application platform like Spark. Some in the cloud that now that you've totally reordered the TCO equation. But then also at the edge for IOT-type use cases, do you have to slim down Spark to work at the edge? If you have server-less working in the cloud, does that mean you have to change the management paradigm on Prim. What does that mix look like? How does someone, you know how does a Fortune 200 company, get their arms around that? >> Ali: Yeah, this is a surprising thing, most surprising thing for me in the last year, is how many of those Fortune 200's that I was talking to three years ago and they were saying 'no way, we're not going into the cloud. You don't understand the regulations that we are facing or the amount of data that we have.' Or 'we can do it better,' or 'the security requirements that we have, no one can match that.' To now, those very same companies are saying 'absolutely, we're going.' It's not about if, it's about when. Now I would be hard-pressed to find any enterprise that says 'no, we're not going to go, ever.' And some companies we've even seen go from the cloud to on Prim, and then now back. Because the prices are getting more competitive in the cloud. Because now there's three, at least, major players that are competing and they're well-funded companies. In some sense, you have ad money and office money and retail money being thrown at this problem. Prices are getting competitive. Very soon, most IT folks will realize, there's no way we can do this faster, or better, or more reliable secure ourselves. >> David: We've got just a minute to go here before the break so we're going to kind of wrap it up here. And we got over 3000 people here at Spark Summit so it's the Spark community. I want you to talk to them for a moment. What problems do you want them to work on the most? And what are we going to be talking about a year from now at this table? >> The second one is harder. So I think the Spark community is doing a phenomenal job. I'm not going to tell them what to do. They should continue doing what they are doing already which is integrating Spark in the ecosystem, adding more and more integrations with the greatest technologies that are happening out there. Continue the innovation and we're super happy to have them here. We'll continue it as well, we'll continue to host this event and look forward to also having a Spark Summit in Europe, and also the East Coast soon. >> David: Okay, so I'm not going to ask you to make any more predictions. >> Alright, excellent. >> David: Ali this is great stuffy today. Thank you so much for taking some time and giving us more insight after the keynote this morning. Good luck with the rest of the show. >> Thank you. >> Thanks, Ali. And thank you for watching. That's Ali Ghodsi CEO from Databricks. We are Spark Summit 2017 here, on the Cube. Thanks for watching, stay with us. (upbeat mustic)
SUMMARY :
Brought to you by Databricks. We have the CEO from Databricks, Ali Ghodsi, joining us. the keynote this morning with Databricks and you delivered. that you couldn't have before. But after the keynote, did you Yes the keynote was recently so on my way here Recap that real quickly for our audience here, and server-less pools you can just have into a bigger perspective, which is if you go back So from the very beginning, So that's the traditional interactive use case, But generating the prediction model is coming from you and the announcement yesterday showed you also and so you can either build better models It's been a sequel that the language that is used Where and why do you see that distinction? and then you can on that, do predictions, what do you think are going to be It's hard to say where it will be ten years from now, or 15. I want to circle back where you said So describe maybe the characteristics of what a database and getting that into the systems. does that mean you have to change or the amount of data that we have.' I want you to talk to them for a moment. and also the East Coast soon. David: Okay, so I'm not going to ask you Thank you so much for taking some time And thank you for watching.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
David | PERSON | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
Ali | PERSON | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Capital One | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
one week | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
last year | DATE | 0.99+ |
ten years | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
three years | QUANTITY | 0.99+ |
3000 people | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
ten minutes | QUANTITY | 0.99+ |
Four years ago | DATE | 0.99+ |
three years ago | DATE | 0.99+ |
next decade | DATE | 0.99+ |
six months | QUANTITY | 0.99+ |
Yesterday | DATE | 0.98+ |
over 1000 people | QUANTITY | 0.98+ |
East Coast | LOCATION | 0.98+ |
today | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
one prediction | QUANTITY | 0.98+ |
second prediction | QUANTITY | 0.98+ |
Silicon Valley | LOCATION | 0.97+ |
Spark Summit 2017 | EVENT | 0.97+ |
Spark | TITLE | 0.97+ |
once a week | QUANTITY | 0.97+ |
Sparks Summit | EVENT | 0.97+ |
Fortune 200 | ORGANIZATION | 0.96+ |
over 3000 people | QUANTITY | 0.96+ |
about a year and a half | QUANTITY | 0.95+ |
Shell | ORGANIZATION | 0.95+ |
Spark | ORGANIZATION | 0.95+ |
Sparks | TITLE | 0.94+ |
IOT | ORGANIZATION | 0.94+ |
day two | QUANTITY | 0.94+ |
Sparks Summit 2017 | EVENT | 0.94+ |
this morning | DATE | 0.93+ |
second one | QUANTITY | 0.93+ |
S3 | TITLE | 0.85+ |
one data scientist | QUANTITY | 0.85+ |
15 | QUANTITY | 0.85+ |
Saturday morning at | DATE | 0.84+ |
tons | QUANTITY | 0.83+ |
S3 | ORGANIZATION | 0.8+ |
one piece of the puzzle | QUANTITY | 0.79+ |
couple people | QUANTITY | 0.77+ |
Prim | ORGANIZATION | 0.76+ |
several years | QUANTITY | 0.75+ |
Reynold Xin, Databricks - #Spark Summit - #theCUBE
>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.
SUMMARY :
Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Reynold | PERSON | 0.99+ |
Ali | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
David Goad | PERSON | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
North Pole | LOCATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Reynold Xin | PERSON | 0.99+ |
last month | DATE | 0.99+ |
10X | QUANTITY | 0.99+ |
two questions | QUANTITY | 0.99+ |
three trillion records | QUANTITY | 0.99+ |
second area | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
last week | DATE | 0.99+ |
Spark | TITLE | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
first direction | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
James Bond | PERSON | 0.98+ |
Spark | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
Tungsten | ORGANIZATION | 0.98+ |
two focuses | QUANTITY | 0.97+ |
three directions | QUANTITY | 0.97+ |
one minute | QUANTITY | 0.97+ |
one area | QUANTITY | 0.96+ |
three | QUANTITY | 0.96+ |
about five | QUANTITY | 0.96+ |
DBIO | ORGANIZATION | 0.96+ |
six years | QUANTITY | 0.95+ |
one thing | QUANTITY | 0.94+ |
over a hundred experiments | QUANTITY | 0.94+ |
Oracle | ORGANIZATION | 0.92+ |
Theano | TITLE | 0.92+ |
single note | QUANTITY | 0.91+ |
Intel | ORGANIZATION | 0.91+ |
one route | QUANTITY | 0.89+ |
theCUBE | ORGANIZATION | 0.88+ |
Office | TITLE | 0.87+ |
TensorFlow | TITLE | 0.87+ |
S3 | TITLE | 0.87+ |
MXNet | TITLE | 0.85+ |
Day 2 Kickoff - #SparkSummit - #theCUBE
[Narrator] Live from San Francisco it's the Cube covering Sparks Summit 2017 brought to you by databricks. >> Welcome to the Cube. My name is David Goad and I'm your host and we are here at Spark day two. It's the Spark Summit and I am flanked by a couple of consultants here from-- sorry, analysts from Wikibon. I got to get this straight. To my left we have Jim Kobielus who is our lead analysist for Data Science. Jim, welcome to the show. >> Thanks David. >> And we also have George Gilbert who is the lead analyst for Big Data and Analytics. I'll get this right eventually. So why don't we start with Jim. Jim just kicking off the show here today, we wanted to get some preliminary thoughts before we really jump into the rest of the day. What are the big themes that we're going to hear about? >> Yeah, today is the Enterprise day at Sparks Summit. So Spark for the Enterprise. Yesterday was focused on Spark, the evolution, extension of Spark to support for native development of deep learning as well as speeding up Spark to support sub-millisecond latencies. But today it's all about Spark and the Enterprise really what I call wrapping dev-ops around Spark, making it more productionizable, supportable. The databricks serverless announcement, though it was announced yesterday, the press release went up they're going into some depth right now in the key note about serverless and really serverless is all about providing an in cloud Spark, essentially a sand box for teams of developers to scale up and scale out enough resources to do the modeling, the training, the deployment, the iteration, the evaluation of Spark jobs in essentially a 24 by seven multi-tenant fully supported environment. So it's really about driving this continuous Spark development and iteration process into a 24 by seven model in the Enterprise, which is really what's happening is that data scientists, Spark developers are becoming an operational function that businesses are building, strategic infrastructure around things like recommendation engines, and e-commerce environments, absolutely demand 24 by seven resilience Spark team based collaboration environments, which is really what the serverless announcement is all about. >> David: So getting increasing demand on mission critical problems so that optimization is a big deal. >> Yeah, data science is not just an R&D function, it's an operational IT function as well. So that's what it's all about. >> David: Awesome, well let's go to George. I saw you watching the key note. I think still watching it again this morning, so taking notes feverishly. What were some of the things that stuck out to you from the key note speaker this morning? >> There are some things that are sort of going to bleed over from yesterday where we can explore some more. We're going to have on the show, the chief architect, Renald Chin, and the CEO, Ali Goatsee, and some of the things that we want to understand is how the scope of applications that are appropriate for Spark are expanding. We've got sort of unofficial guidance yesterday that, you know, just because Spark doesn't handle key value stores or databases all that tightly right now, that doesn't mean it won't in the future on the Apache Spark side through better APIs and on the databricks side, perhaps custom integration and the significance of that is that you can open up a whole class of operational apps, apps that run your business and that now incorporate, you know, rich analytics as well. Another thing that we'll want to be asking about is, keying off what Jim was saying, now that this becomes not a managed service where you just take the labor that the end customer was applying to get the thing running but it's now automated and you don't even know the infrastructure. We'll want to know what does that mean for the edge, you know, where we're doing analytics close to internet of things and people and sort of if there has to be a new configuration of Spark to work with that. And then of course what do we do about the whole data science process and the dev-ops for data science when you have machine learning distributed across the cloud and edge and On-Prem. >> Jim: In fact, I know we have Pepperdata coming on right after this, who might be able to talk about that exact dev-ops in terms of performance optimization into distributed Spark environment, yeah. >> George, I want to follow up with that. We had Matt Fryer from Hotels.com, he's going to be on our show later but he was on the key note stage this morning. He talked about going all cloud, all Spark, and how data science is even competitive advantage for Hotels.com. What do you want to dig into when we get him on the show? >> That's a really good question because if you look at business strategy, you don't really build a sustainable advantage just by doing one thing better than everyone else. That's easier to pick off. The sustainable strategic advantages come from not just doing one thing better than everyone else but many things and then orchestrating their improvement over time and I'd like to dig into how they're going to do that. 'Cause remember Hotels.com it's the internet equivalent descendant of the original travel reservation systems, which did confer competitive advantage on the early architects and deployers of that technology. >> Great and then Pepperdata wanted to come back and we're going to have them on the show here in just a moment. What would you like to learn from them? What do you think will benefit the community the most? >> Jim: Actually, keying off something George said, I'd like to get a sense for how you optimize Spark deployments in a radically distributed IOT edge environment. Whether they've got any plans, or what their thoughts are in terms of the challenges there. As more the intelligence gets pushed to the edge much of that will be on machine learning and deep learning models built into Spark. What are the challenges there? I mean, if you've got thousands to millions of end points that are all autonomious and intelligent and they're all running Spark, just what are the orchestration requirements, what are the resource management requirements, how do you monitor end-to-end in and environment like that and optimize the passing of data and the transfer of the control flow or orchestration across all those dispersed points. >> Okay, so 30 seconds now, why should the audience tune into our show today? What are they going to get? >> I think what they're going to get is a really good sense for how the emerging best practices for optimizing Spark in a distributed fog environment out to the edge where not just the edge devices but everything, all nodes, will incorporate machine learning and deep learning. They'll get a sense for what's been done today, what's the tooling is to enable dev-ops in that kind of environment. As well as, sort of the emerging best practices for compressing more of these algorithms and the data itself as well as doing training in a theoretically federated environment. I'm hoping to hear from some of the vendors who are on the show today. >> David: Fantastic and George, closing thoughts on the opening segment? 30 seconds. >> Closing thoughts on the opening segment. Like Jim is, we want to think about Spark holistically and it has traditionally been best position that's sort of this-- as Tay acknowledged yesterday sort of this offline branch of analytics that you apply to data like sort of repository that you accumulated and now we want to see it put into production but to do that you need more than just what Spark is today. You need basically a database or key value kind of option so that your storing your work as it goes along so you can go back and analyze it either simple analysis or complex analysis. So I want to hear about that. I want to hear about their plans for IOT. Spark is kind of a heavy weight environment, so you're probably not going to put it in the boot of your car or at least not likely anytime soon. >> Jim: Intelligent edge. I mean, Microsoft build a few weeks ago was really deep on intelligent edge. HP, who we're doing their show actually I think it's in Vegas, right? They're also big on intelligent edge. In fact, we had somebody on the show yesterday from HP going into some depth on that. I want to hear what databricks has to say on that theme. >> Yeah, and which part of the edge, is it the gateway, the edge gateway, which is really a slim down server, or the edge device, which could be a 32 bit meg RAM network card. >> Yeah. >> All right, well gentlemen appreciate the little insight here before we get started today and we're just getting started. Thank you both for being on the show and thank you for watching the Cube. We'll be back in a little while with our CEO from databricks. Thanks for watching. (upbeat music)
SUMMARY :
brought to you by databricks. It's the Spark Summit and I am flanked by What are the big themes that we're going to hear about? So Spark for the Enterprise. so that optimization is a big deal. So that's what it's all about. from the key note speaker this morning? and some of the things that we want to understand is Jim: In fact, I know we have Pepperdata coming on and how data science is and I'd like to dig into how they're going to do that. What would you like to learn from them? As more the intelligence gets pushed to the edge and the data itself David: Fantastic and George, but to do that you need more than just what Spark is today. I want to hear what databricks has to say on that theme. or the edge device, and thank you for watching the Cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim | PERSON | 0.99+ |
Jim Kobielus | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Ali Goatsee | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
Matt Fryer | PERSON | 0.99+ |
Renald Chin | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
Hotels.com | ORGANIZATION | 0.99+ |
yesterday | DATE | 0.99+ |
Vegas | LOCATION | 0.99+ |
32 bit | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
24 | QUANTITY | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.99+ |
seven | QUANTITY | 0.98+ |
Yesterday | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Spark Summit | EVENT | 0.98+ |
Tay | PERSON | 0.97+ |
Sparks Summit 2017 | EVENT | 0.96+ |
one | QUANTITY | 0.96+ |
this morning | DATE | 0.96+ |
Pepperdata | ORGANIZATION | 0.96+ |
Day 2 | QUANTITY | 0.95+ |
Wikibon | ORGANIZATION | 0.94+ |
Sparks Summit | EVENT | 0.93+ |
databricks | ORGANIZATION | 0.91+ |
day two | QUANTITY | 0.87+ |
Spark | ORGANIZATION | 0.86+ |
few weeks ago | DATE | 0.86+ |
millions of end points | QUANTITY | 0.81+ |
Big Data | ORGANIZATION | 0.81+ |
Cube | COMMERCIAL_ITEM | 0.68+ |
sub | QUANTITY | 0.6+ |
Apache Spark | TITLE | 0.55+ |
Analytics | ORGANIZATION | 0.53+ |
Dr. Jisheng Wang, Hewlett Packard Enterprise, Spark Summit 2017 - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's theCUBE covering Sparks Summit 2017 brought to you by Databricks. >> You are watching theCUBE at Sparks Summit 2017. We continue our coverage here talking with developers, partners, customers, all things Spark, and today we're honored now to have our next guest Dr. Jisheng Wang who's the Senior Director of Data Science at the CTO Office at Hewlett Packard Enterprise. Dr. Wang, welcome to the show. >> Yeah, thanks for having me here. >> All right and also to my right we have Mr. Jim Kobielus who's the Lead Analyst for Data Science at Wikibon. Welcome, Jim. >> Great to be here like always. >> Well let's jump into it. At first I want to ask about your background a little bit. We were talking about the organization, maybe you could do a better job (laughs) of telling me where you came from and you just recently joined HPE. >> Yes. I actually recently joined HPE earlier this year through the Niara acquisition, and now I'm the Senior Director of Data Science in the CTO Office of Aruba. Actually, Aruba you probably know like two years back, HP acquired Aruba as a wireless networking company, and now Aruba takes charge of the whole enterprise networking business in HP which is about over three billion annual revenue every year now. >> Host: That's not confusing at all. I can follow you (laughs). >> Yes, okay. >> Well all I know is you're doing some exciting stuff with Spark, so maybe tell us about this new solution that you're developing. >> Yes, actually my most experience of Spark now goes back to the Niara time, so Niara was a three and a half year old startup that invented, reinvented the enterprise security using big data and data science. So what is the problem we solved, we tried to solve in Niara is called a UEBA, user and entity behavioral analytics. So I'll just try to be very brief here. Most of the transitional security solutions focus on detecting attackers from outside, but what if the origin of the attacker is inside the enterprise, say Snowden, what can you do? So you probably heard of many cases today employees leaving the company by stealing lots of the company's IP and sensitive data. So UEBA is a new solution try to monitor the behavioral change of the enterprise users to detect both this kind of malicious insider and also the compromised user. >> Host: Behavioral analytics. >> Yes, so it sounds like it's a native analytics which we run like a product. >> Yeah and Jim you've done a lot of work in the industry on this, so any questions you might have for him around UEBA? >> Yeah, give us a sense for how you're incorporating streaming analytics and machine learning into that UEBA solution and then where Spark fits into the overall approach that you take? >> Right, okay. So actually when we started three and a half years back, the first version when we developed the first version of the data pipeline, we used a mix of Hadoop, YARN, Spark, even Apache Storm for different kind of stream and batch analytics work. But soon after with increased maturity and also the momentum from this open source Apache Spark community, we migrated all our stream and batch, you know the ETL and data analytics work into Spark. And it's not just Spark. It's Spark, Spark streaming, MLE, the whole ecosystem of that. So there are at least a couple advantages we have experienced through this kind of a transition. The first thing which really helped us is the simplification of the infrastructure and also the reduction of the DevOps efforts there. >> So simplification around Spark, the whole stack of Spark that you mentioned. >> Yes. >> Okay. >> So for the Niara solution originally, we supported, even here today, we supported both the on-premise and the cloud deployment. For the cloud we also supported the public cloud like AWS, Microsoft Azure, and also Privia Cloud. So you can understand with, if we have to maintain a stack of different like open source tools over this kind of many different deployments, the overhead of doing the DevOps work to monitor, alarming, debugging this kind of infrastructure over different deployments is very hard. So Spark provides us some unified platform. We can integrate the streaming, you know batch, real-time, near real-time, or even longterm batch job all together. So that heavily reduced both the expertise and also the effort required for the DevOps. This is one of the biggest advantages we experienced, and certainly we also experienced something like the scalability, performance, and also the convenience for developers to develop a new applications, all of this, from Spark. >> So are you using the Spark structured streaming runtime inside of your application? Is that true? >> We actually use Spark in the steaming processing when the data, so like in the UEBS solutions, the first thing is collecting a lot of the data, different account data source, network data, cloud application data. So when the data comes in, the first thing is streaming job for the ETL, to process the data. Then after that, we actually also develop the some, like different frequency like one minute, 10 minute, one hour, one day of this analytics job on top of that. And even recently we have started some early adoption of the deep learning into this, how to use deep learning to monitor the user behavior change over time, especially after user gives a notice what user, is user going to access like most servers or download some of the sensitive data? So all of this requires very complex analytics infrastructure. >> Now there were some announcements today here at Spark Summit by Databricks of adding deep learning support to their core Spark code base. What are your thoughts about the deep learning pipelines, API, that they announced this morning? It's new news, I'll understand if you don't, haven't digested it totally, but you probably have some good thoughts on the topic. >> Yes, actually this is also news for me, so I can just speak from my current experience. How to integrate deep learning into Spark actually was a big challenge so far for us because what we used so far, the deep learning piece, we used TensorFlow. And certainly most of our other stream and data massaging or ETL work is done by Spark. So in this case, there are a couple ways to manage this, too. One is to set up two separate resource pool, one for Spark, the other one for TensorFlow, but in our deployment there is some very small on-premise department which has only like four node or five node cluster. It's not efficient to split resource in that way. So we actually also looking for some closer integration between deep learning and Spark. So one thing we looked before is called the TensorFlow on Spark which was open source a couple months ago by Yahoo. >> Right. >> So maybe this is certainly more exciting news for the Spark team to develop this native integration. >> Jim: Very good. >> Okay and we talked about the UEBA solution, but let's go back to a little broader HPE perspective. You have this concept called the intelligent edge, what's that all about? >> So that's a very cool name. Actually come a little bit back. I come from the enterprise background, and enterprise applications have some, actually a lag behind than consumer applications in terms of the adoption of the new data science technology. So there are some native challenges for that. For example, collecting and storing large amount of this enterprise sensitive data is a huge concern, especially in European countries. Also for the similar reason how to collect, normally weigh developer enterprise applications. You're lack of some good quantity and quality of the trending data. So this is some native challenges when you develop enterprise applications, but even despite of this, HPE and Aruba recently made several acquisitions of analytics companies to accelerate the adoption of analytics into different product line. Actually that intelligent age comes from this IOT, which is internet of things, is expected to be the fastest growing market in the next few years here. >> So are you going to be integrating the UEBA behavioral analytics and Spark capability into your IOT portfolio at HP? Is that a strategy or direction for you? >> Yes. Yes, for the big picture that certainly is. So you can think, I think some of the Gartner Report expected the number of the IOT devices is going to grow over 20 billion by 2020. Since all of this IOT devices are connected to either intranet or internet, either through wire or wireless, so as a networking company, we have the advantage of collecting data and even take some actions at the first of place. So the idea of this intelligent age is we want to turn each of these IOT devices, the small IOT devices like IP camera, like those motion detection, all of these small devices as opposed to the distributed sensor for the data collection and also some inline actor to do some real-time or even close to real-time decisions. For example, the behavior anomaly detection is a very good example here. If IOT devices is compromised, if the IP camera has been compromised, then use that to steal your internal data. We should detect and stop that at the first place. >> Can you tell me about the challenges of putting deep learning algorithms natively on resource constrained endpoints in the IOT? That must be really challenging to get them to perform well considering that there may be just a little bit of memory or flash capacity or whatever on the endpoints. Any thoughts about how that can be done effectively and efficiently? >> Very good question >> And at low cost. >> Yes, very good question. So there are two aspects into this. First is this global training of the intelligence which is not going to be done on each of the device. In that case, each of the device is more like the sensor for the data collection. So we are going to build a, collect the data sent to the cloud, or build all of this giant pool, like computing resource to trend the classifier, to trend the model, but when we trend the model, we are going to ship the model, so the inference and the detection of the model of those behavioral anomaly really happen on the endpoint. >> Do the training centrally and then push the trained algorithms down to the edge devices. >> Yes. But even like, the second as well even like you said, some of the device like say people try to put those small chips in the spoon, in the case of, in hospital to make it like more intelligent, you cannot put even just the detection piece there. So we also looking to some new technology. I know like Caffe recently announced, released some of the lightweight deep learning models. Also there's some, your probably know, there's some of the improvement from the chip industry. >> Jim: Yes. >> How to optimize the chip design for this kind of more analytics driven task there. So we are all looking to this different areas now. >> We have just a couple minutes left, and Jim you get one last question after this, but I got to ask you, what's on your wishlist? What do you wish you could learn or maybe what did you come to Spark Summit hoping to take away? >> I've always treated myself as a technical developer. One thing I am very excited these days is the emerging of the new technology, like a Spark, like TensorFlow, like Caffe, even Big-Deal which was announced this morning. So this is something like the first go, when I come to this big advanced industry events, I want to learn the new technology. And the second thing is mostly to share our experience and also about adopting of this new technology and also learn from other colleagues from different industries, how people change life, disrupt the old industry by taking advantage of the new technologies here. >> The community's growing fast. I'm sure you're going to receive what you're looking for. And Jim, final question? >> Yeah, I heard you mention DevOps and Spark in same context, and that's a huge theme we're seeing, more DevOps is being wrapped around the lifecycle of development and training and deployment of machine learning models. If you could have your ideal DevOps tool for Spark developers, what would it look like? What would it do in a nutshell? >> Actually it's still, I just share my personal experience. In Niara, we actually developed a lot of the in-house DevOps tools like for example, when you run a lot of different Spark jobs, stream, batch, like one minute batch verus one day batch job, how do you monitor the status of those workflows? How do you know when the data stop coming? How do you know when the workflow failed? Then even how, monitor is a big thing and then alarming when you have something failure or something wrong, how do you alarm it, and also the debug is another big challenge. So I certainly see the growing effort from both Databricks and the community on different aspects of that. >> Jim: Very good. >> All right, so I'm going to ask you for kind of a soundbite summary. I'm going to put you on the spot here, you're in an elevator and I want you to answer this one question. Spark has enabled me to do blank better than ever before. >> Certainly, certainly. I think as I explained before, it helped a lot from both the developer, even the start-up try to disrupt some industry. It helps a lot, and I'm really excited to see this deep learning integration, all different road map report, you know, down the road. I think they're on the right track. >> All right. Dr. Wang, thank you so much for spending some time with us. We appreciate it and go enjoy the rest of your day. >> Yeah, thanks for being here. >> And thank you for watching the Cube. We're here at Spark Summit 2017. We'll be back after the break with another guest. (easygoing electronic music)
SUMMARY :
brought to you by Databricks. at the CTO Office at Hewlett Packard Enterprise. All right and also to my right we have Mr. Jim Kobielus (laughs) of telling me where you came from of the whole enterprise networking business I can follow you (laughs). that you're developing. of the company's IP and sensitive data. Yes, so it sounds like it's a native analytics of the data pipeline, we used a mix of Hadoop, YARN, the whole stack of Spark that you mentioned. We can integrate the streaming, you know batch, of the deep learning into this, but you probably have some good thoughts on the topic. one for Spark, the other one for TensorFlow, for the Spark team to develop this native integration. Okay and we talked about the UEBA solution, Also for the similar reason how to collect, of the IOT devices is going to grow natively on resource constrained endpoints in the IOT? collect the data sent to the cloud, Do the training centrally But even like, the second as well even like you said, So we are all looking to this different areas now. And the second thing is mostly to share our experience And Jim, final question? If you could have your ideal DevOps tool So I certainly see the growing effort All right, so I'm going to ask you even the start-up try to disrupt some industry. We appreciate it and go enjoy the rest of your day. We'll be back after the break with another guest.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim | PERSON | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
HP | ORGANIZATION | 0.99+ |
10 minute | QUANTITY | 0.99+ |
one hour | QUANTITY | 0.99+ |
one minute | QUANTITY | 0.99+ |
Wang | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
Jisheng Wang | PERSON | 0.99+ |
Niara | ORGANIZATION | 0.99+ |
first version | QUANTITY | 0.99+ |
one day | QUANTITY | 0.99+ |
two aspects | QUANTITY | 0.99+ |
Jim Kobielus | PERSON | 0.99+ |
Hewlett Packard Enterprise | ORGANIZATION | 0.99+ |
First | QUANTITY | 0.99+ |
Caffe | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.99+ |
Spark | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
three and a half year | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
Sparks Summit 2017 | EVENT | 0.99+ |
first | QUANTITY | 0.99+ |
DevOps | TITLE | 0.99+ |
2020 | DATE | 0.99+ |
second thing | QUANTITY | 0.99+ |
Aruba | ORGANIZATION | 0.98+ |
Snowden | PERSON | 0.98+ |
two years back | DATE | 0.98+ |
first thing | QUANTITY | 0.98+ |
one last question | QUANTITY | 0.98+ |
AWS | ORGANIZATION | 0.98+ |
over 20 billion | QUANTITY | 0.98+ |
one question | QUANTITY | 0.98+ |
UEBA | TITLE | 0.98+ |
today | DATE | 0.98+ |
Spark Summit | EVENT | 0.97+ |
Microsoft | ORGANIZATION | 0.97+ |
Spark Summit 2017 | EVENT | 0.96+ |
Apache | ORGANIZATION | 0.96+ |
three and a half years back | DATE | 0.96+ |
Databricks | ORGANIZATION | 0.96+ |
one day batch | QUANTITY | 0.96+ |
earlier this year | DATE | 0.94+ |
Aruba | LOCATION | 0.94+ |
One | QUANTITY | 0.94+ |
#SparkSummit | EVENT | 0.94+ |
One thing | QUANTITY | 0.94+ |
one thing | QUANTITY | 0.94+ |
European | LOCATION | 0.94+ |
Gartner | ORGANIZATION | 0.93+ |