Ziya Ma, Intel | Big Data SV 2018
>> Live from San Jose, it's theCUBE! Presenting Big Data Silicon Valley, brought to you by SiliconANGLE Media and its ecosystem partners. >> Welcome back to theCUBE. Our continuing coverage of our event, Big data SV. I'm Lisa Martin with my co-host George Gilbert. We're down the street from the Strata Data Conference, hearing a lot of interesting insights on big data. Peeling back the layers, looking at opportunities, some of the challenges, barriers to overcome but also the plethora of opportunities that enterprises alike have that they can take advantage of. Our next guest is no stranger to theCUBE, she was just on with me a couple days ago at the Women in Data Science Conference. Please welcome back to theCUBE, Ziya Ma. Vice President of Software and Services Group and the Director of Big Data Technologies from Intel. Hi Ziya! >> Hi Lisa. >> Long time, no see. >> I know, it was just really two to three days ago. >> It was, well and now I can say happy International Women's Day. >> The same to you, Lisa. >> Thank you, it's great to have you here. So as I mentioned, we are down the street from the Strata Data Conference. You've been up there over the last couple days. What are some of the things that you're hearing with respect to big data? Trends, barriers, opportunities? >> Yeah, so first it's very exciting to be back at the conference again. The one biggest trend, or one topic that's hit really hard by many presenters, is the power of bringing the big data system and data science solutions together. You know, we're definitely seeing in the last few years the advancement of big data and advancement of data science or you know, machine learning, deep learning truly pushing forward business differentiation and improve our life quality. So that's definitely one of the biggest trends. Another thing I noticed is there was a lot of discussion on big data and data science getting deployed into the cloud. What are the learnings, what are the use cases? So I think that's another noticeable trend. And also, there were some presentations on doing the data science or having the business intelligence on the edge devices. That's another noticeable trend. And of course, there were discussion on security, privacy for data science and big data so that continued to be one of the topics. >> So we were talking earlier, 'cause there's so many concepts and products to get your arms around. If someone is looking at AI and machine learning on the back end, you know, we'll worry about edge intelligence some other time, but we know that Intel has the CPU with the Xeon and then this lower power one with Atom. There's the GPU, there's ASICs, FPGAS, and then there are these software layers you know, with higher abstraction layer, higher abstraction level. Help us put some of those pieces together for people who are like saying, okay, I know I've got a lot of data, I've got to train these sophisticated models, you know, explain this to me. >> Right, so Intel is a real solution provider for data science and big data. So at the hardware level, and George, as you mentioned, we offer a wide range of products from general purpose like Xeon to targeted silicon such as FPGA, Nervana, and other ASICs chips like Nervana. And also we provide adjacencies like networking the hardware, non-volatile memory and mobile. You know, those are the other adjacent products that we offer. Now on top of the hardware layer, we deliver fully optimized software solutions stack from libraries, frameworks, to tools and solutions. So that we can help engineers or developers to create AI solutions with greater ease and productivity. For instance, we deliver Intel optimized math kernel library. That leverage of the latest instruction set gives us significant performance boosts when you are running your software on Intel hardware. We also deliver framework like BigDL and for Spark and big data type of customers if they are looking for deep learning capabilities. We also optimize some popular open source deep learning frameworks like Caffe, like TensorFlow, MXNet, and a few others. So our goal is to provide all the necessary solutions so that at the end our customers can create the applications, the solutions that they really need to address their biggest pinpoints. >> Help us think about the maturity level now. Like, we know that the very most sophisticated internet service providers who are sort of all over this machine learning now for quite a few years. Banks, insurance companies, people who've had this. Statisticians and actuaries who have that sort of skillset are beginning to deploy some of these early production apps. Where are we in terms of getting this out to the mainstream? What are some of the things that have to happen? >> To get it to mainstream, there are so many things we could do. First I think we will continue to see the wide range of silicon products but then there are a few things Intel is pushing. For example, we're developing this in Nervana, graph compiler that will encapsulate the hardware integration details and present a consistent API for developers to work with. And this is one thing that we hope that we can eventually help the developer community with. And also, we are collaborating with the end user. Like, from the enterprise segment. For example, we're working with the financial services industry, we're working with a manufacturing sector and also customers from the medical field. And online retailers, trying to help them to deliver or create the data science and analytics solutions on Intel-based hardware or Intel optimized software. So that's another thing that we do. And we're seeing actually very good progress in this area. Now we're also collaborating with many cloud service providers. For instance, we work with some of the top seven cloud service providers, both in the U.S. and also in China to democratize the, not only our hardware, but also our libraries and tools, BigDL, MKL, and other frameworks and libraries so that our customers, including individuals and businesses, can easily access to those building blocks from the cloud. So definitely we're working from different factors. >> So last question in the last couple of minutes. Let's kind of vibe on this collaboration theme. Tell us a little bit about the collaboration that you're having with, you mentioned customers in some highly regulated industries, for as an example. But a little bit to understand what's that symbiosis? What is Intel learning from your customers that's driving Intel's innovation of your technologies and big data? >> That's an excellent question. So Lisa, maybe I can start my sharing a couple of customer use cases. What kind of a solution that we help our customer to address. I think it's always wise not to start a conversation with the customer on technology that you deliver. You want to understand the customer's needs first. And then so that you can provide a solution that really address their biggest pinpoint rather than simply selling technology. So for example, we have worked with an online retailer to better understand their customers' shopping behavior and to assess their customers' preferences and interests. And based upon that analysis, the online retailer made different product recommendations and maximized its customers' purchase potential. And it drove up the retailer's sales. You know, that's one type of use case that we have worked. We also have partnered with the customers from the medical field. Actually, today at the Strata Conference we actually had somebody highlighting, we had a joint presentation with UCSF where we helped the medical center to automate the diagnosis and grading of meniscus lesions. And so today actually, that's all done manually by the radiologist but now that entire process is automated. The result is much more accurate, much more consistent, and much more timely. Because you don't have to wait for the availability of a radiologist to read all the 3D MRI images. And that can all be done by machines. You know, so those are the areas that we work with our customers, understand their business need, and give them the solution they are looking for. >> Wow, the impact there. I wish we had more time to dive into some of those examples. But we thank you so much, Ziya, for stopping by twice in one week to theCUBE and sharing your insights. And we look forward to having you back on the show in the near future. >> Thanks, so thanks Lisa, thanks George for having me. >> And for my co-host George Gilbert, I'm Lisa Martin. We are live at Big Data SV in San Jose. Come down, join us for the rest of the afternoon. We're at this cool place called Forager Tasting and Eatery. We will be right back with our next guest after a short break. (electronic outro music)
SUMMARY :
brought to you by SiliconANGLE Media some of the challenges, barriers to overcome What are some of the things that you're So that's definitely one of the biggest trends. on the back end, So at the hardware level, and George, as you mentioned, What are some of the things that have to happen? and also customers from the medical field. So last question in the last couple of minutes. customers from the medical field. And we look forward to having you We will be right back with our
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
UCSF | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
Lisa | PERSON | 0.99+ |
San Jose | LOCATION | 0.99+ |
China | LOCATION | 0.99+ |
Ziya Ma | PERSON | 0.99+ |
U.S. | LOCATION | 0.99+ |
International Women's Day | EVENT | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
Ziya | PERSON | 0.99+ |
one week | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
twice | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
Strata Data Conference | EVENT | 0.99+ |
one topic | QUANTITY | 0.98+ |
Spark | TITLE | 0.98+ |
both | QUANTITY | 0.98+ |
Intel | ORGANIZATION | 0.98+ |
one thing | QUANTITY | 0.98+ |
three days ago | DATE | 0.98+ |
Women in Data Science Conference | EVENT | 0.97+ |
Strata Conference | EVENT | 0.96+ |
first | QUANTITY | 0.96+ |
BigDL | TITLE | 0.96+ |
TensorFlow | TITLE | 0.96+ |
one type | QUANTITY | 0.95+ |
two | DATE | 0.94+ |
MXNet | TITLE | 0.94+ |
Caffe | TITLE | 0.92+ |
theCUBE | ORGANIZATION | 0.91+ |
one | QUANTITY | 0.9+ |
Software and Services Group | ORGANIZATION | 0.9+ |
Forager Tasting and Eatery | ORGANIZATION | 0.88+ |
Vice President | PERSON | 0.86+ |
Big Data Technologies | ORGANIZATION | 0.84+ |
seven cloud service providers | QUANTITY | 0.81+ |
last couple days | DATE | 0.81+ |
Atom | COMMERCIAL_ITEM | 0.76+ |
Silicon Valley | LOCATION | 0.76+ |
Big Data SV 2018 | EVENT | 0.74+ |
a couple days ago | DATE | 0.72+ |
Big Data SV | ORGANIZATION | 0.7+ |
Xeon | COMMERCIAL_ITEM | 0.7+ |
Nervana | ORGANIZATION | 0.68+ |
Big Data | EVENT | 0.62+ |
last | DATE | 0.56+ |
data | EVENT | 0.54+ |
case | QUANTITY | 0.52+ |
3D | QUANTITY | 0.48+ |
couple | QUANTITY | 0.47+ |
years | DATE | 0.47+ |
Nervana | TITLE | 0.45+ |
Big | ORGANIZATION | 0.32+ |
Reynold Xin, Databricks - #Spark Summit - #theCUBE
>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.
SUMMARY :
Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Reynold | PERSON | 0.99+ |
Ali | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
David Goad | PERSON | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
North Pole | LOCATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Reynold Xin | PERSON | 0.99+ |
last month | DATE | 0.99+ |
10X | QUANTITY | 0.99+ |
two questions | QUANTITY | 0.99+ |
three trillion records | QUANTITY | 0.99+ |
second area | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
last week | DATE | 0.99+ |
Spark | TITLE | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
first direction | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
James Bond | PERSON | 0.98+ |
Spark | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
Tungsten | ORGANIZATION | 0.98+ |
two focuses | QUANTITY | 0.97+ |
three directions | QUANTITY | 0.97+ |
one minute | QUANTITY | 0.97+ |
one area | QUANTITY | 0.96+ |
three | QUANTITY | 0.96+ |
about five | QUANTITY | 0.96+ |
DBIO | ORGANIZATION | 0.96+ |
six years | QUANTITY | 0.95+ |
one thing | QUANTITY | 0.94+ |
over a hundred experiments | QUANTITY | 0.94+ |
Oracle | ORGANIZATION | 0.92+ |
Theano | TITLE | 0.92+ |
single note | QUANTITY | 0.91+ |
Intel | ORGANIZATION | 0.91+ |
one route | QUANTITY | 0.89+ |
theCUBE | ORGANIZATION | 0.88+ |
Office | TITLE | 0.87+ |
TensorFlow | TITLE | 0.87+ |
S3 | TITLE | 0.87+ |
MXNet | TITLE | 0.85+ |
Day One Wrap - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's the CUBE covering Spark Summit 2017, brought to by Databricks. (energetic music plays) >> And what an exciting day we've had here at the CUBE. We've been at Spark Summit 2017, talking to partners, to customers, to founders, technologists, data scientists. It's been a load of information, right? >> Yeah, an overload of information. >> Well, George, you've been here in the studio with me talking with a lot of the guests. I'm going to ask you to maybe recap some of the top things you've heard today for our guests. >> Okay so, well, Databricks laid down, sort of, three themes that they wanted folks to take away. Deep learning, Structured Streaming, and serverless. Now, deep learning is not entirely new to Spark. But they've dramatically improved their support for it. I think, going beyond the frameworks that were written specifically for Spark, like Deeplearning4j and BigDL by Intel And now like TensorFlow, which is the opensource framework from Google, has gotten much better support. Structured Streaming, it was not clear how much more news we were going to get, because it's been talked about for 18 months. And they really, really surprised a lot of people, including me, where they took, essentially, the processing time for an event or a small batch of events down to 1 millisecond. Whereas, before, it was in the hundreds if not higher. And that changes the type of apps you can build. And also, the Databricks guys had coined the term continuous apps, which means they operate on a never-ending stream of data, which is different from what we've had in the past where it's batch or with a user interface, request-response. So they definitely turned up the volume on what they can do with continuous apps. And serverless, they'll talk about more tomorrow. And Jim, I think, is going to weigh in. But it, basically, greatly simplifies the ability to run this infrastructure, because you don't think of it as a cluster of resources. You just know that it's sort of out there, and you ask requests of it, and it figures out how to fulfill it. I will say, the other big surprise for me was when we have Matei, who's the creator of Spark and the chief technologist at Databricks, come on the show and say, when we asked him about how Spark was going to deal with, essentially, more advanced storage of data so that you could update things, so that you could get queries back, so that you could do analytics, and not just of stuff that's stored in Spark but stuff that Spark stores essentially below it. And he said, "You know, Databricks, you can expect to see come out with or partner with a database to do these advanced scenarios." And I got the distinct impression, and after listen to the tape again, that he was talking about for Apache Spark, which is separate from Databricks, that they would do some sort of key-value store. So in other words, when you look at competitors or quasi-competitors like Confluent Kafka or a data artist in Flink, they don't, they're not perfect competitors. They overlap some. Now Spark is pushing its way more into overlapping with some of those solutions. >> Alright. Well, Jim Kobielus. And thank you for that, George. You've been mingling with the masses today. (laughs) And you've been here all day as well. >> Educated masses, yeah, (David laughs) who are really engaged in this stuff, yes. >> Well, great, maybe give us some of your top takeaways after all the conversations you've had today. >> They're not all that dissimilar from George's. What Databricks, Databricks of course being the center, the developer, the primary committer in the Spark opensource community. They've done a number of very important things in terms of the announcements today at this event that push Spark, the Spark ecosystem, where it needs to go to expand the range of capabilities and their deployability into production environments. I feel the deep-learning side, announcement in terms of the deep-learning pipeline API very, very important. Now, as George indicated, Spark has been used in a fair number of deep-learning development environments. But not as a modeling tool so much as a training tool, a tool for In Memory distributed training of deep-learning models that we developed in TensorFlow, in Caffe, and other frameworks. Now this announcement is essentially bringing support for deep learning directly into the Spark modeling pipeline, the machine-learning modeling pipeline, being able to call out to deep learning, you know, TensorFlow and so forth, from within MLlib. That's very important. That means that Spark developers, of which there are many, far more than there are TensorFlow developers, will now have an easy pass to bring more deep learning into their projects. That's critically important to democratize deep learning. I hope, and from what I've seen what Databricks has indicated, that they have support currently in API reaching out to both TensorFlow and Keras, that they have plans to bring in API support for access to other leading DL toolkits such as Caffe, Caffe 2, which is Facebook-developed, such as MXNet, which is Amazon-developed, and so forth. That's very encouraging. Structured Streaming is very important in terms of what they announced, which is an API to enable access to faster, or higher-throughput Structured Streaming in their cloud environment. And they also announced that they have gone beyond, in terms of the code that they've built, the micro-batch architecture of Structured Streaming, to enable it to evolve into a more true streaming environment to be able to contend credibly with the likes of Flink. 'Cause I think that the Spark community has, sort of, had their back against the wall with Structured Streaming that they couldn't fully provide a true sub-millisecond en-oo-en latency environment heretofore. But it sounds like with this R&D that Databricks is addressing that, and that's critically important for the Spark community to continue to evolve in terms of continuous computation. And then the serverless-apps announcement is also very important, 'cause I see it as really being, it's a fully-managed multi-tenant Spark-development environment, as an enabler for continuous Build, Deploy, and Testing DevOps within a Spark machine-learning and now deep-learning context. The Spark community as it evolves and matures needs robust DevOps tools to production-ize these machine-learning and deep-learning models. Because really, in many ways, many customers, many developers are now using, or developing, Spark applications that are real 24-by-7 enterprise application artifacts that need a robust DevOps environment. And I think that Databricks has indicated they know where this market needs to go and they're pushing it with R&D. And I'm encouraged by all those signs. >> So, great. Well thank you, Jim. I hope both you gentlemen are looking forward to tomorrow. I certainly am. >> Oh yeah. >> And to you out there, tune in again around 10:00 a.m. Pacific Time. We're going to be broadcasting live here. From Spark Summit 2017, I'm David Goad with Jim and George, saying goodbye for now. And we'll see you in the morning. (sparse percussion music playing) (wind humming and waves crashing).
SUMMARY :
Announcer: Live from San Francisco, it's the CUBE to customers, to founders, technologists, data scientists. I'm going to ask you to maybe recap And that changes the type of apps you can build. And thank you for that, George. after all the conversations you've had today. for the Spark community to continue to evolve I hope both you gentlemen are looking forward to tomorrow. And to you out there, tune in again
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim Kobielus | PERSON | 0.99+ |
Jim | PERSON | 0.99+ |
George | PERSON | 0.99+ |
David | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Matei | PERSON | 0.99+ |
tomorrow | DATE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
both | QUANTITY | 0.98+ |
ORGANIZATION | 0.98+ | |
Intel | ORGANIZATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
18 months | QUANTITY | 0.98+ |
Flink | ORGANIZATION | 0.97+ |
ORGANIZATION | 0.97+ | |
Confluent Kafka | ORGANIZATION | 0.97+ |
Caffe | ORGANIZATION | 0.96+ |
today | DATE | 0.96+ |
TensorFlow | TITLE | 0.94+ |
three themes | QUANTITY | 0.94+ |
10:00 a.m. Pacific Time | DATE | 0.94+ |
CUBE | ORGANIZATION | 0.94+ |
Deeplearning4j | TITLE | 0.94+ |
Spark | ORGANIZATION | 0.93+ |
1 millisecond | QUANTITY | 0.93+ |
Keras | ORGANIZATION | 0.91+ |
Day One | QUANTITY | 0.81+ |
BigDL | TITLE | 0.79+ |
TensorFlow | ORGANIZATION | 0.79+ |
7 | QUANTITY | 0.77+ |
MLlib | TITLE | 0.73+ |
Caffe 2 | ORGANIZATION | 0.7+ |
Caffe | TITLE | 0.7+ |
24- | QUANTITY | 0.68+ |
MXNet | ORGANIZATION | 0.67+ |
Apache Spark | ORGANIZATION | 0.54+ |
Ziya Ma, Intel - Spark Summit East 2017 - #sparksummit - #theCUBE
>> [Narrator] Live from Boston Massachusetts. This is the Cube, covering Sparks Summit East 2017. Brought to you by Databricks. Now here are your hosts, Dave Alante and George Gilbert. >> Back to you Boston everybody. This is the Cube and we're here live at Spark Summit East, #SparkSummit. Ziya Ma is here. She's the Vice President of Big Data at Intel. Ziya, thanks for coming to the Cube. >> Thanks for having me. >> You're welcome. So software is our topic. Software at Intel. You know people don't necessarily associate Intel with always with software but what's the story there? >> So actually there are many things that we do for software. Since I manage the Big Data engineering organization so I'll just say a little bit more about what we do for Big Data. >> [Dave] Great. >> So you know Intel do all the processors, all the hardware. But when our customers are using the hardware, they like to get the best performance out of Intel hardware. So this is for the Big Data space. We optimize the Big Data solution stack, including Spark and Hadoop on top of Intel hardware. And make sure that we leverage the latest instructions set so that the customers get the most performance out of the newest released Intel hardware. And also we collaborated very extensively with the open source community for Big Data ecosystem advancement. For example we're a leading contributor to Apache Spark ecosystem. We're also a top contributor to Apache Hadoop ecosystem. And lately we're getting into the machine learning and deep learning and the AI space, especially integrating those capabilities into the Big Data eTcosystem. >> So I have to ask you a question to just sort of strategically, if we go back several years, you look at during the Unix days, you had a number of players developing hardware, microprocessors, there were risk-based systems, remember MIPS and of course IBM had one and Sun, et cetera, et cetera. Some of those live on but very, very small portion of the market. So Intel has dominated the general purpose market. So as Big Data became more mainstream, was there a discussion okay, we have to develop specialized processors, which I know Intel can do as well, or did you say, okay, we can actually optimize through software. Was that how you got here? Or am I understanding that? >> We believe definitely software optimization, optimizing through software is one thing that we do. That's why Intel actually have, you may not know this, Intel has one of the largest software divisions that focus on enabling and optimizing the solutions in Intel hardware. And of course we also have very aggressive product roadmap for advancing continuously our hardware products. And actually, you mentioned a general purpose computing. CPU today, in the Big Data market, still has more than 95% of the market. So that's still the biggest portion of the Big Data market. And will continue our advancement in that area. And obviously as the Ai and machine learning, deep learning use cases getting added into the Big Data domain and we are expanding our product portfolio into some other Silicon products. >> And of course that was kind of the big bet of, we want to bet on Intel. And I guess, I guess-- >> You should still do. >> And still do. And I guess, at the time, Seagate or other disk mounts. Now flash comes in. And of course now Spark with memory, it's really changing the game, isn't it? What does that mean for you and the software group? >> Right, so what do we... Actually, still we focus on the optimi-- Obviously at the hardware level, like Intel now, is not just offering the computing capability. We also offer very powerful network capability. We offer very good memory solutions, memory hardware. Like we keep talking about this non-volatile memory technologies. So for Big Data, we're trying to leverage all those newest hardware. And we're already working with many of our customers to help them, to improve their Big Data memory solution, the e-memory, analytics type of capability on Intel hardware, give them the most optimum performance and most secure result using Intel hardware. So that's definitely one thing that we continue to do. That's going to be our still our top priority. But we don't just limit our work to optimization. Because giving user the best experience, giving user the complete experience on Intel platform is our ultimate goal. So we work with our customers from financial services company. We work with folks from manufacturing. From transportation. And from other IOT internet of things segment. And to make sure that we give them the easiest Big Data analytics experience on Intel hardware. So when they are running those solutions they don't have to worry too much about how to make their application work with Intel hardware, and how to make it more performant with Intel hardware. Because that's the Intel software solution that's going to bridge the gap. We do that part of the job. And so that it will make our customers experience easier and more complete. >> You serve as the accelerant to the marketplace. Go ahead George. >> [Ziya] That's right. >> So Intel's big ML as the news product, as of the last month of so, open source solution. Tell us how there are other deep learning frameworks that aren't as fully integrated with Spark yet and where BigML fits in since we're at a Spark conference. How it backfills some functionality and how it really takes advantage of Intel hardware. >> George, just like you said, BigDL, we just open sourced a month ago. It's a deep learning framework that we organically built onto of Apache Spark. And it has quite some differences from the other mainstream deep learning frameworks like Caffe, Tensorflow, Torch and Tianu are you name it. The reason that we decide to work on this project was again, through our experience, working with our analytics, especially Big Data analytic customers, as they build their AI solutions or AI modules within their analytics application, it's funny, it's getting more and more difficult to build and integrate AI capability into their existing Big Data analytics ecosystem. They had to set up a different cluster and build a different set of AI capabilities using, let's say, one of the deep learning frameworks. And later they have to overcome a lot of challenges, for example, moving the model and data between the two different clusters and then make sure that AI result is getting integrated into the existing analytics platform or analytics application. So that was the primary driver. How do we make our customers experience easier? Do they have to leave their existing infrastructure and build a separate AI module? And can we do something organic on top of the existing Big Data platform, let's say Apache Spark? Can we just do something like that? So that the user can just leverage the existing infrastructure and make it a naturally integral part of the overall analytics ecosystem that they already have. So this was the primary driver. And also the other benefit that we see by integrating this BigDL framework naturally was the Big Data platform, is that it enables efficient scale-out and fault tolerance and elasticity and dynamic resource management. And those are the benefits that's on naturally brought by Big Data platform. And today, actually, just with this short period of time, we have already tested that BigDL can scale easily to tens or hundreds of nodes. So the scalability is also quite good. And another benefit with solution like BigDL, especially because it eliminates the need of setting a separate cluster and moving the model between different hardware clusters, you save your total cost of ownership. You can just leverage your existing infrastructure. There is no need to buy additional set of hardware and build another environment just for training the model. So that's another benefit that we see. And performance-wise, again we also tested BigDL with Caffe, Torch and TensorFlow. So the performance of BigDL on single node Xeon is orders of magnitude faster than out of box at open source Caffe, TensorFlow or Torch. So it definitely it's going to be very promising. >> Without the heavy lifting. >> And useful solution, yeah. >> Okay, can you talk about some of the use cases that you expect to see from your partners and your customers. >> Actually very good question. You know we already started a few engagement with some of the interested customers. The first customer is from Stuart Industry. Where improving the accuracy for steel-surface defect recognition is very important to it's quality control. So we worked with this customer in the last few months and built end-to-end image recognition pipeline using BigDL and Spark. And the customer just through phase one work, already improved it's defect recognition accuracy to 90%. And they're seeing a very yield improvement with steel production. >> And it used to by human? >> It used to be done by human, yes. >> And you said, what was the degree of improvement? >> 90, nine, zero. So now the accuracy is up to 90%. And another use case and financial services actually, is another use case, especially for fraud detection. So this customer, again I'm not at the customer's request, they're very sensitive the financial industry, they're very sensitive with releasing their name. So the customer, we're seeing is fraud risks were increasing tremendously. With it's wide range of products, services and customer interaction channels. So the implemented end-to-end deep learning solution using BigDL and Spark. And again, through phase one work, they are seeing the fraud detection rate improved 40 times, four, zero times. Through phase one work. We think there were more improvement that we can do because this is just a collaboration in the last few month. And we'll continue this collaboration with this customer. And we expect more use cases from other business segments. But that are the two that's already have BigDL running in production today. >> Well so the first, that's amazing. Essentially replacing the human, have to interact and be much more accurate. The fraud detection, is interesting because fraud detection has come a long way in the last 10 years as you know. Used to take six months, if they found fraud. And now it's minutes, seconds but there's a lot of false positives still. So do you see this technology helping address that problem? >> Yeah, we actually that's continuously improving the prediction accuracy is one of the goals. This is another reason why we need to bring AI and Big Data together. Because you need to train your model. You need to train your AI capabilities with more and more training data. So that you get much more improved training accuracy. Actually this is the biggest way of improving your training accuracy. So you need a huge infrastructure, a big data platform so that you can host and well manage your training data sets. And so that it can feed into your deep learning solution or module for continuously improving your training accuracy. So yes. >> This is a really key point it seems like. I would like to unpack that a little bit. So when we talk to customers and application vendors, it's that training feedback loop that gets the models smarter and smarter. So if you had one cluster for training that was with another framework, and then Spark was your... Rest of your analytics. How would training with feedback data work when you had two separate environments? >> You know that's one of the drivers why we're creating BigDL. Because, we tried to port, we did not come to BigDL at the very beginning. We tried to port the existing deep learning frameworks like Caffe and Tensorflow onto Spark. And you also probably saw some research papers folks. There's other teams that out there that's also trying to port Caffe, Tensorflow and other deep learning framework that's out there onto Spark. Because you have that need. You need to bring the two capabilities together. But the problem is that those systems were developed in a very traditional way. With Big Data, not yet in consideration, when those frameworks were created, were innovated. But now the need for converging the two becomes more and more clear, and more necessary. And that's we way, when we port it over, we said gosh, this is so difficult. First it's very challenging to integrate the two. And secondly the experience, after you've moved it over, is awkward. You're literally using Spark as a dispatcher. The integration is not coherent. It's like they're superficially integrated. So this is where we said, we got to do something different. We can not just superficially integrate two systems together. Can we do something organic on top of the Big Data platform, on top of Apache Spark? So that the integration between the training system, between the feature engineering, between data management can &be more consistent, can be more integrated. So that's exactly the driver for this work. >> That's huge. Seamless integration is one of the most overused phrases in the technology business. Superficial integration is maybe a better description for a lot of those so-called seamless integrations. You're claiming here that it's seamless integration. We're out of time but last word Intel and Spark Summit. What do you guys got going here? What's the vibe like? >> So actually tomorrow I have a keynote. I'm going to talk a little bit more about what we're doing with BigDL. Actually this is one of the big things that we're doing. And of course, in order for BigDL, system like BigDL or even other deep learning frameworks, to get optimum performance on Intel hardware, there's another item that we're highlighting at MKL, Intel optimized Math Kernel Library. It has a lot of common math routines. That's optimized for Intel processor using the latest instruction set. And that's already, today, integrated into the BigDL ecosystem.z6 So that's another thing that we're highlighting. And another thing is that those are just software. And at hardware level, during November, Intel's AI day, our executives from BK, Diane Bryant and Doug Fisher. They also highlighted the Nirvana product portfolio that's coming out. That will give you different hardware choices for AI. You can look at FPGA, Xeon Fi, Xeon and our new Nirvana based Silicon like Crestlake. And those are some good silicon products that you can expect in the future. Intel, taking us to Nirvana, touching every part of the ecosystem. Like you said, 95% share and in all parts of the business. Yeah, thanks very much for coming the Cube. >> Thank you, thank you for having me. >> You're welcome. Alright keep it right there. George and I will be back with our next guest. This is Spark Summit, #SparkSummit. We're the Cube. We'll be right back.
SUMMARY :
This is the Cube, covering Sparks Summit East 2017. This is the Cube and we're here live So software is our topic. Since I manage the Big Data engineering organization And make sure that we leverage the latest instructions set So Intel has dominated the general purpose market. So that's still the biggest portion of the Big Data market. And of course that was kind of the big bet of, And I guess, at the time, Seagate or other disk mounts. And to make sure that we give them the easiest You serve as the accelerant to the marketplace. So Intel's big ML as the news product, And also the other benefit that we see that you expect to see from your partners And the customer just through phase one work, So the customer, we're seeing is fraud risks in the last 10 years as you know. So that you get much more improved training accuracy. that gets the models smarter and smarter. So that the integration between the training system, Seamless integration is one of the most overused phrases integrated into the BigDL ecosystem We're the Cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Seagate | ORGANIZATION | 0.99+ |
Dave Alante | PERSON | 0.99+ |
40 times | QUANTITY | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
90% | QUANTITY | 0.99+ |
Dave | PERSON | 0.99+ |
tomorrow | DATE | 0.99+ |
two | QUANTITY | 0.99+ |
six months | QUANTITY | 0.99+ |
Ziya Ma | PERSON | 0.99+ |
November | DATE | 0.99+ |
Doug Fisher | PERSON | 0.99+ |
two systems | QUANTITY | 0.99+ |
tens | QUANTITY | 0.99+ |
more than 95% | QUANTITY | 0.99+ |
Intel | ORGANIZATION | 0.99+ |
Boston Massachusetts | LOCATION | 0.99+ |
one | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
Spark | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
Ziya | PERSON | 0.99+ |
first customer | QUANTITY | 0.99+ |
a month ago | DATE | 0.98+ |
First | QUANTITY | 0.98+ |
Diane Bryant | PERSON | 0.98+ |
Stuart Industry | ORGANIZATION | 0.98+ |
zero times | QUANTITY | 0.98+ |
nine | QUANTITY | 0.98+ |
zero | QUANTITY | 0.97+ |
two capabilities | QUANTITY | 0.97+ |
Big Data | TITLE | 0.97+ |
BigDL | TITLE | 0.97+ |
Tensorflow | TITLE | 0.97+ |
95% share | QUANTITY | 0.96+ |
Caffe | TITLE | 0.96+ |
one thing | QUANTITY | 0.96+ |
four | QUANTITY | 0.96+ |
#SparkSummit | EVENT | 0.96+ |
one cluster | QUANTITY | 0.96+ |
up to 90% | QUANTITY | 0.96+ |
two different clusters | QUANTITY | 0.96+ |
Hadoop | TITLE | 0.96+ |
today | DATE | 0.96+ |
two separate environments | QUANTITY | 0.95+ |
Cube | COMMERCIAL_ITEM | 0.95+ |
Apache | ORGANIZATION | 0.94+ |
Databricks | ORGANIZATION | 0.94+ |
Spark Summit East 2017 | EVENT | 0.94+ |
Big Data | ORGANIZATION | 0.93+ |
Nirvana | LOCATION | 0.92+ |
MIPS | TITLE | 0.92+ |
Spark Summit East | LOCATION | 0.92+ |
hundreds of nodes | QUANTITY | 0.91+ |
secondly | QUANTITY | 0.9+ |
BigML | TITLE | 0.89+ |
Sparks Summit East 2017 | EVENT | 0.89+ |