SPARKs: Succinct Parallelizable Arguments of Knowledge

>>Hello, everyone. Welcome to Entities Summit. My name is Ellen Komarovsky and I will talk about sparks So simple realizable arguments of knowledge. This talk is based on the joint work No, me, Frank, Cody, Freytag and Raphael past. Let me start by telling you what's the same documents are that's the same argument is a special type of interactive protocol between the prove prove er and the verifier who share some instance X, >>which is allegedly in some language. And the goal of the protocol is for the proper toe convince the very far that access indeed in the language for completeness, the guarantees that their guarantees that if X is indeed in the language, the verifier will in the end of the protocol indeed be convinced. On the other hand, for sadness we require that if X is not in the language, that no matter what the proper does, as long as it is bounded to run in polynomial time, the verifier will not be convinced. There is a stronger notion of sadness called an argument of knowledge, which says that the only way for the approval to continue the verifier is by knowing some witness there is a mathematical way to formalize this notion, but I will not get into it for efficiency. And what makes this protocol succinct is that we require the very far is running time and communication complexity between the program, the verifier Toby, both mounted by some political written function in T, where T is the time to verify the empty statement. In terms of the proof is running time, we don't require anything except that it's, for example, in normality. The goal of this work is to improve this polygonal overhead of the prove er, to explain why this is an important task. Let me give you a motivating example, which is just the concept of delegation of computation. So considering some small device, like a laptop or smartphone, that we used to perform some complicated computation which it cannot do. Since it is a big device, it wishes to delegate the computation to some service or cloud to perform the computation for it. Since the small device does not fully trust the service, it may want to ask the device the service to also issue a proof for correctness of the computation. And the problem is that if the proof it takes much more time than just performing the computation. It's not clear that this is something that will be useful in practice thinking. Think off an overhead, which is square of the time it takes to perform the computation. This will become, very quickly a very big number or very, very large delay for generating the We're not the >>first to study this problem. It has been studied for several decades, and at least from a theoretical point of view, the problem is almost solved or essentially solved. We have constructions off argument systems is great overhead, just bottle of arrhythmic multiplicity of overhead. This is obtained by combining efficient disappears. Together with Killian's arguments is there's a >>huge open problem in complexity. Theory of constructing PCP is with constant over namely, running just in linear time in the running, in the running time off just running the computation. But we argued that even if we had such a PCP and the constant was great, let's say it was just too. This would already be too much, because if you delegate the computation to takes a month toe complete, then waiting another month just for the proof might not be so reasonable. There is a solution in the literature for this problem in what we call using what we call a reliable PCP medicine. And I'll show that there is a recipe construction that has the following very useful property. Once you perform the computation itself without the just the computation and write down the computation to blow, then there is the way to generate every simple off the PCP in just only logarithmic time. So this means that you can, in parallel after computing the function itself, you can empire led, compute the whole PCP in just falling over it. Next time this gives you this gives us a great argument system with just t plus Polly locked parallel time instead of three times for luck tea time. But for this we need about the process service, which is prohibitively large. This is where sparks come in. We introduced the notion, or the paradigm off, computing the proof in part to the computation, not after the computation is done slightly more formally. What spark is it's just a succinct argument of knowledge, like what we said before, with the very fired and communication of Leslie being small but now we also require approval for which is super efficient. Namely, it can be paralyzed able. And it has to finish the proof together with the computation in Time T plus volatility, which essentially the best you can hope for. And we want to prefer to do so only with political rhythmic number off processors. You can also extend the definition to handling computations, which are to begin with a paralyze herbal. But I will not touch upon this. In the stock, you can see the paper. For the >>girls, we have two main results. The first main result is the construction of an interactive spark. It's just four rounds, and it is assumes Onley collisions is not hash functions. The second result is a non interactive spark. This result also assumes career resistant hash functions and in addition, the existence off any snark and namely succinct, non interactive argument of college that does not have to be a super efficient in terms of programming time. Slightly more generally, the two theories follow from >>combined framework, which takes essentially any argument of knowledge and turns it into a spark by assuming on a collision system, hash functions and maybe the multi behind the construction could be viewed as a trade off between computation time and process. Source. Winston. She ate theorem one using Killings protocol, which is an argument of knowledge, which is a four round argument of knowledge. And we insensate you're into using its not which is an argument knowledge. Just by definition, let me tell you what are the main ideas underlying our construction before telling you to control the ideas. Let me make some simplifying assumptions. The first assumption I will only be talking about the non interactive regime. The second example assumption is that I'm going to assume snark, which is a non interactive 16 argument of knowledge. And then we'll assume that's not the snark which is super efficient. So it will consumed other time to t for computation that takes 20 so almost what we want, but just not yet, not not yet there. I will assume that the computation that we want to perform a sequential and additionally I will assume that the computation has no >>space, namely its ah, or it has very low space. So think about the sequential computation, which MM doesn't have a lot of space or even zero for the for the time being, I would like to discuss how to simplify, how to remove this simplifying assumptions. So the starting idea is based on two works off a nettle and duckling. It'll from a couple of years ago. And here's how it works. So >>remember, we want toe performative time. Computation generated proof and we need to finish roughly by time. T. So the idea is to run half of the computation, which is what we can afford because we have a snark that can generate a proof in additional to over two steps so we can run the complete half of the computation and prove that half of the computation all in time T. And the idea is that now we can recursive Lee computer improve the rest of the computation in Parliament. Here's how it looks like. So you run half of the computation, started proof, and then you run a quarter of the remaining half of the remaining computation, which is a quarter of the original one, and prove it. And in parallel again, you take another eighth of the computation, which is one half of what's left and so on. And so forth. As you can see, that eventually will finish the whole computation. And you only need something like logarithmic Lee. Many parallel processors and the communication complexity and verifies running time only grow by algorithmic >>factor. So this is the main idea. Let's go back to the simplifying assumptions we have. So the first one was that I'm only gonna talk about the new interactive regime. You have to believe me that the same ideas extend to the interactive case, which is a little bit more massive with notation. But the ideas extent so I will not talk about it anymore. The second assumption I had was that I have a super efficient start, so it had over had two T >>40 time computation again. You have to believe me that if you work out the math, then the ideas extend to starts with quasi linear overhead. Namely, starts that working time tee times, Polly locked e and then the result extends to any snark because of a result because of a previous work will be tense. Kettle, who showed that a snark with the proof it runs in polynomial time can be generically translated into a snark where the programs in quasi linear with quasi linear overhead. So this gives a result from any stark not only from pretty efficient starts. The last bullet was about the fact that we're dealing with only with sequential Ram computations. And again, you have to believe me that the ideas can be extended toe tyrants And the last assumption which is the focus of this work is how to get rid of the small space assumption. This is what I'm gonna be talking next. Let's see what goes wrong. If the if the computation has space, remember what we did in the previous. In a couple of slides ago, the construction was toe perform. Half of the computation prove it and then half of the remaining computation prove it. And >>so on. If you write down the statement that each of these proofs proofs, it's something like that a machine m on input X executed for some number of steps starting from some state ended at some other state. And if you notice the statement itself depends on the space of the computation, well and therefore, if the space of the computation is nontrivial, the statements are large and therefore the communication will be large and therefore the very fire will have toe be running time, proportional to the space and so on. So we don't even get a saint argument if we do it. Neighborly. Here's a solution for this problem. You can say, Well, you don't have to include the space in the whole space. In the statement, you can include only a digest of the space. Think about some hash function of the space. So indeed, you can modify the statement to not include the space, but only a digest. And now the statement will be a little bit more complicated. It will be that there exists some initial state end state such that there hush is consistent with digest in the statement. And if you run the machine M for K state and for K steps starting from the initial space, you end up with the final space. So this is great. It indeed solves the communication complexity problem in the very far complexity problem. But notice that from the proof for site, we didn't actually do anything because we just move, pushed the complexity in tow. The weakness. So the proof is running. Time is still very large with this solution. Yeah. Our final solution, if in a very high level, is to compress the witness. So instead of using the whole space is the witness. We will be using the computation itself in the computation that we ran as the witness. So now the statement will be off the same form, so it will still be. It will still consist off to digests and machine. But now the the witness will be not the whole state. But it will be the case steps that we performed. Namely, it will be that there exists case steps that I performed such that if I run >>the machine m on this case steps and I started with a digest and I just start and I applied this case steps on the digest. I will end up with the Final Digest. In order to implement this, we need some sort off a nap. Datable digest. This is not really hard, not so hard to obtain because you could just do something like a miracle tree. It's not hard to see that you can add the locations in the medical tree quite efficiently. But the problem is that we need toe toe to compute those updates. Not only not only we need toe be ableto update the hash browns, the hush or the largest which don't also be able to compute the updates in parallel to the computation. And to this end, we introduce a variant of Merkle trees and show how to perform all of those updates level by level in the in the Merkel tree in a pipeline in fashion. So namely, we push the updates off the digest in toe the Merkel tree, one after the other without waiting for the previous ones to end. And here we're using the tree structure off Merkle trees. So that's all I'm gonna say about the protocol. I'm just gonna end with showing you how the final protocol looks like We run case steps of computations. Okay, one steps of computation and we compute the K updates for those case steps in violent the computation. So every time we run a step of computation, we also update start an update off our digest. And once we are finished computing all the updates, we can start running a proof using those updates as witness and were forcibly continuing this way as a conclusion this results with the spark namely 1/16 argument system with the proof is running Time t plus for you Look, team and no times and all we need is something like quality of arrhythmic number of processors. E would like to mention that this is a theoretical result and by no means should be should be taken as a za practical thing that should be implemented. But I think that it is important to work on it. And there is a lot of interesting questions on how to make this really practical and useful. So with that, I'm gonna end and thank you so much for inviting me and enjoy the sandwich.

Published Date : Sep 24 2020

SUMMARY :

protocol between the prove prove er and the verifier who share some instance X, In terms of the proof is running time, we don't require anything except that it's, for example, first to study this problem. extend the definition to handling computations, which are to begin with a and in addition, the existence off any snark and namely succinct, is that I'm going to assume snark, which is a non interactive 16 argument So the starting idea is based on two works off a nettle and duckling. remaining half of the remaining computation, which is a quarter of the original one, and prove But the ideas extent so I will not talk about it anymore. out the math, then the ideas extend to starts with quasi linear overhead. But notice that from the proof for site, we didn't actually do anything because we just But the problem is that we need toe toe to compute those updates.

ENTITIES

Entity	Category	Confidence
Ellen Komarovsky	PERSON	0.99+
Winston	PERSON	0.99+
Killian	PERSON	0.99+
Kettle	PERSON	0.99+
20	QUANTITY	0.99+
two theories	QUANTITY	0.99+
Raphael	PERSON	0.99+
Frank	PERSON	0.99+
first	QUANTITY	0.99+
Freytag	PERSON	0.99+
two	QUANTITY	0.99+
Leslie	PERSON	0.99+
Polly	PERSON	0.99+
second assumption	QUANTITY	0.99+
first one	QUANTITY	0.99+
Cody	PERSON	0.99+
four rounds	QUANTITY	0.99+
eighth	QUANTITY	0.98+
three times	QUANTITY	0.98+
zero	QUANTITY	0.98+
Lee	PERSON	0.98+
second result	QUANTITY	0.98+
each	QUANTITY	0.97+
four round	QUANTITY	0.97+
both	QUANTITY	0.96+
two main results	QUANTITY	0.96+
one steps	QUANTITY	0.94+
over two steps	QUANTITY	0.93+
half	QUANTITY	0.91+
16	QUANTITY	0.91+
Half	QUANTITY	0.91+
second example	QUANTITY	0.9+
a month	QUANTITY	0.88+
Merkle	OTHER	0.87+
couple of years ago	DATE	0.83+
Entities	EVENT	0.82+
one half	QUANTITY	0.79+
two T	QUANTITY	0.77+
first main result	QUANTITY	0.76+
half of	QUANTITY	0.76+
40 time	QUANTITY	0.74+
one	QUANTITY	0.72+
1/16	QUANTITY	0.68+
Onley	PERSON	0.62+
couple of	DATE	0.6+
Summit	ORGANIZATION	0.48+
several decades	QUANTITY	0.47+

Abhiman Matlapudi & Rajeev Krishnan, Deloitte | Informatica World 2019

>> Live from Las Vegas. It's theCUBE. Covering Informatica World 2019, brought to you by Informatica. >> Welcome back everyone to theCUBE's live coverage of Informatica World. I am your host, Rebecca Knight, along with co-host, John Furrier. We have two guests for this segment. We have Abhiman Matlapudi. He is the Product Master at Deloitte. Welcome. >> Thanks for having us. >> And we have Kubalahm Rajeev Krishnan, Specialist Leader at Deloitte. Thank you both so much for coming on theCUBE. >> Thanks Rebecca, John. It's always good to be back on theCUBE. >> Love the new logos here, what's the pins? What's the new take on those? >> It looks like a honeycomb! >> Yeah, so interesting that you ask, so this is our joined Deloitte- Informatica label pin. You can see the Deloitte green colors, >> Nice! They're beautiful. >> And the Informatica colors. This shows the collaboration, the great collaboration that we've had over, you know, the past few years and plans, for the future as well. Well that's what we're here to talk about. So why don't you start the conversation by telling us a little bit about the history of the collaboration, and what you're planning ahead for the future. Yeah. So, you know, if we go like you know, ten years back the collaboration between Deloitte and Informatica has not always been that, that strong and specifically because Deloitte is a huge place to navigate, and you know, in order to have those meaningful collaborations. But over the past few years, we've... built solid relationships with Informatica and vise versa. I think we seek great value. The clear leaders in the Data Management Space. It's easy for us to kind of advise clients in terms of different facets of data management. You know, because no other company actually pulls together you know, the whole ecosystem this well. >> Well you're being polite. In reality, you know where it's weak and where it's real. I mean, the reality is there's a lot of fun out there, a lot of noise, and so, I got to ask you, cause this is the real question, because there's no one environment that's the same. Customers want to get to the truth faster, like, where's the deal? What's the real deal with data? What's gettable? What's attainable? What's aspirational? Because you could say "Hey, well I make data, data-driven organization, Sass apps everywhere." >> Yeah. Yeah absolutely. I mean every, every company wants to be more agile. Business agility is what's driving companies to kind of move all of their business apps to the Cloud. The uh, problem with that is that, is that people don't realize that you also need to have your data management governance house in order, right, so according to a recent Gartner study, they say by next year, 75% of companies who have moved their business apps to the Cloud, is going to, you know, unless they have their data management and data assets under control, they have some kind of information governance, that has, you know, context, or purview over all of these business apps, 50% of their data assets are going to erode in value. So, absolutely the need of the hour. So we've seen that great demand from our clients as well, and that's what we've been advising them as well. >> What's a modern MDM approach? Because this is really the heart of the conversation, we're here at Informatica World. What's- What does it look like? What is it? >> So I mean, there are different facets or functionalities within MDM that actually make up what is the holistic modern MDM, right. In the past, we've seen companies doing MDM to get to that 360-degree view. Somewhere along the line, the ball gets dropped. That 360 view doesn't get combined with your data warehouse and all of the transaction information, right, and, you know, your business uses don't get the value that they were looking for while they invested in that MDM platform. So in today's world, MDM needs to provide front office users with the agility that they need. It's not about someone at the back office doing some data stewardship. It's all about empowering the front office users as well. There's an aspect of AIML from a data stewardship perspective. I mean everyone wants cost take out, right, I mean there's fewer resources and more data coming in. So how how do you manage all of the data? Absolutely you need to have AIML. So Informatica's CLAIRE product helps with suggestions and recommendations for algorithms, matching those algorithms. Deloitte has our own MDM elevate solution that embeds AIML for data stewardship. So it learns from human data inputs, and you know, cuts through the mass of data records that have to be managed. >> You know Rajeev, it was interesting, last year we were talking, the big conversation was moving data around is really hard. Now there's solutions for that. Move the data integrity on premise, on Cloud. Give us an update on what's going on there, because there seems to be a lot of movement, positive movement, around that. In terms of, you know, quality, end to end. We heard Google up here earlier saying "Look, we can go into end to end all you want". This has been a big thing. How are you guys handling this? >> Yeah absolutely, so in today's key note you heard Anil Chakravarthy and Thomas Green up on the stage and Anil announced MDM on GCP, so that's an offering that Deloitte is hosting and managing. So it's going to be an absolutely white-glove service that gives you everything from advice to implement to operate, all hosted on GCP. So it's a three-way ecosystem offering between Deloitte, Informatica, and GCP. >> Well just something about GCP, just as a side note before you get there, is that they are really clever. They're using Sequel as a way to abstract all the under the hood kind of configuration stuff. Smart move, because there's a ton of Sequel people out there! >> Exactly. >> I mean, it's not structured query language for structured data. It's lingua franca for data. They've been changing the game on that. >> Exactly, it should be part of their Cloud journey. So organizations, when they start thinking about Cloud, first of all, what they need to do is they have to understand where all the data assets are and they read the data feeds coming in, where are the data lakes, and once they understand where their datas are, it's not always wise, or necessary to move all their data to the Cloud. So, Deloitte's approach or recommendation is to have a hybrid approach. So that they can keep some of their legacy datas, data assets, in the on premise and some in the Cloud applications. So, Informatica, MDM, and GCP, powered by Deloitte, so it acts as an MDM nimble hub. In respect of where your data assets are, it can give you the quick access to the data and it can enrich the data, it can do the master data, and also it can protect your data. And it's all done by Informatica. >> Describe what a nimble hub is real quick. What does a nimble hub mean? What does that mean? >> So it means that, in respect of wherever your data is coming in and going out, so it gives you a very light feeling that the client wouldn't know. All we- Informatica, MDM, on GCP powered by Deloitte, what we are saying is we are asking clients to just give the data. And everything, as Rajeev said, it's a white-glove approach. It's that from engagement, to the operation, they will just feel a seamless support from Deloitte. >> Yeah, and just to address the nimbleness factor right, so we see clients that suddenly need to get into new market, or they want to say, introduce a new product, so they need the nimbleness from a business perspective. Which means that, well suddenly you've got to like scale up and down your data workloads as well, right? And that's not just transactional data, but master data as well. And that's where the Cloud approach, you know, gives them a positive advantage. >> I want to get back to something Abhiman said about how it's not always wise or necessary to move to the Cloud. And this is a debate about where do you keep stuff. Should it be on on prem, and you said that Deloitte recommends a hybrid approach and I'm sure that's a data-driven recommendation. I'm wondering what evidence you have and what- why that recommendation? >> So, especially when it depends on the applications you're putting on for MDM, and the sources and data is what you are trying to get, for the Informatica MDM to work. So, it's not- some of your social systems are already tied up with so many other applications within your on premise, and they don't want to give every other data. And some might have concerns of sending this data to the Cloud. So that's when you want to keep those old world legacy systems, who doesn't want to get upgrades, to your on premise, and who are all Cloud-savy and they can all starting new. So they can think of what, and which, need a lot of compute power, and storage. And so those are the systems we want to recommend to the Cloud. So that's why we say, think where you want to move your data bases. >> And some of it is also driven by regulation, right, like GDPR, and where, you know, which providers offer in what countries. And there's also companies that want to say "Oh well my product strategy and my pricing around products, I don't want to give that away to someone." Especially in the high tech field, right. Your provider is going to be a confidere. >> Rajeev, one of the things I'm seeing here in this show, is clearly that the importance of the Cloud should not be understated. You see, and you guys, you mentioned you get the servers at Google. This is changing not just the customers opportunity, but your ability to service them. You got a white-glove service, I'm sure there's a ton more head room. Where do you guys see the Cloud going next? Obviously it's not going away, and the on premise isn't going away. But certainly, the importance of the Cloud should not be understated. That's what I'm hearing clearly. You see Amazon, Azure, Google, all big names with Informatica. But with respect to you guys, as you guys go out and do your services. This is good for business. For you guys, helping customers. >> Yeah absolutely, I think there's value for us, there's value for our clients. You know, it's not just the apps that are kind of going to the Cloud, right? I mean you see all data platforms that are going to the Cloud. For example, Cloudera. They just launched CDP. Being GA by July- August. You know, Snowflake's on the Cloud doing great, getting good traction in the market. So eventually what were seeing is, whether it's business applications or data platforms, they're all moving to the Cloud. Now the key things to look out for in the future is, how do we help our clients navigate a multi Cloud environment, for example, because sooner or later, they wouldn't want to have all of their eggs invested in one basket, right? So, how do we help navigate that? How do we make that seamless to the business user? Those are the challenges that we're thinking about. >> What's interesting about Databricks and Snowflake, you mentioned them, is that it really is a tell sign that start-ups can break through and crack the enterprise with Cloud and the ecosystem. And you're starting to see companies that have a Sass-like mindset with technology. Coming into an enterprise marketed with these ecosystems, it's a tough crowd believe me, you know the enterprise. It's not easy to break into the enterprise, so for Databricks and Snowflake, that's a huge tell sign. What's your reaction to that because it's great for Informatica because it's validation for them, but also the start-ups are now growing very fast. I mean, I wouldn't call Snowflake 3 billion dollar start-up their unicorn but, times three. But it's a tell sign. It's just something new we haven't seen. We've seen Cloudera break in. They kind of ramped their way in there with a lot of raise and they had a big field sales force. But Data Bear and Snowflake, they don't have a huge set in the sales force. >> Yeah, I think it's all about clients and understanding, what is the true value that someone provides. Is it someone that we can rely on to keep our data safe? Do they have the capacity to scale? If you can crack those things, then you'll be in the market. >> Who are you attracting to the MDM on Google Cloud? What's the early data look like? You don't have to name names, but whats some of the huge cases that get the white glove service from Deloitte on the Google Cloud? Tell us about that. Give us more data on that. >> So we've just announced that, here at Informatica World, we've got about three to four mid to large enterprises. One large enterprise and about three mid-size companies that are interested in it. So we've been in talks with them in terms of- and that how we want to do it. We don't want to open the flood gates. We'd like to make sure it's all stable, you know, clients are happy and there's word of mouth around. >> I'm sure the end to end management piece of it, that's probably attractive. The end to end... >> Exactly. I mean, Deloitte's clearly the leader in the data analytics space, according to Gartner Reports. Informatica is the leader in their space. GCP has great growth plans, so the three of them coming together is going to be a winner. >> One of the most pressing challenges facing the technology industry is the skills gap and the difficulty in finding talent. Surveys show that I.T. managers can't find qualified candidates for open Cloud roles. What are Deloitte's thought on this and also, what are you doing as a company to address it? >> I mean, this is absolutely a good problem to have, for us. Right, which means that there is a demand. But unless we beat that demand, it's a problem. So we've been taking some creative ways, in terms of addressing that. An example would be our analytics foundry offering, where we provide a pod of people that go from data engineers you know, with Python and Sparks skills, to, you know, Java associates, to front end developers. So a whole stack of developers, a full stack, we provide that full pod so that they can go and address a particular business analytics problem or some kind of visualization issues, in terms of what they want to get from the data. So, we teach Leverate that pod, across multiple clients, I think that's been helping us. >> If you could get an automated, full time employee, that would be great. >> Yeah, and this digital FD concept is something that we'd be looking at, as well. >> I would like to add on that, as well. So, earlier- with the data disruption, Informatica's so busy and Informatica's so busy that Deloitte is so busy. Now, earlier we used plain Informatica folks and then, later on because of the Cloud disruption, so we are training them on the Cloud concepts. Now what the organizations have to think, or the universities to think is that having the curriculum, the Cloud concepts in their universities and their curriculum so that they get all their Cloud skills and after, once they have their Cloud skills, we can train them on the Informatica skills. And Informatica has full training on that. >> I think it's a great opportunity for you guys. We were talking with Sally Jenkins to the team earlier, and the CEO. I was saying that it reminds me of early days of VMware, with virtualization you saw the shift. Certainly the economics. You replaced servers, do a virtual change to the economics. With the data, although not directly, it's a similar concept where there's new operational opportunities, whether it's using leverage in Google Cloud for say, high-end, modern data warehousing to whatever. The community is going to respond. That's going to be a great ecosystem money making opportunity. The ability to add new services, give you guys more capabilities with customers to really move the needle on creating value. >> Yeah, and it's interesting you mention VMware because I actually helped, as VMware stood up there, VMCA, AW's and NSA's offerings on the Cloud. We actually helped them get ready for that GA and their data strategy, in terms of support, both for data and analytics friendliness. So we see a lot of such tech companies who are moving to a flexible consumption service. I mean, the challenges are different and we've got a whole practice around that flex consumption. >> I'm sure Informatica would love the VMware valuation. Maybe not worry for Dell technology. >> We all would love that. >> Rajeem, Abhiman, thank you so much for joining us on theCube today. >> Thank you very much. Good talking to you. >> I'm Rebecca Knight for John Furrier. We will have more from Informatica World tomorrow.

Published Date : May 22 2019

SUMMARY :

brought to you by Informatica. He is the Product Master at Deloitte. Thank you both so much for coming on theCUBE. It's always good to be back on theCUBE. Yeah, so interesting that you ask, They're beautiful. to navigate, and you know, I mean, the reality is there's a lot of fun out there, is that people don't realize that you also need What does it look like? and all of the transaction information, right, "Look, we can go into end to end all you want". So it's going to be an absolutely white-glove service just as a side note before you get there, They've been changing the game on that. and it can enrich the data, What does that mean? It's that from engagement, to the operation, And that's where the Cloud approach, you know, and you said that Deloitte recommends a hybrid approach think where you want to move your data bases. right, like GDPR, and where, you know, is clearly that the importance of the Cloud Now the key things to look out for in the future is, and crack the enterprise with Cloud and the ecosystem. Do they have the capacity to scale? What's the early data look like? We'd like to make sure it's all stable, you know, I'm sure the end to end management piece of it, the data analytics space, according to Gartner Reports. One of the most pressing challenges facing the I mean, this is absolutely a good problem to have, for us. If you could get an automated, full time employee, Yeah, and this digital FD concept is something that the Cloud concepts in their universities and their and the CEO. Yeah, and it's interesting you mention VMware because I'm sure Informatica would love the VMware valuation. thank you so much for joining us on theCube today. Thank you very much. I'm Rebecca Knight for John Furrier.

ENTITIES

Entity	Category	Confidence
Stephane Monoboisset	PERSON	0.99+
Anthony	PERSON	0.99+
Teresa	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Rebecca	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Teresa Tung	PERSON	0.99+
Keith Townsend	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Peter Burris	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Mark	PERSON	0.99+
Samsung	ORGANIZATION	0.99+
Deloitte	ORGANIZATION	0.99+
Jamie	PERSON	0.99+
John Furrier	PERSON	0.99+
Jamie Sharath	PERSON	0.99+
Rajeev	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Jeremy	PERSON	0.99+
Ramin Sayar	PERSON	0.99+
Holland	LOCATION	0.99+
Abhiman Matlapudi	PERSON	0.99+
2014	DATE	0.99+
Rajeem	PERSON	0.99+
Jeff Rick	PERSON	0.99+
Savannah	PERSON	0.99+
Rajeev Krishnan	PERSON	0.99+
three	QUANTITY	0.99+
Savannah Peterson	PERSON	0.99+
France	LOCATION	0.99+
Sally Jenkins	PERSON	0.99+
George	PERSON	0.99+
Stephane	PERSON	0.99+
John Farer	PERSON	0.99+
Jamaica	LOCATION	0.99+
Europe	LOCATION	0.99+
Abhiman	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
130%	QUANTITY	0.99+
Amazon Web Services	ORGANIZATION	0.99+
2018	DATE	0.99+
30 days	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
183%	QUANTITY	0.99+
14 million	QUANTITY	0.99+
Asia	LOCATION	0.99+
38%	QUANTITY	0.99+
Tom	PERSON	0.99+
24 million	QUANTITY	0.99+
Theresa	PERSON	0.99+
Accenture	ORGANIZATION	0.99+
Accelize	ORGANIZATION	0.99+
32 million	QUANTITY	0.99+

Michael Bennett, Dell EMC | Dell EMC: Get Ready For AI

(energetic electronic music) >> Hey, welcome back everybody. Jeff Frick here with The Cube. We're in a very special place. We're in Austin, Texas at the Dell EMC HPC and AI Innovation Lab. High performance computing, artificial intelligence. This is really where it all happens. Where the engineers at Dell EMC are putting together these ready-made solutions for the customers. They got every type of application stack in here, and we're really excited to have our next guest. He's right in the middle of it, he's Michael Bennett, Senior Principal Engineer for Dell EMC. Mike, great to see you. >> Great to see you too. >> So you're working on one particular flavor of the AI solutions, and that's really machine learning with Hadoop. So tell us a little bit about that. >> Sure yeah, the product that I work on is called the Ready Solution for AI Machine Learning with Hadoop, and that product is a Cloudera Hadoop distribution on top of our Dell powered servers. And we've partnered with Intel, who has released a deep learning library, called Big DL, to bring both the traditional machine learning capabilities as well as deep learning capabilities to the product. Product also adds a data science workbench that's released by Cloudera. And this tool allows the customer's data scientists to collaborate together, provides them secure access to the Hadoop cluster, and we think all-around makes a great product to allow customers to gain the power of machine learning and deep learning in their environment, while also kind of reducing some of those overhead complexities that IT often faces with managing multiple environments, providing secure access, things like that. >> Right, cause the big knock always on Hadoop is that it's just hard. It's hard to put in, there aren't enough people, there aren't enough experts. So you guys are really offering a pre-bundled solution that's ready to go? >> Correct, yeah. We've built seven or eight different environments going in the lab at any time to validate different hardware permutations that we may offer of the product as well as, we've been doing this since 2009, so there's a lot of institutional knowledge here at Dell to draw on when building and validating these Hadoop products. Our Dell services team has also been going out installing and setting these up, and our consulting services has been helping customers fit the Hadoop infrastructure into their IT model. >> Right, so is there one basic configuration that you guys have? Or have you found there's two or three different standard-use cases that call for two or three different kinds of standardized solutions? >> We find that most customers are preferring the R7-40XC series. This platform can hold 12 3 1/2" form-factor drives in the front, along with four in the mid-plane, while still providing four SSDs in the back. So customers get a lot of versatility with this. It's also won several Hadoop benchmarking awards. >> And do you find, when you're talking to customers or you're putting this together, that they've tried themselves and they've tried to kind of stitch together and cobble together the open-source proprietary stuff all the way down to network cards and all this other stuff to actually make the solution come together? And it's just really hard, right? >> Yeah, right exactly. What we hear over and over from our product management team is that their interactions with customers, come back with customers saying it's just too hard. They get something that's stable and they come back and they don't know why it's no longer working. They have customized environments that each developer wants for their big data analytics jobs. Things like that. So yeah, overall we're hearing that customers are finding it very complex. >> Right, so we hear time and time again that same thing. And even though we've been going to Hadoop Summit and Hadoop World and Stratus, since 2010. The momentum seems to be a little slower in terms of the hype, but now we're really moving into heavy-duty real time production and that's what you guys are enabling with this ready-made solution. >> So with this product, yeah, we focused on enabling Apache Spark on the Hadoop environment. And that Apache Spark distributed computing has really changed the game as far as what it allows customers to do with their analytics jobs. No longer are we writing things to disc, but multiple transformations are being performed in memory, and that's also a big part of what enables the big DL library that Intel released for the platform to train these deep-learning models. >> Right, cause the Sparks enables the real-time analytics, right? Now you've got streaming data coming into this thing, versus the batch which was kind of the classic play of Hadoop. >> Right and not only do you have streaming data coming in, but Spark also enables you to load your data in memory and perform multiple operations on it. And draw insights that maybe you couldn't before with traditional map-reduce jobs. >> Right, right. So what gets you excited to come to work every day? You've been playing with these big machines. You're in the middle of nerd nirvana I think-- >> Yeah exactly. >> With all of the servers and spin-discs. What gets you up in the morning? What are you excited about, as you see AI get more pervasive within the customers and the solutions that you guys are enabling? >> You know, for me, what's always exciting is trying new things. We've got this huge lab environment with all kinds of lab equipment. So if you want to test a new iteration, let's say tiered HGFS storage with SSDs and traditional hard drives, throw it together in a couple of hours and see what the results are. If we wanted to add new PCIE devices like FPGAs for the inference portion the deep-learning development we can put those in our servers and try them out. So I enjoy that, on top of the validated, thoroughly-worked-through solutions that we offer customers, we can also experiment, play around, and work towards that next generation of technology. >> Right, 'cause any combination of hardware that you basically have at your disposal to try together and test and see what happens? >> Right, exactly. And this is my first time actually working at a OEM, and so I was surprised, not only do we have access to anything that you can see out in the market, but we often receive test and development equipment from partners and vendors, that we can work with and collaborate with to ensure that once the product reaches market it has the features that customers need. >> Right, what's the one thing that trips people up the most? Just some simple little switch configuration that you think is like a minor piece of something, that always seems to get in the way? >> Right, or switches in general. I think that people focus on the application because the switch is so abstracted from what the developer or even somebody troubleshooting the system sees, that oftentimes some misconfiguration or some typo that was entered during the switch configuration process that throws customers off or has somebody scratching their head, wondering why they're not getting the kind of performance that they thought. >> Right, well that's why we need more automation, right? That's what you guys are working on. >> Right yeah exactly. >> Keep the fat-finger typos out of the config settings. >> Right, consistent reproducible. None of that, I did it yesterday and it worked I don't know what changed. >> Right, alright Mike. Well thanks for taking a few minutes out of your day, and don't have too much fun playing with all this gear. >> Awesome, thanks for having me. >> Alright, he's Mike Bennett and I'm Jeff Frick. You're watching The Cube, from Austin Texas at the Dell EMC High Performance Computing and AI Labs. Thanks for watching. (energetic electronic music)

Published Date : Aug 7 2018

SUMMARY :

at the Dell EMC HPC and AI Innovation Lab. of the AI solutions, and that's really that IT often faces with managing multiple environments, Right, cause the big knock always on Hadoop going in the lab at any time to validate in the front, along with four in the mid-plane, is that their interactions with customers, and that's what you guys are enabling has really changed the game as far as what it allows Right, cause the Sparks enables And draw insights that maybe you couldn't before You're in the middle of nerd nirvana I think-- that you guys are enabling? for the inference portion the deep-learning development that you can see out in the market, the kind of performance that they thought. That's what you guys are working on. Right, consistent reproducible. and don't have too much fun playing with all this gear. at the Dell EMC High Performance Computing and AI Labs.

ENTITIES

Entity	Category	Confidence
Jeff Frick	PERSON	0.99+
Michael Bennett	PERSON	0.99+
two	QUANTITY	0.99+
Mike Bennett	PERSON	0.99+
Dell	ORGANIZATION	0.99+
seven	QUANTITY	0.99+
Mike	PERSON	0.99+
Dell EMC	ORGANIZATION	0.99+
The Cube	TITLE	0.99+
yesterday	DATE	0.99+
2010	DATE	0.99+
Austin, Texas	LOCATION	0.98+
both	QUANTITY	0.98+
Austin Texas	LOCATION	0.98+
Spark	TITLE	0.98+
2009	DATE	0.98+
R7-40XC	COMMERCIAL_ITEM	0.98+
Intel	ORGANIZATION	0.98+
each developer	QUANTITY	0.98+
AI Innovation Lab	ORGANIZATION	0.97+
Hadoop	TITLE	0.97+
first time	QUANTITY	0.96+
Dell EMC High Performance Computing	ORGANIZATION	0.96+
four	QUANTITY	0.95+
one	QUANTITY	0.94+
Apache	ORGANIZATION	0.94+
one thing	QUANTITY	0.93+
The Cube	ORGANIZATION	0.92+
12 3 1/2"	QUANTITY	0.92+
Dell EMC HPC	ORGANIZATION	0.9+
three different standard-use cases	QUANTITY	0.9+
eight different environments	QUANTITY	0.89+
three different	QUANTITY	0.88+
Stratus	ORGANIZATION	0.83+
Hadoop World	ORGANIZATION	0.79+
one basic configuration	QUANTITY	0.76+
AI Labs	ORGANIZATION	0.74+
four SSDs	QUANTITY	0.73+
Cloudera	TITLE	0.71+
Hadoop Summit	EVENT	0.69+
hours	QUANTITY	0.67+
Hadoop benchmarking awards	TITLE	0.67+
Sparks	COMMERCIAL_ITEM	0.48+
Hadoop	COMMERCIAL_ITEM	0.34+

Pandit Prasad, IBM | DataWorks Summit 2018

>> From San Jose, in the heart of Silicon Valley, it's theCube. Covering DataWorks Summit 2018. Brought to you by Hortonworks. (upbeat music) >> Welcome back to theCUBE's live coverage of Data Works here in sunny San Jose, California. I'm your host Rebecca Knight along with my co-host James Kobielus. We're joined by Pandit Prasad. He is the analytics, projects, strategy, and management at IBM Analytics. Thanks so much for coming on the show. >> Thanks Rebecca, glad to be here. >> So, why don't you just start out by telling our viewers a little bit about what you do in terms of in relationship with the Horton Works relationship and the other parts of your job. >> Sure, as you said I am in Offering Management, which is also known as Product Management for IBM, manage the big data portfolio from an IBM perspective. I was also working with Hortonworks on developing this relationship, nurturing that relationship, so it's been a year since the Northsys partnership. We announced this partnership exactly last year at the same conference. And now it's been a year, so this year has been a journey and aligning the two portfolios together. Right, so Hortonworks had HDP HDF. IBM also had similar products, so we have for example, Big Sequel, Hortonworks has Hive, so how Hive and Big Sequel align together. IBM has a Data Science Experience, where does that come into the picture on top of HDP, so it means before this partnership if you look into the market, it has been you sell Hadoop, you sell a sequel engine, you sell Data Science. So what this year has given us is more of a solution sell. Now with this partnership we go to the customers and say here is NTN experience for you. You start with Hadoop, you put more analytics on top of it, you then bring Big Sequel for complex queries and federation visualization stories and then finally you put Data Science on top of it, so it gives you a complete NTN solution, the NTN experience for getting the value out of the data. >> Now IBM a few years back released a Watson data platform for team data science with DSX, data science experience, as one of the tools for data scientists. Is Watson data platform still the core, I call it dev ops for data science and maybe that's the wrong term, that IBM provides to market or is there sort of a broader dev ops frame work within which IBM goes to market these tools? >> Sure, Watson data platform one year ago was more of a cloud platform and it had many components of it and now we are getting a lot of components on to the (mumbles) and data science experience is one part of it, so data science experience... >> So Watson analytics as well for subject matter experts and so forth. >> Yes. And again Watson has a whole suit of side business based offerings, data science experience is more of a a particular aspect of the focus, specifically on the data science and that's been now available on PRAM and now we are building this arm from stack, so we have HDP, HDF, Big Sequel, Data Science Experience and we are working towards adding more and more to that portfolio. >> Well you have a broader reference architecture and a stack of solutions AI and power and so for more of the deep learning development. In your relationship with Hortonworks, are they reselling more of those tools into their customer base to supplement, extend what they already resell DSX or is that outside of the scope of the relationship? >> No it is all part of the relationship, these three have been the core of what we announced last year and then there are other solutions. We have the whole governance solution right, so again it goes back to the partnership HDP brings with it Atlas. IBM has a whole suite of governance portfolio including the governance catalog. How do you expand the story from being a Hadoop-centric story to an enterprise data-like story, and then now we are taking that to the cloud that's what Truata is all about. Rob Thomas came out with a blog yesterday morning talking about Truata. If you look at it is nothing but a governed data-link hosted offering, if you want to simplify it. That's one way to look at it caters to the GDPR requirements as well. >> For GDPR for the IBM Hortonworks partnership is the lead solution for GDPR compliance, is it Hortonworks Data Steward Studio or is it any number of solutions that IBM already has for data governance and curation, or is it a combination of all of that in terms of what you, as partners, propose to customers for soup to nuts GDPR compliance? Give me a sense for... >> It is a combination of all of those so it has a HDP, its has HDF, it has Big Sequel, it has Data Science Experience, it had IBM governance catalog, it has IBM data quality and it has a bunch of security products, like Gaurdium and it has some new IBM proprietary components that are very specific towards data (cough drowns out speaker) and how do you deal with the personal data and sensitive personal data as classified by GDPR. I'm supposed to query some high level information but I'm not allowed to query deep into the personal information so how do you blog those queries, how do you understand those, these are not necessarily part of Data Steward Studio. These are some of the proprietary components that are thrown into the mix by IBM. >> One of the requirements that is not often talked about under GDPR, Ricky of Formworks got in to it a little bit in his presentation, was the notion that the requirement that if you are using an UE citizen's PII to drive algorithmic outcomes, that they have the right to full transparency. It's the algorithmic decision paths that were taken. I remember IBM had a tool under the Watson brand that wraps up a narrative of that sort. Is that something that IBM still, it was called Watson Curator a few years back, is that a solution that IBM still offers, because I'm getting a sense right now that Hortonworks has a specific solution, not to say that they may not be working on it, that addresses that side of GDPR, do you know what I'm referring to there? >> I'm not aware of something from the Hortonworks side beyond the Data Steward Studio, which offers basically identification of what some of the... >> Data lineage as opposed to model lineage. It's a subtle distinction. >> It can identify some of the personal information and maybe provide a way to tag it and hence, mask it, but the Truata offering is the one that is bringing some new research assets, after GDPR guidelines became clear and then they got into they are full of how do we cater to those requirements. These are relatively new proprietary components, they are not even being productized, that's why I am calling them proprietary components that are going in to this hosting service. >> IBM's got a big portfolio so I'll understand if you guys are still working out what position. Rebecca go ahead. >> I just wanted to ask you about this new era of GDPR. The last Hortonworks conference was sort of before it came into effect and now we're in this new era. How would you say companies are reacting? Are they in the right space for it, in the sense of they're really still understand the ripple effects and how it's all going to play out? How would you describe your interactions with companies in terms of how they're dealing with these new requirements? >> They are still trying to understand the requirements and interpret the requirements coming to terms with what that really means. For example I met with a customer and they are a multi-national company. They have data centers across different geos and they asked me, I have somebody from Asia trying to query the data so that the query should go to Europe, but the query processing should not happen in Asia, the query processing all should happen in Europe, and only the output of the query should be sent back to Asia. You won't be able to think in these terms before the GDPR guidance era. >> Right, exceedingly complicated. >> Decoupling storage from processing enables those kinds of fairly complex scenarios for compliance purposes. >> It's not just about the access to data, now you are getting into where the processing happens were the results are getting displayed, so we are getting... >> Severe penalties for not doing that so your customers need to keep up. There was announcement at this show at Dataworks 2018 of an IBM Hortonwokrs solution. IBM post-analytics with with Hortonworks. I wonder if you could speak a little bit about that, Pandit, in terms of what's provided, it's a subscription service? If you could tell us what subset of IBM's analytics portfolio is hosted for Hortonwork's customers? >> Sure, was you said, it is a a hosted offering. Initially we are starting of as base offering with three products, it will have HDP, Big Sequel, IBM DB2 Big Sequel and DSX, Data Science Experience. Those are the three solutions, again as I said, it is hosted on IBM Cloud, so customers have a choice of different configurations they can choose, whether it be VMs or bare metal. I should say this is probably the only offering, as of today, that offers bare metal configuration in the cloud. >> It's geared to data scientist developers and machine-learning models will build the models and train them in IBM Cloud, but in a hosted HDP in IBM Cloud. Is that correct? >> Yeah, I would rephrase that a little bit. There are several different offerings on the cloud today and we can think about them as you said for ad-hoc or ephemeral workloads, also geared towards low cost. You think about this offering as taking your on PRAM data center experience directly onto the cloud. It is geared towards very high performance. The hardware and the software they are all configured, optimized for providing high performance, not necessarily for ad-hoc workloads, or ephemeral workloads, they are capable of handling massive workloads, on sitcky workloads, not meant for I turned this massive performance computing power for a couple of hours and then switched them off, but rather, I'm going to run these massive workloads as if it is located in my data center, that's number one. It comes with the complete set of HDP. If you think about it there are currently in the cloud you have Hive and Hbase, the sequel engines and the stories separate, security is optional, governance is optional. This comes with the whole enchilada. It has security and governance all baked in. It provides the option to use Big Sequel, because once you get on Hadoop, the next experience is I want to run complex workloads. I want to run federated queries across Hadoop as well as other data storage. How do I handle those, and then it comes with Data Science Experience also configured for best performance and integrated together. As a part of this partnership, I mentioned earlier, that we have progress towards providing this story of an NTN solution. The next steps of that are, yeah I can say that it's an NTN solution but are the product's look and feel as if they are one solution. That's what we are getting into and I have featured some of those integrations. For example Big Sequel, IBM product, we have been working on baking it very closely with HDP. It can be deployed through Morey, it is integrated with Atlas and Granger for security. We are improving the integrations with Atlas for governance. >> Say you're building a Spark machine learning model inside a DSX on HDP within IH (mumbles) IBM hosting with Hortonworks on HDP 3.0, can you then containerize that machine learning Sparks and then deploy into an edge scenario? >> Sure, first was Big Sequel, the next one was DSX. DSX is integrated with HDP as well. We can run DSX workloads on HDP before, but what we have done now is, if you want to run the DSX workloads, I want to run a Python workload, I need to have Python libraries on all the nodes that I want to deploy. Suppose you are running a big cluster, 500 cluster. I need to have Python libraries on all 500 nodes and I need to maintain the versioning of it. If I upgrade the versions then I need to go and upgrade and make sure all of them are perfectly aligned. >> In this first version will you be able build a Spark model and a Tesorflow model and containerize them and deploy them. >> Yes. >> Across a multi-cloud and orchestrate them with Kubernetes to do all that meshing, is that a capability now or planned for the future within this portfolio? >> Yeah, we have that capability demonstrated in the pedestal today, so that is a new one integration. We can run virtual, we call it virtual Python environment. DSX can containerize it and run data that's foreclosed in the HDP cluster. Now we are making use of both the data in the cluster, as well as the infrastructure of the cluster itself for running the workloads. >> In terms of the layers stacked, is also incorporating the IBM distributed deep-learning technology that you've recently announced? Which I think is highly differentiated, because deep learning is increasingly become a set of capabilities that are across a distributed mesh playing together as is they're one unified application. Is that a capability now in this solution, or will it be in the near future? DPL distributed deep learning? >> No, we have not yet. >> I know that's on the AI power platform currently, gotcha. >> It's what we'll be talking about at next year's conference. >> That's definitely on the roadmap. We are starting with the base configuration of bare metals and VM configuration, next one is, depending on how the customers react to it, definitely we're thinking about bare metal with GPUs optimized for Tensorflow workloads. >> Exciting, we'll be tuned in the coming months and years I'm sure you guys will have that. >> Pandit, thank you so much for coming on theCUBE. We appreciate it. I'm Rebecca Knight for James Kobielus. We will have, more from theCUBE's live coverage of Dataworks, just after this.

Published Date : Jun 19 2018

SUMMARY :

Brought to you by Hortonworks. Thanks so much for coming on the show. and the other parts of your job. and aligning the two portfolios together. and maybe that's the wrong term, getting a lot of components on to the (mumbles) and so forth. a particular aspect of the focus, and so for more of the deep learning development. No it is all part of the relationship, For GDPR for the IBM Hortonworks partnership the personal information so how do you blog One of the requirements that is not often I'm not aware of something from the Hortonworks side Data lineage as opposed to model lineage. It can identify some of the personal information if you guys are still working out what position. in the sense of they're really still understand the and interpret the requirements coming to terms kinds of fairly complex scenarios for compliance purposes. It's not just about the access to data, I wonder if you could speak a little that offers bare metal configuration in the cloud. It's geared to data scientist developers in the cloud you have Hive and Hbase, can you then containerize that machine learning Sparks on all the nodes that I want to deploy. In this first version will you be able build of the cluster itself for running the workloads. is also incorporating the IBM distributed It's what we'll be talking next one is, depending on how the customers react to it, I'm sure you guys will have that. Pandit, thank you so much for coming on theCUBE.

ENTITIES

Entity	Category	Confidence
Rebecca	PERSON	0.99+
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Europe	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Asia	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
San Jose	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
Pandit	PERSON	0.99+
last year	DATE	0.99+
Python	TITLE	0.99+
yesterday morning	DATE	0.99+
Hortonworks	ORGANIZATION	0.99+
three solutions	QUANTITY	0.99+
Ricky	PERSON	0.99+
Northsys	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
Pandit Prasad	PERSON	0.99+
GDPR	TITLE	0.99+
IBM Analytics	ORGANIZATION	0.99+
first version	QUANTITY	0.99+
both	QUANTITY	0.99+
one year ago	DATE	0.98+
Hortonwork	ORGANIZATION	0.98+
three	QUANTITY	0.98+
today	DATE	0.98+
DSX	TITLE	0.98+
Formworks	ORGANIZATION	0.98+
this year	DATE	0.98+
Atlas	ORGANIZATION	0.98+
first	QUANTITY	0.98+
Granger	ORGANIZATION	0.97+
Gaurdium	ORGANIZATION	0.97+
one	QUANTITY	0.97+
Data Steward Studio	ORGANIZATION	0.97+
two portfolios	QUANTITY	0.97+
Truata	ORGANIZATION	0.96+
DataWorks Summit 2018	EVENT	0.96+
one solution	QUANTITY	0.96+
one way	QUANTITY	0.95+
next year	DATE	0.94+
500 nodes	QUANTITY	0.94+
NTN	ORGANIZATION	0.93+
Watson	TITLE	0.93+
Hortonworks	PERSON	0.93+

Chris Boots, Quadrocopter | Airworks 2017

>> Hey, welcome back everybody, Jeff Frick here with the Cube. We are here in Denver, Colorado at the DJI AirWorks show, it's their second show; about 600 people talking about commercial applications for the DJI drone platform. Really exciting agriculture, construction, public safety, no fun stuff, well, it's all kind of fun, really about the commercial applications, and we're excited to have Chris Boots with us, he is the chief engineer of Quadrocopter. Chris, good to see you. >> Good to see you. >> So we talked a little bit about what Quadrocopter does, and you're really into the enterprise space, these are not platforms that are generally available, you've got to get them through a dealer, they're expensive, they're complicated pieces of equipment, and that's a place you guys have been playing for a long time. >> Absolutely. For example, the Wind Series was unveiled today, the 4 and the 8; AirWorks 2016 introduced the Wind 1 and 2, and what these are, are basically universal platforms that allow customers to put various different, whether it be gimbals or sensors, it's kind of just a blank slate DJI product. That way you're not constrained to the limitations of an M200 or an Aspire or anything like that. When Quadrocopter began almost a decade ago, we prided ourselves on delivering custom-tailored systems to various different customer needs, so we felt right at home when DJI unveiled the Wind series. >> So really what you mean is it's kind of stripped down to its bare bones components so that you can design it at whatever payloads you want for the specific application, and they're also big, heavy lifters, right? We saw the agricultural one, I think it holds like two and a half gallons, 22 pounds of liquid, so these are also heavy lift machines, these are not little Mavics or Sparks. >> Yeah, precisely, yeah. If you need to lift something lightweight, there's the Wind 1. If you need to lift something extremely heavy, there's the Wind 4 and the Wind 8 which can lift well over 20 to 25 pounds of payload, so you're lifting some big stuff with this. >> So when you talk to enterprise customers, and kind of their journey into getting into and using a drone platform for their business process, how do they get started, you know, what do you see as kind of their first steps where people have some success and then you know, build into more of a fleet if you will, integrate it more to their processes? How do most companies get started? Do they say yeah this looks like a cool platform, how do we use it? >> That's kind of exactly how it happens. It just all starts with an idea. Most of our customers if they're not already existing UAS corporations and companies, they can be just somebody like you or I that comes up with an interesting UAV solution and you know, they do some Google searching, they do some research, they find something like this doesn't exist. Where do I go from here? So it doesn't take them very long to start making phone calls, and more often than not they call us at Quadrocopter, and one of my pet peeves is I don't like saying no to a customer when they have an idea, so that basically takes their idea, it takes our resources whether it be DJI or third party integrations, and making their dream a reality, so it's not always cinematography and cameras, it can be sensors or you name it, so yeah! >> So what are some of the more innovative uses that you've seen people use the DJI platform for that you would have never thought of, most people on the street would never have an idea that this is a useful application for this platform? >> Sure, well, I'll talk a little bit about the latest Wind application that we designed this year. We utilized the larger of the four copters, the Wind 8, which is an octocopter, and the client had the idea of inspecting methane pipelines. Now these pipelines need to be inspected every six months per governmental regulation. Currently, the only way that most companies like BP and other gas industries are doing this is by foot, by ATV, with handheld sensors, or on a large scale with rotorcraft like helicopters and people hanging off the sides of them, again with handheld devices. >> And what, they've got specialty sensors that they're looking for leaks and this and that, it's not really visual inspection I take it, or is it both probably? >> A lot of times they use either a laser-based or a thermal-based handheld sensor, so like a flare thermal camera. In our case, we didn't want to be constrained to the environmental influences that thermal can sometimes have, whether it's cold or it's dark or bright out, it can really skew the results, so in our case, it was our goal to find something that isn't influenced by the external environments. So we officially landed on a laser-based methane detector and paired that with the Wind 8, which then flies the pipeline route in 10 to 20 foot segments, comes back, and that data is used in mapping software to find out what the results were along that pipeline. If it is found that something is leaking, that file that is pulled off the aircraft will say exactly where it was, how concentrated it was at that exact point, at which point somebody can on the ground inspect that further. It totally gets rid of the whole safety issue of somebody on the ground or in the air and the expensive part of man power, of walking a pipeline. We can do it more efficiently, we can do it way more safer, and we get if not better results. >> The 10 to 20 feet doesn't sound like very long. Is that just because of, >> 20 miles. >> Oh 20 miles. You said feet. So it's 10 to 20 mile runs, then in parts they take the data and run it again. And what was the weight of that payload? >> The sensor itself doesn't really weigh much, I'd say two or three pounds. Most of the payload on the Wind 8 is actually the batteries. So the whole all-up weight of the craft is somewhere around 30 pounds. It's not extremely heavy, but for endurance sake, she'll fly for well over an hour. So at 10, 15 miles per hour, you can really cover some pipeline with battery to spare. >> So was that an initial trial for this customer, to try this solution? >> Yeah, this particular combination of sensor and copter had never been tried before, so it very much is an industry first in this regard, at least with DJI and the sensor. >> So where do they want to go next? I mean, it begs the question. The whole theme of today's keynote was like scale, no longer single operator, single machine, single data, but really starting to think in terms of fleets and multi-units, so is that somewhere where this particular customer wants to go, or how do you see it progressing for them? >> This particular client is a third party, so they aren't directly with BP, but BP often, I don't want to speak on behalf of BP, but a lot of gas companies outsource their inspection services to other different companies, so this particular land surveying company will use this and meet their demands of inspecting whatever section of pipeline that they're designated every six months. >> Yeah, that's great, alright, Chris, thank you for spending a few minutes, I mean that's a great case study and using the big heavy lift stuff, much more fun probably than the Spark! >> Absolutely, yeah, if you guys have any questions, hit me up at quadrocopter.com! >> Alright, he's Chris Boots, I'm Jeff Frick, you are watching the Cube. We're at DJI AirWorks 2017 in Denver. Catch you next time; thanks for watching.

Published Date : Nov 9 2017

SUMMARY :

for the DJI drone platform. and that's a place you that allow customers to so that you can design it at If you need to lift something lightweight, and you know, they do and the client had the idea of and paired that with the Wind 8, The 10 to 20 feet doesn't So it's 10 to 20 mile runs, Most of the payload on the Wind at least with DJI and the sensor. so is that somewhere where so they aren't directly with BP, Absolutely, yeah, if you you are watching the Cube.

ENTITIES

Entity	Category	Confidence
Jeff Frick	PERSON	0.99+
10	QUANTITY	0.99+
two	QUANTITY	0.99+
Chris	PERSON	0.99+
Chris Boots	PERSON	0.99+
22 pounds	QUANTITY	0.99+
Denver	LOCATION	0.99+
BP	ORGANIZATION	0.99+
Chris Boots	PERSON	0.99+
20 miles	QUANTITY	0.99+
DJI	ORGANIZATION	0.99+
M200	COMMERCIAL_ITEM	0.99+
three pounds	QUANTITY	0.99+
second show	QUANTITY	0.99+
20 mile	QUANTITY	0.99+
first steps	QUANTITY	0.99+
20 feet	QUANTITY	0.99+
two and a half gallons	QUANTITY	0.99+
20 foot	QUANTITY	0.99+
Denver, Colorado	LOCATION	0.99+
one	QUANTITY	0.99+
both	QUANTITY	0.98+
Wind 1	COMMERCIAL_ITEM	0.98+
Quadrocopter	ORGANIZATION	0.98+
Aspire	COMMERCIAL_ITEM	0.98+
about 600 people	QUANTITY	0.97+
four copters	QUANTITY	0.97+
today	DATE	0.97+
first	QUANTITY	0.96+
Wind Series	COMMERCIAL_ITEM	0.96+
around 30 pounds	QUANTITY	0.96+
Google	ORGANIZATION	0.96+
this year	DATE	0.96+
a decade ago	DATE	0.95+
25 pounds	QUANTITY	0.94+
single machine	QUANTITY	0.94+
Wind 8	QUANTITY	0.94+
every six months	QUANTITY	0.93+
Wind 4	QUANTITY	0.91+
quadrocopter.com	OTHER	0.91+
2	COMMERCIAL_ITEM	0.9+
single data	QUANTITY	0.88+
pet peeves	QUANTITY	0.87+
Wind 8	COMMERCIAL_ITEM	0.87+
8	COMMERCIAL_ITEM	0.87+
over 20	QUANTITY	0.86+
Wind series	COMMERCIAL_ITEM	0.86+
Mavics	ORGANIZATION	0.85+
4	COMMERCIAL_ITEM	0.85+
Wind 1	QUANTITY	0.84+
10, 15 miles per hour	QUANTITY	0.79+
over an hour	QUANTITY	0.78+
DJI AirWorks	EVENT	0.77+
AirWorks 2016	COMMERCIAL_ITEM	0.71+
single operator	QUANTITY	0.67+
UAS	ORGANIZATION	0.66+
Wind	TITLE	0.64+
Sparks	ORGANIZATION	0.62+
Cube	ORGANIZATION	0.61+
2017	DATE	0.59+
AirWorks 2017	EVENT	0.55+
Cube	PERSON	0.48+
Wind	QUANTITY	0.42+
Airworks	EVENT	0.38+

Veeru Ramaswamy, IBM | CUBEConversation

(upbeat music) >> Hi we're at the Palo Alto studio of SiliconANGLE Media and theCUBE. My name is George Gilbert, we have a special guest with us this week, Veeru Ramaswamy who is VP IBM Watson IoT platform and he's here to fill us in on the incredible amount of innovation and growth that's going on in that sector of the world and we're going to talk more broadly about IoT and digital twins as a broad new construct that we're seeing in how to build enterprise systems. So Veeru, good to have you. Why don't you introduce yourself and tell us a little bit about your background. >> Thanks George, thanks for having me. I've been in the technology space for a long time and if you look at what's happening in the IoT, in the digital space, it's pretty interesting the amount of growth, the amount of productivity and efficiency the companies are trying to achieve. It is just phenomenal and I think we're now turning off the hype cycle and getting into real actions in a lot of businesses. Prior to joining IBM, I was junior offiicer and senior VP of data science with Cable Vision where I led the data strategy for the entire company and prior to that I was the GE of one of the first two guys who actually built the Cyamon digital center. GE digital center, it's a center of excellence. Looking at different kinds of IoT related projects and products along with leading some of the UX and the analytics and the club ration or the social integration. So that's the background. >> So just to set context 'cause this is as we were talking before, there was another era when Steve Jobs was talking about the next work station and he talked about objectory imitation and then everything was sprinkled with fairy dust about objects. So help us distinguish between IoT and digital twins which GE was brilliant in marketing 'cause that concept everyone could grasp. Help us understand where they fit. >> The idea of digital twin is, how do you abstract the actual physical entity out there in the world, and create an object model out of it. So it's very similar in that sense, what happened in the 90s for Steve Jobs and if you look at that object abstraction, is what is now happening in the digital twin space from the IoT angle. The way we look at IoT is we look at every center which is out there which can actually produce a metric on every device which produces a metric we consider as a sense so it could be as simple as the pressure, temperature, humidity sensors or it could be as complicated as cardio sensors and your healthcare and so on and so forth. The concept of bringing these sensors into the to the digital world, the data from that physical world to the digital world is what is making it even more abstract from a programming perspective. >> Help us understand, so it sounds like we're going to have these fire hoses of data. How do we organize that into something that someone who's going to work on that data, someone is going to program to it. How do they make sense out of it the way a normal person looks at a physical object? >> That's a great question. We're looking at sensors as a device that we can measure out of and that we call it a device twin. Taking the data that's coming from the device, we call that as a device twin and then your physical asset, the physical thing itself, which could be elevators, jet engines anything, physical asset that we have what we call the asset twin and there's hierarchical model that we believe that will have to be existing for the digital twin to be actually constructed from an IoT perspective. The asset twins will basically encompass some of the device twins and then we actually take that and represent the digital twin on a physical world of that particular asset. >> So that would be sort of like as we were talking about earlier like an elevator might be the asset but the devices within it might be the bricks and the pulleys and the panels for operating it. >> Veeru: Exactly. >> And it's then the hierarchy of these or in manufacturing terms, the building materials that becomes a critical part of the twin. What are some other components of this digital twin? >> When we talk about digital twin, we don't just take the blueprint as schematics. We also think about the system, the process, the operation that goes along with that physical asset and when we capture that and be able to model that, in the digital world, then that gives you the ability to do a lot of things where you don't have to do it in the physical world. For instance, you don't have to train your people but on the physical world, if it is periodical systems and so on and so forth, you could actually train them in the digital world and then be able to allow them to operate on the physical world whenever it's needed. Or if you want to increase your productivity or efficiency doing predictive models and so forth, you can test all the models in your digital world and then you actually deploy it in your physical world. >> That's great for context setting. How would you think of, this digital twins is more than just a representation of the structure, but it's also got the behavior in there. So in a sense it's a sensor and an actuator in that you could program the real world. What would that look like? What things can you do with that sort of approach? >> So when you actually have the data coming this humongous amount of terabyte data that comes from the sensors, once you model it and you get the insights out of that, based on the insight, you can take an actionable outcome that could be turning off an actuator or turning on an actuator and simple thngs like in the elevator case, open the door, shut the door, move the elevator up, move the elevator down etc. etc All of these things can be done from a digital world. That's where it makes a humongous difference. >> Okay, so it's a structured way of interacting with the highly structured world around us. >> Veeru: That's right. >> Okay, so it's not the narrow definition that many of us have been used to like an airplane engine or the autonomous driving capability of a car. It's more general than that. >> Yeah, it is more general than that. >> Now let's talk about having sort of set context with the definition so everyone knows we're talking about a broader sense that's going on. What are some of the business impacts in terms of operational efficiency, maybe just the first-order impact. But what about the ability to change products into more customizable services that have SLAs or entirely new business models including engineered order instead of make to stock. Tell us something about that hierarchy of value. >> That's a great question. You're talking about things like operations optimization and predicament and all of that which you can actually do from the digital world it's all on digital twin. You also can look into various kinds of business models now instead of a product, you can actually have a service out of the product and then be able to have different business models like powered by the hour, pay per use and kinds of things. So these kinds of models, business models can be tried out. Think about what's happening in the world of Air BnB and Uber, nobody owns any asset but still be able to make revenue by pay per use or power by the hour. I think that's an interesting model. I don't think it's being tested out so much in the physical asset world but I think that could be interesting model that you could actually try. >> One thing that I picked up at the Genius of Things event in Munich in February was that we really have to rethink about software markets in the sense that IBM's customers become in the way your channel, sometimes because they sell to their customers. Almost like a supply chain master or something similar and also pricing changes from potentially we've already migrated or are migrating from perpetual licenses to service softwares or service but now we could do unit pricing or SLA-based pricing, in which case you as a vendor have to start getting very smart about, you owe your customers the risk in meeting an SLA so it's almost more like insurance, actuarial modeling. >> Correct so the way we want think about is, how can we make our customers more, what do you call, monetizable. Their products to be monetizable with their customers and then in that case, when we enter into a service level agreement with our customers, there's always that risk of what we deliver to make their products and services more successful? There's always a risk component which we will have to work with the customers to make sure that combined model of what our customers are going to deliver is going to be more beneficial, more contributing to both bottom line and top line. >> That implies that your modeling, someone's modeling and risk from you the supplier to your customer as vendor to their customer. >> Right. >> That sounds tricky. >> I'm pretty sure we have a lot of financial risk modeling entered into our SLAs when we actually go to our customers. >> So that's a new business model for IBM, for IBM's sort of supply chain master type customers if that's the right word. As this capability, this technology pervades more industries, customers become software vendors or if not software vendors, services vendors for software enhanced products or service enhanced products. >> Exactly, exactly. >> Another thing, I'd listened to a briefing by IBM Global Services where they thought, ultimately, this might end up where there's far more industries are engineered to order instead of make to stock. How would this enable that? >> I think the way we want think about it is that most of the IoT based services will actually start by co-designing and co-developing with your customers. And that's where you're going to start. That's how you're going to start. You're not going to say, here's my 100 data centers and you bring your billion devices and connect and it's going to happen. We are going to start that way and then our customers are going to say, hey by the way, I have these used cases that we want to start doing, so that's why platform becomes so imortant. Once you have the platform, now you can scale, into a scale, individual silos as a vertical use case for them. We provide the platform and the use cases start driving on top of the platform. So the scale becomes much easier for the customers. >> So this sounds like the traditional application. The traditional way an application vendor might turn into a platform vendor which is a difficult transition in itself but you take a few use cases and then generalize into a platform. >> We call that a zone application services. The zone application service is basically, is drawing on perfectly cold platform service which actually provides you the abilities. So for instance like an asset management. An asset management can be done in an oil and gas rig, you can look at asset management in power tub vine, you can can look at asset management in a jet engine. You can do asset management across any different vertical but that is a common horizontal application so most of the time you get 80% of your asset management API's if you will. Then you can be able to scale across multiple different vertical applications and solutions. >> Hold that thought 'cause we're going to come back to joint development and leveraging expertise from vendor and customer and sharing that. Let's talk just at a high level one of the things that I keep hearing is that in Europe industry 4.0 is sort of the hot topic and in the states, it's more digital twins. Help parse that out for us. >> So the way we believe how digital twin should be viewed is a component view. What we mean the component view is that we have your knowledge graph representation of the real assets in the digital world and then you bring in your IoT sensors and connections to the models then you have your functional, logical, physical models that you want to bring into your knowledge graph and then you also want to be able to give the ability of search visualize allies. Kind of an intelligent experience for the end consumer and then you want to bring your similation models when you do the actual similation models in digital to bring it in there and then your enterprise asset management, your ERP systems, all of that and then when you connect, when you're able to build a knowledge graph, that's when the digital twin really connects with your enterprise systems. Sort of bring the OT and the IT together. >> So this is sort of to try and summarize 'cause there are a lot of moving parts in there. You've got you've got the product hierarchy which, in product Kaiser call it building materials, sort of the explosion of parts in an assembly, sub-assembly and then that provides like a structure, a data model then the machine learning models in the different types of models that they could be represent behavior and then when you put a knowledge graph across that structure and behavior, is that what makes it simulation ready? >> Yes, so you're talking about entities and connecting these entities with the actual relationship between these entities. That's the graph that holds the relation between nodes and your links. >> And then integrating the enterprise systems that maybe the lower level operation systems. That's how you effect business processes. >> Correct. >> For efficiency or optimization, automation. >> Yes, take a look at what you can do with like a shop floor optimization. You have all the building materials, you need to know from your existing ERP systems and then you will actually have the actual real parts that's coming to your shop floors to manage them and now base supposing, depending on whether you want to repair, you want to replace, you want an overall, you want to modify whatever that is, you want to look at your existing building materials and see, okay do I first have it do we need more? Do we need to order more? So your auditing system naturally gets integrated into that and then you have to integrate the data that's coming from these models and the availability of the existing assets with you. You can integrate it and say how fast can you actually start moving these out of your shop, into the. >> Okay that's where you translate essentially what's more like intelligent about an object or a rich object into sort of operational implications. >> Veeru: Yes. >> Okay operational process. Let's talk about customer engagement so far. There's intense interest in this. I remember in the Munich event, they were like they had to shut off attendance because they couldn't find a big enough venue. >> Veeru: That's true. >> So what are the characteristics of some of the most successful engagements or the ones that are promising. Maybe it's a little early to say successful. >> So, I think the way you can definitely see success from customer engagement are two fold. One is show what's possible. Show what's possible with after all desire to connect, collection of data, all of that so that one part of it. The second part is understand the customer. The customer has certain requirements in their existing processes and operations. Understand that and then deliver based on what solutions they are expecting, what applications they want to build. How you bring them together is what is, so we're thinking about. That Munich center you talked about. We are actually bringing in chip manufacturers, sensor manufacturers, device manufacturers. We are binging in network providers. We are bringing in SIs, system integrators all of them into the fold and show what is possible and then your partners enable you to get to market faster. That's how we see the engagement with customer should happen in a much more foster manner and show them what's possible. >> It sounds like in the chip industry Moore's law for many years it wasn't deterministic that you we would do double things every 18 months or two years, it was actually an incredibly complex ecosystem web where everyone's sort of product release cycles were synchronized so as to enable that. And it sounds like you're synchronizing the ecosystem to keep up. >> Exactly The saxel of a particular organization IoT efforts is going to depend on how do you build this ecosystem and how do you establish that ecosystem to get to market faster. That's going to be extremely key for all your integration efforts with your customer. >> Let's start narrowly with you. IBM what are the key skills that you feel you need to own starting from sort of the base rocket scientists you know who not only work on machine learning models but they come up with new algorithms on top of say tons of flow work or something like that. And all the way up to the guys who are going to work in conjunction with the customer to apply that science to a particular industry. How does that hold together? >> So it all starts on the platform. On the platform side we have all the developers, the engineers who build these platform all the video connection and all of that to make the connections. So you need the highest software development engineers to build these on the platform and then you also need the solution builders so who is in front of the customer understanding what kind of solutions you want to build. Solutions could be anything. It could be predictive maintenance, it could be as simple as management, it could be remote monitoring and diagnostics. It could be any of these solutions that you want to build and then the solution builders and the platform builders work together to make sure that it's the holistic approach for the customer at the final deployment. >> And how much is the solution builder typically in the early stages IBM or is there some expertise that the customer has to contribute almost like agile development, but not two programmers but like 500 and 500 from different companies. >> 500 is a bit too much. (laughs) I would say this is the concept of co-designing and co-development. We definitely want the ultimate, the developer, the engineers form, the subject exports from our customers and we also need our analytics experts and software developers to come and sit together and understand what's the use case. How do we actually bring in those optimized solution for the customer. >> What level of expertise or what type of expertise are the developers who are contributing to this effort in terms of do they have to, if you're working with manufacturing let's say auto manufacturing. Do they have to have automotive software development expertise or are they more generically analytics and the automotive customer brings in the specific industry expertise. >> It depends. In some cases we have RGB for instance. We have dedicated servers, that particular vertical service provider. We understand some of this industry knowledge. In some cases we don't, in some cases it actually comes from the customer. But it has to be an aggregation of the subject matter experts with our platform developers and solution developers sitting together, finding what's the solution. Literally going through, think about how we actually bring in the UX. What does a typical day of a persona look like? We always by the way believe it's an augmented allegiance which means the human and the machine work together rather than a complete. It gives you the answer for everything you ask for. >> It's a debate that keeps coming up Doug Anglebad sort of had his own answer like 50 years ago which was he sort of set the path for modern computing by saying we're not going to replace people, we're going to augment them and this is just a continuation of that. >> It's a continuation of that. >> Like UX design sounds like someone on the IBM side might be talking to the domain expert and the customer to say how does this workflow work. >> Exactly. So have this design thinking, design sessions with our customers and then based on that we take that knowledge, take it back, we build our mark ups, we build our wire frames, visual designs and the analytics and software that goes behind it and then we provide on top of platform. So most of the platform work, the standard what do you call table state connections, collection of data. All of that as they are already existing then it's one level above as to what the particular solution a customer wants. That's when we actually. >> In terms of getting the customer organization aligned to make this project successful, what are some of the different configurations? Who needs to be a sponsor? Where does budget typically come from? How long are the pilots? That sort of stuff so to set expectations. >> We believe in all the agile thinking, agile development and we believe in all of that. It's almost given now. So depending on where the customer comes from so the customer could actually directly come and sign up to our platform on the existing cloud infrastructure and then they will say, okay we want to build applications then there are some customers really big customers, large enterprises who want to say, give me the platform, we have our solution folks. We will want to work on board with you but we also want somebody who understands building solutions. We integrate with our solution developers and then we build on top of that. They build on top of that actually. So you have that model as well and then you have a GBS which actually does this, has been doing this for years, decades. >> George: Almost like from the silicon. >> All the way up to the application level. >> When the customer is not outsourcing completely, The custom app that they need to build in other words when when they need to go to GBS Global Business Services, whereas if they want a semi-packaged app, can they go to the industry solutions group? >> Yes. >> I assume it's the IoT, Industry Solutions Group. >> Solutions group, yes. >> They then take a it's almost maybe a framework or an existing application that needs customization. >> Exactly so we have IoT-4. IoT for manufacturing, IoT for retail, IoT for insurance IoT for you name it. We have all these industry solutions so there would be some amount of template which is already existing in some fashion so when GBS gets a request to say here is customer X coming and asking for a particular solution. They would come back to IoT solutions group to say, they already have some template solutions from where we can start from rather than building it from scratch. You speed to market again is much faster and then based on that, if it's something that is to be customizable, both of them work together with the customer and then make that happen, and they leverage our platform underneath to do all the connection collection data analytics and so on and so forth that goes along with that. >> Tell me this from everything we hear. There's a huge talent shortage. Tell me in which roles is there the greatest shortage and then how do different members of the ecosystem platform vendors, solution vendors sort of a supply-chain master customers and their customers. How do they attract and retain and train? >> It's a fantastic question. One of the difficulties both in the valley and everywhere across is that three is a skill gap. You want advanced data scientists you want advances machinery experts, you want advanced AI specialists to actually come in. Luckily for us, we have about 1000 data scientists and AI specialists distributed across the globe. >> When you say 1000 data scientists and AI specialists, help us understand which layer are they-- >> It could be all the way from like a BI person all the way to people who can build advanced AI models. >> On top of an engine or a framework. >> We have our Watson APIs from which we build then we have our data signs experience which actually has some of the models then built on top of what's in the data platform so we take that as well. There are many different ways by which we can actually bring the AM model missionary models to build. >> Where do you find those people? Not just the sort of band strengths that's been with IBM for years but to grow that skill space and then where are they also attracted to? >> It's a great question. The valley definitely has a lot of talent, then we also go outside. We have multiple centers of excellence in Israel, in India, in China. So we have multiple centers of excellence we gather from them. It's difficult to get all the talent just from US or just from one country so it's naturally that talent has to be much more improvement and enhanced all the wat fom fresh graduates from colleges to more experienced folks in the in the actual profession. >> What about when you say enhancing the pool talent you have. Could it also include productivity improvements, qualitative productivity improvements in the tools that makes machine learning more accessible at any level? The old story of rising obstruction layers where deep learning might help design statistical models by doing future engineering and optimizing the search for the best model, that sort of stuff. >> Tools are very, very hopeful. There are so many. We have from our tools to python tools to psychic and all of that which can help the data scientist. The key part is the knowledge of the data scientist so data science, you need the algorithm, the statistical background, then you need your applications software development background and then you also need the domestics for engineering background. You have to bring all of them together. >> We don't have too many Michaelangelos who are these all around geniuses. There's the issue of, how do you to get them to work more effectively together and then assuming even each of those are in short supply, how do you make them more productive? >> So making them more productive is by giving them the right tools and resources to work with. I think that's the best way to do it, and in some cases in my organization, we just say, okay we know that a particular person is skilled is up skilled in certain technologies and certain skill sets and then give them all the tools and resources for them to go on build. There's a constant education training process that goes through that we in fact, we have our entire Watson ED platform that can be learned on Kosera today. >> George: Interesting. >> So people can go and learn how to build a platform from a Kosera. >> When we start talking with clients and with vendors, things we hear is that and we were kind of I think early that calling foul but in the open source infrastructure big data infrastructure this notion of mix-and-match and roll your own pipeline sounded so alluring, but in the end it was only the big Internet companies and maybe some big banks and telcos that had the people to operate that stuff and probably even fewer who could build stuff on it. Do we do we need to up level or simplify some of those roles because mainstream companies can't have enough or won't will have enough data scientists or other roles needed to make that whole team work >> I think it will be a combination of both one is we need to up school our existing students with the stem background, that's one thing and the other aspect is, how do you up scale your existing folks in your companies with the latest tools and how can you automate more things so that people who may not be schooled will still be able to use the tool to deliver other things but they don't have to go to a rigorous curriculum to actually be able to deal with it. >> So what does that look like? Give us an example. >> Think of tools like today. There are a lot of BI folks who can actually build. BI is usually your trends and graphs and charts that comes out of the data which are simple things. So they understand the distribution and so on and so forth but they may not know what is the random model. If you look at tools today, that actually gives you to build them, once you give the data to that model, it actually gives you the outputs so they don't really have to go dig deep I have to understand the decision tree model and so on and so forth. They have the data, they can give the data, tools like that. There are so many different tools which would actually give you the outputs and then they can actually start building app, the analytics application on top of that rather than being worried about how do I write 1000 line code or 2000 line code to actually build that model itself. >> The inbuilt machine learning models in and intend, integrated to like pentaho or what's another example. I'm trying to think, I lost my, I having a senior moment. These happen too often now. >> We do have it in our own data science tools. We already have those models supported. You can actually go and call those in your web portal and be able to call the data and then call the model and then you'll get all that. >> George: Splank has something like that. >> Splank does, yes. >> I don't know how functional it is but it seems to be oriented towards like someone who built a dashboard can sort of wire up a model, it gives you an example of what type of predictions or what type of data you need. >> True, in the Splank case, I think it is more of BI tool actually supporting a level of data science moral support on the back. I do not know, maybe I have to look at this but in our case we have a complete data science experience where you actually start from the minute the data gets ingested, you can actually start the storage, the transformation, the analytics and all of that can be done in less than 10 lines of coding. You can just actually do the whole thing. You just call those functions then it will the right there in front of you. So in twin you can do that. That I think is much more powerful and there are tools, there are many many tools today. >> So you're saying that data science experience is an enter in pipeline and therefore can integrate what were boundaries between separate products. >> The boundary is becoming narrower and narrower in some sense. You can go all the way from data ingestion to the analytics in just few clicks or few lines of course. That's what's happening today. Integrated experience if you will. >> That's different from the specialized skills where you might have a tri-factor, prexada or something similar as for the wrangling and then something else for sort of the the visualizations like Altracks or Tavlo and then into modeling. >> A year or so ago, most of data scientists try to spend a lot of time doing data wrangling because some of the models, they can actually call very directly but the wrangling is actually where they spend their time. How do you get the data crawl the data, cleanse the data, etc. That is all now part of our data platform. It is already integrated into the platform so you don't have to go through some of these things. >> Where are you finding the first success for that tool suite? >> Today it is almost integrated with, for instance, I had a case where we exchange the data we integrate that into what's in the Watson data platform and the Watson APIs is a layer above us in the platform where we actually use the analytics tools, more advanced AI tools but the simple machinery models and so on and so forth is already integrated into as part of the Watson data platform. It is going to become an integrated experience through and through. >> To connect data science experience into eWatson IoT platform and maybe a little higher at this quasi-solution layer. >> Correct, exactly. >> Okay, interesting. >> We are doing that today and given the fact that we have so much happening on the edge side of things which means mission critical systems today are expecting stream analysts to get to get insights right there and then be able to provide the outcomes at the edge rather than pushing all the data up to your cloud and then bringing it back down. >> Let's talk about edge versus cloud. Obviously, we can't for latency and band width reasons we can't forward all the data to the cloud, but there's different use cases. We were talking to Matasa Harry at Sparks Summit and one of the use cases he talked about was video. You can't send obviously all the video back and you typically on an edge device wouldn't have heavy-duty machine learning, but for video camera, you might want to learn what is anomalous or behavior call out for that camera. Help us understand some of the different use cases and how much data do you bring back and how frequently do retrain the models? >> In the case of video, it's so true that you want to do a lot of any object ignition and so on and so forth in the video itself. We have tools today, we have cameras outside where if a van goes it detect the particular object in the video live. Realtime streaming analytics so we can do that today. What I'm seeing today in the market is, in the transaction between the edge and the cloud. We believe edge is an extension of the cloud, closer to the asset or device and we believe that models are going to get pushed from the cloud, closer to the edge because the compute capacity and storage and the networking capacity are all improving. We are pushing more and more computing to their devices. >> When you talk about pushing more of the processing. you're talking more about predicts and inferencing then the training. >> Correct. >> Okay. >> I don't think I see so much of the training needs to be done at the edge. >> George: You don't see it. >> No, not yet at least. We see the training happening in the cloud and then once a train, the model has been trained, then you come to a steady, steady model and then that is the model you want to push. When you say model, it could be a bunch of coefficients. That could be pushed onto the edge and then when a new data comes in, you evaluate, make decisions on that, create insights and push it back as actions to the asset and then that data can be pushed back into the cloud once a day or once in a week, whatever that is. Whatever the capacity of the device you have and we believe that edge can go across multiple scales. We believe it could be as small with 128 MB it could be one or two which I see sitting in your local data center on the premise. >> I've had to hear examples of 32 megs in elevators. >> Exactly. >> There might be more like a sort of bandwidth and latency oriented platform at the edge and then throughput and an volume in the cloud for training. And then there's the issue of do you have a model at the edge that corresponds to that instance of a physical asset and then do you have an ensemble meaning, the model that maps to that instance, plus a master canonical model. Does that work for? >> In some cases, I think it'll be I think they have master canonical model and other subsidiary models based on what the asset, it could be a fleet so you in the fleet of assets which you have, you can have, does one asset in the fleet behave similar to another asset in the fleet then you could build similarity models in that. But then there will also be a model to look at now that I have to manage this fleet of assets which will be a different model compared to action similarity model, in terms of operations, in terms of optimization if I want to make certain operations of that asset work more efficiently, that model could be completely different with when compared to when you look at similarity of one model or one asset with another. >> That's interesting and then that model might fit into the information technology systems, the enterprise systems. Let's talk, I want to go get a little lower level now about the issue of intellectual property, joint development and sharing and ownership. IBM it's a nuanced subject. So we get different sort of answers, definitive answers from different execs, but at this high level, IBM says unlike Google and Facebook we will not take your customer data and make use of it but there's more to it than that. It's not as black-and-white. Help explain that for so us. >> The way you want to think is I would definitely paired back what our chairman always says customers' data is customers' data, customer insights is customer insights so they way we look at it is if you look at a black box engine, that could be your analytics engine, whatever it is. The data is your inputs and the insights are our outputs so the insights and outputs belong to them. we don't take their data and marry it with somebody else's data and so forth but we use the data to train the models and the model which is an abstract version of what that engine should be and then more we train the more better the model becomes. And then we can then use across many different customers and as we improve the models, we might go back to the same customers and hey we have an improved model you want to deploy this version rather than the previous version of the model we have. We can go to customer Y and say, here is a model which we believe it can take more of your data and fine tune that model again and then give it back to them. It is true that we don't actually take their data and share the data or the insights from one customer X to another customer Y but the models that make it better. How do you make that model more intelligent is what out job is and that's what we do. >> If we go with precise terminology, it sounds like when we talk about the black box having learned from the customer data and the insights also belonging to the customer. Let's say one of the examples we've heard was architecture engineering consulting for large capital projects has a model that's coming obviously across that vertical but also large capital projects like oil and gas exploration, something like that. There, the model sounds like it's going to get richer with each engagement. And let's pin down so what in the model is sort of not exposed to the next customer and what part of the model that has gotten richer does the next customer get the balance of? >> When we actually build a model, when we pass the data, in some cases, customer X data, the model is built out of customer X data may not sometimes work with the customer Y's data so in which case you actually build it from scratch again. Sometimes it doesn't. In some case it does help because of the similarity of the data in some instance because if the data from company X in oil gas is similar to company Y in oil gas, sometimes the data could be similar so in which case when you train that model, it becomes more efficient and the efficiency goes back to both customers. we will do that but there are places where it would really not work. What we are trying to do is. We are in fact trying to build some kind of knowledge bundles where we can actually what used to be a long process to train the model can ow shortened using that knowledge bundle of what we have actually gained. >> George: Tell me more about how it works. >> In retail for instance, when we actually provide analytics, from any kind of IoT sense, whatever sense of data this comes in we train the model, we get analytics used for ads, pushing coupons, whatever it is. That knowledge, what you have gained off that retail, it could be models of models, it could be metamodels, whatever you built. That can actually serve many different customers but the first customer who is trying to engage with us, you don't have any data to the model. It's almost starting from ground zero and so that would actually take a longer time when you are starting with a new industry and you don't have the data, it'll take you a longer time to understand what is that saturation point or optimization point where you think the model cannot go any further. In some cases, once you do that, you can take that saturated model or near saturated model and improve it based on more data that actually comes from different other segments. >> When you have a model that has gotten better with engagements and we've talked about the black box which produces the insights after taking in the customer data. Inside that black box there's like at the highest level we might call it the digital twin with the broad definition that we started with, then there's a data model which a data model which I guess could also be incorporated into the knowledge graft for the structure and then would it be fair to call the operational model the behavior? >> Yes, how does the system perform or behave with respect the data and the asset itself. >> And then underpinning that, the different models that correspond to the behaviors of different parts of this overall asset. So if we were to be really precise about this black box, what can move from one customer to the next and what what won't? >> The overall model, supposing I'm using a random data retrieval model, that remains but actual the coefficients are the feature rector, or whatever I use, that could be totally different for customers, depending on what kind of data they actually provide us. In data science or in analytics you have a whole platora of all the way from simple classification algorithms to very advanced predictive modeling algorithms. If you take the whole class when you start with a customer, you don't know which model is really going to work for a specific user case because the customer might come and can say, you might get some idea but you will not know exactly this is the model that will work. How you test it with one customer, that model could remain the same kind of use case for some of other customer, but that actual the coefficients the degree of the digital in some cases it might be two level decision trees, in others case it might be a six level decision tree. >> It is not like you take the model and the features and then just let different customers tweak the coefficients for the features. >> If you can do that, that will be great but I don't know whether you can really do it the data is going to change. The data is definitely going to change at some point of time but in certain cases it might be directly correlated where it can help, in certain cases it might not help. >> What I'm taking away is this is fundamentally different from traditional enterprise applications where you could standardize business processes and the transactional data that they were producing. Here it's going to be much more bespoke because I guess the processes, the analytic processes are not standardized. >> Correct, every business processes is unique for a business. >> The accentures of the world we're trying to tell people that when SAP shipped packaged processes, which were pretty much good enough, but that convince them to spend 10 times as much as the license fee on customization. But is there a qualitative difference between the processes here and the processes in the old ERP era? I think it's kind of different in the ERP era and the processes, we are more talking about just data management. Here we're talking about data science which means in the data management world, you're just moving data or transforming data and things like that, that's what you're doing. You're taking the data. transforming to some other form and then you're doing basic SQL queries to get some response, blah blah blah. That is a standard process that is not much of intelligence attached to it but now you are trying to see from the data what kind of intelligence can you derive by modeling the characteristics of the data. That becomes a much tougher problem so it now becomes one level higher of intelligence that you need to capture from the data itself that you want to serve a particular outcome from the insights you get from is model. >> This sounds like the differences are based on one different business objectives and perhaps data that's not as uniform that you would in enterprise applications, you would standardize the data here, if it's not standardized. >> I think because of the varied the disparity of the businesses and the kinds of verticals and things like that you're looking at, to get complete unified business model, is going to be extremely difficult. >> Last question, back-office systems the highest level they got to were maybe the CFO 'cause you had a sign off on a lot of the budget for the license and a much much bigger budget for the SI but he was getting something that was like close you quarter in three days or something instead of two weeks. It was a control function. Who do you sell to now for these different systems and what's the message, how much more strategic how do you sell the business impact differently? >> The platforms we directly interact with the CIO and CTOs or the head of engineering. And the actual solutions or the insights, we usually sell it to the COOs or the operational folks. So because the COO is responsible for showing you productivity, efficiency, how much of savings can you do on the bottom line top line. So the insights would actually go through the COOs or in some sense go through their CTOs to COOs but the actual platform itself will go to the enterprise IT folks in that order. >> This sounds like it's a platform and a solution sell which requires, is that different from the sales motions of other IBM technologies or is this a new approach? >> IBM is transforming on its way. The days where we believe that all the strategies and predictives that we are aligned towards, that actually needs to be the key goal because that's where the world is going. There are folks who, like Jeff Boaz talks about in the olden days you need 70 people to sell or 70% of the people to sell a 30% product. Today it's a 70% product and you need 30% to actually sell the product. The model is completely changing the way we interact with customers. So I think that's what's going to drive. We are transforming that in that area. We are becoming more conscious about all the strategy operations that we want to deliver to the market we want to be able to enable our customers with a much broader value proposition. >> With the industry solutions group and the Global Business Services teams work on these solutions. They've already been selling, line of business CXO type solutions. So is this more of the same, it's just better or is this really higher level than IBM's ever gotten in terms of strategic value? >> This is possibly in decades I would say a high level of value which come from a strategic perspective. >> Okay, on that note Veeru, we'll call it a day. This is great discussion and we look forward to writing it up and clipping all the videos and showering the internet with highlights. >> Thank you George. Appreciate it. >> Hopefully I will get you back soon. >> I was a pleasure, absolutely. >> With that, this George Gilbert. We're in our Palo Alto studio for wiki bond and theCUBE and we've been talking to Veeru Ramaswamy who's VP of Watson IoT platform and we look forward to coming back with Veeru sometime soon. (upbeat music)

Published Date : Aug 23 2017

SUMMARY :

and he's here to fill us in and the club ration or the social integration. the next work station and he talked about into the to the digital world, the way a normal person looks at a physical object? and represent the digital twin on a physical world and the pulleys and the panels for operating it. that becomes a critical part of the twin. in the digital world, then that gives you the ability in that you could program the real world. that comes from the sensors, once you model it Okay, so it's a structured way of interacting Okay, so it's not the narrow definition What are some of the business impacts and then be able to have different business models in the sense that IBM's customers become in the way Correct so the way we want think about is, someone's modeling and risk from you the supplier I'm pretty sure we have a lot of financial risk modeling if that's the right word. are engineered to order instead of make to stock. and you bring your billion devices and connect but you take a few use cases and then generalize so most of the time you get 80% of your asset management sort of the hot topic and in the states, and then you want to bring your similation models and behavior, is that what makes it simulation ready? That's the graph that holds the relation between nodes that maybe the lower level operation systems. and the availability of the existing assets with you. Okay that's where you translate essentially I remember in the Munich event, of some of the most successful engagements the way you can definitely see success It sounds like in the chip industry Moore's law is going to depend on how do you build this ecosystem And all the way up to the guys who are going to and all of that to make the connections. And how much is the solution builder and software developers to come and sit together and the automotive customer brings in We always by the way believe he sort of set the path for modern computing someone on the IBM side might be talking the standard what do you call In terms of getting the customer organization and then you have a GBS which actually or an existing application that needs customization. analytics and so on and so forth that goes along with that. and then how do different members of the ecosystem and AI specialists distributed across the globe. like a BI person all the way to people who can build then we have our data signs experience it's naturally that talent has to be much more the pool talent you have. and then you also need the domestics There's the issue of, and resources to work with. how to build a platform from a Kosera. that had the people to operate that stuff and the other aspect is, So what does that look like? and charts that comes out of the data in and intend, integrated to like pentaho and be able to call the data what type of data you need. the data gets ingested, you can actually start the storage, can integrate what were boundaries You can go all the way from data ingestion sort of the the visualizations like Altracks It is already integrated into the platform and the Watson APIs is a layer above us a little higher at this quasi-solution layer. and given the fact that we have and one of the use cases he talked about was video. and so on and so forth in the video itself. When you talk about pushing more of the processing. needs to be done at the edge. Whatever the capacity of the device you have and then do you have an ensemble meaning, so you in the fleet of assets which you have, about the issue of intellectual property, and share the data or the insights from There, the model sounds like it's going to get richer and the efficiency goes back to both customers. and you don't have the data, it'll take you a longer time incorporated into the knowledge graft for the structure Yes, how does the system perform or behave that correspond to the behaviors of different parts and can say, you might get some idea It is not like you take the model and the features the data is going to change. and the transactional data that they were producing. is unique for a business. and the processes, we are more talking about This sounds like the differences are based on and the kinds of verticals the highest level they got to were maybe the CFO So because the COO is responsible for showing you in the olden days you need 70 people to sell and the Global Business Services teams a high level of value which come from and showering the internet with highlights. Thank you George. and we look forward to coming back

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Steve Jobs	PERSON	0.99+
Veeru	PERSON	0.99+
Jeff Boaz	PERSON	0.99+
Israel	LOCATION	0.99+
80%	QUANTITY	0.99+
GBS	ORGANIZATION	0.99+
Doug Anglebad	PERSON	0.99+
one	QUANTITY	0.99+
Europe	LOCATION	0.99+
Uber	ORGANIZATION	0.99+
Veeru Ramaswamy	PERSON	0.99+
100 data centers	QUANTITY	0.99+
IBM Global Services	ORGANIZATION	0.99+
128 MB	QUANTITY	0.99+
1000 data scientists	QUANTITY	0.99+
GE	ORGANIZATION	0.99+
two	QUANTITY	0.99+
30%	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
second part	QUANTITY	0.99+
India	LOCATION	0.99+
two years	QUANTITY	0.99+
February	DATE	0.99+
10 times	QUANTITY	0.99+
US	LOCATION	0.99+
Munich	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Today	DATE	0.99+
70%	QUANTITY	0.99+
32 megs	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
Kaiser	ORGANIZATION	0.99+
70 people	QUANTITY	0.99+
China	LOCATION	0.99+
six level	QUANTITY	0.99+
both	QUANTITY	0.99+
two programmers	QUANTITY	0.99+
One	QUANTITY	0.99+
both customers	QUANTITY	0.99+
each	QUANTITY	0.99+
two weeks	QUANTITY	0.99+
Cable Vision	ORGANIZATION	0.99+
one part	QUANTITY	0.99+
three days	QUANTITY	0.99+
GBS Global Business Services	ORGANIZATION	0.99+
two level	QUANTITY	0.98+
one customer	QUANTITY	0.98+
today	DATE	0.98+
Kosera	ORGANIZATION	0.98+
one model	QUANTITY	0.98+
less than 10 lines	QUANTITY	0.98+
90s	DATE	0.98+
three	QUANTITY	0.98+
Air BnB	ORGANIZATION	0.98+
a day	QUANTITY	0.97+

Ali Ghodsi, Databricks - #SparkSummit - #theCUBE

>> Narrator: Live from San Francisco, it's the Cube. Covering Sparks Summit 2017. Brought to you by Databricks. (upbeat music) >> Welcome back to the Cube, day two at Sparks Summit. It's very exciting. I can't wait to talk to this gentleman. We have the CEO from Databricks, Ali Ghodsi, joining us. Ali, welcome to the show. >> Thank you so much. >> David: Well we sat here and watched the keynote this morning with Databricks and you delivered. Some big announcements. Before we get into some of that, I want to ask you, it's been about a year and a half since you transitioned from VP Products and Engineering into a CEO role. What's the most fun part of that and maybe what's the toughest part? >> Oh, I see. That's a good question and that's a tough question too. Most fun part is... You know, you touch many more facets of the business. So in engineering, it's all the tech and you're dealing only with engineers, mostly. Customers are one hop away, there's a product management layer between you and the customers. So you're very inwards focused. As a CEO you're dealing with marketing, finance, sales, these different functions. And then, externally with media, with stakeholders, a lot of customer calls. There's many, many more facets of the business that you're seeing. And it also gives you a preview and it also gives you a perspective that you couldn't have before. You see how the pieces fit together so you actually can have a better perspective and see further out than you could before. Before, I was more in my own pick situation where I was seeing sort of just the things relating to engineering so that's the best part. >> You're obviously working close with customers. You introduced a few customers this morning up on stage. But after the keynote, did you hear any reactions from people? What are they saying? >> Yes the keynote was recently so on my way here I've had multiple people sort of... A couple people that high-fived just before I got up on stage here. On several softwaring, people are really excited about that. Less devops, less configuration, let them focus on the innovation, they want that. So that's something that's celebrated. Yesterday-- >> Recap that real quickly for our audience here, what the server-less operating is. >> Absolutely, so it's very simple. We want lots of data scientists to be able to do machine learning without have to worry about the infrastructure underneath it. So we have something called server-less pools and server-less pools you can just have lots of data scientists use it. Under the hood, this pool of resources shrinks and expands automatically. It adds storage, if needed. And you don't have to worry about the configuration of it. And it also makes sure that it's isolating the different data scientists. So if one data scientist happened to run something that takes much more resources, it won't effect the other data scientists that are sharing that. So the short story of it is you cut costs significantly, you can now have 3000 people share the same resources and it enables them to move faster because they don't have to worry about all the devops that they otherwise have to do. >> George, is that a really big deal? >> Well we know whenever there's infrastructure that gets between a developer, data science, and their outcomes, that's friction. I'd be curious to say let's put that into a bigger perspective, which is if you go back several years, what were the class of apps that Spark was being used for, and in conjunction with what other technologies. Then bring us forward to today and then maybe look out three years. >> Ali: Yeah, that's a great question. So from the very beginning, data is key for any of these predictive analytics that we are doing. So that was always a key thing. But back then we saw more Hadoop data lakes. There more data lakes, data reservoirs, data marks that people were building out. We saw also a lot of traditional data warehousing. These days, we see more and more things moving to cloud. The Hadoop data lake received, often times at enterprises, being transformed into a cloud blob storage. That's cheaper, it's dual-up replicated, it's on many continents. That's something that we've seen happen. And we work across any of these, frankly. We, from the very beginning, Spark, one of its strengths is it integrates really well wherever your data is. And there's a huge community of developers around it, over 1000 people now that have contributed to it. Many of these people are in other organizations, they're employed by other companies and their job is to make sure that Databricks or Spark works really, really well with, say, Cassandra or with S3. That's a shift that we're seeing. In terms of applications people are building it's moving more into production. Four years ago much more of it was interactive exploratory. Now we're seeing production use cases. The fraud analytics use case that I mentioned, that's running continuously and the requirements there are different. You can't go down for ten minutes on a Saturday morning at 4 a.m. when you're doing credit card fraud because that's a lot of fraud and that affects the business of, say, Capital One. So that's much more crucial for them. >> So what would be the surrounding infrastructure and applications to make that whole solution work? Would you plug into a traditional system of record at the sales order entry kind of process point? Are you working off sort of semi-real-time or near real-time data? And did you train the models on the data lake? How did the pieces fit together? >> Unfortunately the answers depends on the particular architecture that the customer has. Every enterprise is slightly different. But it's not uncommon that the data is coming in. They're using Spark structured streaming in Databricks to get it into S3, so that's one piece of the puzzle. Then when it ends up there, from then on it funnels out to many different use cases. It could be a data warehousing use case, where they're just using interactive sequel on it. So that's the traditional interactive use case, but it could be a real-time use case, where it's actually taking those data that it's processed and it's detecting anomalies and putting triggers in other systems and then those systems downstream will react to those triggers for anomalies. But it could also be that it's periodically training models and storing the models somewhere. Often times it might be in a Cassandra, or in a Redis, or something of that sort. It will store the model there and then some web application can then take it from there, do point queries to it and say okay, I have a particular user that came in here George now, quickly look up what is his feature vector, figure out what the product recommendations we should show to this person and then it takes it from there. >> So in those cases, Cassandra or Redis, they're playing the serving layer. But generating the prediction model is coming from you and they're just doing the inferencing, the prediction itself. So if you look out several years, without asking you the roadmap, which you can feel free to answer, how do you see that scope of apps expanding or the share of an existing app like that? >> Yeah, I think two interesting trends that I believe in, I'll be foolish enough to make predictions. One is that I think that data warehousing, as we know it today, will continue to exist. However, it will be transformed and all the data warehousing solutions that we have today will add predictive capabilities or it will disappear. So let me motivate that. If you have a data warehouse with customer data in it and a fact table, you have all your transactions there, you have all your products there. Today, you can plug in BI tools and on top of that you can see what's my business health today and yesterday. But you can't ask it: tell me about tomorrow. Why not? The data is there, why can I not ask it this customer data, you tell me which of these customers are going to turn, or which one of them should I reach out to because I can possibly upsell these? Why wouldn't I want to do that? I think everyone would want to do that and everyday a warehousing solution in ten years will have these capabilities. Now with Spark sequel you can do that and the announcement yesterday showed you also how you can bake models, machinery models, and export them so a sequel analyst can just act system directly with no machine learning experience. It's just a simple function call and it just works. So that's one prediction I'll make. The second prediction I'll make is that we're going to see lots of revolutions in different industries, beyond the traditional 'get people to click on ads' and understand social behavior. We're going to go beyond that. So for those use cases it will be closer to the things I mentioned like Shell and what you need to do there is involve these domain experts. The domain experts will come in, the doctors, or the machine specialists, you have to involve them in the loop. And they'll be able to transform, maybe much less exotic applications, it's not the super high-tech Silicon Valley stuff, but it's nevertheless extremely important to every enterprise, to every protocol, on the planet. That's, I think, the exciting part of where predictions will go in the next decade or two. >> If I were to try and pick out the most man-bytes dug kind of observation in there, you know, it's supposed to be the unexpected thing, I would say where you said all data warehouses are going to become predictive services. Because what we've been hearing, it's sort of the other side of that coin which is all the operational databases will get all the predictive capabilities. But you said something very different. I guess my question is are you seeing the advanced analytics going to the data warehouse because the repository of data is going to be bigger there and so you can either build better models or because it's not burdened with transaction SLAs that you can serve up predictions quicker? >> The data warehousing has been about basic statistics. It's been a sequel that the language that is used is to get descriptive statistics. Tables with averages and medians, that's statistics. Why wouldn't you want to have advanced statistics which now does predictions on it. It just so happens that sequel is not the right interface for that. So it's going to be very natural that people who are already asking statistical questions for the last 30 years from their customer data, these massive throes of data that they have stored. Why wouldn't they want to also say, 'okay now give me more advanced statistics?' I'm not an expert on advanced statistics but you the system. Tell me what I should watch out for. Which of these customers do I talk to? Which of the products are in trouble? Which of the products are not, or which parts of my business are not doing well now? Predict the future for me. >> George: When you're doing that though, you're now doing it on data that has a fair amount of latency built into it. Because that's how it got into the data warehouse. Where if it's in the operational database, it's really low latency, typically low latency stuff. Where and why do you see that distinction? >> I do think also that we'll see more and more real-time engines take over. If you do things in real-time you can do it for a fraction of the cost. So we'll also see those capabilities come in. So you don't have to... Your question is, why would you want to once a week batch everything into a central warehouse and I agree with that. It will be streaming in live and then you can on that, do predictions, you can do basic analytics. I think basically the lines will blur between all these technologies that we're seeing. In some sense, Spark actually was the precursor to all that. So Spark already was unifying machine learning, sequel, ETL, real-time, and you're going to see that everywhere appear. >> You mentioned Shell as an example, one of your customers, you also had HP, Capital One, and you developed this unified analytics platform, that's solving some of their common problems. Now that you're in the mood to make predictions, what do you think are going to be the most compelling use cases or industries where you're going to see Databricks going in the future? >> That's a hard one. Right now, I think healthcare. There's a lot of data sets, there's a lot of gene sequencing data. They want to be able to use machine learning. In fact, I think those industries being transformed slowly from using classical statistics into machine learning. We've actually helped some of these companies do that. We've set up workshops and they've gotten people trained. And now they're hiring machine learning experts that are coming in. So that's one I think in the healthcare industry, whether it's for drug-testing, clinical-trials, even diagnosis, that's a big one, I do think industrial IT. These are big companies with lots of equipment, they have tons of sensor data, massive data sets. There's a lot of predictions that they can do on that. So that's a second one I would say. Financial industry, they've always been about predictions, so it makes a lot of sense that they continue doing that. Those are the biggest ones for Databricks. But I think now also as slowly, other verticals are moving into the cloud. We'll see more of other use cases as well. But those are the biggest ones I see right now. It's hard to say where it will be ten years from now, or 15. Things are going so fast that it's hard to even predict six months. >> David: Do you believe IOT is going to be a big business driver? >> Yes, absolutely. >> I want to circle back where you said that we've got different types of databases but we're going to unify the capabilities. Without saying, it's not like one wins, one loses. >> Ali: Yes, I didn't want to do that. >> So describe maybe the characteristics of what a database that compliments Sparks really well might look like. >> That's hard for me to say. The capabilities of Spark, I think, are here to stay. The ability to be able to ETL variety of data that doesn't have structure, so Structured Query Language, SQL, is not fit for it, that is really important and it's going to become more important since data is the new oil, as they said. Well, then it's going to be very important to be able to work with all kinds of data and getting that into the systems. There's more things everyday being created. Devices, IOT, whatever it is that are spewing out this data in different forms and shapes. So being able to work with that variety, that's going to be an important property. So they'll have to do that. That's the ETL portion or the ELT portion. The real-time portion, not having to do this in a batch manner once a week because now time is a competitive advantage. So if I'm one week behind you that means I'm going to lose out. So doing that in real-time, or near human-time or human real-time, that's going to be really important. So that's going to come as well, I think, and people will demand that. That's going to be a competitive advantage. Wherever you can add that secret sauce it's going to add value to the customers. And then finally the predictive stuff, adding the predictive stuff. But I think people will want to continue to also do all the old stuff they've been doing. I don't think that's going to go away. Those bring value to customers, they want to do all those traditional use cases as well. >> So what about now where customers expect to have some, not clear how much, un-Primmed application platform like Spark. Some in the cloud that now that you've totally reordered the TCO equation. But then also at the edge for IOT-type use cases, do you have to slim down Spark to work at the edge? If you have server-less working in the cloud, does that mean you have to change the management paradigm on Prim. What does that mix look like? How does someone, you know how does a Fortune 200 company, get their arms around that? >> Ali: Yeah, this is a surprising thing, most surprising thing for me in the last year, is how many of those Fortune 200's that I was talking to three years ago and they were saying 'no way, we're not going into the cloud. You don't understand the regulations that we are facing or the amount of data that we have.' Or 'we can do it better,' or 'the security requirements that we have, no one can match that.' To now, those very same companies are saying 'absolutely, we're going.' It's not about if, it's about when. Now I would be hard-pressed to find any enterprise that says 'no, we're not going to go, ever.' And some companies we've even seen go from the cloud to on Prim, and then now back. Because the prices are getting more competitive in the cloud. Because now there's three, at least, major players that are competing and they're well-funded companies. In some sense, you have ad money and office money and retail money being thrown at this problem. Prices are getting competitive. Very soon, most IT folks will realize, there's no way we can do this faster, or better, or more reliable secure ourselves. >> David: We've got just a minute to go here before the break so we're going to kind of wrap it up here. And we got over 3000 people here at Spark Summit so it's the Spark community. I want you to talk to them for a moment. What problems do you want them to work on the most? And what are we going to be talking about a year from now at this table? >> The second one is harder. So I think the Spark community is doing a phenomenal job. I'm not going to tell them what to do. They should continue doing what they are doing already which is integrating Spark in the ecosystem, adding more and more integrations with the greatest technologies that are happening out there. Continue the innovation and we're super happy to have them here. We'll continue it as well, we'll continue to host this event and look forward to also having a Spark Summit in Europe, and also the East Coast soon. >> David: Okay, so I'm not going to ask you to make any more predictions. >> Alright, excellent. >> David: Ali this is great stuffy today. Thank you so much for taking some time and giving us more insight after the keynote this morning. Good luck with the rest of the show. >> Thank you. >> Thanks, Ali. And thank you for watching. That's Ali Ghodsi CEO from Databricks. We are Spark Summit 2017 here, on the Cube. Thanks for watching, stay with us. (upbeat mustic)

Published Date : Jun 8 2017

SUMMARY :

Brought to you by Databricks. We have the CEO from Databricks, Ali Ghodsi, joining us. the keynote this morning with Databricks and you delivered. that you couldn't have before. But after the keynote, did you Yes the keynote was recently so on my way here Recap that real quickly for our audience here, and server-less pools you can just have into a bigger perspective, which is if you go back So from the very beginning, So that's the traditional interactive use case, But generating the prediction model is coming from you and the announcement yesterday showed you also and so you can either build better models It's been a sequel that the language that is used Where and why do you see that distinction? and then you can on that, do predictions, what do you think are going to be It's hard to say where it will be ten years from now, or 15. I want to circle back where you said So describe maybe the characteristics of what a database and getting that into the systems. does that mean you have to change or the amount of data that we have.' I want you to talk to them for a moment. and also the East Coast soon. David: Okay, so I'm not going to ask you Thank you so much for taking some time And thank you for watching.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
David	PERSON	0.99+
HP	ORGANIZATION	0.99+
Ali Ghodsi	PERSON	0.99+
Europe	LOCATION	0.99+
Ali	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Capital One	ORGANIZATION	0.99+
three	QUANTITY	0.99+
Today	DATE	0.99+
one week	QUANTITY	0.99+
tomorrow	DATE	0.99+
last year	DATE	0.99+
ten years	QUANTITY	0.99+
yesterday	DATE	0.99+
three years	QUANTITY	0.99+
3000 people	QUANTITY	0.99+
One	QUANTITY	0.99+
ten minutes	QUANTITY	0.99+
Four years ago	DATE	0.99+
three years ago	DATE	0.99+
next decade	DATE	0.99+
six months	QUANTITY	0.99+
Yesterday	DATE	0.98+
over 1000 people	QUANTITY	0.98+
East Coast	LOCATION	0.98+
today	DATE	0.98+
one	QUANTITY	0.98+
one prediction	QUANTITY	0.98+
second prediction	QUANTITY	0.98+
Silicon Valley	LOCATION	0.97+
Spark Summit 2017	EVENT	0.97+
Spark	TITLE	0.97+
once a week	QUANTITY	0.97+
Sparks Summit	EVENT	0.97+
Fortune 200	ORGANIZATION	0.96+
over 3000 people	QUANTITY	0.96+
about a year and a half	QUANTITY	0.95+
Shell	ORGANIZATION	0.95+
Spark	ORGANIZATION	0.95+
Sparks	TITLE	0.94+
IOT	ORGANIZATION	0.94+
day two	QUANTITY	0.94+
Sparks Summit 2017	EVENT	0.94+
this morning	DATE	0.93+
second one	QUANTITY	0.93+
S3	TITLE	0.85+
one data scientist	QUANTITY	0.85+
15	QUANTITY	0.85+
Saturday morning at	DATE	0.84+
tons	QUANTITY	0.83+
S3	ORGANIZATION	0.8+
one piece of the puzzle	QUANTITY	0.79+
couple people	QUANTITY	0.77+
Prim	ORGANIZATION	0.76+
several years	QUANTITY	0.75+

Reynold Xin, Databricks - #Spark Summit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.

Published Date : Jun 7 2017

SUMMARY :

Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Reynold	PERSON	0.99+
Ali	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
David Goad	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
North Pole	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Reynold Xin	PERSON	0.99+
last month	DATE	0.99+
10X	QUANTITY	0.99+
two questions	QUANTITY	0.99+
three trillion records	QUANTITY	0.99+
second area	QUANTITY	0.99+
today	DATE	0.99+
last week	DATE	0.99+
Spark	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
first direction	QUANTITY	0.99+
One	QUANTITY	0.99+
James Bond	PERSON	0.98+
Spark	ORGANIZATION	0.98+
both	QUANTITY	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
Tungsten	ORGANIZATION	0.98+
two focuses	QUANTITY	0.97+
three directions	QUANTITY	0.97+
one minute	QUANTITY	0.97+
one area	QUANTITY	0.96+
three	QUANTITY	0.96+
about five	QUANTITY	0.96+
DBIO	ORGANIZATION	0.96+
six years	QUANTITY	0.95+
one thing	QUANTITY	0.94+
over a hundred experiments	QUANTITY	0.94+
Oracle	ORGANIZATION	0.92+
Theano	TITLE	0.92+
single note	QUANTITY	0.91+
Intel	ORGANIZATION	0.91+
one route	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.88+
Office	TITLE	0.87+
TensorFlow	TITLE	0.87+
S3	TITLE	0.87+
MXNet	TITLE	0.85+

Day 2 Kickoff - #SparkSummit - #theCUBE

[Narrator] Live from San Francisco it's the Cube covering Sparks Summit 2017 brought to you by databricks. >> Welcome to the Cube. My name is David Goad and I'm your host and we are here at Spark day two. It's the Spark Summit and I am flanked by a couple of consultants here from-- sorry, analysts from Wikibon. I got to get this straight. To my left we have Jim Kobielus who is our lead analysist for Data Science. Jim, welcome to the show. >> Thanks David. >> And we also have George Gilbert who is the lead analyst for Big Data and Analytics. I'll get this right eventually. So why don't we start with Jim. Jim just kicking off the show here today, we wanted to get some preliminary thoughts before we really jump into the rest of the day. What are the big themes that we're going to hear about? >> Yeah, today is the Enterprise day at Sparks Summit. So Spark for the Enterprise. Yesterday was focused on Spark, the evolution, extension of Spark to support for native development of deep learning as well as speeding up Spark to support sub-millisecond latencies. But today it's all about Spark and the Enterprise really what I call wrapping dev-ops around Spark, making it more productionizable, supportable. The databricks serverless announcement, though it was announced yesterday, the press release went up they're going into some depth right now in the key note about serverless and really serverless is all about providing an in cloud Spark, essentially a sand box for teams of developers to scale up and scale out enough resources to do the modeling, the training, the deployment, the iteration, the evaluation of Spark jobs in essentially a 24 by seven multi-tenant fully supported environment. So it's really about driving this continuous Spark development and iteration process into a 24 by seven model in the Enterprise, which is really what's happening is that data scientists, Spark developers are becoming an operational function that businesses are building, strategic infrastructure around things like recommendation engines, and e-commerce environments, absolutely demand 24 by seven resilience Spark team based collaboration environments, which is really what the serverless announcement is all about. >> David: So getting increasing demand on mission critical problems so that optimization is a big deal. >> Yeah, data science is not just an R&D function, it's an operational IT function as well. So that's what it's all about. >> David: Awesome, well let's go to George. I saw you watching the key note. I think still watching it again this morning, so taking notes feverishly. What were some of the things that stuck out to you from the key note speaker this morning? >> There are some things that are sort of going to bleed over from yesterday where we can explore some more. We're going to have on the show, the chief architect, Renald Chin, and the CEO, Ali Goatsee, and some of the things that we want to understand is how the scope of applications that are appropriate for Spark are expanding. We've got sort of unofficial guidance yesterday that, you know, just because Spark doesn't handle key value stores or databases all that tightly right now, that doesn't mean it won't in the future on the Apache Spark side through better APIs and on the databricks side, perhaps custom integration and the significance of that is that you can open up a whole class of operational apps, apps that run your business and that now incorporate, you know, rich analytics as well. Another thing that we'll want to be asking about is, keying off what Jim was saying, now that this becomes not a managed service where you just take the labor that the end customer was applying to get the thing running but it's now automated and you don't even know the infrastructure. We'll want to know what does that mean for the edge, you know, where we're doing analytics close to internet of things and people and sort of if there has to be a new configuration of Spark to work with that. And then of course what do we do about the whole data science process and the dev-ops for data science when you have machine learning distributed across the cloud and edge and On-Prem. >> Jim: In fact, I know we have Pepperdata coming on right after this, who might be able to talk about that exact dev-ops in terms of performance optimization into distributed Spark environment, yeah. >> George, I want to follow up with that. We had Matt Fryer from Hotels.com, he's going to be on our show later but he was on the key note stage this morning. He talked about going all cloud, all Spark, and how data science is even competitive advantage for Hotels.com. What do you want to dig into when we get him on the show? >> That's a really good question because if you look at business strategy, you don't really build a sustainable advantage just by doing one thing better than everyone else. That's easier to pick off. The sustainable strategic advantages come from not just doing one thing better than everyone else but many things and then orchestrating their improvement over time and I'd like to dig into how they're going to do that. 'Cause remember Hotels.com it's the internet equivalent descendant of the original travel reservation systems, which did confer competitive advantage on the early architects and deployers of that technology. >> Great and then Pepperdata wanted to come back and we're going to have them on the show here in just a moment. What would you like to learn from them? What do you think will benefit the community the most? >> Jim: Actually, keying off something George said, I'd like to get a sense for how you optimize Spark deployments in a radically distributed IOT edge environment. Whether they've got any plans, or what their thoughts are in terms of the challenges there. As more the intelligence gets pushed to the edge much of that will be on machine learning and deep learning models built into Spark. What are the challenges there? I mean, if you've got thousands to millions of end points that are all autonomious and intelligent and they're all running Spark, just what are the orchestration requirements, what are the resource management requirements, how do you monitor end-to-end in and environment like that and optimize the passing of data and the transfer of the control flow or orchestration across all those dispersed points. >> Okay, so 30 seconds now, why should the audience tune into our show today? What are they going to get? >> I think what they're going to get is a really good sense for how the emerging best practices for optimizing Spark in a distributed fog environment out to the edge where not just the edge devices but everything, all nodes, will incorporate machine learning and deep learning. They'll get a sense for what's been done today, what's the tooling is to enable dev-ops in that kind of environment. As well as, sort of the emerging best practices for compressing more of these algorithms and the data itself as well as doing training in a theoretically federated environment. I'm hoping to hear from some of the vendors who are on the show today. >> David: Fantastic and George, closing thoughts on the opening segment? 30 seconds. >> Closing thoughts on the opening segment. Like Jim is, we want to think about Spark holistically and it has traditionally been best position that's sort of this-- as Tay acknowledged yesterday sort of this offline branch of analytics that you apply to data like sort of repository that you accumulated and now we want to see it put into production but to do that you need more than just what Spark is today. You need basically a database or key value kind of option so that your storing your work as it goes along so you can go back and analyze it either simple analysis or complex analysis. So I want to hear about that. I want to hear about their plans for IOT. Spark is kind of a heavy weight environment, so you're probably not going to put it in the boot of your car or at least not likely anytime soon. >> Jim: Intelligent edge. I mean, Microsoft build a few weeks ago was really deep on intelligent edge. HP, who we're doing their show actually I think it's in Vegas, right? They're also big on intelligent edge. In fact, we had somebody on the show yesterday from HP going into some depth on that. I want to hear what databricks has to say on that theme. >> Yeah, and which part of the edge, is it the gateway, the edge gateway, which is really a slim down server, or the edge device, which could be a 32 bit meg RAM network card. >> Yeah. >> All right, well gentlemen appreciate the little insight here before we get started today and we're just getting started. Thank you both for being on the show and thank you for watching the Cube. We'll be back in a little while with our CEO from databricks. Thanks for watching. (upbeat music)

Published Date : Jun 7 2017

SUMMARY :

brought to you by databricks. It's the Spark Summit and I am flanked by What are the big themes that we're going to hear about? So Spark for the Enterprise. so that optimization is a big deal. So that's what it's all about. from the key note speaker this morning? and some of the things that we want to understand is Jim: In fact, I know we have Pepperdata coming on and how data science is and I'd like to dig into how they're going to do that. What would you like to learn from them? As more the intelligence gets pushed to the edge and the data itself David: Fantastic and George, but to do that you need more than just what Spark is today. I want to hear what databricks has to say on that theme. or the edge device, and thank you for watching the Cube.

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
Jim Kobielus	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Ali Goatsee	PERSON	0.99+
David Goad	PERSON	0.99+
Matt Fryer	PERSON	0.99+
Renald Chin	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
thousands	QUANTITY	0.99+
30 seconds	QUANTITY	0.99+
Hotels.com	ORGANIZATION	0.99+
yesterday	DATE	0.99+
Vegas	LOCATION	0.99+
32 bit	QUANTITY	0.99+
today	DATE	0.99+
24	QUANTITY	0.99+
HP	ORGANIZATION	0.99+
Spark	TITLE	0.99+
seven	QUANTITY	0.98+
Yesterday	DATE	0.98+
both	QUANTITY	0.98+
Spark Summit	EVENT	0.98+
Tay	PERSON	0.97+
Sparks Summit 2017	EVENT	0.96+
one	QUANTITY	0.96+
this morning	DATE	0.96+
Pepperdata	ORGANIZATION	0.96+
Day 2	QUANTITY	0.95+
Wikibon	ORGANIZATION	0.94+
Sparks Summit	EVENT	0.93+
databricks	ORGANIZATION	0.91+
day two	QUANTITY	0.87+
Spark	ORGANIZATION	0.86+
few weeks ago	DATE	0.86+
millions of end points	QUANTITY	0.81+
Big Data	ORGANIZATION	0.81+
Cube	COMMERCIAL_ITEM	0.68+
sub	QUANTITY	0.6+
Apache Spark	TITLE	0.55+
Analytics	ORGANIZATION	0.53+

Dr. Jisheng Wang, Hewlett Packard Enterprise, Spark Summit 2017 - #SparkSummit - #theCUBE

>> Announcer: Live from San Francisco, it's theCUBE covering Sparks Summit 2017 brought to you by Databricks. >> You are watching theCUBE at Sparks Summit 2017. We continue our coverage here talking with developers, partners, customers, all things Spark, and today we're honored now to have our next guest Dr. Jisheng Wang who's the Senior Director of Data Science at the CTO Office at Hewlett Packard Enterprise. Dr. Wang, welcome to the show. >> Yeah, thanks for having me here. >> All right and also to my right we have Mr. Jim Kobielus who's the Lead Analyst for Data Science at Wikibon. Welcome, Jim. >> Great to be here like always. >> Well let's jump into it. At first I want to ask about your background a little bit. We were talking about the organization, maybe you could do a better job (laughs) of telling me where you came from and you just recently joined HPE. >> Yes. I actually recently joined HPE earlier this year through the Niara acquisition, and now I'm the Senior Director of Data Science in the CTO Office of Aruba. Actually, Aruba you probably know like two years back, HP acquired Aruba as a wireless networking company, and now Aruba takes charge of the whole enterprise networking business in HP which is about over three billion annual revenue every year now. >> Host: That's not confusing at all. I can follow you (laughs). >> Yes, okay. >> Well all I know is you're doing some exciting stuff with Spark, so maybe tell us about this new solution that you're developing. >> Yes, actually my most experience of Spark now goes back to the Niara time, so Niara was a three and a half year old startup that invented, reinvented the enterprise security using big data and data science. So what is the problem we solved, we tried to solve in Niara is called a UEBA, user and entity behavioral analytics. So I'll just try to be very brief here. Most of the transitional security solutions focus on detecting attackers from outside, but what if the origin of the attacker is inside the enterprise, say Snowden, what can you do? So you probably heard of many cases today employees leaving the company by stealing lots of the company's IP and sensitive data. So UEBA is a new solution try to monitor the behavioral change of the enterprise users to detect both this kind of malicious insider and also the compromised user. >> Host: Behavioral analytics. >> Yes, so it sounds like it's a native analytics which we run like a product. >> Yeah and Jim you've done a lot of work in the industry on this, so any questions you might have for him around UEBA? >> Yeah, give us a sense for how you're incorporating streaming analytics and machine learning into that UEBA solution and then where Spark fits into the overall approach that you take? >> Right, okay. So actually when we started three and a half years back, the first version when we developed the first version of the data pipeline, we used a mix of Hadoop, YARN, Spark, even Apache Storm for different kind of stream and batch analytics work. But soon after with increased maturity and also the momentum from this open source Apache Spark community, we migrated all our stream and batch, you know the ETL and data analytics work into Spark. And it's not just Spark. It's Spark, Spark streaming, MLE, the whole ecosystem of that. So there are at least a couple advantages we have experienced through this kind of a transition. The first thing which really helped us is the simplification of the infrastructure and also the reduction of the DevOps efforts there. >> So simplification around Spark, the whole stack of Spark that you mentioned. >> Yes. >> Okay. >> So for the Niara solution originally, we supported, even here today, we supported both the on-premise and the cloud deployment. For the cloud we also supported the public cloud like AWS, Microsoft Azure, and also Privia Cloud. So you can understand with, if we have to maintain a stack of different like open source tools over this kind of many different deployments, the overhead of doing the DevOps work to monitor, alarming, debugging this kind of infrastructure over different deployments is very hard. So Spark provides us some unified platform. We can integrate the streaming, you know batch, real-time, near real-time, or even longterm batch job all together. So that heavily reduced both the expertise and also the effort required for the DevOps. This is one of the biggest advantages we experienced, and certainly we also experienced something like the scalability, performance, and also the convenience for developers to develop a new applications, all of this, from Spark. >> So are you using the Spark structured streaming runtime inside of your application? Is that true? >> We actually use Spark in the steaming processing when the data, so like in the UEBS solutions, the first thing is collecting a lot of the data, different account data source, network data, cloud application data. So when the data comes in, the first thing is streaming job for the ETL, to process the data. Then after that, we actually also develop the some, like different frequency like one minute, 10 minute, one hour, one day of this analytics job on top of that. And even recently we have started some early adoption of the deep learning into this, how to use deep learning to monitor the user behavior change over time, especially after user gives a notice what user, is user going to access like most servers or download some of the sensitive data? So all of this requires very complex analytics infrastructure. >> Now there were some announcements today here at Spark Summit by Databricks of adding deep learning support to their core Spark code base. What are your thoughts about the deep learning pipelines, API, that they announced this morning? It's new news, I'll understand if you don't, haven't digested it totally, but you probably have some good thoughts on the topic. >> Yes, actually this is also news for me, so I can just speak from my current experience. How to integrate deep learning into Spark actually was a big challenge so far for us because what we used so far, the deep learning piece, we used TensorFlow. And certainly most of our other stream and data massaging or ETL work is done by Spark. So in this case, there are a couple ways to manage this, too. One is to set up two separate resource pool, one for Spark, the other one for TensorFlow, but in our deployment there is some very small on-premise department which has only like four node or five node cluster. It's not efficient to split resource in that way. So we actually also looking for some closer integration between deep learning and Spark. So one thing we looked before is called the TensorFlow on Spark which was open source a couple months ago by Yahoo. >> Right. >> So maybe this is certainly more exciting news for the Spark team to develop this native integration. >> Jim: Very good. >> Okay and we talked about the UEBA solution, but let's go back to a little broader HPE perspective. You have this concept called the intelligent edge, what's that all about? >> So that's a very cool name. Actually come a little bit back. I come from the enterprise background, and enterprise applications have some, actually a lag behind than consumer applications in terms of the adoption of the new data science technology. So there are some native challenges for that. For example, collecting and storing large amount of this enterprise sensitive data is a huge concern, especially in European countries. Also for the similar reason how to collect, normally weigh developer enterprise applications. You're lack of some good quantity and quality of the trending data. So this is some native challenges when you develop enterprise applications, but even despite of this, HPE and Aruba recently made several acquisitions of analytics companies to accelerate the adoption of analytics into different product line. Actually that intelligent age comes from this IOT, which is internet of things, is expected to be the fastest growing market in the next few years here. >> So are you going to be integrating the UEBA behavioral analytics and Spark capability into your IOT portfolio at HP? Is that a strategy or direction for you? >> Yes. Yes, for the big picture that certainly is. So you can think, I think some of the Gartner Report expected the number of the IOT devices is going to grow over 20 billion by 2020. Since all of this IOT devices are connected to either intranet or internet, either through wire or wireless, so as a networking company, we have the advantage of collecting data and even take some actions at the first of place. So the idea of this intelligent age is we want to turn each of these IOT devices, the small IOT devices like IP camera, like those motion detection, all of these small devices as opposed to the distributed sensor for the data collection and also some inline actor to do some real-time or even close to real-time decisions. For example, the behavior anomaly detection is a very good example here. If IOT devices is compromised, if the IP camera has been compromised, then use that to steal your internal data. We should detect and stop that at the first place. >> Can you tell me about the challenges of putting deep learning algorithms natively on resource constrained endpoints in the IOT? That must be really challenging to get them to perform well considering that there may be just a little bit of memory or flash capacity or whatever on the endpoints. Any thoughts about how that can be done effectively and efficiently? >> Very good question >> And at low cost. >> Yes, very good question. So there are two aspects into this. First is this global training of the intelligence which is not going to be done on each of the device. In that case, each of the device is more like the sensor for the data collection. So we are going to build a, collect the data sent to the cloud, or build all of this giant pool, like computing resource to trend the classifier, to trend the model, but when we trend the model, we are going to ship the model, so the inference and the detection of the model of those behavioral anomaly really happen on the endpoint. >> Do the training centrally and then push the trained algorithms down to the edge devices. >> Yes. But even like, the second as well even like you said, some of the device like say people try to put those small chips in the spoon, in the case of, in hospital to make it like more intelligent, you cannot put even just the detection piece there. So we also looking to some new technology. I know like Caffe recently announced, released some of the lightweight deep learning models. Also there's some, your probably know, there's some of the improvement from the chip industry. >> Jim: Yes. >> How to optimize the chip design for this kind of more analytics driven task there. So we are all looking to this different areas now. >> We have just a couple minutes left, and Jim you get one last question after this, but I got to ask you, what's on your wishlist? What do you wish you could learn or maybe what did you come to Spark Summit hoping to take away? >> I've always treated myself as a technical developer. One thing I am very excited these days is the emerging of the new technology, like a Spark, like TensorFlow, like Caffe, even Big-Deal which was announced this morning. So this is something like the first go, when I come to this big advanced industry events, I want to learn the new technology. And the second thing is mostly to share our experience and also about adopting of this new technology and also learn from other colleagues from different industries, how people change life, disrupt the old industry by taking advantage of the new technologies here. >> The community's growing fast. I'm sure you're going to receive what you're looking for. And Jim, final question? >> Yeah, I heard you mention DevOps and Spark in same context, and that's a huge theme we're seeing, more DevOps is being wrapped around the lifecycle of development and training and deployment of machine learning models. If you could have your ideal DevOps tool for Spark developers, what would it look like? What would it do in a nutshell? >> Actually it's still, I just share my personal experience. In Niara, we actually developed a lot of the in-house DevOps tools like for example, when you run a lot of different Spark jobs, stream, batch, like one minute batch verus one day batch job, how do you monitor the status of those workflows? How do you know when the data stop coming? How do you know when the workflow failed? Then even how, monitor is a big thing and then alarming when you have something failure or something wrong, how do you alarm it, and also the debug is another big challenge. So I certainly see the growing effort from both Databricks and the community on different aspects of that. >> Jim: Very good. >> All right, so I'm going to ask you for kind of a soundbite summary. I'm going to put you on the spot here, you're in an elevator and I want you to answer this one question. Spark has enabled me to do blank better than ever before. >> Certainly, certainly. I think as I explained before, it helped a lot from both the developer, even the start-up try to disrupt some industry. It helps a lot, and I'm really excited to see this deep learning integration, all different road map report, you know, down the road. I think they're on the right track. >> All right. Dr. Wang, thank you so much for spending some time with us. We appreciate it and go enjoy the rest of your day. >> Yeah, thanks for being here. >> And thank you for watching the Cube. We're here at Spark Summit 2017. We'll be back after the break with another guest. (easygoing electronic music)

Published Date : Jun 6 2017

SUMMARY :

brought to you by Databricks. at the CTO Office at Hewlett Packard Enterprise. All right and also to my right we have Mr. Jim Kobielus (laughs) of telling me where you came from of the whole enterprise networking business I can follow you (laughs). that you're developing. of the company's IP and sensitive data. Yes, so it sounds like it's a native analytics of the data pipeline, we used a mix of Hadoop, YARN, the whole stack of Spark that you mentioned. We can integrate the streaming, you know batch, of the deep learning into this, but you probably have some good thoughts on the topic. one for Spark, the other one for TensorFlow, for the Spark team to develop this native integration. Okay and we talked about the UEBA solution, Also for the similar reason how to collect, of the IOT devices is going to grow natively on resource constrained endpoints in the IOT? collect the data sent to the cloud, Do the training centrally But even like, the second as well even like you said, So we are all looking to this different areas now. And the second thing is mostly to share our experience And Jim, final question? If you could have your ideal DevOps tool So I certainly see the growing effort All right, so I'm going to ask you even the start-up try to disrupt some industry. We appreciate it and go enjoy the rest of your day. We'll be back after the break with another guest.

ENTITIES

Entity	Category	Confidence
Jim	PERSON	0.99+
HPE	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
10 minute	QUANTITY	0.99+
one hour	QUANTITY	0.99+
one minute	QUANTITY	0.99+
Wang	PERSON	0.99+
San Francisco	LOCATION	0.99+
Yahoo	ORGANIZATION	0.99+
Jisheng Wang	PERSON	0.99+
Niara	ORGANIZATION	0.99+
first version	QUANTITY	0.99+
one day	QUANTITY	0.99+
two aspects	QUANTITY	0.99+
Jim Kobielus	PERSON	0.99+
Hewlett Packard Enterprise	ORGANIZATION	0.99+
First	QUANTITY	0.99+
Caffe	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Spark	ORGANIZATION	0.99+
one	QUANTITY	0.99+
each	QUANTITY	0.99+
three and a half year	QUANTITY	0.99+
both	QUANTITY	0.99+
Sparks Summit 2017	EVENT	0.99+
first	QUANTITY	0.99+
DevOps	TITLE	0.99+
2020	DATE	0.99+
second thing	QUANTITY	0.99+
Aruba	ORGANIZATION	0.98+
Snowden	PERSON	0.98+
two years back	DATE	0.98+
first thing	QUANTITY	0.98+
one last question	QUANTITY	0.98+
AWS	ORGANIZATION	0.98+
over 20 billion	QUANTITY	0.98+
one question	QUANTITY	0.98+
UEBA	TITLE	0.98+
today	DATE	0.98+
Spark Summit	EVENT	0.97+
Microsoft	ORGANIZATION	0.97+
Spark Summit 2017	EVENT	0.96+
Apache	ORGANIZATION	0.96+
three and a half years back	DATE	0.96+
Databricks	ORGANIZATION	0.96+
one day batch	QUANTITY	0.96+
earlier this year	DATE	0.94+
Aruba	LOCATION	0.94+
One	QUANTITY	0.94+
#SparkSummit	EVENT	0.94+
One thing	QUANTITY	0.94+
one thing	QUANTITY	0.94+
European	LOCATION	0.94+
Gartner	ORGANIZATION	0.93+

Frederick Reiss, IBM STC - Big Data SV 2017 - #BigDataSV - #theCUBE

>> Narrator: Live from San Jose, California it's the Cube, covering Big Data Silicon Valley 2017. (upbeat music) >> Big Data SV 2016, day two of our wall to wall coverage of Strata Hadoob Conference, Big Data SV, really what we call Big Data Week because this is where all the action is going on down in San Jose. We're at the historic Pagoda Lounge in the back of the Faramount, come on by and say hello, we've got a really cool space and we're excited and never been in this space before, so we're excited to be here. So we got George Gilbert here from Wiki, we're really excited to have our next guest, he's Fred Rice, he's the chief architect at IBM Spark Technology Center in San Francisco. Fred, great to see you. >> Thank you, Jeff. >> So I remember when Rob Thomas, we went up and met with him in San Francisco when you guys first opened the Spark Technology Center a couple of years now. Give us an update on what's going on there, I know IBM's putting a lot of investment in this Spark Technology Center in the San Francisco office specifically. Give us kind of an update of what's going on. >> That's right, Jeff. Now we're in the new Watson West building in San Francisco on 505 Howard Street, colocated, we have about a 50 person development organization. Right next to us we have about 25 designers and on the same floor a lot of developers from Watson doing a lot of data science, from the weather underground, doing weather and data analysis, so it's a really exciting place to be, lots of interesting work in data science going on there. >> And it's really great to see how IBM is taking the core Watson, obviously enabled by Spark and other core open source technology and now applying it, we're seeing Watson for Health, Watson for Thomas Vehicles, Watson for Marketing, Watson for this, and really bringing that type of machine learning power to all the various verticals in which you guys play. >> Absolutely, that's been what Watson has been about from the very beginning, bringing the power of machine learning, the power of artificial intelligence to real world applications. >> Jeff: Excellent. >> So let's tie it back to the Spark community. Most folks understand how data bricks builds out the core or does most of the core work for, like, the sequel workload the streaming and machine learning and I guess graph is still immature. We were talking earlier about IBM's contributions in helping to build up the machine learning side. Help us understand what the data bricks core technology for machine learning is and how IBM is building beyond that. >> So the core technology for machine learning in Apache Spark comes out, actually, of the machine learning department at UC Berkeley as well as a lot of different memories from the community. Some of those community members also work for data bricks. We actually at the IBM Spark Technology Center have made a number of contributions to the core Apache Spark and the libraries, for example recent contributions in neural nets. In addition to that, we also work on a project called Apache System ML, which used to be proprietary IBM technology, but the IBM Spark Technology Center has turned System ML into Apache System ML, it's now an open Apache incubating project that's been moving forward out in the open. You can now download the latest release online and that provides a piece that we saw was missing from Spark and a lot of other similar environments and optimizer for machine learning algorithms. So in Spark, you have the catalyst optimizer for data analysis, data frames, sequel, you write your queries in terms of those high level APIs and catalyst figures out how to make them go fast. In System ML, we have an optimizer for high level languages like Spark and Python where you can write algorithms in terms of linear algebra, in terms of high level operations on matrices and vectors and have the optimizer take care of making those algorithms run in parallel, run in scale, taking account of the data characteristics. Does the data fit in memory, and if so, keep it in memory. Does the data not fit in memory? Stream it from desk. >> Okay, so there was a ton of stuff in there. >> Fred: Yep. >> And if I were to refer to that as so densely packed as to be a black hole, that might come across wrong, so I won't refer to that as a black hole. But let's unpack that, so the, and I meant that in a good way, like high bandwidth, you know. >> Fred: Thanks, George. >> Um, so the traditional Spark, the machine learning that comes with Spark's ML lib, one of it's distinguishing characteristics is that the models, the algorithms that are in there, have been built to run on a cluster. >> Fred: That's right. >> And very few have, very few others have built machine learning algorithms to run on a cluster, but as you were saying, you don't really have an optimizer for finding something where a couple of the algorithms would be fit optimally to solve a problem. Help us understand, then, how System ML solves a more general problem for, say, ensemble models and for scale out, I guess I'm, help us understand how System ML fits relative to Sparks ML lib and the more general problems it can solve. >> So, ML Live and a lot of other packages such as Sparking Water from H20, for example, provide you with a toolbox of algorithms and each of those algorithms has been hand tuned for a particular range of problem sizes and problem characteristics. This works great as long as the particular problem you're facing as a data scientist is a good match to that implementation that you have in your toolbox. What System ML provides is less like having a toolbox and more like having a machine shop. You can, you have a lot more flexibility, you have a lot more power, you can write down an algorithm as you would write it down if you were implementing it just to run on your laptop and then let the System ML optimizer take care of producing a parallel version of that algorithm that is customized to the characteristics of your cluster, customized to the characteristics of your data. >> So let me stop you right there, because I want to use an analogy that others might find easy to relate to for all the people who understand sequel and scale out sequel. So, the way you were describing it, it sounds like oh, if I were a sequel developer and I wanted to get at some data on my laptop, I would find it pretty easy to write the sequel to do that. Now, let's say I had a bunch of servers, each with it's own database, and I wanted to get data from each database. If I didn't have a scale out database, I would have to figure out physically how to go to each server in the cluster to get it. What I'm hearing for System ML is it will take that query that I might have written on my one server and it will transparently figure out how to scale that out, although in this case not queries, machine learning algorithms. >> The database analogy is very apt. Just like sequel and query optimization by allowing you to separate that logical description of what you're looking for from the physical description of how to get at it. Lets you have a parallel database with the exact same language as a single machine database. In System ML, because we have an optimizer that separates that logical description of the machine learning algorithm from the physical implementation, we can target a lot of parallel systems, we can also target a large server and the code, the code that implements the algorithm stays the same. >> Okay, now let's take that a step further. You refer to matrix math and I think linear algebra and a whole lot of other things that I never quite made it to since I was a humanities major but when we're talking about those things, my understanding is that those are primitives that Spark doesn't really implement so that if you wanted to do neural nets, which relies on some of those constructs for high performance, >> Fred: Yes. >> Then, um, that's not built into Spark. Can you get to that capability using System ML? >> Yes. System ML edits core, provides you with a library, provides you as a user with a library of machine, rather, linear algebra primitives, just like a language like r or a library like Mumpai gives you matrices and vectors and all of the operations you can do on top of those primitives. And just to be clear, linear algebra really is the language of machine learning. If you pick up a paper about an advanced machine learning algorithm, chances are the specification for what that algorithm does and how that algorithm works is going to be written in the paper literally in linear algebra and the implementation that was used in that paper is probably written in the language where linear algebra is built in, like r, like Mumpai. >> So it sounds to me like Spark has done the work of sort of the blocking and tackling of machine learning to run in parallel. And that's I mean, to be clear, since we haven't really talked about it, that's important when you're handling data at scale and you want to train, you know, models on very, very large data sets. But it sounds like when we want to go to some of the more advanced machine learning capabilities, the ones that today are making all the noise with, you know, speech to text, text to speech, natural language, understanding those neural network based capabilities are not built into the core Spark ML lib, that, would it be fair to say you could start getting at them through System ML? >> Yes, System ML is a much better way to do scalable linear algebra on top of Spark than the very limited linear algebra that's built into Spark. >> So alright, let's take the next step. Can System ML be grafted onto Spark in some way or would it have to be in an entirely new API that doesn't take, integrate with all the other Spark APIs? In a way, that has differentiated Spark, where each API is sort of accessible from every other. Can you tie System ML in or do the Spark guys have to build more primitives into their own sort of engine first? >> A lot of the work that we've done with the Spark Technology Center as part of bringing System ML into the Apache ecosystem has been to build a nice, tight integration with Apache Spark so you can pass Spark data frames directly into System ML you can get data frames back. Your System ML algorithm, once you've written it, in terms of one of System ML's main systematic languages it just plugs into Spark like all the algorithms that are built into Spark. >> Okay, so that's, that would keep Spark competitive with more advanced machine learning frameworks for a longer period of time, in other words, it wouldn't hit the wall the way if would if it encountered tensor flow from Google for Google's way of doing deep learning, Spark wouldn't hit the wall once it needed, like, a tensor flow as long as it had System ML so deeply integrated the way you're doing it. >> Right, with a system like System ML, you can quickly move into new domains of machine learning. So for example, this afternoon I'm going to give a talk with one of our machine learning developers, Mike Dusenberry, about our recent efforts to implement deep learning in System ML, like full scale, convolutional neural nets running on a cluster in parallel processing many gigabytes of images, and we implemented that with very little effort because we have this optimizer underneath that takes care of a lot of the details of how you get that data into the processing, how you get the data spread across the cluster, how you get the processing moved to the data or vice versa. All those decisions are taken care of in the optimizer, you just write down the linear algebra parts and let the system take care of it. That let us implement deep learning much more quickly than we would have if we had done it from scratch. >> So it's just this ongoing cadence of basically removing the infrastructure gut management from the data scientists and enabling them to concentrate really where their value is is on the algorithms themselves, so they don't have to worry about how many clusters it's running on, and that configuration kind of typical dev ops that we see on the regular development side, but now you're really bringing that into the machine learning space. >> That's right, Jeff. Personally, I find all the minutia of making a parallel algorithm worked really fascinating but a lot of people working in data science really see parallelism as a tool. They want to solve the data science problem and System ML lets you focus on solving the data science problem because the system takes care of the parallelism. >> You guys could go on in the weeds for probably three hours but we don't have enough coffee and we're going to set up a follow up time because you're both in San Francisco. But before we let you go, Fred, as you look forward into 2017, kind of the advances that you guys have done there at the IBM Spark Center in the city, what's kind of the next couple great hurdles that you're looking to cross, new challenges that are getting you up every morning that you're excited to come back a year from now and be able to say wow, these are the one or two things that we were able to take down in 2017? >> We're moving forward on several different fronts this year. On one front, we're helping to get the notebook experience with Spark notebooks consistent across the entire IBM product portfolio. We helped a lot with the rollout of notebooks on data science experience on z, for example, and we're working actively with the data science experience and with the Watson data platform. On the other hand, we're contributing to Spark 2.2. There are some exciting features, particularly in sequel that we're hoping to get into that release as well as some new improvements to ML Live. We're moving forward with Apache System ML, we just cut Version 0.13 of that. We're talking right now on the mailing list about getting System ML out of incubation, making it a full, top level project. And we're also continuing to help with the adoption of Apache Spark technology in the enterprise. Our latest focus has been on deep learning on Spark. >> Well, I think we found him! Smartest guy in the room. (laughter) Thanks for stopping by and good luck on your talk this afternoon. >> Thank you, Jeff. >> Absolutely. Alright, he's Fred Rice, he's George Gilbert, and I'm Jeff Rick, you're watching the Cube from Big Data SV, part of Big Data Week in San Jose, California. (upbeat music) (mellow music) >> Hi, I'm John Furrier, the cofounder of SiliconANGLE Media cohost of the Cube. I've been in the tech business since I was 19, first programming on mini computers.

Published Date : Mar 15 2017

SUMMARY :

it's the Cube, covering Big Data Silicon Valley 2017. in the back of the Faramount, come on by and say hello, in the San Francisco office specifically. and on the same floor a lot of developers from Watson to all the various verticals in which you guys play. of machine learning, the power of artificial intelligence or does most of the core work for, like, the sequel workload and have the optimizer take care of making those algorithms and I meant that in a good way, is that the models, the algorithms that are in there, and the more general problems it can solve. to that implementation that you have in your toolbox. in the cluster to get it. and the code, the code that implements the algorithm so that if you wanted to do neural nets, Can you get to that capability using System ML? and all of the operations you can do the ones that today are making all the noise with, you know, linear algebra on top of Spark than the very limited So alright, let's take the next step. System ML into the Apache ecosystem has been to build so deeply integrated the way you're doing it. and let the system take care of it. is on the algorithms themselves, so they don't have to worry because the system takes care of the parallelism. into 2017, kind of the advances that you guys have done of Apache Spark technology in the enterprise. Smartest guy in the room. and I'm Jeff Rick, you're watching the Cube cohost of the Cube.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Jeff Rick	PERSON	0.99+
George	PERSON	0.99+
Jeff	PERSON	0.99+
Fred Rice	PERSON	0.99+
Mike Dusenberry	PERSON	0.99+
IBM	ORGANIZATION	0.99+
2017	DATE	0.99+
San Francisco	LOCATION	0.99+
John Furrier	PERSON	0.99+
San Jose	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
505 Howard Street	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Frederick Reiss	PERSON	0.99+
Spark Technology Center	ORGANIZATION	0.99+
Fred	PERSON	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
one	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
Spark 2.2	TITLE	0.99+
three hours	QUANTITY	0.99+
Watson	ORGANIZATION	0.99+
UC Berkeley	ORGANIZATION	0.99+
one server	QUANTITY	0.99+
Spark	TITLE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Python	TITLE	0.99+
each server	QUANTITY	0.99+
both	QUANTITY	0.99+
each	QUANTITY	0.99+
each database	QUANTITY	0.98+
Big Data Week	EVENT	0.98+
Pagoda Lounge	LOCATION	0.98+
Strata Hadoob Conference	EVENT	0.98+
System ML	TITLE	0.98+
Big Data SV	EVENT	0.97+
each API	QUANTITY	0.97+
ML Live	TITLE	0.96+
today	DATE	0.96+
Thomas Vehicles	ORGANIZATION	0.96+
Apache System ML	TITLE	0.95+
Big Data	EVENT	0.95+
Apache Spark	TITLE	0.94+
Watson for Marketing	ORGANIZATION	0.94+
Sparking Water	TITLE	0.94+
first	QUANTITY	0.94+
one front	QUANTITY	0.94+
Big Data SV 2016	EVENT	0.94+
IBM Spark Technology Center	ORGANIZATION	0.94+
about 25 designers	QUANTITY	0.93+

Ted Julian, IBM Resilient - RSA Conference 2017 - #RSAC #theCUBE

(upbeat electronic music) >> Hey, welcome back everybody. Jeff Frick here with theCUBE. We are live in downtown San Francisco, Moscone Center at the RSA conference. It's one of the biggest conferences, I think after like Salesforce and Oracle that they have in Moscone on the tech scene. Over 40,000 professionals here talking about security, I think it was 34,000 last year. It's so busy they can't find a space for theCUBE, so we just have to make our way in. We're really excited by our next guest, Ted Julian from IBM Resistance, Resilience, excuse me. >> Thank you, it's alright. >> And you are the co-founder of VP Product Management. >> That's right. >> Welcome. >> Thanks, good to be here Jeff, thanks. >> And you said IBM actually purchased a company, >> Ted: A year ago. >> A year ago. So happy anniversary. >> Ted: Yeah, thanks. >> So how is that going? >> It's great. Business is really going well, it's been thrilling to get our product in place and a lot more customers and really see it help make a difference for them. >> Yeah we, Jesse Proudman is a many time CUBE alumni, his company is Blue Box, also bought by IBM. >> Ted: Yes. >> A little while ago, also had a really good experience of, kind of bringing all that horse power. >> They know what they are doing. >> To what his situation was. So let's jump into it. >> Sure. >> Security, it's kind of a dark and ominous keynote this morning. The attack's surface is growing with our homes and IOT. The bad guys are getting smarter, the governments are getting involved, there's just not necessarily bad guys. What's kind of your perspective as you see it year after year acquisition? 40,000 professionals here focused on this problem. >> We are not winning. >> We are not winning? >> Unfortunately, I mean, I guess as a species. Again, what is it? We saw a survey recently from the Ponemon Institute. 70% of organizations acknowledge they didn't have an incident response plan. So you talk about that stuff in the keynote where sort of a breach was inevitable. What are you going to do? Well the thing you'd need to have is a response plan to deal with it, and 70% don't. Cost of a breach also, according to Ponemon Institute is up to $4 million on average, obviously they can be a lot larger than that. >> Right. >> So there's a lot of work to be done to do better. >> And then you hook up a new device, and they are on that new device as soon as it plugs into the internet. They say within an hour, they ran a test today. So is the, I mean where are we winning, Where are we getting better? I mean, I've heard crazy stats that people don't even know they've been breached for like 245 days. >> Ted: Yeah. >> Is that coming down? Are we getting better? >> Certainly the best in the business are, and really the challenge I think as an industry is to percolate that down through the rest of the marketplace. Everybody is going to be breached, so it's not whether or not you are breached, it's how you deal with it come the day, that's really going to differentiate the good organizations from the bad ones. And that's where we've been able to help our customers quite a bit by using our platform to help them get a consistence and repeatable process for how they deal with that inevitable breach when it happens. >> That's interesting. So how much if it is you know kind of building a process for when these things happen versus just the cool, sexy technology that people like to talk about? >> Oh, it's everything. I mean one of the hottest trends that you're going to be seeing all over the show is automation and orchestration. Which is critically important as part of the sort of you get an alert and how do you enrich that to understand that, once you understand that how can you quickly come to sort of a course of action that you want to take. How can you implement that course of action very efficiently? Those things are all important. Computers can help a lot with that but at the end of the day it's smart people making good decisions that are going to be the success factor that determines how well you do. >> Right, right. Another kind of theme that we are hearing over and over is really collaboration amongst the companies amongst the competitors, sharing information about the threat profiles, about the threats that are coming in to kind of enable everybody to actually kind of be on the same team. That didn't always used to be the case, was it? >> Well, people have been working on this for a while but I think what's been a challenge is getting people to feel comfortable contributing their data into that data set. Naturally they are very sensitive about that, right? >> Right. >> This is some of our most confidential information that we've had a security issue and we're really not you know, dying to give that out to the general public. And so I think it's been, the industry's been trying to figure out how can we show enough value back when that information's contributed to some kind of a forum to make people feel more comfortable about doing that? So I think we've seen a little bit of progress over this last year and they'll be more going forward, but this is a, It's marathon not a sprint, I think to solve that problem. But, it is crucial because if we can get to that point that's what ultimately allows us to turn the tables on the bad guys. Because they cooperate, big time, they are sharing vulnerabilities, they are sharing tactics, they are sharing information about targets, and it's only when the good guys similarly share what they're experiencing that we'll have that opportunity to turn the table on them. >> It's funny we had a Verizon thing the other night and the guy said if you are from the investigator point of view, it's probably like a police investigator. They see the same pattern over and over and over and over and over it's only when it's the first time it's happen to you that's it's unique and different. So really the way to kind of short-circuit the whole response. >> How do you find out you've been breached? There is short list. One, Brian Crebs, very famous reporter happens to find out, he tells you. Number two, FBI. >> They tell you. >> Unfortunately, that's usually, it's usually external sources like that as oppose to organization internal systems that tip them off to a breach. Another example of how we are doing better but we need to do a lot better. >> And then there's this whole thing coming up called IOT, right. And 5G and all these connected device in the home, our cars, our nest, So the attacks surface gets giant. Like I said, they said in the keynote, you plug something in the internet they are on it within an hour. How does that really change the way that you kind of think about the problem? >> It makes it a lot harder. The attack surface gets harder, gets bigger, the potential risks go up quite a bit, right. I mean you are talking about heart implants, or things like that which may have connectivity to some degree, then obviously the stakes are severe. But the thing that makes those devices even trickier is so often they're embedded systems, and so unlike your Windows PC's or your Mac where, I mean it's updating itself all the time. >> Right, right. >> And you barely even think about it, you turn it on one morning and there is a new update. A little harder to make those update happen on IOT kinds of devices, either because they're harder to get to or the system's aren't as open or people aren't use to allowing those updates to occur. So even though we may know about the vulnerabilities patching them up is even harder in an IOT environment typically than in a traditional. >> It's crazy. Alright, so give us a little update on Resilient. What exactly is do you guys do inside this crazy eco-system of protecting us all? >> Sure. So five or six years ago, myself and my co-founder John started the company and it was really was acknowledging that we've gone through the era of prevention, to detection and now it's all about response. And at the end of the day when organizations were trying to deal with that we saw them using ticketing systems, spreadsheet, email, chat I mean a mess. And so we built our platform, the Resilient IRP from the ground up specifically to help them tie together the people processing in technology around incident response. And that's gone amazing. I mean the growth that we've seen even before the IBM acquisition but afterwards has been breath taking. And more recently we been adding more and more intelligence in automation and orchestration into the platform, to help not only advise people what to do, which we've done forever, but help them do it, click a bottom and we'll deploy that patch or we'll revoke that user's privileges or what have you. >> Right. Yeah a lot of conversation about kind of evolution of big data, evolution of things like Sparks so that you know can react in real time as opposed to kind of looking back after the fact and then trying to go and sell something. >> For sure. And for us it's really empowering that human. It's either the enrichment activity where they'd normally go to 10 different screens, to look up different data about a malware thread or about vulnerabilities, we just spoon feed that to them right within the platforms so they don't have to have those 10 tabs opened in the browser. And after they'd had a chance to evaluate that, and they want to know what to do, again they don't have to go to another tool and make that action happen, they can as click a button within Resilient and we'll do that for them. >> Alright. Ted Julian, we are rooting for you. >> Ted: Thanks, yeah. >> IBM, give him some more recourses. He's Ted Julian and I'm Jeff Frick. You're watching theCUBE at RSA Conference 2017, at Moscone Center, San Francisco. Thanks for watching.

Published Date : Feb 15 2017

SUMMARY :

It's one of the biggest conferences, So happy anniversary. it's been thrilling to get our product in place Jesse Proudman is a many time CUBE alumni, kind of bringing all that horse power. So let's jump into it. the governments are getting involved, is a response plan to deal with it, And then you hook up a new device, and really the challenge I think as an industry that people like to talk about? as part of the sort of you get an alert to actually kind of be on the same team. is getting people to feel comfortable that opportunity to turn the table on them. and the guy said if you are from the investigator happens to find out, that tip them off to a breach. the way that you kind of think about the problem? I mean you are talking about heart implants, And you barely even think about it, What exactly is do you guys do And at the end of the day so that you know can react in real time so they don't have to have those Ted Julian, we are rooting for you. He's Ted Julian and I'm Jeff Frick.

ENTITIES

Entity	Category	Confidence
Brian Crebs	PERSON	0.99+
John	PERSON	0.99+
Ted Julian	PERSON	0.99+
Jesse Proudman	PERSON	0.99+
Ponemon Institute	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Jeff Frick	PERSON	0.99+
FBI	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
10 tabs	QUANTITY	0.99+
Moscone	LOCATION	0.99+
Ted	PERSON	0.99+
Verizon	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
70%	QUANTITY	0.99+
Blue Box	ORGANIZATION	0.99+
40,000 professionals	QUANTITY	0.99+
245 days	QUANTITY	0.99+
A year ago	DATE	0.99+
10 different screens	QUANTITY	0.99+
last year	DATE	0.99+
today	DATE	0.99+
Salesforce	ORGANIZATION	0.99+
five	DATE	0.98+
One	QUANTITY	0.98+
IBM Resistance, Resilience	ORGANIZATION	0.98+
Over 40,000 professionals	QUANTITY	0.98+
RSA	EVENT	0.98+
one morning	QUANTITY	0.97+
RSA Conference 2017	EVENT	0.97+
CUBE	ORGANIZATION	0.97+
first time	QUANTITY	0.97+
34,000	QUANTITY	0.96+
#RSAC	EVENT	0.96+
up to $4 million	QUANTITY	0.96+
six years ago	DATE	0.96+
Mac	COMMERCIAL_ITEM	0.95+
Moscone Center	LOCATION	0.93+
one	QUANTITY	0.93+
Moscone Center, San Francisco	LOCATION	0.9+
this morning	DATE	0.89+
an hour	QUANTITY	0.85+
Windows	TITLE	0.82+
VP Product Management	ORGANIZATION	0.8+
Sparks	TITLE	0.79+
theCUBE	ORGANIZATION	0.74+
San Francisco	LOCATION	0.7+
within an hour	QUANTITY	0.69+
Number two	QUANTITY	0.68+
more customers	QUANTITY	0.6+
5G	OTHER	0.56+
#theCUBE	ORGANIZATION	0.49+

Wikibon Big Data Market Update pt. 2 - Spark Summit East 2017 - #SparkSummit - #theCUBE

(lively music) >> [Announcer] Live from Boston, Massachusetts, this is the Cube, covering Sparks Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Sparks Summit in Boston, everybody. This is the Cube, the worldwide leader in live tech coverage. We've been here two days, wall-to-wall coverage of Sparks Summit. George Gilbert, my cohost this week, and I are going to review part two of the Wikibon Big Data Forecast. Now, it's very preliminary. We're only going to show you a small subset of what we're doing here. And so, well, let me just set it up. So, these are preliminary estimates, and we're going to look at different ways to triangulate the market. So, at Wikibon, what we try to do is focus on disruptive markets, and try to forecast those over the long term. What we try to do is identify where the traditional market research estimates really, we feel, might be missing some of the big trends. So, we're trying to figure out, what's the impact, for example, of real time. And, what's the impact of this new workload that we've been talking about around continuous streaming. So, we're beginning to put together ways to triangulate that, and we're going to show you, give you a glimpse today of what we're doing. So, if you bring up the first slide, we showed this yesterday in part one. This is our last year's big data forecast. And, what we're going to do today, is we're going to focus in on that line, that S-curve. That really represents the real time component of the market. The Spark would be in there. The Streaming analytics would be in there. Add some color to that, George, if you would. >> [George] Okay, for 60 years, since the dawn of computing, we have two ways of interacting with computers. You put your punch cards in, or whatever else and you come back and you get your answer later. That's batch. Then, starting in the early 60's, we had interactive, where you're at a terminal. And then, the big revolution in the 80's was you had a PC, but you still were either interactive either with terminal or batch, typically for reporting and things like that. What's happening is the rise of a new interaction mode. Which is continuous processing. Streaming is one way of looking at it but it might be more effective to call it continuous processing because you're not going to get rid of batch or interactive but your apps are going to have a little of each. So, what we're trying to do, since this is early, early in its life cycle, we're going to try and look at that streaming component from a couple of different angles. >> Okay, as I say, that's represented by this Ogive curve, or the S-curve. On the next slide, we're at the beginning when you think about these continuous workloads. We're at the early part of that S-curve, and of course, most of you or many of you know how the S-curve works. It's slow, slow, slow. For a lot of effort, you don't get much in return. Then you hit the steep part of that S-curve. And that's really when things start to take off. So, the challenge is, things are complex right now. That's really what this slide shows. And Spark is designed, really, to reduce some of that complexity. We've heard a lot about that, but take us through this. Look at this data flow from ingest, to explore, to process, to serve. We talked a lot about that yesterday, but this underscores the complexity in the marketplace. >> [George] Right, and while we're just looking mostly at numbers today, the point of the forecast is to estimate when the barriers, representing complexities, start to fall. And then, when we can put all these pieces together, in just explore, process, serve. When that becomes an end-to-end pipeline. When you can start taking the data in on one end, get a scientist to turn it into a model, inject it into an application, and that process becomes automated. That's when it's mature enough for the knee in the curve to start. >> And that's when we think the market's going to explode. But now so, how do you bound this. Okay, when we do forecasts, we always try to bound things. Because if they're not bounded, then you get no foundation. So, if you look at the next slide, we're trying to get a sense of real-time analytics. How big can it actually get? That's what this slide is really trying to-- >> [George] So this one was one firm's take on real-time analytics, where by 2027, they see it peaking just under-- >> [Dave] When you say one firm, you mean somebody from the technology district? >> [George] Publicly available data. And we take it as as a, since they didn't have a lot of assumptions published, we took it as, okay one data point. And then, we're going to come at it with some bottoms-up end top-down data points, and compare. >> [Dave] Okay, so the next slide we want to drill into the DBMS market and when you think about DBMS, you think about the traditional RDBMS and what we know, or the Oracle, SQL Server, IBMDB2's, etc. And then, you have this emergent NewSQL, and noSQL entrance, which are, obviously, we talked today to a number of folks. The number of suppliers is exploding. The revenue's still relatively small. Certainly small relative to the RDBMS marketplace. But, take us through what your expectations is here, and what some of the assumptions are behind this. >> [George] Okay, so the first thing to understand is the DBMS market, overall, is about $40 billion of which 30 billion goes to online transaction processing supporting real operational apps. 10 billion goes to Orlap or business intelligence type stuff. The Orlap one is shrinking materially. The online transaction processing one, new sales is shrinking materially but there's a huge maintenance stream. >> [Dave] Yeah which companies like Oracle and IBM and Microsoft are living off of that trying to fund new development. >> We modeled that declining gently and beginning to accelerate more going out into the latter years of the tenure period. >> What's driving that decline? Obviously, you've got the big sucking sound of a dup in part, is driving that. But really, increasingly it's people shifting their resources to some of these new emergent applications and workloads and new types of databases to support them right? But these are still, those new databases, you can see here, the NewSQL and noSQL still, relatively, small. A lot of it's open source. But then it starts to take off. What's your assumption there? >> So here, what's going on is, if you look at dollars today, it's, actually, interesting. If you take the noSQL databases, you take DynamoDB, you take Cassandra, Hadoop, HBase, Couchbase, Mongo, Kudu and you add all those up, it's about, with DynamoDB, it's, probably, about 1.55 billion out of a $40 billion market today. >> [Dave] Okay but it's starting to get meaningful. We were approaching two billion. >> But where it's meaningful is the unit share. If that were translated into Oracle pricing. The market would be much, much bigger. So the point it. >> Ten X? >> At least, at least. >> Okay, so in terms of work being done. If there's a measure of work being done. >> [George] We're looking at dollars here. >> Operations per second or etcetera, it would be enormous. >> Yes, but that's reflective of the fact that the data volumes are exploding but the prices are dropping precipitously. >> So do you have a metric to demonstrate that. We're, obviously, not going to show it today but. >> [George] Yes. >> Okay great, so-- >> On the business intelligence side, without naming names, the data warehouse appliance vendors are charging anywhere from 25,000 per terabyte up to, when you include running costs, as high as 100,000 a terabyte. That their customers are estimating. That's not the selling cost but that's the cost of ownership per terabyte. Whereas, if you look at, let's say Hadoop, which is comparable for the off loading some of the data warehouse work loads. That's down to the 5K per terabyte range. >> Okay great, so you expect that these platforms will have a bigger and bigger impact? What's your pricing assumption? Is prices going to go up or is it just volume's going to go through the roof? >> I'm, actually, expecting pricing. It's difficult because we're going to add more and more functionality. Volumes go up and if you add sufficient functionality, you can maintain pricing. But as volumes go up, typically, prices go down. So it's a matter of how much do these noSQL and NewSQL databases add in terms of functionality and I distinguish between them because NewSQL databases are scaled out version of Oracle or Teradata but they are based on the more open source pricing model. >> Okay and NoSQL, don't forget, stands for not only SQL, not not SQL. >> If you look at the slides, big existing markets never fall off a cliff when they're in the climb. They just slowly fade. And, eventually, that accelerates. But what's interesting here is, the data volumes could explode but the revenue associated with the NoSQL which is the dark gray and the NewSQL which is the blue. Those don't explode. You could take, what's the DBMS cost of supporting YouTube? It would be in the many, many, many billions of dollars. It would support 1/2 of an Oracle itself probably. But it's all open source there so. >> Right, so that's minimizing the opportunity is what you're saying? >> Right. >> You can see the database market is flat, certainly flattish and even declining but you do expect some growth in the out years as part of that evasion, that volume, presumably-- >> And that's the next slide which is where we've seen that growth come from. >> Okay so let's talk about that. So the next slide, again, I should have set this up better. The X-axis year is worldwide dollars and the horizontal axis is time. And we're talking here about these continuous application work loads. This new work load that you talked about earlier. So take us through the three. >> [George] There's three types of workloads that, in large part, are going to be driving most of this revenue. Now, these aren't completely, they are completely comparable to the DBMS market because some of these don't use traditional databases. Or if they do, they're Torry databases and I'll explain that. >> [Dave] Sure but if I look at the IoT Edge, the Cloud and the micro services and streaming, that's a tail wind to the database forecast in the previous slide, is that right? >> [George] It's, actually, interesting but the application and infrastructure telemetry, this is what Splunk pioneered. Which is all the torrents of data coming out of your data center and your applications and you're trying to manage what's going on. That is a database application. And we know Splunk, for 2016, was 400 million. In software revenue Hadoop was 750 million. And the various other management vendors, New Relic, AppDynamics, start ups and 5% of Azure and AWS revenue. If you add all that up, it comes out to $1.7 billion for 2016. And so, we can put a growth rate on that. And we talked to several vendors to say, okay, how much will that work load be compared to IoT Edge Cloud. And the IoT Edge Cloud is the smart devices at the Edge and the analytics are in the fog but not counting the database revenue up in the Cloud. So it's everything surrounding the Cloud. And that, actually, if you look out five years, that's, maybe, 20% larger than the app and infrastructure telemetry but growing much, much faster. Then the third one where you were talking about was this a tail wind to the database. Micro server systems streaming are very different ways of building applications from what we do now. Now, people build their logic for the application and everyone then, stores their data in this centralized external database. In micro services, you build a little piece of the app and whatever data you need, you store within that little piece of the app. And so the database requirements are, rather, primitive. And so that piece will not drive a lot of database revenue. >> So if you could go back to the previous slide, Patrick. What's driving database growth in the out years? Why wouldn't database continue to get eaten away and decline? >> [George] In broad terms, the overall database market, it staying flat. Because as prices collapse but the data volumes go up. >> [Dave] But there's an assumption in here that the NoSQL space, actually, grows in the out years. What's driving that growth? >> [George] Both the NoSQL and the NewSQL. The NoSQL, probably, is best serving capturing the IoT data because you don't need lots of fancy query capabilities for concurrency. >> [Dave] So it is a tail wind in a sense in that-- >> [George] IoT but that's different. >> [Dave] Yeah sure but you've got the overall market growing. And that's because the new stuff, NewSQL and NoSQL is growing faster than the decline of the old stuff. And it's not in the 2020 to 2022 time frame. It's not enough to offset that decline. And then they have it start growing again. You're saying that's going to be driven by IoT and other Edge use cases? >> Yes, IoT Edge and the NewSQL, actually, is where when they mature, you start to substitute them for the traditional operational apps. For people who want to write database apps not who want to write micro service based apps. >> Okay, alright good. Thank you, George, for setting it up for us. Now, we're going to be at Big Data SV in mid March? Is that right? Middle of March. And George is going to be releasing the actual final forecast there. We do it every year. We use Spark Summit to look at our preliminary numbers, some of the Spark related forecasts like continuous work loads. And then we harden those forecasts going into Big Data SV. We publish our big data report like we've done for the past, five, six, seven years. So check us out at Big Data SV. We do that in conjunction with the Strada events. So we'll be there again this year at the Fairmont Hotel. We got a bunch of stuff going on all week there. Some really good programs going on. So check out siliconangle.tv for all that action. Check out Wikibon.com. Look for new research coming out. You're going to be publishing this quarter, correct? And of course, check out siliconangle.com for all the news. And, really, we appreciate everybody watching. George, been a pleasure co-hosting with you. As always, really enjoyable. >> Alright, thanks Dave. >> Alright, to that's a rap from Sparks. We're going to try to get out of here, hit the snow storm and work our way home. Thanks everybody for watching. A great job everyone here. Seth, Ava, Patrick and Alex. And thanks to our audience. This is the Cube. We're out, see you next time. (lively music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. of the Wikibon Big Data Forecast. What's happening is the rise of a new interaction mode. On the next slide, we're at the beginning for the knee in the curve to start. So, if you look at the next slide, And then, we're going to come at it with some bottoms-up [Dave] Okay, so the next slide we want to drill into the [George] Okay, so the first thing to understand and IBM and Microsoft are living off of that going out into the latter years of the tenure period. you can see here, the NewSQL and you add all those up, [Dave] Okay but it's starting to get meaningful. So the point it. Okay, so in terms of work being done. it would be enormous. that the data volumes are exploding So do you have a metric to demonstrate that. some of the data warehouse work loads. the more open source pricing model. Okay and NoSQL, don't forget, but the revenue associated with the NoSQL And that's the next slide which is where and the horizontal axis is time. in large part, are going to be driving of the app and whatever data you need, What's driving database growth in the out years? the data volumes go up. that the NoSQL space, actually, grows is best serving capturing the IoT data because And it's not in the 2020 to 2022 time frame. and the NewSQL, actually, And George is going to be releasing This is the Cube.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
Patrick	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Seth	PERSON	0.99+
30 billion	QUANTITY	0.99+
Alex	PERSON	0.99+
two billion	QUANTITY	0.99+
2016	DATE	0.99+
$40 billion	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
2027	DATE	0.99+
20%	QUANTITY	0.99+
five years	QUANTITY	0.99+
New Relic	ORGANIZATION	0.99+
Orlap	ORGANIZATION	0.99+
$1.7 billion	QUANTITY	0.99+
10 billion	QUANTITY	0.99+
2020	DATE	0.99+
Boston	LOCATION	0.99+
Ava	PERSON	0.99+
mid March	DATE	0.99+
third one	QUANTITY	0.99+
last year	DATE	0.99+
AppDynamics	ORGANIZATION	0.99+
2022	DATE	0.99+
yesterday	DATE	0.99+
Wikibon	ORGANIZATION	0.99+
60 years	QUANTITY	0.99+
two days	QUANTITY	0.99+
siliconangle.com	OTHER	0.99+
400 million	QUANTITY	0.99+
750 million	QUANTITY	0.99+
YouTube	ORGANIZATION	0.99+
today	DATE	0.99+
5%	QUANTITY	0.99+
Middle of March	DATE	0.99+
Sparks Summit	EVENT	0.99+
first slide	QUANTITY	0.99+
three	QUANTITY	0.99+
two ways	QUANTITY	0.98+
Boston, Massachusetts	LOCATION	0.98+
early 60's	DATE	0.98+
about $40 billion	QUANTITY	0.98+
one firm	QUANTITY	0.98+
this year	DATE	0.98+
Ten X	QUANTITY	0.98+
Spark Summit	EVENT	0.97+
25,000 per terabyte	QUANTITY	0.97+
80's	DATE	0.97+
Databricks	ORGANIZATION	0.97+
DynamoDB	TITLE	0.97+
three types	QUANTITY	0.97+
Both	QUANTITY	0.96+
Sparks Summit East 2017	EVENT	0.96+
Spark Summit East 2017	EVENT	0.96+
this week	DATE	0.95+
Spark	TITLE	0.95+

Ziya Ma, Intel - Spark Summit East 2017 - #sparksummit - #theCUBE

>> [Narrator] Live from Boston Massachusetts. This is the Cube, covering Sparks Summit East 2017. Brought to you by Databricks. Now here are your hosts, Dave Alante and George Gilbert. >> Back to you Boston everybody. This is the Cube and we're here live at Spark Summit East, #SparkSummit. Ziya Ma is here. She's the Vice President of Big Data at Intel. Ziya, thanks for coming to the Cube. >> Thanks for having me. >> You're welcome. So software is our topic. Software at Intel. You know people don't necessarily associate Intel with always with software but what's the story there? >> So actually there are many things that we do for software. Since I manage the Big Data engineering organization so I'll just say a little bit more about what we do for Big Data. >> [Dave] Great. >> So you know Intel do all the processors, all the hardware. But when our customers are using the hardware, they like to get the best performance out of Intel hardware. So this is for the Big Data space. We optimize the Big Data solution stack, including Spark and Hadoop on top of Intel hardware. And make sure that we leverage the latest instructions set so that the customers get the most performance out of the newest released Intel hardware. And also we collaborated very extensively with the open source community for Big Data ecosystem advancement. For example we're a leading contributor to Apache Spark ecosystem. We're also a top contributor to Apache Hadoop ecosystem. And lately we're getting into the machine learning and deep learning and the AI space, especially integrating those capabilities into the Big Data eTcosystem. >> So I have to ask you a question to just sort of strategically, if we go back several years, you look at during the Unix days, you had a number of players developing hardware, microprocessors, there were risk-based systems, remember MIPS and of course IBM had one and Sun, et cetera, et cetera. Some of those live on but very, very small portion of the market. So Intel has dominated the general purpose market. So as Big Data became more mainstream, was there a discussion okay, we have to develop specialized processors, which I know Intel can do as well, or did you say, okay, we can actually optimize through software. Was that how you got here? Or am I understanding that? >> We believe definitely software optimization, optimizing through software is one thing that we do. That's why Intel actually have, you may not know this, Intel has one of the largest software divisions that focus on enabling and optimizing the solutions in Intel hardware. And of course we also have very aggressive product roadmap for advancing continuously our hardware products. And actually, you mentioned a general purpose computing. CPU today, in the Big Data market, still has more than 95% of the market. So that's still the biggest portion of the Big Data market. And will continue our advancement in that area. And obviously as the Ai and machine learning, deep learning use cases getting added into the Big Data domain and we are expanding our product portfolio into some other Silicon products. >> And of course that was kind of the big bet of, we want to bet on Intel. And I guess, I guess-- >> You should still do. >> And still do. And I guess, at the time, Seagate or other disk mounts. Now flash comes in. And of course now Spark with memory, it's really changing the game, isn't it? What does that mean for you and the software group? >> Right, so what do we... Actually, still we focus on the optimi-- Obviously at the hardware level, like Intel now, is not just offering the computing capability. We also offer very powerful network capability. We offer very good memory solutions, memory hardware. Like we keep talking about this non-volatile memory technologies. So for Big Data, we're trying to leverage all those newest hardware. And we're already working with many of our customers to help them, to improve their Big Data memory solution, the e-memory, analytics type of capability on Intel hardware, give them the most optimum performance and most secure result using Intel hardware. So that's definitely one thing that we continue to do. That's going to be our still our top priority. But we don't just limit our work to optimization. Because giving user the best experience, giving user the complete experience on Intel platform is our ultimate goal. So we work with our customers from financial services company. We work with folks from manufacturing. From transportation. And from other IOT internet of things segment. And to make sure that we give them the easiest Big Data analytics experience on Intel hardware. So when they are running those solutions they don't have to worry too much about how to make their application work with Intel hardware, and how to make it more performant with Intel hardware. Because that's the Intel software solution that's going to bridge the gap. We do that part of the job. And so that it will make our customers experience easier and more complete. >> You serve as the accelerant to the marketplace. Go ahead George. >> [Ziya] That's right. >> So Intel's big ML as the news product, as of the last month of so, open source solution. Tell us how there are other deep learning frameworks that aren't as fully integrated with Spark yet and where BigML fits in since we're at a Spark conference. How it backfills some functionality and how it really takes advantage of Intel hardware. >> George, just like you said, BigDL, we just open sourced a month ago. It's a deep learning framework that we organically built onto of Apache Spark. And it has quite some differences from the other mainstream deep learning frameworks like Caffe, Tensorflow, Torch and Tianu are you name it. The reason that we decide to work on this project was again, through our experience, working with our analytics, especially Big Data analytic customers, as they build their AI solutions or AI modules within their analytics application, it's funny, it's getting more and more difficult to build and integrate AI capability into their existing Big Data analytics ecosystem. They had to set up a different cluster and build a different set of AI capabilities using, let's say, one of the deep learning frameworks. And later they have to overcome a lot of challenges, for example, moving the model and data between the two different clusters and then make sure that AI result is getting integrated into the existing analytics platform or analytics application. So that was the primary driver. How do we make our customers experience easier? Do they have to leave their existing infrastructure and build a separate AI module? And can we do something organic on top of the existing Big Data platform, let's say Apache Spark? Can we just do something like that? So that the user can just leverage the existing infrastructure and make it a naturally integral part of the overall analytics ecosystem that they already have. So this was the primary driver. And also the other benefit that we see by integrating this BigDL framework naturally was the Big Data platform, is that it enables efficient scale-out and fault tolerance and elasticity and dynamic resource management. And those are the benefits that's on naturally brought by Big Data platform. And today, actually, just with this short period of time, we have already tested that BigDL can scale easily to tens or hundreds of nodes. So the scalability is also quite good. And another benefit with solution like BigDL, especially because it eliminates the need of setting a separate cluster and moving the model between different hardware clusters, you save your total cost of ownership. You can just leverage your existing infrastructure. There is no need to buy additional set of hardware and build another environment just for training the model. So that's another benefit that we see. And performance-wise, again we also tested BigDL with Caffe, Torch and TensorFlow. So the performance of BigDL on single node Xeon is orders of magnitude faster than out of box at open source Caffe, TensorFlow or Torch. So it definitely it's going to be very promising. >> Without the heavy lifting. >> And useful solution, yeah. >> Okay, can you talk about some of the use cases that you expect to see from your partners and your customers. >> Actually very good question. You know we already started a few engagement with some of the interested customers. The first customer is from Stuart Industry. Where improving the accuracy for steel-surface defect recognition is very important to it's quality control. So we worked with this customer in the last few months and built end-to-end image recognition pipeline using BigDL and Spark. And the customer just through phase one work, already improved it's defect recognition accuracy to 90%. And they're seeing a very yield improvement with steel production. >> And it used to by human? >> It used to be done by human, yes. >> And you said, what was the degree of improvement? >> 90, nine, zero. So now the accuracy is up to 90%. And another use case and financial services actually, is another use case, especially for fraud detection. So this customer, again I'm not at the customer's request, they're very sensitive the financial industry, they're very sensitive with releasing their name. So the customer, we're seeing is fraud risks were increasing tremendously. With it's wide range of products, services and customer interaction channels. So the implemented end-to-end deep learning solution using BigDL and Spark. And again, through phase one work, they are seeing the fraud detection rate improved 40 times, four, zero times. Through phase one work. We think there were more improvement that we can do because this is just a collaboration in the last few month. And we'll continue this collaboration with this customer. And we expect more use cases from other business segments. But that are the two that's already have BigDL running in production today. >> Well so the first, that's amazing. Essentially replacing the human, have to interact and be much more accurate. The fraud detection, is interesting because fraud detection has come a long way in the last 10 years as you know. Used to take six months, if they found fraud. And now it's minutes, seconds but there's a lot of false positives still. So do you see this technology helping address that problem? >> Yeah, we actually that's continuously improving the prediction accuracy is one of the goals. This is another reason why we need to bring AI and Big Data together. Because you need to train your model. You need to train your AI capabilities with more and more training data. So that you get much more improved training accuracy. Actually this is the biggest way of improving your training accuracy. So you need a huge infrastructure, a big data platform so that you can host and well manage your training data sets. And so that it can feed into your deep learning solution or module for continuously improving your training accuracy. So yes. >> This is a really key point it seems like. I would like to unpack that a little bit. So when we talk to customers and application vendors, it's that training feedback loop that gets the models smarter and smarter. So if you had one cluster for training that was with another framework, and then Spark was your... Rest of your analytics. How would training with feedback data work when you had two separate environments? >> You know that's one of the drivers why we're creating BigDL. Because, we tried to port, we did not come to BigDL at the very beginning. We tried to port the existing deep learning frameworks like Caffe and Tensorflow onto Spark. And you also probably saw some research papers folks. There's other teams that out there that's also trying to port Caffe, Tensorflow and other deep learning framework that's out there onto Spark. Because you have that need. You need to bring the two capabilities together. But the problem is that those systems were developed in a very traditional way. With Big Data, not yet in consideration, when those frameworks were created, were innovated. But now the need for converging the two becomes more and more clear, and more necessary. And that's we way, when we port it over, we said gosh, this is so difficult. First it's very challenging to integrate the two. And secondly the experience, after you've moved it over, is awkward. You're literally using Spark as a dispatcher. The integration is not coherent. It's like they're superficially integrated. So this is where we said, we got to do something different. We can not just superficially integrate two systems together. Can we do something organic on top of the Big Data platform, on top of Apache Spark? So that the integration between the training system, between the feature engineering, between data management can &be more consistent, can be more integrated. So that's exactly the driver for this work. >> That's huge. Seamless integration is one of the most overused phrases in the technology business. Superficial integration is maybe a better description for a lot of those so-called seamless integrations. You're claiming here that it's seamless integration. We're out of time but last word Intel and Spark Summit. What do you guys got going here? What's the vibe like? >> So actually tomorrow I have a keynote. I'm going to talk a little bit more about what we're doing with BigDL. Actually this is one of the big things that we're doing. And of course, in order for BigDL, system like BigDL or even other deep learning frameworks, to get optimum performance on Intel hardware, there's another item that we're highlighting at MKL, Intel optimized Math Kernel Library. It has a lot of common math routines. That's optimized for Intel processor using the latest instruction set. And that's already, today, integrated into the BigDL ecosystem.z6 So that's another thing that we're highlighting. And another thing is that those are just software. And at hardware level, during November, Intel's AI day, our executives from BK, Diane Bryant and Doug Fisher. They also highlighted the Nirvana product portfolio that's coming out. That will give you different hardware choices for AI. You can look at FPGA, Xeon Fi, Xeon and our new Nirvana based Silicon like Crestlake. And those are some good silicon products that you can expect in the future. Intel, taking us to Nirvana, touching every part of the ecosystem. Like you said, 95% share and in all parts of the business. Yeah, thanks very much for coming the Cube. >> Thank you, thank you for having me. >> You're welcome. Alright keep it right there. George and I will be back with our next guest. This is Spark Summit, #SparkSummit. We're the Cube. We'll be right back.

Published Date : Feb 8 2017

SUMMARY :

This is the Cube, covering Sparks Summit East 2017. This is the Cube and we're here live So software is our topic. Since I manage the Big Data engineering organization And make sure that we leverage the latest instructions set So Intel has dominated the general purpose market. So that's still the biggest portion of the Big Data market. And of course that was kind of the big bet of, And I guess, at the time, Seagate or other disk mounts. And to make sure that we give them the easiest You serve as the accelerant to the marketplace. So Intel's big ML as the news product, And also the other benefit that we see that you expect to see from your partners And the customer just through phase one work, So the customer, we're seeing is fraud risks in the last 10 years as you know. So that you get much more improved training accuracy. that gets the models smarter and smarter. So that the integration between the training system, Seamless integration is one of the most overused phrases integrated into the BigDL ecosystem We're the Cube.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Seagate	ORGANIZATION	0.99+
Dave Alante	PERSON	0.99+
40 times	QUANTITY	0.99+
IBM	ORGANIZATION	0.99+
90%	QUANTITY	0.99+
Dave	PERSON	0.99+
tomorrow	DATE	0.99+
two	QUANTITY	0.99+
six months	QUANTITY	0.99+
Ziya Ma	PERSON	0.99+
November	DATE	0.99+
Doug Fisher	PERSON	0.99+
two systems	QUANTITY	0.99+
tens	QUANTITY	0.99+
more than 95%	QUANTITY	0.99+
Intel	ORGANIZATION	0.99+
Boston Massachusetts	LOCATION	0.99+
one	QUANTITY	0.99+
Boston	LOCATION	0.99+
Spark	TITLE	0.99+
first	QUANTITY	0.99+
Ziya	PERSON	0.99+
first customer	QUANTITY	0.99+
a month ago	DATE	0.98+
First	QUANTITY	0.98+
Diane Bryant	PERSON	0.98+
Stuart Industry	ORGANIZATION	0.98+
zero times	QUANTITY	0.98+
nine	QUANTITY	0.98+
zero	QUANTITY	0.97+
two capabilities	QUANTITY	0.97+
Big Data	TITLE	0.97+
BigDL	TITLE	0.97+
Tensorflow	TITLE	0.97+
95% share	QUANTITY	0.96+
Caffe	TITLE	0.96+
one thing	QUANTITY	0.96+
four	QUANTITY	0.96+
#SparkSummit	EVENT	0.96+
one cluster	QUANTITY	0.96+
up to 90%	QUANTITY	0.96+
two different clusters	QUANTITY	0.96+
Hadoop	TITLE	0.96+
today	DATE	0.96+
two separate environments	QUANTITY	0.95+
Cube	COMMERCIAL_ITEM	0.95+
Apache	ORGANIZATION	0.94+
Databricks	ORGANIZATION	0.94+
Spark Summit East 2017	EVENT	0.94+
Big Data	ORGANIZATION	0.93+
Nirvana	LOCATION	0.92+
MIPS	TITLE	0.92+
Spark Summit East	LOCATION	0.92+
hundreds of nodes	QUANTITY	0.91+
secondly	QUANTITY	0.9+
BigML	TITLE	0.89+
Sparks Summit East 2017	EVENT	0.89+

Nick Pentreath, IBM STC - Spark Summit East 2017 - #sparksummit - #theCUBE

>> Narrator: Live from Boston, Massachusetts, this is The Cube, covering Spark Summit East 2017. Brought to you by Data Bricks. Now, here are your hosts, Dave Valente and George Gilbert. >> Boston, everybody. Nick Pentry this year, he's a principal engineer a the IBM Spark Technology Center in South Africa. Welcome to The Cube. >> Thank you. >> Great to see you. >> Great to see you. >> So let's see, it's a different time of year, here that you're used to. >> I've flown from, I don't know the Fahrenheit's equivalent, but 30 degrees Celsius heat and sunshine to snow and sleet, so. >> Yeah, yeah. So it's a lot chillier there. Wait until tomorrow. But, so we were joking. You probably get the T-shirt for the longest flight here, so welcome. >> Yeah, I actually need the parka, or like a beanie. (all laugh) >> Little better. Long sleeve. So Nick, tell us about the Spark Technology Center, STC is its acronym and your role, there. >> Sure, yeah, thank you. So Spark Technology Center was formed by IBM a little over a year ago, and its mission is to focus on the Open Source world, particularly Apache Spark and the ecosystem around that, and to really drive forward the community and to make contributions to both the core project and the ecosystem. The overarching goal is to help drive adoption, yeah, and particularly enterprise customers, the kind of customers that IBM typically serves. And to harden Spark and to make it really enterprise ready. >> So why Spark? I mean, we've watched IBM do this now for several years. The famous example that I like to use is Linux. When IBM put $1 billion into Linux, it really went all in on Open Source, and it drove a lot of IBM value, both internally and externally for customers. So what was it about Spark? I mean, you could have made a similar bet on Hadoop. You decided not to, you sort of waited to see that market evolve. What was the catalyst for having you guys all go in on Spark? >> Yeah, good question. I don't know all the details, certainly, of what was the internal drivers because I joined HTC a little under a year ago, so I'm fairly new. >> Translate the hallway talk, maybe. (Nick laughs) >> Essentially, I think you raise very good parallels to Linux and also Java. >> Absolutely. >> So Spark, sorry, IBM, made these investments and Open Source technologies that had ceased to be transformational and kind of game-changing. And I think, you know, most people will probably admit within IBM that they maybe missed the boat, actually, on Hadoop and saw Spark as the successor and actually saw a chance to really dive into that and kind of almost leap frog and say, "We're going to "back this as the next generation analytics platform "and operating system for analytics "and big debt in the enterprise." >> Well, I don't know if you happened to watch the Super Bowl, but there's a saying that it's sometimes better to be lucky than good. (Nick laughs) And that sort of applies, and so, in some respects, maybe missing the window on Hadoop was not a bad thing for IBM >> Yeah, exactly because not a lot of people made a ton of dough on Hadoop and they're still sort of struggling to figure it out. And now along comes Spark, and you've got this more real time nature. IBM talks a lot about bringing analytics and transactions together. They've made some announcements about that and affecting business outcomes in near real time. I mean, that's really what it's all about and one of your areas of expertise is machine learning. And so, talk about that relationship and what it means for organizations, your mission. >> Yeah, machine learning is a key part of the mission. And you've seen the kind of big debt in enterprise story, starting with the kind of Hadoop and data lakes. And that's evolved into, now we've, before we just dumped all of this data into these data lakes and these silos and maybe we had some Hadoop jobs and so on. But now we've got all this data we can store, what are we actually going to do with it? So part of that is the traditional data warehousing and business intelligence and analytics, but more and more, we're seeing there's a rich value in this data, and to unlock it, you really need intelligent systems. You need machine learning, you need AI, you need real time decision making that starts transcending the boundaries of all the rule-based systems and human-based systems. So we see machine learning as one of the key tools and one of the key unlockers of value in these enterprise data stores. >> So Nick, perhaps paint us a picture of someone who's advanced enough to be working with machine learning with BMI and we know that the tool chain's kind of immature. Although, IBM with Data Works or Data First has a fairly broad end-to-end sort of suit of tools, but what are the early-use cases? And what needs to mature to go into higher volume production apps or higher-value production apps? >> I think the early-use cases for machine learning in general and certainly at scale are numerous and they're growing, but classic examples are, let's say, recommendation engines. That's an area that's close to my heart. In my previous life before IBM, I bought the startup that had a recommendation engine service targeting online stores and new commerce players and social networks and so on. So this is a great kind of example use case. We've got all this data about, let's say, customer behavior in your retail store or your video-sharing site, and in order to serve those customers better and make more money, if you can make good recommendations about what they should buy, what they should watch, or what they should listen to, that's a classic use case for machine learning and unlocking the data that is there, so that is one of the drivers of some of these systems, players like Amazon, they're sort of good examples of the recommendation use case. Another is fraud detection, and that is a classic example in financial services, enterprise, which is a kind of staple of IBM's customer base. So these are a couple of examples of the use cases, but the tool sets, traditionally, have been kind of cumbersome. So Amazon bought everything from scratch themselves using customized systems, and they've got teams and teams of people. Nowadays, you've got this bold into Apache Spark, you've got it in Spark, a machine learning library, you've got good models to do that kind of thing. So I think from an algorithmic perspective, there's been a lot of advancement and there's a lot of standardization and almost commoditization of the model side. So what is missing? >> George: Yeah, what else? >> And what are the shortfalls currently? So there's a big difference between the current view, I guess the hype of the machine learning as you've got data, you apply some machine learning, and then you get profit, right? But really, there's a hugely complex workflow that involves this end-to-end story. You've got data coming from various data sources, you have to feed it into one centralized system, transform and process it, extract your features and do your sort of hardcore data signs, which is the core piece that everyone sort of thinks about as the only piece, but that's kind of in the middle and it makes up a relatively small proportion of the overall chain. And once you've got that, you do model training and selection testing, and you now have to take that model, that machine-learning algorithm and you need to deploy it into a real system to make real decisions. And that's not even the end of it because once you've got that, you need to close the loop, what we call the feedback loop, and you need to monitor the performance of that model in the real world. You need to make sure that it's not deteriorating, that it's adding business value. All of these ind of things. So I think that is the real, the piece of the puzzle that's missing at the moment is this end-to-end, delivering this end-to-end story and doing it at scale, securely, enterprise-grade. >> And the business impact of that presumably will be a better-quality experience. I mean, recommendation engines and fraud detection have been around for a while, they're just not that good. Retargeting systems are too little too late, and kind of cumbersome fraud detection. Still a lot of false positives. Getting much better, certainly compressing the time. It used to be six months, >> Yes, yes. Now it's minutes or second, but a lot of false positives still, so, but are you suggesting that by closing that gap, that we'll start to see from a consumer standpoint much better experiences? >> Well, I think that's imperative because if you don't see that from a consumer standpoint, then the mission is failing because ultimately, it's not magic that you just simply throw machine learning at something and you unlock business value and everyone's happy. You have to, you know, there's a human in the loop, there. You have to fulfill the customer's need, you have to fulfill consumer needs, and the better you do that, the more successful your business is. You mentioned the time scale, and I think that's a key piece, here. >> Yeah. >> What makes better decisions? What makes a machine-learning system better? Well, it's better data and more data, and faster decisions. So I think all of those three are coming into play with Apache Spark, end-to-end's story streaming systems, and the models are getting better and better because they're getting more data and better data. >> So I think we've, the industry, has pretty much attacked the time problem. Certainly for fraud detection and recommendation systems the quality issue. Are we close? I mean, are we're talking about 6-12 months before we really sort of start to see a major impact to the consumer and ultimately, to the company who's providing those services? >> Nick: Well, >> Or is it further away than that, you think? >> You know, it's always difficult to make predictions about timeframes, but I think there's a long way to go to go from, yeah, as you mentioned where we are, the algorithms and the models are quite commoditized. The time gap to make predictions is kind of down to this real-time nature. >> Yeah. >> So what is missing? I think it's actually less about the traditional machine-learning algorithms and more about making the systems better and getting better feedback, better monitoring, so improving the end user's experience of these systems. >> Yeah. >> And that's actually, I don't think it's, I think there's a lot of work to be done. I don't think it's a 6-12 month thing, necessarily. I don't think that in 12 months, certainly, you know, everything's going to be perfectly recommended. I think there's areas of active research in the kind of academic fields of how to improve these things, but I think there's a big engineering challenge to bring in more disparate data sources, to better, to improve data quality, to improve these feedback loops, to try and get systems that are serving customer needs better. So improving recommendations, improving the quality of fraud detection systems. Everything from that to medical imaging and counter detection. I think we've got a long way to go. >> Would it be fair to say that we've done a pretty good job with traditional application lifecycle in terms of DevOps, but we now need the DevOps for the data scientists and their collaborators? >> Nick: Yeah, I think that's >> And where is BMI along that? >> Yeah, that's a good question, and I think you kind of hit the nail on the head, that the enterprise applied machine learning problem has moved from the kind of academic to the software engineering and actually, DevOps. Internally, someone mentioned the word train ops, so it's almost like, you know, the machine learning workflow and actually professionalizing and operationalizing that. So recently, IBM, for one, has announced what's in data platform and now, what's in machine learning. And that really tries to address that problem. So really, the aim is to simplify and productionize these end-to-end machine-learning workflows. So that is the product push that IBM has at the moment. >> George: Okay, that's helpful. >> Yeah, and right. I was at the Watson data platform announcement you call the Data Works. I think they changed the branding. >> Nick: Yeah. >> It looked like there were numerous components that IBM had in its portfolio that's now strung together. And to create that end-to-end system that you're describing. Is that a fair characterization, or is it underplaying? I'm sure it is. The work that went into it, but help us maybe understand that better. >> Yeah, I should caveat it by saying we're fairly focused, very focused at HTC on the Open Source side of things, So my work is predominately within the Apache Spark project and I'm less involved in the data bank. >> Dave: So you didn't contribute specifically to Watson data platform? >> Not to the product line, so, you know, >> Yeah, so its really not an appropriate question for you? >> I wouldn't want to kind of, >> Yeah. >> To talk too deeply about it >> Yeah, yeah, so that, >> Simply because I haven't been involved. >> Yeah, that's, I don't want to push you on that because it's not your wheelhouse, but then, help me understand how you will commercialize the activities that you do, or is that not necessarily the intent? >> So the intent with HTC particularly is that we focus on Open Source and a core part of that is that we, being within IBM, we have the opportunity to interface with other product groups and customer groups. >> George: Right. >> So while we're not directly focused on, let's say, the commercial aspect, we want to effectively leverage the ability to talk to real-world customers and find the use cases, talk to other product groups that are building this Watson data platform and all the product lines and the features, data sans experience, it's all built on top of Apache Apache Spark and platform. >> Dave: So your role is really to innovate? >> Exactly, yeah. >> Leverage and Open Source and innovate. >> Both innovate and kind of improve, so improve performance improve efficiency. When you are operating at the scale of a company such as IBM and other large players, your customers and you as product teams and builders of products will come into contact with all the kind of little issues and bugs >> Right. >> And performance >> Make it better. Problems, yeah. And that is the feedback that we take on board and we try and make it better, not just for IBM and their customers. Because it's an Apache product and everyone benefits. So that's really the idea. Take all the feedback and learnings from enterprise customers and product groups and centralize that in the Open Source contributions that we make. >> Great. Would it be, so would it be fair to say you're focusing on making the core Spark, Spark ML and Spark ML Lib capabilities sort of machine learning libraries and in the pipeline, more robust? >> Yes. >> And if that's the case, we know there needs to be improvements in its ability to serve predictions in real time, like high speed. We know there's a need to take the pipeline and sort of share it with other tools, perhaps. Or collaborate with other tool chains. >> Nick: Yeah. >> What are some of the things that the Enterprise customers are looking for along the lines? >> Yeah, that's a great question and very topical at the moment. So both from an Open Source community perspective and Enterprise customer perspective, this is one of the, if not the key, I think, kind of missing pieces within the Spark machine-learning kind of community at the moment, and it's one of the things that comes up most often. So it is a missing piece, and we as a community need to work together and decide, is this something that we built within Spark and provide that functionality? Is is something where we try and adopt open standards that will benefit everybody and that provides a kind of one standardized format, or way or serving models? Or is it something where there's a few Open Source projects out there that might serve for this purpose, and do we get behind those? So I don't have the answer because this is ongoing work, but it's definitely one of the most critical kind of blockers, or, let's say, areas that needs work at the moment. >> One quick question, then, along those lines. IBM, the first thing IBM contributed to the Spark community was Spark ML, which is, as I understand it, it was an ability to, I think, create an ensemble sort of set of models to do a better job or create a more, >> So are you referring to system ML, I think it is? >> System ML. >> System ML, yeah, yeah. >> What are they, I forgot. >> Yeah, so, so. >> Yeah, where does that fit? >> System ML started out as a IBM research project and perhaps the simplest way to describe it is, as a kind of sequel optimizer is to take sequel queries and decide how to execute them in the most efficient way, system ML takes a kind of high-level mathematical language and compiles it down to a execution plan that runs in a distributed system. So in much the same way as your sequel operators allow this very flexible and high-level language, you don't have to worry about how things are done, you just tell the system what you want done. System ML aims to do that for mathematical and machine learning problems, so it's now an Apache project. It's been donated to Open Source and it's an incubating project under very active development. And that is really, there's a couple of different aspects to it, but that's the high-level goal. The underlying execution engine is Spark. It can run on Hadoop and it can run locally, but really, the main focus is to execute on Spark and then expose these kind of higher level APRs that are familiar to users of languages like R and Python, for example, to be able to write their algorithms and not necessarily worry about how do I do large scale matrix operations on a cluster? System ML will compile that down and execute that for them. >> So really quickly, follow up, what that means is if it's a higher level way for people who sort of cluster aware to write machine-learning algorithms that are cluster aware? >> Nick: Precisely, yeah. >> That's very, very valuable. When it works. >> When it works, yeah. So it does, again, with the caveat that I'm mostly focused on Spark and not so much the System ML side of things, so I'm definitely not an expert. I don't claim to be an expert in it. But it does, you know, it works at the moment. It works for a large class of machine-learning problems. It's very powerful, but again, it's a young project and there's always work to be done, so exactly the areas that I know that they're focusing on are these areas of usability, hardening up the APRs and making them easier to use and easier to access for users coming from the R and Python communities who, again are, as you said, they're not necessarily experts on distributed systems and cluster awareness, but they know how to write a very complex machine-learning model in R, for example. And it's really trying to enable them with a set of APR tools. So in terms of the underlying engine, they are, I don't know how many hundreds of thousands, millions of lines of code and years and years of research that's gone into that, so it's an extremely powerful set of tools. But yes, a lot of work still to be done there and ongoing to make it, in a way to make it user ready and Enterprise ready in a sense of making it easier for people to use it and adopt it and to put it into their systems and production. >> So I wonder if we can close, Nick, just a few questions on STC, so the Spark Technology Centers in Cape Town, is that a global expertise center? Is is STC a virtual sort of IBM community, or? >> I'm the only member visiting Cape Town, >> David: Okay. >> So I'm kind of fairly lucky from that perspective, to be able to kind of live at home. The rest of the team is mostly in San Francisco, so there's an office there that's co-located with the Watson west office >> Yeah. >> And Watson teams >> Sure. >> That are based there in Howard Street, I think it is. >> Dave: How often do you get there? >> I'll be there next week. >> Okay. >> So I typically, sort of two or three times a year, I try and get across there >> Right. And interface with the team, >> So, >> But we are a fairly, I mean, IBM is obviously a global company, and I've been surprised actually, pleasantly surprised there are team members pretty much everywhere. Our team has a few scattered around including me, but in general, when we interface with various teams, they pop up in all kinds of geographical locations, and I think it's great, you know, a huge diversity of people and locations, so. >> Anything, I mean, these early days here, early day one, but anything you saw in the morning keynotes or things you hope to learn here? Anything that's excited you so far? >> A couple of the morning keynotes, but had to dash out to kind of prepare for, I'm doing a talk later, actually on feature hashing for scalable machine learning, so that's at 12:20, please come and see it. >> Dave: A breakout session, it's at what, 12:20? >> 20 past 12:00, yeah. >> Okay. >> So in room 302, I think, >> Okay. >> I'll be talking about that, so I needed to prepare, but I think some of the key exciting things that I have seen that I would like to go and take a look at are kind of related to the deep learning on Spark. I think that's been a hot topic recently in one of the areas, again, Spark is, perhaps, hasn't been the strongest contender, let's say, but there's some really interesting work coming out of Intel, it looks like. >> They're talking here on The Cube in a couple hours. >> Yeah. >> Yeah. >> I'd really like to see their work. >> Yeah. >> And that sounds very exciting, so yeah. I think every time I come to a Spark summit, they always need projects from the community, various companies, some of them big, some of them startups that are pushing the envelope, whether it's research projects in machine learning, whether it's adding deep learning libraries, whether it's improving performance for kind of commodity clusters or for single, very powerful single modes, there's always people pushing the envelope, and that's what's great about being involved in an Open Source community project and being part of those communities, so yeah. That's one of the talks that I would like to go and see. And I think I, unfortunately, had to miss some of the Netflix talks on their recommendation pipeline. That's always interesting to see. >> Dave: Right. >> But I'll have to check them on the video (laughs). >> Well, there's always another project in Open Source land. Nick, thanks very much for coming on The Cube and good luck. Cool, thanks very much. Thanks for having me. >> Have a good trip, stay warm, hang in there. (Nick laughs) Alright, keep it right there. My buddy George and I will be back with our next guest. We're live. This is The Cube from Sparks Summit East, #sparksummit. We'll be right back. (upbeat music) (gentle music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Data Bricks. a the IBM Spark Technology Center in South Africa. So let's see, it's a different time of year, here I've flown from, I don't know the Fahrenheit's equivalent, You probably get the T-shirt for the longest flight here, need the parka, or like a beanie. So Nick, tell us about the Spark Technology Center, and the ecosystem. The famous example that I like to use is Linux. I don't know all the details, certainly, Translate the hallway talk, maybe. Essentially, I think you raise very good parallels and kind of almost leap frog and say, "We're going to and so, in some respects, maybe missing the window on Hadoop and they're still sort of struggling to figure it out. So part of that is the traditional data warehousing So Nick, perhaps paint us a picture of someone and almost commoditization of the model side. And that's not even the end of it And the business impact of that presumably will be still, so, but are you suggesting that by closing it's not magic that you just simply throw and the models are getting better and better attacked the time problem. to go from, yeah, as you mentioned where we are, and more about making the systems better So improving recommendations, improving the quality So really, the aim is to simplify and productionize Yeah, and right. And to create that end-to-end system that you're describing. and I'm less involved in the data bank. So the intent with HTC particularly is that we focus leverage the ability to talk to real-world customers and you as product teams and builders of products and centralize that in the Open Source contributions sort of machine learning libraries and in the pipeline, And if that's the case, So I don't have the answer because this is ongoing work, IBM, the first thing IBM contributed to the Spark community but really, the main focus is to execute on Spark When it works. and ongoing to make it, in a way to make it user ready So I'm kind of fairly lucky from that perspective, And interface with the team, and I think it's great, you know, A couple of the morning keynotes, but had to dash out are kind of related to the deep learning on Spark. that are pushing the envelope, whether it's research and good luck. My buddy George and I will be back with our next guest.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
George Gilbert	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Valente	PERSON	0.99+
George	PERSON	0.99+
Dave	PERSON	0.99+
Nick Pentreath	PERSON	0.99+
Howard Street	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Nick Pentry	PERSON	0.99+
$1 billion	QUANTITY	0.99+
Nick	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
HTC	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Cape Town	LOCATION	0.99+
South Africa	LOCATION	0.99+
Java	TITLE	0.99+
Linux	TITLE	0.99+
12 months	QUANTITY	0.99+
six months	QUANTITY	0.99+
next week	DATE	0.99+
Boston	LOCATION	0.99+
Boston, Massachusetts	LOCATION	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
BMI	ORGANIZATION	0.99+
Python	TITLE	0.99+
Spark	TITLE	0.99+
12:20	DATE	0.99+
three	QUANTITY	0.99+
6-12 month	QUANTITY	0.99+
Watson	ORGANIZATION	0.98+
tomorrow	DATE	0.98+
Spark Technology Center	ORGANIZATION	0.98+
one	QUANTITY	0.98+
Spark Technology Centers	ORGANIZATION	0.98+
this year	DATE	0.97+
Hadoop	TITLE	0.97+
hundreds of thousands	QUANTITY	0.97+
both	QUANTITY	0.97+
30 degrees Celsius	QUANTITY	0.97+
Data First	ORGANIZATION	0.97+
Super Bowl	EVENT	0.97+
single	QUANTITY	0.96+

Rob Thomas, IBM | BigDataNYC 2016

>> Narrator: Live from New York, it's the Cube. Covering Big Data New York City 2016. Brought to you by headline sponsors: Cisco, IBM, Nvidia, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and Jeff Frick. >> Welcome back to New York City, everybody. This is the Cube, the worldwide leader in live tech coverage. Rob Thomas is here, he's the GM of products for IBM Analytics. Rob, always good to see you, man. >> Yeah, Dave, great to see you. Jeff, great to see you as well. >> You too, Rob. World traveller. >> Been all over the place, but good to be here, back in New York, close to home for one day. (laughs) >> Yeah, at least a day. So the whole community is abuzz with this article that hit. You wrote it last week. It hit NewCo Shift, I guess just today or yesterday: The End of Tech Companies. >> Rob: Yes. >> Alright, and you've got some really interesting charts in there, you've got some ugly charts. You've got HDP, you've got, let's see... >> Rob: You've got Imperva. >> TerraData, Imperva. >> Rob: Yes. >> Not looking pretty. We talked about this last year, just about a year ago. We said, the nose of the plane is up. >> Yep. >> Dave: But the planes are losing altitude. >> Yep. >> Dave: And when the funding dries up, look out. Interesting, some companies still are getting funding, so this makes rip currents. But in general, it's not pretty for pure play, dupe companies. >> Right. >> Dave: Something that you guys predicted, a long time ago, I guess. >> So I think there's a macro trend here, and this is really, I did a couple months of research, and this is what went into that end of tech companies post. And it's interesting, so you look at it in the stock market today: the five highest valued companies are all tech companies, what we would call. And that's not a coincidence. The reality is, I think we're getting past the phase of there being tech companies, and tech is becoming the default, and either you're going to be a tech company, or you're going to be extinct. I think that's the MO that every company has to operate with, whether you're a retailer, or in healthcare, or insurance, in banking, it doesn't matter. If you don't become a tech company, you're not going to be a company. That's what I was getting at. And so some of the pressures I was highlighting was, I think what's played out in enterprise software is what will start to play out in other traditional industries over the next five years. >> Well, you know, it's interesting, we talk about these things years and years and years in advance and people just kind of ignore it. Like Benioff even said, more SaaS companies are going to come out of non-tech companies than tech companies, OK. We've been talking for years about how the practitioners of big data are actually going to make more money than the big data vendors. Peter Goldmacher was actually the first, that was one of his predictions that hit true. Many of them didn't. (laughs) You know, Peter's a good friend-- >> Rob: Peter's a good friend of mine as well, so I always like pointing out what he says that's wrong. >> But, but-- >> Thinking of you, Peter. >> But we sort of ignored that, and now it's all coming to fruition, right? >> Right. >> Your article talks about, and it's a long read, but it's not too long to read, so please read it. But it talks about how basically every industry is, of course, getting disrupted, we know that, but every company is a tech company. >> Right. >> Or else. >> Right. And, you know, what I was, so John Battelle called me last week, he said hey, I want to run this, he said, because I think it's going to hit a nerve with people, and we were talking about why is that? Is it because of the election season, or whatever. People are concerned about the macro view of what's happening in the economy. And I think this kind of strikes at the nerve that says, one is you have to make this transition, and then I go into the article with some specific things that I think every company has to be doing to make this transition. It starts with, you've got to rethink your capital structure because the investments you made, the distribution model that you had that got you here, is not going to be sufficient for the future. You have to rethink the tools that you're utilitizing and the workforce, because you're going to have to adopt a new way to work. And that starts at the top, by the way. And so I go through a couple different suggestions of what I think companies should look at to make this transition, and I guess what scares me is, I visit companies all over the world, I see very few companies making these kind of moves. 'Cause it's a major shake-up to culture, it's a major shake-up to how they run their business, and, you know, I use the Warren Buffett quote, "When the tide goes out, you can see who's been swimming naked." The tide may go out pretty soon here, you know, it'll be in the next five years, and I think you're going to see a lot of companies that thought they could never be threatened by tech, if you will, go the wrong way because they're not making those moves now. >> Well, let's stay cognitive, now that we're on this subject, because you know, you're having a pretty frank conversation here. A lot of times when you talk to people inside of IBM about cognitive and the impact it's going to have, they don't want to talk about that. But it's real. Machines have always replaced humans, and now we're seeing that replacement of cognitive functions, so that doesn't mean value can't get created. In fact, way more value is going to be created than we can even imagine, but you have to change the way in which you do things in order to take advantage of that. >> Right, right. One thing I say in the article is I think we're on the cusp of the great reskilling, which is, you take all the traditional IT jobs, I think over the next decade half those jobs probably go away, but they're replaced by a new set of capabilities around data science and machine learning, and advanced analytics, things that are leveraging cognitive capabilities, but doing it with human focus as well. And so, you're going to see a big shift in skills. This is why we're partnering with companies like Galvanize, I saw Jim Deters when I was walking in. Galvanize is at the forefront of helping companies do that reskilling. We want to help them do that reskilling as well, and we're going to provide them a platform that automates the process of doing a lot of these analytics. That's what the new project Dataworks, the new Watson project is all about, is how we begin to automate what have traditionally been very cumbersome and difficult problems to solve in an organization, but we're helping clients that haven't done that reskilling yet, we're helping them go ahead and get an advantage through technology. >> Rob, I want to follow up too on that concept on the capital markets and how this stuff is measured, because as you pointed out in your article, valuations of the top companies are huge. That's not a multiple of data right now. We haven't really figured that out, and it's something that we're looking at, the Wikibon team is how do you value the data from what used to be liability 'cause you had to put it on machines and pay for it. Now it's really the driver, there's some multiple of data value that's driving those top-line valuations that you point out in that article. >> You know it's interesting, and nobody has really figured that out, 'cause you don't see it showing up, at least I don't think, in any stock prices, maybe CoStar would be one example where it probably has, they've got a lot of data around commercial real estate, that one sticks out to me, but I think about in the current era that we're in there's three ways to drive competitive advantage: one is economies of scale, low-cost manufacturing; another is through network effects, you know, a number of social media companies have done that well; but third is, machine learning on a large corpus of data is a competitive advantage. If you have the right data assets and you can get better answers, your models will get smarter over time, how's anybody going to catch up with you? They're not going to. So I think we're probably not too far from what you say, Jeff, which is companies starting to be looked at as a value of their data assets, and maybe data should be on the balance sheet. >> Well that's what I'm saying, eventually does it move to the balance sheet as something that you need to account for? Because clearly there's something in the Apple number, in the Alphabet number, in the Microsoft number, that's more than regular. >> Exactly, it's not just about, it's not just about the distribution model, you know, large companies for a long time, certainly in tech, we had a huge advantage because of distribution, our ability to get to other countries face to face, but as the world has moved to the Internet and digital sales and try/buy, it's changed that. Distribution can still be an advantage, but is no longer the advantage, and so companies are trying to figure out what are the next set of assets? It used to be my distribution model, now maybe it's my data, or perhaps it's the insight that I develop from the data. That's really changed. >> Then, in the early days of the sort of big data meme taking off, people would ask, OK, how can I monetize the data? As opposed to what I think they're really asking is, how could I use data to support making money? >> Rob: Right. Right. >> And that's something a lot of people I don't think really understood, and it's starting to come into focus now. And then, once you figure that out, you can figure out what data sources, and how to get quality in that data and enrich that data and trust that data, right? Is that sort of a logical sequence that companies are now going through? >> It's an interesting observation, because you think about it, the companies that were early on in purely monetizing data, companies like Dun & Bradstreet come to mind, Nielsen come to mind, they're not the super-fast-growing companies today. So it's kind of like, there was an era where data monetization was a viable strategy, and there's still some of that now, but now it's more about, how do you turn your data assets into a new business model? There was actually a great, new Clay Christensen article, it was published I think last week, talking about companies need to develop new business models. We're at the time, everybody's kind of developed in, we sell hardware, we sell software, we sell services, or whatever we sell, and his point was now is the time to develop a new business model, and those will, now my view, those will largely be formed on the basis of data, so not necessarily just monetizing the data, to your point, Dave, but on the basis of that data. >> I love the music industry, because they're always kind of out at the front of this evolving business model for digital assets in this new world, and it keeps jumping, right? It jumped, it was free, then people went ahead and bought stuff on iTunes, now Spotify has flipped it over to a subscription model, and the innovation of change in the business model, not necessarily the products that much, it's very different. The other thing that's interesting is just that digital assets don't have scarcity, right? >> Rob: Right. >> There's scarcity around the data, but not around the assets, per se. So it's a very different way of thinking about distribution and kind of holding back, how do you integrate with other people's data? It's not, not the same. >> So think about, that's an interesting example, because think about the music, there's a great documentary on Netflix about Tower Records, and how Tower Records went through the big spike and now is kind of, obviously no longer really around. Same thing goes for the Blockbusters of the world. So they got disrupted by digital, because their advantage was a distribution channel that was in the physical world, and that's kind of my assertion in that post about the end of tech companies is that every company is facing that. They may not know it yet, but if you're in agriculture, and your traditional dealer network is how you got to market, whether you know it or not, that is about to be disrupted. I don't know exactly what form that will take, but it's going to be different. And so I think every company to your point on, you know, you look at the music industry, kind of use it as a map, that's an interesting way to look at a lot of industries in terms of what could play out in the next five years. >> It's interesting that you say though in all your travels that people aren't, I would think they would be clamoring, oh my gosh, I know it's coming, what do I do, 'cause I know it's coming from an angle that I'm not aware of as opposed to, like you say a lot of people don't see it coming. You know, it's not my industry. Not going to happen to me. >> You know it's funny, I think, I hear two, one perception I hear is, well, we're not a tech company so we don't have to worry about that, which is totally flawed. Two is, I hear companies that, I'd say they use the right platitudes: "We need to be digital." OK, that's great to say, but are you actually changing your business model to get there? Maybe not. So I think people are starting to wake up to this, but it's still very much in its infancy, and some people are going to be left behind. >> So the tooling and the new way to work are sort of intuitive. What about capital structure? What's the implication to capital structures, how do you see that changing? So it's a few things. One is, you have to relook at where you're investing capital today. The majority of companies are still investing in what got them to where they are versus where they need to be. So you need to make a very conscious shift, and I use the old McKinsey model of horizon one, two and three, but I insert the idea that there should be a horizon zero, where you really think about what are you really going to start to just outsource, or just altogether stop doing, because you have to aggressively shift your investments to horizon two, horizon three, you've really got to start making bets on the future, so that's one is basically a capital shift. Two is, to attract this new workforce. When I talked about the great reskilling, people want to come to work for different reasons now. They want to come to work, you know, to work in the right kind of office in the right location, that's going to require investment. They want a new comp structure, they're no longer just excited by a high base salary like, you know, they want participation in upside, even if you're a mature company that's been around for 50 years, are you providing your employees meaningful upside in terms of bonus or stock? Most companies say, you know, we've always reserved that stuff for executives. That's not, there's too many other companies that are providing that as an alternative today, so you have to rethink your capital structure in that way. So it's how you spend your money, but also, you know, as you look at the balance sheet, how you actually are, you know, I'd say spreading money around the company, and I think that changes as well. >> So how does this all translate into how IBM behaves, from a product standpoint? >> We have changed a lot of things in IBM. Obviously we've made a huge move towards what we think is the future, around artificial intelligence and machine learning with everything that we've done around the Watson platform. We've made huge capital investments in our cloud capability all over the world, because that is an arms race right now. We've made a huge change in how we're hiring, we're rebuilding offices, so we put an office in Cambridge, downtown Boston. Put an office here in New York downtown. We're opening the office in San Francisco very soon. >> Jeff: The Sparks Center downtown. >> Yeah. So we've kind of come to urban areas to attract this new type of skill 'cause it's really important to us. So we've done it in a lot of different ways. >> Excellent. And then tonight we're going to hear more about that, right? >> Rob: Yes. >> You guys have a big announcement tonight? >> Rob: Big announcement tonight. >> Ritica was on, she showed us a little bit about what's coming, but what can you tell us about what we can expect tonight? >> Our focus is on building the first enterprise platform for data, which is steeped in artificial intelligence. First time you've seen anything like it. You think about it, the platform business model has taken off in some sectors. You can see it in social media, Facebook is very much a platform. You can see it in entertainment, Netflix is very much a platform. There hasn't really been a platform for enterprise data and IP. That's what we're going to be delivering as part of this new Watson project, which is Dataworks, and we think it'll be very interesting. Got a great ecosystem of partners that will be with us at the event tonight, that're bringing their IP and their data to be part of the platform. It will be a unique experience. >> What do you, I know you can't talk specifics on M&A, but just in general, in concept, in terms of all the funding, we talked last year at this event how the whole space was sort of overfunded, overcrowded, you know, and something's got to give. Do you feel like there's been, given the money that went in, is there enough innovation coming out of the Hadoop big data ecosystem? Or is a lot of that money just going to go poof? >> Well, you know, we're in an interesting time in capital markets, right? When you loan money and get back less than you loan, because interest rates are negative, it's almost, there's no bad place to put money. (laughing) Like you can't do worse than that. But I think, you know the Hadoop ecosystem, I think it's played out about like we envisioned, which is it's becoming cheap storage. And I do see a lot of innovation happening around that, that's why we put so much into Spark. We're now the number one contributor around machine learning in the Spark project, which we're really proud of. >> Number one. >> Yes, in terms of contributions over the last year. Which has been tremendous. And in terms of companies in the ecos-- look, there's been a lot of money raised, which means people have runway. I think what you'll see is a lot of people that try stuff, it doesn't work out, they'll try something else. Look, there's still a lot of great innovation happening, and as much as it's the easiest time to start a company in terms of the cost of starting a company, I think it's probably one of the hardest times in terms of getting time and attention and scale, and so you've got to be patient and give these bets some time to play out. >> So you're still sanguine on the future of big data? Good. When Rob turns negative, then I'm concerned. >> It's definitely, we know the endpoint is going to be massive data environments in the cloud, instrumented, with automated analytics and machine learning. That's the future, Watson's got a great headstart, so we're proud of that. >> Well, you've made bets there. You've also, I mean, IBM, obviously great services company, for years services led. You're beginning to automate a lot of those services, package a lot of those services into industry-specific software and other SaaS products. Is that the future for IBM? >> It is. I mean, I think you need it two ways. One is, you need domain solutions, verticalized, that are solving a specific problem. But underneath that you need a general-purpose platform, which is what we're really focused on around Dataworks, is providing that. But when it comes to engaging a user, if you're not engaging what I would call a horizontal user, a data scientist or a data engineer or developer, then you're engaging a line-of-business person who's going to want something in their lingua franca, whether that's wealth management and banking, or payer underwriting or claims processing in healthcare, they're going to want it in that language. That's why we've had the solutions focus that we have. >> And they're going to want that data science expertise to be operationalized into the products. >> Rob: Yes. >> It was interesting, we had Jim on and Galvanize and what they're doing. Sharp partnership, Rob, you guys have, I think made the right bets here, and instead of chasing a lot of the shiny new toys, you've sort of thought ahead, so congratulations on that. >> Well, thanks, it's still early days, we're still playing out all the bets, but yeah, we've had a good run here, and look forward to the next phase here with Dataworks. >> Alright, Rob Thomas, thanks very much for coming on the Cube. >> Thanks guys, nice to see you. >> Jeff: Appreciate your time today, Rob. >> Alright, keep it right there, everybody. We'll be back with our next guest right after this. This is the Cube, we're live from New York City, right back. (electronic music)

Published Date : Sep 28 2016

SUMMARY :

Brought to you by headline sponsors: This is the Cube, the worldwide leader Jeff, great to see you as well. Been all over the So the whole community is abuzz Alright, and you've got some We said, the nose of the plane is up. Dave: But the planes But in general, it's not you guys predicted, and tech is becoming the default, than the big data vendors. friend of mine as well, about, and it's a long read, because the investments you made, A lot of times when you of the great reskilling, on that concept on the capital markets and you can get better answers, as something that you need to account for? the distribution model, you know, Rob: Right. and it's starting to come into focus now. now is the time to develop and the innovation of change but not around the assets, per se. Blockbusters of the world. It's interesting that you but are you actually but I insert the idea that all over the world, because 'cause it's really important to us. to hear more about that, right? the first enterprise platform for data, of the Hadoop big data ecosystem? in the Spark project, which and as much as it's the on the future of big data? the endpoint is going to be Is that the future for IBM? they're going to want it in that language. And they're going to want lot of the shiny new toys, and look forward to the next thanks very much for coming on the Cube. This is the Cube, we're live

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Dave	PERSON	0.99+
Nvidia	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
Peter	PERSON	0.99+
Rob Thomas	PERSON	0.99+
John Battelle	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Peter Goldmacher	PERSON	0.99+
Rob	PERSON	0.99+
San Francisco	LOCATION	0.99+
Jeff Frick	PERSON	0.99+
New York City	LOCATION	0.99+
CoStar	ORGANIZATION	0.99+
last week	DATE	0.99+
yesterday	DATE	0.99+
Cambridge	LOCATION	0.99+
Apple	ORGANIZATION	0.99+
New York	LOCATION	0.99+
Benioff	PERSON	0.99+
New York City	LOCATION	0.99+
tonight	DATE	0.99+
Warren Buffett	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Galvanize	ORGANIZATION	0.99+
first	QUANTITY	0.99+
two	QUANTITY	0.99+
one	QUANTITY	0.99+
last year	DATE	0.99+
Jim Deters	PERSON	0.99+
today	DATE	0.99+
last year	DATE	0.99+
two ways	QUANTITY	0.99+
Clay Christensen	PERSON	0.99+
three ways	QUANTITY	0.99+
Alphabet	ORGANIZATION	0.99+
One	QUANTITY	0.99+
Two	QUANTITY	0.99+
third	QUANTITY	0.99+
iTunes	TITLE	0.99+
one day	QUANTITY	0.99+
Jim	PERSON	0.99+
Spotify	ORGANIZATION	0.99+
Nielsen	ORGANIZATION	0.99+
Tower Records	ORGANIZATION	0.98+
IBM Analytics	ORGANIZATION	0.98+
Netflix	ORGANIZATION	0.98+
Wikibon	ORGANIZATION	0.98+
one example	QUANTITY	0.98+
McKinsey	ORGANIZATION	0.98+
Facebook	ORGANIZATION	0.98+
First time	QUANTITY	0.98+

Neil Mendelson, Oracle - On the Ground - #theCUBE

>> Announcer: theCUBE presents "On the Ground." (light techno music) >> Hello there and welcome to SiliconANGLE's theCUBE, On the Ground, here at Oracle's Headquarters. I'm John Furrier, the host of theCUBE, and I'm here with Neil Mendelson, the Vice President of Product Management for the Big Data Team at Oracle. Welcome to On the Ground, thanks for having us here, at Headquarters. >> Good to be here. >> So big data, obviously a big focus of Oracle OpenWorld, is right around the corner but in general, big data breadth of products from Oracle, has been around for awhile. What's your take on this? Because Oracle is doing very well with this new Cloud storing. My interview with Mark Hurd, 100% of the code has been cloudified. Big data now is a big part of the Cloud dynamic. What are some of the things that you're seeing out in the marketplace around big data, and where does Oracle fit? >> Well, you know, when this whole big data thing started years ago, I mean Hadoop just hit its 10th anniversary, right? Everybody was talking about throwing everything out that they had and there was no reason for SQL anymore and you're just going to throw a bunch of stuff together yourself and put it together and off you go, right? And now I think people have realized that to get the real value out of these new technologies, it's not a question of just the new technologies alone, but how do you integrate those with your existing estates. >> So Oracle obviously is a big database business, you know, I mean Tom Curry, with "Hey the database, take your swim lane", but what's interesting is with Hadoop and some of these other ecosystems, what customers are looking for is to not just use Oracle database but to use whatever they might see as a feature of some use case. >> Neil: Absolutely. >> Hadoop for batch. So you guys have been connecting these systems, so could you just quickly explain for a minute how you guys look at this choice factor from a customer standpoint because there's a role for Hadoop, but Hadoop isn't going to take over the whole world as we see in the ecosystem. What's your role, vis-a-vis the database choice? >> Yeah, so we very much believe when Oracle started, it was all about Database, and it was all about SQL. And we believe now that the new normal is really one that includes both Hadoop, NoSQL, and Relational, right? SQL is of course still a factor, but so are the ability to interface, in via rest interfaces and scripting languages. So for us, it's really a big tent, and we've been taking what we had done previously in Database and really extending that to Data Management over Hadoop and NoSQL. >> We had a great chat at Oracle OpenWorld last year, and you talked about your history at Oracle before you did you run with start-ups. You've seen this movie go on early days with data warehousing, so I got to ask you, big data's not new to Oracle, obviously the database business has been thriving and changing with the Cloud around the corner and certainly here on the doorstep but could you explain Oracle's Database, I mean, big data product offerings? >> Sure. >> What was the first product? Take us through the lineage of where it is, because you guys have products. >> We do. >> And a slew of stuff is coming, I can imagine, I'm sure you can't share much about that but talk about the lineage right now. >> Okay, so we started about three years ago on the Hadoop side by making an appliance made for Hadoop and then in the future, which followed on with Spark. And that appliance has been doing well on the marketplace for a number of years and we've obviously continued to enhance that. We then took what we perfected on premises and we moved that up to the Cloud, so we have a big data cloud service for customers that offer them high-performance access to Hadoop and Spark and without necessarily the need to actually manage security and all the things with it. At OpenWorld, we'll be making a series of announcements, we'll be creating yet another big data Cloud service. This one will be fully managed, fully elastic for customers who only want to take advantage of a Hadoop or Sparks service, as an example, and don't want to deal with the ability to specifically tweak the environment, right? We also announced a little while ago, our family of Cloud Machines, right? So you'll see, a, the first Cloud Machine is one that provides Oracle IaaS and PaaS services and then we'll add to the family. >> John: That's shipping already, though. >> That's shipping already, right? And then we'll add to the family, an Exadata Cloud Machine and a big data Cloud Machine and the Cloud Machines are really kind of a cool concept. They're cool because for a lot of customers from a regulatory point of view or otherwise, they're just not ready for the public Cloud, but everybody wants to take advantage of what the Cloud provides. So how do you do that behind your firewall, right? How do you provide IT as a service? So what Oracle has done essentially, is to package up its Cloud services and able to deliver that to customers behind the firewall and they get the exact same technology that they have on the public Cloud, they build to one architecture and then deploy it wherever they choose. They get the advantages of the Cloud, it's a subscription service, right, but they can deal with but they can adhere to whatever data sovereignty or issues that they might have. >> So let's get to that regulatory dynamic in a second but I just want to back up, so Big Data Appliance, B-D-A you guys call it, Big Data Appliance, that's been out. Big data service... >> Neil: Cloud service started about a year ago. >> Done a year, that's out there. Those laces that connect Appliance that's on-pem with the Cloud. >> Neil: Right. >> And then now you have the cloud machine series of enhancements coming in Oracle Openworld. >> Right, as well as a fully elastic, fully managed cloud service that will add to the mix as well >> Okay, so let's get down, so that's going to bring us fully cloud-enabled. >> Yep. >> Cloud on-premise, >> Both. >> All that kind of dynamic flexibility and an option for cloud configurations and depressuring. Okay, back to the regulatory thing. So what's the big deal about that, because you mentioned that most companies we talk to love the cloud, they love the economics, but there's a lot of fund and fear internally amongst their own team about getting sued, losing data, you know, certain industries that they might have to play, is that a fact and can you explain that for someone and what's important about that. >> Yeah I mean, for some customers it's a real concern, right, and the world is dynamically shifting, I mean, look at what happened a few months ago with you know the Brexit, right, I mean all of a sudden it was OK to have, you know, the data as long as it was in the EU, well the EU is now shifting, so where does the data go, right? So from a regulatory point of view we haven't fully settled in terms of where customer data can be held, exactly how its treated, and you know those things are evolving. So for a number of companies, they want the advantages of the cloud but they don't necessarily want it on the public cloud and that's why we're offering these new cloud machines because they can essentially have their cake and eat it too. >> So interesting, the dynamic then is is that this whole regulatory thing is a moving train. >> Right. >> Relative to the whole global landscape. >> Right. >> Who knows what's going to happen with China and other things, right? >> Right and I think that's what's really terrific is that our history is, of course, were a company that's been around for a while so we started on premises and we moved up to the cloud and our customers are ones that are going to have, kind of, this hybrid kind of a system, right. Other companies started much later and their cloud only and you know while that's great for companies that want the public cloud. What do you do if you're in a regulatory environment that isn't ready to boot public cloud? Now you have to have two architectures, one for on-premises and one for cloud and then how do you deal with a moving landscape where a year from now things that are on premises can move to the cloud and other things that are in the cloud may have to move to back on premises, right? How do you deal with that dynamic going forward and not get stuck. >> So, is it fair to say that Oracle is a big data player in the cloud and on-premise? >> Absolutely, and not just for data management. I think that you know while we started at that core, that's our heritage, we've so much built out our portfolio, we have big data products in the data integration space, in the machine learning space, we have big data products that connect up with our IoT strategy, with data visualization, we've really blossomed as the marketplace has matured bringing additional technology for customers to utilize. >> Okay, so let's get down to the reality and get into the weeds with customer deployments. How do you guys compare vis-a-vis the competition now you got the on-prem with the BDA, Big Data Appliance with the cloud service, cloud machines to create some provisioning, flexibility on whether architectures the customers may choose. >> Yeah. For whatever reason that they would have. >> Okay. How does that compare to the competition? >> On the on-premises side, if we start there, there was a recent Forrester Wave that looked at various Hadoop appliances and we took the number one category or the number one position across all the three categories that they looked at, they looked at the strategy, they looked at the market presence and they looked at the capability of what we offered and we ended up number one in that space. On the cloud side, of course, we're maturing in terms of that offering as well but you know we're really the only company out there that can offer the same architecture both on cloud and on-premises, where you don't necessarily have to go all in on one or the other, and for many companies that's exactly what they're, you know, what they need right. They can't necessarily go all in one way or the other. >> So I got to ask you kind of a, put your Oracle historian tech historian hat on as well as your Oracle executive hat on and talk about some of the technologies that have come and gone over the years and how does that relate to some of the things that are hyped up now? I mean certainly Hadoop, what's supposed to be this new industry, it's going to disrupt the database and Oracle's going to be put out of business and this is how people are going to store stuff, MapReduce. Now people are saying, why even have Hadoop in the cloud when you got object store. So, things come and go, I'm not saying Hadoop is going to come and go but it's good for batch but so, what's your comments on it can you point to industry technology, say okay, that's going to be a feature of something else, that's a real deal? What are some of the things that you look at that you can say... >> So you know we're seeing exactly as you described, a few years ago you go to a conference and it was all about MapReduce. Right now, a seminar in MapReduce, nobody goes, right. Everybody's going to Spark, right, and there's already things that potentially will replace Spark, things like Flink, and we're going to see that continual change and a lot of what we focused on is to be able to provide some level of abstraction between the customers architecture and these moving technology. So, I'll give an example. Our data integration technology, historically that was, you know, you're able to visually describe a set of transformations and then we generated code in SQL or PL/SQL. Now we generate code, not only in SQL and PL/SQL but we generate that same code in Spark. If tomorrow Spark gets replaced with Flink or something else, we simply replace the code generator underneath and all of what the customers built gets preserved and moved into the future. I think a lot of people are now becoming concerned that as they take advantage of open source really really at the very low levels they have the potential to essentially get stuck in a technology which has essentially become obsoleted, right? >> Yeah. >> As any new technology evolves we move from people who just code, right, with all the lower level stuff up to a set of tools and you know we talk to companies now that have huge amounts of now legacy MapReduce code, right, you think only a few years ago... >> It's kinds like cobalt. >> Neil: Yeah. (John laughing) >> Neil: So... >> I's going to be around but not really pervasive. >> Right. So how can you take advantage of these technologies, without necessarily having to get stuck to any one of them. >> So, I'm going to ask you the philosophical question, so Oracle database business has been the star over the years since the founding but even now it seems to me that the role of the database becomes even more important as you connect subsystems, call it, Hadoop, Spark, whatever technology's going to evolve as a feature of an integrated system, if you will, software-based and or engineered system coming together. So that seems to be obvious that you can connect in an open way and give customers choice but that's kind of different from the old Oracle. I have a database everything runs on Oracle, Oracle on Oracle's grade, certainly it runs well but what's the philosophy internally obviously the database team's sitting there it must be like, wow big data is an opportunity for Oracle. >> That's right. Or do they go, no the database business is different. How do you guys talk about that internally and then how do customers take away from that dynamic between the database crown jewel and the opening it up and being more big data driven? >> I think it's ironic because, externally, when you talk to people, they just assume that we're going to be like "Oh my god this is a threat" and we're going to just double down on what we're doing on the database side and we're just going to hunker down and I don't know try to hide, right? But that's exactly the opposite of what we're really been doing internally. We really have embraced these technologies of Hadoop and Spark and NoSQL, and we're essentially seeing data management evolve, that is the new normal. So rather than looking at, not only what we might have said, we did say when we introduced Oracle in the data warehousing market back in '95, We said "Put all your data in the Oracle database." We're not saying that anymore because there are reasons to put data in Hadoop, there are reasons to put data in graph databases, in NoSQL databases, we need to be able to provide those choice while still integrating that data management platform as one integrated entity. >> Would you say then it was fair to say that, from a customer standpoint, by having that open approach gives more faster access to different data types in real time? >> Absolutely. >> John: Then isn't that the core value proposition of big data. >> Yeah, again when the Hadoop new craze first started it was all about unload and put everything in this one store and for a lot of companies today, they still are faced with the this conundrum which says, in order to analyze data, I have to put it all in one place. So that means that you have to move your operational data into one place, you have to move your data warehousing stuff into one place, but then at the same time you mentioned real time. How do you get into the business of moving data from Place A to Place B on a constant basis while still being able to offer real-time access and real-time analytics? The answer is you can't. >> And the value of the data, the data capital, as we've been talking about, McGee bond is an IoT piece of data from a turbine could have really big relevance to the system of record in another database and that has to be exposed and integrated quickly to surface some insight about the quality of that... >> It's the thing that gives you context, right. Today what's going on is that we are getting all access to all these rich data sources and rich data types that we didn't have before, whether that's text information or information coming off sensors and alike, and the relevance of that information is, when we combined it together with the corporate information, the stuff that we have in our existing systems to really reap the true benefit. How do you know, when you get a log file the log file doesn't have anything about the customer in it, the log file just has a, a number associating itself to a customer. You have to tie that together with the customer profile which data which might not exist in Hadoop, maybe it's in a NoSQL store. >> And certainly the Open Source is booming with Oracle. You guys are actively involved in all the different open source ecosystems. >> Sure, we drive a number of open source projects whether it's MySQL or Java or, the list goes on and on. Many people don't think of, you know, they're not even aware that Oracle's behind my MySQL. As an example, right, I mean, I remember talking to my son recently he says, "Do you know anything about MySQL" and I'm like well a little bit. And then as we're talking and were looking through his code, finally I say, "You know this is Ooracle product," He's like no it's not. You know cause... >> It's too cool to be Oracle. >> That's right. That's not a bad thing, right. >> Yeah. I mean the reality of it is, is that you know we've invested a whole lot of time and energy in these technologies and we're really looking to commercialize them to mainstream them, to make them less scary for more people to be able to get value from. Well your son's example's a great illustration of the new Oracle that's out there now this whole new philosophy. Final, give you the last word real quick, for folks watching, what's one thing you'd want to share with them that they may or may not know about Oracle and it's big data strategy? >> Give us a look. Right, I mean I think that when you think of big data and you think of these new technologies, you may not think of Oracle, right. You may think of the new companies that you're more familiar with in the light. The reality of is, is that Oracle has an extraordinarily rich portfolio of technology and services on the cloud as well as like cloud machines. So give us a look, I think you'll be surprised at how open we are, how much of the open source technology we've embedded in our products and how fast were essentially evolving into, what is the new normal. >> Neil thanks so much for spending the with me here On the Ground. I'm john Furrier, you're watching exclusive "On the Ground" coverage here at Oracle Headquarters. Thanks for watching. >> Neil: Thank you.

Published Date : Sep 6 2016

SUMMARY :

and I'm here with Neil Mendelson, 100% of the code has been cloudified. and put it together and off you go, right? but to use whatever they might see but Hadoop isn't going to take over the whole world but so are the ability to interface, and you talked about your history at Oracle because you guys have products. but talk about the lineage right now. and don't want to deal with the ability and able to deliver that So let's get to that regulatory dynamic in a second Those laces that connect Appliance And then now you have the cloud machine series so that's going to bring us certain industries that they might have to play, and you know those things are evolving. So interesting, the dynamic then is Relative to the whole and then how do you deal with a moving landscape I think that you know while we started at that core, and get into the weeds with customer deployments. For whatever reason that they would have. How does that compare to the competition? that can offer the same architecture and how does that relate to some of the things and moved into the future. and you know we talk to companies now Neil: Yeah. So how can you take advantage of these technologies, So, I'm going to ask you the philosophical question, and the opening it up and being more big data driven? that is the new normal. the core value proposition of big data. So that means that you have to and that has to be exposed and integrated quickly and the relevance of that information is, And certainly the Open Source is booming with Oracle. Many people don't think of, you know, That's not a bad thing, right. is that you know we've invested a whole lot and you think of these new technologies, Neil thanks so much for spending the with me

ENTITIES

Entity	Category	Confidence
Neil Mendelson	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Mark Hurd	PERSON	0.99+
John	PERSON	0.99+
Neil	PERSON	0.99+
Tom Curry	PERSON	0.99+
john Furrier	PERSON	0.99+
MySQL	TITLE	0.99+
Java	TITLE	0.99+
last year	DATE	0.99+
100%	QUANTITY	0.99+
EU	ORGANIZATION	0.99+
Ooracle	ORGANIZATION	0.99+
Today	DATE	0.99+
OpenWorld	ORGANIZATION	0.99+
Big Data Appliance	ORGANIZATION	0.99+
one	QUANTITY	0.99+
On the Ground	TITLE	0.99+
NoSQL	TITLE	0.99+
one place	QUANTITY	0.99+
first	QUANTITY	0.99+
Spark	TITLE	0.98+
Both	QUANTITY	0.98+
SQL	TITLE	0.98+
Brexit	EVENT	0.98+
three categories	QUANTITY	0.98+
two architectures	QUANTITY	0.98+
Hadoop	TITLE	0.98+
tomorrow	DATE	0.97+
first product	QUANTITY	0.97+
both	QUANTITY	0.97+
10th anniversary	QUANTITY	0.97+
'95	DATE	0.96+
theCUBE	ORGANIZATION	0.96+
SiliconANGLE	ORGANIZATION	0.94+
few months ago	DATE	0.93+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Sparks: