Frank Slootman, Snowflake | Snowflake Summit 2022
>>Hi, everybody. Welcome back to Caesars in Las Vegas. My name is Dave ante. We're here with the chairman and CEO of snowflake, Frank Luman. Good to see you again, Frank. Thanks for coming on. Yeah, >>You, you as well, Dave. Good to be with you. >>No, it's, it's awesome to be, obviously everybody's excited to be back. You mentioned that in your, in your keynote, the most amazing thing to me is the progression of what we're seeing here in the ecosystem and of your data cloud. Um, you wrote a book, the rise of the data cloud, and it was very cogent. You talked about network effects, but now you've executed on that. I call it the super cloud. You have AWS, you know, I use that term, AWS. You're building on top of that. And now you have customers building on top of your cloud. So there's these layers of value that's unique in the industry. Was this by design >>Or, well, you know, when you, uh, are a data clouding, you have data, people wanna do things, you know, with that data, they don't want to just, you know, run data operations, populate dashboards, you know, run reports pretty soon. They want to build applications and after they build applications, they wanna build businesses on it. So it goes on and on and on. So it, it drives your development to enable more and more functionality on that data cloud. Didn't start out that way. You know, we were very, very much focused on data operations, then it becomes application development and then it becomes, Hey, we're developing whole businesses on this platform. So similar to what happened to Facebook in many, in many ways, you know, >>There was some confusion I think, and there still is in the community of, particularly on wall street, about your quarter, your con the consumption model I loved on the earnings call. One of the analysts asked Mike, you know, do you ever consider going to a subscription model? And Mike got cut him off, then let finish. No, that would really defeat the purpose. Um, and so there's also a narrative around, well, maybe snowflake, consumption's easier to dial down. Maybe it's more discretionary, but I, I, I would say this, that if you're building apps on top of snowflake and you're actually monetizing, which is a big theme here, now, your revenue is aligned, you know, with those cloud costs. And so unless you're selling it for more, you know, than it costs more than, than you're selling it for, you're gonna dial that up. And that is the future of, I see this ecosystem in your company. Is that, is that fair? You buy that. >>Yeah, it, it is fair. Obviously the public cloud runs on a consumption model. So, you know, you start looking all the layers of the stack, um, you know, snowflake, you know, we have to be a consumption model because we run on top of other people's, uh, consumption models. Otherwise you don't have alignment. I mean, we have conversations, uh, with people that build on snowflake, um, you know, they have trouble, you know, with their financial model because they're not running a consumption model. So it's like square pack around hole. So we all have to align ourselves. So that's when they pay a dollar, you know, a portion goes to, let's say, AWS portion goes to the snowflake of that dollar. And the portion goes to whatever the uplift is, application value, data value, whatever it is to that goes on top of that. So the whole dollar, you know, gets allocated depending on whose value at it. Um, we're talking about. >>Yeah, but you sell value. Um, so you're not a SaaS company. Uh, at least I don't look at you that that way I I've always felt like the SAS pricing model is flawed because it's not aligned with customers. Right. If you, if you get stuck with orphaned licenses too bad, you know, pay us. >>Yeah. We're, we're, we're obviously a SaaS model in the sense that it is software as a service, but it's not a SaaS model in the sense that we don't sell use rights. Right. And that's the big difference. I mean, when you buy, you know, so many users from, you know, Salesforce and ServiceNow or whoever you have just purchased the right, you know, for so many users to use that software for this period of time, and the revenue gets recognized, you know, radically, you know, one month at a time, the same amount. Now we're not that different because we still do a contract the exact same way as SA vendor does it, but we don't recognize the revenue radically. We recognize the revenue based on the consumption, but over the term of the contract, we recognize the entire amount. It just is not neatly organized in these monthly buckets. >>You know? So what happens if they underspend one quarter, they have to catch up by the end of the, the term, is that how it works or is that a negotiation or it's >>The, the, the spending is a totally, totally separate from the consumption itself, you know, because you know how they pay for the contract. Let's say they do a three year contract. Um, you know, they, they will probably pay for that, you know, on an annual basis, you know, that three year contract. Um, but it's how they recognize their expenses for snowflake and how we recognize the revenue is based on what they actually consume. But it's not like you're on demand where you can just decide to not use it. And then I don't have any cost, but over the three year period, you know, all of that, you know, uh, needs to get consumed or they expire. And that's the same way with Amazon. If I don't consume what I buy from Amazon, I still gotta pay for it. You know, so, >>Well, you're right. Well, I guess you could buy by the drink, but it's way, way more expensive and nobody really correct. Does that, so, yep. Okay. Phase one, better simpler, you know, cloud enterprise data warehouse, phase two, you introduced the, the data cloud and, and now we're seeing the rise of the data cloud. What, what does phase three look like >>Now? Phase, phase three is all about applications. Um, and we've just learned, uh, you know, from the beginning that people were trying to do this, but we weren't instrumental at all to do it. So people would ODBC, you know, JDBC drivers just uses as database, right? So the entire application would happen outside, you know, snowflake, we're just a database. You connect to the database, you know, you read or right data, you know, you do data, data manipulations. And then the application, uh, processing all happens outside of snowflake. Now there's issues with that because we start to exfil trade data, meaning that we started to take data out of snowflake and, and put it, uh, in other places. Now there's risk for that. There's operational risk, there's governance, exposure, security issues, you know, all this kind of stuff. And the other problem is, you know, data gets Reed. >>It proliferates. And then, you know, data science tests are like, well, I, I need that data to stay in one place. That's the whole idea behind the data cloud. You know, we have very big infrastructure clouds. We have very big application clouds and then data, you know, sort of became the victim there and became more proliferated and more segment. And it's ever been. So all we do is just send data to the work all day. And we said, no, we're gonna enable the work to get to the data. And the data that stays in more in place, we don't have latency issue. We don't have data quality issues. We don't have lineage issues. So, you know, people have responded very, very well to the data cloud idea, like, yeah, you know, as an enterprise or an institution, you know, I'm the epicenter of my own data cloud because it's not just my own data. >>It's also my ecosystem. It's the people that I have data networking relationships with, you know, for example, you know, take, you know, uh, an investment bank, you know, in, in, in, in New York city, they send data to fidelity. They send data to BlackRock. They send data to, you know, bank of New York, all the regulatory clearing houses, all on and on and on, you know, every night they're running thousands, tens of thousands, you know, of jobs pushing that data, you know, out there. It just, and they they're all on snowflake already. So it doesn't have to be this way. Right. So, >>Yes. So I, I asked the guys before, you know, last week, Hey, what, what would you ask Frank? Now? You might remember you came on, uh, our program during COVID and I was asking you how you're dealing with it, turn off the news. And it was, that was cool. And I asked you at the time, you know, were you ever, you go on Preem and you said, look, I'll never say never, but it defeats the purpose. And you said, we're not gonna do a halfway house. Actually, you were more declarative. We're not doing a halfway house, one foot in one foot out. And then the guy said, well, what about that Dell deal? And that pure deal that you just did. And I, I think I know the answer, but I want to hear from you did a customer come to you and say, get you in the headlock and say, you gotta do this. >>Or it did happen that way. Uh, it, uh, it started with a conversation, um, you know, via with, uh, with Michael Dell. Um, it was supposed to be just a friendly chat, you know, Hey, how's it going? And I mean, obviously Dell is the owner of data, the main, or our first company, you know? Um, but it's, it, wasn't easy for, for Dell and snowflake to have a conversation because they're the epitome of the on-premise company and we're the epitome of a cloud company. And it's like, how, what do we have in common here? Right. What can we talk about? But, you know, Michael's a very smart, uh, engaging guy, you know, always looking for, for opportunity. And of course they decided we're gonna hook up our CTOs, our product teams and, you know, explore, you know, somebody's, uh, ideas and, you know, yeah. We had some, you know, starts and restarts and all of that because it's just naturally, you know, uh, not an easy thing to conceive of, but, you know, in the end it was like, you know what? >>It makes a lot of sense. You know, we can virtualize, you know, Dell object storage, you know, as if it's, you know, an S three storage, you know, from Amazon and then, you know, snowflake in its analytical processing. We'll just reference that data because to us, it just looks like a file that's sitting on, on S3. And we have, we have such a thing it's called an external table, right. That's, that's how we basically, it projects, you know, a snowflake, uh, semantic and structural model, you know, on an external object. And we process against it exactly the same way as if it was an internal, uh, table. So we just extended that, um, you know, with, um, with our storage partners, like Dell and pure storage, um, for it to happen, you know, across a network to an on-prem place. So it's very elegant and it, it, um, it becomes an, an enterprise architecture rather than just a cloud architecture. And I'm, I just don't know what will come of it. And, but I've already talked to customers who have to have data on premises just can't go anywhere because they process against it, you know, where it originates, but there are analytical processes that wanna reference attributes of that data. Well, this is what we'll do that. >>Yeah. I'm, it is interesting. I'm gonna ask Dell if I were them, I'd be talking to you about, Hey, I'm gonna try to separate compute from storage on prem and maybe do some of the, the work there. I don't even know if it's technically feasible. It's, I'll ask OI. But, um, but, but, but to me, that's an example of your extending your ecosystem. Um, so you're talking now about applications and that's an example of increasing your Tam. I don't know if you ever get to the edge, you know, we'll see, we're not quite quite there yet, but, um, but as you've said before, there's no lack of market for you. >>Yeah. I mean, obviously snowflake it it's, it's Genesis was reinventing database management in, in a cloud computing environment, which is so different from a, a machine environment or a cluster environment. So that's why, you know, we're, we're, we're not a, a fit for a machine centric, uh, environment sort of defeats the purpose of, you know, how we were built. We, we are truly a native solution. Most products, uh, in the clouds are actually not cloud native. You know, they, they originated the machine environments and you still see that, you know, almost everything you see in the cloud by the way is not cloud native, our generation of applications. They only run the cloud. They can only run the cloud. They are cloud native. They don't know anything else, >>You know? Yeah, you're right. A lot of companies would just wrap something in wrap their stack in Kubernetes and throw it into the cloud and say, we're in the cloud too. And you basically get, you just shifted. It >>Didn't make sense. Oh. They throw it in the container and run it. Right. Yeah. >>So, okay. That's cool. But what does that get you that doesn't change your operational model? Um, so coming back to software development and what you're doing in, in that regard, it seems one of the things we said about Supercloud is in order to have a Supercloud, you gotta have an ecosystem, you gotta have optionality. Hence you're doing things like Apache iceberg, you know, you said today, well, we're not sure where it's gonna go, but we offering options. Uh, but, but my, my question is, um, as it pertains to software developments specifically, how do you, so one of the things we said, sorry, I've lost my train there. One of the things we said is you have to have a super PAs in order to have a super cloud ecosystem, PAs layer. That's essentially what you've introduced here. Is it not a platform for our application development? >>Yeah. I mean, what happens today? I mean, how do you enable a developer, you know, on snowflake, without the developer, you know, reading the, the files out of snowflake, you know, processing, you know, against that data, wherever they are, and then putting the results set, God knows where, right. And that's what happens today. It's the wild west it's completely UN uncovered, right? And that's the reason why lots of enterprises will not allow Python anything anywhere near, you know, their enterprise data. We just know that, uh, we also know it from streamlet, um, or the acquisition, um, large acquisition that we made this year because they said, look, you know, we're, we have a lot of demand, you know, uh, in the Python community, but that's the wild west. That's not the enterprise grade high trust, uh, you know, corporate environment. They are strictly segregated, uh, today. >>Now do some, do these, do these things sometimes dribble up in the enterprise? Yes, they do. And it's actually intolerable the risk that enterprises, you know, take, you know, with things being UN uncovered. I mean the whole snowflake strategy and promises that you're in snowflake, it is a, an absolute enterprise grade environment experience. And it's really hard to do. It takes enormous investment. Uh, but that is what you buy from us. Just having Python is not particularly hard. You know, we can do that in a week. This has taken us years to get it to this level, you know, of, of, you know, governance, security and, and, you know, having all the risks around exfiltration and so on, really understood and dealt with. That's also why these things run in private previews and public previews for so long because we have to squeeze out, you know, everything that may not have been, you know, understood or foreseen, you know, >>So there are trade offs of, of going into this snowflake cloud, you get all this great functionality. Some people might think it's a walled garden. How, how would you respond to that? >>Yeah. And it's true when you have a, you know, a snowflake object, like a snowflake, uh, table only snowflake, you know, runs that table. And, um, you know, that, that is, you know, it's very high function. It's very sort of analogous to what apple did, you know, they have very high functioning, but you do have to accept the fact that it's, that it's not, uh, you know, other, other things in apple cannot, you know, get that these objects. So this is the reason why we introduce an open file format, you know, like, like iceberg, uh, because what iceberg effectively does is it allows any tool, uh, you know, to access that particular object. We do it in such a way that a lot of the functionality of snowflake, you know, will address the iceberg format, which is great because it's, you're gonna get much more function out of our, you know, iceberg implementation than you would get from iceberg on its own. So we do it in a very high value addeds, uh, you know, manner, but other tools can still access the same object in a read to write, uh, manner. So it, it really sort of delivers the original, uh, promise of the data lake, which is just like, Hey, I have all these objects tools come and go. I can use what I want. Um, so you get, you get the best of both worlds for the most part. >>Have you reminds me a little bit of VMware? I mean, VMware's a software mainframe, it's just better than >>Doing >>It on your own. Yep. Um, one of the other hallmarks of a cloud company, and you guys clearly are a cloud company is startups and innovation. Um, now of course you see that in, in the, in the ecosystem, uh, and maybe that's the answer to my question, but you guys are kind of whale hunters, <laugh> your customers are, tend to be bigger. Uh, is the, is the innovation now the extension of that, the ecosystem is that by design. >>Oh, um, you know, we have a enormous, uh, ISV following and, um, we're gonna have a whole separate conference like this, by the way, just for, yeah. >>For developers. I hope you guys will up there too. Yeah. Um, you know, the, the reason that, that the ISV strategy is very important for, you know, for, for, for, for many reasons, but, you know, ISVs are the people that are really going to unlock a lot of the value and a lot of the promise of data, right? Because you, you can never do that on your own. And the problem has been that for ISVs, it is so expensive and so difficult to build a product that can be used because the entire enterprise platform infrastructure needs to be built by somebody, you know, I mean, are you really gonna run infrastructure, database, operations, security, compliance, scalability, economics. How do you do that as a software company where really you only have your, your domain expertise that you want to deliver on a platform. You don't wanna do all these things. >>First of all, you don't know how to do it, how to do it well. Um, so it is much easier, much faster when there is already platform to actually build done in the world of clout that just doesn't, you know, exist. And then beyond that, you know, okay, fine building. It is sort of step one. Now I gotta sell it. I gotta market it. So how do I do that? Well, in the snowflake community, you have already market <laugh>, there's thousands and thousands of customers that are also on self lake. Okay. So their, their ability to consume that service that you just built, you know, they can search it, they can try it, they can test it and decide whether they want to consume it. And then, you know, we can monetize it. So all they have to do is cash the check. So the net effecti of it is we drastically lowered the barriers to entry into the world, you know, of software, you know, two men or two women in a dog, and a handful of files can build something that then can be sold, sort of to, for software developers. >>I wrote a piece 2012 after the first reinvent. And I, you know, and I, and I put a big gorilla on the front page and I said, how do you compete with Amazon gorilla? And then one of my answers was you build data ecosystems and you verticalize, and that's, that's what you're doing >>Here. Yeah. There certain verticals that are farther along than others, uh, obviously, but for example, in financial, uh, which is our largest vertical, I mean, the, the data ecosystem is really developing hardcore now. And that's, that's because they so rely on those relationships between all the big financial institutions and entities, regulatory, you know, clearing houses, investment bankers, uh, retail banks, all this kind of stuff. Um, so they're like, it becomes a no brainer. The network affects kick in so strongly because they're like, well, this is really the only way to do it. I mean, if you and I work in different companies and we do, and we want to create a secure, compliant data network and connection between us, I mean, it would take forever to get our lawyers to agree that yeah, it's okay. <laugh> right now, it's like a matter of minutes to set it up. If we're both on snowflake, >>It's like procurement, do they, do you have an MSA yeah. Check? And it just sail right through versus back and forth and endless negotiations >>Today. Data networking is becoming core ecosystem in the world of computing. You know, >>I mean, you talked about the network effects in rise of the data cloud and correct. Again, you know, you, weren't the first to come up with that notion, but you are applying it here. Um, I wanna switch topics a little bit. I, when I read your press releases, I laugh every time. Cause this says no HQ, Bozeman. And so where, where do you, I think I know where you land on, on hybrid work and remote work, but what are your thoughts on that? You, you see Elon the other day said you can't work for us unless you come to the office. Where, where do you stand? >>Yeah. Well, the, well, the, the first aspect is, uh, we really wanted to, uh, separate from the idea of a headquarters location, because I feel it's very antiquated. You know, we have many different hubs. There's not one place in the world where all the important people are and where we make all the important positions, that whole way of thinking, uh, you know, it is obsolete. I mean, I am where I need to be. And it it's many different places. It's not like I, I sit in this incredible place, you know, and that's, you know, that's where I sit and everybody comes to me. No, we are constantly moving around and we have engineering hubs. You know, we have your regional, uh, you know, headquarters for, for sales. Obviously we have in Malaysia, we have in Europe, you know? And, um, so I wanted to get rid of this headquarters designation. >>And, you know, the, the, the other issue obviously is that, you know, we were obviously in California, but you know, California is, is no longer, uh, the dominant place of where we are resident. I mean, 40% of our engineering people are now in be Washington. You know, we have hundreds of people in Poland where people, you know, we are gonna have very stressed location in Toronto. Um, yeah. Obviously our customers are, are everywhere, right? So this idea that, you know, everything is happening in, in one state is just, um, you know, not, not correct. So we wanted to go to no headquarters. Of course the SCC doesn't let you do that. Um, because they want, they want you to have a street address where the government can send you a mail and then it becomes, the question is, well, what's an acceptable location. Well, it has to be a place where the CEO and the CFO have residency by hooker, by crook. >>That happened to be in Bozeman Montana because Mike and I are both, it was not by design. We just did that because we were, uh, required to, you know, you know, comply with government, uh, requirements, which of course we do, but that's why it, it says what it says now on, on the topic of, you know, where did we work? Um, we are super situational about it. It's not like, Hey, um, you know, everybody in the office or, or everybody is remote, we're not categorical about it. Depends on the function, depends on the location. Um, but everybody is tethered to an office. Okay. In words, everybody has a relationship with an office. There's, there's almost nobody, there are a few exceptions of people that are completely remote. Uh, but you know, if you get hired on with snowflake, you will always have an office affiliation and you can be called into the office by your manager. But for purpose, you know, a meeting, a training, an event, you don't get called in just to hang out. And like, the office is no longer your home away from home. Right. And we're now into hotel, right? So you don't have a fixed place, you know? So >>You talked in your keynote a lot about last question. I let you go customer alignment, obviously a big deal. I have been watching, you know, we go to a lot of events, you'll see a technology company tell a story, you know, about their widget or whatever it was their box. And then you'll see an outcome and you look at it and you shake your head and say, well, that the difference between this and that is the square root of zero, right. When you talk about customer alignment today, we're talking about monetizing data. Um, so that's a whole different conversation. Um, and I, I wonder if you could sort of close on how that's different. Um, I mean, at ServiceNow, you transformed it. You know, I get that, you know, data, the domain was okay, tape, blow it out, but this is a, feels like a whole new vector or wave of growth. >>Yeah. You know, monetizing, uh, data becomes sort of a, you know, a byproduct of having a data cloud you all of a sudden, you know, become aware of the fact that, Hey, Hey, I have data and be that data might actually be quite valuable to parties. And then C you know, it's really easy to then, you know, uh, sell that and, and monetize that. Cause if it was hard, forget it, you know, I don't have time for it. Right. But if it's relatively, if it's compliant, it's relatively effortless, it's pure profit. Um, I just want to reference one attribute, two attributes of what you have, by the way, you know, uh, hedge funds have been into this sort of thing, you know, for a long time, because they procure data from hundreds and hundreds of sources, right. Because they're, they are the original data scientists. >>Um, but the, the bigger thing with data is that a lot of, you know, digital transformation is, is, is finally becoming real. You know, for years it was arm waving and conceptual and abstract, but it's becoming real. I mean, how do we, how do we run a supply chain? You know, how do we run, you know, healthcare, um, all these things are become are, and how do we run cyber security? They're being redefined as data problems and data challenges. And they have data solutions. So that's right. Data strategies are insanely important because, you know, if, if the solution is through data, then you need to have, you know, a data strategy, you know, and in our world, that means you have a data cloud and you have all the enablement that allows you to do that. But, you know, hospitals, you know, are, are saying, you know, data science is gonna have a bigger impact on healthcare than life science, you know, in the coming, whatever, you know, 10, 20 years, how do you enable that? >>Right. I, I have conversations with, with, with hospital executives are like, I got generations of data, you know, clinical diagnostic, demographic, genomic. And then I, I am envisioning these predictive outcomes over here. I wanna be able to predict, you know, once somebody's gonna get what disease and you know, what I have to do about it, um, how do I do that? <laugh> right. The day you go from, uh, you know, I have a lot of data too. I have these outcomes and then do me a miracle in the middle, in the middle of somewhere. Well, that's where we come in. We're gonna organize ourselves and then unpack thats, you know, and then we, we work, we through training models, you know, we can start delivering some of these insights, but the, the promise is extraordinary. We can change whole industries like pharma and, and, and healthcare. Um, you know, 30 effects of data, the economics will change. And you know, the societal outcomes, you know, um, quality of life disease, longevity of life is quite extraordinary. Supply chain management. That's all around us right >>Now. Well, there's a lot of, you know, high growth companies that were kind of COVID companies, valuations shot up. And now they're trying to figure out what to do. You've been pretty clear because of what you just talked about, the opportunities enormous. You're not slowing down, you're amping it up, you know, pun intended. So Frank Luman, thanks so much for coming on the cube. Really appreciate your time. >>My pleasure. >>All right. And thank you for watching. Keep it right there for more coverage from the snowflake summit, 2022, you're watching the cube.
SUMMARY :
Good to see you again, Frank. You have AWS, you know, I use that term, AWS. you know, with that data, they don't want to just, you know, run data operations, populate dashboards, One of the analysts asked Mike, you know, do you ever consider going to a subscription model? with people that build on snowflake, um, you know, they have trouble, you know, with their financial model because bad, you know, pay us. you know, so many users from, you know, Salesforce and ServiceNow or whoever you have just purchased the they, they will probably pay for that, you know, on an annual basis, you know, that three year contract. Phase one, better simpler, you know, cloud enterprise data warehouse, You connect to the database, you know, you read or right data, you know, you do data, data manipulations. like, yeah, you know, as an enterprise or an institution, you know, I'm the epicenter of you know, for example, you know, take, you know, uh, an investment bank, you know, in, you know, were you ever, you go on Preem and you said, look, I'll never say never, but it defeats the purpose. just naturally, you know, uh, not an easy thing to conceive of, but, you know, You know, we can virtualize, you know, Dell object storage, you know, I don't know if you ever get to the edge, you know, we'll see, we're not quite quite there yet, So that's why, you know, we're, And you basically get, you just shifted. Oh. They throw it in the container and run it. you know, you said today, well, we're not sure where it's gonna go, but we offering options. you know, on snowflake, without the developer, you know, reading the, the files out of snowflake, And it's actually intolerable the risk that enterprises, you know, take, So there are trade offs of, of going into this snowflake cloud, you get all this great functionality. uh, you know, other, other things in apple cannot, you know, get that these objects. Um, now of course you see that Oh, um, you know, we have a enormous, uh, ISV following and, be built by somebody, you know, I mean, are you really gonna run infrastructure, you know, of software, you know, two men or two women in a dog, and a handful of files can build you know, and I, and I put a big gorilla on the front page and I said, how do you compete with Amazon gorilla? regulatory, you know, clearing houses, investment bankers, uh, retail banks, It's like procurement, do they, do you have an MSA yeah. Data networking is becoming core ecosystem in the world of computing. Again, you know, It's not like I, I sit in this incredible place, you know, and that's, And, you know, the, the, the other issue obviously is that, you know, we were obviously in California, We just did that because we were, uh, required to, you know, you know, I have been watching, you know, we go to a lot of events, you'll see a technology company tell And then C you know, you know, a data strategy, you know, and in our world, that means you have a data cloud and you have all the enablement that thats, you know, and then we, we work, we through training models, you know, you know, pun intended. And thank you for watching.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave | PERSON | 0.99+ |
California | LOCATION | 0.99+ |
Mike | PERSON | 0.99+ |
Frank Luman | PERSON | 0.99+ |
BlackRock | ORGANIZATION | 0.99+ |
Poland | LOCATION | 0.99+ |
Europe | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Malaysia | LOCATION | 0.99+ |
Frank | PERSON | 0.99+ |
Toronto | LOCATION | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Frank Slootman | PERSON | 0.99+ |
one foot | QUANTITY | 0.99+ |
hundreds | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
2012 | DATE | 0.99+ |
Michael | PERSON | 0.99+ |
Washington | LOCATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
40% | QUANTITY | 0.99+ |
one month | QUANTITY | 0.99+ |
three year | QUANTITY | 0.99+ |
Michael Dell | PERSON | 0.99+ |
Bozeman Montana | LOCATION | 0.99+ |
New York | LOCATION | 0.99+ |
last week | DATE | 0.99+ |
ORGANIZATION | 0.99+ | |
30 effects | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
two attributes | QUANTITY | 0.99+ |
SCC | ORGANIZATION | 0.99+ |
two women | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
one quarter | QUANTITY | 0.99+ |
one attribute | QUANTITY | 0.99+ |
zero | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
20 years | QUANTITY | 0.98+ |
Las Vegas | LOCATION | 0.98+ |
apple | ORGANIZATION | 0.98+ |
Today | DATE | 0.98+ |
10 | QUANTITY | 0.98+ |
first aspect | QUANTITY | 0.98+ |
ServiceNow | ORGANIZATION | 0.98+ |
two men | QUANTITY | 0.98+ |
Kubernetes | TITLE | 0.97+ |
tens of thousands | QUANTITY | 0.97+ |
Elon | PERSON | 0.97+ |
first | QUANTITY | 0.97+ |
Snowflake Summit 2022 | EVENT | 0.97+ |
both worlds | QUANTITY | 0.96+ |
First | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
S three | COMMERCIAL_ITEM | 0.96+ |
one state | QUANTITY | 0.95+ |
Supercloud | ORGANIZATION | 0.95+ |
this year | DATE | 0.95+ |
one place | QUANTITY | 0.94+ |
first company | QUANTITY | 0.93+ |
snowflake | ORGANIZATION | 0.9+ |
Dave ante | PERSON | 0.87+ |
hundreds of people | QUANTITY | 0.87+ |
hundreds of sources | QUANTITY | 0.85+ |
2022 | DATE | 0.85+ |
Dipti Borkar, Ahana, and Derrick Harcey, Securonix | CUBE Conversation, July 2021
(upbeat music) >> Welcome to theCUBE Conversation. I'm John Furrier, host of theCUBE here in Palo Alto, California, in our studios. We've got a great conversation around open data link analytics on AWS, two great companies, Ahana and Securonix. Dipti Borkar, Co-founder and Chief Product Officer at Ahana's here. Great to see you, and Derrick Harcey, Chief Architect at Securonix. Thanks for coming on, really appreciate you guys spending the time. >> Yeah, thanks so much, John. Thank you for having us and Derrick, hello again. (laughing) >> Hello, Dipti. >> We had a great conversation around our startup showcase, which you guys were featured last month this year, 2021. The conversation continues and a lot of people are interested in this idea of open systems, open source. Obviously open data lakes is really driving a lot of value, especially with machine learning and whatnot. So this is a key, key point. So can you guys just take a step back before we get under the hood and set the table on Securonix and Ahana? What's the big play here? What is the value proposition? >> Why sure, I'll give a quick update. Securonix has been in the security business. First, a user and entity, behavioral analytics, and then the next generation SIEM platform for 10 years now. And we really need to take advantage of some cutting edge technologies in the open source community and drive adoption and momentum that we can not only bring in data from our customers, that they can find security threats, but also store in a way that they can use for other purposes within their organization. That's where the open data lake is very critical. >> Yeah and to add on to that, John, what we've seen, you know, traditionally we've had data warehouses, right? We've had operational systems move all of their data into the warehouse and those, you know, while these systems are really good, built for good use cases, the amount of data is exploding, the types of data is exploding, different types, semi-structured, structured and so when, as companies like Securonix in the security space, as well as other verticals, look for getting more insights out of their data, there's a new approach that's emerging where you have a data lake, which AWS has revolutionized with S3 and commoditized and there's analytics that's built on top of it. And so we're seeing a lot of good advantages that come out of this new approach. >> Well, it's interesting EC2 and S3 are having their 15th birthday, as they say in Amazon's interesting teenage years, but while I got you guys here, I want to just ask you, can you define the SIEM thing because the SIEM market is exploding, it just changed a little bit. Obviously it's data, event management, but again, as data becomes more proliferating, and it's not stopping anytime soon, as cloud native applications emerge, why is this important? What is this SIEM category? What's it about? >> Yeah, thanks. I'll take that. So obviously SIEM traditionally has been around for about a couple of decades and it really started with first log collection and management and rule-based threat detection. Now what we call next generation SIEM is really the modernization of a security platform that includes streaming threat detection and behavioral analysis and data analytics. We literally look for thousands of different threat detection techniques, and we chained together sequences of events and we stream everything in real time and it's very important to find threats as quickly as possible. But the momentum that we see in the industry as we see massive sizes of customers, we have made a transition from on-premise to the cloud and we literally are processing tens of petabytes of data for our customers. And it's critical that we can adjust data quickly, find threats quickly and allow customers to have the tools to respond to those security incidents quickly and really get the handle on their security posture. >> Derrick, if I ask you what's different about this next gen SIEM, what would you say and what's the big a-ha? What's the moment there? What's the key thing? >> The real key is taking the off the boundaries of scale. We want to be able to ingest massive quantities of data. We want to be able to do instant threat detection, and we want to be able to search on the entire forensic data set across all of the history of our customer base. In the past, we had to make sacrifices, either on the amount of data we ingest or the amount of time that we stored that data. And the really the next generation SIEM platform is offering advanced capabilities on top of that data set because those boundaries are no longer barriers for us. >> Dipti, any comment before I jump into the question for you? >> Yeah, you know, absolutely. It is about scale and like I mentioned earlier, the amount of data is only increasing and it's also the types of information. So the systems that were built to process this information in the past are, you know, support maybe terabytes of data, right? And that's where new technologies open source engines like Presto come in, which were built to handle internet scale. Presto was kind of created at Facebook to handle these petabytes that Derrick is talking about that every industry is now seeing where we're are moving from gigs to terabytes to petabytes. And that's where the analytic stack is moving. >> That's a great segue. I want to ask you while I got you here 'cause this is again, the definitions, 'cause people love to hear the experts weigh in. What is open data lake analytics? How would you define that? And then talk about where Presto fits in. >> Yeah, that's a great question. So the way I define open data lake analytics is you have a data lake on the core, which is, let's say S3, it's the most popular one, but on top of it, there are open aspects, it is open format. Open formats play a very important role because you can have different types of processing. It could be SQL processing, it could be machine learning, it could be other types of workloads, all work on these open formats versus a proprietary format where it's locked and it's open interfaces. Open interfaces that are like SQL, JDBC, ODBC is widely accessible to a range of tools. And so it's everywhere. Open source is a very important part of it. As companies like Securonix pick these technologies for their mission critical systems, they want to know that this is going to be available and open for them for a long period of time. And that's why open source becomes important. And then finally, I would say open cloud because at the end of the day, you know, while AWS is where a lot of the innovations happening, a lot of the market is, there are other clouds and open cloud is something that these engines were built for, right? So that's how I define open data lake analytics. It's analytics with query engines built on top of these open formats, open source, open interfaces and open cloud. Now Presto comes in where you want to find the needle in the haystack, right? And so when you have these deep questions about where did the threat come from or who was it, right? You have to ask these questions of your data. And Presto is an open source distributed SQL engine that allows data platform teams to run queries on their data lakes in a high-performance ways, in memory and on these petabytes of data. So that's where Presto fits in. It's one of the defacto query engines for SQL analysis on the data lake. So hopefully that answers the question, gives more context. >> Yeah, I mean, the joke about data lakes has been you don't want to be a data swamp, right? That's what people don't want. >> That's right. >> But at the same time, the needle in the haystack, it's like big data is like a needle in a haystack of needles. So there's a constant struggle to getting that data, the right data at the right time. And what I learned in the last presentation, you guys both presented, your teams presented at the conference was the managed service approach. Could you guys talk about why that approach works well together with you guys? Because I think when people get to the cloud, they replatform, then they start refactoring and data becomes a real big part of that. Why is the managed service the best approach to solving these problems? >> Yeah and interestingly, both Securonix and Ahana have a managed service approach so maybe Derrick can go first and I can go after. >> Yeah, yeah. I'll be happy to go first. You know, we really have found making the transition over the last decade from off premise to the cloud for the majority of our customers that running a large open data lake requires a lot of different skillsets and there's hundreds of technologies in the open source community to choose from and to be able to choose the right blend of skillsets and technologies to produce a comprehensive service is something that customers can do, many customers did do, and it takes a lot of resources and effort. So what we really want to be able to do is take and package up our security service, our next generation SIEM platform to our customers where they don't need to become experts in every aspect of it. Now, an underlying component of that for us is how we store data in an open standards way and how we access that data in an open standards way. So just like we want our customers to get immediate value from the security services that we provide, we also want to be able take advantage of a search service that is offered to us and supported by a vendor like Ahana where we can very quickly take advantage of that value within our core underlying platform. So we really want to be able to make a frictionless effort to allow our customers achieve value as quick as possible. >> That's great stuff. And on the Ahana side, open data lakes, really the ease of use there, it sounds easy to me, but we know it's not easy just to put data in a data lake. At the end of the day, a lot of customers want simplicity 'cause they don't have the staffing. This comes up a lot. How do you leverage their open source participation and/or getting stood up quickly so they can get some value? Because that seems to be the number one thing people want right now. Dipti, how does that work? How do people get value quickly? >> Yeah, absolutely. When you talk about these open source press engines like Presto and others, right? They came out of these large internet companies that have a lot of distributed systems, engineers, PhDs, very kind of advanced level teams. And they can manage these distributed systems building onto them, add features at large scale, but not every company can and these engines are extremely powerful. So when you combine the power of Presto with the cloud and a managed service, that's where value for everyone comes in. And that's what I did with Ahana is looked at Presto, which is a great engine, but converted it into a great user experience so that whether it's a three person platform team or a five person platform team, they still get the same benefit of Presto that a Facebook gets, but at much, much a less operational complexity cost, as well as the ability to depend on a vendor who can then drive the innovation and make it even better. And so that's where managed services really com in. There's thousands of credit parameters that need to be tuned. With Ahana, you get it out of the box. So you have the best practices that are followed at these larger companies. Our team comes from Facebook, HuBERT and others, and you get that out of the box, with a few clicks you can get up and running. And so you see value immediately, in 30 minutes you're up and running and you can create your data lake versus with Hadoop and these prior systems, it would take months to receive real value from some of these systems. >> Yeah, we saw the Hadoop scar tissue is all great and all good now, but it takes too much resource, standing up clusters, managing it, you can't hire enough people. I got to ask you while you're on that topic, do you guys ship templates? How do you solve the problem of out of the box? You mentioned some out of the box capability. Do you guys think of as recipes, templates? What's your thoughts around what you're providing customers to get up and running? >> Yeah so in the case of Securonix, right, let's say they want to create a Presto cluster. They go into our SAS console. You essentially put in the number of nodes that you want. Number of workers you want. There's a lot of additional value that we built in like caching capabilities if you want more performance, built in cataloging that's again, another single click. And there isn't really as much of a template. Everybody gets the best tuned Presto for their workloads. Now there are certain workloads where you might have interactive in some cases, or you might have transformation batch ETL, and what we're doing next is actually giving you the knobs so that it comes pre tuned for the type of workload that you want to run versus you figuring it out. And so that's what I mean by out of the box, where you don't have to worry about these configuration parameters. You get the performance. And maybe Derrick can you talk a little bit about the benefits of the managed service and the usage as well. >> Yeah, absolutely. So, I'll answer the same question and then I'll tie back to what Dipti asked. Really, you know, our customers, we want it to be very easy for them to ingest security event logs. And there's really hundreds of types of a security event logs that we support natively out of the box, but the key for us is a standard that we call the open event format. And that is a normalized schema. We take any data source in it's normalized format, be a collector device a customer uses on-premise, they send the data up to our cloud, we do streaming analysis and data analytics to determine where the threats are. And once we do that, then we send the data off to a long-term storage format in a standards-based Parquet file. And that Parquet file is natively read by the Ahana service. So we simply deploy an Ahana cluster that uses the Presto engine that natively supports our open standard file format. And we have a normalized schema that our application can immediately start to see value from. So we handle the collection and streaming ingest, and we simply leverage the engine in Ahana to give us the appropriate scale. We can size up and down and control the cost to give the users the experience that they're paying for. >> I really love this topic because one, not only is it cutting edge, but it's very relevant for modern applications. You mentioned next gen SIEMs, SIEM, security information event management, not SIM as memory card, which I think of all the time because I always want to add more, but this brings up the idea of streaming data real-time, but as more services go to the cloud, Derrick, if you don't mind sharing more on this. Share the journey that you guys gone through, because I think a lot of people are looking at the cloud and saying, and I've been in a lot of these conversations about repatriation versus cloud. People aren't going that way. They're going more innovation with his net new revenue models emerging from the value that they're getting out of understanding events that are happening within the network and the apps, even when they're being stood up and torn down. So there's a lot of cloud native action going on where just controlling and understanding is way beyond the, just put stuff into an event log. It's a whole nother animal. >> Well, there's a couple of paradigm shifts that we've seen major patterns for in the last five or six years. Like I said, we started with the safe streaming ingest platform on premise. We use some different open source technologies. What we've done when we moved to the cloud is we've adopted cloud native services as part of our underlying platform to modernize and make our service cloud native. But what we're seeing as many customers either want to focus on on-premise deployments and especially financial institutions and government institute things, because they are very risk averse. Now we're seeing even those customers are realizing that it's very difficult to maintain the hundreds or thousands of servers that it requires on premise and have the large skilled staff required to keep it running. So what we're seeing now is a lot of those customers deployed some packaged products like our own, and even our own customers are doing a mass migration to the cloud because everything is handled for them as a service. And we have a team of experts that we maintain to support all of our global customers, rather than every one of our global customers having their own teams that we then support on the back end. So it's a much more efficient model. And then the other major approach that many of our customers also went down the path of is, is building their own security data lake. And many customers were somewhat successful in building their own security data lake but in order to keep up with the innovation, if you look at the analyst groups, the Gartner Magic Quadrant on the SIEM space, the feature set that is provided by a packaged product is a very large feature set. And even if somebody was put together all of the open source technologies to meet 20% of those features, just maintaining that over time is very expensive and very difficult. So we want to provide a service that has all of the best in class features, but also leverages the ability to innovate on the backend without the customer knowing. So we can do a technology shift to Ahana and Presto from our previous technology set. The customer doesn't know the difference, but they see the value add within the service that we're offering. >> So if I get this right, Derrick, Presto's enabling you guys to do threat detection at a level that you're super happy with as well as giving you the option for give self-service. Is that right for the, is that a kind of a- >> Well, let me clarify our definition. So we do streaming threat detection. So we do a machine learning based behavioral analysis and threat detection on rule-based correlation as well. So we do threat detection during the streaming process, but as part of the process of managing cybersecurity, the customer has a team of security analysts that do threat hunting. And the threat hunting is where Ahana comes in. So a human gets involved and starts searches for the forensic logs to determine what happened over time that might be suspicious and they start to investigate through a series of queries to give them the information that's relevant. And once they find information that's relevant, then they package it up into an algorithm that will do a analysis on an ongoing basis as part of the stream processing. So it's really part of the life cycle of hunting a real time threat detection. >> It's kind of like old adage hunters and farmers, you're farming through the streaming and hunting with the detection. I got to ask you, what would it be the alternative if you go back, I mean, I know cloud's so great because you have cutting edge applications and technologies. Without Presto, where would you be? I mean, what would be life like without these capabilities? What would have to happen? >> Well, the issue is not that we had the same feature set before we moved to Presto, but the challenge was on scale. The cost profile to continue to grow from 100 terabytes to one petabyte, to tens of petabytes, not only was it expensive, but it just, the scaling factors were not linear. So not only did we have a problem with the costs, but we also had a problem with the performance tailing off and keeping the service running. A large Hadoop cluster, for example, our first incarnation of this use, the hive service, in order to query data in a MapReduce cluster. So it's a completely different technology that uses a distributed Hadoop compute cluster to do the query. It does work, but then we start to see resource contention with that, and all the other things in the Hadoop platform. The Presto engine has the beauty of it, not only was it designed for scale, but it's feature built just for a query engine and that's the providing the right tool for the job, as opposed to a general purpose tool. >> Derrick, you've got a very busy job as chief architect. What are you excited about going forward when you look at the cloud technologies? What are you looking at? What are you watching? What are you getting excited about or what worries you? >> Well, that's a good question. What we're really doing, I'm leading up a group called the Securonix Innovation Labs, and we're looking at next generation technologies. We go through and analyze both open source technologies, technologies that are proprietary as well as building own technologies. And that's where we came across Ahana as part of a comprehensive analysis of different search engines, because we wanted to go through another round of search engine modernization, and we worked together in a partnership, and we're going to market together as part of our modernization efforts that we're continuously going through. So I'm looking forward to iterative continuous improvement over time. And this next journey, what we're seeing because of the growth in cybersecurity, really requires new and innovative technologies to work together holistically. >> Dipti, you got a great company that you co-founded. I got to ask you as the co-founder and chief product officer, you both the lead entrepreneur also, got the keys to the kingdom with the products. You got to balance that 20 miles stare out in the future while driving product excellence. You've got open source as a tailwind. What's on your mind as you go forward with your venture? >> Yeah. Great question. It's been super exciting to have found the Ahana in this space, cloud data and open source. That's where the action is happening these days, but there's two parts to it. One is making our customers successful and continuously delivering capabilities, features, continuing on our ease of use theme and a foundation to get customers like Securonix and others to get most value out of their data and as fast as possible, right? So that's a continuum. In terms of the longer term innovation, the way I see the space, there is a lot more innovation to be done and Presto itself can be made even better and there's a next gen Presto that we're working on. And given that Presto is a part of the foundation, the Linux Foundation, a lot of this innovation is happening together collaboratively with Facebook, with Uber who are members of the foundation with us. Securonix, we look forward to making a part of that foundation. And that innovation together can then benefit the entire community as well as the customer base. This includes better performance with more capabilities built in, caching and many other different types of database innovations, as well as scaling, auto scaling and keeping up with this ease of use theme that we're building on. So very exciting to work together with all these companies, as well as Securonix who's been a fantastic partner. We work together, build features together, and I look at delivering those features and functionalities to be used by these analysts, data scientists and threat hunters as Derrick called them. >> Great success, great partnership. And I love the open innovation, open co-creation you guys are doing together and open data lakes, great concept, open data analytics as well. This is the future. Insights coming from the open and sharing and actually having some standards. I love this topic, so Dipti, thank you very much, and Derrick, thanks for coming on and sharing on this Cube Conversation. Thanks for coming on. >> Thank you so much, John. >> Thanks for having us. >> Thanks. Take care. Bye-bye. >> Okay, it's theCube Conversation here in Palo Alto, California. I'm John furrier, your host of theCube. Thanks for watching. (upbeat music)
SUMMARY :
guys spending the time. and Derrick, hello again. and set the table on Securonix and Ahana? and momentum that we can into the warehouse and those, you know, because the SIEM market is exploding, and really get the handle either on the amount of data we ingest and it's also the types of information. hear the experts weigh in. So hopefully that answers the Yeah, I mean, the joke Why is the managed Yeah and interestingly, a search service that is offered to us And on the Ahana side, open data lakes, and you get that out of the box, I got to ask you while and the usage as well. and control the cost from the value that they're getting and have the large skilled staff as well as giving you the for the forensic logs to and hunting with the detection. and that's the providing when you look at the cloud technologies? because of the growth in cybersecurity, got the keys to the and a foundation to get And I love the open here in Palo Alto, California.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Securonix | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Derrick Harcey | PERSON | 0.99+ |
Derrick | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Ahana | ORGANIZATION | 0.99+ |
Ahana | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
20% | QUANTITY | 0.99+ |
July 2021 | DATE | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Dipti | PERSON | 0.99+ |
100 terabytes | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Linux Foundation | ORGANIZATION | 0.99+ |
two parts | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
Securonix Innovation Labs | ORGANIZATION | 0.99+ |
tens of petabytes | QUANTITY | 0.99+ |
30 minutes | QUANTITY | 0.99+ |
one petabyte | QUANTITY | 0.99+ |
Dipti Borkar | PERSON | 0.99+ |
20 miles | QUANTITY | 0.99+ |
Palo Alto, California | LOCATION | 0.99+ |
five person | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
SQL | TITLE | 0.99+ |
last month | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
One | QUANTITY | 0.98+ |
15th birthday | QUANTITY | 0.97+ |
two great companies | QUANTITY | 0.96+ |
HuBERT | ORGANIZATION | 0.96+ |
Hadoop | TITLE | 0.96+ |
S3 | TITLE | 0.96+ |
hundreds of technologies | QUANTITY | 0.96+ |
three person | QUANTITY | 0.95+ |
Parquet | TITLE | 0.94+ |
first incarnation | QUANTITY | 0.94+ |
first | QUANTITY | 0.94+ |
Presto | ORGANIZATION | 0.93+ |
Gartner | ORGANIZATION | 0.93+ |
last decade | DATE | 0.92+ |
terabytes of data | QUANTITY | 0.92+ |
first log | QUANTITY | 0.91+ |
single click | QUANTITY | 0.9+ |
Presto | PERSON | 0.9+ |
theCUBE | ORGANIZATION | 0.88+ |
Rob Harris, Stardog | Cube Conversation, March 2021
>>hello. >>Welcome to the special key conversation. I'm John ferry, host of the queue here in Palo Alto, California, featuring star dog is a great hot start-up. We've got a great guest, Rob Harris, vice president of solutions consulting for star dog here talking about some of the cloud growth, um, knowledge graphs, the role of data. Obviously there's a huge sea change. You're seeing real value coming out of this COVID as companies coming out of the pandemic, new opportunities, new use cases, new expectations, highly accelerated shift happening, and we're here to break it down. Rob, thanks for joining us on the cube conversation. Great to be here. So got, I'm excited to talk to you guys about your company and specifically the value proposition I've been talking for almost since 2007 around graph databases with Neo four J came out and looking at how data would be part of a real part of the developer mindset. Um, early on, and this more of the development. Now it's mainstream, you're seeing value being created in graph structures. Okay. Not just relational. This has been, uh, very well verified. You guys are in this business. So this is a really hot area, a lot of value being created. It's cool. And it's relevant. So tell us first, what is star dog doing? What's uh, what is the company about? >>Yeah, so I mean, we are an enterprise knowledge graph platform company. We help people be successful at standing up knowledge graphs of the data that they have both inside their company and using public data and tying that all together in order to be able to leverage that connected data and really turn it into knowledge through context and understand it. >>So how did this all come about this from a tech standpoint? What is the, what is the, uh, what was the motivation around this? Because, um, obviously the unstructured wave hit, you're seeing successes like data bricks, for instance, just absolutely crushing it on, on their valuation and their relevance. You seeing the same kind of wave hit almost kind of born back on the Hadoop days with unstructured data. Is that a big part of it? Is it just evolution? What's the big driver here? >>Yeah, no, I think it's a, it's a great question. The driver early is as these data sets have increased for so many companies trying to really bring some understanding to it as they roll it out in their organizations, you know, we've tried to just try to centralize it and that hasn't been sufficient in order to be able to unlock the value of most organization status. So being able to step beyond just, you know, pulling everything together into one place, but really putting that context and meaning around it that the graph can do. So that's where we've really got started at, uh, back in the day is we really looked at the inference and reasoning part of a knowledge graph. How do we bring more context and understanding that doesn't naturally exist within the data? And that really is how we launched off the product. >>I got to ask you around the use cases because one of the things that's really relevant right now is you're seeing a lot of front end development around agile application. Dev ops is brought infrastructure as code. You're seeing kind of this huge tsunami of new of applications, but one of the things that people are talking about in some of the developer circles and it's kind of hits the enterprise is this notion of state because you can have an application calling data, but if the data is not addressable and then keeping state and in real time and all these kinds of new, new technical problems, how do you guys look at that? When you look at trying to create knowledge graphs, because maintaining that level of connection, you need data, a ton of it it's gotta be exposed and addressable and then deal dealt with in real time. How do you guys look at it? >>Yeah, that's, that's a great question. What we've done to try to kind of move the ball forward on this is move past, trying to centralize that data into a knowledge graph that is separate from the rest of your data assets, but really build a data virtualization layer, which we have integrated into our product to look at the data where it is in the applications and the unstructured documents and the structure repositories, so that we can observe as state changes in that data and answer questions that are relevant at the time. And we don't have to worry about some sort of synchronous process, you know, loading information into the graph. So that ability to add that virtualization layer, uh, to the graph really enables you to get more of a real time, look at your data as it evolves. >>Yeah. I definitely want to double, double click on that and say, but I want to just drop step back and kind of set the table for the folks that aren't, um, getting in the weeds yet on this. There's kind of a specific definition of enterprise knowledge graph. Could you like just quickly define that? What is the enterprise knowledge graph? Sure. >>Yeah, we, we really see an enterprise knowledge graph as a connected set of data with context. So it's not just storing it like a graph, but connect again and putting meaning around that data through structure, through definitions, et cetera, across the entire enterprise. So looking at not just data within a single application or within a single silo, but broadly through your enterprise, what does your data mean? How is it connected and what does it look like within context each other? >>How should companies reuse their data? >>Boy, that's a broad question, right? Uh, you know, I mean, one of the things, uh, that I think is very important as so many companies have just collected data assets over the years, they collect more and more and more. We have customers that have eight petabytes of data within their data Lake. And they're trying to figure out how to leverage it by actually connecting and putting that context around the data. You can get a lot more meaning out of that old data or the stale data or the unknown data that the people are getting right today. So the ability to reuse the data assets with in context of meeting is where we see people really be able to make huge licks for in their organization like drug companies be able to get drugs to market faster. By looking at older studies, they've done where maybe the meeting was hidden because it was an old system. Nobody knew what the particular codes and meaning were in context of today. So being able to reuse and bring that forward brings real life application to people solving business problems today. >>Rob, I got to get your thoughts on something that we always riff on here on the cube, which is, um, you know, do you take down the data silos or do you leverage them? And you know, this came up a lot, many years ago when we first started discussing containers, for instance, and then that we saw that you didn't have to kill the old to bring in the new, um, there's one mindset of, you know, break the silos down, go horizontal scalability on the data, critical data, plane control, plane, other saying, Hey, you know what, just put it, you know, put a wrapper around those, those silos and you know, I'm oversimplifying, but you get the idea. So how should someone who's really struggling with, or, or not struggling, we're putting together an architecture around their future plans around dealing with data and data silos specifically, because certainly as new data comes in there's mechanism for that. But as you have existing data silos, what do companies do? What's the strategy in your opinion? >>Yeah, you know, it is a really interesting question. I was in data warehouse and for a long, long time and a big proponent of moving everything to one place. And, uh, then I really moved into looking into data virtualization and realized that neither of those solutions are complete, that there are some things that have to be centralized and moved the old systems aren't sufficient in order to be able to answer questions or process them. But there are many data silos that we've created within organizations that can be reused. You can leverage the compute, you can leverage the storage that already exist within us. And that's the approach we've taken at start off. We really want to be able to allow you to centralize the data that makes sense, right. To get it out of those old systems, that should be shut down from just a monetary perspective, but the systems that are have actual meeting or that it's too expensive in order to, to remove them, leverage those data silos. And by letting you have both approaches in the same platform, we hope to make this not an either or architectural decision, which is always the difficult question. >>Okay. So you got me on that one. So let me just say that. I want to leverage my data silos. What do I do? Take me through the playbook. What if I got the data silos? What is the star dog recommendation for me? >>Sure. So what, what we generally recommend is you start off with building kind of a model, uh, in the, in the lingo, we sometimes say ontology Euro, some sort of semantic understanding that puts context around what is my data and what does it mean? And then we allow you to map those data silos. We have a series of connectors in our product that whether it's an application and you're connecting through a rest connector, or whether it's a database and you're connecting through ODBC or JDBC map that data into the platform. And then when you issue queries to the startup platform, we federate those queries out to the downstream systems and answer as if that data existed on the graph. So that way we're leveraging the silos where they are without you having to move the data physically into the platform. So you guys are essentially building a >>Data fabric. >>We are, yeah. Data fabric is really the new term. That's been popping up more and more with our customers when they come to us to say, how can we kind of get past the traditional ways of doing data integration and unified data in a single place? Like you said, we don't think the answer is purely all about moving it all to one big Lake. We don't think the answer is all about just creating this virtualization plane, but really being able to leverage the festival. >>All right. So, so if you, if you believe that, then let's just go to the next level then. So if you believe that they can, don't have to move things around and to have one specific thing, how does a customer deal with their challenge of hybrid cloud and soon to be multi-cloud because that's certainly on the horizon. People want choice. There's going to be architectural. I mean, certainly a cloud operations will be in play, but this on-premise and this cloud, and then soon to be multiple cloud. How do you guys deal with that? That question? >>Yeah, that's a great question. And this is really a, an area that we're very excited about and we've been investing very heavily in is how to have multiple instances of StarTalk running in different clouds or on prem on the clown, coordinate to answer questions, to minimize data movement between the platforms. So we have the ability to run either an agent on prem. For example, if you're running the platform in the cloud or vice versa, you can run it in the cloud. You are two full instances that start off where they will actually cope plan queries to understand where does the data live? Where is it resident and how do I minimize moving data around in order to answer the question? So we really are trying to create that unified data fabric across on-prem or multiple cloud providers, so that any of the nodes in the platform can answer question from any of the datas >>S you know, complexity is always the issue. People cost go up. When you have complexity, you guys are trying to tame it. This is a huge conversation. You bring up multi-cloud and hybrid cloud. And multi-cloud when you think about the IOT edge, and you don't want to move data around, this is what everyone's saying, why move it? Why move data? It's expensive to move data processes where it is, and you kind of have this kind of flexibility. So this idea of unification is a huge concept. Is that enough? And how should customers think about the unification? Because if you can get there, it almost, it is the kind of the Holy grail you're talking about here. So, so this is kind of the prospect of, of having kind of an ideal architecture of unification. So take us, take me through that one step deeper. >>Well, it is, it is kind of interesting because as you really think about unifying your data and really bringing it together, of course it is the Holy grail. And that's what people have been talking about. Um, gosh, since I started in the industry over 20 years ago, how do I get this single plain view of my data, regardless of whether it's physically located or, uh, somehow stitched together, but what are the things that, you know, our founders really strongly believed on when they started the company? Was it isn't enough. It isn't sufficient. There is more value in your data that you don't even know. And unlocking that through either machine learning, which is, of course, we all know it's very hot right now to look at how do I derive new insights out of the data that I already have, or even through logical reasoning, right? And inference looking at, what do I understand about how that data is put together and how it's created in order to create more connections within the data and answer more questions. All those are ways to grow beyond just unifying your data, but actually getting more insights out of it. And I think that is the real Holy grail that people are looking for, not just bringing all the data together, but actually being able to get business value and insights out of that data. Yeah. >>Looking for it. You guys have obviously a pretty strong roster of clients that represent that. Um, but I got to ask you, since you brought up the founders, uh, the company, obviously having a founders' DNA, uh, mindset, um, tends to change the culture or drive the culture of the covenant change with age drives the culture of the company. What is the founder's culture inside star, dog? What is the vibe there, if you could, um, what do they talk about the most when you, when they get in that mode of being founders like, Hey, you know, this is the North star, what is, what's the rap like? What's the vibe share? It takes that, take us through some star star, dog culture. >>Sure. So our three founders came out of the rusty of Maryland, all in a PhD program around semantic reasoning and logical understanding and being able to understand data and be able to communicate that as easily as possible is really the core and the fiber of their being. And that's what we see continually under discussion every single day. How can we push the limits to take this technology and your gift easier to use more available, bring more insights to the customers beyond what we've seen in the past. And I find that really exciting to be able to constantly have conversations about how do we push the envelope? How do we look beyond even what Gartner says is five or eight years in the future, but looking even further ahead. So there >>They're into they're into this whole data scene. Then big time they are >>That they are very active in the conferences and posts and you know, all that great. >>They love this agility. They got to love dev ops. I mean, if you're into this knowledge graph scene, so I gotta, I gotta ask you, what's the machine learning angle here, obviously, AI, we know what AI is. AI is essentially combination of many things, machine learning and other computer science and data access. Um, what is the secret sauce behind the machine learning and, and the vibe and the product of, of, uh, >>Yeah, a lot of times w we, the way that we leverage machine learning or the way that we look at it is how do we create those connections between data? So you have multiple different systems and you're trying to bring all that data together. Yeah. It's not always easy to tell, is this rod Harris the same as that rod Harris is this product the same as that product. So when possible we will leverage keys or we'll leverage very, uh, you know, systematic type of understanding of these things are the same, but sometimes you need to reach beyond that. And that's where we leverage a lot of machine learning within the platform, looking at things like linear regression or other approaches around the graph, you know, connectivity, analysis, page rank, things like that to say, where are things the same so that we can build that connections in that connectivity as automatically as possible. >>You don't get a lot of talks on the cube. Also. Now that's new news, new clubhouse app, where people are talking about misinformation, obviously we're in the media business. We love the digital network effect. Everything's networks, the network economy. You starting to see this power of information and value. You guys carved the knowledge graph. So I gotta, I gotta ask you, when you look at this kind of future where you have this, um, complexity and the network effect, um, how are you guys looking at that data access? Because if you don't have the data, you're not going to have that insight, right? So you need to have that, that network connection. Is that a limitation or for companies? Is that an, um, cause usually people aren't necessarily their blind spot is their data or their lack of their data. So having things network together is going to be more of the norm in the future. How do you guys see that playing out? Yeah, >>I think you're exactly right. And I think that as you look beyond where we are today, and a lot of times we focus today on the data that a company already has, what do I know? Right. What do I know about you? What, how do I interact with you? How have I interacted with you? I think that as we look at the future, we're going to talk more about data sharing, but leveraging publicly available information about being able to take these insights and leverage them, not just within the walls of my own organization, but being able to share them and, uh, work together with other organizations to bring up a better understanding of you as a person or as a consumer that we could all interact with. Yeah, you're absolutely right. You know, Metcons law still holds true that, you know, more network connections bring more value. I certainly see that growing in the future, probably more around, you know, more data sharing and more openness about leveraging publicly available. >>You know, it's interesting. You mentioned you came from a data warehouse background. I remember when I broken the businessmen 30 years ago, when I started getting computer science, you know, it was, it was, there was, there was pain having a product and an enabling platform. You guys seem to have this enabling platform where there's no one use case. I mean, you, you have an unlimited use case landscape. Um, you could do anything with what you guys have. It's not so much, I mean, there's, low-hanging fruit. So I got to ask you, if you have that, uh, enabling platform, you're creating value for customers. What are some of the areas you see developing, like now in terms of low-hanging fruit and where's the possibilities? How do you guys see that? I'm sure you've probably got a tsunami of activity around corner cases from media to every vertical we do. And that's, you know, >>The exciting part of this job. Uh, part of the exciting part of knowledge press in general is to see all the different ways that they are allowed to use. But we do see some use cases repeated over and over again. Uh, risk management is a very common one. How do I look at all the people and the assets with an organization, the interactions they have to look at hotspots for risk, uh, that I need to correct within my organization for the pre-commercial pharma, that has been a very, very hot area for us recently. How do we look at all the that's available with an organization that's publicly available in order to accelerate drug development in this post COVID world, that's become more and more relevant, uh, for organizations to be able to move forward faster and the kind of bio industry and my sciences. Um, that's a use case that we've seen repeated over and over again. And then this growing idea of the data fabric, the data fabric, looking at metadata within the organization to improve data integration processes, to really reduce the need for moving data without or around the organization as much. Those are the use cases we've seen repeated over and over again over the last >>Awesome Rob. My last question before we wrap up is for the solution architect that's out there that has, you know, got a real tall order. They have to put together a scalable organization, people process and technology around a data architecture. That's going to be part of, um, the next gen, the next gen next level activity. And they need headroom for IOT edge and industrial edge, uh, and all use cases. Um, what's your advice to them as they have to look out at and start thinking about architecture? >>Yeah, that's, it's a great question. Uh, I really think that it's important to keep your options open as the technology in the space continues to evolve, right? It's easy to get locked into a single vendor or a single mindset. Um, I've been an architect most of my career, and that's usually a lot of the pitfalls. Things like a knowledge graph are open and flexible. They adhere to standards, which then means you're not locked into a single vendor and you're allowed to leverage this type of technology to grow beyond originally envisioned. So thinking about how you can take advantage of these modern techniques to look at things and not just keep repeating what you've done in the past, the sins of the past have, uh, you know, a lot of times do reappear. So fighting against that as much as possible as gritty is my encouragement. >>Awesome, great insight. And I love this. I love this area. I know you guys got a great trend. You're riding on a very cool, very relevant final minute. Just take a quick minute to give a plug for the company. What's the business model. How do I deploy this? How do I get the software? How do you charge for it? If I'm going to buy this solution or engage with star DOE what do I do? Take me through that. Sure. >>Yeah. We, uh, we are like, uh, you've sat through this whole thing. We are enterprise knowledge graph platform company. So we really help you get started with your business, uh, uh, leveraging and using a knowledge graph fricking organization. We have the ability to deploy on prem. We have on the cloud, we're in the AWS marketplace today. So you can take a look at our software today, who generally are subscription-based based on the size of the install. And we are happy to talk to you any time, just drop by our website, reach out we'll we'll get doctors. >>Rob. Great. Thanks for coming. I really appreciate it. That gradients said, looking forward to seeing you in person, when we get back to real life, hopefully the vaccines are coming on. Thanks to, uh, companies like you guys providing awesome analytics and intelligence for these drug companies and pharma companies. Now you have a few of them in your, on your client roster. So congratulations, looking forward to following up great, great area. Cool and relevant data architecture is changing. Some of it's broken. Some it's being fixed started off as one of the hot startups scaling up beautifully in this new era of cloud computing meets applications and data. So I'm John. Forget the cube. This is a cube conversation from Palo Alto, California. Thanks for watching.
SUMMARY :
I'm excited to talk to you guys about your company and specifically the value proposition I've been talking to leverage that connected data and really turn it into knowledge through context and understand it. You seeing the same kind of wave hit almost kind of born back on the Hadoop days So being able to step beyond just, you know, pulling everything together into one place, I got to ask you around the use cases because one of the things that's really relevant right now is you're seeing a lot of front end development And we don't have to worry about some sort of synchronous process, you know, loading information into the graph. What is the enterprise knowledge graph? So it's not just storing it like a graph, but connect again and putting meaning around that So the ability to reuse the data assets with in context of meeting is and then that we saw that you didn't have to kill the old to bring in the new, um, there's one mindset of, And by letting you have both approaches in the same platform, What is the star dog recommendation And then we allow you to map those data silos. Data fabric is really the new term. So if you believe that they can, clouds or on prem on the clown, coordinate to answer questions, to minimize data movement It's expensive to move data processes where it is, and you kind of have this but what are the things that, you know, our founders really strongly believed on when they started the company? Hey, you know, this is the North star, what is, what's the rap like? And I find that really exciting to be able to constantly have conversations about how do we push the They're into they're into this whole data scene. That they are very active in the conferences and posts and you know, They got to love dev ops. So you have multiple different systems and you're trying to bring all that data So you need to have that, that network connection. And I think that as you look beyond where we are today, What are some of the areas you see developing, Uh, part of the exciting part of knowledge press in general is to see all you know, got a real tall order. the sins of the past have, uh, you know, a lot of times do reappear. I know you guys got a great trend. So we really help you get started with your business, uh, That gradients said, looking forward to seeing you in person,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Rob | PERSON | 0.99+ |
Rob Harris | PERSON | 0.99+ |
March 2021 | DATE | 0.99+ |
John ferry | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Maryland | LOCATION | 0.99+ |
five | QUANTITY | 0.99+ |
Palo Alto, California | LOCATION | 0.99+ |
eight years | QUANTITY | 0.99+ |
John | PERSON | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
three founders | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
one | QUANTITY | 0.98+ |
single | QUANTITY | 0.98+ |
30 years ago | DATE | 0.98+ |
2007 | DATE | 0.98+ |
both | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
one place | QUANTITY | 0.95+ |
over 20 years ago | DATE | 0.93+ |
both approaches | QUANTITY | 0.92+ |
rod Harris | COMMERCIAL_ITEM | 0.92+ |
single silo | QUANTITY | 0.92+ |
many years ago | DATE | 0.91+ |
Stardog | ORGANIZATION | 0.9+ |
pandemic | EVENT | 0.9+ |
single vendor | QUANTITY | 0.9+ |
double | QUANTITY | 0.9+ |
single application | QUANTITY | 0.88+ |
agile | TITLE | 0.86+ |
one step | QUANTITY | 0.84+ |
ODBC | TITLE | 0.84+ |
single day | QUANTITY | 0.82+ |
star DOE | ORGANIZATION | 0.81+ |
star dog | ORGANIZATION | 0.79+ |
two full instances | QUANTITY | 0.79+ |
Neo four J | ORGANIZATION | 0.79+ |
single plain | QUANTITY | 0.76+ |
eight petabytes of data | QUANTITY | 0.75+ |
prem | ORGANIZATION | 0.74+ |
JDBC | TITLE | 0.7+ |
one big Lake | QUANTITY | 0.69+ |
thing | QUANTITY | 0.56+ |
COVID | EVENT | 0.55+ |
StarTalk | ORGANIZATION | 0.53+ |
Metcons | TITLE | 0.52+ |
things | QUANTITY | 0.49+ |
playbook | COMMERCIAL_ITEM | 0.35+ |
UNLIST TILL 4/2 - Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives
>> Sue: Hello everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session in entitled Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives. My name is Sue LeClaire, Director of Marketing at Vertica and I'll be your host for this webinar. Joining me is Tom Wall, a member of the Vertica engineering team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't get to, we'll do our best to answer them offline. Alternatively, you can visit the Vertica forums to post you questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Tom, over to you. >> Tom: Hello everyone and thanks for joining us today for this talk. My name is Tom Wall and I am the leader of Vertica's ecosystem engineering team. We are the team that focuses on building out all the developer tools, third party integrations that enables the SoftMaker system that surrounds Vertica to thrive. So today, we'll be talking about some of our new open source initatives and how those can be really effective for you and make things easier for you to build and integrate Vertica with the rest of your technology stack. We've got several new libraries, integration projects and examples, all open source, to share, all being built out in the open on our GitHub page. Whether you use these open source projects or not, this is a very exciting new effort that will really help to grow the developer community and enable lots of exciting new use cases. So, every developer out there has probably had to deal with the problem like this. You have some business requirements, to maybe build some new Vertica-powered application. Maybe you have to build some new system to visualize some data that's that's managed by Vertica. The various circumstances, lots of choices will might be made for you that constrain your approach to solving a particular problem. These requirements can come from all different places. Maybe your solution has to work with a specific visualization tool, or web framework, because the business has already invested in the licensing and the tooling to use it. Maybe it has to be implemented in a specific programming language, since that's what all the developers on the team know how to write code with. While Vertica has many different integrations with lots of different programming language and systems, there's a lot of them out there, and we don't have integrations for all of them. So how do you make ends meet when you don't have all the tools you need? All you have to get creative, using tools like PyODBC, for example, to bridge between programming languages and frameworks to solve the problems you need to solve. Most languages do have an ODBC-based database interface. ODBC is our C-Library and most programming languages know how to call C code, somehow. So that's doable, but it often requires lots of configuration and troubleshooting to make all those moving parts work well together. So that's enough to get the job done but native integrations are usually a lot smoother and easier. So rather than, for example, in Python trying to fight with PyODBC, to configure things and get Unicode working, and to compile all the different pieces, the right way is to make it all work smoothly. It would be much better if you could just PIP install library and get to work. And with Vertica-Python, a new Python client library, you can actually do that. So that story, I assume, probably sounds pretty familiar to you. Sounds probably familiar to a lot of the audience here because we're all using Vertica. And our challenge, as Big Data practitioners is to make sense of all this stuff, despite those technical and non-technical hurdles. Vertica powers lots of different businesses and use cases across all kinds of different industries and verticals. While there's a lot different about us, we're all here together right now for this talk because we do have some things in common. We're all using Vertica, and we're probably also using Vertica with other systems and tools too, because it's important to use the right tool for the right job. That's a founding principle of Vertica and it's true today too. In this constantly changing technology landscape, we need lots of good tools and well established patterns, approaches, and advice on how to combine them so that we can be successful doing our jobs. Luckily for us, Vertica has been designed to be easy to build with and extended in this fashion. Databases as a whole had had this goal from the very beginning. They solve the hard problems of managing data so that you don't have to worry about it. Instead of worrying about those hard problems, you can focus on what matters most to you and your domain. So implementing that business logic, solving that problem, without having to worry about all of these intense, sometimes details about what it takes to manage a database at scale. With the declarative syntax of SQL, you tell Vertica what the answer is that you want. You don't tell Vertica how to get it. Vertica will figure out the right way to do it for you so that you don't have to worry about it. So this SQL abstraction is very nice because it's a well defined boundary where lots of developers know SQL, and it allows you to express what you need without having to worry about those details. So we can be the experts in data management while you worry about your problems. This goes beyond though, what's accessible through SQL to Vertica. We've got well defined extension and integration points across the product that allow you to customize this experience even further. So if you want to do things write your own SQL functions, or extend database softwares with UDXs, you can do so. If you have a custom data format that might be a proprietary format, or some source system that Vertica doesn't natively support, we have extension points that allow you to use those. To make it very easy to do passive, parallel, massive data movement, loading into Vertica but also to export Vertica to send data to other systems. And with these new features in time, we also could do the same kinds of things with Machine Learning models, importing and exporting to tools like TensorFlow. And it's these integration points that have enabled Vertica to build out this open architecture and a rich ecosystem of tools, both open source and closed source, of different varieties that solve all different problems that are common in this big data processing world. Whether it's open source, streaming systems like Kafka or Spark, or more traditional ETL tools on the loading side, but also, BI tools and visualizers and things like that to view and use the data that you keep in your database on the right side. And then of course, Vertica needs to be flexible enough to be able to run anywhere. So you can really take Vertica and use it the way you want it to solve the problems that you need to solve. So Vertica has always employed open standards, and integrated it with all kinds of different open source systems. What we're really excited to talk about now is that we are taking our new integration projects and making those open source too. In particular, we've got two new open source client libraries that allow you to build Vertica applications for Python and Go. These libraries act as a foundation for all kinds of interesting applications and tools. Upon those libraries, we've also built some integrations ourselves. And we're using these new libraries to power some new integrations with some third party products. Finally, we've got lots of new examples and reference implementations out on our GitHub page that can show you how to combine all these moving parts and exciting ways to solve new problems. And the code for all these things is available now on our GitHub page. And so you can use it however you like, and even help us make it better too. So the first such project that we have is called Vertica-Python. Vertica-Python began at our customer, Uber. And then in late 2018, we collaborated with them and we took it over and made Vertica-Python the first official open source client for Vertica You can use this to build your own Python applications, or you can use it via tools that were written in Python. Python has grown a lot in recent years and it's very common language to solve lots of different problems and use cases in the Big Data space from things like DevOps admission and Data Science or Machine Learning, or just homegrown applications. We use Python a lot internally for our own QA testing and automation needs. And with the Python 2 End Of Life, that happened at the end of 2019, it was important that we had a robust Python solution to help migrate our internal stuff off of Python 2. And also to provide a nice migration path for all of you our users that might be worried about the same problems with their own Python code. So Vertica-Python is used already for lots of different tools, including Vertica's admintools now starting with 9.3.1. It was also used by DataDog to build a Vertica-DataDog integration that allows you to monitor your Vertica infrastructure within DataDog. So here's a little example of how you might use the Python Client to do some some work. So here we open in connection, we run a query to find out what node we've connected to, and then we do a little DataLoad by running a COPY statement. And this is designed to have a familiar look and feel if you've ever used a Python Database Client before. So we implement the DB API 2.0 standard and it feels like a Python package. So that includes things like, it's part of the centralized package manager, so you can just PIP install this right now and go start using it. We also have our client for Go length. So this is called vertica-sql-go. And this is a very similar story, just in a different context or the different programming language. So vertica-sql-go, began as a collaboration with the Microsoft Focus SecOps Group who builds microfocus' security products some of which use vertica internally to provide some of those analytics. So you can use this to build your own apps in the Go programming language but you can also use it via tools that are written Go. So most notably, we have our Grafana integration, which we'll talk a little bit more about later, that leverages this new clients to provide Grafana visualizations for vertica data. And Go is another rising popularity programming language 'cause it offers an interesting balance of different programming design trade-offs. So it's got good performance, got a good current concurrency and memory safety. And we liked all those things and we're using it to power some internal monitoring stuff of our own. And here's an example of the code you can write with this client. So this is Go code that does a similar thing. It opens a connection, it runs a little test query, and then it iterates over those rows, processing them using Go data types. You get that native look and feel just like you do in Python, except this time in the Go language. And you can go get it the way you usually package things with Go by running that command there to acquire this package. And it's important to note here for the DC projects, we're really doing open source development. We're not just putting code out on our GitHub page. So if you go out there and look, you can see that you can ask questions, you can report bugs, you can submit poll requests yourselves and you can collaborate directly with our engineering team and the other vertica users out on our GitHub page. Because it's out on our GitHub page, it allows us to be a little bit faster with the way we ship and deliver functionality compared to the core vertica release cycle. So in 2019, for example, as we were building features to prepare for the Python 3 migration, we shipped 11 different releases with 40 customer reported issues, filed on GitHub. That was done over 78 different poll requests and with lots of community engagement as we do so. So lots of people are using this already, we see as our GitHub badge last showed with about 5000 downloads of this a day of people using it in their software. And again, we want to make this easy, not just to use but also to contribute and understand and collaborate with us. So all these projects are built using the Apache 2.0 license. The master branch is always available and stable with the latest creative functionality. And you can always build it and test it the way we do so that it's easy for you to understand how it works and to submit contributions or bug fixes or even features. It uses automated testing both for locally and with poll requests. And for vertica-python, it's fully automated with Travis CI. So we're really excited about doing this and we're really excited about where it can go in the future. 'Cause this offers some exciting opportunities for us to collaborate with you more directly than we have ever before. You can contribute improvements and help us guide the direction of these projects, but you can also work with each other to share knowledge and implementation details and various best practices. And so maybe you think, "Well, I don't use Python, "I don't use go so maybe it doesn't matter to me." But I would argue it really does matter. Because even if you don't use these tools and languages, there's lots of amazing vertica developers out there who do. And these clients do act as low level building blocks for all kinds of different interesting tools, both in these Python and Go worlds, but also well beyond that. Because these implementations and examples really generalize to lots of different use cases. And we're going to do a deeper dive now into some of these to understand exactly how that's the case and what you can do with these things. So let's take a deeper look at some of the details of what it takes to build one of these open source client libraries. So these database client interfaces, what are they exactly? Well, we all know SQL, but if you look at what SQL specifies, it really only talks about how to manipulate the data within the database. So once you're connected and in, you can run commands with SQL. But these database client interfaces address the rest of those needs. So what does the programmer need to do to actually process those SQL queries? So these interfaces are specific to a particular language or a technology stack. But the use cases and the architectures and design patterns are largely the same between different languages. They all have a need to do some networking and connect and authenticate and create a session. They all need to be able to run queries and load some data and deal with problems and errors. And then they also have a lot of metadata and Type Mapping because you want to use these clients the way you use those programming languages. Which might be different than the way that vertica's data types and vertica's semantics work. So some of this client interfaces are truly standards. And they are robust enough in terms of what they design and call for to support a truly pluggable driver model. Where you might write an application that codes directly against the standard interface, and you can then plug in a different database driver, like a JDBC driver, to have that application work with any database that has a JDBC driver. So most of these interfaces aren't as robust as a JDBC or ODBC but that's okay. 'Cause it's good as a standard is, every database is unique for a reason. And so you can't really expose all of those unique properties of a database through these standard interfaces. So vertica's unique in that it can scale to the petabytes and beyond. And you can run it anywhere in any environment, whether it's on-prem or on clouds. So surely there's something about vertica that's unique, and we want to be able to take advantage of that fact in our solutions. So even though these standards might not cover everything, there's often a need and common patterns that arise to solve these problems in similar ways. When there isn't enough of a standard to define those comments, semantics that different databases might have in common, what you often see is tools will invent plug in layers or glue code to compensate by defining application wide standard to cover some of these same semantics. Later on, we'll get into some of those details and show off what exactly that means. So if you connect to a vertica database, what's actually happening under the covers? You have an application, you have a need to run some queries, so what does that actually look like? Well, probably as you would imagine, your application is going to invoke some API calls and some client library or tool. This library takes those API calls and implements them, usually by issuing some networking protocol operations, communicating over the network to ask vertica to do the heavy lifting required for that particular API call. And so these API's usually do the same kinds of things although some of the details might differ between these different interfaces. But you do things like establish a connection, run a query, iterate over your rows, manage your transactions, that sort of thing. Here's an example from vertica-python, which just goes into some of the details of what actually happens during the Connect API call. And you can see all these details in our GitHub implementation of this. There's actually a lot of moving parts in what happens during a connection. So let's walk through some of that and see what actually goes on. I might have my API call like this where I say Connect and I give it a DNS name, which is my entire cluster. And I give you my connection details, my username and password. And I tell the Python Client to get me a session, give me a connection so I can start doing some work. Well, in order to implement this, what needs to happen? First, we need to do some TCP networking to establish our connection. So we need to understand what the request is, where you're going to connect to and why, by pressing the connection string. and vertica being a distributed system, we want to provide high availability, so we might need to do some DNS look-ups to resolve that DNS name which might be an entire cluster and not just a single machine. So that you don't have to change your connection string every time you add or remove nodes to the database. So we do some high availability and DNS lookup stuff. And then once we connect, we might do Load Balancing too, to balance the connections across the different initiator nodes in the cluster, or in a sub cluster, as needed. Once we land on the node we want to be at, we might do some TLS to secure our connections. And vertica supports the industry standard TLS protocols, so this looks pretty familiar for everyone who've used TLS anywhere before. So you're going to do a certificate exchange and the client might send the server certificate too, and then you going to verify that the server is who it says it is, so that you can know that you trust it. Once you've established that connection, and secured it, then you can start actually beginning to request a session within vertica. So you going to send over your user information like, "Here's my username, "here's the database I want to connect to." You might send some information about your application like a session label, so that you can differentiate on the database with monitoring queries, what the different connections are and what their purpose is. And then you might also send over some session settings to do things like auto commit, to change the state of your session for the duration of this connection. So that you don't have to remember to do that with every query that you have. Once you've asked vertica for a session, before vertica will give you one, it has to authenticate you. and vertica has lots of different authentication mechanisms. So there's a negotiation that happens there to decide how to authenticate you. Vertica decides based on who you are, where you're coming from on the network. And then you'll do an auth-specific exchange depending on what the auth mechanism calls for until you are authenticated. Finally, vertica trusts you and lets you in, so you going to establish a session in vertica, and you might do some note keeping on the client side just to know what happened. So you might log some information, you might record what the version of the database is, you might do some protocol feature negotiation. So if you connect to a version of the database that doesn't support all these protocols, you might decide to turn some functionality off and that sort of thing. But finally, after all that, you can return from this API call and then your connection is good to go. So that connection is just one example of many different APIs. And we're excited here because with vertica-python we're really opening up the vertica client wire protocol for the first time. And so if you're a low level vertica developer and you might have used Postgres before, you might know that some of vertica's client protocol is derived from Postgres. But they do differ in many significant ways. And this is the first time we've ever revealed those details about how it works and why. So not all Postgres protocol features work with vertica because vertica doesn't support all the features that Postgres does. Postgres, for example, has a large object interface that allows you to stream very wide data values over. Whereas vertica doesn't really have very wide data values, you have 30, you have long bar charts, but that's about as wide as you can get. Similarly, the vertica protocol supports lots of features not present in Postgres. So Load Balancing, for example, which we just went through an example of, Postgres is a single node system, it doesn't really make sense for Postgres to have Load Balancing. But Load Balancing is really important for vertica because it is a distributed system. Vertica-python serves as an open reference implementation of this protocol. With all kinds of new details and extension points that we haven't revealed before. So if you look at these boxes below, all these different things are new protocol features that we've implemented since August 2019, out in the open on our GitHub page for Python. Now, the vertica-sql-go implementation of these things is still in progress, but the core protocols are there for basic query operations. There's more to do there but we'll get there soon. So this is really cool 'cause not only do you have now a Python Client implementation, and you have a Go client implementation of this, but you can use this protocol reference to do lots of other things, too. The obvious thing you could do is build more clients for other languages. So if you have a need for a client in some other language that are vertica doesn't support yet, now you have everything available to solve that problem and to go about doing so if you need to. But beyond clients, it's also used for other things. So you might use it for mocking and testing things. So rather than connecting to a real vertica database, you can simulate some of that. You can also use it to do things like query routing and proxies. So Uber, for example, this log here in this link tells a great story of how they route different queries to different vertical clusters by intercepting these protocol messages, parsing the queries in them and deciding which clusters to send them to. So a lot of these things are just ideas today, but now that you have the source code, there's no limit in sight to what you can do with this thing. And so we're very interested in hearing your ideas and requests and we're happy to offer advice and collaborate on building some of these things together. So let's take a look now at some of the things we've already built that do these things. So here's a picture of vertica's Grafana connector with some data powered from an example that we have in this blog link here. So this has an internet of things use case to it, where we have lots of different sensors recording flight data, feeding into Kafka which then gets loaded into vertica. And then finally, it gets visualized nicely here with Grafana. And Grafana's visualizations make it really easy to analyze the data with your eyes and see when something something happens. So in these highlighted sections here, you notice a drop in some of the activity, that's probably a problem worth looking into. It might be a lot harder to see that just by staring at a large table yourself. So how does a picture like that get generated with a tool like Grafana? Well, Grafana specializes in visualizing time series data. And time can be really tricky for computers to do correctly. You got time zones, daylight savings, leap seconds, negative infinity timestamps, please don't ever use those. In every system, if it wasn't hard enough, just with those problems, what makes it harder is that every system does it slightly differently. So if you're querying some time data, how do we deal with these semantic differences as we cross these domain boundaries from Vertica to Grafana's back end architecture, which is implemented in Go on it's front end, which is implemented with JavaScript? Well, you read this from bottom up in terms of the processing. First, you select the timestamp and Vertica is timestamp has to be converted to a Go time object. And we have to reconcile the differences that there might be as we translate it. So Go time has a different time zone specifier format, and it also supports nanosecond precision, while Vertica only supports microsecond precision. So that's not too big of a deal when you're querying data because you just see some extra zeros, not fractional seconds. But on the way in, if we're loading data, we have to find a way to resolve those things. Once it's into the Go process, it has to be converted further to render in the JavaScript UI. So that there, the Go time object has to be converted to a JavaScript Angular JS Date object. And there too, we have to reconcile those differences. So a lot of these differences might just be presentation, and not so much the actual data changing, but you might want to choose to render the date into a more human readable format, like we've done in this example here. Here's another picture. This is another picture of some time series data, and this one shows you can actually write your own queries with Grafana to provide answers. So if you look closely here you can see there's actually some functions that might not look too familiar with you if you know vertica's functions. Vertica doesn't have a dollar underscore underscore time function or a time filter function. So what's actually happening there? How does this actually provide an answer if it's not really real vertica syntax? Well, it's not sufficient to just know how to manipulate data, it's also really important that you know how to operate with metadata. So information about how the data works in the data source, Vertica in this case. So Grafana needs to know how time works in detail for each data source beyond doing that basic I/O that we just saw in the previous example. So it needs to know, how do you connect to the data source to get some time data? How do you know what time data types and functions there are and how they behave? How do you generate a query that references a time literal? And finally, once you've figured out how to do all that, how do you find the time in the database? How do you do know which tables have time columns and then they might be worth rendering in this kind of UI. So Go's database standard doesn't actually really offer many metadata interfaces. Nevertheless, Grafana needs to know those answers. And so it has its own plugin layer that provides a standardizing layer whereby every data source can implement hints and metadata customization needed to have an extensible data source back end. So we have another open source project, the Vertica-Grafana data source, which is a plugin that uses Grafana's extension points with JavaScript and the front end plugins and also with Go in the back end plugins to provide vertica connectivity inside Grafana. So the way this works, is that the plugin frameworks defines those standardizing functions like time and time filter, and it's our plugin that's going to rewrite them in terms of vertica syntax. So in this example, time gets rewritten to a vertica cast. And time filter becomes a BETWEEN predicate. So that's one example of how you can use Grafana, but also how you might build any arbitrary visualization tool that works with data in Vertica. So let's now look at some other examples and reference architectures that we have out in our GitHub page. For some advanced integrations, there's clearly a need to go beyond these standards. So SQL and these surrounding standards, like JDBC, and ODBC, were really critical in the early days of Vertica, because they really enabled a lot of generic database tools. And those will always continue to play a really important role, but the Big Data technology space moves a lot faster than these old database data can keep up with. So there's all kinds of new advanced analytics and query pushdown logic that were never possible 10 or 20 years ago, that Vertica can do natively. There's also all kinds of data-oriented application workflows doing things like streaming data, or Parallel Loading or Machine Learning. And all of these things, we need to build software with, but we don't really have standards to go by. So what do we do there? Well, open source implementations make for easier integrations, and applications all over the place. So even if you're not using Grafana for example, other tools have similar challenges that you need to overcome. And it helps to have an example there to show you how to do it. Take Machine Learning, for example. There's been many excellent Machine Learning tools that have arisen over the years to make data science and the task of Machine Learning lot easier. And a lot of those have basic database connectivity, but they generally only treat the database as a source of data. So they do lots of data I/O to extract data from a database like Vertica for processing in some other engine. We all know that's not the most efficient way to do it. It's much better if you can leverage Vertica scale and bring the processing to the data. So a lot of these tools don't take full advantage of Vertica because there's not really a uniform way to go do so with these standards. So instead, we have a project called vertica-ml-python. And this serves as a reference architecture of how you can do scalable machine learning with Vertica. So this project establishes a familiar machine learning workflow that scales with vertica. So it feels similar to like a scickit-learn project except all the processing and aggregation and heavy lifting and data processing happens in vertica. So this makes for a much more lightweight, scalable approach than you might otherwise be used to. So with vertica-ml-python, you can probably use this yourself. But you could also see how it works. So if it doesn't meet all your needs, you could still see the code and customize it to build your own approach. We've also got lots of examples of our UDX framework. And so this is an older GitHub project. We've actually had this for a couple of years, but it is really useful and important so I wanted to plug it here. With our User Defined eXtensions framework or UDXs, this allows you to extend the operators that vertica executes when it does a database load or a database query. So with UDXs, you can write your own domain logic in a C++, Java or Python or R. And you can call them within the context of a SQL query. And vertica brings your logic to that data, and makes it fast and scalable and fault tolerant and correct for you. So you don't have to worry about all those hard problems. So our UDX examples, demonstrate how you can use our SDK to solve interesting problems. And some of these examples might be complete, total usable packages or libraries. So for example, we have a curl source that allows you to extract data from any curlable endpoint and load into vertica. We've got things like an ODBC connector that allows you to access data in an external database via an ODBC driver within the context of a vertica query, all kinds of parsers and string processors and things like that. We also have more exciting and interesting things where you might not really think of vertica being able to do that, like a heat map generator, which takes some XY coordinates and renders it on top of an image to show you the hotspots in it. So the image on the right was actually generated from one of our intern gaming sessions a few years back. So all these things are great examples that show you not just how you can solve problems, but also how you can use this SDK to solve neat things that maybe no one else has to solve, or maybe that are unique to your business and your needs. Another exciting benefit is with testing. So the test automation strategy that we have in vertica-python these clients, really generalizes well beyond the needs of a database client. Anyone that's ever built a vertica integration or an application, probably has a need to write some integration tests. And that could be hard to do with all the moving parts, in the big data solution. But with our code being open source, you can see in vertica-python, in particular, how we've structured our tests to facilitate smooth testing that's fast, deterministic and easy to use. So we've automated the download process, the installation deployment process, of a Vertica Community Edition. And with a single click, you can run through the tests locally and part of the PR workflow via Travis CI. We also do this for multiple different python environments. So for all python versions from 2.7 up to 3.8 for different Python interpreters, and for different Linux distros, we're running through all of them very quickly with ease, thanks to all this automation. So today, you can see how we do it in vertica-python, in the future, we might want to spin that out into its own stand-alone testbed starter projects so that if you're starting any new vertica integration, this might be a good starting point for you to get going quickly. So that brings us to some of the future work we want to do here in the open source space . Well, there's a lot of it. So in terms of the the client stuff, for Python, we are marching towards our 1.0 release, which is when we aim to be protocol complete to support all of vertica's unique protocols, including COPY LOCAL and some new protocols invented to support complex types, which is our new feature in vertica 10. We have some cursor enhancements to do things like better streaming and improved performance. Beyond that we want to take it where you want to bring it. So send us your requests in the Go client fronts, just about a year behind Python in terms of its protocol implementation, but the basic operations are there. But we still have more work to do to implement things like load balancing, some of the advanced auths and other things. But they're two, we want to work with you and we want to focus on what's important to you so that we can continue to grow and be more useful and more powerful over time. Finally, this question of, "Well, what about beyond database clients? "What else might we want to do with open source?" If you're building a very deep or a robust vertica integration, you probably need to do a lot more exciting things than just run SQL queries and process the answers. Especially if you're an OEM or you're a vendor that resells vertica packaged as a black box piece of a larger solution, you might to have managed the whole operational lifecycle of vertica. There's even fewer standards for doing all these different things compared to the SQL clients. So we started with the SQL clients 'cause that's a well established pattern, there's lots of downstream work that that can enable. But there's also clearly a need for lots of other open source protocols, architectures and examples to show you how to do these things and do have real standards. So we talked a little bit about how you could do UDXs or testing or Machine Learning, but there's all sorts of other use cases too. That's why we're excited to announce here our awesome vertica, which is a new collection of open source resources available on our GitHub page. So if you haven't heard of this awesome manifesto before, I highly recommend you check out this GitHub page on the right. We're not unique here but there's lots of awesome projects for all kinds of different tools and systems out there. And it's a great way to establish a community and share different resources, whether they're open source projects, blogs, examples, references, community resources, and all that. And this tool is an open source project. So it's an open source wiki. And you can contribute to it by submitting yourself to PR. So we've seeded it with some of our favorite tools and projects out there but there's plenty more out there and we hope to see more grow over time. So definitely check this out and help us make it better. So with that, I'm going to wrap up. I wanted to thank you all. Special thanks to Siting Ren and Roger Huebner, who are the project leads for the Python and Go clients respectively. And also, thanks to all the customers out there who've already been contributing stuff. This has already been going on for a long time and we hope to keep it going and keep it growing with your help. So if you want to talk to us, you can find us at this email address here. But of course, you can also find us on the Vertica forums, or you could talk to us on GitHub too. And there you can find links to all the different projects I talked about today. And so with that, I think we're going to wrap up and now we're going to hand it off for some Q&A.
SUMMARY :
Also a reminder that you can maximize your screen and frameworks to solve the problems you need to solve.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tom Wall | PERSON | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Roger Huebner | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Tom | PERSON | 0.99+ |
Python 2 | TITLE | 0.99+ |
August 2019 | DATE | 0.99+ |
2019 | DATE | 0.99+ |
Python 3 | TITLE | 0.99+ |
two | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
Python | TITLE | 0.99+ |
python | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
late 2018 | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
end of 2019 | DATE | 0.99+ |
Vertica | TITLE | 0.99+ |
today | DATE | 0.99+ |
Java | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
C++ | TITLE | 0.99+ |
JavaScript | TITLE | 0.99+ |
vertica-python | TITLE | 0.99+ |
Today | DATE | 0.99+ |
first time | QUANTITY | 0.99+ |
11 different releases | QUANTITY | 0.99+ |
UDXs | TITLE | 0.99+ |
Kafka | TITLE | 0.99+ |
Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives | TITLE | 0.98+ |
Grafana | ORGANIZATION | 0.98+ |
PyODBC | TITLE | 0.98+ |
first | QUANTITY | 0.98+ |
UDX | TITLE | 0.98+ |
vertica 10 | TITLE | 0.98+ |
ODBC | TITLE | 0.98+ |
10 | DATE | 0.98+ |
Postgres | TITLE | 0.98+ |
DataDog | ORGANIZATION | 0.98+ |
40 customer reported issues | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
UNLIST TILL 4/2 - Model Management and Data Preparation
>> Sue: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Machine Learning with Vertica, Data Preparation and Model Management. My name is Sue LeClaire, Director of Managing at Vertica and I'll be your host for this webinar. Joining me is Waqas Dhillon. He's part of the Vertica Product Management Team at Vertica. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can visit Vertica Forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, and yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Waqas, over to you. >> Waqas: Thank you, Sue. Hi, everyone. My name is Waqas Dhillon and I'm a Product Manager here at Vertica. So today, we're going to go through data preparation and model management in Vertica, and the session would essentially be starting with some introduction and going through some of the machine learning configurations and you're doing machine learning at scale. After that, we have two media sections here. The first one is on data preparation, and so we'd go through data preparation is, what are the Vertica functions for data exploration and data preparation, and then share an example with you. Similarly, in the second part of this talk we'll go through different export models using PMML and how that works with Vertica, and we'll share examples from that, as well. So yeah, let's dive right in. So, Vertica essentially is an open architecture with a rich ecosystem. So, you have a lot of options for data transformation and ingesting data from different tools, and then you also have options for connecting through ODBC, JDBC, and some other connectors to BI and visualization tools. There's a lot of them that Vertica connects to, and in the middle sits Vertica, which you can have on external tables or you can have in place analytics on R, on cloud, or on prem, so that choice is yours, but essentially what it does is it offers you a lot of options for performing your data and analytics on scale, and within that, data analytics machine learning is also a core component, and then you have a lot of options and functions for that. Now, machine learning in Vertica is actually built on top of the architecture that distributed data analytics offers, so it offers a lot of those capabilities and builds on top of them, so you eliminate the overhead data transfer when you're working with Vertica machine learning, you keep your data secure, storing and managing the models really easy and much more efficient. You can serve a lot of concurrent users all at the same time, and then it's really scalable and avoids maintenance cost of a separate system, so essentially a lot of benefits here, but one important thing to mention here is that all the algorithms that you see, whether they're analytics functions, advanced analytics functions, or machine learning functions, they are distributed not just across the cluster on different nodes. So, each node gets a distributed work load. On each node, too, there might be multiple tracks and multiple processors that are running with each of these functions. So, highly distributed solution and one of its kind in this space. So, when we talk about Vertica machine learning, it essentially covers all machine learning process and we see it as something starting with data ingestion and doing data analysis and understanding, going through the steps of data preparation, modeling, evaluation, and finally deployment, as well. So, when you're using with Vertica, you're using Vertica for machine learning, it takes care of all these steps and you can do all of that inside of the Vertica database, but when we look at the three main pillars that Vertica machine learning aims to build on, the first one is to have Vertica as a platform for high performance machine learning. We have a lot of functions for data exploration and preparation and we'll go through some of them here. We have distributed in-database algorithms for model training and prediction, we have scalable functions for model evaluation, and finally we have distributed scoring functions, as well. Doing all of the stuff in the database, that's a really good thing, but we don't want it isolated in this space. We understand that a lot of our customers, our users, they like to work with other tools and work with Vertica, as well. So, they might use Vertica for data prep, another two for model training, or use Vertica for model training and take those nodes out to other tools and do prediction there. So, integration is really important part of our overall offering. So, it's a pretty flexible system. We have been offering UdX in four languages, a lot of people find there over the past few years, but the new capability of importing PMML models for in-database scoring and exporting Vertica native-models, for external scoring it's something that we have recently added, and another talk would actually go through the TensorFlow integrations, a really exciting and important milestone that we have where you can bring TensorFlow models into Vertica for in-database scoring. For this talk, we'll focus on data exploration and preparation, importing PMML, and exporting PMML models, and finally, since Vertica is not just a cue engine, but also a data store, we have a lot of really good capability for model storage and management, as well. So, yeah. Let's dive into the first part on machine learning at scale. So, when we say machine learning at scale we're actually having a few really important considerations and they have their own implications. The first one is that we want to have speed, but also want it to come at a reasonable cost. So, it's really important for us to pick the right scaling architecture. Secondly, it's not easy to move big data around. It might be easy to do that on a smaller data set, on an Excel sheet, or something of the like, but once you're talking about big data and data analytics at really big scale, it's really not easy to move that data around from one tool to another, so what you'd want to do is bring models to the data instead of having to move this data to the tools, and the third thing here is that some sub-sampling it can actually compromise your accuracy, and a lot of tools that are out there they still force you to take smaller samples of your data because they can only handle so much data, but that can impact your accuracy and the need here is that you should be able to work with all of your data. We'll just go through each of these really quickly. So, the first factor here is scalability. Now, if you want to scale your architecture, you have two main options. The first is vertical scaling. Let's say you have a machine, a server, essentially, and you can keep on adding resources, like RAM and CPU and keep increasing the performance as well as the capacity of that system, but there's a limit to what you can do here, and the limit, you can hit that in terms of cost, as well as in terms of technology. Beyond a certain point, you will not be able to scale more. So, the right solution to follow here is actually horizontal scaling in which you can keep on adding more instances to have more computing power and more capacity. So, essentially what you get with this architecture is a super computer, which stitches together several nodes and the workload is distributed on each of those nodes for massive develop processing and really fast speeds, as well. The second aspect of having big data and the difficulty around moving it around is actually can be clarified with this example. So, what usually happens is, and this is a simplified version, you have a lot of applications and tools for which you might be collecting the data, and this data then goes into an analytics database. That database then in turn might be connected to some VI tools, dashboard and applications, and some ad-hoc queries being done on the database. Then, you want to do machine learning in this architecture. What usually happens is that you have your machine learning tools and the data that is coming in to the analytics database is actually being exported out of the machine learning tools. You're training your models there, and afterwards, when you have new incoming data, that data again goes out to the machine learning tools for prediction. With those results that you get from those tools usually ended up back in the distributed database because you want to put it on dashboard or you want to power up some applications with that. So, there's essentially a lot of data overhead that's involved here. There are cons with that, including data governance, data movement, and other complications that you need to resolve here. One of the possible solutions to overcome that difficulty is that you have machine learning as part of the distributed analytical database, as well, so you get the benefits of having it applied on all of the data that's inside of the database and not having to care about all of the data movement there, but if there are some use cases where it still makes sense to at least train the models outside, that's where you can do your data preparation outside of the database, and then take the data out, the prepared data, build your model, and then bring the model back to the analytics database. In this case, we'll talk about Vertica. So, the model would be archived, hosted by Vertica, and then you can keep on applying predictions on the new data that's incoming into the database. So, the third consideration here for machine learning on scale is sampling versus full data set. As I mentioned, a lot of tools they cannot handle big data and you are forced to sub-sample, but what happens here, as you can see in the figure on the left most, figure A, is that if you have a single data point, essentially any model can explain that, but if you have more data points, as in figure B, there would be a smaller number of models that could be able to explain that, and in figure C, even more data points, lesser number of models explained, but lesser also means here that these models would probably be more accurate, and the objective for building machine learning models is mostly to have prediction capability and generalization capability, essentially, on unseen data, so if you build a model that's accurate on one data point, it could not have very good generalization capabilities. The conventional wisdom with machine learning is that the more data points that you have for learning the better and more accurate models that you'll get out of your machine learning models. So, you need to pick a tool which can handle all of your data and does not force you to sub-sample that, and doing that, even a simpler model might be much better than a more complex model here. So, yeah. Let's go to data exploration and data preparation part. Vertica's a really powerful tool and it offers a lot of scalability in this space, and as I mentioned, will support the whole process. You can define the problem and you can gather your data and construct your data set inside Vertica, and then consider it a prepared training modeling deployment and managing the model, but this is a really critical step in the overall machine learning process. Some estimate it takes between 60 to 80% of the overall effort of a machine learning process. So, a lot of functions here. You can use part of Vertica, do data exploration, de-duplication, outlier detection, balancing, normalization, and potentially a lot more. You can actually go to our Vertica documentation and find them there. Within Vertica we divide them into two parts. Within data prep, one is exploration functions, the second is transformation functions. Within exploration, you have a rich set functions that you can use in DB, and then if you want to build your own you can use the UDX to do that. Similarly, for transformation there's a lot of functions around time series, pattern matching, outlier detection that you can use to transform that data, and it's just a snapshot of some of those functions that are available in Vertica right now. And again, the good thing about these functions is not just their presence in the database. The good thing is actually their ability to scale on really, really large data set and be able to compute those results for you on that data set in an acceptable amount of time, which makes your machine learning processes really critical. So, let's go to an example and see how we can use some of these functions. As I mentioned, there's a whole lot of them and we'll not be able to go through all of them, but just for our understanding we can go through some of them and see how they work. So, we have here a sample data set of network flows. It's a similar attack from some source nodes, and then there are some victim nodes on which these attacks are happening. So yeah, let's just look at the data here real quick. We'll load the data, we'll browse the data, compute some statistics around it, ask some questions, make plots, and then clean the data. The objective here is not to make a prediction, per se, which is what we mostly do in machine learning algorithms, but to just go through the data prep process and see how easy it is to do that with Vertica and what kind of options might be there to help you through that process. So, the first step is loading the data. Since in this case we know the structure of the data, so we create a table and create different column names and data types, but let's say you have a data set for which you do not already know the structure, there's a really cool feature in Vertica called flex tables and you can use that to initially import the data into the database and then go through all of the variables and then assign them variable types. You can also use that if your data is dynamic and it's changing, to board the data first and then create these definitions. So once we've done that, we load the data into the database. It's for one week of data out of the whole data set right now, but once you've done that we'd like to look at the flows just to look at the data, you know how it looks, and once we do select star from flows and just have a limit here, we see that there's already some data duplication, and by duplication I mean rows which have the exact same data for each of the columns. So, as part of the cleaning process, the first thing we'd want to do is probably to remove that duplication. So, we create a table with distinct flows and you can see here we have about a million flows here which are unique. So, moving on. The next step we want to do here, this is essentially time state data and these times are in days of the week, so we want to look at the trends of this data. So, the network traffic that's there, you can call it flows. So, based on hours of the day how does the traffic move and how does it differ from one day to another? So, it's part of an exploration process. There might be a lot of further exploration that you want to do, but we can start with this one and see how it goes, and you can see in the graph here that we have seven days of data, and the weekend traffic, which is in pink and purple here seems a little different from the rest of the days. Pretty close to each other, but yeah, definitely something we can look into and see if there's some real difference and if there's something we want to explore further here, but the thing is that this is just data for one week, as I mentioned. What if we load data for 70 days? You'd have a longer graph probably, but a lot of lines and would not really be able to make sense out of that data. It would be a really crowded plot for that, so we have to come up with a better way to be able to explore that and we'll come back to that in a little bit. So, what are some other things that we can do? We can get some statistics, we can take one sample flow and look at some of the values here. We see that the forward column here and ToS column here, they have zero values, and when we explore further we see that there's a lot of values here or records here for which these columns are essentially zero, so probably not really helpful for our use case. Then, we can look at the flow end. So, flow end is the end time when the last packet in a flow was sent and you can do a select min flow and max flow to see the data when it started and when it ended, and you can see it's about one week's of data for the first til eighth. Now, we also want to look at the data whether it's balanced or not because balanced data is really important for a lot of classification use cases that we want to try with this and you can see that source address, destination address, source port, and destination port, and you see it's highly in balanced data and so is versus destination address space, so probably something that we need to do, really powerful Vertica balancing functions that you can use within, and just sampling, over-sampling, or hybrid sampling here and that can be really useful here. Another thing we can look at is there's so many statistics of these columns, so off the unique flows table that we created we just use the summarize num call function in Vertica and it gives us a lot of really cool (mumbling) and percentile information on that. Now, if we look at the duration, which is the last record here, we can see that the mean is about 4.6 seconds, but when we look at the percentile information, we see that the median is about 0.27. So, there's a lot of short flows that have duration less than 0.27 seconds. Yes, there would be more and they'd probably bring the mean to the 4.6 value, but then the number of short flows is probably pretty high. We can ask some other questions from the data about the features. We can look at the protocols here and look at the count. So, we see that most of the traffic that we have is for TCP and UDP, which is sort of expected for a data set like this, and then we want to look at what are the most popular network services here? So again, simply queue here, select destination port count, add in the information here. We get the destination port and count for each. So, we can see that most of the traffic here is web traffic, HTTP and HTTPS, followed by domain name resolution. So, let's explore some more. We can look at the label distributions. We see that the labels that are given with that because this is essentially data for which we already know whether something was an anomaly or not, record was anomaly or not, and creating our algorithm based on it. So, we see that there's this background label, a lot of records there, and then anomaly spam seems to be really high. There are anomaly UDB scans and SSS scams, as well. So, another question we can ask is among the SMTP flows, how labels are distributed, and we can say that anomaly spam is highest, and then comes the background spam. So, can we say out of this that SMTP flows, they are spams, and maybe we can build a model that actually answers that question for us? That can be one machine learning model that you can build out of this data set. Again, we can also verify the destination port of flows that were labeled as spam. So, you can expect port 25 for SMTP service here, and we can see that SMTP with destination port 25, you have a lot of counts here, but there are some other destination ports for which the count is really low, and essentially, when we're doing and analysis at this scale, these data points might not really be needed. So, as part of the data prep slash data cleaning we might want to get rid of these records here. So now, what we can do is going back to the graph that I showed earlier, we can try and plot the daily trends by aggregating them. Again, we take the unique flow and convert into a flow count and to a manageable number that we can then feed into one of the algorithms. Now, PCA principle component analysis, it's a really powerful algorithm in Vertica, and what it essentially does is a lot of times when you have a high number of columns, which might be highly (mumbling) with each other, you can feed them into the PCA algorithm and it will get for you a list of principle components which would be linearly independent from each other. Now, each of these components would explain a certain extent of the variants of the overall data set that you have. So, you can see here component one explains about 73.9% of the variance, and component two explains about 16% of the variance. So, if you combine those two components alone, that would get you for around 90% of the variance. Now, you can use PCA for a lot of different purposes, but in this specific example, we want to see if we combine all the data points that we have together and we do that by day of the week, what sort of information can we get out of it? Is there any insight that this provides? Because once you have two data points, it's really easy to plot them. So, we just apply the PCA, we first (mumbling) it, and then reapply on our data set, and this is the graph we get as a result. Now, you can see component one is on the X axis here, component two on the y axis, and each of these points represents a day of the week. Now, with just two points it's easy to plot that and compare this to the graph that we saw earlier, which had a lot of lines and the more weeks that we added or the more days that we added, the more lines that we'd have versus this graph in which you can clearly tell that five days traffic starting from Monday til Friday, that's closely clustered together, so probably pretty similar to each other, and then Saturday traffic is pretty much apart from all of these days and it's also further away from Sunday. So, these two days of traffic is different from other days of traffic and we can always dive deeper into this and look at exactly what's happening here and see how this traffic is actually different, but with just a few functions and some pretty simple SQL queries, we were already able to get a pretty good insight from the data set that we had. Now, let's move on to our next part of this talk on importing and exporting PMML models to and from Vertica. So, current common practice is when you're putting your machine learning models into production, you'd have a dev or test environment, and in that you might be using a lot of different tools, Scikit and Spark, R, and once you want to deploy these models into production, you'd put them into containers and there would be a pool of containers in the production environment which would be talking to your database that could be your analytical database, and all of the new data that's incoming would be coming into the database itself. So, as I mentioned in one of the slides earlier, there is a lot of data transfer that's happening between that pool of containers hosting your machine learning training models versus the database which you'd be getting data for scoring and then sending the scores back to the database. So, why would you really need to transfer your models? The thing is that no machine learning platform provides everything. There might be some really cool algorithms that might compromise, but then Spark might have its own benefits in terms of some additional algorithms or some other stuff that you're looking at and that's the reason why a lot of these tools might be used in the same company at the same time, and then there might be some functional considerations, as well. You might want to isolate your data between data science team and your production environment, and you might want to score your pre-trained models on some S nodes here. You cannot host probably a big solution, so there is a whole lot of use cases where model movement or model transfer from one tool to another makes sense. Now, one of the common methods for transferring models from one tool to another is the PMML standard. It's an XML-based model exchange format, sort of a standard way to define statistical and data mining models, and helps you share models between the different applications that are PMML compliant. Really popular tool, and that's the tool of choice that we have for moving models to and from Vertica. Now, with this model management, this model movement capability, there's a lot of model management capabilities that Vertica offers. So, models are essentially first class citizens of Vertica. What that means is that each model is associated with a DB schema, so the user that initially creates a model, that's the owner of it, but he can transfer the ownership to other users, he can work with the ownership rights in any way that you would work with any other relation in a database would be. So, the same commands that you use for granting access to a model, changing its owner, changing its name, or dropping it, you can use similar commands for more of this one. There are a lot of functions for exploring the contents of models and that really helps in putting these models into production. The metadata of these models is also available for model management and governance, and finally, the import/export part enables you to apply all of these operations to the model that you have imported or you might want to export while they're in the database, and I think it would be nice to actually go through and example to showcase some of these capabilities in our model management, including the PMML model import and export. So, the workflow for export would be that we trained some data, we'll train a logistic regression model, and we'll save it as an in-DB Vertica model. Then, we'll explore the summary and attributes of the model, look at what's inside the model, what the training parameters are, concoctions and stuff, and then we can export the model as PMML and an external tool can import that model from PMML. And similarly, we'll go through and example for export. We'll have an external PMML model trained outside of Vertica, we'll import that PMML model and from there on, essentially, we'll treat it as an in-DB PMML model. We'll explore the summary and attribute of the model in much the same way as in in-DB model. We'll apply the model for in-DB scoring and get the prediction results, and finally, we'll bring some test data. We'll use that on test data for which the scoring needs to be done. So first, we want to create a connection with the database. In this case, we are using a Python Jupyter Notebook. We have the Vertica Python connector here that you can use, really powerful connector, allows you to do a lot of cool stuff to the database using the Jupyter front end, but essentially, you can use any other SQL front end tool or for that matter, any other Python ID which lets you connect to the database. So, exporting model. First, we'll create an logistic regression model here. Select logistic regression, we'll give it a model name, then put relation, which might be a table, time table, or review. There's response column and the predictor columns. So, we get a logistic regression model that we built. Now, we look at the models table and see that the model has been created. This is a table in Vertica that contains a list of all the models that are there in the database. So, we can see here that my model that we just created, it's created with Vertica models as a category, model type is logistic regression, and we have some other metadata around this model, as well. So now, we can look at some of the summary statistics of the model. We can look at the details. So, it gives us the predictor, coefficients, standard error, Z value, and P value. We can look at the regularization parameters. We didn't use any, so that would be a value of one, but if you had used, it would show it up here, the call string and also additional information regarding iteration count, rejected row count, and accepted row count. Now, we can also look at the list of attributes of the model. So, select get model attribute using parameter, model name is myModel. So, for this particular model that we just created, it would give us the name of all the attributes that are there. Similarly, you can look at the coefficients of the model in a column format. So, using parameter name myModel, and in this case we add attribute name equals details because we want all the details for that particular model and we get the predictor name, coefficient, standard error, Z value, and P value here. So now, what we can do is we can export this model. So, we used the select export models and we give it a path to where we want the model to be exported to. We give it the name of the model that needs to be exported because essentially might have a lot of models that you have created, and you give it the category here, which in our example is PMML, and you get a status message here that export model has been successful. So now, let's move onto the importing models example. In much the same way that we created a model in Vertica and exported it out, you might want to create a model outside of Vertica in another tool and then bring that to Vertica for scoring because Vertica contains all of the hard data and it might make sense to host that model in Vertica because scoring happens a lot more quickly than model training. So, in this particular case we do a select import models and we are importing a logistic regression model that was created in Spark. The category here again is PMML. So, we get the status message that the import was successful. Now, let's look at the attributes, look at the models table, and see that the model is really present there. Now previously when we ran this query because we had only myModel there, so that was the only entry you saw, but now once this model is imported you can see that as line item number two here, Spark logistic regression, it's a public schema. The category here however is different because it's not an individuated model, rather an imported model, so you get PMML here and then other metadata regarding the model, as well. Now, let's do some of the same operations that we did with the in-DB model so we can look at the summary of the imported PMML model. So, you can see the function name, data fields, predictors, and some additional information here. Moving on. Let's look at the attributes of the PMML model. Select your model attribute. Essentially the same query that we applied earlier, but the difference here is only the model name. So, you get the attribute names, attribute field, and number of rows. We can also look at the coefficient of the PMML model, name, exponent, and coefficient here. So yeah, pretty much similar to what you can do with an in-DB model. You can also perform all operations on an important model and one additional thing we'd want to do here is to use this important model for our prediction. So in this case, we'll data do a select predict PMML and give it some values using parameters model name, and logistic regression, and match by position, it's a really cool feature. This is true in this case. Sector, true. So, if you have model being imported from another platform in which, let's say you have 50 columns, now the names of the columns in that environment in which you're training the model might be slightly different than the names of the column that you have set up for Vertica, but as long as the order is the same, Vertica can actually match those columns by position and you don't need to have the exact same names for those columns. So in this case, we have set that to true and we see that predict PMML gives us a status of one. Now, using the important model, in this case we had a certain value that we had given it, but you can also use it on a table, as well. So in that case, you also get the prediction here and you can look at the (mumbling) metrics, see how well you did. Now, just sort of wrapping this up, it's really important to know the important distinction between using your models in any tool, any single node solution tool that you might already be using, like Python or R versus Vertica. What happens is, let's say you build a model in Python. It might be a single node solution. Now, after building that model, let's say you want to do prediction on really large amounts of data and you don't want to go through the overhead of keeping to move that data out of the database to do prediction every time you want to do it. So, what you can do is you can import that model into Vertica, but what Vertica does differently than Python is that the PMML model would actually be distributed across each mode in the cluster, so it would be applying on the data segments in each of those nodes and they might be different threads running for that prediction. So, the speed that you get here from all prediction would be much, much faster. Similarly, once you build a model for machine learning in Vertica, the objective mostly is that you want to use up all of your data and build a model that's accurate and is not just using a sample of the data, but using all the data that's available to it, essentially. So, you can build that model. The model building process would again go through the same technique. It would actually be distributed across all nodes in a cluster, and it would be using up all the threads and processes available to it within those nodes. So, really fast model training, but let's say you wanted to deploy it on an edge node and maybe do prediction closer to where the data was being generated, so you can export that model in a PMML format and all deploy it on the edge node. So, it's really helpful for a lot of use cases. And just some rising takeaways from our discussion today. So, Vertica's a really powerful tool for machine learning, for data preparation, model training, prediction, and deployment. You might want to use Vertica for all of these steps or some of these steps. Either way, Vertica supports both approaches. In the upcoming releases, we are planning to have more import and export capability through PMML models. Initially, we're supporting kmeans, linear, and logistic regression, but we keep on adding more algorithms and the plan is to actually move to supporting custom models. If you want to do that with the upcoming release, our TensorFlow indication is always there which you can use, but with PMML, this is the starting point for us and we keep on improving that. Vertica model can be exported in PMML format for scoring on other platforms, and similarly, models that get build in other tools can be imported for in-DB machine learning and in-DB scoring within Vertica. There are a lot of critical model management tools that are provided in Vertica and there are a lot of them on the roadmap, as well, which would keep on developing. Many ML functions and algorithms, they're already part of the in-DB library and we keep on adding to that, as well. So, thank you so much for joining the discussion today and if you have any questions we'd love to take them now. Back to you, Sue.
SUMMARY :
and thank you for joining us today and the limit, you can hit that in terms of cost,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Vertica | ORGANIZATION | 0.99+ |
Waqas Dhillon | PERSON | 0.99+ |
70 days | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
two points | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
seven days | QUANTITY | 0.99+ |
one week | QUANTITY | 0.99+ |
five days | QUANTITY | 0.99+ |
Sunday | DATE | 0.99+ |
two parts | QUANTITY | 0.99+ |
second part | QUANTITY | 0.99+ |
Saturday | DATE | 0.99+ |
Excel | TITLE | 0.99+ |
50 columns | QUANTITY | 0.99+ |
4/2 | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
each | QUANTITY | 0.99+ |
each node | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
first factor | QUANTITY | 0.99+ |
less than 0.27 seconds | QUANTITY | 0.99+ |
Vertica | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
Friday | DATE | 0.99+ |
Monday | DATE | 0.99+ |
second aspect | QUANTITY | 0.99+ |
eighth | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
one day | QUANTITY | 0.99+ |
two data points | QUANTITY | 0.99+ |
third consideration | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
first step | QUANTITY | 0.98+ |
first part | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
zero values | QUANTITY | 0.98+ |
second | QUANTITY | 0.98+ |
both approaches | QUANTITY | 0.98+ |
about 4.6 seconds | QUANTITY | 0.98+ |
third thing | QUANTITY | 0.98+ |
Secondly | QUANTITY | 0.98+ |
one tool | QUANTITY | 0.98+ |
zero | QUANTITY | 0.98+ |
each mode | QUANTITY | 0.98+ |
One | QUANTITY | 0.97+ |
figure B | OTHER | 0.97+ |
figure C | OTHER | 0.97+ |
4.6 value | QUANTITY | 0.97+ |
R | TITLE | 0.97+ |
Machine Learning with Vertica, Data Preparation and Model Management | TITLE | 0.97+ |
Waqas | PERSON | 0.97+ |
each model | QUANTITY | 0.97+ |
two main options | QUANTITY | 0.97+ |
80% | QUANTITY | 0.97+ |
two components | QUANTITY | 0.96+ |
around 90% | QUANTITY | 0.96+ |
two | QUANTITY | 0.96+ |
later this week | DATE | 0.95+ |
UNLIST TILL 4/2 The Data-Driven Prognosis
>> Narrator: Hi, everyone, thanks for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled toward Zero Unplanned Downtime of Medical Imaging Systems using Big Data. My name is Sue LeClaire, Director of Marketing at Vertica, and I'll be your host for this webinar. Joining me is Mauro Barbieri, lead architect of analytics at Philips. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. There will be a Q&A session at the end of the presentation. And we'll answer as many questions as we're able to during that time. Any questions that we don't get to we'll do our best to answer them offline. Alternatively, you can also visit the vertical forums to post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And yes, this virtual session is being recorded, and we'll be available to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Mauro, over to you. >> Thank you, good day everyone. So medical imaging systems such as MRI scanners, interventional guided therapy machines, CT scanners, the XR system, they need to provide hospitals, optimal clinical performance but also predictable cost of ownership. So clinicians understand the need for maintenance of these devices, but they just want to be non intrusive and scheduled. And whenever there is a problem with the system, the hospital suspects Philips services to resolve it fast and and the first interaction with them. In this presentation you will see how we are using big data to increase the uptime of our medical imaging systems. I'm sure you have heard of the company Phillips. Phillips is a company that was founded in 129 years ago in actually 1891 in Eindhoven in Netherlands, and they started by manufacturing, light bulbs, and other electrical products. The two brothers Gerard and Anton, they took an investment from their father Frederik, and they set up to manufacture and sale light bulbs. And as you may know, a key technology for making light bulbs is, was glass and vacuum. So when you're good at making glass products and vacuum and light bulbs, then there is an easy step to start making radicals like they did but also X ray tubes. So Philips actually entered very early in the market of medical imaging and healthcare technology. And this is what our is our core as a company, and it's also our future. So, healthcare, I mean, we are in a situation now in which everybody recognize the importance of it. And and we see incredible trends in a transition from what we call Volume Based Healthcare to Value Base, where, where the clinical outcomes are driving improvements in the healthcare domain. Where it's not enough to respond to healthcare challenges, but we need to be involved in preventing and maintaining the population wellness and from a situation in which we episodically are in touch with healthcare we need to continuously monitor and continuously take care of populations. And from healthcare facilities and technology available to a few elected and reach countries we want to make health care accessible to everybody throughout the world. And this of course, has poses incredible challenges. And this is why we are transforming the Philips to become a healthcare technology leader. So from Philips has been a concern realizing and active in many sectors in many sectors and realizing what kind of technologies we've been focusing on healthcare. And we have been transitioning from creating and selling products to making solutions to addresses ethical challenges. And from selling boxes, to creating long term relationships with our customers. And so, if you have known the Philips brand from from Shavers from, from televisions to light bulbs, you probably now also recognize the involvement of Philips in the healthcare domain, in diagnostic imaging, in ultrasound, in image guided therapy and systems, in digital pathology, non invasive ventilation, as well as patient monitoring intensive care, telemedicine, but also radiology, cardiology and oncology informatics. Philips has become a powerhouse of healthcare technology. To give you an idea of this, these are the numbers for, from 2019 about almost 20 billion sales, 4% comparable sales growth with respect to the previous year and about 10% of the sales are reinvested in R&D. This is also shown in the number of patents rights, last year we filed more than 1000 patents in, in the healthcare domain. And the company is about 80,000 employees active globally in over 100 countries. So, let me focus now on the type of products that are in the scope of this presentation. This is a Philips Magnetic Resonance Imaging Scanner, also called Ingenia 3.0 Tesla is an incredible machine. Apart from being very beautiful as you can see, it's a it's a very powerful technology. It can make high resolution images of the human body without harmful radiation. And it's a, it's a, it's a complex machine. First of all, it's massive, it weights 4.6 thousand kilograms. And it has superconducting magnets cooled with liquid helium at -269 degrees Celsius. And it's actually full of software millions and millions of lines of code. And it's occupied three rooms. What you see in this picture, the examination room, but there is also a technical room which is full of of of equipment of custom hardware, and machinery that is needed to operate this complex device. This is another system, it's an interventional, guided therapy system where the X ray is used during interventions with the patient on the table. You see on the left, what we call C-arm, a robotic arm that moves and can take images of the patient while it's been operated, it's used for cardiology intervention, neurological intervention, cardiovascular intervention. There's a table that moves in very complex ways and it again it occupies two rooms, this room that we see here and but also a room full of cabinets and hardwood and computers. This is another another characteristic of this machine is that it has to operate it as it is used during medical interventions, and so it has to interact with all kind of other equipment. This is another system it's a, it's a, it's a Computer Tomography Scanner Icon which is a unique, it is unique due to its special detection technology. It has an image resolution up to 0.5 millimeters and making thousand by thousand pixel images. And it is also a complex machine. This is a picture of the inside of a compatible device not really an icon, but it has, again three rotating, which waits two and a half turn. So, it's a combination of X ray tube on top, high voltage generators to power the extra tube and in a ray of detectors to create the images. And this rotates at 220 right per minutes, making 50 frames per second to make 3D reconstruction of the of the body. So a lot of technology, complex technology and this technology is made for this situation. We make it for clinicians, who are busy saving people lives. And of course, they want optimal clinical performance. They want the best technology to treat the patients. But they also want predictable cost of ownership. They want predictable system operations. They want their clinical schedules not interrupted. So, they understand these machines are complex full of technology. And these machines may have, may require maintenance, may require software update, sometimes may even say they require some parts, horrible parts to be replaced, but they don't want to have it unplanned. They don't want to have unplanned downtime. They would hate send, having to send patients home and to have to reschedule visits. So they understand maintenance. They just want to have a schedule predictable and non intrusive. So already a number of years ago, we started a transition from what we call Reactive Maintenance services of these devices to proactive. So, let me show you what we mean with this. Normally, if a system has an issue system on the field, and traditional reactive workflow would be that, this the customer calls a call center, reports the problem. The company servicing the device would dispatch a field service engineer, the field service engineer would go on site, do troubleshooting, literally smell, listen to noise, watch for lights, for, for blinking LEDs or other unusual issues and would troubleshoot the issue, find the root cause and perhaps decide that the spare part needs to be replaced. He would order a spare part. The part would have to be delivered at the site. Either immediately or the engineer would would need to come back another day when the part is available, perform the repair. That means replacing the parts, do all the needed tests and validations. And finally release the system for clinical use. So as you can see, there is a lot of, there are a lot of steps, and also handover of information from one to between different people, between different organizations even. Would it be better to actually keep monitoring the installed base, keep observing the machine and actually based on the information collected, detect or predict even when an issue is is going to happen? And then instead of reacting to a customer calling, proactively approach the customer scheduling, preventive service, and therefore avoid the problem. So this is actually what we call Corrective Service. And this is what we're being transitioning to using Big Data and Big Data is just one ingredient. In fact, there are more things that are needed. The devices themselves need to be designed for reliability and predictability. If the device is a black box does not communicate to the outside world the status, if it does not transmit data, then of course, it is not possible to observe and therefore, predict issues. This of course requires a remote service infrastructure or an IoT infrastructure as it is called nowadays. The passivity to connect the medical device with a data center in enterprise infrastructure, collect the data and perform the remote troubleshooting and the predictions. Also the right processes and the right organization is to be in place, because an organization that is, you know, waiting for the customer to call and then has a number of few service engineers available and a certain amount of spare parts and stock is a different organization from an organization that actually is continuously observing the installed base and is scheduling actions to prevent issues. And in other pillar is knowledge management. So in order to realize predictive models and to have predictive service action, it's important to manage knowledge about failure modes, about maintenance procedures very well to have it standardized and digitalized and available. And last but not least, of course, the predictive models themselves. So we talked about transmitting data from the installed base on the medical device, to an enterprise infrastructure that would analyze the data and generate predictions that's predictive models are exactly the last ingredient that is needed. So this is not something that I'm, you know, I'm telling you for the first time is actually a strategic intent of Philips, where we aim for zero unplanned downtime. And we market it that way. We also is not a secret that we do it by using big data. And, of course, there could be other methods to to achieving the same goal. But we started using big data already now well, quite quite many years ago. And one of the reasons is that our medical devices already are wired to collect lots of data about the functioning. So they collect events, error logs that are sensor connecting sensor data. And to give you an idea, for example, just as an order of magnitudes of size of the data, the one MRI scanner can log more than 1 million events per day, hundreds of thousands of sensor readings and tens of thousands of many other data elements. And so this is truly big data. On the other hand, this data was was actually not designed for predictive maintenance, you have to think a medical device of this type of is, stays in the field for about 10 years. Some a little bit longer, some of it's shorter. So these devices have been designed 10 years ago, and not necessarily during the design, and not all components were designed, were designed with predictive maintenance in mind with IoT, and with the latest technology at that time, you know, progress, will not so forward looking at the time. So the actual the key challenge is taking the data which is already available, which is already logged by the medical devices, integrating it and creating predictive models. And if we dive a little bit more into the research challenges, this is one of the Challenges. How to integrate diverse data sources, especially how to automate the costly process of data provisioning and cleaning? But also, once you have the data, let's say, how to create these models that can predict failures and the degradation of performance of a single medical device? Once you have these models and alerts, another challenge is how to automatically recommend service actions based on the probabilistic information on these possible failures? And once you have the insights even if you can recommend action still recommending an action should be done with the goal of planning, maintenance, for generating value. That means balancing costs and benefits, preventing unplanned downtimes without of course scheduling and unnecessary interventions because every intervention, of course, is a disruption for the clinical schedule. And there are many more applications that can be built off such as the optimal management of spare parts supplies. So how do you approach this problem? Our approach was to collect into one database Vertica. A large amount of historical data, first of all historical data coming from the medical devices, so event logs, parameter value system configuration, sensor readings, all the data that we have at our disposal, that in the same database together with records of failures, maintenance records, service work orders, part replacement contracts, so basically the evidence of failures and once you have data from the medical devices, and data from the failures in the same database, it becomes possible to correlate event logs, errors, signal sensor readings with records of failures and records of part replacement and maintenance operations. And we did that also with a specific approach. So we, we create integrated teams, and every integrated team at three figures, not necessarily three people, they were actually multiple people. But there was at least one business owner from a service organization. And this business owner is the person who knows what is relevant, which use case are relevant to solve for a particular type of product or a particular market. What basically is generating value or is worthwhile tackling as an organization. And we have data scientists, data scientists are the one who actually can manipulate data. They can write the queries, they can write the models and robust statistics. They can create visualization and they are the ones who really manipulate the data. Last but not least, very important is subject matter experts. Subject Matter Experts are the people who know the failure modes, who know about the functioning of the medical devices, perhaps they're even designed, they come from the design side, or they come from the service innovation side or even from the field. People who have been servicing the machines in real life for many, many years. So, they are familiar with the failure models, but also familiar with the type of data that is logged and the processes and how actually the systems behave, if you if you if you if you allow me in, in the wild in the in the field. So the combination of these three secrets was a key. Because data scientist alone, just statisticians basically are people who can all do machine learning. And they're not very effective because the data is too complicated. That's why you more than too complex, so they will spend a huge amount of time just trying to figure out the data. Or perhaps they will spend the time in tackling things that are useless, because it's such an interesting knows much quicker which data points are useful, which phenomenon can be found in the data or probably not found. So the combination of subject matter experts and data scientists is very powerful and together gathered by a business owner, we could tackle the most useful use cases first. So, this teams set up to work and they developed three things mainly, first of all, they develop insights on the failure modes. So, by looking at the data, and analyzing information about what happened in the field, they find out exactly how things fail in a very pragmatic and quantitative way. Also, they of course, set up to develop the predictive model with associated alerts and service actions. And a predictive model is just not an alert is just not a flag. Just not a flag, only flag that turns on like a like a traffic light, you know, but there's much more than that. It's such an alert is to be interpreted and used by highly skilled and trained engineer, for example, in a in a call center, who needs to evaluate that error and plan a service action. Service action may involve the ordering a replacement of an expensive part, it may involve calling up the customer hospital and scheduling a period of downtime, downtime to replace a part. So it has an impact on the clinical practice, could have an impact. So, it is important that the alert is coupled with sufficient evidence and information for such a highly skilled trained engineer to plan the service session efficiently. So, it's it's, it's a lot of work in terms of preparing data, preparing visualizations, and making sure that old information is represented correctly and in a compact form. Additionally, These teams develop, get insight into the failure modes and so they can provide input to the R&D organization to improve the products. So, to summarize these graphically, we took a lot of historical data from, coming from the medical devices from the history but also data from relational databases, where the service, work orders, where the part replacement, the contact information, we integrated it, and we set up to the data analytics. From there we don't have value yet, only value starts appearing when we use the insights of data analytics the model on live data. When we process live data with the module we can generate alerts, and the alerts can be used to plan the maintenance and the maintenance therefore the plant maintenance replaces replacing downtime is creating value. To give an idea of the, of the type of I cannot show you the details of these modules, all of these predictive models. But to give you an idea, this is just a picture of some of the components of our medical device for which we have models for which we have, for which we call the failure modes, hard disk, clinical grade monitoring, monitors, X ray tubes, and so forth. This is for MRI machines, a lot of custom hardware and other types of amplifiers and electronics. The alerts are then displayed in a in a dashboard, what we call a Remote monitoring dashboard. We have a team of remote monitoring engineers that basically surveyors the install base, looks at this dashboard picks up these alerts. And an alert as I said before is not just one flag, it contains a lot of information about the failure and about the medical device. And the remote monitor engineer basically will pick up these alerts, they review them and they create cases for the markets organization to handle. So, they see an alert coming in they create a case. So that the particular call center in in some country can call the customer and schedule and make an appointment to schedule a service action or it can add it preventive action to the schedule of the field service engineer who's already supposed to go to visit the customer for example. This is a picture and high-level picture of the overall data person architecture. On the bottom we have install base install base is formed by all our medical devices that are connected to our Philips and more service network. Data is transmitted in a in a secure and in a secure way to our enterprise infrastructure. Where we have a so called Data Lake, which is basically an archive where we store the data as it comes from, from the customers, it is scrubbed and protected. From there, we have a processes ETL, Extract, Transform and Load that in parallel, analyze this information, parse all these files and all this data and extract the relevant parameters. All this, the reason is that the data coming from the medical device is very verbose, and in legacy formats, sometimes in binary formats in strange legacy structures. And therefore, we parse it and we structure it and we make it magically usable by data science teams. And the results are stored in a in a vertica cluster, in a data warehouse. In the same data warehouse, where we also store information from other enterprise systems from all kinds of databases from SQL, Microsoft SQL Server, Tera Data SAP from Salesforce obligations. So, the enterprise IT system also are connected to vertica the data is inserted into vertica. And then from vertica, the data is pulled by our predictive models, which are Python and Rscripts that run on our proprietary environment helps with insights. From this proprietary environment we generate the alerts which are then used by the remote monitoring application. It's not the only application this is the case of remote monitoring. We also have applications for particular remote service. So whenever we cannot prevent or predict we cannot predict an issue from happening or we cannot prevent an issue from happening and we need to react on a customer call, then we can still use the data to very quickly troubleshoot the system, find the root cause and advice or the best service session. Additionally, there are reliability dashboards because all this data can also be used to perform reliability studies and improve the design of the medical devices and is used by R&D. And the access is with all kinds of tools. So Vertica gives the flexibility to connect with JDBC to connect dashboards using Power BI to create dashboards and click view or just simply use RM Python directly to perform analytics. So little summary of the, of the size of the data for the for the moment we have integrated about 500 terabytes worth of data tables, about 30 trillion data points. More than eighty different data sources. For our complete connected install base, including our customer relation management system SAP, we also have connected, we have integrated data from from the factory for repair shops, this is very useful because having information from the factory allows to characterize components and devices when they are new, when they are still not used. So, we can model degradation, excuse me, predict failures much better. Also, we have many years of historical data and of course 24/7 live feeds. So, to get all this going, we we have chosen very simple designs from the very beginning this was developed in the back the first system in 2015. At that time, we went from scratch to production eight months and is also very stable system. To achieve that, we apply what we call Exhaustive Error Handling. When you process, most of people attending this conference probably know when you are dealing with Big Data, you have probably you face all kinds of corner cases you feel that will never happen. But just because of the sheer volume of the data, you find all kinds of strange things. And that's what you need to take care of, if you want to have a stable, stable platform, stable data pipeline. Also other characteristic is that, we need to handle live data, but also be able to, we need to be able to reprocess large historical datasets, because insights into the data are getting generated over time by the team that is using the data. And very often, they find not only defects, but also they have changed requests for new data to be extracted to distract in a different way to be aggregated in a different way. So basically, the platform is continuously crunching data. Also, components have built-in monitoring capabilities. Transparent transparency builds trust by showing how the platform behaves. People actually trust that they are having all the data which is available, or if they don't see the data or if something is not functioning they can see why and where the processing has stopped. A very important point is documentation of data sources every data point as a so called Data Provenance Fields. That is not only the medical device where it comes from, with all this identifier, but also from which file, from which moment in time, from which row, from which byte offset that data point comes. This allows to identify and not only that, but also when this data point was created, by whom, by whom meaning which version of the platform and of the ETL created a data point. This allows us to identify issues and also to fix only the subset of when an issue is identified and fixed. It's possible then to fix only subset of the data that is impacted by that issue. Again, this grid trusts in data to essential for this type of applications. We actually have different environments in our analytic solution. One that we call data science environment is more or less what I've shown so far, where it's deployed in our Philips private cloud, but also can be deployed in in in public cloud such as Amazon. It contains the years of historical data, it allows interactive data exploration, human queries, therefore, it is a highly viable load. It is used for the training of machine learning algorithms and this design has been such that we it is for allowing rapid prototyping and for large data volumes. In other environments is the so called Production Environment where we actually score the models with live data from generation of the alerts. So this environment does not require years of data just months, because a model to make a prediction does not need necessarily years of data, but maybe some model even a couple of weeks or a few months, three months, six months depending on the type of data on the failure which has been predicted. And this has highly optimized queries because the applications are stable. It only only change when we deploy new models or new versions of the models. And it is designed optimized for low latency, high throughput and reliability is no human intervention, no human queries. And of course, there are development staging environments. And one of the characteristics. Another characteristic of all this work is that what we call Data Driven Service Innovation. In all this work, we use the data in every step of the process. The First business case creation. So, basically, some people ask how did you manage to find the unlocked investment to create such a platform and to work on it for years, you know, how did you start? Basically, we started with a business case and the business case again for that we use data. Of course, you need to start somewhere you need to have some data, but basically, you can use data to make a quantitative analysis of the current situation and also make it as accurate as possible estimate quantitative of value creation, if you have that basically, is you can justify the investments and you can start building. Next to that data is used to decide where to focus your efforts. In this case, we decided to focus on the use cases that had the maximum estimated business impact, with business impact meaning here, customer value, as well as value for the company. So we want to reduce unplanned downtime, we want to give value to our customers. But it would be not sustainable, if for creating value, we would start replacing, you know, parts without any consideration for the cost of it. So it needs to be sustainable. Also, then we use data to analyze the failure modes to actually do digging into the data understanding of things fail, for visualization, and to do reliability analysis. And of course, then data is a key to do feature engineering for the development of the predictive models for training the models and for the validation with historical data. So data is all over the place. And last but not least, again, these models is architecture generates new data about the alerts and about the how good the alerts are, and how well they can predict failures, how much downtime is being saved, how money issues have been prevented. So this also data that needs to be analyzed and provides insights on the performance of this, of this models and can be used to improve the models found. And last but not least, once you have performance of the models you can use data to, to quantify as much as possible the value which is created. And it is when you go back to the first step, you made the business value you you create the first business case with estimates. Can you, can you actually show that you are creating value? And the more you can, have this fitness feedback loop closed and quantify the better it is for having more and more impact. Among the key elements that are needed for realizing this? So I want to mention one about data documentation is the practice that we started already six years ago is proven to be very valuable. We document always how data is extracted and how it is stored in, in data model documents. Data Model documents specify how data goes from one place to the other, in this case from device logs, for example, to a table in vertica. And it includes things such as the finish of duplicates, queries to check for duplicates, and of course, the logical design of the tables below the physical design of the table and the rationale. Next to it, there is a data dictionary that explains for each column in the data model from a subject matter expert perspective, what that means, such as its definition and meaning is if it's, if it's a measurement, the use of measure and the range. Or if it's a, some sort of, of label the spec values, or whether the value is raw or or calculated. This is essential for maximizing the value of data for allowing people to use data. Last but not least, also an ETL design document, it explains how the transformation has happened from the source to the destination including very important the failure and the strategy. For example, when you cannot parse part of a file, should you load only what you can parse or drop the entire file completely? So, import best effort or do all or nothing or how to populate records for which there is no value what are the default values and you know, how to have the data is normalized or transform and also to avoid duplicates. This again is very important to provide to the users of the data, if full picture of all the data itself. And this is not just, this the formal process the documents are reviewed and approved by all the stakeholders into the subject matter experts and also the data scientists from a function that we have started called Data Architect. So to, this is something I want to give about, oh, yeah and of course the the documents are available to the end users of the data. And we even have links with documents of the data warehouse. So if you are, if you get access to the database, and you're doing your research and you see a table or a view, you think, well, it could be that could be interesting. It looks like something I could use for my research. Well, the data itself has a link to the document. So from the database while you're exploring data, you can retrieve a link to the place where the document is available. This is just the quick summary of some of the of the results that I'm allowed to share at this moment. This is about image guided therapy, using our remote service infrastructure for remotely connected system with the right contracts. We can achieve we have we have reduced downtime by 14% more than one out of three of cases are resolved remotely without an engineer having to go outside. 82% is the first time right fixed rate that means that the issue is fixed either remotely or if a visit at the site is needed, that visit only one visit is needed. So at that moment, the engineer we decided the right part and fix this straightaway. And this result on average on 135 hours more operational availability per year. This therefore, the ability to treat more patients for the same costs. I'd like to conclude with citing some nice testimonials from some of our customers, showing that the value that we've created is really high impact and this concludes my presentation. Thanks for your attention so far. >> Thank you Morrow, very interesting. And we've got a number of questions that we that have come in. So let's get to them. The first one, how many devices has Philips connected worldwide? And how do you determine which related center data workloads get analyzed with protocols? >> Okay, so this is just two questions. So the first question how many devices are connected worldwide? Well, actually, I'm not allowed to tell you the precise number of connected devices worldwide, but what I can tell is that we are in the order of tens of thousands of devices. And of all types actually. And then, how would we determine which related sensor gets analyzed with vertica well? And a little bit how I set In the in the presentation is a combination of two approaches is a data driven approach and the knowledge driven approach. So a knowledge driven approach because we make maximum use of our knowledge of the failure modes, and the behavior of the medical devices and of their components to select what we think are promising data points and promising features. However, from that moment on data science kicks in, and it's actually data science is used to look at the actual data and come up with quantitative information of what is really happening. So, it could be that an expert is convinced that the particular range of value of a sensor are indicative of a particular failure. And it turns out that maybe it was too optimistic on the other way around that in practice, there are many other situations situation he was not aware of. That could happen. So thanks to the data, then we, you know, get a better understanding of the phenomenon and we get the better modeling. I bet I answered that, any question? >> Yeah, we have another question. Do you have plans to perform any analytics at the edge? >> Now that's a good question. So I can't disclose our plans on this right now, but at the edge devices are certainly one of the options we look at to help our customers towards Zero Unplanned Downtime. Not only that, but also to facilitate the integration of our solution with existing and future hospital IT infrastructure. I mean, we're talking about advanced security, privacy and guarantee that the data is always safe remains. patient data and clinical data remains does not go outside the parameters of the hospital of course, while we want to enhance our functionality provides more value with our services. Yeah, so edge definitely very interesting area of innovation. >> Another question, what are the most helpful vertica features that you rely on? >> I would say, the first that comes to mind, to me at this moment is ease of integration. Basically, with vertica, we will be able to load any data source in a very easy way. And also it really can be interfaced very easily with old type of ions as an application. And this, of course, is not unique to vertica. Nevertheless, the added value here is that this is coupled with an incredible speed, incredible speed for loading and for querying. So it's basically a very versatile tool to innovate fast for data science, because basically we do not end up another thing is multiple projections, advanced encoding and compression. So this allows us to perform the optimizations only when we need it and without having to touch applications or queries. So if we want to achieve high performance, we Basically spend a little effort on improving the projection. And now we can achieve very often dramatic increases in performance. Another feature is EO mode. This is great for for cloud for cloud deployment. >> Okay, another question. What is the number one lesson learned that you can share? >> I think that would my advice would be document control your entire data pipeline, end to end, create positive feedback loops. So I hear that what I hear often is that enterprises I mean Philips is one of them that are not digitally native. I mean, Philips is 129 years old as a company. So you can imagine the the legacy that we have, we will not, you know, we are not born with Web, like web companies are with with, you know, with everything online and everything digital. So enterprises that are not digitally native, sometimes they struggle to innovate in big data or into to do data driven innovation, because, you know, the data is not available or is in silos. Data is controlled by different parts of the organ of the organization with different processes. There is not as a super strong enterprise IT system, providing all the data, you know, for everybody with API's. So my advice is to, to for the very beginning, a creative creating as soon as possible, an end to end solution, from data creation to consumption. That creates value for all the stakeholders of the data pipeline. It is important that everyone in the data pipeline from the producer of the data to the to the consumers, basically in order to pipeline everybody gets a piece of value, piece of the cake. When the value is proven to all stakeholders, everyone would naturally contribute to keep the data pipeline running, and to keep the quality of the data high. That's the students there. >> Yeah, thank you. And in the area of machine learning, what types of innovations do you plan to adopt to help with your data pipeline? >> So, in the error of machine learning, we're looking at things like automatically detecting the deterioration of models to trigger improvement action, as well as connected with active learning. Again, focused on improving the accuracy of our predictive models. So active learning is when the additional human intervention labeling of difficult cases is triggered. So the machine learning classifier may not be able to, you know, classify correctly all the time and instead of just randomly picking up some cases for a human to review, you, you want the costly humans to only review the most valuable cases, from a machine learning point of view, the ones that would contribute the most in improving the classifier. Another error is is deep learning and was not working on it, I mean, but but also applications of more generic anomaly detection algorithms. So the challenge of anomaly detection is that we are not only interested in finding anomalies but also in the recommended proper service actions. Because without a proper service action, and alert generated because of an anomaly, the data loses most of its value. So, this is where I think we, you know. >> Go ahead. >> No, that's, that's it, thanks. >> Okay, all right. So that's all the time that we have today for questions. I want to thank the audience for attending Mauro's presentation and also for your questions. If you weren't able to, if we weren't able to answer your question today, I'd ask let we'll let you know that we'll respond via email. And again, our engineers will be at the vertica, on the vertica quorums awaiting your other questions. It would help us greatly if you could give us some feedback and rate the session before you sign off. Your rating will help us guide us as when we're looking at content to provide for the next vertica BTC. Also, note that a replay of today's event and a PDF copy of the slides will be available on demand, we'll let you know when that'll be by email hopefully later this week. And of course, we invite you to share the content with your colleagues. Again, thank you for your participation today. This includes this breakout session and hope you have a wonderful day. Thank you. >> Thank you
SUMMARY :
in the lower right corner of the slide. and perhaps decide that the spare part needs to be replaced. So let's get to them. and the behavior of the medical devices Do you have plans to perform any analytics at the edge? and guarantee that the data is always safe remains. on improving the projection. What is the number one lesson learned that you can share? from the producer of the data to the to the consumers, And in the area of machine learning, what types the deterioration of models to trigger improvement action, and a PDF copy of the slides will be available on demand,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Mauro Barbieri | PERSON | 0.99+ |
Philips | ORGANIZATION | 0.99+ |
Gerard | PERSON | 0.99+ |
Frederik | PERSON | 0.99+ |
Phillips | ORGANIZATION | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
2015 | DATE | 0.99+ |
two questions | QUANTITY | 0.99+ |
Mauro | PERSON | 0.99+ |
Eindhoven | LOCATION | 0.99+ |
4.6 thousand kilograms | QUANTITY | 0.99+ |
two rooms | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
14% | QUANTITY | 0.99+ |
six months | QUANTITY | 0.99+ |
Anton | PERSON | 0.99+ |
4% | QUANTITY | 0.99+ |
135 hours | QUANTITY | 0.99+ |
three months | QUANTITY | 0.99+ |
2019 | DATE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
last year | DATE | 0.99+ |
82% | QUANTITY | 0.99+ |
two approaches | QUANTITY | 0.99+ |
eight months | QUANTITY | 0.99+ |
three people | QUANTITY | 0.99+ |
three rooms | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
first question | QUANTITY | 0.99+ |
more than 1000 patents | QUANTITY | 0.99+ |
1891 | DATE | 0.99+ |
Today | DATE | 0.99+ |
Power BI | TITLE | 0.99+ |
Netherlands | LOCATION | 0.99+ |
one ingredient | QUANTITY | 0.99+ |
three figures | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
over 100 countries | QUANTITY | 0.99+ |
later this week | DATE | 0.99+ |
tens of thousands | QUANTITY | 0.99+ |
SQL | TITLE | 0.98+ |
about 10% | QUANTITY | 0.98+ |
about 80,000 employees | QUANTITY | 0.98+ |
six years ago | DATE | 0.98+ |
Python | TITLE | 0.98+ |
three | QUANTITY | 0.98+ |
two brothers | QUANTITY | 0.98+ |
millions | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
about 30 trillion data points | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
about 500 terabytes | QUANTITY | 0.98+ |
Microsoft | ORGANIZATION | 0.98+ |
first time | QUANTITY | 0.98+ |
each column | QUANTITY | 0.98+ |
hundreds of thousands | QUANTITY | 0.98+ |
this week | DATE | 0.97+ |
Salesforce | ORGANIZATION | 0.97+ |
first | QUANTITY | 0.97+ |
tens of thousands of devices | QUANTITY | 0.97+ |
first system | QUANTITY | 0.96+ |
about 10 years | QUANTITY | 0.96+ |
10 years ago | DATE | 0.96+ |
one visit | QUANTITY | 0.95+ |
Morrow | PERSON | 0.95+ |
up to 0.5 millimeters | QUANTITY | 0.95+ |
More than eighty different data sources | QUANTITY | 0.95+ |
129 years ago | DATE | 0.95+ |
first interaction | QUANTITY | 0.94+ |
one flag | QUANTITY | 0.94+ |
three things | QUANTITY | 0.93+ |
thousand | QUANTITY | 0.93+ |
50 frames per second | QUANTITY | 0.93+ |
First business | QUANTITY | 0.93+ |
Steve Wilkes, Striim | Big Data SV 2018
>> Narrator: Live from San Jose it's theCUBE. Presenting Big Data Silicon Valley. Brought to you by SiliconANGLE Media and its ecosystem partners. (upbeat music) >> Welcome back to San Jose everybody, this is theCUBE, the leader in live tech coverage and you're watching BigData SV, my name is Dave Vellante. In the early days of Hadoop everything was batch oriented. About four or five years ago the market really started to focus on real time and streaming analytics to try to really help companies affect outcomes while things were still in motion. Steve Wilks is here, he's the co-founder and CTO of a company called Stream, a firm that's been in this business for around six years. Steve welcome to theCUBE, good to see you. Thanks for coming on. >> Thanks Dave it's a pleasure to be here. >> So tell us more about that, you started about six years ago, a little bit before the market really started talking about real time and streaming. So what led you to that conclusion that you should co-found Steam way ahead of its time? >> It's partly our heritage. So the four of us that founded Stream, we were executives at GoldenGate Software. In fact our CEO Ali Kutay was the CEO of GoldenGate Software. So when we were acquired by Oracle in 2009, after having to work for Oracle for a couple years, we were trying to work out what to do next. And GoldenGate was replication software right? So it's moving data from one place to another. But customers would ask us in customer advisory boards, that data seems valuable, it's moving. Can you look at it while it's moving and analyze it while it's moving, get value out of that moving data? And so that was kind of set in our heads. And then we were thinking about what to do next, that was kind of the genesis of the idea. So the concept around Stream when we first started the company was we can't just give people streaming data, we need to give them the ability to process that data, analyze it, visualize it, play with it and really truly understand the data. As well as being able to collect it and move it somewhere else. And so the goal from day one was always to build a full end-to-end platform that did everything customers needed to do for streaming integration analytics out of the box. And that's what we've done after six years. >> I got to ask a really basic question, so you're talking about your experience at GoldenGate moving data from point a to point b and somebody said well why don't we put that to work. But is there change data or was it static data? Why couldn't I just analyze it in place? >> GoldenGate works on change data. >> Okay so that's why, there was changes going through. Why wait until it hits its target, let's do some work in real time and learn from that, get greater productivity. And now you guys have taken that to a new level. That new level being what? Modern tools, modern technologies? >> A platform built from the ground up to be inherently distributed, scalable, reliable with exactly one's processing guarantees. And to be a complete end-to-end platform. There's a recognition that the first part of being able to do streaming data integration or analytics is that you need to be able to collect the data right? And while change data captured from databases is the way to get data out of databases in a streaming fashion, you also have to deal with files and devices and message queues and anywhere else the data can reside. So you need a large number of different data collectors that all turn the enterprise data sources into streaming data. And similarly if you want to store data somewhere you need a large collection of target adapters that deliver to things. Not just on premise but also in the cloud. So things like Amazon S3 or the cloud databases like Redshift and Google BigQuery. So the idea was really that we wanted to give customers everything they need and that everything they need isn't trivial. It's not just, well we take Apache Kafka and then we stuff things into it and then we take things out. Pretty often, for example, you need to be able to enrich data and that means you need to be able to join streaming data with additional context information, reference data. And that reference data may come form a database or from files or somewhere else. So you can't call out to the database and maintain the speeds of streaming data. We have customers that are doing hundreds of thousands of events per second. So you can't call out to a database for every event and ask for records to enrich it with. And you can't even do that with an external cache because it's just not fast enough. So we built in an in-memory data grid as part of our platform. So you can join streaming data with the context information in real time without slowing anything down. So when you're thinking about doing streaming integration, it's more than just moving data around. It's ability to process it and get it in the right form, to be able to analyze it, to be able to do things like complex event processing on that data. And also to be able to visualize it and play with it is an essential part of the whole platform. >> So I wanted to ask you about end-to-end. I've seen a lot of products from larger, maybe legacy companies that will say it's end-to-end but what it really is, is a cobbled together pieces that they bought in and then, this is our end-to-end platform, but it's not unified. Or I've seen others "Well we've got an end-to-end platform" oh really, can I see the visualization? "Well we don't have visualization "we use this third party for visualization". So convince me that you're end-to-end. >> So our platform when you start with it you go into a UI, you can start building data flows. Those data flows start from connectors, we have all the connectors that you need to get your enterprise data. We have wizards to help you build those. And so now you have a data stream. Now you want to start processing that, we have SQL-based processing so you can do everything from filtering, transformation, aggregation, enrichment of data. If you want to load reference data into memory you use a cache component to drag that in, configure that. You now have data in-memory you can join with your streams. If you want to now take the results of all that processing and write it somewhere, use one of our target connectors, drag that in so you've got a data flow that's getting bigger and bigger, doing more and more processing. So now you're writing some of that data out to Kafka, oh I'm going to also add in another target adaptor write some of it into Azure Blob Storage and some of it's going to Amazon Redshift. So now you have a much bigger data flow. But now you say okay well I also want to do some analytics on that. So you take the data stream, you build another data flow that is doing some aggregation of a Windows, maybe some complex event processing, and then you use that dashboard builder to build a dashboard to visualize all of that. And that's all in one product. So it literally is everything you need to get value immediately. And you're right, the big vendors they have multiple different products and they're very happy to sell you consulting to put them all together. Even if you're trying to build this from open source and you know, organizations try and do that, you need five or six major pieces of open source, a lot of support in libraries, and a huge team of developers to just build a platform that you can start to build applications on. And most organizations aren't software platform companies, they're finance companies, oil and gas companies, healthcare companies. And they really want to focus on solving business problems and not on reinventing the wheel by building a software platform. So we can just go in there and say look; value immediately. And that really, really helps. >> So what are some of your favorite use cases, examples, maybe customer examples that you can share with me? >> So one of the great examples, one of my customers they have a lot of data in our HP non-stop system. And they needed to be able to get visibility into that immediately. And this was like order processing, supply chain, ERP data. And it would've taken a very large amount of time to do analytics directly on the HP nonstop. And finding resources to do that is hard as well. So they needed to get the data out and they need to get it into the appropriate place. And they recognize that use the right technology to ask the right question. So they wanted some of it in Hadoop so they could do some machine learning on that. They wanted some of it to go into Kafka so they could get real time analytics. And they wanted some of it to go into HBase so they could query it immediately and use that for reference purposes. So they utilized us to do change data capture against the HP nonstop, deliver that datastream out immediately into Kafka and also push some of it into HEFS and some of it into HBase. So they immediately got value out of that, because then they could also build some real-time analytics on it. It would sent out alerts if things were taking too long in their order processing system. And allowed them to get visibility directly into their process that they couldn't get before with much fewer resources and more modern technologies than they could have used before. So that's one example. >> Can I ask you a question about that? So you talked about Kafka, HBase, you talk about a lot of different open source projects. You've integrated those or you've got entries and exits into those? >> So we ship with Kafka as part of our product. It's an optional messaging bus. So, our platform has two different ways of moving data around. We have a high-speed, in-memory only message bus and that works almost network speed and it's great for a lot of different use cases. And that is what backs our data streams. So when you build a data flow, you have streams in between each step, that is backed by an in-memory bus. Pretty often though, in use cases, you need to be able to potentially rewind data for recovery purposes or have different applications running at different speeds and that's where a persistent message bus like Kafka comes in but you don't want to use a persistent message bus for everything because it's doing IO and it's slowing things down. So you typically use that at the beginning, at the sources, especially things like IOT where you can't rewind into them. Things like databases and files, you can rewind into them and replay and recover but IOT sources, you can't do that. So you would push that into a Kafka backed stream and then subsequent processing is in-memory. So we have that as part of our product. We also have Elastic as part of our product for results storage. You can switch to other results storage but that's our default. And we have a few other key components that are part of our product but then on the periphery, we have adapters integrate with a lot of the other things that you mentioned. So we have adapters to read and write HDFS, Hive, HBase, Across, Cloudera, Autumn Works, even MapR. So we have the MapR versions of the file system and MapR streams and MapR DB and then there's lots of other more proprietary connectors like CVC from Oracle, and SQL server, and MySQL and MariaDB. And then database connectors for delivery to virtually any JDBC compliant database. >> I took you down a tangent before you had a chance. You were going to give us another example. We're pretty much out of time but if you can briefly share either that or the last word, I'll give it to you. >> I think the last word would be that that is one example. We have lots and lots of other types of use cases that we do including things like: migrating data from on-premise to the cloud, being able to distribute log data, and being able to analyze that log data being able to do in-memory analytics and get real-time insights immediately and send alerts. It's a very comprehensive platform but each one of those use cases are very easy to develop on their own and you can do them very quickly. And of course as the use case expands within a customer, they build more and more and so they end up using the same platform for lots of different use cases within the same account. >> And how large is the company? How many people? >> We are around 70 people right now. >> 70 People and you're looking for funding? What rounds are you in? Where are you at with funding and revenue and all that stuff? >> Well I'd have to defer to my CEO for those questions. >> All right, so you've been around for what, six years you said? >> Yeah, we have a number of rounds of funding. We had initial seed funding then we had the investment by Summit Partners that carried us through for a while. Then subsequent investment from Intel Capital, Dell EMC, Atlantic Bridge. And that's where we are right now. >> Good, excellent. Steve, thanks so much for coming on theCUBE, really appreciate your time. >> Great, it's awesome. Thank you Dave. >> Great to meet you. All right, keep it right there everybody, we'll be back with our next guest. This is theCUBE. We're live from BigData SV in San Jose. We'll be right back. (techno music)
SUMMARY :
Brought to you by SiliconANGLE Media the market really started to focus So what led you to that conclusion So it's moving data from one place to another. I got to ask a really basic question, And now you guys have taken that to a new level. and that means you need to be able to So I wanted to ask you about end-to-end. So our platform when you start with it And they needed to be able to get visibility So you talked about Kafka, HBase, So when you build a data flow, you have streams We're pretty much out of time but if you can briefly to develop on their own and you can do them very quickly. And that's where we are right now. really appreciate your time. Thank you Dave. Great to meet you.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Steve Wilks | PERSON | 0.99+ |
Steve | PERSON | 0.99+ |
2009 | DATE | 0.99+ |
Steve Wilkes | PERSON | 0.99+ |
five | QUANTITY | 0.99+ |
Intel Capital | ORGANIZATION | 0.99+ |
GoldenGate Software | ORGANIZATION | 0.99+ |
Ali Kutay | PERSON | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
GoldenGate | ORGANIZATION | 0.99+ |
Kafka | TITLE | 0.99+ |
San Jose | LOCATION | 0.99+ |
Stream | ORGANIZATION | 0.99+ |
MySQL | TITLE | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
Atlantic Bridge | ORGANIZATION | 0.99+ |
six years | QUANTITY | 0.99+ |
Steam | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
MapR | TITLE | 0.99+ |
HP | ORGANIZATION | 0.99+ |
four | QUANTITY | 0.99+ |
70 People | QUANTITY | 0.99+ |
Dell EMC | ORGANIZATION | 0.99+ |
MariaDB | TITLE | 0.99+ |
Striim | PERSON | 0.99+ |
SQL | TITLE | 0.99+ |
one | QUANTITY | 0.98+ |
each step | QUANTITY | 0.98+ |
Summit Partners | ORGANIZATION | 0.98+ |
two different ways | QUANTITY | 0.97+ |
first part | QUANTITY | 0.97+ |
around six years | QUANTITY | 0.97+ |
around 70 people | QUANTITY | 0.96+ |
HBase | TITLE | 0.96+ |
one example | QUANTITY | 0.96+ |
theCUBE | ORGANIZATION | 0.95+ |
BigData SV | ORGANIZATION | 0.94+ |
Big Data | ORGANIZATION | 0.92+ |
Hadoop | TITLE | 0.92+ |
one product | QUANTITY | 0.92+ |
each one | QUANTITY | 0.91+ |
six major pieces | QUANTITY | 0.91+ |
About four | DATE | 0.91+ |
CVC | TITLE | 0.89+ |
first | QUANTITY | 0.89+ |
about six years ago | DATE | 0.88+ |
day one | QUANTITY | 0.88+ |
Elastic | TITLE | 0.87+ |
Silicon Valley | LOCATION | 0.87+ |
Windows | TITLE | 0.87+ |
five years ago | DATE | 0.86+ |
S3 | TITLE | 0.82+ |
JDBC | TITLE | 0.81+ |
Azure | TITLE | 0.8+ |
CEO | PERSON | 0.79+ |
one place | QUANTITY | 0.78+ |
Redshift | TITLE | 0.76+ |
Autumn | ORGANIZATION | 0.75+ |
second | QUANTITY | 0.74+ |
thousands | QUANTITY | 0.72+ |
Big Data SV 2018 | EVENT | 0.71+ |
couple years | QUANTITY | 0.71+ |
ORGANIZATION | 0.69+ |
Jacque Istok, Pivotal | BigData NYC 2017
>> Announcer: Live from midtown Manhattan, it's the Cube, covering big data New York City 2017. Brought to you by Silicon Angle Media and its ecosystem sponsors. >> Welcome back everyone, we're here live in New York City for the week, three days of wall to wall coverage of big data NYC, it's big data week here in conjunction with Strata Adup, Strata Data which is an event running right around the corner, this is the Cube, I'm John Furrier with my cohost, Peter Burris, our next guest Jacque Istok who's the head of data at Pivotal. Welcome to the Cube, good to see you again. >> Likewise. >> You guys had big news we covered at VMware, obviously the Kubernetes craze is fantastic, you're starting to see cloud native platforms front and center even in some of these operational worlds like in cloud, data you guys have been here a while with Green Plum and Pivotal's been adding more to the data suite, so you guys are a player in this ecosystem. >> Correct. >> As it grows to be much more developer-centric and enterprise-centric and AI-centric, what's the update? >> I'd like to talk about a couple things, just three quick things here, one focused primarily on simplicity, first and foremost as you said, there's a lot of things going on on the cloud foundry side, a lot of things that we're doing with Kubernetes, etc., super exciting. I will say Tony Berge has written a nice piece about Green Plum in Zitinet, essentially calling Green Plum the best kept secret in the analytic database world. Why I think that's important is, what isn't really well known is that over the period of Pivotal's history, the last four and a half years, we focused really heavily on the cloud foundry side, on dev/ops, on getting users to actually be able to publish code. What we haven't talked about as much is what we're doing on the data side and I find it very interesting to repeatedly tell analysts and customers that the Green Plum business has been and continues to be a profitable business unit within Pivotal, so as we're growing on the cloud foundry side, we're continuing to grow a business that many of the organizations that I see here at Strata are still looking to get to, that ever forgotten profitability zone. >> There's a legacy around Green Plum, I'm not going to say they pivoted, pun intended, Pivotal. There's been added stuff around Green Plum, Green Plum might get lost in the messaging because it's been now one of many ingredients, right? >> It's true and when we formed Pivotal, I think there were 34 some different skews that we have now focused in on over the last two years or so. What's super exciting is again, over that time period, one of the things that we took to heart within the Green Plum side is this idea of extreme agile. As you guys know, Pivotal Labs being the core part of the Pivotal mission helps our customers figure out how to actually build software. We finally are drinking our own champagne and over the last year and a half of Green Plum R&D, we're shipping code, a complete data platform, we're shipping that on a cadence of about four to five weeks which again, a little bit unheard of in the industry, being able to move at that pace. We work through the backlog and what is also super exciting and I'm glad that you guys are able to help me tell the world, we released version five last week. Version five is actually the only parallel open source data platform that actually has native ANSI compliance SQL and I feel a little bit like I've rewound the clock 15 years where I have to actually throw in the ANSI compliance, but I think that in a lot of ways, there are SQL alternatives that are out there in the world. They are very much not ANSI compliant and that hurts. >> It's a nuance but it's table stakes in the enterprise. ANSI compliance is just, >> There's a reason you want to be ANSI compliant, because there's a whole swath of analytic applications mainly in the data warehouse world, that were built using ANSI compliant SQL, so why do this with version five? I presume it's got to have something to do with you want to start capturing some of those applications and helping customers modernize. >> That is correct. I think the SQL piece is one part of the data platform, of really a modern data platform. The other parts are again, becoming table stakes. Being able to do text analytics, we've backed Apache Solar within Green Plum, being able to do graph analytics or spatial analytics, anything from classifications, regressions, all of that, actually becomes table stakes and we feel that enterprises have suffered a little bit over the last five or six years. They've had this promise of having a new platform that they can leverage for doing interesting new things, machine learning, AI, etc. but the existing stuff that they were trying to do has been super, super hard. What we're trying to do is bridge those together and provide both in the same platform, out of the gate so that customers can actually use it immediately and I think one of the things we've seen is there's about 1000 to one SQL experienced individuals within the enterprise versus say Haduk experience in individuals. The other thing that I think is actually super important and almost bigger than everything else I talked about is we're the, a lot of the old school postgres deriviants of MBD databases forked their databases at some point in postgres's history, for a variety of reasons from licensing to when they started. Green Plum's no different. We forked right around eight dot too with this last release of version five, we've actually up leveled the postgres base within Green Plum's 8.3. Now in and of itself, it doesn't sound, >> What does that mean? >> We are now taking a 100% commitment both to open source and both to the postgres community. I think if you look at postgres today, in its latest versions, it is a full fledged, mission critical database that can be used anywhere. What we feel is that if we can bring our core engineering developments around parallelism, around analytics and combine that with postgres itself, then we don't have to implement all of the low level database things that a lot of our competitors have to do. What's unique about it is one, Green Plum continues to be open source, which again most of our competitors are not, two if you look at primarily what they're doing, nobody's got that level of commitment to the postgres community which means all of their resources are going to be stuck building core database technology, even building that ANSI SQL compliance in, which we'll get "for free" which will let us focus on things like machine learning, artificial intelligence. >> Just give a quick second and tell about the relevance of postgres because of the success, first of all it's massive, it's everywhere, but it's not going anywhere. Just give a quick, for the audience watching, what's the relevance of it. >> Sure like you said, it is everywhere. It is the most full featured, actual database in the open source community. Arguably my SQL has "more" market share, but my SQL projects that generally leverage them are not used for mission critical enterprise applications. Being able to have parity allows us not only to have that database technology baked into Green Plum, but it also gives us all of the community stuff with it. Everything from being able to leverage the most recent ODBC and JDBC libraries, but also integrations into everything from the post GIS travert for geospatial to being able to connect to other types of data sources, etc. >> It's a big community, shows that it's successful, but again, >> And it doesn't come in a red box. >> It does not come in a red box, that is correct. >> Which is not a bad thing. Look, postgres as a technology was developed a long time ago, largely in response to think about analytics and transaction, or analytics and operating applications might have actually come to and we're now living in a world where we can actually see the hardware and a lot of practices, etc. are beginning to find ways where this may start to happen. With Green Plum and postgres both MPP based, so your, by going to this, you're able to stay more modern, more up to date on all the new technology that's coming together to support these richer, more complex classes of applications. >> You're spot on, I suppose I would argue that postgres, I feel came up with as a response to Oracle in the past of, we need an open source alternative to Oracle, but other than that, 100% correct. >> There was always a difference between postgres and MySQL, MySQL always was okay, that's that, let's do that open source, postgres coming out of Berkeley and coming out of some other places, always had a slightly different notion of the types of problems it was going to take on. >> 100% correct, 100%. But to your question before, what does this all mean to customers, I think the one thing that version five really gives us the confidence to say is, and a lot of times I hate lobbing when the ball's out like this, but we welcome and embrace with open arms any terradata customers out there that are looking to save millions if not tens of millions of dollars on a modern platform that can actually run not only on premise, not only on bare metal, but virtually and off premise. We're truly the only MPP platform, the only open source MPP data platform that can allow you to build analytics and move those analytics from Amazon to Azure to back on prem. >> Talk about this, the terradata thing for a second, I want to get down and double click on that. Customers don't want to change code, so what specifically are you guys offering terradata customers specifically. With the release of version five, with a lot of the development that we've done and some of the partnering that we've done, we are now able to take without changing a line of code of your terradata applications, you load the data within the Green Plum platform, you can point those applications directly to Green Plum and run them unchanged, so I think in the past, the reticence to move to any other platform was really the amount of time it would take to actually redevelop all of the stuff that you had. We offer an ability to go from an immediate ROI to a platform that again, bridges that gap, allows you to really be modern. >> Peter, I want to talk to you about that importance that we just said because you've been studying the private cloud report, true private cloud which is on premises, coming from a cloud operating model, automating away undifferentiated labor and shipping that to differentiated labor, but this brings up what customers want in hybrid cloud and ultimately having public cloud and private cloud so hybrid sits there. They don't want to change their code basis, this is a huge deal. >> Obviously a couple things to go along with what Jacque said. The first thing is that you're right, people want the data to run where the data naturally needs to run or should run, that's the big argument about public versus hybrid versus what we call true private cloud. The idea that decreasing the workload needs to be located where the data, where it naturally should be located because of the physical, legal, regulatory, intellectual property attributes of the data, being able to do that is really really important. The other thing that Jacque said that goes right into this question John, is that ultimately in too many domains in this analytics world, which is fundamentally predicated on the idea of breaking data out of applications so that you can use it in new and novel and more value creating ways, is that the data gets locked up in a data warehouse. What's valuable in a data warehouse is not the hardware. It's the data. By providing the facility for being able to point an application at a couple of different data source including one that's more modern, or which takes advantage of more modern technology, that can be considerably cheaper, it means the shop can elevate the story about the asset and the asset here is the data and the applications that run against it, not the hardware and the system where the data's stored and located. One of the biggest challenges, we talked earlier just to go on for a second, we talked earlier with a couple of other guests about the fact that the industry still, what your average person still doesn't understand how to value data. How to establish a data asset and one of the reasons is because it's so constantly co-mingled with the underlying hardware. >> And actually I'd even further go on, I think the advent of some of these cloud data warehouses forgets that notion of being able to run it different places and provides one of the things that customers are really looking for which is simplicity. The ability to spin up a quick MPP SQL system within say Amazon for example, almost without a doubt, a lot of the business users that I speak to are willing to sacrifice capabilities within the platform which they are for the simplicity of getting up and going. One of the things that we really focused on in V5 is being able to give that same turnkey feel and so Green Plum exists within the Amazon marketplace, within the Azure marketplace, Google later this quarter, and then in addition to the simplicity, it has all of the functionality that is missing in those platforms, again, all the analytics, all the ability to reach out and federate queries against different types of data, I think it's exciting as we continue to progress in our releases, Green Plum has, for a number of years, had this ability to seamlessly query HGFS. Like a lot of the competitors, but HGFS isn't going away, neither is a generic object store like S3. But we continue to extend that to things like Spark for example, so now the ability to actually house your data within a data platform and seamlessly integrate with Spark back and forth, if you want to use Spark, use Spark, but somewhere that data needs to be materialized so that other applications can leverage it as well. >> But even then people have been saying well, if you want to put it on this disk, then put it on this disk. Given the question about Spark versus another database manager is a higher level conversation than many of the shops who are investing millions and millions and millions of dollars in their analytic application portfolio and all you're trying to do is, as I interpret it, is trying to say look, the value in the portfolio is the applications and the data. It's not the underlying elements. There's a whole bunch of new elements we can use, you can put it in the cloud, you can put it on premise if that's where the data belongs. Use some of these new and evolving technologies, but you're focused on how the data and the applications continue to remain valuable to the business over time and not the traditional hardware assets. >> Correct and I'll again leverage a notion that we get from labs, which is this idea of user centric design and so everything that we've been putting into the Green Plum database is around, ideally the four primary users of our system. Not just the analysts and not just the data scientists, but also the operators and the IT folks. That is where I'd say the last tenant of where we're going really is this idea of coopetition. I would, as the Pivotal Green Plum guy that's been around for 10 plus years, I would tell you very straight up that we are again, an open source MPP data platform that can rival any other platform out there, whether it's terradata, whether it's Haduke, we can beat that platform. >> Why should customers call you up? Why should they call you? There's all this other stuff out there, you got legacy, you got terradata, might have other things, people are knocking at my door, they're getting pounded with sales messages, buy me I'm better than the other guy. Why Pivotal data? >> The first thing I would say is, the latest reviews from Gardner for example, well actually let me rewind. I will easily argue that terradata has been the data warehouse platform for the last 30 years that everyone has tried to emulate. I'd even argue so much as that when Haduke came on the scene eight years ago, what they did was they changed the dynamics and what they're doing now is actually trying to emulate the terradata success through things like SQL on top of Haduke. What that has basically gotten us to is we're looking for a terradata replacement at Haduke like prices, that's what Green Plum has to offer in spades. Now, if you actually extend that just a little bit, I still recognize that not everybody's going to call us, there are still 200 other vendors out there that are selling a similar product or similar kinds of stories. What I would tell you in response to those folks is that Green Plum has been around in production for the last 10 plus years, we're a proven technology for solving problems, many of those are not. We work very well in this cooperative spirit of, Green Plum can be the end all be all, but I recognize it's not going to be the end all be all so this is why we have to work within the ecosystem. >> You have to, open source is dominating. At the Linux event, we just covered open source summit, 90% of software written will be open source libraries, 10% is where the value's being added. >> For sure, if you were to start up a new star up right now, would you go with a commercial product? >> No, just postgres database is good. All right final question to end the segment. This big data space that's now being called data, certainly Strata, Haduke is now Strata Data, just trying to keep that show going longer. But you got Microsoft Azure making a lot of waves going on right now with Microsoft Ignite, so cloud is into the play here, data's changed, so the question is how has this industry changed over the past eight years. You go back to 2010, I saw Green Plum coming prior to even getting bought out, but they were kicking ass, same product evolved. Where has the space gone? What's happened, how would you summarize it to someone who's walking in for the first year like hey back in the old days, we used to walk to school in the snow with no shoes on both ways. Now it's like get off my lawn you young developers. Seriously what is the evolution of that, how would you explain it? >> Again, I would start with terradata started the industry, by far and then folks like Netease and Green Plum came around to really give a lower cost alternative. Haduke came on the scene eight some years ago, and what I pride myself in being at Green Plum for this long and Green Plum implemented the map produced paradigm as Haduke was starting to build and as it continued to build, we focused on building our own distribution and SQL and Haduke, I think what we're getting down to is the brass tacks of the business is tired of technological science experiments and they just want to get stuff done. >> And a cost of ownership that's manageable. >> And sustainable. >> And sustainable and not in a spot where they're going to be locked into a single vendor, hence the open source. >> The ones that are winning today employed what strategy that ended up working out and what strategy didn't end up working out, if you go back and say, the people who took this path failed, people who took this approach won. What's the answer there? >> Clearly anybody who was an appliance that has long since drifted. I'd also say Green Plum's in this unique position where, >> An appliance too though. >> Well, pseudo appliance yes, I still have to respond to that, we were always software. >> You pivoted luckily. >> But putting that aside, the hardware vendors have gone away, all of the software competitors that we had have actually either been sunset, sold off and forgotten and so Green Plum, here we sit as the sole standard or person that's been around for the long haul. We are now seeing a spot where we have no competition other than the forgotten really legacy guys like terradata. People are longing to get off of legacy and onto something modern, the trick will be whether that modern is some of these new and upcoming players and technologies, or whether it really focuses on solving problems. >> What's the strategy with the winning strategy? Stick to your knitting, stick to what you know or was it more of, >> For us it was two fold, one it was continuing to service our customers and make them successful so that was how we built a profitable data platform business and then the other was to double down on the strategies that seemed to be interesting to organizations which were cloud, open source, and analytics and like you said, I talked to one of the folks over at the Air Force and he was mentioning how to him, data's actually more important than fuel, being able to understand where the airplanes are, where the fuel is, where the people are, where the missiles are etc., that's actually more important than the fuel itself. Data is the thing that powers everything. >> Data's currency of everything now, great Jacque thinks so much for coming on the Cube, Pivotal Data Platform, Data Suite, Green Plum now with all these other adds, that's great congratulations. Stay on the path helping customers, you can't lose. >> Exactly. >> The Cube here helping you figure out the big data noise, we're obviously in big data New York City event for our annual, the annual Cube Wikibon event, in conjunction with Strata Data across the street, more live coverage here for three days here in New York City I'm John Furrier, Peter Burris, we'll be back after this short break. (electronic music)
SUMMARY :
Brought to you by Silicon Angle Media Welcome to the Cube, good to see you again. to the data suite, so you guys analysts and customers that the Green Plum Green Plum might get lost in the messaging and over the last year and a half of Green Plum R&D, It's a nuance but it's table stakes in the enterprise. I presume it's got to have something to do with and provide both in the same platform, and both to the postgres community. of postgres because of the success, It is the most full featured, and operating applications might have actually come to in the past of, we need an open source alternative of the types of problems it was going to take on. MPP data platform that can allow you the reticence to move to any other platform and shipping that to differentiated labor, is that the data gets locked up in a data warehouse. all the ability to reach out and federate queries and the applications continue to remain valuable but also the operators and the IT folks. Why should customers call you up? I still recognize that not everybody's going to call us, At the Linux event, we just covered open source summit, in the snow with no shoes on both ways. and Green Plum implemented the map produced paradigm And sustainable and not in a spot where they're going to be the people who took this path failed, that has long since drifted. to respond to that, we were always software. But putting that aside, the hardware on the strategies that seemed to be interesting Stay on the path helping customers, you can't lose. for our annual, the annual Cube Wikibon event,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jacque | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
Green Plum | ORGANIZATION | 0.99+ |
Peter | PERSON | 0.99+ |
Jacque Istok | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Tony Berge | PERSON | 0.99+ |
Silicon Angle Media | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
John | PERSON | 0.99+ |
New York City | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
millions | QUANTITY | 0.99+ |
90% | QUANTITY | 0.99+ |
NYC | LOCATION | 0.99+ |
Berkeley | LOCATION | 0.99+ |
Pivotal | ORGANIZATION | 0.99+ |
MySQL | TITLE | 0.99+ |
2010 | DATE | 0.99+ |
first | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
eight years ago | DATE | 0.99+ |
10% | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
three days | QUANTITY | 0.99+ |
Haduke | ORGANIZATION | 0.99+ |
last week | DATE | 0.99+ |
both | QUANTITY | 0.98+ |
Strata | ORGANIZATION | 0.98+ |
Netease | ORGANIZATION | 0.98+ |
eight | DATE | 0.98+ |
One | QUANTITY | 0.98+ |
Strata Adup | ORGANIZATION | 0.98+ |
first thing | QUANTITY | 0.98+ |
terradata | ORGANIZATION | 0.98+ |
Oracle | ORGANIZATION | 0.97+ |
15 years | QUANTITY | 0.97+ |
first year | QUANTITY | 0.97+ |
200 other vendors | QUANTITY | 0.97+ |
Strata Data | ORGANIZATION | 0.97+ |
two | QUANTITY | 0.97+ |
tens of millions of dollars | QUANTITY | 0.97+ |
millions of dollars | QUANTITY | 0.97+ |
one part | QUANTITY | 0.97+ |
George Chow, Simba Technologies - DataWorks Summit 2017
>> (Announcer) Live from San Jose, in the heart of Silicon Valley, it's theCUBE covering DataWorks Summit 2017, brought to you by Hortonworks. >> Hi everybody, this is George Gilbert, Big Data and Analytics Analyst with Wikibon. We are wrapping up our show on theCUBE today at DataWorks 2017 in San Jose. It has been a very interesting day, and we have a special guest to help us do a survey of the wrap-up, George Chow from Simba. We used to call him Chief Technology Officer, now he's Technology Fellow, but when we was explaining the different in titles to me, I thought he said Technology Felon. (George Chow laughs) But he's since corrected me. >> Yes, very much so >> So George and I have been, we've been looking at both Spark Summit last week and DataWorks this week. What are some of the big advances that really caught your attention? >> What's caught my attention actually is how much manufacturing has really, I think, caught into the streaming data. I think last week was very notable that both Volkswagon and Audi actually had case studies for how they're using streaming data. And I think just before the break now, there was also a similar session from Ford, showcasing what they are doing around streaming data. >> And are they using the streaming analytics capabilities for autonomous driving, or is it other telemetry that they're analyzing? >> The, what is it, I think the Volkswagon study was production, because I still have to review the notes, but the one for Audi was actually quite interesting because it was for managing paint defect. >> (George Gilbert) For paint-- >> Paint defect. >> (George Gilbert) Oh. >> So what they were doing, they were essentially recording the environmental condition that they were painting the cars in, basically the entire pipeline-- >> To predict when there would be imperfections. >> (George Chow) Yes. >> Because paint is an extremely high-value sort of step in the assembly process. >> Yes, what they are trying to do is to essentially make a connection between downstream defect, like future defect, and somewhat trying to pinpoint the causes upstream. So the idea is that if they record all the environmental conditions early on, they could turn around and hopefully figure it out later on. >> Okay, this sounds really, really concrete. So what are some of the surprising environmental variables that they're tracking, and then what's the technology that they're using to build model and then anticipate if there's a problem? >> I think the surprising finding they said were actually, I think it was a humidity or fan speed, if I recall, at the time when the paint was being applied, because essentially, paint has to be... Paint is very sensitive to the condition that is being applied to the body. So my recollection is that one of the finding was that it was a narrow window during which the paint were, like, ideal, in terms of having the least amount of defect. >> So, had they built a digital twin style model, where it's like a digital replica of some aspects of the car, or was it more of a predictive model that had telemetry coming at it, and when it's an outside a certain bounds they know they're going to have defects downstream? >> I think they're still working on the predictive model, or actually the model is still being built, because they are essentially trying to build that model to figure out how they should be tuning the production pipeline. >> Got it, so this is sort of still in the development phase? >> (George Chow) Yeah, yeah >> And can you tell us, did they talk about the technologies that they're using? >> I remember the... It's a little hazy now because after a couple weeks of conference, so I don't remember the specifics because I was counting on the recordings to come out in a couples weeks' time. So I'll definitely share that. It's a case study to keep an eye on. >> So tell us, were there other ones where this use of real-time or near real-time data had some applications that we couldn't do before because we now can do things with very low latency? >> I think that's the one that I was looking forward to with Ford. That was the session just earlier, I think about an hour ago. The session actually consisted of a demo that was being done live, you know. It was being streamed to us where they were showcasing the data that was coming off a car that's been rigged up. >> So what data were they tracking and what were they trying to anticipate here? >> They didn't give enough detail, but it was basically data coming off of the CAN bus of the car, so if anybody is familiar with the-- >> Oh that's right, you're a car guru, and you and I compare, well our latest favorite is the Porche Macan >> Yes, yes. >> SUV, okay. >> But yeah, they were looking at streaming the performance data of the car as well as the location data. >> Okay, and... Oh, this sounds more like a test case, like can we get telemetry data that might be good for insurance or for... >> Well they've built out the system enough using the Lambda Architecture with Kafka, so they were actually consuming the data in real-time, and the demo was actually exactly seeing the data being ingested and being acted on. So in the case they were doing a simplistic visualization of just placing the car on the Google Map so you can basically follow the car around. >> Okay so, what was the technical components in the car, and then, how much data were they sending to some, or where was the data being sent to, or how much of the data? >> The data was actually sent, streamed, all the way into Ford's own data centers. So they were using NiFi with all the right proxy-- >> (George Gilbert) NiFi being from Hortonworks there. >> Yeah, yeah >> The Hortonworks data flow, okay >> Yeah, with all the appropriate proxys and firewall to bring it all the way into a secure environment. >> Wow >> So it was quite impressive from the point of view of, it was life data coming off of the 4G modem, well actually being uploaded through the 4G modem in the car. >> Wow, okay, did they say how much compute and storage they needed in the device, in this case the car? >> I think they were using a very lightweight platform. They were streaming apparently from the Raspberry Pi. >> (George Gilbert) Oh, interesting. >> But they were very guarded about what was inside the data center because, you know, for competitive reasons, they couldn't share much about how big or how large a scale they could operate at. >> Okay, so Simba has been doing ODBC and JDBC drivers to standard APIs, to databases for a long time. That was all about, that was an era where either it was interactive or batch. So, how is streaming, sort of big picture, going to change the way applications are built? >> Well, one way to think about streaming is that if you look at many of these APIs, into these systems, like Spark is a good example, where they're trying to harmonize streaming and batch, or rather, to take away the need to deal with it as a streaming system as opposed to a batch system, because it's obviously much easier to think about and reason about your system when it is traditional, like in the traditional batch model. So, the way that I see it also happening is that streaming systems will, you could say will adapt, will actually become easier to build, and everyone is trying to make it easier to build, so that you don't have to think about and reason about it as a streaming system. >> Okay, so this is really important. But they have to make a trade-off if they do it that way. So there's the desire for leveraging skill sets, which were all batch-oriented, and then, presumably SQL, which is a data manipulation everyone's comfortable with, but then, if you're doing it batch-oriented, you have a portion of time where you're not sure you have the final answer. And I assume if you were in a streaming-first solution, you would explicitly know whether you have all the data or don't, as opposed to late arriving stuff, that might come later. >> Yes, but what I'm referring to is actually the programming model. All I'm saying is that more and more people will want streaming applications, but more and more people need to develop it quickly, without having to build it in a very specialized fashion. So when you look at, let's say the example of Spark, when they focus on structured streaming, the whole idea is to make it possible for you to develop the app without having to write it from scratch. And the comment about SQL is actually exactly on point, because the idea is that you want to work with the data, you can say, not mindful, not with a lot of work to account for the fact that it is actually streaming data that could arrive out of order even, so the whole idea is that if you can build applications in a more consistent way, irrespective whether it's batch or streaming, you're better off. >> So, last week even though we didn't have a major release of Spark, we had like a point release, or a discussion about the 2.2 release, and that's of course very relevant for our big data ecosystem since Spark has become the compute engine for it. Explain the significance where the reaction time, the latency for Spark, went down from several hundred milliseconds to one millisecond or below. What are the implications for the programming model and for the applications you can build with it. >> Actually, hitting that new threshold, the millisecond, is actually a very important milestone because when you look at a typical scenario, let's say with AdTech where you're serving ads, you really only have, maybe, on the order about 100 or maybe 200 millisecond max to actually turn around. >> And that max includes a bunch of things, not just the calculation. >> Yeah, and that, let's say 100 milliseconds, includes transfer time, which means that in your real budget, you only have allowances for maybe, under 10 to 20 milliseconds to compute and do any work. So being able to actually have a system that delivers millisecond-level performance actually gives you ability to use Spark right now in that scenario. >> Okay, so in other words, now they can claim, even if it's not per event processing, they can claim that they can react so fast that it's as good as per event processing, is that fair to say? >> Yes, yes that's very fair. >> Okay, that's significant. So, what type... How would you see applications changing? We've only got another minute or two, but how do you see applications changing now that, Spark has been designed for people that have traditional, batch-oriented skills, but who can now learn how to do streaming, real-time applications without learning anything really new. How will that change what we see next year? >> Well I think we should be careful to not pigeonhole Spark as something built for batch, because I think the idea is that, you could say, the originators, of Spark know that it's all about the ease of development, and it's the ease of reasoning about your system. It's not the fact that the technology is built for batch, so the fact that you could use your knowledge and experience and an API that actually is familiar, should leverage it for something that you can build for streaming. That's the power, you could say. That's the strength of what the Spark project has taken on. >> Okay, we're going to have to end it on that note. There's so much more to go through. George, you will be back as a favorite guest on the show. There will be many more interviews to come. >> Thank you. >> With that, this is George Gilbert. We are DataWorks 2017 in San Jose. We had a great day today. We learned a lot from Rob Bearden and Rob Thomas up front about the IBM deal. We had Scott Gnau, CTO of Hortonworks on several times, and we've come away with an appreciation for a partnership now between IBM and Hortonworks that can take the two of them into a set of use cases that neither one on its own could really handle before. So today was a significant day. Tune in tomorrow, we have another great set of guests. Keynotes start at nine, and our guests will be on starting at 11. So with that, this is George Gilbert, signing out. Have a good night. (energetic, echoing chord and drum beat)
SUMMARY :
in the heart of Silicon Valley, do a survey of the wrap-up, What are some of the big advances caught into the streaming data. but the one for Audi was actually quite interesting in the assembly process. So the idea is that if they record So what are some of the surprising environmental So my recollection is that one of the finding or actually the model is still being built, of conference, so I don't remember the specifics the data that was coming off a car the performance data of the car for insurance or for... So in the case they were doing a simplistic visualization So they were using NiFi with all the right proxy-- to bring it all the way into a secure environment. So it was quite impressive from the point of view of, I think they were using a very lightweight platform. the data center because, you know, for competitive reasons, going to change the way applications are built? so that you don't have to think about and reason about it But they have to make a trade-off if they do it that way. so the whole idea is that if you can build and for the applications you can build with it. because when you look at a typical scenario, not just the calculation. So being able to actually have a system that delivers but how do you see applications changing now that, so the fact that you could use your knowledge There's so much more to go through. that can take the two of them
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
IBM | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Scott Gnau | PERSON | 0.99+ |
Rob Bearden | PERSON | 0.99+ |
Audi | ORGANIZATION | 0.99+ |
Rob Thomas | PERSON | 0.99+ |
San Jose | LOCATION | 0.99+ |
George Chow | PERSON | 0.99+ |
Ford | ORGANIZATION | 0.99+ |
last week | DATE | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
100 milliseconds | QUANTITY | 0.99+ |
200 millisecond | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
tomorrow | DATE | 0.99+ |
Volkswagon | ORGANIZATION | 0.99+ |
this week | DATE | 0.99+ |
Google Map | TITLE | 0.99+ |
AdTech | ORGANIZATION | 0.99+ |
DataWorks 2017 | EVENT | 0.98+ |
DataWorks Summit 2017 | EVENT | 0.98+ |
both | QUANTITY | 0.98+ |
11 | DATE | 0.98+ |
Spark | TITLE | 0.98+ |
Wikibon | ORGANIZATION | 0.96+ |
under 10 | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
20 milliseconds | QUANTITY | 0.95+ |
Spark Summit | EVENT | 0.94+ |
first solution | QUANTITY | 0.94+ |
SQL | TITLE | 0.93+ |
hundred milliseconds | QUANTITY | 0.93+ |
2.2 | QUANTITY | 0.92+ |
one way | QUANTITY | 0.89+ |
Spark | ORGANIZATION | 0.88+ |
Lambda Architecture | TITLE | 0.87+ |
Kafka | TITLE | 0.86+ |
minute | QUANTITY | 0.86+ |
Porche Macan | ORGANIZATION | 0.86+ |
about 100 | QUANTITY | 0.85+ |
ODBC | TITLE | 0.84+ |
DataWorks | EVENT | 0.84+ |
NiFi | TITLE | 0.84+ |
about an hour ago | DATE | 0.8+ |
JDBC | TITLE | 0.79+ |
Raspberry Pi | COMMERCIAL_ITEM | 0.76+ |
Simba | ORGANIZATION | 0.75+ |
Simba Technologies | ORGANIZATION | 0.74+ |
couples weeks' | QUANTITY | 0.7+ |
CTO | PERSON | 0.68+ |
theCUBE | ORGANIZATION | 0.67+ |
twin | QUANTITY | 0.67+ |
couple weeks | QUANTITY | 0.64+ |
Holden Karau, IBM Big Data SV 17 #BigDataSV #theCUBE
>> Announcer: Big Data Silicon Valley 2017. >> Hey, welcome back, everybody, Jeff Frick here with The Cube. We are live at the historic Pagoda Lounge in San Jose for Big Data SV, which is associated with Strathead Dupe World, across the street, as well as Big Data week, so everything big data is happening in San Jose, we're happy to be here, love the new venue, if you're around, stop by, back of the Fairmount, Pagoda Lounge. We're excited to be joined in this next segment by, who's now become a regular, any time we're at a Big Data event, a Spark event, Holden always stops by. Holden Karau, she's the principal software engineer at IBM. Holden, great to see you. >> Thank you, it's wonderful to be back yet again. >> Absolutely, so the big data meme just keeps rolling, Google Cloud Next was last week, a lot of talk about AI and ML and of course you're very involved in Spark, so what are you excited about these days? What are you, I'm sure you've got a couple presentations going on across the street. >> Yeah, so my two presentations this week, oh wow, I should remember them. So the one that I'm doing today is with my co-worker Seth Hendrickson, also at IBM, and we're going to be focused on how to use structured streaming for machine learning. And sort of, I think that's really interesting, because streaming machine learning is something a lot of people seem to want to do but aren't yet doing in production, so it's always fun to talk to people before they've built their systems. And then tomorrow I'm going to be talking with Joey on how to debug Spark, which is something that I, you know, a lot of people ask questions about, but I tend to not talk about, because it tends to scare people away, and so I try to keep the happy going. >> Jeff: Bugs are never fun. >> No, no, never fun. >> Just picking up on that structured streaming and machine learning, so there's this issue of, as we move more and more towards the industrial internet of things, like having to process events as they come in, make a decision. How, there's a range of latency that's required. Where does structured streaming and ML fit today, and where might that go? >> So structured streaming for today, latency wise, is probably not something I would use for something like that right now. It's in the like sub second range. Which is nice, but it's not what you want for like live serving of decisions for your car, right? That's just not going to be feasible. But I think it certainly has the potential to get a lot faster. We've seen a lot of renewed interest in ML liblocal, which is really about making it so that we can take the models that we've trained in Spark and really push them out to the edge and sort of serve them in the edge, and apply our models on end devices. So I'm really excited about where that's going. To be fair, part of my excitement is someone else is doing that work, so I'm very excited that they're doing this work for me. >> Let me clarify on that, just to make sure I understand. So there's a lot of overhead in Spark, because it runs on a cluster, because you have an optimizer, because you have the high availability or the resilience, and so you're saying we can preserve the predict and maybe serve part and carve out all the other overhead for running in a very small environment. >> Right, yeah. So I think for a lot of these IOT devices and stuff like that it actually makes a lot more sense to do the predictions on the device itself, right. These models generally are megabytes in size, and we don't need a cluster to do predictions on these models, right. We really need the cluster to train them, but I think for a lot of cases, pushing the prediction out to the edge node is actually a pretty reasonable use case. And so I'm really excited that we've got some work going on there. >> Taking that one step further, we've talked to a bunch of people, both like at GE, and at their Minds and Machines show, and IBM's Genius of Things, where you want to be able to train the models up in the cloud where you're getting data from all the different devices and then push the retrained model out to the edge. Can that happen in Spark, or do we have to have something else orchestrating all that? >> So actually pushing the model out isn't something that I would do in Spark itself, I think that's better served by other tools. Spark is not really well suited to large amounts of internet traffic, right. But it's really well suited to the training, and I think with ML liblocal it'll essentially, we'll be able to provide both sides of it, and the copy part will be left up to whoever it is that's doing their work, right, because like if you're copying over a cell network you need to do something very different as if you're broadcasting over a terrestrial XM or something like that, you need to do something very different for satellite. >> If you're at the edge on a device, would you be actually running, like you were saying earlier, structured streaming, with the prediction? >> Right, I don't think you would use structured streaming per se on the edge device, but essentially there would be a lot of code share between structured streaming and the code that you'd be using on the edge device. And it's being vectored out now so that we can have this code sharing and Spark machine learning. And you would use structured streaming maybe on the training side, and then on the serving side you would use your custom local code. >> Okay, so tell us a little more about Spark ML today and how we can democratize machine learning, you know, for a bigger audience. >> Right, I think machine learning is great, but right now you really need a strong statistical background to really be able to apply it effectively. And we probably can't get rid of that for all problems, but I think for a lot of problems, doing things like hyperparameter tuning can actually give really powerful tools to just like regular engineering folks who, they're smart, but maybe they don't have a strong machine learning background. And Spark's ML pipelines make it really easy to sort of construct multiple stages, and then just be like, okay, I don't know what these parameters should be, I want you to do a search over what these different parameters could be for me, and it makes it really easy to do this as just a regular engineer with less of an ML background. >> Would that be like, just for those of us who are, who don't know what hyperparameter tuning is, that would be the knobs, the variables? >> Yeah, it's going to spin the knobs on like our regularization parameter on like our regression, and it can also spin some knobs on maybe the engram sizes that we're using on the inputs to something else, right. And it can compare how these knobs sort of interact with each other, because often you can tune one knob but you actually have six different knobs that you want to tune and you don't know, if you just explore each one individually, you're not going to find the best setting for them working together. >> So this would make it easier for, as you're saying, someone who's not a data scientist to set up a pipeline that lets you predict. >> I think so, very much. I think it does a lot of the, brings a lot of the benefits from sort of the SciPy world to the big data world. And SciPy is really wonderful about making machine learning really accessible, but it's just not ready for big data, and I think this does a good job of bringing these same concepts, if not the code, but the same concepts, to big data. >> The SciPy, if I understand, is it a notebook that would run essentially on one machine? >> SciPy can be put in a notebook environment, and generally it would run on, yeah, a single machine. >> And so to make that sit on Spark means that you could then run it on a cluster-- >> So this isn't actually taking SciPy and distributing it, this is just like stealing the good concepts from SciPy and making them available for big data people. Because SciPy's done a really good job of making a very intuitive machine learning interface. >> So just to put a fine sort of qualifier on one thing, if you're doing the internet of things and you have Spark at the edge and you're running the model there, it's the programming model, so structured streaming is one way of programming Spark, but if you don't have structured streaming at the edge, would you just be using the core batch Spark programming model? >> So at the edge you'd just be using, you wouldn't even be using batch, right, because you're trying to predict individual events, right, so you'd just be calling predict with every new event that you're getting in. And you might have a q mechanism of some type. But essentially if we had this batch, we would be adding additional latency, and I think at the edge we really, the reason we're moving the models to the edge is to avoid the latency. >> So just to be clear then, is the programming model, so it wouldn't be structured streaming, and we're taking out all the overhead that forced us to use batch with Spark. So the reason I'm trying to clarify is a lot of people had this question for a long time, which is are we going to have a different programming model at the edge from what we have at the center? >> Yeah, that's a great question. And I don't think the answer is finished yet, but I think the work is being done to try and make it look the same. Of course, you know, trying to make it look the same, this is Boosh, it's not like actually barking at us right now, even though she looks like a dog, she is, there will always be things which are a little bit different from the edge to your cluster, but I think Spark has done a really good job of making things look very similar on single node cases to multi node cases, and I think we can probably bring the same things to ML. >> Okay, so it's almost time, we're coming back, Spark took us from single machine to cluster, and now we have to essentially bring it back for an edge device that's really light weight. >> Yeah, I think at the end of the day, just from a latency point of view, that's what we have to do for serving. For some models, not for everyone. Like if you're building a website with a recommendation system, you don't need to serve that model like on the edge node, that's fine, but like if you've got a car device we can't depend on cell latency, right, you have to serve that in car. >> So what are some of the things, some of the other things that IBM is contributing to the ecosystem that you see having a big impact over the next couple years? >> So there's a lot of really exciting things coming out of IBM. And I'm obviously pretty biased. I spend a lot of time focused on Python support in Spark, and one of the most exciting things is coming from my co-worker Brian, I'm not going to say his last name in case I get it wrong, but Brian is amazing, and he's been working on integrating Arrow with Spark, and this can make it so that it's going to be a lot easier to sort of interoperate between JVM languages and Python and R, so I'm really optimistic about the sort of Python and R interfaces improving a lot in Spark and getting a lot faster as well. And we're also, in addition to the Arrow work, we've got some work around making it a lot easier for people in R and Python to get started. The R stuff is mostly actually the Microsoft people, thanks Felix, you're awesome. I don't actually know which camera I should have done that to but that's okay. >> I think you got it! >> But Felix is amazing, and the other people working on R are too. But I think we've both been pursuing sort of making it so that people who are in the R or Python spaces can just use like Pit Install, Conda Install, or whatever tool it is they're used to working with, to just bring Spark into their machine really easily, just like they would sort of any other software package that they're using. Because right now, for someone getting started in Spark, if you're in the Java space it's pretty easy, but if you're in R or Python you have to do sort of a lot of weird setup work, and it's worth it, but like if we can get rid of that friction, I think we can get a lot more people in these communities using Spark. >> Let me see, just as a scenario, the R server is getting fairly well integrated into Sequel server, so would it be, would you be able to use R as the language with a Spark execution engine to somehow integrate it into Sequel server as an execution engine for doing the machine learning and predicting? >> You definitely, well I shouldn't say definitely, you probably could do that. I don't necessarily know if that's a good idea, but that's the kind of stuff that this would enable, right, it'll make it so that people that are making tools in R or Python can just use Spark as another library, right, and it doesn't have to be this really special setup. It can just be this library and they point out the cluster and they can do whatever work it wants to do. That being said, the Sequel server R integration, if you find yourself using that to do like distributed computing, you should probably take a step back and like rethink what you're doing. >> George: Because it's not really scale out. >> It's not really set up for that. And you might be better off doing this with like, connecting your Spark cluster to your Sequel server instance using like JDBC or a special driver and doing it that way, but you definitely could do it in another inverted sort of way. >> So last question from me, if you look out a couple years, how will we make machine learning accessible to a bigger and bigger audience? And I know you touched on the tuning of the knobs, hyperparameter tuning, what will it look like ultimately? >> I think ML pipelines are probably what things are going to end up looking like. But I think the other part that we'll sort of see is we'll see a lot more examples of how to work with certain kinds of data, because right now, like, I know what I need to do when I'm ingesting some textural data, but I know that because I spent like a week trying to figure out what the hell I was doing once, right. And I didn't bother to write it down. And it looks like no one else bothered to write it down. So really I think we'll see a lot of tools that look very similar to the tools we have today, they'll have more options and they'll be a bit easier to use, but I think the main thing that we're really lacking right now is good documentation and sort of good books and just good resources for people to figure out how to use these tools. Now of course, I mean, I'm biased, because I work on these tools, so I'm like, yeah, they're pretty great. So there might be other people who are like, Holden, no, you're wrong, we need to rethink everything. But I think this is, we can go very far with the pipeline concept. >> And then that's good, right? The democratization of these things opens it up to more people, you get more creative people solving more different problems, that makes the whole thing go. >> You can like install Spark easily, you can, you know, set up an ML pipeline, you can train your model, you can start doing predictions, you can, people that haven't been able to do machine learning at scale can get started super easily, and build a recommendation system for their small little online shop and be like, hey, you bought this, you might also want to buy Boosh, he's really cute, but you can't have this one. No no no, not this one. >> Such a tease! >> Holden: I'm sorry, I'm sorry. >> Well Holden, that will, we'll say goodbye for now, I'm sure we will see you in June in San Francisco at the Spark Summit, and look forward to the update. >> Holden: I look forward to chatting with you then. >> Absolutely, and break a leg this afternoon at your presentation. >> Holden: Thank you. >> She's Holden Karau, I'm Jeff Frick, he's George Gilbert, you're watching The Cube, we're at Big Data SV, thanks for watching. (upbeat music)
SUMMARY :
Announcer: Big Data We're excited to be joined to be back yet again. so what are you excited about these days? but I tend to not talk about, like having to process and really push them out to the edge and carve out all the other overhead We really need the cluster to train them, model out to the edge. and the copy part will be left up to and then on the serving side you would use you know, for a bigger audience. and it makes it really easy to do this that you want to tune and you don't know, that lets you predict. but the same concepts, to big data. and generally it would run the good concepts from SciPy the models to the edge So just to be clear then, from the edge to your cluster, machine to cluster, like on the edge node, that's fine, R and Python to get started. and the other people working on R are too. but that's the kind of stuff not really scale out. to your Sequel server instance and they'll be a bit easier to use, that makes the whole thing go. and be like, hey, you bought this, look forward to the update. to chatting with you then. Absolutely, and break you're watching The Cube,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Frick | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Holden Karau | PERSON | 0.99+ |
Holden | PERSON | 0.99+ |
Felix | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Joey | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
San Jose | LOCATION | 0.99+ |
Seth Hendrickson | PERSON | 0.99+ |
Spark | TITLE | 0.99+ |
Python | TITLE | 0.99+ |
last week | DATE | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
tomorrow | DATE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
June | DATE | 0.99+ |
six different knobs | QUANTITY | 0.99+ |
GE | ORGANIZATION | 0.99+ |
Boosh | PERSON | 0.99+ |
Pagoda Lounge | LOCATION | 0.99+ |
one knob | QUANTITY | 0.99+ |
both sides | QUANTITY | 0.99+ |
two presentations | QUANTITY | 0.99+ |
this week | DATE | 0.98+ |
today | DATE | 0.98+ |
The Cube | ORGANIZATION | 0.98+ |
Java | TITLE | 0.98+ |
both | QUANTITY | 0.97+ |
one thing | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
Big Data week | EVENT | 0.96+ |
single machine | QUANTITY | 0.95+ |
R | TITLE | 0.95+ |
SciPy | TITLE | 0.95+ |
Big Data | EVENT | 0.95+ |
single machine | QUANTITY | 0.95+ |
each one | QUANTITY | 0.94+ |
JDBC | TITLE | 0.93+ |
Spark ML | TITLE | 0.89+ |
JVM | TITLE | 0.89+ |
The Cube | TITLE | 0.88+ |
single | QUANTITY | 0.88+ |
Sequel | TITLE | 0.87+ |
Big Data Silicon Valley 2017 | EVENT | 0.86+ |
Spark Summit | LOCATION | 0.86+ |
one machine | QUANTITY | 0.86+ |
a week | QUANTITY | 0.84+ |
Fairmount | LOCATION | 0.83+ |
liblocal | TITLE | 0.83+ |
Bryan Smith, Rocket Software - IBM Machine Learning Launch - #IBMML - #theCUBE
>> Announcer: Live from New York, it's theCUBE, covering the IBM Machine Learning Launch Event, brought to you by IBM. Now, here are your hosts, Dave Vellante and Stu Miniman. >> Welcome back to New York City, everybody. We're here at the Waldorf Astoria covering the IBM Machine Learning Launch Event, bringing machine learning to the IBM Z. Bryan Smith is here, he's the vice president of R&D and the CTO of Rocket Software, powering the path to digital transformation. Bryan, welcome to theCUBE, thanks for coming on. >> Thanks for having me. >> So, Rocket Software, Waltham, Mass. based, close to where we are, but a lot of people don't know about Rocket, so pretty large company, give us the background. >> It's been around for, this'll be our 27th year. Private company, we've been a partner of IBM's for the last 23 years. Almost all of that is in the mainframe space, or we focused on the mainframe space, I'll say. We have 1,300 employees, we call ourselves Rocketeers. It's spread around the world. We're really an R&D focused company. More than half the company is engineering, and it's spread across the world on every continent and most major countries. >> You're esstenially OEM-ing your tools as it were. Is that right, no direct sales force? >> About half, there are different lenses to look at this, but about half of our go-to-market is through IBM with IBM-labeled, IBM-branded products. We've always been, for the side of products, we've always been the R&D behind the products. The partnership, though, has really grown. It's more than just an R&D partnership now, now we're doing co-marketing, we're even doing some joint selling to serve IBM mainframe customers. The partnership has really grown over these last 23 years from just being the guys who write the code to doing much more. >> Okay, so how do you fit in this announcement. Machine learning on Z, where does Rocket fit? >> Part of the announcement today is a very important piece of technology that we developed. We call it data virtualization. Data virtualization is really enabling customers to open their mainframe to allow the data to be used in ways that it was never designed to be used. You might have these data structures that were designed 10, 20, even 30 years ago that were designed for a very specific application, but today they want to use it in a very different way, and so, the traditional path is to take that data and copy it, to ETL it someplace else they can get some new use or to build some new application. What data virtualization allows you to do is to leave that data in place but access it using APIs that developers want to use today. They want to use JSON access, for example, or they want to use SQL access. But they want to be able to do things like join across IMS, DB2, and VSAM all with a single query using an SQL statement. We can do that relational databases and non-relational databases. It gets us out of this mode of having to copy data into some other data store through this ETL process, access the data in place, we call it moving the applications or the analytics to the data versus moving the data to the analytics or to the applications. >> Okay, so in this specific case, and I have said several times today, as Stu has heard me, two years ago IBM had a big theme around the z13 bringing analytics and transactions together, this sort of extends that. Great, I've got this transaction data that lives behind a firewall somewhere. Why the mainframe, why now? >> Well, I would pull back to where I said where we see more companies and organizations wanting to move applications and analytics closer to the data. The data in many of these large companies, that core business-critical data is on the mainframe, and so, being able to do more real time analytics without having to look at old data is really important. There's this term data gravity. I love the visual that presents in my mind that you have these different masses, these different planets if you will, and the biggest, massivest planet in that solar system really is the data, and so, it's pulling the smaller satellites if you will into this planet or this star by way of gravity because data is, data's a new currency, data is what the companies are running on. We're helping in this announcement with being able to unlock and open up all mainframe data sources, even some non-mainframe data sources, and using things like Spark that's running on the platform, that's running on z/OS to access that data directly without having to write any special programming or any special code to get to all their data. >> And the preferred place to run all that data is on the mainframe obviously if you're a mainframe customer. One of the questions I guess people have is, okay, I get that, it's the transaction data that I'm getting access to, but if I'm bringing transaction and analytic data together a lot of times that analytic data might be in social media, it might be somewhere else not on the mainframe. How do envision customers dealing with that? Do you have tooling them to do that? >> We do, so this data virtualization solution that I'm talking about is one that is mainframe resident, but it can also access other data sources. It can access DB2 on Linux Windows, it can access Informix, it can access Cloudant, it can access Hadoop through IBM's BigInsights. Other feeds like Twitter, like other social media, it can pull that in. The case where you'd want to do that is where you're trying to take that data and integrate it with a massive amount of mainframe data. It's going to be much more highly performant by pulling this other small amount of data into, next to that core business data. >> I get the performance and I get the security of the mainframe, I like those two things, but what about the economics? >> Couple of things. One, IBM when they ported Spark to z/OS, they did it the right way. They leveraged the architecture, it wasn't just a simple port of recompiling a bunch of open source code from Apache, it was rewriting it to be highly performant on the Z architecture, taking advantage of specialty engines. We've done the same with the data virtualization component that goes along with that Spark on z/OS offering that also leverages the architecture. We actually have different binaries that we load depending on which architecture of the machine that we're running on, whether it be a z9, an EC12, or the big granddaddy of a z13. >> Bryan, can you speak the developers? I think about, you're talking about all this mobile and Spark and everything like that. There's got to be certain developers that are like, "Oh my gosh, there's mainframe stuff. "I don't know anything about that." How do you help bridge that gap between where it lives in the tools that they're using? >> The best example is talking about embracing this API economy. And so, developers really don't care where the stuff is at, they just want it to be easy to get to. They don't have to code up some specific interface or language to get to different types of data, right? IBM's done a great job with the z/OS Connect in opening up the mainframe to the API economy with ReSTful interfaces, and so with z/OS Connect combined with Rocket data virtualization, you can come through that z/OS Connect same path using all those same ReSTful interfaces pushing those APIs out to tools like Swagger, which the developers want to use, and not only can you get to the applications through z/OS Connect, but we're a service provider to z/OS Connect allowing them to also get to every piece of data using those same ReSTful APIs. >> If I heard you correctly, the developer doesn't need to even worry about that it's on mainframe or speak mainframe or anything like that, right? >> The goal is that they never do. That they simply see in their tool-set, again like Swagger, that they have data as well as different services that they can invoke using these very straightforward, simple ReSTful APIs. >> Can you speak to the customers you've talked to? You know, there's certain people out in the industry, I've had this conversation for a few years at IBM shows is there's some part of the market that are like, oh, well, the mainframe is this dusty old box sitting in a corner with nothing new, and my experience has been the containers and cool streaming and everything like that, oh well, you know, mainframe did virtualization and Linux and all these things really early, decades ago and is keeping up with a lot of these trends with these new type of technologies. What do you find in the customers that, how much are they driving forward on new technologies, looking for that new technology and being able to leverage the assets that they have? >> You asked a lot of questions there. The types of customers certainly financial and insurance are the big two, but that doesn't mean that we're limited and not going after retail and helping governments and manufacturing customers as well. What I find is talking with them that there's the folks who get it and the folks who don't, and the folks who get it are the ones who are saying, "Well, I want to be able "to embrace these new technologies," and they're taking things like open source, they're looking at Spark, for example, they're looking at Anaconda. Last week, we just announced at the Anaconda Conference, we stepped on stage with Continuum, IBM, and we, Rocket, stood up there talking about this partnership that we formed to create this ecosystem because the development world changes very, very rapidly. For a while, all the rage was JDBC, or all the rage was component broker, and so today it's Spark and Anaconda are really in the forefront of developers' minds. We're constantly moving to keep up with developers because that's where the action's happening. Again, they don't care where the data is housed as long as you can open that up. We've been playing with this concept that came up from some research firm called two-speed IT where you have maybe your core business that has been running for years, and it's designed to really be slow-moving, very high quality, it keeps everything running today, but they want to embrace some of their new technologies, they want to be able to roll out a brand-new app, and they want to be able to update that multiple times a week. And so, this two-speed IT says, you're kind of breaking 'em off into two separate teams. You don't have to take your existing infrastructure team and say, "You must embrace every Agile "and every DevOps type of methodology." What we're seeing customers be successful with is this two-speed IT where you can fracture these two, and now you need to create some nice integration between those two teams, so things like data virtualization really help with that. It opens up and allows the development teams to very quickly access those assets on the mainframe in this case while allowing those developers to very quickly crank out an application where quality is not that important, where being very quick to respond and doing lots of AB testing with customers is really critical. >> Waterfall still has its place. As a company that predominately, or maybe even exclusively is involved in mainframe, I'm struck by, it must've been 2008, 2009, Paul Maritz comes in and he says VMWare our vision is to build the software mainframe. And of course the world said, "Ah, that's, mainframe's dead," we've been hearing that forever. In many respects, I accredit the VMWare, they built sort of a form of software mainframe, but now you hear a lot of talk, Stu, about going back to bare metal. You don't hear that talk on the mainframe. Everything's virtualized, right, so it's kind of interesting to see, and IBM uses the language of private cloud. The mainframe's, we're joking, the original private cloud. My question is you're strategy as a company has been always focused on the mainframe and going forward I presume it's going to continue to do that. What's your outlook for that platform? >> We're not exclusively by the mainframe, by the way. We're not, we have a good mix. >> Okay, it's overstating that, then. It's half and half or whatever. You don't talk about it, 'cause you're a private company. >> Maybe a little more than half is mainframe-focused. >> Dave: Significant. >> It is significant. >> You've got a large of proportion of the company on mainframe, z/OS. >> So we're bullish on the mainframe. We continue to invest more every year. We invest, we increase our investment every year, and so in a software company, your investment is primarily people. We increase that by double digits every year. We have license revenue increases in the double digits every year. I don't know many other mainframe-based software companies that have that. But I think that comes back to the partnership that we have with IBM because we are more than just a technology partner. We work on strategic projects with IBM. IBM will oftentimes stand up and say Rocket is a strategic partner that works with us on hard problem-solving customers issues every day. We're bullish, we're investing more all the time. We're not backing away, we're not decreasing our interest or our bets on the mainframe. If anything, we're increasing them at a faster rate than we have in the past 10 years. >> And this trend of bringing analytics and transactions together is a huge mega-trend, I mean, why not do it on the mainframe? If the economics are there, which you're arguing that in many use cases they are, because of the value component as well, then the future looks pretty reasonable, wouldn't you say? >> I'd say it's very, very bright. At the Anaconda Conference last week, I was coming up with an analogy for these folks. It's just a bunch of data scientists, right, and during most of the breaks and the receptions, they were just asking questions, "Well, what is a mainframe? "I didn't know that we still had 'em, "and what do they do?" So it was fun to educate them on that. But I was trying to show them an analogy with data warehousing where, say that in the mid-'90s it was perfectly acceptable to have a separate data warehouse separate from your transaction system. You would copy all this data over into the data warehouse. That was the model, right, and then slowly it became more important that the analytics or the BI against that data warehouse was looking at more real time data. So then it became more efficiencies and how do we replicate this faster, and how do we get closer to, not looking at week-old data but day-old data? And so, I explained that to them and said the days of being able to do analytics against old data that's copied are going away. ETL, we're also bullish to say that ETL is dead. ETL's future is very bleak. There's no place for it. It had its time, but now it's done because with data virtualization you can access that data in place. I was telling these folks as they're talking about, these data scientists, as they're talking about how they look at their models, their first step is always ETL. And so I told them this story, I said ETL is dead, and they just look at me kind of strange. >> Dave: Now the first step is load. >> Yes, there you go, right, load it in there. But having access from these platforms directly to that data, you don't have to worry about any type of a delay. >> What you described, though, is still common architecture where you've got, let's say, a Z mainframe, it's got an InfiniBand pipe to some exit data warehouse or something like that, and so, IBM's vision was, okay, we can collapse that, we can simplify that, consolidate it. SAP with HANA has a similar vision, we can do that. I'm sure Oracle's got their vision. What gives you confidence in IBM's approach and legs going forward? >> Probably due to the advances that we see in z/OS itself where handling mixed workloads, which it's just been doing for many of the 50 years that it's been around, being able to prioritize different workloads, not only just at the CPU dispatching, but also at the memory usage, also at the IO, all the way down through the channel to the actual device. You don't see other operating systems that have that level of granularity for managing mixed workloads. >> In the security component, that's what to me is unique about this so-called private cloud, and I say, I was using that software mainframe example from VMWare in the past, and it got a good portion of the way there, but it couldn't get that last mile, which is, any workload, any application with the performance and security that you would expect. It's just never quite got there. I don't know if the pendulum is swinging, I don't know if that's the accurate way to say it, but it's certainly stabilized, wouldn't you say? >> There's certainly new eyes being opened every day to saying, wait a minute, I could do something different here. Muscle memory doesn't have to guide me in doing business the way I have been doing it before, and that's this muscle memory I'm talking about of this ETL piece. >> Right, well, and a large number of workloads in mainframe are running Linux, right, you got Anaconda, Spark, all these modern tools. The question you asked about developers was right on. If it's independent or transparent to developers, then who cares, that's the key. That's the key lever this day and age is the developer community. You know it well. >> That's right. Give 'em what they want. They're the customers, they're the infrastructure that's being built. >> Bryan, we'll give you the last word, bumper sticker on the event, Rocket Software, your partnership, whatever you choose. >> We're excited to be here, it's an exciting day to talk about machine learning on z/OS. I say we're bullish on the mainframe, we are, we're especially bullish on z/OS, and that's what this even today is all about. That's where the data is, that's where we need the analytics running, that's where we need the machine learning running, that's where we need to get the developers to access the data live. >> Excellent, Bryan, thanks very much for coming to theCUBE. >> Bryan: Thank you. >> And keep right there, everybody. We'll be back with our next guest. This is theCUBE, we're live from New York City. Be right back. (electronic keyboard music)
SUMMARY :
Event, brought to you by IBM. powering the path to close to where we are, but and it's spread across the Is that right, no direct sales force? from just being the Okay, so how do you or the analytics to the data versus Why the mainframe, why now? data is on the mainframe, is on the mainframe obviously It's going to be much that also leverages the architecture. There's got to be certain They don't have to code up some The goal is that they never do. and my experience has been the containers and the folks who get it are the ones who You don't hear that talk on the mainframe. the mainframe, by the way. It's half and half or whatever. half is mainframe-focused. of the company on mainframe, z/OS. in the double digits every year. the days of being able to do analytics directly to that data, you don't have it's got an InfiniBand pipe to some for many of the 50 years I don't know if that's the in doing business the way I is the developer community. They're the customers, bumper sticker on the the developers to access the data live. very much for coming to theCUBE. This is theCUBE, we're
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
IBM | ORGANIZATION | 0.99+ |
Bryan | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Paul Maritz | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Stu Miniman | PERSON | 0.99+ |
Rocket Software | ORGANIZATION | 0.99+ |
50 years | QUANTITY | 0.99+ |
2009 | DATE | 0.99+ |
New York City | LOCATION | 0.99+ |
2008 | DATE | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
27th year | QUANTITY | 0.99+ |
New York City | LOCATION | 0.99+ |
first step | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
JDBC | ORGANIZATION | 0.99+ |
1,300 employees | QUANTITY | 0.99+ |
Continuum | ORGANIZATION | 0.99+ |
Last week | DATE | 0.99+ |
New York | LOCATION | 0.99+ |
Anaconda | ORGANIZATION | 0.99+ |
two things | QUANTITY | 0.99+ |
mid-'90s | DATE | 0.99+ |
Spark | TITLE | 0.99+ |
Rocket | ORGANIZATION | 0.99+ |
z/OS Connect | TITLE | 0.99+ |
10 | DATE | 0.99+ |
two teams | QUANTITY | 0.99+ |
Linux | TITLE | 0.99+ |
today | DATE | 0.99+ |
two-speed | QUANTITY | 0.99+ |
two separate teams | QUANTITY | 0.99+ |
Z. Bryan Smith | PERSON | 0.99+ |
SQL | TITLE | 0.99+ |
Bryan Smith | PERSON | 0.99+ |
z/OS | TITLE | 0.98+ |
two years ago | DATE | 0.98+ |
ReSTful | TITLE | 0.98+ |
Swagger | TITLE | 0.98+ |
last week | DATE | 0.98+ |
decades ago | DATE | 0.98+ |
DB2 | TITLE | 0.98+ |
HANA | TITLE | 0.97+ |
IBM Machine Learning Launch Event | EVENT | 0.97+ |
Anaconda Conference | EVENT | 0.97+ |
Hadoop | TITLE | 0.97+ |
Spark | ORGANIZATION | 0.97+ |
One | QUANTITY | 0.97+ |
Informix | TITLE | 0.96+ |
VMWare | ORGANIZATION | 0.96+ |
More than half | QUANTITY | 0.95+ |
z13 | COMMERCIAL_ITEM | 0.95+ |
JSON | TITLE | 0.95+ |
Jack Norris - Hadoop Summit 2014 - theCUBE - #HadoopSummit
>>The queue at Hadoop summit, 2014 is brought to you by anchor sponsor Hortonworks. We do, I do. And headline sponsor when disco we make Hadoop invincible >>Okay. Welcome back. Everyone live here in Silicon valley in San Jose. This is a dupe summit. This is Silicon angle and Wiki bonds. The cube is our flagship program. We go out to the events and extract the signal to noise. I'm John barrier, the founder SiliconANGLE joins my cohost, Jeff Kelly, top big data analyst in the, in the community. Our next guest, Jack Norris, COO of map R security enterprise. That's the buzz of the show and it was the buzz of OpenStack summit. Another open source show. And here this year, you're just seeing move after, move at the moon, talking about a couple of critical issues. Enterprise grade Hadoop, Hortonworks announced a big acquisition when all in, as they said, and now cloud era follows suit with their news. Today, I, you sitting back saying, they're catching up to you guys. I mean, how do you look at that? I mean, cause you guys have that's the security stuff nailed down. So what Dan, >>You feel about that now? I think I'm, if you look at the kind of Hadoop market, it's definitely moving from a test experimental phase into a production phase. We've got tremendous customers across verticals that are doing some really interesting production use cases. And we recognized very early on that to really meet the needs of customers required some architectural innovation. So combining the open source ecosystem packages with some innovations underneath to really deliver high availability, data protection, disaster recovery features, security is part of that. But if you can't predict the PR protect the data, if you can't have multitenancy and separate workflows across the cluster, then it doesn't matter how secure it is. You know, you need those. >>I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. Silicon lucky bond is so successful, but I just don't understand your business model without plates were free content and they have some underwriters. So you guys have been very successful yet. People aren't looking at map are as good at the quiet leader, like you doing your business, you're making money. Jeff. He had some numbers with us that in the Hindu community, about 20% are paying subscriptions. That's unlike your business model. So explain to the folks out there, the business model and specifically the traction because you have >>Customers. Yeah. Oh no, we've got, we've got over 500 paying customers. We've got at least $1 million customer in seven different verticals. So we've got breadth and depth and our business model is simple. We're an enterprise software company. That's looking at how to provide the best of open source as well as innovations underneath >>The most open distribution of Hadoop. But you add that value separately to that, right? So you're, it's not so much that you're proprietary at all. Right. Okay. >>You clarify that. Right. So if you look at, at this exciting ecosystem, Hadoop is fairly early in its life cycle. If it's a commoditization phase like Linux or, or relational database with my SQL open source, kind of equates the whole technology here at the beginning of this life cycle, early stages of the life cycle. There's some architectural innovations that are really required. If you look at Hadoop, it's an append only file system relying on Linux. And that really limits the types of operations. That types of use cases that you can do. What map ours done is provide some deep architectural innovations, provide complete read-write file systems to integrate data protection with snapshots and mirroring, et cetera. So there's a whole host of capabilities that make it easy to integrate enterprise secure and, and scale much better. Do you think, >>I feel like you were maybe a little early to the market in the sense that we heard Merv Adrian and his keynote this morning. Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about nine years into Hadoop. Do you feel like maybe you guys were a little early and now you're at a tipping point, whereas these more, as more and more deployments get ready to go to production, this is going to be an area that's going to become increasingly important. >>I think, I think our timing has been spectacular because we, we kind of came out at a time when there was some customers that were really serious about Hadoop. We were able to work closely with them and prove our technology. And now as the market is just ramping, we're here with all of those features that they need. And what's a, what's an issue. Is that an incremental improvement to provide those kind of key features is not really possible if the underlying architecture isn't there and it's hard to provide, you know, online real-time capabilities in a underlying platform that's append only. So the, the HDFS layer written in Java, relying on the Linux file system is kind of the, the weak underbelly, if you will, of, of the ecosystem. There's a lot of, a lot of important developments happening yarn on top of it, a lot of really kind of exciting things. So we're actively participating in including Apache drill and on top of a complete read-write file system and integrated Hindu database. It just makes it all come to life. >>Yeah. I mean, those things on top are critical, but you know, it's, it's the underlying infrastructure that, you know, we asked, we keep on community about that. And what's the, what are the things that are really holding you back from Paducah and production and the, and the biggest challenge is they cited worth high availability, backup, and recovery and maintaining performance at scale. Those are the top three and that's kind of where Matt BARR has been focused, you know, since day one. >>So if you look at a major retailer, 2000 nodes and map bar 50 unique applications running on a single cluster on 10,000 jobs a day running on top of that, if you look at the Rubicon project, they recently went public a hundred million add actions, a hundred billion ad auctions a day. And on top of that platform, beats music that just got acquired for $3 billion. Basically it's the underlying map, our engine that allowed them to scale and personalize that music service. So there's a, there's a lot of proof points in terms of how quickly we scale the enterprise grade features that we provide and kind of the blending of deep predictive analytics in a batch environment with online capabilities. >>So I got to ask you about your go to market. I'll see Cloudera and Hortonworks have different business models. Just talk about that, but Cloudera got the massive funding. So you get this question all the time. What do you, how do you counter that army and the arms race? I think >>I just wrote an article in Forbes and he says cash is not a strategy. And I think that was, that was an excellent, excellent article. And he goes in and, you know, in this fast growing market, you know, an amount of money isn't necessarily translate to architectural innovations or speeding the development of that. This is a fairly fragmented ecosystem in terms of the stack that runs on top of it. There's no single application or single vendor that kind of drives value. So an acquisition strategy is >>So your field Salesforce has direct or indirect, both mixable. How do you handle the, because Cloudera has got feet on the street and every squirrel will find it, not if they're parked there, parking sales reps and SCS and all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. Yeah. And they're going to actually try to engage the clients. So, you know, I guess it is a strategy if they're deploying sales and marketing, right? So >>The beauty about that, and in fact, we're all in this together in terms of sharing an API and driving an ecosystem, it's not a fragmented market. You can start with one distribution and move to another, without recompiling or without doing any sort of changes. So it's a fairly open community. If this were a vendor lock-in or, you know, then spending money on brand, et cetera, would, would be important. Our focus is on the, so the sales execution of direct sales, yes, we have direct sales. We also have partners and it depends on the geographies as to what that percentage is. >>And John Schroeder on with the HP at fifth big data NYC has updated the HP relationship. >>Oh, excellent. In fact, we just launched our application gallery app gallery, make it very easy for administrators and developers and analysts to get access and understand what's available in the ecosystem. That's available directly on our website. And one of the featured applications there today is an integration with the map, our sandbox and HP Vertica. So you can get early access, try it and get the best of kind of enterprise grade SQL first, >>First Hadoop app store, basically. Yeah. If you want to call it that way. Right. So like >>Sure. Available, we launched with close to 30, 30 with, you know, a whole wave kind of following that. >>So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. So, you know, there's a lot of talk about that. Some confusion about the different methods for applying SQL on predicts or map art takes an open approach. I know you'll support things like Impala from, from a competitor Cloudera, talk about that approach from a map arts perspective. >>So I guess our, our, our perspective is kind of unbiased open source. We don't try to pick and choose and dictate what's the right open source based on either our participation or some community involvement. And the reality is with multiple applications being run on the platform, there are different use cases that make difference, you know, make different sense. So whether it's a hive solution or, you know, drill drills available, or HP Vertica people have the choice. And it's part of, of a broad range of capabilities that you want to be able to run on the platform for your workflows, whether it's SQL access or a MapReduce or a spark framework shark, et cetera. >>So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, you've got Impala, you've got hive. And the stinger initiative is, is that whole kind of SQL on Hadoop ecosystem, still working itself out. Are we going to have this many options in a year or two years from now? Or are they complimentary and potentially, you know, each has its has its role. >>I think the major differences is kind of how it deals with the new data formats. Can it deal with self-describing data? Sources can leverage, Jason file does require a centralized metadata, and those are some of the perspectives and advantages say the Apache drill has to expand the data sets that are possible enabled data exploration without dependency on a, on an it administrator to define that, that metadata. >>So another, maybe not always as exciting, but taking workloads from existing systems, moving them to Hadoop is one of the ways that a lot of people get started with, to do whether associated transformation workloads or there's something in that vein. So I know you've announced a partnership with Syncsort and that's one of the things that they focus on is really making it as easy as possible to meet those. We'll talk a little bit about that partnership, why that makes sense for you and, and >>When your customer, I think it's a great proof point because we announced that partnership around mainframe offload, we have flipped comScore and experience in that, in that press release. And if you look at a workload on a mainframe going to duke, that that seems like that's a, that's really an oxymoron, but by having the capabilities that map R has and making that a system of record with that full high availability and that data protection, we're actually an option to offload from mainframe offload, from sand processing and provide a really cost effective, scalable alternative. And we've got customers that had, had tried to offload from the mainframe multiple times in the past, on successfully and have done it successfully with Mapbox. >>So talk a little bit more about kind of the broader partnership strategy. I mean, we're, we're here at Hadoop summit. Of course, Hortonworks talks a lot about their partnerships and kind of their reseller arrangements. Fedor. I seem to take a little bit more of a direct approach what's map R's approach to kind of partnering and, and as that relates to kind of resell arrangements and things like, >>I think the app gallery is probably a great proof point there. The strategy is, is an ecosystem approach. It's having a collection of tools and applications and management facilities as well as applications on top. So it's a very open strategy. We focus on making sure that we have open API APIs at that application layer, that it's very easy to get data in and out. And part of that architecture by presenting standard file system format, by allowing non Java applications to run directly on our platform to support standard database connections, ODBC, and JDBC, to provide database functionality. In addition to kind of this deep predictive analytics really it's about supporting the broadest set of applications on top of a single platform. What we're seeing in this kind of this, this modern architecture is data gravity matters. And the more processing you can do on a single platform, the better off you are, the more agile, the more competitive, right? >>So in terms of, so you're partnering with people like SAS, for example, to kind of bring some of the, some of the analytic capabilities into the platform. Can you kind of tell us a little bit about any >>Companies like SAS and revolution analytics and Skytree, and I mean, just a whole host of, of companies on the analytics side, as well as on the tools and visualization, et cetera. Yeah. >>Well, I mean, I, I bring up SAS because I think they, they get the fact that the, the whole data gravity situation is they've got it. They've got to go to where the data is and not have the data come to them. So, you know, I give them credit for kind of acknowledging that, that kind of big data truth ism, that it's >>All going to the data, not bringing the data >>To the computer. Jack talk about the success you had with the customers had some pretty impressive numbers talking about 500 customers, Merv agent. The garden was on with us earlier, essentially reiterating not mentioning that bar. He was just saying what you guys are doing is right where the puck is going. And some think the puck is not even there at the same rink, some other vendors. So I gotta give you props on that. So what I want you to talk about the success you have in specifically around where you're winning and where you're successful, you guys have struggled with, >>I need to improve on, yeah, there's a, there's a whole class of applications that I think Hadoop is enabling, which is about operations in analytics. It's taking this, this higher arrival rate machine generated data and doing analytics as it happens and then impacting the business. So whether it's fraud detection or recommendation engines, or, you know, supply chain applications using sensor data, it's happening very, very quickly. So a system that can tolerate and accept streaming data sources, it has real-time operations. That is 24 by seven and highly available is, is what really moves the needle. And that's the examples I used with, you know, add a Rubicon project and, you know, cable TV, >>The very outcome. What's the primary outcomes your clients want with your product? Is it stability? And the platform has enabled development. Is there a specific, is there an outcome that's consistent across all your wins? >>Well, the big picture, some of them are focused on revenues. Like how do we optimize revenue either? It's a new data source or it's a new application or it's existing application. We're exploding the dataset. Some of it's reducing costs. So they want to do things like a mainframe offload or data warehouse offload. And then there's some that are focused on risk mitigation. And if there's anything that they have in common it's, as they moved from kind of test and looked at production, it's the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. So it's not, it's not anything new. It's just like, Hey, we've got SLS and I've got data protection policies, and I've got a disaster recovery procedure. And why can't I expect the same level of capabilities in Hindu that I have today in those other systems. >>It's a final question. Where are you guys heading this year? What's your key objectives. Obviously, you're getting these announcements as flurry of announcements, good success state of the company. How many employees were you guys at? Give us a quick update on the numbers. >>So, you know, we just reported this incredible momentum where we've tripled core growth year over year, we've added a tremendous amount of customers. We're over 500 now. So we're basically sticking to our knitting, focusing on the customers, elevating the proof points here. Some of the most significant customers we have in the telco and financial services and healthcare and, and retail area are, you know, view this as a strategic weapon view, this is a huge competitive advantage, and it's helping them impact their business. That's really spring our success. We've, you know, we're, we're growing at an incredible clip here and it's just, it's a great time to have made those calls and those investments early on and kind of reaping the benefits. >>It's. Now I've always said, when we, since the first Hadoop summit, when Hortonworks came out of Yahoo and this whole community kind of burst open, you had to duke world. Now Riley runs at it's a whole different vibe of itself. This was look at the developer vibe. So I got to ask you, and we would have been a big fan. I mean, everyone has enough beachhead to be successful, not about map arbors Hortonworks or cloud air. And this is why I always kind of smile when everyone goes, oh, Cloudera or Hortonworks. I mean, they're two different animals at this point. It would do different things. If you guys were over here, everyone has their quote, swim lanes or beachhead is not a lot of super competition. Do you think, or is it going to be this way for awhile? What's your fork at some? At what point do you see more competition? 10 years out? I mean, Merv was talking a 10 year horizon for innovation. >>I think that the more people learn and understand about Hadoop, the more they'll appreciate these kind of set of capabilities that matter in production and post-production, and it'll migrate earlier. And as we, you know, focus on more developer tools like our sandbox, so people can easily get experienced and understand kind of what map are, is. I think we'll start to see a lot more understanding and momentum. >>Awesome. Jack Norris here, inside the cube CMO, Matt BARR, a very successful enterprise grade, a duke player, a leader in the space. Thanks for coming on. We really appreciate it. Right back after the short break you're live in Silicon valley, I had dupe December, 2014, the right back.
SUMMARY :
The queue at Hadoop summit, 2014 is brought to you by anchor sponsor I mean, cause you guys have that's the security stuff nailed down. I think I'm, if you look at the kind of Hadoop market, I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. That's looking at how to provide the best of open source But you add that value separately to So if you look at, at this exciting ecosystem, Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about isn't there and it's hard to provide, you know, online real-time And what's the, what are the things that are really holding you back from Paducah So if you look at a major retailer, 2000 nodes and map bar 50 So I got to ask you about your go to market. you know, in this fast growing market, you know, an amount of money isn't necessarily all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. We also have partners and it depends on the geographies as to what that percentage So you can get early If you want to call it that way. a whole wave kind of following that. So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. And it's part of, of a broad range of capabilities that you want So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, of the perspectives and advantages say the Apache drill has to expand the data sets why that makes sense for you and, and And if you look at a workload on a mainframe going to duke, So talk a little bit more about kind of the broader partnership strategy. And the more processing you can do on a single platform, the better off you are, Can you kind and I mean, just a whole host of, of companies on the analytics side, as well as on the tools So, you know, I give them credit for kind of acknowledging that, that kind of big data truth So what I want you to talk about the success you have in specifically around where you're winning and you know, add a Rubicon project and, you know, cable TV, And the platform has enabled development. the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. Where are you guys heading this year? So, you know, we just reported this incredible momentum where we've tripled core and this whole community kind of burst open, you had to duke world. And as we, you know, focus on more developer tools like our sandbox, a duke player, a leader in the space.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Kelly | PERSON | 0.99+ |
Jack Norris | PERSON | 0.99+ |
John Schroeder | PERSON | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Jeff | PERSON | 0.99+ |
$3 billion | QUANTITY | 0.99+ |
December, 2014 | DATE | 0.99+ |
Jason | PERSON | 0.99+ |
Matt BARR | PERSON | 0.99+ |
10,000 jobs | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
10 year | QUANTITY | 0.99+ |
Syncsort | ORGANIZATION | 0.99+ |
Dan | PERSON | 0.99+ |
Silicon valley | LOCATION | 0.99+ |
John barrier | PERSON | 0.99+ |
Java | TITLE | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
24 | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
this year | DATE | 0.99+ |
Jack | PERSON | 0.99+ |
fifth | QUANTITY | 0.99+ |
Linux | TITLE | 0.99+ |
Skytree | ORGANIZATION | 0.99+ |
each | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
today | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
Merv | PERSON | 0.98+ |
about 10 years | QUANTITY | 0.98+ |
San Jose | LOCATION | 0.98+ |
Hadoop | EVENT | 0.98+ |
about 20% | QUANTITY | 0.97+ |
seven | QUANTITY | 0.97+ |
over 500 | QUANTITY | 0.97+ |
a year | QUANTITY | 0.97+ |
about 500 customers | QUANTITY | 0.97+ |
SQL | TITLE | 0.97+ |
seven different verticals | QUANTITY | 0.97+ |
two years | QUANTITY | 0.97+ |
single platform | QUANTITY | 0.96+ |
2014 | DATE | 0.96+ |
Apache | ORGANIZATION | 0.96+ |
Hadoop | LOCATION | 0.95+ |
SiliconANGLE | ORGANIZATION | 0.94+ |
comScore | ORGANIZATION | 0.94+ |
single vendor | QUANTITY | 0.94+ |
day one | QUANTITY | 0.94+ |
Salesforce | ORGANIZATION | 0.93+ |
about nine years | QUANTITY | 0.93+ |
Hadoop Summit 2014 | EVENT | 0.93+ |
Merv | ORGANIZATION | 0.93+ |
two different animals | QUANTITY | 0.92+ |
single application | QUANTITY | 0.92+ |
top three | QUANTITY | 0.89+ |
SAS | ORGANIZATION | 0.89+ |
Riley | PERSON | 0.88+ |
First | QUANTITY | 0.87+ |
Forbes | TITLE | 0.87+ |
single cluster | QUANTITY | 0.87+ |
Mapbox | ORGANIZATION | 0.87+ |
map R | ORGANIZATION | 0.86+ |
map | ORGANIZATION | 0.86+ |
Dr. Amr Awadallah - Interview 2 - Hadoop World 2011 - theCUBE
Yeah, I'm Aala, They're the co-founder back to back. This is the cube silicon angle.com, Silicon angle dot TV's production of the cube, our flagship telecasts. We go out to the event. That was a great conversation. I was really just, just cool. I could have, we could have probably hit on a few more things, obviously well read. Awesome. Co-founder of Cloudera a. You were, you did a good job teaming up with that co-founder, huh? Not bad on the cube, huh? He's not bad on the cube, isn't he? He, >>He reads the internet. >>That's what I'm saying. >>Anything is going on. >>He's a cube star, you know, And >>Technology. Jeff knows it. Yeah. >>We, we tell you, I'm smarter just by being in Cloudera all those years. And I actually was following what he was saying, Sad and didn't dust my brain. So, Okay, so you're back. So we were talking earlier with Michaels and about the relational database thing. So I kind of pick that up where we left off with you around, you know, he was really excited. It's like, you know, hey, we saw that relational database movement happen. He was part of that. Yeah, yeah. That generation. And then, but things were happening or kind of happening the same way in a similar way, still early. So I was trying to really peg with him, how early are we, like, so, you know, as the curve, you know, this is 1400, it's not the Javit Center yet. Maybe the Duke world, you know, next year might be at the Javit Center, 35,000 just don't go to Vegas. So I'm trying to figure out where we are on that curve. Yeah. And we on the upwards slope, you know, down here, not even hitting that, >>I think, I think, I think we're moving up quicker than previous waves. And actually if you, if you look for example, Oracle, I think it took them 15, 20 years until they, they really became a mature company, VM VMware, which started about, what, 12, 13 years ago. It took them about maybe eight years to, to be a big company, met your company, and I'm hoping we're gonna do it in five. So a couple more years. >>Highly accelerated. >>Yes. But yeah, we see, I mean, I'm, I'm, I've been surprised by the growth. I have been, Right? I've been told, warned about enterprise software and, and that it takes long for production to take place. >>But the consumerization trend is really changing that. I mean, it seems to be that, yeah, the enterprises always last. Why the shorter >>Cycle? I think the shorter cycle is coming from having the, the, the, the right solution for the right problem at the right time. I think that's a big part of it. So luck definitely is a big part of this. Now, in terms of why this is changing compared to a couple of dec decades ago, why the adoption is changing compared to a couple of decades ago. I, I think that's coming just because of how quickly the technology itself, the underlying hardware is evolving. So right now, the fact that you can buy a single server and it has eight cores to 16 cores has 12 hards to terabytes. Each is, is something that's just pushing the, the, the, the limits what you can do with the existing systems and hence making it more likely for new systems to disrupt them. >>Yeah. We can talk about a lot. It's very easy for people to actually start a, a big data >>Project. >>Yes. For >>Example. Yes. And the hardest part is, okay, what, what do I really, what problem do I need to solve? How am I gonna, how am I gonna monetize it? Right? Those are the hard parts. It's not the, not the underlying >>Technology. Yes, Yes, that's true. That's true. I mean, >>You're saying, eh, you're saying >>Because, because I'm seeing both so much. I'm, I'm seeing both. I'm seeing both. And like, I'm seeing cases where you're right. There's some companies that was like, Oh, this Hadoop thing is so cool. What problem can I solve with it? And I see other companies, like, I have this huge problem and, and, and they don't know that HA exists. It's so, And once they know, they just jump on it right away. It's like, we know when you have a headache and you're searching for the medicine in Espin. Wow. It >>Works. I was talking to Jeff Hiba before he came on stage and, and I didn't even get to it cuz we were so on a nice riff there. Right. Bunch of like a musicians playing the guitar together. But like he, we talked about the it and and dynamics and he said something that I thoughts right. On money and SAP is talking the same thing and said they're going to the lines of business. Yes. Because it is the gatekeeper that's, it's like selling mini computers to a mainframe selling client servers from a mini computer team. Yeah. >>There's not, we're seeing, we're seeing both as well. So more likely the, the former one meaning, meaning that yes, line of business and departments, they adopt the technology and then it comes in and they see there's already these five different departments having it and they think, okay, now we need to formalize this across the organization. >>So what happens then? What are you seeing out there? Like when that happens, that mean people get their hands on, Hey, we got a problem to solve. Yeah. Is that what it comes down to? Well, Hadoop exist. Go get Hadoop. Oh yeah. They plop it in there and I what does it do? They, >>So they pop it into their, in their own installation or on the, on the cloud and they show that this actually is working and solving the problem for them. Yeah. And when that happens, it's a very, it's a very easy adoption from there on because they just go tell it, We need this right now because it's solving this problem and it's gonna make, make us much >>More money moving it right in. Yes. No problems. >>Is is that another reason why the cycle's compressed? I mean, you know, you think client server, there was a lot of resistance from it and now it's more much, Same thing with mobile. I mean mobile is flipped, right? I mean, so okay, bring it in. We gotta deal with it. Yep. I would think the same thing. We, we have a data problem. Let's turn it into an >>Opportunity. Yeah. In my, and it goes back to what I said earlier, the right solution for the right problem at the right time. Like when they, when you have larger amounts of unstructured data, there isn't anything else out there that can even touch what had, can >>Do. So Amar, I need to just change gears here a minute. The gaming stuff. So we have, we we're featured on justin.tv right now on the front page. Oh wow. But the numbers aren't coming in because there's a competing stream of a recently released Modern Warfare three feature. Yes. Yes. So >>I was looking for, we >>Have to compete with Modern Warfare three. So can you, can we talk about Modern Warfare three for a minute and share the folks what you think of the current version, if any, if you played it. Yeah. So >>Unfortunately I'm waiting to get back home. I don't have my Xbox with me here. >>A little like a, I'm talking about >>My lines and business. >>Boom. Water warfares like a Christmas >>Tree here. Sorry. You know, I love, I'm a big gamer. I'm a big video gamer at Cloudera. We have every Thursday at five 30 end office, we, we play Call of of Beauty version four, which is modern world form one actually. And I challenge, I challenge people out there to come challenge our team. Just ping me on Twitter and we'll, we'll do a Cloudera versus >>Let's, let's, let's reframe that. Let team out. There am Abalas company. This is the geeks that invent the future. Jeff Haer Baer at Facebook now at Cloudera. Hammerer leading the charge. These guys are at gamers. So all the young gamers out there am are saying they're gonna challenge you. At which version? >>Modern Warfare one. >>Modern Warfare one. Yes. How do they fire in? Can you set up an >>External We'll >>We'll figure it out. We'll figure it out. Okay. >>Yeah. Just p me on Twitter and We'll, >>We can carry it live actually we can stream that. Yeah, >>That'd be great. >>Great. >>Yeah. So I'll tell you some of our best Hadooop committers and Hadoop developers pitch >>A picture. Modern Warfare >>Three going now Model Warfare three. Very excited about the game. I saw the, the trailers for it looks, graphics look just amazing. Graphics are amazing. I love the Sirius since the first one that came out. And I'm looking forward to getting back home to playing the game. >>I can't play, my son won't let me play. I'm such a fumbler with the Hub. I'm a keyboard controller. I can't work the Xbox controller. Oh, I have a coordination problem my age and I'm just a gluts and like, like Dad, sorry, Charity's over. I can I play with my friends? You the box. But I'm around big gamer. >>But, but in terms of, I mean, something I wanted to bring up is how to link up gaming with big data and analysis and so on. So like, I, I'm a big gamer. I love playing games, but at the same time, whenever I play games, I feel a little bit guilty because it's kind of like wasted time. So it's like, I mean, yeah, it's fun and I'm getting lots of enjoyment on it makes my life much more cheerful. But still, how can we harness all of this, all of these hours that gamers spend playing a game like Modern Warfare three, How can we, how can we collect instrument, all of the data that's coming from that and coming up, for example, with something useful with predicted. >>This is exactly, this is exactly the kind of application that's mainstream is gaming. Yeah. Yeah. Danny at Riot G is telling me, we saw him at Oracle Open World. He's up there for the Java one. He said that they, they don't really have a big data platform and their business is about understanding user behavior rep tons of data about user playing time, who they're playing with. Yeah, Yeah. How they want us to get into currency trading, You know, >>Buy, I can't, I can't mention the names, but some of the biggest giving companies out there are using Hadoop right now. And, and depending on CDH for doing exactly that kind of thing, creating >>A good user experience >>Today, they're doing it for the purpose of enhancing the user experience and improving retention. So they do track everything. Like every single bullet, you fire everything in best Ball Head, you get everything home run, you do. And, and, and in, in a three >>Type of game consecutive headshot, you get >>Everything, everything is being Yeah. Headshot you get and so on. But, but as you said, they are using that information today to sell more products and, and, and retain their users. Now what I'm suggesting is that how can you harness that energy for the good as well? I mean for making money, money is good and everything, but how can you harness that for doing something useful so that all of this entertainment time is also actually productive time as well. I think that'd be a holy grail in this, in this environment if we >>Can achieve that. Yeah. It used to be that corn used to be the telegraph of the future of about, of applications, but gaming really is, if you look at gaming, you know, you get the headset on. It's a collaborative environment. Oh yeah. You got unified communications. >>Yeah. And you see our teenager kids, how, how many hours they spend on these things. >>You got play as a play environments, very social collaborative. Yeah. You know, some say, you know, we we're saying, what I'm saying is that that's the, that's the future work environment with Skype evolving. We're our multiplayer game's called our job. Right? Yeah. You know, so I'm big on gaming. So all the gamers out there, a has challenged you. Yeah. Got a big data example. What else are we seeing? So let's talk about the, the software. So we, one of the things you were talking about that I really liked, you were going down the list. So on Mike's slide he had all the new features. So around the core, can you just go down the core and rattle off your version of what, what it means and what it is. So you start off with say H Base, we talked about that already. What are the other ones that are out there? >>So the projects that we have right there, >>The projects that are around those tools that are being built. Cause >>Yeah, so the foundational, the foundational one as we mentioned before, is sdfs for storage map use for processing. Yeah. And then the, the immediate layer above that is how to make MAP reduce easier for the masses. So how can, not everybody knows how to learn map, use Java, everybody knows sql, right? So, so one of the most successful projects right now that has the highest attach rate, meaning people usually when they install had do installed as well is Hive. So Hive takes sequel and so Jeff Harm Becker, my co-founder, when he was at Facebook, his team built the Hive system. Essentially Hive takes sql so you don't have to learn a new language, you already know sql. And then converts that into MAP use for you. That not only expands the developer base for how many people can use adu, but also makes it easier to integrate Hadoop through all DBC and JDBC integrated with BI tools like MicroStrategy and Tableau and Informatica, et cetera, et cetera. >>You mentioned R too. You mentioned R Program R >>As well. Yeah, R is one of our best partnerships. We're very, very happy with them. So that's, that's one of the very key projects is Hive assisted project to Hive ISS called Pig. A pig Latin is a language that ya invented that you have to learn the language. It's very easy, it's very easy to learn compared to map produce. But once you learn it, you can, you can specify very deep data pipelines, right? SQL is good for queries. It's not good for data pipelines because it becomes very convoluted. It becomes very hard for the, the human brain to understand it. So Pig is much more natural to the human. It's more like Pearl very similar to scripting kind of languages. So with Peggy can write very, very long data pipelines, again, very successful projects doing very, very well. Another key project is Edge Base, like you said. So Edge Base allows you to do low latencies. So you can do very, very quick lookups and also allows you to do transactions. So you can do updates in inserts and deletes. So one of the talks here that had World we try to recommend people watch when the videos come out is the Talk by Jonathan Gray from Facebook. And he talked about how they use Edge Base, >>Jonathan, something on here in the Cube later. Yeah. So >>Drill him on that. So they use Edge Base now for many, many things within Facebook. They have a big team now committed to building an improving edge base with us and with the community at large. And they're using it for doing their online messaging system. The live mail system in Facebook is powered by Edge Base right now. Again, Pro and eBay, The Casini project, they gave a keynote earlier today at the conference as well is using Edge Base as well. So Edge Base is definitely one of the projects that's growing very, very quickly right now within the Hudu system. Another key project that Jeff alluded to earlier when he was on here is Flum. So Flume is very instrumental because you have this nice system had, but Hadoop is useless unless you have data inside it. So how do you get the data inside do? >>So Flum essentially is this very nice framework for having these agents all over your infrastructure, inside your web servers, inside your application servers, inside your mobile devices, your network equipment that collects all of that data and then reliably and, and materializes it inside Hado. So Flum does that. Another good project is Uzi, so many of them, I dunno how, how long you want me to keep going here, But, but Uzi is great. Uzi is a workflow processing system. So Uzi allows you to define a series of jobs. Some of them in Pig, some of them in Hive, some of them in map use. You can define a series of them and then link them to each other and say, only start this job when these other jobs, two jobs finish because I'm waiting for the input from them before I can kick off and so on. >>So Uzi is a very nice framework that will will do that. We'll manage the whole graph of jobs for you and retry things when they fail, et cetera, et cetera. Another good project is where W H I R R and where allows you to very easily start ADU cluster on top of Amazon. Easy two on top of Rackspace, virtualized environ. It's more for kicking off, it's for kicking off Hadoop instances or edge based instances on any virtual infrastructure. Okay. VMware, vCloud. So that it supports all of the major vCloud, sorry, all of the me, all of the major virtualized infrastructure systems out there, Eucalyptus as well, and so on. So that's where W H I R R ARU is another key project. It's one, it's duck cutting's main kind of project right now. Don of that gut cutting came on stage with you guys has, So Aru ARO is a project about how do we encode with our files, the schema of these files, right? >>Because when you open up a text file and you don't know how to what the columns mean and how to pars it, it becomes very hard to work for it. So ARU allows you to do that much more easily. It's also useful for doing rrp. We call rtc remove procedure calls for having different services talk to each other. ARO is very useful for that as well. And the list keeps going on and on Maha. Yeah. Which we just, thanks for me for reminding me of my house. We just added Maha very recently actually. What is that >>Adam? I'm not >>Familiar with it. So Maha is a data mining library. So MAHA takes some of the most popular data mining algorithms for doing clustering and regression and statistical modeling and implements them using the map map with use model. >>They have, they have machine learning in it too or Yes, yes. So that's the machine learning. >>So, So yes. Stay vector to machines and so on. >>What Scoop? >>So Scoop, you know, all of them. Thanks for feeding me all the names. >>The ones I don't understand, >>But there's so many of them, right? I can't even remember all of them. So Scoop actually is a very interesting project, is short for SQL to Hadoop, hence the name Scoop, right? So SQ from SQL and Oops from Hadoop and also means Scoop as in scooping up stuff when you scoop up ice cream. Yeah. And the idea for Scoop is to make it easy to move data between relational systems like Oracle metadata and it is a vertical and so on and Hadoop. So you can very simply say, Scoop the name of the table inside the relation system, the name of the file inside Hadoop. And the, the table will be copied over to the file and Vice and Versa can say Scoop the name of the file in Hadoop, the name of the table over there, it'll move the table over there. So it's a connectivity tool between the relational world and the Hadoop world. >>Great, great tutorial. >>And all of these are Apache projects. They're all projects built. >>It's not part of your, your unique proprietary. >>Yes. But >>These are things that you've been contributing >>To, We're contributing to the whole ecosystem. Yes. >>And you understand very well. Yes. And >>And contribute to your knowledge of the marketplace >>And Absolutely. We collaborate with the, with the community on creating these projects. We employ committers and founders for many of these projects. Like Duck Cutting, the founder of He works in Cloudera, the founder for that UIE project. He works at Calera for zookeeper works at Calera. So we have a number of them on stuff >>Work. So we had Aroon from Horton Works. Yes. And and it was really good because I tell you, I walk away from that conversation and I gotta say for the folks out there, there really isn't a war going on in Apache. There isn't. And >>Apache, there isn't. I mean isn't but would be honest. Like, and in the developer community, we are friends, we're working together. We want to achieve the, there's >>No war. It's all Kumbaya. Everyone understands the rising tide floats, all boats are all playing nice in the same box. Yes. It's just a competitive landscape in Horton. Works >>In the business, >>Business business, competitive business, PR and >>Pr. We're trying to be friendly, as friendly as we can. >>Yeah, no, I mean they're, they're, they're hying it up. But he was like, he was cool. Like, Hey, you know, we know each other. Yes. We all know each other and we're just gonna offer free Yes. And charge with support. And so are they. And that's okay. And they got other things going on. Yes. But he brought up the question. He said they're, they're launching a management console. So I said, Tyler's got a significant lead. He kind of didn't really answer the question. So the question is, that's your core bread and butter, That's your yes >>And no. Yes and no. I mean if you look at, if you look at Cloudera Enterprise, and I mentioned this earlier and when we talked in the morning, it has two main things in it. Cloudera Enterprise has the management suite, but it also has the, the the the support and maintenance that we provide to our customers and all the experience that we have in our team part That subscription. Yes. For a description. And I, I wanna stress the point that the fact that I built a sports car doesn't mean that I'm good at running that sports car. The driver of the car usually is much better at driving the car than the guy who built the car, right? So yes, we have many people on staff that are helping build had, but we have many more people on stuff that helped run Hado at large scale, at at financial indu, financial industry, retail industry, telecom industry, media industry, health industry, et cetera, et cetera. So that's very, very important for our customer. All that experience that we bring in on how to run the system technically Yeah. Within these verticals. >>But their strategies clear. We're gonna create an open source project within Apache for a management consult. Yes. And we sell support too. Yes. So there'll be a free alternative to management. >>So we have to see, But I mean we look at the product, I mean our products, >>It's gotta come down to product differentiation. >>Our product has been in the market for two years, so they just started building their products. It's >>Alpha, It's just Alpha. The >>Product is Alpha in Alpha right now. Yeah. Okay. >>Well the Apache products, it is >>Apache, right? Yeah. The Apache project is out. So we'll see how it does it compare to ours. But I think ours is way, way ahead of anything else out there. Yeah. Essentially people to try that for themselves and >>See essentially, John, when I asked Arro why does the world need Hortonwork? You know, eventually the answer we got was, well it's free. It needs to be more open. Had needs to be more open. >>No, there's, >>It's going to be, That's not really the reason why Warton >>Works. >>No, they want, they want to go make money. >>Exactly. We wasn't >>Gonna say them you >>When I kept pushing and pushing and that's ultimately the closest we can get cuz you >>Just listens. Not gonna >>12 open source projects. Yes. >>I >>Mean, yeah, yeah. You can't get much more open. Yeah. Look >>At management >>Consult, but Airs not shooting on all those. I mean, I mean not only we are No, no, not >>No, no, we absolutely >>Are. No, you are contributing. You're not. But that's not all your projects. There's other people >>Involved. Yeah, we didn't start, we didn't start all of these projects. Yeah, that's >>True. You contributing heavily to all of them. >>Yes, we >>Are. And that's clear. Todd Lipkin said that, you know, he contributed his first patch to HPAC in 2008. Yes. So I mean, you go back through the ranks >>Of your people and Todd now is a committer on Edge base is a committer on had itself. So on a number >>Of you clearly the lead and, and you know, and, but >>There is a concern. But we, we've heard it and I wanna just ask you No, no. So there's a concern that if I build processes around a proprietary management console, Yes. I'm gonna end up being locked into that proprietary management CNA all over again. Now this is so far from ca Yes. >>Right. >>But that's a concern that some people have expressed. And, and, and I think one of the reasons why Port Works is getting so much attention. So Yes. >>Talk about that. It's, it's a very good, it's a very good observation to make. Actually, >>There there is two separate things here. There's the platform where all the data sets and then there's this management parcel beside the platform. Now why did we make the management console why the cloud didn't make the management console? Because it makes our job for supporting the customers much more achievable. When a customer calls in and says, We have a problem, help us fix this problem. When they go to our management console, there is a button they click that gives us a dump of the state, of the cluster. And that's what allows us to very quickly debug what's going on. And within minutes tell them you need to do this and you to do that. Yeah. Without that we just can't offer the support services. There's >>Real value there. >>Yes. So, so now a year from, But, but, but you have to keep in mind that the, the underlying platform is completely open source and free CBH is completely a hundred percent open source, a hundred percent free, a hundred percent Apache. So a year from now, when it comes time to renew with us, if the customer is not happy with our management suite is not happy with our support data, they can, they can go to work >>And works. People are afraid >>Of all they can go to ibm. >>The data, you can take the data that >>You don't even need to take the data. You're not gonna move the data. It's the same system, the same software. Every, everything in CDH is Apache. Right? We're not putting anything in cdh, which is not Apache. So a year from now, if you're not happy with our service to you and the value that we're providing, you can switch. There is no lock in. There is no lock. And >>Your, your argument would be the switching costs to >>The only lock in is happiness. The only lock in is which >>Happiness inspection customer delay. Which by, by the way, we just wrote a piece about those wars and we said the risk of lockin is low. We made that statement. We've got some heat for it. Yes. And >>This is sort of at scale though. What the, what the people are saying, they're throwing the tomatoes is saying if this is, again, in theory at scale, the customers are so comfortable with that, the console that they don't switch. Now my argument was >>Yes, but that means they're happy with it. That means they're satisfied and happy >>With it. >>And it's more economical for them than going and hiding people full-time on stuff. Yeah. >>So you're, you're always on check as, as long as the customer doesn't feel like Oracle. >>Yeah. See that's different. Oracle is very, Oracle >>Is like different, right? Yeah. Here it's like Cisco routers, they get nested into the environment, provide value. That's just good competitive product strategy. Yes. If it they're happy. Yeah. It's >>Called open washing with >>Oracle, >>I mean our number one core attribute on the company, the number one value for us is customer satisfaction. Keeping our people Yeah. Our customers happy with the service that we provide. >>So differentiate in the product. Yes. Keep the commanding lead. That's the strategist. That's the, that's what's happening. That's your goal. Yes. >>That's what's happening. >>Absolutely. Okay. Co-founder of Cloudera, Always a pleasure to have you on the cube. We really appreciate all the hospitality over the beer and a half. And wanna personally thank you for letting us sit in your office and we'll miss you >>And we'll miss you too. We'll >>See you at the, the Cube events off Swing by, thanks for coming on the cube and great to see you and congratulations on all your success. >>Thank >>You. And thanks for the review on Modern Warfare three. Yeah, yeah. >>Love me again. If there any gaming stuff, you know, I.
SUMMARY :
Yeah, I'm Aala, They're the co-founder back to back. Yeah. So I kind of pick that up where we left off with you around, you know, he was really excited. So a couple more years. takes long for production to take place. But the consumerization trend is really changing that. So right now, the fact that you can buy a single server and it It's very easy for people to actually start a, a big data Those are the hard parts. I mean, It's like, we know when you have a headache and you're On money and SAP is talking the same thing and said they're going to the lines of business. the former one meaning, meaning that yes, line of business and departments, they adopt the technology and What are you seeing out there? So they pop it into their, in their own installation or on the, on the cloud and they show that this actually is working and Yes. I mean, you know, you think client server, there was a lot of resistance from for the right problem at the right time. Do. So Amar, I need to just change gears here a minute. of the current version, if any, if you played it. I don't have my Xbox with me here. And I challenge, I challenge people out there to come challenge our team. So all the young gamers out there am are saying they're gonna challenge you. Can you set up an We'll figure it out. We can carry it live actually we can stream that. Modern Warfare I love the Sirius since the first one that came out. You the box. but at the same time, whenever I play games, I feel a little bit guilty because it's kind of like wasted time. Danny at Riot G is telling me, we saw him at Oracle Open World. Buy, I can't, I can't mention the names, but some of the biggest giving companies out there are using Hadoop So they do Now what I'm suggesting is that how can you harness that energy for the good as well? but gaming really is, if you look at gaming, you know, you get the headset on. So around the core, can you just go down the core and rattle off your version of what, The projects that are around those tools that are being built. Yeah, so the foundational, the foundational one as we mentioned before, is sdfs for storage map use You mentioned R too. So one of the talks here that had World we Jonathan, something on here in the Cube later. So Edge Base is definitely one of the projects that's growing very, very quickly right now So Uzi allows you to define a series of So that it supports all of the major vCloud, So ARU allows you to do that much more easily. So MAHA takes some of the most popular data mining So that's the machine learning. So, So yes. So Scoop, you know, all of them. And the idea for Scoop is to make it easy to move data between relational systems like Oracle metadata And all of these are Apache projects. To, We're contributing to the whole ecosystem. And you understand very well. So we have a number of them on And and it was really good because I tell you, Like, and in the developer community, It's all Kumbaya. So the question is, the experience that we have in our team part That subscription. So there'll be a free alternative to management. Our product has been in the market for two years, so they just started building their products. Alpha, It's just Alpha. Product is Alpha in Alpha right now. So we'll see how it does it compare to ours. You know, eventually the answer We wasn't Not gonna Yes. Yeah. I mean, I mean not only we are No, But that's not all your projects. Yeah, we didn't start, we didn't start all of these projects. So I mean, you go back through the ranks So on a number But we, we've heard it and I wanna just ask you No, no. So there's a concern that So Yes. It's, it's a very good, it's a very good observation to make. And within minutes tell them you need to do this and you to do that. So a year from now, when it comes time to renew with us, if the customer is And works. It's the same system, the same software. The only lock in is which Which by, by the way, we just wrote a piece about those wars and we said the risk of lockin is low. the console that they don't switch. Yes, but that means they're happy with it. And it's more economical for them than going and hiding people full-time on stuff. Oracle is very, Oracle Yeah. I mean our number one core attribute on the company, the number one value for us is customer satisfaction. So differentiate in the product. And wanna personally thank you for letting us sit in your office and we'll miss you And we'll miss you too. you and congratulations on all your success. Yeah, yeah. If there any gaming stuff, you know, I.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
Jeff Hiba | PERSON | 0.99+ |
Todd Lipkin | PERSON | 0.99+ |
2008 | DATE | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Mike | PERSON | 0.99+ |
Modern Warfare three | TITLE | 0.99+ |
Apache | ORGANIZATION | 0.99+ |
Danny | PERSON | 0.99+ |
Jonathan Gray | PERSON | 0.99+ |
Jeff Haer Baer | PERSON | 0.99+ |
15 | QUANTITY | 0.99+ |
two years | QUANTITY | 0.99+ |
Calera | ORGANIZATION | 0.99+ |
Modern Warfare | TITLE | 0.99+ |
16 cores | QUANTITY | 0.99+ |
Jeff Harm Becker | PERSON | 0.99+ |
Todd | PERSON | 0.99+ |
eight cores | QUANTITY | 0.99+ |
Jonathan | PERSON | 0.99+ |
both | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Amazon | ORGANIZATION | 0.99+ |
Java | TITLE | 0.99+ |
next year | DATE | 0.99+ |
Skype | ORGANIZATION | 0.99+ |
two jobs | QUANTITY | 0.99+ |
Vegas | LOCATION | 0.99+ |
Michaels | PERSON | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
hundred percent | QUANTITY | 0.99+ |
35,000 | QUANTITY | 0.99+ |
Horton Works | ORGANIZATION | 0.99+ |
Today | DATE | 0.99+ |
Peggy | PERSON | 0.99+ |
eBay | ORGANIZATION | 0.99+ |
Horton | LOCATION | 0.99+ |
12 hards | QUANTITY | 0.99+ |
Each | QUANTITY | 0.99+ |
vCloud | TITLE | 0.99+ |
HPAC | ORGANIZATION | 0.99+ |
Aala | PERSON | 0.99+ |
Adam | PERSON | 0.99+ |
Tyler | PERSON | 0.98+ |
UIE | ORGANIZATION | 0.98+ |
Hadoop World | TITLE | 0.98+ |
first one | QUANTITY | 0.98+ |
12 open source projects | QUANTITY | 0.98+ |
Edge Base | TITLE | 0.98+ |
W H I R R | TITLE | 0.98+ |
five | QUANTITY | 0.98+ |
Hammerer | PERSON | 0.98+ |
Xbox | COMMERCIAL_ITEM | 0.98+ |
Port Works | ORGANIZATION | 0.98+ |
Hive | TITLE | 0.98+ |
Amar | PERSON | 0.98+ |
five different departments | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Christmas | EVENT | 0.98+ |
SQL | TITLE | 0.97+ |
Silicon angle dot TV | ORGANIZATION | 0.97+ |
Tableau | TITLE | 0.97+ |
two | QUANTITY | 0.97+ |
W H I R R | TITLE | 0.97+ |