Frank Slootman, Snowflake | Snowflake Summit 2022

>>Hi, everybody. Welcome back to Caesars in Las Vegas. My name is Dave ante. We're here with the chairman and CEO of snowflake, Frank Luman. Good to see you again, Frank. Thanks for coming on. Yeah, >>You, you as well, Dave. Good to be with you. >>No, it's, it's awesome to be, obviously everybody's excited to be back. You mentioned that in your, in your keynote, the most amazing thing to me is the progression of what we're seeing here in the ecosystem and of your data cloud. Um, you wrote a book, the rise of the data cloud, and it was very cogent. You talked about network effects, but now you've executed on that. I call it the super cloud. You have AWS, you know, I use that term, AWS. You're building on top of that. And now you have customers building on top of your cloud. So there's these layers of value that's unique in the industry. Was this by design >>Or, well, you know, when you, uh, are a data clouding, you have data, people wanna do things, you know, with that data, they don't want to just, you know, run data operations, populate dashboards, you know, run reports pretty soon. They want to build applications and after they build applications, they wanna build businesses on it. So it goes on and on and on. So it, it drives your development to enable more and more functionality on that data cloud. Didn't start out that way. You know, we were very, very much focused on data operations, then it becomes application development and then it becomes, Hey, we're developing whole businesses on this platform. So similar to what happened to Facebook in many, in many ways, you know, >>There was some confusion I think, and there still is in the community of, particularly on wall street, about your quarter, your con the consumption model I loved on the earnings call. One of the analysts asked Mike, you know, do you ever consider going to a subscription model? And Mike got cut him off, then let finish. No, that would really defeat the purpose. Um, and so there's also a narrative around, well, maybe snowflake, consumption's easier to dial down. Maybe it's more discretionary, but I, I, I would say this, that if you're building apps on top of snowflake and you're actually monetizing, which is a big theme here, now, your revenue is aligned, you know, with those cloud costs. And so unless you're selling it for more, you know, than it costs more than, than you're selling it for, you're gonna dial that up. And that is the future of, I see this ecosystem in your company. Is that, is that fair? You buy that. >>Yeah, it, it is fair. Obviously the public cloud runs on a consumption model. So, you know, you start looking all the layers of the stack, um, you know, snowflake, you know, we have to be a consumption model because we run on top of other people's, uh, consumption models. Otherwise you don't have alignment. I mean, we have conversations, uh, with people that build on snowflake, um, you know, they have trouble, you know, with their financial model because they're not running a consumption model. So it's like square pack around hole. So we all have to align ourselves. So that's when they pay a dollar, you know, a portion goes to, let's say, AWS portion goes to the snowflake of that dollar. And the portion goes to whatever the uplift is, application value, data value, whatever it is to that goes on top of that. So the whole dollar, you know, gets allocated depending on whose value at it. Um, we're talking about. >>Yeah, but you sell value. Um, so you're not a SaaS company. Uh, at least I don't look at you that that way I I've always felt like the SAS pricing model is flawed because it's not aligned with customers. Right. If you, if you get stuck with orphaned licenses too bad, you know, pay us. >>Yeah. We're, we're, we're obviously a SaaS model in the sense that it is software as a service, but it's not a SaaS model in the sense that we don't sell use rights. Right. And that's the big difference. I mean, when you buy, you know, so many users from, you know, Salesforce and ServiceNow or whoever you have just purchased the right, you know, for so many users to use that software for this period of time, and the revenue gets recognized, you know, radically, you know, one month at a time, the same amount. Now we're not that different because we still do a contract the exact same way as SA vendor does it, but we don't recognize the revenue radically. We recognize the revenue based on the consumption, but over the term of the contract, we recognize the entire amount. It just is not neatly organized in these monthly buckets. >>You know? So what happens if they underspend one quarter, they have to catch up by the end of the, the term, is that how it works or is that a negotiation or it's >>The, the, the spending is a totally, totally separate from the consumption itself, you know, because you know how they pay for the contract. Let's say they do a three year contract. Um, you know, they, they will probably pay for that, you know, on an annual basis, you know, that three year contract. Um, but it's how they recognize their expenses for snowflake and how we recognize the revenue is based on what they actually consume. But it's not like you're on demand where you can just decide to not use it. And then I don't have any cost, but over the three year period, you know, all of that, you know, uh, needs to get consumed or they expire. And that's the same way with Amazon. If I don't consume what I buy from Amazon, I still gotta pay for it. You know, so, >>Well, you're right. Well, I guess you could buy by the drink, but it's way, way more expensive and nobody really correct. Does that, so, yep. Okay. Phase one, better simpler, you know, cloud enterprise data warehouse, phase two, you introduced the, the data cloud and, and now we're seeing the rise of the data cloud. What, what does phase three look like >>Now? Phase, phase three is all about applications. Um, and we've just learned, uh, you know, from the beginning that people were trying to do this, but we weren't instrumental at all to do it. So people would ODBC, you know, JDBC drivers just uses as database, right? So the entire application would happen outside, you know, snowflake, we're just a database. You connect to the database, you know, you read or right data, you know, you do data, data manipulations. And then the application, uh, processing all happens outside of snowflake. Now there's issues with that because we start to exfil trade data, meaning that we started to take data out of snowflake and, and put it, uh, in other places. Now there's risk for that. There's operational risk, there's governance, exposure, security issues, you know, all this kind of stuff. And the other problem is, you know, data gets Reed. >>It proliferates. And then, you know, data science tests are like, well, I, I need that data to stay in one place. That's the whole idea behind the data cloud. You know, we have very big infrastructure clouds. We have very big application clouds and then data, you know, sort of became the victim there and became more proliferated and more segment. And it's ever been. So all we do is just send data to the work all day. And we said, no, we're gonna enable the work to get to the data. And the data that stays in more in place, we don't have latency issue. We don't have data quality issues. We don't have lineage issues. So, you know, people have responded very, very well to the data cloud idea, like, yeah, you know, as an enterprise or an institution, you know, I'm the epicenter of my own data cloud because it's not just my own data. >>It's also my ecosystem. It's the people that I have data networking relationships with, you know, for example, you know, take, you know, uh, an investment bank, you know, in, in, in, in New York city, they send data to fidelity. They send data to BlackRock. They send data to, you know, bank of New York, all the regulatory clearing houses, all on and on and on, you know, every night they're running thousands, tens of thousands, you know, of jobs pushing that data, you know, out there. It just, and they they're all on snowflake already. So it doesn't have to be this way. Right. So, >>Yes. So I, I asked the guys before, you know, last week, Hey, what, what would you ask Frank? Now? You might remember you came on, uh, our program during COVID and I was asking you how you're dealing with it, turn off the news. And it was, that was cool. And I asked you at the time, you know, were you ever, you go on Preem and you said, look, I'll never say never, but it defeats the purpose. And you said, we're not gonna do a halfway house. Actually, you were more declarative. We're not doing a halfway house, one foot in one foot out. And then the guy said, well, what about that Dell deal? And that pure deal that you just did. And I, I think I know the answer, but I want to hear from you did a customer come to you and say, get you in the headlock and say, you gotta do this. >>Or it did happen that way. Uh, it, uh, it started with a conversation, um, you know, via with, uh, with Michael Dell. Um, it was supposed to be just a friendly chat, you know, Hey, how's it going? And I mean, obviously Dell is the owner of data, the main, or our first company, you know? Um, but it's, it, wasn't easy for, for Dell and snowflake to have a conversation because they're the epitome of the on-premise company and we're the epitome of a cloud company. And it's like, how, what do we have in common here? Right. What can we talk about? But, you know, Michael's a very smart, uh, engaging guy, you know, always looking for, for opportunity. And of course they decided we're gonna hook up our CTOs, our product teams and, you know, explore, you know, somebody's, uh, ideas and, you know, yeah. We had some, you know, starts and restarts and all of that because it's just naturally, you know, uh, not an easy thing to conceive of, but, you know, in the end it was like, you know what? >>It makes a lot of sense. You know, we can virtualize, you know, Dell object storage, you know, as if it's, you know, an S three storage, you know, from Amazon and then, you know, snowflake in its analytical processing. We'll just reference that data because to us, it just looks like a file that's sitting on, on S3. And we have, we have such a thing it's called an external table, right. That's, that's how we basically, it projects, you know, a snowflake, uh, semantic and structural model, you know, on an external object. And we process against it exactly the same way as if it was an internal, uh, table. So we just extended that, um, you know, with, um, with our storage partners, like Dell and pure storage, um, for it to happen, you know, across a network to an on-prem place. So it's very elegant and it, it, um, it becomes an, an enterprise architecture rather than just a cloud architecture. And I'm, I just don't know what will come of it. And, but I've already talked to customers who have to have data on premises just can't go anywhere because they process against it, you know, where it originates, but there are analytical processes that wanna reference attributes of that data. Well, this is what we'll do that. >>Yeah. I'm, it is interesting. I'm gonna ask Dell if I were them, I'd be talking to you about, Hey, I'm gonna try to separate compute from storage on prem and maybe do some of the, the work there. I don't even know if it's technically feasible. It's, I'll ask OI. But, um, but, but, but to me, that's an example of your extending your ecosystem. Um, so you're talking now about applications and that's an example of increasing your Tam. I don't know if you ever get to the edge, you know, we'll see, we're not quite quite there yet, but, um, but as you've said before, there's no lack of market for you. >>Yeah. I mean, obviously snowflake it it's, it's Genesis was reinventing database management in, in a cloud computing environment, which is so different from a, a machine environment or a cluster environment. So that's why, you know, we're, we're, we're not a, a fit for a machine centric, uh, environment sort of defeats the purpose of, you know, how we were built. We, we are truly a native solution. Most products, uh, in the clouds are actually not cloud native. You know, they, they originated the machine environments and you still see that, you know, almost everything you see in the cloud by the way is not cloud native, our generation of applications. They only run the cloud. They can only run the cloud. They are cloud native. They don't know anything else, >>You know? Yeah, you're right. A lot of companies would just wrap something in wrap their stack in Kubernetes and throw it into the cloud and say, we're in the cloud too. And you basically get, you just shifted. It >>Didn't make sense. Oh. They throw it in the container and run it. Right. Yeah. >>So, okay. That's cool. But what does that get you that doesn't change your operational model? Um, so coming back to software development and what you're doing in, in that regard, it seems one of the things we said about Supercloud is in order to have a Supercloud, you gotta have an ecosystem, you gotta have optionality. Hence you're doing things like Apache iceberg, you know, you said today, well, we're not sure where it's gonna go, but we offering options. Uh, but, but my, my question is, um, as it pertains to software developments specifically, how do you, so one of the things we said, sorry, I've lost my train there. One of the things we said is you have to have a super PAs in order to have a super cloud ecosystem, PAs layer. That's essentially what you've introduced here. Is it not a platform for our application development? >>Yeah. I mean, what happens today? I mean, how do you enable a developer, you know, on snowflake, without the developer, you know, reading the, the files out of snowflake, you know, processing, you know, against that data, wherever they are, and then putting the results set, God knows where, right. And that's what happens today. It's the wild west it's completely UN uncovered, right? And that's the reason why lots of enterprises will not allow Python anything anywhere near, you know, their enterprise data. We just know that, uh, we also know it from streamlet, um, or the acquisition, um, large acquisition that we made this year because they said, look, you know, we're, we have a lot of demand, you know, uh, in the Python community, but that's the wild west. That's not the enterprise grade high trust, uh, you know, corporate environment. They are strictly segregated, uh, today. >>Now do some, do these, do these things sometimes dribble up in the enterprise? Yes, they do. And it's actually intolerable the risk that enterprises, you know, take, you know, with things being UN uncovered. I mean the whole snowflake strategy and promises that you're in snowflake, it is a, an absolute enterprise grade environment experience. And it's really hard to do. It takes enormous investment. Uh, but that is what you buy from us. Just having Python is not particularly hard. You know, we can do that in a week. This has taken us years to get it to this level, you know, of, of, you know, governance, security and, and, you know, having all the risks around exfiltration and so on, really understood and dealt with. That's also why these things run in private previews and public previews for so long because we have to squeeze out, you know, everything that may not have been, you know, understood or foreseen, you know, >>So there are trade offs of, of going into this snowflake cloud, you get all this great functionality. Some people might think it's a walled garden. How, how would you respond to that? >>Yeah. And it's true when you have a, you know, a snowflake object, like a snowflake, uh, table only snowflake, you know, runs that table. And, um, you know, that, that is, you know, it's very high function. It's very sort of analogous to what apple did, you know, they have very high functioning, but you do have to accept the fact that it's, that it's not, uh, you know, other, other things in apple cannot, you know, get that these objects. So this is the reason why we introduce an open file format, you know, like, like iceberg, uh, because what iceberg effectively does is it allows any tool, uh, you know, to access that particular object. We do it in such a way that a lot of the functionality of snowflake, you know, will address the iceberg format, which is great because it's, you're gonna get much more function out of our, you know, iceberg implementation than you would get from iceberg on its own. So we do it in a very high value addeds, uh, you know, manner, but other tools can still access the same object in a read to write, uh, manner. So it, it really sort of delivers the original, uh, promise of the data lake, which is just like, Hey, I have all these objects tools come and go. I can use what I want. Um, so you get, you get the best of both worlds for the most part. >>Have you reminds me a little bit of VMware? I mean, VMware's a software mainframe, it's just better than >>Doing >>It on your own. Yep. Um, one of the other hallmarks of a cloud company, and you guys clearly are a cloud company is startups and innovation. Um, now of course you see that in, in the, in the ecosystem, uh, and maybe that's the answer to my question, but you guys are kind of whale hunters, <laugh> your customers are, tend to be bigger. Uh, is the, is the innovation now the extension of that, the ecosystem is that by design. >>Oh, um, you know, we have a enormous, uh, ISV following and, um, we're gonna have a whole separate conference like this, by the way, just for, yeah. >>For developers. I hope you guys will up there too. Yeah. Um, you know, the, the reason that, that the ISV strategy is very important for, you know, for, for, for, for many reasons, but, you know, ISVs are the people that are really going to unlock a lot of the value and a lot of the promise of data, right? Because you, you can never do that on your own. And the problem has been that for ISVs, it is so expensive and so difficult to build a product that can be used because the entire enterprise platform infrastructure needs to be built by somebody, you know, I mean, are you really gonna run infrastructure, database, operations, security, compliance, scalability, economics. How do you do that as a software company where really you only have your, your domain expertise that you want to deliver on a platform. You don't wanna do all these things. >>First of all, you don't know how to do it, how to do it well. Um, so it is much easier, much faster when there is already platform to actually build done in the world of clout that just doesn't, you know, exist. And then beyond that, you know, okay, fine building. It is sort of step one. Now I gotta sell it. I gotta market it. So how do I do that? Well, in the snowflake community, you have already market <laugh>, there's thousands and thousands of customers that are also on self lake. Okay. So their, their ability to consume that service that you just built, you know, they can search it, they can try it, they can test it and decide whether they want to consume it. And then, you know, we can monetize it. So all they have to do is cash the check. So the net effecti of it is we drastically lowered the barriers to entry into the world, you know, of software, you know, two men or two women in a dog, and a handful of files can build something that then can be sold, sort of to, for software developers. >>I wrote a piece 2012 after the first reinvent. And I, you know, and I, and I put a big gorilla on the front page and I said, how do you compete with Amazon gorilla? And then one of my answers was you build data ecosystems and you verticalize, and that's, that's what you're doing >>Here. Yeah. There certain verticals that are farther along than others, uh, obviously, but for example, in financial, uh, which is our largest vertical, I mean, the, the data ecosystem is really developing hardcore now. And that's, that's because they so rely on those relationships between all the big financial institutions and entities, regulatory, you know, clearing houses, investment bankers, uh, retail banks, all this kind of stuff. Um, so they're like, it becomes a no brainer. The network affects kick in so strongly because they're like, well, this is really the only way to do it. I mean, if you and I work in different companies and we do, and we want to create a secure, compliant data network and connection between us, I mean, it would take forever to get our lawyers to agree that yeah, it's okay. <laugh> right now, it's like a matter of minutes to set it up. If we're both on snowflake, >>It's like procurement, do they, do you have an MSA yeah. Check? And it just sail right through versus back and forth and endless negotiations >>Today. Data networking is becoming core ecosystem in the world of computing. You know, >>I mean, you talked about the network effects in rise of the data cloud and correct. Again, you know, you, weren't the first to come up with that notion, but you are applying it here. Um, I wanna switch topics a little bit. I, when I read your press releases, I laugh every time. Cause this says no HQ, Bozeman. And so where, where do you, I think I know where you land on, on hybrid work and remote work, but what are your thoughts on that? You, you see Elon the other day said you can't work for us unless you come to the office. Where, where do you stand? >>Yeah. Well, the, well, the, the first aspect is, uh, we really wanted to, uh, separate from the idea of a headquarters location, because I feel it's very antiquated. You know, we have many different hubs. There's not one place in the world where all the important people are and where we make all the important positions, that whole way of thinking, uh, you know, it is obsolete. I mean, I am where I need to be. And it it's many different places. It's not like I, I sit in this incredible place, you know, and that's, you know, that's where I sit and everybody comes to me. No, we are constantly moving around and we have engineering hubs. You know, we have your regional, uh, you know, headquarters for, for sales. Obviously we have in Malaysia, we have in Europe, you know? And, um, so I wanted to get rid of this headquarters designation. >>And, you know, the, the, the other issue obviously is that, you know, we were obviously in California, but you know, California is, is no longer, uh, the dominant place of where we are resident. I mean, 40% of our engineering people are now in be Washington. You know, we have hundreds of people in Poland where people, you know, we are gonna have very stressed location in Toronto. Um, yeah. Obviously our customers are, are everywhere, right? So this idea that, you know, everything is happening in, in one state is just, um, you know, not, not correct. So we wanted to go to no headquarters. Of course the SCC doesn't let you do that. Um, because they want, they want you to have a street address where the government can send you a mail and then it becomes, the question is, well, what's an acceptable location. Well, it has to be a place where the CEO and the CFO have residency by hooker, by crook. >>That happened to be in Bozeman Montana because Mike and I are both, it was not by design. We just did that because we were, uh, required to, you know, you know, comply with government, uh, requirements, which of course we do, but that's why it, it says what it says now on, on the topic of, you know, where did we work? Um, we are super situational about it. It's not like, Hey, um, you know, everybody in the office or, or everybody is remote, we're not categorical about it. Depends on the function, depends on the location. Um, but everybody is tethered to an office. Okay. In words, everybody has a relationship with an office. There's, there's almost nobody, there are a few exceptions of people that are completely remote. Uh, but you know, if you get hired on with snowflake, you will always have an office affiliation and you can be called into the office by your manager. But for purpose, you know, a meeting, a training, an event, you don't get called in just to hang out. And like, the office is no longer your home away from home. Right. And we're now into hotel, right? So you don't have a fixed place, you know? So >>You talked in your keynote a lot about last question. I let you go customer alignment, obviously a big deal. I have been watching, you know, we go to a lot of events, you'll see a technology company tell a story, you know, about their widget or whatever it was their box. And then you'll see an outcome and you look at it and you shake your head and say, well, that the difference between this and that is the square root of zero, right. When you talk about customer alignment today, we're talking about monetizing data. Um, so that's a whole different conversation. Um, and I, I wonder if you could sort of close on how that's different. Um, I mean, at ServiceNow, you transformed it. You know, I get that, you know, data, the domain was okay, tape, blow it out, but this is a, feels like a whole new vector or wave of growth. >>Yeah. You know, monetizing, uh, data becomes sort of a, you know, a byproduct of having a data cloud you all of a sudden, you know, become aware of the fact that, Hey, Hey, I have data and be that data might actually be quite valuable to parties. And then C you know, it's really easy to then, you know, uh, sell that and, and monetize that. Cause if it was hard, forget it, you know, I don't have time for it. Right. But if it's relatively, if it's compliant, it's relatively effortless, it's pure profit. Um, I just want to reference one attribute, two attributes of what you have, by the way, you know, uh, hedge funds have been into this sort of thing, you know, for a long time, because they procure data from hundreds and hundreds of sources, right. Because they're, they are the original data scientists. >>Um, but the, the bigger thing with data is that a lot of, you know, digital transformation is, is, is finally becoming real. You know, for years it was arm waving and conceptual and abstract, but it's becoming real. I mean, how do we, how do we run a supply chain? You know, how do we run, you know, healthcare, um, all these things are become are, and how do we run cyber security? They're being redefined as data problems and data challenges. And they have data solutions. So that's right. Data strategies are insanely important because, you know, if, if the solution is through data, then you need to have, you know, a data strategy, you know, and in our world, that means you have a data cloud and you have all the enablement that allows you to do that. But, you know, hospitals, you know, are, are saying, you know, data science is gonna have a bigger impact on healthcare than life science, you know, in the coming, whatever, you know, 10, 20 years, how do you enable that? >>Right. I, I have conversations with, with, with hospital executives are like, I got generations of data, you know, clinical diagnostic, demographic, genomic. And then I, I am envisioning these predictive outcomes over here. I wanna be able to predict, you know, once somebody's gonna get what disease and you know, what I have to do about it, um, how do I do that? <laugh> right. The day you go from, uh, you know, I have a lot of data too. I have these outcomes and then do me a miracle in the middle, in the middle of somewhere. Well, that's where we come in. We're gonna organize ourselves and then unpack thats, you know, and then we, we work, we through training models, you know, we can start delivering some of these insights, but the, the promise is extraordinary. We can change whole industries like pharma and, and, and healthcare. Um, you know, 30 effects of data, the economics will change. And you know, the societal outcomes, you know, um, quality of life disease, longevity of life is quite extraordinary. Supply chain management. That's all around us right >>Now. Well, there's a lot of, you know, high growth companies that were kind of COVID companies, valuations shot up. And now they're trying to figure out what to do. You've been pretty clear because of what you just talked about, the opportunities enormous. You're not slowing down, you're amping it up, you know, pun intended. So Frank Luman, thanks so much for coming on the cube. Really appreciate your time. >>My pleasure. >>All right. And thank you for watching. Keep it right there for more coverage from the snowflake summit, 2022, you're watching the cube.

Published Date : Jun 15 2022

SUMMARY :

Good to see you again, Frank. You have AWS, you know, I use that term, AWS. you know, with that data, they don't want to just, you know, run data operations, populate dashboards, One of the analysts asked Mike, you know, do you ever consider going to a subscription model? with people that build on snowflake, um, you know, they have trouble, you know, with their financial model because bad, you know, pay us. you know, so many users from, you know, Salesforce and ServiceNow or whoever you have just purchased the they, they will probably pay for that, you know, on an annual basis, you know, that three year contract. Phase one, better simpler, you know, cloud enterprise data warehouse, You connect to the database, you know, you read or right data, you know, you do data, data manipulations. like, yeah, you know, as an enterprise or an institution, you know, I'm the epicenter of you know, for example, you know, take, you know, uh, an investment bank, you know, in, you know, were you ever, you go on Preem and you said, look, I'll never say never, but it defeats the purpose. just naturally, you know, uh, not an easy thing to conceive of, but, you know, You know, we can virtualize, you know, Dell object storage, you know, I don't know if you ever get to the edge, you know, we'll see, we're not quite quite there yet, So that's why, you know, we're, And you basically get, you just shifted. Oh. They throw it in the container and run it. you know, you said today, well, we're not sure where it's gonna go, but we offering options. you know, on snowflake, without the developer, you know, reading the, the files out of snowflake, And it's actually intolerable the risk that enterprises, you know, take, So there are trade offs of, of going into this snowflake cloud, you get all this great functionality. uh, you know, other, other things in apple cannot, you know, get that these objects. Um, now of course you see that Oh, um, you know, we have a enormous, uh, ISV following and, be built by somebody, you know, I mean, are you really gonna run infrastructure, you know, of software, you know, two men or two women in a dog, and a handful of files can build you know, and I, and I put a big gorilla on the front page and I said, how do you compete with Amazon gorilla? regulatory, you know, clearing houses, investment bankers, uh, retail banks, It's like procurement, do they, do you have an MSA yeah. Data networking is becoming core ecosystem in the world of computing. Again, you know, It's not like I, I sit in this incredible place, you know, and that's, And, you know, the, the, the other issue obviously is that, you know, we were obviously in California, We just did that because we were, uh, required to, you know, you know, I have been watching, you know, we go to a lot of events, you'll see a technology company tell And then C you know, you know, a data strategy, you know, and in our world, that means you have a data cloud and you have all the enablement that thats, you know, and then we, we work, we through training models, you know, you know, pun intended. And thank you for watching.

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
California	LOCATION	0.99+
Mike	PERSON	0.99+
Frank Luman	PERSON	0.99+
BlackRock	ORGANIZATION	0.99+
Poland	LOCATION	0.99+
Europe	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Malaysia	LOCATION	0.99+
Frank	PERSON	0.99+
Toronto	LOCATION	0.99+
Dell	ORGANIZATION	0.99+
Frank Slootman	PERSON	0.99+
one foot	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
thousands	QUANTITY	0.99+
2012	DATE	0.99+
Michael	PERSON	0.99+
Washington	LOCATION	0.99+
AWS	ORGANIZATION	0.99+
40%	QUANTITY	0.99+
one month	QUANTITY	0.99+
three year	QUANTITY	0.99+
Michael Dell	PERSON	0.99+
Bozeman Montana	LOCATION	0.99+
New York	LOCATION	0.99+
last week	DATE	0.99+
Facebook	ORGANIZATION	0.99+
30 effects	QUANTITY	0.99+
both	QUANTITY	0.99+
One	QUANTITY	0.99+
two attributes	QUANTITY	0.99+
SCC	ORGANIZATION	0.99+
two women	QUANTITY	0.99+
Python	TITLE	0.99+
one quarter	QUANTITY	0.99+
one attribute	QUANTITY	0.99+
zero	QUANTITY	0.98+
today	DATE	0.98+
20 years	QUANTITY	0.98+
Las Vegas	LOCATION	0.98+
apple	ORGANIZATION	0.98+
Today	DATE	0.98+
10	QUANTITY	0.98+
first aspect	QUANTITY	0.98+
ServiceNow	ORGANIZATION	0.98+
two men	QUANTITY	0.98+
Kubernetes	TITLE	0.97+
tens of thousands	QUANTITY	0.97+
Elon	PERSON	0.97+
first	QUANTITY	0.97+
Snowflake Summit 2022	EVENT	0.97+
both worlds	QUANTITY	0.96+
First	QUANTITY	0.96+
one	QUANTITY	0.96+
S three	COMMERCIAL_ITEM	0.96+
one state	QUANTITY	0.95+
Supercloud	ORGANIZATION	0.95+
this year	DATE	0.95+
one place	QUANTITY	0.94+
first company	QUANTITY	0.93+
snowflake	ORGANIZATION	0.9+
Dave ante	PERSON	0.87+
hundreds of people	QUANTITY	0.87+
hundreds of sources	QUANTITY	0.85+
2022	DATE	0.85+

Jason Buffington, Veeam | VeeamON 2022

(upbeat music) >> Welcome back to theCUBE's coverage of VEEMON 2022. We're here at the Aria in Las Vegas. Dave Vellante with David Nicholson, my co-host for the week, two days at wall to wall coverage. Jason Buffington is here, JBuff, who does some amazing work for VEEAM, former Analyst from the Enterprise Strategy Group. So he's got a real appreciation for independence data, and we're going to dig into some data. You guys, I got to say, Jason, first of all, welcome back to theCUBE. It's great to see you again. >> Yeah, two and a half years, thanks for having me back. >> Yeah, that's right. (Jason laughs) Seems like a blur. >> No doubt. >> But so here's the thing as analysts, you can appreciate this, the trend is your friend, right? and everybody just inundates you with now, ransomware. It's the trend. So you get everybody's talking about the ransomware, cyber resiliency, immutability, air gaps, et cetera. Okay, great. Technology's there, it's kind of like the NFL, everybody kind of does the same thing. >> There's a lot of wonderful buzzwords in that sentence. >> Absolutely, but what you guys have done that's different is you brought in some big time thought leadership, with data and survey work which of course as an analyst we love, but you drive strategies off of this. So you got to, I'll set it up. You got a new study out that's pivoted off of February study of 3,600 organizations, and then you follow that up with a thousand organizations that actually got hit with ransomware. So tell us more about the study and the work that you've done there. >> Yeah, I got to say I have the best job ever. So I spent seven years as an analyst. And when I decided I didn't want to be an analyst anymore, I called VEEAM and said, I'd like to get in the fight and they let me in. But they let me do independent research on their behalf. So it's kind of like being an in-house counsel. I'm an in-house analyst. And for the beginning of this year, in February, we published a report called the Data Protection Trends Report. And it was over 3000 responses, right? 28 countries around the world looking at digital transformation, the effects of COVID, where are they are on BAS and DRS. But one of the new areas we wanted to look at was how pervasive is ransomware? How does that align with BCDR overall? So some of those just big thought questions that everyone's trying to solve for. And out of that, we said, "Wow, this is really worth double clicking." And so today, actually about an hour ago we published the Ransomware Trends Report and it's a thousand organizations all of which have all been survived. They all had a ransomware attack. One of the things I think I'm most proud of for VEEAM in this particular project, we use an independent research firm. So no one knows it's VEEAM that's asking the questions. We don't have any access to the respondents along the way. I wish we did, right? >> Yeah, I bet >> Go sell 'em back up software. But of the thousands 200 were CISOs, 400 were security professionals which we don't normally interact with, 200 backup admins, 200 IT ops, and the idea was, "Okay, you've all been through a really bad day. Tell us from your four different views, how did that go? What did you solve for? What did you learn? What are you moving forward with?" And so, yeah, some great learnings all around helping us understand how do we deliver solutions that meet their needs? >> I mean, there's just not enough time here to cover all this data. And I think I like about it is, like you said, it's a blind survey. You used an independent third party whom I know they're really good. And you guys are really honest about it. It's like, it was funny that the analyst called today for the analyst meeting when Danny was saying if 54% and Dave Russell was like, it's 52%, actually ended up being 53%. (Jason laughs) So, whereas many companies would say 75%. So anyway, what were some of the more striking findings of that study? Let's get into it a little bit. >> So a couple of the ones that were really startling for me, on average about one in four organizations say they have not been hit. But since we know that ransomware has a gestation for around 200 days from first intrusions, so when you have that attack, 25% may be wrong. That's 25% in best case. Another 16% said they only got hit once in the last year. And that means 60%, right on the money got hit more than once per year. And so when you think about it's like that school bully Once they take your lunch money once and they want lunch money, again, they just come right back again. Did you fix this hole? Did you fix that hole? Cool, payday. And so that was really, really scary. Once they get in, on average organizations said 47% of their production data was encrypted. Think about that. So, and we tested for, hey, was it in the, maybe it's just in the ROBO. So on the edge where the tech isn't as good, or maybe it's in the cloud because it's in a broad attack surface. Whatever it is, turns out, doesn't matter. >> So this isn't just nibbling around the edges. >> No. >> This is going straight to the heart of the enterprise. >> 47% of production data, regardless of where it's stored, data center ROBO or cloud, on average was encrypted. But what I thought was really interesting was when you look at the four personas, the security professional and the backup admin. The person responsible for prevention or mediation, they saw a much higher rate of infection than the CSOs and the IT pros, which I think the meta point there is the closer you are to the problem. the worst this is. 47% is bad. it's worse than that. As you get closer to it. >> The other thing that struck me is that a large proportion of, I think it was a third of the companies that paid ransom. >> Oh yeah. >> Weren't able to recover it. Maybe got the keys and it didn't work or maybe they never got the keys. >> That's crazy too. And I think one thing that a lot of folks, you watch the movies and stuff and you think, "Oh, I'm going to pay the Bitcoin. I'm going to get this magic incantation key and all of a sudden it's like it never happened. That is not how this works. And so yeah. So the question actually was did you pay and did it work right? And so 52%, just at half of organization said, yes. I paid and I was able to recover it. A third of folks, 27%. So a third of those that paid, they paid they cut the check, they did the ransom, whatever, and they still couldn't get back. Almost even money by the way. So 24% paid, but could not get back. 19% did not pay, but recovered from backup. VEEAM's whole job for all of 2022 and 23 needs to be invert that number and help the other 81% say, "No, I didn't pay I just recovered." >> Well, in just a huge number of cases they attacked the backup Corpus. >> Yes. >> I mean, that's was... >> 94% >> 94%? >> 94% of the time, one of the first intrusions is to attempt to get rid of the backup repository. And in two thirds of all cases the back repository is impacted. And so when I describe this, I talk about it this way. The ransomware thief, they're selling a product. They're selling your survivability as a product. And how do you increase the likelihood that you will buy what they're selling? Get rid of the life preserver. Get rid of their only other option 'cause then they got nothing left. So yeah, two thirds, the backup password goes away. That's why VEEAM is so important around cloud and disk and tape, immutable at every level. How we do what we do. >> So what's the answer here. We hear things like immutability. We hear terms like air gap. We heard, which we don't hear often, is orchestrated recovery and automated recovery. I wonder if you could get, I want to come back to... So, okay. So you're differentiating with some thought leadership, that's nice. >> Yep. >> Okay, good. Thank you. The industry thanks you for that free service. But how about product and practices? How does VEEAM differentiate in that regard? >> Sure. Now full disclosure. So when you download that report, for every five or six pages of research, the marketing department is allowed to put in one paragraph. It says, this is our answer. They call the VEEAM perspective. That's their rebuttal. To five pages of research, they get one paragraph, 250 word count and you're done. And so there is actually a commercial... >> We're here to buy here in. (chuckles) >> To the back of that. It's how we pay for the research. >> Everybody sells an onset. (laughs) >> All right. So let's talk about the tech that actually matters though, because there actually are some good insights there. Certainly the first one is immutability. So if you don't have a survivable repository you have no options. And so we provide air gaping, whether you are cloud based. So your favorite hyper-scale or one of the tens of thousands of cloud service providers that offer VEEAM products. So you can have, immutability at the cloud layer. You can certainly have immutability at the object layer on-prem or disk. We're happy to use all your favorite DDoS and then tape. It is hard to get more air-gaped and take the tape out drive, stick it on a shelf or stick it in a white van and have it shipped down the street. So, and the fact that we aren't dependent on any architecture, means choose your favorite cloud, choose your favorite disc, choose your favorite tape and we'll make all of 'em usable and defendable. So that's super key Number one. Super key number two there's three. >> So Platform agnostic essentially. >> Yeah. >> Cloud platform agenda, >> Any cloud, any physical, we work happily with everybody. Just here for your data. So, now you know you have at least a repository, which is not affectable. The next thing is you need to know, do you actually have recoverable data? And that's two different questions. >> How do you know? Right, I mean... >> You don't. So one of my colleagues, Chris Hoff, talks about how you can have this Nalgene bottle that makes sure that no water spills. Do you know that that's water? Is it vodka? Is it poison? You don't know. You just know that nothing's spilling out of it. That's an immutable repository. Then you got to know, can you actually restore the data? And so automating test restores every night, not just did the backup log work. Only 16% actually test their backups. That breaks my heart. That means 84% got it wrong. >> And that's because it just don't have the resource or sometimes testing is dangerous. >> It can be dangerous. It can also just be hard. I mean, how do you spend something up without breaking what's already live. So several years ago, VEEAM created the sandbox is what we call a data lab. And so we create a whole framework for you with a proxy that goes in you can stand up whatever you want. You can, if file exists, you can ping it, you can ODBC SQL, you can map the exchange. I mean, you can, did it actually come up. >> You can actually run water through the recovery pipes. >> Yes. >> And tweak it so that it actually works. >> Exactly. So that's the second thing. And only 16% of organizations do. >> Wow. >> And then the third thing is orchestration. So there's a lot of complexity that happens when you recover one workload. There is a stupid amount of complexity happens when you try cover a whole site or old system, or I don't know, 47% of your infrastructure. And so what can you do to orchestrate that to remediate that time? Those are the three things we found. >> So, and that orchestration piece, a number of customers that were in the survey were trying to recover manually. Which is a formula for failure. A number of, I think the largest percentage were scripts which I want you to explain why scripts are problematic. And then there was a portion that was actually doing it right. Maybe it was bigger, maybe it was a quarter that was doing orchestrated recovery. But talk about why scripts are not the right approach. >> So there were two numbers in there. So there was 16% test the ability to recover, 25% use orchestration as part of the recovery process. And so the problem where it is, is that okay, if I'm doing it manually, think about, okay, I've stood back up these databases. Now I have to reconnect the apps. Now I have to re IP. I mean, there's lots of stuff to stand up any given application. Scripts says, "Hey, I'm going to write those steps down." But we all know that, that IT and infrastructure is a living breathing thing. And so those scripts are good for about the day after you put the application in, and after that they start to gather dust pretty quick. The thing about orchestration is, if you only have a script, it's as frequently as you run the script that's all you know. But if you do a workflow, have it run the workflow every night, every week, every month. Test it the same way. That's why that's such a key to success. And for us that's VEEAM disaster recovery orchestra tour. That's a product that orchestrates all the stuff that VEEAM users know and love about our backend recovery engine. >> So imagine you're, you are an Excel user, you're using macros. And I got to go in here, click on that, doing this, sort of watching you and it repeats that, but then something changes. New data or new compliance issue, whatever... >> That got renamed directly. >> So you're going to have to go in and manually change that. How do you, what's the technology behind automated orchestration? What's the magic there? >> The magic is a product that we call orchestrator. And so it actually takes all of those steps and you actually define each step along the way. You define the IP addresses. You define the paths. You define where it's going to go. And then it runs the job in test mode every night, every week, whatever. And so if there's a problem with any step along the way, it gives you the report. Fix those things before you need it. That's the power of orchestrator. >> So what are you guys doing with this study? What can we expect? >> So the report came out today. In a couple weeks, we'll release regional versions of the same data. The reason that we survey at scale is because we want to know what's different in a PJ versus the Americas versus Europe and all those different personas. So we'll be releasing regional versions of the data along the way. And then we'll enable road shows and events and all the other stuff that happens and our partners get it so they can use it for consulting, et cetera. >> So you saw differences in persona. In terms of their perception, the closer you were to the problem, the more obvious it was, did you have enough end to discern its pearly? I know that's why you're due the drill downs but did you sense any preliminary data you can share on regions as West getting hit harder or? >> So attack rate's actually pretty consistent. Especially because so many criminals now use ransomware as a service. I mean, you're standing it up and you're spreading wide and you're seeing what hits. Where we actually saw pretty distinct geographic problems is the cloud is not of as available in all segments. Expertise around preventative measures and remediation is not available in all segments, in all regions. And so really geographic split and segment split and the lack of expertise in some of the more advanced technologies you want to use, that's really where things break down. Common attack plane, uncommon disadvantage in recovery. >> Great stuff. I want to dig in more. I probably have a few more questions if you don't mind, I can email you or give you a call. It's Jason Buffington. Thanks so much for coming on theCUBE. >> Thanks for having me. >> All right, keep it right there. You're watching theCUBE's live coverage of VEEAMON 2022. We're here in person in Las Vegas, huge hybrid audience. Keep it right there, be right back. (upbeat music)

Published Date : May 17 2022

SUMMARY :

It's great to see you again. Yeah, two and a half years, Yeah, that's right. But so here's the thing as analysts, buzzwords in that sentence. and the work that you've done there. And for the beginning of But of the thousands 200 were CISOs, And you guys are really honest about it. So a couple of the ones that nibbling around the edges. straight to the heart of the enterprise. is the closer you are to the problem. is that a large proportion of, Maybe got the keys and it didn't work So the question actually was Well, in just a huge number of cases And how do you increase the likelihood I wonder if you could get, The industry thanks you So when you download that report, We're here to buy here in. To the back of that. So, and the fact that we aren't dependent The next thing is you need to know, How do you know? not just did the backup log work. just don't have the resource And so we create a whole framework for you You can actually run water So that's the second thing. And so what can you do to orchestrate that are not the right approach. And so the problem where it is, And I got to go in here, What's the magic there? and you actually define So the report came out today. the closer you were to the problem, and the lack of expertise I can email you or give you a call. live coverage of VEEAMON 2022.

ENTITIES

Entity	Category	Confidence
Jason	PERSON	0.99+
Dave Russell	PERSON	0.99+
Danny	PERSON	0.99+
David Nicholson	PERSON	0.99+
Chris Hoff	PERSON	0.99+
Jason Buffington	PERSON	0.99+
JBuff	PERSON	0.99+
Dave Vellante	PERSON	0.99+
25%	QUANTITY	0.99+
February	DATE	0.99+
16%	QUANTITY	0.99+
seven years	QUANTITY	0.99+
3,600 organizations	QUANTITY	0.99+
five pages	QUANTITY	0.99+
Las Vegas	LOCATION	0.99+
47%	QUANTITY	0.99+
Excel	TITLE	0.99+
84%	QUANTITY	0.99+
54%	QUANTITY	0.99+
75%	QUANTITY	0.99+
53%	QUANTITY	0.99+
52%	QUANTITY	0.99+
two numbers	QUANTITY	0.99+
24%	QUANTITY	0.99+
one paragraph	QUANTITY	0.99+
60%	QUANTITY	0.99+
27%	QUANTITY	0.99+
six pages	QUANTITY	0.99+
19%	QUANTITY	0.99+
VEEAM	ORGANIZATION	0.99+
today	DATE	0.99+
Data Protection Trends Report	TITLE	0.99+
two days	QUANTITY	0.99+
Europe	LOCATION	0.99+
81%	QUANTITY	0.99+
four personas	QUANTITY	0.99+
over 3000 responses	QUANTITY	0.99+
200 backup admins	QUANTITY	0.99+
250 word	QUANTITY	0.99+
each step	QUANTITY	0.99+
2022	DATE	0.99+
28 countries	QUANTITY	0.98+
DRS.	ORGANIZATION	0.98+
one	QUANTITY	0.98+
two different questions	QUANTITY	0.98+
third thing	QUANTITY	0.98+
two thirds	QUANTITY	0.98+
two and a half years	QUANTITY	0.98+
second thing	QUANTITY	0.98+
Americas	LOCATION	0.98+
94%	QUANTITY	0.98+
several years ago	DATE	0.97+
Enterprise Strategy Group	ORGANIZATION	0.97+
three	QUANTITY	0.97+
first one	QUANTITY	0.97+
Ransomware Trends Report	TITLE	0.97+
thousands	QUANTITY	0.97+
one thing	QUANTITY	0.97+
last year	DATE	0.96+
One	QUANTITY	0.96+
BAS	ORGANIZATION	0.96+
around 200 days	QUANTITY	0.96+
COVID	OTHER	0.95+
200 IT ops	QUANTITY	0.95+
third	QUANTITY	0.94+
four organizations	QUANTITY	0.94+
NFL	ORGANIZATION	0.94+
400	QUANTITY	0.94+
about an hour ago	DATE	0.94+
four different views	QUANTITY	0.94+
first intrusions	QUANTITY	0.93+
once	QUANTITY	0.93+
ROBO	ORGANIZATION	0.92+

Dipti Borkar, Ahana, and Derrick Harcey, Securonix | CUBE Conversation, July 2021

(upbeat music) >> Welcome to theCUBE Conversation. I'm John Furrier, host of theCUBE here in Palo Alto, California, in our studios. We've got a great conversation around open data link analytics on AWS, two great companies, Ahana and Securonix. Dipti Borkar, Co-founder and Chief Product Officer at Ahana's here. Great to see you, and Derrick Harcey, Chief Architect at Securonix. Thanks for coming on, really appreciate you guys spending the time. >> Yeah, thanks so much, John. Thank you for having us and Derrick, hello again. (laughing) >> Hello, Dipti. >> We had a great conversation around our startup showcase, which you guys were featured last month this year, 2021. The conversation continues and a lot of people are interested in this idea of open systems, open source. Obviously open data lakes is really driving a lot of value, especially with machine learning and whatnot. So this is a key, key point. So can you guys just take a step back before we get under the hood and set the table on Securonix and Ahana? What's the big play here? What is the value proposition? >> Why sure, I'll give a quick update. Securonix has been in the security business. First, a user and entity, behavioral analytics, and then the next generation SIEM platform for 10 years now. And we really need to take advantage of some cutting edge technologies in the open source community and drive adoption and momentum that we can not only bring in data from our customers, that they can find security threats, but also store in a way that they can use for other purposes within their organization. That's where the open data lake is very critical. >> Yeah and to add on to that, John, what we've seen, you know, traditionally we've had data warehouses, right? We've had operational systems move all of their data into the warehouse and those, you know, while these systems are really good, built for good use cases, the amount of data is exploding, the types of data is exploding, different types, semi-structured, structured and so when, as companies like Securonix in the security space, as well as other verticals, look for getting more insights out of their data, there's a new approach that's emerging where you have a data lake, which AWS has revolutionized with S3 and commoditized and there's analytics that's built on top of it. And so we're seeing a lot of good advantages that come out of this new approach. >> Well, it's interesting EC2 and S3 are having their 15th birthday, as they say in Amazon's interesting teenage years, but while I got you guys here, I want to just ask you, can you define the SIEM thing because the SIEM market is exploding, it just changed a little bit. Obviously it's data, event management, but again, as data becomes more proliferating, and it's not stopping anytime soon, as cloud native applications emerge, why is this important? What is this SIEM category? What's it about? >> Yeah, thanks. I'll take that. So obviously SIEM traditionally has been around for about a couple of decades and it really started with first log collection and management and rule-based threat detection. Now what we call next generation SIEM is really the modernization of a security platform that includes streaming threat detection and behavioral analysis and data analytics. We literally look for thousands of different threat detection techniques, and we chained together sequences of events and we stream everything in real time and it's very important to find threats as quickly as possible. But the momentum that we see in the industry as we see massive sizes of customers, we have made a transition from on-premise to the cloud and we literally are processing tens of petabytes of data for our customers. And it's critical that we can adjust data quickly, find threats quickly and allow customers to have the tools to respond to those security incidents quickly and really get the handle on their security posture. >> Derrick, if I ask you what's different about this next gen SIEM, what would you say and what's the big a-ha? What's the moment there? What's the key thing? >> The real key is taking the off the boundaries of scale. We want to be able to ingest massive quantities of data. We want to be able to do instant threat detection, and we want to be able to search on the entire forensic data set across all of the history of our customer base. In the past, we had to make sacrifices, either on the amount of data we ingest or the amount of time that we stored that data. And the really the next generation SIEM platform is offering advanced capabilities on top of that data set because those boundaries are no longer barriers for us. >> Dipti, any comment before I jump into the question for you? >> Yeah, you know, absolutely. It is about scale and like I mentioned earlier, the amount of data is only increasing and it's also the types of information. So the systems that were built to process this information in the past are, you know, support maybe terabytes of data, right? And that's where new technologies open source engines like Presto come in, which were built to handle internet scale. Presto was kind of created at Facebook to handle these petabytes that Derrick is talking about that every industry is now seeing where we're are moving from gigs to terabytes to petabytes. And that's where the analytic stack is moving. >> That's a great segue. I want to ask you while I got you here 'cause this is again, the definitions, 'cause people love to hear the experts weigh in. What is open data lake analytics? How would you define that? And then talk about where Presto fits in. >> Yeah, that's a great question. So the way I define open data lake analytics is you have a data lake on the core, which is, let's say S3, it's the most popular one, but on top of it, there are open aspects, it is open format. Open formats play a very important role because you can have different types of processing. It could be SQL processing, it could be machine learning, it could be other types of workloads, all work on these open formats versus a proprietary format where it's locked and it's open interfaces. Open interfaces that are like SQL, JDBC, ODBC is widely accessible to a range of tools. And so it's everywhere. Open source is a very important part of it. As companies like Securonix pick these technologies for their mission critical systems, they want to know that this is going to be available and open for them for a long period of time. And that's why open source becomes important. And then finally, I would say open cloud because at the end of the day, you know, while AWS is where a lot of the innovations happening, a lot of the market is, there are other clouds and open cloud is something that these engines were built for, right? So that's how I define open data lake analytics. It's analytics with query engines built on top of these open formats, open source, open interfaces and open cloud. Now Presto comes in where you want to find the needle in the haystack, right? And so when you have these deep questions about where did the threat come from or who was it, right? You have to ask these questions of your data. And Presto is an open source distributed SQL engine that allows data platform teams to run queries on their data lakes in a high-performance ways, in memory and on these petabytes of data. So that's where Presto fits in. It's one of the defacto query engines for SQL analysis on the data lake. So hopefully that answers the question, gives more context. >> Yeah, I mean, the joke about data lakes has been you don't want to be a data swamp, right? That's what people don't want. >> That's right. >> But at the same time, the needle in the haystack, it's like big data is like a needle in a haystack of needles. So there's a constant struggle to getting that data, the right data at the right time. And what I learned in the last presentation, you guys both presented, your teams presented at the conference was the managed service approach. Could you guys talk about why that approach works well together with you guys? Because I think when people get to the cloud, they replatform, then they start refactoring and data becomes a real big part of that. Why is the managed service the best approach to solving these problems? >> Yeah and interestingly, both Securonix and Ahana have a managed service approach so maybe Derrick can go first and I can go after. >> Yeah, yeah. I'll be happy to go first. You know, we really have found making the transition over the last decade from off premise to the cloud for the majority of our customers that running a large open data lake requires a lot of different skillsets and there's hundreds of technologies in the open source community to choose from and to be able to choose the right blend of skillsets and technologies to produce a comprehensive service is something that customers can do, many customers did do, and it takes a lot of resources and effort. So what we really want to be able to do is take and package up our security service, our next generation SIEM platform to our customers where they don't need to become experts in every aspect of it. Now, an underlying component of that for us is how we store data in an open standards way and how we access that data in an open standards way. So just like we want our customers to get immediate value from the security services that we provide, we also want to be able take advantage of a search service that is offered to us and supported by a vendor like Ahana where we can very quickly take advantage of that value within our core underlying platform. So we really want to be able to make a frictionless effort to allow our customers achieve value as quick as possible. >> That's great stuff. And on the Ahana side, open data lakes, really the ease of use there, it sounds easy to me, but we know it's not easy just to put data in a data lake. At the end of the day, a lot of customers want simplicity 'cause they don't have the staffing. This comes up a lot. How do you leverage their open source participation and/or getting stood up quickly so they can get some value? Because that seems to be the number one thing people want right now. Dipti, how does that work? How do people get value quickly? >> Yeah, absolutely. When you talk about these open source press engines like Presto and others, right? They came out of these large internet companies that have a lot of distributed systems, engineers, PhDs, very kind of advanced level teams. And they can manage these distributed systems building onto them, add features at large scale, but not every company can and these engines are extremely powerful. So when you combine the power of Presto with the cloud and a managed service, that's where value for everyone comes in. And that's what I did with Ahana is looked at Presto, which is a great engine, but converted it into a great user experience so that whether it's a three person platform team or a five person platform team, they still get the same benefit of Presto that a Facebook gets, but at much, much a less operational complexity cost, as well as the ability to depend on a vendor who can then drive the innovation and make it even better. And so that's where managed services really com in. There's thousands of credit parameters that need to be tuned. With Ahana, you get it out of the box. So you have the best practices that are followed at these larger companies. Our team comes from Facebook, HuBERT and others, and you get that out of the box, with a few clicks you can get up and running. And so you see value immediately, in 30 minutes you're up and running and you can create your data lake versus with Hadoop and these prior systems, it would take months to receive real value from some of these systems. >> Yeah, we saw the Hadoop scar tissue is all great and all good now, but it takes too much resource, standing up clusters, managing it, you can't hire enough people. I got to ask you while you're on that topic, do you guys ship templates? How do you solve the problem of out of the box? You mentioned some out of the box capability. Do you guys think of as recipes, templates? What's your thoughts around what you're providing customers to get up and running? >> Yeah so in the case of Securonix, right, let's say they want to create a Presto cluster. They go into our SAS console. You essentially put in the number of nodes that you want. Number of workers you want. There's a lot of additional value that we built in like caching capabilities if you want more performance, built in cataloging that's again, another single click. And there isn't really as much of a template. Everybody gets the best tuned Presto for their workloads. Now there are certain workloads where you might have interactive in some cases, or you might have transformation batch ETL, and what we're doing next is actually giving you the knobs so that it comes pre tuned for the type of workload that you want to run versus you figuring it out. And so that's what I mean by out of the box, where you don't have to worry about these configuration parameters. You get the performance. And maybe Derrick can you talk a little bit about the benefits of the managed service and the usage as well. >> Yeah, absolutely. So, I'll answer the same question and then I'll tie back to what Dipti asked. Really, you know, our customers, we want it to be very easy for them to ingest security event logs. And there's really hundreds of types of a security event logs that we support natively out of the box, but the key for us is a standard that we call the open event format. And that is a normalized schema. We take any data source in it's normalized format, be a collector device a customer uses on-premise, they send the data up to our cloud, we do streaming analysis and data analytics to determine where the threats are. And once we do that, then we send the data off to a long-term storage format in a standards-based Parquet file. And that Parquet file is natively read by the Ahana service. So we simply deploy an Ahana cluster that uses the Presto engine that natively supports our open standard file format. And we have a normalized schema that our application can immediately start to see value from. So we handle the collection and streaming ingest, and we simply leverage the engine in Ahana to give us the appropriate scale. We can size up and down and control the cost to give the users the experience that they're paying for. >> I really love this topic because one, not only is it cutting edge, but it's very relevant for modern applications. You mentioned next gen SIEMs, SIEM, security information event management, not SIM as memory card, which I think of all the time because I always want to add more, but this brings up the idea of streaming data real-time, but as more services go to the cloud, Derrick, if you don't mind sharing more on this. Share the journey that you guys gone through, because I think a lot of people are looking at the cloud and saying, and I've been in a lot of these conversations about repatriation versus cloud. People aren't going that way. They're going more innovation with his net new revenue models emerging from the value that they're getting out of understanding events that are happening within the network and the apps, even when they're being stood up and torn down. So there's a lot of cloud native action going on where just controlling and understanding is way beyond the, just put stuff into an event log. It's a whole nother animal. >> Well, there's a couple of paradigm shifts that we've seen major patterns for in the last five or six years. Like I said, we started with the safe streaming ingest platform on premise. We use some different open source technologies. What we've done when we moved to the cloud is we've adopted cloud native services as part of our underlying platform to modernize and make our service cloud native. But what we're seeing as many customers either want to focus on on-premise deployments and especially financial institutions and government institute things, because they are very risk averse. Now we're seeing even those customers are realizing that it's very difficult to maintain the hundreds or thousands of servers that it requires on premise and have the large skilled staff required to keep it running. So what we're seeing now is a lot of those customers deployed some packaged products like our own, and even our own customers are doing a mass migration to the cloud because everything is handled for them as a service. And we have a team of experts that we maintain to support all of our global customers, rather than every one of our global customers having their own teams that we then support on the back end. So it's a much more efficient model. And then the other major approach that many of our customers also went down the path of is, is building their own security data lake. And many customers were somewhat successful in building their own security data lake but in order to keep up with the innovation, if you look at the analyst groups, the Gartner Magic Quadrant on the SIEM space, the feature set that is provided by a packaged product is a very large feature set. And even if somebody was put together all of the open source technologies to meet 20% of those features, just maintaining that over time is very expensive and very difficult. So we want to provide a service that has all of the best in class features, but also leverages the ability to innovate on the backend without the customer knowing. So we can do a technology shift to Ahana and Presto from our previous technology set. The customer doesn't know the difference, but they see the value add within the service that we're offering. >> So if I get this right, Derrick, Presto's enabling you guys to do threat detection at a level that you're super happy with as well as giving you the option for give self-service. Is that right for the, is that a kind of a- >> Well, let me clarify our definition. So we do streaming threat detection. So we do a machine learning based behavioral analysis and threat detection on rule-based correlation as well. So we do threat detection during the streaming process, but as part of the process of managing cybersecurity, the customer has a team of security analysts that do threat hunting. And the threat hunting is where Ahana comes in. So a human gets involved and starts searches for the forensic logs to determine what happened over time that might be suspicious and they start to investigate through a series of queries to give them the information that's relevant. And once they find information that's relevant, then they package it up into an algorithm that will do a analysis on an ongoing basis as part of the stream processing. So it's really part of the life cycle of hunting a real time threat detection. >> It's kind of like old adage hunters and farmers, you're farming through the streaming and hunting with the detection. I got to ask you, what would it be the alternative if you go back, I mean, I know cloud's so great because you have cutting edge applications and technologies. Without Presto, where would you be? I mean, what would be life like without these capabilities? What would have to happen? >> Well, the issue is not that we had the same feature set before we moved to Presto, but the challenge was on scale. The cost profile to continue to grow from 100 terabytes to one petabyte, to tens of petabytes, not only was it expensive, but it just, the scaling factors were not linear. So not only did we have a problem with the costs, but we also had a problem with the performance tailing off and keeping the service running. A large Hadoop cluster, for example, our first incarnation of this use, the hive service, in order to query data in a MapReduce cluster. So it's a completely different technology that uses a distributed Hadoop compute cluster to do the query. It does work, but then we start to see resource contention with that, and all the other things in the Hadoop platform. The Presto engine has the beauty of it, not only was it designed for scale, but it's feature built just for a query engine and that's the providing the right tool for the job, as opposed to a general purpose tool. >> Derrick, you've got a very busy job as chief architect. What are you excited about going forward when you look at the cloud technologies? What are you looking at? What are you watching? What are you getting excited about or what worries you? >> Well, that's a good question. What we're really doing, I'm leading up a group called the Securonix Innovation Labs, and we're looking at next generation technologies. We go through and analyze both open source technologies, technologies that are proprietary as well as building own technologies. And that's where we came across Ahana as part of a comprehensive analysis of different search engines, because we wanted to go through another round of search engine modernization, and we worked together in a partnership, and we're going to market together as part of our modernization efforts that we're continuously going through. So I'm looking forward to iterative continuous improvement over time. And this next journey, what we're seeing because of the growth in cybersecurity, really requires new and innovative technologies to work together holistically. >> Dipti, you got a great company that you co-founded. I got to ask you as the co-founder and chief product officer, you both the lead entrepreneur also, got the keys to the kingdom with the products. You got to balance that 20 miles stare out in the future while driving product excellence. You've got open source as a tailwind. What's on your mind as you go forward with your venture? >> Yeah. Great question. It's been super exciting to have found the Ahana in this space, cloud data and open source. That's where the action is happening these days, but there's two parts to it. One is making our customers successful and continuously delivering capabilities, features, continuing on our ease of use theme and a foundation to get customers like Securonix and others to get most value out of their data and as fast as possible, right? So that's a continuum. In terms of the longer term innovation, the way I see the space, there is a lot more innovation to be done and Presto itself can be made even better and there's a next gen Presto that we're working on. And given that Presto is a part of the foundation, the Linux Foundation, a lot of this innovation is happening together collaboratively with Facebook, with Uber who are members of the foundation with us. Securonix, we look forward to making a part of that foundation. And that innovation together can then benefit the entire community as well as the customer base. This includes better performance with more capabilities built in, caching and many other different types of database innovations, as well as scaling, auto scaling and keeping up with this ease of use theme that we're building on. So very exciting to work together with all these companies, as well as Securonix who's been a fantastic partner. We work together, build features together, and I look at delivering those features and functionalities to be used by these analysts, data scientists and threat hunters as Derrick called them. >> Great success, great partnership. And I love the open innovation, open co-creation you guys are doing together and open data lakes, great concept, open data analytics as well. This is the future. Insights coming from the open and sharing and actually having some standards. I love this topic, so Dipti, thank you very much, and Derrick, thanks for coming on and sharing on this Cube Conversation. Thanks for coming on. >> Thank you so much, John. >> Thanks for having us. >> Thanks. Take care. Bye-bye. >> Okay, it's theCube Conversation here in Palo Alto, California. I'm John furrier, your host of theCube. Thanks for watching. (upbeat music)

Published Date : Jul 30 2021

SUMMARY :

guys spending the time. and Derrick, hello again. and set the table on Securonix and Ahana? and momentum that we can into the warehouse and those, you know, because the SIEM market is exploding, and really get the handle either on the amount of data we ingest and it's also the types of information. hear the experts weigh in. So hopefully that answers the Yeah, I mean, the joke Why is the managed Yeah and interestingly, a search service that is offered to us And on the Ahana side, open data lakes, and you get that out of the box, I got to ask you while and the usage as well. and control the cost from the value that they're getting and have the large skilled staff as well as giving you the for the forensic logs to and hunting with the detection. and that's the providing when you look at the cloud technologies? because of the growth in cybersecurity, got the keys to the and a foundation to get And I love the open here in Palo Alto, California.

ENTITIES

Entity	Category	Confidence
Securonix	ORGANIZATION	0.99+
John	PERSON	0.99+
Derrick Harcey	PERSON	0.99+
Derrick	PERSON	0.99+
Facebook	ORGANIZATION	0.99+
Ahana	ORGANIZATION	0.99+
Ahana	PERSON	0.99+
John Furrier	PERSON	0.99+
20%	QUANTITY	0.99+
July 2021	DATE	0.99+
Uber	ORGANIZATION	0.99+
Dipti	PERSON	0.99+
100 terabytes	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
hundreds	QUANTITY	0.99+
Linux Foundation	ORGANIZATION	0.99+
two parts	QUANTITY	0.99+
thousands	QUANTITY	0.99+
Securonix Innovation Labs	ORGANIZATION	0.99+
tens of petabytes	QUANTITY	0.99+
30 minutes	QUANTITY	0.99+
one petabyte	QUANTITY	0.99+
Dipti Borkar	PERSON	0.99+
20 miles	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
five person	QUANTITY	0.99+
First	QUANTITY	0.99+
SQL	TITLE	0.99+
last month	DATE	0.99+
both	QUANTITY	0.99+
One	QUANTITY	0.98+
15th birthday	QUANTITY	0.97+
two great companies	QUANTITY	0.96+
HuBERT	ORGANIZATION	0.96+
Hadoop	TITLE	0.96+
S3	TITLE	0.96+
hundreds of technologies	QUANTITY	0.96+
three person	QUANTITY	0.95+
Parquet	TITLE	0.94+
first incarnation	QUANTITY	0.94+
first	QUANTITY	0.94+
Presto	ORGANIZATION	0.93+
Gartner	ORGANIZATION	0.93+
last decade	DATE	0.92+
terabytes of data	QUANTITY	0.92+
first log	QUANTITY	0.91+
single click	QUANTITY	0.9+
Presto	PERSON	0.9+
theCUBE	ORGANIZATION	0.88+

Rob Harris, Stardog | Cube Conversation, March 2021

>>hello. >>Welcome to the special key conversation. I'm John ferry, host of the queue here in Palo Alto, California, featuring star dog is a great hot start-up. We've got a great guest, Rob Harris, vice president of solutions consulting for star dog here talking about some of the cloud growth, um, knowledge graphs, the role of data. Obviously there's a huge sea change. You're seeing real value coming out of this COVID as companies coming out of the pandemic, new opportunities, new use cases, new expectations, highly accelerated shift happening, and we're here to break it down. Rob, thanks for joining us on the cube conversation. Great to be here. So got, I'm excited to talk to you guys about your company and specifically the value proposition I've been talking for almost since 2007 around graph databases with Neo four J came out and looking at how data would be part of a real part of the developer mindset. Um, early on, and this more of the development. Now it's mainstream, you're seeing value being created in graph structures. Okay. Not just relational. This has been, uh, very well verified. You guys are in this business. So this is a really hot area, a lot of value being created. It's cool. And it's relevant. So tell us first, what is star dog doing? What's uh, what is the company about? >>Yeah, so I mean, we are an enterprise knowledge graph platform company. We help people be successful at standing up knowledge graphs of the data that they have both inside their company and using public data and tying that all together in order to be able to leverage that connected data and really turn it into knowledge through context and understand it. >>So how did this all come about this from a tech standpoint? What is the, what is the, uh, what was the motivation around this? Because, um, obviously the unstructured wave hit, you're seeing successes like data bricks, for instance, just absolutely crushing it on, on their valuation and their relevance. You seeing the same kind of wave hit almost kind of born back on the Hadoop days with unstructured data. Is that a big part of it? Is it just evolution? What's the big driver here? >>Yeah, no, I think it's a, it's a great question. The driver early is as these data sets have increased for so many companies trying to really bring some understanding to it as they roll it out in their organizations, you know, we've tried to just try to centralize it and that hasn't been sufficient in order to be able to unlock the value of most organization status. So being able to step beyond just, you know, pulling everything together into one place, but really putting that context and meaning around it that the graph can do. So that's where we've really got started at, uh, back in the day is we really looked at the inference and reasoning part of a knowledge graph. How do we bring more context and understanding that doesn't naturally exist within the data? And that really is how we launched off the product. >>I got to ask you around the use cases because one of the things that's really relevant right now is you're seeing a lot of front end development around agile application. Dev ops is brought infrastructure as code. You're seeing kind of this huge tsunami of new of applications, but one of the things that people are talking about in some of the developer circles and it's kind of hits the enterprise is this notion of state because you can have an application calling data, but if the data is not addressable and then keeping state and in real time and all these kinds of new, new technical problems, how do you guys look at that? When you look at trying to create knowledge graphs, because maintaining that level of connection, you need data, a ton of it it's gotta be exposed and addressable and then deal dealt with in real time. How do you guys look at it? >>Yeah, that's, that's a great question. What we've done to try to kind of move the ball forward on this is move past, trying to centralize that data into a knowledge graph that is separate from the rest of your data assets, but really build a data virtualization layer, which we have integrated into our product to look at the data where it is in the applications and the unstructured documents and the structure repositories, so that we can observe as state changes in that data and answer questions that are relevant at the time. And we don't have to worry about some sort of synchronous process, you know, loading information into the graph. So that ability to add that virtualization layer, uh, to the graph really enables you to get more of a real time, look at your data as it evolves. >>Yeah. I definitely want to double, double click on that and say, but I want to just drop step back and kind of set the table for the folks that aren't, um, getting in the weeds yet on this. There's kind of a specific definition of enterprise knowledge graph. Could you like just quickly define that? What is the enterprise knowledge graph? Sure. >>Yeah, we, we really see an enterprise knowledge graph as a connected set of data with context. So it's not just storing it like a graph, but connect again and putting meaning around that data through structure, through definitions, et cetera, across the entire enterprise. So looking at not just data within a single application or within a single silo, but broadly through your enterprise, what does your data mean? How is it connected and what does it look like within context each other? >>How should companies reuse their data? >>Boy, that's a broad question, right? Uh, you know, I mean, one of the things, uh, that I think is very important as so many companies have just collected data assets over the years, they collect more and more and more. We have customers that have eight petabytes of data within their data Lake. And they're trying to figure out how to leverage it by actually connecting and putting that context around the data. You can get a lot more meaning out of that old data or the stale data or the unknown data that the people are getting right today. So the ability to reuse the data assets with in context of meeting is where we see people really be able to make huge licks for in their organization like drug companies be able to get drugs to market faster. By looking at older studies, they've done where maybe the meeting was hidden because it was an old system. Nobody knew what the particular codes and meaning were in context of today. So being able to reuse and bring that forward brings real life application to people solving business problems today. >>Rob, I got to get your thoughts on something that we always riff on here on the cube, which is, um, you know, do you take down the data silos or do you leverage them? And you know, this came up a lot, many years ago when we first started discussing containers, for instance, and then that we saw that you didn't have to kill the old to bring in the new, um, there's one mindset of, you know, break the silos down, go horizontal scalability on the data, critical data, plane control, plane, other saying, Hey, you know what, just put it, you know, put a wrapper around those, those silos and you know, I'm oversimplifying, but you get the idea. So how should someone who's really struggling with, or, or not struggling, we're putting together an architecture around their future plans around dealing with data and data silos specifically, because certainly as new data comes in there's mechanism for that. But as you have existing data silos, what do companies do? What's the strategy in your opinion? >>Yeah, you know, it is a really interesting question. I was in data warehouse and for a long, long time and a big proponent of moving everything to one place. And, uh, then I really moved into looking into data virtualization and realized that neither of those solutions are complete, that there are some things that have to be centralized and moved the old systems aren't sufficient in order to be able to answer questions or process them. But there are many data silos that we've created within organizations that can be reused. You can leverage the compute, you can leverage the storage that already exist within us. And that's the approach we've taken at start off. We really want to be able to allow you to centralize the data that makes sense, right. To get it out of those old systems, that should be shut down from just a monetary perspective, but the systems that are have actual meeting or that it's too expensive in order to, to remove them, leverage those data silos. And by letting you have both approaches in the same platform, we hope to make this not an either or architectural decision, which is always the difficult question. >>Okay. So you got me on that one. So let me just say that. I want to leverage my data silos. What do I do? Take me through the playbook. What if I got the data silos? What is the star dog recommendation for me? >>Sure. So what, what we generally recommend is you start off with building kind of a model, uh, in the, in the lingo, we sometimes say ontology Euro, some sort of semantic understanding that puts context around what is my data and what does it mean? And then we allow you to map those data silos. We have a series of connectors in our product that whether it's an application and you're connecting through a rest connector, or whether it's a database and you're connecting through ODBC or JDBC map that data into the platform. And then when you issue queries to the startup platform, we federate those queries out to the downstream systems and answer as if that data existed on the graph. So that way we're leveraging the silos where they are without you having to move the data physically into the platform. So you guys are essentially building a >>Data fabric. >>We are, yeah. Data fabric is really the new term. That's been popping up more and more with our customers when they come to us to say, how can we kind of get past the traditional ways of doing data integration and unified data in a single place? Like you said, we don't think the answer is purely all about moving it all to one big Lake. We don't think the answer is all about just creating this virtualization plane, but really being able to leverage the festival. >>All right. So, so if you, if you believe that, then let's just go to the next level then. So if you believe that they can, don't have to move things around and to have one specific thing, how does a customer deal with their challenge of hybrid cloud and soon to be multi-cloud because that's certainly on the horizon. People want choice. There's going to be architectural. I mean, certainly a cloud operations will be in play, but this on-premise and this cloud, and then soon to be multiple cloud. How do you guys deal with that? That question? >>Yeah, that's a great question. And this is really a, an area that we're very excited about and we've been investing very heavily in is how to have multiple instances of StarTalk running in different clouds or on prem on the clown, coordinate to answer questions, to minimize data movement between the platforms. So we have the ability to run either an agent on prem. For example, if you're running the platform in the cloud or vice versa, you can run it in the cloud. You are two full instances that start off where they will actually cope plan queries to understand where does the data live? Where is it resident and how do I minimize moving data around in order to answer the question? So we really are trying to create that unified data fabric across on-prem or multiple cloud providers, so that any of the nodes in the platform can answer question from any of the datas >>S you know, complexity is always the issue. People cost go up. When you have complexity, you guys are trying to tame it. This is a huge conversation. You bring up multi-cloud and hybrid cloud. And multi-cloud when you think about the IOT edge, and you don't want to move data around, this is what everyone's saying, why move it? Why move data? It's expensive to move data processes where it is, and you kind of have this kind of flexibility. So this idea of unification is a huge concept. Is that enough? And how should customers think about the unification? Because if you can get there, it almost, it is the kind of the Holy grail you're talking about here. So, so this is kind of the prospect of, of having kind of an ideal architecture of unification. So take us, take me through that one step deeper. >>Well, it is, it is kind of interesting because as you really think about unifying your data and really bringing it together, of course it is the Holy grail. And that's what people have been talking about. Um, gosh, since I started in the industry over 20 years ago, how do I get this single plain view of my data, regardless of whether it's physically located or, uh, somehow stitched together, but what are the things that, you know, our founders really strongly believed on when they started the company? Was it isn't enough. It isn't sufficient. There is more value in your data that you don't even know. And unlocking that through either machine learning, which is, of course, we all know it's very hot right now to look at how do I derive new insights out of the data that I already have, or even through logical reasoning, right? And inference looking at, what do I understand about how that data is put together and how it's created in order to create more connections within the data and answer more questions. All those are ways to grow beyond just unifying your data, but actually getting more insights out of it. And I think that is the real Holy grail that people are looking for, not just bringing all the data together, but actually being able to get business value and insights out of that data. Yeah. >>Looking for it. You guys have obviously a pretty strong roster of clients that represent that. Um, but I got to ask you, since you brought up the founders, uh, the company, obviously having a founders' DNA, uh, mindset, um, tends to change the culture or drive the culture of the covenant change with age drives the culture of the company. What is the founder's culture inside star, dog? What is the vibe there, if you could, um, what do they talk about the most when you, when they get in that mode of being founders like, Hey, you know, this is the North star, what is, what's the rap like? What's the vibe share? It takes that, take us through some star star, dog culture. >>Sure. So our three founders came out of the rusty of Maryland, all in a PhD program around semantic reasoning and logical understanding and being able to understand data and be able to communicate that as easily as possible is really the core and the fiber of their being. And that's what we see continually under discussion every single day. How can we push the limits to take this technology and your gift easier to use more available, bring more insights to the customers beyond what we've seen in the past. And I find that really exciting to be able to constantly have conversations about how do we push the envelope? How do we look beyond even what Gartner says is five or eight years in the future, but looking even further ahead. So there >>They're into they're into this whole data scene. Then big time they are >>That they are very active in the conferences and posts and you know, all that great. >>They love this agility. They got to love dev ops. I mean, if you're into this knowledge graph scene, so I gotta, I gotta ask you, what's the machine learning angle here, obviously, AI, we know what AI is. AI is essentially combination of many things, machine learning and other computer science and data access. Um, what is the secret sauce behind the machine learning and, and the vibe and the product of, of, uh, >>Yeah, a lot of times w we, the way that we leverage machine learning or the way that we look at it is how do we create those connections between data? So you have multiple different systems and you're trying to bring all that data together. Yeah. It's not always easy to tell, is this rod Harris the same as that rod Harris is this product the same as that product. So when possible we will leverage keys or we'll leverage very, uh, you know, systematic type of understanding of these things are the same, but sometimes you need to reach beyond that. And that's where we leverage a lot of machine learning within the platform, looking at things like linear regression or other approaches around the graph, you know, connectivity, analysis, page rank, things like that to say, where are things the same so that we can build that connections in that connectivity as automatically as possible. >>You don't get a lot of talks on the cube. Also. Now that's new news, new clubhouse app, where people are talking about misinformation, obviously we're in the media business. We love the digital network effect. Everything's networks, the network economy. You starting to see this power of information and value. You guys carved the knowledge graph. So I gotta, I gotta ask you, when you look at this kind of future where you have this, um, complexity and the network effect, um, how are you guys looking at that data access? Because if you don't have the data, you're not going to have that insight, right? So you need to have that, that network connection. Is that a limitation or for companies? Is that an, um, cause usually people aren't necessarily their blind spot is their data or their lack of their data. So having things network together is going to be more of the norm in the future. How do you guys see that playing out? Yeah, >>I think you're exactly right. And I think that as you look beyond where we are today, and a lot of times we focus today on the data that a company already has, what do I know? Right. What do I know about you? What, how do I interact with you? How have I interacted with you? I think that as we look at the future, we're going to talk more about data sharing, but leveraging publicly available information about being able to take these insights and leverage them, not just within the walls of my own organization, but being able to share them and, uh, work together with other organizations to bring up a better understanding of you as a person or as a consumer that we could all interact with. Yeah, you're absolutely right. You know, Metcons law still holds true that, you know, more network connections bring more value. I certainly see that growing in the future, probably more around, you know, more data sharing and more openness about leveraging publicly available. >>You know, it's interesting. You mentioned you came from a data warehouse background. I remember when I broken the businessmen 30 years ago, when I started getting computer science, you know, it was, it was, there was, there was pain having a product and an enabling platform. You guys seem to have this enabling platform where there's no one use case. I mean, you, you have an unlimited use case landscape. Um, you could do anything with what you guys have. It's not so much, I mean, there's, low-hanging fruit. So I got to ask you, if you have that, uh, enabling platform, you're creating value for customers. What are some of the areas you see developing, like now in terms of low-hanging fruit and where's the possibilities? How do you guys see that? I'm sure you've probably got a tsunami of activity around corner cases from media to every vertical we do. And that's, you know, >>The exciting part of this job. Uh, part of the exciting part of knowledge press in general is to see all the different ways that they are allowed to use. But we do see some use cases repeated over and over again. Uh, risk management is a very common one. How do I look at all the people and the assets with an organization, the interactions they have to look at hotspots for risk, uh, that I need to correct within my organization for the pre-commercial pharma, that has been a very, very hot area for us recently. How do we look at all the that's available with an organization that's publicly available in order to accelerate drug development in this post COVID world, that's become more and more relevant, uh, for organizations to be able to move forward faster and the kind of bio industry and my sciences. Um, that's a use case that we've seen repeated over and over again. And then this growing idea of the data fabric, the data fabric, looking at metadata within the organization to improve data integration processes, to really reduce the need for moving data without or around the organization as much. Those are the use cases we've seen repeated over and over again over the last >>Awesome Rob. My last question before we wrap up is for the solution architect that's out there that has, you know, got a real tall order. They have to put together a scalable organization, people process and technology around a data architecture. That's going to be part of, um, the next gen, the next gen next level activity. And they need headroom for IOT edge and industrial edge, uh, and all use cases. Um, what's your advice to them as they have to look out at and start thinking about architecture? >>Yeah, that's, it's a great question. Uh, I really think that it's important to keep your options open as the technology in the space continues to evolve, right? It's easy to get locked into a single vendor or a single mindset. Um, I've been an architect most of my career, and that's usually a lot of the pitfalls. Things like a knowledge graph are open and flexible. They adhere to standards, which then means you're not locked into a single vendor and you're allowed to leverage this type of technology to grow beyond originally envisioned. So thinking about how you can take advantage of these modern techniques to look at things and not just keep repeating what you've done in the past, the sins of the past have, uh, you know, a lot of times do reappear. So fighting against that as much as possible as gritty is my encouragement. >>Awesome, great insight. And I love this. I love this area. I know you guys got a great trend. You're riding on a very cool, very relevant final minute. Just take a quick minute to give a plug for the company. What's the business model. How do I deploy this? How do I get the software? How do you charge for it? If I'm going to buy this solution or engage with star DOE what do I do? Take me through that. Sure. >>Yeah. We, uh, we are like, uh, you've sat through this whole thing. We are enterprise knowledge graph platform company. So we really help you get started with your business, uh, uh, leveraging and using a knowledge graph fricking organization. We have the ability to deploy on prem. We have on the cloud, we're in the AWS marketplace today. So you can take a look at our software today, who generally are subscription-based based on the size of the install. And we are happy to talk to you any time, just drop by our website, reach out we'll we'll get doctors. >>Rob. Great. Thanks for coming. I really appreciate it. That gradients said, looking forward to seeing you in person, when we get back to real life, hopefully the vaccines are coming on. Thanks to, uh, companies like you guys providing awesome analytics and intelligence for these drug companies and pharma companies. Now you have a few of them in your, on your client roster. So congratulations, looking forward to following up great, great area. Cool and relevant data architecture is changing. Some of it's broken. Some it's being fixed started off as one of the hot startups scaling up beautifully in this new era of cloud computing meets applications and data. So I'm John. Forget the cube. This is a cube conversation from Palo Alto, California. Thanks for watching.

Published Date : Mar 3 2021

SUMMARY :

I'm excited to talk to you guys about your company and specifically the value proposition I've been talking to leverage that connected data and really turn it into knowledge through context and understand it. You seeing the same kind of wave hit almost kind of born back on the Hadoop days So being able to step beyond just, you know, pulling everything together into one place, I got to ask you around the use cases because one of the things that's really relevant right now is you're seeing a lot of front end development And we don't have to worry about some sort of synchronous process, you know, loading information into the graph. What is the enterprise knowledge graph? So it's not just storing it like a graph, but connect again and putting meaning around that So the ability to reuse the data assets with in context of meeting is and then that we saw that you didn't have to kill the old to bring in the new, um, there's one mindset of, And by letting you have both approaches in the same platform, What is the star dog recommendation And then we allow you to map those data silos. Data fabric is really the new term. So if you believe that they can, clouds or on prem on the clown, coordinate to answer questions, to minimize data movement It's expensive to move data processes where it is, and you kind of have this but what are the things that, you know, our founders really strongly believed on when they started the company? Hey, you know, this is the North star, what is, what's the rap like? And I find that really exciting to be able to constantly have conversations about how do we push the They're into they're into this whole data scene. That they are very active in the conferences and posts and you know, They got to love dev ops. So you have multiple different systems and you're trying to bring all that data So you need to have that, that network connection. And I think that as you look beyond where we are today, What are some of the areas you see developing, Uh, part of the exciting part of knowledge press in general is to see all you know, got a real tall order. the sins of the past have, uh, you know, a lot of times do reappear. I know you guys got a great trend. So we really help you get started with your business, uh, That gradients said, looking forward to seeing you in person,

ENTITIES

Entity	Category	Confidence
Rob	PERSON	0.99+
Rob Harris	PERSON	0.99+
March 2021	DATE	0.99+
John ferry	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Maryland	LOCATION	0.99+
five	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
eight years	QUANTITY	0.99+
John	PERSON	0.99+
Gartner	ORGANIZATION	0.99+
three founders	QUANTITY	0.99+
today	DATE	0.99+
one	QUANTITY	0.98+
single	QUANTITY	0.98+
30 years ago	DATE	0.98+
2007	DATE	0.98+
both	QUANTITY	0.97+
first	QUANTITY	0.97+
one place	QUANTITY	0.95+
over 20 years ago	DATE	0.93+
both approaches	QUANTITY	0.92+
rod Harris	COMMERCIAL_ITEM	0.92+
single silo	QUANTITY	0.92+
many years ago	DATE	0.91+
Stardog	ORGANIZATION	0.9+
pandemic	EVENT	0.9+
single vendor	QUANTITY	0.9+
double	QUANTITY	0.9+
single application	QUANTITY	0.88+
agile	TITLE	0.86+
one step	QUANTITY	0.84+
ODBC	TITLE	0.84+
single day	QUANTITY	0.82+
star DOE	ORGANIZATION	0.81+
star dog	ORGANIZATION	0.79+
two full instances	QUANTITY	0.79+
Neo four J	ORGANIZATION	0.79+
single plain	QUANTITY	0.76+
eight petabytes of data	QUANTITY	0.75+
prem	ORGANIZATION	0.74+
JDBC	TITLE	0.7+
one big Lake	QUANTITY	0.69+
thing	QUANTITY	0.56+
COVID	EVENT	0.55+
StarTalk	ORGANIZATION	0.53+
Metcons	TITLE	0.52+
things	QUANTITY	0.49+
playbook	COMMERCIAL_ITEM	0.35+

UNLIST TILL 4/2 - The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL

hello everybody and thank you for joining us today for the virtual verdict of BBC 2020 today's breakout session is entitled the shortest path to Vertica best practices for data warehouse migration ETL I'm Jeff Healey I'll leave verdict and marketing I'll be your host for this breakout session joining me today are Marco guesser and Mauricio lychee vertical product engineer is joining us from yume region but before we begin I encourage you to submit questions or comments or in the virtual session don't have to wait just type question in a comment in the question box below the slides that click Submit as always there will be a Q&A session the end of the presentation will answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forums that formed at vertical comm to post your questions there after the session our engineering team is planning to join the forums to keep the conversation going also reminder that you can maximize your screen by clicking the double arrow button and lower right corner of the sides and yes this virtual session is being recorded be available to view on demand this week send you a notification as soon as it's ready now let's get started over to you mark marco andretti oh hello everybody this is Marco speaking a sales engineer from Amir said I'll just get going ah this is the agenda part one will be done by me part two will be done by Mauricio the agenda is as you can see big bang or piece by piece and the migration of the DTL migration of the physical data model migration of et I saw VTL + bi functionality what to do with store procedures what to do with any possible existing user defined functions and migration of the data doctor will be by Maurice it you want to talk about emeritus Rider yeah hello everybody my name is Mauricio Felicia and I'm a birth record pre-sales like Marco I'm going to talk about how to optimize that were always using some specific vertical techniques like table flattening live aggregated projections so let me start with be a quick overview of the data browser migration process we are going to talk about today and normally we often suggest to start migrating the current that allows the older disease with limited or minimal changes in the overall architecture and yeah clearly we will have to port the DDL or to redirect the data access tool and we will platform but we should minimizing the initial phase the amount of changes in order to go go live as soon as possible this is something that we also suggest in the second phase we can start optimizing Bill arouse and which again with no or minimal changes in the architecture as such and during this optimization phase we can create for example dog projections or for some specific query or optimize encoding or change some of the visual spools this is something that we normally do if and when needed and finally and again if and when needed we go through the architectural design for these operations using full vertical techniques in order to take advantage of all the features we have in vertical and this is normally an iterative approach so we go back to name some of the specific feature before moving back to the architecture and science we are going through this process in the next few slides ok instead in order to encourage everyone to keep using their common sense when migrating to a new database management system people are you often afraid of it it's just often useful to use the analogy of how smooth in your old home you might have developed solutions for your everyday life that make perfect sense there for example if your old cent burner dog can't walk anymore you might be using a fork lifter to heap in through your window in the old home well in the new home consider the elevator and don't complain that the window is too small to fit the dog through this is very much in the same way as Narita but starting to make the transition gentle again I love to remain in my analogy with the house move picture your new house as your new holiday home begin to install everything you miss and everything you like from your old home once you have everything you need in your new house you can shut down themselves the old one so move each by feet and go for quick wins to make your audience happy you do bigbang only if they are going to retire the platform you are sitting on where you're really on a sinking ship otherwise again identify quick wings implement published and quickly in Vertica reap the benefits enjoy the applause use the gained reputation for further funding and if you find that nobody's using the old platform anymore you can shut it down if you really have to migrate you can still go to really go to big battle in one go only if you absolutely have to otherwise migrate by subject area use the group all similar clear divisions right having said that ah you start off by migrating objects objects in the database that's one of the very first steps it consists of migrating verbs the places where you can put the other objects into that is owners locations which is usually schemers then what do you have that you extract tables news then you convert the object definition deploy them to Vertica and think that you shouldn't do it manually never type what you can generate ultimate whatever you can use it enrolls usually there is a system tables in the old database that contains all the roads you can export those to a file reformat them and then you have a create role and create user scripts that you can apply to Vertica if LDAP Active Directory was used for the authentication the old database vertical supports anything within the l dubs standard catalogued schemas should be relatively straightforward with maybe sometimes the difference Vertica does not restrict you by defining a schema as a collection of all objects owned by a user but it supports it emulates it for old times sake Vertica does not need the catalog or if you absolutely need the catalog from the old tools that you use it it usually said it is always set to the name of the database in case of vertical having had now the schemas the catalogs the users and roles in place move the take the definition language of Jesus thought if you are allowed to it's best to use a tool that translates to date types in the PTL generated you might see as a mention of old idea to listen by memory to by the way several times in this presentation we are very happy to have it it actually can export the old database table definition because they got it works with the odbc it gets what the old database ODBC driver translates to ODBC and then it has internal translation tables to several target schema to several target DBMS flavors the most important which is obviously vertical if they force you to use something else there are always tubes like sequel plots in Oracle the show table command in Tara data etc H each DBMS should have a set of tools to extract the object definitions to be deployed in the other instance of the same DBMS ah if I talk about youth views usually a very new definition also in the old database catalog one thing that you might you you use special a bit of special care synonyms is something that were to get emulated different ways depending on the specific needs I said I stop you on the view or table to be referred to or something that is really neat but other databases don't have the search path in particular that works that works very much like the path environment variable in Windows or Linux where you specify in a table an object name without the schema name and then it searched it first in the first entry of the search path then in a second then in third which makes synonym hugely completely unneeded when you generate uvl we remained in the analogy of moving house dust and clean your stuff before placing it in the new house if you see a table like the one here at the bottom this is usually corpse of a bad migration in the past already an ID is usually an integer and not an almost floating-point data type a first name hardly ever has 256 characters and that if it's called higher DT it's not necessarily needed to store the second when somebody was hired so take good care in using while you are moving dust off your stuff and use better data types the same applies especially could string how many bytes does a string container contains for eurozone's it's not for it's actually 12 euros in utf-8 in the way that Vertica encodes strings and ASCII characters one died but the Euro sign thinks three that means that you have to very often you have when you have a single byte character set up a source you have to pay attention oversize it first because otherwise it gets rejected or truncated and then you you will have to very carefully check what their best science is the best promising is the most promising approach is to initially dimension strings in multiples of very initial length and again ODP with the command you see there would be - I you 2 comma 4 will double the lengths of what otherwise will single byte character and multiply that for the length of characters that are wide characters in traditional databases and then load the representative sample of your cells data and profile using the tools that we personally use to find the actually longest datatype and then make them shorter notice you might be talking about the issues of having too long and too big data types on projection design are we live and die with our projects you might know remember the rules on how default projects has come to exist the way that we do initially would be just like for the profiling load a representative sample of the data collector representative set of already known queries from the Vertica database designer and you don't have to decide immediately you can always amend things and otherwise follow the laws of physics avoid moving data back and forth across nodes avoid heavy iOS if you can design your your projections initially by hand encoding matters you know that the database designer is a very tight fisted thing it would optimize to use as little space as possible you will have to think of the fact that if you compress very well you might end up using more time in reading it this is the testimony to run once using several encoding types and you see that they are l e is the wrong length encoded if sorted is not even visible while the others are considerably slower you can get those nights and look it in look at them in detail I will go in detail you now hear about it VI migrations move usually you can expect 80% of everything to work to be able to live to be lifted and shifted you don't need most of the pre aggregated tables because we have live like regain projections many BI tools have specialized query objects for the dimensions and the facts and we have the possibility to use flatten tables that are going to be talked about later you might have to ride those by hand you will be able to switch off casting because vertical speeds of everything with laps Lyle aggregate projections and you have worked with molap cubes before you very probably won't meet them at all ETL tools what you will have to do is if you do it row by row in the old database consider changing everything to very big transactions and if you use in search statements with parameter markers consider writing to make pipes and using verticals copy command mouse inserts yeah copy c'mon that's what I have here ask you custom functionality you can see on this slide the verticals the biggest number of functions in the database we compare them regularly by far compared to any other database you might find that many of them that you have written won't be needed on the new database so look at the vertical catalog instead of trying to look to migrate a function that you don't need stored procedures are very often used in the old database to overcome their shortcomings that Vertica doesn't have very rarely you will have to actually write a procedure that involves a loop but it's really in our experience very very rarely usually you can just switch to standard scripting and this is basically repeating what Mauricio said in the interest of time I will skip this look at this one here the most of the database data warehouse migration talks should be automatic you can use you can automate GDL migration using ODB which is crucial data profiling it's not crucial but game-changing the encoding is the same thing you can automate at you using our database designer the physical data model optimization in general is game-changing you have the database designer use the provisioning use the old platforms tools to generate the SQL you have no objects without their onus is crucial and asking functions and procedures they are only crucial if they depict the company's intellectual property otherwise you can almost always replace them with something else that's it from me for now Thank You Marco Thank You Marco so we will now point our presentation talking about some of the Vertica that overall the presentation techniques that we can implement in order to improve the general efficiency of the dot arouse and let me start with a few simple messages well the first one is that you are supposed to optimize only if and when this is needed in most of the cases just a little shift from the old that allows to birth will provide you exhaust the person as if you were looking for or even better so in this case probably is not really needed to to optimize anything in case you want optimize or you need to optimize then keep in mind some of the vertical peculiarities for example implement delete and updates in the vertical way use live aggregate projections in order to avoid or better in order to limit the goodbye executions at one time used for flattening in order to avoid or limit joint and and then you can also implement invert have some specific birth extensions life for example time series analysis or machine learning on top of your data we will now start by reviewing the first of these ballots optimize if and when needed well if this is okay I mean if you get when you migrate from the old data where else to birth without any optimization if the first four month level is okay then probably you only took my jacketing but this is not the case one very easier to dispute in session technique that you can ask is to ask basket cells to optimize the physical data model using the birth ticket of a designer how well DB deal which is the vertical database designer has several interfaces here I'm going to use what we call the DB DB programmatic API so basically sequel functions and using other databases you might need to hire experts looking at your data your data browser your table definition creating indexes or whatever in vertical all you need is to run something like these are simple as six single sequel statement to get a very well optimized physical base model you see that we start creating a new design then we had to be redesigned tables and queries the queries that we want to optimize we set our target in this case we are tuning the physical data model in order to maximize query performances this is why we are using my design query and in our statement another possible journal tip would be to tune in order to reduce storage or a mix between during storage and cheering queries and finally we asked Vertica to produce and deploy these optimized design in a matter of literally it's a matter of minutes and in a few minutes what you can get is a fully optimized fiscal data model okay this is something very very easy to implement keep in mind some of the vertical peculiarities Vaska is very well tuned for load and query operations aunt Berta bright rose container to biscuits hi the Pharos container is a group of files we will never ever change the content of this file the fact that the Rose containers files are never modified is one of the political peculiarities and these approach led us to use minimal locks we can add multiple load operations in parallel against the very same table assuming we don't have a primary or unique constraint on the target table in parallel as a sage because they will end up in two different growth containers salad in read committed requires in not rocket fuel and can run concurrently with insert selected because the Select will work on a snapshot of the catalog when the transaction start this is what we call snapshot isolation the kappa recovery because we never change our rows files are very simple and robust so we have a huge amount of bandages due to the fact that we never change the content of B rows files contain indiarose containers but on the other side believes and updates require a little attention so what about delete first when you believe in the ethica you basically create a new object able it back so it appeared a bit later in the Rose or in memory and this vector will point to the data being deleted so that when the feed is executed Vertica will just ignore the rules listed in B delete records and it's not just about the leak and updating vertical consists of two operations delete and insert merge consists of either insert or update which interim is made of the little insert so basically if we tuned how the delete work we will also have tune the update in the merge so what should we do in order to optimize delete well remember what we said that every time we please actually we create a new object a delete vector so avoid committing believe and update too often we reduce work the work for the merge out for the removal method out activities that are run afterwards and be sure that all the interested projections will contain the column views in the dedicate this will let workers directly after access the projection without having to go through the super projection in order to create the vector and the delete will be much much faster and finally another very interesting optimization technique is trying to segregate the update and delete operation from Pyrenean third workload in order to reduce lock contention beliefs something we are going to discuss and these contain using partition partition operation this is exactly what I want to talk about now here you have a typical that arouse architecture so we have data arriving in a landing zone where the data is loaded that is from the data sources then we have a transformation a year writing into a staging area that in turn will feed the partitions block of data in the green data structure we have at the end those green data structure we have at the end are the ones used by the data access tools when they run their queries sometimes we might need to change old data for example because we have late records or maybe because we want to fix some errors that have been originated in the facilities so what we do in this case is we just copied back the partition we want to change or we want to adjust from the green interior a the end to the stage in the area we have a very fast operation which is Tokyo Station then we run our updates or our adjustment procedure or whatever we need in order to fix the errors in the data in the staging area and at the very same time people continues to you with green data structures that are at the end so we will never have contention between the two operations when we updating the staging area is completed what we have to do is just to run a swap partition between tables in order to swap the data that we just finished to adjust in be staging zone to the query area that is the green one at the end this swap partition is very fast is an atomic operation and basically what will happens is just that well exchange the pointer to the data this is a very very effective techniques and lot of customer useless so why flops on table and live aggregate for injections well basically we use slot in table and live aggregate objection to minimize or avoid joint this is what flatten table are used for or goodbye and this is what live aggregate projections are used for now compared to traditional data warehouses better can store and process and aggregate and join order of magnitudes more data that is a true columnar database joint and goodbye normally are not a problem at all they run faster than any traditional data browse that page there are still scenarios were deficits are so big and we are talking about petabytes of data and so quickly going that would mean be something in order to boost drop by and join performances and this is why you can't reduce live aggregate projections to perform aggregations hard loading time and limit the need for global appear on time and flux and tables to combine information from different entity uploading time and again avoid running joint has query undefined okay so live aggregate projections at this point in time we can use live aggregate projections using for built in aggregate functions which are some min Max and count okay let's see how this works suppose that you have a normal table in this case we have a table unit sold with three columns PIB their time and quantity which has been segmented in a given way and on top of this base table we call it uncle table we create a projection you see that we create the projection using the salad that will aggregate the data we get the PID we get the date portion of the time and we get the sum of quantity from from the base table grouping on the first two columns so PID and the date portion of day time okay what happens in this case when we load data into the base table all we have to do with load data into the base table when we load data into the base table we will feel of course big injections that assuming we are running with k61 we will have to projection to projections and we will know the data in those two projection with all the detail in data we are going to load into the table so PAB playtime and quantity but at the very same time at the very same time and without having to do nothing any any particular operation or without having to run any any ETL procedure we will also get automatically in the live aggregate projection for the data pre aggregated with be a big day portion of day time and the sum of quantity into the table name total quantity you see is something that we get for free without having to run any specific procedure and this is very very efficient so the key concept is that during the loading operation from VDL point of view is executed again the base table we do not explicitly aggregate data or we don't have any any plc do the aggregation is automatic and we'll bring the pizza to be live aggregate projection every time we go into the base table you see the two selection that we have we have on in this line on the left side and you see that those two selects will produce exactly the same result so running select PA did they trying some quantity from the base table or running the select star from the live aggregate projection will result exactly in the same data you know this is of course very useful but is much more useful result that if we and we can observe this if we run an explained if we run the select against the base table asking for this group data what happens behind the scene is that basically vertical itself that is a live aggregate projection with the data that has been already aggregating loading phase and rewrite your query using polite aggregate projection this happens automatically you see this is a query that ran a group by against unit sold and vertical decided to rewrite this clearly as something that has to be collected against the light aggregates projection because if I decrease this will save a huge amount of time and effort during the ETL cycle okay and is not just limited to be information you want to aggregate for example another query like select count this thing you might note that can't be seen better basically our goodbyes will also take advantage of the live aggregate injection and again this is something that happens automatically you don't have to do anything to get this okay one thing that we have to keep very very clear in mind Brassica what what we store in the live aggregate for injection are basically partially aggregated beta so in this example we have two inserts okay you see that we have the first insert that is entered in four volts and the second insert which is inserting five rules well in for each of these insert we will have a partial aggregation you will never know that after the first insert you will have a second one so better will calculate the aggregation of the data every time irin be insert it is a key concept and be also means that you can imagine lies the effectiveness of bees technique by inserting large chunk of data ok if you insert data row by row this technique live aggregate rejection is not very useful because for every goal that you insert you will have an aggregation so basically they'll live aggregate injection will end up containing the same number of rows that you have in the base table but if you everytime insert a large chunk of data the number of the aggregations that you will have in the library get from structure is much less than B base data so this is this is a key concept you can see how these works by counting the number of rows that you have in alive aggregate injection you see that if you run the select count star from the solved live aggregate rejection the query on the left side you will get four rules but actually if you explain this query you will see that he was reading six rows so this was because every of those two inserts that we're actively interested a few rows in three rows in India in the live aggregate projection so this is a key concept live aggregate projection keep partially aggregated data this final aggregation will always happen at runtime okay another which is very similar to be live aggregate projection or what we call top K projection we actually do not aggregate anything in the top case injection we just keep the last or limit the amount of rows that we collect using the limit over partition by all the by clothes and this again in this case we create on top of the base stable to top gay projection want to keep the last quantity that has been sold and the other one to keep the max quantity in both cases is just a matter of ordering the data in the first case using the B time column in the second page using quantity in both cases we fill projection with just the last roof and again this is something that we do when we insert data into the base table and this is something that happens automatically okay if we now run after the insert our select against either the max quantity okay or be lost wanted it okay we will get the very last you see that we have much less rows in the top k projections okay we told at the beginning that basically we can use for built-in function you might remember me max sum and count what if I want to create my own specific aggregation on top of the lid and customer sum up because our customers have very specific needs in terms of live aggregate projections well in this case you can code your own live aggregate production user-defined functions so you can create the user-defined transport function to implement any sort of complex aggregation while loading data basically after you implemented miss VPS you can deploy using a be pre pass approach that basically means the data is aggregated as loading time during the data ingestion or the batch approach that means that the data is when that woman is running on top which things to remember on live a granade projections they are limited to be built in function again some max min and count but you can call your own you DTF so you can do whatever you want they can reference only one table and for bass cab version before 9.3 it was impossible to update or delete on the uncle table this limit has been removed in 9.3 so you now can update and delete data from the uncle table okay live aggregate projection will follow the segmentation of the group by expression and in some cases the best optimizer can decide to pick the live aggregates objection or not depending on if depending on the fact that the aggregation is a consistent or not remember that if we insert and commit every single role to be uncoachable then we will end up with a live aggregate indirection that contains exactly the same number of rows in this case living block or using the base table it would be the same okay so this is one of the two fantastic techniques that we can implement in Burtka this live aggregate projection is basically to avoid or limit goodbyes the other which we are going to talk about is cutting table and be reused in order to avoid the means for joins remember that K is very fast running joints but when we scale up to petabytes of beta we need to boost and this is what we have in order to have is problem fixed regardless the amount of data we are dealing with so how what about suction table let me start with normalized schemas everybody knows what is a normalized scheme under is no but related stuff in this slide the main scope of an normalized schema is to reduce data redundancies so and the fact that we reduce data analysis is a good thing because we will obtain fast and more brides we will have to write into a database small chunks of data into the right table the problem with these normalized schemas is that when you run your queries you have to put together the information that arrives from different table and be required to run joint again jointly that again normally is very good to run joint but sometimes the amount of data makes not easy to deal with joints and joints sometimes are not easy to tune what happens in in the normal let's say traditional data browser is that we D normalize the schemas normally either manually or using an ETL so basically we have on one side in this light on the left side the normalized schemas where we can get very fast right on the other side on the left we have the wider table where we run all the three joints and pre aggregation in order to prepare the data for the queries and so we will have fast bribes on the left fast reads on the Left sorry fast bra on the right and fast read on the left side of these slides the probability in the middle because we will push all the complexity in the middle in the ETL that will have to transform be normalized schema into the water table and the way we normally implement these either manually using procedures that we call the door using ETL this is what happens in traditional data warehouse is that we will have to coach in ETL layer in order to round the insert select that will feed from the normalized schema and right into the widest table at the end the one that is used by the data access tools we we are going to to view store to run our theories so this approach is costly because of course someone will have to code this ETL and is slow because someone will have to execute those batches normally overnight after loading the data and maybe someone will have to check the following morning that everything was ok with the batch and is resource intensive of course and is also human being intensive because of the people that will have to code and check the results it ever thrown because it can fail and introduce a latency because there is a get in the time axis between the time t0 when you load the data into be normalized schema and the time t1 when we get the data finally ready to be to be queried so what would be inverter to facilitate this process is to create this flatten table with the flattened T work first you avoid data redundancy because you don't need the wide table on the normalized schema on the left side second is fully automatic you don't have to do anything you just have to insert the data into the water table and the ETL that you have coded is transformed into an insert select by vatika automatically you don't have to do anything it's robust and this Latin c0 is a single fast as soon as you load the data into the water table you will get all the joints executed for you so let's have a look on how it works in this case we have the table we are going to flatten and basically we have to focus on two different clauses the first one is you see that there is one table here I mentioned value 1 which can be defined as default and then the Select or set using okay the difference between the fold and set using is when the data is populated if we use default data is populated as soon as we know the data into the base table if we use set using Google Earth to refresh but everything is there I mean you don't need them ETL you don't need to code any transformation because everything is in the table definition itself and it's for free and of course is in latency zero so as soon as you load the other columns you will have the dimension value valued as well okay let's see an example here suppose here we have a dimension table customer dimension that is on the left side and we have a fact table on on the right you see that the fact table uses columns like o underscore name or Oh the score city which are basically the result of the salad on top of the customer dimension so Beezus were the join is executed as soon as a remote data into the fact table directly into the fact table without of course loading data that arise from the dimension all the data from the dimension will be populated automatically so let's have an example here suppose that we are running this insert as you can see we are running be inserted directly into the fact table and we are loading o ID customer ID and total we are not loading made a major name no city those name and city will be automatically populated by Vertica for you because of the definition of the flood table okay you see behave well all you need in order to have your widest tables built for you your flattened table and this means that at runtime you won't need any join between base fuck table and the customer dimension that we have used in order to calculate name and city because the data is already there this was using default the other option was is using set using the concept is absolutely the same you see that in this case on the on the right side we have we have basically replaced this all on the school name default with all underscore name set using and same is true for city the concept that I said is the same but in this case which we set using then we will have to refresh you see that we have to run these select trash columns and then the name of the table in this case all columns will be fresh or you can specify only certain columns and this will bring the values for name and city reading from the customer dimension so this technique this technique is extremely useful the difference between default and said choosing just to summarize the most important differences remember you just have to remember that default will relate your target when you load set using when you refresh end and in some cases you might need to use them both so in some cases you might want to use both default end set using in this example here we'll see that we define the underscore name using both default and securing and this means that we love the data populated either when we load the data into the base table or when we run the Refresh this is summary of the technique that we can implement in birth in order to make our and other browsers even more efficient and well basically this is the end of our presentation thank you for listening and now we are ready for the Q&A session you

Published Date : Mar 30 2020

SUMMARY :

the end to the stage in the area we have

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Tom	PERSON	0.99+
Marta	PERSON	0.99+
John	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David	PERSON	0.99+
Dave	PERSON	0.99+
Peter Burris	PERSON	0.99+
Chris Keg	PERSON	0.99+
Laura Ipsen	PERSON	0.99+
Jeffrey Immelt	PERSON	0.99+
Chris	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Chris O'Malley	PERSON	0.99+
Andy Dalton	PERSON	0.99+
Chris Berg	PERSON	0.99+
Dave Velante	PERSON	0.99+
Maureen Lonergan	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Paul Forte	PERSON	0.99+
Erik Brynjolfsson	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Andrew McCafee	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Cheryl	PERSON	0.99+
Mark	PERSON	0.99+
Marta Federici	PERSON	0.99+
Larry	PERSON	0.99+
Matt Burr	PERSON	0.99+
Sam	PERSON	0.99+
Andy Jassy	PERSON	0.99+
Dave Wright	PERSON	0.99+
Maureen	PERSON	0.99+
Google	ORGANIZATION	0.99+
Cheryl Cook	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
$8,000	QUANTITY	0.99+
Justin Warren	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Europe	LOCATION	0.99+
Andy	PERSON	0.99+
30,000	QUANTITY	0.99+
Mauricio	PERSON	0.99+
Philips	ORGANIZATION	0.99+
Robb	PERSON	0.99+
Jassy	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Mike Nygaard	PERSON	0.99+

UNLIST TILL 4/2 - Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives

>> Sue: Hello everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session in entitled Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives. My name is Sue LeClaire, Director of Marketing at Vertica and I'll be your host for this webinar. Joining me is Tom Wall, a member of the Vertica engineering team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't get to, we'll do our best to answer them offline. Alternatively, you can visit the Vertica forums to post you questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Tom, over to you. >> Tom: Hello everyone and thanks for joining us today for this talk. My name is Tom Wall and I am the leader of Vertica's ecosystem engineering team. We are the team that focuses on building out all the developer tools, third party integrations that enables the SoftMaker system that surrounds Vertica to thrive. So today, we'll be talking about some of our new open source initatives and how those can be really effective for you and make things easier for you to build and integrate Vertica with the rest of your technology stack. We've got several new libraries, integration projects and examples, all open source, to share, all being built out in the open on our GitHub page. Whether you use these open source projects or not, this is a very exciting new effort that will really help to grow the developer community and enable lots of exciting new use cases. So, every developer out there has probably had to deal with the problem like this. You have some business requirements, to maybe build some new Vertica-powered application. Maybe you have to build some new system to visualize some data that's that's managed by Vertica. The various circumstances, lots of choices will might be made for you that constrain your approach to solving a particular problem. These requirements can come from all different places. Maybe your solution has to work with a specific visualization tool, or web framework, because the business has already invested in the licensing and the tooling to use it. Maybe it has to be implemented in a specific programming language, since that's what all the developers on the team know how to write code with. While Vertica has many different integrations with lots of different programming language and systems, there's a lot of them out there, and we don't have integrations for all of them. So how do you make ends meet when you don't have all the tools you need? All you have to get creative, using tools like PyODBC, for example, to bridge between programming languages and frameworks to solve the problems you need to solve. Most languages do have an ODBC-based database interface. ODBC is our C-Library and most programming languages know how to call C code, somehow. So that's doable, but it often requires lots of configuration and troubleshooting to make all those moving parts work well together. So that's enough to get the job done but native integrations are usually a lot smoother and easier. So rather than, for example, in Python trying to fight with PyODBC, to configure things and get Unicode working, and to compile all the different pieces, the right way is to make it all work smoothly. It would be much better if you could just PIP install library and get to work. And with Vertica-Python, a new Python client library, you can actually do that. So that story, I assume, probably sounds pretty familiar to you. Sounds probably familiar to a lot of the audience here because we're all using Vertica. And our challenge, as Big Data practitioners is to make sense of all this stuff, despite those technical and non-technical hurdles. Vertica powers lots of different businesses and use cases across all kinds of different industries and verticals. While there's a lot different about us, we're all here together right now for this talk because we do have some things in common. We're all using Vertica, and we're probably also using Vertica with other systems and tools too, because it's important to use the right tool for the right job. That's a founding principle of Vertica and it's true today too. In this constantly changing technology landscape, we need lots of good tools and well established patterns, approaches, and advice on how to combine them so that we can be successful doing our jobs. Luckily for us, Vertica has been designed to be easy to build with and extended in this fashion. Databases as a whole had had this goal from the very beginning. They solve the hard problems of managing data so that you don't have to worry about it. Instead of worrying about those hard problems, you can focus on what matters most to you and your domain. So implementing that business logic, solving that problem, without having to worry about all of these intense, sometimes details about what it takes to manage a database at scale. With the declarative syntax of SQL, you tell Vertica what the answer is that you want. You don't tell Vertica how to get it. Vertica will figure out the right way to do it for you so that you don't have to worry about it. So this SQL abstraction is very nice because it's a well defined boundary where lots of developers know SQL, and it allows you to express what you need without having to worry about those details. So we can be the experts in data management while you worry about your problems. This goes beyond though, what's accessible through SQL to Vertica. We've got well defined extension and integration points across the product that allow you to customize this experience even further. So if you want to do things write your own SQL functions, or extend database softwares with UDXs, you can do so. If you have a custom data format that might be a proprietary format, or some source system that Vertica doesn't natively support, we have extension points that allow you to use those. To make it very easy to do passive, parallel, massive data movement, loading into Vertica but also to export Vertica to send data to other systems. And with these new features in time, we also could do the same kinds of things with Machine Learning models, importing and exporting to tools like TensorFlow. And it's these integration points that have enabled Vertica to build out this open architecture and a rich ecosystem of tools, both open source and closed source, of different varieties that solve all different problems that are common in this big data processing world. Whether it's open source, streaming systems like Kafka or Spark, or more traditional ETL tools on the loading side, but also, BI tools and visualizers and things like that to view and use the data that you keep in your database on the right side. And then of course, Vertica needs to be flexible enough to be able to run anywhere. So you can really take Vertica and use it the way you want it to solve the problems that you need to solve. So Vertica has always employed open standards, and integrated it with all kinds of different open source systems. What we're really excited to talk about now is that we are taking our new integration projects and making those open source too. In particular, we've got two new open source client libraries that allow you to build Vertica applications for Python and Go. These libraries act as a foundation for all kinds of interesting applications and tools. Upon those libraries, we've also built some integrations ourselves. And we're using these new libraries to power some new integrations with some third party products. Finally, we've got lots of new examples and reference implementations out on our GitHub page that can show you how to combine all these moving parts and exciting ways to solve new problems. And the code for all these things is available now on our GitHub page. And so you can use it however you like, and even help us make it better too. So the first such project that we have is called Vertica-Python. Vertica-Python began at our customer, Uber. And then in late 2018, we collaborated with them and we took it over and made Vertica-Python the first official open source client for Vertica You can use this to build your own Python applications, or you can use it via tools that were written in Python. Python has grown a lot in recent years and it's very common language to solve lots of different problems and use cases in the Big Data space from things like DevOps admission and Data Science or Machine Learning, or just homegrown applications. We use Python a lot internally for our own QA testing and automation needs. And with the Python 2 End Of Life, that happened at the end of 2019, it was important that we had a robust Python solution to help migrate our internal stuff off of Python 2. And also to provide a nice migration path for all of you our users that might be worried about the same problems with their own Python code. So Vertica-Python is used already for lots of different tools, including Vertica's admintools now starting with 9.3.1. It was also used by DataDog to build a Vertica-DataDog integration that allows you to monitor your Vertica infrastructure within DataDog. So here's a little example of how you might use the Python Client to do some some work. So here we open in connection, we run a query to find out what node we've connected to, and then we do a little DataLoad by running a COPY statement. And this is designed to have a familiar look and feel if you've ever used a Python Database Client before. So we implement the DB API 2.0 standard and it feels like a Python package. So that includes things like, it's part of the centralized package manager, so you can just PIP install this right now and go start using it. We also have our client for Go length. So this is called vertica-sql-go. And this is a very similar story, just in a different context or the different programming language. So vertica-sql-go, began as a collaboration with the Microsoft Focus SecOps Group who builds microfocus' security products some of which use vertica internally to provide some of those analytics. So you can use this to build your own apps in the Go programming language but you can also use it via tools that are written Go. So most notably, we have our Grafana integration, which we'll talk a little bit more about later, that leverages this new clients to provide Grafana visualizations for vertica data. And Go is another rising popularity programming language 'cause it offers an interesting balance of different programming design trade-offs. So it's got good performance, got a good current concurrency and memory safety. And we liked all those things and we're using it to power some internal monitoring stuff of our own. And here's an example of the code you can write with this client. So this is Go code that does a similar thing. It opens a connection, it runs a little test query, and then it iterates over those rows, processing them using Go data types. You get that native look and feel just like you do in Python, except this time in the Go language. And you can go get it the way you usually package things with Go by running that command there to acquire this package. And it's important to note here for the DC projects, we're really doing open source development. We're not just putting code out on our GitHub page. So if you go out there and look, you can see that you can ask questions, you can report bugs, you can submit poll requests yourselves and you can collaborate directly with our engineering team and the other vertica users out on our GitHub page. Because it's out on our GitHub page, it allows us to be a little bit faster with the way we ship and deliver functionality compared to the core vertica release cycle. So in 2019, for example, as we were building features to prepare for the Python 3 migration, we shipped 11 different releases with 40 customer reported issues, filed on GitHub. That was done over 78 different poll requests and with lots of community engagement as we do so. So lots of people are using this already, we see as our GitHub badge last showed with about 5000 downloads of this a day of people using it in their software. And again, we want to make this easy, not just to use but also to contribute and understand and collaborate with us. So all these projects are built using the Apache 2.0 license. The master branch is always available and stable with the latest creative functionality. And you can always build it and test it the way we do so that it's easy for you to understand how it works and to submit contributions or bug fixes or even features. It uses automated testing both for locally and with poll requests. And for vertica-python, it's fully automated with Travis CI. So we're really excited about doing this and we're really excited about where it can go in the future. 'Cause this offers some exciting opportunities for us to collaborate with you more directly than we have ever before. You can contribute improvements and help us guide the direction of these projects, but you can also work with each other to share knowledge and implementation details and various best practices. And so maybe you think, "Well, I don't use Python, "I don't use go so maybe it doesn't matter to me." But I would argue it really does matter. Because even if you don't use these tools and languages, there's lots of amazing vertica developers out there who do. And these clients do act as low level building blocks for all kinds of different interesting tools, both in these Python and Go worlds, but also well beyond that. Because these implementations and examples really generalize to lots of different use cases. And we're going to do a deeper dive now into some of these to understand exactly how that's the case and what you can do with these things. So let's take a deeper look at some of the details of what it takes to build one of these open source client libraries. So these database client interfaces, what are they exactly? Well, we all know SQL, but if you look at what SQL specifies, it really only talks about how to manipulate the data within the database. So once you're connected and in, you can run commands with SQL. But these database client interfaces address the rest of those needs. So what does the programmer need to do to actually process those SQL queries? So these interfaces are specific to a particular language or a technology stack. But the use cases and the architectures and design patterns are largely the same between different languages. They all have a need to do some networking and connect and authenticate and create a session. They all need to be able to run queries and load some data and deal with problems and errors. And then they also have a lot of metadata and Type Mapping because you want to use these clients the way you use those programming languages. Which might be different than the way that vertica's data types and vertica's semantics work. So some of this client interfaces are truly standards. And they are robust enough in terms of what they design and call for to support a truly pluggable driver model. Where you might write an application that codes directly against the standard interface, and you can then plug in a different database driver, like a JDBC driver, to have that application work with any database that has a JDBC driver. So most of these interfaces aren't as robust as a JDBC or ODBC but that's okay. 'Cause it's good as a standard is, every database is unique for a reason. And so you can't really expose all of those unique properties of a database through these standard interfaces. So vertica's unique in that it can scale to the petabytes and beyond. And you can run it anywhere in any environment, whether it's on-prem or on clouds. So surely there's something about vertica that's unique, and we want to be able to take advantage of that fact in our solutions. So even though these standards might not cover everything, there's often a need and common patterns that arise to solve these problems in similar ways. When there isn't enough of a standard to define those comments, semantics that different databases might have in common, what you often see is tools will invent plug in layers or glue code to compensate by defining application wide standard to cover some of these same semantics. Later on, we'll get into some of those details and show off what exactly that means. So if you connect to a vertica database, what's actually happening under the covers? You have an application, you have a need to run some queries, so what does that actually look like? Well, probably as you would imagine, your application is going to invoke some API calls and some client library or tool. This library takes those API calls and implements them, usually by issuing some networking protocol operations, communicating over the network to ask vertica to do the heavy lifting required for that particular API call. And so these API's usually do the same kinds of things although some of the details might differ between these different interfaces. But you do things like establish a connection, run a query, iterate over your rows, manage your transactions, that sort of thing. Here's an example from vertica-python, which just goes into some of the details of what actually happens during the Connect API call. And you can see all these details in our GitHub implementation of this. There's actually a lot of moving parts in what happens during a connection. So let's walk through some of that and see what actually goes on. I might have my API call like this where I say Connect and I give it a DNS name, which is my entire cluster. And I give you my connection details, my username and password. And I tell the Python Client to get me a session, give me a connection so I can start doing some work. Well, in order to implement this, what needs to happen? First, we need to do some TCP networking to establish our connection. So we need to understand what the request is, where you're going to connect to and why, by pressing the connection string. and vertica being a distributed system, we want to provide high availability, so we might need to do some DNS look-ups to resolve that DNS name which might be an entire cluster and not just a single machine. So that you don't have to change your connection string every time you add or remove nodes to the database. So we do some high availability and DNS lookup stuff. And then once we connect, we might do Load Balancing too, to balance the connections across the different initiator nodes in the cluster, or in a sub cluster, as needed. Once we land on the node we want to be at, we might do some TLS to secure our connections. And vertica supports the industry standard TLS protocols, so this looks pretty familiar for everyone who've used TLS anywhere before. So you're going to do a certificate exchange and the client might send the server certificate too, and then you going to verify that the server is who it says it is, so that you can know that you trust it. Once you've established that connection, and secured it, then you can start actually beginning to request a session within vertica. So you going to send over your user information like, "Here's my username, "here's the database I want to connect to." You might send some information about your application like a session label, so that you can differentiate on the database with monitoring queries, what the different connections are and what their purpose is. And then you might also send over some session settings to do things like auto commit, to change the state of your session for the duration of this connection. So that you don't have to remember to do that with every query that you have. Once you've asked vertica for a session, before vertica will give you one, it has to authenticate you. and vertica has lots of different authentication mechanisms. So there's a negotiation that happens there to decide how to authenticate you. Vertica decides based on who you are, where you're coming from on the network. And then you'll do an auth-specific exchange depending on what the auth mechanism calls for until you are authenticated. Finally, vertica trusts you and lets you in, so you going to establish a session in vertica, and you might do some note keeping on the client side just to know what happened. So you might log some information, you might record what the version of the database is, you might do some protocol feature negotiation. So if you connect to a version of the database that doesn't support all these protocols, you might decide to turn some functionality off and that sort of thing. But finally, after all that, you can return from this API call and then your connection is good to go. So that connection is just one example of many different APIs. And we're excited here because with vertica-python we're really opening up the vertica client wire protocol for the first time. And so if you're a low level vertica developer and you might have used Postgres before, you might know that some of vertica's client protocol is derived from Postgres. But they do differ in many significant ways. And this is the first time we've ever revealed those details about how it works and why. So not all Postgres protocol features work with vertica because vertica doesn't support all the features that Postgres does. Postgres, for example, has a large object interface that allows you to stream very wide data values over. Whereas vertica doesn't really have very wide data values, you have 30, you have long bar charts, but that's about as wide as you can get. Similarly, the vertica protocol supports lots of features not present in Postgres. So Load Balancing, for example, which we just went through an example of, Postgres is a single node system, it doesn't really make sense for Postgres to have Load Balancing. But Load Balancing is really important for vertica because it is a distributed system. Vertica-python serves as an open reference implementation of this protocol. With all kinds of new details and extension points that we haven't revealed before. So if you look at these boxes below, all these different things are new protocol features that we've implemented since August 2019, out in the open on our GitHub page for Python. Now, the vertica-sql-go implementation of these things is still in progress, but the core protocols are there for basic query operations. There's more to do there but we'll get there soon. So this is really cool 'cause not only do you have now a Python Client implementation, and you have a Go client implementation of this, but you can use this protocol reference to do lots of other things, too. The obvious thing you could do is build more clients for other languages. So if you have a need for a client in some other language that are vertica doesn't support yet, now you have everything available to solve that problem and to go about doing so if you need to. But beyond clients, it's also used for other things. So you might use it for mocking and testing things. So rather than connecting to a real vertica database, you can simulate some of that. You can also use it to do things like query routing and proxies. So Uber, for example, this log here in this link tells a great story of how they route different queries to different vertical clusters by intercepting these protocol messages, parsing the queries in them and deciding which clusters to send them to. So a lot of these things are just ideas today, but now that you have the source code, there's no limit in sight to what you can do with this thing. And so we're very interested in hearing your ideas and requests and we're happy to offer advice and collaborate on building some of these things together. So let's take a look now at some of the things we've already built that do these things. So here's a picture of vertica's Grafana connector with some data powered from an example that we have in this blog link here. So this has an internet of things use case to it, where we have lots of different sensors recording flight data, feeding into Kafka which then gets loaded into vertica. And then finally, it gets visualized nicely here with Grafana. And Grafana's visualizations make it really easy to analyze the data with your eyes and see when something something happens. So in these highlighted sections here, you notice a drop in some of the activity, that's probably a problem worth looking into. It might be a lot harder to see that just by staring at a large table yourself. So how does a picture like that get generated with a tool like Grafana? Well, Grafana specializes in visualizing time series data. And time can be really tricky for computers to do correctly. You got time zones, daylight savings, leap seconds, negative infinity timestamps, please don't ever use those. In every system, if it wasn't hard enough, just with those problems, what makes it harder is that every system does it slightly differently. So if you're querying some time data, how do we deal with these semantic differences as we cross these domain boundaries from Vertica to Grafana's back end architecture, which is implemented in Go on it's front end, which is implemented with JavaScript? Well, you read this from bottom up in terms of the processing. First, you select the timestamp and Vertica is timestamp has to be converted to a Go time object. And we have to reconcile the differences that there might be as we translate it. So Go time has a different time zone specifier format, and it also supports nanosecond precision, while Vertica only supports microsecond precision. So that's not too big of a deal when you're querying data because you just see some extra zeros, not fractional seconds. But on the way in, if we're loading data, we have to find a way to resolve those things. Once it's into the Go process, it has to be converted further to render in the JavaScript UI. So that there, the Go time object has to be converted to a JavaScript Angular JS Date object. And there too, we have to reconcile those differences. So a lot of these differences might just be presentation, and not so much the actual data changing, but you might want to choose to render the date into a more human readable format, like we've done in this example here. Here's another picture. This is another picture of some time series data, and this one shows you can actually write your own queries with Grafana to provide answers. So if you look closely here you can see there's actually some functions that might not look too familiar with you if you know vertica's functions. Vertica doesn't have a dollar underscore underscore time function or a time filter function. So what's actually happening there? How does this actually provide an answer if it's not really real vertica syntax? Well, it's not sufficient to just know how to manipulate data, it's also really important that you know how to operate with metadata. So information about how the data works in the data source, Vertica in this case. So Grafana needs to know how time works in detail for each data source beyond doing that basic I/O that we just saw in the previous example. So it needs to know, how do you connect to the data source to get some time data? How do you know what time data types and functions there are and how they behave? How do you generate a query that references a time literal? And finally, once you've figured out how to do all that, how do you find the time in the database? How do you do know which tables have time columns and then they might be worth rendering in this kind of UI. So Go's database standard doesn't actually really offer many metadata interfaces. Nevertheless, Grafana needs to know those answers. And so it has its own plugin layer that provides a standardizing layer whereby every data source can implement hints and metadata customization needed to have an extensible data source back end. So we have another open source project, the Vertica-Grafana data source, which is a plugin that uses Grafana's extension points with JavaScript and the front end plugins and also with Go in the back end plugins to provide vertica connectivity inside Grafana. So the way this works, is that the plugin frameworks defines those standardizing functions like time and time filter, and it's our plugin that's going to rewrite them in terms of vertica syntax. So in this example, time gets rewritten to a vertica cast. And time filter becomes a BETWEEN predicate. So that's one example of how you can use Grafana, but also how you might build any arbitrary visualization tool that works with data in Vertica. So let's now look at some other examples and reference architectures that we have out in our GitHub page. For some advanced integrations, there's clearly a need to go beyond these standards. So SQL and these surrounding standards, like JDBC, and ODBC, were really critical in the early days of Vertica, because they really enabled a lot of generic database tools. And those will always continue to play a really important role, but the Big Data technology space moves a lot faster than these old database data can keep up with. So there's all kinds of new advanced analytics and query pushdown logic that were never possible 10 or 20 years ago, that Vertica can do natively. There's also all kinds of data-oriented application workflows doing things like streaming data, or Parallel Loading or Machine Learning. And all of these things, we need to build software with, but we don't really have standards to go by. So what do we do there? Well, open source implementations make for easier integrations, and applications all over the place. So even if you're not using Grafana for example, other tools have similar challenges that you need to overcome. And it helps to have an example there to show you how to do it. Take Machine Learning, for example. There's been many excellent Machine Learning tools that have arisen over the years to make data science and the task of Machine Learning lot easier. And a lot of those have basic database connectivity, but they generally only treat the database as a source of data. So they do lots of data I/O to extract data from a database like Vertica for processing in some other engine. We all know that's not the most efficient way to do it. It's much better if you can leverage Vertica scale and bring the processing to the data. So a lot of these tools don't take full advantage of Vertica because there's not really a uniform way to go do so with these standards. So instead, we have a project called vertica-ml-python. And this serves as a reference architecture of how you can do scalable machine learning with Vertica. So this project establishes a familiar machine learning workflow that scales with vertica. So it feels similar to like a scickit-learn project except all the processing and aggregation and heavy lifting and data processing happens in vertica. So this makes for a much more lightweight, scalable approach than you might otherwise be used to. So with vertica-ml-python, you can probably use this yourself. But you could also see how it works. So if it doesn't meet all your needs, you could still see the code and customize it to build your own approach. We've also got lots of examples of our UDX framework. And so this is an older GitHub project. We've actually had this for a couple of years, but it is really useful and important so I wanted to plug it here. With our User Defined eXtensions framework or UDXs, this allows you to extend the operators that vertica executes when it does a database load or a database query. So with UDXs, you can write your own domain logic in a C++, Java or Python or R. And you can call them within the context of a SQL query. And vertica brings your logic to that data, and makes it fast and scalable and fault tolerant and correct for you. So you don't have to worry about all those hard problems. So our UDX examples, demonstrate how you can use our SDK to solve interesting problems. And some of these examples might be complete, total usable packages or libraries. So for example, we have a curl source that allows you to extract data from any curlable endpoint and load into vertica. We've got things like an ODBC connector that allows you to access data in an external database via an ODBC driver within the context of a vertica query, all kinds of parsers and string processors and things like that. We also have more exciting and interesting things where you might not really think of vertica being able to do that, like a heat map generator, which takes some XY coordinates and renders it on top of an image to show you the hotspots in it. So the image on the right was actually generated from one of our intern gaming sessions a few years back. So all these things are great examples that show you not just how you can solve problems, but also how you can use this SDK to solve neat things that maybe no one else has to solve, or maybe that are unique to your business and your needs. Another exciting benefit is with testing. So the test automation strategy that we have in vertica-python these clients, really generalizes well beyond the needs of a database client. Anyone that's ever built a vertica integration or an application, probably has a need to write some integration tests. And that could be hard to do with all the moving parts, in the big data solution. But with our code being open source, you can see in vertica-python, in particular, how we've structured our tests to facilitate smooth testing that's fast, deterministic and easy to use. So we've automated the download process, the installation deployment process, of a Vertica Community Edition. And with a single click, you can run through the tests locally and part of the PR workflow via Travis CI. We also do this for multiple different python environments. So for all python versions from 2.7 up to 3.8 for different Python interpreters, and for different Linux distros, we're running through all of them very quickly with ease, thanks to all this automation. So today, you can see how we do it in vertica-python, in the future, we might want to spin that out into its own stand-alone testbed starter projects so that if you're starting any new vertica integration, this might be a good starting point for you to get going quickly. So that brings us to some of the future work we want to do here in the open source space . Well, there's a lot of it. So in terms of the the client stuff, for Python, we are marching towards our 1.0 release, which is when we aim to be protocol complete to support all of vertica's unique protocols, including COPY LOCAL and some new protocols invented to support complex types, which is our new feature in vertica 10. We have some cursor enhancements to do things like better streaming and improved performance. Beyond that we want to take it where you want to bring it. So send us your requests in the Go client fronts, just about a year behind Python in terms of its protocol implementation, but the basic operations are there. But we still have more work to do to implement things like load balancing, some of the advanced auths and other things. But they're two, we want to work with you and we want to focus on what's important to you so that we can continue to grow and be more useful and more powerful over time. Finally, this question of, "Well, what about beyond database clients? "What else might we want to do with open source?" If you're building a very deep or a robust vertica integration, you probably need to do a lot more exciting things than just run SQL queries and process the answers. Especially if you're an OEM or you're a vendor that resells vertica packaged as a black box piece of a larger solution, you might to have managed the whole operational lifecycle of vertica. There's even fewer standards for doing all these different things compared to the SQL clients. So we started with the SQL clients 'cause that's a well established pattern, there's lots of downstream work that that can enable. But there's also clearly a need for lots of other open source protocols, architectures and examples to show you how to do these things and do have real standards. So we talked a little bit about how you could do UDXs or testing or Machine Learning, but there's all sorts of other use cases too. That's why we're excited to announce here our awesome vertica, which is a new collection of open source resources available on our GitHub page. So if you haven't heard of this awesome manifesto before, I highly recommend you check out this GitHub page on the right. We're not unique here but there's lots of awesome projects for all kinds of different tools and systems out there. And it's a great way to establish a community and share different resources, whether they're open source projects, blogs, examples, references, community resources, and all that. And this tool is an open source project. So it's an open source wiki. And you can contribute to it by submitting yourself to PR. So we've seeded it with some of our favorite tools and projects out there but there's plenty more out there and we hope to see more grow over time. So definitely check this out and help us make it better. So with that, I'm going to wrap up. I wanted to thank you all. Special thanks to Siting Ren and Roger Huebner, who are the project leads for the Python and Go clients respectively. And also, thanks to all the customers out there who've already been contributing stuff. This has already been going on for a long time and we hope to keep it going and keep it growing with your help. So if you want to talk to us, you can find us at this email address here. But of course, you can also find us on the Vertica forums, or you could talk to us on GitHub too. And there you can find links to all the different projects I talked about today. And so with that, I think we're going to wrap up and now we're going to hand it off for some Q&A.

Published Date : Mar 30 2020

SUMMARY :

Also a reminder that you can maximize your screen and frameworks to solve the problems you need to solve.

ENTITIES

Entity	Category	Confidence
Tom Wall	PERSON	0.99+
Sue LeClaire	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Roger Huebner	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Tom	PERSON	0.99+
Python 2	TITLE	0.99+
August 2019	DATE	0.99+
2019	DATE	0.99+
Python 3	TITLE	0.99+
two	QUANTITY	0.99+
Sue	PERSON	0.99+
Python	TITLE	0.99+
python	TITLE	0.99+
SQL	TITLE	0.99+
late 2018	DATE	0.99+
First	QUANTITY	0.99+
end of 2019	DATE	0.99+
Vertica	TITLE	0.99+
today	DATE	0.99+
Java	TITLE	0.99+
Spark	TITLE	0.99+
C++	TITLE	0.99+
JavaScript	TITLE	0.99+
vertica-python	TITLE	0.99+
Today	DATE	0.99+
first time	QUANTITY	0.99+
11 different releases	QUANTITY	0.99+
UDXs	TITLE	0.99+
Kafka	TITLE	0.99+
Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives	TITLE	0.98+
Grafana	ORGANIZATION	0.98+
PyODBC	TITLE	0.98+
first	QUANTITY	0.98+
UDX	TITLE	0.98+
vertica 10	TITLE	0.98+
ODBC	TITLE	0.98+
10	DATE	0.98+
Postgres	TITLE	0.98+
DataDog	ORGANIZATION	0.98+
40 customer reported issues	QUANTITY	0.97+
both	QUANTITY	0.97+

UNLIST TILL 4/2 - Model Management and Data Preparation

>> Sue: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Machine Learning with Vertica, Data Preparation and Model Management. My name is Sue LeClaire, Director of Managing at Vertica and I'll be your host for this webinar. Joining me is Waqas Dhillon. He's part of the Vertica Product Management Team at Vertica. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can visit Vertica Forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, and yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Waqas, over to you. >> Waqas: Thank you, Sue. Hi, everyone. My name is Waqas Dhillon and I'm a Product Manager here at Vertica. So today, we're going to go through data preparation and model management in Vertica, and the session would essentially be starting with some introduction and going through some of the machine learning configurations and you're doing machine learning at scale. After that, we have two media sections here. The first one is on data preparation, and so we'd go through data preparation is, what are the Vertica functions for data exploration and data preparation, and then share an example with you. Similarly, in the second part of this talk we'll go through different export models using PMML and how that works with Vertica, and we'll share examples from that, as well. So yeah, let's dive right in. So, Vertica essentially is an open architecture with a rich ecosystem. So, you have a lot of options for data transformation and ingesting data from different tools, and then you also have options for connecting through ODBC, JDBC, and some other connectors to BI and visualization tools. There's a lot of them that Vertica connects to, and in the middle sits Vertica, which you can have on external tables or you can have in place analytics on R, on cloud, or on prem, so that choice is yours, but essentially what it does is it offers you a lot of options for performing your data and analytics on scale, and within that, data analytics machine learning is also a core component, and then you have a lot of options and functions for that. Now, machine learning in Vertica is actually built on top of the architecture that distributed data analytics offers, so it offers a lot of those capabilities and builds on top of them, so you eliminate the overhead data transfer when you're working with Vertica machine learning, you keep your data secure, storing and managing the models really easy and much more efficient. You can serve a lot of concurrent users all at the same time, and then it's really scalable and avoids maintenance cost of a separate system, so essentially a lot of benefits here, but one important thing to mention here is that all the algorithms that you see, whether they're analytics functions, advanced analytics functions, or machine learning functions, they are distributed not just across the cluster on different nodes. So, each node gets a distributed work load. On each node, too, there might be multiple tracks and multiple processors that are running with each of these functions. So, highly distributed solution and one of its kind in this space. So, when we talk about Vertica machine learning, it essentially covers all machine learning process and we see it as something starting with data ingestion and doing data analysis and understanding, going through the steps of data preparation, modeling, evaluation, and finally deployment, as well. So, when you're using with Vertica, you're using Vertica for machine learning, it takes care of all these steps and you can do all of that inside of the Vertica database, but when we look at the three main pillars that Vertica machine learning aims to build on, the first one is to have Vertica as a platform for high performance machine learning. We have a lot of functions for data exploration and preparation and we'll go through some of them here. We have distributed in-database algorithms for model training and prediction, we have scalable functions for model evaluation, and finally we have distributed scoring functions, as well. Doing all of the stuff in the database, that's a really good thing, but we don't want it isolated in this space. We understand that a lot of our customers, our users, they like to work with other tools and work with Vertica, as well. So, they might use Vertica for data prep, another two for model training, or use Vertica for model training and take those nodes out to other tools and do prediction there. So, integration is really important part of our overall offering. So, it's a pretty flexible system. We have been offering UdX in four languages, a lot of people find there over the past few years, but the new capability of importing PMML models for in-database scoring and exporting Vertica native-models, for external scoring it's something that we have recently added, and another talk would actually go through the TensorFlow integrations, a really exciting and important milestone that we have where you can bring TensorFlow models into Vertica for in-database scoring. For this talk, we'll focus on data exploration and preparation, importing PMML, and exporting PMML models, and finally, since Vertica is not just a cue engine, but also a data store, we have a lot of really good capability for model storage and management, as well. So, yeah. Let's dive into the first part on machine learning at scale. So, when we say machine learning at scale we're actually having a few really important considerations and they have their own implications. The first one is that we want to have speed, but also want it to come at a reasonable cost. So, it's really important for us to pick the right scaling architecture. Secondly, it's not easy to move big data around. It might be easy to do that on a smaller data set, on an Excel sheet, or something of the like, but once you're talking about big data and data analytics at really big scale, it's really not easy to move that data around from one tool to another, so what you'd want to do is bring models to the data instead of having to move this data to the tools, and the third thing here is that some sub-sampling it can actually compromise your accuracy, and a lot of tools that are out there they still force you to take smaller samples of your data because they can only handle so much data, but that can impact your accuracy and the need here is that you should be able to work with all of your data. We'll just go through each of these really quickly. So, the first factor here is scalability. Now, if you want to scale your architecture, you have two main options. The first is vertical scaling. Let's say you have a machine, a server, essentially, and you can keep on adding resources, like RAM and CPU and keep increasing the performance as well as the capacity of that system, but there's a limit to what you can do here, and the limit, you can hit that in terms of cost, as well as in terms of technology. Beyond a certain point, you will not be able to scale more. So, the right solution to follow here is actually horizontal scaling in which you can keep on adding more instances to have more computing power and more capacity. So, essentially what you get with this architecture is a super computer, which stitches together several nodes and the workload is distributed on each of those nodes for massive develop processing and really fast speeds, as well. The second aspect of having big data and the difficulty around moving it around is actually can be clarified with this example. So, what usually happens is, and this is a simplified version, you have a lot of applications and tools for which you might be collecting the data, and this data then goes into an analytics database. That database then in turn might be connected to some VI tools, dashboard and applications, and some ad-hoc queries being done on the database. Then, you want to do machine learning in this architecture. What usually happens is that you have your machine learning tools and the data that is coming in to the analytics database is actually being exported out of the machine learning tools. You're training your models there, and afterwards, when you have new incoming data, that data again goes out to the machine learning tools for prediction. With those results that you get from those tools usually ended up back in the distributed database because you want to put it on dashboard or you want to power up some applications with that. So, there's essentially a lot of data overhead that's involved here. There are cons with that, including data governance, data movement, and other complications that you need to resolve here. One of the possible solutions to overcome that difficulty is that you have machine learning as part of the distributed analytical database, as well, so you get the benefits of having it applied on all of the data that's inside of the database and not having to care about all of the data movement there, but if there are some use cases where it still makes sense to at least train the models outside, that's where you can do your data preparation outside of the database, and then take the data out, the prepared data, build your model, and then bring the model back to the analytics database. In this case, we'll talk about Vertica. So, the model would be archived, hosted by Vertica, and then you can keep on applying predictions on the new data that's incoming into the database. So, the third consideration here for machine learning on scale is sampling versus full data set. As I mentioned, a lot of tools they cannot handle big data and you are forced to sub-sample, but what happens here, as you can see in the figure on the left most, figure A, is that if you have a single data point, essentially any model can explain that, but if you have more data points, as in figure B, there would be a smaller number of models that could be able to explain that, and in figure C, even more data points, lesser number of models explained, but lesser also means here that these models would probably be more accurate, and the objective for building machine learning models is mostly to have prediction capability and generalization capability, essentially, on unseen data, so if you build a model that's accurate on one data point, it could not have very good generalization capabilities. The conventional wisdom with machine learning is that the more data points that you have for learning the better and more accurate models that you'll get out of your machine learning models. So, you need to pick a tool which can handle all of your data and does not force you to sub-sample that, and doing that, even a simpler model might be much better than a more complex model here. So, yeah. Let's go to data exploration and data preparation part. Vertica's a really powerful tool and it offers a lot of scalability in this space, and as I mentioned, will support the whole process. You can define the problem and you can gather your data and construct your data set inside Vertica, and then consider it a prepared training modeling deployment and managing the model, but this is a really critical step in the overall machine learning process. Some estimate it takes between 60 to 80% of the overall effort of a machine learning process. So, a lot of functions here. You can use part of Vertica, do data exploration, de-duplication, outlier detection, balancing, normalization, and potentially a lot more. You can actually go to our Vertica documentation and find them there. Within Vertica we divide them into two parts. Within data prep, one is exploration functions, the second is transformation functions. Within exploration, you have a rich set functions that you can use in DB, and then if you want to build your own you can use the UDX to do that. Similarly, for transformation there's a lot of functions around time series, pattern matching, outlier detection that you can use to transform that data, and it's just a snapshot of some of those functions that are available in Vertica right now. And again, the good thing about these functions is not just their presence in the database. The good thing is actually their ability to scale on really, really large data set and be able to compute those results for you on that data set in an acceptable amount of time, which makes your machine learning processes really critical. So, let's go to an example and see how we can use some of these functions. As I mentioned, there's a whole lot of them and we'll not be able to go through all of them, but just for our understanding we can go through some of them and see how they work. So, we have here a sample data set of network flows. It's a similar attack from some source nodes, and then there are some victim nodes on which these attacks are happening. So yeah, let's just look at the data here real quick. We'll load the data, we'll browse the data, compute some statistics around it, ask some questions, make plots, and then clean the data. The objective here is not to make a prediction, per se, which is what we mostly do in machine learning algorithms, but to just go through the data prep process and see how easy it is to do that with Vertica and what kind of options might be there to help you through that process. So, the first step is loading the data. Since in this case we know the structure of the data, so we create a table and create different column names and data types, but let's say you have a data set for which you do not already know the structure, there's a really cool feature in Vertica called flex tables and you can use that to initially import the data into the database and then go through all of the variables and then assign them variable types. You can also use that if your data is dynamic and it's changing, to board the data first and then create these definitions. So once we've done that, we load the data into the database. It's for one week of data out of the whole data set right now, but once you've done that we'd like to look at the flows just to look at the data, you know how it looks, and once we do select star from flows and just have a limit here, we see that there's already some data duplication, and by duplication I mean rows which have the exact same data for each of the columns. So, as part of the cleaning process, the first thing we'd want to do is probably to remove that duplication. So, we create a table with distinct flows and you can see here we have about a million flows here which are unique. So, moving on. The next step we want to do here, this is essentially time state data and these times are in days of the week, so we want to look at the trends of this data. So, the network traffic that's there, you can call it flows. So, based on hours of the day how does the traffic move and how does it differ from one day to another? So, it's part of an exploration process. There might be a lot of further exploration that you want to do, but we can start with this one and see how it goes, and you can see in the graph here that we have seven days of data, and the weekend traffic, which is in pink and purple here seems a little different from the rest of the days. Pretty close to each other, but yeah, definitely something we can look into and see if there's some real difference and if there's something we want to explore further here, but the thing is that this is just data for one week, as I mentioned. What if we load data for 70 days? You'd have a longer graph probably, but a lot of lines and would not really be able to make sense out of that data. It would be a really crowded plot for that, so we have to come up with a better way to be able to explore that and we'll come back to that in a little bit. So, what are some other things that we can do? We can get some statistics, we can take one sample flow and look at some of the values here. We see that the forward column here and ToS column here, they have zero values, and when we explore further we see that there's a lot of values here or records here for which these columns are essentially zero, so probably not really helpful for our use case. Then, we can look at the flow end. So, flow end is the end time when the last packet in a flow was sent and you can do a select min flow and max flow to see the data when it started and when it ended, and you can see it's about one week's of data for the first til eighth. Now, we also want to look at the data whether it's balanced or not because balanced data is really important for a lot of classification use cases that we want to try with this and you can see that source address, destination address, source port, and destination port, and you see it's highly in balanced data and so is versus destination address space, so probably something that we need to do, really powerful Vertica balancing functions that you can use within, and just sampling, over-sampling, or hybrid sampling here and that can be really useful here. Another thing we can look at is there's so many statistics of these columns, so off the unique flows table that we created we just use the summarize num call function in Vertica and it gives us a lot of really cool (mumbling) and percentile information on that. Now, if we look at the duration, which is the last record here, we can see that the mean is about 4.6 seconds, but when we look at the percentile information, we see that the median is about 0.27. So, there's a lot of short flows that have duration less than 0.27 seconds. Yes, there would be more and they'd probably bring the mean to the 4.6 value, but then the number of short flows is probably pretty high. We can ask some other questions from the data about the features. We can look at the protocols here and look at the count. So, we see that most of the traffic that we have is for TCP and UDP, which is sort of expected for a data set like this, and then we want to look at what are the most popular network services here? So again, simply queue here, select destination port count, add in the information here. We get the destination port and count for each. So, we can see that most of the traffic here is web traffic, HTTP and HTTPS, followed by domain name resolution. So, let's explore some more. We can look at the label distributions. We see that the labels that are given with that because this is essentially data for which we already know whether something was an anomaly or not, record was anomaly or not, and creating our algorithm based on it. So, we see that there's this background label, a lot of records there, and then anomaly spam seems to be really high. There are anomaly UDB scans and SSS scams, as well. So, another question we can ask is among the SMTP flows, how labels are distributed, and we can say that anomaly spam is highest, and then comes the background spam. So, can we say out of this that SMTP flows, they are spams, and maybe we can build a model that actually answers that question for us? That can be one machine learning model that you can build out of this data set. Again, we can also verify the destination port of flows that were labeled as spam. So, you can expect port 25 for SMTP service here, and we can see that SMTP with destination port 25, you have a lot of counts here, but there are some other destination ports for which the count is really low, and essentially, when we're doing and analysis at this scale, these data points might not really be needed. So, as part of the data prep slash data cleaning we might want to get rid of these records here. So now, what we can do is going back to the graph that I showed earlier, we can try and plot the daily trends by aggregating them. Again, we take the unique flow and convert into a flow count and to a manageable number that we can then feed into one of the algorithms. Now, PCA principle component analysis, it's a really powerful algorithm in Vertica, and what it essentially does is a lot of times when you have a high number of columns, which might be highly (mumbling) with each other, you can feed them into the PCA algorithm and it will get for you a list of principle components which would be linearly independent from each other. Now, each of these components would explain a certain extent of the variants of the overall data set that you have. So, you can see here component one explains about 73.9% of the variance, and component two explains about 16% of the variance. So, if you combine those two components alone, that would get you for around 90% of the variance. Now, you can use PCA for a lot of different purposes, but in this specific example, we want to see if we combine all the data points that we have together and we do that by day of the week, what sort of information can we get out of it? Is there any insight that this provides? Because once you have two data points, it's really easy to plot them. So, we just apply the PCA, we first (mumbling) it, and then reapply on our data set, and this is the graph we get as a result. Now, you can see component one is on the X axis here, component two on the y axis, and each of these points represents a day of the week. Now, with just two points it's easy to plot that and compare this to the graph that we saw earlier, which had a lot of lines and the more weeks that we added or the more days that we added, the more lines that we'd have versus this graph in which you can clearly tell that five days traffic starting from Monday til Friday, that's closely clustered together, so probably pretty similar to each other, and then Saturday traffic is pretty much apart from all of these days and it's also further away from Sunday. So, these two days of traffic is different from other days of traffic and we can always dive deeper into this and look at exactly what's happening here and see how this traffic is actually different, but with just a few functions and some pretty simple SQL queries, we were already able to get a pretty good insight from the data set that we had. Now, let's move on to our next part of this talk on importing and exporting PMML models to and from Vertica. So, current common practice is when you're putting your machine learning models into production, you'd have a dev or test environment, and in that you might be using a lot of different tools, Scikit and Spark, R, and once you want to deploy these models into production, you'd put them into containers and there would be a pool of containers in the production environment which would be talking to your database that could be your analytical database, and all of the new data that's incoming would be coming into the database itself. So, as I mentioned in one of the slides earlier, there is a lot of data transfer that's happening between that pool of containers hosting your machine learning training models versus the database which you'd be getting data for scoring and then sending the scores back to the database. So, why would you really need to transfer your models? The thing is that no machine learning platform provides everything. There might be some really cool algorithms that might compromise, but then Spark might have its own benefits in terms of some additional algorithms or some other stuff that you're looking at and that's the reason why a lot of these tools might be used in the same company at the same time, and then there might be some functional considerations, as well. You might want to isolate your data between data science team and your production environment, and you might want to score your pre-trained models on some S nodes here. You cannot host probably a big solution, so there is a whole lot of use cases where model movement or model transfer from one tool to another makes sense. Now, one of the common methods for transferring models from one tool to another is the PMML standard. It's an XML-based model exchange format, sort of a standard way to define statistical and data mining models, and helps you share models between the different applications that are PMML compliant. Really popular tool, and that's the tool of choice that we have for moving models to and from Vertica. Now, with this model management, this model movement capability, there's a lot of model management capabilities that Vertica offers. So, models are essentially first class citizens of Vertica. What that means is that each model is associated with a DB schema, so the user that initially creates a model, that's the owner of it, but he can transfer the ownership to other users, he can work with the ownership rights in any way that you would work with any other relation in a database would be. So, the same commands that you use for granting access to a model, changing its owner, changing its name, or dropping it, you can use similar commands for more of this one. There are a lot of functions for exploring the contents of models and that really helps in putting these models into production. The metadata of these models is also available for model management and governance, and finally, the import/export part enables you to apply all of these operations to the model that you have imported or you might want to export while they're in the database, and I think it would be nice to actually go through and example to showcase some of these capabilities in our model management, including the PMML model import and export. So, the workflow for export would be that we trained some data, we'll train a logistic regression model, and we'll save it as an in-DB Vertica model. Then, we'll explore the summary and attributes of the model, look at what's inside the model, what the training parameters are, concoctions and stuff, and then we can export the model as PMML and an external tool can import that model from PMML. And similarly, we'll go through and example for export. We'll have an external PMML model trained outside of Vertica, we'll import that PMML model and from there on, essentially, we'll treat it as an in-DB PMML model. We'll explore the summary and attribute of the model in much the same way as in in-DB model. We'll apply the model for in-DB scoring and get the prediction results, and finally, we'll bring some test data. We'll use that on test data for which the scoring needs to be done. So first, we want to create a connection with the database. In this case, we are using a Python Jupyter Notebook. We have the Vertica Python connector here that you can use, really powerful connector, allows you to do a lot of cool stuff to the database using the Jupyter front end, but essentially, you can use any other SQL front end tool or for that matter, any other Python ID which lets you connect to the database. So, exporting model. First, we'll create an logistic regression model here. Select logistic regression, we'll give it a model name, then put relation, which might be a table, time table, or review. There's response column and the predictor columns. So, we get a logistic regression model that we built. Now, we look at the models table and see that the model has been created. This is a table in Vertica that contains a list of all the models that are there in the database. So, we can see here that my model that we just created, it's created with Vertica models as a category, model type is logistic regression, and we have some other metadata around this model, as well. So now, we can look at some of the summary statistics of the model. We can look at the details. So, it gives us the predictor, coefficients, standard error, Z value, and P value. We can look at the regularization parameters. We didn't use any, so that would be a value of one, but if you had used, it would show it up here, the call string and also additional information regarding iteration count, rejected row count, and accepted row count. Now, we can also look at the list of attributes of the model. So, select get model attribute using parameter, model name is myModel. So, for this particular model that we just created, it would give us the name of all the attributes that are there. Similarly, you can look at the coefficients of the model in a column format. So, using parameter name myModel, and in this case we add attribute name equals details because we want all the details for that particular model and we get the predictor name, coefficient, standard error, Z value, and P value here. So now, what we can do is we can export this model. So, we used the select export models and we give it a path to where we want the model to be exported to. We give it the name of the model that needs to be exported because essentially might have a lot of models that you have created, and you give it the category here, which in our example is PMML, and you get a status message here that export model has been successful. So now, let's move onto the importing models example. In much the same way that we created a model in Vertica and exported it out, you might want to create a model outside of Vertica in another tool and then bring that to Vertica for scoring because Vertica contains all of the hard data and it might make sense to host that model in Vertica because scoring happens a lot more quickly than model training. So, in this particular case we do a select import models and we are importing a logistic regression model that was created in Spark. The category here again is PMML. So, we get the status message that the import was successful. Now, let's look at the attributes, look at the models table, and see that the model is really present there. Now previously when we ran this query because we had only myModel there, so that was the only entry you saw, but now once this model is imported you can see that as line item number two here, Spark logistic regression, it's a public schema. The category here however is different because it's not an individuated model, rather an imported model, so you get PMML here and then other metadata regarding the model, as well. Now, let's do some of the same operations that we did with the in-DB model so we can look at the summary of the imported PMML model. So, you can see the function name, data fields, predictors, and some additional information here. Moving on. Let's look at the attributes of the PMML model. Select your model attribute. Essentially the same query that we applied earlier, but the difference here is only the model name. So, you get the attribute names, attribute field, and number of rows. We can also look at the coefficient of the PMML model, name, exponent, and coefficient here. So yeah, pretty much similar to what you can do with an in-DB model. You can also perform all operations on an important model and one additional thing we'd want to do here is to use this important model for our prediction. So in this case, we'll data do a select predict PMML and give it some values using parameters model name, and logistic regression, and match by position, it's a really cool feature. This is true in this case. Sector, true. So, if you have model being imported from another platform in which, let's say you have 50 columns, now the names of the columns in that environment in which you're training the model might be slightly different than the names of the column that you have set up for Vertica, but as long as the order is the same, Vertica can actually match those columns by position and you don't need to have the exact same names for those columns. So in this case, we have set that to true and we see that predict PMML gives us a status of one. Now, using the important model, in this case we had a certain value that we had given it, but you can also use it on a table, as well. So in that case, you also get the prediction here and you can look at the (mumbling) metrics, see how well you did. Now, just sort of wrapping this up, it's really important to know the important distinction between using your models in any tool, any single node solution tool that you might already be using, like Python or R versus Vertica. What happens is, let's say you build a model in Python. It might be a single node solution. Now, after building that model, let's say you want to do prediction on really large amounts of data and you don't want to go through the overhead of keeping to move that data out of the database to do prediction every time you want to do it. So, what you can do is you can import that model into Vertica, but what Vertica does differently than Python is that the PMML model would actually be distributed across each mode in the cluster, so it would be applying on the data segments in each of those nodes and they might be different threads running for that prediction. So, the speed that you get here from all prediction would be much, much faster. Similarly, once you build a model for machine learning in Vertica, the objective mostly is that you want to use up all of your data and build a model that's accurate and is not just using a sample of the data, but using all the data that's available to it, essentially. So, you can build that model. The model building process would again go through the same technique. It would actually be distributed across all nodes in a cluster, and it would be using up all the threads and processes available to it within those nodes. So, really fast model training, but let's say you wanted to deploy it on an edge node and maybe do prediction closer to where the data was being generated, so you can export that model in a PMML format and all deploy it on the edge node. So, it's really helpful for a lot of use cases. And just some rising takeaways from our discussion today. So, Vertica's a really powerful tool for machine learning, for data preparation, model training, prediction, and deployment. You might want to use Vertica for all of these steps or some of these steps. Either way, Vertica supports both approaches. In the upcoming releases, we are planning to have more import and export capability through PMML models. Initially, we're supporting kmeans, linear, and logistic regression, but we keep on adding more algorithms and the plan is to actually move to supporting custom models. If you want to do that with the upcoming release, our TensorFlow indication is always there which you can use, but with PMML, this is the starting point for us and we keep on improving that. Vertica model can be exported in PMML format for scoring on other platforms, and similarly, models that get build in other tools can be imported for in-DB machine learning and in-DB scoring within Vertica. There are a lot of critical model management tools that are provided in Vertica and there are a lot of them on the roadmap, as well, which would keep on developing. Many ML functions and algorithms, they're already part of the in-DB library and we keep on adding to that, as well. So, thank you so much for joining the discussion today and if you have any questions we'd love to take them now. Back to you, Sue.

Published Date : Mar 30 2020

SUMMARY :

and thank you for joining us today and the limit, you can hit that in terms of cost,

ENTITIES

Entity	Category	Confidence
Vertica	ORGANIZATION	0.99+
Waqas Dhillon	PERSON	0.99+
70 days	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
two points	QUANTITY	0.99+
two days	QUANTITY	0.99+
Sue	PERSON	0.99+
seven days	QUANTITY	0.99+
one week	QUANTITY	0.99+
five days	QUANTITY	0.99+
Sunday	DATE	0.99+
two parts	QUANTITY	0.99+
second part	QUANTITY	0.99+
Saturday	DATE	0.99+
Excel	TITLE	0.99+
50 columns	QUANTITY	0.99+
4/2	DATE	0.99+
First	QUANTITY	0.99+
Python	TITLE	0.99+
each	QUANTITY	0.99+
each node	QUANTITY	0.99+
Today	DATE	0.99+
first factor	QUANTITY	0.99+
less than 0.27 seconds	QUANTITY	0.99+
Vertica	TITLE	0.99+
first	QUANTITY	0.99+
Friday	DATE	0.99+
Monday	DATE	0.99+
second aspect	QUANTITY	0.99+
eighth	QUANTITY	0.99+
today	DATE	0.99+
one day	QUANTITY	0.99+
two data points	QUANTITY	0.99+
third consideration	QUANTITY	0.99+
one	QUANTITY	0.99+
first step	QUANTITY	0.98+
first part	QUANTITY	0.98+
first one	QUANTITY	0.98+
zero values	QUANTITY	0.98+
second	QUANTITY	0.98+
both approaches	QUANTITY	0.98+
about 4.6 seconds	QUANTITY	0.98+
third thing	QUANTITY	0.98+
Secondly	QUANTITY	0.98+
one tool	QUANTITY	0.98+
zero	QUANTITY	0.98+
each mode	QUANTITY	0.98+
One	QUANTITY	0.97+
figure B	OTHER	0.97+
figure C	OTHER	0.97+
4.6 value	QUANTITY	0.97+
R	TITLE	0.97+
Machine Learning with Vertica, Data Preparation and Model Management	TITLE	0.97+
Waqas	PERSON	0.97+
each model	QUANTITY	0.97+
two main options	QUANTITY	0.97+
80%	QUANTITY	0.97+
two components	QUANTITY	0.96+
around 90%	QUANTITY	0.96+
two	QUANTITY	0.96+
later this week	DATE	0.95+

Jacque Istok, Pivotal | BigData NYC 2017

>> Announcer: Live from midtown Manhattan, it's the Cube, covering big data New York City 2017. Brought to you by Silicon Angle Media and its ecosystem sponsors. >> Welcome back everyone, we're here live in New York City for the week, three days of wall to wall coverage of big data NYC, it's big data week here in conjunction with Strata Adup, Strata Data which is an event running right around the corner, this is the Cube, I'm John Furrier with my cohost, Peter Burris, our next guest Jacque Istok who's the head of data at Pivotal. Welcome to the Cube, good to see you again. >> Likewise. >> You guys had big news we covered at VMware, obviously the Kubernetes craze is fantastic, you're starting to see cloud native platforms front and center even in some of these operational worlds like in cloud, data you guys have been here a while with Green Plum and Pivotal's been adding more to the data suite, so you guys are a player in this ecosystem. >> Correct. >> As it grows to be much more developer-centric and enterprise-centric and AI-centric, what's the update? >> I'd like to talk about a couple things, just three quick things here, one focused primarily on simplicity, first and foremost as you said, there's a lot of things going on on the cloud foundry side, a lot of things that we're doing with Kubernetes, etc., super exciting. I will say Tony Berge has written a nice piece about Green Plum in Zitinet, essentially calling Green Plum the best kept secret in the analytic database world. Why I think that's important is, what isn't really well known is that over the period of Pivotal's history, the last four and a half years, we focused really heavily on the cloud foundry side, on dev/ops, on getting users to actually be able to publish code. What we haven't talked about as much is what we're doing on the data side and I find it very interesting to repeatedly tell analysts and customers that the Green Plum business has been and continues to be a profitable business unit within Pivotal, so as we're growing on the cloud foundry side, we're continuing to grow a business that many of the organizations that I see here at Strata are still looking to get to, that ever forgotten profitability zone. >> There's a legacy around Green Plum, I'm not going to say they pivoted, pun intended, Pivotal. There's been added stuff around Green Plum, Green Plum might get lost in the messaging because it's been now one of many ingredients, right? >> It's true and when we formed Pivotal, I think there were 34 some different skews that we have now focused in on over the last two years or so. What's super exciting is again, over that time period, one of the things that we took to heart within the Green Plum side is this idea of extreme agile. As you guys know, Pivotal Labs being the core part of the Pivotal mission helps our customers figure out how to actually build software. We finally are drinking our own champagne and over the last year and a half of Green Plum R&D, we're shipping code, a complete data platform, we're shipping that on a cadence of about four to five weeks which again, a little bit unheard of in the industry, being able to move at that pace. We work through the backlog and what is also super exciting and I'm glad that you guys are able to help me tell the world, we released version five last week. Version five is actually the only parallel open source data platform that actually has native ANSI compliance SQL and I feel a little bit like I've rewound the clock 15 years where I have to actually throw in the ANSI compliance, but I think that in a lot of ways, there are SQL alternatives that are out there in the world. They are very much not ANSI compliant and that hurts. >> It's a nuance but it's table stakes in the enterprise. ANSI compliance is just, >> There's a reason you want to be ANSI compliant, because there's a whole swath of analytic applications mainly in the data warehouse world, that were built using ANSI compliant SQL, so why do this with version five? I presume it's got to have something to do with you want to start capturing some of those applications and helping customers modernize. >> That is correct. I think the SQL piece is one part of the data platform, of really a modern data platform. The other parts are again, becoming table stakes. Being able to do text analytics, we've backed Apache Solar within Green Plum, being able to do graph analytics or spatial analytics, anything from classifications, regressions, all of that, actually becomes table stakes and we feel that enterprises have suffered a little bit over the last five or six years. They've had this promise of having a new platform that they can leverage for doing interesting new things, machine learning, AI, etc. but the existing stuff that they were trying to do has been super, super hard. What we're trying to do is bridge those together and provide both in the same platform, out of the gate so that customers can actually use it immediately and I think one of the things we've seen is there's about 1000 to one SQL experienced individuals within the enterprise versus say Haduk experience in individuals. The other thing that I think is actually super important and almost bigger than everything else I talked about is we're the, a lot of the old school postgres deriviants of MBD databases forked their databases at some point in postgres's history, for a variety of reasons from licensing to when they started. Green Plum's no different. We forked right around eight dot too with this last release of version five, we've actually up leveled the postgres base within Green Plum's 8.3. Now in and of itself, it doesn't sound, >> What does that mean? >> We are now taking a 100% commitment both to open source and both to the postgres community. I think if you look at postgres today, in its latest versions, it is a full fledged, mission critical database that can be used anywhere. What we feel is that if we can bring our core engineering developments around parallelism, around analytics and combine that with postgres itself, then we don't have to implement all of the low level database things that a lot of our competitors have to do. What's unique about it is one, Green Plum continues to be open source, which again most of our competitors are not, two if you look at primarily what they're doing, nobody's got that level of commitment to the postgres community which means all of their resources are going to be stuck building core database technology, even building that ANSI SQL compliance in, which we'll get "for free" which will let us focus on things like machine learning, artificial intelligence. >> Just give a quick second and tell about the relevance of postgres because of the success, first of all it's massive, it's everywhere, but it's not going anywhere. Just give a quick, for the audience watching, what's the relevance of it. >> Sure like you said, it is everywhere. It is the most full featured, actual database in the open source community. Arguably my SQL has "more" market share, but my SQL projects that generally leverage them are not used for mission critical enterprise applications. Being able to have parity allows us not only to have that database technology baked into Green Plum, but it also gives us all of the community stuff with it. Everything from being able to leverage the most recent ODBC and JDBC libraries, but also integrations into everything from the post GIS travert for geospatial to being able to connect to other types of data sources, etc. >> It's a big community, shows that it's successful, but again, >> And it doesn't come in a red box. >> It does not come in a red box, that is correct. >> Which is not a bad thing. Look, postgres as a technology was developed a long time ago, largely in response to think about analytics and transaction, or analytics and operating applications might have actually come to and we're now living in a world where we can actually see the hardware and a lot of practices, etc. are beginning to find ways where this may start to happen. With Green Plum and postgres both MPP based, so your, by going to this, you're able to stay more modern, more up to date on all the new technology that's coming together to support these richer, more complex classes of applications. >> You're spot on, I suppose I would argue that postgres, I feel came up with as a response to Oracle in the past of, we need an open source alternative to Oracle, but other than that, 100% correct. >> There was always a difference between postgres and MySQL, MySQL always was okay, that's that, let's do that open source, postgres coming out of Berkeley and coming out of some other places, always had a slightly different notion of the types of problems it was going to take on. >> 100% correct, 100%. But to your question before, what does this all mean to customers, I think the one thing that version five really gives us the confidence to say is, and a lot of times I hate lobbing when the ball's out like this, but we welcome and embrace with open arms any terradata customers out there that are looking to save millions if not tens of millions of dollars on a modern platform that can actually run not only on premise, not only on bare metal, but virtually and off premise. We're truly the only MPP platform, the only open source MPP data platform that can allow you to build analytics and move those analytics from Amazon to Azure to back on prem. >> Talk about this, the terradata thing for a second, I want to get down and double click on that. Customers don't want to change code, so what specifically are you guys offering terradata customers specifically. With the release of version five, with a lot of the development that we've done and some of the partnering that we've done, we are now able to take without changing a line of code of your terradata applications, you load the data within the Green Plum platform, you can point those applications directly to Green Plum and run them unchanged, so I think in the past, the reticence to move to any other platform was really the amount of time it would take to actually redevelop all of the stuff that you had. We offer an ability to go from an immediate ROI to a platform that again, bridges that gap, allows you to really be modern. >> Peter, I want to talk to you about that importance that we just said because you've been studying the private cloud report, true private cloud which is on premises, coming from a cloud operating model, automating away undifferentiated labor and shipping that to differentiated labor, but this brings up what customers want in hybrid cloud and ultimately having public cloud and private cloud so hybrid sits there. They don't want to change their code basis, this is a huge deal. >> Obviously a couple things to go along with what Jacque said. The first thing is that you're right, people want the data to run where the data naturally needs to run or should run, that's the big argument about public versus hybrid versus what we call true private cloud. The idea that decreasing the workload needs to be located where the data, where it naturally should be located because of the physical, legal, regulatory, intellectual property attributes of the data, being able to do that is really really important. The other thing that Jacque said that goes right into this question John, is that ultimately in too many domains in this analytics world, which is fundamentally predicated on the idea of breaking data out of applications so that you can use it in new and novel and more value creating ways, is that the data gets locked up in a data warehouse. What's valuable in a data warehouse is not the hardware. It's the data. By providing the facility for being able to point an application at a couple of different data source including one that's more modern, or which takes advantage of more modern technology, that can be considerably cheaper, it means the shop can elevate the story about the asset and the asset here is the data and the applications that run against it, not the hardware and the system where the data's stored and located. One of the biggest challenges, we talked earlier just to go on for a second, we talked earlier with a couple of other guests about the fact that the industry still, what your average person still doesn't understand how to value data. How to establish a data asset and one of the reasons is because it's so constantly co-mingled with the underlying hardware. >> And actually I'd even further go on, I think the advent of some of these cloud data warehouses forgets that notion of being able to run it different places and provides one of the things that customers are really looking for which is simplicity. The ability to spin up a quick MPP SQL system within say Amazon for example, almost without a doubt, a lot of the business users that I speak to are willing to sacrifice capabilities within the platform which they are for the simplicity of getting up and going. One of the things that we really focused on in V5 is being able to give that same turnkey feel and so Green Plum exists within the Amazon marketplace, within the Azure marketplace, Google later this quarter, and then in addition to the simplicity, it has all of the functionality that is missing in those platforms, again, all the analytics, all the ability to reach out and federate queries against different types of data, I think it's exciting as we continue to progress in our releases, Green Plum has, for a number of years, had this ability to seamlessly query HGFS. Like a lot of the competitors, but HGFS isn't going away, neither is a generic object store like S3. But we continue to extend that to things like Spark for example, so now the ability to actually house your data within a data platform and seamlessly integrate with Spark back and forth, if you want to use Spark, use Spark, but somewhere that data needs to be materialized so that other applications can leverage it as well. >> But even then people have been saying well, if you want to put it on this disk, then put it on this disk. Given the question about Spark versus another database manager is a higher level conversation than many of the shops who are investing millions and millions and millions of dollars in their analytic application portfolio and all you're trying to do is, as I interpret it, is trying to say look, the value in the portfolio is the applications and the data. It's not the underlying elements. There's a whole bunch of new elements we can use, you can put it in the cloud, you can put it on premise if that's where the data belongs. Use some of these new and evolving technologies, but you're focused on how the data and the applications continue to remain valuable to the business over time and not the traditional hardware assets. >> Correct and I'll again leverage a notion that we get from labs, which is this idea of user centric design and so everything that we've been putting into the Green Plum database is around, ideally the four primary users of our system. Not just the analysts and not just the data scientists, but also the operators and the IT folks. That is where I'd say the last tenant of where we're going really is this idea of coopetition. I would, as the Pivotal Green Plum guy that's been around for 10 plus years, I would tell you very straight up that we are again, an open source MPP data platform that can rival any other platform out there, whether it's terradata, whether it's Haduke, we can beat that platform. >> Why should customers call you up? Why should they call you? There's all this other stuff out there, you got legacy, you got terradata, might have other things, people are knocking at my door, they're getting pounded with sales messages, buy me I'm better than the other guy. Why Pivotal data? >> The first thing I would say is, the latest reviews from Gardner for example, well actually let me rewind. I will easily argue that terradata has been the data warehouse platform for the last 30 years that everyone has tried to emulate. I'd even argue so much as that when Haduke came on the scene eight years ago, what they did was they changed the dynamics and what they're doing now is actually trying to emulate the terradata success through things like SQL on top of Haduke. What that has basically gotten us to is we're looking for a terradata replacement at Haduke like prices, that's what Green Plum has to offer in spades. Now, if you actually extend that just a little bit, I still recognize that not everybody's going to call us, there are still 200 other vendors out there that are selling a similar product or similar kinds of stories. What I would tell you in response to those folks is that Green Plum has been around in production for the last 10 plus years, we're a proven technology for solving problems, many of those are not. We work very well in this cooperative spirit of, Green Plum can be the end all be all, but I recognize it's not going to be the end all be all so this is why we have to work within the ecosystem. >> You have to, open source is dominating. At the Linux event, we just covered open source summit, 90% of software written will be open source libraries, 10% is where the value's being added. >> For sure, if you were to start up a new star up right now, would you go with a commercial product? >> No, just postgres database is good. All right final question to end the segment. This big data space that's now being called data, certainly Strata, Haduke is now Strata Data, just trying to keep that show going longer. But you got Microsoft Azure making a lot of waves going on right now with Microsoft Ignite, so cloud is into the play here, data's changed, so the question is how has this industry changed over the past eight years. You go back to 2010, I saw Green Plum coming prior to even getting bought out, but they were kicking ass, same product evolved. Where has the space gone? What's happened, how would you summarize it to someone who's walking in for the first year like hey back in the old days, we used to walk to school in the snow with no shoes on both ways. Now it's like get off my lawn you young developers. Seriously what is the evolution of that, how would you explain it? >> Again, I would start with terradata started the industry, by far and then folks like Netease and Green Plum came around to really give a lower cost alternative. Haduke came on the scene eight some years ago, and what I pride myself in being at Green Plum for this long and Green Plum implemented the map produced paradigm as Haduke was starting to build and as it continued to build, we focused on building our own distribution and SQL and Haduke, I think what we're getting down to is the brass tacks of the business is tired of technological science experiments and they just want to get stuff done. >> And a cost of ownership that's manageable. >> And sustainable. >> And sustainable and not in a spot where they're going to be locked into a single vendor, hence the open source. >> The ones that are winning today employed what strategy that ended up working out and what strategy didn't end up working out, if you go back and say, the people who took this path failed, people who took this approach won. What's the answer there? >> Clearly anybody who was an appliance that has long since drifted. I'd also say Green Plum's in this unique position where, >> An appliance too though. >> Well, pseudo appliance yes, I still have to respond to that, we were always software. >> You pivoted luckily. >> But putting that aside, the hardware vendors have gone away, all of the software competitors that we had have actually either been sunset, sold off and forgotten and so Green Plum, here we sit as the sole standard or person that's been around for the long haul. We are now seeing a spot where we have no competition other than the forgotten really legacy guys like terradata. People are longing to get off of legacy and onto something modern, the trick will be whether that modern is some of these new and upcoming players and technologies, or whether it really focuses on solving problems. >> What's the strategy with the winning strategy? Stick to your knitting, stick to what you know or was it more of, >> For us it was two fold, one it was continuing to service our customers and make them successful so that was how we built a profitable data platform business and then the other was to double down on the strategies that seemed to be interesting to organizations which were cloud, open source, and analytics and like you said, I talked to one of the folks over at the Air Force and he was mentioning how to him, data's actually more important than fuel, being able to understand where the airplanes are, where the fuel is, where the people are, where the missiles are etc., that's actually more important than the fuel itself. Data is the thing that powers everything. >> Data's currency of everything now, great Jacque thinks so much for coming on the Cube, Pivotal Data Platform, Data Suite, Green Plum now with all these other adds, that's great congratulations. Stay on the path helping customers, you can't lose. >> Exactly. >> The Cube here helping you figure out the big data noise, we're obviously in big data New York City event for our annual, the annual Cube Wikibon event, in conjunction with Strata Data across the street, more live coverage here for three days here in New York City I'm John Furrier, Peter Burris, we'll be back after this short break. (electronic music)

Published Date : Sep 27 2017

SUMMARY :

Brought to you by Silicon Angle Media Welcome to the Cube, good to see you again. to the data suite, so you guys analysts and customers that the Green Plum Green Plum might get lost in the messaging and over the last year and a half of Green Plum R&D, It's a nuance but it's table stakes in the enterprise. I presume it's got to have something to do with and provide both in the same platform, and both to the postgres community. of postgres because of the success, It is the most full featured, and operating applications might have actually come to in the past of, we need an open source alternative of the types of problems it was going to take on. MPP data platform that can allow you the reticence to move to any other platform and shipping that to differentiated labor, is that the data gets locked up in a data warehouse. all the ability to reach out and federate queries and the applications continue to remain valuable but also the operators and the IT folks. Why should customers call you up? I still recognize that not everybody's going to call us, At the Linux event, we just covered open source summit, in the snow with no shoes on both ways. and Green Plum implemented the map produced paradigm And sustainable and not in a spot where they're going to be the people who took this path failed, that has long since drifted. to respond to that, we were always software. But putting that aside, the hardware on the strategies that seemed to be interesting Stay on the path helping customers, you can't lose. for our annual, the annual Cube Wikibon event,

ENTITIES

Entity	Category	Confidence
Jacque	PERSON	0.99+
Peter Burris	PERSON	0.99+
Green Plum	ORGANIZATION	0.99+
Peter	PERSON	0.99+
Jacque Istok	PERSON	0.99+
John Furrier	PERSON	0.99+
Tony Berge	PERSON	0.99+
Silicon Angle Media	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
John	PERSON	0.99+
New York City	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
millions	QUANTITY	0.99+
90%	QUANTITY	0.99+
NYC	LOCATION	0.99+
Berkeley	LOCATION	0.99+
Pivotal	ORGANIZATION	0.99+
MySQL	TITLE	0.99+
2010	DATE	0.99+
first	QUANTITY	0.99+
Spark	TITLE	0.99+
Microsoft	ORGANIZATION	0.99+
eight years ago	DATE	0.99+
10%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
three days	QUANTITY	0.99+
Haduke	ORGANIZATION	0.99+
last week	DATE	0.99+
both	QUANTITY	0.98+
Strata	ORGANIZATION	0.98+
Netease	ORGANIZATION	0.98+
eight	DATE	0.98+
One	QUANTITY	0.98+
Strata Adup	ORGANIZATION	0.98+
first thing	QUANTITY	0.98+
terradata	ORGANIZATION	0.98+
Oracle	ORGANIZATION	0.97+
15 years	QUANTITY	0.97+
first year	QUANTITY	0.97+
200 other vendors	QUANTITY	0.97+
Strata Data	ORGANIZATION	0.97+
two	QUANTITY	0.97+
tens of millions of dollars	QUANTITY	0.97+
millions of dollars	QUANTITY	0.97+
one part	QUANTITY	0.97+

George Chow, Simba Technologies - DataWorks Summit 2017

>> (Announcer) Live from San Jose, in the heart of Silicon Valley, it's theCUBE covering DataWorks Summit 2017, brought to you by Hortonworks. >> Hi everybody, this is George Gilbert, Big Data and Analytics Analyst with Wikibon. We are wrapping up our show on theCUBE today at DataWorks 2017 in San Jose. It has been a very interesting day, and we have a special guest to help us do a survey of the wrap-up, George Chow from Simba. We used to call him Chief Technology Officer, now he's Technology Fellow, but when we was explaining the different in titles to me, I thought he said Technology Felon. (George Chow laughs) But he's since corrected me. >> Yes, very much so >> So George and I have been, we've been looking at both Spark Summit last week and DataWorks this week. What are some of the big advances that really caught your attention? >> What's caught my attention actually is how much manufacturing has really, I think, caught into the streaming data. I think last week was very notable that both Volkswagon and Audi actually had case studies for how they're using streaming data. And I think just before the break now, there was also a similar session from Ford, showcasing what they are doing around streaming data. >> And are they using the streaming analytics capabilities for autonomous driving, or is it other telemetry that they're analyzing? >> The, what is it, I think the Volkswagon study was production, because I still have to review the notes, but the one for Audi was actually quite interesting because it was for managing paint defect. >> (George Gilbert) For paint-- >> Paint defect. >> (George Gilbert) Oh. >> So what they were doing, they were essentially recording the environmental condition that they were painting the cars in, basically the entire pipeline-- >> To predict when there would be imperfections. >> (George Chow) Yes. >> Because paint is an extremely high-value sort of step in the assembly process. >> Yes, what they are trying to do is to essentially make a connection between downstream defect, like future defect, and somewhat trying to pinpoint the causes upstream. So the idea is that if they record all the environmental conditions early on, they could turn around and hopefully figure it out later on. >> Okay, this sounds really, really concrete. So what are some of the surprising environmental variables that they're tracking, and then what's the technology that they're using to build model and then anticipate if there's a problem? >> I think the surprising finding they said were actually, I think it was a humidity or fan speed, if I recall, at the time when the paint was being applied, because essentially, paint has to be... Paint is very sensitive to the condition that is being applied to the body. So my recollection is that one of the finding was that it was a narrow window during which the paint were, like, ideal, in terms of having the least amount of defect. >> So, had they built a digital twin style model, where it's like a digital replica of some aspects of the car, or was it more of a predictive model that had telemetry coming at it, and when it's an outside a certain bounds they know they're going to have defects downstream? >> I think they're still working on the predictive model, or actually the model is still being built, because they are essentially trying to build that model to figure out how they should be tuning the production pipeline. >> Got it, so this is sort of still in the development phase? >> (George Chow) Yeah, yeah >> And can you tell us, did they talk about the technologies that they're using? >> I remember the... It's a little hazy now because after a couple weeks of conference, so I don't remember the specifics because I was counting on the recordings to come out in a couples weeks' time. So I'll definitely share that. It's a case study to keep an eye on. >> So tell us, were there other ones where this use of real-time or near real-time data had some applications that we couldn't do before because we now can do things with very low latency? >> I think that's the one that I was looking forward to with Ford. That was the session just earlier, I think about an hour ago. The session actually consisted of a demo that was being done live, you know. It was being streamed to us where they were showcasing the data that was coming off a car that's been rigged up. >> So what data were they tracking and what were they trying to anticipate here? >> They didn't give enough detail, but it was basically data coming off of the CAN bus of the car, so if anybody is familiar with the-- >> Oh that's right, you're a car guru, and you and I compare, well our latest favorite is the Porche Macan >> Yes, yes. >> SUV, okay. >> But yeah, they were looking at streaming the performance data of the car as well as the location data. >> Okay, and... Oh, this sounds more like a test case, like can we get telemetry data that might be good for insurance or for... >> Well they've built out the system enough using the Lambda Architecture with Kafka, so they were actually consuming the data in real-time, and the demo was actually exactly seeing the data being ingested and being acted on. So in the case they were doing a simplistic visualization of just placing the car on the Google Map so you can basically follow the car around. >> Okay so, what was the technical components in the car, and then, how much data were they sending to some, or where was the data being sent to, or how much of the data? >> The data was actually sent, streamed, all the way into Ford's own data centers. So they were using NiFi with all the right proxy-- >> (George Gilbert) NiFi being from Hortonworks there. >> Yeah, yeah >> The Hortonworks data flow, okay >> Yeah, with all the appropriate proxys and firewall to bring it all the way into a secure environment. >> Wow >> So it was quite impressive from the point of view of, it was life data coming off of the 4G modem, well actually being uploaded through the 4G modem in the car. >> Wow, okay, did they say how much compute and storage they needed in the device, in this case the car? >> I think they were using a very lightweight platform. They were streaming apparently from the Raspberry Pi. >> (George Gilbert) Oh, interesting. >> But they were very guarded about what was inside the data center because, you know, for competitive reasons, they couldn't share much about how big or how large a scale they could operate at. >> Okay, so Simba has been doing ODBC and JDBC drivers to standard APIs, to databases for a long time. That was all about, that was an era where either it was interactive or batch. So, how is streaming, sort of big picture, going to change the way applications are built? >> Well, one way to think about streaming is that if you look at many of these APIs, into these systems, like Spark is a good example, where they're trying to harmonize streaming and batch, or rather, to take away the need to deal with it as a streaming system as opposed to a batch system, because it's obviously much easier to think about and reason about your system when it is traditional, like in the traditional batch model. So, the way that I see it also happening is that streaming systems will, you could say will adapt, will actually become easier to build, and everyone is trying to make it easier to build, so that you don't have to think about and reason about it as a streaming system. >> Okay, so this is really important. But they have to make a trade-off if they do it that way. So there's the desire for leveraging skill sets, which were all batch-oriented, and then, presumably SQL, which is a data manipulation everyone's comfortable with, but then, if you're doing it batch-oriented, you have a portion of time where you're not sure you have the final answer. And I assume if you were in a streaming-first solution, you would explicitly know whether you have all the data or don't, as opposed to late arriving stuff, that might come later. >> Yes, but what I'm referring to is actually the programming model. All I'm saying is that more and more people will want streaming applications, but more and more people need to develop it quickly, without having to build it in a very specialized fashion. So when you look at, let's say the example of Spark, when they focus on structured streaming, the whole idea is to make it possible for you to develop the app without having to write it from scratch. And the comment about SQL is actually exactly on point, because the idea is that you want to work with the data, you can say, not mindful, not with a lot of work to account for the fact that it is actually streaming data that could arrive out of order even, so the whole idea is that if you can build applications in a more consistent way, irrespective whether it's batch or streaming, you're better off. >> So, last week even though we didn't have a major release of Spark, we had like a point release, or a discussion about the 2.2 release, and that's of course very relevant for our big data ecosystem since Spark has become the compute engine for it. Explain the significance where the reaction time, the latency for Spark, went down from several hundred milliseconds to one millisecond or below. What are the implications for the programming model and for the applications you can build with it. >> Actually, hitting that new threshold, the millisecond, is actually a very important milestone because when you look at a typical scenario, let's say with AdTech where you're serving ads, you really only have, maybe, on the order about 100 or maybe 200 millisecond max to actually turn around. >> And that max includes a bunch of things, not just the calculation. >> Yeah, and that, let's say 100 milliseconds, includes transfer time, which means that in your real budget, you only have allowances for maybe, under 10 to 20 milliseconds to compute and do any work. So being able to actually have a system that delivers millisecond-level performance actually gives you ability to use Spark right now in that scenario. >> Okay, so in other words, now they can claim, even if it's not per event processing, they can claim that they can react so fast that it's as good as per event processing, is that fair to say? >> Yes, yes that's very fair. >> Okay, that's significant. So, what type... How would you see applications changing? We've only got another minute or two, but how do you see applications changing now that, Spark has been designed for people that have traditional, batch-oriented skills, but who can now learn how to do streaming, real-time applications without learning anything really new. How will that change what we see next year? >> Well I think we should be careful to not pigeonhole Spark as something built for batch, because I think the idea is that, you could say, the originators, of Spark know that it's all about the ease of development, and it's the ease of reasoning about your system. It's not the fact that the technology is built for batch, so the fact that you could use your knowledge and experience and an API that actually is familiar, should leverage it for something that you can build for streaming. That's the power, you could say. That's the strength of what the Spark project has taken on. >> Okay, we're going to have to end it on that note. There's so much more to go through. George, you will be back as a favorite guest on the show. There will be many more interviews to come. >> Thank you. >> With that, this is George Gilbert. We are DataWorks 2017 in San Jose. We had a great day today. We learned a lot from Rob Bearden and Rob Thomas up front about the IBM deal. We had Scott Gnau, CTO of Hortonworks on several times, and we've come away with an appreciation for a partnership now between IBM and Hortonworks that can take the two of them into a set of use cases that neither one on its own could really handle before. So today was a significant day. Tune in tomorrow, we have another great set of guests. Keynotes start at nine, and our guests will be on starting at 11. So with that, this is George Gilbert, signing out. Have a good night. (energetic, echoing chord and drum beat)

Published Date : Jun 13 2017

SUMMARY :

in the heart of Silicon Valley, do a survey of the wrap-up, What are some of the big advances caught into the streaming data. but the one for Audi was actually quite interesting in the assembly process. So the idea is that if they record So what are some of the surprising environmental So my recollection is that one of the finding or actually the model is still being built, of conference, so I don't remember the specifics the data that was coming off a car the performance data of the car for insurance or for... So in the case they were doing a simplistic visualization So they were using NiFi with all the right proxy-- to bring it all the way into a secure environment. So it was quite impressive from the point of view of, I think they were using a very lightweight platform. the data center because, you know, for competitive reasons, going to change the way applications are built? so that you don't have to think about and reason about it But they have to make a trade-off if they do it that way. so the whole idea is that if you can build and for the applications you can build with it. because when you look at a typical scenario, not just the calculation. So being able to actually have a system that delivers but how do you see applications changing now that, so the fact that you could use your knowledge There's so much more to go through. that can take the two of them

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
George	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
Scott Gnau	PERSON	0.99+
Rob Bearden	PERSON	0.99+
Audi	ORGANIZATION	0.99+
Rob Thomas	PERSON	0.99+
San Jose	LOCATION	0.99+
George Chow	PERSON	0.99+
Ford	ORGANIZATION	0.99+
last week	DATE	0.99+
Silicon Valley	LOCATION	0.99+
one millisecond	QUANTITY	0.99+
two	QUANTITY	0.99+
next year	DATE	0.99+
100 milliseconds	QUANTITY	0.99+
200 millisecond	QUANTITY	0.99+
today	DATE	0.99+
tomorrow	DATE	0.99+
Volkswagon	ORGANIZATION	0.99+
this week	DATE	0.99+
Google Map	TITLE	0.99+
AdTech	ORGANIZATION	0.99+
DataWorks 2017	EVENT	0.98+
DataWorks Summit 2017	EVENT	0.98+
both	QUANTITY	0.98+
11	DATE	0.98+
Spark	TITLE	0.98+
Wikibon	ORGANIZATION	0.96+
under 10	QUANTITY	0.96+
one	QUANTITY	0.96+
20 milliseconds	QUANTITY	0.95+
Spark Summit	EVENT	0.94+
first solution	QUANTITY	0.94+
SQL	TITLE	0.93+
hundred milliseconds	QUANTITY	0.93+
2.2	QUANTITY	0.92+
one way	QUANTITY	0.89+
Spark	ORGANIZATION	0.88+
Lambda Architecture	TITLE	0.87+
Kafka	TITLE	0.86+
minute	QUANTITY	0.86+
Porche Macan	ORGANIZATION	0.86+
about 100	QUANTITY	0.85+
ODBC	TITLE	0.84+
DataWorks	EVENT	0.84+
NiFi	TITLE	0.84+
about an hour ago	DATE	0.8+
JDBC	TITLE	0.79+
Raspberry Pi	COMMERCIAL_ITEM	0.76+
Simba	ORGANIZATION	0.75+
Simba Technologies	ORGANIZATION	0.74+
couples weeks'	QUANTITY	0.7+
CTO	PERSON	0.68+
theCUBE	ORGANIZATION	0.67+
twin	QUANTITY	0.67+
couple weeks	QUANTITY	0.64+

Kenneth Knowles, Google - Flink Forward - #FFSF17 - #theCUBE

>> Welcome everybody, we're at the Flink Forward conference in San Francisco, at the Kabuki Hotel. Flink Forward U.S. is the first U.S. user conference for the Flink community sponsored by data Artisans, the creators of Flink, and we're here with special guest Kenneth Knowles-- >> Hi. >> Who works for Google and who heads up the Apache Beam Team where, just to set context, Beam is the API Or STK on which developers can build stream processing apps that can be supported by Google's Dataflow, Apache Flink, Spark, Apex, among other future products that'll come along. Ken, why don't you tell us, what was the genesis of Beam, and why did Google open up sort of the API to it. >> So, I can speak as an Apache Beam Team PMC member, that the genesis came from a combined code donation to Apache from Google Cloud Dataflow STK and there was also already written by data Artisans a Flink runner for that, which already included some portability hooks, and then there was also a runner for Spark that was written by some folks at PayPal. And so, sort of those three efforts pointed out that it was a good time to have a unified model for these DAG-based computational... I guess it's a DAG-based computational model. >> Okay, so I want to pause you for a moment. >> Yeah. >> And generally, we try to avoid being rude and cutting off our guests but, in this case, help us understand what a DAG is, and why it's so important. >> Okay, so a DAG is a directed acyclic graph, and, in some sense, if you draw a boxes and arrows diagram of your computation where you say "I read some data from here," and it goes through some filters and then I do a join and then I write it somewhere. These all end up looking what they call the DAG just because of the fact that it is the structure, and all computation sort of can be modeled this way, and in particular, these massively parallel computations profit a lot from being modeled this way as opposed to MapReduce because the fact that you have access to the entire DAG means you can perform transformations and optimizations and you have more opportunities for executing it in different ways. >> Oh, in other words, because you can see the big picture you can find, like, the shortest path as opposed to I've got to do this step, I've got to do this step and this step. >> Yeah, it's exactly like that, you're not constrained to sort of, the person writing the program knows what it is that they want to compute, and then, you know, you have very smart people writing the optimizer and the execution engine. So it may execute an entirely different way, so for example, if you're doing a summation, right, rather than shuffling all your data to one place and summing there, maybe you do some partial summations, and then you just shuffle accumulators to one place, and finish the summation, right? >> Okay, now let me bump you up a couple levels >> Yeah. >> And tell us, so, MapReduce was a trees within the forest approach, you know, lots of seeing just what's a couple feet ahead of you. And now we have the big picture that allows you to find the best path, perhaps, one way of saying it. Tell us though, with Google or with others who are using Beam-compatible applications, what new class of solutions can they build that you wouldn't have done with MapReduce before? >> Well, I guess there's... There's two main aspects to Beam that I would emphasize, there's the portability, so you can write this application without having to commit to which backend you're going to run it on. And there's... There's also the unification of streaming and batch which is not present in a number of backends, and Beam as this layer sort of makes it very easy to use sort of batch-style computation and streaming-style computation in the same pipeline. And actually I said there was two things, the third thing that actually really opens things up is that Beam is not just a portability layer across backends, it's also a portability layer across languages, so, something that really only has preliminary support on a lot of systems is Python, so, for example, Beam has a Python STK where you write a DAG description of your computation in Python, and via Beam's portability API's, one of these sort of usually Java-centric engines would be able to run that Python pipeline. >> Okay, so-- >> So, did I answer your question? >> Yes, yes, but let's go one level deeper, which is, if MapReduce, if its sweet spot was web crawl indexing in batch mode, what are some of the things that are now possible with a Beam-style platform that supports Beam, you know, underneath it, that can do this direct acyclic graph processing? >> I guess what I, I'm still learning all the different things that you can do with this style of computation, and the truth is it's just extremely general, right? You can set up a DAG, and there's a lot of talks here at Flink Forward about using a stream processor to do high frequency trading or fraud detection. And those are completely different even though they're in the same model of computation as, you know, you would still use it for things like crawling the web and doing PageRank over. Actually, at the moment we don't have iterative computations so we wouldn't do PageRank today. >> So, is it considered a complete replacement, and then new used cases for older style frameworks like MapReduce, or is it a complement for things where you want to do more with data in motion or lower latency? >> It is absolutely intended as a full replacement for MapReduce, yes, like, if you're thinking about writing a MapReduce pipeline, instead you should write a Beam pipeline, and then you should benchmark it on different Beam backends, right? >> And, so, working with Spark, working with Flink, how are they, in terms of implementing the full richness of the Beam-interface relative to the Google product Dataflow, from which I assumed Beam was derived? >> So, all of the different backends exist in sort of different states as far as implementing the full model. One thing I really want to emphasize is that Beam is not trying to take the intersection on all of these, right? And I think that your question already shows that you know this, we keep sort of a matrix on our website where we say, "Okay there's all these different "features you might want, "and then there's all these backends "you might want to run it on," and it's sort of there's can you do it, can you do it sometimes, and notes about that, we want this whole matrix to be, yes, you can use all of the model on Flink, all of it on Spark, all of it on Google Cloud Dataflow, but so they all have some gaps and I guess, yeah, we're really welcoming contributors in that space. >> So, for someone whose been around for a long time, you might think of it as an ODBC driver, where the capabilities of the databases behind it are different, and so the drivers can only support some subset of a full capability. >> Yeah, I think that there's, so, I'm not familiar enough with ODBC to say absolutely yes, absolutely no, but yes, it's that sort of a thing, it's like the JVM has many languages on it and ODBC provides this generic database abstraction. >> Is Google's goal with Beam API to make it so that customers demand a level of portability that goes not just for the on-prim products but for products that are in other public clouds, and sort of pry open the API lock in? >> So, I can't say what Google's goals are, but I can certainly say that Beam's goals are that nobody's going to be locked into a particular backend. >> Okay. >> I mean, I can't even say what Beam's goals are, sorry, those are my goals, I can speak for myself. >> Is Beam seeing so far adoption by the sort of big consumer internet companies, or has it started to spread to mainstream enterprises, or is still a little immature? >> I think Beam's still a little bit less mature than that, we're heading into our first stable release, so, we began incubating it as an Apache project about a year ago, and then, around the beginning of the new year, actually right at the end of 2016, we graduated to be an Apache top level project, so right now we're sort of on the road from we've become a top level project, we're seeing contributions ramp up dramatically, and we're aiming for a stable release as soon as possible, our next release we expect to be a stable API that we would encourage users and enterprises to adopt I think. >> Okay, and that's when we would see it in production form on the Google Cloud platform? >> Well, so the thing is that the code and the backends behind it are all very mature, but, right now, we're still sort of like, I don't know how to say it, we're polishing the edges, right, it's still got a lot of rough edges and you might encounter them if you're trying it out right now and things might change out from under you before we make our stable release. >> Understood. >> Yep. All right. Kenneth, thank you for joining us, and for the update on the Beam project and we'll be looking for that and seeing its progress over the next few months. >> Great. Thanks for having me. >> With that, I'm George Gilbert, I'm with Kenneth Knowles, we're at the dataArtisan's Flink Forward user conference in San Francisco at the Kabuki Hotel and we'll be back after a few minutes.

Published Date : Apr 15 2017

SUMMARY :

and we're here with special guest Kenneth Knowles-- Beam is the API Or STK on which developers can build and then there was also a runner for Spark and cutting off our guests but, in this case, and you have more opportunities for executing it Oh, in other words, because you can see the big picture and then you just shuffle accumulators to one place, that allows you to find the best path, and streaming-style computation in the same pipeline. and the truth is it's just extremely general, right? and it's sort of there's can you do it, and so the drivers can only support some subset and ODBC provides this generic database abstraction. are that nobody's going to be I mean, I can't even say what Beam's goals are, and we're aiming for a stable release and you might encounter them and for the update on the Beam project Thanks for having me. in San Francisco at the Kabuki Hotel

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Kenneth	PERSON	0.99+
Kenneth Knowles	PERSON	0.99+
San Francisco	LOCATION	0.99+
Python	TITLE	0.99+
Google	ORGANIZATION	0.99+
Ken	PERSON	0.99+
two things	QUANTITY	0.99+
PayPal	ORGANIZATION	0.99+
one place	QUANTITY	0.98+
three efforts	QUANTITY	0.98+
Flink	ORGANIZATION	0.98+
Flink Forward	EVENT	0.98+
Python STK	TITLE	0.98+
Apache	ORGANIZATION	0.98+
MapReduce	TITLE	0.98+
ODBC	TITLE	0.97+
Beam	TITLE	0.97+
dataArtisan	ORGANIZATION	0.97+
third thing	QUANTITY	0.97+
first stable release	QUANTITY	0.96+
first	QUANTITY	0.95+
#FFSF17	EVENT	0.95+
Apache Beam Team	ORGANIZATION	0.94+
Flink Forward	ORGANIZATION	0.94+
two main aspects	QUANTITY	0.93+
Artisans	ORGANIZATION	0.93+
Beam	ORGANIZATION	0.93+
Spark	TITLE	0.92+
end of 2016	DATE	0.92+
Kabuki Hotel	LOCATION	0.92+
today	DATE	0.87+
about a year ago	DATE	0.85+
Cloud Dataflow	TITLE	0.83+
Dataflow	TITLE	0.82+
Java	TITLE	0.81+
one way	QUANTITY	0.77+
One thing	QUANTITY	0.73+
Google Cloud	TITLE	0.72+
couple feet	QUANTITY	0.71+
Apache	TITLE	0.7+
Flink Forward user	EVENT	0.7+
JVM	TITLE	0.69+
Cloud Dataflow STK	TITLE	0.69+
PMC	ORGANIZATION	0.69+
Forward	EVENT	0.64+
year	DATE	0.62+
DAG	OTHER	0.59+
U.S.	LOCATION	0.53+
Apex	TITLE	0.51+

Jack Norris - Hadoop Summit 2014 - theCUBE - #HadoopSummit

>>The queue at Hadoop summit, 2014 is brought to you by anchor sponsor Hortonworks. We do, I do. And headline sponsor when disco we make Hadoop invincible >>Okay. Welcome back. Everyone live here in Silicon valley in San Jose. This is a dupe summit. This is Silicon angle and Wiki bonds. The cube is our flagship program. We go out to the events and extract the signal to noise. I'm John barrier, the founder SiliconANGLE joins my cohost, Jeff Kelly, top big data analyst in the, in the community. Our next guest, Jack Norris, COO of map R security enterprise. That's the buzz of the show and it was the buzz of OpenStack summit. Another open source show. And here this year, you're just seeing move after, move at the moon, talking about a couple of critical issues. Enterprise grade Hadoop, Hortonworks announced a big acquisition when all in, as they said, and now cloud era follows suit with their news. Today, I, you sitting back saying, they're catching up to you guys. I mean, how do you look at that? I mean, cause you guys have that's the security stuff nailed down. So what Dan, >>You feel about that now? I think I'm, if you look at the kind of Hadoop market, it's definitely moving from a test experimental phase into a production phase. We've got tremendous customers across verticals that are doing some really interesting production use cases. And we recognized very early on that to really meet the needs of customers required some architectural innovation. So combining the open source ecosystem packages with some innovations underneath to really deliver high availability, data protection, disaster recovery features, security is part of that. But if you can't predict the PR protect the data, if you can't have multitenancy and separate workflows across the cluster, then it doesn't matter how secure it is. You know, you need those. >>I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. Silicon lucky bond is so successful, but I just don't understand your business model without plates were free content and they have some underwriters. So you guys have been very successful yet. People aren't looking at map are as good at the quiet leader, like you doing your business, you're making money. Jeff. He had some numbers with us that in the Hindu community, about 20% are paying subscriptions. That's unlike your business model. So explain to the folks out there, the business model and specifically the traction because you have >>Customers. Yeah. Oh no, we've got, we've got over 500 paying customers. We've got at least $1 million customer in seven different verticals. So we've got breadth and depth and our business model is simple. We're an enterprise software company. That's looking at how to provide the best of open source as well as innovations underneath >>The most open distribution of Hadoop. But you add that value separately to that, right? So you're, it's not so much that you're proprietary at all. Right. Okay. >>You clarify that. Right. So if you look at, at this exciting ecosystem, Hadoop is fairly early in its life cycle. If it's a commoditization phase like Linux or, or relational database with my SQL open source, kind of equates the whole technology here at the beginning of this life cycle, early stages of the life cycle. There's some architectural innovations that are really required. If you look at Hadoop, it's an append only file system relying on Linux. And that really limits the types of operations. That types of use cases that you can do. What map ours done is provide some deep architectural innovations, provide complete read-write file systems to integrate data protection with snapshots and mirroring, et cetera. So there's a whole host of capabilities that make it easy to integrate enterprise secure and, and scale much better. Do you think, >>I feel like you were maybe a little early to the market in the sense that we heard Merv Adrian and his keynote this morning. Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about nine years into Hadoop. Do you feel like maybe you guys were a little early and now you're at a tipping point, whereas these more, as more and more deployments get ready to go to production, this is going to be an area that's going to become increasingly important. >>I think, I think our timing has been spectacular because we, we kind of came out at a time when there was some customers that were really serious about Hadoop. We were able to work closely with them and prove our technology. And now as the market is just ramping, we're here with all of those features that they need. And what's a, what's an issue. Is that an incremental improvement to provide those kind of key features is not really possible if the underlying architecture isn't there and it's hard to provide, you know, online real-time capabilities in a underlying platform that's append only. So the, the HDFS layer written in Java, relying on the Linux file system is kind of the, the weak underbelly, if you will, of, of the ecosystem. There's a lot of, a lot of important developments happening yarn on top of it, a lot of really kind of exciting things. So we're actively participating in including Apache drill and on top of a complete read-write file system and integrated Hindu database. It just makes it all come to life. >>Yeah. I mean, those things on top are critical, but you know, it's, it's the underlying infrastructure that, you know, we asked, we keep on community about that. And what's the, what are the things that are really holding you back from Paducah and production and the, and the biggest challenge is they cited worth high availability, backup, and recovery and maintaining performance at scale. Those are the top three and that's kind of where Matt BARR has been focused, you know, since day one. >>So if you look at a major retailer, 2000 nodes and map bar 50 unique applications running on a single cluster on 10,000 jobs a day running on top of that, if you look at the Rubicon project, they recently went public a hundred million add actions, a hundred billion ad auctions a day. And on top of that platform, beats music that just got acquired for $3 billion. Basically it's the underlying map, our engine that allowed them to scale and personalize that music service. So there's a, there's a lot of proof points in terms of how quickly we scale the enterprise grade features that we provide and kind of the blending of deep predictive analytics in a batch environment with online capabilities. >>So I got to ask you about your go to market. I'll see Cloudera and Hortonworks have different business models. Just talk about that, but Cloudera got the massive funding. So you get this question all the time. What do you, how do you counter that army and the arms race? I think >>I just wrote an article in Forbes and he says cash is not a strategy. And I think that was, that was an excellent, excellent article. And he goes in and, you know, in this fast growing market, you know, an amount of money isn't necessarily translate to architectural innovations or speeding the development of that. This is a fairly fragmented ecosystem in terms of the stack that runs on top of it. There's no single application or single vendor that kind of drives value. So an acquisition strategy is >>So your field Salesforce has direct or indirect, both mixable. How do you handle the, because Cloudera has got feet on the street and every squirrel will find it, not if they're parked there, parking sales reps and SCS and all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. Yeah. And they're going to actually try to engage the clients. So, you know, I guess it is a strategy if they're deploying sales and marketing, right? So >>The beauty about that, and in fact, we're all in this together in terms of sharing an API and driving an ecosystem, it's not a fragmented market. You can start with one distribution and move to another, without recompiling or without doing any sort of changes. So it's a fairly open community. If this were a vendor lock-in or, you know, then spending money on brand, et cetera, would, would be important. Our focus is on the, so the sales execution of direct sales, yes, we have direct sales. We also have partners and it depends on the geographies as to what that percentage is. >>And John Schroeder on with the HP at fifth big data NYC has updated the HP relationship. >>Oh, excellent. In fact, we just launched our application gallery app gallery, make it very easy for administrators and developers and analysts to get access and understand what's available in the ecosystem. That's available directly on our website. And one of the featured applications there today is an integration with the map, our sandbox and HP Vertica. So you can get early access, try it and get the best of kind of enterprise grade SQL first, >>First Hadoop app store, basically. Yeah. If you want to call it that way. Right. So like >>Sure. Available, we launched with close to 30, 30 with, you know, a whole wave kind of following that. >>So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. So, you know, there's a lot of talk about that. Some confusion about the different methods for applying SQL on predicts or map art takes an open approach. I know you'll support things like Impala from, from a competitor Cloudera, talk about that approach from a map arts perspective. >>So I guess our, our, our perspective is kind of unbiased open source. We don't try to pick and choose and dictate what's the right open source based on either our participation or some community involvement. And the reality is with multiple applications being run on the platform, there are different use cases that make difference, you know, make different sense. So whether it's a hive solution or, you know, drill drills available, or HP Vertica people have the choice. And it's part of, of a broad range of capabilities that you want to be able to run on the platform for your workflows, whether it's SQL access or a MapReduce or a spark framework shark, et cetera. >>So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, you've got Impala, you've got hive. And the stinger initiative is, is that whole kind of SQL on Hadoop ecosystem, still working itself out. Are we going to have this many options in a year or two years from now? Or are they complimentary and potentially, you know, each has its has its role. >>I think the major differences is kind of how it deals with the new data formats. Can it deal with self-describing data? Sources can leverage, Jason file does require a centralized metadata, and those are some of the perspectives and advantages say the Apache drill has to expand the data sets that are possible enabled data exploration without dependency on a, on an it administrator to define that, that metadata. >>So another, maybe not always as exciting, but taking workloads from existing systems, moving them to Hadoop is one of the ways that a lot of people get started with, to do whether associated transformation workloads or there's something in that vein. So I know you've announced a partnership with Syncsort and that's one of the things that they focus on is really making it as easy as possible to meet those. We'll talk a little bit about that partnership, why that makes sense for you and, and >>When your customer, I think it's a great proof point because we announced that partnership around mainframe offload, we have flipped comScore and experience in that, in that press release. And if you look at a workload on a mainframe going to duke, that that seems like that's a, that's really an oxymoron, but by having the capabilities that map R has and making that a system of record with that full high availability and that data protection, we're actually an option to offload from mainframe offload, from sand processing and provide a really cost effective, scalable alternative. And we've got customers that had, had tried to offload from the mainframe multiple times in the past, on successfully and have done it successfully with Mapbox. >>So talk a little bit more about kind of the broader partnership strategy. I mean, we're, we're here at Hadoop summit. Of course, Hortonworks talks a lot about their partnerships and kind of their reseller arrangements. Fedor. I seem to take a little bit more of a direct approach what's map R's approach to kind of partnering and, and as that relates to kind of resell arrangements and things like, >>I think the app gallery is probably a great proof point there. The strategy is, is an ecosystem approach. It's having a collection of tools and applications and management facilities as well as applications on top. So it's a very open strategy. We focus on making sure that we have open API APIs at that application layer, that it's very easy to get data in and out. And part of that architecture by presenting standard file system format, by allowing non Java applications to run directly on our platform to support standard database connections, ODBC, and JDBC, to provide database functionality. In addition to kind of this deep predictive analytics really it's about supporting the broadest set of applications on top of a single platform. What we're seeing in this kind of this, this modern architecture is data gravity matters. And the more processing you can do on a single platform, the better off you are, the more agile, the more competitive, right? >>So in terms of, so you're partnering with people like SAS, for example, to kind of bring some of the, some of the analytic capabilities into the platform. Can you kind of tell us a little bit about any >>Companies like SAS and revolution analytics and Skytree, and I mean, just a whole host of, of companies on the analytics side, as well as on the tools and visualization, et cetera. Yeah. >>Well, I mean, I, I bring up SAS because I think they, they get the fact that the, the whole data gravity situation is they've got it. They've got to go to where the data is and not have the data come to them. So, you know, I give them credit for kind of acknowledging that, that kind of big data truth ism, that it's >>All going to the data, not bringing the data >>To the computer. Jack talk about the success you had with the customers had some pretty impressive numbers talking about 500 customers, Merv agent. The garden was on with us earlier, essentially reiterating not mentioning that bar. He was just saying what you guys are doing is right where the puck is going. And some think the puck is not even there at the same rink, some other vendors. So I gotta give you props on that. So what I want you to talk about the success you have in specifically around where you're winning and where you're successful, you guys have struggled with, >>I need to improve on, yeah, there's a, there's a whole class of applications that I think Hadoop is enabling, which is about operations in analytics. It's taking this, this higher arrival rate machine generated data and doing analytics as it happens and then impacting the business. So whether it's fraud detection or recommendation engines, or, you know, supply chain applications using sensor data, it's happening very, very quickly. So a system that can tolerate and accept streaming data sources, it has real-time operations. That is 24 by seven and highly available is, is what really moves the needle. And that's the examples I used with, you know, add a Rubicon project and, you know, cable TV, >>The very outcome. What's the primary outcomes your clients want with your product? Is it stability? And the platform has enabled development. Is there a specific, is there an outcome that's consistent across all your wins? >>Well, the big picture, some of them are focused on revenues. Like how do we optimize revenue either? It's a new data source or it's a new application or it's existing application. We're exploding the dataset. Some of it's reducing costs. So they want to do things like a mainframe offload or data warehouse offload. And then there's some that are focused on risk mitigation. And if there's anything that they have in common it's, as they moved from kind of test and looked at production, it's the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. So it's not, it's not anything new. It's just like, Hey, we've got SLS and I've got data protection policies, and I've got a disaster recovery procedure. And why can't I expect the same level of capabilities in Hindu that I have today in those other systems. >>It's a final question. Where are you guys heading this year? What's your key objectives. Obviously, you're getting these announcements as flurry of announcements, good success state of the company. How many employees were you guys at? Give us a quick update on the numbers. >>So, you know, we just reported this incredible momentum where we've tripled core growth year over year, we've added a tremendous amount of customers. We're over 500 now. So we're basically sticking to our knitting, focusing on the customers, elevating the proof points here. Some of the most significant customers we have in the telco and financial services and healthcare and, and retail area are, you know, view this as a strategic weapon view, this is a huge competitive advantage, and it's helping them impact their business. That's really spring our success. We've, you know, we're, we're growing at an incredible clip here and it's just, it's a great time to have made those calls and those investments early on and kind of reaping the benefits. >>It's. Now I've always said, when we, since the first Hadoop summit, when Hortonworks came out of Yahoo and this whole community kind of burst open, you had to duke world. Now Riley runs at it's a whole different vibe of itself. This was look at the developer vibe. So I got to ask you, and we would have been a big fan. I mean, everyone has enough beachhead to be successful, not about map arbors Hortonworks or cloud air. And this is why I always kind of smile when everyone goes, oh, Cloudera or Hortonworks. I mean, they're two different animals at this point. It would do different things. If you guys were over here, everyone has their quote, swim lanes or beachhead is not a lot of super competition. Do you think, or is it going to be this way for awhile? What's your fork at some? At what point do you see more competition? 10 years out? I mean, Merv was talking a 10 year horizon for innovation. >>I think that the more people learn and understand about Hadoop, the more they'll appreciate these kind of set of capabilities that matter in production and post-production, and it'll migrate earlier. And as we, you know, focus on more developer tools like our sandbox, so people can easily get experienced and understand kind of what map are, is. I think we'll start to see a lot more understanding and momentum. >>Awesome. Jack Norris here, inside the cube CMO, Matt BARR, a very successful enterprise grade, a duke player, a leader in the space. Thanks for coming on. We really appreciate it. Right back after the short break you're live in Silicon valley, I had dupe December, 2014, the right back.

Published Date : Jun 4 2014

SUMMARY :

The queue at Hadoop summit, 2014 is brought to you by anchor sponsor I mean, cause you guys have that's the security stuff nailed down. I think I'm, if you look at the kind of Hadoop market, I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. That's looking at how to provide the best of open source But you add that value separately to So if you look at, at this exciting ecosystem, Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about isn't there and it's hard to provide, you know, online real-time And what's the, what are the things that are really holding you back from Paducah So if you look at a major retailer, 2000 nodes and map bar 50 So I got to ask you about your go to market. you know, in this fast growing market, you know, an amount of money isn't necessarily all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. We also have partners and it depends on the geographies as to what that percentage So you can get early If you want to call it that way. a whole wave kind of following that. So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. And it's part of, of a broad range of capabilities that you want So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, of the perspectives and advantages say the Apache drill has to expand the data sets why that makes sense for you and, and And if you look at a workload on a mainframe going to duke, So talk a little bit more about kind of the broader partnership strategy. And the more processing you can do on a single platform, the better off you are, Can you kind and I mean, just a whole host of, of companies on the analytics side, as well as on the tools So, you know, I give them credit for kind of acknowledging that, that kind of big data truth So what I want you to talk about the success you have in specifically around where you're winning and you know, add a Rubicon project and, you know, cable TV, And the platform has enabled development. the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. Where are you guys heading this year? So, you know, we just reported this incredible momentum where we've tripled core and this whole community kind of burst open, you had to duke world. And as we, you know, focus on more developer tools like our sandbox, a duke player, a leader in the space.

ENTITIES

Entity	Category	Confidence
Jeff Kelly	PERSON	0.99+
Jack Norris	PERSON	0.99+
John Schroeder	PERSON	0.99+
HP	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
$3 billion	QUANTITY	0.99+
December, 2014	DATE	0.99+
Jason	PERSON	0.99+
Matt BARR	PERSON	0.99+
10,000 jobs	QUANTITY	0.99+
Today	DATE	0.99+
10 year	QUANTITY	0.99+
Syncsort	ORGANIZATION	0.99+
Dan	PERSON	0.99+
Silicon valley	LOCATION	0.99+
John barrier	PERSON	0.99+
Java	TITLE	0.99+
Yahoo	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
24	QUANTITY	0.99+
Hadoop	TITLE	0.99+
Cloudera	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
this year	DATE	0.99+
Jack	PERSON	0.99+
fifth	QUANTITY	0.99+
Linux	TITLE	0.99+
Skytree	ORGANIZATION	0.99+
each	QUANTITY	0.99+
both	QUANTITY	0.99+
today	DATE	0.98+
one	QUANTITY	0.98+
Merv	PERSON	0.98+
about 10 years	QUANTITY	0.98+
San Jose	LOCATION	0.98+
Hadoop	EVENT	0.98+
about 20%	QUANTITY	0.97+
seven	QUANTITY	0.97+
over 500	QUANTITY	0.97+
a year	QUANTITY	0.97+
about 500 customers	QUANTITY	0.97+
SQL	TITLE	0.97+
seven different verticals	QUANTITY	0.97+
two years	QUANTITY	0.97+
single platform	QUANTITY	0.96+
2014	DATE	0.96+
Apache	ORGANIZATION	0.96+
Hadoop	LOCATION	0.95+
SiliconANGLE	ORGANIZATION	0.94+
comScore	ORGANIZATION	0.94+
single vendor	QUANTITY	0.94+
day one	QUANTITY	0.94+
Salesforce	ORGANIZATION	0.93+
about nine years	QUANTITY	0.93+
Hadoop Summit 2014	EVENT	0.93+
Merv	ORGANIZATION	0.93+
two different animals	QUANTITY	0.92+
single application	QUANTITY	0.92+
top three	QUANTITY	0.89+
SAS	ORGANIZATION	0.89+
Riley	PERSON	0.88+
First	QUANTITY	0.87+
Forbes	TITLE	0.87+
single cluster	QUANTITY	0.87+
Mapbox	ORGANIZATION	0.87+
map R	ORGANIZATION	0.86+
map	ORGANIZATION	0.86+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for ODBC: