Ed Walsh and Thomas Hazel, ChaosSearch

>> Welcome to theCUBE, I am Dave Vellante. And today we're going to explore the ebb and flow of data as it travels into the cloud and the data lake. The concept of data lakes was alluring when it was first coined last decade by CTO James Dixon. Rather than be limited to highly structured and curated data that lives in a relational database in the form of an expensive and rigid data warehouse or a data mart. A data lake is formed by flowing data from a variety of sources into a scalable repository, like, say an S3 bucket that anyone can access, dive into, they can extract water, A.K.A data, from that lake and analyze data that's much more fine-grained and less expensive to store at scale. The problem became that organizations started to dump everything into their data lakes with no schema on our right, no metadata, no context, just shoving it into the data lake and figure out what's valuable at some point down the road. Kind of reminds you of your attic, right? Except this is an attic in the cloud. So it's too big to clean out over a weekend. Well look, it's 2021 and we should be solving this problem by now. A lot of folks are working on this, but often the solutions add other complexities for technology pros. So to understand this better, we're going to enlist the help of ChaosSearch CEO Ed Walsh, and Thomas Hazel, the CTO and Founder of ChaosSearch. We're also going to speak with Kevin Miller who's the Vice President and General Manager of S3 at Amazon web services. And of course they manage the largest and deepest data lakes on the planet. And we'll hear from a customer to get their perspective on this problem and how to go about solving it, but let's get started. Ed, Thomas, great to see you. Thanks for coming on theCUBE. >> Likewise. >> Face to face, it's really good to be here. >> It is nice face to face. >> It's great. >> So, Ed, let me start with you. We've been talking about data lakes in the cloud forever. Why is it still so difficult to extract value from those data lakes? >> Good question. I mean, data analytics at scale has always been a challenge, right? So, we're making some incremental changes. As you mentioned that we need to see some step function changes. But in fact, it's the reason ChaosSearch was really founded. But if you look at it, the same challenge around data warehouse or a data lake. Really it's not just to flowing the data in, it's how to get insights out. So it kind of falls into a couple of areas, but the business side will always complain and it's kind of uniform across everything in data lakes, everything in data warehousing. They'll say, "Hey, listen, I typically have to deal with a centralized team to do that data prep because it's data scientists and DBAs". Most of the time, they're a centralized group. Sometimes they're are business units, but most of the time, because they're scarce resources together. And then it takes a lot of time. It's arduous, it's complicated, it's a rigid process of the deal of the team, hard to add new data, but also it's hard to, it's very hard to share data and there's no way to governance without locking it down. And of course they would be more self-serve. So there's, you hear from the business side constantly now underneath is like, there's some real technology issues that we haven't really changed the way we're doing data prep since the two thousands, right? So if you look at it, it's, it falls two big areas. It's one, how to do data prep. How do you take, a request comes in from a business unit. I want to do X, Y, Z with this data. I want to use this type of tool sets to do the following. Someone has to be smart, how to put that data in the right schema, you mentioned. You have to put it in the right format, that the tool sets can analyze that data before you do anything. And then second thing, I'll come back to that 'cause that's the biggest challenge. But the second challenge is how these different data lakes and data warehouses are now persisting data and the complexity of managing that data and also the cost of computing it. And I'll go through that. But basically the biggest thing is actually getting it from raw data so the rigidness and complexity that the business sides are using it is literally someone has to do this ETL process, extract, transform, load. They're actually taking data, a request comes in, I need so much data in this type of way to put together. They're literally physically duplicating data and putting it together on a schema. They're stitching together almost a data puddle for all these different requests. And what happens is anytime they have to do that, someone has to do it. And it's, very skilled resources are scanned in the enterprise, right? So it's a DBS and data scientists. And then when they want new data, you give them a set of data set. They're always saying, what can I add to this data? Now that I've seen the reports. I want to add this data more fresh. And the same process has to happen. This takes about 60% to 80% of the data scientists in DPA's to do this work. It's kind of well-documented. And this is what actually stops the process. That's what is rigid. They have to be rigid because there's a process around that. That's the biggest challenge of doing this. And it takes an enterprise, weeks or months. I always say three weeks or three months. And no one challenges beyond that. It also takes the same skill set of people that you want to drive digital transformation, data warehousing initiatives, motorization, being data driven or all these data scientists and DBS they don't have enough of. So this is not only hurting you getting insights out of your day like in the warehouses. It's also, this resource constraint is hurting you actually getting. >> So that smallest atomic unit is that team, that's super specialized team, right? >> Right. >> Yeah. Okay. So you guys talk about activating the data lake. >> Yep. >> For analytics. What's unique about that? What problems are you all solving? You know, when you guys crew created this magic sauce. >> No, and basically, there's a lot of things. I highlighted the biggest one is how to do the data prep, but also you're persisting and using the data. But in the end, it's like, there's a lot of challenges at how to get analytics at scale. And this is really where Thomas and I founded the team to go after this, but I'll try to say it simply. What we're doing, I'll try to compare and contrast what we do compared to what you do with maybe an elastic cluster or a BI cluster. And if you look at it, what we do is we simply put your data in S3, don't move it, don't transform it. In fact, we're against data movement. What we do is we literally point and set that data and we index that data and make it available in a data representation that you can give virtual views to end-users. And those virtual views are available immediately over petabytes of data. And it actually gets presented to the end-user as an open API. So if you're elastic search user, you can use all your elastic search tools on this view. If you're a SQL user, Tableau, Looker, all the different tools, same thing with machine learning next year. So what we do is we take it, make it very simple. Simply put it there. It's already there already. Point us at it. We do the hard of indexing and making available. And then you publish in the open API as your users can use exactly what they do today. So that's, dramatically I'll give you a before and after. So let's say you're doing elastic search. You're doing logging analytics at scale, they're lending their data in S3. And then they're ETL physically duplicating and moving data. And typically deleting a lot of data to get in a format that elastic search can use. They're persisting it up in a data layer called leucine. It's physically sitting in memories, CPU, SSDs, and it's not one of them, it's a bunch of those. They in the cloud, you have to set them up because they're persisting ECC. They stand up same by 24, not a very cost-effective way to the cloud computing. What we do in comparison to that is literally pointing it at the same S3. In fact, you can run a complete parallel, the data necessary it's being ETL out. When just one more use case read only, or allow you to get that data and make this virtual views. So we run a complete parallel, but what happens is we just give a virtual view to the end users. We don't need this persistence layer, this extra cost layer, this extra time, cost and complexity of doing that. So what happens is when you look at what happens in elastic, they have a constraint, a trade-off of how much you can keep and how much you can afford to keep. And also it becomes unstable at time because you have to build out a schema. It's on a server, the more the schema scales out, guess what? you have to add more servers, very expensive. They're up seven by 24. And also they become brutalized. You lose one node, the whole thing has to be put together. We have none of that cost and complexity. We literally go from to keep whatever you want, whatever you want to keep an S3 is single persistence, very cost effective. And what we are able to do is, costs, we save 50 to 80%. Why? We don't go with the old paradigm of sit it up on servers, spin them up for persistence and keep them up 7 by 24. We're literally asking their cluster, what do you want to cut? We bring up the right compute resources. And then we release those sources after the query done. So we can do some queries that they can't imagine at scale, but we're able to do the exact same query at 50 to 80% savings. And they don't have to do any tutorial of moving that data or managing that layer of persistence, which is not only expensive, it becomes brittle. And then it becomes, I'll be quick. Once you go to BI, it's the same challenge, but the BI systems, the requests are constant coming at from a business unit down to the centralized data team. Give me this flavor of data. I want to use this piece of, you know, this analytic tool in that desk set. So they have to do all this pipeline. They're constantly saying, okay, I'll give you this data, this data, I'm duplicating that data, I'm moving it and stitching it together. And then the minute you want more data, they do the same process all over. We completely eliminate that. >> And those requests are queue up. Thomas, it had me, you don't have to move the data. That's kind of the exciting piece here, isn't it? >> Absolutely no. I think, you know, the data lake philosophy has always been solid, right? The problem is we had that Hadoop hang over, right? Where let's say we were using that platform, little too many variety of ways. And so, I always believed in data lake philosophy when James came and coined that I'm like, that's it. However, HTFS, that wasn't really a service. Cloud object storage is a service that the elasticity, the security, the durability, all that benefits are really why we founded on-cloud storage as a first move. >> So it was talking Thomas about, you know, being able to shut off essentially the compute so you don't have to keep paying for it, but there's other vendors out there and stuff like that. Something similar as separating, compute from storage that they're famous for that. And you have Databricks out there doing their lake house thing. Do you compete with those? How do you participate and how do you differentiate? >> Well, you know you've heard this term data lakes, warehouse, now lake house. And so what everybody wants is simple in, easy in, however, the problem with data lakes was complexity of out. Driving value. And I said, what if, what if you have the easy in and the value out? So if you look at, say snowflake as a warehousing solution, you have to all that prep and data movement to get into that system. And that it's rigid static. Now, Databricks, now that lake house has exact same thing. Now, should they have a data lake philosophy, but their data ingestion is not data lake philosophy. So I said, what if we had that simple in with a unique architecture and indexed technology, make it virtually accessible, publishable dynamically at petabyte scale. And so our service connects to the customer's cloud storage. Data stream the data in, set up what we call a live indexing stream, and then go to our data refinery and publish views that can be consumed the elastic API, use cabana Grafana, or say SQL tables look or say Tableau. And so we're getting the benefits of both sides, use scheme on read-write performance with scheme write-read performance. And if you can do that, that's the true promise of a data lake, you know, again, nothing against Hadoop, but scheme on read with all that complexity of software was a little data swamping. >> Well, you've got to start it, okay. So we got to give them a good prompt, but everybody I talked to has got this big bunch of spark clusters, now saying, all right, this doesn't scale, we're stuck. And so, you know, I'm a big fan of Jamag Dagani and our concept of the data lake and it's early days. But if you fast forward to the end of the decade, you know, what do you see as being the sort of critical components of this notion of, people call it data mesh, but to get the analytics stack, you're a visionary Thomas, how do you see this thing playing out over the next decade? >> I love her thought leadership, to be honest, our core principles were her core principles now, 5, 6, 7 years ago. And so this idea of, decentralize that data as a product, self-serve and, and federated computer governance, I mean, all that was our core principle. The trick is how do you enable that mesh philosophy? I can say we're a mesh ready, meaning that, we can participate in a way that very few products can participate. If there's gates data into your system, the CTL, the schema management, my argument with the data meshes like producers and consumers have the same rights. I want the consumer, people that choose how they want to consume that data. As well as the producer, publishing it. I can say our data refinery is that answer. You know, shoot, I'd love to open up a standard, right? Where we can really talk about the producers and consumers and the rights each others have. But I think she's right on the philosophy. I think as products mature in this cloud, in this data lake capabilities, the trick is those gates. If you have to structure up front, if you set those pipelines, the chance of you getting your data into a mesh is the weeks and months that Ed was mentioning. >> Well, I think you're right. I think the problem with data mesh today is the lack of standards you've got. You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are APIs, but they're all unique primitives. So there aren't standards, by which to your point, the consumer can take the data the way he or she wants it and build their own data products without having to tap people on the shoulder to say, how can I use this?, where does the data live? And being able to add their own data. >> You're exactly right. So I'm an organization, I'm generating data, when the courageously stream it into a lake. And then the service, a ChaosSearch service, is the data is discoverable and configurable by the consumer. Let's say you want to go to the corner store. I want to make a certain meal tonight. I want to pick and choose what I want, how I want it. Imagine if the data mesh truly can have that producer of information, you know, all the things you can buy a grocery store and what you want to make for dinner. And if you'd static, if you call up your producer to do the change, was it really a data mesh enabled service? I would argue not. >> Ed, bring us home. >> Well, maybe one more thing with this. >> Please, yeah. 'Cause some of this is we're talking 2031, but largely these principles are what we have in production today, right? So even the self service where you can actually have a business context on top of a data lake, we do that today, we talked about, we get rid of the physical ETL, which is 80% of the work, but the last 20% it's done by this refinery where you can do virtual views, the right or back and do all the transformation need and make it available. But also that's available to, you can actually give that as a role-based access service to your end-users, actually analysts. And you don't want to be a data scientist or DBA. In the hands of a data scientist the DBA is powerful, but the fact of matter, you don't have to affect all of our employees, regardless of seniority, if they're in finance or in sales, they actually go through and learn how to do this. So you don't have to be it. So part of that, and they can come up with their own view, which that's one of the things about data lakes. The business unit wants to do themselves, but more importantly, because they have that context of what they're trying to do instead of queuing up the very specific request that takes weeks, they're able to do it themselves. >> And if I have to put it on different data stores and ETL that I can do things in real time or near real time. And that's game changing and something we haven't been able to do ever. >> And then maybe just to wrap it up, listen, you know 8 years ago, Thomas and his group of founders, came up with the concept. How do you actually get after analytics at scale and solve the real problems? And it's not one thing, it's not just getting S3. It's all these different things. And what we have in market today is the ability to literally just simply stream it to S3, by the way, simply do, what we do is automate the process of getting the data in a representation that you can now share an augment. And then we publish open API. So can actually use a tool as you want, first use case log analytics, hey, it's easy to just stream your logs in. And we give you elastic search type of services. Same thing that with CQL, you'll see mainstream machine learning next year. So listen, I think we have the data lake, you know, 3.0 now, and we're just stretching our legs right now to have fun. >> Well, and you have to say it log analytics. But if I really do believe in this concept of building data products and data services, because I want to sell them, I want to monetize them and being able to do that quickly and easily, so I can consume them as the future. So guys, thanks so much for coming on the program. Really appreciate it.

Published Date : Nov 15 2021

SUMMARY :

and Thomas Hazel, the CTO really good to be here. lakes in the cloud forever. And the same process has to happen. So you guys talk about You know, when you guys crew founded the team to go after this, That's kind of the exciting service that the elasticity, And you have Databricks out there And if you can do that, end of the decade, you know, the chance of you getting your on the shoulder to say, all the things you can buy a grocery store So even the self service where you can actually have And if I have to put it is the ability to literally Well, and you have

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Kevin Miller	PERSON	0.99+
Thomas	PERSON	0.99+
Ed	PERSON	0.99+
80%	QUANTITY	0.99+
Ed Walsh	PERSON	0.99+
50	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
James	PERSON	0.99+
Thomas Hazel	PERSON	0.99+
ChaosSearch	ORGANIZATION	0.99+
three months	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
next year	DATE	0.99+
2021	DATE	0.99+
two thousands	QUANTITY	0.99+
three weeks	QUANTITY	0.99+
24	QUANTITY	0.99+
James Dixon	PERSON	0.99+
last decade	DATE	0.99+
7	QUANTITY	0.99+
second challenge	QUANTITY	0.99+
2031	DATE	0.99+
Jamag Dagani	PERSON	0.98+
S3	ORGANIZATION	0.98+
both sides	QUANTITY	0.98+
S3	TITLE	0.98+
8 years ago	DATE	0.98+
second thing	QUANTITY	0.98+
today	DATE	0.98+
about 60%	QUANTITY	0.98+
tonight	DATE	0.97+
first	QUANTITY	0.97+
Tableau	TITLE	0.97+
two big areas	QUANTITY	0.96+
one	QUANTITY	0.95+
SQL	TITLE	0.94+
seven	QUANTITY	0.94+
6	DATE	0.94+
CTO	PERSON	0.93+
CQL	TITLE	0.93+
7 years	DATE	0.93+
first move	QUANTITY	0.93+
next decade	DATE	0.92+
single	QUANTITY	0.91+
DBS	ORGANIZATION	0.9+
20%	QUANTITY	0.9+
one thing	QUANTITY	0.87+
5	DATE	0.87+
Hadoop	TITLE	0.87+
Looker	TITLE	0.8+
Grafana	TITLE	0.73+
DPA	ORGANIZATION	0.71+
one more thing	QUANTITY	0.71+
end of the	DATE	0.69+
Vice President	PERSON	0.65+
petabytes	QUANTITY	0.64+
cabana	TITLE	0.62+
CEO	PERSON	0.57+
HTFS	ORGANIZATION	0.54+
house	ORGANIZATION	0.49+
theCUBE	ORGANIZATION	0.48+

Ed Walsh and Thomas Hazel V1

>>Welcome to the cube. I'm Dave Volante. Today, we're going to explore the ebb and flow of data as it travels into the cloud. In the data lake, the concept of data lakes was a Loring when it was first coined last decade by CTO James Dickson, rather than be limited to highly structured and curated data that lives in a relational database in the form of an expensive and rigid data warehouse or a data Mart, a data lake is formed by flowing data from a variety of sources into a scalable repository, like say an S3 bucket that anyone can access, dive into. They can extract water. It can a data from that lake and analyze data. That's much more fine-grained and less expensive to store at scale. The problem became that organizations started to dump everything into their data lakes with no schema on it, right? No metadata, no context to shove it into the data lake and figure out what's valuable. >>At some point down the road kind of reminds you of your attic, right? Except this is an attic in the cloud. So it's too big to clean out over a weekend. We'll look it's 2021 and we should be solving this problem by now, a lot of folks are working on this, but often the solutions at other complexities for technology pros. So to understand this better, we're going to enlist the help of chaos search CEO and Walsh and Thomas Hazel, the CTO and founder of chaos search. We're also going to speak with Kevin Miller. Who's the vice president and general manager of S3 at Amazon web services. And of course they manage the largest and deepest data lakes on the planet. And we'll hear from a customer to get their perspective on this problem and how to go about solving it, but let's get started. Ed Thomas. Great to see you. Thanks for coming on the cube. Likewise face. It's really good to be in this nice face. Great. So let me start with you. We've been talking about data lakes in the cloud forever. Why is it still so difficult to extract value from those data? >>Good question. I mean, a data analytics at scale is always been a challenge, right? So, and it's, uh, we're making some incremental changes. As you mentioned that we need to seem some step function changes, but, uh, in fact, it's the reason, uh, search was really founded. But if you look at it the same challenge around data warehouse or a data lake, really, it's not just a flowing the data in is how to get insights out. So it kind of falls into a couple of areas, but the business side will always complain and it's kind of uniform across everything in data lakes, everything that we're offering, they'll say, Hey, listen, I typically have to deal with a centralized team to do that data prep because it's data scientist and DBS. Most of the time they're a centralized group, sometimes are business units, but most of the time, because they're scarce resources together. >>And then it takes a lot of time. It's arduous, it's complicated. It's a rigid process of the deal of the team, hard to add new data. But also it's hard to, you know, it's very hard to share data and there's no way to governance without locking it down. And of course they would be more self-service. So there's you hear from the business side constantly now underneath is like, there's some real technology issues that we haven't really changed the way we're doing data prep since the two thousands. Right? So if you look at it, it's, it falls, uh, two big areas. It's one. How do data prep, how do you take a request comes in from a business unit. I want to do X, Y, Z with this data. I want to use this type of tool sets to do the following. Someone has to be smart, how to put that data in the right schema. >>You mentioned you have to put it in the right format, that the tool sets can analyze that data before you do anything. And then secondly, I'll come back to that because that's a biggest challenge. But the second challenge is how these different data lakes and data we're also going to persisting data and the complexity of managing that data and also the cost of computing. And I'll go through that. But basically the biggest thing is actually getting it from raw data so that the rigidness and complexity that the business sides are using it is literally someone has to do this ETL process extract, transform load. They're actually taking data request comes in. I need so much data in this type of way to put together their Lilly, physically duplicating data and putting it together and schema they're stitching together almost a data puddle for all these different requests. >>And what happens is anytime they have to do that, someone has to do it. And it's very skilled. Resources are scant in the enterprise, right? So it's a DBS and data scientists. And then when they want new data, you give them a set of data set. They're always saying, what can I add this data? Now that I've seen the reports, I want to add this data more fresh. And the same process has to happen. This takes about 60 to 80% of the data scientists in DPA's to do this work. It's kind of well-documented. Uh, and this is what actually stops the process. That's what is rigid. They have to be rigid because there's a process around that. Uh, that's the biggest challenge to doing this. And it takes in the enterprise, uh, weeks or months. I always say three weeks to three months. And no one challenges beyond that. It also takes the same skill set of people that you want to drive. Digital transformation, data, warehousing initiatives, uh, monitorization being, data driven, or all these data scientists and DBS. They don't have enough of, so this is not only hurting you getting insights out of your dead like that, or else it's also this resource constraints hurting you actually getting smaller. >>The Tomic unit is that team that's super specialized team. Right. Right. Yeah. Okay. So you guys talk about activating the data lake. Yep, sure. Analytics, what what's unique about that? What problems are you all solving? You know, when you guys crew created this, this, this magic sauce. >>No, and it basically, there's a lot of things I highlighted the biggest one is how to do the data prep, but also you're persisting and using the data. But in the end, it's like, there's a lot of challenges that how to get analytics at scale. And this is really where Thomas founded the team to go after this. But, um, I'll try to say it simply, what are we doing? I'll try to compare and stress what we do compared to what you do with maybe an elastic cluster or a BI cluster. Um, and if you look at it, what we do is we simply put your data in S3, don't move it, don't transform it. In fact, we're not we're against data movement. What we do is we literally pointed at that data and we index that data and make it available in a data representation that you can give virtual views to end users. >>And those virtual views are available immediately over petabytes of data. And it re it actually gets presented to the end user as an open API. So if you're elastic search user, you can use all your lesser search tools on this view. If you're a SQL user, Tableau, Looker, all the different tools, same thing with machine learning next year. So what we do is we take it, make it very simple. Simply put it there. It's already there already. Point is at it. We do the hard of indexing and making available. And then you publish in the open API as your users can use exactly what they do today. So that's dramatically. I'll give you a before and after. So let's say you're doing elastic search. You're doing logging analytics at scale, they're lending their data in S3. And then they're,, they're physically duplicating a moving data and typically deleting a lot of data to get in a format that elastic search can use. >>They're persisting it up in a data layer called leucine. It's physically sitting in memories, CPU, uh, uh, SSDs. And it's not one of them. It's a bunch of those. They in the cloud, you have to set them up because they're persisting ECC. They stand up semi by 24, not a very cost-effective way to the cloud, uh, cloud computing. What we do in comparison to that is literally pointing it at the same S3. In fact, you can run a complete parallel, the data necessary. It's being ETL. That we're just one more use case read only, or allow you to get that data and make this virtual views. So we run a complete parallel, but what happens is we just give a virtual view to the end users. We don't need this persistence layer, this extra cost layer, this extra, um, uh, time cost and complexity of doing that. >>So what happens is when you look at what happens in elastic, they have a constraint, a trade-off of how much you can keep and how much you can afford to keep. And also it becomes unstable at time because you have to build out a schema. It's on a server, the more the schema scales out, guess what you have to add more servers, very expensive. They're up seven by 24. And also they become brittle. As you lose one node. The whole thing has to be put together. We have none of that cost and complexity. We literally go from to keep whatever you want, whatever you want to keep an S3, a single persistence, very cost effective. And what we do is, um, costs. We save 50 to 80% why we don't go with the old paradigm of sit it up on servers, spin them up for persistence and keep them up. >>Somebody 24, we're literally asking her cluster, what do you want to cut? We bring up the right compute resources. And then we release those sources after the query done. So we can do some queries that they can't imagine at scale, but we're able to do the exact same query at 50 to 80% savings. And they don't have to do any of the toil of moving that data or managing that layer of persistence, which is not only expensive. It becomes brittle. And then it becomes an I'll be quick. Once you go to BI, it's the same challenge, but the BI systems, the requests are constant coming at from a business unit down to the centralized data team. Give me this flavor of debt. I want to use this piece of, you know, this analytic tool in that desk set. So they have to do all this pipeline. They're constantly saying, okay, I'll give you this data, this data I'm duplicating that data. I'm moving in stitching together. And then the minute you want more data, they do the same process all over. We completely eliminate that. >>The questions queue up, Thomas, it had me, you don't have to move the data. That's, that's kind of the >>Writing piece here. Isn't it? I absolutely, no. I think, you know, the daylight philosophy has always been solid, right? The problem is we had that who do hang over, right? Where let's say we were using that platform, little, too many variety of ways. And so I always believed in daily philosophy when James came and coined that I'm like, that's it. However, HTFS that wasn't really a service cloud. Oddish storage is a service that the, the last society, the security and the durability, all that benefits are really why we founded, uh, Oncotype storage as a first move. >>So it was talking Thomas about, you know, being able to shut off essentially the compute and you have to keep paying for it, but there's other vendors out there and stuff like that. Something similar as separating, compute from storage that they're famous for that. And, and, and yet Databricks out there doing their lake house thing. Do you compete with those? How do you participate and how do you differentiate? >>I know you've heard this term data lakes, warehouse now, lake house. And so what everybody wants is simple in easy N however, the problem with data lakes was complexity of out driving value. And I said, what if, what if you have the easy end and the value out? So if you look at, uh, say snowflake as a, as a warehousing solution, you have to all that prep and data movement to get into that system. And that it's rigid static. Now, Databricks, now that lake house has exact same thing. Now, should they have a data lake philosophy, but their data ingestion is not daily philosophy. So I said, what if we had that simple in with a unique architecture, indexed technology, make it virtually accessible publishable dynamically at petabyte scale. And so our service connects to the customer's cloud storage data, stream the data in set up what we call a live indexing stream, and then go to our data refinery and publish views that can be consumed the lasted API, use cabana Grafana, or say SQL tables look or say Tableau. And so we're getting the benefits of both sides, you know, schema on read, write performance with scheme on, right. Reperformance. And if you can do that, that's the true promise of a data lake, you know, again, nothing against Hadoop, but a schema on read with all that complexity of, uh, software was, uh, what was a little data, swamp >>Got to start it. Okay. So we got to give a good prompt, but everybody I talked to has got this big bunch of spark clusters now saying, all right, this, this doesn't scale we're stuck. And so, you know, I'm a big fan of and our concept of the data lake and it's it's early days. But if you fast forward to the end of the decade, you know, what do you see as being the sort of critical components of this notion of, you know, people call it data mesh, but you've got the analytics stack. Uh, you, you, you're a visionary Thomas, how do you see this thing playing out over the next? >>I love for thought leadership, to be honest, our core principles were her core principles now, you know, 5, 6, 7 years ago. And so this idea of, you know, de centralize that data as a product, you know, self-serve and, and federated, computer, uh, governance, I mean, all that, it was our core principle. The trick is how do you enable that mesh philosophy? We, I could say we're a mesh ready, meaning that, you know, we can participate in a way that very few products can participate. If there's gates data into your system, the CTLA, the schema management, my argument with the data meshes like producers and consumers have the same rights. I want the consumer people that choose how they want to consume that data, as well as the producer publishing it. I can say our data refinery is that answer. You know, shoot, I love to open up a standard, right, where we can really talk about the producers and consumers and the rights each others have. But I think she's right on the philosophy. I think as products mature in this cloud, in this data lake capabilities, the trick is those gates. If you have the structure up front, it gets at those pipelines. You know, the chance of you getting your data into a mesh is the weeks and months that it was mentioning. >>Well, I think you're right. I think the problem with, with data mesh today is the lack of standards you've got. You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are API APIs, but they're all, you know, unique primitives. So there aren't standards by which to your point, the consumer can take the data the way he or she wants it and build their own data products without having to tap people on the shoulder to say, how can I use this? Where's the data live and, and, and, and, and being able to add their own >>You're exactly right. So I'm an organization I'm generally data will be courageous, a stream it to a lake. And then the service, uh, Ks search service is the data's con uh, discoverable and configurable by the consumer. Let's say you want to go to the corner store? You know, I want to make a certain meal tonight. I want to pick and choose what I want, how I want it. Imagine if the data mesh truly can have that producer of information, you, all the things you can buy a grocery store and what you want to make for dinner. And if you'd static, if you call up your producer to do the change, was it really a data mesh enabled service? I would argue not that >>Bring us home >>Well. Uh, and, um, maybe one more thing with this, cause some of this is we talking 20, 31, but largely these principles are what we have in production today, right? So even the self service where you can actually have business context on top of a debt, like we do that today, we talked about, we get rid of the physical ETL, which is 80% of the work, but the last 20% it's done by this refinery where you can do virtual views, the right our back and do all the transformation need and make it available. But also that's available to, you can actually give that as a role-based access service to your end users actually analysts, and you don't want to be a data scientist or DBA in the hands of a data science. The DBA is powerful, but the fact of matter, you don't have to affect all of our employees, regardless of seniority. If they're in finance or in sales, they actually go through and learn how to do this. So you don't have to be it. So part of that, and they can come up with their own view, which that's one of the things about debt lakes, the business unit wants to do themselves, but more importantly, because they have that context of what they're trying to do instead of queuing up the very specific request that takes weeks, they're able to do it themselves and to find out that >>Different data stores and ETL that I can do things in real time or near real time. And that's that's game changing and something we haven't been able to do, um, ever. Hmm. >>And then maybe just to wrap it up, listen, um, you know, eight years ago is a group of founders came up with the concept. How do you actually get after analytics at scale and solve the real problems? And it's not one thing it's not just getting S3, it's all these different things. And what we have in market today is the ability to literally just simply stream it to S3 by the way, simply do what we do is automate the process of getting the data in a representation that you can now share an augment. And then we publish open API. So can actually use a tool as you want first use case log analytics, Hey, it's easy to just stream your logs in and we give you elastic search puppet services, same thing that with CQL, you'll see mainstream machine learning next year. So listen, I think we have the data lake, you know, 3.0 now, and we're just stretching our legs run off >>Well, and you have to say it log analytics. But if I really do believe in this concept of building data products and data services, because I want to sell them, I want to monetize them and being able to do that quickly and easily, so that can consume them as the future. So guys, thanks so much for coming on the program. Really appreciate it. All right. In a moment, Kevin Miller of Amazon web services joins me. You're watching the cube, your leader in high tech coverage.

Published Date : Nov 2 2021

SUMMARY :

that organizations started to dump everything into their data lakes with no schema on it, At some point down the road kind of reminds you of your attic, right? But if you look at it the same challenge around data warehouse So if you look at it, it's, it falls, uh, two big areas. You mentioned you have to put it in the right format, that the tool sets can analyze that data before you do anything. It also takes the same skill set of people that you want So you guys talk about activating the data lake. Um, and if you look at it, what we do is we simply put your data in S3, don't move it, And then you publish in the open API as your users can use exactly what they you have to set them up because they're persisting ECC. It's on a server, the more the schema scales out, guess what you have to add more servers, And then the minute you want more data, they do the same process all over. The questions queue up, Thomas, it had me, you don't have to move the data. I absolutely, no. I think, you know, the daylight philosophy has always been So it was talking Thomas about, you know, being able to shut off essentially the And I said, what if, what if you have the easy end and the value out? the sort of critical components of this notion of, you know, people call it data mesh, And so this idea of, you know, de centralize that You know, when you draw the conceptual diagrams, you've got a lot of lollipops, which are API APIs, but they're all, if you call up your producer to do the change, was it really a data mesh enabled service? but the fact of matter, you don't have to affect all of our employees, regardless of seniority. And that's that's game changing And then maybe just to wrap it up, listen, um, you know, eight years ago is a group of founders Well, and you have to say it log analytics.

ENTITIES

Entity	Category	Confidence
Kevin Miller	PERSON	0.99+
Thomas	PERSON	0.99+
Dave Volante	PERSON	0.99+
Ed Thomas	PERSON	0.99+
50	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
James	PERSON	0.99+
80%	QUANTITY	0.99+
three months	QUANTITY	0.99+
three weeks	QUANTITY	0.99+
Thomas Hazel	PERSON	0.99+
2021	DATE	0.99+
Ed Walsh	PERSON	0.99+
next year	DATE	0.99+
Databricks	ORGANIZATION	0.99+
S3	ORGANIZATION	0.99+
second challenge	QUANTITY	0.99+
24	QUANTITY	0.99+
both sides	QUANTITY	0.99+
eight years ago	DATE	0.98+
Today	DATE	0.98+
two thousands	QUANTITY	0.98+
today	DATE	0.98+
20%	QUANTITY	0.98+
tonight	DATE	0.97+
last decade	DATE	0.97+
S3	TITLE	0.97+
first	QUANTITY	0.96+
one	QUANTITY	0.96+
Tableau	TITLE	0.95+
single	QUANTITY	0.95+
James Dickson	PERSON	0.94+
Hadoop	TITLE	0.94+
two big areas	QUANTITY	0.94+
20	QUANTITY	0.94+
SQL	TITLE	0.93+
seven	QUANTITY	0.93+
CTO	PERSON	0.93+
about 60	QUANTITY	0.93+
Oncotype	ORGANIZATION	0.92+
first move	QUANTITY	0.92+
secondly	QUANTITY	0.91+
one more thing	QUANTITY	0.89+
DBS	ORGANIZATION	0.89+
one node	QUANTITY	0.85+
Walsh	PERSON	0.83+
petabytes	QUANTITY	0.77+
Tomic	ORGANIZATION	0.77+
31	QUANTITY	0.77+
end of the	DATE	0.76+
cabana	TITLE	0.73+
HTFS	ORGANIZATION	0.7+
Mart	ORGANIZATION	0.68+
Grafana	TITLE	0.63+
data	ORGANIZATION	0.58+
Looker	TITLE	0.55+
CQL	TITLE	0.55+
DPA	ORGANIZATION	0.54+

Sanjeev Mohan, SanjMo & Nong Li, Okera | AWS Startup Showcase

(cheerful music) >> Hello everyone, welcome to today's session of theCUBE's presentation of AWS Startup Showcase, New Breakthroughs in DevOps, Data Analytics, Cloud Management Tools, featuring Okera from the cloud management migration track. I'm John Furrier, your host. We've got two great special guests today, Nong Li, founder and CTO of Okera, and Sanjeev Mohan, principal @SanjMo, and former research vice president of big data and advanced analytics at Gartner. He's a legend, been around the industry for a long time, seen the big data trends from the past, present, and knows the future. Got a great lineup here. Gentlemen, thank you for this, so, life in the trenches, lessons learned across compliance, cloud migration, analytics, and use cases for Fortune 1000s. Thanks for joining us. >> Thanks for having us. >> So Sanjeev, great to see you, I know you've seen this movie, I was saying that in the open, you've at Gartner seen all the visionaries, the leaders, you know everything about this space. It's changing extremely fast, and one of the big topics right out of the gate is not just innovation, we'll get to that, that's the fun part, but it's the regulatory compliance and audit piece of it. It's keeping people up at night, and frankly if not done right, slows things down. This is a big part of the showcase here, is to solve these problems. Share us your thoughts, what's your take on this wide-ranging issue? >> So, thank you, John, for bringing this up, and I'm so happy you mentioned the fact that, there's this notion that it can slow things down. Well I have to say that the old way of doing governance slowed things down, because it was very much about control and command. But the new approach to data governance is actually in my opinion, it's liberating data. If you want to democratize or monetize, whatever you want to call it, you cannot do it 'til you know you can trust said data and it's governed in some ways, so data governance has actually become very interesting, and today if you want to talk about three different areas within compliance regulatory, for example, we all know about the EU GDPR, we know California has CCPA, and in fact California is now getting even a more stringent version called CPRA in a couple of years, which is more aligned to GDPR. That is a first area we know we need to comply to that, we don't have any way out. But then, there are other areas, there is insider trading, there is how you secure the data that comes from third parties, you know, vendors, partners, suppliers, so Nong, I'd love to hand it over to you, and see if you can maybe throw some light into how our customers are handling these use cases. >> Yeah, absolutely, and I love what you said about balancing agility and liberating, in the face of what may be seen as things that slow you down. So we work with customers across verticals with old and new regulations, so you know, you brought up GDPR. One of our clients is using this to great effect to power their ecosystem. They are a very large retail company that has operations and customers across the world, obviously the importance of GDPR, and the regulations that imposes on them are very top of mind, and at the same time, being able to do effective targeting analytics on customer information is equally critical, right? So they're exactly at that spot where they need this customer insight for powering their business, and then the regulatory concerns are extremely prevalent for them. So in the context of GDPR, you'll hear about things like consent management and right to be forgotten, right? I, as a customer of that retailer should say "I don't want my information used for this purpose," right? "Use it for this, but not this." And you can imagine at a very, very large scale, when you have a billion customers, managing that, all the data you've collected over time through all of your devices, all of your telemetry, really, really challenging. And they're leveraging Okera embedded into their analytics platform so they can do both, right? Their data scientists and analysts who need to do everything they're doing to power the business, not have to think about these kind of very granular customer filtering requirements that need to happen, and then they leverage us to do that. So that's kind of new, right, GDPR, relatively new stuff at this point, but we obviously also work with customers that have regulations from a long long time ago, right? So I think you also mentioned insider trading and that supply chain, so we'll talk to customers, and they want really data-driven decisions on their supply chain, everything about their production pipeline, right? They want to understand all of that, and of course that makes sense, whether you're the CFO, if you're going to make business decisions, you need that information readily available, and supply chains as we know get more and more and more complex, we have more and more integrated into manufacturing and other verticals. So that's your, you're a little bit stuck, right? You want to be data-driven on those supply chain analytics, but at the same time, knowing the details of all the supply chain across all of your dependencies exposes your internal team to very high blackout periods or insider trading concerns, right? For example, if you knew Apple was buying a bunch of something, that's maybe information that only a select few people can have, and the way that manifests into data policies, 'cause you need the ability to have very, very scalable, per employee kind of scalable data restriction policies, so they can do their job easier, right? If we talk about speeding things up, instead of a very complex process for them to get approved, and approved on SEC regulations, all that kind of stuff, you can now go give them access to the part of the supply chain that they need, and no more, and limit their exposure and the company's exposure and all of that kind of stuff. So one of our customers able to do this, getting two orders of magnitude, a 100x reduction in the policies to manage the system like that. >> When I hear you talking like that, I think the old days of "Oh yeah, regulatory, it kind of slows down innovation, got to go faster," pretty basic variables, not a lot of combination of things to check. Now with cloud, there seems to be combinations, Sanjeev, because how complicated has the regulatory compliance and audit environment gotten in the past few years, because I hear security in a supply chain, I hear insider threats, I mean these are security channels, not just compliance department G&A kind of functions. You're talking about large-scale, potentially combinations of access, distribution, I mean it seems complicated. How much more complicated is it now, just than it was a few years ago? >> So, you know the way I look at it is, I'm just mentioning these companies just as an example, when PayPal or Ebay, all these companies started, they started in California. Anybody who ever did business on Ebay or PayPal, guess where that data was? In the US in some data center. Today you cannot do it. Today, data residency laws are really tough, and so now these organizations have to really understand what data needs to remain where. On top of that, we now have so many regulations. You know, earlier on if you were healthcare, you needed to be HIPAA compliant, or banking PCI DSS, but today, in the cloud, you really need to know, what data I have, what sensitive data I have, how do I discover it? So that data discovery becomes really important. What roles I have, so for example, let's say I work for a bank in the US, and I decide to move to Germany. Now, the old school is that a new rule will be created for me, because of German... >> John: New email address, all these new things happen, right? >> Right, exactly. So you end up with this really, a mass of rules and... And these are all static. >> Rules and tools, oh my god. >> Yeah. So Okera actually makes a lot of this dynamic, which reduces your cloud migration overhead, and Nong used some great examples, in fact, sorry if I take just a second, without mentioning any names, there's one of the largest banks in the world is going global in the digital space for the first time, and they're taking Okera with them. So... >> But what's the point? This is my next topic in cloud migration, I want to bring this up because, complexity, when you're in that old school kind of data center, waterfall, these old rules and tools, you have to roll this out, and it's a pain in the butt for everybody, it's a hassle, huge hassle. Cloud gives the agility, we know that, and cloud's becoming more secure, and I think now people see the on-premise, certainly things that'd be on-premises for secure things, I get that, but when you start getting into agility, and you now have cloud regions, you can start being more programmatic, so I want to get you guys' thoughts on the cloud migration, how companies who are now lifting and shifting, replatforming, what's the refactoring beyond that, because you can replatform in the cloud, and still some are kind of holding back on that. Then when you're in the cloud, the ones that are winning, the companies that are winning are the ones that are refactoring in the cloud. Doing things different with new services. Sanjeev, you start. >> Yeah, so you know, in fact lot of people tell me, "You know, we are just going to lift and shift into the cloud." But you're literally using cloud as a data center. You still have all the, if I may say, junk you had on-prem, you just moved it into the cloud, and now you're paying for it. In cloud, nothing is free. Every storage, every processing, you're going to pay for it. The most successful companies are the ones that are replatforming, they are taking advantage of the platform as a service or software as a service, so that includes things like, you pay as you go, you pay for exactly the amount you use, so you scale up and scale down or scale out and scale in, pretty quickly, you know? So you're handling that demand, so without replatforming, you are not really utilizing your- >> John: It's just hosting. >> Yeah, you're just hosting. >> It's basically hosting if you're not doing anything right there. >> Right. The reason why people sometimes resist to replatform, is because there's a hidden cost that we don't really talk about, PaaS adds 3x to IaaS cost. So, some organizations that are very mature, and they have a few thousand people in the IT department, for them, they're like "No, we just want to run it in the cloud, we have the expertise, and it's cheaper for us." But in the long run, to get the most benefit, people should think of using cloud as a service. >> Nong what's your take, because you see examples of companies, I'll just call one out, Snowflake for instance, they're essentially a data warehouse in the cloud, they refactored and they replatformed, they have a competitive advantage with the scale, so they have things that others don't have, that just hosting. Or even on-premise. The new model developing where there's real advantages, and how should companies think about this when they have to manage these data lakes, and they have to manage all these new access methods, but they want to maintain that operational stability and control and growth? >> Yeah, so. No? Yeah. >> There's a few topics that are all (indistinct) this topic. (indistinct) enterprises moving to the cloud, they do this maybe for some cost savings, but a ton of it is agility, right? The motor that the business can run at is just so much faster. So we'll work with companies in the context of cloud migration for data, where they might have a data warehouse they've been using for 20 years, and building policies over that time, right? And it's taking a long time to go proof of access and those kind of things, made more sense, right? If it took you months to procure a physical infrastructure, get machines shipped to your data center, then this data access taking so long feels okay, right? That's kind of the same rate that everything is moving. In the cloud, you can spin up new infrastructure instantly, so you don't want approvals for getting policies, creating rules, all that stuff that Sanjeev was talking about, that being slow is a huge, huge problem. So this is a very common environment that we see where they're trying to do that kind of thing. And then, for replatforming, again, they've been building these roles and processes and policies for 20 years. What they don't want to do is take 20 years to go migrate all that stuff into the cloud, right? That's probably an experience nobody wants to repeat, and frankly for many of them, people who did it originally may or may not be involved in this kind of effort. So we work with a lot of companies like that, they have their, they want stability, they got to have the business running as normal, they got to get moving into the new infrastructure, doing it in a new way that, you know, with all the kind of lessons learned, so, as Sanjeev said, one of these big banks that we work with, that classical story of on-premise data warehousing, maybe a little bit of Hadoop, moved onto AWS, S3, Snowflake, that kind of setup, extremely intricate policies, but let's go reimagine how we can do this faster, right? What we like to talk about is, you're an organization, you need a design that, if you onboarded 1000 more data users, that's got to be way, way easier than the first 10 you onboarded, right? You got to get it to be easier over time, in a really, really significant way. >> Talk about the data authorization safety factor, because I can almost imagine all the intricacies of these different tools creates specialism amongst people who operate them. And each one might have their own little authorization nuance. Trend is not to have that siloed mentality. What's your take on clients that want to just "Hey, you know what? I want to have the maximum agility, but I don't want to get caught in the weeds on some of these tripwires around access and authorization." >> Yeah, absolutely, I think it's real important to get the balance of it, right? Because if you are an enterprise, or if you have diversive teams, you want them to have the ability to use tools as best of breed for their purpose, right? But you don't want to have it be so that every tool has its own access and provisioning and whatever, that's definitely going to be a security, or at least, a lot of friction for you to get things going. So we think about that really hard, I think we've seen great success with things like SSO and Okta, right? Unifying authentication. We think there's a very, very similar thing about to happen with authorization. You want that single control plane that can integrate with all the tools, and still get the best of what you need, but it's much, much easier (indistinct). >> Okta's a great example, if people don't want to build their own thing and just go with that, same with what you guys are doing. That seems to be the dots that are connecting you, Sanjeev. The ease of use, but yet the stability factor. >> Right. Yeah, because John, today I may want to bring up a SQL editor to go into Snowflake, just as an example. Tomorrow, I may want to use the Azure Bot, you know? I may not even want to go to Snowflake, I may want to go to an underlying piece of data, or I may use Power BI, you know, for some reason, and come from Azure side, so the point is that, unless we are able to control, in some sort of a centralized manner, we will not get that consistency. And security you know is all or nothing. You cannot say "Well, I secured my Snowflake, but if you come through HTFS, Hadoop, or some, you know, that is outside of my realm, or my scope," what's the point? So that is why it is really important to have a watertight way, in fact I'm using just a few examples, maybe tomorrow I decide to use a data catalog, or I use Denodo as my data virtualization and I run a query. I'm the same identity, but I'm using different tools. I may use it from home, over VPN, or I may use it from the office, so you want this kind of flexibility, all encompassed in a policy, rather than a separate rule if you do this and this, if you do that, because then you end up with literally thousands of rules. >> And it's never going to stop, either, it's like fashion, the next tool's going to come out, it's going to be cool, and people are going to want to use it, again, you don't want to have to then move the train from the compliance side this way or that way, it's a lot of hassle, right? So we have that one capability, you can bring on new things pretty quickly. Nong, am I getting it right, this is kind of like the trend, that you're going to see more and more tools and/or things that are relevant or, certain use cases that might justify it, but yet, AppSec review, compliance review, I mean, good luck with that, right? >> Yeah, absolutely, I mean we certainly expect tools to continue to get more and more diverse, and better, right? Most innovation in the data space, and I think we... This is a great time for that, a lot of things that need to happen, and so on and so forth. So I think one of the early goals of the company, when we were just brainstorming, is we don't want data teams to not be able to use the tools because it doesn't have the right security (indistinct), right? Often those tools may not be focused on that particular area. They're great at what they do, but we want to make sure they're enabled, they do some enterprise investments, they see broader adoption much easier. A lot of those things. >> And I can hear the sirens in the background, that's someone who's not using your platform, they need some help there. But that's the case, I mean if you don't get this right, there are some consequences, and I think one of the things I would like to bring up on next track is, to talk through with you guys is, the persona pigeonhole role, "Oh yeah, a data person, the developer, the DevOps, the SRE," you start to see now, developers and with cloud developers, and data folks, people, however they get pigeonholed, kind of blending in, okay? You got data services, you got analytics, you got data scientists, you got more democratization, all these things are being kicked around, but the notion of a developer now is a data developer, because cloud is about DevOps, data is now a big part of it, it's not just some department, it's actually blending in. Just a cultural shift, can you guys share your thoughts on this trend of data people versus developers now becoming kind of one, do you guys see this happening, and if so, how? >> So when, John, I started my career, I was a DBA, and then a data architect. Today, I think you cannot have a DBA who's not a developer. That's just my opinion. Because there is so much of CICD, DevOps, that happens today, and you know, you write your code in Python, you put it in version control, you deploy using Jenkins, you roll back if there's a problem. And then, you are interacting, you're building your data to be consumed as a service. People in the past, you would have a thick client that would connect to the database over TCP/IP. Today, people don't want to connect over TCP/IP necessarily, they want to go by HTTP. And they want an API gateway in the middle. So, if you're a data architect or DBA, now you have to worry about, "I have a REST API call that's coming in, how am I going to secure that, and make sure that people are allowed to see that?" And that was just yesterday. >> Exactly. Got to build an abstraction layer. You got to build an abstraction layer. The old days, you have to worry about schema, and do all that, it was hard work back then, but now, it's much different. You got serverless, functions are going to show way... It's happening. >> Correct, GraphQL, and semantic layer, that just blows me away because, it used to be, it was all in database, then we took it out of database and we put it in a BI tool. So we said, like BusinessObjects started this whole trend. So we're like "Let's put the semantic layer there," well okay, great, but that was when everything was surrounding BusinessObjects and Oracle Database, or some other database, but today what if somebody brings Power BI or Tableau or Qlik, you know? Now you don't have a semantic layer access. So you cannot have it in the BI layer, so you move it down to its own layer. So now you've got a semantic layer, then where do you store your metrics? Same story repeats, you have a metrics layer, then the data centers want to do feature engineering, where do you store your features? You have a feature store. And before you know, this stack has disaggregated over and over and over, and then you've got layers and layers of specialization that are happening, there's query accelerators like Dremio or Trino, so you've got your data here, which Nong is trying really hard to protect, and then you've got layers and layers and layers of abstraction, and networks are fast, so the end user gets great service, but it's a nightmare for architects to bring all these things together. >> How do you tame the complexity? What's the bottom line? >> Nong? >> Yeah, so, I think... So there's a few things you need to do, right? So, we need to re-think how we express security permanence, right? I think you guys have just maybe in passing (indistinct) talked about creating all these rules and all that kind of stuff, that's been the way we've done things forever. We get to think about policies and mechanisms that are much more dynamic, right? You need to really think about not having to do any additional work, for the new things you add to the system. That's really, really core to solving the complexity problem, right? 'Cause that gets you those orders of magnitude reduction, system's got to be more expressive and map to those policies. That's one. And then second, it's got to be implemented at the right layer, right, to Sanjeev's point, close to the data, and it can service all of those applications and use cases at the same time, and have that uniformity and breadth of support. So those two things have to happen. >> Love this universal data authorization vision that you guys have. Super impressive, we had a CUBE Conversation earlier with Nick Halsey, who's a veteran in the industry, and he likes it. That's a good sign, 'cause he's seen a lot of stuff, too, Sanjeev, like yourself. This is a new thing, you're seeing compliance being addressed, and with programmatic, I'm imagining there's going to be bots someday, very quickly with AI that's going to scale that up, so they kind of don't get in the innovation way, they can still get what they need, and enable innovation. You've got cloud migration, which is only going faster and faster. Nong, you mentioned speed, that's what CloudOps is all about, developers want speed, not things in days or hours, they want it in minutes and seconds. And then finally, ultimately, how's it scale up, how does it scale up for the people operating and/or programming? These are three major pieces. What happens next? Where do we go from here, what's, the customer's sitting there saying "I need help, I need trust, I need scale, I need security." >> So, I just wrote a blog, if I may diverge a bit, on data observability. And you know, so there are a lot of these little topics that are critical, DataOps is one of them, so to me data observability is really having a transparent view of, what is the state of your data in the pipeline, anywhere in the pipeline? So you know, when we talk to these large banks, these banks have like 1000, over 1000 data pipelines working every night, because they've got that hundred, 200 data sources from which they're bringing data in. Then they're doing all kinds of data integration, they have, you know, we talked about Python or Informatica, or whatever data integration, data transformation product you're using, so you're combining this data, writing it into an analytical data store, something's going to break. So, to me, data observability becomes a very critical thing, because it shows me something broke, walk me down the pipeline, so I know where it broke. Maybe the data drifted. And I know Okera does a lot of work in data drift, you know? So this is... Nong, jump in any time, because I know we have use cases for that. >> Nong, before you get in there, I just want to highlight a quick point. I think you're onto something there, Sanjeev, because we've been reporting, and we believe, that data workflows is intellectual property. And has to be protected. Nong, go ahead, your thoughts, go ahead. >> Yeah, I mean, the observability thing is critically important. I would say when you want to think about what's next, I think it's really effectively bridging tools and processes and systems and teams that are focused on data production, with the data analysts, data scientists, that are focused on data consumption, right? I think bridging those two, which cover a lot of the topics we talked about, that's kind of where security almost meets, that's kind of where you got to draw it. I think for observability and pipelines and data movement, understanding that is essential. And I think broadly, on all of these topics, where all of us can be better, is if we're able to close the loop, get the feedback loop of success. So data drift is an example of the loop rarely being closed. It drifts upstream, and downstream users can take forever to figure out what's going on. And we'll have similar examples related to buy-ins, or data quality, all those kind of things, so I think that's really a problem that a lot of us should think about. How do we make sure that loop is closed as quickly as possible? >> Great insight. Quick aside, as the founder CTO, how's life going for you, you feel good? I mean, you started a company, doing great, it's not drifting, it's right in the stream, mainstream, right in the wheelhouse of where the trends are, you guys have a really crosshairs on the real issues, how you feeling, tell us a little bit about how you see the vision. >> Yeah, I obviously feel really good, I mean we started the company a little over five years ago, there are kind of a few things that we bet would happen, and I think those things were out of our control, I don't think we would've predicted GDPR security and those kind of things being as prominent as they are. Those things have really matured, probably as best as we could've hoped, so that feels awesome. Yeah, (indistinct) really expanded in these years, and it feels good. Feels like we're in the right spot. >> Yeah, it's great, data's competitive advantage, and certainly has a lot of issues. It could be a blocker if not done properly, and you're doing great work. Congratulations on your company. Sanjeev, thanks for kind of being my cohost in this segment, great to have you on, been following your work, and you continue to unpack it at your new place that you started. SanjMo, good to see your Twitter handle taking on the name of your new firm, congratulations. Thanks for coming on. >> Thank you so much, such a pleasure. >> Appreciate it. Okay, I'm John Furrier with theCUBE, you're watching today's session presentation of AWS Startup Showcase, featuring Okera, a hot startup, check 'em out, great solution, with a really great concept. Thanks for watching. (calm music)

Published Date : Sep 22 2021

SUMMARY :

and knows the future. and one of the big topics and I'm so happy you in the policies to manage of things to check. and I decide to move to Germany. So you end up with this really, is going global in the digital and you now have cloud regions, Yeah, so you know, if you're not doing anything right there. But in the long run, to and they have to manage all Yeah, so. In the cloud, you can spin up get caught in the weeds and still get the best of what you need, with what you guys are doing. the Azure Bot, you know? are going to want to use it, a lot of things that need to happen, the SRE," you start to see now, People in the past, you The old days, you have and networks are fast, so the for the new things you add to the system. that you guys have. So you know, when we talk Nong, before you get in there, I would say when you want I mean, you started a and I think those things and you continue to unpack it Thank you so much, of AWS Startup Showcase,

ENTITIES

Entity	Category	Confidence
Nick Halsey	PERSON	0.99+
John	PERSON	0.99+
John Furrier	PERSON	0.99+
California	LOCATION	0.99+
US	LOCATION	0.99+
Nong Li	PERSON	0.99+
Apple	ORGANIZATION	0.99+
Germany	LOCATION	0.99+
Ebay	ORGANIZATION	0.99+
PayPal	ORGANIZATION	0.99+
20 years	QUANTITY	0.99+
Sanjeev	PERSON	0.99+
Tomorrow	DATE	0.99+
two	QUANTITY	0.99+
GDPR	TITLE	0.99+
Sanjeev Mohan	PERSON	0.99+
Today	DATE	0.99+
One	QUANTITY	0.99+
yesterday	DATE	0.99+
Snowflake	TITLE	0.99+
today	DATE	0.99+
Python	TITLE	0.99+
Gartner	ORGANIZATION	0.99+
Tableau	TITLE	0.99+
first time	QUANTITY	0.99+
3x	QUANTITY	0.99+
both	QUANTITY	0.99+
100x	QUANTITY	0.99+
one	QUANTITY	0.99+
Okera	ORGANIZATION	0.99+
Informatica	ORGANIZATION	0.98+
two orders	QUANTITY	0.98+
Nong	ORGANIZATION	0.98+
SanjMo	PERSON	0.98+
second	QUANTITY	0.98+
Power BI	TITLE	0.98+
1000	QUANTITY	0.98+
tomorrow	DATE	0.98+
two things	QUANTITY	0.98+
Qlik	TITLE	0.98+
each one	QUANTITY	0.97+
thousands of rules	QUANTITY	0.97+
1000 more data users	QUANTITY	0.96+
Twitter	ORGANIZATION	0.96+
first 10	QUANTITY	0.96+
Okera	PERSON	0.96+
AWS	ORGANIZATION	0.96+
hundred, 200 data sources	QUANTITY	0.95+
HIPAA	TITLE	0.94+
EU	ORGANIZATION	0.94+
CCPA	TITLE	0.94+
over 1000 data pipelines	QUANTITY	0.93+
single	QUANTITY	0.93+
first area	QUANTITY	0.93+
two great special guests	QUANTITY	0.92+
BusinessObjects	TITLE	0.92+

Keynote Analysis | Virtual Vertica BDC 2020

(upbeat music) >> Narrator: It's theCUBE, covering the Virtual Vertica Big Data Conference 2020. Brought to you by Vertica. >> Dave Vellante: Hello everyone, and welcome to theCUBE's exclusive coverage of the Vertica Virtual Big Data Conference. You're watching theCUBE, the leader in digital event tech coverage. And we're broadcasting remotely from our studios in Palo Alto and Boston. And, we're pleased to be covering wall-to-wall this digital event. Now, as you know, originally BDC was scheduled this week at the new Encore Hotel and Casino in Boston. Their theme was "Win big with big data". Oh sorry, "Win big with data". That's right, got it. And, I know the community was really looking forward to that, you know, meet up. But look, we're making the best of it, given these uncertain times. We wish you and your families good health and safety. And this is the way that we're going to broadcast for the next several months. Now, we want to unpack Colin Mahony's keynote, but, before we do that, I want to give a little context on the market. First, theCUBE has covered every BDC since its inception, since the BDC's inception that is. It's a very intimate event, with a heavy emphasis on user content. Now, historically, the data engineers and DBAs in the Vertica community, they comprised the majority of the content at this event. And, that's going to be the same for this virtual, or digital, production. Now, theCUBE is going to be broadcasting for two days. What we're doing, is we're going to be concurrent with the Virtual BDC. We got practitioners that are coming on the show, DBAs, data engineers, database gurus, we got a security experts coming on, and really a great line up. And, of course, we'll also be hearing from Vertica Execs, Colin Mahony himself right of the keynote, folks from product marketing, partners, and a number of experts, including some from Micro Focus, which is the, of course, owner of Vertica. But I want to take a moment to share a little bit about the history of Vertica. The company, as you know, was founded by Michael Stonebraker. And, Verica started, really they started out as a SQL platform for analytics. It was the first, or at least one of the first, to really nail the MPP column store trend. Not only did Vertica have an early mover advantage in MPP, but the efficiency and scale of its software, relative to traditional DBMS, and also other MPP players, is underscored by the fact that Vertica, and the Vertica brand, really thrives to this day. But, I have to tell you, it wasn't without some pain. And, I'll talk a little bit about that, and really talk about how we got here today. So first, you know, you think about traditional transaction databases, like Oracle or IMBDB tour, or even enterprise data warehouse platforms like Teradata. They were simply not purpose-built for big data. Vertica was. Along with a whole bunch of other players, like Netezza, which was bought by IBM, Aster Data, which is now Teradata, Actian, ParAccel, which was the basis for Redshift, Amazon's Redshift, Greenplum was bought, in the early days, by EMC. And, these companies were really designed to run as massively parallel systems that smoked traditional RDBMS and EDW for particular analytic applications. You know, back in the big data days, I often joked that, like an NFL draft, there was run on MPP players, like when you see a run on polling guards. You know, once one goes, they all start to fall. And that's what you saw with the MPP columnar stores, IBM, EMC, and then HP getting into the game. So, it was like 2011, and Leo Apotheker, he was the new CEO of HP. Frankly, he has no clue, in my opinion, with what to do with Vertica, and totally missed one the biggest trends of the last decade, the data trend, the big data trend. HP picked up Vertica for a song, it wasn't disclosed, but my guess is that it was around 200 million. So, rather than build a bunch of smart tokens around Vertica, which I always call the diamond in the rough, Apotheker basically permanently altered HP for years. He kind of ruined HP, in my view, with a 12 billion dollar purchase of Autonomy, which turned out to be one of the biggest disasters in recent M&A history. HP was forced to spin merge, and ended up selling most of its software to Microsoft, Micro Focus. (laughs) Luckily, during its time at HP, CEO Meg Whitman, largely was distracted with what to do with the mess that she inherited form Apotheker. So, Vertica was left alone. Now, the upshot is Colin Mahony, who was then the GM of Vertica, and still is. By the way, he's really the CEO, and he just doesn't have the title, I actually think they should give that to him. But anyway, he's been at the helm the whole time. And Colin, as you'll see in our interview, is a rockstar, he's got technical and business jobs, people love him in the community. Vertica's culture is really engineering driven and they're all about data. Despite the fact that Vertica is a 15-year-old company, they've really kept pace, and not been polluted by legacy baggage. Vertica, early on, embraced Hadoop and the whole open-source movement. And that helped give it tailwinds. It leaned heavily into cloud, as we're going to talk about further this week. And they got a good story around machine intelligence and AI. So, whereas many traditional database players are really getting hurt, and some are getting killed, by cloud database providers, Vertica's actually doing a pretty good job of servicing its install base, and is in a reasonable position to compete for new workloads. On its last earnings call, the Micro Focus CFO, Stephen Murdoch, he said they're investing 70 to 80 million dollars in two key growth areas, security and Vertica. Now, Micro Focus is running its Suse play on these two parts of its business. What I mean by that, is they're investing and allowing them to be semi-autonomous, spending on R&D and go to market. And, they have no hardware agenda, unlike when Vertica was part of HP, or HPE, I guess HP, before the spin out. Now, let me come back to the big trend in the market today. And there's something going on around analytic databases in the cloud. You've got companies like Snowflake and AWS with Redshift, as we've reported numerous times, and they're doing quite well, they're gaining share, especially of new workloads that are merging, particularly in the cloud native space. They combine scalable compute, storage, and machine learning, and, importantly, they're allowing customers to scale, compute, and storage independent of each other. Why is that important? Because you don't have to buy storage every time you buy compute, or vice versa, in chunks. So, if you can scale them independently, you've got granularity. Vertica is keeping pace. In talking to customers, Vertica is leaning heavily into the cloud, supporting all the major cloud platforms, as we heard from Colin earlier today, adding Google. And, why my research shows that Vertica has some work to do in cloud and cloud native, to simplify the experience, it's more robust in motor stack, which supports many different environments, you know deep SQL, acid properties, and DNA that allows Vertica to compete with these cloud-native database suppliers. Now, Vertica might lose out in some of those native workloads. But, I have to say, my experience in talking with customers, if you're looking for a great MMP column store that scales and runs in the cloud, or on-prem, Vertica is in a very strong position. Vertica claims to be the only MPP columnar store to allow customers to scale, compute, and storage independently, both in the cloud and in hybrid environments on-prem, et cetera, cross clouds, as well. So, while Vertica may be at a disadvantage in a pure cloud native bake-off, it's more robust in motor stack, combined with its multi-cloud strategy, gives Vertica a compelling set of advantages. So, we heard a lot of this from Colin Mahony, who announced Vertica 10.0 in his keynote. He really emphasized Vertica's multi-cloud affinity, it's Eon Mode, which really allows that separation, or scaling of compute, independent of storage, both in the cloud and on-prem. Vertica 10, according to Mahony, is making big bets on in-database machine learning, he talked about that, AI, and along with some advanced regression techniques. He talked about PMML models, Python integration, which was actually something that they talked about doing with Uber and some other customers. Now, Mahony also stressed the trend toward object stores. And, Vertica now supports, let's see S3, with Eon, S3 Eon in Google Cloud, in addition to AWS, and then Pure and HDFS, as well, they all support Eon Mode. Mahony also stressed, as I mentioned earlier, a big commitment to on-prem and the whole cloud optionality thing. So 10.0, according to Colin Mahony, is all about really doubling down on these industry waves. As they say, enabling native PMML models, running them in Vertica, and really doing all the work that's required around ML and AI, they also announced support for TensorFlow. So, object store optionality is important, is what he talked about in Eon Mode, with the news of support for Google Cloud and, as well as HTFS. And finally, a big focus on deployment flexibility. Migration tools, which are a critical focus really on improving ease of use, and you hear this from a lot of customers. So, these are the critical aspects of Vertica 10.0, and an announcement that we're going to be unpacking all week, with some of the experts that I talked about. So, I'm going to close with this. My long-time co-host, John Furrier, and I have talked some time about this new cocktail of innovation. No longer is Moore's law the, really, mainspring of innovation. It's now about taking all these data troves, bringing machine learning and AI into that data to extract insights, and then operationalizing those insights at scale, leveraging cloud. And, one of the things I always look for from cloud is, if you've got a cloud play, you can attract innovation in the form of startups. It's part of the success equation, certainly for AWS, and I think it's one of the challenges for a lot of the legacy on-prem players. Vertica, I think, has done a pretty good job in this regard. And, you know, we're going to look this week for evidence of that innovation. One of the interviews that I'm personally excited about this week, is a new-ish company, I would consider them a startup, called Zebrium. What they're doing, is they're applying AI to do autonomous log monitoring for IT ops. And, I'm interviewing Larry Lancaster, who's their CEO, this week, and I'm going to press him on why he chose to run on Vertica and not a cloud database. This guy is a hardcore tech guru and I want to hear his opinion. Okay, so keep it right there, stay with us. We're all over the Vertica Virtual Big Data Conference, covering in-depth interviews and following all the news. So, theCUBE is going to be interviewing these folks, two days, wall-to-wall coverage, so keep it right there. We're going to be right back with our next guest, right after this short break. This is Dave Vellante and you're watching theCUBE. (upbeat music)

Published Date : Mar 31 2020

SUMMARY :

Brought to you by Vertica. and the Vertica brand, really thrives to this day.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
Colin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
70	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Michael Stonebraker	PERSON	0.99+
Colin Mahony	PERSON	0.99+
Stephen Murdoch	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
EMC	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
Zebrium	ORGANIZATION	0.99+
two days	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Verica	ORGANIZATION	0.99+
Micro Focus	ORGANIZATION	0.99+
2011	DATE	0.99+
HPE	ORGANIZATION	0.99+
Uber	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Mahony	PERSON	0.99+
Meg Whitman	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Aster Data	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
First	QUANTITY	0.99+
12 billion dollar	QUANTITY	0.99+
One	QUANTITY	0.99+
this week	DATE	0.99+
John Furrier	PERSON	0.99+
15-year-old	QUANTITY	0.98+
Python	TITLE	0.98+
Oracle	ORGANIZATION	0.98+
olin Mahony	PERSON	0.98+
around 200 million	QUANTITY	0.98+
Virtual Vertica Big Data Conference 2020	EVENT	0.98+
theCUBE	ORGANIZATION	0.98+
80 million dollars	QUANTITY	0.97+
today	DATE	0.97+
two parts	QUANTITY	0.97+
Vertica Virtual Big Data Conference	EVENT	0.97+
Teradata	ORGANIZATION	0.97+
one	QUANTITY	0.97+
Actian	ORGANIZATION	0.97+

Doug Merritt, Splunk | Splunk .conf19

>> Announcer: Live from Las Vegas, it's theCUBE! Covering Splunk .conf19. Brought to you by Splunk. Okay, welcome back, everyone. This is day three live CUBE coverage here in Las Vegas for Splunk's .conf. Its 10 years anniversary of their big customer event. I'm John Furrier, theCUBE. This is our seventh year covering, riding the wave with Splunk. From scrappy startup, to going public company, massive growth, now a market leader continuing to innovate. We're here with the CEO, Doug Merritt of Splunk. Thanks for joining me, good to see you. >> Thank you for being here, thanks for having me. >> John: How ya feelin'? (laughs) >> Exhausted and energized simultaneously. (laughs) it was a fun week. >> You know, every year when we have the event we discuss Splunk's success and the loyalty of the customer base, the innovation, you guys are providing the value, you got a lot of happy customers, and you got a great ecosystem and partner network growing. You're now growing even further, every year it just gets better. This year has been a lot of big highlights, new branding, so you got that next level thing goin' on, new platform, tweaks, bringing this cohesive thing. What's your highlights this year? I mean, what's the big, there's so much goin' on, what's your highlights? >> So where you started is always my highlight of the show, is being able to spend time with customers. I have never been at a company where I feel so fortunate to have the passion and the dedication and the enthusiasm and the gratitude of customers as we have here. And so that, I tell everyone at Splunk this is similar to a holiday function for a kid for me where the energy keeps me going all year long, so that always is number one, and then around the customers, what we've been doing with the technology architecture, the platform, and the depth and breadth of what we've been working on honestly for four plus years. It really, I think, has come together in a unique way at this show. >> Last year you had a lot of announcements that were intentional announcements, it's coming. They're coming now, they're here, they're shipping. >> They're here, they're here. >> What is some of the feedback you're hearing because a lot of it has a theme where, you know, we kind of pointed this out a couple of years ago, it's like a security show now, but it's not a security show, but there's a lot of security in there. What are some of the key things that have come out of the oven that people should know about that are being delivered here? >> So the core of what we're trying to communicate with Data-to-Everything is that you need a very multifaceted data platform to be able to handle the huge variety of data that we're all dealing with, and Splunk has been known and been very successful at being able to index data, messy, non-structured data, and make sense of it even though it's not structured in the index, and that's been, still is incredibly valuable. But we started almost four years ago on a journey of adding in stream processing before the data gets anywhere, to our index or anywhere else, it's moving all around the world, how do you actually find that data and then begin to take advantage of it in-flight? And we announced that the beta of Data Stream Processor last year, but it went production this year, four years of development, a ton of patents, a 40 plus person, 50 plus person, development team behind that, a lot of hard engineering, and really elegant interface to get that there. And then on the other end, to complement the index, data is landing all over the place, not just in our index, and we're very aware that different structures exist for different needs. A data warehouse has different properties than a relational database which has different properties than a NoSQL column store in-memory database, and data is going to only continue to be more dispersed. So again, four plus years ago we started on what now is Data Fabric Search which we pre-announced in beta format last year. That went production at this show, but the ability to address a distributed Splunk landscape, but more importantly we demoed the integration with HTFS and S3 landscapes as the proof point of we've built a connector framework, so that this really cannot just be a incredibly high-speed, high-cardinality search processing engine, but it really is a federated search engine as well. So now we can operate on data in the stream when it's in motion. We obviously still have all the great properties of the Splunk index, and I was really excited about Splunk 8.0 and all the features in that, and we can go get data wherever it lives across a distributed Splunk environment, but increasingly across the more and more distributed data environment. >> So this is a data platform. This is absolutely a data platform, so that's very clear. So the success of platforms, in the enterprise at least, not just small and medium-sized businesses, you can have a tool and kind of look like a platform, there's some apps out there that I would point to and say, "Hey, that looks like a tool, it's really not a platform." You guys are a platform. But the success of a platform are two things, ecosystem and apps, because if you're in a platform that's enabling value, you got to have those. Talk about how you see the ecosystem success and the app success. Is that happening in your view? >> It is happening. We have over 2,000 apps on our Splunkbase framework which is where any of our customers can go and download the application to help draw value of a Palo Alto firewall, or ensure integration with a ServiceNow trouble ticketing system, and thousands of other examples that exist. And that has grown from less than 300 apps, when I first got here six years ago, to over 2,000 today. But that is still the earliest inning, for earliest pitch and your earliest inning journey. Why are there 20,000, 200,000, two million apps out there? A piece of it is we have had to up the game on how you interface with the platform, and for us that means through a stable set of services, well-mannered, well-articulated, consistently maintained services, and that's been a huge push with the core Splunk index, but it's also a big amount of work that we've been doing on everything from the separation between Phantom runbooks and playbooks with the underlying orchestration automation, it's a key component of our Stream Processor, you know, what transformations are you doing, what enrichments are you doing? That has to live separate than the underlying technology, the Kafka transport mechanism, or Kinesis, or whatever happens in the future. So that investment to make sure we got a effective and stable set of services has been key, but then you complement that with the amazing set of partners that are out here, and making sure they're educated and enabled on how to take advantage of the platform, and then feather in things like the Splunk Ventures announcement, the Innovation Fund and Social Impact Fund, to further double down on, hey, we are here to help in every way. We're going to help with enablement, we're going to help with sell-through and marketing, and we'll help with investment. >> Yeah, I think this is smart, and I think one of the things I'll point out is that feedback we heard from customers in conversations we had here on theCUBE and the hallway is, there's a lot of great feedback on the automation, the machine learning toolkit, which is a good tell sign of the engagement level of how they're dealing with data, and this kind of speaks to data as a value... The value creation from data seems to be the theme. It's not just data for data's sake, I mean, managing data is all hard stuff, but value from the data. You mentioned the Ventures, you got a lot of tech for good stuff goin' on. You're investing in companies where they're standing up data-driven companies to solve world problems, you got other things, so you guys are adjusting. In the middle innings of the data game, platform update, business model changes. Talk about some of the consumption changes, now you got Splunk Cloud, what's goin' on on (laughs) how you charge, how are customers consuming, what moves did you guys make there and what's the result? >> Yeah, it's a great intro on data is awesome, but we all have data to get to decisions first and actions second. Without an action there is no point in gathering data, and so many companies have been working their tails off to digitize their landscapes. Why, well you want a more flexible landscape, but why the flexibility? Because there's so much data being generated that if you can get effective decisions and then actions, that landscape can adapt very, very rapidly, which goes back to machine learning and eventual AI-type opportunities. So that is absolutely, squarely where we've been focused, is translating that data into value and into actual outcomes, which is why our orchestration automation piece was so important. One of the gating factors that we felt has existed is for the Splunk index, and it's only for the Splunk index, the pricing mechanism has been data volume, and that's a little bit contrary to the promise, which is you don't know where the value is going to be within data, and whether it's a gigabyte or whether it's a petabyte, why shouldn't you be able to put whatever data you want in to experiment? And so we came out with some updates in pricing a month and change ago that we were reiterating at the show and will continue to drive on a, hopefully, very aggressive and clear marketing and communications framework, that for people that have adjusted to the data volume metric, we're trying to make that much simpler. There's now a limited set of bands, or tiers, from 100 gigs to unlimited, so that you really get visibility on, all right, I think that I want to play with five terabytes, I know what that band looks like and it's very liberal. So that if you wind up with six and a half terabytes you won't be penalized, and then there's a complimentary metric which I think is ultimately going to be the more long-lived metric for our infrastructurally-bound products, which is virtual CPU or virtual core. And when I think about our index, stream processing, federated search, the execution of automation, all those are basically a factor of how much infrastructure you're going to throw at the problem, whether it's CPU or whether it's storage or network. So I can see a day when Splunk Enterprise and the index, and everything else at that lower level, or at that infrastructure layer, are all just a series of virtual CPUs or virtual cores. But I think both, we're offering choice, we really are customer-centric, and whether you want a more liberal data volume or whether you want to switch to an infrastructure, we're there and our job is to help you understand the value translation on both of those because all that matters is turning it into action and into doing. >> It's interesting, in the news yesterday quantum supremacy was announced. Google claims it, IBM's debating it, but quantum computing just points to the trend that more compute's coming. So this is going to be a good thing for data. You mentioned the pricing thing, this brings up a topic we've been hearing all week on theCUBE is, diverse data's actually great for machine learning, great for AI. So bringing in diverse data gives you more aperture into data, and that actually helps. With the diversity comes confusion and this is where the pricing seems to hit. You're trying to create, if I get this right, pricing that matches the needs of the diverse use of data. Is that kind of how you guys are thinkin' about it? >> Meets the needs of diverse data, and also provides a lot of clarity for people on when you get to a certain threshold that we stop charging you altogether, right? Once you get above 10s of terabytes to 100 terabytes, just put as much data in as you want. The foundation of Splunk, going back to the first data, is we're the only technology that still exists on the index side that takes raw, non-formatted data, doesn't force you to cleanse or scrub it in any way, and then takes all that raw data and actually provides value through the way that we interact with the data with our query language. And that design architecture, I've said it for five, six years now, is completely unique in the industry. Everybody else thinks that you've got to get to the data you want to operate on, and then put it somewhere, and the way that life works is much more organic and emergent. You've got chaos happening, and then how do you find patterns and value out of that chaos? Well, that chaos winds up being pretty voluminous. So how do we help more organizations? Some of the leading organizations are at five to 10 petabytes of data per day going through the index. How do we help everybody get there? 'Cause you don't know the nugget across that petabyte or 10 petabyte set is going to be the key to solving a critical issue, so let's make it easy for you to put that data in to find those nuggets, but then once you know what the pattern is, now you're in a different world, now you're in the structured data world of metrics, or KPIs, or events, or multidimensional data that is much more curated, and by nature that's going to be more fine-grained. There's not as much volume there as there is in the raw data. >> Doug, I notice also at the event here there's a focus on verticals. Can you comment on the strategy there, is that by design? Is there a vertical focus? >> It's definitely by design. >> Share some insight into that. >> So we launched with an IT operations focus, we wound up progressing over the years to a security operations focus, and then our doubling down with Omnition, SignalFx, VictorOps, and now Streamlio is a new acquisition on the DevOps and next gen app dev buying centers. As a company and how we go to market and what we are doing with our own solutions, we stay incredibly focused on those three very technical buying centers, but we've also seen that data is data. So the data you're bringing in to solve a security problem can be used to solve a manufacturing problem, or a logistics and supply chain problem, or a customer sentiment analysis problem, and so how do you make use of that data across those different buying centers? We've set up a verticals group to seed, continue to seed, the opportunity within those different verticals. >> And that's compatible with the horizontally scalable Splunk platform. That's kind of why that exists, right? >> That the overall platform that was in every keynote, starting with mine, is completely agnostic and horizontal. The solutions on top, the security operations, ITOps, and DevOps, are very specific to those users but they're using the horizontal platform, and then you wind up walking into the Accenture booth and seeing how they've taken similar data that the SecOps teams gathered to actually provide insight on effective rail transport for DB cargo, or effective cell tower triangulation and capacity for a major Australian cell company, or effective manufacturing and logistics supply chain optimization for a manufacturer and all their different retail distribution centers. >> Awesome, you know, I know you've talked with Jeff Frick in the past, and Stu Miniman and Dave Vellante about user experience, I know that's something that's near and dear to your heart. You guys, it has been rumored, there's going to be some user experience work done on the onboarding for your Splunk Cloud and making it easier to get in to this new Splunk platform. What can we expect on the user experience side? (laughs) >> So, for any of you out there that want to try, we've got Splunk Investigate, that's one of the first applications on top of the fully decomposed, services layered, stateless Splunk Cloud. Mission Control actually is a complementary other, those are the first two apps on top of that new framework. And the UI and experience that is in Splunk Investigate I think is a good example of both the ease of coming to and using the product. There's a very liberal amount of data you get for free just to experiment with Splunk Investigate, but then the onboarding experience of data is I think very elegant. The UI is, I love the UI, it's a Jupyter-style workbook-type interface, but if you think about what do investigators need, investigators need both some bread crumbs on where to start and how to end, but then they also need the ability to bring in anybody that's necessary so that you can actually swarm and attack a problem very efficiently. And so when you go back and look at, why did we buy VictorOps? Well, it wasn't because we think that the IT alerting space is a massive space we're going to own, it's because collaboration is incredibly important to swarm incidents of any type, whether they're security incidents or manufacturing incidents. So the facilities at VictorOps gave, on allowing distributed teams and virtual teams to very quickly get to resolution. You're going to find those baked into all products like Mission Control 'cause it's one of the key facilities of, that Tim talked about in his keynote, of indulgent design, mobility, high collaboration, 'cause luckily people still matter, and while ML is helping all of us be more productive it isn't taking away the need for us, but how do you get us to cooperate effectively? And so our cloud-based apps, I encourage any of you out there, go try Splunk Investigate, it's a beautiful product and I think you'll be blown away by it. >> Great success on the product side, and then great success on the customer side, you got great, loyal customers. But I got to ask you about the next level Splunk. As you look at this event, what jumps out at me is the cohesiveness of the story around the platform and the apps, ecosystem's great, but the new branding, Data-to-Everything. It's not product-specific 'cause you have product leadership. This is a whole next level Splunk. What is the next level Splunk vision? >> And I love the pink and orange, in bold colors. So when I've thought about what are the issues that are some of the blockers to Splunk eventually fulfilling the destiny that we could have, the number one is awareness. Who the heck is Splunk? People have very high variance of their understanding of Splunk. Log aggregation, security tool, IT tool, and what we've seen over and over is it is much more this data platform, and certainly with the announcements, it's becoming more of this data fabric or platform that can be used for anything. So how do we bring awareness to Splunk? Well, let's help create a category, and it's not up to us to create the category, it's up to all of you to create the category, but Data-to-Everything in our minds represents the power of data, and while we will continue internally to focus on those technical buying centers, everything is solvable with data. So we're trying to really reinforce the importance of data and the capabilities that something like Splunk brings. Cloud becomes a really important message to that because that makes it, execution to that, 'cause it makes it so much easier for people to immediately try something and get value, but on-prem will always be important as well 'cause data has gravity, data has risk, data has cost to move. And there are so many use cases where you would just never push data to the cloud, and it's not because we don't love cloud. If you have a factory that's producing 100 terabytes an hour in a area where you've got poor bandwidth, there's no option for a cloud connect there of high scale, so you better be able to process, make sense of, and act on that data locally. >> And you guys are great in the cloud too, on-premise, but final word, I want to get your thoughts to end this segment, I know you got to run, thanks for your time, and congratulations on all your success. Data for good. There's a lot of tech for bad kind of narratives goin' on, but there's a real resurgence of tech for good. A lot of people, entrepreneurs, for-profit, for-nonprofit, are doing ventures for good. Data is a real theme. Data for good is something that you have, that's part of the Data-to-Everything. Talk about the data for good real quick. >> Yeah, we were really excited about what we've done with Splunk4Good as our nonprofit focused entity. The Splunk Pledge which is a classic 1-1-1 approach to make sure that we're able to help organizations that need the help do something meaningful within their world, and then the Splunk Social Impact Fund which is trying to put our money where our mouth is to ensure that if funding and scarcity of funds is an issue of getting to effective outcomes, that we can be there to support. At this show we've featured three awesome charities, Conservation International, NetHope, and the Global Emancipation Network, that are all trying to tackle really thorny problems with different, in different ways, different problems in different ways, but data winds up being at the heart of one of the ways to unlock what they're trying to get done. We're really excited and proud that we're able to actually make meaningful donations to all three of those, but it is a constant theme within Splunk, and I think something that all of us, from the tech community and non-tech community are going to have to help evangelize, is with every invention and with every thing that occurs in the world there is the power to take it and make a less noble execution of it, you know, there's always potential harmful activities, and then there's the power to actually drive good, and data is one of those. >> Awesome. >> Data can be used as a weapon, it can be used negatively, but it also needs to be liberated so that it can be used positively. While we're all kind of concerned about our own privacy and really, really personal data, we're not going to get to the type of healthcare and genetic, massive shifts in changes and benefits without having a way to begin to share some of this data. So putting controls around data is going to be important, putting people in the middle of the process to decide what happens to their data, and some consequences around misuse of data is going to be important. But continuing to keep a mindset of all good happens as we become more liberal, globalization is good, free flow of good-- >> The value is in the data. >> Free flow of people, free flow of data ultimately is very good. >> Doug, thank you so much for spending the time to come on theCUBE, and again congratulations on great culture. Also is worth noting, just to give you a plug here, because it's, I think, very valuable, one of the best places to work for women in tech. You guys recently got some recognition on that. That is a huge accomplishment, congratulations. >> Thank you, thank you, we had a great diversity track here which is really important as well. But we love partnering with you guys, thank you for spending an entire week with us and for helping to continue to evangelize and help people understand what the power of technology and data can do for them. >> Hey, video is data, and we're bringin' that data to you here on theCUBE, and of course, CUBE cloud coming soon. I'm John Furrier here live at Splunk .conf with Doug Merritt the CEO. We'll be back with more coverage after this short break. (futuristic music)

Published Date : Oct 24 2019

SUMMARY :

Brought to you by Splunk. Exhausted and energized simultaneously. and the loyalty of the customer base, and the gratitude of customers as we have here. Last year you had a lot of announcements What is some of the feedback you're hearing and data is going to only continue to be more dispersed. and the app success. and download the application to help draw value and this kind of speaks to data as a value... and it's only for the Splunk index, pricing that matches the needs of the diverse use of data. and the way that life works Doug, I notice also at the event here and so how do you make use of that data with the horizontally scalable Splunk platform. and then you wind up walking into the Accenture booth and making it easier to get in the ease of coming to and using the product. But I got to ask you about the next level Splunk. and the capabilities that something like Splunk brings. Data for good is something that you have, and then there's the power to actually drive good, putting people in the middle of the process to decide free flow of data ultimately is very good. one of the best places to work for women in tech. and for helping to continue to evangelize and we're bringin' that data to you here on theCUBE,

ENTITIES

Entity	Category	Confidence
Doug	PERSON	0.99+
Doug Merritt	PERSON	0.99+
Dave Vellante	PERSON	0.99+
NetHope	ORGANIZATION	0.99+
Jeff Frick	PERSON	0.99+
five	QUANTITY	0.99+
John Furrier	PERSON	0.99+
Tim	PERSON	0.99+
100 gigs	QUANTITY	0.99+
IBM	ORGANIZATION	0.99+
last year	DATE	0.99+
John	PERSON	0.99+
Stu Miniman	PERSON	0.99+
Last year	DATE	0.99+
Conservation International	ORGANIZATION	0.99+
Las Vegas	LOCATION	0.99+
less than 300 apps	QUANTITY	0.99+
thousands	QUANTITY	0.99+
four years	QUANTITY	0.99+
100 terabytes	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Global Emancipation Network	ORGANIZATION	0.99+
Splunk	ORGANIZATION	0.99+
both	QUANTITY	0.99+
yesterday	DATE	0.99+
this year	DATE	0.99+
six years	QUANTITY	0.99+
Streamlio	ORGANIZATION	0.99+
Omnition	ORGANIZATION	0.99+
six and a half terabytes	QUANTITY	0.99+
Splunk4Good	ORGANIZATION	0.99+
SignalFx	ORGANIZATION	0.99+
five terabytes	QUANTITY	0.99+
10 years	QUANTITY	0.99+
four plus years	QUANTITY	0.99+
over 2,000 apps	QUANTITY	0.99+
VictorOps	ORGANIZATION	0.99+
four plus years ago	DATE	0.99+
One	QUANTITY	0.98+
first data	QUANTITY	0.98+
10 petabytes	QUANTITY	0.98+
seventh year	QUANTITY	0.98+
six years ago	DATE	0.98+
10 petabyte	QUANTITY	0.98+
Splunk Ventures	ORGANIZATION	0.98+
50 plus person	QUANTITY	0.98+
first two apps	QUANTITY	0.98+
20,000, 200,000, two million apps	QUANTITY	0.98+
over 2,000	QUANTITY	0.97+
a ton of patents	QUANTITY	0.97+
three	QUANTITY	0.97+
one	QUANTITY	0.97+
two things	QUANTITY	0.97+
40 plus person	QUANTITY	0.96+
today	DATE	0.96+
Splunk 8.0	TITLE	0.96+
first	QUANTITY	0.95+
four years ago	DATE	0.95+
Splunk Investigate	TITLE	0.95+
couple of years ago	DATE	0.95+
first applications	QUANTITY	0.94+
This year	DATE	0.94+
above 10s of terabytes	QUANTITY	0.93+
Splunk	TITLE	0.93+
Ventures	ORGANIZATION	0.91+
Palo Alto	LOCATION	0.88+
Splunk Cloud	TITLE	0.87+
three very technical buying centers	QUANTITY	0.87+
NoSQL	TITLE	0.87+
an hour	QUANTITY	0.87+
second	QUANTITY	0.85+

Colin Mahony, Vertica | MIT CDOIQ 2019

>> From Cambridge, Massachusetts, it's theCUBE, covering MIT Chief Data Officer and Information Quality Symposium 2019, brought to you by SiliconANGLE Media. >> Welcome back to Cambridge, Massachusetts everybody, you're watching The Cube, the leader in tech coverage. My name is Dave Vellante here with my cohost Paul Gillin. This is day one of our two day coverage of the MIT CDOIQ conferences. CDO, Chief Data Officer, IQ, information quality. Colin Mahoney is here, he's a good friend and long time CUBE alum. I haven't seen you in awhile, >> I know >> But thank you so much for taking some time, you're like a special guest here >> Thank you, yeah it's great to be here, thank you. >> Yeah, so, this is not, you know, something that you would normally attend. I caught up with you, invited you in. This conference has started as, like back office governance, information quality, kind of wonky stuff, hidden. And then when the big data meme took off, kind of around the time we met. The Chief Data Officer role emerged, the whole Hadoop thing exploded, and then this conference kind of got bigger and bigger and bigger. Still intimate, but very high level, very senior. It's kind of come full circle as we've been saying, you know, information quality still matters. You have been in this data business forever, so I wanted to invite you in just to get your perspectives, we'll talk about what's new with what's going on in your company, but let's go back a little bit. When we first met and even before, you saw it coming, you kind of invested your whole career into data. So, take us back 10 years, I mean it was so different, remember it was Batch, it was Hadoop, but it was cool. There was a lot of cool >> It's still cool. (laughs) projects going on, and it's still cool. But, take a look back. >> Yeah, so it's changed a lot, look, I got into it a while ago, I've always loved data, I had no idea, the explosion and the three V's of data that we've seen over the last decade. But, data's really important, and it's just going to get more and more important. But as I look back I think what's really changed, and even if you just go back a decade I mean, there's an insatiable appetite for data. And that is not slowing down, it hasn't slowed down at all, and I think everybody wants that perfect solution that they can ask any question and get an immediate answers to. We went through the Hadoop boom, I'd argue that we're going through the Hadoop bust, but what people actually want is still the same. You know, they want real answers, accurate answers, they want them quickly, and they want it against all their information and all their data. And I think that Hadoop evolved a lot as well, you know, it started as one thing 10 years ago, with MapReduce and I think in the end what it's really been about is disrupting the storage market. But if you really look at what's disrupting storage right now, public clouds, S3, right? That's the new data league. So there's always a lot of hype cycles, everybody talks about you know, now it's Cloud, everything, for maybe the last 10 years it was a lot of Hadoop, but at the end of the day I think what people want to do with data is still very much the same. And a lot of companies are still struggling with it, hence the role for Chief Data Officers to really figure out how do I monetize data on the one hand and how to I protect that asset on the other hand. >> Well so, and the cool this is, so this conference is not a tech conference, really. And we love tech, we love talking about this, this is why I love having you on. We kind of have a little Vertica thread that I've created here, so Colin essentially, is the current CEO of Vertica, I know that's not your title, you're GM and Senior Vice President, but you're running Vertica. So, Michael Stonebreaker's coming on tomorrow, >> Yeah, excellent. >> Chris Lynch is coming on tomorrow, >> Oh, great, yeah. >> we've got Andy Palmer >> Awesome, yeah. >> coming up as well. >> Pretty cool. (laughs) >> So we have this connection, why is that important? It's because, you know, Vertica is a very cool company and is all about data, and it was all about disrupting, sort of the traditional relational database. It's kind of doing more with data, and if you go back to the roots of Vertica, it was like how do you do things faster? How do you really take advantage of data to really drive new business? And that's kind of what it's all about. And the tech behind it is really cool, we did your conference for many, many years. >> It's coming back by the way. >> Is it? >> Yeah, this March, so March 30th. >> Oh, wow, mark that down. >> At Boston, at the new Encore Hotel. >> Well we better have theCUBE there, bro. (laughs) >> Yeah, that's great. And yeah, you've done that conference >> Yep. >> haven't you before? So very cool customers, kind of leading edge, so I want to get to some of that, but let's talk the disruption for a minute. So you guys started with the whole architecture, MPP and so forth. And you talked about Cloud, Cloud really disrupted Hadoop. What are some of the other technology disruptions that you're seeing in the market space? >> I think, I mean, you know, it's hard not to talk about AI machine learning, and what one means versus the other, who knows right? But I think one thing that is definitely happening is people are leveraging the volumes of data and they're trying to use all the processing power and storage power that we have to do things that humans either are too expensive to do or simply can't do at the same speed and scale. And so, I think we're going through a renaissance where a lot more is being automated, certainly on the Vertica roadmap, and our path has always been initially to get the data in and then we want the platform to do a lot more for our customers, lots more analytics, lots more machine-learning in the platform. So that's definitely been a lot of the buzz around, but what's really funny is when you talk to a lot of customers they're still struggling with just some basic stuff. Forget about the predictive thing, first you've got to get to what happened in the past. Let's give accurate reporting on what's actually happening. The other big thing I think as a disruption is, I think IOT, for all the hype that it's getting it's very real. And every device is kicking off lots of information, the feedback loop of AB testing or quality testing for predictive maintenance, it's happening almost instantly. And so you're getting massive amounts of new data coming in, it's all this machine sensor type data, you got to figure out what it means really quick, and then you actually have to do something and act on it within seconds. And that's a whole new area for so many people. It's not their traditional enterprise data network warehouse and you know, back to you comment on Stonebreaker, he got a lot of this right from the beginning, you know, and I think he looked at the architectures, he took a lot of the best in class designs, we didn't necessarily invent everything, but we put a lot of that together. And then I think the other you've got to do is constantly re-invent your platform. We came out with our Eon Mode to run cloud native, we just got rated the best cloud data warehouse from a net promoter score rating perspective, so, but we got to keep going you know, we got to keep re-inventing ourselves, but leverage everything that we've done in the past as well. >> So one of the things that you said, which is kind of relevant for here, Paul, is you're still seeing a real data quality issue that customers are wrestling with, and that's a big theme here, isn't it? >> Absolutely, and the, what goes around comes around, as Dave said earlier, we're still talking about information quality 13 years after this conference began. Have the tools to improve quality improved all that much? >> I think the tools have improved, I think that's another area where machine learning, if you look at Tamr, and I know you're going to have Andy here tomorrow, they're leveraging a lot of the augmented things you can do with the processing to make it better. But I think one thing that makes the problem worse now, is it's gotten really easy to pour data in. It's gotten really easy to store data without having to have the right structure, the right quality, you know, 10 years ago, 20 years ago, everything was perfect before it got into the platform. Right, everything was, there was quality, everything was there. What's been happening over the last decade is you're pumping data into these systems, nobody knows if it's redundant data, nobody knows if the quality's any good, and the amount of data is massive. >> And it's cheap to store >> Very cheap to store. >> So people keep pumping it in. >> But I think that creates a lot of issues when it comes to data quality. So, I do think the technology's gotten better, I think there's a lot of companies that are doing a great job with it, but I think the challenge has definitely upped. >> So, go ahead. >> I'm sorry. You mentioned earlier that we're seeing the death of Hadoop, but I'd like you to elaborate on that becuase (Dave laughs) Hadoop actually came up this morning in the keynote, it's part of what GlaxoSmithKline did. Came up in a conversation I had with the CEO of Experian last week, I mean, it's still out there, why do you think it's in decline? >> I think, I mean first of all if you look at the Hadoop vendors that are out there, they've all been struggling. I mean some of them are shutting down, two of them have merged and they've got killed lately. I think there are some very successful implementations of Hadoop. I think Hadoop as a storage environment is wonderful, I think you can process a lot of data on Hadoop, but the problem with Hadoop is it became the panacea that was going to solve all things data. It was going to be the database, it was going to be the data warehouse, it was going to do everything. >> That's usually the kiss of death, isn't it? >> It's the kiss of death. And it, you know, the killer app on Hadoop, ironically, became SQL. I mean, SQL's the killer app on Hadoop. If you want to SQL engine, you don't need Hadoop. But what we did was, in the beginning Mike sort of made fun of it, Stonebreaker, and joked a lot about he's heard of MapReduce, it's called Group By, (Dave laughs) and that created a lot of tension between the early Vertica and Hadoop. I think, in the end, we embraced it. We sit next to Hadoop, we sit on top of Hadoop, we sit behind it, we sit in front of it, it's there. But I think what the reality check of the industry has been, certainly by the business folks in these companies is it has not fulfilled all the promises, it has not fulfilled a fraction on the promises that they bet on, and so they need to figure those things out. So I don't think it's going to go away completely, but I think its best success has been disrupting the storage market, and I think there's some much larger disruptions of technologies that frankly are better than HTFS to do that. >> And the Cloud was a gamechanger >> And a lot of them are in the cloud. >> Which is ironic, 'cause you know, cloud era, (Colin laughs) they didn't really have a cloud strategy, neither did Hortonworks, neither did MapR and, it just so happened Amazon had one, Google had one, and Microsoft has one, so, it's just convenient to-- >> Well, how is that affecting your business? We've seen this massive migration to the cloud (mumbles) >> It's actually been great for us, so one of the things about Vertica is we run everywhere, and we made a decision a while ago, we had our own data warehouse as a service offering. It might have been ahead of its time, never really took off, what we did instead is we pivoted and we say "you know what? "We're going to invest in that experience "so it's a SaaS-like experience, "but we're going to let our customers "have full control over the cloud. "And if they want to go to Amazon they can, "if they want to go to Google they can, "if they want to go to Azure they can." And we really invested in that and that experience. We're up on the Amazon marketplace, we have lots of customers running up on Amazon Cloud as well as Google and Azure now, and then about two years ago we went down and did this endeavor to completely re-architect our product so that we could separate compute and storage so that our customers could actually take advantage of the cloud economics as well. That's been huge for us, >> So you scale independent-- >> Scale independently, cloud native, add compute, take away compute, and for our existing customers, they're loving the hybrid aspect, they love that they can still run on Premise, they love that they can run up on a public cloud, they love that they can run in both places. So we will continue to invest a lot in that. And it is really, really important, and frankly, I think cloud has helped Vertica a lot, because being able to provision hardware quickly, being able to tie in to these public clouds, into our customers' accounts, give them control, has been great and we're going to continue on that path. >> Because Vertica's an ISV, I mean you're a software company. >> We're a software company. >> I know you were a part of HP for a while, and HP wanted to mash that in and run it on it's hardware, but software runs great in the cloud. And then to you it's another hardware platform. >> It's another hardware platform, exactly. >> So give us the update on Micro Focus, Micro Focus acquired Vertica as part of the HPE software business, how many years ago now? Two years ago? >> Less than two years ago. >> Okay, so how's that going, >> It's going great. >> Give us the update there. >> Yeah, so first of all it is great, HPE and HP were wonderful to Vertica, but it's great being part of a software company. Micro Focus is a software company. And more than just a software company it's a company that has a lot of experience bridging the old and the new. Leveraging all of the investments that you've made but also thinking about cloud and all these other things that are coming down the pike. I think for Vertica it's been really great because, as you've seen Vertica has gotten its identity back again. And that's something that Micro Focus is very good at. You can look at what Micro Focus did with SUSE, the Linux company, which actually you know, now just recently spun out of Micro Focus but, letting organizations like Vertica that have this culture, have this product, have this passion, really focus on our market and our customers and doing the right thing by them has been just really great for us and operating as a software company. The other nice thing is that we do integrate with a lot of other products, some of which came from the HPE side, some of which came from Micro Focus, security products is an example. The other really nice thing is we've been doing this insource thing at Micro Focus where we open up our source code to some of the other teams in Micro Focus and they've been contributing now in amazing ways to the product. In ways that we would just never be able to scale, but with 4,000 engineers strong in Micro Focus, we've got a much larger development organization that can actually contribute to the things that Vertica needs to do. And as we go into the cloud and as we do a lot more operational aspects, the experience that these teams have has been incredible, and security's another great example there. So overall it's been great, we've had four different owners of Vertica, our job is to continue what we do on the innovation side in the culture, but so far Micro Focus has been terrific. >> Well, I'd like to say, you're kind of getting that mojo back, because you guys as an independent company were doing your own thing, and then you did for a while inside of HP, >> We did. >> And that obviously changed, 'cause they wanted more integration, but, and Micro Focus, they know what they're doing, they know how to do acquisitions, they've been very successful. >> It's a very well run company, operationally. >> The SUSE piece was really interesting, spinning that out, because now RHEL is part of IBM, so now you've got SUSE as the lone independent. >> Yeah. >> Yeah. >> But I want to ask you, go back to a technology question, is NoSQL the next Hadoop? Are these databases, it seems to be that the hot fad now is NoSQL, it can do anything. Is the promise overblown? >> I think, I mean NoSQL has been out almost as long as Hadoop, and I, we always say not only SQL, right? Mike's said this from day one, best tool for the job. Nothing is going to do every job well, so I think that there are, whether it's key value stores or other types of NoSQL engines, document DB's, now you have some of these DB's that are running on different chips, >> Graph, yeah. >> there's always, yeah, graph DBs, there's always going to be specialty things. I think one of the things about our analytic platform is we can do, time series is a great example. Vertica's a great time series database. We can compete with specialized time series databases. But we also offer a lot of, the other things that you can do with Vertica that you wouldn't be able to do on a database like that. So, I always think there's going to be specialty products, I also think some of these can do a lot more workloads than you might think, but I don't see as much around the NoSQL movement as say I did a few years ago. >> But so, and you mentioned the cloud before as kind of, your position on it I think is a tailwind, not to put words in your mouth, >> Yeah, yeah, it's a great tailwind. >> You're in the Amazon marketplace, I mean they have products that are competitive, right? >> They do, they do. >> But, so how are you differentiating there? >> I think the way we differentiate, whether it's Redshift from Amazon, or BigQuery from Google, or even what Azure DB does is, first of all, Vertica, I think from, feature functionality and performance standpoint is ahead. Number one, I think the second thing, and we hear this from a lot of customers, especially at the C-level is they don't want to be locked into these full stacks of the clouds. Having the ability to take a product and run it across multiple clouds is a big thing, because the stack lock-in now, the full stack lock-in of these clouds is scary. It's really easy to develop in their ecosystems but you get very locked into them, and I think a lot of people are concerned about that. So that works really well for Vertica, but I think at the end of the day it's just, it's the robustness of the product, we continue to innovate, when you look at separating compute and storage, believe it or not, a lot of these cloud-native databases don't do that. And so we can actually leverage a lot of the cloud hardware better than the native cloud databases do themselves. So, like I said, we have to keep going, those guys aren't going to stop, and we actually have great relationships with those companies, we work really well with the clouds, they seem to care just as much about their cloud ecosystem as their own database products, and so I think that's going to continue as well. >> Well, Colin, congratulations on all the success >> Yeah, thank you, yeah. >> It's awesome to see you again and really appreciate you coming to >> Oh thank you, it's great, I appreciate the invite, >> MIT. >> it's great to be here. >> All right, keep it right there everybody, Paul and I will be back with our next guest from MIT, you're watching theCUBE. (electronic jingle)

Published Date : Jul 31 2019

SUMMARY :

brought to you by SiliconANGLE Media. I haven't seen you in awhile, kind of around the time we met. It's still cool. but at the end of the day I think is the current CEO of Vertica, (laughs) and if you go back to the roots of Vertica, at the new Encore Hotel. Well we better have theCUBE there, bro. And yeah, you've done that conference but let's talk the disruption for a minute. but we got to keep going you know, Have the tools to improve quality the right quality, you know, But I think that creates a lot of issues but I'd like you to elaborate on that becuase I think you can process a lot of data on Hadoop, and so they need to figure those things out. so one of the things about Vertica is we run everywhere, and frankly, I think cloud has helped Vertica a lot, I mean you're a software company. And then to you it's another hardware platform. the Linux company, which actually you know, and Micro Focus, they know what they're doing, so now you've got SUSE as the lone independent. is NoSQL the next Hadoop? Nothing is going to do every job well, the other things that you can do with Vertica and so I think that's going to continue as well. Paul and I will be back with our next guest from MIT,

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Andy Palmer	PERSON	0.99+
Paul Gillin	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Colin Mahoney	PERSON	0.99+
Paul	PERSON	0.99+
Colin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Vertica	ORGANIZATION	0.99+
Chris Lynch	PERSON	0.99+
HPE	ORGANIZATION	0.99+
Michael Stonebreaker	PERSON	0.99+
HP	ORGANIZATION	0.99+
Micro Focus	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
Colin Mahony	PERSON	0.99+
last week	DATE	0.99+
Andy	PERSON	0.99+
March 30th	DATE	0.99+
NoSQL	TITLE	0.99+
Mike	PERSON	0.99+
Experian	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
SQL	TITLE	0.99+
two day	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Cambridge, Massachusetts	LOCATION	0.99+
4,000 engineers	QUANTITY	0.99+
Two years ago	DATE	0.99+
SUSE	TITLE	0.99+
Azure DB	TITLE	0.98+
second thing	QUANTITY	0.98+
20 years ago	DATE	0.98+
10 years ago	DATE	0.98+
one	QUANTITY	0.98+
Vertica	TITLE	0.98+
Hortonworks	ORGANIZATION	0.97+
MapReduce	ORGANIZATION	0.97+
one thing	QUANTITY	0.97+

Mandy Chessell, IBM | Dataworks Summit EU 2018

>> Announcer: From Berlin, Germany, it's the Cube covering Dataworks Summit Europe 2018. Brought to you by Hortonworks. (electronic music) >> Well hello welcome to the Cube I'm James Kobielus. I'm the lead analyst for big data analytics within the Wikibon team of SiliconANGLE Media. I'm hosting the Cube this week at Dataworks Summit 2018 in Berlin, Germany. It's been an excellent event. Hortonworks, the host, had... We've completed two days of keynotes. They made an announcement of the Data Steward Studio as the latest of their offerings and demonstrated it this morning, to address GDPR compliance, which of course is hot and heavy is coming down on enterprises both in the EU and around the world including in the U.S. and the May 25th deadline is fast approaching. One of Hortonworks' prime partners is IBM. And today on this Cube segment we have Mandy Chessell. Mandy is a distinguished engineer at IBM who did an excellent keynote yesterday all about metadata and metadata management. Mandy, great to have you. >> Hi and thank you. >> So I wonder if you can just reprise or summarize the main take aways from your keynote yesterday on metadata and it's role in GDPR compliance, so forth and the broader strategies that enterprise customers have regarding managing their data in this new multi-cloud world where Hadoop and open source platforms are critically important for storing and processing data. So Mandy go ahead. >> So, metadata's not new. I mean it's basically information about data. And a lot of companies are trying to build a data catalog which is not a catalog of, you know, actually containing their data, it's a catalog that describes their data. >> James: Is it different with index or a glossary. How's the catalog different from-- >> Yeah, so catalog actually includes both. So it is a list of all the data sets plus a links to glossary definitions of what those data items mean within the data sets, plus information about the lineage of the data. It includes information about who's using it, what they're using it for, how it should be governed. >> James: It's like a governance repository. >> So governance is part of it. So the governance part is really saying, "This is how you're allowed to use it, "this is how the data's classified," "these are the automated actions that are going to happen "on the data as it's used "within the operational environment." >> James: Yeah. >> So there's that aspect to it, but there is the collaboration side. Hey I've been using this data set it's great. Or, actually this data set is full of errors, we can't use it. So you've got feedback to data set owners as well as, exchange and collaboration between data scientists working with the data. So it's really, it is a central resource for an organization that has a strong data strategy, is interested in becoming a data-driven organization as such, so, you know, this becomes their major catalog over their data assets, and how they're using it. So when a regulator comes in and says, "can you show up, show me that you're "managing personal data?" The data catalog will have the information about where personal data's located, what type of infrastructure it's sitting on, how it's being used by different services. So they can really show that they know what they're doing and then from that they can show how to processes are used in the metadata in order to use the data appropriately day to day. >> So Apache Atlas, so it's basically a catalog, if I understand correctly at least for IBM and Hortonworks, it's Hadoop, it's Apache Atlas and Apache Atlas is essentially a metadata open source code base. >> Mandy: Yes, yes. >> So explain what Atlas is in this context. >> So yes, Atlas is a collection of code, but it supports a server, a graph-based metadata server. It also supports-- >> James: A graph-based >> Both: Metadata server >> Yes >> James: I'm sorry, so explain what you mean by graph-based in this context. >> Okay, so it runs using the JanusGraph, graph repository. And this is very good for metadata 'cause if you think about what it is it's connecting dots. It's basically saying this data set means this value and needs to be classified in this way and this-- >> James: Like a semantic knowledge graph >> It is, yes actually. And on top of it we impose a type system that describes the different types of things you need to control and manage in a data catalog, but the graph, the Atlas component gives you that graph-based, sorry, graph-based repository underneath, but on top we've built what we call the open metadata and governance libraries. They run inside Atlas so when you run Atlas you will have all the open metadata interfaces, but you can also take those libraries and connect them and load them actually into another vendor's product. And what they're doing is allowing metadata to be exchanged between repositories of different types. And this becomes incredibly important as an organization increases their maturity and their use of data because you can't just have knowledge about data in a single server, it just doesn't scale. You need to get that knowledge into every runtime environment, into the data tools that people are using across the organization. And so it needs to be distributed. >> Mandy I'm wondering, the whole notion of what you catalog in that repository, does it include, or does Apache Atlas support adding metadata relevant to data derivative assets like machine learning models-- >> Mandy: Absolutely. >> So forth. >> Mandy: Absolutely, so we have base types in the upper metadata layer, but also it's a very flexible and sensible type system. So, if you've got a specialist machine learning model that needs additional information stored about it, that can easily be added to the runtime environment. And then it will be managed through the open metadata protocols as if it was part of the native type system. >> Because of the courses in analysts, one of my core areas is artificial intelligence and one of the hot themes in artificial, well there's a broad umbrella called AI safety. >> Mandy: Yeah. >> And one of the core subsets of that is something called explicable AI, being able to identify the lineage of a given algorithmic decision back to what machine learning models fed from what data. >> Mandy: Yeah. >> Throw what action like when let's say a self-driving vehicle hits a human being for legal, you know, discovery whatever. So what I'm getting at, what I'm working through to is the extent to which the Hortonworks, IBM big data catalog running Atlas can be a foundation for explicable AI either now or in the future. We see a lot of enterprise, me as an analyst at least, sees lots of enterprises that are exploring this topic, but it's not to the point where it's in production, explicable AI, but where clearly companies like IBM are exploring building a stack or a architecture for doing this kind of thing in a standardized way. What are your thoughts there? Is IBM working on bringing, say Atlas and the overall big data catalog into that kind of a use case. >> Yes, yeah, so if you think about what's required, you need to understand the data that was used to train the AI how, what data's been fed to it since it was deployed because that's going to change its behavior, and then also a view of how that data's going to change in the future so you can start to anticipate issues that might arising from the model's changing behavior. And this is where the data catalog can actually associate and maintain information about the data that's being used with the algorithm. You can also associate the checking mechanism that's constantly monitoring the profile of the data so you can see where the data is changing over time, that will obviously affect the behavior of the machine learning model. So it's really about providing, not just information about the model itself, but also the data that's feeding it, how those characteristics are changing over time so that you know the model is continuing to work into the future. >> So tell us about the IBM, Hortonworks partnership on metadata and so forth. >> Mandy: Okay. >> How is that evolving? So, you know, your partnership is fairly tight. You clearly, you've got ODPI, you've got the work that you're doing related to the big data catalog. What can we expect to see in the near future in terms of, initiatives building on all of that for governance of big data in the multi-cloud environment? >> Yeah so Hortonworks started the Apache Atlas project a couple of years ago with a number of their customers. And they built a base repository and a set of APIs that allow it to work in the Hadoop environment. We came along last year, formed our partnership. That partnership includes this open metadata and governance layer. So since then we worked with ING as well and ING bring the, sort of, user perspective, this is the organization's use of the data. And, so between the three of us we are basically transforming Apache Atlas from an Hadoop focused metadata repository to an enterprise focused metadata repository. Plus enabling other vendors to connect into the open metadata ecosystem. So we're standardizing types, standardizing format, the format of metadata, there's a protocol for exchanging metadata between repositories. And this is all coming from that three-way partnership where you've got a consuming organization, you've got a company who's used to building enterprise middleware, and you've got Hortonworks with their knowledge of open source development in their Hadoop environment. >> Quick out of left field, as you develop this architecture, clearly you're leveraging Hadoop HTFS for storage. Are you looking to at least evaluating maybe using block chain for more distributed management of the metadata in these heterogeneous environments in the multi-cloud, or not? >> So Atlas itself does run on HTFS, but doesn't need to run on HTFS, it's got other storage environments so that we can run it outside of Hadoop. When it comes to block chain, so block chain is, for, sharing data between partners, small amounts of data that basically express agreements, so it's like a ledger. There are some aspects that we could use for metadata management. It's more that we actually need to put metadata management into block chain. So the agreements and contracts that are stored in block chain are only meaningful if we understand the data that's there, what it's quality, where it came from what it means. And so actually there's a very interesting distributor metadata question that comes with the block chain technology. And I think that's an important area of research. >> Well Mandy we're at the end of our time. Thank you very much. We could go on and on. You're a true expert and it's great to have you on the Cube. >> Thank you for inviting me. >> So this is James Kobielus with Mandy Chessell of IBM. We are here this week in Berlin at Dataworks Summit 2018. It's a great event and we have some more interviews coming up so thank you very much for tuning in. (electronic music)

Published Date : Apr 19 2018

SUMMARY :

Announcer: From Berlin, Germany, it's the Cube I'm hosting the Cube this week at Dataworks Summit 2018 and the broader strategies that enterprise customers which is not a catalog of, you know, actually containing How's the catalog different from-- So it is a list of all the data sets plus a links "these are the automated actions that are going to happen in the metadata in order to use So Apache Atlas, so it's basically a catalog, So yes, Atlas is a collection of code, James: I'm sorry, so explain what you mean and needs to be classified in this way that describes the different types of things you need in the upper metadata layer, but also it's a very flexible and one of the hot themes in artificial, And one of the core subsets of that the extent to which the Hortonworks, IBM big data catalog in the future so you can start to anticipate issues So tell us about the IBM, Hortonworks partnership for governance of big data in the multi-cloud environment? And, so between the three of us we are basically of the metadata in these heterogeneous environments So the agreements and contracts that are stored You're a true expert and it's great to have you on the Cube. So this is James Kobielus with Mandy Chessell of IBM.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Mandy Chessell	PERSON	0.99+
IBM	ORGANIZATION	0.99+
ING	ORGANIZATION	0.99+
James	PERSON	0.99+
three	QUANTITY	0.99+
Berlin	LOCATION	0.99+
Mandy	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
May 25th	DATE	0.99+
last year	DATE	0.99+
U.S.	LOCATION	0.99+
two days	QUANTITY	0.99+
Atlas	TITLE	0.99+
yesterday	DATE	0.99+
Berlin, Germany	LOCATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Data Steward Studio	ORGANIZATION	0.99+
both	QUANTITY	0.99+
Both	QUANTITY	0.98+
EU	LOCATION	0.98+
GDPR	TITLE	0.98+
One	QUANTITY	0.98+
one	QUANTITY	0.98+
Dataworks Summit 2018	EVENT	0.97+
Dataworks Summit EU 2018	EVENT	0.96+
this week	DATE	0.94+
single server	QUANTITY	0.94+
Hadoop	TITLE	0.94+
today	DATE	0.93+
this morning	DATE	0.93+
three-way partnership	QUANTITY	0.93+
Wikibon	ORGANIZATION	0.91+
Hortonworks'	ORGANIZATION	0.9+
Atlas	ORGANIZATION	0.89+
Dataworks Summit Europe 2018	EVENT	0.89+
couple of years ago	DATE	0.87+
Apache Atlas	TITLE	0.86+
Cube	COMMERCIAL_ITEM	0.83+
Apache	ORGANIZATION	0.82+
JanusGraph	TITLE	0.79+
hot themes	QUANTITY	0.68+
Hado	ORGANIZATION	0.67+
Hadoop HTFS	TITLE	0.63+

Jagane Sundar, WANdisco | AWS Summit SF 2018

>> Voiceover: Live from the Moscone Center, it's theCUBE. Covering AWS Summit San Francisco 2018. Brought to you by Amazon Web Services. >> Welcome back, I'm Stu Miniman and this is theCUBE's exclusive coverage of AWS Summit here in San Francisco. Happy to welcome back to the program Jagane Sundar, who is the CTO of WANdisco. Jagane, great to see you, how have you been? >> Well, been great Stu, thanks for having me. >> All right so, every show we go to now, data really is at the center of it, you know. I'm an infrastructure guy, you know, data is so much of the discussion here, here in the cloud in the keynotes, they were talking about it. IOT of course, data is so much involved in it. We've watched WANdisco from the days that we were talking about big data. Now it's you know, there's AI, there's ML. Data's involved, but tell us what is WANdisco's position in the marketplace today, and the updated role on data? >> So, we have this notion, this brand new industry segment called live data. Now this is more than just itty-bitty data or big data, in fact this is cloud-scale data located in multiple regions around the world and changing all the time. So you have East Coast data centers with data, West Coast data centers with data, European data centers with data, all of this is changing at the same time. Yet, your need for analytics and business intelligence based on that is across the board. You want your analytics to be consistent with the data from all these locations. That, in a sense, is the live data problem. >> Okay, I think I understand it but, you know, we're not talking about like, in the storage world there was like hot data, what's hot and cold data. And we talked about real-time data for streaming data and everything like that. But how do you compare and contrast, you know, you said global in scope, talked about multi-region, really talking distributed. From an architectural standpoint, what's enabling that to be kind of the discussion today? Is it the likes of Amazon and their global reach? And where does WANdisco fit into the picture? >> So Amazon's clearly a factor in this. The fact that you can start up a virtual machine in any part of the world in a matter of minutes and have data accessible to that VM in an instant changes the business of globally accessible data. You're not simply talking about a primary data center and a disaster recovery data center anymore. You have multiple data centers, the data's changing in all those places, and you want analytics on all of the data, not part of the data, not on the primary data center, how do you accomplish that, that's the challenge. >> Yeah, so drill into it a little bit for us. Is this a replication technology? Is this just a service that I can spin up? When you say live, can I turn it off? How do those kind of, when I think about all the cloud dynamics and levers? >> So it is indeed based on active-active replication, using a mathematically strong algorithm called Paxos. In a minute, I'll contrast that with other replication technologies, but the essence of this is that by using this replication technology as a service, so if you are going up to Amazon's web services and you're purchasing some analytics engine, be it Hive or Redshift or any analytics engine, and you want to have that be accessible from multiple data centers, be available in the face of data center or entire region failure, and the data should be accessible, then you go with our live data platform. >> Yeah so, we want you to compare and contrast. What I think about, you know, I hear active-active, speed of light's always a challenge. You know globally, you have inconsistency it's challenging, there's things like Google Spanner out there to look at those. You know, how does this fit compared to the way we've thought of things like replication and globally distributed systems in the past? >> Interesting question. So, ours great for analytics applications, but something like Google Spanner is more like a MySQL database replacement that runs into multiple data centers. We don't cater to that and database-transaction type of applications. We cater to analytics applications of batch, very fast streaming applications, enterprise data warehouse-type analytics applications, for all of those. Now if you take a look inside and see what kind of replication technology will be used, you'll find that we're better than the other two different types. There are two different types of existing replication technologies. One is log shipping. The traditional Oracle, GoldenGate-type, ship the log, once the change is made to the primary. The second is, take a snapshot and copy differences between snapshots. Both have their deficiencies. Snapshot of course is time-based, and it happens once in a while. You'll be lucky if you can get one day RTO with those sorts of things. Also, there's an interesting anecdote that comes to mind when I say that because the Hadoop folks in their HTFS, implemented a version of snapshot and snapdiff. The unfortunate truth is that it was engineered such that, if you have a lot of changes happening, the snapshot and snapdiff code might consume too much memory and bring down your NameNode. That's undesirable, now your backup facility just brought down your main data capability. So snapshot has its deficiencies. Log shipping is always active/passive. Contrast that with our technology of live data, whereat you can have multiple data centers filled with data. You can write your data to any of these data centers. It makes for a much more capable system. >> Okay, can you explain, how does this fit with AWS and can it live in multi-clouds, what about on-premises, the whole you know, multi and hybrid cloud discussion? >> Interesting, so the answer is yes. It can live in multiple regions within the same cloud, multiple reasons within different clouds. It'll also bridge data that exists on your on-prem, Hadoop or other big data systems, or object store systems within Cloud, S3 or Azure, or any of the BLOB stores available in the cloud. And when I say this, I mean in a live data fashion. That means you can write to your on-prem storage, you can also write to your cloud buckets at the same time. We'll keep it consistent and replicated. >> Yeah, what are you hearing from customers when it comes to where their data lives? I know last time I interviewed David Richards, your CEO, he said the data lakes really used to be on premises, now there's a massive shift moving to the public clouds. Is that continuing, what's kind of the breakdown, what are you hearing from customers? >> So I cannot name a single customer of ours who is not thinking about the cloud. Every one of them has a presence on premise. They're looking to grow in the cloud. On-prem does not appear to be on a growth path for them. They're looking at growing in the cloud, they're looking at bursting into the cloud, and they're almost all looking at multi-cloud as well. That's been our experience. >> At the beginning of the conversation we talked about data. How are customers doing you know, exploiting and leveraging or making sure that they aren't having data become a liability for them? >> So there are so many interesting use cases I'd love to talk about, but the one that jumps out at me is a major auto manufacturer. Telematics data coming in from a huge number, hundreds of thousands, of cars on the road. They chose to use our technology because they can feed their West Coast car telematics into their West Coast data center, while simultaneously writing East Coast car data into the East Coast data center. We do the replication, we build the live data platform for them, they run their standard analytics applications, be it Hadoop-sourced or some other analytics applications, they get consistent answers. Whether you run the analytics application on the East Coast or the West Coast, you will get the same exact answer. That is very valuable because if you are doing things like fault detection, you really don't want spurious detection because the data on the West Coast was not quite consistent and your analytics application was led astray. That's a great example. We also have another example with a top three bank that has a regulatory concern where they need to operate out of their backup data centers, so-called backup data center, once every three months or so. Now with live data, there is no notion of active data center and backup data center. All data centers are active, so this particular regulatory requirement is extremely simple for them to implement. They just run their queries on one of the other data centers and prove to the regulators that their data is indeed live. I could go on and on about a number of these. We also have a top two retailer who has got such a volume data that they cannot manage it in one Hadoop cluster. They use our technology to create the live data data link. >> One of the challenges always, customers love the idea of global but governance, compliance, things like GDPR pop up. Does that play into your world? Or is that a bit outside of what WANdisco sees? >> It actually turns out to be an important consideration for us because if you think about it, when we replicate the data flows through us. So we can be very careful about not replicating data that is not supposed to be replicated. We can also be very careful about making sure that the data is available in multiple regions within the same country if that is the requirement. So GDPR does play a big role in the reason why many of our customers, particularly in the financial industry, end up purchasing our software. >> Okay, so this new term live data, are there any other partners of yours that are involved in this? As always, you want like a bit of an ecosystem to help build out a wave. >> So our most important partners are the cloud vendors. And they're multi-region by nature. There is no idea of a single data center or a single region cloud, so Microsoft, Amazon with AWS, these are all important partners of ours, and they're promoting our live data platform as part of their strategy of building huge hybrid data lakes. >> All right, Jagane give us a little view looking forward. What should we expect to see with live data and WANdisco through the rest of 2018? >> Looking forward, we expect to see our footprint grow in terms with dealing with a variety of applications, all the way from batch, pig scripts that used to run once a day to hive that's maybe once every 15 minutes to data warehouses that are almost instant and queryable by human beings, to streaming data that pours things into Kafka. We see the whole footprint of analytics databases growing. We see cross-capability meaning perhaps an Amazon Redshift to an Azure or SQL EDW replication. Those things are very interesting to us, to our customers, because some of them have strengths in certain areas and other have strengths in other areas. Customers want to exploit both of those. So we see us as being the glue for all world-scale analytics applications. >> All right well, Jagane, I appreciate you sharing with us everything that's happening at WANdisco. This new idea of live data, we look forward to catching up with you and the team in the future and hearing more about the customers and everything on there. We'll be back with lots more coverage here from AWS Summit here in San Francisco. I'm Stu Miniman, you're watching theCUBE. (electronic music)

Published Date : Apr 4 2018

SUMMARY :

Brought to you by Amazon Web Services. and this is theCUBE's exclusive coverage data really is at the center of it, you know. and changing all the time. Is it the likes of Amazon and their global reach? The fact that you can start up a virtual machine about all the cloud dynamics and levers? but the essence of this is that by using and globally distributed systems in the past? ship the log, once the change is made to the primary. That means you can write to your on-prem storage, Yeah, what are you hearing from customers They're looking at growing in the cloud, At the beginning of the conversation we talked about data. or the West Coast, you will get the same exact answer. One of the challenges always, of our customers, particularly in the financial industry, As always, you want like a bit of an ecosystem So our most important partners are the cloud vendors. What should we expect to see with live data We see the whole footprint to catching up with you and the team in the future

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Amazon Web Services	ORGANIZATION	0.99+
David Richards	PERSON	0.99+
Jagane	PERSON	0.99+
San Francisco	LOCATION	0.99+
Jagane Sundar	PERSON	0.99+
Stu Miniman	PERSON	0.99+
WANdisco	ORGANIZATION	0.99+
GDPR	TITLE	0.99+
Stu	PERSON	0.99+
One	QUANTITY	0.99+
East Coast	LOCATION	0.99+
Both	QUANTITY	0.99+
second	QUANTITY	0.99+
two	QUANTITY	0.98+
MySQL	TITLE	0.98+
West Coast	LOCATION	0.98+
two different types	QUANTITY	0.98+
one	QUANTITY	0.98+
both	QUANTITY	0.98+
one day	QUANTITY	0.98+
Kafka	TITLE	0.98+
S3	TITLE	0.97+
Moscone Center	LOCATION	0.97+
Oracle	ORGANIZATION	0.96+
once a day	QUANTITY	0.95+
Google Spanner	TITLE	0.95+
single data center	QUANTITY	0.95+
NameNode	TITLE	0.94+
hundreds of thousands	QUANTITY	0.94+
today	DATE	0.93+
theCUBE	ORGANIZATION	0.92+
Azure	TITLE	0.91+
WANdisco	TITLE	0.9+
snapdiff	TITLE	0.89+
SQL EDW	TITLE	0.89+
Redshift	TITLE	0.88+
single customer	QUANTITY	0.87+
AWS Summit	EVENT	0.87+
AWS Summit San Francisco 2018	EVENT	0.86+
single region	QUANTITY	0.85+
2018	DATE	0.84+
snapshot	TITLE	0.81+
Jagane	ORGANIZATION	0.76+
three bank	QUANTITY	0.74+
once every 15 minutes	QUANTITY	0.73+
European	LOCATION	0.73+
AWS Summit SF 2018	EVENT	0.71+
once	QUANTITY	0.7+
Cloud	TITLE	0.65+
every three months	QUANTITY	0.64+
GoldenGate	ORGANIZATION	0.57+
of cars	QUANTITY	0.55+
minute	QUANTITY	0.53+
Paxos	ORGANIZATION	0.53+
HTFS	TITLE	0.53+
Hive	TITLE	0.49+
Hadoop	ORGANIZATION	0.41+
BLOB	TITLE	0.4+

Jeff Veis, Actian | BigData NYC 2017

>> Live from Midtown Manhattan, it's the Cube. Covering big data, New York City 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Okay welcome back everyone, live here in New York City it's the Cube special annual presentation of BIGDATA NYC. This is our annual event in New York City where we talk to all the fall leaders and experts, CEOs, entrepreneurs and anyone making shaping the agenda with the Cube. In conjunction with STRATA DATA which was formally called STRATA HEDUP. HEDUP world, the Cube's NYC event. BIGDATA I want to see you separate from that when we're here. Which of these, who's the chief marketing acting of Cube alumni. Formerly with HPE, been on many times. Good to see you. >> Good to see you. >> Well you're a marketing genius we've talked before at HPE. You got so much experience in data and analytics, you've seen the swath of spectrum across the board from classic. I call classic enterprise to cutting edge. To now full on cloud, AI, machine learning, IOT. Lot of stuff going on, on premise seems to be hot still. There's so much going on from the large enterprises dealing with how to better use your analytics. At Acting you're heading up to marketing, what's the positioning? What're you doing there? >> Well the shift that we see and what's unique about Acting. Which has just a very differentiated and robust portfolio is the shift to what we refer to as hybrid data. And it's a shift that people aren't talking about, most of the competition here. They have that next best mouse trap, that one thing. So it's either move your database to the cloud or buy this appliance or move to this piece of open source. And it's not that they don't have interesting technologies but I think they're missing the key point. Which is never before have we seen the creation side of data and the consumption of data becoming more diverse, more dynamic. >> And more in demand too, people want both sides. Before we go any deeper I just want you to take a minute to define what is hybrid data actually mean. What does that term mean for the people that want to understand this term deeper. >> Well it's understanding that it's not just the location of it. Of course there's hybrid computing which is premised in cloud. And that's an important part of it. But there's also about where and how is that data created. What time domain is that data going to be consumed and used and that's so important. A lot of analytics, a lot of the guys across the street are kind of thinking about reporting in analytics and that old world way of. We collect lots of data and then we deliver analytics. But increasingly analytics is being used almost in real time or near real time. Because people are doing things with the data in the moment. Then another dimension of it is AdHawk discovery. Where you can have not one or two or three data scientists but dozens if not hundreds of people. All with copies of Tableau and Click attacking and hitting that data. And of course it's not one data source but multiple as they find adjacencies with data. A lot of the data may be outside of the four walls. So when you look at consumption ad creation of data the net net is you need not one solution but a collection of best fits. >> So a hybrid between consumption and creation so that's the two hybrids. I mean hybrid implies, you know little bit of this little bit of that. >> That's the bridge that you need to be able to cross. Which is where do I get that data? And then where's that data going? >> Great so lets get into Acting. Give us the update, obviously Acting has got a huge portfolio. We've covered you guys know best. Been on the Cube many times. They've cobbled together all these solutions that can be very affective for customers. Take us through the value proposition that this hybrid data enables with Acting. >> Well if you decompose it from our view point there's three pillars. That you kind of needed since the test of time in one sense. They're critical, which is the ability to manage the data. The ability to connect the data. In the old days we said integrate but now I think basically all apps, all kind of data sources are connected in some sense. Sometimes very temporal. And then finally the analytics. So you need those three pillars and you need to be able to orchestrate across them. And what we have is a collection of solutions that span that. They can do transactional data, they can do graph data and object oriented data. Today we're announcing a new generation of our analytics, specifically on HEDUP. And that's Vector H. Love to be able to talk to that today with the native spark integration. >> Lets get into the news. Hard news here at BIGDATA NYC is you guys announced the latest support for Apachi Spark so with Vector H. So Acting Vector in HEDUP, hence the H. What is it? >> Is Spark glue for hybrid data environments or is it something you layer over different underlying databases? >> Well I think it's fair to say it is becoming the glue. In fact we had a previous technology that did a humans job at doing some of the work. Now that we spark and that community. The thing though is if you wanted to take advantage of spark it was kind of like the old days of HEDUP. Assembly was required and that is increasingly not what organizations are looking for. They want to adopt the technology but they want to use it and get on with their day job. What we have done... >> Machine learning, putting algorithms in place, managing software. >> It could be very exonerate things such as predictive machines learning. Next generation AI. But for everyone of those there's an easy a dozen if not a hundred uses of being able to reach and extract data in their native formats. Be able to grab a Parke file and without any transformation being analyze it. Or being able to talk to an application and being able to interface with that. With being able to do reads and writes with zero penalty. So the asset compliance component of databases is critical and a lot of the traditional HEDUP approaches, pretty much read only vehicles. And that meant they were limited on the use cases they could use it. >> Lets talk about the hard news. What specifically was announced? >> Well we have a technology called Vector. Vector does run, just to establish the baseline here. It runs single node, Windows, Linux, and there's a community edition. So your users can download and use that right now. We have Vector H which was designed for scale out for HEDUP and it takes advantage of Yarn. And allows you to scale out across your HEDUP cluster petabytes if you like. What we've added to that solution is now native spark integration and that native spark integration gives you three key things. Number one, zero penalty for real time updates. We're the only ones to the best of our knowledge that can do that. In other words you can update the data and you will not slow down your analytics performance. Every other HEDUP based analytic tool has to, if you will stop the clock. Fresh out the new data to be able to do updates. Because of our architecture and our deep knowledge with transactional processing you don't slow down. That means you can always be assured you'll have fresh data running. The second thing is spark powered direct query access. So we can get at not just Vector formats we have an optimized data format. Which it is the fastest as you'd find in analytic databases but what's so important is you can hit, ORC, Parke and other data file formats through spark and without any transformation. Be it to ingest and analyze an information. The third one and certainly not the least is something that I think you're going to be talking a lot more about. Which is native spark data frame support. Data frames. >> What's the impact of that? >> Well data frames will allow you to be able to talk to spark SQL, spark R based applications. So now that you're not just going to the data you're going to other applications. And that means that you're able to interface directly to the system of record applications that are running. Using this lingua franca of data frames that now has hit a maturity point where you're seeing pretty broad adoption. And by doing native integration with that we've just simplified the ability to connect directly to dozens of enterprise applications and get the information you need. >> Jeff would you be describing what you're offering now. As a form of data, sort of a data virtualization layer that sits in front of all these back end databases. But uses data frames from spark or am I misconstruing. >> Well it's a little less a virtualization layer as maybe a super highway. That we're able to say this analytics tool... You know in the old days it was one of two things. Either you had to do a formal traditional integration and transform that data right so? You had to go from French to German, once it was in German you could read it. Or what you had to do was you had to be able to query and bring in that information. But you had to be able to slow down your performance because that transformation had not occurred. Now what we're able to use is use this park native connector. So you can have the best of both worlds and if you will, it is creating an abstraction layer but it's really for connectivity as opposed to an overall one. What we're not doing is virtualizing the data. That's the key point, there are some people that are pushing data cataloging and cleansing products and abstracting the entire data from you. You're still aware of where the native format is, you're still able to write to it with zero penalty. And that's critical for performance. When you start to build lots of abstraction layers truly traditional ones. You simplify some things but usually you pay a performance penalty. And just to make a point, in the benchmarks we're running compared to Hive and Polor for example. We're used cases against Vector H may take nearly two hours we can do it in less than two minutes. And we've been able to uphold that for over a year. That is because Vector in its core technology has calmer capabilities and, this is a mouthful. But multi level in memory capability. And what does that mean? You ask. >> I was going to ask but keep going. >> I can imagine the performance latency is probably great. I mean you have in memory that everyone kind of wants. >> Well a lot of in memory where it is you used is just held at the RAM level. And it's the ability to breed data in RAM and take advantage of it. And we do that and of course that's a positive but we go down to the cash level. We get down much much lower because we would rather that data be in the CPU if at all possible. And with these high performance cores it's quite possible. So we have some tricks that are special and unique to Vector so that we actually optimize the in memory capability. The other last thing we do is you know HEDUP and HTFS is not particularly smart about where it places the data. And the last thing you want is your data rolling across lots of different data nodes. That just kills performance. What we're able to do is think about the core location of the data. Look at the jobs and look at the performance and we're able to squeeze optimization in there. And that's how we're able to get 50, 100 sometimes an excess of 500 times faster than some of the other well known SQL and HEDUP performances. So that combined now with this spark integration this native spark integration. Means people don't have to do the plumbing they can get out of the basement and up to the first floor. They can take care of, advantage of open source innovation yet get what we're claiming is the fastest HEDUP analytics database in HEDUP. >> So, I got to ask you. I mean you've been, and I mentioned on the intro, industry veteran. CMO, chief marketing officer. I mean challenging with Acting cause there's so many things to focus on. How are you attacking the marketing of Acting because you have a portfolio that hybrid data is a good position. I like that how you bring that to the forefront kind of give it a simple positioning. But as you look at Acting's value proposition and engage you customer base and potentially prospective customers. How are you iterating the marketing message the position and engaging with clients? >> Well it's a fair question and it is daunting when you have multiple products. And you got to have a simple compelling message, less is more to get signal above noise today. At least that's how I feel. So we're hanging our hats on hybrid data. And we're going to take it to the moon or go down with the ship on that. But we've been getting some pretty good feedback. >> What's been the hit one feedback on the hybrid data because, I'm a big fan of hybrid cloud but I've been saying it's a methodology it's not a product. On premise cloud is growing and so is public so hybrid hangs together in the cloud thing. So with data, you're bridging two worlds. Consumption and creation. >> Well what's interesting when you say hybrid data. People put their own definitions around it. In an unaided way and they say you know with all the technology and all the trends, that's actually at the end of the day nets out my situation. I do have data that's hybrid data and it's becoming increasingly more hybrid. And god knows the people that are demanding wanting to use it aren't using it or doing it. And the last thing I need, and I'm really convinced of this. Is a lot of people talk about platforms we love to use the P word. Nobody buys a platform because people are trying to address their use cases. But they don't wat to do it in this siloed kind of brick wall way where I address one use case but it won't function elsewhere. What are they looking for is a collection of best fits solutions that can cooperate together. The secret source for us is we have a cloud control plane. All our technologies, whether it's on premise or in the cloud touch that. And it allows us to orchestrate and do things together. Sometimes it's very intimate and sometimes it's broader. >> Or what exactly is the control plane? >> It does everything from administration, it can do down to billing and it can also be scheduling transactional performance. Now on one extreme we use it for a back up recovery for our transactional database. And we have a cloud based back up recovery service and it all gets administered through the control plane. So it knows exactly when it's appropriate to backup because it understands that database and it takes care of it. It was relatively simple for us to create. On the more intimate sense we were the first company and it was called Acting X which I know we were talking before. We named our product after X before our friends at Apple did. So I like to think we were pioneers. >> San Francisco had the iPhone don't get confused there remember. >> I got to give credit where credit's due. >> And give it up. >> But what Acting X is, and we announced it back in April. Is it takes the same vector technology I just talked about. So it's material and we combined it with our integrated transactional database. Which has over 10,000 users around the world. And what we did is we dropped in this high performance calmer database for free. I'm going to say that again, for free in our transactional part from system. So everyone one of our customers, soon as they upgraded to now Acting X. Got a rocket ship of a calmer high performance database inside their transactional database. The data is fresh, it moves over into the calmer format. And the reporting takes off. >> Jeff to end this statement I'll give you the last word. A lot of people look at Acting also a product I mentioned earlier. Is it product leadership that's winning, is it the values of the customer? Where is Acting and winning for the folks that aren't yet customers that you'd like to talk to. What is the Acting success formula? What's the differentiation, where is it, where does it jump off the page? Is it the product, is it the delivery? Where's the action. >> Is it innovation? >> Well let me tell you about, I would answer with two phrases. First is our tag line, our tag line is "activate your data". And that resonated with a lot of people. A lot of people have a lot of data and we've been in this big data era where people talked about the size of their data. Literally I have 5 petabytes you have 6 petabytes. I think people realized that kind of missed the entire picture. Sometimes smaller data, god forbid 1 terabyte can be amazingly powerful depending on the use case. So it's obviously more than size what it is about is activating it. Are you actually using that data so it's making a meaningful difference. And you're not putting it in a data pond, puddle or lake to be used someday like you're storing it in an attic. There's a lot of data getting dusty in attics today because it is not being activated. And that would bring me to the, not the tag line but what I think what's driving us and why customers are considering us. They see we are about the technology of the future but we're very much about innovation that actually works. Because of our heritage, because we have companies that understand for over 20 years how to run on data. We get what acid compliance is, we get what transactional systems are. We get that you need to be able to not just read but write data. And we bring the methodology to our innovation and so for people, companies, animals, any form of life. That is interested in. >> So it's the product platform that activates and then the result is how you guys roll with customers. >> In the real world today where you can have real concurrency, real enterprise, great performance. Along with the innovation. >> And the hybrid gives them some flexibility that's the new tag line, that's the kind of main. I understand you currently hybrid data means basically flexibility for the customer. >> Yeah it's use the data you need for what you use it for and have the systems work for you. Rather than you work for the systems. >> Okay check out Acting, Jeff Viece friend of the Cube, alumni now. The CMO at Acting, we following your progress so congratulations on the new opportunity. More Cube coverage after this strip break. I'm John Furrier, James Kobielus here inside the Cube in New York City for our BIGDATA NYC event all week. In conjunction with STRATA Data right next door we'll be right back. (tech music)

Published Date : Sep 27 2017

SUMMARY :

Brought to you by SiliconANGLE Media and anyone making shaping the agenda There's so much going on from the large enterprises is the shift to what we refer to as hybrid data. What does that term mean for the people that the net net is you need not one solution so that's the two hybrids. That's the bridge that you need to be able to cross. Been on the Cube many times. and you need to be able to orchestrate across them. So Acting Vector in HEDUP, hence the H. it is becoming the glue. and being able to interface with that. Lets talk about the hard news. and you will not slow down your analytics performance. and get the information you need. Jeff would you be describing and abstracting the entire data from you. I can imagine the performance latency And the last thing you want is your data rolling across I like that how you bring that to the forefront and it is daunting when you have multiple products. on the hybrid data because, and they say you know with all the technology So I like to think we were pioneers. San Francisco had the iPhone And the reporting takes off. is it the values of the customer? We get that you need to be able to not just read and then the result is how you guys roll with customers. where you can have real concurrency, And the hybrid gives them some flexibility and have the systems work for you. Jeff Viece friend of the Cube, alumni now.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Jeff Viece	PERSON	0.99+
Jeff Veis	PERSON	0.99+
John Furrier	PERSON	0.99+
April	DATE	0.99+
Jeff	PERSON	0.99+
New York City	LOCATION	0.99+
6 petabytes	QUANTITY	0.99+
two	QUANTITY	0.99+
Apple	ORGANIZATION	0.99+
one	QUANTITY	0.99+
HPE	ORGANIZATION	0.99+
5 petabytes	QUANTITY	0.99+
dozens	QUANTITY	0.99+
less than two minutes	QUANTITY	0.99+
50	QUANTITY	0.99+
Midtown Manhattan	LOCATION	0.99+
First	QUANTITY	0.99+
STRATA Data	ORGANIZATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
1 terabyte	QUANTITY	0.99+
two phrases	QUANTITY	0.99+
first floor	QUANTITY	0.99+
over 20 years	QUANTITY	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
Vector	ORGANIZATION	0.99+
Linux	TITLE	0.99+
both sides	QUANTITY	0.99+
one sense	QUANTITY	0.99+
Acting X	TITLE	0.99+
San Francisco	LOCATION	0.98+
over a year	QUANTITY	0.98+
Windows	TITLE	0.98+
Cube	ORGANIZATION	0.98+
third one	QUANTITY	0.98+
Today	DATE	0.98+
500 times	QUANTITY	0.98+
today	DATE	0.98+
NYC	LOCATION	0.98+
over 10,000 users	QUANTITY	0.98+
three data scientists	QUANTITY	0.98+
two worlds	QUANTITY	0.98+
three pillars	QUANTITY	0.98+
hundreds of people	QUANTITY	0.98+
Tableau	TITLE	0.97+
second thing	QUANTITY	0.97+
STRATA HEDUP	EVENT	0.97+
two hours	QUANTITY	0.97+
both worlds	QUANTITY	0.96+
HEDUP	ORGANIZATION	0.96+
SQL	TITLE	0.96+
two things	QUANTITY	0.96+
a dozen	QUANTITY	0.95+
one data source	QUANTITY	0.95+
first company	QUANTITY	0.95+
one solution	QUANTITY	0.94+
100	QUANTITY	0.93+
BIGDATA	ORGANIZATION	0.91+
two hybrids	QUANTITY	0.9+
BIGDATA	EVENT	0.9+
STRATA DATA	ORGANIZATION	0.9+
2017	DATE	0.89+
Vector H	TITLE	0.88+
spark	ORGANIZATION	0.88+
HEDUP	TITLE	0.88+
German	LOCATION	0.87+
one extreme	QUANTITY	0.86+
four walls	QUANTITY	0.86+
dozens of enterprise applications	QUANTITY	0.85+
single	QUANTITY	0.84+
Acting X.	TITLE	0.82+
three key things	QUANTITY	0.8+

Adam Wilson & Joe Hellerstein, Trifacta - Big Data SV 17 - #BigDataSV - #theCUBE

>> Commentator: Live from San Jose, California. It's theCUBE covering Big Data Silicon Valley 2017. >> Okay, welcome back everyone. We are here live in Silicon Valley for Big Data SV (mumbles) event in conjunction with Strata + Hadoop. Our companion event, the Big Data NYC and we're here breaking down the Big Data world as it evolves and goes to the next level up on the step function, AI machine learning, IOT really forcing people to really focus on a clear line of the side of the data. I'm John Furrier with our announcer from Wikibon, George Gilbert and our next guest, our two executives from Trifacta. The founder and Chief Strategy Officer, Joe Hellerstein and Adam Wilson, the CEO. Guys, welcome to theCUBE. Welcome back. >> Great to be here. >> Good to be here. >> Founder, co-founder? >> Co-founder. >> Co-founder. He's a multiple co-founders. I remember it 'cause you guys were one of the first sites that have the (mumbles) in the about section on all the management team. Just to show you how technical you guys are. Welcome back. >> And if you're Trifacta, you have to have three founders, right? So that's part of the tri, right? >> The triple threat, so to speak. Okay, so a big year for you guys. Give us the update. I mean, also we had Alation announce this partnering going on and some product movement. >> Yup. >> But there's a turbulent time right now. You have a lot of things happening in multiple theaters to technical theater to business theater. And also within the customer base. It's a land grand, it seems to be on the metadata and who's going to control what. What's happening? What's going on in the market place and what's the update from you guys? >> Yeah, yeah. Last year was an absolutely spectacular year for Trifacta. It was four times growth in bookings, three times growth in customers. You know, it's been really exciting for us to see the technology get in the hands of some of the largest companies on the planet and to see what they're able to do with it. From the very beginning, we really believed in this idea of self service and democratization. We recognize that the wrangling of the data is often where a lot of the time and the effort goes. In fact, up to 80% of the time and effort goes in a lot of these analytic projects and to the extent that we can help take the data from (mumbles) in a more productive way and to allow more people in an organization to do that. That's going to create information agility that that we feel really good about and there are customers and they are telling us is having an impact on their use of Big Data and Hadoop. And I think you're seeing that transition where, you know, in the very beginning there was a lot of offloading, a lot of like, hey we're going to grab some cost savings but then in some point, people scratch their heads and said, well, wait a minute. What about the strategic asset that we were building? That was going to change the way people work with the data. Where is that piece of it? And I think as people started figuring out in order to get our (mumbles), we got to have users and use cases on these clusters and the data like itself is not a used case. Tools like Trifacta have been absolutely instrumental and really fueling that maturity in the market and we feel great about what's happening there. >> I want to get some more drilled out before we get to some of these questions for Joe too because I think you mentioned, you got some quotes. I just want to double up a click on that. It always comes up in the business model question for people. What's your business model? >> Sure. >> And doing democratization is really hard. Sometimes democratization doesn't appear until years later so it's one of those elusive things. You see it and you believe it but then making it happen are two different things. >> Yeah, sure. >> So. And appreciate that the vision they-- (mumbles) But ultimately, at the end of the day, that business model comes down to how you organized. Prove points. >> Yup. >> Customers, partnerships. >> Yeah. >> We had Alation on Stephanie (mumbles). Can you share just and connect the dots on the business model? >> Sure. >> With respect to the product, customers, partners. How was that specifically evolving? >> Adam: Sure. >> Give some examples. >> Sure, yeah. And I would say kind of-- we felt from the beginning that, you know, we wanted to turn what was traditionally a very complex messy problem dealing with data, you know, in the user experience problem that was powered by machine learning and so, a lot of it was down to, you know, how we were going to build and architect the technology needed (mumbles) for really getting the power in the hands of the people who know the data best. But it's important, and I think this is often lost in Silicon Valley where the focus on innovation is all around technology to recognize that the business model also has to support democritization so one of the first things we did coming in was to release a free version of the product. So Trifacta Wrangler that is now being used by over 4500 companies, ten of thousands of users and the power of that in terms of getting people something of value that they could start using right away on spreadsheets and files and small data and allowing them to get value but then also for us, the exchange is that we're actually getting a chance to curate at scale usage data across all of these-- >> Is this a (mumbles) product? >> It's a hybrid product. >> Okay. >> So the data stays local. It never leaves their local laptop. The metadata is hashed and put into the cloud and now we're-- >> (mumbles) to that. >> Absolutely. And so now we can use that as training data that actually has more people wrangle, the product itself gets smarter based on that. >> That's good. >> So that's creating real tangible value for customers and for us is a source of very strategic advantage and so we think that combination of the technology innovation but also making sure that we can get this in the hands of users and they can get going and as their problem grows up to be bigger and more complicated, not just spreadsheets and files on the desktop but something more complicated, then we're right there along with them for products that would have been modified. >> How about partnerships with Alation? How they (mumbles)? What are all the deals you got going on there? >> So Alation has been a great partner for us for a while and we've really deepened the integration with the announcements today. We think that cataloging and data wrangling are very complimentary and they're a natural fit. We've got customers like Munich Re, like eBay as well as MarketShare that are using both solutions in concert with one another and so, we really felt that it was natural to tighten that coupling and to help people go from inventorying what's going on in their data legs and their clusters to then cleansing, standardizing. Essentially making it fit for purpose and then ensuring that metadata can roundtrip back into the catalog. And so that's really been an extension of what we're doing also at the technical level with technologies like Cloudera Navigator with Atlas and with the project that Joe's involved with at Berkeley called Ground. So I don't know if you want to talk-- >> Yeah, tell him about Ground. >> Sure. So part of our outlook on this and this speaks to the kind of way that the landscape in the industry's shaping out is that we're not going to see customers buying until it's sort of lock in on the key components of the area for (mumbles). So for example, storage, HD (mumbles). This is open and that's key, I think, for all the players in this base at HTFS. It's not a product from a storage vendor. It's an open platform and you can change vendors along the way and you could role your own and so on. So metadata, to my mind, is going to move in the same direction. That the storage of metadata, the basic component tree that keeps the metadata, that's got to be open to give people the confidence that they're going to pour the basic descriptions of what's in their business and what their people are doing into a place that they know they can count on and it will be vendor neutral. So the catalog vendors are, in my mind, providing a functionality above that basic storage that relates to how do you search the catalog, what does the catalog do for you to suggest things, to suggest data sets that you should be looking at. So that's a value we have on top but below that what we're seeing is, we're seeing Horton and Cloudera coming out with either products re opensource and it's sort of the metadata space and what would be a shame is if the two vendors ended up kind of pointing guns inward and kind of killing the metadata storage. So one of the things that I got interested in as my dual role as a professor at Berkeley and also as a founder of a company in this space was we want to ensure that there's a free open vendor neutral metadata solution. So we began building out a project called Ground which is both a platform for metadata storage that can be sitting underneath catalog vendors and other metadata value adds. And it's also a platform for research much as we did with Spark previously at Berkeley. So Ground is a project in our new lab at Berkeley. The RISELab which is the successor to the AMPLab that gave us Spark. And Ground has now got, you know, collaboratives from Cloudera, from LinkedIn. Capital One has significantly invested in Ground and is putting engineers behind it and contributors are coming also from some startups to build out an open-sourced platform for metadata. >> How old has Ground been around? >> Joe: Ground's been around for about 12 months. It's very-- >> So it's brand new. How do people get involved? >> Brand new. >> Just standard similar to the way the AMPLab was? Just jump in and-- >> Yeah, you know-- >> Go away and-- >> It comes up on GitHub. There's (mumbles) to go download and play with. It's in alpha. And you know, we hope we (mumbles) and the usual opensource still. >> This is interesting. I like this idea because one thing you've been riffing on the cue ball of time is how do you make data addressable? Because ultimately, you know, real time you need to have access to data really really low (mumbles) to see the inside to make it work. Hence the data swamp problem right? So, how do you guys see that? 'Cause now I can just pop in. I can hear the objections. Oh, security! You know. How do you guys see the protections? I'd love to help get my data in there and get something back in return in a community model. Security? Is it the hashing? What's the-- How do you get any security (mumbles)? Or what are the issues? >> Yeah, so I mean the straightforward issues are the traditional issues of authorization and encryption and those are issues that are reasonably well-plumed out in the industry and you can go out and you can take the solutions from people like Clutter or from Horton and those solutions have plugin quite nicely actually to a variety of platforms. And I feel like that level of enterprise security is understood. It's work for vendors to work with that technology so when we went out, we make sure we were carburized in all the right ways at Trifacta to work with these vendors and that we integrated well with Navigator, we integrated with Atlas. That was, you know, there was some labor there but it's understood. There's also-- >> It's solvable basically. >> It's solvable basically and pluggable. There are research questions there which, you know, on another day we could talk about but for instance if you don't trust your cloud hosting service what do you do? And that's like an open area that we're working on at Berkeley. Intel SGX is a really interesting technology and that's based probably a topic for another day. >> But you know, I think it's important-- >> The sooner we get you out of the studio, Paolo Alto would love to drill on that. >> I think it's important though that, you know, when we talk about self service, the first question that comes up is I'm only going to let you self service as far as I can govern what's going on, right? And so I think those things-- >> Restrictions, guard rails-- >> Really going hand in here. >> About handcuffs. >> Yeah so, right. Because that's always a first thing that kind of comes out where people say, okay wait minute now is this-- if I've now got, you know-- you've got an increasing number of knowledge workers who think that is their-- and believe that it is their unalienable right to have access to data. >> Well that's the (mumbles) democratization. That's the top down, you know, governance control point. >> So how do you balance that? And I think you can't solve for one side of that equation without the other, right? And that's really really critical. >> Democratization is anarchization, right? >> Right, exactly. >> Yes, exactly. But it's hard though. I mean, and you look at all the big trends where there was, you know, web one data, web (mumbles), all had those democratization trends but they took six years to play out and I think there might be a more auxiliary with cloud when you point about this new stop. Okay George, go ahead. You might get in there. >> I wanted to ask you about, you know, what we were talking about earlier and what customers are faced with which is, you know, a lot of choice and specialization because building something end to end and having it fully functional is really difficult. So... What are the functional points where you start driving the guard rails in that Ikee cares about and then what are the user experience points where you have critical mass so that the end users then draw other compliant tools in. You with me? On sort of the IT side and the user side and then which tools start pulling those standards? >> Well, I would say at the highest level, to me what's been very interesting especially would be with that's happened in opensource is that people have now gotten accustomed to the idea that like I don't have to go buy a big monolithic stacks where the innovation moves only as fast as the slowest product in the stack or the portfolio. I can grab onto things and I can download them today and be using them tomorrow. And that has, I think, changed the entire approach that companies like Trifacta are taking to how we how we build and release product to market, how we inter operate with partners like Alation and Waterline and how we integrate with the platform vendors like Cloudera, MapR, and Horton because we recognize that we are going to have to be meniacal focused on one piece of this puzzle and to go very very deep but then play incredibly well both, you know, with all the rest of the ecosystem and so I think that is really colored our entire product strategy and how we go to market and I think customers, you know, they want the flexibility to change their minds and the subscription model is all about that, right? You got to earn it every single year. >> So what's the future of (mumbles)? 'Cause that brings up a good point we were kind of critical of Google and you mentioned you guys had-- I saw in some news that you guys were involved with Google. >> Yup. >> Being enterprise ready is not just, hey we have the great tech and you buy from us, damn it we're Google. >> Right. >> I mean, you have to have sales people. You have to have automation mechanism to create great product. Will the future of wrangling and data prep go into-- where does it end up? Because enterprises want, they want certain things. They're finicky of things. >> Right, right. >> As you guys know. So how does the future of data prep deal with the, I won't say the slowness of the enterprise, but they're more conservative, more SLA driven than they are price performance. >> But they're also more fragmented than ever before and you know, while that may not be a great thing for the customers for a company that's all about harmonizing data that's actually a phenomenal opportunity, right? Because we want to be the decision that customers make that guarantee that all their other decisions are changeable, right? And I go and-- >> Well they have legacy systems of record. This is the challenge, right? So I got the old oracle monolithic-- >> That's fine. And that's good-- >> So how do you-- >> The more the merrier, right? >> Does that impact you guys at all? How did you guys handle that situation? >> To me, to us that is more fragmentation which creates more need for wrangling because that introduces more complexity, right? >> You guys do well in that environment. >> Absolutely. And that, you know, is only getting bigger, worse, and more complicated. And especially as people go from (mumbles) to cloud as people start thinking about moving from just looking at transactions to interactions to now looking at behavior data and the IOT-- >> You're welcome in that environment. >> So we welcome that. In fact, that's where-- we went to solve this problem for Hadoop and Big Data first because we wanted to solve the problems at scale that were the most complicated and over time we can always move downstream to sort of more structured and smaller data and that's kind of what's happened with our business. >> I guess I want to circle back to this issue of which part of this value chain of refining data is-- if I'm understanding you right, the data wrangling is the anchor and once a company has made that choice then all the other tool choices have to revolve around it? Is that a-- >> Well think about this way, I mean, the bulk of the time when you talk to the analysts and also the bulk of the labor cost and these things isn't getting the data from its raw form into usage. That whole process of wrangling which is not really just data prep. It's all the things you do all day long to kind of massage these data sets and get 'em from here to there and make 'em work. That space is where the labor cost is. That also means that's spaces were the value add is because that's where your people power or your business context is really getting poured in to understand what do I have, what am I doing with it and what do I want to get out of it. As we move from bottom line IT to top line value generation with data, it becomes all the more so, right? Because now it's not just the matter of getting the reports out every month. It's also what did that brilliant in sales do to that dataset to get that much left? I need to learn from her and do a similar thing. Alright? So, that whole space is where the value is. What that means is that, you know, you don't want that space to be tied to a particular BI tool or a particular execution edge. So when we say that we want to make a decision in the middle of that enables all the other decisions, what you really want to make sure is that that work process in there is not tightly bound to the rest of the stack. Okay? And so you want to particularly pick technologies in that space that will play nicely with different storage, that play nicely with different execution environments. Today it's a dupe, tomorrow it's Amazon, the next day it's Google and they have different engines back there potentially. And you want it certainly makes your place with all the analytic and visualizations-- >> So decouple from all that? >> You want to decouple that and you want to not lock yourself in 'cause that's where the creativity's happening on the consumption side and that's where the mess that you talked about is just growing on the production side so data production is just getting more complicated. Data consumption's getting more interesting. >> That's actually a really really cool good point. >> Elaborating on that, does that mean that you have to open up interfaces with either the UI layer or at the sort of data definition layer? Or does that just mean other companies have to do the work to tie in to the styles? The styles and structures that you have already written? >> In fact it's sort of the opposite. We do the work to tie in to a lot of this, these other decisions in this infrastructure, you know. We don't pretend for a minute that people are going to sort of pick a solution like Trifacta and then build their organization around it. As your point, there's tons of legacy, technology out there. There is all kinds of things moving. Absolutely. So we, a big part of being the decoder ring for data for Trifacta and saying it's like listen, we are going to inter operate with your existing investments and we're going to make sure that you can always get at your data, you can always take it from whatever state its in to whatever state you need to be in, you can change your mind along the way. And that puts a lot of owners on us and that's the reason why we have to be so focused on this space and not jump into visualization and analytics and not jump in to its storage and processing and not try to do the other things to the right or left. Right? >> So final question. I'd like you guys both to take a stab at it. You know, just going to pivot off at what Joe was saying. Some of the most interesting things are happening in the data exploration kind of discovery area from creativity to insights to game changing stuff. >> Yup. >> Ventures potentially. >> Joe: Yup. >> The problem of the complexity, that's conflict. >> Yeah. >> So how does we resolve this? I mean, besides the Trifacta solution which you guys are taming, creating a platform for that, how do people in industry work together to solve that problem? What's the approach? >> So I think actually there's a couple sort of heartening trends on this front that make me pretty optimistic. One of these is that the inside of structures are in the enterprises we work with becoming quite aligned between IT and the line of business. It's no longer the case that the line of business that are these annoying people that they're distracting IT from their bottom line function. IT's bottom line function is being translated into a what's your value for the business question? And the answer for a savvy IT management person is, I will try to empower the people around me to be rabid fans and I will also try to make sure that they do their own works so I don't have to learn how to do it for them. Right? And so, that I think is happening-- >> Guys to this (mumbles) business guys, a bunch of annoying guys who don't get what I need, right? So it works both ways, right? >> It does, it does. And I see that that's improving sort of in the industry as the corporate missions around data change, right? So it's no longer that the IT guys really only need to take care of executives and everyone else doesn't matter. Their function really is to serve the business and I see that alignment. The other thing that I think is a huge opportunity and the part of who I-- we're excited to be so tightly coupled with Google and also have our stuff running in Amazon and at Microsoft. It's as people read platform to the cloud, a lot of legacy becomes a shed or at least become deprecated. And so there is a real-- >> Or containerized or some sort of microservice. >> Yeah. >> Right, right. >> And so, people are peeling off business function and as part of that cost savings to migrate it to the cloud, they're also simplified. And you know, things will get complicated again. >> What's (mumbles) solution architects out there that kind of re-boot their careers because the old way was, hey I got networks, I got apps and stacks and so that gives the guys who could be the new heroes coming in. >> Right. >> And thinking differently about enabling that creativity. >> In the midst of all that, everything you said is true. IT is a massive place and it always will be. And tools that can come in and help are absolutely going to be (mumbles). >> This is obvious now. The tension's obviously eased a bit in the sense that there's clear line of sight that top line and bottom line are working together now on. You mentioned that earlier. Okay. Adam, take a stab at it. (mumbling) >> I was just going to-- hey, I know it's great. I was just going to give an example, I think, that illustrates that point so you know, one of our customers is Pepsi. And Pepsi came to us and they said, listen we work with retailers all over the world and their reality is that, when they place orders with us, they often get it wrong. And sometimes they order too much and then they return it, it spoils and that's bad for us. Or they order too little and they stock out and we miss revenue opportunities. So they said, we actually have to be better at demand planning and forecasting than the orders that are literally coming in the door. So how do we do that? Well, we're getting all of the customers to give us their point of sale data. We're combining that with geospatial data, with weather data. We're like looking at historical data and industry averages but as you can see, they were like-- we're stitching together data across a whole variety of sources and they said the best people to do this are actually the category managers and the people responsible for the brands 'cause they literally live inside those businesses and they understand it. And so what happened was they-- the IT organization was saying, look listen, we don't want to be the people doing the janitorial work on the data. We're going to give that work over to people who understand it and they're going to be more productive and get to better outcomes with that information and that brings us up to go find new and interesting sources and I think that collaborative model that you're starting to see emerge where they can now be the data heroes in a different way by not being the ones beating the bottleneck on provisioning but rather can go out and figure out how do we share the best stuff across the organization? How do we find new sources of information to bring in that people can leverage to make better decisions? That's in incredibly powerful place to be and you know, I think that that model is really what's going to be driving a lot of the thinking at Trifacta and in the industry over the next couple of years. >> Great. Adam Wilson, CEO of Trifacta. Joe Hellestein, CTO-- Chief Strategy Officer of Trifacta and also a professor at Berkeley. Great story. Getting the (mumbles) right is hard but under the hood stuff's complicated and again, congratulations about sharing the Ground project. Ground open source. Open source lab kind of thing at-- in Berkeley. Exciting new stuff. Thanks so much for coming on theCUBE. I appreciate great conversation. I'm John Furrier, George Gilbert. You're watching theCUBE here at Big Data SV in conjunction with Strata and Hadoop. Thanks for watching. >> Great. >> Thanks guys.

Published Date : Mar 16 2017

SUMMARY :

It's theCUBE covering Big Data Silicon Valley 2017. and Adam Wilson, the CEO. that have the (mumbles) in the about section Okay, so a big year for you guys. and what's the update from you guys? and really fueling that maturity in the market in the business model question for people. You see it and you believe it but then that business model comes down to how you organized. on the business model? With respect to the product, customers, partners. that the business model also has to support democritization So the data stays local. the product itself gets smarter and files on the desktop but something more complicated, and to help people go from inventorying that relates to how do you search the catalog, It's very-- So it's brand new. and the usual opensource still. I can hear the objections. and that we integrated well with Navigator, There are research questions there which, you know, The sooner we get you out and believe that it is their unalienable right That's the top down, you know, governance control point. And I think you can't solve for one side of that equation and I think there might be a more auxiliary with cloud so that the end users then draw other compliant tools in. and how we go to market and I think customers, you know, I saw in some news that you guys hey we have the great tech and you buy from us, I mean, you have to have sales people. So how does the future of data prep deal with the, So I got the old oracle monolithic-- And that's good-- in that environment. and the IOT-- You're welcome in that and that's kind of what's happened with our business. the bulk of the time when you talk to the analysts and you want to not lock yourself in and that's the reason why we have to be in the data exploration kind of discovery area The problem of the complexity, in the enterprises we work with becoming quite aligned And I see that that's improving sort of in the industry as or some sort of microservice. and as part of that cost savings to migrate it to the cloud, so that gives the guys who could be In the midst of all that, everything you said is true. in the sense that there's clear line of sight and in the industry over the next couple of years. and again, congratulations about sharing the Ground project.

ENTITIES

Entity	Category	Confidence
Joe Hellerstein	PERSON	0.99+
George	PERSON	0.99+
Joe	PERSON	0.99+
George Gilbert	PERSON	0.99+
Joe Hellestein	PERSON	0.99+
John Furrier	PERSON	0.99+
Trifacta	ORGANIZATION	0.99+
Pepsi	ORGANIZATION	0.99+
Adam Wilson	PERSON	0.99+
Adam	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Waterline	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Berkeley	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
San Jose, California	LOCATION	0.99+
Alation	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Stephanie	PERSON	0.99+
Horton	ORGANIZATION	0.99+
LinkedIn	ORGANIZATION	0.99+
six years	QUANTITY	0.99+
one	QUANTITY	0.99+
MapR	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
Capital One	ORGANIZATION	0.99+
first question	QUANTITY	0.99+
Today	DATE	0.99+
One	QUANTITY	0.99+
Last year	DATE	0.99+
two executives	QUANTITY	0.99+
Trifacta	PERSON	0.99+
Cloudera	ORGANIZATION	0.99+
one piece	QUANTITY	0.98+
both solutions	QUANTITY	0.98+
today	DATE	0.98+
over 4500 companies	QUANTITY	0.98+
Intel	ORGANIZATION	0.98+
both ways	QUANTITY	0.98+
both	QUANTITY	0.98+
three founders	QUANTITY	0.97+
two vendors	QUANTITY	0.97+
first sites	QUANTITY	0.97+
Ground	ORGANIZATION	0.97+
Munich Re	ORGANIZATION	0.97+
about 12 months	QUANTITY	0.97+
NYC	LOCATION	0.96+
first thing	QUANTITY	0.96+
four times	QUANTITY	0.96+
eBay	ORGANIZATION	0.95+
Wikibon	ORGANIZATION	0.95+
Paolo Alto	PERSON	0.95+
next day	DATE	0.95+
three times	QUANTITY	0.94+
ten of thousands of users	QUANTITY	0.93+
one side	QUANTITY	0.93+
years later	DATE	0.92+

Nenshad Bardoliwalla, Paxata - #BigDataNYC 2016 - #theCUBE

>> Voiceover: Live from New York, it's The Cube, covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, Nvidia, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to New York City, everybody. Nenshad Bardoliwalla is here, he's the co-founder and chief product officer at Paxata, a company that, three years ago, I want to say three years ago, came out of stealth on The Cube. >> October 27, 2013. >> Right, and we were at the Warwick Hotel across the street from the Hilton. Yeah, Prakash came on The Cube and came out of stealth. Welcome back. >> Thank you very much. >> Great to see you guys. Taking the world by storm. >> Great to be here, and of course, Prakash sends his apologies. He couldn't be here so he sent his stunt double. (Dave and George laugh) >> Great, so give us the update. What's the latest? >> So there are a lot of great things going on in our space. The thing that we announced here at the show is what we're calling Paxata Connect, OK? We are moving just in the same way that we created the self-service data preparation category, and now there are 50 companies that claim they do self-service data prep. We are moving the industry to the next phase of what we are calling our business information platform. Paxata Connect is one of the first major milestones in getting to that vision of the business information platform. What Paxata Connect allows our customers to do is, number one, to have visual, completely declarative, point-and-click browsing access to a variety of different data sources in the enterprise. For example, we support, we are the only company that we know of that supports connecting to multiple, simultaneous, different Hadoop distributions in one system. So a Paxata customer can connect to MapR, they can connect to Hortonworks, they can connect to Cloudera, and they can federate across all of them, which is a very powerful aspect of the system. >> And part of this involves, when you say declarative, it means you don't have to write a program to retrieve the data. >> Exactly right. Exactly right. >> Is this going into HTFS, into Hive, or? >> Yes it is. In fact, so Hadoop is one part of, this multi-source Hadoop capability is one part of Paxata Connect. The second is, as we've moved into this information platform world, our customers are telling us they want read-write access to more than just Hadoop. Hadoop is obviously a very important part, but we're actually supporting no-sequel data sources like Cloudant, Mongo DB, we're supporting read and write, we're supporting, for the first time, relational databases, we already supported read, but now we actually support write to relational databases. So Paxata is really becoming kind of this fabric, a business-centric information fabric, that allows people to move data from anywhere to any destination, and transform it, profile it, explore it along the way. >> Excellent. Let's get into some of the use cases. >> Yeah, tell us where the banks are. The sense at the conference is that everyone sort of got their data lakes to some extent up and running. Now where are they pushing to go next? >> Sure, that's an excellent question. So we have really focused on the enterprise segment, as you know. So the customers that are working with Paxata from an industry perspective, banking is, of course, a very important one, we were really proud to share the stage yesterday with both Citi and Standard Chartered Bank, two of our flagship banking customers. But Paxata is also heavily used in the United States government, in the intelligence community, I won't say any more about that. It's used heavily in retail and consumer products, it's used heavily in the high-tech space, it's used heavily by data service providers, that is, companies whose entire business is based on data. But to answer your question specifically, what's happening in the data lake world is that a lot of folks, the early adopters, have jumped onto the data lake bandwagon. So they're pouring terabytes and petabytes of data into the data lake. And then the next question the business asks is, OK, now what? Where's the data, right? One of the simplest use cases, but actually one that's very pervasive for our customers, is they say, "Look, we don't even know, "our business people, they don't even know "what's in Hadoop right now." And by the way, I will also say that the data lake is not just Hadoop, but Amazon S3 is also serving as a data lake. The capabilities inside Microsoft's cloud are also serving as a data lake. Even the notion of a data lake is becoming this sort of polymorphic distributed thing. So what they do is, they want to be able to get what we like to say is first eyes on data. We let people with Paxata, especially with the release of Connect, to just point and click their way and to actually explore the data in all of the native systems before they even bring it in to something like Paxata. So they can actually sneak preview thousands of database tables or thousands of compressed data sets inside of Amazon S3, or thousands of data sets inside of Hadoop, and now the business people for the first time can point and click and actually see what is in the data lake in the first place. So step number one is, we have taken the approach so far in the industry of, there have been a lot of IT-driven use cases that have motivated people to go to the data lake approach. But now, we obviously want to show, all of our companies want to show business value, so tools and platforms like Paxata that sit on top of the data lake, that can federate across multiple data lakes and provide business-centric access to that information is the first significant use case pattern we're seeing. >> Just a clarification, could there be two roles where one is for slightly more technical business user exposes views summarizing, so that the ultimate end user doesn't have to see the thousands of tables? >> Absolutely, that's a great question. So when you look at self-service, if somebody wants to roll out a self-service strategy, there are multiple roles in an organization that actually need to intersect with self-service. There is a pattern in organizations where people say, "We want our people to get access to all the data." Of course it's governed, they have to have the right passwords and SSO and all that, but they're the companies who say, yes, the users really need to be able to see all of the data across these different tables. But there's a different role, who also uses Paxata extensively, who are the curators, right? These are the people who say, look, I'm going to provision the raw data, provide the views, provide even some normalization or transformation, and then land that data back into another layer, as people call the data relay, they go from layer zero to layer one to layer two, they're different directory structures, but the point is, there's a natural processing frame that they're going through with their data, and then from the curated data that's created by the data stewards, then the analysts can go pick it up. >> One of the other big challenges that our research is showing, that chief data officers express, is that they get this data in the data lake. So they've got the data sources, you're providing access to it, the other piece is they want to trust that data. There's obviously a governance piece, but then there's a data quality piece, maybe you could talk about that? >> Absolutely. So use case number one is about access. The second reason that people are not so -- So, why are people doing data prep in the first place? They are trying to make information-driven decisions that actually help move their business forward. So if you look at researchers from firms like Forrester, they'll say there are two reasons that slow down the latency of going from raw data to decision. Number one is access to data. That's the use case we just talked about. Number two is the trustworthiness of data. Our approach is very different on that. Once people actually can find the data that they're looking for, the big paradigm shift in the self-service world is that, instead of trying to process data based on transforming the metadata attributes, like I'm going to draw on a work flow diagram, bring in this table, aggregate with this operator, then split it this way, filter it, which is the classic ETL paradigm. The, I don't want to say profound, but maybe the very obvious thing we did was to say, "What if people could actually look at the data in the first place --" >> And sort of program it by example? >> We can tell, that's right. Because our eyes can tell us, our brains help us to say, we can immediately look at a data set, right? You look at an age column, let's say. There are values in the age column of 150 years. Maybe 20 years from now there may be someone who, on Earth, lives to 150 years. But pretty much -- >> Highly unlikely. >> The customers at the banks you work with are not 150 years old, right? So just being able to look at the data, to get to the point that you're asking, quality is about data being fit for a specific purpose. In order for data to be fit for a specific purpose, the person who needs the data needs to make the decision about what is quality data. Both of you may have access to the same transactional data, raw data, that the IT team has landed in the Hadoop cluster. But now you pull it up for one use case, you pull it up for another use case, and because your needs are different, what constitutes quality to you and where you want to make the investment is going to be very different. So by putting the power of that capability into the hands of the person who actually knows what they want, that is how we are actually able to change the paradigm and really compress the latency from "Here's my raw data" to "Here's the decision I want to make on that data." >> Let me ask, it sounds like, having put all of the self-service capabilities together, you've democratized access to this data. Now, what happens in terms of governance, or more importantly, just trust, when the pipeline, you know, has to go beyond where you're working on it, to some of the analytics or some of the basic ingest? To say, "I know this data came from here "and it's going there." >> That's right, how do we verify the fidelity of these data sources? It's a fantastic question. So, in my career, having worked in BI for a couple of decades, I know I look much younger but it actually has been a couple of decades. Remember, the camera adds about 15 pounds, for those of you watching at home. (Dave and George laugh) >> George: But you've lost already. >> Thank you very much. >> So you've lost net 30. (Nenshad laughs) >> Or maybe I'm back to where I'm supposed to be. What I've seen as the two models of governance in the enterprise when it comes to analytics and information management, right? There's model one, which is, we're going to build an enterprise data warehouse, we're going to know all the possible questions people are going to ask in advance, we're going to preprogram the ETL routines, we're going to put something like a MicroStrategy or BusinessObjects, an enterprise-reporting factory tool. Then you spend 10 million dollars on that project, the users come in and for the first time they use the system, and they say, "Oh, I kind of want to change this, this way. "I want to add this calculation." It takes them about five minutes to determine that they can't do it for whatever reason, and what is the first feature they look for in the product in order to move forward? Download to Excel, right? So you invested 15 million dollars to build a download to Excel capability which they already had before. So if you lock things down too much, the point is, the end users will go around you. They've been doing it for 30 years and they'll keep doing it. Then we have model two. Model two is, Excel spreadsheet. Excel Hell, or spreadmarts. There are lots of words for these things. You have a version of the data, you have a version of the data, I have a version of the data. We all started from the same transactional data, yet you're the head of sales, so suddenly your forecast looks really rosy. You're the head of finance, you really don't like what the forecast looks like. And I'm the product guy, so why am I even looking at the forecast in the first place, but somehow I got access to the data, right? These are the two polarities of the enterprise that we've worked with for the last 30 years. We wanted to find sort of a middle path, which is to say, let's give people the freedom and flexibility to be able to do the transformations they need to. If they want to add a column, let them add a column. If they want to change a calculation, let them add a a calculation. But, every single step in the process must be recorded. It must be versioned, it must be auditable. It must be governed in that way. So why the large banks and the intelligence community and the large enterprise customers are attracted to Paxata is because they have the ability to have perfect retraceability for every decision that they make. I can actually sit next to you and say, "This is why the data looks like this. "This is how this value, which started at one million, "became 1.5 million." That covers the Paxata part. But then the answer to the question you asked is, how do you even extend that to a broader ecosystem? I think that's really about some of the metadata interchange initiatives that a lot of the vendors in the Hadoop space, but also in the traditional enterprise space, have had for the last many years. If you look at something like Apache Atlas or Cloudera Navigator, they are systems designed to collect, aggregate, and connect these different metadata steps so you can see in an end-to-end flow, this is the raw data that got ingested into Hadoop. These are the transformations that the end user did in Paxata in order to make it ready for analytics. This is how it's getting consumed in something like Zoom Data, and you actually have the entire life cycle of data now actually manifested as a software asset. >> So those not, in other words, those are not just managing within the perimeter of Hadoop. They are managers of managers. >> That's right, that's right. Because the data is coming from anywhere, and it's going to anywhere. And then you can add another dimension of complexity which is, it's not just one Hadoop cluster. It's 10 Hadoop clusters. And those 10 Hadoop clusters, three of them are in Amazon. Four of them are in Microsoft. Three of them are in Google Cloud platform. How do you know what people are doing with data then? >> How is this all presented to the user? What does the user see? >> Great question. The trick to all of this, of self service, first you have to know very clearly, who is the person you are trying to serve? What are their technical skills and capabilities, and how can you get them productive as fast as possible? When we created this category, our key notion was that we were going to go after analysts. Now, that is a very generic term, right? Because we are all, in some sense, analysts in our day-to-day lives. But in Paxata, a business analyst, in an enterprise organizational context, is somebody that has the ability to use Microsoft Excel, they have to have that skill or they won't be successful with today's Paxata. They have to know what a VLOOKUP is, because a VLOOKUP is a way to actually pull data from a second data source into one. We would all know that as a join or a lookup. And the third thing is, they have to know what a pivot table is and know how a pivot table works. Because the key insight we had is that, of the hundreds of millions of analysts, people who use Excel on a day-to-day basis, a lot of their work is data prep. But Excel, being an amazing generic tool, is actually quite bad for doing data prep. So the person we target, when I go to a customer and they say, "Are we a good candidate to use Paxata?" and we're talking to the actual person who's going to use the software, I say, "Do you know what a VLOOKUP is, yes or no? "Do you know what a pivot table is, yes or no?" If they have that skill, when they come into Paxata, we designed Paxata to be very attractive to those people. So it's completely point-and-click. It's completely visual. It's completely interactive. There's no scripting inside that whole process, because do you think the average Microsoft Excel analyst wants to script, or they want to use a proprietary wrangling language? I'm sorry, but analysts don't want to wrangle. Data scientists, the 1% of the 1%, maybe they like to wrangle, but you don't have that with the broader analyst community, and that is a much larger market opportunity that we have targeted. >> Well, very large, I mean, a lot of people are familiar with those concepts in Excel, and if they're not, they're relatively easy to learn. >> Nenshad: That's right. Excellent. All right, Nenshad, we have to leave it there. Thanks very much for coming on The Cube, appreciate it. >> Thank you very much for having me. >> Congratulations for all the success. >> Thank you. >> All right, keep it right there, everybody. We'll be back with our next guest. This is The Cube, we're live from New York City at Big Data NYC. We'll be right back. (electronic music)

Published Date : Sep 30 2016

SUMMARY :

Brought to you by headline sponsors, here, he's the co-founder across the street from the Hilton. Great to see you guys. Great to be here, and of course, What's the latest? of the business information platform. to retrieve the data. Exactly right. explore it along the way. Let's get into some of the use cases. The sense at the conference One of the simplest use These are the people who One of the other big That's the use case we just talked about. to say, we can immediately the banks you work with of the self-service capabilities together, Remember, the camera adds about 15 pounds, So you've lost net 30. of the data, I have a version of the data. They are managers of managers. and it's going to anywhere. And the third thing is, they have to know relatively easy to learn. have to leave it there. This is The Cube, we're

ENTITIES

Entity	Category	Confidence
Citi	ORGANIZATION	0.99+
October 27, 2013	DATE	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Nenshad	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Prakash	PERSON	0.99+
Dave	PERSON	0.99+
New York City	LOCATION	0.99+
Nvidia	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Earth	LOCATION	0.99+
15 million dollars	QUANTITY	0.99+
two	QUANTITY	0.99+
30 years	QUANTITY	0.99+
Forrester	ORGANIZATION	0.99+
Excel	TITLE	0.99+
thousands	QUANTITY	0.99+
50 companies	QUANTITY	0.99+
10 million dollars	QUANTITY	0.99+
Standard Chartered Bank	ORGANIZATION	0.99+
New York City	LOCATION	0.99+
Nenshad Bardoliwalla	PERSON	0.99+
two reasons	QUANTITY	0.99+
one million	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
first	QUANTITY	0.99+
two roles	QUANTITY	0.99+
two polarities	QUANTITY	0.99+
1.5 million	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
150 years	QUANTITY	0.99+
Hadoop	TITLE	0.99+
Paxata	ORGANIZATION	0.99+
second reason	QUANTITY	0.99+
One	QUANTITY	0.99+
two models	QUANTITY	0.99+
second	QUANTITY	0.99+
one	QUANTITY	0.99+
yesterday	DATE	0.99+
Both	QUANTITY	0.99+
three years ago	DATE	0.99+
first time	QUANTITY	0.98+
first time	QUANTITY	0.98+
New York	LOCATION	0.98+
both	QUANTITY	0.98+
1%	QUANTITY	0.97+
third thing	QUANTITY	0.97+
one system	QUANTITY	0.97+
about five minutes	QUANTITY	0.97+
Paxata	PERSON	0.97+
first feature	QUANTITY	0.97+
Data	LOCATION	0.96+
one part	QUANTITY	0.96+
United States government	ORGANIZATION	0.95+
thousands of tables	QUANTITY	0.94+
20 years	QUANTITY	0.94+
Model two	QUANTITY	0.94+
10 Hadoop clusters	QUANTITY	0.94+
terabytes	QUANTITY	0.93+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for HTFS: