Ben Sharma, Tony Fisher, Zaloni - BigData SV 2017 - #BigDataSV - #theCUBE
>> Announcer: Live from San Jose, California, it's The Cube, covering Big Data Silicon Valley 20-17. (rhythmic music) >> Hey, welcome back, everyone. We're live in Silicon Valley for Big Data SV, Big Data Silicon Valley in conjunction with Strata + Hadoob. This is the week where it all happens in Silicon Valley around the emergence of the Big Data as it goes to the next level. The Cube is actually on the ground covering it like a blanket. I'm John Furrier. My cohost, George Gilbert with Boogie Bond. And our next guest, we have two executives from Zeloni, Ben Sharma, who's the founder and CEO, and Tony Fischer, SVP and strategy. Guys, welcome back to The Cube. Good to see you. >> Thank you for having us back. >> You guys are great guests. You're in New York for Big Data NYC, and a lot is going on, certainly, here, and it's just getting kicked off with Strata-Hadoob, they got the sessions today, but you guys have already got some news out there. Give us the update. What's the big discussion at the show? >> So yeah, 20-16 was a great year for us. A lot of growth. We tripled our customer base, and a lot of interest in data lake, as customers are going from say Pilot and POCs into production implementation so far though. And in conjunction with that, this week we launched what we call a solution named Data Lake in a Box, appropriately, right? So what that means is we're bringing the full stack together to customers, so that we can get a data lake up and running in eight weeks time frame, with enterprise create data ingestion from their source systems hydrated into the data lake and ready for analytics. >> So is it a pretty big box, and is it waterproof? (all laughing) I mean, this is the big discussion now, pun intended. But the data lake is evolving, so I wanted to get your take on it. This is kind of been a theme that's been leading up and now front and center here on The Cube. Already the data lake has changed, also we've heard, I think Dave Alante in New York said data swamp. But using the data is critical on a data lake. So as it goes to more mature model of leveraging the data, what are the key trends right now? What are you guys seeing? Because this is a hot topic that everyone is talking about. >> Well, that's a good distinction that we like to make, is the difference between a data swamp and a data lake. >> And a data lake is much more governed. It has the rigor, it has the automation, it has a lot of the concepts that people are used to from traditional architectures, only we apply them in the scale-out architecture. So we put together a maturity model that really maps out a customer's journey throughout the big data and the data lake experience. And each phase of this, we can see what the customer's doing, what their trends are and where they want to go, and we can advise to them the right way to move forward. And so a lot of the customers we see are kind of in kind of what we call the ignore stage. I'd say most of the people we talk to are just ignoring. They don't have things active, but they're doing a lot of research. They're trying to figure out what's next. And we want to move them from there. The next stage up is called store. And store is basically just the sandbox environment. "I'm going to stick stuff in there." "I'm going to hope something comes out of it." No collaboration. But then, moving forward, there's the managed phase, the automated phase, and the optimized phase. And our goal is to move them up into those phases as quickly as possible. And data lake in a box is an effort to do that, to leapfrog them into a managed data lake environment. >> So that's kind of where the swamp analogy comes in, because the data lake, the swamp is kind of dirty, where you can almost think, "Okay, the first step is store it." And then they get busy or they try to figure out how to operationalize it, and then it's kind of like, "Uh ..." So your point, they're trying to get to that. So you guys get 'em to that set up, and then move them quickly to value? Is that kind of the approach? >> Yeah. So, time to value is critical, right? So how do you reduce the time to insight from the time the data is produced by the date producer, till the time you can make the data available to the data consumer for analytics and downstream use cases. So that's kind of our core focus in bringing these solutions to the market. >> Dave often and I were talking, and George always talk about the value of data at the right time at the right place, is the critical lynch-pin for the value, whether it's an app-driven, or whatever. So the data lake, you never know what data in the data lake will need to be pulled out and put into either real time or an app. So you have to assume at any given moment there's going to be data value. >> Sure >> So that, conceptually, people can get that. But how do you make that happen? Because that's a really hard problem. How do you guys tackle that when a customer says, "Hey, I want to do the data lake. "I've got to have the coverage. "I got to know who's accessing stuff. "But at the end of the day, "I got to move the data to where it's valuable." >> Sure. So the approach we have taken is with an integrated platform with a common metadata layer. Metadata is the key. So, using this common metadata layer, being able to do managed ingestion from various different sources, being able to do data validation and data quality, being able to manage the life cycle of the data, being able to generate these insights about the data itself, so that you can use that effectively for data science or for downstream applications and use cases is critical based on our experience of taking these applications from, say, a POC pilot phase into a production phase. >> And what's the next step, once you guys get to that point with the metadata? Because, like, I get that, it's like everyone's got the metadata focus. Now, I'm the data engineer, the data NG or the geek, the supergeek and then you've got the data science, then the analysts, then there will probably be a new category, a bot or something AI will do something. But you can have a spectrum of applications on the data side. How do they get access to the metadata? Is it through the machine learning? Do you guys have anything unique there that makes that seamless or is that the end goal? >> Sure, do you want to take that? >> Yes sure, it's a multi-pronged answer, but I'll start and you can jump in. One of the things we provide as part of our overall platform is a product called Micah. And Micah is really the kind of on-ramp to the data. And all those people that you just named, we love them all, but their access to the data is through a self-service data preparation product, and key to that is the metadata repository. So, all the metadata is out there; we call it a catalog at that point, and so they can go in, look at the catalog, get a sense for the data, get an understanding for the form and function of the data, see who uses it, see where it's used, and determine if that's the data that they want, and if it is, they have the ability to refine it further, or they can put it in a shopping cart if they have access to it, they can get it immediately, they can refine it, if they don't have access to it, there's an automatic request that they can get access to it. And so it's a onramp concept, of having a card catalog of all the information that's out there, how it's being used, how it's been refined, to allow the end user to make sure that they've got the right data, they can be positioned for their ultimate application. >> And just to add to what Tony said, because we are using this common metadata layer, and capturing metadata every instance, if you will, we are serving it up to the data consumers, using a rich catalog, so that a lot of our enterprise customers are now starting to create what they consider a data marketplace or a data portal within their organization, so that they're able to catalog not just the data that's in the data lake, but also data that's in other data stores. And provide one single unified view of these data sets, so that your data scientists can come in and see is this a data set that I can use for my model building? What are the different attributes of this data set? What is the quality of the data? How fresh is the data? And those kind of traits, so that they are effective in their analytical journey. >> I think that's the key thing that's interesting to me, is that you're seeing the big data explosions over the past ten years, eight years, we've been covering The Cube since the dupe world started. But now, it's the data set world, so it's a big data set in this market. The data sets are the key because that's what data scientists want to wrangle around with, and sling data sets with whatever tooling they want to use. Is that kind of the same trend that you guys see? >> That's correct. And also what we're seeing in the marketplace, is that customers are moving from a single architecture to a distributed architecture, where they may have a hybrid environment with some things being instantiated in the Cloud, some things being on PRIM. So how do you not provide a unified interface across these multiple environments, and in a governed way, so that the right people have access to the right data, and it's not the data swamp. >> Okay, so lets go back to the maturity model because I like that framework. So now you've just complicated the heck out of it. Cause now you've got Cloud, and then on PRIM, and then now, how do you put that prism of maturity model, on now hybrid, so how does that cross-connect there? And a second follow-up to that is, where are the customers on this progress bar? I'm sure they're different by customer but, so, maturity model to the hybrid, and then trends in the customer base that you're seeing? >> Alright, I'll take the second one, and then you can take the first one, okay? So, the vast majority of the people that we work with, and the people, the prospects customers, analysts we've talked to, other industry dignitaries, they put the vast majority of the customers in the ignore stage. Really just doing their research. So a good 50% plus of most organizations are still in that stage. And then, the data swamp environment, that I'm using it to store stuff, hopefully I'll get something good out of it. That's another 25% of the population. And so, most of the customers are there, and we're trying to move them kind of rapidly up and into a managed and automated data lake environment. The other trend along these lines that we're seeing, that's pretty interesting, is the emergence of IT in the big data world. It used to be a business user's world, and business users built these sandboxes, and business users did what they wanted to. But now, we see organizations that are really starting to bring IT into the fold, because they need the governance, they need the automation, they need the type of rigor that they're used to, in other data environments, and has been lacking in the big data environment. >> And you've got the IOT code cracking the code on the IOT side which has created another dimension of complexity. On the numbers of the 50% that ignore, is that profile more for Fortune 1000? >> It's larger companies, it's Fortune, and Global 2000. >> Got it, okay, and the terms of the hybrid maturity model, how's that, and add a third dimension, IOT, we've got a multi-dimensional chess game going here. >> I think they way we think about it is, that they're different patterns of data sets coming in. So they could be batched, they could be files, or database extracts, or they could be streams, right? So as long as you think about a converged architecture that can handle these different patterns, then you can map different use cases whether they are IOT and streaming use cases versus what we are seeing is that a lot of companies are trying to replace their operational analytics platforms with a data lake environment, and they're building their operational analytics on top of the data lake, correct? So you need to think more from an abstraction layer, how do you abstract it out? Because one of the challenges that we see customers facing, is that they don't want to get sticky with one Cloud service provider because they may have multiple Cloud service providers, >> John: It's a multi-Cloud world right now. >> So how do you leverage that, where you have one Cloud service provider in one geo, another Cloud service provider in another geo, and still being able to have an abstraction layer on top of it, so that you're building applications? >> So do you guys provide that data layer across that abstraction? >> That is correct, yes, so we leverage the ecosystem, but what we do is add the data management and data governance layer, we provide that abstraction, so that you can be on PREM, you can be in Cloud service provider one, or Cloud service provider two. You still have the same controls, and same governance functions as you build your data lake environment. >> And this is consistent with some of the Cube interviews we had all day today, and other Cube interviews, where when you had the Cloud, you're renting basically, but you own your data. You get to have a nice ... And that metadata seems to be the key, that's the key, right? For everything. >> That's right. And now what we're seeing is that a lot of our Enterprise customers are looking at bringing in some of the public cloud infrastructure into their on-PRAM environment as they are going to be available in appliances and things like that, right? So how do you then make sure that whatever you're doing in a non-enterprise cloud environment you are also able to extend it to the enterprise-- >> And the consequences to the enterprise is that the enterprise multiple jobs, if they don't have a consistent data layer ... >> Sure, yeah. >> It's just more redundancy. >> Exactly. >> Not redundancy, duplication actually. >> Yeah, duplication and difficulty of rationalizing it together. >> So let me drill down into a little more detail on the transition between these sort of maturity phases? And then the movement into production apps. I'm curious to know, we've heard Tableau, XL, Power BI, Click I guess, being-- sort of adapting to being front ends to big data. But they don't, for their experience to work they can't really handle big data sets. So you need the MPP sequel database on the data lake. And I guess the question there is is there value to be gotten or measurable value to be gotten just from turning the data lake into you know, interactive BI kind of platform? And sort of as the first step along that maturity model. >> One of the patterns we were seeing is that serving LIR is becoming more and more mature in the data lake, so that earlier it used to be mainly batch type of workloads. Now, with MPP engines running on the data lake itself, you are able to connect your existing BI applications, whether it's Tableau, Click, Power BI, and others, to these engines so that you are able to get low-latency query response times and are able to slice-and-dice your data sets in the data lake itself. >> But you're essentially still, you have to sample the data. You can't handle the full data set unless you're working with something like Zoom Data. >> Yeah, so there are physical limitations obviously. And then there are also this next generation of BI tools which work in a converged manner in the data lake itself. So there's like Zoom Data, Arcadia, and others that are able to kind of run inside the data lake itself instead of you having to have an external environment like the other BI tools, so we see that as a pattern. But if you already are an enterprise, you have on board a BI platform, how do you leverage that with the data lake as part of the next-generation architecture is a key trend that we are seeing. >> So that your metadata helps make that from swamp to curated data lake. >> That's right, and not only that what we have done, as Tony was mentioning, in our Micah product we have a self-service catalog and then we provide a shopping cart experience where you can actually source data sets into the shopping cart, and we let them provision a sandbox. And when they provision the sandbox, they can actually launch Tableau or whatever the BI tool of choice is on that sandbox, so that they can actually-- and that sandbox could exist in the data lake or it could exist on a relational data store or an MPP data store that's outside of the data lake. That's part of your modern data architecture. >> But further to your point, if people have to throw out all of their decision support applications and their BI applications in order to change their data infrastructure, they're not going to do it. >> Understood. >> So you have to make that environment work and that's what Ben's referring to with a lot of the new accelerator tools and things that will sit on top of the data lake. >> Guys, thanks so much for coming on The Cube. Really appreciate it. I'll give you guys the final word in the segment ... What do you expect this week? I mean, obviously, we've been seeing the consolidation. You're starting to see the swim lanes of with Spark and Open Source and you see the cloud and IOT colliding, there's a huge intersection with deep learning, AI is certainly hyped up now beyond all recognition but it's essentially deep learning. Neural networks meets machine learning. That's been around before, but now freely available with Cloud and Compute. And so kind of a interesting dynamic that's rockin' the big data world. Your thoughts on what we're going to see this week and how that relates to the industry? >> I'll take a stab at it and you may feel free to jump in. I think what we'll see is that lot of customers that have been playing with big data for a couple of years are now getting to a point where what worked for one or two use cases now needs to be scaled out and provided at an enterprise scale. So they're looking at a managed and a governance layer to put on top of the platform. So they can enable machine learning and AI and all those use cases, because business is asking for them. Right? Business is asking for how they can bring intenser flow and run on the data lake itself, right? So we see those kind of requirements coming up more and more frequently. >> Awesome. Tony? >> What he said. >> And enterprise readiness certainly has to be table-- there's a lot of table stakes in the enterprise. It's not like, easy to get into, you can see Google kind of just putting their toe in the water with the Google cloud, tenser flow, great highlight they got spanner, so all these other things like latency rearing their heads again. So these are all kind of table stakes. >> Yeah, and the other thing, moving forward with respect to machine learning and some of the advanced algorithms, what we're doing now and some of the research we're doing is actually using machine learning to manage the data lake, which is a new concept, so when we get to the optimized phase of our maturity model, a lot of that has to do with self-correcting and self-automating. >> I need some machine learning and some AI, so does George and we need machine learning to watch the machine learn, and then algorithmists for algorithms. It's a crazy world, exciting time for us. >> Are we going to have a bot next time when we come here? (all laughing) >> We're going to chat off of messenger, we just came from south by southwest. Guys, thanks for coming on The Cube. Great insight and congratulations on the continued momentum. This is The Cube breakin' it down with experts, CEOs, entrepreneurs, all here inside The Cube. Big Data Sv, I'm John for George Gilbert. We'll be back after this short break. Thanks! (upbeat electronic music)
SUMMARY :
Announcer: Live from This is the week where it What's the big discussion at the show? hydrated into the data lake But the data lake is evolving, is the difference between a and the data lake experience. Is that kind of the approach? make the data available So the data lake, you never "But at the end of the day, So the approach we have taken is seamless or is that the end goal? One of the things we provide that's in the data lake, Is that kind of the same so that the right people have access And a second follow-up to that is, and the people, the prospects customers, On the numbers of the 50% that ignore, it's Fortune, and Global 2000. of the hybrid maturity model, of the data lake, correct? John: It's a multi-Cloud the data management and And that metadata seems to be the key, some of the public cloud And the consequences of rationalizing it together. database on the data lake. in the data lake itself. You can't handle the full data set manner in the data lake itself. So that your metadata helps make that exist in the data lake But further to your point, if So you have to make and how that relates to the industry? and run on the data lake itself, right? stakes in the enterprise. a lot of that has to and some AI, so does George and we need on the continued momentum.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Tony Fischer | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
Tony | PERSON | 0.99+ |
Dave Alante | PERSON | 0.99+ |
Tony Fisher | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Ben Sharma | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
New York | LOCATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Zeloni | PERSON | 0.99+ |
Zaloni | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
50% | QUANTITY | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
25% | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
eight weeks | QUANTITY | 0.99+ |
two executives | QUANTITY | 0.99+ |
first step | QUANTITY | 0.99+ |
Tableau | TITLE | 0.99+ |
eight years | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Big Data | ORGANIZATION | 0.98+ |
two | QUANTITY | 0.98+ |
this week | DATE | 0.98+ |
second one | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
each phase | QUANTITY | 0.98+ |
Ben | PERSON | 0.97+ |
NYC | LOCATION | 0.97+ |
20-16 | DATE | 0.97+ |
Cloud | TITLE | 0.97+ |
Strata | ORGANIZATION | 0.97+ |
Big Data Sv | ORGANIZATION | 0.97+ |
second | QUANTITY | 0.96+ |
two use cases | QUANTITY | 0.96+ |
Cube | ORGANIZATION | 0.96+ |
third | QUANTITY | 0.94+ |
The Cube | ORGANIZATION | 0.91+ |
single architecture | QUANTITY | 0.91+ |
Power | TITLE | 0.9+ |
Micah | LOCATION | 0.85+ |
Arcadia | TITLE | 0.83+ |
Zoom Data | TITLE | 0.83+ |
Big Data SV | ORGANIZATION | 0.82+ |
Micah | PERSON | 0.81+ |
Click | TITLE | 0.8+ |
Strata-Hadoob | TITLE | 0.8+ |
Zoom Data | TITLE | 0.78+ |
Fortune | ORGANIZATION | 0.78+ |
Spark | TITLE | 0.78+ |
Power BI | TITLE | 0.78+ |
#theCUBE | ORGANIZATION | 0.77+ |
one geo | QUANTITY | 0.76+ |
one single unified | QUANTITY | 0.75+ |
Big Data Silicon Valley | ORGANIZATION | 0.72+ |
Bond | ORGANIZATION | 0.72+ |
Hadoob | ORGANIZATION | 0.72+ |
POCs | ORGANIZATION | 0.67+ |
PRIM | TITLE | 0.66+ |
Data | ORGANIZATION | 0.65+ |
lake | ORGANIZATION | 0.6+ |
Pilot | ORGANIZATION | 0.58+ |
XL | TITLE | 0.58+ |
of years | QUANTITY | 0.56+ |
Global | ORGANIZATION | 0.55+ |
Nenshad Bardoliwalla, Paxata - #BigDataNYC 2016 - #theCUBE
>> Voiceover: Live from New York, it's The Cube, covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, Nvidia, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to New York City, everybody. Nenshad Bardoliwalla is here, he's the co-founder and chief product officer at Paxata, a company that, three years ago, I want to say three years ago, came out of stealth on The Cube. >> October 27, 2013. >> Right, and we were at the Warwick Hotel across the street from the Hilton. Yeah, Prakash came on The Cube and came out of stealth. Welcome back. >> Thank you very much. >> Great to see you guys. Taking the world by storm. >> Great to be here, and of course, Prakash sends his apologies. He couldn't be here so he sent his stunt double. (Dave and George laugh) >> Great, so give us the update. What's the latest? >> So there are a lot of great things going on in our space. The thing that we announced here at the show is what we're calling Paxata Connect, OK? We are moving just in the same way that we created the self-service data preparation category, and now there are 50 companies that claim they do self-service data prep. We are moving the industry to the next phase of what we are calling our business information platform. Paxata Connect is one of the first major milestones in getting to that vision of the business information platform. What Paxata Connect allows our customers to do is, number one, to have visual, completely declarative, point-and-click browsing access to a variety of different data sources in the enterprise. For example, we support, we are the only company that we know of that supports connecting to multiple, simultaneous, different Hadoop distributions in one system. So a Paxata customer can connect to MapR, they can connect to Hortonworks, they can connect to Cloudera, and they can federate across all of them, which is a very powerful aspect of the system. >> And part of this involves, when you say declarative, it means you don't have to write a program to retrieve the data. >> Exactly right. Exactly right. >> Is this going into HTFS, into Hive, or? >> Yes it is. In fact, so Hadoop is one part of, this multi-source Hadoop capability is one part of Paxata Connect. The second is, as we've moved into this information platform world, our customers are telling us they want read-write access to more than just Hadoop. Hadoop is obviously a very important part, but we're actually supporting no-sequel data sources like Cloudant, Mongo DB, we're supporting read and write, we're supporting, for the first time, relational databases, we already supported read, but now we actually support write to relational databases. So Paxata is really becoming kind of this fabric, a business-centric information fabric, that allows people to move data from anywhere to any destination, and transform it, profile it, explore it along the way. >> Excellent. Let's get into some of the use cases. >> Yeah, tell us where the banks are. The sense at the conference is that everyone sort of got their data lakes to some extent up and running. Now where are they pushing to go next? >> Sure, that's an excellent question. So we have really focused on the enterprise segment, as you know. So the customers that are working with Paxata from an industry perspective, banking is, of course, a very important one, we were really proud to share the stage yesterday with both Citi and Standard Chartered Bank, two of our flagship banking customers. But Paxata is also heavily used in the United States government, in the intelligence community, I won't say any more about that. It's used heavily in retail and consumer products, it's used heavily in the high-tech space, it's used heavily by data service providers, that is, companies whose entire business is based on data. But to answer your question specifically, what's happening in the data lake world is that a lot of folks, the early adopters, have jumped onto the data lake bandwagon. So they're pouring terabytes and petabytes of data into the data lake. And then the next question the business asks is, OK, now what? Where's the data, right? One of the simplest use cases, but actually one that's very pervasive for our customers, is they say, "Look, we don't even know, "our business people, they don't even know "what's in Hadoop right now." And by the way, I will also say that the data lake is not just Hadoop, but Amazon S3 is also serving as a data lake. The capabilities inside Microsoft's cloud are also serving as a data lake. Even the notion of a data lake is becoming this sort of polymorphic distributed thing. So what they do is, they want to be able to get what we like to say is first eyes on data. We let people with Paxata, especially with the release of Connect, to just point and click their way and to actually explore the data in all of the native systems before they even bring it in to something like Paxata. So they can actually sneak preview thousands of database tables or thousands of compressed data sets inside of Amazon S3, or thousands of data sets inside of Hadoop, and now the business people for the first time can point and click and actually see what is in the data lake in the first place. So step number one is, we have taken the approach so far in the industry of, there have been a lot of IT-driven use cases that have motivated people to go to the data lake approach. But now, we obviously want to show, all of our companies want to show business value, so tools and platforms like Paxata that sit on top of the data lake, that can federate across multiple data lakes and provide business-centric access to that information is the first significant use case pattern we're seeing. >> Just a clarification, could there be two roles where one is for slightly more technical business user exposes views summarizing, so that the ultimate end user doesn't have to see the thousands of tables? >> Absolutely, that's a great question. So when you look at self-service, if somebody wants to roll out a self-service strategy, there are multiple roles in an organization that actually need to intersect with self-service. There is a pattern in organizations where people say, "We want our people to get access to all the data." Of course it's governed, they have to have the right passwords and SSO and all that, but they're the companies who say, yes, the users really need to be able to see all of the data across these different tables. But there's a different role, who also uses Paxata extensively, who are the curators, right? These are the people who say, look, I'm going to provision the raw data, provide the views, provide even some normalization or transformation, and then land that data back into another layer, as people call the data relay, they go from layer zero to layer one to layer two, they're different directory structures, but the point is, there's a natural processing frame that they're going through with their data, and then from the curated data that's created by the data stewards, then the analysts can go pick it up. >> One of the other big challenges that our research is showing, that chief data officers express, is that they get this data in the data lake. So they've got the data sources, you're providing access to it, the other piece is they want to trust that data. There's obviously a governance piece, but then there's a data quality piece, maybe you could talk about that? >> Absolutely. So use case number one is about access. The second reason that people are not so -- So, why are people doing data prep in the first place? They are trying to make information-driven decisions that actually help move their business forward. So if you look at researchers from firms like Forrester, they'll say there are two reasons that slow down the latency of going from raw data to decision. Number one is access to data. That's the use case we just talked about. Number two is the trustworthiness of data. Our approach is very different on that. Once people actually can find the data that they're looking for, the big paradigm shift in the self-service world is that, instead of trying to process data based on transforming the metadata attributes, like I'm going to draw on a work flow diagram, bring in this table, aggregate with this operator, then split it this way, filter it, which is the classic ETL paradigm. The, I don't want to say profound, but maybe the very obvious thing we did was to say, "What if people could actually look at the data in the first place --" >> And sort of program it by example? >> We can tell, that's right. Because our eyes can tell us, our brains help us to say, we can immediately look at a data set, right? You look at an age column, let's say. There are values in the age column of 150 years. Maybe 20 years from now there may be someone who, on Earth, lives to 150 years. But pretty much -- >> Highly unlikely. >> The customers at the banks you work with are not 150 years old, right? So just being able to look at the data, to get to the point that you're asking, quality is about data being fit for a specific purpose. In order for data to be fit for a specific purpose, the person who needs the data needs to make the decision about what is quality data. Both of you may have access to the same transactional data, raw data, that the IT team has landed in the Hadoop cluster. But now you pull it up for one use case, you pull it up for another use case, and because your needs are different, what constitutes quality to you and where you want to make the investment is going to be very different. So by putting the power of that capability into the hands of the person who actually knows what they want, that is how we are actually able to change the paradigm and really compress the latency from "Here's my raw data" to "Here's the decision I want to make on that data." >> Let me ask, it sounds like, having put all of the self-service capabilities together, you've democratized access to this data. Now, what happens in terms of governance, or more importantly, just trust, when the pipeline, you know, has to go beyond where you're working on it, to some of the analytics or some of the basic ingest? To say, "I know this data came from here "and it's going there." >> That's right, how do we verify the fidelity of these data sources? It's a fantastic question. So, in my career, having worked in BI for a couple of decades, I know I look much younger but it actually has been a couple of decades. Remember, the camera adds about 15 pounds, for those of you watching at home. (Dave and George laugh) >> George: But you've lost already. >> Thank you very much. >> So you've lost net 30. (Nenshad laughs) >> Or maybe I'm back to where I'm supposed to be. What I've seen as the two models of governance in the enterprise when it comes to analytics and information management, right? There's model one, which is, we're going to build an enterprise data warehouse, we're going to know all the possible questions people are going to ask in advance, we're going to preprogram the ETL routines, we're going to put something like a MicroStrategy or BusinessObjects, an enterprise-reporting factory tool. Then you spend 10 million dollars on that project, the users come in and for the first time they use the system, and they say, "Oh, I kind of want to change this, this way. "I want to add this calculation." It takes them about five minutes to determine that they can't do it for whatever reason, and what is the first feature they look for in the product in order to move forward? Download to Excel, right? So you invested 15 million dollars to build a download to Excel capability which they already had before. So if you lock things down too much, the point is, the end users will go around you. They've been doing it for 30 years and they'll keep doing it. Then we have model two. Model two is, Excel spreadsheet. Excel Hell, or spreadmarts. There are lots of words for these things. You have a version of the data, you have a version of the data, I have a version of the data. We all started from the same transactional data, yet you're the head of sales, so suddenly your forecast looks really rosy. You're the head of finance, you really don't like what the forecast looks like. And I'm the product guy, so why am I even looking at the forecast in the first place, but somehow I got access to the data, right? These are the two polarities of the enterprise that we've worked with for the last 30 years. We wanted to find sort of a middle path, which is to say, let's give people the freedom and flexibility to be able to do the transformations they need to. If they want to add a column, let them add a column. If they want to change a calculation, let them add a a calculation. But, every single step in the process must be recorded. It must be versioned, it must be auditable. It must be governed in that way. So why the large banks and the intelligence community and the large enterprise customers are attracted to Paxata is because they have the ability to have perfect retraceability for every decision that they make. I can actually sit next to you and say, "This is why the data looks like this. "This is how this value, which started at one million, "became 1.5 million." That covers the Paxata part. But then the answer to the question you asked is, how do you even extend that to a broader ecosystem? I think that's really about some of the metadata interchange initiatives that a lot of the vendors in the Hadoop space, but also in the traditional enterprise space, have had for the last many years. If you look at something like Apache Atlas or Cloudera Navigator, they are systems designed to collect, aggregate, and connect these different metadata steps so you can see in an end-to-end flow, this is the raw data that got ingested into Hadoop. These are the transformations that the end user did in Paxata in order to make it ready for analytics. This is how it's getting consumed in something like Zoom Data, and you actually have the entire life cycle of data now actually manifested as a software asset. >> So those not, in other words, those are not just managing within the perimeter of Hadoop. They are managers of managers. >> That's right, that's right. Because the data is coming from anywhere, and it's going to anywhere. And then you can add another dimension of complexity which is, it's not just one Hadoop cluster. It's 10 Hadoop clusters. And those 10 Hadoop clusters, three of them are in Amazon. Four of them are in Microsoft. Three of them are in Google Cloud platform. How do you know what people are doing with data then? >> How is this all presented to the user? What does the user see? >> Great question. The trick to all of this, of self service, first you have to know very clearly, who is the person you are trying to serve? What are their technical skills and capabilities, and how can you get them productive as fast as possible? When we created this category, our key notion was that we were going to go after analysts. Now, that is a very generic term, right? Because we are all, in some sense, analysts in our day-to-day lives. But in Paxata, a business analyst, in an enterprise organizational context, is somebody that has the ability to use Microsoft Excel, they have to have that skill or they won't be successful with today's Paxata. They have to know what a VLOOKUP is, because a VLOOKUP is a way to actually pull data from a second data source into one. We would all know that as a join or a lookup. And the third thing is, they have to know what a pivot table is and know how a pivot table works. Because the key insight we had is that, of the hundreds of millions of analysts, people who use Excel on a day-to-day basis, a lot of their work is data prep. But Excel, being an amazing generic tool, is actually quite bad for doing data prep. So the person we target, when I go to a customer and they say, "Are we a good candidate to use Paxata?" and we're talking to the actual person who's going to use the software, I say, "Do you know what a VLOOKUP is, yes or no? "Do you know what a pivot table is, yes or no?" If they have that skill, when they come into Paxata, we designed Paxata to be very attractive to those people. So it's completely point-and-click. It's completely visual. It's completely interactive. There's no scripting inside that whole process, because do you think the average Microsoft Excel analyst wants to script, or they want to use a proprietary wrangling language? I'm sorry, but analysts don't want to wrangle. Data scientists, the 1% of the 1%, maybe they like to wrangle, but you don't have that with the broader analyst community, and that is a much larger market opportunity that we have targeted. >> Well, very large, I mean, a lot of people are familiar with those concepts in Excel, and if they're not, they're relatively easy to learn. >> Nenshad: That's right. Excellent. All right, Nenshad, we have to leave it there. Thanks very much for coming on The Cube, appreciate it. >> Thank you very much for having me. >> Congratulations for all the success. >> Thank you. >> All right, keep it right there, everybody. We'll be back with our next guest. This is The Cube, we're live from New York City at Big Data NYC. We'll be right back. (electronic music)
SUMMARY :
Brought to you by headline sponsors, here, he's the co-founder across the street from the Hilton. Great to see you guys. Great to be here, and of course, What's the latest? of the business information platform. to retrieve the data. Exactly right. explore it along the way. Let's get into some of the use cases. The sense at the conference One of the simplest use These are the people who One of the other big That's the use case we just talked about. to say, we can immediately the banks you work with of the self-service capabilities together, Remember, the camera adds about 15 pounds, So you've lost net 30. of the data, I have a version of the data. They are managers of managers. and it's going to anywhere. And the third thing is, they have to know relatively easy to learn. have to leave it there. This is The Cube, we're
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Citi | ORGANIZATION | 0.99+ |
October 27, 2013 | DATE | 0.99+ |
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Nenshad | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Prakash | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
New York City | LOCATION | 0.99+ |
Nvidia | ORGANIZATION | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Earth | LOCATION | 0.99+ |
15 million dollars | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
30 years | QUANTITY | 0.99+ |
Forrester | ORGANIZATION | 0.99+ |
Excel | TITLE | 0.99+ |
thousands | QUANTITY | 0.99+ |
50 companies | QUANTITY | 0.99+ |
10 million dollars | QUANTITY | 0.99+ |
Standard Chartered Bank | ORGANIZATION | 0.99+ |
New York City | LOCATION | 0.99+ |
Nenshad Bardoliwalla | PERSON | 0.99+ |
two reasons | QUANTITY | 0.99+ |
one million | QUANTITY | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
two roles | QUANTITY | 0.99+ |
two polarities | QUANTITY | 0.99+ |
1.5 million | QUANTITY | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
150 years | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
Paxata | ORGANIZATION | 0.99+ |
second reason | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
two models | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
Both | QUANTITY | 0.99+ |
three years ago | DATE | 0.99+ |
first time | QUANTITY | 0.98+ |
first time | QUANTITY | 0.98+ |
New York | LOCATION | 0.98+ |
both | QUANTITY | 0.98+ |
1% | QUANTITY | 0.97+ |
third thing | QUANTITY | 0.97+ |
one system | QUANTITY | 0.97+ |
about five minutes | QUANTITY | 0.97+ |
Paxata | PERSON | 0.97+ |
first feature | QUANTITY | 0.97+ |
Data | LOCATION | 0.96+ |
one part | QUANTITY | 0.96+ |
United States government | ORGANIZATION | 0.95+ |
thousands of tables | QUANTITY | 0.94+ |
20 years | QUANTITY | 0.94+ |
Model two | QUANTITY | 0.94+ |
10 Hadoop clusters | QUANTITY | 0.94+ |
terabytes | QUANTITY | 0.93+ |