Felix Van de Maele, Collibra, Data Citizens 22

(upbeat techno music) >> Collibra is a company that was founded in 2008 right before the so-called modern big data era kicked into high gear. The company was one of the first to focus its business on data governance. Now, historically, data governance and data quality initiatives, they were back office functions, and they were largely confined to regulated industries that had to comply with public policy mandates. But as the cloud went mainstream the tech giants showed us how valuable data could become, and the value proposition for data quality and trust, it evolved from primarily a compliance driven issue, to becoming a linchpin of competitive advantage. But, data in the decade of the 2010s was largely about getting the technology to work. You had these highly centralized technical teams that were formed and they had hyper-specialized skills, to develop data architectures and processes, to serve the myriad data needs of organizations. And it resulted in a lot of frustration, with data initiatives for most organizations, that didn't have the resources of the cloud guys and the social media giants, to really attack their data problems and turn data into gold. This is why today, for example, there's quite a bit of momentum to re-thinking monolithic data architectures. You see, you hear about initiatives like Data Mesh and the idea of data as a product. They're gaining traction as a way to better serve the the data needs of decentralized business users. You hear a lot about data democratization. So these decentralization efforts around data, they're great, but they create a new set of problems. Specifically, how do you deliver, like a self-service infrastructure to business users and domain experts? Now the cloud is definitely helping with that but also, how do you automate governance? This becomes especially tricky as protecting data privacy has become more and more important. In other words, while it's enticing to experiment, and run fast and loose with data initiatives, kind of like the Wild West, to find new veins of gold, it has to be done responsibly. As such, the idea of data governance has had to evolve to become more automated and intelligent. Governance and data lineage is still fundamental to ensuring trust as data. It moves like water through an organization. No one is going to use data that is entrusted. Metadata has become increasingly important for data discovery and data classification. As data flows through an organization, the continuously ability to check for data flaws and automating that data quality, they become a functional requirement of any modern data management platform. And finally, data privacy has become a critical adjacency to cyber security. So you can see how data governance has evolved into a much richer set of capabilities than it was 10 or 15 years ago. Hello and welcome to theCUBE's coverage of Data Citizens made possible by Collibra, a leader in so-called Data intelligence and the host of Data Citizens 2022, which is taking place in San Diego. My name is Dave Vellante and I'm one of the hosts of our program which is running in parallel to Data Citizens. Now at theCUBE we like to say we extract the signal from the noise, and over the next couple of days we're going to feature some of the themes from the keynote speakers at Data Citizens, and we'll hear from several of the executives. Felix Van de Maele, who is the co-founder and CEO of Collibra, will join us. Along with one of the other founders of Collibra, Stan Christiaens, who's going to join my colleague Lisa Martin. I'm going to also sit down with Laura Sellers, she's the Chief Product Officer at Collibra. We'll talk about some of the the announcements and innovations they're making at the event, and then we'll dig in further to data quality with Kirk Haslbeck. He's the Vice President of Data Quality at Collibra. He's an amazingly smart dude who founded Owl DQ, a company that he sold to Collibra last year. Now, many companies they didn't make it through the Hadoop era, you know they missed the industry waves and they became driftwood. Collibra, on the other hand, has evolved its business, they've leveraged the cloud, expanded its product portfolio and leaned in heavily to some major partnerships with cloud providers as well as receiving a strategic investment from Snowflake, earlier this year. So, it's a really interesting story that we're thrilled to be sharing with you. Thanks for watching and I hope you enjoy the program. (upbeat rock music) Last year theCUBE covered Data Citizens, Collibra's customer event, and the premise that we put forth prior to that event was that despite all the innovation that's gone on over the last decade or more with data, you know starting with the Hadoop movement, we had Data lakes, we had Spark, the ascendancy of programming languages like Python, the introduction of frameworks like Tensorflow, the rise of AI, Low Code, No Code, et cetera. Businesses still find it's too difficult to get more value from their data initiatives, and we said at the time, you know maybe it's time to rethink data innovation. While a lot of the effort has been focused on, you more efficiently storing and processing data, perhaps more energy needs to go into thinking about the people and the process side of the equation. Meaning, making it easier for domain experts to both gain insights from data, trust the data, and begin to use that data in new ways, fueling data products, monetization, and insights. Data Citizens 2022 is back and we're pleased to have Felix Van de Maele who is the founder and CEO of Collibra. He's on theCUBE. We're excited to have you Felix. Good to see you again. >> Likewise Dave. Thanks for having me again. >> You bet. All right, we're going to get the update from Felix on the current data landscape, how he sees it why data intelligence is more important now than ever, and get current on what Collibra has been up to over the past year, and what's changed since Data citizens 2021, and we may even touch on some of the product news. So Felix, we're living in a very different world today with businesses and consumers. They're struggling with things like supply chains, uncertain economic trends and we're not just snapping back to the 2010s, that's clear, and that's really true as well in the world of data. So what's different in your mind, in the data landscape of the 2020s, from the previous decade, and what challenges does that bring for your customers? >> Yeah, absolutely, and and I think you said it well, Dave and the intro that, that rising complexity and fragmentation, in the broader data landscape, that hasn't gotten any better over the last couple of years. When when we talk to our customers, that level of fragmentation, the complexity, how do we find data that we can trust, that we know we can use, has only gotten more more difficult. So that trend that's continuing, I think what is changing is that trend has become much more acute. Well, the other thing we've seen over the last couple of years is that the level of scrutiny that organizations are under, respect to data, as data becomes more mission critical, as data becomes more impactful than important, the level of scrutiny with respect to privacy, security, regulatory compliance, as only increasing as well. Which again, is really difficult in this environment of continuous innovation, continuous change, continuous growing complexity, and fragmentation. So, it's become much more acute. And to your earlier point, we do live in a different world and and the past couple of years we could probably just kind of brute force it, right? We could focus on, on the top line, there was enough kind of investments to be, to be had. I think nowadays organizations are focused or are, are, are are, are, are in a very different environment where there's much more focus on cost control, productivity, efficiency, how do we truly get the value from that data? So again, I think it just another incentive for organization to now truly look at data and to scale with data, not just from a a technology and infrastructure perspective, but how do we actually scale data from an organizational perspective, right? You said at the, the people and process, how do we do that at scale? And that's only, only, only becoming much more important, and we do believe that the, the economic environment that we find ourselves in today is going to be catalyst for organizations to really take that more seriously if, if, if you will, than they maybe have in the have in the past. >> You know, I don't know when you guys founded Collibra, if you had a sense as to how complicated it was going to get, but you've been on a mission to really address these problems from the beginning. How would you describe your, your, your mission and what are you doing to address these challenges? >> Yeah, absolutely. We, we started Collibra in 2008. So, in some sense and the, the last kind of financial crisis and that was really the, the start of Collibra, where we found product market fit, working with large financial institutions to help them cope with the increasing compliance requirements that they were faced with because of the, of the financial crisis. And kind of here we are again, in a very different environment of course 15 years, almost 15 years later, but data only becoming more important. But our mission to deliver trusted data for every user, every use case and across every source, frankly, has only become more important. So, what has been an incredible journey over the last 14, 15 years, I think we're still relatively early in our mission to again, be able to provide everyone, and that's why we call it Data Citizens, we truly believe that everyone in the organization should be able to use trusted data in an easy, easy matter. That mission is is only becoming more important, more relevant. We definitely have a lot more work ahead of us because we still relatively early in that, in that journey. >> Well that's interesting, because you know, in my observation it takes 7 to 10 years to actually build a company, and then the fact that you're still in the early days is kind of interesting. I mean, you, Collibra's had a good 12 months or so since we last spoke at Data Citizens. Give us the latest update on your business. What do people need to know about your current momentum? >> Yeah, absolutely. Again, there's a lot of tailwind organizations that are only maturing their data practices and we've seen that kind of transform or influence a lot of our business growth that we've seen, broader adoption of the platform. We work at some of the largest organizations in the world with its Adobe, Heineken, Bank of America and many more. We have now over 600 enterprise customers, all industry leaders and every single vertical. So it's, it's really exciting to see that and continue to partner with those organizations. On the partnership side, again, a lot of momentum in the org in the, in the market with some of the cloud partners like Google, Amazon, Snowflake, Data Breaks, and and others, right? As those kind of new modern data infrastructures, modern data architectures, are definitely all moving to the cloud. A great opportunity for us, our partners, and of course our customers, to help them kind of transition to the cloud even faster. And so we see a lot of excitement and momentum there. We did an acquisition about 18 months ago around data quality, data observability, which we believe is an enormous opportunity. Of course data quality isn't new but I think there's a lot of reasons why we're so excited about quality and observability now. One, is around leveraging AI machine learning again to drive more automation. And a second is that those data pipelines, that are now being created in the cloud, in these modern data architecture, architectures, they've become mission critical. They've become real time. And so monitoring, observing those data pipelines continuously, has become absolutely critical so that they're really excited about, about that as well. And on the organizational side, I'm sure you've heard the term around kind of data mesh, something that's gaining a lot of momentum, rightfully so. It's really the type of governance that we always believed in. Federated, focused on domains, giving a lot of ownership to different teams. I think that's the way to scale data organizations, and so that aligns really well with our vision and from a product perspective, we've seen a lot of momentum with our customers there as well. >> Yeah, you know, a couple things there. I mean, the acquisition of OwlDQ, you know Kirk Haslbeck and, and their team. It's interesting, you know the whole data quality used to be this back office function and and really confined to highly regulated industries. It's come to the front office, it's top of mind for Chief Data Officers. Data mesh, you mentioned you guys are a connective tissue for all these different nodes on the data mesh. That's key. And of course we see you at all the shows. You're, you're a critical part of many ecosystems and you're developing your own ecosystem. So, let's chat a little bit about the, the products. We're going to go deeper into products later on, at Data Citizens 22, but we know you're debuting some, some new innovations, you know, whether it's, you know, the the under the covers in security, sort of making data more accessible for people, just dealing with workflows and processes, as you talked about earlier. Tell us a little bit about what you're introducing. >> Yeah, absolutely. We we're super excited, a ton of innovation. And if we think about the big theme and like, like I said, we're still relatively early in this, in this journey towards kind of that mission of data intelligence that really bolts and compelling mission. Either customers are still start, are just starting on that, on that journey. We want to make it as easy as possible for the, for organization to actually get started, because we know that's important that they do. And for our organization and customers, that have been with us for some time, there's still a tremendous amount of opportunity to kind of expand the platform further. And again to make it easier for, really to, to accomplish that mission and vision around that Data Citizen, that everyone has access to trustworthy data in a very easy, easy way. So that's really the theme of a lot of the innovation that we're driving, a lot of kind of ease of adoption, ease of use, but also then, how do we make sure that, as clear becomes this kind of mission critical enterprise platform, from a security performance, architecture scale supportability, that we're truly able to deliver that kind of an enterprise mission critical platform. And so that's the big theme. From an innovation perspective, from a product perspective, a lot of new innovation that we're really excited about. A couple of highlights. One, is around data marketplace. Again, a lot of our customers have plans in that direction, How to make it easy? How do we make How do we make available to true kind of shopping experience? So that anybody in the organization can, in a very easy search first way, find the right data product, find the right dataset, that they can then consume. Usage analytics, how do you, how do we help organizations drive adoption? Tell them where they're working really well and where they have opportunities. Homepages again to, to make things easy for, for people, for anyone in your organization, to kind of get started with Collibra. You mentioned Workflow Designer, again, we have a very powerful enterprise platform, one of our key differentiators is the ability to really drive a lot of automation through workflows. And now we provided a, a new Low-Code, No-Code kind of workflow designer experience. So, so really customers can take it to the next level. There's a lot more new product around Collibra protect, which in partnership with Snowflake, which has been a strategic investor in Collibra, focused on how do we make access governance easier? How do we, how do we, how are we able to make sure that as you move to the cloud, things like access management, masking around sensitive data, PIA data, is managed as a much more effective, effective rate. Really excited about that product. There's more around data quality. Again, how do we, how do we get that deployed as easily, and quickly, and widely as we can? Moving that to the cloud has been a big part of our strategy. So, we launch our data quality cloud product, as well as making use of those, those native compute capabilities and platforms, like Snowflake, Databricks, Google, Amazon, and others. And so we are bettering a capability, a capability that we call push down, so we're actually pushing down the computer and data quality, to monitoring into the underlying platform, which again from a scale performance and ease of use perspective, is going to make a massive difference. And then more broadly, we talked a little bit about the ecosystem. Again, integrations, we talk about being able to connect to every source. Integrations are absolutely critical, and we're really excited to deliver new integrations with Snowflake, Azure and Google Cloud storage as well. So that's a lot coming out, the team has been work, at work really hard, and we are really really excited about what we are coming, what we're bringing to market. >> Yeah, a lot going on there. I wonder if you could give us your, your closing thoughts. I mean, you you talked about, you know, the marketplace, you know you think about Data Mesh, you think of data as product, one of the key principles, you think about monetization. This is really different than what we've been used to in data, which is just getting the technology to work has been, been so hard. So, how do you see sort of the future and, you know give us the, your closing thoughts please? >> Yeah, absolutely. And, and I think we we're really at a pivotal moment and I think you said it well. We, we all know the constraint and the challenges with data, how to actually do data at scale. And while we've seen a ton of innovation on the infrastructure side, we fundamentally believe that just getting a faster database is important, but it's not going to fully solve the challenges and truly kind of deliver on the opportunity. And that's why now is really the time to, deliver this data intelligence vision, this data intelligence platform. We are still early, making it as easy as we can, as kind of our, as our mission. And so I'm really, really excited to see what we, what we are going to, how the marks are going to evolve over the next, next few quarters and years. I think the trend is clearly there. We talked about Data Mesh, this kind of federated approach focus on data products, is just another signal that we believe, that a lot of our organization are now at the time, they're understanding need to go beyond just the technology. I really, really think about how to actually scale data as a business function, just like we've done with IT, with HR, with sales and marketing, with finance. That's how we need to think about data. I think now is the time, given the economic environment that we are in, much more focus on control, much more focus on productivity, efficiency, and now is the time we need to look beyond just the technology and infrastructure to think of how to scale data, how to manage data at scale. >> Yeah, it's a new era. The next 10 years of data won't be like the last, as I always say. Felix, thanks so much. Good luck in, in San Diego. I know you're going to crush it out there. >> Thank you Dave. >> Yeah, it's a great spot for an in-person event and and of course the content post-event is going to be available at collibra.com and you can of course catch theCUBE coverage at theCUBE.net and all the news at siliconangle.com. This is Dave Vellante for theCUBE, your leader in enterprise and emerging tech coverage. (upbeat techno music)

Published Date : Nov 2 2022

SUMMARY :

and the premise that we put for having me again. in the data landscape of the 2020s, and to scale with data, and what are you doing to And kind of here we are again, still in the early days a lot of momentum in the org in the, And of course we see you at all the shows. is the ability to the technology to work and now is the time we need to look of data won't be like the and of course the content

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Heineken	ORGANIZATION	0.99+
Adobe	ORGANIZATION	0.99+
Felix Van de Maele	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Laura Sellers	PERSON	0.99+
Collibra	ORGANIZATION	0.99+
2008	DATE	0.99+
Felix	PERSON	0.99+
San Diego	LOCATION	0.99+
Stan Christiaens	PERSON	0.99+
Dave	PERSON	0.99+
Bank of America	ORGANIZATION	0.99+
7	QUANTITY	0.99+
Snowflake	ORGANIZATION	0.99+
2020s	DATE	0.99+
last year	DATE	0.99+
2010s	DATE	0.99+
Data Breaks	ORGANIZATION	0.99+
Python	TITLE	0.99+
Last year	DATE	0.99+
12 months	QUANTITY	0.99+
siliconangle.com	OTHER	0.99+
one	QUANTITY	0.99+
Data Citizens	ORGANIZATION	0.99+
Databricks	ORGANIZATION	0.99+
Owl DQ	ORGANIZATION	0.98+
10	DATE	0.98+
OwlDQ	ORGANIZATION	0.98+
Kirk Haslbeck	PERSON	0.98+
10 years	QUANTITY	0.98+
One	QUANTITY	0.98+
Spark	TITLE	0.98+
today	DATE	0.98+
first	QUANTITY	0.97+
Data Citizens	EVENT	0.97+
earlier this year	DATE	0.96+
Tensorflow	TITLE	0.96+
Data Citizens 22	ORGANIZATION	0.95+
both	QUANTITY	0.94+
theCUBE	ORGANIZATION	0.94+
15 years ago	DATE	0.93+
over 600 enterprise customers	QUANTITY	0.91+
past couple of years	DATE	0.91+
about 18 months ago	DATE	0.9+
collibra.com	OTHER	0.89+
Data citizens 2021	ORGANIZATION	0.88+
Data Citizens 2022	EVENT	0.86+
almost 15 years later	DATE	0.85+
West	LOCATION	0.85+
Azure	TITLE	0.84+
first way	QUANTITY	0.83+
Vice President	PERSON	0.83+
last couple of years	DATE	0.8+

Robert Walsh, ZeniMax | PentahoWorld 2017

>> Announcer: Live from Orlando, Florida it's theCUBE covering Pentaho World 2017. Brought to you by Hitachi Vantara. (upbeat techno music) (coughs) >> Welcome to Day Two of theCUBE's live coverage of Pentaho World, brought to you by Hitachi Vantara. I'm your host Rebecca Knight along with my co-host Dave Vellante. We're joined by Robert Walsh. He is the Technical Director Enterprise Business Intelligence at ZeniMax. Thanks so much for coming on the show. >> Thank you, good morning. >> Good to see ya. >> I should say congratulations is in order (laughs) because you're company, ZeniMax, has been awarded the Pentaho Excellence Award for the Big Data category. I want to talk about the award, but first tell us a little bit about ZeniMax. >> Sure, so the company itself, so most people know us by the games versus the company corporate name. We make a lot of games. We're the third biggest company for gaming in America. And we make a lot of games such as Quake, Fallout, Skyrim, Doom. We have game launching this week called Wolfenstein. And so, most people know us by the games versus the corporate entity which is ZeniMax Media. >> Okay, okay. And as you said, you're the third largest gaming company in the country. So, tell us what you do there. >> So, myself and my team, we are primarily responsible for the ingestion and the evaluation of all the data from the organization. That includes really two main buckets. So, very simplistically we have the business world. So, the traditional money, users, then the graphics, people, sales. And on the other side we have the game. That's where a lot of people see the fun in what we do, such as what people are doing in the game, where in the game they're doing it, and why they're doing it. So, get a lot of data on gameplay behavior based on our playerbase. And we try and fuse those two together for the single viewer or customer. >> And that data comes from is it the console? Does it come from the ... What's the data flow? >> Yeah, so we actually support many different platforms. So, we have games on the console. So, Microsoft, Sony, PlayStation, Xbox, as well as the PC platform. Mac's for example, Android, and iOS. We support all platforms. So, the big challenge that we have is trying to unify that ingestion of data across all these different platforms in a unified way to facilitate downstream the reporting that we do as a company. >> Okay, so who ... When it says you're playing the game on a Microsoft console, whose data is that? Is it the user's data? Is it Microsoft's data? Is it ZeniMax's data? >> I see. So, many games that we actually release have a service act component. Most of our games are actually an online world. So, if you disconnect today people are still playing in that world. It never ends. So, in that situation, we have all the servers that people connect to from their desktop, from their console. Not all but most data we generate for the game comes from the servers that people connect to. We own those. >> Dave: Oh, okay. >> Which simplifies greatly getting that data from the people. >> Dave: So, it's your data? >> Exactly. >> What is the data telling you these days? >> Oh, wow, depends on the game. I think people realize what people do in games, what games have become. So, we have one game right now called Elder Scrolls Online, and this year we released the ability to buy in-game homes. And you can buy furniture for your in-game homes. So, you can furnish them. People can come and visit. And you can buy items, and weapons, and pets, and skins. And what's really interesting is part of the reason why we exist is to look at patterns and trends based on people interact with that environment. So for example, we'll see America playerbase buy very different items compared to say the European playerbase, based on social differences. And so, that helps immensely for the people who continuously develop the game to add items and features that people want to see and want to leverage. >> That is fascinating that Americans and Europeans are buying different furniture for their online homes. So, just give us some examples of the difference that you're seeing between these two groups. >> So, it's not just the homes, it applies to everything that they purchase as well. It's quite interesting. So, when it comes to the Americans versus Europeans for example what we find is that Europeans prefer much more cosmetic, passive experiences. Whereas the Americans are much things that stand out, things that are ... I'm trying to avoid stereotypes right now. >> Right exactly. >> It is what it is. >> Americans like ostentatious stuff. >> Robert: Exactly. >> We get it. >> Europeans are a bit more passive in that regard. And so, we do see that. >> Rebecca: Understated maybe. >> Thank you, that's a much better way of putting it. But games often have to be tweaked based on the environment. A different way of looking at it is a lot of companies in career in Asia all of these games in the West and they will have to tweak the game completely before it releases in these environments. Because players will behave differently and expect different things. And these games have become global. We have people playing all over the world all at the same time. So, how do you facilitate it? How do you support these different users with different needs in this one environment? Again, that's why BI has grown substantially in the gaming industry in the past five, ten years. >> Can you talk about the evolution of how you've been able to interact and essentially affect the user behavior or response to that behavior. You mentioned BI. So, you know, go back ten years it was very reactive. Not a lot of real time stuff going on. Are you now in the position to effect the behavior in real time, in a positive way? >> We're very close to that. We're not quite there yet. So yes, that's a very good point. So, five, ten years ago most games were traditional boxes. You makes a game, you get a box, Walmart or Gamestop, and then you're finished. The relationship with the customer ends. Now, we have this concept that's used often is games as a service. We provide an online environment, a service around a game, and people will play those games for weeks, months, if not years. And so, the shift as well as from a BI tech standpoint is one item where we've been able to streamline the ingest process. So, we're not real time but we can be hourly. Which is pretty responsive. But also, the fact that these games have become these online environments has enabled us to get this information. Five years ago, when the game was in a box, on the shelf, there was no connective tissue between us and them to interact and facilitate. With the games now being online, we can leverage BI. We can be more real time. We can respond quicker. But it's also due to the fact that now games themselves have changed to facilitate that interaction. >> Can you, Robert, paint a picture of the data pipeline? We started there with sort of the different devices. And you're bringing those in as sort of a blender. But take us through the data pipeline and how you're ultimately embedding or operationalizing those analytics. >> Sure. So, the game theater, the game and the business information, game theater is most likely 90, 95% of our total data footprint. We generate a lot more game information than we do business information. It's just due to how much we can track. We can do so. And so, a lot of these games will generate various game events, game logs that we can ingest into a single data lake. And we can use Amazon S3 for that. But it's not just a game theater. So, we have databases for financial information, account users, and so we will ingest the game events as well as the databases into one single location. At that point, however, it's still very raw. It's still very basic. We enable the analysts to actually interact with that. And they can go in there and get their feet wet but it's still very raw. The next step is really taking that raw information that is disjointed and separated, and unifying that into a single model that they can use in a much more performant way. In that first step, the analysts have the burden of a lot of the ETL work, to manipulate the data, to transform it, to make it useful. Which they can do. They should be doing the analysis, not the ingesting the data. And so, the progression from there into our warehouse is the next step of that pipeline. And so in there, we create these models and structures. And they're often born out of what the analysts are seeing and using in that initial data lake stage. So, they're repeating analysis, if they're doing this on a regular basis, the company wants something that's automated and auditable and productionized, then that's a great use case for promotion into our warehouse. You've got this initial staging layer. We have a warehouse where it's structured information. And we allow the analysts into both of those environments. So, they can pick their poison in respects. Structured data over here, raw and vast over here based on their use case. >> And what are the roles ... Just one more follow up, >> Yeah. >> if I may? Who are the people that are actually doing this work? Building the models, cleaning the data, and shoring data. You've got data scientists. You've got quality engineers. You got data engineers. You got application developers. Can you describe the collaboration between those roles? >> Sure. Yeah, so we as a BI organization we have two main groups. We have our engineering team. That's the one I drive. Then we have reporting, and that's a team. Now, we are really one single unit. We work as a team but we separate those two functions. And so, in my organization we have two main groups. We have our big data team which is doing that initial ingestion. Now, we ingest billions of troves of data a day. Terabytes a data a day. And so, we have a team just dedicated to ingestion, standardization, and exposing that first stage. Then we have our second team who are the warehouse engineers, who are actually here today somewhere. And they're the ones who are doing the modeling, the structuring. I mean the data modeling, making the data usable and promoting that into the warehouse. On the reporting team, basically we are there to support them. We provide these tool sets to engage and let them do their work. And so, in that team they have a very split of people do a lot of report development, visualization, data science. A lot of the individuals there will do all those three, two of the three, one of the three. But they do also have segmentation across your day to day reporting which has to function as well as the more deep analysis for data science or predictive analysis. >> And that data warehouse is on-prem? Is it in the cloud? >> Good question. Everything that I talked about is all in the cloud. About a year and a half, two years ago, we made the leap into the cloud. We drunk the Kool-Aid. As of Q2 next year at the very latest, we'll be 100% cloud. >> And the database infrastructure is Amazon? >> Correct. We use Amazon for all the BI platforms. >> Redshift or is it... >> Robert: Yes. >> Yeah, okay. >> That's where actually I want to go because you were talking about the architecture. So, I know you've mentioned Amazon Redshift. Cloudera is another one of your solutions provider. And of course, we're here in Pentaho World, Pentaho. You've described Pentaho as the glue. Can you expand on that a little bit? >> Absolutely. So, I've been talking about these two environments, these two worlds data lake to data warehouse. They're both are different in how they're developed, but it's really a single pipeline, as you said. And so, how do we get data from this raw form into this modeled structure? And that's where Pentaho comes into play. That's the glue. If the glue between these two environments, while they're conceptually very different they provide a singular purpose. But we need a way to unify that pipeline. And so, Pentaho we use very heavily to take this raw information, to transform it, ingest it, and model it into Redshift. And we can automate, we can schedule, we can provide error handling. And so it gives us the framework. And it's self-documenting to be able to track and understand from A to B, from raw to structured how we do that. And again, Pentaho is allowing us to make that transition. >> Pentaho 8.0 just came out yesterday. >> Hmm, it did? >> What are you most excited about there? Do you see any changes? We keep hearing a lot about the ability to scale with Pentaho World. >> Exactly. So, there's three things that really appeal to me actually on 8.0. So, things that we're missing that they've actually filled in with this release. So firstly, we on the streaming component from earlier the real time piece we were missing, we're looking at using Kafka and queuing for a lot of our ingestion purposes. And Pentaho in releasing this new version the mechanism to connect to that environment. That was good timing. We need that. Also too, get into more critical detail, the logs that we ingest, the data that we handle we use Avro and Parquet. When we can. We use JSON, Avro, and Parquet. Pentaho can handle JSON today. Avro, Parquet are coming in 8.0. And then lastly, to your point you made as well is where they're going with their system, they want to go into streaming, into all this information. It's very large and it has to go big. And so, they're adding, again, the ability to add worker nodes and scale horizontally their environment. And that's really a requirement before these other things can come into play. So, those are the things we're looking for. Our data lake can scale on demand. Our Redshift environment can scale on demand. Pentaho has not been able to but with this release they should be able to. And that was something that we've been hoping for for quite some time. >> I wonder if I can get your opinion on something. A little futures-oriented. You have a choice as an organization. You could just take roll your own opensource, best of breed opensource tools, and slog through that. And if you're an internet giant or a huge bank, you can do that. >> Robert: Right. >> You can take tooling like Pentaho which is end to end data pipeline, and this dramatically simplifies things. A lot of the cloud guys, Amazon, Microsoft, I guess to a certain extent Google, they're sort of picking off pieces of the value chain. And they're trying to come up with as a service fully-integrated pipeline. Maybe not best of breed but convenient. How do you see that shaking out generally? And then specifically, is that a challenge for Pentaho from your standpoint? >> So, you're right. That why they're trying to fill these gaps in their environment. To what Pentaho does and what they're offering, there's no comparison right now. They're not there yet. They're a long way away. >> Dave: You're saying the cloud guys are not there. >> No way. >> Pentaho is just so much more functional. >> Robert: They're not close. >> Okay. >> So, that's the first step. However, though what I've been finding in the cloud, there's lots of benefits from the ease of deployment, the scaling. You use a lot of dev ops support, DBA support. But the tools that they offer right now feel pretty bare bones. They're very generic. They have a place but they're not designed for singular purpose. Redshift is the only real piece of the pipeline that is a true Amazon product, but that came from a company called Power Excel ten years ago. They licensed that from a separate company. >> Dave: What a deal that was for Amazon! (Rebecca and Dave laugh) >> Exactly. And so, we like it because of the functionality Power Excel put in many year ago. Now, they've developed upon that. And it made it easier to deploy. But that's the core reason behind it. Now, we use for our big data environment, we use Data Breaks. Data Breaks is a cloud solution. They deploy into Amazon. And so, what I've been finding more and more is companies that are specialized in application or function who have their product support cloud deployment, is to me where it's a sweet middle ground. So, Pentaho is also talking about next year looking at Amazon deployment solutioning for their tool set. So, to me it's not really about going all Amazon. Oh, let's use all Amazon products. They're cheap and cheerful. We can make it work. We can hire ten engineers and hack out a solution. I think what's more applicable is people like Pentaho, whatever people in the industry who have the expertise and are specialized in that function who can allow their products to be deployed in that environment and leverage the Amazon advantages, the Elastic Compute, storage model, the deployment methodology. That is where I see the sweet spot. So, if Pentaho can get to that point, for me that's much more appealing than looking at Amazon trying to build out some things to replace Pentaho x years down the line. >> So, their challenge, if I can summarize, they've got to stay functionally ahead. Which they're way ahead now. They got to maintain that lead. They have to curate best of breed like Spark, for example, from Databricks. >> Right. >> Whatever's next and curate that in a way that is easy to integrate. And then look at the cloud's infrastructure. >> Right. Over the years, these companies that have been looking at ways to deploy into a data center easily and efficiently. Now, the cloud is the next option. How do they support and implement into the cloud in a way where we can leverage their tool set but in a way where we can leverage the cloud ecosystem. And that's the gap. And I think that's what we look for in companies today. And Pentaho is moving towards that. >> And so, that's a lot of good advice for Pentaho? >> I think so. I hope so. Yeah. If they do that, we'll be happy. So, we'll definitely take that. >> Is it Pen-ta-ho or Pent-a-ho? >> You've been saying Pent-a-ho with your British accent! But it is Pen-ta-ho. (laughter) Thank you. >> Dave: Cheap and cheerful, I love it. >> Rebecca: I know -- >> Bless your cotton socks! >> Yes. >> I've had it-- >> Dave: Cord and Bennett. >> Rebecca: Man, okay. Well, thank you so much, Robert. It's been a lot of fun talking to you. >> You're very welcome. >> We will have more from Pen-ta-ho World (laughter) brought to you by Hitachi Vantara just after this. (upbeat techno music)

Published Date : Oct 27 2017

SUMMARY :

Brought to you by Hitachi Vantara. He is the Technical Director for the Big Data category. Sure, so the company itself, gaming company in the country. And on the other side we have the game. from is it the console? So, the big challenge that Is it the user's data? So, many games that we actually release from the people. And so, that helps examples of the difference So, it's not just the homes, And so, we do see that. We have people playing all over the world affect the user behavior And so, the shift as well of the different devices. We enable the analysts to And what are the roles ... Who are the people that are and promoting that into the warehouse. about is all in the cloud. We use Amazon for all the BI platforms. You've described Pentaho as the glue. And so, Pentaho we use very heavily about the ability to scale the data that we handle And if you're an internet A lot of the cloud So, you're right. Dave: You're saying the Pentaho is just So, that's the first step. of the functionality They have to curate best of breed that is easy to integrate. And that's the gap. So, we'll definitely take that. But it is Pen-ta-ho. It's been a lot of fun talking to you. brought to you by Hitachi

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Rebecca	PERSON	0.99+
Robert Walsh	PERSON	0.99+
Robert	PERSON	0.99+
Dave	PERSON	0.99+
Pentaho	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Asia	LOCATION	0.99+
Walmart	ORGANIZATION	0.99+
America	LOCATION	0.99+
ZeniMax Media	ORGANIZATION	0.99+
ZeniMax	ORGANIZATION	0.99+
Power Excel	TITLE	0.99+
second team	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
two	QUANTITY	0.99+
two main groups	QUANTITY	0.99+
two groups	QUANTITY	0.99+
Wolfenstein	TITLE	0.99+
one	QUANTITY	0.99+
Orlando, Florida	LOCATION	0.99+
Sony	ORGANIZATION	0.99+
two functions	QUANTITY	0.99+
three	QUANTITY	0.99+
both	QUANTITY	0.99+
90, 95%	QUANTITY	0.99+
next year	DATE	0.99+
Kool-Aid	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
iOS	TITLE	0.99+
today	DATE	0.99+
Doom	TITLE	0.99+
yesterday	DATE	0.99+
Hitachi Vantara	ORGANIZATION	0.99+
two main buckets	QUANTITY	0.98+
Gamestop	ORGANIZATION	0.98+
Fallout	TITLE	0.98+
two environments	QUANTITY	0.98+
first step	QUANTITY	0.98+
one item	QUANTITY	0.98+
Five years ago	DATE	0.98+
Android	TITLE	0.98+
one game	QUANTITY	0.98+
Pentaho World	TITLE	0.98+
three things	QUANTITY	0.98+
first stage	QUANTITY	0.98+
Pen-ta-ho World	ORGANIZATION	0.98+
Pentaho Excellence Award	TITLE	0.98+
this year	DATE	0.98+

Arun Murthy, Hortonworks - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> [Announcer] Live, from Boston, Massachusetts, it's the Cube, covering Spark Summit East 2017, brought to you by Data Breaks. Now, your host, Dave Alante and George Gilbert. >> Welcome back to snowy Boston everybody, this is The Cube, the leader in live tech coverage. Arun Murthy is here, he's the founder and vice president of engineering at Horton Works, father of YARN, can I call you that, godfather of YARN, is that fair, or? (laughs) Anyway. He's so, so modest. Welcome back to the Cube, it's great to see you. >> Pleasure to have you. >> Coming off the big keynote, (laughs) you ended the session this morning, so that was great. Glad you made it in to Boston, and uh, lot of talk about security and governance, you know we've been talking about that years, it feels like it's truly starting to come into the main stream Arun, so. >> Well I think it's just a reflection of what customers are doing with the tech now. Now, three, four years ago, a lot of it was pilots, a lot of it was, you know, people playing with the tech. But increasingly, it's about, you know, people actually applying stuff in production, having data, system of record, running workloads both on prem and on the cloud, cloud is sort of becoming more and more real at mainstream enterprises. So a lot of it means, as you take any of the examples today any interesting app will have some sort of real time data feed, it's probably coming out from a cell phone or sensor which means that data is actually not, in most cases not coming on prem, it's actually getting collected in a local cloud somewhere, it's just more cost effective, why would we put up 25 data centers if you don't have to, right? So then you got to connect that data, production data you have or customer data you have or data you might have purchased and then join them up, run some interesting analytics, do geobased real time threat detection, cyber security. A lot of it means that you need a common way to secure data, govern it, and that's where we see the action, I think it's a really good sign for the market and for the community that people are pushing on these dimensions of the broader, because, getting pushed in this dimension because it means that people are actually using it for real production work loads. >> Well in the early days of Hadoop you really didn't talk that much about cloud. >> Yeah. >> You know, and now, >> Absolutely. >> It's like, you know, duh, cloud. >> Yeah. >> It's everywhere, and of course the whole hybrid cloud thing comes into play, what are you seeing there, what are things you can do in a hybrid, you know, or on prem that you can't do in a public cloud and what's the dynamic look like? >> Well, it's definitely not an either or, right? So what we're seeing is increasingly interesting apps need data which are born in the cloud and they'll stay in the cloud, but they also need transactional data which stays on prem, you might have an EDW for example, right? >> Right. >> There's not a lot of, you know, people want to solve business problems and not just move data from one place to another, right? Or back from one place to another, so it's not interesting to move an EDW to the cloud, and similarly it's not interesting to bring your IOT data or sensor data back into on-prem, right? Just makes sense. So naturally what happens is, you know, at Hortonworks we talk of kinds of modern app or a modern data app, which means a modern data app has to spare, has to sort of, you know, it can pass both on-prem data and cloud data. >> Yeah, you talked about that in your keynote years ago. Furio said that the data is the new development kit. And now you're seeing the apps are just so dang rich, >> Exactly, exactly. >> And they have to span >> Absolutely. >> physical locations, >> Yeah. >> But then this whole thing of IOT comes up, we've been having a conversation on The Cube, last several Cubes of, okay, how much stays out, how much stays in, there's a lot of debates about that, there's reasons not to bring it in, but you talked today about some of the important stuff will come back. >> Yeah. >> So the way this is, this all is going to be, you know, there's a lot of data that should be born in the cloud and stay there, the IOT data, but then what will happen increasingly is, key summaries of the data will move back and forth, so key summaries of your EDW will move to the cloud, sometimes key summaries of your IOT data, you know, you want to do some sort of historical training in analytics, that will come back on-prem, so I think there's a bi-directional data movement, but it just won't be all the data, right? It'll be key interesting summaries of the data but not all of it. >> And a lot of times, people say well it doesn't matter where it lives, cloud should be an operating model, not a place where you put data or applications, and while that's true and we would agree with that, from a customer standpoint it matters in terms of performance and latency issues and cost and regulation, >> And security and governance. >> Yeah. >> Absolutely. >> You need to think those things through. >> Exactly, so I mean, so that's what we're focused on, to make sure that you have a common security and governance model regardless of where data is, so you can think of it as, infrastructure you own and infrastructure you lease. >> Right. >> Right? Now, the details matter of course, when you go to the cloud you lose S3 for example or ADLS from Microsoft, but you got to make sure that there's a common sort of security governance front and top of it, in front of it, as an example one of the things that, you know, in the open source community, Ranger's a really sort of key project right now from a security authorization and authentication standpoint. We've done a lot of work with our friends at Microsoft to make sure, you can actually now manage data in Wasabi which is their object store, data stream, natively with Ranger, so you can set a policy that says only Dave can access these files, you know, George can access these columns, that sort of stuff is natively done on the Microsoft platform thanks to the relationship we have with them. >> Right. >> So that's actually really interesting for the open source communities. So you've talked about sort of commodity storage at the bottom layer and even if they're different sort of interfaces and implementations, it's still commodity storage, and now what's really helpful to customers is that they have a common security model, >> Exactly. >> Authorization, authentication, >> Authentication, lineage prominence, >> Oh okay. >> You want to make sure all of these are common sources across. >> But you've mentioned off of the different data patterns, like the stuff that might be streaming in on the cloud, what, assuming you're not putting it into just a file system or an object store, and you want to sort of merge it with >> Yeah. >> Historical data, so what are some of the data stores other than the file system, in other words, newfangled databases to manage this sort of interaction? >> So I think what you're saying is, we certainly have the raw data, the raw data is going to line up in whatever cloud native storage, >> Yeah. >> It's going to be Amazon, Wasabi, ADLS, Google Storage. But then increasingly you want, so now the patterns change so you have raw data, you have some sort of an ETL process, what's interesting in the cloud is that even the process data or, if you take the unstructured raw data and structure it, that structured data also needs to live on the cloud platform, right? The reason that's important is because A, it's cheaper to use the native platform rather than set up your own database on top of it. The other one is you also want to take advantage of all the native sources that the cloud storage provides, so for example, linking your application. So automatically data in Wasabi, you know, if you can set up a policy and easily say this structured data stable that I have of which is a summary of all the IOT activity in the last 24 hours, you can, using the cloud provider's technologies you can actually make it show up easily in Europe, like you don't have to do any work, right? So increasingly what we Hortonworks focused a lot on is to make sure that we, all of the computer engines, whether it's Spark or Hive or, you know, or MapReduce, it doesn't really matter, they're all natively working on the cloud provider's storage platform. >> [George] Okay. >> Right, so, >> Okay. >> That's a really key consideration for us. >> And the follow up to that, you know, there's a bit of a misconception that Spark replaces Hadoop, but it actually can be a processing, a compute engine for, >> Yeah. >> That can compliment or replace some of the compute engines in Hadoop, help us frame, how you talk about it with your customers. >> For us it's really simple, like in the past, the only option you had on Hadoop to do any computation was MapReduce, that was, I started working in MapReduce 11 years ago, so as you can imagine, it's a pretty good run for any technology, right? Spark is definitely the interesting sort of engine for sort of the, anything from mission learning to ETL for data on top of Hadoop. But again, what we focus a lot on is to make sure that every time we bring in, so right now, when we started on HTP, the first on HTP had about nine open source projects literally just nine. Today, the last one we shipped was 2.5, HTP 2.5 had about 27 I think, like it's a huge sort of explosion, right? But the problem with that is not just that we have 27 projects, the problem is that you're going to make sure each of the 27 work with all the 26 others. >> It's a QA nightmare. >> Exactly. So that integration is really key, so same thing with Spark, we want to make sure you have security and YARN (mumbles), like you saw in the demo today, you can now run Spark SQL but also make sure you get low level (mumbles) masking, all of the enterprise capabilities that you need, and I was at a financial services three or four weeks ago in Chicago. Today, to do equivalent of what I showed today on demo, they need literally, they have a classic ADW, and they have to maintain anywhere between 1500 to 2500 views of the same database, that's a nightmare as you can imagine. Now the fact that you can do this on the raw data using whether it's Hive or Spark or Peg or MapReduce, it doesn't really matter, it's really key, and that's the thing we push to make sure things like YARN security work across all the stacks, all the open source techs. >> So that makes life better, a simplification use case if you will, >> Yeah. >> What are some of the other use cases that you're seeing things like Spark enable? >> Machine learning is a really big one. Increasingly, every product is going to have some, people call it, machine learning and AI and deep learning, there's a lot of techniques out there, but the key part is you want to build a predictive model, in the past (mumbles) everybody want to build a model and score what's happening in the real world against model, but equally important make sure the model gets updated as more data comes in on and actually as the model scores does get smaller over time. So that's something we see all over, so for example, even within our own product, it's not just us enabling this for the customer, for example at Hortonworks we have a product called SmartSense which allows you to optimize how people use Hadoop. Where the, what are the opportunities for you to explore deficiencies within your own Hadoop system, whether it's Spark or Hive, right? So we now put mesh learning into SmartSense. And show you that customers who are running queries like you are running, Mr. Customer X, other customers like you are tuning Hadoop this way, they're running this sort of config, they're using these sort of features in Hadoop. That allows us to actually make the product itself better all the way down the pipe. >> So you're improving the scoring algorithm or you're sort of replacing it with something better? >> What we're doing there is just helping them optimize their Hadoop deploys. >> Yep. >> Right? You know, configuration and tuning and kernel settings and network settings, we do that automatically with SmartSense. >> But the customer, you talked about scoring and trying to, >> Yeah. >> They're tuning that, improving that and increasing the probability of it's accuracy, or is it? >> It's both. >> Okay. >> So the thing is what they do is, you initially come with a hypothesis, you have some amount of data, right? I'm a big believer that over time, more data, you're better off spending more, getting more data into the system than to tune that algorithm financially, right? >> Interesting, okay. >> Right, so you know, for example, you know, talk to any of the big guys on Facebook because they'll do the same, what they'll say is it's much better to get, to spend your time getting 10x data to the system and improving the model rather than spending 10x the time and improving the model itself on day one. >> Yeah, but that's a key choice, because you got to >> Exactly. >> Spend money on doing either, >> One of them. >> And you're saying go for the data. >> Go for the data. >> At least now. >> Yeah, go for data, what happens is the good part of that is it's not just the model, it's the, what you got to really get through is the entire end to end flow. >> Yeah. >> All the way from data aggregation to ingestion to collection to scoring, all that aspect, you're better off sort of walking through the paces like building the entire end to end product rather than spending time in a silo trying to make a lot of change. >> We've talked to a lot of machine learning tool vendors, application vendors, and it seems like we got to the point with Big Data where we put it in a repository then we started doing better at curating it and understanding it then starting to do a little bit exploration with business intelligence, but with machine learning, we don't have something that does this end to end, you know, from acquiring the data, building the model to operationalizing it, where are we on that, who should we look to for that? >> It's definitely very early, I mean if you look at, even the EDW space, for example, what is EDW? EDW is ingestion, ETL, and then sort of fast query layer, Olap BI, on and on and on, right? So that's the full EDW flow, I don't think as a market, I mean, it's really early in this space, not only as an overall industry, we have that end to end sort of industrialized design concept, it's going to take time, but a lot of people are ahead, you know, the Google's a world ahead, over time a lot of people will catch up. >> We got to go, I wish we had more time, I had so many other questions for you but I know time is tight in our schedule, so thanks so much Arun, >> Appreciate it. For coming on, appreciate it, alright, keep right there everybody, we'll be back with our next guest, it's The Cube, we're live from Spark Summit East in Boston, right back. (upbeat music)

Published Date : Feb 9 2017

SUMMARY :

brought to you by Data Breaks. father of YARN, can I call you that, Glad you made it in to Boston, So a lot of it means, as you take any of the examples today you really didn't talk that has to sort of, you know, it can pass both on-prem data Yeah, you talked about that in your keynote years ago. but you talked today about some of the important stuff So the way this is, this all is going to be, you know, And security and You need to think those so that's what we're focused on, to make sure that you have as an example one of the things that, you know, in the open So that's actually really interesting for the open source You want to make sure all of these are common sources in the last 24 hours, you can, using the cloud provider's in Hadoop, help us frame, how you talk about it with like in the past, the only option you had on Hadoop all of the enterprise capabilities that you need, Where the, what are the opportunities for you to explore What we're doing there is just helping them optimize and network settings, we do that automatically for example, you know, talk to any of the big guys is it's not just the model, it's the, what you got to really like building the entire end to end product rather than but a lot of people are ahead, you know, the Google's everybody, we'll be back with our next guest, it's The Cube,

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave Alante	PERSON	0.99+
Arun Murthy	PERSON	0.99+
Europe	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
10x	QUANTITY	0.99+
Boston	LOCATION	0.99+
Chicago	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
George	PERSON	0.99+
Arun	PERSON	0.99+
Wasabi	ORGANIZATION	0.99+
25 data centers	QUANTITY	0.99+
Today	DATE	0.99+
Hadoop	TITLE	0.99+
Wasabi	LOCATION	0.99+
YARN	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
ADLS	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Horton Works	ORGANIZATION	0.99+
today	DATE	0.99+
Data Breaks	ORGANIZATION	0.99+
1500	QUANTITY	0.98+
SmartSense	TITLE	0.98+
S3	TITLE	0.98+
Boston, Massachusetts	LOCATION	0.98+
One	QUANTITY	0.98+
27 projects	QUANTITY	0.98+
three	DATE	0.98+
Google	ORGANIZATION	0.98+
Furio	PERSON	0.98+
Spark	TITLE	0.98+
2500 views	QUANTITY	0.98+
first	QUANTITY	0.97+
Spark Summit East	LOCATION	0.97+
both	QUANTITY	0.97+
Spark SQL	TITLE	0.97+
Google Storage	ORGANIZATION	0.97+
26	QUANTITY	0.96+
Ranger	ORGANIZATION	0.96+
four weeks ago	DATE	0.95+
one	QUANTITY	0.94+
each	QUANTITY	0.94+
four years ago	DATE	0.94+
11 years ago	DATE	0.93+
27 work	QUANTITY	0.9+
MapReduce	TITLE	0.89+
Hive	TITLE	0.89+
this morning	DATE	0.88+
EDW	TITLE	0.88+
about nine open source	QUANTITY	0.88+
day one	QUANTITY	0.87+
nine	QUANTITY	0.86+
years	DATE	0.84+
Olap	TITLE	0.83+
Cube	ORGANIZATION	0.81+
a lot of data	QUANTITY	0.8+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Data Breaks: